Terry, You were right, the error indeed seems to come from the message coalescing feature. If I turn it off using the "--mca btl_openib_use_message_coalescing 0", I'm not able to observe the "hdr->tag=0" error.
There are some trac requests associated to very similar error (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be related), aren't they ? What would you suggest Terry ? Eloi On Friday 24 September 2010 16:00:26 Terry Dontje wrote: > Eloi Gaudry wrote: > > Terry, > > > > No, I haven't tried any other values than P,65536,256,192,128 yet. > > > > The reason why is quite simple. I've been reading and reading again this > > thread to understand the btl_openib_receive_queues meaning and I can't > > figure out why the default values seem to induce the hdr- > > > >> tag=0 issue > >> (http://www.open-mpi.org/community/lists/users/2009/01/7808.php). > > Yeah, the size of the fragments and number of them really should not > cause this issue. So I too am a little perplexed about it. > > > Do you think that the default shared received queue parameters are > > erroneous for this specific Mellanox card ? Any help on finding the > > proper parameters would actually be much appreciated. > > I don't necessarily think it is the queue size for a specific card but > more so the handling of the queues by the BTL when using certain sizes. > At least that is one gut feel I have. > > In my mind the tag being 0 is either something below OMPI is polluting > the data fragment or OMPI's internal protocol is some how getting messed > up. I can imagine (no empirical data here) the queue sizes could change > how the OMPI protocol sets things up. Another thing may be the > coalescing feature in the openib BTL which tries to gang multiple > messages into one packet when resources are running low. I can see > where changing the queue sizes might affect the coalescing. So, it > might be interesting to turn off the coalescing. You can do that by > setting "--mca btl_openib_use_message_coalescing 0" in your mpirun line. > > If that doesn't solve the issue then obviously there must be something > else going on :-). > > Note, the reason I am interested in this is I am seeing a similar error > condition (hdr->tag == 0) on a development system. Though my failing > case fails with np=8 using the connectivity test program which is mainly > point to point and there are not a significant amount of data transfers > going on either. > > --td > > > Eloi > > > > On Friday 24 September 2010 14:27:07 you wrote: > >> That is interesting. So does the number of processes affect your runs > >> any. The times I've seen hdr->tag be 0 usually has been due to protocol > >> issues. The tag should never be 0. Have you tried to do other > >> receive_queue settings other than the default and the one you mention. > >> > >> I wonder if you did a combination of the two receive queues causes a > >> failure or not. Something like > >> > >> P,128,256,192,128:P,65536,256,192,128 > >> > >> I am wondering if it is the first queuing definition causing the issue > >> or possibly the SRQ defined in the default. > >> > >> --td > >> > >> Eloi Gaudry wrote: > >>> Hi Terry, > >>> > >>> The messages being send/received can be of any size, but the error > >>> seems to happen more often with small messages (as an int being > >>> broadcasted or allreduced). The failing communication differs from one > >>> run to another, but some spots are more likely to be failing than > >>> another. And as far as I know, there are always located next to a > >>> small message (an int being broadcasted for instance) communication. > >>> Other typical messages size are > >>> > >>>> 10k but can be very much larger. > >>> > >>> I've been checking the hca being used, its' from mellanox (with > >>> vendor_part_id=26428). There is no receive_queues parameters associated > >>> to it. > >>> > >>> $ cat share/openmpi/mca-btl-openib-device-params.ini as well: > >>> [...] > >>> > >>> # A.k.a. ConnectX > >>> [Mellanox Hermon] > >>> vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3 > >>> vendor_part_id = > >>> 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488 > >>> use_eager_rdma = 1 > >>> mtu = 2048 > >>> max_inline_data = 128 > >>> > >>> [..] > >>> > >>> $ ompi_info --param btl openib --parsable | grep receive_queues > >>> > >>> mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128 > >>> :S ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 > >>> mca:btl:openib:param:btl_openib_receive_queues:data_source:default > >>> value mca:btl:openib:param:btl_openib_receive_queues:status:writable > >>> mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited, > >>> comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4 > >>> mca:btl:openib:param:btl_openib_receive_queues:deprecated:no > >>> > >>> I was wondering if these parameters (automatically computed at openib > >>> btl init for what I understood) were not incorrect in some way and I > >>> plugged some others values: "P,65536,256,192,128" (someone on the list > >>> used that values when encountering a different issue) . Since that, I > >>> haven't been able to observe the segfault (occuring as hrd->tag = 0 in > >>> btl_openib_component.c:2881) yet. > >>> > >>> Eloi > >>> > >>> > >>> /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/ > >>> > >>> On Thursday 23 September 2010 23:33:48 Terry Dontje wrote: > >>>> Eloi, I am curious about your problem. Can you tell me what size of > >>>> job it is? Does it always fail on the same bcast, or same process? > >>>> > >>>> Eloi Gaudry wrote: > >>>>> Hi Nysal, > >>>>> > >>>>> Thanks for your suggestions. > >>>>> > >>>>> I'm now able to get the checksum computed and redirected to stdout, > >>>>> thanks (I forgot the "-mca pml_base_verbose 5" option, you were > >>>>> right). I haven't been able to observe the segmentation fault (with > >>>>> hdr->tag=0) so far (when using pml csum) but I 'll let you know when > >>>>> I am. > >>>>> > >>>>> I've got two others question, which may be related to the error > >>>>> observed: > >>>>> > >>>>> 1/ does the maximum number of MPI_Comm that can be handled by OpenMPI > >>>>> somehow depends on the btl being used (i.e. if I'm using openib, may > >>>>> I use the same number of MPI_Comm object as with tcp) ? Is there > >>>>> something as MPI_COMM_MAX in OpenMPI ? > >>>>> > >>>>> 2/ the segfaults only appears during a mpi collective call, with very > >>>>> small message (one int is being broadcast, for instance) ; i followed > >>>>> the guidelines given at http://icl.cs.utk.edu/open- > >>>>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the > >>>>> debug-build of OpenMPI asserts if I use a different min-size that > >>>>> 255. Anyway, if I deactivate eager_rdma, the segfaults remains. Does > >>>>> the openib btl handle very small message differently (even with > >>>>> eager_rdma > >>>>> deactivated) than tcp ? > >>>> > >>>> Others on the list does coalescing happen with non-eager_rdma? If so > >>>> then that would possibly be one difference between the openib btl and > >>>> tcp aside from the actual protocol used. > >>>> > >>>>> is there a way to make sure that large messages and small messages > >>>>> are handled the same way ? > >>>> > >>>> Do you mean so they all look like eager messages? How large of > >>>> messages are we talking about here 1K, 1M or 10M? > >>>> > >>>> --td > >>>> > >>>>> Regards, > >>>>> Eloi > >>>>> > >>>>> On Friday 17 September 2010 17:57:17 Nysal Jan wrote: > >>>>>> Hi Eloi, > >>>>>> Create a debug build of OpenMPI (--enable-debug) and while running > >>>>>> with the csum PML add "-mca pml_base_verbose 5" to the command line. > >>>>>> This will print the checksum details for each fragment sent over the > >>>>>> wire. I'm guessing it didnt catch anything because the BTL failed. > >>>>>> The checksum verification is done in the PML, which the BTL calls > >>>>>> via a callback function. In your case the PML callback is never > >>>>>> called because the hdr->tag is invalid. So enabling checksum > >>>>>> tracing also might not be of much use. Is it the first Bcast that > >>>>>> fails or the nth Bcast and what is the message size? I'm not sure > >>>>>> what could be the problem at this moment. I'm afraid you will have > >>>>>> to debug the BTL to find out more. > >>>>>> > >>>>>> --Nysal > >>>>>> > >>>>>> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <e...@fft.be> wrote: > >>>>>>> Hi Nysal, > >>>>>>> > >>>>>>> thanks for your response. > >>>>>>> > >>>>>>> I've been unable so far to write a test case that could illustrate > >>>>>>> the hdr->tag=0 error. > >>>>>>> Actually, I'm only observing this issue when running an internode > >>>>>>> computation involving infiniband hardware from Mellanox (MT25418, > >>>>>>> ConnectX IB DDR, PCIe 2.0 > >>>>>>> 2.5GT/s, rev a0) with our time-domain software. > >>>>>>> > >>>>>>> I checked, double-checked, and rechecked again every MPI use > >>>>>>> performed during a parallel computation and I couldn't find any > >>>>>>> error so far. The fact that the very > >>>>>>> same parallel computation run flawlessly when using tcp (and > >>>>>>> disabling openib support) might seem to indicate that the issue is > >>>>>>> somewhere located inside the > >>>>>>> openib btl or at the hardware/driver level. > >>>>>>> > >>>>>>> I've just used the "-mca pml csum" option and I haven't seen any > >>>>>>> related messages (when hdr->tag=0 and the segfaults occurs). > >>>>>>> Any suggestion ? > >>>>>>> > >>>>>>> Regards, > >>>>>>> Eloi > >>>>>>> > >>>>>>> On Friday 17 September 2010 16:03:34 Nysal Jan wrote: > >>>>>>>> Hi Eloi, > >>>>>>>> Sorry for the delay in response. I haven't read the entire email > >>>>>>>> thread, but do you have a test case which can reproduce this > >>>>>>>> error? Without that it will be difficult to nail down the cause. > >>>>>>>> Just to clarify, I do not work for an iwarp vendor. I can > >>>>>>>> certainly try to reproduce it on an IB system. There is also a > >>>>>>>> PML called csum, you can use it via "-mca pml csum", which will > >>>>>>>> checksum the MPI messages and verify it at the receiver side for > >>>>>>>> any data > >>>>>>>> corruption. You can try using it to see if it is able > >>>>>>> > >>>>>>> to > >>>>>>> > >>>>>>>> catch anything. > >>>>>>>> > >>>>>>>> Regards > >>>>>>>> --Nysal > >>>>>>>> > >>>>>>>> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote: > >>>>>>>>> Hi Nysal, > >>>>>>>>> > >>>>>>>>> I'm sorry to intrrupt, but I was wondering if you had a chance to > >>>>>>>>> look > >>>>>>> > >>>>>>> at > >>>>>>> > >>>>>>>>> this error. > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> Eloi > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Eloi Gaudry > >>>>>>>>> > >>>>>>>>> Free Field Technologies > >>>>>>>>> Company Website: http://www.fft.be > >>>>>>>>> Company Phone: +32 10 487 959 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> ---------- Forwarded message ---------- > >>>>>>>>> From: Eloi Gaudry <e...@fft.be> > >>>>>>>>> To: Open MPI Users <us...@open-mpi.org> > >>>>>>>>> Date: Wed, 15 Sep 2010 16:27:43 +0200 > >>>>>>>>> Subject: Re: [OMPI users] [openib] segfault when using openib btl > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> I was wondering if anybody got a chance to have a look at this > >>>>>>>>> issue. > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> Eloi > >>>>>>>>> > >>>>>>>>> On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote: > >>>>>>>>>> Hi Jeff, > >>>>>>>>>> > >>>>>>>>>> Please find enclosed the output (valgrind.out.gz) from > >>>>>>>>>> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 > >>>>>>>>>> --mca > >>>>>>> > >>>>>>> btl > >>>>>>> > >>>>>>>>>> openib,self --display-map --verbose --mca mpi_warn_on_fork 0 > >>>>>>>>>> --mca btl_openib_want_fork_support 0 -tag-output > >>>>>>>>>> /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck > >>>>>>>>>> --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi- > >>>>>>>>>> valgrind.supp --suppressions=./suppressions.python.supp > >>>>>>>>>> /opt/actran/bin/actranpy_mp ... > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> Eloi > >>>>>>>>>> > >>>>>>>>>> On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote: > >>>>>>>>>>> On Monday 16 August 2010 19:14:47 Jeff Squyres wrote: > >>>>>>>>>>>> On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote: > >>>>>>>>>>>>> I did run our application through valgrind but it couldn't > >>>>>>>>>>>>> find any "Invalid write": there is a bunch of "Invalid read" > >>>>>>>>>>>>> (I'm using > >>>>>>>>> > >>>>>>>>> 1.4.2 > >>>>>>>>> > >>>>>>>>>>>>> with the suppression file), "Use of uninitialized bytes" and > >>>>>>>>>>>>> "Conditional jump depending on uninitialized bytes" in > >>>>>>> > >>>>>>> different > >>>>>>> > >>>>>>>>> ompi > >>>>>>>>> > >>>>>>>>>>>>> routines. Some of them are located in btl_openib_component.c. > >>>>>>>>>>>>> I'll send you an output of valgrind shortly. > >>>>>>>>>>>> > >>>>>>>>>>>> A lot of them in btl_openib_* are to be expected -- > >>>>>>>>>>>> OpenFabrics uses OS-bypass methods for some of its memory, > >>>>>>>>>>>> and therefore valgrind is unaware of them (and therefore > >>>>>>>>>>>> incorrectly marks them as > >>>>>>>>>>>> uninitialized). > >>>>>>>>>>> > >>>>>>>>>>> would it help if i use the upcoming 1.5 version of openmpi ? i > >>>>>>> > >>>>>>> read > >>>>>>> > >>>>>>>>> that > >>>>>>>>> > >>>>>>>>>>> a huge effort has been done to clean-up the valgrind output ? > >>>>>>>>>>> but maybe that this doesn't concern this btl (for the reasons > >>>>>>>>>>> you mentionned). > >>>>>>>>>>> > >>>>>>>>>>>>> Another question, you said that the callback function pointer > >>>>>>>>> > >>>>>>>>> should > >>>>>>>>> > >>>>>>>>>>>>> never be 0. But can the tag be null (hdr->tag) ? > >>>>>>>>>>>> > >>>>>>>>>>>> The tag is not a pointer -- it's just an integer. > >>>>>>>>>>> > >>>>>>>>>>> I was worrying that its value could not be null. > >>>>>>>>>>> > >>>>>>>>>>> I'll send a valgrind output soon (i need to build libpython > >>>>>>>>>>> without pymalloc first). > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Eloi > >>>>>>>>>>> > >>>>>>>>>>>>> Thanks for your help, > >>>>>>>>>>>>> Eloi > >>>>>>>>>>>>> > >>>>>>>>>>>>> On 16/08/2010 18:22, Jeff Squyres wrote: > >>>>>>>>>>>>>> Sorry for the delay in replying. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Odd; the values of the callback function pointer should > >>>>>>>>>>>>>> never > >>>>>>> > >>>>>>> be > >>>>>>> > >>>>>>>>> 0. > >>>>>>>>> > >>>>>>>>>>>>>> This seems to suggest some kind of memory corruption is > >>>>>>>>>>>>>> occurring. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I don't know if it's possible, because the stack trace looks > >>>>>>>>>>>>>> like you're calling through python, but can you run this > >>>>>>>>>>>>>> application through valgrind, or some other memory-checking > >>>>>>>>>>>>>> debugger? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote: > >>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> sorry, i just forgot to add the values of the function > >>>>>>>>> > >>>>>>>>> parameters: > >>>>>>>>>>>>>>> (gdb) print reg->cbdata > >>>>>>>>>>>>>>> $1 = (void *) 0x0 > >>>>>>>>>>>>>>> (gdb) print openib_btl->super > >>>>>>>>>>>>>>> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = > >>>>>>> > >>>>>>> 12288, > >>>>>>> > >>>>>>>>>>>>>>> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536, > >>>>>>>>>>>>>>> btl_rdma_pipeline_send_length = 1048576, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> btl_rdma_pipeline_frag_size = 1048576, > >>>>>>>>> > >>>>>>>>> btl_min_rdma_pipeline_size > >>>>>>>>> > >>>>>>>>>>>>>>> = 1060864, btl_exclusivity = 1024, btl_latency = 10, > >>>>>>>>>>>>>>> btl_bandwidth = 800, btl_flags = 310, btl_add_procs = > >>>>>>>>>>>>>>> 0x2b341eb8ee47<mca_btl_openib_add_procs>, btl_del_procs = > >>>>>>>>>>>>>>> 0x2b341eb90156<mca_btl_openib_del_procs>, btl_register = > >>>>>>>>>>>>>>> 0, btl_finalize = > >>>>>>>>>>>>>>> 0x2b341eb93186<mca_btl_openib_finalize>, > >>>>>>>>> > >>>>>>>>> btl_alloc > >>>>>>>>> > >>>>>>>>>>>>>>> = 0x2b341eb90a3e<mca_btl_openib_alloc>, btl_free = > >>>>>>>>>>>>>>> 0x2b341eb91400<mca_btl_openib_free>, btl_prepare_src = > >>>>>>>>>>>>>>> 0x2b341eb91813<mca_btl_openib_prepare_src>, > >>>>>>>>>>>>>>> btl_prepare_dst > >>>>>>> > >>>>>>> = > >>>>>>> > >>>>>>>>>>>>>>> 0x2b341eb91f2e<mca_btl_openib_prepare_dst>, btl_send = > >>>>>>>>>>>>>>> 0x2b341eb94517<mca_btl_openib_send>, btl_sendi = > >>>>>>>>>>>>>>> 0x2b341eb9340d<mca_btl_openib_sendi>, btl_put = > >>>>>>>>>>>>>>> 0x2b341eb94660<mca_btl_openib_put>, btl_get = > >>>>>>>>>>>>>>> 0x2b341eb94c4e<mca_btl_openib_get>, btl_dump = > >>>>>>>>>>>>>>> 0x2b341acd45cb<mca_btl_base_dump>, btl_mpool = 0xf3f4110, > >>>>>>>>>>>>>>> btl_register_error = > >>>>>>>>>>>>>>> 0x2b341eb90565<mca_btl_openib_register_error_cb>, > >>>>>>>>>>>>>>> btl_ft_event > >>>>>>>>> > >>>>>>>>> = > >>>>>>>>> > >>>>>>>>>>>>>>> 0x2b341eb952e7<mca_btl_openib_ft_event>} > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> (gdb) print hdr->tag > >>>>>>>>>>>>>>> $3 = 0 '\0' > >>>>>>>>>>>>>>> (gdb) print des > >>>>>>>>>>>>>>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700 > >>>>>>>>>>>>>>> (gdb) print reg->cbfunc > >>>>>>>>>>>>>>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote: > >>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Here is the output of a core file generated during a > >>>>>>>>> > >>>>>>>>> segmentation > >>>>>>>>> > >>>>>>>>>>>>>>>> fault observed during a collective call (using openib): > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? () > >>>>>>>>>>>>>>>> (gdb) where > >>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? () > >>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming > >>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, > >>>>>>>>>>>>>>>> byte_len=18) at btl_openib_component.c:2881 #2 > >>>>>>>>>>>>>>>> 0x00002aedbc4e25e2 in handle_wc (device=0x19024ac0, cq=0, > >>>>>>>>>>>>>>>> wc=0x7ffff279ce90) at > >>>>>>>>>>>>>>>> btl_openib_component.c:3178 #3 0x00002aedbc4e2e9d in > >>>>>>>>> > >>>>>>>>> poll_device > >>>>>>>>> > >>>>>>>>>>>>>>>> (device=0x19024ac0, count=2) at > >>>>>>>>>>>>>>>> btl_openib_component.c:3318 > >>>>>>> > >>>>>>> #4 > >>>>>>> > >>>>>>>>>>>>>>>> 0x00002aedbc4e34b8 in progress_one_device > >>>>>>> > >>>>>>> (device=0x19024ac0) > >>>>>>> > >>>>>>>>>>>>>>>> at btl_openib_component.c:3426 #5 0x00002aedbc4e3561 in > >>>>>>>>>>>>>>>> btl_openib_component_progress () at > >>>>>>>>>>>>>>>> btl_openib_component.c:3451 > >>>>>>>>> > >>>>>>>>> #6 > >>>>>>>>> > >>>>>>>>>>>>>>>> 0x00002aedb8b22ab8 in opal_progress () at > >>>>>>>>>>>>>>>> runtime/opal_progress.c:207 #7 0x00002aedb859f497 in > >>>>>>>>>>>>>>>> opal_condition_wait (c=0x2aedb888ccc0, m=0x2aedb888cd20) > >>>>>>>>>>>>>>>> at ../opal/threads/condition.h:99 #8 > >>>>>>>>>>>>>>>> 0x00002aedb859fa31 in ompi_request_default_wait_all > >>>>>>> > >>>>>>> (count=2, > >>>>>>> > >>>>>>>>>>>>>>>> requests=0x7ffff279d0e0, statuses=0x0) at > >>>>>>>>>>>>>>>> request/req_wait.c:262 #9 0x00002aedbd7559ad in > >>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_recursivedoubling > >>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1, > >>>>>>>>>>>>>>>> dtype=0x6788220, op=0x6787a20, > >>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at > >>>>>>>>> > >>>>>>>>> coll_tuned_allreduce.c:223 > >>>>>>>>> > >>>>>>>>>>>>>>>> #10 0x00002aedbd7514f7 in > >>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed > >>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1, > >>>>>>>>>>>>>>>> dtype=0x6788220, op=0x6787a20, comm=0x19d81ff0, > >>>>>>>>>>>>>>>> module=0x19d82b20) at > >>>>>>>>>>>>>>>> coll_tuned_decision_fixed.c:63 > >>>>>>>>>>>>>>>> #11 0x00002aedb85c7792 in PMPI_Allreduce > >>>>>>>>> > >>>>>>>>> (sendbuf=0x7ffff279d444, > >>>>>>>>> > >>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220, > >>>>>>>>> > >>>>>>>>> op=0x6787a20, > >>>>>>>>> > >>>>>>>>>>>>>>>> comm=0x19d81ff0) at pallreduce.c:102 #12 > >>>>>>>>>>>>>>>> 0x0000000004387dbf > >>>>>>> > >>>>>>> in > >>>>>>> > >>>>>>>>>>>>>>>> FEMTown::MPI::Allreduce (sendbuf=0x7ffff279d444, > >>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220, > >>>>>>>>> > >>>>>>>>> op=0x6787a20, > >>>>>>>>> > >>>>>>>>>>>>>>>> comm=0x19d81ff0) at stubs.cpp:626 #13 0x0000000004058be8 > >>>>>>>>>>>>>>>> in FEMTown::Domain::align (itf= > >>>>>>>>> > >>>>>>>>> {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Int > >>>>>>>>> > >>>>>>>>>>>>>>>> er fa ce>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> = {_vptr.shared_base_ptr = 0x7ffff279d620, ptr_ = {px = > >>>>>>>>>>>>>>>> 0x199942a4, pn = {pi_ = 0x6}}},<No data fields>}) at > >>>>>>>>>>>>>>>> interface.cpp:371 #14 0x00000000040cb858 in > >>>>>>>>>>>>>>>> FEMTown::Field::detail::align_itfs_and_neighbhors (dim=2, > >>>>>>>>> > >>>>>>>>> set={px > >>>>>>>>> > >>>>>>>>>>>>>>>> = 0x7ffff279d780, pn = {pi_ = 0x2f279d640}}, > >>>>>>>>>>>>>>>> check_info=@0x7ffff279d7f0) at check.cpp:63 #15 > >>>>>>>>> > >>>>>>>>> 0x00000000040cbfa8 > >>>>>>>>> > >>>>>>>>>>>>>>>> in FEMTown::Field::align_elements (set={px = > >>>>>>>>>>>>>>>> 0x7ffff279d950, pn > >>>>>>>>> > >>>>>>>>> = > >>>>>>>>> > >>>>>>>>>>>>>>>> {pi_ = 0x66e08d0}}, check_info=@0x7ffff279d7f0) at > >>>>>>>>>>>>>>>> check.cpp:159 #16 0x00000000039acdd4 in > >>>>>>>>>>>>>>>> PyField_align_elements (self=0x0, args=0x2aaab0765050, > >>>>>>>>>>>>>>>> kwds=0x19d2e950) at check.cpp:31 #17 0x0000000001fbf76d in > >>>>>>>>>>>>>>>> FEMTown::Main::ExErrCatch<_object* (*)(_object*, _object*, > >>>>>>>>>>>>>>>> _object*)>::exec<_object> > >>>>>>>>>>>>>>>> (this=0x7ffff279dc20, s=0x0, po1=0x2aaab0765050, > >>>>>>>>>>>>>>>> po2=0x19d2e950) at > >>>>>>>>>>>>>>>> /home/qa/svntop/femtown/modules/main/py/exception.hpp:463 > >>>>>>> > >>>>>>> #18 > >>>>>>> > >>>>>>>>>>>>>>>> 0x00000000039acc82 in PyField_align_elements_ewrap > >>>>>>> > >>>>>>> (self=0x0, > >>>>>>> > >>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:39 #19 > >>>>>>>>>>>>>>>> 0x00000000044093a0 in PyEval_EvalFrameEx (f=0x19b52e90, > >>>>>>>>>>>>>>>> throwflag=<value optimized out>) at Python/ceval.c:3921 > >>>>>>>>>>>>>>>> #20 0x000000000440aae9 in PyEval_EvalCodeEx > >>>>>>>>>>>>>>>> (co=0x2aaab754ad50, globals=<value optimized out>, > >>>>>>>>>>>>>>>> locals=<value optimized out>, args=0x3, argcount=1, > >>>>>>>>>>>>>>>> kws=0x19ace4a0, kwcount=2, > >>>>>>>>>>>>>>>> defs=0x2aaab75e4800, defcount=2, closure=0x0) at > >>>>>>>>>>>>>>>> Python/ceval.c:2968 > >>>>>>>>>>>>>>>> #21 0x0000000004408f58 in PyEval_EvalFrameEx > >>>>>>>>>>>>>>>> (f=0x19ace2d0, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>> Python/ceval.c:3802 #22 0x000000000440aae9 in > >>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab7550120, globals=<value > >>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, args=0x7, > >>>>>>>>>>>>>>>> argcount=1, kws=0x19acc418, kwcount=3, > >>>>>>>>>>>>>>>> defs=0x2aaab759e958, defcount=6, closure=0x0) at > >>>>>>>>>>>>>>>> Python/ceval.c:2968 > >>>>>>>>>>>>>>>> #23 0x0000000004408f58 in PyEval_EvalFrameEx > >>>>>>>>>>>>>>>> (f=0x19acc1c0, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>> Python/ceval.c:3802 #24 0x000000000440aae9 in > >>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b5e738, globals=<value > >>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, args=0x6, > >>>>>>>>>>>>>>>> argcount=1, kws=0x19abd328, kwcount=5, > >>>>>>>>>>>>>>>> defs=0x2aaab891b7e8, defcount=3, closure=0x0) at > >>>>>>>>>>>>>>>> Python/ceval.c:2968 > >>>>>>>>>>>>>>>> #25 0x0000000004408f58 in PyEval_EvalFrameEx > >>>>>>>>>>>>>>>> (f=0x19abcea0, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>> Python/ceval.c:3802 #26 0x000000000440aae9 in > >>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4198, globals=<value > >>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, args=0xb, > >>>>>>>>>>>>>>>> argcount=1, kws=0x19a89df0, kwcount=10, defs=0x0, > >>>>>>>>>>>>>>>> defcount=0, closure=0x0) at Python/ceval.c:2968 > >>>>>>>>>>>>>>>> #27 0x0000000004408f58 in PyEval_EvalFrameEx > >>>>>>>>>>>>>>>> (f=0x19a89c40, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>> Python/ceval.c:3802 #28 0x000000000440aae9 in > >>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4288, globals=<value > >>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, args=0x1, > >>>>>>>>>>>>>>>> argcount=0, kws=0x19a89330, kwcount=0, > >>>>>>>>>>>>>>>> defs=0x2aaab8b66668, defcount=1, closure=0x0) at > >>>>>>>>>>>>>>>> Python/ceval.c:2968 > >>>>>>>>>>>>>>>> #29 0x0000000004408f58 in PyEval_EvalFrameEx > >>>>>>>>>>>>>>>> (f=0x19a891b0, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>> Python/ceval.c:3802 #30 0x000000000440aae9 in > >>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b6a738, globals=<value > >>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, args=0x0, > >>>>>>>>>>>>>>>> argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, > >>>>>>>>>>>>>>>> closure=0x0) at > >>>>>>>>>>>>>>>> Python/ceval.c:2968 > >>>>>>>>>>>>>>>> #31 0x000000000440ac02 in PyEval_EvalCode (co=0x1902f9b0, > >>>>>>>>>>>>>>>> globals=0x0, locals=0x190d9700) at Python/ceval.c:522 #32 > >>>>>>>>>>>>>>>> 0x000000000442853c in PyRun_StringFlags (str=0x192fd3d8 > >>>>>>>>>>>>>>>> "DIRECT.Actran.main()", start=<value optimized out>, > >>>>>>>>>>>>>>>> globals=0x192213d0, locals=0x192213d0, flags=0x0) at > >>>>>>>>>>>>>>>> Python/pythonrun.c:1335 #33 0x0000000004429690 in > >>>>>>>>>>>>>>>> PyRun_SimpleStringFlags (command=0x192fd3d8 > >>>>>>>>>>>>>>>> "DIRECT.Actran.main()", flags=0x0) at > >>>>>>>>>>>>>>>> Python/pythonrun.c:957 #34 0x0000000001fa1cf9 in > >>>>>>>>>>>>>>>> FEMTown::Python::FEMPy::run_application > >>>>>>> > >>>>>>> (this=0x7ffff279f650) > >>>>>>> > >>>>>>>>>>>>>>>> at fempy.cpp:873 #35 0x000000000434ce99 in > >>>>>>>>> > >>>>>>>>> FEMTown::Main::Batch::run > >>>>>>>>> > >>>>>>>>>>>>>>>> (this=0x7ffff279f650) at batch.cpp:374 #36 > >>>>>>> > >>>>>>> 0x0000000001f9aa25 > >>>>>>> > >>>>>>>>>>>>>>>> in main (argc=8, argv=0x7ffff279fa48) at main.cpp:10 (gdb) > >>>>>>>>>>>>>>>> f 1 #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming > >>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, > >>>>>>>>>>>>>>>> byte_len=18) at btl_openib_component.c:2881 2881 > >>>>>>>>>>>>>>>> reg->cbfunc( &openib_btl->super, hdr->tag, des, > >>>>>>>>>>>>>>>> reg->cbdata > >>>>>>> > >>>>>>> ); > >>>>>>> > >>>>>>>>>>>>>>>> Current language: auto; currently c > >>>>>>>>>>>>>>>> (gdb) > >>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming > >>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, > >>>>>>>>>>>>>>>> byte_len=18) at btl_openib_component.c:2881 2881 > >>>>>>>>>>>>>>>> reg->cbfunc( &openib_btl->super, hdr->tag, des, > >>>>>>>>>>>>>>>> reg->cbdata > >>>>>>> > >>>>>>> ); > >>>>>>> > >>>>>>>>>>>>>>>> (gdb) l 2876 > >>>>>>>>>>>>>>>> 2877 if(OPAL_LIKELY(!(is_credit_msg = > >>>>>>>>>>>>>>>> is_credit_message(frag)))) { 2878 /* call > >>>>>>>>>>>>>>>> registered callback */ > >>>>>>>>>>>>>>>> 2879 mca_btl_active_message_callback_t* reg; > >>>>>>>>>>>>>>>> 2880 reg = mca_btl_base_active_message_trigger > >>>>>>>>>>>>>>>> + hdr->tag; 2881 > >>>>>>>>>>>>>>>> reg->cbfunc(&openib_btl->super, hdr->tag, des, > >>>>>>>>>>>>>>>> reg->cbdata ); 2882 > >>>>>>>>>>>>>>>> if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) { 2883 > >>>>>>>>>>>>>>>> cqp > >>>>>>> > >>>>>>> = > >>>>>>> > >>>>>>>>>>>>>>>> (hdr->credits>> 11)& 0x0f; > >>>>>>>>>>>>>>>> 2884 hdr->credits&= 0x87ff; > >>>>>>>>>>>>>>>> 2885 } else { > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Friday 16 July 2010 16:01:02 Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>> Hi Edgar, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> The only difference I could observed was that the > >>>>>>>>>>>>>>>>> segmentation fault appeared sometimes later during the > >>>>>>>>>>>>>>>>> parallel computation. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I'm running out of idea here. I wish I could use the > >>>>>>>>>>>>>>>>> "--mca > >>>>>>>>> > >>>>>>>>> coll > >>>>>>>>> > >>>>>>>>>>>>>>>>> tuned" with "--mca self,sm,tcp" so that I could check > >>>>>>>>>>>>>>>>> that the issue is not somehow limited to the tuned > >>>>>>>>>>>>>>>>> collective routines. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote: > >>>>>>>>>>>>>>>>>> On 7/15/2010 10:18 AM, Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>> hi edgar, > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> thanks for the tips, I'm gonna try this option as well. > >>>>>>> > >>>>>>> the > >>>>>>> > >>>>>>>>>>>>>>>>>>> segmentation fault i'm observing always happened during > >>>>>>>>>>>>>>>>>>> a collective communication indeed... does it basically > >>>>>>> > >>>>>>> switch > >>>>>>> > >>>>>>>>> all > >>>>>>>>> > >>>>>>>>>>>>>>>>>>> collective communication to basic mode, right ? > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> sorry for my ignorance, but what's a NCA ? > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> sorry, I meant to type HCA (InifinBand networking card) > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>>>>>> Edgar > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> thanks, > >>>>>>>>>>>>>>>>>>> éloi > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote: > >>>>>>>>>>>>>>>>>>>> you could try first to use the algorithms in the basic > >>>>>>>>> > >>>>>>>>> module, > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>> e.g. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> mpirun -np x --mca coll basic ./mytest > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> and see whether this makes a difference. I used to > >>>>>>> > >>>>>>> observe > >>>>>>> > >>>>>>>>>>>>>>>>>>>> sometimes a (similar ?) problem in the openib btl > >>>>>>>>>>>>>>>>>>>> triggered from the tuned collective component, in > >>>>>>>>>>>>>>>>>>>> cases where the ofed libraries were installed but no > >>>>>>>>>>>>>>>>>>>> NCA was found on a node. It used to work however with > >>>>>>>>>>>>>>>>>>>> the basic component. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>>>>>>>> Edgar > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On 7/15/2010 3:08 AM, Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>> hi Rolf, > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> unfortunately, i couldn't get rid of that annoying > >>>>>>>>>>>>>>>>>>>>> segmentation fault when selecting another bcast > >>>>>>>>>>>>>>>>>>>>> algorithm. i'm now going to replace MPI_Bcast with a > >>>>>>>>>>>>>>>>>>>>> naive > >>>>>>>>>>>>>>>>>>>>> implementation (using MPI_Send and MPI_Recv) and see > >>>>>>>>>>>>>>>>>>>>> if > >>>>>>>>> > >>>>>>>>> that > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> helps. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> regards, > >>>>>>>>>>>>>>>>>>>>> éloi > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>> Hi Rolf, > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> thanks for your input. You're right, I miss the > >>>>>>>>>>>>>>>>>>>>>> coll_tuned_use_dynamic_rules option. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> I'll check if I the segmentation fault disappears > >>>>>>>>>>>>>>>>>>>>>> when > >>>>>>>>> > >>>>>>>>> using > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> the basic bcast linear algorithm using the proper > >>>>>>>>>>>>>>>>>>>>>> command line you provided. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart > >>>>>>> > >>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>> Hi Eloi: > >>>>>>>>>>>>>>>>>>>>>>> To select the different bcast algorithms, you need > >>>>>>>>>>>>>>>>>>>>>>> to add an extra mca parameter that tells the > >>>>>>>>>>>>>>>>>>>>>>> library to use dynamic selection. --mca > >>>>>>>>>>>>>>>>>>>>>>> coll_tuned_use_dynamic_rules 1 > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> One way to make sure you are typing this in > >>>>>>>>>>>>>>>>>>>>>>> correctly is > >>>>>>>>> > >>>>>>>>> to > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> use it with ompi_info. Do the following: > >>>>>>>>>>>>>>>>>>>>>>> ompi_info -mca coll_tuned_use_dynamic_rules 1 > >>>>>>>>>>>>>>>>>>>>>>> --param > >>>>>>>>> > >>>>>>>>> coll > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> You should see lots of output with all the > >>>>>>>>>>>>>>>>>>>>>>> different algorithms that can be selected for the > >>>>>>>>>>>>>>>>>>>>>>> various collectives. Therefore, you need this: > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> --mca coll_tuned_use_dynamic_rules 1 --mca > >>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1 > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> Rolf > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> On 07/13/10 11:28, Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> I've found that "--mca coll_tuned_bcast_algorithm > >>>>>>>>>>>>>>>>>>>>>>>> 1" allowed to switch to the basic linear > >>>>>>>>>>>>>>>>>>>>>>>> algorithm. Anyway whatever the algorithm used, > >>>>>>>>>>>>>>>>>>>>>>>> the segmentation fault remains. > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Does anyone could give some advice on ways to > >>>>>>> > >>>>>>> diagnose > >>>>>>> > >>>>>>>>> the > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> issue I'm facing ? > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> I'm focusing on the MPI_Bcast routine that seems > >>>>>>>>>>>>>>>>>>>>>>>>> to randomly segfault when using the openib btl. > >>>>>>>>>>>>>>>>>>>>>>>>> I'd > >>>>>>> > >>>>>>> like > >>>>>>> > >>>>>>>>> to > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> know if there is any way to make OpenMPI switch > >>>>>>>>>>>>>>>>>>>>>>>>> to > >>>>>>> > >>>>>>> a > >>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> different algorithm than the default one being > >>>>>>>>>>>>>>>>>>>>>>>>> selected for MPI_Bcast. > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your help, > >>>>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> I'm observing a random segmentation fault during > >>>>>>> > >>>>>>> an > >>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> internode parallel computation involving the > >>>>>>> > >>>>>>> openib > >>>>>>> > >>>>>>>>> btl > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> and OpenMPI-1.4.2 (the same issue can be > >>>>>>>>>>>>>>>>>>>>>>>>>> observed with OpenMPI-1.3.3). > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> mpirun (Open MPI) 1.4.2 > >>>>>>>>>>>>>>>>>>>>>>>>>> Report bugs to > >>>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/help/ > >>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] *** Process received signal *** > >>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Signal: Segmentation fault (11) > >>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Signal code: Address not mapped > >>>>>>> > >>>>>>> (1) > >>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Failing at address: (nil) > >>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] [ 0] /lib64/libpthread.so.0 > >>>>>>>>>>>>>>>>>>>>>>>>>> [0x349540e4c0] [pbn08:02624] *** End of error > >>>>>>>>> > >>>>>>>>> message > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> *** > >>>>>>>>>>>>>>>>>>>>>>>>>> sh: line 1: 2624 Segmentation fault > >>>>>>>>> > >>>>>>>>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/R > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> ed Ha tE L\ -5 \/ x 86 _6 4\ /bin\/actranpy_mp > >>>>>>>>> > >>>>>>>>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/Re > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> dH at EL -5 /x 86 _ 64 /A c tran_11.0.rc2.41872' > >>>>>>>>> > >>>>>>>>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3D > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> re al _m 4_ n2 .d a t' > >>>>>>>>> > >>>>>>>>> '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> ' '--mem=3200' '--threads=1' > >>>>>>>>>>>>>>>>>>>>>>>>>> '--errorlevel=FATAL' '--t_max=0.1' > >>>>>>>>>>>>>>>>>>>>>>>>>> '--parallel=domain' > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> If I choose not to use the openib btl (by using > >>>>>>>>>>>>>>>>>>>>>>>>>> --mca btl self,sm,tcp on the command line, for > >>>>>>>>>>>>>>>>>>>>>>>>>> instance), I don't encounter any problem and the > >>>>>>>>>>>>>>>>>>>>>>>>>> parallel computation runs flawlessly. > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> I would like to get some help to be able: > >>>>>>>>>>>>>>>>>>>>>>>>>> - to diagnose the issue I'm facing with the > >>>>>>>>>>>>>>>>>>>>>>>>>> openib btl - understand why this issue is > >>>>>>>>>>>>>>>>>>>>>>>>>> observed only when > >>>>>>>>> > >>>>>>>>> using > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> the openib btl and not when using self,sm,tcp > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> Any help would be very much appreciated. > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> The outputs of ompi_info and the configure > >>>>>>>>>>>>>>>>>>>>>>>>>> scripts of OpenMPI are enclosed to this email, > >>>>>>>>>>>>>>>>>>>>>>>>>> and some > >>>>>>>>> > >>>>>>>>> information > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> on the infiniband drivers as well. > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> Here is the command line used when launching a > >>>>>>>>> > >>>>>>>>> parallel > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> computation > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> using infiniband: > >>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS > >>>>>>>>>>>>>>>>>>>>>>>>>> --hostfile host.list --mca > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> btl openib,sm,self,tcp --display-map --verbose > >>>>>>>>>>>>>>>>>>>>>>>>>> --version --mca mpi_warn_on_fork 0 --mca > >>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support 0 [...] > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> and the command line used if not using infiniband: > >>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS > >>>>>>>>>>>>>>>>>>>>>>>>>> --hostfile host.list --mca > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> btl self,sm,tcp --display-map --verbose > >>>>>>>>>>>>>>>>>>>>>>>>>> --version > >>>>>>>>> > >>>>>>>>> --mca > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca > >>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support > >>>>>>>>> > >>>>>>>>> 0 > >>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> [...] > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________ > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 487 959