Terry, Please find enclosed the requested check outputs (using -output-filename stdout.tag.null option).
For information, Nysal In his first message referred to ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1) #define MCA_PML_OB1_HDR_TYPE_RNDV (MCA_BTL_TAG_PML + 2) #define MCA_PML_OB1_HDR_TYPE_RGET (MCA_BTL_TAG_PML + 3) #define MCA_PML_OB1_HDR_TYPE_ACK (MCA_BTL_TAG_PML + 4) #define MCA_PML_OB1_HDR_TYPE_NACK (MCA_BTL_TAG_PML + 5) #define MCA_PML_OB1_HDR_TYPE_FRAG (MCA_BTL_TAG_PML + 6) #define MCA_PML_OB1_HDR_TYPE_GET (MCA_BTL_TAG_PML + 7) #define MCA_PML_OB1_HDR_TYPE_PUT (MCA_BTL_TAG_PML + 8) #define MCA_PML_OB1_HDR_TYPE_FIN (MCA_BTL_TAG_PML + 9) and in ompi/mca/btl/btl.h #define MCA_BTL_TAG_PML 0x40 Eloi On Monday 27 September 2010 14:36:59 Terry Dontje wrote: > I am thinking checking the value of *frag->hdr right before the return > in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h. > It is line 548 in the trunk > https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_ope > nib_endpoint.h#548 > > --td > > Eloi Gaudry wrote: > > Hi Terry, > > > > Do you have any patch that I could apply to be able to do so ? I'm > > remotely working on a cluster (with a terminal) and I cannot use any > > parallel debugger or sequential debugger (with a call to xterm...). I > > can track frag->hdr->tag value in > > ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the > > SEND/RDMA_WRITE case, but this is all I can think of alone. > > > > You'll find a stacktrace (receive side) in this thread (10th or 11th > > message) but it might be pointless. > > > > Regards, > > Eloi > > > > On Monday 27 September 2010 11:43:55 Terry Dontje wrote: > >> So it sounds like coalescing is not your issue and that the problem has > >> something to do with the queue sizes. It would be helpful if we could > >> detect the hdr->tag == 0 issue on the sending side and get at least a > >> stack trace. There is something really odd going on here. > >> > >> --td > >> > >> Eloi Gaudry wrote: > >>> Hi Terry, > >>> > >>> I'm sorry to say that I might have missed a point here. > >>> > >>> I've lately been relaunching all previously failing computations with > >>> the message coalescing feature being switched off, and I saw the same > >>> hdr->tag=0 error several times, always during a collective call > >>> (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as > >>> soon as I switched to the peer queue option I was previously using > >>> (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using > >>> --mca btl_openib_use_message_coalescing 0), all computations ran > >>> flawlessly. > >>> > >>> As for the reproducer, I've already tried to write something but I > >>> haven't succeeded so far at reproducing the hdr->tag=0 issue with it. > >>> > >>> Eloi > >>> > >>> On 24/09/2010 18:37, Terry Dontje wrote: > >>>> Eloi Gaudry wrote: > >>>>> Terry, > >>>>> > >>>>> You were right, the error indeed seems to come from the message > >>>>> coalescing feature. If I turn it off using the "--mca > >>>>> btl_openib_use_message_coalescing 0", I'm not able to observe the > >>>>> "hdr->tag=0" error. > >>>>> > >>>>> There are some trac requests associated to very similar error > >>>>> (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are > >>>>> all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 > >>>>> that might be related), aren't they ? What would you suggest Terry ? > >>>> > >>>> Interesting, though it looks to me like the segv in ticket 2352 would > >>>> have happened on the send side instead of the receive side like you > >>>> have. As to what to do next it would be really nice to have some > >>>> sort of reproducer that we can try and debug what is really going > >>>> on. The only other thing to do without a reproducer is to inspect > >>>> the code on the send side to figure out what might make it generate > >>>> at 0 hdr->tag. Or maybe instrument the send side to stop when it is > >>>> about ready to send a 0 hdr->tag and see if we can see how the code > >>>> got there. > >>>> > >>>> I might have some cycles to look at this Monday. > >>>> > >>>> --td > >>>> > >>>>> Eloi > >>>>> > >>>>> On Friday 24 September 2010 16:00:26 Terry Dontje wrote: > >>>>>> Eloi Gaudry wrote: > >>>>>>> Terry, > >>>>>>> > >>>>>>> No, I haven't tried any other values than P,65536,256,192,128 yet. > >>>>>>> > >>>>>>> The reason why is quite simple. I've been reading and reading again > >>>>>>> this thread to understand the btl_openib_receive_queues meaning and > >>>>>>> I can't figure out why the default values seem to induce the hdr- > >>>>>>> > >>>>>>>> tag=0 issue > >>>>>>>> (http://www.open-mpi.org/community/lists/users/2009/01/7808.php). > >>>>>> > >>>>>> Yeah, the size of the fragments and number of them really should not > >>>>>> cause this issue. So I too am a little perplexed about it. > >>>>>> > >>>>>>> Do you think that the default shared received queue parameters are > >>>>>>> erroneous for this specific Mellanox card ? Any help on finding the > >>>>>>> proper parameters would actually be much appreciated. > >>>>>> > >>>>>> I don't necessarily think it is the queue size for a specific card > >>>>>> but more so the handling of the queues by the BTL when using > >>>>>> certain sizes. At least that is one gut feel I have. > >>>>>> > >>>>>> In my mind the tag being 0 is either something below OMPI is > >>>>>> polluting the data fragment or OMPI's internal protocol is some how > >>>>>> getting messed up. I can imagine (no empirical data here) the > >>>>>> queue sizes could change how the OMPI protocol sets things up. > >>>>>> Another thing may be the coalescing feature in the openib BTL which > >>>>>> tries to gang multiple messages into one packet when resources are > >>>>>> running low. I can see where changing the queue sizes might > >>>>>> affect the coalescing. So, it might be interesting to turn off the > >>>>>> coalescing. You can do that by setting "--mca > >>>>>> btl_openib_use_message_coalescing 0" in your mpirun line. > >>>>>> > >>>>>> If that doesn't solve the issue then obviously there must be > >>>>>> something else going on :-). > >>>>>> > >>>>>> Note, the reason I am interested in this is I am seeing a similar > >>>>>> error condition (hdr->tag == 0) on a development system. Though my > >>>>>> failing case fails with np=8 using the connectivity test program > >>>>>> which is mainly point to point and there are not a significant > >>>>>> amount of data transfers going on either. > >>>>>> > >>>>>> --td > >>>>>> > >>>>>>> Eloi > >>>>>>> > >>>>>>> On Friday 24 September 2010 14:27:07 you wrote: > >>>>>>>> That is interesting. So does the number of processes affect your > >>>>>>>> runs any. The times I've seen hdr->tag be 0 usually has been due > >>>>>>>> to protocol issues. The tag should never be 0. Have you tried to > >>>>>>>> do other receive_queue settings other than the default and the one > >>>>>>>> you mention. > >>>>>>>> > >>>>>>>> I wonder if you did a combination of the two receive queues causes > >>>>>>>> a failure or not. Something like > >>>>>>>> > >>>>>>>> P,128,256,192,128:P,65536,256,192,128 > >>>>>>>> > >>>>>>>> I am wondering if it is the first queuing definition causing the > >>>>>>>> issue or possibly the SRQ defined in the default. > >>>>>>>> > >>>>>>>> --td > >>>>>>>> > >>>>>>>> Eloi Gaudry wrote: > >>>>>>>>> Hi Terry, > >>>>>>>>> > >>>>>>>>> The messages being send/received can be of any size, but the > >>>>>>>>> error seems to happen more often with small messages (as an int > >>>>>>>>> being broadcasted or allreduced). The failing communication > >>>>>>>>> differs from one run to another, but some spots are more likely > >>>>>>>>> to be failing than another. And as far as I know, there are > >>>>>>>>> always located next to a small message (an int being broadcasted > >>>>>>>>> for instance) communication. Other typical messages size are > >>>>>>>>> > >>>>>>>>>> 10k but can be very much larger. > >>>>>>>>> > >>>>>>>>> I've been checking the hca being used, its' from mellanox (with > >>>>>>>>> vendor_part_id=26428). There is no receive_queues parameters > >>>>>>>>> associated to it. > >>>>>>>>> > >>>>>>>>> $ cat share/openmpi/mca-btl-openib-device-params.ini as well: > >>>>>>>>> [...] > >>>>>>>>> > >>>>>>>>> # A.k.a. ConnectX > >>>>>>>>> [Mellanox Hermon] > >>>>>>>>> vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3 > >>>>>>>>> vendor_part_id = > >>>>>>>>> 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,264 > >>>>>>>>> 88 use_eager_rdma = 1 > >>>>>>>>> mtu = 2048 > >>>>>>>>> max_inline_data = 128 > >>>>>>>>> > >>>>>>>>> [..] > >>>>>>>>> > >>>>>>>>> $ ompi_info --param btl openib --parsable | grep receive_queues > >>>>>>>>> > >>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,1 > >>>>>>>>> 92 ,128 > >>>>>>>>> > >>>>>>>>> :S ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 > >>>>>>>>> > >>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:data_source:defau > >>>>>>>>> lt value > >>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:status:writable > >>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimi > >>>>>>>>> t ed, comma delimited list of receive queues: > >>>>>>>>> P,4096,8,6,4:P,32768,8,6,4 > >>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:deprecated:no > >>>>>>>>> > >>>>>>>>> I was wondering if these parameters (automatically computed at > >>>>>>>>> openib btl init for what I understood) were not incorrect in some > >>>>>>>>> way and I plugged some others values: "P,65536,256,192,128" > >>>>>>>>> (someone on the list used that values when encountering a > >>>>>>>>> different issue) . Since that, I haven't been able to observe the > >>>>>>>>> segfault (occuring as hrd->tag = 0 in > >>>>>>>>> btl_openib_component.c:2881) yet. > >>>>>>>>> > >>>>>>>>> Eloi > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/ > >>>>>>>>> > >>>>>>>>> On Thursday 23 September 2010 23:33:48 Terry Dontje wrote: > >>>>>>>>>> Eloi, I am curious about your problem. Can you tell me what > >>>>>>>>>> size of job it is? Does it always fail on the same bcast, or > >>>>>>>>>> same process? > >>>>>>>>>> > >>>>>>>>>> Eloi Gaudry wrote: > >>>>>>>>>>> Hi Nysal, > >>>>>>>>>>> > >>>>>>>>>>> Thanks for your suggestions. > >>>>>>>>>>> > >>>>>>>>>>> I'm now able to get the checksum computed and redirected to > >>>>>>>>>>> stdout, thanks (I forgot the "-mca pml_base_verbose 5" option, > >>>>>>>>>>> you were right). I haven't been able to observe the > >>>>>>>>>>> segmentation fault (with hdr->tag=0) so far (when using pml > >>>>>>>>>>> csum) but I 'll let you know when I am. > >>>>>>>>>>> > >>>>>>>>>>> I've got two others question, which may be related to the error > >>>>>>>>>>> observed: > >>>>>>>>>>> > >>>>>>>>>>> 1/ does the maximum number of MPI_Comm that can be handled by > >>>>>>>>>>> OpenMPI somehow depends on the btl being used (i.e. if I'm > >>>>>>>>>>> using openib, may I use the same number of MPI_Comm object as > >>>>>>>>>>> with tcp) ? Is there something as MPI_COMM_MAX in OpenMPI ? > >>>>>>>>>>> > >>>>>>>>>>> 2/ the segfaults only appears during a mpi collective call, > >>>>>>>>>>> with very small message (one int is being broadcast, for > >>>>>>>>>>> instance) ; i followed the guidelines given at > >>>>>>>>>>> http://icl.cs.utk.edu/open- > >>>>>>>>>>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the > >>>>>>>>>>> debug-build of OpenMPI asserts if I use a different min-size > >>>>>>>>>>> that 255. Anyway, if I deactivate eager_rdma, the segfaults > >>>>>>>>>>> remains. Does the openib btl handle very small message > >>>>>>>>>>> differently (even with eager_rdma > >>>>>>>>>>> deactivated) than tcp ? > >>>>>>>>>> > >>>>>>>>>> Others on the list does coalescing happen with non-eager_rdma? > >>>>>>>>>> If so then that would possibly be one difference between the > >>>>>>>>>> openib btl and tcp aside from the actual protocol used. > >>>>>>>>>> > >>>>>>>>>>> is there a way to make sure that large messages and small > >>>>>>>>>>> messages are handled the same way ? > >>>>>>>>>> > >>>>>>>>>> Do you mean so they all look like eager messages? How large of > >>>>>>>>>> messages are we talking about here 1K, 1M or 10M? > >>>>>>>>>> > >>>>>>>>>> --td > >>>>>>>>>> > >>>>>>>>>>> Regards, > >>>>>>>>>>> Eloi > >>>>>>>>>>> > >>>>>>>>>>> On Friday 17 September 2010 17:57:17 Nysal Jan wrote: > >>>>>>>>>>>> Hi Eloi, > >>>>>>>>>>>> Create a debug build of OpenMPI (--enable-debug) and while > >>>>>>>>>>>> running with the csum PML add "-mca pml_base_verbose 5" to the > >>>>>>>>>>>> command line. This will print the checksum details for each > >>>>>>>>>>>> fragment sent over the wire. I'm guessing it didnt catch > >>>>>>>>>>>> anything because the BTL failed. The checksum verification is > >>>>>>>>>>>> done in the PML, which the BTL calls via a callback function. > >>>>>>>>>>>> In your case the PML callback is never called because the > >>>>>>>>>>>> hdr->tag is invalid. So enabling checksum tracing also might > >>>>>>>>>>>> not be of much use. Is it the first Bcast that fails or the > >>>>>>>>>>>> nth Bcast and what is the message size? I'm not sure what > >>>>>>>>>>>> could be the problem at this moment. I'm afraid you will have > >>>>>>>>>>>> to debug the BTL to find out more. > >>>>>>>>>>>> > >>>>>>>>>>>> --Nysal > >>>>>>>>>>>> > >>>>>>>>>>>> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <e...@fft.be> wrote: > >>>>>>>>>>>>> Hi Nysal, > >>>>>>>>>>>>> > >>>>>>>>>>>>> thanks for your response. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I've been unable so far to write a test case that could > >>>>>>>>>>>>> illustrate the hdr->tag=0 error. > >>>>>>>>>>>>> Actually, I'm only observing this issue when running an > >>>>>>>>>>>>> internode computation involving infiniband hardware from > >>>>>>>>>>>>> Mellanox (MT25418, ConnectX IB DDR, PCIe 2.0 > >>>>>>>>>>>>> 2.5GT/s, rev a0) with our time-domain software. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I checked, double-checked, and rechecked again every MPI use > >>>>>>>>>>>>> performed during a parallel computation and I couldn't find > >>>>>>>>>>>>> any error so far. The fact that the very > >>>>>>>>>>>>> same parallel computation run flawlessly when using tcp (and > >>>>>>>>>>>>> disabling openib support) might seem to indicate that the > >>>>>>>>>>>>> issue is somewhere located inside the > >>>>>>>>>>>>> openib btl or at the hardware/driver level. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I've just used the "-mca pml csum" option and I haven't seen > >>>>>>>>>>>>> any related messages (when hdr->tag=0 and the segfaults > >>>>>>>>>>>>> occurs). Any suggestion ? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Regards, > >>>>>>>>>>>>> Eloi > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Friday 17 September 2010 16:03:34 Nysal Jan wrote: > >>>>>>>>>>>>>> Hi Eloi, > >>>>>>>>>>>>>> Sorry for the delay in response. I haven't read the entire > >>>>>>>>>>>>>> email thread, but do you have a test case which can > >>>>>>>>>>>>>> reproduce this error? Without that it will be difficult to > >>>>>>>>>>>>>> nail down the cause. Just to clarify, I do not work for an > >>>>>>>>>>>>>> iwarp vendor. I can certainly try to reproduce it on an IB > >>>>>>>>>>>>>> system. There is also a PML called csum, you can use it via > >>>>>>>>>>>>>> "-mca pml csum", which will checksum the MPI messages and > >>>>>>>>>>>>>> verify it at the receiver side for any data > >>>>>>>>>>>>>> corruption. You can try using it to see if it is able > >>>>>>>>>>>>> > >>>>>>>>>>>>> to > >>>>>>>>>>>>> > >>>>>>>>>>>>>> catch anything. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Regards > >>>>>>>>>>>>>> --Nysal > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> Hi Nysal, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I'm sorry to intrrupt, but I was wondering if you had a > >>>>>>>>>>>>>>> chance to look > >>>>>>>>>>>>> > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> this error. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Eloi Gaudry > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Free Field Technologies > >>>>>>>>>>>>>>> Company Website: http://www.fft.be > >>>>>>>>>>>>>>> Company Phone: +32 10 487 959 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> ---------- Forwarded message ---------- > >>>>>>>>>>>>>>> From: Eloi Gaudry <e...@fft.be> > >>>>>>>>>>>>>>> To: Open MPI Users <us...@open-mpi.org> > >>>>>>>>>>>>>>> Date: Wed, 15 Sep 2010 16:27:43 +0200 > >>>>>>>>>>>>>>> Subject: Re: [OMPI users] [openib] segfault when using > >>>>>>>>>>>>>>> openib btl Hi, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I was wondering if anybody got a chance to have a look at > >>>>>>>>>>>>>>> this issue. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote: > >>>>>>>>>>>>>>>> Hi Jeff, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Please find enclosed the output (valgrind.out.gz) from > >>>>>>>>>>>>>>>> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host > >>>>>>>>>>>>>>>> pbn11,pbn10 --mca > >>>>>>>>>>>>> > >>>>>>>>>>>>> btl > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>> openib,self --display-map --verbose --mca mpi_warn_on_fork > >>>>>>>>>>>>>>>> 0 --mca btl_openib_want_fork_support 0 -tag-output > >>>>>>>>>>>>>>>> /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck > >>>>>>>>>>>>>>>> --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/open > >>>>>>>>>>>>>>>> mp i- valgrind.supp > >>>>>>>>>>>>>>>> --suppressions=./suppressions.python.supp > >>>>>>>>>>>>>>>> /opt/actran/bin/actranpy_mp ... > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>> On Monday 16 August 2010 19:14:47 Jeff Squyres wrote: > >>>>>>>>>>>>>>>>>> On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>> I did run our application through valgrind but it > >>>>>>>>>>>>>>>>>>> couldn't find any "Invalid write": there is a bunch of > >>>>>>>>>>>>>>>>>>> "Invalid read" (I'm using > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 1.4.2 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> with the suppression file), "Use of uninitialized > >>>>>>>>>>>>>>>>>>> bytes" and "Conditional jump depending on > >>>>>>>>>>>>>>>>>>> uninitialized bytes" in > >>>>>>>>>>>>> > >>>>>>>>>>>>> different > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> ompi > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> routines. Some of them are located in > >>>>>>>>>>>>>>>>>>> btl_openib_component.c. I'll send you an output of > >>>>>>>>>>>>>>>>>>> valgrind shortly. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> A lot of them in btl_openib_* are to be expected -- > >>>>>>>>>>>>>>>>>> OpenFabrics uses OS-bypass methods for some of its > >>>>>>>>>>>>>>>>>> memory, and therefore valgrind is unaware of them (and > >>>>>>>>>>>>>>>>>> therefore incorrectly marks them as > >>>>>>>>>>>>>>>>>> uninitialized). > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> would it help if i use the upcoming 1.5 version of > >>>>>>>>>>>>>>>>> openmpi ? i > >>>>>>>>>>>>> > >>>>>>>>>>>>> read > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> a huge effort has been done to clean-up the valgrind > >>>>>>>>>>>>>>>>> output ? but maybe that this doesn't concern this btl > >>>>>>>>>>>>>>>>> (for the reasons you mentionned). > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Another question, you said that the callback function > >>>>>>>>>>>>>>>>>>> pointer > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> should > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> never be 0. But can the tag be null (hdr->tag) ? > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> The tag is not a pointer -- it's just an integer. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I was worrying that its value could not be null. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I'll send a valgrind output soon (i need to build > >>>>>>>>>>>>>>>>> libpython without pymalloc first). > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Thanks for your help, > >>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On 16/08/2010 18:22, Jeff Squyres wrote: > >>>>>>>>>>>>>>>>>>>> Sorry for the delay in replying. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Odd; the values of the callback function pointer > >>>>>>>>>>>>>>>>>>>> should never > >>>>>>>>>>>>> > >>>>>>>>>>>>> be > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> 0. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> This seems to suggest some kind of memory corruption > >>>>>>>>>>>>>>>>>>>> is occurring. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> I don't know if it's possible, because the stack trace > >>>>>>>>>>>>>>>>>>>> looks like you're calling through python, but can you > >>>>>>>>>>>>>>>>>>>> run this application through valgrind, or some other > >>>>>>>>>>>>>>>>>>>> memory-checking debugger? > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> sorry, i just forgot to add the values of the > >>>>>>>>>>>>>>>>>>>>> function > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> parameters: > >>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbdata > >>>>>>>>>>>>>>>>>>>>> $1 = (void *) 0x0 > >>>>>>>>>>>>>>>>>>>>> (gdb) print openib_btl->super > >>>>>>>>>>>>>>>>>>>>> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit > >>>>>>>>>>>>>>>>>>>>> = > >>>>>>>>>>>>> > >>>>>>>>>>>>> 12288, > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> btl_rndv_eager_limit = 12288, btl_max_send_size = > >>>>>>>>>>>>>>>>>>>>> 65536, btl_rdma_pipeline_send_length = 1048576, > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> btl_rdma_pipeline_frag_size = 1048576, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> btl_min_rdma_pipeline_size > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> = 1060864, btl_exclusivity = 1024, btl_latency = > >>>>>>>>>>>>>>>>>>>>> 10, btl_bandwidth = 800, btl_flags = 310, > >>>>>>>>>>>>>>>>>>>>> btl_add_procs = > >>>>>>>>>>>>>>>>>>>>> 0x2b341eb8ee47<mca_btl_openib_add_procs>, > >>>>>>>>>>>>>>>>>>>>> btl_del_procs = > >>>>>>>>>>>>>>>>>>>>> 0x2b341eb90156<mca_btl_openib_del_procs>, > >>>>>>>>>>>>>>>>>>>>> btl_register = 0, btl_finalize = > >>>>>>>>>>>>>>>>>>>>> 0x2b341eb93186<mca_btl_openib_finalize>, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> btl_alloc > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> = 0x2b341eb90a3e<mca_btl_openib_alloc>, btl_free = > >>>>>>>>>>>>>>>>>>>>> 0x2b341eb91400<mca_btl_openib_free>, > >>>>>>>>>>>>>>>>>>>>> btl_prepare_src = > >>>>>>>>>>>>>>>>>>>>> 0x2b341eb91813<mca_btl_openib_prepare_src>, > >>>>>>>>>>>>>>>>>>>>> btl_prepare_dst > >>>>>>>>>>>>> > >>>>>>>>>>>>> = > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> 0x2b341eb91f2e<mca_btl_openib_prepare_dst>, > >>>>>>>>>>>>>>>>>>>>> btl_send = 0x2b341eb94517<mca_btl_openib_send>, > >>>>>>>>>>>>>>>>>>>>> btl_sendi = 0x2b341eb9340d<mca_btl_openib_sendi>, > >>>>>>>>>>>>>>>>>>>>> btl_put = 0x2b341eb94660<mca_btl_openib_put>, > >>>>>>>>>>>>>>>>>>>>> btl_get = 0x2b341eb94c4e<mca_btl_openib_get>, > >>>>>>>>>>>>>>>>>>>>> btl_dump = 0x2b341acd45cb<mca_btl_base_dump>, > >>>>>>>>>>>>>>>>>>>>> btl_mpool = 0xf3f4110, btl_register_error = > >>>>>>>>>>>>>>>>>>>>> 0x2b341eb90565<mca_btl_openib_register_error_cb>, > >>>>>>>>>>>>>>>>>>>>> btl_ft_event > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> = > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> 0x2b341eb952e7<mca_btl_openib_ft_event>} > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> (gdb) print hdr->tag > >>>>>>>>>>>>>>>>>>>>> $3 = 0 '\0' > >>>>>>>>>>>>>>>>>>>>> (gdb) print des > >>>>>>>>>>>>>>>>>>>>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700 > >>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbfunc > >>>>>>>>>>>>>>>>>>>>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0 > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Here is the output of a core file generated during a > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> segmentation > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> fault observed during a collective call (using > >>>>>>>>>>>>>>>>>>>>>> openib): > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? () > >>>>>>>>>>>>>>>>>>>>>> (gdb) where > >>>>>>>>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? () > >>>>>>>>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming > >>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, > >>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at > >>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 #2 0x00002aedbc4e25e2 in > >>>>>>>>>>>>>>>>>>>>>> handle_wc (device=0x19024ac0, cq=0, > >>>>>>>>>>>>>>>>>>>>>> wc=0x7ffff279ce90) at > >>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3178 #3 0x00002aedbc4e2e9d > >>>>>>>>>>>>>>>>>>>>>> in > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> poll_device > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> (device=0x19024ac0, count=2) at > >>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3318 > >>>>>>>>>>>>> > >>>>>>>>>>>>> #4 > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e34b8 in progress_one_device > >>>>>>>>>>>>> > >>>>>>>>>>>>> (device=0x19024ac0) > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> at btl_openib_component.c:3426 #5 > >>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e3561 in btl_openib_component_progress > >>>>>>>>>>>>>>>>>>>>>> () at > >>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3451 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> #6 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> 0x00002aedb8b22ab8 in opal_progress () at > >>>>>>>>>>>>>>>>>>>>>> runtime/opal_progress.c:207 #7 0x00002aedb859f497 in > >>>>>>>>>>>>>>>>>>>>>> opal_condition_wait (c=0x2aedb888ccc0, > >>>>>>>>>>>>>>>>>>>>>> m=0x2aedb888cd20) at ../opal/threads/condition.h:99 > >>>>>>>>>>>>>>>>>>>>>> #8 > >>>>>>>>>>>>>>>>>>>>>> 0x00002aedb859fa31 in ompi_request_default_wait_all > >>>>>>>>>>>>> > >>>>>>>>>>>>> (count=2, > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> requests=0x7ffff279d0e0, statuses=0x0) at > >>>>>>>>>>>>>>>>>>>>>> request/req_wait.c:262 #9 0x00002aedbd7559ad in > >>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_recursivedoubling > >>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1, > >>>>>>>>>>>>>>>>>>>>>> dtype=0x6788220, op=0x6787a20, > >>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> coll_tuned_allreduce.c:223 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> #10 0x00002aedbd7514f7 in > >>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed > >>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1, > >>>>>>>>>>>>>>>>>>>>>> dtype=0x6788220, op=0x6787a20, comm=0x19d81ff0, > >>>>>>>>>>>>>>>>>>>>>> module=0x19d82b20) at > >>>>>>>>>>>>>>>>>>>>>> coll_tuned_decision_fixed.c:63 > >>>>>>>>>>>>>>>>>>>>>> #11 0x00002aedb85c7792 in PMPI_Allreduce > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> (sendbuf=0x7ffff279d444, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> op=0x6787a20, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at pallreduce.c:102 #12 > >>>>>>>>>>>>>>>>>>>>>> 0x0000000004387dbf > >>>>>>>>>>>>> > >>>>>>>>>>>>> in > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> FEMTown::MPI::Allreduce (sendbuf=0x7ffff279d444, > >>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> op=0x6787a20, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at stubs.cpp:626 #13 > >>>>>>>>>>>>>>>>>>>>>> 0x0000000004058be8 in FEMTown::Domain::align (itf= > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Int > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> er fa ce>> > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> = {_vptr.shared_base_ptr = 0x7ffff279d620, ptr_ = > >>>>>>>>>>>>>>>>>>>>>> {px = 0x199942a4, pn = {pi_ = 0x6}}},<No data > >>>>>>>>>>>>>>>>>>>>>> fields>}) at interface.cpp:371 #14 > >>>>>>>>>>>>>>>>>>>>>> 0x00000000040cb858 in > >>>>>>>>>>>>>>>>>>>>>> FEMTown::Field::detail::align_itfs_and_neighbhors > >>>>>>>>>>>>>>>>>>>>>> (dim=2, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> set={px > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> = 0x7ffff279d780, pn = {pi_ = 0x2f279d640}}, > >>>>>>>>>>>>>>>>>>>>>> check_info=@0x7ffff279d7f0) at check.cpp:63 #15 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 0x00000000040cbfa8 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> in FEMTown::Field::align_elements (set={px = > >>>>>>>>>>>>>>>>>>>>>> 0x7ffff279d950, pn > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> = > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> {pi_ = 0x66e08d0}}, check_info=@0x7ffff279d7f0) at > >>>>>>>>>>>>>>>>>>>>>> check.cpp:159 #16 0x00000000039acdd4 in > >>>>>>>>>>>>>>>>>>>>>> PyField_align_elements (self=0x0, > >>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at > >>>>>>>>>>>>>>>>>>>>>> check.cpp:31 #17 > >>>>>>>>>>>>>>>>>>>>>> 0x0000000001fbf76d in > >>>>>>>>>>>>>>>>>>>>>> FEMTown::Main::ExErrCatch<_object* (*)(_object*, > >>>>>>>>>>>>>>>>>>>>>> _object*, _object*)>::exec<_object> > >>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279dc20, s=0x0, po1=0x2aaab0765050, > >>>>>>>>>>>>>>>>>>>>>> po2=0x19d2e950) at > >>>>>>>>>>>>>>>>>>>>>> /home/qa/svntop/femtown/modules/main/py/exception.hp > >>>>>>>>>>>>>>>>>>>>>> p: 463 > >>>>>>>>>>>>> > >>>>>>>>>>>>> #18 > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> 0x00000000039acc82 in PyField_align_elements_ewrap > >>>>>>>>>>>>> > >>>>>>>>>>>>> (self=0x0, > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at > >>>>>>>>>>>>>>>>>>>>>> check.cpp:39 #19 0x00000000044093a0 in > >>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx (f=0x19b52e90, throwflag=<value > >>>>>>>>>>>>>>>>>>>>>> optimized out>) at Python/ceval.c:3921 #20 > >>>>>>>>>>>>>>>>>>>>>> 0x000000000440aae9 in PyEval_EvalCodeEx > >>>>>>>>>>>>>>>>>>>>>> (co=0x2aaab754ad50, globals=<value optimized out>, > >>>>>>>>>>>>>>>>>>>>>> locals=<value optimized out>, args=0x3, argcount=1, > >>>>>>>>>>>>>>>>>>>>>> kws=0x19ace4a0, kwcount=2, > >>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab75e4800, defcount=2, closure=0x0) at > >>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 > >>>>>>>>>>>>>>>>>>>>>> #21 0x0000000004408f58 in PyEval_EvalFrameEx > >>>>>>>>>>>>>>>>>>>>>> (f=0x19ace2d0, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #22 0x000000000440aae9 in > >>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab7550120, globals=<value > >>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, > >>>>>>>>>>>>>>>>>>>>>> args=0x7, argcount=1, kws=0x19acc418, kwcount=3, > >>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab759e958, defcount=6, closure=0x0) at > >>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 > >>>>>>>>>>>>>>>>>>>>>> #23 0x0000000004408f58 in PyEval_EvalFrameEx > >>>>>>>>>>>>>>>>>>>>>> (f=0x19acc1c0, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #24 0x000000000440aae9 in > >>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b5e738, globals=<value > >>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, > >>>>>>>>>>>>>>>>>>>>>> args=0x6, argcount=1, kws=0x19abd328, kwcount=5, > >>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab891b7e8, defcount=3, closure=0x0) at > >>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 > >>>>>>>>>>>>>>>>>>>>>> #25 0x0000000004408f58 in PyEval_EvalFrameEx > >>>>>>>>>>>>>>>>>>>>>> (f=0x19abcea0, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #26 0x000000000440aae9 in > >>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4198, globals=<value > >>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, > >>>>>>>>>>>>>>>>>>>>>> args=0xb, argcount=1, kws=0x19a89df0, kwcount=10, > >>>>>>>>>>>>>>>>>>>>>> defs=0x0, defcount=0, closure=0x0) at > >>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 #27 0x0000000004408f58 in > >>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx > >>>>>>>>>>>>>>>>>>>>>> (f=0x19a89c40, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #28 0x000000000440aae9 in > >>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4288, globals=<value > >>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, > >>>>>>>>>>>>>>>>>>>>>> args=0x1, argcount=0, kws=0x19a89330, kwcount=0, > >>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab8b66668, defcount=1, closure=0x0) at > >>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 > >>>>>>>>>>>>>>>>>>>>>> #29 0x0000000004408f58 in PyEval_EvalFrameEx > >>>>>>>>>>>>>>>>>>>>>> (f=0x19a891b0, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #30 0x000000000440aae9 in > >>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b6a738, globals=<value > >>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, > >>>>>>>>>>>>>>>>>>>>>> args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, > >>>>>>>>>>>>>>>>>>>>>> defcount=0, closure=0x0) at > >>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 > >>>>>>>>>>>>>>>>>>>>>> #31 0x000000000440ac02 in PyEval_EvalCode > >>>>>>>>>>>>>>>>>>>>>> (co=0x1902f9b0, globals=0x0, locals=0x190d9700) at > >>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:522 #32 0x000000000442853c in > >>>>>>>>>>>>>>>>>>>>>> PyRun_StringFlags (str=0x192fd3d8 > >>>>>>>>>>>>>>>>>>>>>> "DIRECT.Actran.main()", start=<value optimized out>, > >>>>>>>>>>>>>>>>>>>>>> globals=0x192213d0, locals=0x192213d0, flags=0x0) at > >>>>>>>>>>>>>>>>>>>>>> Python/pythonrun.c:1335 #33 0x0000000004429690 in > >>>>>>>>>>>>>>>>>>>>>> PyRun_SimpleStringFlags (command=0x192fd3d8 > >>>>>>>>>>>>>>>>>>>>>> "DIRECT.Actran.main()", flags=0x0) at > >>>>>>>>>>>>>>>>>>>>>> Python/pythonrun.c:957 #34 0x0000000001fa1cf9 in > >>>>>>>>>>>>>>>>>>>>>> FEMTown::Python::FEMPy::run_application > >>>>>>>>>>>>> > >>>>>>>>>>>>> (this=0x7ffff279f650) > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> at fempy.cpp:873 #35 0x000000000434ce99 in > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> FEMTown::Main::Batch::run > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279f650) at batch.cpp:374 #36 > >>>>>>>>>>>>> > >>>>>>>>>>>>> 0x0000000001f9aa25 > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> in main (argc=8, argv=0x7ffff279fa48) at main.cpp:10 > >>>>>>>>>>>>>>>>>>>>>> (gdb) f 1 #1 0x00002aedbc4e05f4 in > >>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming (openib_btl=0x1902f9b0, > >>>>>>>>>>>>>>>>>>>>>> ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at > >>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881 reg->cbfunc( > >>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des, reg->cbdata > >>>>>>>>>>>>> > >>>>>>>>>>>>> ); > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Current language: auto; currently c > >>>>>>>>>>>>>>>>>>>>>> (gdb) > >>>>>>>>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming > >>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, > >>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at > >>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881 reg->cbfunc( > >>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des, reg->cbdata > >>>>>>>>>>>>> > >>>>>>>>>>>>> ); > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> (gdb) l 2876 > >>>>>>>>>>>>>>>>>>>>>> 2877 if(OPAL_LIKELY(!(is_credit_msg = > >>>>>>>>>>>>>>>>>>>>>> is_credit_message(frag)))) { 2878 /* call > >>>>>>>>>>>>>>>>>>>>>> registered callback */ > >>>>>>>>>>>>>>>>>>>>>> 2879 mca_btl_active_message_callback_t* > >>>>>>>>>>>>>>>>>>>>>> reg; 2880 reg = > >>>>>>>>>>>>>>>>>>>>>> mca_btl_base_active_message_trigger + hdr->tag; 2881 > >>>>>>>>>>>>>>>>>>>>>> reg->cbfunc(&openib_btl->super, hdr->tag, des, > >>>>>>>>>>>>>>>>>>>>>> reg->cbdata ); 2882 > >>>>>>>>>>>>>>>>>>>>>> if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) { 2883 > >>>>>>>>>>>>>>>>>>>>>> cqp > >>>>>>>>>>>>> > >>>>>>>>>>>>> = > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> (hdr->credits>> 11)& 0x0f; > >>>>>>>>>>>>>>>>>>>>>> 2884 hdr->credits&= 0x87ff; > >>>>>>>>>>>>>>>>>>>>>> 2885 } else { > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> On Friday 16 July 2010 16:01:02 Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>>> Hi Edgar, > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> The only difference I could observed was that the > >>>>>>>>>>>>>>>>>>>>>>> segmentation fault appeared sometimes later during > >>>>>>>>>>>>>>>>>>>>>>> the parallel computation. > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> I'm running out of idea here. I wish I could use > >>>>>>>>>>>>>>>>>>>>>>> the "--mca > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> coll > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> tuned" with "--mca self,sm,tcp" so that I could > >>>>>>>>>>>>>>>>>>>>>>> check that the issue is not somehow limited to the > >>>>>>>>>>>>>>>>>>>>>>> tuned collective routines. > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote: > >>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 10:18 AM, Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>>>>> hi edgar, > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> thanks for the tips, I'm gonna try this option as > >>>>>>>>>>>>>>>>>>>>>>>>> well. > >>>>>>>>>>>>> > >>>>>>>>>>>>> the > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault i'm observing always happened > >>>>>>>>>>>>>>>>>>>>>>>>> during a collective communication indeed... does > >>>>>>>>>>>>>>>>>>>>>>>>> it basically > >>>>>>>>>>>>> > >>>>>>>>>>>>> switch > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> all > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> collective communication to basic mode, right ? > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> sorry for my ignorance, but what's a NCA ? > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> sorry, I meant to type HCA (InifinBand networking > >>>>>>>>>>>>>>>>>>>>>>>> card) > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>>>>>>>>>>>> Edgar > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> thanks, > >>>>>>>>>>>>>>>>>>>>>>>>> éloi > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 16:20:54 Edgar Gabriel > >>>>>>>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>> you could try first to use the algorithms in the > >>>>>>>>>>>>>>>>>>>>>>>>>> basic > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> module, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> e.g. > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> mpirun -np x --mca coll basic ./mytest > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> and see whether this makes a difference. I used > >>>>>>>>>>>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>> > >>>>>>>>>>>>> observe > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> sometimes a (similar ?) problem in the openib > >>>>>>>>>>>>>>>>>>>>>>>>>> btl triggered from the tuned collective > >>>>>>>>>>>>>>>>>>>>>>>>>> component, in cases where the ofed libraries > >>>>>>>>>>>>>>>>>>>>>>>>>> were installed but no NCA was found on a node. > >>>>>>>>>>>>>>>>>>>>>>>>>> It used to work however with the basic > >>>>>>>>>>>>>>>>>>>>>>>>>> component. > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>>>>>>>>>>>>>> Edgar > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 3:08 AM, Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>>> hi Rolf, > >>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>> unfortunately, i couldn't get rid of that > >>>>>>>>>>>>>>>>>>>>>>>>>>> annoying segmentation fault when selecting > >>>>>>>>>>>>>>>>>>>>>>>>>>> another bcast algorithm. i'm now going to > >>>>>>>>>>>>>>>>>>>>>>>>>>> replace MPI_Bcast with a naive > >>>>>>>>>>>>>>>>>>>>>>>>>>> implementation (using MPI_Send and MPI_Recv) > >>>>>>>>>>>>>>>>>>>>>>>>>>> and see if > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>> helps. > >>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>> regards, > >>>>>>>>>>>>>>>>>>>>>>>>>>> éloi > >>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry > >>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Rolf, > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for your input. You're right, I miss > >>>>>>>>>>>>>>>>>>>>>>>>>>>> the coll_tuned_use_dynamic_rules option. > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> I'll check if I the segmentation fault > >>>>>>>>>>>>>>>>>>>>>>>>>>>> disappears when > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> using > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic bcast linear algorithm using the > >>>>>>>>>>>>>>>>>>>>>>>>>>>> proper command line you provided. > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday 13 July 2010 20:39:59 Rolf > >>>>>>>>>>>>>>>>>>>>>>>>>>>> vandeVaart > >>>>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Eloi: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> To select the different bcast algorithms, you > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> need to add an extra mca parameter that tells > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the library to use dynamic selection. --mca > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_use_dynamic_rules 1 > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> One way to make sure you are typing this in > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> correctly is > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> use it with ompi_info. Do the following: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ompi_info -mca coll_tuned_use_dynamic_rules 1 > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --param > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> coll > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> You should see lots of output with all the > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithms that can be selected for > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the various collectives. Therefore, you need > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> this: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --mca coll_tuned_use_dynamic_rules 1 --mca > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1 > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Rolf > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 07/13/10 11:28, Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've found that "--mca > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1" allowed to > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to the basic linear algorithm. Anyway > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> whatever the algorithm used, the > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault remains. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does anyone could give some advice on ways > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>> > >>>>>>>>>>>>> diagnose > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue I'm facing ? > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm focusing on the MPI_Bcast routine that > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems to randomly segfault when using the > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> openib btl. I'd > >>>>>>>>>>>>> > >>>>>>>>>>>>> like > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know if there is any way to make OpenMPI > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to > >>>>>>>>>>>>> > >>>>>>>>>>>>> a > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithm than the default one > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> being selected for MPI_Bcast. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your help, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm observing a random segmentation fault > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> during > >>>>>>>>>>>>> > >>>>>>>>>>>>> an > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internode parallel computation involving > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>> > >>>>>>>>>>>>> openib > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> btl > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and OpenMPI-1.4.2 (the same issue can be > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> observed with OpenMPI-1.3.3). > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun (Open MPI) 1.4.2 > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Report bugs to > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/help/ > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] *** Process received > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> signal *** [pbn08:02624] Signal: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segmentation fault (11) [pbn08:02624] > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Signal code: Address not mapped > >>>>>>>>>>>>> > >>>>>>>>>>>>> (1) > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Failing at address: (nil) > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] [ 0] > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /lib64/libpthread.so.0 [0x349540e4c0] > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] *** End of error > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> message > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *** > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sh: line 1: 2624 Segmentation fault > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/R > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ed Ha tE L\ -5 \/ x 86 _6 4\ > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /bin\/actranpy_mp > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/Re > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dH at EL -5 /x 86 _ 64 /A c > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tran_11.0.rc2.41872' > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3D > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> re al _m 4_ n2 .d a t' > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' '--mem=3200' '--threads=1' > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--errorlevel=FATAL' '--t_max=0.1' > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--parallel=domain' > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If I choose not to use the openib btl (by > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using --mca btl self,sm,tcp on the command > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line, for instance), I don't encounter any > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem and the parallel computation runs > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flawlessly. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to get some help to be able: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - to diagnose the issue I'm facing with > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the openib btl - understand why this > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue is observed only when > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> using > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the openib btl and not when using > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> self,sm,tcp > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Any help would be very much appreciated. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The outputs of ompi_info and the configure > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scripts of OpenMPI are enclosed to this > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> email, and some > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> information > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the infiniband drivers as well. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is the command line used when > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> launching a > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> parallel > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using infiniband: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $NPROCESS --hostfile host.list --mca > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl openib,sm,self,tcp --display-map > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --verbose --version --mca mpi_warn_on_fork > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 0 --mca btl_openib_want_fork_support 0 > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [...] > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and the command line used if not using > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> infiniband: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $NPROCESS --hostfile host.list --mca > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl self,sm,tcp --display-map --verbose > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --version > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> --mca > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 0 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [...] > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ____________________________________________ > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> __ _ > >>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> users mailing list > >>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 487 959