Terry, Please find enclosed the requested check outputs (using -output-filename stdout.tag.null option). I'm displaying frag->hdr->tag here.
Eloi On Monday 27 September 2010 16:29:12 Terry Dontje wrote: > Eloi, sorry can you print out frag->hdr->tag? > > Unfortunately from your last email I think it will still all have > non-zero values. > If that ends up being the case then there must be something odd with the > descriptor pointer to the fragment. > > --td > > Eloi Gaudry wrote: > > Terry, > > > > Please find enclosed the requested check outputs (using -output-filename > > stdout.tag.null option). > > > > For information, Nysal In his first message referred to > > ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on > > receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML > > + 1) > > #define MCA_PML_OB1_HDR_TYPE_RNDV (MCA_BTL_TAG_PML + 2) > > #define MCA_PML_OB1_HDR_TYPE_RGET (MCA_BTL_TAG_PML + 3) > > > > #define MCA_PML_OB1_HDR_TYPE_ACK (MCA_BTL_TAG_PML + 4) > > > > #define MCA_PML_OB1_HDR_TYPE_NACK (MCA_BTL_TAG_PML + 5) > > #define MCA_PML_OB1_HDR_TYPE_FRAG (MCA_BTL_TAG_PML + 6) > > #define MCA_PML_OB1_HDR_TYPE_GET (MCA_BTL_TAG_PML + 7) > > > > #define MCA_PML_OB1_HDR_TYPE_PUT (MCA_BTL_TAG_PML + 8) > > > > #define MCA_PML_OB1_HDR_TYPE_FIN (MCA_BTL_TAG_PML + 9) > > and in ompi/mca/btl/btl.h > > #define MCA_BTL_TAG_PML 0x40 > > > > Eloi > > > > On Monday 27 September 2010 14:36:59 Terry Dontje wrote: > >> I am thinking checking the value of *frag->hdr right before the return > >> in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h. > >> It is line 548 in the trunk > >> https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_ > >> ope nib_endpoint.h#548 > >> > >> --td > >> > >> Eloi Gaudry wrote: > >>> Hi Terry, > >>> > >>> Do you have any patch that I could apply to be able to do so ? I'm > >>> remotely working on a cluster (with a terminal) and I cannot use any > >>> parallel debugger or sequential debugger (with a call to xterm...). I > >>> can track frag->hdr->tag value in > >>> ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the > >>> SEND/RDMA_WRITE case, but this is all I can think of alone. > >>> > >>> You'll find a stacktrace (receive side) in this thread (10th or 11th > >>> message) but it might be pointless. > >>> > >>> Regards, > >>> Eloi > >>> > >>> On Monday 27 September 2010 11:43:55 Terry Dontje wrote: > >>>> So it sounds like coalescing is not your issue and that the problem > >>>> has something to do with the queue sizes. It would be helpful if we > >>>> could detect the hdr->tag == 0 issue on the sending side and get at > >>>> least a stack trace. There is something really odd going on here. > >>>> > >>>> --td > >>>> > >>>> Eloi Gaudry wrote: > >>>>> Hi Terry, > >>>>> > >>>>> I'm sorry to say that I might have missed a point here. > >>>>> > >>>>> I've lately been relaunching all previously failing computations with > >>>>> the message coalescing feature being switched off, and I saw the same > >>>>> hdr->tag=0 error several times, always during a collective call > >>>>> (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as > >>>>> soon as I switched to the peer queue option I was previously using > >>>>> (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using > >>>>> --mca btl_openib_use_message_coalescing 0), all computations ran > >>>>> flawlessly. > >>>>> > >>>>> As for the reproducer, I've already tried to write something but I > >>>>> haven't succeeded so far at reproducing the hdr->tag=0 issue with it. > >>>>> > >>>>> Eloi > >>>>> > >>>>> On 24/09/2010 18:37, Terry Dontje wrote: > >>>>>> Eloi Gaudry wrote: > >>>>>>> Terry, > >>>>>>> > >>>>>>> You were right, the error indeed seems to come from the message > >>>>>>> coalescing feature. If I turn it off using the "--mca > >>>>>>> btl_openib_use_message_coalescing 0", I'm not able to observe the > >>>>>>> "hdr->tag=0" error. > >>>>>>> > >>>>>>> There are some trac requests associated to very similar error > >>>>>>> (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they > >>>>>>> are all closed (except > >>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be > >>>>>>> related), aren't they ? What would you suggest Terry ? > >>>>>> > >>>>>> Interesting, though it looks to me like the segv in ticket 2352 > >>>>>> would have happened on the send side instead of the receive side > >>>>>> like you have. As to what to do next it would be really nice to > >>>>>> have some sort of reproducer that we can try and debug what is > >>>>>> really going on. The only other thing to do without a reproducer > >>>>>> is to inspect the code on the send side to figure out what might > >>>>>> make it generate at 0 hdr->tag. Or maybe instrument the send side > >>>>>> to stop when it is about ready to send a 0 hdr->tag and see if we > >>>>>> can see how the code got there. > >>>>>> > >>>>>> I might have some cycles to look at this Monday. > >>>>>> > >>>>>> --td > >>>>>> > >>>>>>> Eloi > >>>>>>> > >>>>>>> On Friday 24 September 2010 16:00:26 Terry Dontje wrote: > >>>>>>>> Eloi Gaudry wrote: > >>>>>>>>> Terry, > >>>>>>>>> > >>>>>>>>> No, I haven't tried any other values than P,65536,256,192,128 > >>>>>>>>> yet. > >>>>>>>>> > >>>>>>>>> The reason why is quite simple. I've been reading and reading > >>>>>>>>> again this thread to understand the btl_openib_receive_queues > >>>>>>>>> meaning and I can't figure out why the default values seem to > >>>>>>>>> induce the hdr- > >>>>>>>>> > >>>>>>>>>> tag=0 issue > >>>>>>>>>> (http://www.open-mpi.org/community/lists/users/2009/01/7808.php) > >>>>>>>>>> . > >>>>>>>> > >>>>>>>> Yeah, the size of the fragments and number of them really should > >>>>>>>> not cause this issue. So I too am a little perplexed about it. > >>>>>>>> > >>>>>>>>> Do you think that the default shared received queue parameters > >>>>>>>>> are erroneous for this specific Mellanox card ? Any help on > >>>>>>>>> finding the proper parameters would actually be much > >>>>>>>>> appreciated. > >>>>>>>> > >>>>>>>> I don't necessarily think it is the queue size for a specific card > >>>>>>>> but more so the handling of the queues by the BTL when using > >>>>>>>> certain sizes. At least that is one gut feel I have. > >>>>>>>> > >>>>>>>> In my mind the tag being 0 is either something below OMPI is > >>>>>>>> polluting the data fragment or OMPI's internal protocol is some > >>>>>>>> how getting messed up. I can imagine (no empirical data here) > >>>>>>>> the queue sizes could change how the OMPI protocol sets things > >>>>>>>> up. Another thing may be the coalescing feature in the openib BTL > >>>>>>>> which tries to gang multiple messages into one packet when > >>>>>>>> resources are running low. I can see where changing the queue > >>>>>>>> sizes might affect the coalescing. So, it might be interesting to > >>>>>>>> turn off the coalescing. You can do that by setting "--mca > >>>>>>>> btl_openib_use_message_coalescing 0" in your mpirun line. > >>>>>>>> > >>>>>>>> If that doesn't solve the issue then obviously there must be > >>>>>>>> something else going on :-). > >>>>>>>> > >>>>>>>> Note, the reason I am interested in this is I am seeing a similar > >>>>>>>> error condition (hdr->tag == 0) on a development system. Though > >>>>>>>> my failing case fails with np=8 using the connectivity test > >>>>>>>> program which is mainly point to point and there are not a > >>>>>>>> significant amount of data transfers going on either. > >>>>>>>> > >>>>>>>> --td > >>>>>>>> > >>>>>>>>> Eloi > >>>>>>>>> > >>>>>>>>> On Friday 24 September 2010 14:27:07 you wrote: > >>>>>>>>>> That is interesting. So does the number of processes affect > >>>>>>>>>> your runs any. The times I've seen hdr->tag be 0 usually has > >>>>>>>>>> been due to protocol issues. The tag should never be 0. Have > >>>>>>>>>> you tried to do other receive_queue settings other than the > >>>>>>>>>> default and the one you mention. > >>>>>>>>>> > >>>>>>>>>> I wonder if you did a combination of the two receive queues > >>>>>>>>>> causes a failure or not. Something like > >>>>>>>>>> > >>>>>>>>>> P,128,256,192,128:P,65536,256,192,128 > >>>>>>>>>> > >>>>>>>>>> I am wondering if it is the first queuing definition causing the > >>>>>>>>>> issue or possibly the SRQ defined in the default. > >>>>>>>>>> > >>>>>>>>>> --td > >>>>>>>>>> > >>>>>>>>>> Eloi Gaudry wrote: > >>>>>>>>>>> Hi Terry, > >>>>>>>>>>> > >>>>>>>>>>> The messages being send/received can be of any size, but the > >>>>>>>>>>> error seems to happen more often with small messages (as an int > >>>>>>>>>>> being broadcasted or allreduced). The failing communication > >>>>>>>>>>> differs from one run to another, but some spots are more likely > >>>>>>>>>>> to be failing than another. And as far as I know, there are > >>>>>>>>>>> always located next to a small message (an int being > >>>>>>>>>>> broadcasted for instance) communication. Other typical > >>>>>>>>>>> messages size are > >>>>>>>>>>> > >>>>>>>>>>>> 10k but can be very much larger. > >>>>>>>>>>> > >>>>>>>>>>> I've been checking the hca being used, its' from mellanox (with > >>>>>>>>>>> vendor_part_id=26428). There is no receive_queues parameters > >>>>>>>>>>> associated to it. > >>>>>>>>>>> > >>>>>>>>>>> $ cat share/openmpi/mca-btl-openib-device-params.ini as well: > >>>>>>>>>>> [...] > >>>>>>>>>>> > >>>>>>>>>>> # A.k.a. ConnectX > >>>>>>>>>>> [Mellanox Hermon] > >>>>>>>>>>> vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3 > >>>>>>>>>>> vendor_part_id = > >>>>>>>>>>> 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,2 > >>>>>>>>>>> 64 88 use_eager_rdma = 1 > >>>>>>>>>>> mtu = 2048 > >>>>>>>>>>> max_inline_data = 128 > >>>>>>>>>>> > >>>>>>>>>>> [..] > >>>>>>>>>>> > >>>>>>>>>>> $ ompi_info --param btl openib --parsable | grep receive_queues > >>>>>>>>>>> > >>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256 > >>>>>>>>>>> ,1 92 ,128 > >>>>>>>>>>> > >>>>>>>>>>> :S ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 > >>>>>>>>>>> > >>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:data_source:def > >>>>>>>>>>> au lt value > >>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:status:writable > >>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:help:Colon-deli > >>>>>>>>>>> mi t ed, comma delimited list of receive queues: > >>>>>>>>>>> P,4096,8,6,4:P,32768,8,6,4 > >>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:deprecated:no > >>>>>>>>>>> > >>>>>>>>>>> I was wondering if these parameters (automatically computed at > >>>>>>>>>>> openib btl init for what I understood) were not incorrect in > >>>>>>>>>>> some way and I plugged some others values: > >>>>>>>>>>> "P,65536,256,192,128" (someone on the list used that values > >>>>>>>>>>> when encountering a different issue) . Since that, I haven't > >>>>>>>>>>> been able to observe the segfault (occuring as hrd->tag = 0 in > >>>>>>>>>>> btl_openib_component.c:2881) yet. > >>>>>>>>>>> > >>>>>>>>>>> Eloi > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/ > >>>>>>>>>>> > >>>>>>>>>>> On Thursday 23 September 2010 23:33:48 Terry Dontje wrote: > >>>>>>>>>>>> Eloi, I am curious about your problem. Can you tell me what > >>>>>>>>>>>> size of job it is? Does it always fail on the same bcast, or > >>>>>>>>>>>> same process? > >>>>>>>>>>>> > >>>>>>>>>>>> Eloi Gaudry wrote: > >>>>>>>>>>>>> Hi Nysal, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks for your suggestions. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I'm now able to get the checksum computed and redirected to > >>>>>>>>>>>>> stdout, thanks (I forgot the "-mca pml_base_verbose 5" > >>>>>>>>>>>>> option, you were right). I haven't been able to observe the > >>>>>>>>>>>>> segmentation fault (with hdr->tag=0) so far (when using pml > >>>>>>>>>>>>> csum) but I 'll let you know when I am. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I've got two others question, which may be related to the > >>>>>>>>>>>>> error observed: > >>>>>>>>>>>>> > >>>>>>>>>>>>> 1/ does the maximum number of MPI_Comm that can be handled by > >>>>>>>>>>>>> OpenMPI somehow depends on the btl being used (i.e. if I'm > >>>>>>>>>>>>> using openib, may I use the same number of MPI_Comm object as > >>>>>>>>>>>>> with tcp) ? Is there something as MPI_COMM_MAX in OpenMPI ? > >>>>>>>>>>>>> > >>>>>>>>>>>>> 2/ the segfaults only appears during a mpi collective call, > >>>>>>>>>>>>> with very small message (one int is being broadcast, for > >>>>>>>>>>>>> instance) ; i followed the guidelines given at > >>>>>>>>>>>>> http://icl.cs.utk.edu/open- > >>>>>>>>>>>>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the > >>>>>>>>>>>>> debug-build of OpenMPI asserts if I use a different min-size > >>>>>>>>>>>>> that 255. Anyway, if I deactivate eager_rdma, the segfaults > >>>>>>>>>>>>> remains. Does the openib btl handle very small message > >>>>>>>>>>>>> differently (even with eager_rdma > >>>>>>>>>>>>> deactivated) than tcp ? > >>>>>>>>>>>> > >>>>>>>>>>>> Others on the list does coalescing happen with non-eager_rdma? > >>>>>>>>>>>> If so then that would possibly be one difference between the > >>>>>>>>>>>> openib btl and tcp aside from the actual protocol used. > >>>>>>>>>>>> > >>>>>>>>>>>>> is there a way to make sure that large messages and small > >>>>>>>>>>>>> messages are handled the same way ? > >>>>>>>>>>>> > >>>>>>>>>>>> Do you mean so they all look like eager messages? How large > >>>>>>>>>>>> of messages are we talking about here 1K, 1M or 10M? > >>>>>>>>>>>> > >>>>>>>>>>>> --td > >>>>>>>>>>>> > >>>>>>>>>>>>> Regards, > >>>>>>>>>>>>> Eloi > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Friday 17 September 2010 17:57:17 Nysal Jan wrote: > >>>>>>>>>>>>>> Hi Eloi, > >>>>>>>>>>>>>> Create a debug build of OpenMPI (--enable-debug) and while > >>>>>>>>>>>>>> running with the csum PML add "-mca pml_base_verbose 5" to > >>>>>>>>>>>>>> the command line. This will print the checksum details for > >>>>>>>>>>>>>> each fragment sent over the wire. I'm guessing it didnt > >>>>>>>>>>>>>> catch anything because the BTL failed. The checksum > >>>>>>>>>>>>>> verification is done in the PML, which the BTL calls via a > >>>>>>>>>>>>>> callback function. In your case the PML callback is never > >>>>>>>>>>>>>> called because the hdr->tag is invalid. So enabling > >>>>>>>>>>>>>> checksum tracing also might not be of much use. Is it the > >>>>>>>>>>>>>> first Bcast that fails or the nth Bcast and what is the > >>>>>>>>>>>>>> message size? I'm not sure what could be the problem at > >>>>>>>>>>>>>> this moment. I'm afraid you will have to debug the BTL to > >>>>>>>>>>>>>> find out more. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> --Nysal > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <e...@fft.be> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> Hi Nysal, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> thanks for your response. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I've been unable so far to write a test case that could > >>>>>>>>>>>>>>> illustrate the hdr->tag=0 error. > >>>>>>>>>>>>>>> Actually, I'm only observing this issue when running an > >>>>>>>>>>>>>>> internode computation involving infiniband hardware from > >>>>>>>>>>>>>>> Mellanox (MT25418, ConnectX IB DDR, PCIe 2.0 > >>>>>>>>>>>>>>> 2.5GT/s, rev a0) with our time-domain software. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I checked, double-checked, and rechecked again every MPI > >>>>>>>>>>>>>>> use performed during a parallel computation and I couldn't > >>>>>>>>>>>>>>> find any error so far. The fact that the very > >>>>>>>>>>>>>>> same parallel computation run flawlessly when using tcp > >>>>>>>>>>>>>>> (and disabling openib support) might seem to indicate that > >>>>>>>>>>>>>>> the issue is somewhere located inside the > >>>>>>>>>>>>>>> openib btl or at the hardware/driver level. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I've just used the "-mca pml csum" option and I haven't > >>>>>>>>>>>>>>> seen any related messages (when hdr->tag=0 and the > >>>>>>>>>>>>>>> segfaults occurs). Any suggestion ? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Friday 17 September 2010 16:03:34 Nysal Jan wrote: > >>>>>>>>>>>>>>>> Hi Eloi, > >>>>>>>>>>>>>>>> Sorry for the delay in response. I haven't read the entire > >>>>>>>>>>>>>>>> email thread, but do you have a test case which can > >>>>>>>>>>>>>>>> reproduce this error? Without that it will be difficult to > >>>>>>>>>>>>>>>> nail down the cause. Just to clarify, I do not work for an > >>>>>>>>>>>>>>>> iwarp vendor. I can certainly try to reproduce it on an IB > >>>>>>>>>>>>>>>> system. There is also a PML called csum, you can use it > >>>>>>>>>>>>>>>> via "-mca pml csum", which will checksum the MPI messages > >>>>>>>>>>>>>>>> and verify it at the receiver side for any data > >>>>>>>>>>>>>>>> corruption. You can try using it to see if it is able > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> catch anything. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Regards > >>>>>>>>>>>>>>>> --Nysal > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> > >>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> Hi Nysal, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I'm sorry to intrrupt, but I was wondering if you had a > >>>>>>>>>>>>>>>>> chance to look > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> at > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> this error. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Eloi Gaudry > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Free Field Technologies > >>>>>>>>>>>>>>>>> Company Website: http://www.fft.be > >>>>>>>>>>>>>>>>> Company Phone: +32 10 487 959 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> ---------- Forwarded message ---------- > >>>>>>>>>>>>>>>>> From: Eloi Gaudry <e...@fft.be> > >>>>>>>>>>>>>>>>> To: Open MPI Users <us...@open-mpi.org> > >>>>>>>>>>>>>>>>> Date: Wed, 15 Sep 2010 16:27:43 +0200 > >>>>>>>>>>>>>>>>> Subject: Re: [OMPI users] [openib] segfault when using > >>>>>>>>>>>>>>>>> openib btl Hi, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I was wondering if anybody got a chance to have a look at > >>>>>>>>>>>>>>>>> this issue. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>> Hi Jeff, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Please find enclosed the output (valgrind.out.gz) from > >>>>>>>>>>>>>>>>>> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host > >>>>>>>>>>>>>>>>>> pbn11,pbn10 --mca > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> btl > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> openib,self --display-map --verbose --mca > >>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 > >>>>>>>>>>>>>>>>>> -tag-output /opt/valgrind-3.5.0/bin/valgrind > >>>>>>>>>>>>>>>>>> --tool=memcheck > >>>>>>>>>>>>>>>>>> --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/o > >>>>>>>>>>>>>>>>>> pen mp i- valgrind.supp > >>>>>>>>>>>>>>>>>> --suppressions=./suppressions.python.supp > >>>>>>>>>>>>>>>>>> /opt/actran/bin/actranpy_mp ... > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>> On Monday 16 August 2010 19:14:47 Jeff Squyres wrote: > >>>>>>>>>>>>>>>>>>>> On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>> I did run our application through valgrind but it > >>>>>>>>>>>>>>>>>>>>> couldn't find any "Invalid write": there is a bunch > >>>>>>>>>>>>>>>>>>>>> of "Invalid read" (I'm using > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> 1.4.2 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> with the suppression file), "Use of uninitialized > >>>>>>>>>>>>>>>>>>>>> bytes" and "Conditional jump depending on > >>>>>>>>>>>>>>>>>>>>> uninitialized bytes" in > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> different > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> ompi > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> routines. Some of them are located in > >>>>>>>>>>>>>>>>>>>>> btl_openib_component.c. I'll send you an output of > >>>>>>>>>>>>>>>>>>>>> valgrind shortly. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> A lot of them in btl_openib_* are to be expected -- > >>>>>>>>>>>>>>>>>>>> OpenFabrics uses OS-bypass methods for some of its > >>>>>>>>>>>>>>>>>>>> memory, and therefore valgrind is unaware of them (and > >>>>>>>>>>>>>>>>>>>> therefore incorrectly marks them as > >>>>>>>>>>>>>>>>>>>> uninitialized). > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> would it help if i use the upcoming 1.5 version of > >>>>>>>>>>>>>>>>>>> openmpi ? i > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> read > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> a huge effort has been done to clean-up the valgrind > >>>>>>>>>>>>>>>>>>> output ? but maybe that this doesn't concern this btl > >>>>>>>>>>>>>>>>>>> (for the reasons you mentionned). > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Another question, you said that the callback function > >>>>>>>>>>>>>>>>>>>>> pointer > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> should > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> never be 0. But can the tag be null (hdr->tag) ? > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> The tag is not a pointer -- it's just an integer. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I was worrying that its value could not be null. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I'll send a valgrind output soon (i need to build > >>>>>>>>>>>>>>>>>>> libpython without pymalloc first). > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Thanks for your help, > >>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> On 16/08/2010 18:22, Jeff Squyres wrote: > >>>>>>>>>>>>>>>>>>>>>> Sorry for the delay in replying. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Odd; the values of the callback function pointer > >>>>>>>>>>>>>>>>>>>>>> should never > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> be > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> 0. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> This seems to suggest some kind of memory corruption > >>>>>>>>>>>>>>>>>>>>>> is occurring. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> I don't know if it's possible, because the stack > >>>>>>>>>>>>>>>>>>>>>> trace looks like you're calling through python, but > >>>>>>>>>>>>>>>>>>>>>> can you run this application through valgrind, or > >>>>>>>>>>>>>>>>>>>>>> some other memory-checking debugger? > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> sorry, i just forgot to add the values of the > >>>>>>>>>>>>>>>>>>>>>>> function > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> parameters: > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbdata > >>>>>>>>>>>>>>>>>>>>>>> $1 = (void *) 0x0 > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print openib_btl->super > >>>>>>>>>>>>>>>>>>>>>>> $2 = {btl_component = 0x2b341edd7380, > >>>>>>>>>>>>>>>>>>>>>>> btl_eager_limit = > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 12288, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> btl_rndv_eager_limit = 12288, btl_max_send_size = > >>>>>>>>>>>>>>>>>>>>>>> 65536, btl_rdma_pipeline_send_length = 1048576, > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> btl_rdma_pipeline_frag_size = 1048576, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> btl_min_rdma_pipeline_size > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> = 1060864, btl_exclusivity = 1024, btl_latency = > >>>>>>>>>>>>>>>>>>>>>>> 10, btl_bandwidth = 800, btl_flags = 310, > >>>>>>>>>>>>>>>>>>>>>>> btl_add_procs = > >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb8ee47<mca_btl_openib_add_procs>, > >>>>>>>>>>>>>>>>>>>>>>> btl_del_procs = > >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb90156<mca_btl_openib_del_procs>, > >>>>>>>>>>>>>>>>>>>>>>> btl_register = 0, btl_finalize = > >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb93186<mca_btl_openib_finalize>, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> btl_alloc > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> = 0x2b341eb90a3e<mca_btl_openib_alloc>, btl_free > >>>>>>>>>>>>>>>>>>>>>>> = 0x2b341eb91400<mca_btl_openib_free>, > >>>>>>>>>>>>>>>>>>>>>>> btl_prepare_src = > >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb91813<mca_btl_openib_prepare_src>, > >>>>>>>>>>>>>>>>>>>>>>> btl_prepare_dst > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> = > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb91f2e<mca_btl_openib_prepare_dst>, > >>>>>>>>>>>>>>>>>>>>>>> btl_send = 0x2b341eb94517<mca_btl_openib_send>, > >>>>>>>>>>>>>>>>>>>>>>> btl_sendi = 0x2b341eb9340d<mca_btl_openib_sendi>, > >>>>>>>>>>>>>>>>>>>>>>> btl_put = 0x2b341eb94660<mca_btl_openib_put>, > >>>>>>>>>>>>>>>>>>>>>>> btl_get = 0x2b341eb94c4e<mca_btl_openib_get>, > >>>>>>>>>>>>>>>>>>>>>>> btl_dump = 0x2b341acd45cb<mca_btl_base_dump>, > >>>>>>>>>>>>>>>>>>>>>>> btl_mpool = 0xf3f4110, btl_register_error = > >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb90565<mca_btl_openib_register_error_cb>, > >>>>>>>>>>>>>>>>>>>>>>> btl_ft_event > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> = > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb952e7<mca_btl_openib_ft_event>} > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print hdr->tag > >>>>>>>>>>>>>>>>>>>>>>> $3 = 0 '\0' > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print des > >>>>>>>>>>>>>>>>>>>>>>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700 > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbfunc > >>>>>>>>>>>>>>>>>>>>>>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0 > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Here is the output of a core file generated during > >>>>>>>>>>>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> segmentation > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> fault observed during a collective call (using > >>>>>>>>>>>>>>>>>>>>>>>> openib): > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? () > >>>>>>>>>>>>>>>>>>>>>>>> (gdb) where > >>>>>>>>>>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? () > >>>>>>>>>>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming > >>>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, > >>>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 #2 0x00002aedbc4e25e2 > >>>>>>>>>>>>>>>>>>>>>>>> in handle_wc (device=0x19024ac0, cq=0, > >>>>>>>>>>>>>>>>>>>>>>>> wc=0x7ffff279ce90) at > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3178 #3 0x00002aedbc4e2e9d > >>>>>>>>>>>>>>>>>>>>>>>> in > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> poll_device > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> (device=0x19024ac0, count=2) at > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3318 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> #4 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e34b8 in progress_one_device > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> (device=0x19024ac0) > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> at btl_openib_component.c:3426 #5 > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e3561 in > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component_progress () at > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3451 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> #6 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb8b22ab8 in opal_progress () at > >>>>>>>>>>>>>>>>>>>>>>>> runtime/opal_progress.c:207 #7 0x00002aedb859f497 > >>>>>>>>>>>>>>>>>>>>>>>> in opal_condition_wait (c=0x2aedb888ccc0, > >>>>>>>>>>>>>>>>>>>>>>>> m=0x2aedb888cd20) at > >>>>>>>>>>>>>>>>>>>>>>>> ../opal/threads/condition.h:99 #8 > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb859fa31 in > >>>>>>>>>>>>>>>>>>>>>>>> ompi_request_default_wait_all > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> (count=2, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> requests=0x7ffff279d0e0, statuses=0x0) at > >>>>>>>>>>>>>>>>>>>>>>>> request/req_wait.c:262 #9 0x00002aedbd7559ad in > >>>>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_recursivedoubling > >>>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, > >>>>>>>>>>>>>>>>>>>>>>>> count=1, dtype=0x6788220, op=0x6787a20, > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> coll_tuned_allreduce.c:223 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> #10 0x00002aedbd7514f7 in > >>>>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed > >>>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, > >>>>>>>>>>>>>>>>>>>>>>>> count=1, dtype=0x6788220, op=0x6787a20, > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at > >>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_decision_fixed.c:63 > >>>>>>>>>>>>>>>>>>>>>>>> #11 0x00002aedb85c7792 in PMPI_Allreduce > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> (sendbuf=0x7ffff279d444, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1, > >>>>>>>>>>>>>>>>>>>>>>>> datatype=0x6788220, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> op=0x6787a20, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at pallreduce.c:102 #12 > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004387dbf > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> in > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::MPI::Allreduce (sendbuf=0x7ffff279d444, > >>>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1, > >>>>>>>>>>>>>>>>>>>>>>>> datatype=0x6788220, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> op=0x6787a20, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at stubs.cpp:626 #13 > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004058be8 in FEMTown::Domain::align (itf= > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Int > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> er fa ce>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> = {_vptr.shared_base_ptr = 0x7ffff279d620, ptr_ = > >>>>>>>>>>>>>>>>>>>>>>>> {px = 0x199942a4, pn = {pi_ = 0x6}}},<No data > >>>>>>>>>>>>>>>>>>>>>>>> fields>}) at interface.cpp:371 #14 > >>>>>>>>>>>>>>>>>>>>>>>> 0x00000000040cb858 in > >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Field::detail::align_itfs_and_neighbhors > >>>>>>>>>>>>>>>>>>>>>>>> (dim=2, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> set={px > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> = 0x7ffff279d780, pn = {pi_ = 0x2f279d640}}, > >>>>>>>>>>>>>>>>>>>>>>>> check_info=@0x7ffff279d7f0) at check.cpp:63 #15 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> 0x00000000040cbfa8 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> in FEMTown::Field::align_elements (set={px = > >>>>>>>>>>>>>>>>>>>>>>>> 0x7ffff279d950, pn > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> = > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> {pi_ = 0x66e08d0}}, check_info=@0x7ffff279d7f0) at > >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:159 #16 0x00000000039acdd4 in > >>>>>>>>>>>>>>>>>>>>>>>> PyField_align_elements (self=0x0, > >>>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at > >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:31 #17 > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000001fbf76d in > >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Main::ExErrCatch<_object* (*)(_object*, > >>>>>>>>>>>>>>>>>>>>>>>> _object*, _object*)>::exec<_object> > >>>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279dc20, s=0x0, po1=0x2aaab0765050, > >>>>>>>>>>>>>>>>>>>>>>>> po2=0x19d2e950) at > >>>>>>>>>>>>>>>>>>>>>>>> /home/qa/svntop/femtown/modules/main/py/exception. > >>>>>>>>>>>>>>>>>>>>>>>> hp p: 463 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> #18 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> 0x00000000039acc82 in PyField_align_elements_ewrap > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> (self=0x0, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at > >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:39 #19 0x00000000044093a0 in > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx (f=0x19b52e90, throwflag=<value > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>) at Python/ceval.c:3921 #20 > >>>>>>>>>>>>>>>>>>>>>>>> 0x000000000440aae9 in PyEval_EvalCodeEx > >>>>>>>>>>>>>>>>>>>>>>>> (co=0x2aaab754ad50, globals=<value optimized out>, > >>>>>>>>>>>>>>>>>>>>>>>> locals=<value optimized out>, args=0x3, > >>>>>>>>>>>>>>>>>>>>>>>> argcount=1, kws=0x19ace4a0, kwcount=2, > >>>>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab75e4800, defcount=2, closure=0x0) at > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 > >>>>>>>>>>>>>>>>>>>>>>>> #21 0x0000000004408f58 in PyEval_EvalFrameEx > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19ace2d0, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #22 0x000000000440aae9 in > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab7550120, > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x7, argcount=1, > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19acc418, kwcount=3, defs=0x2aaab759e958, > >>>>>>>>>>>>>>>>>>>>>>>> defcount=6, closure=0x0) at Python/ceval.c:2968 > >>>>>>>>>>>>>>>>>>>>>>>> #23 0x0000000004408f58 in PyEval_EvalFrameEx > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19acc1c0, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #24 0x000000000440aae9 in > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b5e738, > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x6, argcount=1, > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19abd328, kwcount=5, defs=0x2aaab891b7e8, > >>>>>>>>>>>>>>>>>>>>>>>> defcount=3, closure=0x0) at Python/ceval.c:2968 > >>>>>>>>>>>>>>>>>>>>>>>> #25 0x0000000004408f58 in PyEval_EvalFrameEx > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19abcea0, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #26 0x000000000440aae9 in > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4198, > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0xb, argcount=1, > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19a89df0, kwcount=10, defs=0x0, defcount=0, > >>>>>>>>>>>>>>>>>>>>>>>> closure=0x0) at > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 #27 0x0000000004408f58 in > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19a89c40, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #28 0x000000000440aae9 in > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4288, > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x1, argcount=0, > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19a89330, kwcount=0, defs=0x2aaab8b66668, > >>>>>>>>>>>>>>>>>>>>>>>> defcount=1, closure=0x0) at Python/ceval.c:2968 > >>>>>>>>>>>>>>>>>>>>>>>> #29 0x0000000004408f58 in PyEval_EvalFrameEx > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19a891b0, throwflag=<value optimized out>) at > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #30 0x000000000440aae9 in > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b6a738, > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x0, argcount=0, kws=0x0, > >>>>>>>>>>>>>>>>>>>>>>>> kwcount=0, defs=0x0, defcount=0, closure=0x0) at > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 > >>>>>>>>>>>>>>>>>>>>>>>> #31 0x000000000440ac02 in PyEval_EvalCode > >>>>>>>>>>>>>>>>>>>>>>>> (co=0x1902f9b0, globals=0x0, locals=0x190d9700) at > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:522 #32 0x000000000442853c in > >>>>>>>>>>>>>>>>>>>>>>>> PyRun_StringFlags (str=0x192fd3d8 > >>>>>>>>>>>>>>>>>>>>>>>> "DIRECT.Actran.main()", start=<value optimized > >>>>>>>>>>>>>>>>>>>>>>>> out>, globals=0x192213d0, locals=0x192213d0, > >>>>>>>>>>>>>>>>>>>>>>>> flags=0x0) at Python/pythonrun.c:1335 #33 > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004429690 in PyRun_SimpleStringFlags > >>>>>>>>>>>>>>>>>>>>>>>> (command=0x192fd3d8 "DIRECT.Actran.main()", > >>>>>>>>>>>>>>>>>>>>>>>> flags=0x0) at > >>>>>>>>>>>>>>>>>>>>>>>> Python/pythonrun.c:957 #34 0x0000000001fa1cf9 in > >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Python::FEMPy::run_application > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> (this=0x7ffff279f650) > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> at fempy.cpp:873 #35 0x000000000434ce99 in > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> FEMTown::Main::Batch::run > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279f650) at batch.cpp:374 #36 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 0x0000000001f9aa25 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> in main (argc=8, argv=0x7ffff279fa48) at > >>>>>>>>>>>>>>>>>>>>>>>> main.cpp:10 (gdb) f 1 #1 0x00002aedbc4e05f4 in > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming (openib_btl=0x1902f9b0, > >>>>>>>>>>>>>>>>>>>>>>>> ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881 reg->cbfunc( > >>>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des, reg->cbdata > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> ); > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Current language: auto; currently c > >>>>>>>>>>>>>>>>>>>>>>>> (gdb) > >>>>>>>>>>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming > >>>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, > >>>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881 reg->cbfunc( > >>>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des, reg->cbdata > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> ); > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> (gdb) l 2876 > >>>>>>>>>>>>>>>>>>>>>>>> 2877 if(OPAL_LIKELY(!(is_credit_msg = > >>>>>>>>>>>>>>>>>>>>>>>> is_credit_message(frag)))) { 2878 /* > >>>>>>>>>>>>>>>>>>>>>>>> call registered callback */ > >>>>>>>>>>>>>>>>>>>>>>>> 2879 mca_btl_active_message_callback_t* > >>>>>>>>>>>>>>>>>>>>>>>> reg; 2880 reg = > >>>>>>>>>>>>>>>>>>>>>>>> mca_btl_base_active_message_trigger + hdr->tag; > >>>>>>>>>>>>>>>>>>>>>>>> 2881 reg->cbfunc(&openib_btl->super, hdr->tag, > >>>>>>>>>>>>>>>>>>>>>>>> des, reg->cbdata ); 2882 > >>>>>>>>>>>>>>>>>>>>>>>> if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) { 2883 > >>>>>>>>>>>>>>>>>>>>>>>> cqp > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> = > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> (hdr->credits>> 11)& 0x0f; > >>>>>>>>>>>>>>>>>>>>>>>> 2884 hdr->credits&= 0x87ff; > >>>>>>>>>>>>>>>>>>>>>>>> 2885 } else { > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> On Friday 16 July 2010 16:01:02 Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>>>>> Hi Edgar, > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> The only difference I could observed was that the > >>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault appeared sometimes later > >>>>>>>>>>>>>>>>>>>>>>>>> during the parallel computation. > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> I'm running out of idea here. I wish I could use > >>>>>>>>>>>>>>>>>>>>>>>>> the "--mca > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> coll > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> tuned" with "--mca self,sm,tcp" so that I could > >>>>>>>>>>>>>>>>>>>>>>>>> check that the issue is not somehow limited to > >>>>>>>>>>>>>>>>>>>>>>>>> the tuned collective routines. > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 17:24:24 Edgar Gabriel > >>>>>>>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 10:18 AM, Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>>> hi edgar, > >>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for the tips, I'm gonna try this option > >>>>>>>>>>>>>>>>>>>>>>>>>>> as well. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault i'm observing always > >>>>>>>>>>>>>>>>>>>>>>>>>>> happened during a collective communication > >>>>>>>>>>>>>>>>>>>>>>>>>>> indeed... does it basically > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> switch > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> all > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>> collective communication to basic mode, right ? > >>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>> sorry for my ignorance, but what's a NCA ? > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> sorry, I meant to type HCA (InifinBand > >>>>>>>>>>>>>>>>>>>>>>>>>> networking card) > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>>>>>>>>>>>>>> Edgar > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, > >>>>>>>>>>>>>>>>>>>>>>>>>>> éloi > >>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 16:20:54 Edgar Gabriel > >>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>>>> you could try first to use the algorithms in > >>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> module, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> e.g. > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun -np x --mca coll basic ./mytest > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> and see whether this makes a difference. I > >>>>>>>>>>>>>>>>>>>>>>>>>>>> used to > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> observe > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> sometimes a (similar ?) problem in the openib > >>>>>>>>>>>>>>>>>>>>>>>>>>>> btl triggered from the tuned collective > >>>>>>>>>>>>>>>>>>>>>>>>>>>> component, in cases where the ofed libraries > >>>>>>>>>>>>>>>>>>>>>>>>>>>> were installed but no NCA was found on a node. > >>>>>>>>>>>>>>>>>>>>>>>>>>>> It used to work however with the basic > >>>>>>>>>>>>>>>>>>>>>>>>>>>> component. > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Edgar > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 3:08 AM, Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> hi Rolf, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> unfortunately, i couldn't get rid of that > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> annoying segmentation fault when selecting > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> another bcast algorithm. i'm now going to > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> replace MPI_Bcast with a naive > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation (using MPI_Send and MPI_Recv) > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and see if > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> helps. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> regards, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> éloi > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Rolf, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for your input. You're right, I miss > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the coll_tuned_use_dynamic_rules option. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'll check if I the segmentation fault > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> disappears when > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> using > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic bcast linear algorithm using the > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proper command line you provided. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday 13 July 2010 20:39:59 Rolf > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> vandeVaart > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Eloi: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To select the different bcast algorithms, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you need to add an extra mca parameter > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that tells the library to use dynamic > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> selection. --mca > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_use_dynamic_rules 1 > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One way to make sure you are typing this in > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> correctly is > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use it with ompi_info. Do the following: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ompi_info -mca coll_tuned_use_dynamic_rules > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1 --param > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> coll > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You should see lots of output with all the > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithms that can be selected > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for the various collectives. Therefore, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you need this: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --mca coll_tuned_use_dynamic_rules 1 --mca > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1 > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Rolf > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 07/13/10 11:28, Eloi Gaudry wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've found that "--mca > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1" allowed to > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to the basic linear algorithm. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Anyway whatever the algorithm used, the > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault remains. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does anyone could give some advice on ways > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> diagnose > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue I'm facing ? > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm focusing on the MPI_Bcast routine > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that seems to randomly segfault when > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using the openib btl. I'd > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> like > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know if there is any way to make OpenMPI > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithm than the default one > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> being selected for MPI_Bcast. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your help, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm observing a random segmentation > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fault during > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> an > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internode parallel computation involving > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> openib > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> btl > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and OpenMPI-1.4.2 (the same issue can be > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> observed with OpenMPI-1.3.3). > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun (Open MPI) 1.4.2 > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Report bugs to > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/hel > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> p/ [pbn08:02624] *** Process received > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> signal *** [pbn08:02624] Signal: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segmentation fault (11) > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Signal code: Address > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not mapped > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> (1) > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Failing at address: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (nil) [pbn08:02624] [ 0] > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /lib64/libpthread.so.0 [0x349540e4c0] > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] *** End of error > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> message > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *** > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sh: line 1: 2624 Segmentation fault > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/R > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ed Ha tE L\ -5 \/ x 86 _6 4\ > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /bin\/actranpy_mp > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/Re > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dH at EL -5 /x 86 _ 64 /A c > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tran_11.0.rc2.41872' > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3D > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> re al _m 4_ n2 .d a t' > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' '--mem=3200' '--threads=1' > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--errorlevel=FATAL' '--t_max=0.1' > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--parallel=domain' > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If I choose not to use the openib btl > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (by using --mca btl self,sm,tcp on the > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command line, for instance), I don't > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> encounter any problem and the parallel > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation runs flawlessly. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to get some help to be > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> able: - to diagnose the issue I'm > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> facing with the openib btl - understand > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why this issue is observed only when > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> using > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the openib btl and not when using > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> self,sm,tcp > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Any help would be very much appreciated. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The outputs of ompi_info and the > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configure scripts of OpenMPI are > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enclosed to this email, and some > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> information > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the infiniband drivers as well. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is the command line used when > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> launching a > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> parallel > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using infiniband: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $NPROCESS --hostfile host.list --mca > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl openib,sm,self,tcp --display-map > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --verbose --version --mca > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support 0 [...] > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and the command line used if not using > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> infiniband: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $NPROCESS --hostfile host.list --mca > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl self,sm,tcp --display-map --verbose > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --version > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> --mca > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> 0 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [...] > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> __________________________________________ > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> __ __ _ > >>>>>>>>>>>>> > >>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>> users mailing list > >>>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 487 959
stdout.tag.null.tar
Description: Unix tar archive