Hi Nysal, Thanks for your suggestions.
I'm now able to get the checksum computed and redirected to stdout, thanks (I forgot the "-mca pml_base_verbose 5" option, you were right). I haven't been able to observe the segmentation fault (with hdr->tag=0) so far (when using pml csum) but I 'll let you know when I am. I've got two others question, which may be related to the error observed: 1/ does the maximum number of MPI_Comm that can be handled by OpenMPI somehow depends on the btl being used (i.e. if I'm using openib, may I use the same number of MPI_Comm object as with tcp) ? Is there something as MPI_COMM_MAX in OpenMPI ? 2/ the segfaults only appears during a mpi collective call, with very small message (one int is being broadcast, for instance) ; i followed the guidelines given at http://icl.cs.utk.edu/open- mpi/faq/?category=openfabrics#ib-small-message-rdma but the debug-build of OpenMPI asserts if I use a different min-size that 255. Anyway, if I deactivate eager_rdma, the segfaults remains. Does the openib btl handle very small message differently (even with eager_rdma deactivated) than tcp ? is there a way to make sure that large messages and small messages are handled the same way ? Regards, Eloi On Friday 17 September 2010 17:57:17 Nysal Jan wrote: > Hi Eloi, > Create a debug build of OpenMPI (--enable-debug) and while running with the > csum PML add "-mca pml_base_verbose 5" to the command line. This will print > the checksum details for each fragment sent over the wire. I'm guessing it > didnt catch anything because the BTL failed. The checksum verification is > done in the PML, which the BTL calls via a callback function. In your case > the PML callback is never called because the hdr->tag is invalid. So > enabling checksum tracing also might not be of much use. Is it the first > Bcast that fails or the nth Bcast and what is the message size? I'm not > sure what could be the problem at this moment. I'm afraid you will have to > debug the BTL to find out more. > > --Nysal > > On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <e...@fft.be> wrote: > > Hi Nysal, > > > > thanks for your response. > > > > I've been unable so far to write a test case that could illustrate the > > hdr->tag=0 error. > > Actually, I'm only observing this issue when running an internode > > computation involving infiniband hardware from Mellanox (MT25418, > > ConnectX IB DDR, PCIe 2.0 > > 2.5GT/s, rev a0) with our time-domain software. > > > > I checked, double-checked, and rechecked again every MPI use performed > > during a parallel computation and I couldn't find any error so far. The > > fact that the very > > same parallel computation run flawlessly when using tcp (and disabling > > openib support) might seem to indicate that the issue is somewhere > > located inside the > > openib btl or at the hardware/driver level. > > > > I've just used the "-mca pml csum" option and I haven't seen any related > > messages (when hdr->tag=0 and the segfaults occurs). > > Any suggestion ? > > > > Regards, > > Eloi > > > > On Friday 17 September 2010 16:03:34 Nysal Jan wrote: > > > Hi Eloi, > > > Sorry for the delay in response. I haven't read the entire email > > > thread, but do you have a test case which can reproduce this error? > > > Without that it will be difficult to nail down the cause. Just to > > > clarify, I do not work for an iwarp vendor. I can certainly try to > > > reproduce it on an IB system. There is also a PML called csum, you can > > > use it via "-mca pml csum", which will checksum the MPI messages and > > > verify it at the receiver side for any data corruption. You can try > > > using it to see if it is able > > > > to > > > > > catch anything. > > > > > > Regards > > > --Nysal > > > > > > On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote: > > > > Hi Nysal, > > > > > > > > I'm sorry to intrrupt, but I was wondering if you had a chance to > > > > look > > > > at > > > > > > this error. > > > > > > > > Regards, > > > > Eloi > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Eloi Gaudry > > > > > > > > Free Field Technologies > > > > Company Website: http://www.fft.be > > > > Company Phone: +32 10 487 959 > > > > > > > > > > > > ---------- Forwarded message ---------- > > > > From: Eloi Gaudry <e...@fft.be> > > > > To: Open MPI Users <us...@open-mpi.org> > > > > Date: Wed, 15 Sep 2010 16:27:43 +0200 > > > > Subject: Re: [OMPI users] [openib] segfault when using openib btl > > > > Hi, > > > > > > > > I was wondering if anybody got a chance to have a look at this issue. > > > > > > > > Regards, > > > > Eloi > > > > > > > > On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote: > > > > > Hi Jeff, > > > > > > > > > > Please find enclosed the output (valgrind.out.gz) from > > > > > /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca > > > > btl > > > > > > > openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca > > > > > btl_openib_want_fork_support 0 -tag-output > > > > > /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck > > > > > --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi- > > > > > valgrind.supp --suppressions=./suppressions.python.supp > > > > > /opt/actran/bin/actranpy_mp ... > > > > > > > > > > Thanks, > > > > > Eloi > > > > > > > > > > On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote: > > > > > > On Monday 16 August 2010 19:14:47 Jeff Squyres wrote: > > > > > > > On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote: > > > > > > > > I did run our application through valgrind but it couldn't > > > > > > > > find any "Invalid write": there is a bunch of "Invalid read" > > > > > > > > (I'm using > > > > > > > > 1.4.2 > > > > > > > > > > > > with the suppression file), "Use of uninitialized bytes" and > > > > > > > > "Conditional jump depending on uninitialized bytes" in > > > > different > > > > > > ompi > > > > > > > > > > > > routines. Some of them are located in btl_openib_component.c. > > > > > > > > I'll send you an output of valgrind shortly. > > > > > > > > > > > > > > A lot of them in btl_openib_* are to be expected -- OpenFabrics > > > > > > > uses OS-bypass methods for some of its memory, and therefore > > > > > > > valgrind is unaware of them (and therefore incorrectly marks > > > > > > > them as > > > > > > > uninitialized). > > > > > > > > > > > > would it help if i use the upcoming 1.5 version of openmpi ? i > > > > read > > > > > > that > > > > > > > > > > a huge effort has been done to clean-up the valgrind output ? but > > > > > > maybe that this doesn't concern this btl (for the reasons you > > > > > > mentionned). > > > > > > > > > > > > > > Another question, you said that the callback function pointer > > > > > > > > should > > > > > > > > > > > > never be 0. But can the tag be null (hdr->tag) ? > > > > > > > > > > > > > > The tag is not a pointer -- it's just an integer. > > > > > > > > > > > > I was worrying that its value could not be null. > > > > > > > > > > > > I'll send a valgrind output soon (i need to build libpython > > > > > > without pymalloc first). > > > > > > > > > > > > Thanks, > > > > > > Eloi > > > > > > > > > > > > > > Thanks for your help, > > > > > > > > Eloi > > > > > > > > > > > > > > > > On 16/08/2010 18:22, Jeff Squyres wrote: > > > > > > > >> Sorry for the delay in replying. > > > > > > > >> > > > > > > > >> Odd; the values of the callback function pointer should > > > > > > > >> never > > > > be > > > > > > 0. > > > > > > > > > > > >> This seems to suggest some kind of memory corruption is > > > > > > > >> occurring. > > > > > > > >> > > > > > > > >> I don't know if it's possible, because the stack trace looks > > > > > > > >> like you're calling through python, but can you run this > > > > > > > >> application through valgrind, or some other memory-checking > > > > > > > >> debugger? > > > > > > > >> > > > > > > > >> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote: > > > > > > > >>> Hi, > > > > > > > >>> > > > > > > > >>> sorry, i just forgot to add the values of the function > > > > > > > > parameters: > > > > > > > >>> (gdb) print reg->cbdata > > > > > > > >>> $1 = (void *) 0x0 > > > > > > > >>> (gdb) print openib_btl->super > > > > > > > >>> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = > > > > 12288, > > > > > > > > > >>> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536, > > > > > > > >>> btl_rdma_pipeline_send_length = 1048576, > > > > > > > >>> > > > > > > > >>> btl_rdma_pipeline_frag_size = 1048576, > > > > > > > > btl_min_rdma_pipeline_size > > > > > > > > > > > >>> = 1060864, btl_exclusivity = 1024, btl_latency = 10, > > > > > > > >>> btl_bandwidth = 800, btl_flags = 310, btl_add_procs = > > > > > > > >>> 0x2b341eb8ee47<mca_btl_openib_add_procs>, btl_del_procs = > > > > > > > >>> 0x2b341eb90156<mca_btl_openib_del_procs>, btl_register = > > > > > > > >>> 0, btl_finalize = > > > > > > > >>> 0x2b341eb93186<mca_btl_openib_finalize>, > > > > > > > > btl_alloc > > > > > > > > > > > >>> = 0x2b341eb90a3e<mca_btl_openib_alloc>, btl_free = > > > > > > > >>> 0x2b341eb91400<mca_btl_openib_free>, btl_prepare_src = > > > > > > > >>> 0x2b341eb91813<mca_btl_openib_prepare_src>, > > > > > > > >>> btl_prepare_dst > > > > = > > > > > > > > > >>> 0x2b341eb91f2e<mca_btl_openib_prepare_dst>, btl_send = > > > > > > > >>> 0x2b341eb94517<mca_btl_openib_send>, btl_sendi = > > > > > > > >>> 0x2b341eb9340d<mca_btl_openib_sendi>, btl_put = > > > > > > > >>> 0x2b341eb94660<mca_btl_openib_put>, btl_get = > > > > > > > >>> 0x2b341eb94c4e<mca_btl_openib_get>, btl_dump = > > > > > > > >>> 0x2b341acd45cb<mca_btl_base_dump>, btl_mpool = 0xf3f4110, > > > > > > > >>> btl_register_error = > > > > > > > >>> 0x2b341eb90565<mca_btl_openib_register_error_cb>, > > > > > > > >>> btl_ft_event > > > > > > > > = > > > > > > > > > > > >>> 0x2b341eb952e7<mca_btl_openib_ft_event>} > > > > > > > >>> > > > > > > > >>> (gdb) print hdr->tag > > > > > > > >>> $3 = 0 '\0' > > > > > > > >>> (gdb) print des > > > > > > > >>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700 > > > > > > > >>> (gdb) print reg->cbfunc > > > > > > > >>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0 > > > > > > > >>> > > > > > > > >>> Eloi > > > > > > > >>> > > > > > > > >>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote: > > > > > > > >>>> Hi, > > > > > > > >>>> > > > > > > > >>>> Here is the output of a core file generated during a > > > > > > > > segmentation > > > > > > > > > > > >>>> fault observed during a collective call (using openib): > > > > > > > >>>> > > > > > > > >>>> #0 0x0000000000000000 in ?? () > > > > > > > >>>> (gdb) where > > > > > > > >>>> #0 0x0000000000000000 in ?? () > > > > > > > >>>> #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming > > > > > > > >>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, > > > > > > > >>>> byte_len=18) at btl_openib_component.c:2881 #2 > > > > > > > >>>> 0x00002aedbc4e25e2 in handle_wc (device=0x19024ac0, cq=0, > > > > > > > >>>> wc=0x7ffff279ce90) at > > > > > > > >>>> btl_openib_component.c:3178 #3 0x00002aedbc4e2e9d in > > > > > > > > poll_device > > > > > > > > > > > >>>> (device=0x19024ac0, count=2) at > > > > > > > >>>> btl_openib_component.c:3318 > > > > #4 > > > > > > > > > >>>> 0x00002aedbc4e34b8 in progress_one_device > > > > (device=0x19024ac0) > > > > > > > > > >>>> at btl_openib_component.c:3426 #5 0x00002aedbc4e3561 in > > > > > > > >>>> btl_openib_component_progress () at > > > > > > > >>>> btl_openib_component.c:3451 > > > > > > > > #6 > > > > > > > > > > > >>>> 0x00002aedb8b22ab8 in opal_progress () at > > > > > > > >>>> runtime/opal_progress.c:207 #7 0x00002aedb859f497 in > > > > > > > >>>> opal_condition_wait (c=0x2aedb888ccc0, m=0x2aedb888cd20) > > > > > > > >>>> at ../opal/threads/condition.h:99 #8 > > > > > > > >>>> 0x00002aedb859fa31 in ompi_request_default_wait_all > > > > (count=2, > > > > > > > > > >>>> requests=0x7ffff279d0e0, statuses=0x0) at > > > > > > > >>>> request/req_wait.c:262 #9 0x00002aedbd7559ad in > > > > > > > >>>> ompi_coll_tuned_allreduce_intra_recursivedoubling > > > > > > > >>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1, > > > > > > > >>>> dtype=0x6788220, op=0x6787a20, > > > > > > > >>>> comm=0x19d81ff0, module=0x19d82b20) at > > > > > > > > coll_tuned_allreduce.c:223 > > > > > > > > > > > >>>> #10 0x00002aedbd7514f7 in > > > > > > > >>>> ompi_coll_tuned_allreduce_intra_dec_fixed > > > > > > > >>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1, > > > > > > > >>>> dtype=0x6788220, op=0x6787a20, comm=0x19d81ff0, > > > > > > > >>>> module=0x19d82b20) at > > > > > > > >>>> coll_tuned_decision_fixed.c:63 > > > > > > > >>>> #11 0x00002aedb85c7792 in PMPI_Allreduce > > > > > > > > (sendbuf=0x7ffff279d444, > > > > > > > > > > > >>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220, > > > > > > > > op=0x6787a20, > > > > > > > > > > > >>>> comm=0x19d81ff0) at pallreduce.c:102 #12 > > > > > > > >>>> 0x0000000004387dbf > > > > in > > > > > > > > > >>>> FEMTown::MPI::Allreduce (sendbuf=0x7ffff279d444, > > > > > > > >>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220, > > > > > > > > op=0x6787a20, > > > > > > > > > > > >>>> comm=0x19d81ff0) at stubs.cpp:626 #13 0x0000000004058be8 > > > > > > > >>>> in FEMTown::Domain::align (itf= > > > > > > > > {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Int > > > > > > > > > > > >>>> er fa ce>> > > > > > > > >>>> > > > > > > > >>>> = {_vptr.shared_base_ptr = 0x7ffff279d620, ptr_ = {px = > > > > > > > >>>> 0x199942a4, pn = {pi_ = 0x6}}},<No data fields>}) at > > > > > > > >>>> interface.cpp:371 #14 0x00000000040cb858 in > > > > > > > >>>> FEMTown::Field::detail::align_itfs_and_neighbhors (dim=2, > > > > > > > > set={px > > > > > > > > > > > >>>> = 0x7ffff279d780, pn = {pi_ = 0x2f279d640}}, > > > > > > > >>>> check_info=@0x7ffff279d7f0) at check.cpp:63 #15 > > > > > > > > 0x00000000040cbfa8 > > > > > > > > > > > >>>> in FEMTown::Field::align_elements (set={px = > > > > > > > >>>> 0x7ffff279d950, pn > > > > > > > > = > > > > > > > > > > > >>>> {pi_ = 0x66e08d0}}, check_info=@0x7ffff279d7f0) at > > > > > > > >>>> check.cpp:159 #16 0x00000000039acdd4 in > > > > > > > >>>> PyField_align_elements (self=0x0, args=0x2aaab0765050, > > > > > > > >>>> kwds=0x19d2e950) at check.cpp:31 #17 0x0000000001fbf76d in > > > > > > > >>>> FEMTown::Main::ExErrCatch<_object* (*)(_object*, _object*, > > > > > > > >>>> _object*)>::exec<_object> > > > > > > > >>>> (this=0x7ffff279dc20, s=0x0, po1=0x2aaab0765050, > > > > > > > >>>> po2=0x19d2e950) at > > > > > > > >>>> /home/qa/svntop/femtown/modules/main/py/exception.hpp:463 > > > > #18 > > > > > > > > > >>>> 0x00000000039acc82 in PyField_align_elements_ewrap > > > > (self=0x0, > > > > > > > > > >>>> args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:39 #19 > > > > > > > >>>> 0x00000000044093a0 in PyEval_EvalFrameEx (f=0x19b52e90, > > > > > > > >>>> throwflag=<value optimized out>) at Python/ceval.c:3921 > > > > > > > >>>> #20 0x000000000440aae9 in PyEval_EvalCodeEx > > > > > > > >>>> (co=0x2aaab754ad50, globals=<value optimized out>, > > > > > > > >>>> locals=<value optimized out>, args=0x3, argcount=1, > > > > > > > >>>> kws=0x19ace4a0, kwcount=2, > > > > > > > >>>> defs=0x2aaab75e4800, defcount=2, closure=0x0) at > > > > > > > >>>> Python/ceval.c:2968 > > > > > > > >>>> #21 0x0000000004408f58 in PyEval_EvalFrameEx > > > > > > > >>>> (f=0x19ace2d0, throwflag=<value optimized out>) at > > > > > > > >>>> Python/ceval.c:3802 #22 0x000000000440aae9 in > > > > > > > >>>> PyEval_EvalCodeEx (co=0x2aaab7550120, globals=<value > > > > > > > >>>> optimized out>, locals=<value optimized out>, args=0x7, > > > > > > > >>>> argcount=1, kws=0x19acc418, kwcount=3, > > > > > > > >>>> defs=0x2aaab759e958, defcount=6, closure=0x0) at > > > > > > > >>>> Python/ceval.c:2968 > > > > > > > >>>> #23 0x0000000004408f58 in PyEval_EvalFrameEx > > > > > > > >>>> (f=0x19acc1c0, throwflag=<value optimized out>) at > > > > > > > >>>> Python/ceval.c:3802 #24 0x000000000440aae9 in > > > > > > > >>>> PyEval_EvalCodeEx (co=0x2aaab8b5e738, globals=<value > > > > > > > >>>> optimized out>, locals=<value optimized out>, args=0x6, > > > > > > > >>>> argcount=1, kws=0x19abd328, kwcount=5, > > > > > > > >>>> defs=0x2aaab891b7e8, defcount=3, closure=0x0) at > > > > > > > >>>> Python/ceval.c:2968 > > > > > > > >>>> #25 0x0000000004408f58 in PyEval_EvalFrameEx > > > > > > > >>>> (f=0x19abcea0, throwflag=<value optimized out>) at > > > > > > > >>>> Python/ceval.c:3802 #26 0x000000000440aae9 in > > > > > > > >>>> PyEval_EvalCodeEx (co=0x2aaab3eb4198, globals=<value > > > > > > > >>>> optimized out>, locals=<value optimized out>, args=0xb, > > > > > > > >>>> argcount=1, kws=0x19a89df0, kwcount=10, defs=0x0, > > > > > > > >>>> defcount=0, closure=0x0) at Python/ceval.c:2968 > > > > > > > >>>> #27 0x0000000004408f58 in PyEval_EvalFrameEx > > > > > > > >>>> (f=0x19a89c40, throwflag=<value optimized out>) at > > > > > > > >>>> Python/ceval.c:3802 #28 0x000000000440aae9 in > > > > > > > >>>> PyEval_EvalCodeEx (co=0x2aaab3eb4288, globals=<value > > > > > > > >>>> optimized out>, locals=<value optimized out>, args=0x1, > > > > > > > >>>> argcount=0, kws=0x19a89330, kwcount=0, > > > > > > > >>>> defs=0x2aaab8b66668, defcount=1, closure=0x0) at > > > > > > > >>>> Python/ceval.c:2968 > > > > > > > >>>> #29 0x0000000004408f58 in PyEval_EvalFrameEx > > > > > > > >>>> (f=0x19a891b0, throwflag=<value optimized out>) at > > > > > > > >>>> Python/ceval.c:3802 #30 0x000000000440aae9 in > > > > > > > >>>> PyEval_EvalCodeEx (co=0x2aaab8b6a738, globals=<value > > > > > > > >>>> optimized out>, locals=<value optimized out>, args=0x0, > > > > > > > >>>> argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, > > > > > > > >>>> closure=0x0) at > > > > > > > >>>> Python/ceval.c:2968 > > > > > > > >>>> #31 0x000000000440ac02 in PyEval_EvalCode (co=0x1902f9b0, > > > > > > > >>>> globals=0x0, locals=0x190d9700) at Python/ceval.c:522 #32 > > > > > > > >>>> 0x000000000442853c in PyRun_StringFlags (str=0x192fd3d8 > > > > > > > >>>> "DIRECT.Actran.main()", start=<value optimized out>, > > > > > > > >>>> globals=0x192213d0, locals=0x192213d0, flags=0x0) at > > > > > > > >>>> Python/pythonrun.c:1335 #33 0x0000000004429690 in > > > > > > > >>>> PyRun_SimpleStringFlags (command=0x192fd3d8 > > > > > > > >>>> "DIRECT.Actran.main()", flags=0x0) at > > > > > > > >>>> Python/pythonrun.c:957 #34 0x0000000001fa1cf9 in > > > > > > > >>>> FEMTown::Python::FEMPy::run_application > > > > (this=0x7ffff279f650) > > > > > > > > > >>>> at fempy.cpp:873 #35 0x000000000434ce99 in > > > > > > > > FEMTown::Main::Batch::run > > > > > > > > > > > >>>> (this=0x7ffff279f650) at batch.cpp:374 #36 > > > > 0x0000000001f9aa25 > > > > > > > > > >>>> in main (argc=8, argv=0x7ffff279fa48) at main.cpp:10 (gdb) > > > > > > > >>>> f 1 #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming > > > > > > > >>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, > > > > > > > >>>> byte_len=18) at btl_openib_component.c:2881 2881 > > > > > > > >>>> reg->cbfunc( &openib_btl->super, hdr->tag, des, > > > > > > > >>>> reg->cbdata > > > > ); > > > > > > > > > >>>> Current language: auto; currently c > > > > > > > >>>> (gdb) > > > > > > > >>>> #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming > > > > > > > >>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, > > > > > > > >>>> byte_len=18) at btl_openib_component.c:2881 2881 > > > > > > > >>>> reg->cbfunc( &openib_btl->super, hdr->tag, des, > > > > > > > >>>> reg->cbdata > > > > ); > > > > > > > > > >>>> (gdb) l 2876 > > > > > > > >>>> 2877 if(OPAL_LIKELY(!(is_credit_msg = > > > > > > > >>>> is_credit_message(frag)))) { 2878 /* call > > > > > > > >>>> registered callback */ > > > > > > > >>>> 2879 mca_btl_active_message_callback_t* reg; > > > > > > > >>>> 2880 reg = mca_btl_base_active_message_trigger > > > > > > > >>>> + hdr->tag; 2881 > > > > > > > >>>> reg->cbfunc(&openib_btl->super, hdr->tag, des, > > > > > > > >>>> reg->cbdata ); 2882 > > > > > > > >>>> if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) { 2883 > > > > > > > >>>> cqp > > > > = > > > > > > > > > >>>> (hdr->credits>> 11)& 0x0f; > > > > > > > >>>> 2884 hdr->credits&= 0x87ff; > > > > > > > >>>> 2885 } else { > > > > > > > >>>> > > > > > > > >>>> Regards, > > > > > > > >>>> Eloi > > > > > > > >>>> > > > > > > > >>>> On Friday 16 July 2010 16:01:02 Eloi Gaudry wrote: > > > > > > > >>>>> Hi Edgar, > > > > > > > >>>>> > > > > > > > >>>>> The only difference I could observed was that the > > > > > > > >>>>> segmentation fault appeared sometimes later during the > > > > > > > >>>>> parallel computation. > > > > > > > >>>>> > > > > > > > >>>>> I'm running out of idea here. I wish I could use the > > > > > > > >>>>> "--mca > > > > > > > > coll > > > > > > > > > > > >>>>> tuned" with "--mca self,sm,tcp" so that I could check > > > > > > > >>>>> that the issue is not somehow limited to the tuned > > > > > > > >>>>> collective routines. > > > > > > > >>>>> > > > > > > > >>>>> Thanks, > > > > > > > >>>>> Eloi > > > > > > > >>>>> > > > > > > > >>>>> On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote: > > > > > > > >>>>>> On 7/15/2010 10:18 AM, Eloi Gaudry wrote: > > > > > > > >>>>>>> hi edgar, > > > > > > > >>>>>>> > > > > > > > >>>>>>> thanks for the tips, I'm gonna try this option as well. > > > > the > > > > > > > > > >>>>>>> segmentation fault i'm observing always happened during > > > > > > > >>>>>>> a collective communication indeed... does it basically > > > > switch > > > > > > all > > > > > > > > > > > >>>>>>> collective communication to basic mode, right ? > > > > > > > >>>>>>> > > > > > > > >>>>>>> sorry for my ignorance, but what's a NCA ? > > > > > > > >>>>>> > > > > > > > >>>>>> sorry, I meant to type HCA (InifinBand networking card) > > > > > > > >>>>>> > > > > > > > >>>>>> Thanks > > > > > > > >>>>>> Edgar > > > > > > > >>>>>> > > > > > > > >>>>>>> thanks, > > > > > > > >>>>>>> éloi > > > > > > > >>>>>>> > > > > > > > >>>>>>> On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote: > > > > > > > >>>>>>>> you could try first to use the algorithms in the basic > > > > > > > > module, > > > > > > > > > > > >>>>>>>> e.g. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> mpirun -np x --mca coll basic ./mytest > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> and see whether this makes a difference. I used to > > > > observe > > > > > > > > > >>>>>>>> sometimes a (similar ?) problem in the openib btl > > > > > > > >>>>>>>> triggered from the tuned collective component, in > > > > > > > >>>>>>>> cases where the ofed libraries were installed but no > > > > > > > >>>>>>>> NCA was found on a node. It used to work however with > > > > > > > >>>>>>>> the basic component. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Thanks > > > > > > > >>>>>>>> Edgar > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> On 7/15/2010 3:08 AM, Eloi Gaudry wrote: > > > > > > > >>>>>>>>> hi Rolf, > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> unfortunately, i couldn't get rid of that annoying > > > > > > > >>>>>>>>> segmentation fault when selecting another bcast > > > > > > > >>>>>>>>> algorithm. i'm now going to replace MPI_Bcast with a > > > > > > > >>>>>>>>> naive > > > > > > > >>>>>>>>> implementation (using MPI_Send and MPI_Recv) and see > > > > > > > >>>>>>>>> if > > > > > > > > that > > > > > > > > > > > >>>>>>>>> helps. > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> regards, > > > > > > > >>>>>>>>> éloi > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote: > > > > > > > >>>>>>>>>> Hi Rolf, > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> thanks for your input. You're right, I miss the > > > > > > > >>>>>>>>>> coll_tuned_use_dynamic_rules option. > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> I'll check if I the segmentation fault disappears > > > > > > > >>>>>>>>>> when > > > > > > > > using > > > > > > > > > > > >>>>>>>>>> the basic bcast linear algorithm using the proper > > > > > > > >>>>>>>>>> command line you provided. > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> Regards, > > > > > > > >>>>>>>>>> Eloi > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart > > > > wrote: > > > > > > > >>>>>>>>>>> Hi Eloi: > > > > > > > >>>>>>>>>>> To select the different bcast algorithms, you need > > > > > > > >>>>>>>>>>> to add an extra mca parameter that tells the > > > > > > > >>>>>>>>>>> library to use dynamic selection. --mca > > > > > > > >>>>>>>>>>> coll_tuned_use_dynamic_rules 1 > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> One way to make sure you are typing this in > > > > > > > >>>>>>>>>>> correctly is > > > > > > > > to > > > > > > > > > > > >>>>>>>>>>> use it with ompi_info. Do the following: > > > > > > > >>>>>>>>>>> ompi_info -mca coll_tuned_use_dynamic_rules 1 > > > > > > > >>>>>>>>>>> --param > > > > > > > > coll > > > > > > > > > > > >>>>>>>>>>> You should see lots of output with all the > > > > > > > >>>>>>>>>>> different algorithms that can be selected for the > > > > > > > >>>>>>>>>>> various collectives. Therefore, you need this: > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> --mca coll_tuned_use_dynamic_rules 1 --mca > > > > > > > >>>>>>>>>>> coll_tuned_bcast_algorithm 1 > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> Rolf > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> On 07/13/10 11:28, Eloi Gaudry wrote: > > > > > > > >>>>>>>>>>>> Hi, > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> I've found that "--mca coll_tuned_bcast_algorithm > > > > > > > >>>>>>>>>>>> 1" allowed to switch to the basic linear > > > > > > > >>>>>>>>>>>> algorithm. Anyway whatever the algorithm used, > > > > > > > >>>>>>>>>>>> the segmentation fault remains. > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> Does anyone could give some advice on ways to > > > > diagnose > > > > > > the > > > > > > > > > > > >>>>>>>>>>>> issue I'm facing ? > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> Regards, > > > > > > > >>>>>>>>>>>> Eloi > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote: > > > > > > > >>>>>>>>>>>>> Hi, > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> I'm focusing on the MPI_Bcast routine that seems > > > > > > > >>>>>>>>>>>>> to randomly segfault when using the openib btl. > > > > > > > >>>>>>>>>>>>> I'd > > > > like > > > > > > to > > > > > > > > > > > >>>>>>>>>>>>> know if there is any way to make OpenMPI switch > > > > > > > >>>>>>>>>>>>> to > > > > a > > > > > > > > > >>>>>>>>>>>>> different algorithm than the default one being > > > > > > > >>>>>>>>>>>>> selected for MPI_Bcast. > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> Thanks for your help, > > > > > > > >>>>>>>>>>>>> Eloi > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote: > > > > > > > >>>>>>>>>>>>>> Hi, > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> I'm observing a random segmentation fault during > > > > an > > > > > > > > > >>>>>>>>>>>>>> internode parallel computation involving the > > > > openib > > > > > > btl > > > > > > > > > > > >>>>>>>>>>>>>> and OpenMPI-1.4.2 (the same issue can be > > > > > > > >>>>>>>>>>>>>> observed with OpenMPI-1.3.3). > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> mpirun (Open MPI) 1.4.2 > > > > > > > >>>>>>>>>>>>>> Report bugs to > > > > > > > >>>>>>>>>>>>>> http://www.open-mpi.org/community/help/ > > > > > > > >>>>>>>>>>>>>> [pbn08:02624] *** Process received signal *** > > > > > > > >>>>>>>>>>>>>> [pbn08:02624] Signal: Segmentation fault (11) > > > > > > > >>>>>>>>>>>>>> [pbn08:02624] Signal code: Address not mapped > > > > (1) > > > > > > > > > >>>>>>>>>>>>>> [pbn08:02624] Failing at address: (nil) > > > > > > > >>>>>>>>>>>>>> [pbn08:02624] [ 0] /lib64/libpthread.so.0 > > > > > > > >>>>>>>>>>>>>> [0x349540e4c0] [pbn08:02624] *** End of error > > > > > > > > message > > > > > > > > > > > >>>>>>>>>>>>>> *** > > > > > > > >>>>>>>>>>>>>> sh: line 1: 2624 Segmentation fault > > > > > > > > \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/R > > > > > > > > > > > >>>>>>>>>>>>>> ed Ha tE L\ -5 \/ x 86 _6 4\ /bin\/actranpy_mp > > > > > > > > '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/Re > > > > > > > > > > > >>>>>>>>>>>>>> dH at EL -5 /x 86 _ 64 /A c tran_11.0.rc2.41872' > > > > > > > > '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3D > > > > > > > > > > > >>>>>>>>>>>>>> re al _m 4_ n2 .d a t' > > > > > > > > '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch > > > > > > > > > > > >>>>>>>>>>>>>> ' '--mem=3200' '--threads=1' > > > > > > > >>>>>>>>>>>>>> '--errorlevel=FATAL' '--t_max=0.1' > > > > > > > >>>>>>>>>>>>>> '--parallel=domain' > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> If I choose not to use the openib btl (by using > > > > > > > >>>>>>>>>>>>>> --mca btl self,sm,tcp on the command line, for > > > > > > > >>>>>>>>>>>>>> instance), I don't encounter any problem and the > > > > > > > >>>>>>>>>>>>>> parallel computation runs flawlessly. > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> I would like to get some help to be able: > > > > > > > >>>>>>>>>>>>>> - to diagnose the issue I'm facing with the > > > > > > > >>>>>>>>>>>>>> openib btl - understand why this issue is > > > > > > > >>>>>>>>>>>>>> observed only when > > > > > > > > using > > > > > > > > > > > >>>>>>>>>>>>>> the openib btl and not when using self,sm,tcp > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> Any help would be very much appreciated. > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> The outputs of ompi_info and the configure > > > > > > > >>>>>>>>>>>>>> scripts of OpenMPI are enclosed to this email, > > > > > > > >>>>>>>>>>>>>> and some > > > > > > > > information > > > > > > > > > > > >>>>>>>>>>>>>> on the infiniband drivers as well. > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> Here is the command line used when launching a > > > > > > > > parallel > > > > > > > > > > > >>>>>>>>>>>>>> computation > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> using infiniband: > > > > > > > >>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS > > > > > > > >>>>>>>>>>>>>> --hostfile host.list --mca > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> btl openib,sm,self,tcp --display-map --verbose > > > > > > > >>>>>>>>>>>>>> --version --mca mpi_warn_on_fork 0 --mca > > > > > > > >>>>>>>>>>>>>> btl_openib_want_fork_support 0 [...] > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> and the command line used if not using infiniband: > > > > > > > >>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS > > > > > > > >>>>>>>>>>>>>> --hostfile host.list --mca > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> btl self,sm,tcp --display-map --verbose > > > > > > > >>>>>>>>>>>>>> --version > > > > > > > > --mca > > > > > > > > > > > >>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca > > > > > > > >>>>>>>>>>>>>> btl_openib_want_fork_support > > > > > > > > 0 > > > > > > > > > > > >>>>>>>>>>>>>> [...] > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> Thanks, > > > > > > > >>>>>>>>>>>>>> Eloi > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> _______________________________________________