Hi Nysal,

thanks for your response.

I've been unable so far to write a test case that could illustrate the 
hdr->tag=0 error.
Actually, I'm only observing this issue when running an internode computation 
involving infiniband hardware from Mellanox (MT25418, ConnectX IB DDR, PCIe 2.0 
2.5GT/s, rev a0) with our time-domain software.

I checked, double-checked, and rechecked again every MPI use performed during a 
parallel computation and I couldn't find any error so far. The fact that the 
very 
same parallel computation run flawlessly when using tcp (and disabling openib 
support) might seem to indicate that the issue is somewhere located inside the 
openib btl or at the hardware/driver level.

I've just used the "-mca pml csum" option and I haven't seen any related 
messages (when hdr->tag=0 and the segfaults occurs).
Any suggestion ?

Regards,
Eloi



On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
> Hi Eloi,
> Sorry for the delay in response. I haven't read the entire email thread,
> but do you have a test case which can reproduce this error? Without that
> it will be difficult to nail down the cause. Just to clarify, I do not
> work for an iwarp vendor. I can certainly try to reproduce it on an IB
> system. There is also a PML called csum, you can use it via "-mca pml
> csum", which will checksum the MPI messages and verify it at the receiver
> side for any data corruption. You can try using it to see if it is able to
> catch anything.
> 
> Regards
> --Nysal
> 
> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote:
> > Hi Nysal,
> > 
> > I'm sorry to intrrupt, but I was wondering if you had a chance to look at
> > this error.
> > 
> > Regards,
> > Eloi
> > 
> > 
> > 
> > --
> > 
> > 
> > Eloi Gaudry
> > 
> > Free Field Technologies
> > Company Website: http://www.fft.be
> > Company Phone:   +32 10 487 959
> > 
> > 
> > ---------- Forwarded message ----------
> > From: Eloi Gaudry <e...@fft.be>
> > To: Open MPI Users <us...@open-mpi.org>
> > Date: Wed, 15 Sep 2010 16:27:43 +0200
> > Subject: Re: [OMPI users] [openib] segfault when using openib btl
> > Hi,
> > 
> > I was wondering if anybody got a chance to have a look at this issue.
> > 
> > Regards,
> > Eloi
> > 
> > On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
> > > Hi Jeff,
> > > 
> > > Please find enclosed the output (valgrind.out.gz) from
> > > /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl
> > > openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca
> > > btl_openib_want_fork_support 0 -tag-output
> > > /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
> > > --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi-
> > > valgrind.supp --suppressions=./suppressions.python.supp
> > > /opt/actran/bin/actranpy_mp ...
> > > 
> > > Thanks,
> > > Eloi
> > > 
> > > On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
> > > > On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
> > > > > On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
> > > > > > I did run our application through valgrind but it couldn't find
> > > > > > any "Invalid write": there is a bunch of "Invalid read" (I'm
> > > > > > using
> > 
> > 1.4.2
> > 
> > > > > > with the suppression file), "Use of uninitialized bytes" and
> > > > > > "Conditional jump depending on uninitialized bytes" in different
> > 
> > ompi
> > 
> > > > > > routines. Some of them are located in btl_openib_component.c.
> > > > > > I'll send you an output of valgrind shortly.
> > > > > 
> > > > > A lot of them in btl_openib_* are to be expected -- OpenFabrics
> > > > > uses OS-bypass methods for some of its memory, and therefore
> > > > > valgrind is unaware of them (and therefore incorrectly marks them
> > > > > as
> > > > > uninitialized).
> > > > 
> > > > would it  help if i use the upcoming 1.5 version of openmpi ? i read
> > 
> > that
> > 
> > > > a huge effort has been done to clean-up the valgrind output ? but
> > > > maybe that this doesn't concern this btl (for the reasons you
> > > > mentionned).
> > > > 
> > > > > > Another question, you said that the callback function pointer
> > 
> > should
> > 
> > > > > > never be 0. But can the tag be null (hdr->tag) ?
> > > > > 
> > > > > The tag is not a pointer -- it's just an integer.
> > > > 
> > > > I was worrying that its value could not be null.
> > > > 
> > > > I'll send a valgrind output soon (i need to build libpython without
> > > > pymalloc first).
> > > > 
> > > > Thanks,
> > > > Eloi
> > > > 
> > > > > > Thanks for your help,
> > > > > > Eloi
> > > > > > 
> > > > > > On 16/08/2010 18:22, Jeff Squyres wrote:
> > > > > >> Sorry for the delay in replying.
> > > > > >> 
> > > > > >> Odd; the values of the callback function pointer should never be
> > 
> > 0.
> > 
> > > > > >> This seems to suggest some kind of memory corruption is
> > > > > >> occurring.
> > > > > >> 
> > > > > >> I don't know if it's possible, because the stack trace looks
> > > > > >> like you're calling through python, but can you run this
> > > > > >> application through valgrind, or some other memory-checking
> > > > > >> debugger?
> > > > > >> 
> > > > > >> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
> > > > > >>> Hi,
> > > > > >>> 
> > > > > >>> sorry, i just forgot to add the values of the function
> > 
> > parameters:
> > > > > >>> (gdb) print reg->cbdata
> > > > > >>> $1 = (void *) 0x0
> > > > > >>> (gdb) print openib_btl->super
> > > > > >>> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> > > > > >>> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> > > > > >>> btl_rdma_pipeline_send_length = 1048576,
> > > > > >>> 
> > > > > >>>   btl_rdma_pipeline_frag_size = 1048576,
> > 
> > btl_min_rdma_pipeline_size
> > 
> > > > > >>>   = 1060864, btl_exclusivity = 1024, btl_latency = 10,
> > > > > >>>   btl_bandwidth = 800, btl_flags = 310, btl_add_procs =
> > > > > >>>   0x2b341eb8ee47<mca_btl_openib_add_procs>, btl_del_procs =
> > > > > >>>   0x2b341eb90156<mca_btl_openib_del_procs>, btl_register = 0,
> > > > > >>>   btl_finalize = 0x2b341eb93186<mca_btl_openib_finalize>,
> > 
> > btl_alloc
> > 
> > > > > >>>   = 0x2b341eb90a3e<mca_btl_openib_alloc>, btl_free =
> > > > > >>>   0x2b341eb91400<mca_btl_openib_free>, btl_prepare_src =
> > > > > >>>   0x2b341eb91813<mca_btl_openib_prepare_src>, btl_prepare_dst =
> > > > > >>>   0x2b341eb91f2e<mca_btl_openib_prepare_dst>, btl_send =
> > > > > >>>   0x2b341eb94517<mca_btl_openib_send>, btl_sendi =
> > > > > >>>   0x2b341eb9340d<mca_btl_openib_sendi>, btl_put =
> > > > > >>>   0x2b341eb94660<mca_btl_openib_put>, btl_get =
> > > > > >>>   0x2b341eb94c4e<mca_btl_openib_get>, btl_dump =
> > > > > >>>   0x2b341acd45cb<mca_btl_base_dump>, btl_mpool = 0xf3f4110,
> > > > > >>>   btl_register_error =
> > > > > >>>   0x2b341eb90565<mca_btl_openib_register_error_cb>,
> > > > > >>>   btl_ft_event
> > 
> > =
> > 
> > > > > >>>   0x2b341eb952e7<mca_btl_openib_ft_event>}
> > > > > >>> 
> > > > > >>> (gdb) print hdr->tag
> > > > > >>> $3 = 0 '\0'
> > > > > >>> (gdb) print des
> > > > > >>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> > > > > >>> (gdb) print reg->cbfunc
> > > > > >>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> > > > > >>> 
> > > > > >>> Eloi
> > > > > >>> 
> > > > > >>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > > > > >>>> Hi,
> > > > > >>>> 
> > > > > >>>> Here is the output of a core file generated during a
> > 
> > segmentation
> > 
> > > > > >>>> fault observed during a collective call (using openib):
> > > > > >>>> 
> > > > > >>>> #0  0x0000000000000000 in ?? ()
> > > > > >>>> (gdb) where
> > > > > >>>> #0  0x0000000000000000 in ?? ()
> > > > > >>>> #1  0x00002aedbc4e05f4 in btl_openib_handle_incoming
> > > > > >>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700,
> > > > > >>>> byte_len=18) at btl_openib_component.c:2881 #2
> > > > > >>>> 0x00002aedbc4e25e2 in handle_wc (device=0x19024ac0, cq=0,
> > > > > >>>> wc=0x7ffff279ce90) at
> > > > > >>>> btl_openib_component.c:3178 #3  0x00002aedbc4e2e9d in
> > 
> > poll_device
> > 
> > > > > >>>> (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
> > > > > >>>> 0x00002aedbc4e34b8 in progress_one_device (device=0x19024ac0)
> > > > > >>>> at btl_openib_component.c:3426 #5  0x00002aedbc4e3561 in
> > > > > >>>> btl_openib_component_progress () at
> > > > > >>>> btl_openib_component.c:3451
> > 
> > #6
> > 
> > > > > >>>> 0x00002aedb8b22ab8 in opal_progress () at
> > > > > >>>> runtime/opal_progress.c:207 #7 0x00002aedb859f497 in
> > > > > >>>> opal_condition_wait (c=0x2aedb888ccc0, m=0x2aedb888cd20) at
> > > > > >>>> ../opal/threads/condition.h:99 #8
> > > > > >>>> 0x00002aedb859fa31 in ompi_request_default_wait_all (count=2,
> > > > > >>>> requests=0x7ffff279d0e0, statuses=0x0) at
> > > > > >>>> request/req_wait.c:262 #9 0x00002aedbd7559ad in
> > > > > >>>> ompi_coll_tuned_allreduce_intra_recursivedoubling
> > > > > >>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1,
> > > > > >>>> dtype=0x6788220, op=0x6787a20,
> > > > > >>>> comm=0x19d81ff0, module=0x19d82b20) at
> > 
> > coll_tuned_allreduce.c:223
> > 
> > > > > >>>> #10 0x00002aedbd7514f7 in
> > > > > >>>> ompi_coll_tuned_allreduce_intra_dec_fixed
> > > > > >>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1,
> > > > > >>>> dtype=0x6788220, op=0x6787a20, comm=0x19d81ff0,
> > > > > >>>> module=0x19d82b20) at
> > > > > >>>> coll_tuned_decision_fixed.c:63
> > > > > >>>> #11 0x00002aedb85c7792 in PMPI_Allreduce
> > 
> > (sendbuf=0x7ffff279d444,
> > 
> > > > > >>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220,
> > 
> > op=0x6787a20,
> > 
> > > > > >>>> comm=0x19d81ff0) at pallreduce.c:102 #12 0x0000000004387dbf in
> > > > > >>>> FEMTown::MPI::Allreduce (sendbuf=0x7ffff279d444,
> > > > > >>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220,
> > 
> > op=0x6787a20,
> > 
> > > > > >>>> comm=0x19d81ff0) at stubs.cpp:626 #13 0x0000000004058be8 in
> > > > > >>>> FEMTown::Domain::align (itf=
> > 
> > {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Int
> > 
> > > > > >>>>             er fa ce>>
> > > > > >>>> 
> > > > > >>>> = {_vptr.shared_base_ptr = 0x7ffff279d620, ptr_ = {px =
> > > > > >>>> 0x199942a4, pn = {pi_ = 0x6}}},<No data fields>}) at
> > > > > >>>> interface.cpp:371 #14 0x00000000040cb858 in
> > > > > >>>> FEMTown::Field::detail::align_itfs_and_neighbhors (dim=2,
> > 
> > set={px
> > 
> > > > > >>>> = 0x7ffff279d780, pn = {pi_ = 0x2f279d640}},
> > > > > >>>> check_info=@0x7ffff279d7f0) at check.cpp:63 #15
> > 
> > 0x00000000040cbfa8
> > 
> > > > > >>>> in FEMTown::Field::align_elements (set={px = 0x7ffff279d950,
> > > > > >>>> pn
> > 
> > =
> > 
> > > > > >>>> {pi_ = 0x66e08d0}}, check_info=@0x7ffff279d7f0) at
> > > > > >>>> check.cpp:159 #16 0x00000000039acdd4 in
> > > > > >>>> PyField_align_elements (self=0x0, args=0x2aaab0765050,
> > > > > >>>> kwds=0x19d2e950) at check.cpp:31 #17 0x0000000001fbf76d in
> > > > > >>>> FEMTown::Main::ExErrCatch<_object* (*)(_object*, _object*,
> > > > > >>>> _object*)>::exec<_object>
> > > > > >>>> (this=0x7ffff279dc20, s=0x0, po1=0x2aaab0765050,
> > > > > >>>> po2=0x19d2e950) at
> > > > > >>>> /home/qa/svntop/femtown/modules/main/py/exception.hpp:463 #18
> > > > > >>>> 0x00000000039acc82 in PyField_align_elements_ewrap (self=0x0,
> > > > > >>>> args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:39 #19
> > > > > >>>> 0x00000000044093a0 in PyEval_EvalFrameEx (f=0x19b52e90,
> > > > > >>>> throwflag=<value optimized out>) at Python/ceval.c:3921 #20
> > > > > >>>> 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab754ad50,
> > > > > >>>> globals=<value optimized out>, locals=<value optimized out>,
> > > > > >>>> args=0x3, argcount=1, kws=0x19ace4a0, kwcount=2,
> > > > > >>>> defs=0x2aaab75e4800, defcount=2, closure=0x0) at
> > > > > >>>> Python/ceval.c:2968
> > > > > >>>> #21 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19ace2d0,
> > > > > >>>> throwflag=<value optimized out>) at Python/ceval.c:3802 #22
> > > > > >>>> 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab7550120,
> > > > > >>>> globals=<value optimized out>, locals=<value optimized out>,
> > > > > >>>> args=0x7, argcount=1, kws=0x19acc418, kwcount=3,
> > > > > >>>> defs=0x2aaab759e958, defcount=6, closure=0x0) at
> > > > > >>>> Python/ceval.c:2968
> > > > > >>>> #23 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19acc1c0,
> > > > > >>>> throwflag=<value optimized out>) at Python/ceval.c:3802 #24
> > > > > >>>> 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab8b5e738,
> > > > > >>>> globals=<value optimized out>, locals=<value optimized out>,
> > > > > >>>> args=0x6, argcount=1, kws=0x19abd328, kwcount=5,
> > > > > >>>> defs=0x2aaab891b7e8, defcount=3, closure=0x0) at
> > > > > >>>> Python/ceval.c:2968
> > > > > >>>> #25 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19abcea0,
> > > > > >>>> throwflag=<value optimized out>) at Python/ceval.c:3802 #26
> > > > > >>>> 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab3eb4198,
> > > > > >>>> globals=<value optimized out>, locals=<value optimized out>,
> > > > > >>>> args=0xb, argcount=1, kws=0x19a89df0, kwcount=10, defs=0x0,
> > > > > >>>> defcount=0, closure=0x0) at Python/ceval.c:2968
> > > > > >>>> #27 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19a89c40,
> > > > > >>>> throwflag=<value optimized out>) at Python/ceval.c:3802 #28
> > > > > >>>> 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab3eb4288,
> > > > > >>>> globals=<value optimized out>, locals=<value optimized out>,
> > > > > >>>> args=0x1, argcount=0, kws=0x19a89330, kwcount=0,
> > > > > >>>> defs=0x2aaab8b66668, defcount=1, closure=0x0) at
> > > > > >>>> Python/ceval.c:2968
> > > > > >>>> #29 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19a891b0,
> > > > > >>>> throwflag=<value optimized out>) at Python/ceval.c:3802 #30
> > > > > >>>> 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab8b6a738,
> > > > > >>>> globals=<value optimized out>, locals=<value optimized out>,
> > > > > >>>> args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0,
> > > > > >>>> defcount=0, closure=0x0) at
> > > > > >>>> Python/ceval.c:2968
> > > > > >>>> #31 0x000000000440ac02 in PyEval_EvalCode (co=0x1902f9b0,
> > > > > >>>> globals=0x0, locals=0x190d9700) at Python/ceval.c:522 #32
> > > > > >>>> 0x000000000442853c in PyRun_StringFlags (str=0x192fd3d8
> > > > > >>>> "DIRECT.Actran.main()", start=<value optimized out>,
> > > > > >>>> globals=0x192213d0, locals=0x192213d0, flags=0x0) at
> > > > > >>>> Python/pythonrun.c:1335 #33 0x0000000004429690 in
> > > > > >>>> PyRun_SimpleStringFlags (command=0x192fd3d8
> > > > > >>>> "DIRECT.Actran.main()", flags=0x0) at
> > > > > >>>> Python/pythonrun.c:957 #34 0x0000000001fa1cf9 in
> > > > > >>>> FEMTown::Python::FEMPy::run_application (this=0x7ffff279f650)
> > > > > >>>> at fempy.cpp:873 #35 0x000000000434ce99 in
> > 
> > FEMTown::Main::Batch::run
> > 
> > > > > >>>> (this=0x7ffff279f650) at batch.cpp:374 #36 0x0000000001f9aa25
> > > > > >>>> in main (argc=8, argv=0x7ffff279fa48) at main.cpp:10 (gdb) f
> > > > > >>>> 1 #1  0x00002aedbc4e05f4 in btl_openib_handle_incoming
> > > > > >>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700,
> > > > > >>>> byte_len=18) at btl_openib_component.c:2881 2881
> > > > > >>>> reg->cbfunc( &openib_btl->super, hdr->tag, des, reg->cbdata );
> > > > > >>>> Current language: auto; currently c
> > > > > >>>> (gdb)
> > > > > >>>> #1  0x00002aedbc4e05f4 in btl_openib_handle_incoming
> > > > > >>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700,
> > > > > >>>> byte_len=18) at btl_openib_component.c:2881 2881
> > > > > >>>> reg->cbfunc( &openib_btl->super, hdr->tag, des, reg->cbdata );
> > > > > >>>> (gdb) l 2876
> > > > > >>>> 2877        if(OPAL_LIKELY(!(is_credit_msg =
> > > > > >>>> is_credit_message(frag)))) { 2878            /* call
> > > > > >>>> registered callback */
> > > > > >>>> 2879            mca_btl_active_message_callback_t* reg;
> > > > > >>>> 2880            reg = mca_btl_base_active_message_trigger +
> > > > > >>>> hdr->tag; 2881            reg->cbfunc(&openib_btl->super,
> > > > > >>>> hdr->tag, des, reg->cbdata ); 2882
> > > > > >>>> if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) { 2883                cqp =
> > > > > >>>> (hdr->credits>>  11)&  0x0f;
> > > > > >>>> 2884                hdr->credits&= 0x87ff;
> > > > > >>>> 2885            } else {
> > > > > >>>> 
> > > > > >>>> Regards,
> > > > > >>>> Eloi
> > > > > >>>> 
> > > > > >>>> On Friday 16 July 2010 16:01:02 Eloi Gaudry wrote:
> > > > > >>>>> Hi Edgar,
> > > > > >>>>> 
> > > > > >>>>> The only difference I could observed was that the
> > > > > >>>>> segmentation fault appeared sometimes later during the
> > > > > >>>>> parallel computation.
> > > > > >>>>> 
> > > > > >>>>> I'm running out of idea here. I wish I could use the "--mca
> > 
> > coll
> > 
> > > > > >>>>> tuned" with "--mca self,sm,tcp" so that I could check that
> > > > > >>>>> the issue is not somehow limited to the tuned collective
> > > > > >>>>> routines.
> > > > > >>>>> 
> > > > > >>>>> Thanks,
> > > > > >>>>> Eloi
> > > > > >>>>> 
> > > > > >>>>> On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote:
> > > > > >>>>>> On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
> > > > > >>>>>>> hi edgar,
> > > > > >>>>>>> 
> > > > > >>>>>>> thanks for the tips, I'm gonna try this option as well. the
> > > > > >>>>>>> segmentation fault i'm observing always happened during a
> > > > > >>>>>>> collective communication indeed... does it basically switch
> > 
> > all
> > 
> > > > > >>>>>>> collective communication to basic mode, right ?
> > > > > >>>>>>> 
> > > > > >>>>>>> sorry for my ignorance, but what's a NCA ?
> > > > > >>>>>> 
> > > > > >>>>>> sorry, I meant to type HCA (InifinBand networking card)
> > > > > >>>>>> 
> > > > > >>>>>> Thanks
> > > > > >>>>>> Edgar
> > > > > >>>>>> 
> > > > > >>>>>>> thanks,
> > > > > >>>>>>> éloi
> > > > > >>>>>>> 
> > > > > >>>>>>> On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
> > > > > >>>>>>>> you could try first to use the algorithms in the basic
> > 
> > module,
> > 
> > > > > >>>>>>>> e.g.
> > > > > >>>>>>>> 
> > > > > >>>>>>>> mpirun -np x --mca coll basic ./mytest
> > > > > >>>>>>>> 
> > > > > >>>>>>>> and see whether this makes a difference. I used to observe
> > > > > >>>>>>>> sometimes a (similar ?) problem in the openib btl
> > > > > >>>>>>>> triggered from the tuned collective component, in cases
> > > > > >>>>>>>> where the ofed libraries were installed but no NCA was
> > > > > >>>>>>>> found on a node. It used to work however with the basic
> > > > > >>>>>>>> component.
> > > > > >>>>>>>> 
> > > > > >>>>>>>> Thanks
> > > > > >>>>>>>> Edgar
> > > > > >>>>>>>> 
> > > > > >>>>>>>> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
> > > > > >>>>>>>>> hi Rolf,
> > > > > >>>>>>>>> 
> > > > > >>>>>>>>> unfortunately, i couldn't get rid of that annoying
> > > > > >>>>>>>>> segmentation fault when selecting another bcast
> > > > > >>>>>>>>> algorithm. i'm now going to replace MPI_Bcast with a
> > > > > >>>>>>>>> naive
> > > > > >>>>>>>>> implementation (using MPI_Send and MPI_Recv) and see if
> > 
> > that
> > 
> > > > > >>>>>>>>> helps.
> > > > > >>>>>>>>> 
> > > > > >>>>>>>>> regards,
> > > > > >>>>>>>>> éloi
> > > > > >>>>>>>>> 
> > > > > >>>>>>>>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
> > > > > >>>>>>>>>> Hi Rolf,
> > > > > >>>>>>>>>> 
> > > > > >>>>>>>>>> thanks for your input. You're right, I miss the
> > > > > >>>>>>>>>> coll_tuned_use_dynamic_rules option.
> > > > > >>>>>>>>>> 
> > > > > >>>>>>>>>> I'll check if I the segmentation fault disappears when
> > 
> > using
> > 
> > > > > >>>>>>>>>> the basic bcast linear algorithm using the proper
> > > > > >>>>>>>>>> command line you provided.
> > > > > >>>>>>>>>> 
> > > > > >>>>>>>>>> Regards,
> > > > > >>>>>>>>>> Eloi
> > > > > >>>>>>>>>> 
> > > > > >>>>>>>>>> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
> > > > > >>>>>>>>>>> Hi Eloi:
> > > > > >>>>>>>>>>> To select the different bcast algorithms, you need to
> > > > > >>>>>>>>>>> add an extra mca parameter that tells the library to
> > > > > >>>>>>>>>>> use dynamic selection. --mca
> > > > > >>>>>>>>>>> coll_tuned_use_dynamic_rules 1
> > > > > >>>>>>>>>>> 
> > > > > >>>>>>>>>>> One way to make sure you are typing this in correctly
> > > > > >>>>>>>>>>> is
> > 
> > to
> > 
> > > > > >>>>>>>>>>> use it with ompi_info.  Do the following:
> > > > > >>>>>>>>>>> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param
> > 
> > coll
> > 
> > > > > >>>>>>>>>>> You should see lots of output with all the different
> > > > > >>>>>>>>>>> algorithms that can be selected for the various
> > > > > >>>>>>>>>>> collectives. Therefore, you need this:
> > > > > >>>>>>>>>>> 
> > > > > >>>>>>>>>>> --mca coll_tuned_use_dynamic_rules 1 --mca
> > > > > >>>>>>>>>>> coll_tuned_bcast_algorithm 1
> > > > > >>>>>>>>>>> 
> > > > > >>>>>>>>>>> Rolf
> > > > > >>>>>>>>>>> 
> > > > > >>>>>>>>>>> On 07/13/10 11:28, Eloi Gaudry wrote:
> > > > > >>>>>>>>>>>> Hi,
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> I've found that "--mca coll_tuned_bcast_algorithm 1"
> > > > > >>>>>>>>>>>> allowed to switch to the basic linear algorithm.
> > > > > >>>>>>>>>>>> Anyway whatever the algorithm used, the segmentation
> > > > > >>>>>>>>>>>> fault remains.
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> Does anyone could give some advice on ways to diagnose
> > 
> > the
> > 
> > > > > >>>>>>>>>>>> issue I'm facing ?
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> Regards,
> > > > > >>>>>>>>>>>> Eloi
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> > > > > >>>>>>>>>>>>> Hi,
> > > > > >>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>> I'm focusing on the MPI_Bcast routine that seems to
> > > > > >>>>>>>>>>>>> randomly segfault when using the openib btl. I'd like
> > 
> > to
> > 
> > > > > >>>>>>>>>>>>> know if there is any way to make OpenMPI switch to a
> > > > > >>>>>>>>>>>>> different algorithm than the default one being
> > > > > >>>>>>>>>>>>> selected for MPI_Bcast.
> > > > > >>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>> Thanks for your help,
> > > > > >>>>>>>>>>>>> Eloi
> > > > > >>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> > > > > >>>>>>>>>>>>>> Hi,
> > > > > >>>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>>> I'm observing a random segmentation fault during an
> > > > > >>>>>>>>>>>>>> internode parallel computation involving the openib
> > 
> > btl
> > 
> > > > > >>>>>>>>>>>>>> and OpenMPI-1.4.2 (the same issue can be observed
> > > > > >>>>>>>>>>>>>> with OpenMPI-1.3.3).
> > > > > >>>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>>>    mpirun (Open MPI) 1.4.2
> > > > > >>>>>>>>>>>>>>    Report bugs to
> > > > > >>>>>>>>>>>>>>    http://www.open-mpi.org/community/help/
> > > > > >>>>>>>>>>>>>>    [pbn08:02624] *** Process received signal ***
> > > > > >>>>>>>>>>>>>>    [pbn08:02624] Signal: Segmentation fault (11)
> > > > > >>>>>>>>>>>>>>    [pbn08:02624] Signal code: Address not mapped (1)
> > > > > >>>>>>>>>>>>>>    [pbn08:02624] Failing at address: (nil)
> > > > > >>>>>>>>>>>>>>    [pbn08:02624] [ 0] /lib64/libpthread.so.0
> > > > > >>>>>>>>>>>>>>    [0x349540e4c0] [pbn08:02624] *** End of error
> > 
> > message
> > 
> > > > > >>>>>>>>>>>>>>    ***
> > > > > >>>>>>>>>>>>>>    sh: line 1:  2624 Segmentation fault
> > 
> > \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/R
> > 
> > > > > >>>>>>>>>>>>>> ed Ha tE L\ -5 \/ x 86 _6 4\ /bin\/actranpy_mp
> > 
> > '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/Re
> > 
> > > > > >>>>>>>>>>>>>> dH at EL -5 /x 86 _ 64 /A c tran_11.0.rc2.41872'
> > 
> > '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3D
> > 
> > > > > >>>>>>>>>>>>>> re al _m 4_ n2 .d a t'
> > 
> > '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch
> > 
> > > > > >>>>>>>>>>>>>> ' '--mem=3200' '--threads=1' '--errorlevel=FATAL'
> > > > > >>>>>>>>>>>>>> '--t_max=0.1' '--parallel=domain'
> > > > > >>>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>>> If I choose not to use the openib btl (by using
> > > > > >>>>>>>>>>>>>> --mca btl self,sm,tcp on the command line, for
> > > > > >>>>>>>>>>>>>> instance), I don't encounter any problem and the
> > > > > >>>>>>>>>>>>>> parallel computation runs flawlessly.
> > > > > >>>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>>> I would like to get some help to be able:
> > > > > >>>>>>>>>>>>>> - to diagnose the issue I'm facing with the openib
> > > > > >>>>>>>>>>>>>> btl - understand why this issue is observed only
> > > > > >>>>>>>>>>>>>> when
> > 
> > using
> > 
> > > > > >>>>>>>>>>>>>> the openib btl and not when using self,sm,tcp
> > > > > >>>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>>> Any help would be very much appreciated.
> > > > > >>>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>>> The outputs of ompi_info and the configure scripts
> > > > > >>>>>>>>>>>>>> of OpenMPI are enclosed to this email, and some
> > 
> > information
> > 
> > > > > >>>>>>>>>>>>>> on the infiniband drivers as well.
> > > > > >>>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>>> Here is the command line used when launching a
> > 
> > parallel
> > 
> > > > > >>>>>>>>>>>>>> computation
> > > > > >>>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>>> using infiniband:
> > > > > >>>>>>>>>>>>>>    path_to_openmpi/bin/mpirun -np $NPROCESS
> > > > > >>>>>>>>>>>>>>    --hostfile host.list --mca
> > > > > >>>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>>> btl openib,sm,self,tcp  --display-map --verbose
> > > > > >>>>>>>>>>>>>> --version --mca mpi_warn_on_fork 0 --mca
> > > > > >>>>>>>>>>>>>> btl_openib_want_fork_support 0 [...]
> > > > > >>>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>>> and the command line used if not using infiniband:
> > > > > >>>>>>>>>>>>>>    path_to_openmpi/bin/mpirun -np $NPROCESS
> > > > > >>>>>>>>>>>>>>    --hostfile host.list --mca
> > > > > >>>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>>> btl self,sm,tcp  --display-map --verbose --version
> > 
> > --mca
> > 
> > > > > >>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
> > > > > >>>>>>>>>>>>>> btl_openib_want_fork_support
> > 
> > 0
> > 
> > > > > >>>>>>>>>>>>>> [...]
> > > > > >>>>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>> Eloi
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> _______________________________________________
> > > > > >>>>>>>>>>>> users mailing list
> > > > > >>>>>>>>>>>> us...@open-mpi.org
> > > > > >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > > >>> 
> > > > > >>> --
> > > > > >>> 
> > > > > >>> 
> > > > > >>> Eloi Gaudry
> > > > > >>> 
> > > > > >>> Free Field Technologies
> > > > > >>> Company Website: http://www.fft.be
> > > > > >>> Company Phone:   +32 10 487 959
> > > > > >>> 
> > > > > >>> _______________________________________________
> > > > > >>> users mailing list
> > > > > >>> us...@open-mpi.org
> > > > > >>> http://www.open-mpi.org/mailman/listinfo.cgi/user
> > > > > > 
> > > > > > _______________________________________________
> > > > > > users mailing list
> > > > > > us...@open-mpi.org
> > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > --
> > 
> > 
> > Eloi Gaudry
> > 
> > Free Field Technologies
> > Company Website: http://www.fft.be
> > Company Phone:   +32 10 487 959
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959

Reply via email to