Hi Eloi,
>Do you think that a thread race condition could explain the hdr->tag value
?
Are there multiple threads invoking MPI functions in your application? The
openib BTL is not yet thread safe in the 1.4 release series. There have been
improvements to openib BTL thread safety in 1.5, but it is still not
officially supported.

--Nysal

On Tue, Aug 17, 2010 at 1:06 PM, Eloi Gaudry <e...@fft.be> wrote:

> Hi Nysal,
>
> This is what I was wondering, it hdr->tag was expected to be null or not.
> I'll soon send a valgrind output to the list, hoping this could help to
> locate an invalid
> memory access allowing to understand why reg->cbfunc / hdr->tag are null.
>
> Do you think that a thread race condition could explain the hdr->tag value
> ?
>
> Thanks for your help,
> Eloi
>
> On Monday 16 August 2010 20:46:39 Nysal Jan wrote:
> > The value of hdr->tag seems wrong.
> >
> > In ompi/mca/pml/ob1/pml_ob1_hdr.h
> > #define MCA_PML_OB1_HDR_TYPE_MATCH     (MCA_BTL_TAG_PML + 1)
> > #define MCA_PML_OB1_HDR_TYPE_RNDV      (MCA_BTL_TAG_PML + 2)
> > #define MCA_PML_OB1_HDR_TYPE_RGET      (MCA_BTL_TAG_PML + 3)
> > #define MCA_PML_OB1_HDR_TYPE_ACK       (MCA_BTL_TAG_PML + 4)
> > #define MCA_PML_OB1_HDR_TYPE_NACK      (MCA_BTL_TAG_PML + 5)
> > #define MCA_PML_OB1_HDR_TYPE_FRAG      (MCA_BTL_TAG_PML + 6)
> > #define MCA_PML_OB1_HDR_TYPE_GET       (MCA_BTL_TAG_PML + 7)
> > #define MCA_PML_OB1_HDR_TYPE_PUT       (MCA_BTL_TAG_PML + 8)
> > #define MCA_PML_OB1_HDR_TYPE_FIN       (MCA_BTL_TAG_PML + 9)
> >
> > and in ompi/mca/btl/btl.h
> > #define MCA_BTL_TAG_PML             0x40
> >
> > So hdr->tag should be a value >= 65
> > Since the tag is incorrect you are not getting the proper callback
> function
> > pointer and hence the SEGV.
> > I'm not sure at this point as to why you are getting an invalid/corrupt
> > message header ?
> >
> > --Nysal
> >
> > On Tue, Aug 10, 2010 at 7:45 PM, Eloi Gaudry <e...@fft.be> wrote:
> > > Hi,
> > >
> > > sorry, i just forgot to add the values of the function parameters:
> > > (gdb) print reg->cbdata
> > > $1 = (void *) 0x0
> > > (gdb) print openib_btl->super
> > > $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> > > btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> > > btl_rdma_pipeline_send_length = 1048576,
> > >
> > >  btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size =
> > >
> > > 1060864, btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth = 800,
> > > btl_flags = 310,
> > >
> > >  btl_add_procs = 0x2b341eb8ee47 <mca_btl_openib_add_procs>,
> btl_del_procs
> > >  =
> > >
> > > 0x2b341eb90156 <mca_btl_openib_del_procs>, btl_register = 0,
> btl_finalize
> > > = 0x2b341eb93186 <mca_btl_openib_finalize>,
> > >
> > >  btl_alloc = 0x2b341eb90a3e <mca_btl_openib_alloc>, btl_free =
> > >
> > > 0x2b341eb91400 <mca_btl_openib_free>, btl_prepare_src = 0x2b341eb91813
> > > <mca_btl_openib_prepare_src>,
> > >
> > >  btl_prepare_dst = 0x2b341eb91f2e <mca_btl_openib_prepare_dst>,
> btl_send
> > >  =
> > >
> > > 0x2b341eb94517 <mca_btl_openib_send>, btl_sendi = 0x2b341eb9340d
> > > <mca_btl_openib_sendi>,
> > >
> > >  btl_put = 0x2b341eb94660 <mca_btl_openib_put>, btl_get =
> 0x2b341eb94c4e
> > >
> > > <mca_btl_openib_get>, btl_dump = 0x2b341acd45cb <mca_btl_base_dump>,
> > > btl_mpool = 0xf3f4110,
> > >
> > >  btl_register_error = 0x2b341eb90565
> <mca_btl_openib_register_error_cb>,
> > >
> > > btl_ft_event = 0x2b341eb952e7 <mca_btl_openib_ft_event>}
> > > (gdb) print hdr->tag
> > > $3 = 0 '\0'
> > > (gdb) print des
> > > $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> > > (gdb) print reg->cbfunc
> > > $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> > >
> > > Eloi
> > >
> > > On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > > > Hi,
> > > >
> > > > Here is the output of a core file generated during a segmentation
> fault
> > > > observed during a collective call (using openib):
> > > >
> > > > #0  0x0000000000000000 in ?? ()
> > > > (gdb) where
> > > > #0  0x0000000000000000 in ?? ()
> > > > #1  0x00002aedbc4e05f4 in btl_openib_handle_incoming
> > > > (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18)
> at
> > > > btl_openib_component.c:2881 #2  0x00002aedbc4e25e2 in handle_wc
> > > > (device=0x19024ac0, cq=0, wc=0x7ffff279ce90) at
> > > > btl_openib_component.c:3178 #3  0x00002aedbc4e2e9d in poll_device
> > > > (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
> > > > 0x00002aedbc4e34b8 in progress_one_device (device=0x19024ac0) at
> > > > btl_openib_component.c:3426 #5  0x00002aedbc4e3561 in
> > > > btl_openib_component_progress () at btl_openib_component.c:3451 #6
> > > > 0x00002aedb8b22ab8 in opal_progress () at runtime/opal_progress.c:207
> > > > #7 0x00002aedb859f497 in opal_condition_wait (c=0x2aedb888ccc0,
> > > > m=0x2aedb888cd20) at ../opal/threads/condition.h:99 #8
> > >
> > >  0x00002aedb859fa31
> > >
> > > > in ompi_request_default_wait_all (count=2, requests=0x7ffff279d0e0,
> > > > statuses=0x0) at request/req_wait.c:262 #9  0x00002aedbd7559ad in
> > > > ompi_coll_tuned_allreduce_intra_recursivedoubling
> (sbuf=0x7ffff279d444,
> > > > rbuf=0x7ffff279d440, count=1, dtype=0x6788220, op=0x6787a20,
> > > > comm=0x19d81ff0, module=0x19d82b20) at coll_tuned_allreduce.c:223
> > > > #10 0x00002aedbd7514f7 in ompi_coll_tuned_allreduce_intra_dec_fixed
> > > > (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1, dtype=0x6788220,
> > > > op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20) at
> > > > coll_tuned_decision_fixed.c:63
> > > > #11 0x00002aedb85c7792 in PMPI_Allreduce (sendbuf=0x7ffff279d444,
> > > > recvbuf=0x7ffff279d440, count=1, datatype=0x6788220, op=0x6787a20,
> > > > comm=0x19d81ff0) at pallreduce.c:102 #12 0x0000000004387dbf in
> > > > FEMTown::MPI::Allreduce (sendbuf=0x7ffff279d444,
> > > > recvbuf=0x7ffff279d440, count=1, datatype=0x6788220, op=0x6787a20,
> > > > comm=0x19d81ff0) at stubs.cpp:626 #13 0x0000000004058be8 in
> > > > FEMTown::Domain::align (itf=
> > >
> > > {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Interface>>
> > >
> > > > = {_vptr.shared_base_ptr = 0x7ffff279d620, ptr_ = {px = 0x199942a4,
> pn
> > > > = {pi_ = 0x6}}}, <No data fields>}) at interface.cpp:371
> > > > #14 0x00000000040cb858 in
> > >
> > > FEMTown::Field::detail::align_itfs_and_neighbhors
> > >
> > > > (dim=2, set={px = 0x7ffff279d780, pn = {pi_ = 0x2f279d640}},
> > > > check_info=@0x7ffff279d7f0) at check.cpp:63 #15 0x00000000040cbfa8 in
> > > > FEMTown::Field::align_elements (set={px = 0x7ffff279d950, pn = {pi_ =
> > > > 0x66e08d0}}, check_info=@0x7ffff279d7f0) at check.cpp:159 #16
> > > > 0x00000000039acdd4 in PyField_align_elements (self=0x0,
> > > > args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:31 #17
> > > > 0x0000000001fbf76d in FEMTown::Main::ExErrCatch<_object*
> (*)(_object*,
> > > > _object*, _object*)>::exec<_object> (this=0x7ffff279dc20, s=0x0,
> > > > po1=0x2aaab0765050, po2=0x19d2e950) at
> > > > /home/qa/svntop/femtown/modules/main/py/exception.hpp:463
> > > > #18 0x00000000039acc82 in PyField_align_elements_ewrap (self=0x0,
> > > > args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:39 #19
> > > > 0x00000000044093a0 in PyEval_EvalFrameEx (f=0x19b52e90,
> > > > throwflag=<value optimized out>) at Python/ceval.c:3921 #20
> > > > 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab754ad50,
> > > > globals=<value optimized out>, locals=<value optimized out>,
> args=0x3,
> > > > argcount=1, kws=0x19ace4a0, kwcount=2, defs=0x2aaab75e4800,
> > > > defcount=2, closure=0x0) at
> > > > Python/ceval.c:2968
> > > > #21 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19ace2d0,
> > > > throwflag=<value optimized out>) at Python/ceval.c:3802 #22
> > > > 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab7550120,
> > >
> > > globals=<value
> > >
> > > > optimized out>, locals=<value optimized out>, args=0x7, argcount=1,
> > > > kws=0x19acc418, kwcount=3, defs=0x2aaab759e958, defcount=6,
> > > > closure=0x0) at Python/ceval.c:2968
> > > > #23 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19acc1c0,
> > > > throwflag=<value optimized out>) at Python/ceval.c:3802 #24
> > > > 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab8b5e738,
> > >
> > > globals=<value
> > >
> > > > optimized out>, locals=<value optimized out>, args=0x6, argcount=1,
> > > > kws=0x19abd328, kwcount=5, defs=0x2aaab891b7e8, defcount=3,
> > > > closure=0x0) at Python/ceval.c:2968
> > > > #25 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19abcea0,
> > > > throwflag=<value optimized out>) at Python/ceval.c:3802 #26
> > > > 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab3eb4198,
> > >
> > > globals=<value
> > >
> > > > optimized out>, locals=<value optimized out>, args=0xb, argcount=1,
> > > > kws=0x19a89df0, kwcount=10, defs=0x0, defcount=0, closure=0x0) at
> > > > Python/ceval.c:2968
> > > > #27 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19a89c40,
> > > > throwflag=<value optimized out>) at Python/ceval.c:3802 #28
> > > > 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab3eb4288,
> > >
> > > globals=<value
> > >
> > > > optimized out>, locals=<value optimized out>, args=0x1, argcount=0,
> > > > kws=0x19a89330, kwcount=0, defs=0x2aaab8b66668, defcount=1,
> > > > closure=0x0) at Python/ceval.c:2968
> > > > #29 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19a891b0,
> > > > throwflag=<value optimized out>) at Python/ceval.c:3802 #30
> > > > 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab8b6a738,
> > >
> > > globals=<value
> > >
> > > > optimized out>, locals=<value optimized out>, args=0x0, argcount=0,
> > > > kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at
> > > > Python/ceval.c:2968
> > > > #31 0x000000000440ac02 in PyEval_EvalCode (co=0x1902f9b0,
> globals=0x0,
> > > > locals=0x190d9700) at Python/ceval.c:522 #32 0x000000000442853c in
> > > > PyRun_StringFlags (str=0x192fd3d8 "DIRECT.Actran.main()",
> start=<value
> > > > optimized out>, globals=0x192213d0, locals=0x192213d0, flags=0x0) at
> > > > Python/pythonrun.c:1335 #33 0x0000000004429690 in
> > > > PyRun_SimpleStringFlags (command=0x192fd3d8 "DIRECT.Actran.main()",
> > > > flags=0x0) at
> > > > Python/pythonrun.c:957 #34 0x0000000001fa1cf9 in
> > > > FEMTown::Python::FEMPy::run_application (this=0x7ffff279f650) at
> > > > fempy.cpp:873 #35 0x000000000434ce99 in FEMTown::Main::Batch::run
> > > > (this=0x7ffff279f650) at batch.cpp:374 #36 0x0000000001f9aa25 in main
> > > > (argc=8, argv=0x7ffff279fa48) at main.cpp:10 (gdb) f 1
> > > > #1  0x00002aedbc4e05f4 in btl_openib_handle_incoming
> > > > (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18)
> at
> > > > btl_openib_component.c:2881 2881            reg->cbfunc(
> > >
> > > > &openib_btl->super, hdr->tag, des, reg->cbdata ); Current language:
> > >  auto;
> > >
> > > > currently c
> > > > (gdb)
> > > > #1  0x00002aedbc4e05f4 in btl_openib_handle_incoming
> > > > (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18)
> at
> > > > btl_openib_component.c:2881 2881            reg->cbfunc(
> > > > &openib_btl->super, hdr->tag, des, reg->cbdata ); (gdb) l
> > > > 2876
> > > > 2877        if(OPAL_LIKELY(!(is_credit_msg =
> is_credit_message(frag))))
> > > > { 2878            /* call registered callback */
> > > > 2879            mca_btl_active_message_callback_t* reg;
> > > > 2880            reg = mca_btl_base_active_message_trigger + hdr->tag;
> > > > 2881            reg->cbfunc( &openib_btl->super, hdr->tag, des,
> > >
> > > reg->cbdata
> > >
> > > > ); 2882            if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) {
> > > > 2883                cqp = (hdr->credits >> 11) & 0x0f;
> > > > 2884                hdr->credits &= 0x87ff;
> > > > 2885            } else {
> > > >
> > > > Regards,
> > > > Eloi
> > > >
> > > > On Friday 16 July 2010 16:01:02 Eloi Gaudry wrote:
> > > > > Hi Edgar,
> > > > >
> > > > > The only difference I could observed was that the segmentation
> fault
> > > > > appeared sometimes later during the parallel computation.
> > > > >
> > > > > I'm running out of idea here. I wish I could use the "--mca coll
> > > > > tuned" with "--mca self,sm,tcp" so that I could check that the
> issue
> > > > > is not somehow limited to the tuned collective routines.
> > > > >
> > > > > Thanks,
> > > > > Eloi
> > > > >
> > > > > On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote:
> > > > > > On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
> > > > > > > hi edgar,
> > > > > > >
> > > > > > > thanks for the tips, I'm gonna try this option as well. the
> > > > > > > segmentation fault i'm observing always happened during a
> > >
> > > collective
> > >
> > > > > > > communication indeed... does it basically switch all collective
> > > > > > > communication to basic mode, right ?
> > > > > > >
> > > > > > > sorry for my ignorance, but what's a NCA ?
> > > > > >
> > > > > > sorry, I meant to type HCA (InifinBand networking card)
> > > > > >
> > > > > > Thanks
> > > > > > Edgar
> > > > > >
> > > > > > > thanks,
> > > > > > > éloi
> > > > > > >
> > > > > > > On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
> > > > > > >> you could try first to use the algorithms in the basic module,
> > >
> > > e.g.
> > >
> > > > > > >> mpirun -np x --mca coll basic ./mytest
> > > > > > >>
> > > > > > >> and see whether this makes a difference. I used to observe
> > >
> > > sometimes
> > >
> > > > > > >> a (similar ?) problem in the openib btl triggered from the
> tuned
> > > > > > >> collective component, in cases where the ofed libraries were
> > > > > > >> installed but no NCA was found on a node. It used to work
> > > > > > >> however with the basic component.
> > > > > > >>
> > > > > > >> Thanks
> > > > > > >> Edgar
> > > > > > >>
> > > > > > >> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
> > > > > > >>> hi Rolf,
> > > > > > >>>
> > > > > > >>> unfortunately, i couldn't get rid of that annoying
> segmentation
> > > > > > >>> fault when selecting another bcast algorithm. i'm now going
> to
> > > > > > >>> replace MPI_Bcast with a naive implementation (using MPI_Send
> > > > > > >>> and MPI_Recv) and see if that helps.
> > > > > > >>>
> > > > > > >>> regards,
> > > > > > >>> éloi
> > > > > > >>>
> > > > > > >>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
> > > > > > >>>> Hi Rolf,
> > > > > > >>>>
> > > > > > >>>> thanks for your input. You're right, I miss the
> > > > > > >>>> coll_tuned_use_dynamic_rules option.
> > > > > > >>>>
> > > > > > >>>> I'll check if I the segmentation fault disappears when using
> > > > > > >>>> the basic bcast linear algorithm using the proper command
> > > > > > >>>> line you provided.
> > > > > > >>>>
> > > > > > >>>> Regards,
> > > > > > >>>> Eloi
> > > > > > >>>>
> > > > > > >>>> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
> > > > > > >>>>> Hi Eloi:
> > > > > > >>>>> To select the different bcast algorithms, you need to add
> an
> > > > > > >>>>> extra mca parameter that tells the library to use dynamic
> > > > > > >>>>> selection. --mca coll_tuned_use_dynamic_rules 1
> > > > > > >>>>>
> > > > > > >>>>> One way to make sure you are typing this in correctly is to
> > > > > > >>>>> use it with ompi_info.  Do the following:
> > > > > > >>>>> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
> > > > > > >>>>>
> > > > > > >>>>> You should see lots of output with all the different
> > > > > > >>>>> algorithms that can be selected for the various
> collectives.
> > > > > > >>>>> Therefore, you need this:
> > > > > > >>>>>
> > > > > > >>>>> --mca coll_tuned_use_dynamic_rules 1 --mca
> > > > > > >>>>> coll_tuned_bcast_algorithm 1
> > > > > > >>>>>
> > > > > > >>>>> Rolf
> > > > > > >>>>>
> > > > > > >>>>> On 07/13/10 11:28, Eloi Gaudry wrote:
> > > > > > >>>>>> Hi,
> > > > > > >>>>>>
> > > > > > >>>>>> I've found that "--mca coll_tuned_bcast_algorithm 1"
> allowed
> > >
> > > to
> > >
> > > > > > >>>>>> switch to the basic linear algorithm. Anyway whatever the
> > > > > > >>>>>> algorithm used, the segmentation fault remains.
> > > > > > >>>>>>
> > > > > > >>>>>> Does anyone could give some advice on ways to diagnose the
> > >
> > > issue
> > >
> > > > > > >>>>>> I'm facing ?
> > > > > > >>>>>>
> > > > > > >>>>>> Regards,
> > > > > > >>>>>> Eloi
> > > > > > >>>>>>
> > > > > > >>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> > > > > > >>>>>>> Hi,
> > > > > > >>>>>>>
> > > > > > >>>>>>> I'm focusing on the MPI_Bcast routine that seems to
> > > > > > >>>>>>> randomly segfault when using the openib btl. I'd like to
> > > > > > >>>>>>> know if there is any way to make OpenMPI switch to a
> > > > > > >>>>>>> different algorithm than the default one being selected
> > > > > > >>>>>>> for MPI_Bcast.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Thanks for your help,
> > > > > > >>>>>>> Eloi
> > > > > > >>>>>>>
> > > > > > >>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> > > > > > >>>>>>>> Hi,
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> I'm observing a random segmentation fault during an
> > >
> > > internode
> > >
> > > > > > >>>>>>>> parallel computation involving the openib btl and
> > > > > > >>>>>>>> OpenMPI-1.4.2 (the same issue can be observed with
> > > > > > >>>>>>>> OpenMPI-1.3.3).
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>    mpirun (Open MPI) 1.4.2
> > > > > > >>>>>>>>    Report bugs to
> http://www.open-mpi.org/community/help/
> > > > > > >>>>>>>>    [pbn08:02624] *** Process received signal ***
> > > > > > >>>>>>>>    [pbn08:02624] Signal: Segmentation fault (11)
> > > > > > >>>>>>>>    [pbn08:02624] Signal code: Address not mapped (1)
> > > > > > >>>>>>>>    [pbn08:02624] Failing at address: (nil)
> > > > > > >>>>>>>>    [pbn08:02624] [ 0] /lib64/libpthread.so.0
> > > > > > >>>>>>>>    [0x349540e4c0] [pbn08:02624] *** End of error message
> > > > > > >>>>>>>>    ***
> > > > > > >>>>>>>>    sh: line 1:  2624 Segmentation fault
> > >
> > > \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatE
> > >
> > > > > > >>>>>>>> L\ -5 \/ x 86 _6 4\ /bin\/actranpy_mp
> > >
> > > '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL
> > >
> > > > > > >>>>>>>> -5 /x 86 _ 64 /A c tran_11.0.rc2.41872'
> > >
> > > '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m
> > >
> > > > > > >>>>>>>> 4_ n2 .d a t'
> > > > > > >>>>>>>>
> '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch'
> > > > > > >>>>>>>> '--mem=3200' '--threads=1' '--errorlevel=FATAL'
> > >
> > > '--t_max=0.1'
> > >
> > > > > > >>>>>>>> '--parallel=domain'
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> If I choose not to use the openib btl (by using --mca
> btl
> > > > > > >>>>>>>> self,sm,tcp on the command line, for instance), I don't
> > > > > > >>>>>>>> encounter any problem and the parallel computation runs
> > > > > > >>>>>>>> flawlessly.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> I would like to get some help to be able:
> > > > > > >>>>>>>> - to diagnose the issue I'm facing with the openib btl
> > > > > > >>>>>>>> - understand why this issue is observed only when using
> > > > > > >>>>>>>> the openib btl and not when using self,sm,tcp
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Any help would be very much appreciated.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> The outputs of ompi_info and the configure scripts of
> > >
> > > OpenMPI
> > >
> > > > > > >>>>>>>> are enclosed to this email, and some information on the
> > > > > > >>>>>>>> infiniband drivers as well.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Here is the command line used when launching a parallel
> > > > > > >>>>>>>> computation
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> using infiniband:
> > > > > > >>>>>>>>    path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile
> > > > > > >>>>>>>>    host.list --mca
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> btl openib,sm,self,tcp  --display-map --verbose
> --version
> > > > > > >>>>>>>> --mca mpi_warn_on_fork 0 --mca
> > > > > > >>>>>>>> btl_openib_want_fork_support
> > >
> > > 0
> > >
> > > > > > >>>>>>>> [...]
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> and the command line used if not using infiniband:
> > > > > > >>>>>>>>    path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile
> > > > > > >>>>>>>>    host.list --mca
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> btl self,sm,tcp  --display-map --verbose --version --mca
> > > > > > >>>>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0
> > >
> > > [...]
> > >
> > > > > > >>>>>>>> Thanks,
> > > > > > >>>>>>>> Eloi
> > > > > > >>>>>>
> > > > > > >>>>>> _______________________________________________
> > > > > > >>>>>> users mailing list
> > > > > > >>>>>> us...@open-mpi.org
> > > > > > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > > --
> > >
> > >
> > > Eloi Gaudry
> > >
> > > Free Field Technologies
> > > Company Website: http://www.fft.be
> > > Company Phone:   +32 10 487 959
> > >
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
>
>
> Eloi Gaudry
>
> Free Field Technologies
> Company Website: http://www.fft.be
> Company Phone:   +32 10 487 959
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to