Privet, Artem

ML is the collective component that is invoking the calls into BCOL. The
triplet basesmuma,basesmuma,ptpcoll, for example, means I want three levels
of hierarchy - socket level, UMA level, and then network level. I am
guessing (only a guess after a quick glance) that maybe srun is not binding
processes which could result in the socket subgrouping code to fail (it
should gracefully declare nothing to subgroup, but this is where the bug
could be.) It will always come to the conclusion that processes are bound
to the host, so the two level command line should work. Also, you need to
look at the variable OMPI_MCA_sbgp_base_string (this defines the subgouping
rules, and the BCOLs are the collective primitives mapped onto a particular
communication substrate e.g. shared memory, CORE-Direct, vanilla
point-to-point.)

Can you try with:
srun -N .. --cpu_bind=cores ...
and see if this resolves the issue? Also, are you running on a
hyperthreaded machine?

Another experiment to try:
I'm assuming this will hang?
export OMPI_MCA_bcol_base_string=basesmuma,basesmuma,ptpcoll   (this says
map shared memory collective primitives to both the group of processes
export OMPI_MCA_sbgp_base_string=basesmsocket,basesmuma,p2p

I would guess this will work
export OMPI_MCA_bcol_base_string=basesmuma,ptpcoll   (This says only form a
single shared memory subgroup consisting of processes on host and then a
single point-to-point subgroup consisting of all host leaders)
export OMPI_MCA_sbgp_base_string=basesmuma,p2p

I'm speculating that this will hang because of something
export OMPI_MCA_bcol_base_string=basesmuma,ptpcoll  (This says form groups
consisting of all procs on the same socket and then take a local leader
from each of these groups and form a point-to-point group)
export OMPI_MCA_sbgp_base_string=basesmsocket,p2p

In any case, Elena's suggestion to add -mca coll ^ml will silence all of
this.

Josh



On Fri, Oct 17, 2014 at 11:46 AM, Artem Polyakov <artpo...@gmail.com> wrote:

> Hey, Lena :).
>
> 2014-10-17 22:07 GMT+07:00 Elena Elkina <elena.elk...@itseez.com>:
>
>> Hi Artem,
>>
>> Actually some time ago there was a known issue with coll ml. I used to
>> run my command lines with -mca coll ^ml to avoid these problems, so I don't
>> know if it was fixed or not. It looks like you have the same problem.
>>
>
> but mine is with bcol, not coll framework. And as you can see modules
> itself doesn't brake the program. Only some of their combinations. Also I
> am curious why basesmuma module listed twice.
>
>
>
>> Best regards,
>> Elena
>>
>> On Fri, Oct 17, 2014 at 7:01 PM, Artem Polyakov <artpo...@gmail.com>
>> wrote:
>>
>>> Gilles,
>>>
>>> I checked your patch and it doesn't solve the problem I observe. I think
>>> the reason is somewhere else.
>>>
>>> 2014-10-17 19:13 GMT+07:00 Gilles Gouaillardet <
>>> gilles.gouaillar...@gmail.com>:
>>>
>>>> Artem,
>>>>
>>>> There is a known issue #235 with modex and i made PR #238 with a
>>>> tentative fix.
>>>>
>>>> Could you please give it a try and reports if it solves your problem ?
>>>>
>>>> Cheers
>>>>
>>>> Gilles
>>>>
>>>>
>>>> Artem Polyakov <artpo...@gmail.com> wrote:
>>>> Hello, I have troubles with latest trunk if I use PMI1.
>>>>
>>>> For example, if I use 2 nodes the application hangs. See backtraces
>>>> from both nodes below. From them I can see that second (non launching) node
>>>> hangs in bcol component selection. Here is the default setting of
>>>> bcol_base_string parameter:
>>>> bcol_base_string="basesmuma,basesmuma,iboffload,ptpcoll,ugni"
>>>> according to ompi_info. I don't know if it is correct that basesmuma is
>>>> duplicated or not.
>>>>
>>>> Experiments with this parameter showed that it directly influences the
>>>> bug:
>>>> export OMPI_MCA_bcol_base_string="" #  [SEGFAULT]
>>>> export OMPI_MCA_bcol_base_string="ptpcoll" #  [OK]
>>>> export OMPI_MCA_bcol_base_string="basesmuma,ptpcoll" #  [OK]
>>>> export OMPI_MCA_bcol_base_string="basesmuma,ptpcoll,iboffload" #  [OK]
>>>> export OMPI_MCA_bcol_base_string="basesmuma,ptpcoll,iboffload,ugni" #
>>>>  [OK]
>>>> export
>>>> OMPI_MCA_bcol_base_string="basesmuma,basesmuma,ptpcoll,iboffload,ugni" #
>>>>  [HANG]
>>>> export
>>>> OMPI_MCA_bcol_base_string="basesmuma,basesmuma,iboffload,ptpcoll" # [HANG]
>>>> export OMPI_MCA_bcol_base_string="basesmuma,basesmuma,iboffload" # [OK]
>>>> export OMPI_MCA_bcol_base_string="basesmuma,basesmuma,iboffload,ugni" #
>>>> [OK]
>>>> export OMPI_MCA_bcol_base_string="basesmuma,basesmuma,ptpcoll" #  [HANG]
>>>> export OMPI_MCA_bcol_base_string="ptpcoll,basesmuma" #  [OK]
>>>> export OMPI_MCA_bcol_base_string="ptpcoll,basesmuma,basesmuma" #  [HANG]
>>>>
>>>> I can provide other information if nessesary.
>>>>
>>>> cn1:
>>>> (gdb) bt
>>>> 0  0x00007fdebd30ac6d in poll () from /lib/x86_64-linux-gnu/libc.so.6
>>>> 1  0x00007fdebcca64e0 in poll_dispatch (base=0x1d466b0,
>>>> tv=0x7fff71aab880) at poll.c:165
>>>> 2  0x00007fdebcc9b041 in opal_libevent2021_event_base_loop
>>>> (base=0x1d466b0, flags=2) at event.c:1631
>>>> 3  0x00007fdebcc35891 in opal_progress () at runtime/opal_progress.c:169
>>>> 4  0x00007fdeb32f78cb in opal_condition_wait (c=0x7fdebdb51bc0
>>>> <ompi_request_cond>, m=0x7fdebdb51cc0 <ompi_request_lock>) at
>>>> ../../../../opal/threads/condition.h:78
>>>> 5  0x00007fdeb32f79b8 in ompi_request_wait_completion
>>>> (req=0x7fff71aab920) at ../../../../ompi/request/request.h:381
>>>> 6  0x00007fdeb32f84b8 in mca_pml_ob1_recv (addr=0x7fff71aabd80,
>>>> count=1, datatype=0x6026c0 <ompi_mpi_int>, src=1, tag=0, comm=0x6020a0
>>>> <ompi_mpi_comm_world>,
>>>>     status=0x7fff71aabd90) at pml_ob1_irecv.c:109
>>>> 7  0x00007fdebd88f54d in PMPI_Recv (buf=0x7fff71aabd80, count=1,
>>>> type=0x6026c0 <ompi_mpi_int>, source=1, tag=0, comm=0x6020a0
>>>> <ompi_mpi_comm_world>,
>>>>     status=0x7fff71aabd90) at precv.c:78
>>>> 8  0x0000000000400c44 in main (argc=1, argv=0x7fff71aabe98) at
>>>> hellompi.c:33
>>>>
>>>> cn2:
>>>> (gdb) bt
>>>> 0  0x00007fa65aa78c6d in poll () from /lib/x86_64-linux-gnu/libc.so.6
>>>> 1  0x00007fa65a4144e0 in poll_dispatch (base=0x20e96b0,
>>>> tv=0x7fff46f44a80) at poll.c:165
>>>> 2  0x00007fa65a409041 in opal_libevent2021_event_base_loop
>>>> (base=0x20e96b0, flags=2) at event.c:1631
>>>> 3  0x00007fa65a3a3891 in opal_progress () at runtime/opal_progress.c:169
>>>> 4  0x00007fa65afbbc25 in opal_condition_wait (c=0x7fa65b2bfbc0
>>>> <ompi_request_cond>, m=0x7fa65b2bfcc0 <ompi_request_lock>) at
>>>> ../opal/threads/condition.h:78
>>>> 5  0x00007fa65afbc1b5 in ompi_request_default_wait_all (count=2,
>>>> requests=0x7fff46f44c70, statuses=0x0) at request/req_wait.c:287
>>>> 6  0x00007fa65afc7906 in comm_allgather_pml (src_buf=0x7fff46f44da0,
>>>> dest_buf=0x233dac0, count=288, dtype=0x7fa65b29fee0 <ompi_mpi_char>,
>>>> my_rank_in_group=1,
>>>>     n_peers=2, ranks_in_comm=0x210a760, comm=0x6020a0
>>>> <ompi_mpi_comm_world>) at patterns/comm/allgather.c:250
>>>> 7  0x00007fa64f14ba08 in bcol_basesmuma_smcm_allgather_connection
>>>> (sm_bcol_module=0x7fa64e64d010, module=0x232c800,
>>>>     peer_list=0x7fa64f3513e8 <mca_bcol_basesmuma_component+456>,
>>>> back_files=0x7fa64eae2690, comm=0x6020a0 <ompi_mpi_comm_world>, input=...,
>>>>     base_fname=0x7fa64f14ca8c "sm_ctl_mem_", map_all=false) at
>>>> bcol_basesmuma_smcm.c:205
>>>> 8  0x00007fa64f146525 in base_bcol_basesmuma_setup_ctl
>>>> (sm_bcol_module=0x7fa64e64d010, cs=0x7fa64f351220
>>>> <mca_bcol_basesmuma_component>) at bcol_basesmuma_setup.c:344
>>>> 9  0x00007fa64f146cbb in base_bcol_basesmuma_setup_library_buffers
>>>> (sm_bcol_module=0x7fa64e64d010, cs=0x7fa64f351220
>>>> <mca_bcol_basesmuma_component>)
>>>>     at bcol_basesmuma_setup.c:550
>>>> 10 0x00007fa64f1418d0 in mca_bcol_basesmuma_comm_query
>>>> (module=0x232c800, num_modules=0x232e570) at bcol_basesmuma_module.c:532
>>>> 11 0x00007fa64fd9e5f2 in mca_coll_ml_tree_hierarchy_discovery
>>>> (ml_module=0x232fbe0, topo=0x232fd98, n_hierarchies=3,
>>>> exclude_sbgp_name=0x0, include_sbgp_name=0x0)
>>>>     at coll_ml_module.c:1964
>>>> 12 0x00007fa64fd9f3a3 in mca_coll_ml_fulltree_hierarchy_discovery
>>>> (ml_module=0x232fbe0, n_hierarchies=3) at coll_ml_module.c:2211
>>>> 13 0x00007fa64fd9cbe4 in ml_discover_hierarchy (ml_module=0x232fbe0) at
>>>> coll_ml_module.c:1518
>>>> 14 0x00007fa64fda164f in mca_coll_ml_comm_query (comm=0x6020a0
>>>> <ompi_mpi_comm_world>, priority=0x7fff46f45358) at coll_ml_module.c:2970
>>>> 15 0x00007fa65b02f6aa in query_2_0_0 (component=0x7fa64fffe4e0
>>>> <mca_coll_ml_component>, comm=0x6020a0 <ompi_mpi_comm_world>,
>>>> priority=0x7fff46f45358,
>>>>     module=0x7fff46f45390) at base/coll_base_comm_select.c:374
>>>> 16 0x00007fa65b02f66e in query (component=0x7fa64fffe4e0
>>>> <mca_coll_ml_component>, comm=0x6020a0 <ompi_mpi_comm_world>,
>>>> priority=0x7fff46f45358, module=0x7fff46f45390)
>>>>     at base/coll_base_comm_select.c:357
>>>> 17 0x00007fa65b02f581 in check_one_component (comm=0x6020a0
>>>> <ompi_mpi_comm_world>, component=0x7fa64fffe4e0 <mca_coll_ml_component>,
>>>> module=0x7fff46f45390)
>>>>     at base/coll_base_comm_select.c:319
>>>> 18 0x00007fa65b02f3c7 in check_components (components=0x7fa65b2a9530
>>>> <ompi_coll_base_framework+80>, comm=0x6020a0 <ompi_mpi_comm_world>)
>>>>     at base/coll_base_comm_select.c:283
>>>> 19 0x00007fa65b027d45 in mca_coll_base_comm_select (comm=0x6020a0
>>>> <ompi_mpi_comm_world>) at base/coll_base_comm_select.c:119
>>>> 20 0x00007fa65afbdb2c in ompi_mpi_init (argc=1, argv=0x7fff46f45a78,
>>>> requested=0, provided=0x7fff46f4590c) at runtime/ompi_mpi_init.c:858
>>>> 21 0x00007fa65aff20ef in PMPI_Init (argc=0x7fff46f4594c,
>>>> argv=0x7fff46f45940) at pinit.c:84
>>>> 22 0x0000000000400b66 in main (argc=1, argv=0x7fff46f45a78) at
>>>> hellompi.c:11
>>>>
>>>>
>>>>
>>>> --
>>>> С Уважением, Поляков Артем Юрьевич
>>>> Best regards, Artem Y. Polyakov
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/10/16055.php
>>>>
>>>
>>>
>>>
>>> --
>>> С Уважением, Поляков Артем Юрьевич
>>> Best regards, Artem Y. Polyakov
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/10/16067.php
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/10/16068.php
>>
>
>
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/10/16069.php
>

Reply via email to