Artem,

There is a known issue #235 with modex and i made PR #238 with a tentative fix.

Could you please give it a try and reports if it solves your problem ?

Cheers

Gilles

Artem Polyakov <artpo...@gmail.com> wrote:
>Hello, I have troubles with latest trunk if I use PMI1.
>
>
>For example, if I use 2 nodes the application hangs. See backtraces from both 
>nodes below. From them I can see that second (non launching) node hangs in 
>bcol component selection. Here is the default setting of bcol_base_string 
>parameter:
>
>bcol_base_string="basesmuma,basesmuma,iboffload,ptpcoll,ugni"
>
>according to ompi_info. I don't know if it is correct that basesmuma is 
>duplicated or not.
>
>
>Experiments with this parameter showed that it directly influences the bug:
>
>export OMPI_MCA_bcol_base_string="" #  [SEGFAULT]
>
>export OMPI_MCA_bcol_base_string="ptpcoll" #  [OK]
>
>export OMPI_MCA_bcol_base_string="basesmuma,ptpcoll" #  [OK]
>
>export OMPI_MCA_bcol_base_string="basesmuma,ptpcoll,iboffload" #  [OK]
>
>export OMPI_MCA_bcol_base_string="basesmuma,ptpcoll,iboffload,ugni" #  [OK]
>
>export OMPI_MCA_bcol_base_string="basesmuma,basesmuma,ptpcoll,iboffload,ugni" 
>#  [HANG]
>
>export OMPI_MCA_bcol_base_string="basesmuma,basesmuma,iboffload,ptpcoll" # 
>[HANG]
>
>export OMPI_MCA_bcol_base_string="basesmuma,basesmuma,iboffload" # [OK]
>
>export OMPI_MCA_bcol_base_string="basesmuma,basesmuma,iboffload,ugni" # [OK]
>
>export OMPI_MCA_bcol_base_string="basesmuma,basesmuma,ptpcoll" #  [HANG]
>
>export OMPI_MCA_bcol_base_string="ptpcoll,basesmuma" #  [OK]
>
>export OMPI_MCA_bcol_base_string="ptpcoll,basesmuma,basesmuma" #  [HANG]
>
>
>I can provide other information if nessesary.
>
>
>cn1:
>
>(gdb) bt
>
>0  0x00007fdebd30ac6d in poll () from /lib/x86_64-linux-gnu/libc.so.6
>
>1  0x00007fdebcca64e0 in poll_dispatch (base=0x1d466b0, tv=0x7fff71aab880) at 
>poll.c:165
>
>2  0x00007fdebcc9b041 in opal_libevent2021_event_base_loop (base=0x1d466b0, 
>flags=2) at event.c:1631
>
>3  0x00007fdebcc35891 in opal_progress () at runtime/opal_progress.c:169
>
>4  0x00007fdeb32f78cb in opal_condition_wait (c=0x7fdebdb51bc0 
><ompi_request_cond>, m=0x7fdebdb51cc0 <ompi_request_lock>) at 
>../../../../opal/threads/condition.h:78
>
>5  0x00007fdeb32f79b8 in ompi_request_wait_completion (req=0x7fff71aab920) at 
>../../../../ompi/request/request.h:381
>
>6  0x00007fdeb32f84b8 in mca_pml_ob1_recv (addr=0x7fff71aabd80, count=1, 
>datatype=0x6026c0 <ompi_mpi_int>, src=1, tag=0, comm=0x6020a0 
><ompi_mpi_comm_world>, 
>
>    status=0x7fff71aabd90) at pml_ob1_irecv.c:109
>
>7  0x00007fdebd88f54d in PMPI_Recv (buf=0x7fff71aabd80, count=1, type=0x6026c0 
><ompi_mpi_int>, source=1, tag=0, comm=0x6020a0 <ompi_mpi_comm_world>, 
>
>    status=0x7fff71aabd90) at precv.c:78
>
>8  0x0000000000400c44 in main (argc=1, argv=0x7fff71aabe98) at hellompi.c:33
>
>
>cn2:
>
>(gdb) bt
>
>0  0x00007fa65aa78c6d in poll () from /lib/x86_64-linux-gnu/libc.so.6
>
>1  0x00007fa65a4144e0 in poll_dispatch (base=0x20e96b0, tv=0x7fff46f44a80) at 
>poll.c:165
>
>2  0x00007fa65a409041 in opal_libevent2021_event_base_loop (base=0x20e96b0, 
>flags=2) at event.c:1631
>
>3  0x00007fa65a3a3891 in opal_progress () at runtime/opal_progress.c:169
>
>4  0x00007fa65afbbc25 in opal_condition_wait (c=0x7fa65b2bfbc0 
><ompi_request_cond>, m=0x7fa65b2bfcc0 <ompi_request_lock>) at 
>../opal/threads/condition.h:78
>
>5  0x00007fa65afbc1b5 in ompi_request_default_wait_all (count=2, 
>requests=0x7fff46f44c70, statuses=0x0) at request/req_wait.c:287
>
>6  0x00007fa65afc7906 in comm_allgather_pml (src_buf=0x7fff46f44da0, 
>dest_buf=0x233dac0, count=288, dtype=0x7fa65b29fee0 <ompi_mpi_char>, 
>my_rank_in_group=1, 
>
>    n_peers=2, ranks_in_comm=0x210a760, comm=0x6020a0 <ompi_mpi_comm_world>) 
>at patterns/comm/allgather.c:250
>
>7  0x00007fa64f14ba08 in bcol_basesmuma_smcm_allgather_connection 
>(sm_bcol_module=0x7fa64e64d010, module=0x232c800, 
>
>    peer_list=0x7fa64f3513e8 <mca_bcol_basesmuma_component+456>, 
>back_files=0x7fa64eae2690, comm=0x6020a0 <ompi_mpi_comm_world>, input=..., 
>
>    base_fname=0x7fa64f14ca8c "sm_ctl_mem_", map_all=false) at 
>bcol_basesmuma_smcm.c:205
>
>8  0x00007fa64f146525 in base_bcol_basesmuma_setup_ctl 
>(sm_bcol_module=0x7fa64e64d010, cs=0x7fa64f351220 
><mca_bcol_basesmuma_component>) at bcol_basesmuma_setup.c:344
>
>9  0x00007fa64f146cbb in base_bcol_basesmuma_setup_library_buffers 
>(sm_bcol_module=0x7fa64e64d010, cs=0x7fa64f351220 
><mca_bcol_basesmuma_component>)
>
>    at bcol_basesmuma_setup.c:550
>
>10 0x00007fa64f1418d0 in mca_bcol_basesmuma_comm_query (module=0x232c800, 
>num_modules=0x232e570) at bcol_basesmuma_module.c:532
>
>11 0x00007fa64fd9e5f2 in mca_coll_ml_tree_hierarchy_discovery 
>(ml_module=0x232fbe0, topo=0x232fd98, n_hierarchies=3, exclude_sbgp_name=0x0, 
>include_sbgp_name=0x0)
>
>    at coll_ml_module.c:1964
>
>12 0x00007fa64fd9f3a3 in mca_coll_ml_fulltree_hierarchy_discovery 
>(ml_module=0x232fbe0, n_hierarchies=3) at coll_ml_module.c:2211
>
>13 0x00007fa64fd9cbe4 in ml_discover_hierarchy (ml_module=0x232fbe0) at 
>coll_ml_module.c:1518
>
>14 0x00007fa64fda164f in mca_coll_ml_comm_query (comm=0x6020a0 
><ompi_mpi_comm_world>, priority=0x7fff46f45358) at coll_ml_module.c:2970
>
>15 0x00007fa65b02f6aa in query_2_0_0 (component=0x7fa64fffe4e0 
><mca_coll_ml_component>, comm=0x6020a0 <ompi_mpi_comm_world>, 
>priority=0x7fff46f45358, 
>
>    module=0x7fff46f45390) at base/coll_base_comm_select.c:374
>
>16 0x00007fa65b02f66e in query (component=0x7fa64fffe4e0 
><mca_coll_ml_component>, comm=0x6020a0 <ompi_mpi_comm_world>, 
>priority=0x7fff46f45358, module=0x7fff46f45390)
>
>    at base/coll_base_comm_select.c:357
>
>17 0x00007fa65b02f581 in check_one_component (comm=0x6020a0 
><ompi_mpi_comm_world>, component=0x7fa64fffe4e0 <mca_coll_ml_component>, 
>module=0x7fff46f45390)
>
>    at base/coll_base_comm_select.c:319
>
>18 0x00007fa65b02f3c7 in check_components (components=0x7fa65b2a9530 
><ompi_coll_base_framework+80>, comm=0x6020a0 <ompi_mpi_comm_world>)
>
>    at base/coll_base_comm_select.c:283
>
>19 0x00007fa65b027d45 in mca_coll_base_comm_select (comm=0x6020a0 
><ompi_mpi_comm_world>) at base/coll_base_comm_select.c:119
>
>20 0x00007fa65afbdb2c in ompi_mpi_init (argc=1, argv=0x7fff46f45a78, 
>requested=0, provided=0x7fff46f4590c) at runtime/ompi_mpi_init.c:858
>
>21 0x00007fa65aff20ef in PMPI_Init (argc=0x7fff46f4594c, argv=0x7fff46f45940) 
>at pinit.c:84
>
>22 0x0000000000400b66 in main (argc=1, argv=0x7fff46f45a78) at hellompi.c:11
>
>
>
>
>-- 
>С Уважением, Поляков Артем Юрьевич
>Best regards, Artem Y. Polyakov 
>

Reply via email to