Hey, Lena :). 2014-10-17 22:07 GMT+07:00 Elena Elkina <elena.elk...@itseez.com>:
> Hi Artem, > > Actually some time ago there was a known issue with coll ml. I used to run > my command lines with -mca coll ^ml to avoid these problems, so I don't > know if it was fixed or not. It looks like you have the same problem. > but mine is with bcol, not coll framework. And as you can see modules itself doesn't brake the program. Only some of their combinations. Also I am curious why basesmuma module listed twice. > Best regards, > Elena > > On Fri, Oct 17, 2014 at 7:01 PM, Artem Polyakov <artpo...@gmail.com> > wrote: > >> Gilles, >> >> I checked your patch and it doesn't solve the problem I observe. I think >> the reason is somewhere else. >> >> 2014-10-17 19:13 GMT+07:00 Gilles Gouaillardet < >> gilles.gouaillar...@gmail.com>: >> >>> Artem, >>> >>> There is a known issue #235 with modex and i made PR #238 with a >>> tentative fix. >>> >>> Could you please give it a try and reports if it solves your problem ? >>> >>> Cheers >>> >>> Gilles >>> >>> >>> Artem Polyakov <artpo...@gmail.com> wrote: >>> Hello, I have troubles with latest trunk if I use PMI1. >>> >>> For example, if I use 2 nodes the application hangs. See backtraces from >>> both nodes below. From them I can see that second (non launching) node >>> hangs in bcol component selection. Here is the default setting of >>> bcol_base_string parameter: >>> bcol_base_string="basesmuma,basesmuma,iboffload,ptpcoll,ugni" >>> according to ompi_info. I don't know if it is correct that basesmuma is >>> duplicated or not. >>> >>> Experiments with this parameter showed that it directly influences the >>> bug: >>> export OMPI_MCA_bcol_base_string="" # [SEGFAULT] >>> export OMPI_MCA_bcol_base_string="ptpcoll" # [OK] >>> export OMPI_MCA_bcol_base_string="basesmuma,ptpcoll" # [OK] >>> export OMPI_MCA_bcol_base_string="basesmuma,ptpcoll,iboffload" # [OK] >>> export OMPI_MCA_bcol_base_string="basesmuma,ptpcoll,iboffload,ugni" # >>> [OK] >>> export >>> OMPI_MCA_bcol_base_string="basesmuma,basesmuma,ptpcoll,iboffload,ugni" # >>> [HANG] >>> export OMPI_MCA_bcol_base_string="basesmuma,basesmuma,iboffload,ptpcoll" >>> # [HANG] >>> export OMPI_MCA_bcol_base_string="basesmuma,basesmuma,iboffload" # [OK] >>> export OMPI_MCA_bcol_base_string="basesmuma,basesmuma,iboffload,ugni" # >>> [OK] >>> export OMPI_MCA_bcol_base_string="basesmuma,basesmuma,ptpcoll" # [HANG] >>> export OMPI_MCA_bcol_base_string="ptpcoll,basesmuma" # [OK] >>> export OMPI_MCA_bcol_base_string="ptpcoll,basesmuma,basesmuma" # [HANG] >>> >>> I can provide other information if nessesary. >>> >>> cn1: >>> (gdb) bt >>> 0 0x00007fdebd30ac6d in poll () from /lib/x86_64-linux-gnu/libc.so.6 >>> 1 0x00007fdebcca64e0 in poll_dispatch (base=0x1d466b0, >>> tv=0x7fff71aab880) at poll.c:165 >>> 2 0x00007fdebcc9b041 in opal_libevent2021_event_base_loop >>> (base=0x1d466b0, flags=2) at event.c:1631 >>> 3 0x00007fdebcc35891 in opal_progress () at runtime/opal_progress.c:169 >>> 4 0x00007fdeb32f78cb in opal_condition_wait (c=0x7fdebdb51bc0 >>> <ompi_request_cond>, m=0x7fdebdb51cc0 <ompi_request_lock>) at >>> ../../../../opal/threads/condition.h:78 >>> 5 0x00007fdeb32f79b8 in ompi_request_wait_completion >>> (req=0x7fff71aab920) at ../../../../ompi/request/request.h:381 >>> 6 0x00007fdeb32f84b8 in mca_pml_ob1_recv (addr=0x7fff71aabd80, count=1, >>> datatype=0x6026c0 <ompi_mpi_int>, src=1, tag=0, comm=0x6020a0 >>> <ompi_mpi_comm_world>, >>> status=0x7fff71aabd90) at pml_ob1_irecv.c:109 >>> 7 0x00007fdebd88f54d in PMPI_Recv (buf=0x7fff71aabd80, count=1, >>> type=0x6026c0 <ompi_mpi_int>, source=1, tag=0, comm=0x6020a0 >>> <ompi_mpi_comm_world>, >>> status=0x7fff71aabd90) at precv.c:78 >>> 8 0x0000000000400c44 in main (argc=1, argv=0x7fff71aabe98) at >>> hellompi.c:33 >>> >>> cn2: >>> (gdb) bt >>> 0 0x00007fa65aa78c6d in poll () from /lib/x86_64-linux-gnu/libc.so.6 >>> 1 0x00007fa65a4144e0 in poll_dispatch (base=0x20e96b0, >>> tv=0x7fff46f44a80) at poll.c:165 >>> 2 0x00007fa65a409041 in opal_libevent2021_event_base_loop >>> (base=0x20e96b0, flags=2) at event.c:1631 >>> 3 0x00007fa65a3a3891 in opal_progress () at runtime/opal_progress.c:169 >>> 4 0x00007fa65afbbc25 in opal_condition_wait (c=0x7fa65b2bfbc0 >>> <ompi_request_cond>, m=0x7fa65b2bfcc0 <ompi_request_lock>) at >>> ../opal/threads/condition.h:78 >>> 5 0x00007fa65afbc1b5 in ompi_request_default_wait_all (count=2, >>> requests=0x7fff46f44c70, statuses=0x0) at request/req_wait.c:287 >>> 6 0x00007fa65afc7906 in comm_allgather_pml (src_buf=0x7fff46f44da0, >>> dest_buf=0x233dac0, count=288, dtype=0x7fa65b29fee0 <ompi_mpi_char>, >>> my_rank_in_group=1, >>> n_peers=2, ranks_in_comm=0x210a760, comm=0x6020a0 >>> <ompi_mpi_comm_world>) at patterns/comm/allgather.c:250 >>> 7 0x00007fa64f14ba08 in bcol_basesmuma_smcm_allgather_connection >>> (sm_bcol_module=0x7fa64e64d010, module=0x232c800, >>> peer_list=0x7fa64f3513e8 <mca_bcol_basesmuma_component+456>, >>> back_files=0x7fa64eae2690, comm=0x6020a0 <ompi_mpi_comm_world>, input=..., >>> base_fname=0x7fa64f14ca8c "sm_ctl_mem_", map_all=false) at >>> bcol_basesmuma_smcm.c:205 >>> 8 0x00007fa64f146525 in base_bcol_basesmuma_setup_ctl >>> (sm_bcol_module=0x7fa64e64d010, cs=0x7fa64f351220 >>> <mca_bcol_basesmuma_component>) at bcol_basesmuma_setup.c:344 >>> 9 0x00007fa64f146cbb in base_bcol_basesmuma_setup_library_buffers >>> (sm_bcol_module=0x7fa64e64d010, cs=0x7fa64f351220 >>> <mca_bcol_basesmuma_component>) >>> at bcol_basesmuma_setup.c:550 >>> 10 0x00007fa64f1418d0 in mca_bcol_basesmuma_comm_query >>> (module=0x232c800, num_modules=0x232e570) at bcol_basesmuma_module.c:532 >>> 11 0x00007fa64fd9e5f2 in mca_coll_ml_tree_hierarchy_discovery >>> (ml_module=0x232fbe0, topo=0x232fd98, n_hierarchies=3, >>> exclude_sbgp_name=0x0, include_sbgp_name=0x0) >>> at coll_ml_module.c:1964 >>> 12 0x00007fa64fd9f3a3 in mca_coll_ml_fulltree_hierarchy_discovery >>> (ml_module=0x232fbe0, n_hierarchies=3) at coll_ml_module.c:2211 >>> 13 0x00007fa64fd9cbe4 in ml_discover_hierarchy (ml_module=0x232fbe0) at >>> coll_ml_module.c:1518 >>> 14 0x00007fa64fda164f in mca_coll_ml_comm_query (comm=0x6020a0 >>> <ompi_mpi_comm_world>, priority=0x7fff46f45358) at coll_ml_module.c:2970 >>> 15 0x00007fa65b02f6aa in query_2_0_0 (component=0x7fa64fffe4e0 >>> <mca_coll_ml_component>, comm=0x6020a0 <ompi_mpi_comm_world>, >>> priority=0x7fff46f45358, >>> module=0x7fff46f45390) at base/coll_base_comm_select.c:374 >>> 16 0x00007fa65b02f66e in query (component=0x7fa64fffe4e0 >>> <mca_coll_ml_component>, comm=0x6020a0 <ompi_mpi_comm_world>, >>> priority=0x7fff46f45358, module=0x7fff46f45390) >>> at base/coll_base_comm_select.c:357 >>> 17 0x00007fa65b02f581 in check_one_component (comm=0x6020a0 >>> <ompi_mpi_comm_world>, component=0x7fa64fffe4e0 <mca_coll_ml_component>, >>> module=0x7fff46f45390) >>> at base/coll_base_comm_select.c:319 >>> 18 0x00007fa65b02f3c7 in check_components (components=0x7fa65b2a9530 >>> <ompi_coll_base_framework+80>, comm=0x6020a0 <ompi_mpi_comm_world>) >>> at base/coll_base_comm_select.c:283 >>> 19 0x00007fa65b027d45 in mca_coll_base_comm_select (comm=0x6020a0 >>> <ompi_mpi_comm_world>) at base/coll_base_comm_select.c:119 >>> 20 0x00007fa65afbdb2c in ompi_mpi_init (argc=1, argv=0x7fff46f45a78, >>> requested=0, provided=0x7fff46f4590c) at runtime/ompi_mpi_init.c:858 >>> 21 0x00007fa65aff20ef in PMPI_Init (argc=0x7fff46f4594c, >>> argv=0x7fff46f45940) at pinit.c:84 >>> 22 0x0000000000400b66 in main (argc=1, argv=0x7fff46f45a78) at >>> hellompi.c:11 >>> >>> >>> >>> -- >>> С Уважением, Поляков Артем Юрьевич >>> Best regards, Artem Y. Polyakov >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/10/16055.php >>> >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/10/16067.php >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/10/16068.php > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov