This should also be fixed when we stop firing up the usnic connectivity checker when there are no usNICs present.
On Jun 1, 2014, at 9:12 AM, Artem Polyakov <artpo...@gmail.com> wrote: > > 2014-06-01 14:24 GMT+07:00 Gilles Gouaillardet > <gilles.gouaillar...@gmail.com>: > export OMPI_MCA_btl_openib_use_eager_rdma=0 > > Gilles, > > I test your approach. Both: > a) export OMPI_MCA_btl_openib_use_eager_rdma=0 > b) applying your patch and run without "export > OMPI_MCA_btl_openib_use_eager_rdma=0" > works well for me. > This fixes first part of the problem: when OMPI_MCA_btl="openib,self" > > However once I comment out this statement thus giving OMPI the right to > deside which BTL to use program hangs again. Here is additional information > that can be useful: > > 1. If I set 1 slot per node this problem doesn't rise. > > 2. If I use at least 2 cores per node I can see this hang. > Here is the backtraces for all branches of hanged program: > > rank = 0 > (gdb) bt > #0 0x00000039522df343 in poll () from /lib64/libc.so.6 > #1 0x00007f1e4fb01605 in poll_dispatch (base=0x13973b0, tv=0x7fff2595ce50) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165 > #2 0x00007f1e4faf601c in opal_libevent2021_event_base_loop (base=0x13973b0, > flags=3) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631 > #3 0x00007f1e4fa9870a in opal_progress () at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169 > #4 0x00007f1e500beb51 in ompi_mpi_init (argc=1, argv=0x7fff2595d158, > requested=0, provided=0x7fff2595cfc8) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641 > #5 0x00007f1e500f425e in PMPI_Init (argc=0x7fff2595d02c, > argv=0x7fff2595d020) at pinit.c:84 > #6 0x0000000000400a6e in main () > > rank = 1 > (gdb) bt > #0 0x00000039522accdd in nanosleep () from /lib64/libc.so.6 > #1 0x00000039522e1e54 in usleep () from /lib64/libc.so.6 > >>>>>>>>>>>>> GOTCHA >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > #2 0x00007fae7a6a7f4d in ompi_btl_usnic_connectivity_client_init () at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_cclient.c:92 > #3 0x00007fae7a6a4b72 in usnic_component_init > (num_btl_modules=0x7fffc0a67cc8, want_progress_threads=false, > want_mpi_threads=false) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_component.c:461 > #4 0x00007fae7ed9958f in mca_btl_base_select (enable_progress_threads=false, > enable_mpi_threads=false) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/base/btl_base_select.c:113 > <<<<<<<<<<<<< GOTCHA <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > #5 0x00007fae7b5e6b48 in mca_bml_r2_component_init (priority=0x7fffc0a67d84, > enable_progress_threads=false, enable_mpi_threads=false) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/bml/r2/bml_r2_component.c:88 > #6 0x00007fae7ed98362 in mca_bml_base_init (enable_progress_threads=false, > enable_mpi_threads=false) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/bml/base/bml_base_init.c:69 > #7 0x00007fae79e2dcb5 in mca_pml_ob1_component_init > (priority=0x7fffc0a67eb0, enable_progress_threads=false, > enable_mpi_threads=false) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/pml/ob1/pml_ob1_component.c:271 > #8 0x00007fae7edc0251 in mca_pml_base_select (enable_progress_threads=false, > enable_mpi_threads=false) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/pml/base/pml_base_select.c:127 > #9 0x00007fae7ed2b9e9 in ompi_mpi_init (argc=1, argv=0x7fffc0a681c8, > requested=0, provided=0x7fffc0a68038) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:611 > #10 0x00007fae7ed6125e in PMPI_Init (argc=0x7fffc0a6809c, > argv=0x7fffc0a68090) at pinit.c:84 > #11 0x0000000000400a6e in main () > > rank=2 > (gdb) bt > #0 0x00000038e38df343 in poll () from /lib64/libc.so.6 > #1 0x00007fa403413605 in poll_dispatch (base=0x25e33b0, tv=0x7fff1a081be0) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165 > #2 0x00007fa40340801c in opal_libevent2021_event_base_loop (base=0x25e33b0, > flags=3) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631 > #3 0x00007fa4033aa70a in opal_progress () at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169 > #4 0x00007fa4039d0b51 in ompi_mpi_init (argc=1, argv=0x7fff1a081ee8, > requested=0, provided=0x7fff1a081d58) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641 > #5 0x00007fa403a0625e in PMPI_Init (argc=0x7fff1a081dbc, > argv=0x7fff1a081db0) at pinit.c:84 > #6 0x0000000000400a6e in main () > > > rank=3 > (gdb) bt > #0 0x00000038e38df343 in poll () from /lib64/libc.so.6 > #1 0x00007f1ad8de7605 in poll_dispatch (base=0x21a73b0, tv=0x7fff0fa9f7f0) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165 > #2 0x00007f1ad8ddc01c in opal_libevent2021_event_base_loop (base=0x21a73b0, > flags=3) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631 > #3 0x00007f1ad8d7e70a in opal_progress () at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169 > #4 0x00007f1ad93a4b51 in ompi_mpi_init (argc=1, argv=0x7fff0fa9faf8, > requested=0, provided=0x7fff0fa9f968) > at > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641 > #5 0x00007f1ad93da25e in PMPI_Init (argc=0x7fff0fa9f9cc, > argv=0x7fff0fa9f9c0) at pinit.c:84 > #6 0x0000000000400a6e in main () > > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14928.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/