This should also be fixed when we stop firing up the usnic connectivity checker 
when there are no usNICs present.

On Jun 1, 2014, at 9:12 AM, Artem Polyakov <artpo...@gmail.com> wrote:

> 
> 2014-06-01 14:24 GMT+07:00 Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com>:
> export OMPI_MCA_btl_openib_use_eager_rdma=0
> 
> Gilles,
> 
> I test your approach. Both:
> a) export OMPI_MCA_btl_openib_use_eager_rdma=0 
> b) applying your patch and run without "export 
> OMPI_MCA_btl_openib_use_eager_rdma=0" 
> works well for me. 
> This fixes first part of the problem: when OMPI_MCA_btl="openib,self"
> 
> However once I comment out this statement thus giving OMPI the right to 
> deside which BTL to use program hangs again. Here is additional information 
> that can be useful:
> 
> 1. If I set 1 slot per node this problem doesn't rise.
> 
> 2. If I use at least 2 cores per node I can see this hang. 
> Here is the backtraces for all branches of hanged program:
> 
> rank = 0
> (gdb) bt
> #0  0x00000039522df343 in poll () from /lib64/libc.so.6
> #1  0x00007f1e4fb01605 in poll_dispatch (base=0x13973b0, tv=0x7fff2595ce50)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165
> #2  0x00007f1e4faf601c in opal_libevent2021_event_base_loop (base=0x13973b0, 
> flags=3)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631
> #3  0x00007f1e4fa9870a in opal_progress () at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169
> #4  0x00007f1e500beb51 in ompi_mpi_init (argc=1, argv=0x7fff2595d158, 
> requested=0, provided=0x7fff2595cfc8)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641
> #5  0x00007f1e500f425e in PMPI_Init (argc=0x7fff2595d02c, 
> argv=0x7fff2595d020) at pinit.c:84
> #6  0x0000000000400a6e in main ()
> 
> rank = 1
> (gdb) bt
> #0  0x00000039522accdd in nanosleep () from /lib64/libc.so.6
> #1  0x00000039522e1e54 in usleep () from /lib64/libc.so.6
> >>>>>>>>>>>>> GOTCHA >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> #2  0x00007fae7a6a7f4d in ompi_btl_usnic_connectivity_client_init () at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_cclient.c:92
> #3  0x00007fae7a6a4b72 in usnic_component_init 
> (num_btl_modules=0x7fffc0a67cc8, want_progress_threads=false, 
> want_mpi_threads=false)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_component.c:461
> #4  0x00007fae7ed9958f in mca_btl_base_select (enable_progress_threads=false, 
> enable_mpi_threads=false)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/base/btl_base_select.c:113
> <<<<<<<<<<<<< GOTCHA <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> #5  0x00007fae7b5e6b48 in mca_bml_r2_component_init (priority=0x7fffc0a67d84, 
> enable_progress_threads=false, enable_mpi_threads=false)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/bml/r2/bml_r2_component.c:88
> #6  0x00007fae7ed98362 in mca_bml_base_init (enable_progress_threads=false, 
> enable_mpi_threads=false)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/bml/base/bml_base_init.c:69
> #7  0x00007fae79e2dcb5 in mca_pml_ob1_component_init 
> (priority=0x7fffc0a67eb0, enable_progress_threads=false, 
> enable_mpi_threads=false)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/pml/ob1/pml_ob1_component.c:271
> #8  0x00007fae7edc0251 in mca_pml_base_select (enable_progress_threads=false, 
> enable_mpi_threads=false)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/pml/base/pml_base_select.c:127
> #9  0x00007fae7ed2b9e9 in ompi_mpi_init (argc=1, argv=0x7fffc0a681c8, 
> requested=0, provided=0x7fffc0a68038)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:611
> #10 0x00007fae7ed6125e in PMPI_Init (argc=0x7fffc0a6809c, 
> argv=0x7fffc0a68090) at pinit.c:84
> #11 0x0000000000400a6e in main ()
> 
> rank=2
> (gdb) bt
> #0  0x00000038e38df343 in poll () from /lib64/libc.so.6
> #1  0x00007fa403413605 in poll_dispatch (base=0x25e33b0, tv=0x7fff1a081be0)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165
> #2  0x00007fa40340801c in opal_libevent2021_event_base_loop (base=0x25e33b0, 
> flags=3)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631
> #3  0x00007fa4033aa70a in opal_progress () at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169
> #4  0x00007fa4039d0b51 in ompi_mpi_init (argc=1, argv=0x7fff1a081ee8, 
> requested=0, provided=0x7fff1a081d58)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641
> #5  0x00007fa403a0625e in PMPI_Init (argc=0x7fff1a081dbc, 
> argv=0x7fff1a081db0) at pinit.c:84
> #6  0x0000000000400a6e in main ()
> 
> 
> rank=3
> (gdb) bt
> #0  0x00000038e38df343 in poll () from /lib64/libc.so.6
> #1  0x00007f1ad8de7605 in poll_dispatch (base=0x21a73b0, tv=0x7fff0fa9f7f0)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165
> #2  0x00007f1ad8ddc01c in opal_libevent2021_event_base_loop (base=0x21a73b0, 
> flags=3)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631
> #3  0x00007f1ad8d7e70a in opal_progress () at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169
> #4  0x00007f1ad93a4b51 in ompi_mpi_init (argc=1, argv=0x7fff0fa9faf8, 
> requested=0, provided=0x7fff0fa9f968)
>     at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641
> #5  0x00007f1ad93da25e in PMPI_Init (argc=0x7fff0fa9f9cc, 
> argv=0x7fff0fa9f9c0) at pinit.c:84
> #6  0x0000000000400a6e in main ()
> 
> 
> 
> 
> -- 
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14928.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to