2014-06-01 14:24 GMT+07:00 Gilles Gouaillardet <
gilles.gouaillar...@gmail.com>:

> export OMPI_MCA_btl_openib_use_eager_rdma=0


Gilles,

I test your approach. Both:
a) export OMPI_MCA_btl_openib_use_eager_rdma=0
b) applying your patch and run without "export
OMPI_MCA_btl_openib_use_eager_rdma=0"
works well for me.
This fixes first part of the problem: when OMPI_MCA_btl="openib,self"

However once I comment out this statement thus giving OMPI the right to
deside which BTL to use program hangs again. Here is additional information
that can be useful:

1. If I set 1 slot per node this problem doesn't rise.

2. If I use at least 2 cores per node I can see this hang.
Here is the backtraces for all branches of hanged program:

rank = 0
(gdb) bt
#0  0x00000039522df343 in poll () from /lib64/libc.so.6
#1  0x00007f1e4fb01605 in poll_dispatch (base=0x13973b0, tv=0x7fff2595ce50)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165
#2  0x00007f1e4faf601c in opal_libevent2021_event_base_loop
(base=0x13973b0, flags=3)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631
#3  0x00007f1e4fa9870a in opal_progress () at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169
#4  0x00007f1e500beb51 in ompi_mpi_init (argc=1, argv=0x7fff2595d158,
requested=0, provided=0x7fff2595cfc8)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641
#5  0x00007f1e500f425e in PMPI_Init (argc=0x7fff2595d02c,
argv=0x7fff2595d020) at pinit.c:84
#6  0x0000000000400a6e in main ()

rank = 1
(gdb) bt
*#0  0x00000039522accdd in nanosleep () from /lib64/libc.so.6*
*#1  0x00000039522e1e54 in usleep () from /lib64/libc.so.6*
*>>>>>>>>>>>>> GOTCHA >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>*
*#2  0x00007fae7a6a7f4d in ompi_btl_usnic_connectivity_client_init () at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_cclient.c:92*
#3  0x00007fae7a6a4b72 in usnic_component_init
(num_btl_modules=0x7fffc0a67cc8, want_progress_threads=false,
want_mpi_threads=false)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_component.c:461
#4  0x00007fae7ed9958f in mca_btl_base_select
(enable_progress_threads=false, enable_mpi_threads=false)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/base/btl_base_select.c:113
*<<<<<<<<<<<<< GOTCHA <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<*
#5  0x00007fae7b5e6b48 in mca_bml_r2_component_init
(priority=0x7fffc0a67d84, enable_progress_threads=false,
enable_mpi_threads=false)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/bml/r2/bml_r2_component.c:88
#6  0x00007fae7ed98362 in mca_bml_base_init (enable_progress_threads=false,
enable_mpi_threads=false)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/bml/base/bml_base_init.c:69
#7  0x00007fae79e2dcb5 in mca_pml_ob1_component_init
(priority=0x7fffc0a67eb0, enable_progress_threads=false,
enable_mpi_threads=false)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/pml/ob1/pml_ob1_component.c:271
#8  0x00007fae7edc0251 in mca_pml_base_select
(enable_progress_threads=false, enable_mpi_threads=false)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/pml/base/pml_base_select.c:127
#9  0x00007fae7ed2b9e9 in ompi_mpi_init (argc=1, argv=0x7fffc0a681c8,
requested=0, provided=0x7fffc0a68038)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:611
#10 0x00007fae7ed6125e in PMPI_Init (argc=0x7fffc0a6809c,
argv=0x7fffc0a68090) at pinit.c:84
#11 0x0000000000400a6e in main ()

rank=2
(gdb) bt
#0  0x00000038e38df343 in poll () from /lib64/libc.so.6
#1  0x00007fa403413605 in poll_dispatch (base=0x25e33b0, tv=0x7fff1a081be0)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165
#2  0x00007fa40340801c in opal_libevent2021_event_base_loop
(base=0x25e33b0, flags=3)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631
#3  0x00007fa4033aa70a in opal_progress () at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169
#4  0x00007fa4039d0b51 in ompi_mpi_init (argc=1, argv=0x7fff1a081ee8,
requested=0, provided=0x7fff1a081d58)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641
#5  0x00007fa403a0625e in PMPI_Init (argc=0x7fff1a081dbc,
argv=0x7fff1a081db0) at pinit.c:84
#6  0x0000000000400a6e in main ()


rank=3
(gdb) bt
#0  0x00000038e38df343 in poll () from /lib64/libc.so.6
#1  0x00007f1ad8de7605 in poll_dispatch (base=0x21a73b0, tv=0x7fff0fa9f7f0)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165
#2  0x00007f1ad8ddc01c in opal_libevent2021_event_base_loop
(base=0x21a73b0, flags=3)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631
#3  0x00007f1ad8d7e70a in opal_progress () at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169
#4  0x00007f1ad93a4b51 in ompi_mpi_init (argc=1, argv=0x7fff0fa9faf8,
requested=0, provided=0x7fff0fa9f968)
    at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641
#5  0x00007f1ad93da25e in PMPI_Init (argc=0x7fff0fa9f9cc,
argv=0x7fff0fa9f9c0) at pinit.c:84
#6  0x0000000000400a6e in main ()




-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

Reply via email to