2014-06-01 14:24 GMT+07:00 Gilles Gouaillardet < gilles.gouaillar...@gmail.com>:
> export OMPI_MCA_btl_openib_use_eager_rdma=0 Gilles, I test your approach. Both: a) export OMPI_MCA_btl_openib_use_eager_rdma=0 b) applying your patch and run without "export OMPI_MCA_btl_openib_use_eager_rdma=0" works well for me. This fixes first part of the problem: when OMPI_MCA_btl="openib,self" However once I comment out this statement thus giving OMPI the right to deside which BTL to use program hangs again. Here is additional information that can be useful: 1. If I set 1 slot per node this problem doesn't rise. 2. If I use at least 2 cores per node I can see this hang. Here is the backtraces for all branches of hanged program: rank = 0 (gdb) bt #0 0x00000039522df343 in poll () from /lib64/libc.so.6 #1 0x00007f1e4fb01605 in poll_dispatch (base=0x13973b0, tv=0x7fff2595ce50) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165 #2 0x00007f1e4faf601c in opal_libevent2021_event_base_loop (base=0x13973b0, flags=3) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631 #3 0x00007f1e4fa9870a in opal_progress () at /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169 #4 0x00007f1e500beb51 in ompi_mpi_init (argc=1, argv=0x7fff2595d158, requested=0, provided=0x7fff2595cfc8) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641 #5 0x00007f1e500f425e in PMPI_Init (argc=0x7fff2595d02c, argv=0x7fff2595d020) at pinit.c:84 #6 0x0000000000400a6e in main () rank = 1 (gdb) bt *#0 0x00000039522accdd in nanosleep () from /lib64/libc.so.6* *#1 0x00000039522e1e54 in usleep () from /lib64/libc.so.6* *>>>>>>>>>>>>> GOTCHA >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>* *#2 0x00007fae7a6a7f4d in ompi_btl_usnic_connectivity_client_init () at /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_cclient.c:92* #3 0x00007fae7a6a4b72 in usnic_component_init (num_btl_modules=0x7fffc0a67cc8, want_progress_threads=false, want_mpi_threads=false) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_component.c:461 #4 0x00007fae7ed9958f in mca_btl_base_select (enable_progress_threads=false, enable_mpi_threads=false) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/base/btl_base_select.c:113 *<<<<<<<<<<<<< GOTCHA <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<* #5 0x00007fae7b5e6b48 in mca_bml_r2_component_init (priority=0x7fffc0a67d84, enable_progress_threads=false, enable_mpi_threads=false) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/bml/r2/bml_r2_component.c:88 #6 0x00007fae7ed98362 in mca_bml_base_init (enable_progress_threads=false, enable_mpi_threads=false) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/bml/base/bml_base_init.c:69 #7 0x00007fae79e2dcb5 in mca_pml_ob1_component_init (priority=0x7fffc0a67eb0, enable_progress_threads=false, enable_mpi_threads=false) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/pml/ob1/pml_ob1_component.c:271 #8 0x00007fae7edc0251 in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/pml/base/pml_base_select.c:127 #9 0x00007fae7ed2b9e9 in ompi_mpi_init (argc=1, argv=0x7fffc0a681c8, requested=0, provided=0x7fffc0a68038) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:611 #10 0x00007fae7ed6125e in PMPI_Init (argc=0x7fffc0a6809c, argv=0x7fffc0a68090) at pinit.c:84 #11 0x0000000000400a6e in main () rank=2 (gdb) bt #0 0x00000038e38df343 in poll () from /lib64/libc.so.6 #1 0x00007fa403413605 in poll_dispatch (base=0x25e33b0, tv=0x7fff1a081be0) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165 #2 0x00007fa40340801c in opal_libevent2021_event_base_loop (base=0x25e33b0, flags=3) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631 #3 0x00007fa4033aa70a in opal_progress () at /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169 #4 0x00007fa4039d0b51 in ompi_mpi_init (argc=1, argv=0x7fff1a081ee8, requested=0, provided=0x7fff1a081d58) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641 #5 0x00007fa403a0625e in PMPI_Init (argc=0x7fff1a081dbc, argv=0x7fff1a081db0) at pinit.c:84 #6 0x0000000000400a6e in main () rank=3 (gdb) bt #0 0x00000038e38df343 in poll () from /lib64/libc.so.6 #1 0x00007f1ad8de7605 in poll_dispatch (base=0x21a73b0, tv=0x7fff0fa9f7f0) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165 #2 0x00007f1ad8ddc01c in opal_libevent2021_event_base_loop (base=0x21a73b0, flags=3) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631 #3 0x00007f1ad8d7e70a in opal_progress () at /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169 #4 0x00007f1ad93a4b51 in ompi_mpi_init (argc=1, argv=0x7fff0fa9faf8, requested=0, provided=0x7fff0fa9f968) at /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641 #5 0x00007f1ad93da25e in PMPI_Init (argc=0x7fff0fa9f9cc, argv=0x7fff0fa9f9c0) at pinit.c:84 #6 0x0000000000400a6e in main () -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov