No surprise there - that's known behavior. As has been said, we hope to extend the thread-multiple support in the 1.9 series.
On Mon, Dec 2, 2013 at 6:33 PM, Eric Chamberland < eric.chamberl...@giref.ulaval.ca> wrote: > Hi, > > I just open a new "chapter" with the same subject. ;-) > > We are using OpenMPI 1.6.5 (compiled with --enable-thread-multiple) with > Petsc 3.4.3 (on colosse supercomputer: http://www.calculquebec.ca/en/ > resources/compute-servers/colosse). We observed a deadlock with threads > within the openib btl. > > We successfully bypassed the deadlock by 2 different ways: > > #1- launching the code with "--mca btl ^openib" > > #2- compiling OpenMPI 1.6.5 *without* the "--enable-thread-multiple" > option. > > When the code hangs, here are some backtraces (on different processes) > that we got: > > #0 0x00007fb4a6a03795 in pthread_spin_lock () from /lib64/libpthread.so.0 > #1 0x00007fb49db7ea7b in ?? () from /usr/lib64/libmlx4-rdmav2.so > #2 0x00007fb4a878d469 in ibv_poll_cq () at > /usr/include/infiniband/verbs.h:884 > #3 poll_device () at > ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3563 > > #4 progress_one_device () at > ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3694 > > #5 btl_openib_component_progress () at > ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719 > > #6 0x00007fb4a8973d32 in opal_progress () at > ../../openmpi-1.6.5/opal/runtime/opal_progress.c:207 > #7 0x00007fb4a87404f0 in opal_condition_wait (count=25695904, > requests=0x100, statuses=0x7fff9b7f1320) at > ../../openmpi-1.6.5/opal/threads/condition.h:92 > #8 ompi_request_default_wait_all (count=25695904, requests=0x100, > statuses=0x7fff9b7f1320) at ../../openmpi-1.6.5/ompi/ > request/req_wait.c:263 > > > > > #0 0x00007f731d1100b8 in pthread_mutex_unlock () from > /lib64/libpthread.so.0 > #1 0x00007f731ee9b3b7 in opal_mutex_unlock () at > ../../../../../openmpi-1.6.5/opal/threads/mutex_unix.h:123 > #2 progress_one_device () at > ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3688 > > #3 btl_openib_component_progress () at > ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719 > > #4 0x00007f731f081d32 in opal_progress () at > ../../openmpi-1.6.5/opal/runtime/opal_progress.c:207 > #5 0x00007f731ee4e4f0 in opal_condition_wait (count=25649104, > requests=0x0, statuses=0x1875fd0) at > ../../openmpi-1.6.5/opal/threads/condition.h:92 > #6 ompi_request_default_wait_all (count=25649104, requests=0x0, > statuses=0x1875fd0) at ../../openmpi-1.6.5/ompi/request/req_wait.c:263 > #7 0x00007f731eec2644 in > ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x1875fd0, > rbuf=0x0, count=25649104, dtype=0x7f72ce8f80fc, op=0x1875fd0, > comm=0x5e80, module=0xca4ec20) > at > ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_ > tuned_allreduce.c:223 > #8 0x00007f731eebe2ec in ompi_coll_tuned_allreduce_intra_dec_fixed > (sbuf=0x1875fd0, rbuf=0x0, count=25649104, dtype=0x7f72ce8f80fc, > op=0x1875fd0, comm=0x5e80, module=0x159d8330) > at > ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61 > > #9 0x00007f731ee5cad9 in PMPI_Allreduce (sendbuf=0x1875fd0, > recvbuf=0x0, count=25649104, datatype=0x7f72ce8f80fc, op=0x1875fd0, > comm=0x5e80) at pallreduce.c:105 > > > > #0 opal_progress () at ../../openmpi-1.6.5/opal/ > runtime/opal_progress.c:206 > #1 0x00007f8e3d8844f0 in opal_condition_wait (count=0, requests=0x0, > statuses=0x7f8e3dde8a20) at ../../openmpi-1.6.5/opal/ > threads/condition.h:92 > #2 ompi_request_default_wait_all (count=0, requests=0x0, > statuses=0x7f8e3dde8a20) at ../../openmpi-1.6.5/ompi/ > request/req_wait.c:263 > #3 0x00007f8e3d8f8644 in > ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x0, rbuf=0x0, > count=1037994528, dtype=0x1, op=0x0, comm=0x60bb, module=0xcb86ce0) > at > ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_ > tuned_allreduce.c:223 > #4 0x00007f8e3d8f42ec in ompi_coll_tuned_allreduce_intra_dec_fixed > (sbuf=0x0, rbuf=0x0, count=1037994528, dtype=0x1, op=0x0, comm=0x60bb, > module=0x171d59a0) > at > ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61 > > #5 0x00007f8e3d892ad9 in PMPI_Allreduce (sendbuf=0x0, recvbuf=0x0, > count=1037994528, datatype=0x1, op=0x0, comm=0x60bb) at pallreduce.c:105 > > > > #0 0x00007f7ef7d0b258 in pthread_mutex_lock@plt () from > /software/MPI/openmpi/1.6.5_intel/lib/libmpi.so.1 > #1 0x00007f7ef7d72377 in opal_mutex_lock () at > ../../../../../openmpi-1.6.5/opal/threads/mutex_unix.h:109 > #2 progress_one_device () at > ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3650 > > #3 btl_openib_component_progress () at > ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719 > > #4 0x00007f7ef7f58d32 in opal_progress () at > ../../openmpi-1.6.5/opal/runtime/opal_progress.c:207 > #5 0x00007f7ef7d254f0 in opal_condition_wait (count=25625488, > requests=0x0, statuses=0x7f7ef8324208) at > ../../openmpi-1.6.5/opal/threads/condition.h:92 > #6 ompi_request_default_wait_all (count=25625488, requests=0x0, > statuses=0x7f7ef8324208) at ../../openmpi-1.6.5/ompi/ > request/req_wait.c:263 > #7 0x00007f7ef7d99644 in > ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x1870390, > rbuf=0x0, count=-130924024, dtype=0x0, op=0x1874cb0, comm=0x60bc, > module=0xca6a360) > at > ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_ > tuned_allreduce.c:223 > #8 0x00007f7ef7d952ec in ompi_coll_tuned_allreduce_intra_dec_fixed > (sbuf=0x1870390, rbuf=0x0, count=-130924024, dtype=0x0, op=0x1874cb0, > comm=0x60bc, module=0x14512a20) > at > ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61 > > #9 0x00007f7ef7d33ad9 in PMPI_Allreduce (sendbuf=0x1870390, > recvbuf=0x0, count=-130924024, datatype=0x0, op=0x1874cb0, comm=0x60bc) > at pallreduce.c:105 > > > > > #0 0x00007f1fe7bcd0b8 in pthread_mutex_unlock () from > /lib64/libpthread.so.0 > #1 0x00007f1fe99583b7 in opal_mutex_unlock () at > ../../../../../openmpi-1.6.5/opal/threads/mutex_unix.h:123 > #2 progress_one_device () at > ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3688 > > #3 btl_openib_component_progress () at > ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719 > > #4 0x00007f1fe9b3ed32 in opal_progress () at > ../../openmpi-1.6.5/opal/runtime/opal_progress.c:207 > #5 0x00007f1fe990b4f0 in opal_condition_wait (count=25659568, > requests=0x0, statuses=0x18788b0) at > ../../openmpi-1.6.5/opal/threads/condition.h:92 > #6 ompi_request_default_wait_all (count=25659568, requests=0x0, > statuses=0x18788b0) at ../../openmpi-1.6.5/ompi/request/req_wait.c:263 > #7 0x00007f1fe997f644 in > ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x18788b0, > rbuf=0x0, count=25659568, dtype=0x7f1f9949727c, op=0x18788b0, > comm=0x3db6, module=0xccda900) > at > ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_ > tuned_allreduce.c:223 > #8 0x00007f1fe997b2ec in ompi_coll_tuned_allreduce_intra_dec_fixed > (sbuf=0x18788b0, rbuf=0x0, count=25659568, dtype=0x7f1f9949727c, > op=0x18788b0, comm=0x3db6, module=0x170dbf00) > at > ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61 > > #9 0x00007f1fe9919ad9 in PMPI_Allreduce (sendbuf=0x18788b0, > recvbuf=0x0, count=25659568, datatype=0x7f1f9949727c, op=0x18788b0, > comm=0x3db6) at pallreduce.c:105 > > Attached, is "ompi_info -all" output. > > here is the command line: > > "mpiexec -mca mpi_show_mca_params all -mca oob_tcp_peer_retries 1000 > --output-filename PneuSurfaceLibre.out --timestamp-output > --report-bindings -mca orte_num_sockets 2 -mca orte_num_cores 4 > --bind-to-socket -npersocket 1 our_housecode_executable_based_on_petsc_343 > and_parameters..." > > Hope it can help to debug! > > Thanks! > > Eric > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >