Re: [OMPI users] MPI_Bcast/MPI_Finalize hang with Open MPI 1.1

Graham E Fagg Thu, 29 Jun 2006 17:23:50 -0400

Hi Doug

wow, looks like some messages are getting lost (or even delivered to thewrong peer on the same node.. ) Could you also try with:

-mca coll_base_verbose 1 -mca coll_tuned_use_dynamic_rules 1 -mcacoll_tuned_bcast_algorithm <1,2,3,4,5,6>


The values 1-6 control which topology/aglorithm are used internally..

Once we figure out which topo/sequence causes this we can look to see ifits a collective issue or a btl, bml, pml issue.


thanks
G
On Thu, 29 Jun 2006, Doug Gregor wrote:

I am running into a problem with a simple program (which performs severalMPI_Bcast operations) hanging. Most processes hang in MPI_Finalize, theothers hang in MPI_Bcast. Interestingly enough, this only happens when Ioversubscribe the nodes. For instance, using IU's Odin cluster, I take 4nodes (each has two Opteron processors) and run 9 processes:


        mpirun -np 9 ./a.out

The backtrace from 7/9 of the processes shows that they're in MPI_Finalize:

#0  0x0000003d1b92e813 in sigprocmask () from /lib64/tls/libc.so.6
#1  0x0000002a9598f55f in poll_dispatch ()
  from /san/mpi/openmpi-1.1-gcc/lib/libopal.so.0
#2  0x0000002a9598e3f3 in opal_event_loop ()
  from /san/mpi/openmpi-1.1-gcc/lib/libopal.so.0
#3  0x0000002a960487c4 in mca_oob_tcp_msg_wait ()
  from /san/mpi/openmpi-1.1-gcc/lib/openmpi/mca_oob_tcp.so
#4  0x0000002a9604ca13 in mca_oob_tcp_recv ()
  from /san/mpi/openmpi-1.1-gcc/lib/openmpi/mca_oob_tcp.so
#5  0x0000002a9585d833 in mca_oob_recv_packed ()
  from /san/mpi/openmpi-1.1-gcc/lib/liborte.so.0
#6  0x0000002a9585dd37 in mca_oob_xcast ()
  from /san/mpi/openmpi-1.1-gcc/lib/liborte.so.0
#7  0x0000002a956cbfb0 in ompi_mpi_finalize ()
  from /san/mpi/openmpi-1.1-gcc/lib/libmpi.so.0
#8  0x000000000040bd3e in main ()

The other two processes are in MPI_Bcast:

#0  0x0000002a97c2cbe3 in mca_btl_mvapi_component_progress ()
  from /san/mpi/openmpi-1.1-gcc/lib/openmpi/mca_btl_mvapi.so
#1  0x0000002a97b21072 in mca_bml_r2_progress ()
  from /san/mpi/openmpi-1.1-gcc/lib/openmpi/mca_bml_r2.so
#2  0x0000002a95988a4a in opal_progress ()
  from /san/mpi/openmpi-1.1-gcc/lib/libopal.so.0
#3  0x0000002a97a13fe7 in mca_pml_ob1_recv ()
  from /san/mpi/openmpi-1.1-gcc/lib/openmpi/mca_pml_ob1.so
#4  0x0000002a9846d0aa in ompi_coll_tuned_bcast_intra_chain ()
  from /san/mpi/openmpi-1.1-gcc/lib/openmpi/mca_coll_tuned.so
#5  0x0000002a9846d100 in ompi_coll_tuned_bcast_intra_pipeline ()
  from /san/mpi/openmpi-1.1-gcc/lib/openmpi/mca_coll_tuned.so
#6  0x0000002a9846a3d7 in ompi_coll_tuned_bcast_intra_dec_fixed ()
  from /san/mpi/openmpi-1.1-gcc/lib/openmpi/mca_coll_tuned.so
#7  0x0000002a956deae3 in PMPI_Bcast ()
  from /san/mpi/openmpi-1.1-gcc/lib/libmpi.so.0
#8  0x000000000040bcc7 in main ()

Other random information:

- The two processes stuck in MPI_Bcast are not on the same node. Thishas been the case both times I've gone through the backtraces, but I can'tconclude that it's a necessary condition.- If I force the use of the "basic" MCA for collectives, this problemdoes not occur.

        - If I don't oversubscribe the nodes, things seem to work properly.
        - The C++ program source and result of ompi_info are attached

This should be easy to reproduce for anyone with access to Odin. I'm usingOpen MPI 1.1 configured with no special options. It is available as themodule "mpi/openmpi-1.1-gcc" on the cluster. I'm using SLURM interactively toallocate the nodes before executing mpirun:


        srun -A -N 4

        Cheers,
        Doug Gregor



Thanks,
        Graham.
----------------------------------------------------------------------
Dr Graham E. Fagg       | Distributed, Parallel and Meta-Computing
Innovative Computing Lab. PVM3.4, HARNESS, FT-MPI, SNIPE & Open MPI
Computer Science Dept   | Suite 203, 1122 Volunteer Blvd,
University of Tennessee | Knoxville, Tennessee, USA. TN 37996-3450
Email: f...@cs.utk.edu  | Phone:+1(865)974-5790 | Fax:+1(865)974-8296
Broken complex systems are always derived from working simple systems
----------------------------------------------------------------------

Re: [OMPI users] MPI_Bcast/MPI_Finalize hang with Open MPI 1.1

Reply via email to