On May 19, 2009, at 9:32 AM, Ashley Pittman wrote:

Can you confirm that *all* processes are in PMPI_Allreduce at some
point, the collectives commonly get blamed for a lot of hangs and it's
not always the correct place to look.

For the openmpi run, every single process showed one of those
two stack traces, mostly the first one.


P.S. I get a similar hang with MVAPICH, in a nearby but different part
of the
code (on an MPI_Bcast, specifically), increasing my tendency to believe
that it's OFED's fault.  But maybe the stack trace will suggest to
someone
where it might be stuck, and therefore perhaps an mca flag to try?

This strikes me as a filesystem problem more than MPI per se.  Again
with MVAPICH are all your processes in MPI_Bcast or just some of them?

I'd suspect the filesystem too, except that it's hung up in an MPI call. As I said before, the whole thing is bizarre. It doesn't matter where the executable is, just what CWD is (i.e. I can do mpirun /scratch/exec or mpirun /home/ bernstei/exec,
but if it's sitting in /scratch it'll hang).  And I've been running
other codes both from NFS and from scratch directories for months,
and never had a problem.

Using MVAPICH every process is stuck in a collective, but they're not all the same collective (see stack traces below). The 2 processes on the head node
are stuck on mpi_bcast, in various low level MPI routines.  The other 6
processes are stuck on an mpi_allreduce, again in various low level mpi
processes. I don't know enough about the code to tell they're all supposed
to be part of the same communicator, and the fact that they're stuck on
different collectives is suspicious.  I can look into that.

So yes, it does seem to be a problem with collective communication.
But a very weird one.

                                                                        Noam

#0  0x0000000001b2c120 in MPIDI_CH3I_read_progress ()
#1  0x0000000001b2be44 in MPIDI_CH3I_Progress ()
#2  0x0000000001b0686b in MPIC_Wait ()
#3  0x0000000001b072a6 in MPIC_Send ()
#4  0x0000000001b01b16 in MPIR_Bcast ()
#5  0x0000000001b033ad in PMPI_Bcast ()
#6  0x0000000001b1ec52 in pmpi_bcast_ ()
#7  0x00000000007098d4 in message_passing_mp_mp_bcast_rm_ ()
#8  0x000000000091f9c0 in qs_mo_types_mp_read_mos_restart_low_ ()
#9  0x0000000000922485 in qs_mo_types_mp_read_mo_set_from_restart_ ()
#10 0x000000000158b00e in qs_initial_guess_mp_calculate_first_density_matrix_ ()
#11 0x0000000000a013c5 in qs_scf_mp_scf_env_initial_rho_setup_ ()
#12 0x00000000009fc78a in qs_scf_mp_init_scf_run_ ()
#13 0x00000000009e81bd in qs_scf_mp_scf_ ()
#14 0x0000000000847ed3 in qs_energy_mp_qs_energies_ ()
#15 0x0000000000856e5e in qs_force_mp_qs_forces_ ()
#16 0x00000000004b904b in force_env_methods_mp_force_env_calc_energy_force_ () #17 0x00000000004b899e in force_env_methods_mp_force_env_calc_energy_force_ ()
#18 0x00000000006c4ee4 in md_run_mp_qs_mol_dyn_ ()
#19 0x000000000040c3d2 in cp2k_runs_mp_cp2k_run_ ()
#20 0x000000000040af1a in cp2k_runs_mp_run_input_ ()
#21 0x0000000000409df9 in MAIN__ ()
#22 0x0000000000408e0c in main ()


#0  0x0000000001b3d4e4 in MPIDI_CH3I_MRAILI_Get_next_vbuf ()
#1  0x0000000001b2c1ae in MPIDI_CH3I_read_progress ()
#2  0x0000000001b2be44 in MPIDI_CH3I_Progress ()
#3  0x0000000001b0686b in MPIC_Wait ()
#4  0x0000000001b06c60 in MPIC_Sendrecv ()
#5  0x0000000001aff15a in MPIR_Allreduce ()
#6  0x0000000001b0036d in PMPI_Allreduce ()
#7  0x0000000001b1f1da in pmpi_allreduce_ ()
#8  0x0000000000700f9b in message_passing_mp_mp_sum_r1_ ()
#9 0x0000000000b68f9d in sparse_matrix_types_mp_cp_sm_sm_trace_scalar_ () #10 0x000000000158de4c in qs_initial_guess_mp_calculate_first_density_matrix_ ()
#11 0x0000000000a013c5 in qs_scf_mp_scf_env_initial_rho_setup_ ()
#12 0x00000000009fc78a in qs_scf_mp_init_scf_run_ ()
#13 0x00000000009e81bd in qs_scf_mp_scf_ ()
#14 0x0000000000847ed3 in qs_energy_mp_qs_energies_ ()
#15 0x0000000000856e5e in qs_force_mp_qs_forces_ ()
#16 0x00000000004b904b in force_env_methods_mp_force_env_calc_energy_force_ () #17 0x00000000004b899e in force_env_methods_mp_force_env_calc_energy_force_ ()
#18 0x00000000006c4ee4 in md_run_mp_qs_mol_dyn_ ()
#19 0x000000000040c3d2 in cp2k_runs_mp_cp2k_run_ ()
#20 0x000000000040af1a in cp2k_runs_mp_run_input_ ()
#21 0x0000000000409df9 in MAIN__ ()
#22 0x0000000000408e0c in main ()



                                                                                
                        Noam

Reply via email to