Thanks,

I've tried padb first to get stack traces. This is from IMB-MPI1
hanging after one hour, the last output was:
# Benchmarking Alltoall
# #processes = 1024
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.04         0.09         0.05
            1         1000       253.40       335.35       293.06
            2         1000       266.93       346.65       306.23
            4         1000       303.52       382.41       342.21
            8         1000       383.89       493.56       439.34
           16         1000       501.27       627.84       569.80
           32         1000      1039.65      1259.70      1163.12
           64         1000      1710.12      2071.47      1910.62
          128         1000      3051.68      3653.44      3398.65

On Fri, Dec 1, 2017 at 4:23 PM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com> wrote:
> FWIW,
>
> pstack <pid>
> Is a gdb wrapper that displays the stack trace.
>
> PADB http://padb.pittman.org.uk is a great OSS that automatically collect
> the stack traces of all the MPI tasks (and can do some grouping similar to
> dshbak)
>
> Cheers,
>
> Gilles
>
>
> Noam Bernstein <noam.bernst...@nrl.navy.mil> wrote:
>
> On Dec 1, 2017, at 8:10 AM, Götz Waschk <goetz.was...@gmail.com> wrote:
>
> On Fri, Dec 1, 2017 at 10:13 AM, Götz Waschk <goetz.was...@gmail.com> wrote:
>
> I have attached my slurm job script, it will simply do an mpirun
> IMB-MPI1 with 1024 processes. I haven't set any mca parameters, so for
> instance, vader is enabled.
>
> I have tested again, with
>    mpirun --mca btl "^vader" IMB-MPI1
> it made no difference.
>
>
> I’ve lost track of the earlier parts of this thread, but has anyone
> suggested logging into the nodes it’s running on, doing “gdb -p PID” for
> each of the mpi processes, and doing “where” to see where it’s hanging?
>
> I use this script (trace_all), which depends on a variable process that is a
> grep regexp that matches the mpi executable:
>
> echo "where" > /tmp/gf
>
> pids=`ps aux | grep $process | grep -v grep | grep -v trace_all | awk
> '{print \$2}'`
> for pid in $pids; do
>    echo $pid
>    prog=`ps auxw | grep " $pid " | grep -v grep | awk '{print $11}'`
>    gdb -x /tmp/gf -batch $prog $pid
>    echo ""
> done
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users



-- 
AL I:40: Do what thou wilt shall be the whole of the Law.
Stack trace(s) for thread: 1
-----------------
[0-1023] (1024 processes)
-----------------
main() at ?:?
  IMB_init_buffers_iter() at ?:?
    IMB_alltoall() at ?:?
      -----------------
      [0-31,35,42,118,163,235] (37 processes)
      -----------------
      PMPI_Barrier() at ?:?
        ompi_coll_base_barrier_intra_recursivedoubling() at ?:?
          ompi_request_default_wait() at ?:?
            opal_progress() at ?:?
      -----------------
      [32-34,36-41,43-117,119-162,164-234,236-1023] (987 processes)
      -----------------
      PMPI_Alltoall() at ?:?
        ompi_coll_base_alltoall_intra_basic_linear() at ?:?
          ompi_request_default_wait_all() at ?:?
            -----------------
            
[32-34,36-41,43-117,119-162,164-234,236-413,415-532,534-651,653-744,746-894,896-1023]
 (982 processes)
            -----------------
            opal_progress() at ?:?
            -----------------
            [533] (1 processes)
            -----------------
            opal_progress@plt() at ?:?
Stack trace(s) for thread: 2
-----------------
[0-1023] (1024 processes)
-----------------
start_thread() at ?:?
  progress_engine() at ?:?
    opal_libevent2022_event_base_loop() at event.c:1630
      epoll_dispatch() at epoll.c:407
        epoll_wait() at ?:?
Stack trace(s) for thread: 3
-----------------
[0-1023] (1024 processes)
-----------------
start_thread() at ?:?
  progress_engine() at ?:?
    opal_libevent2022_event_base_loop() at event.c:1630
      poll_dispatch() at poll.c:165
        poll() at ?:?
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to