Hi,
I have hit an issue in which orted hangs during the finalization of a job. This
is reproduced by running 80 ranks per node (yes, oversubscribed) on a 4 nodes
cluster that runs SLES12 with OMPI 1.10.2 (I also see it with 1.10.0). I found
that it is independent of the binary used (I used a very simple sample to init
ranks do nothing and finalize) and also saw happens after MPI_Finalize(). It is
not a deterministic issue and takes a few attempts to reproduce. When the hang
occurs, the mpirun process does not get to wait() its childs (see below(1)) and
is stuck on a poll() (see below (2)). I logged in the other nodes and found all
the "other" orted processes are also held on a different poll (see below (3)).
I found that after attaching gdb to mpirun and letting it continue the
execution finishes with no issues. Same thing sending a SIGSTOP and SIGCONT the
hung mpirun.
(1)
root 164356 161186 0 16:50 pts/000:00:00 mpirun -np 320
--allow-run-as-root -machinefile ./user/hostfile /scratch/user/osu_multi_lat
root 164358 164356 0 16:50 pts/000:00:00 /usr/bin/ssh -x n3
PATH=/scratch/user/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/scratch/user/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;
DYLD
root 164359 164356 0 16:50 pts/000:00:00 /usr/bin/ssh -x n2
PATH=/scratch/user/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/scratch/user/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;
DYLD
root 164361 164356 0 16:50 pts/000:00:06 [osu_multi_lat]
root 164362 164356 0 16:50 pts/000:00:06 [osu_multi_lat]
root 164364 164356 0 16:50 pts/000:00:06 [osu_multi_lat]
root 164365 164356 0 16:50 pts/000:00:06 [osu_multi_lat]
root 164366 164356 0 16:50 pts/000:00:06 [osu_multi_lat]
root 164367 164356 0 16:50 pts/000:00:06 [osu_multi_lat]
root 164370 164356 0 16:50 pts/000:00:06 [osu_multi_lat]
root 164372 164356 0 16:50 pts/000:00:06 [osu_multi_lat]
root 164374 164356 0 16:50 pts/000:00:06 [osu_multi_lat]
root 164375 164356 0 16:50 pts/000:00:06 [osu_multi_lat]
root 164378 164356 0 16:50 pts/000:00:06 [osu_multi_lat]
root 164379 164356 0 16:50 pts/000:00:06 [osu_multi_lat]
(2)
gdb -p 164356
...
Missing separate debuginfos, use: zypper install
glibc-debuginfo-2.19-17.72.x86_64
(gdb) bt
#0 0x7f143177a3cd in poll () from /lib64/libc.so.6
#1 0x7f14325e0636 in poll_dispatch () from
/scratch/user/lib/libopen-pal.so.13
#2 0x7f14325d77bf in opal_libevent2021_event_base_loop () from
/scratch/user/lib/libopen-pal.so.13
#3 0x004051cd in orterun (argc=7, argv=0x7fff8c4bb428) at
orterun.c:1133
#4 0x00403a8d in main (argc=7, argv=0x7fff8c4bb428) at main.c:13
(3) (remote nodes orted)
(gdb) bt
#0 0x7f8c288d33b0 in __poll_nocancel () from /lib64/libc.so.6
#1 0x7f8c29941186 in poll_dispatch () /scratch/user/lib/libopen-pal.so.13
#2 0x7f8c2993830f in opal_libevent2021_event_base_loop () from
/scratch/user/lib/libopen-pal.so.13
#3 0x7f8c29be75c4 in orte_daemon () from
/scratch/user/lib/libopen-rte.so.12
#4 0x00400827 in main ()
Thanks,
_MAC