Are the procs still alive? Is this on a single node?
> On Jun 30, 2016, at 8:49 AM, Orion Poplawski <or...@cora.nwra.com> wrote:
>
> I'm seeing hangs when MPI_Abort is called. This is with openmpi 1.10.3. e.g:
>
> program output:
>
> Testing -- big dataset test (bigdset)
> Proc 3: *** Parallel ERROR ***
> VRFY (sizeof(MPI_Offset)>4) failed at line 479 in ../../testpar/t_mdset.c
> aborting MPI processes
> Testing -- big dataset test (bigdset)
> Proc 0: *** Parallel ERROR ***
> VRFY (sizeof(MPI_Offset)>4) failed at line 479 in ../../testpar/t_mdset.c
> aborting MPI processes
> Testing -- big dataset test (bigdset)
> Proc 2: *** Parallel ERROR ***
> VRFY (sizeof(MPI_Offset)>4) failed at line 479 in ../../testpar/t_mdset.c
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
> with errorcode 1.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> Testing -- big dataset test (bigdset)
> Proc 5: *** Parallel ERROR ***
> VRFY (sizeof(MPI_Offset)>4) failed at line 479 in ../../testpar/t_mdset.c
> aborting MPI processes
> aborting MPI processes
> Testing -- big dataset test (bigdset)
> Proc 1: *** Parallel ERROR ***
> VRFY (sizeof(MPI_Offset)>4) failed at line 479 in ../../testpar/t_mdset.c
> aborting MPI processes
> Testing -- big dataset test (bigdset)
> Proc 4: *** Parallel ERROR ***
> VRFY (sizeof(MPI_Offset)>4) failed at line 479 in ../../testpar/t_mdset.c
> aborting MPI processes
>
>
> strace of mpiexec process:
>
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN},
> {fd=14, events=POLLIN}], 4, -1
>
> mpiexec 21511 orion 1w REG 8,3 10547 17696145
> /var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/builddir/build/BUILD/hdf5-1.8.17/openmpi/testpar/testphdf5.chklog
> mpiexec 21511 orion 2w REG 8,3 10547 17696145
> /var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/builddir/build/BUILD/hdf5-1.8.17/openmpi/testpar/testphdf5.chklog
> mpiexec 21511 orion 3u unix 0xdaedbc80 0t0 4818918 type=STREAM
> mpiexec 21511 orion 4u unix 0xdaed8000 0t0 4818919 type=STREAM
> mpiexec 21511 orion 5u a_inode 0,11 0 8731 [eventfd]
> mpiexec 21511 orion 6u REG 0,38 0 4818921
> /var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/dev/shm/open_mpi.0000
> (deleted)
> mpiexec 21511 orion 7r FIFO 0,10 0t0 4818922 pipe
> mpiexec 21511 orion 8w FIFO 0,10 0t0 4818922 pipe
> mpiexec 21511 orion 9r DIR 8,3 4096 15471703
> /var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root
> mpiexec 21511 orion 10r DIR 0,16 0 82
> /var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/sys/firmware/devicetree/base/cpus
> mpiexec 21511 orion 11u IPv4 4818926 0t0 TCP *:39619
> (LISTEN)
> mpiexec 21511 orion 12r FIFO 0,10 0t0 4818927 pipe
> mpiexec 21511 orion 13w FIFO 0,10 0t0 4818927 pipe
> mpiexec 21511 orion 14r FIFO 8,3 0t0 17965730
> /var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/tmp/openmpi-sessions-mockbuild@arm03-packager00_0/46622/0/debugger_attach_fifo
>
> Any suggestions on what to look for? FWIW, it was a 6 process run on a 4 core
> machine.
>
> Thanks.
>
> --
> Orion Poplawski
> Technical Manager 303-415-9701 x222
> NWRA, Boulder/CoRA Office FAX: 303-415-9702
> 3380 Mitchell Lane or...@nwra.com
> Boulder, CO 80301 http://www.nwra.com
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29573.php