According to the other reporter, it has been fixed in 1.10.7. I haven’t verified that, but I’d suggest trying it first.
> On Nov 8, 2017, at 8:26 AM, Nikolas Antolin <nanto...@gmail.com> wrote: > > Thank you for the replies. Do I understand correctly that since OpenMPI v1.10 > is no longer supported, I am unlikely to see a bug fix for this without > moving to v2.x or v3.x? I am dealing with clusters where the administrators > may be loathe to update packages until it is absolutely necessary, and want > to present them with a complete outlook on the problem. > > Thanks, > Nik > > 2017-11-07 19:00 GMT-07:00 r...@open-mpi.org <mailto:r...@open-mpi.org> > <r...@open-mpi.org <mailto:r...@open-mpi.org>>: > Glad to hear it has already been fixed :-) > > Thanks! > >> On Nov 7, 2017, at 4:13 PM, Tru Huynh <t...@pasteur.fr >> <mailto:t...@pasteur.fr>> wrote: >> >> Hi, >> >> On Tue, Nov 07, 2017 at 02:05:20PM -0700, Nikolas Antolin wrote: >>> Hello, >>> >>> In debugging a test of an application, I recently came across odd behavior >>> for simultaneous MPI_Abort calls. Namely, while the MPI_Abort was >>> acknowledged by the process output, the mpirun process failed to exit. I >>> was able to duplicate this behavior on multiple machines with OpenMPI >>> versions 1.10.2, 1.10.5, and 1.10.6 with the following simple program: >>> >>> #include <mpi.h> >>> #include <stdio.h> >>> #include <unistd.h> >>> #include <stdbool.h> >>> >>> int main(int argc, char **argv) >>> { >>> int rank; >>> >>> MPI_Init(&argc,&argv); >>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>> >>> printf("I am process number %d\n", rank); >>> MPI_Abort(MPI_COMM_WORLD, 3); >>> return 0; >>> } >>> >>> Is this a bug or a feature? Does this behavior exist in OpenMPI versions >>> 2.0 and 3.0? >> I compiled your test case on CentOS-7 with openmpi 1.10.7/2.1.2 and >> 3.0.0 and the program seems to run fine. >> >> [tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do >> >> >> module purge && module add openmpi/$i && mpicc aa.c -o aa-$i >> && ldd aa-$i; mpirun -n 2 ./aa-$i ; done >> >> >> linux-vdso.so.1 => (0x00007ffe115bd000) >> libmpi.so.12 => /c7/shared/openmpi/1.10.7/lib/libmpi.so.12 >> (0x00007f40d7b4a000) >> libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f40d78f7000) >> libc.so.6 => /lib64/libc.so.6 (0x00007f40d7534000) >> libopen-rte.so.12 => /c7/shared/openmpi/1.10.7/lib/libopen-rte.so.12 >> (0x00007f40d72b8000) >> libopen-pal.so.13 => /c7/shared/openmpi/1.10.7/lib/libopen-pal.so.13 >> (0x00007f40d6fd9000) >> libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f40d6dcd000) >> libdl.so.2 => /lib64/libdl.so.2 (0x00007f40d6bc9000) >> librt.so.1 => /lib64/librt.so.1 (0x00007f40d69c0000) >> libm.so.6 => /lib64/libm.so.6 (0x00007f40d66be000) >> libutil.so.1 => /lib64/libutil.so.1 (0x00007f40d64bb000) >> /lib64/ld-linux-x86-64.so.2 (0x000055f6d96c4000) >> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f40d62a4000) >> I am process number 1 >> I am process number 0 >> -------------------------------------------------------------------------- >> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD >> with errorcode 3. >> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >> You may or may not see output from other processes, depending on >> exactly when Open MPI kills them. >> -------------------------------------------------------------------------- >> [borma.bis.pasteur.fr:08511 <http://borma.bis.pasteur.fr:8511/>] 1 more >> process has sent help message help-mpi-api.txt / mpi-abort >> [borma.bis.pasteur.fr:08511 <http://borma.bis.pasteur.fr:8511/>] Set MCA >> parameter "orte_base_help_aggregate" to 0 to see all help / error messages >> linux-vdso.so.1 => (0x00007fffaabcd000) >> libmpi.so.20 => /c7/shared/openmpi/2.1.2/lib/libmpi.so.20 >> (0x00007f5bcee39000) >> libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5bcebe6000) >> libc.so.6 => /lib64/libc.so.6 (0x00007f5bce823000) >> libopen-rte.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-rte.so.20 >> (0x00007f5bce5a0000) >> libopen-pal.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-pal.so.20 >> (0x00007f5bce2a7000) >> libdl.so.2 => /lib64/libdl.so.2 (0x00007f5bce0a3000) >> libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f5bcde97000) >> libudev.so.1 => /lib64/libudev.so.1 (0x00007f5bcde81000) >> librt.so.1 => /lib64/librt.so.1 (0x00007f5bcdc79000) >> libm.so.6 => /lib64/libm.so.6 (0x00007f5bcd977000) >> libutil.so.1 => /lib64/libutil.so.1 (0x00007f5bcd773000) >> /lib64/ld-linux-x86-64.so.2 (0x000055718df01000) >> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f5bcd55d000) >> libcap.so.2 => /lib64/libcap.so.2 (0x00007f5bcd357000) >> libdw.so.1 => /lib64/libdw.so.1 (0x00007f5bcd110000) >> libattr.so.1 => /lib64/libattr.so.1 (0x00007f5bccf0b000) >> libelf.so.1 => /lib64/libelf.so.1 (0x00007f5bcccf2000) >> libz.so.1 => /lib64/libz.so.1 (0x00007f5bccadc000) >> liblzma.so.5 => /lib64/liblzma.so.5 (0x00007f5bcc8b6000) >> libbz2.so.1 => /lib64/libbz2.so.1 (0x00007f5bcc6a5000) >> I am process number 1 >> I am process number 0 >> -------------------------------------------------------------------------- >> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD >> with errorcode 3. >> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >> You may or may not see output from other processes, depending on >> exactly when Open MPI kills them. >> -------------------------------------------------------------------------- >> [borma.bis.pasteur.fr:08534 <http://borma.bis.pasteur.fr:8534/>] 1 more >> process has sent help message help-mpi-api.txt / mpi-abort >> [borma.bis.pasteur.fr:08534 <http://borma.bis.pasteur.fr:8534/>] Set MCA >> parameter "orte_base_help_aggregate" to 0 to see all help / error messages >> linux-vdso.so.1 => (0x00007ffc09585000) >> libmpi.so.40 => /c7/shared/openmpi/3.0.0/lib/libmpi.so.40 >> (0x00007fa208ffc000) >> libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa208da9000) >> libc.so.6 => /lib64/libc.so.6 (0x00007fa2089e6000) >> libopen-rte.so.40 => /c7/shared/openmpi/3.0.0/lib/libopen-rte.so.40 >> (0x00007fa208734000) >> libopen-pal.so.40 => /c7/shared/openmpi/3.0.0/lib/libopen-pal.so.40 >> (0x00007fa208431000) >> libdl.so.2 => /lib64/libdl.so.2 (0x00007fa20822d000) >> libnuma.so.1 => /lib64/libnuma.so.1 (0x00007fa208021000) >> libudev.so.1 => /lib64/libudev.so.1 (0x00007fa20800b000) >> librt.so.1 => /lib64/librt.so.1 (0x00007fa207e03000) >> libm.so.6 => /lib64/libm.so.6 (0x00007fa207b01000) >> libutil.so.1 => /lib64/libutil.so.1 (0x00007fa2078fd000) >> libz.so.1 => /lib64/libz.so.1 (0x00007fa2076e7000) >> /lib64/ld-linux-x86-64.so.2 (0x000055e717175000) >> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fa2074d0000) >> libcap.so.2 => /lib64/libcap.so.2 (0x00007fa2072cb000) >> libdw.so.1 => /lib64/libdw.so.1 (0x00007fa207084000) >> libattr.so.1 => /lib64/libattr.so.1 (0x00007fa206e7e000) >> libelf.so.1 => /lib64/libelf.so.1 (0x00007fa206c66000) >> liblzma.so.5 => /lib64/liblzma.so.5 (0x00007fa206a40000) >> libbz2.so.1 => /lib64/libbz2.so.1 (0x00007fa20682f000) >> I am process number 0 >> I am process number 1 >> -------------------------------------------------------------------------- >> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD >> with errorcode 3. >> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >> You may or may not see output from other processes, depending on >> exactly when Open MPI kills them. >> -------------------------------------------------------------------------- >> [borma.bis.pasteur.fr:08561 <http://borma.bis.pasteur.fr:8561/>] 1 more >> process has sent help message help-mpi-api.txt / mpi-abort >> [borma.bis.pasteur.fr:08561 <http://borma.bis.pasteur.fr:8561/>] Set MCA >> parameter "orte_base_help_aggregate" to 0 to see all help / error messages >> >> When I increased the number of MPI processes from 2 to 6 (the number of cores >> of the desktop), only the openmpi-1.10.7 built version hang (killed with >> ctrl-c), >> no errors with the 2.1.2 and 3.0.0 versions. >> >> [tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do >> >> >> module purge && module add openmpi/$i; echo $i; mpirun -n 6 >> ./aa-$i ; done >> >> 1.10.7 >> I am process number 0 >> I am process number 1 >> I am process number 2 >> I am process number 3 >> I am process number 4 >> -------------------------------------------------------------------------- >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >> with errorcode 3. >> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >> You may or may not see output from other processes, depending on >> exactly when Open MPI kills them. >> -------------------------------------------------------------------------- >> I am process number 5 >> >> >> >> ^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly >> terminate >> >> 2.1.2 >> I am process number 2 >> I am process number 3 >> I am process number 4 >> I am process number 0 >> I am process number 1 >> -------------------------------------------------------------------------- >> MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD >> with errorcode 3. >> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >> You may or may not see output from other processes, depending on >> exactly when Open MPI kills them. >> -------------------------------------------------------------------------- >> I am process number 5 >> [borma.bis.pasteur.fr:10542 <http://borma.bis.pasteur.fr:10542/>] 5 more >> processes have sent help message help-mpi-api.txt / mpi-abort >> [borma.bis.pasteur.fr:10542 <http://borma.bis.pasteur.fr:10542/>] Set MCA >> parameter "orte_base_help_aggregate" to 0 to see all help / error messages >> 3.0.0 >> I am process number 2 >> I am process number 0 >> I am process number 3 >> I am process number 5 >> I am process number 4 >> I am process number 1 >> -------------------------------------------------------------------------- >> MPI_ABORT was invoked on rank 5 in communicator MPI_COMM_WORLD >> with errorcode 3. >> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >> You may or may not see output from other processes, depending on >> exactly when Open MPI kills them. >> -------------------------------------------------------------------------- >> [borma.bis.pasteur.fr:10570 <http://borma.bis.pasteur.fr:10570/>] 5 more >> processes have sent help message help-mpi-api.txt / mpi-abort >> [borma.bis.pasteur.fr:10570 <http://borma.bis.pasteur.fr:10570/>] Set MCA >> parameter "orte_base_help_aggregate" to 0 to see all help / error messages >> >> -> some race condition on 1.10.7 ? >> Cheers >> >> Tru >> >> >> -- >> Dr Tru Huynh | mailto:t...@pasteur.fr <mailto:t...@pasteur.fr> | tel/fax +33 >> 1 45 68 87 37 <tel:+33%201%2045%2068%2087%2037>/19 >> https://research.pasteur.fr/en/team/structural-bioinformatics/ >> <https://research.pasteur.fr/en/team/structural-bioinformatics/> >> Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://lists.open-mpi.org/mailman/listinfo/users >> <https://lists.open-mpi.org/mailman/listinfo/users> > > _______________________________________________ > users mailing list > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > https://lists.open-mpi.org/mailman/listinfo/users > <https://lists.open-mpi.org/mailman/listinfo/users> > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users