Re: [OMPI users] OpenMPI 1.10.x handling of simultaneous MPI_Abort calls

2017-11-08 Thread r...@open-mpi.org
I see. Then you understand correctly - we are not going to fix the v1.10 series.

> On Nov 8, 2017, at 10:47 AM, Nikolas Antolin  wrote:
> 
> That was not my interpretation. His message said he did not observe the race 
> condition for 2 processes, but did for 6 processes. I observe a failure to 
> exit mpirun around 25-30% of the time with 2 processes, causing an 
> inconsistent hang in both my example program and my larger application.
> 
> -Nik
> 
> On Nov 8, 2017 11:40, "r...@open-mpi.org " 
> > wrote:
> According to the other reporter, it has been fixed in 1.10.7. I haven’t 
> verified that, but I’d suggest trying it first.
> 
> 
>> On Nov 8, 2017, at 8:26 AM, Nikolas Antolin > > wrote:
>> 
>> Thank you for the replies. Do I understand correctly that since OpenMPI 
>> v1.10 is no longer supported, I am unlikely to see a bug fix for this 
>> without moving to v2.x or v3.x? I am dealing with clusters where the 
>> administrators may be loathe to update packages until it is absolutely 
>> necessary, and want to present them with a complete outlook on the problem.
>> 
>> Thanks,
>> Nik
>> 
>> 2017-11-07 19:00 GMT-07:00 r...@open-mpi.org  
>> >:
>> Glad to hear it has already been fixed :-)
>> 
>> Thanks!
>> 
>>> On Nov 7, 2017, at 4:13 PM, Tru Huynh >> > wrote:
>>> 
>>> Hi,
>>> 
>>> On Tue, Nov 07, 2017 at 02:05:20PM -0700, Nikolas Antolin wrote:
 Hello,
 
 In debugging a test of an application, I recently came across odd behavior
 for simultaneous MPI_Abort calls. Namely, while the MPI_Abort was
 acknowledged by the process output, the mpirun process failed to exit. I
 was able to duplicate this behavior on multiple machines with OpenMPI
 versions 1.10.2, 1.10.5, and 1.10.6 with the following simple program:
 
 #include 
 #include 
 #include 
 #include 
 
 int main(int argc, char **argv)
 {
int rank;
 
MPI_Init(,);
MPI_Comm_rank(MPI_COMM_WORLD, );
 
printf("I am process number %d\n", rank);
MPI_Abort(MPI_COMM_WORLD, 3);
return 0;
 }
 
 Is this a bug or a feature? Does this behavior exist in OpenMPI versions
 2.0 and 3.0?
>>> I compiled your test case on CentOS-7 with openmpi 1.10.7/2.1.2 and
>>> 3.0.0 and the program seems to run fine.
>>> 
>>> [tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do 
>>> 
>>> 
>>>  module purge && module add openmpi/$i && mpicc aa.c -o 
>>> aa-$i && ldd aa-$i; mpirun  -n 2 ./aa-$i ; done 
>>> 
>>>  
>>> linux-vdso.so.1 =>  (0x7ffe115bd000)
>>> libmpi.so.12 => /c7/shared/openmpi/1.10.7/lib/libmpi.so.12 
>>> (0x7f40d7b4a000)
>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f40d78f7000)
>>> libc.so.6 => /lib64/libc.so.6 (0x7f40d7534000)
>>> libopen-rte.so.12 => /c7/shared/openmpi/1.10.7/lib/libopen-rte.so.12 
>>> (0x7f40d72b8000)
>>> libopen-pal.so.13 => /c7/shared/openmpi/1.10.7/lib/libopen-pal.so.13 
>>> (0x7f40d6fd9000)
>>> libnuma.so.1 => /lib64/libnuma.so.1 (0x7f40d6dcd000)
>>> libdl.so.2 => /lib64/libdl.so.2 (0x7f40d6bc9000)
>>> librt.so.1 => /lib64/librt.so.1 (0x7f40d69c)
>>> libm.so.6 => /lib64/libm.so.6 (0x7f40d66be000)
>>> libutil.so.1 => /lib64/libutil.so.1 (0x7f40d64bb000)
>>> /lib64/ld-linux-x86-64.so.2 (0x55f6d96c4000)
>>> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f40d62a4000)
>>> I am process number 1
>>> I am process number 0
>>> --
>>> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
>>> with errorcode 3.
>>> 
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>> --
>>> [borma.bis.pasteur.fr:08511 ] 1 more 
>>> process has sent help message help-mpi-api.txt / mpi-abort
>>> [borma.bis.pasteur.fr:08511 ] Set MCA 
>>> parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>>> linux-vdso.so.1 =>  (0x7fffaabcd000)
>>> libmpi.so.20 => /c7/shared/openmpi/2.1.2/lib/libmpi.so.20 
>>> (0x7f5bcee39000)
>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f5bcebe6000)
>>> libc.so.6 => 

Re: [OMPI users] OpenMPI 1.10.x handling of simultaneous MPI_Abort calls

2017-11-08 Thread Nikolas Antolin
That was not my interpretation. His message said he did not observe the
race condition for 2 processes, but did for 6 processes. I observe a
failure to exit mpirun around 25-30% of the time with 2 processes, causing
an inconsistent hang in both my example program and my larger application.

-Nik

On Nov 8, 2017 11:40, "r...@open-mpi.org"  wrote:

> According to the other reporter, it has been fixed in 1.10.7. I haven’t
> verified that, but I’d suggest trying it first.
>
>
> On Nov 8, 2017, at 8:26 AM, Nikolas Antolin  wrote:
>
> Thank you for the replies. Do I understand correctly that since OpenMPI
> v1.10 is no longer supported, I am unlikely to see a bug fix for this
> without moving to v2.x or v3.x? I am dealing with clusters where the
> administrators may be loathe to update packages until it is absolutely
> necessary, and want to present them with a complete outlook on the problem.
>
> Thanks,
> Nik
>
> 2017-11-07 19:00 GMT-07:00 r...@open-mpi.org :
>
>> Glad to hear it has already been fixed :-)
>>
>> Thanks!
>>
>> On Nov 7, 2017, at 4:13 PM, Tru Huynh  wrote:
>>
>> Hi,
>>
>> On Tue, Nov 07, 2017 at 02:05:20PM -0700, Nikolas Antolin wrote:
>>
>> Hello,
>>
>> In debugging a test of an application, I recently came across odd behavior
>> for simultaneous MPI_Abort calls. Namely, while the MPI_Abort was
>> acknowledged by the process output, the mpirun process failed to exit. I
>> was able to duplicate this behavior on multiple machines with OpenMPI
>> versions 1.10.2, 1.10.5, and 1.10.6 with the following simple program:
>>
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> int main(int argc, char **argv)
>> {
>>int rank;
>>
>>MPI_Init(,);
>>MPI_Comm_rank(MPI_COMM_WORLD, );
>>
>>printf("I am process number %d\n", rank);
>>MPI_Abort(MPI_COMM_WORLD, 3);
>>return 0;
>> }
>>
>> Is this a bug or a feature? Does this behavior exist in OpenMPI versions
>> 2.0 and 3.0?
>>
>> I compiled your test case on CentOS-7 with openmpi 1.10.7/2.1.2 and
>> 3.0.0 and the program seems to run fine.
>>
>> [tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do
>>
>>
>>  module
>> purge && module add openmpi/$i && mpicc aa.c -o aa-$i && ldd aa-$i; mpirun
>>  -n 2 ./aa-$i ; done
>>
>>
>> linux-vdso.so.1 =>  (0x7ffe115bd000)
>> libmpi.so.12 => /c7/shared/openmpi/1.10.7/lib/libmpi.so.12
>> (0x7f40d7b4a000)
>> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f40d78f7000)
>> libc.so.6 => /lib64/libc.so.6 (0x7f40d7534000)
>> libopen-rte.so.12 => /c7/shared/openmpi/1.10.7/lib/libopen-rte.so.12
>> (0x7f40d72b8000)
>> libopen-pal.so.13 => /c7/shared/openmpi/1.10.7/lib/libopen-pal.so.13
>> (0x7f40d6fd9000)
>> libnuma.so.1 => /lib64/libnuma.so.1 (0x7f40d6dcd000)
>> libdl.so.2 => /lib64/libdl.so.2 (0x7f40d6bc9000)
>> librt.so.1 => /lib64/librt.so.1 (0x7f40d69c)
>> libm.so.6 => /lib64/libm.so.6 (0x7f40d66be000)
>> libutil.so.1 => /lib64/libutil.so.1 (0x7f40d64bb000)
>> /lib64/ld-linux-x86-64.so.2 (0x55f6d96c4000)
>> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f40d62a4000)
>> I am process number 1
>> I am process number 0
>> 
>> --
>> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
>> with errorcode 3.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> 
>> --
>> [borma.bis.pasteur.fr:08511 ] 1 more
>> process has sent help message help-mpi-api.txt / mpi-abort
>> [borma.bis.pasteur.fr:08511 ] Set MCA
>> parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>> linux-vdso.so.1 =>  (0x7fffaabcd000)
>> libmpi.so.20 => /c7/shared/openmpi/2.1.2/lib/libmpi.so.20
>> (0x7f5bcee39000)
>> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f5bcebe6000)
>> libc.so.6 => /lib64/libc.so.6 (0x7f5bce823000)
>> libopen-rte.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-rte.so.20
>> (0x7f5bce5a)
>> libopen-pal.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-pal.so.20
>> (0x7f5bce2a7000)
>> libdl.so.2 => /lib64/libdl.so.2 (0x7f5bce0a3000)
>> libnuma.so.1 => /lib64/libnuma.so.1 (0x7f5bcde97000)
>> libudev.so.1 => /lib64/libudev.so.1 (0x7f5bcde81000)
>> librt.so.1 => /lib64/librt.so.1 (0x7f5bcdc79000)
>> libm.so.6 => /lib64/libm.so.6 (0x7f5bcd977000)
>> libutil.so.1 => /lib64/libutil.so.1 (0x7f5bcd773000)
>> /lib64/ld-linux-x86-64.so.2 (0x55718df01000)
>> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f5bcd55d000)
>> libcap.so.2 => /lib64/libcap.so.2 (0x7f5bcd357000)
>> libdw.so.1 => /lib64/libdw.so.1 

Re: [OMPI users] OpenMPI 1.10.x handling of simultaneous MPI_Abort calls

2017-11-08 Thread r...@open-mpi.org
According to the other reporter, it has been fixed in 1.10.7. I haven’t 
verified that, but I’d suggest trying it first.


> On Nov 8, 2017, at 8:26 AM, Nikolas Antolin  wrote:
> 
> Thank you for the replies. Do I understand correctly that since OpenMPI v1.10 
> is no longer supported, I am unlikely to see a bug fix for this without 
> moving to v2.x or v3.x? I am dealing with clusters where the administrators 
> may be loathe to update packages until it is absolutely necessary, and want 
> to present them with a complete outlook on the problem.
> 
> Thanks,
> Nik
> 
> 2017-11-07 19:00 GMT-07:00 r...@open-mpi.org  
> >:
> Glad to hear it has already been fixed :-)
> 
> Thanks!
> 
>> On Nov 7, 2017, at 4:13 PM, Tru Huynh > > wrote:
>> 
>> Hi,
>> 
>> On Tue, Nov 07, 2017 at 02:05:20PM -0700, Nikolas Antolin wrote:
>>> Hello,
>>> 
>>> In debugging a test of an application, I recently came across odd behavior
>>> for simultaneous MPI_Abort calls. Namely, while the MPI_Abort was
>>> acknowledged by the process output, the mpirun process failed to exit. I
>>> was able to duplicate this behavior on multiple machines with OpenMPI
>>> versions 1.10.2, 1.10.5, and 1.10.6 with the following simple program:
>>> 
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>> 
>>> int main(int argc, char **argv)
>>> {
>>>int rank;
>>> 
>>>MPI_Init(,);
>>>MPI_Comm_rank(MPI_COMM_WORLD, );
>>> 
>>>printf("I am process number %d\n", rank);
>>>MPI_Abort(MPI_COMM_WORLD, 3);
>>>return 0;
>>> }
>>> 
>>> Is this a bug or a feature? Does this behavior exist in OpenMPI versions
>>> 2.0 and 3.0?
>> I compiled your test case on CentOS-7 with openmpi 1.10.7/2.1.2 and
>> 3.0.0 and the program seems to run fine.
>> 
>> [tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do  
>>  
>>  
>>   module purge && module add openmpi/$i && mpicc aa.c -o aa-$i 
>> && ldd aa-$i; mpirun  -n 2 ./aa-$i ; done
>>  
>>  
>>  linux-vdso.so.1 =>  (0x7ffe115bd000)
>>  libmpi.so.12 => /c7/shared/openmpi/1.10.7/lib/libmpi.so.12 
>> (0x7f40d7b4a000)
>>  libpthread.so.0 => /lib64/libpthread.so.0 (0x7f40d78f7000)
>>  libc.so.6 => /lib64/libc.so.6 (0x7f40d7534000)
>>  libopen-rte.so.12 => /c7/shared/openmpi/1.10.7/lib/libopen-rte.so.12 
>> (0x7f40d72b8000)
>>  libopen-pal.so.13 => /c7/shared/openmpi/1.10.7/lib/libopen-pal.so.13 
>> (0x7f40d6fd9000)
>>  libnuma.so.1 => /lib64/libnuma.so.1 (0x7f40d6dcd000)
>>  libdl.so.2 => /lib64/libdl.so.2 (0x7f40d6bc9000)
>>  librt.so.1 => /lib64/librt.so.1 (0x7f40d69c)
>>  libm.so.6 => /lib64/libm.so.6 (0x7f40d66be000)
>>  libutil.so.1 => /lib64/libutil.so.1 (0x7f40d64bb000)
>>  /lib64/ld-linux-x86-64.so.2 (0x55f6d96c4000)
>>  libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f40d62a4000)
>> I am process number 1
>> I am process number 0
>> --
>> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
>> with errorcode 3.
>> 
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --
>> [borma.bis.pasteur.fr:08511 ] 1 more 
>> process has sent help message help-mpi-api.txt / mpi-abort
>> [borma.bis.pasteur.fr:08511 ] Set MCA 
>> parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>>  linux-vdso.so.1 =>  (0x7fffaabcd000)
>>  libmpi.so.20 => /c7/shared/openmpi/2.1.2/lib/libmpi.so.20 
>> (0x7f5bcee39000)
>>  libpthread.so.0 => /lib64/libpthread.so.0 (0x7f5bcebe6000)
>>  libc.so.6 => /lib64/libc.so.6 (0x7f5bce823000)
>>  libopen-rte.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-rte.so.20 
>> (0x7f5bce5a)
>>  libopen-pal.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-pal.so.20 
>> (0x7f5bce2a7000)
>>  libdl.so.2 => /lib64/libdl.so.2 (0x7f5bce0a3000)
>>  libnuma.so.1 => /lib64/libnuma.so.1 (0x7f5bcde97000)
>>  libudev.so.1 => /lib64/libudev.so.1 (0x7f5bcde81000)
>>  librt.so.1 => /lib64/librt.so.1 (0x7f5bcdc79000)
>>  libm.so.6 => /lib64/libm.so.6 (0x7f5bcd977000)
>>  libutil.so.1 => /lib64/libutil.so.1 (0x7f5bcd773000)
>>  /lib64/ld-linux-x86-64.so.2 (0x55718df01000)
>>  libgcc_s.so.1 => /lib64/libgcc_s.so.1 

Re: [OMPI users] OpenMPI 1.10.x handling of simultaneous MPI_Abort calls

2017-11-08 Thread Nikolas Antolin
Thank you for the replies. Do I understand correctly that since OpenMPI
v1.10 is no longer supported, I am unlikely to see a bug fix for this
without moving to v2.x or v3.x? I am dealing with clusters where the
administrators may be loathe to update packages until it is absolutely
necessary, and want to present them with a complete outlook on the problem.

Thanks,
Nik

2017-11-07 19:00 GMT-07:00 r...@open-mpi.org :

> Glad to hear it has already been fixed :-)
>
> Thanks!
>
> On Nov 7, 2017, at 4:13 PM, Tru Huynh  wrote:
>
> Hi,
>
> On Tue, Nov 07, 2017 at 02:05:20PM -0700, Nikolas Antolin wrote:
>
> Hello,
>
> In debugging a test of an application, I recently came across odd behavior
> for simultaneous MPI_Abort calls. Namely, while the MPI_Abort was
> acknowledged by the process output, the mpirun process failed to exit. I
> was able to duplicate this behavior on multiple machines with OpenMPI
> versions 1.10.2, 1.10.5, and 1.10.6 with the following simple program:
>
> #include 
> #include 
> #include 
> #include 
>
> int main(int argc, char **argv)
> {
>int rank;
>
>MPI_Init(,);
>MPI_Comm_rank(MPI_COMM_WORLD, );
>
>printf("I am process number %d\n", rank);
>MPI_Abort(MPI_COMM_WORLD, 3);
>return 0;
> }
>
> Is this a bug or a feature? Does this behavior exist in OpenMPI versions
> 2.0 and 3.0?
>
> I compiled your test case on CentOS-7 with openmpi 1.10.7/2.1.2 and
> 3.0.0 and the program seems to run fine.
>
> [tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do
>
>
>  module purge
> && module add openmpi/$i && mpicc aa.c -o aa-$i && ldd aa-$i; mpirun  -n 2
> ./aa-$i ; done
>
>
> linux-vdso.so.1 =>  (0x7ffe115bd000)
> libmpi.so.12 => /c7/shared/openmpi/1.10.7/lib/libmpi.so.12
> (0x7f40d7b4a000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f40d78f7000)
> libc.so.6 => /lib64/libc.so.6 (0x7f40d7534000)
> libopen-rte.so.12 => /c7/shared/openmpi/1.10.7/lib/libopen-rte.so.12
> (0x7f40d72b8000)
> libopen-pal.so.13 => /c7/shared/openmpi/1.10.7/lib/libopen-pal.so.13
> (0x7f40d6fd9000)
> libnuma.so.1 => /lib64/libnuma.so.1 (0x7f40d6dcd000)
> libdl.so.2 => /lib64/libdl.so.2 (0x7f40d6bc9000)
> librt.so.1 => /lib64/librt.so.1 (0x7f40d69c)
> libm.so.6 => /lib64/libm.so.6 (0x7f40d66be000)
> libutil.so.1 => /lib64/libutil.so.1 (0x7f40d64bb000)
> /lib64/ld-linux-x86-64.so.2 (0x55f6d96c4000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f40d62a4000)
> I am process number 1
> I am process number 0
> --
> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
> with errorcode 3.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --
> [borma.bis.pasteur.fr:08511 ] 1 more
> process has sent help message help-mpi-api.txt / mpi-abort
> [borma.bis.pasteur.fr:08511 ] Set MCA
> parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> linux-vdso.so.1 =>  (0x7fffaabcd000)
> libmpi.so.20 => /c7/shared/openmpi/2.1.2/lib/libmpi.so.20
> (0x7f5bcee39000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f5bcebe6000)
> libc.so.6 => /lib64/libc.so.6 (0x7f5bce823000)
> libopen-rte.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-rte.so.20
> (0x7f5bce5a)
> libopen-pal.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-pal.so.20
> (0x7f5bce2a7000)
> libdl.so.2 => /lib64/libdl.so.2 (0x7f5bce0a3000)
> libnuma.so.1 => /lib64/libnuma.so.1 (0x7f5bcde97000)
> libudev.so.1 => /lib64/libudev.so.1 (0x7f5bcde81000)
> librt.so.1 => /lib64/librt.so.1 (0x7f5bcdc79000)
> libm.so.6 => /lib64/libm.so.6 (0x7f5bcd977000)
> libutil.so.1 => /lib64/libutil.so.1 (0x7f5bcd773000)
> /lib64/ld-linux-x86-64.so.2 (0x55718df01000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f5bcd55d000)
> libcap.so.2 => /lib64/libcap.so.2 (0x7f5bcd357000)
> libdw.so.1 => /lib64/libdw.so.1 (0x7f5bcd11)
> libattr.so.1 => /lib64/libattr.so.1 (0x7f5bccf0b000)
> libelf.so.1 => /lib64/libelf.so.1 (0x7f5bcccf2000)
> libz.so.1 => /lib64/libz.so.1 (0x7f5bccadc000)
> liblzma.so.5 => /lib64/liblzma.so.5 (0x7f5bcc8b6000)
> libbz2.so.1 => /lib64/libbz2.so.1 (0x7f5bcc6a5000)
> I am process number 1
> I am process number 0
> --
> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
> with errorcode 3.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
>