Re: [OMPI devel] OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-12-01 Thread Ralph Castain
Looks like this should be fixed in my PR #101 - could you please review it?

Thanks
Ralph


> On Nov 26, 2014, at 8:14 PM, Ralph Castain  wrote:
> 
> Aha - I see what happened. I have that param set to false in my default mca 
> param file. If I set it to true on the cmd line, then I run without 
> segfaulting.
> 
> Thanks!
> Ralph
> 
> 
>> On Nov 26, 2014, at 5:55 PM, Gilles Gouaillardet 
>> > wrote:
>> 
>> Ralph,
>> 
>> let me correct and enhance my previous statement :
>> 
>> - i cannot reproduce your crash in my environment (RHEL6 like vs your RHEL7 
>> like)
>> (i configured with --enable-debug --enable-picky)
>> 
>> - i can reproduce the crash with
>> mpirun --mca mpi_param_check false
>> 
>> - if you configured with --without-mpi-param-check, i assume you would get 
>> the same crash
>> (and if i understand correctly, there would be no way to --mca 
>> mpi_param_check true)
>> 
>> here is the relevant part of my config.status :
>> $ grep MPI_PARAM_CHECK config.status 
>> D["MPI_PARAM_CHECK"]=" ompi_mpi_param_check"
>> D["OMPI_PARAM_CHECK"]=" 1"
>> 
>> i will try on a centos7 box from now.
>> in the mean time, can you check you config.status and try again with 
>> mpirun --mca mpi_param_check true
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/11/27 10:06, Gilles Gouaillardet wrote:
>>> I will double check this(afk right now)
>>> Are you running on a rhel6 like distro with gcc ?
>>> 
>>> Iirc, crash vs mpi error is ruled by --with-param-check or something like 
>>> this...
>>> 
>>> Cheers,
>>> 
>>> Gilles 
>>> 
>>> Ralph Castain  さんのメール:
 I tried it with both the fortran and c versions - got the same result.
 
 
 This was indeed with a debug build. I wouldn’t expect a segfault even with 
 an optimized build, though - I would expect an MPI error, yes?
 
 
 
 
 On Nov 26, 2014, at 4:26 PM, Gilles Gouaillardet 
   
 wrote:
 
 
 I will have a look
 
 Btw, i was running the fortran version, not the c one.
 Did you confgure with --enable--debug ?
 The program sends to a rank *not* in the communicator, so this behavior 
 could make some sense on an optimized build.
 
 Cheers,
 
 Gilles
 
 Ralph Castain  さんのメール:
 Ick - I’m getting a segfault when trying to run that test:
 
 
 MPITEST info  (0): Starting MPI_Errhandler_fatal test
 
 MPITEST info  (0): This test will abort after printing the results message
 
 MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
 
 [bend001:07714] *** Process received signal ***
 
 [bend001:07714] Signal: Segmentation fault (11)
 
 [bend001:07714] Signal code: Address not mapped (1)
 
 [bend001:07714] Failing at address: 0x50
 
 [bend001:07715] *** Process received signal ***
 
 [bend001:07715] Signal: Segmentation fault (11)
 
 [bend001:07715] Signal code: Address not mapped (1)
 
 [bend001:07715] Failing at address: 0x50
 
 [bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
 
 [bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
 
 [bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
 
 [bend001:07713] *** Process received signal ***
 
 [bend001:07713] Signal: Segmentation fault (11)
 
 [bend001:07713] Signal code: Address not mapped (1)
 
 [bend001:07713] Failing at address: 0x50
 
 [bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
 
 [bend001:07713] [ 1] 
 /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
 
 [bend001:07713] [ 2] [bend001:07714] [ 0] 
 /usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
 
 [bend001:07714] [ 1] 
 /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
 
 [bend001:07714] [ 2] [bend001:07715] [ 0] 
 /usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
 
 [bend001:07715] [ 1] 
 /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
 
 [bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests 
 PASSED (3)
 
 
 
 This is with the head of the 1.8 branch. Any suggestions?
 
 Ralph
 
 
 
 On Nov 26, 2014, at 8:46 AM, Ralph Castain  
  wrote:
 
 
 Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks 
 like the fix indeed failed to come over to 1.8. I’ll take a gander (pretty 
 sure I remember how I fixed it) - thanks!

Re: [OMPI devel] OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Ralph Castain
Aha - I see what happened. I have that param set to false in my default mca 
param file. If I set it to true on the cmd line, then I run without segfaulting.

Thanks!
Ralph


> On Nov 26, 2014, at 5:55 PM, Gilles Gouaillardet 
>  wrote:
> 
> Ralph,
> 
> let me correct and enhance my previous statement :
> 
> - i cannot reproduce your crash in my environment (RHEL6 like vs your RHEL7 
> like)
> (i configured with --enable-debug --enable-picky)
> 
> - i can reproduce the crash with
> mpirun --mca mpi_param_check false
> 
> - if you configured with --without-mpi-param-check, i assume you would get 
> the same crash
> (and if i understand correctly, there would be no way to --mca 
> mpi_param_check true)
> 
> here is the relevant part of my config.status :
> $ grep MPI_PARAM_CHECK config.status 
> D["MPI_PARAM_CHECK"]=" ompi_mpi_param_check"
> D["OMPI_PARAM_CHECK"]=" 1"
> 
> i will try on a centos7 box from now.
> in the mean time, can you check you config.status and try again with 
> mpirun --mca mpi_param_check true
> 
> Cheers,
> 
> Gilles
> 
> On 2014/11/27 10:06, Gilles Gouaillardet wrote:
>> I will double check this(afk right now)
>> Are you running on a rhel6 like distro with gcc ?
>> 
>> Iirc, crash vs mpi error is ruled by --with-param-check or something like 
>> this...
>> 
>> Cheers,
>> 
>> Gilles 
>> 
>> Ralph Castain  さんのメール:
>>> I tried it with both the fortran and c versions - got the same result.
>>> 
>>> 
>>> This was indeed with a debug build. I wouldn’t expect a segfault even with 
>>> an optimized build, though - I would expect an MPI error, yes?
>>> 
>>> 
>>> 
>>> 
>>> On Nov 26, 2014, at 4:26 PM, Gilles Gouaillardet 
>>>   
>>> wrote:
>>> 
>>> 
>>> I will have a look
>>> 
>>> Btw, i was running the fortran version, not the c one.
>>> Did you confgure with --enable--debug ?
>>> The program sends to a rank *not* in the communicator, so this behavior 
>>> could make some sense on an optimized build.
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> Ralph Castain  さんのメール:
>>> Ick - I’m getting a segfault when trying to run that test:
>>> 
>>> 
>>> MPITEST info  (0): Starting MPI_Errhandler_fatal test
>>> 
>>> MPITEST info  (0): This test will abort after printing the results message
>>> 
>>> MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
>>> 
>>> [bend001:07714] *** Process received signal ***
>>> 
>>> [bend001:07714] Signal: Segmentation fault (11)
>>> 
>>> [bend001:07714] Signal code: Address not mapped (1)
>>> 
>>> [bend001:07714] Failing at address: 0x50
>>> 
>>> [bend001:07715] *** Process received signal ***
>>> 
>>> [bend001:07715] Signal: Segmentation fault (11)
>>> 
>>> [bend001:07715] Signal code: Address not mapped (1)
>>> 
>>> [bend001:07715] Failing at address: 0x50
>>> 
>>> [bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
>>> 
>>> [bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
>>> 
>>> [bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
>>> 
>>> [bend001:07713] *** Process received signal ***
>>> 
>>> [bend001:07713] Signal: Segmentation fault (11)
>>> 
>>> [bend001:07713] Signal code: Address not mapped (1)
>>> 
>>> [bend001:07713] Failing at address: 0x50
>>> 
>>> [bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
>>> 
>>> [bend001:07713] [ 1] 
>>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
>>> 
>>> [bend001:07713] [ 2] [bend001:07714] [ 0] 
>>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
>>> 
>>> [bend001:07714] [ 1] 
>>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
>>> 
>>> [bend001:07714] [ 2] [bend001:07715] [ 0] 
>>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
>>> 
>>> [bend001:07715] [ 1] 
>>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
>>> 
>>> [bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED 
>>> (3)
>>> 
>>> 
>>> 
>>> This is with the head of the 1.8 branch. Any suggestions?
>>> 
>>> Ralph
>>> 
>>> 
>>> 
>>> On Nov 26, 2014, at 8:46 AM, Ralph Castain  
>>>  wrote:
>>> 
>>> 
>>> Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks 
>>> like the fix indeed failed to come over to 1.8. I’ll take a gander (pretty 
>>> sure I remember how I fixed it) - thanks!
>>> 
>>> On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
>>>   
>>> wrote:
>>> 
>>> Ralph,
>>> 
>>> i noted several hangs in mtt with the v1.8 branch.
>>> 
>>> a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
>> >from the intel_tests suite,
>>> invoke mpirun on one node and 

Re: [OMPI devel] OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Gilles Gouaillardet
Ralph,

let me correct and enhance my previous statement :

- i cannot reproduce your crash in my environment (RHEL6 like vs your
RHEL7 like)
(i configured with --enable-debug --enable-picky)

- i can reproduce the crash with
mpirun --mca mpi_param_check false

- if you configured with --without-mpi-param-check, i assume you would
get the same crash
(and if i understand correctly, there would be no way to --mca
mpi_param_check true)

here is the relevant part of my config.status :
$ grep MPI_PARAM_CHECK config.status
D["MPI_PARAM_CHECK"]=" ompi_mpi_param_check"
D["OMPI_PARAM_CHECK"]=" 1"

i will try on a centos7 box from now.
in the mean time, can you check you config.status and try again with
mpirun --mca mpi_param_check true

Cheers,

Gilles

On 2014/11/27 10:06, Gilles Gouaillardet wrote:
> I will double check this(afk right now)
> Are you running on a rhel6 like distro with gcc ?
>
> Iirc, crash vs mpi error is ruled by --with-param-check or something like 
> this...
>
> Cheers,
>
> Gilles 
>
> Ralph Castain ??:
>> I tried it with both the fortran and c versions - got the same result.
>>
>>
>> This was indeed with a debug build. I wouldn't expect a segfault even with 
>> an optimized build, though - I would expect an MPI error, yes?
>>
>>
>>
>>
>> On Nov 26, 2014, at 4:26 PM, Gilles Gouaillardet 
>>  wrote:
>>
>>
>> I will have a look
>>
>> Btw, i was running the fortran version, not the c one.
>> Did you confgure with --enable--debug ?
>> The program sends to a rank *not* in the communicator, so this behavior 
>> could make some sense on an optimized build.
>>
>> Cheers,
>>
>> Gilles
>>
>> Ralph Castain ??:
>> Ick - I'm getting a segfault when trying to run that test:
>>
>>
>> MPITEST info  (0): Starting MPI_Errhandler_fatal test
>>
>> MPITEST info  (0): This test will abort after printing the results message
>>
>> MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
>>
>> [bend001:07714] *** Process received signal ***
>>
>> [bend001:07714] Signal: Segmentation fault (11)
>>
>> [bend001:07714] Signal code: Address not mapped (1)
>>
>> [bend001:07714] Failing at address: 0x50
>>
>> [bend001:07715] *** Process received signal ***
>>
>> [bend001:07715] Signal: Segmentation fault (11)
>>
>> [bend001:07715] Signal code: Address not mapped (1)
>>
>> [bend001:07715] Failing at address: 0x50
>>
>> [bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
>>
>> [bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
>>
>> [bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
>>
>> [bend001:07713] *** Process received signal ***
>>
>> [bend001:07713] Signal: Segmentation fault (11)
>>
>> [bend001:07713] Signal code: Address not mapped (1)
>>
>> [bend001:07713] Failing at address: 0x50
>>
>> [bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
>>
>> [bend001:07713] [ 1] 
>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
>>
>> [bend001:07713] [ 2] [bend001:07714] [ 0] 
>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
>>
>> [bend001:07714] [ 1] 
>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
>>
>> [bend001:07714] [ 2] [bend001:07715] [ 0] 
>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
>>
>> [bend001:07715] [ 1] 
>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
>>
>> [bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED 
>> (3)
>>
>>
>>
>> This is with the head of the 1.8 branch. Any suggestions?
>>
>> Ralph
>>
>>
>>
>> On Nov 26, 2014, at 8:46 AM, Ralph Castain  wrote:
>>
>>
>> Hmmmyeah, I know we saw this and resolved it in the trunk, but it looks 
>> like the fix indeed failed to come over to 1.8. I'll take a gander (pretty 
>> sure I remember how I fixed it) - thanks!
>>
>> On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
>>  wrote:
>>
>> Ralph,
>>
>> i noted several hangs in mtt with the v1.8 branch.
>>
>> a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
> >from the intel_tests suite,
>> invoke mpirun on one node and run the taks on an other node :
>>
>> node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
>>
>> /* since this is a race condition, you might need to run this in a loop
>> in order to hit the bug */
>>
>> the attached tarball contains a patch (add debug + temporary hack) and
>> some log files obtained with
>> --mca errmgr_base_verbose 100 --mca odls_base_verbose 100
>>
>> without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
>> the hack, i can still reproduce the hang (though it might
>> be a different one) with -np 16 (log.ko.2.txt)
>>
>> i remember some similar hangs were fixed on the trunk/master a few
>> monthes ago.
>> i tried to 

Re: [OMPI devel] OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Ralph Castain

> On Nov 26, 2014, at 5:06 PM, Gilles Gouaillardet 
>  wrote:
> 
> I will double check this(afk right now)
> Are you running on a rhel6 like distro with gcc ?

Yeah, I’m running CentOS7 and gcc 4.8.2

> 
> Iirc, crash vs mpi error is ruled by --with-param-check or something like 
> this…

Sounds right - just surprised that you wouldn’t get it but I would. Makes 
debugging this problem a tad difficult, so I may need a different error to 
debug the problem.

> 
> Cheers,
> 
> Gilles 
> 
> Ralph Castain さんのメール:
> I tried it with both the fortran and c versions - got the same result.
> 
> This was indeed with a debug build. I wouldn’t expect a segfault even with an 
> optimized build, though - I would expect an MPI error, yes?
> 
> 
> 
>> On Nov 26, 2014, at 4:26 PM, Gilles Gouaillardet 
>> > wrote:
>> 
>> I will have a look
>> 
>> Btw, i was running the fortran version, not the c one.
>> Did you confgure with --enable--debug ?
>> The program sends to a rank *not* in the communicator, so this behavior 
>> could make some sense on an optimized build.
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> Ralph Castain >さんのメール:
>> Ick - I’m getting a segfault when trying to run that test:
>> 
>> MPITEST info  (0): Starting MPI_Errhandler_fatal test
>> MPITEST info  (0): This test will abort after printing the results message
>> MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
>> [bend001:07714] *** Process received signal ***
>> [bend001:07714] Signal: Segmentation fault (11)
>> [bend001:07714] Signal code: Address not mapped (1)
>> [bend001:07714] Failing at address: 0x50
>> [bend001:07715] *** Process received signal ***
>> [bend001:07715] Signal: Segmentation fault (11)
>> [bend001:07715] Signal code: Address not mapped (1)
>> [bend001:07715] Failing at address: 0x50
>> [bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
>> [bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
>> [bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
>> [bend001:07713] *** Process received signal ***
>> [bend001:07713] Signal: Segmentation fault (11)
>> [bend001:07713] Signal code: Address not mapped (1)
>> [bend001:07713] Failing at address: 0x50
>> [bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
>> [bend001:07713] [ 1] 
>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
>> [bend001:07713] [ 2] [bend001:07714] [ 0] 
>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
>> [bend001:07714] [ 1] 
>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
>> [bend001:07714] [ 2] [bend001:07715] [ 0] 
>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
>> [bend001:07715] [ 1] 
>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
>> [bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED 
>> (3)
>> 
>> 
>> This is with the head of the 1.8 branch. Any suggestions?
>> Ralph
>> 
>> 
>>> On Nov 26, 2014, at 8:46 AM, Ralph Castain >> > wrote:
>>> 
>>> Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks 
>>> like the fix indeed failed to come over to 1.8. I’ll take a gander (pretty 
>>> sure I remember how I fixed it) - thanks!
>>> 
 On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
 > 
 wrote:
 
 Ralph,
 
 i noted several hangs in mtt with the v1.8 branch.
 
 a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
 from the intel_tests suite,
 invoke mpirun on one node and run the taks on an other node :
 
 node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
 
 /* since this is a race condition, you might need to run this in a loop
 in order to hit the bug */
 
 the attached tarball contains a patch (add debug + temporary hack) and
 some log files obtained with
 --mca errmgr_base_verbose 100 --mca odls_base_verbose 100
 
 without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
 the hack, i can still reproduce the hang (though it might
 be a different one) with -np 16 (log.ko.2.txt)
 
 i remember some similar hangs were fixed on the trunk/master a few
 monthes ago.
 i tried to backport some commits but it did not help :-(
 
 could you please have a look at this ?
 
 Cheers,
 
 Gilles
 ___
 devel mailing list
 de...@open-mpi.org 
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
 

Re: [OMPI devel] OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Gilles Gouaillardet
I will double check this(afk right now)
Are you running on a rhel6 like distro with gcc ?

Iirc, crash vs mpi error is ruled by --with-param-check or something like 
this...

Cheers,

Gilles 

Ralph Castain さんのメール:
>I tried it with both the fortran and c versions - got the same result.
>
>
>This was indeed with a debug build. I wouldn’t expect a segfault even with an 
>optimized build, though - I would expect an MPI error, yes?
>
>
>
>
>On Nov 26, 2014, at 4:26 PM, Gilles Gouaillardet 
> wrote:
>
>
>I will have a look
>
>Btw, i was running the fortran version, not the c one.
>Did you confgure with --enable--debug ?
>The program sends to a rank *not* in the communicator, so this behavior could 
>make some sense on an optimized build.
>
>Cheers,
>
>Gilles
>
>Ralph Castain さんのメール:
>Ick - I’m getting a segfault when trying to run that test:
>
>
>MPITEST info  (0): Starting MPI_Errhandler_fatal test
>
>MPITEST info  (0): This test will abort after printing the results message
>
>MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
>
>[bend001:07714] *** Process received signal ***
>
>[bend001:07714] Signal: Segmentation fault (11)
>
>[bend001:07714] Signal code: Address not mapped (1)
>
>[bend001:07714] Failing at address: 0x50
>
>[bend001:07715] *** Process received signal ***
>
>[bend001:07715] Signal: Segmentation fault (11)
>
>[bend001:07715] Signal code: Address not mapped (1)
>
>[bend001:07715] Failing at address: 0x50
>
>[bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07713] *** Process received signal ***
>
>[bend001:07713] Signal: Segmentation fault (11)
>
>[bend001:07713] Signal code: Address not mapped (1)
>
>[bend001:07713] Failing at address: 0x50
>
>[bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
>
>[bend001:07713] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
>
>[bend001:07713] [ 2] [bend001:07714] [ 0] 
>/usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
>
>[bend001:07714] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
>
>[bend001:07714] [ 2] [bend001:07715] [ 0] 
>/usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
>
>[bend001:07715] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
>
>[bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED (3)
>
>
>
>This is with the head of the 1.8 branch. Any suggestions?
>
>Ralph
>
>
>
>On Nov 26, 2014, at 8:46 AM, Ralph Castain  wrote:
>
>
>Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks like 
>the fix indeed failed to come over to 1.8. I’ll take a gander (pretty sure I 
>remember how I fixed it) - thanks!
>
>On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
> wrote:
>
>Ralph,
>
>i noted several hangs in mtt with the v1.8 branch.
>
>a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
>from the intel_tests suite,
>invoke mpirun on one node and run the taks on an other node :
>
>node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
>
>/* since this is a race condition, you might need to run this in a loop
>in order to hit the bug */
>
>the attached tarball contains a patch (add debug + temporary hack) and
>some log files obtained with
>--mca errmgr_base_verbose 100 --mca odls_base_verbose 100
>
>without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
>the hack, i can still reproduce the hang (though it might
>be a different one) with -np 16 (log.ko.2.txt)
>
>i remember some similar hangs were fixed on the trunk/master a few
>monthes ago.
>i tried to backport some commits but it did not help :-(
>
>could you please have a look at this ?
>
>Cheers,
>
>Gilles
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/11/16357.php
>
>
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/11/16364.php
>
>


Re: [OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Ralph Castain
I tried it with both the fortran and c versions - got the same result.

This was indeed with a debug build. I wouldn’t expect a segfault even with an 
optimized build, though - I would expect an MPI error, yes?



> On Nov 26, 2014, at 4:26 PM, Gilles Gouaillardet 
>  wrote:
> 
> I will have a look
> 
> Btw, i was running the fortran version, not the c one.
> Did you confgure with --enable--debug ?
> The program sends to a rank *not* in the communicator, so this behavior could 
> make some sense on an optimized build.
> 
> Cheers,
> 
> Gilles
> 
> Ralph Castain さんのメール:
> Ick - I’m getting a segfault when trying to run that test:
> 
> MPITEST info  (0): Starting MPI_Errhandler_fatal test
> MPITEST info  (0): This test will abort after printing the results message
> MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
> [bend001:07714] *** Process received signal ***
> [bend001:07714] Signal: Segmentation fault (11)
> [bend001:07714] Signal code: Address not mapped (1)
> [bend001:07714] Failing at address: 0x50
> [bend001:07715] *** Process received signal ***
> [bend001:07715] Signal: Segmentation fault (11)
> [bend001:07715] Signal code: Address not mapped (1)
> [bend001:07715] Failing at address: 0x50
> [bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
> [bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
> [bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
> [bend001:07713] *** Process received signal ***
> [bend001:07713] Signal: Segmentation fault (11)
> [bend001:07713] Signal code: Address not mapped (1)
> [bend001:07713] Failing at address: 0x50
> [bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
> [bend001:07713] [ 1] 
> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
> [bend001:07713] [ 2] [bend001:07714] [ 0] 
> /usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
> [bend001:07714] [ 1] 
> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
> [bend001:07714] [ 2] [bend001:07715] [ 0] 
> /usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
> [bend001:07715] [ 1] 
> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
> [bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED 
> (3)
> 
> 
> This is with the head of the 1.8 branch. Any suggestions?
> Ralph
> 
> 
>> On Nov 26, 2014, at 8:46 AM, Ralph Castain > > wrote:
>> 
>> Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks 
>> like the fix indeed failed to come over to 1.8. I’ll take a gander (pretty 
>> sure I remember how I fixed it) - thanks!
>> 
>>> On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
>>> > 
>>> wrote:
>>> 
>>> Ralph,
>>> 
>>> i noted several hangs in mtt with the v1.8 branch.
>>> 
>>> a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
>>> from the intel_tests suite,
>>> invoke mpirun on one node and run the taks on an other node :
>>> 
>>> node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
>>> 
>>> /* since this is a race condition, you might need to run this in a loop
>>> in order to hit the bug */
>>> 
>>> the attached tarball contains a patch (add debug + temporary hack) and
>>> some log files obtained with
>>> --mca errmgr_base_verbose 100 --mca odls_base_verbose 100
>>> 
>>> without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
>>> the hack, i can still reproduce the hang (though it might
>>> be a different one) with -np 16 (log.ko.2.txt)
>>> 
>>> i remember some similar hangs were fixed on the trunk/master a few
>>> monthes ago.
>>> i tried to backport some commits but it did not help :-(
>>> 
>>> could you please have a look at this ?
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org 
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/11/16357.php 
>>> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16364.php



Re: [OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Gilles Gouaillardet
I will have a look

Btw, i was running the fortran version, not the c one.
Did you confgure with --enable--debug ?
The program sends to a rank *not* in the communicator, so this behavior could 
make some sense on an optimized build.

Cheers,

Gilles

Ralph Castain さんのメール:
>Ick - I’m getting a segfault when trying to run that test:
>
>
>MPITEST info  (0): Starting MPI_Errhandler_fatal test
>
>MPITEST info  (0): This test will abort after printing the results message
>
>MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
>
>[bend001:07714] *** Process received signal ***
>
>[bend001:07714] Signal: Segmentation fault (11)
>
>[bend001:07714] Signal code: Address not mapped (1)
>
>[bend001:07714] Failing at address: 0x50
>
>[bend001:07715] *** Process received signal ***
>
>[bend001:07715] Signal: Segmentation fault (11)
>
>[bend001:07715] Signal code: Address not mapped (1)
>
>[bend001:07715] Failing at address: 0x50
>
>[bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07713] *** Process received signal ***
>
>[bend001:07713] Signal: Segmentation fault (11)
>
>[bend001:07713] Signal code: Address not mapped (1)
>
>[bend001:07713] Failing at address: 0x50
>
>[bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
>
>[bend001:07713] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
>
>[bend001:07713] [ 2] [bend001:07714] [ 0] 
>/usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
>
>[bend001:07714] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
>
>[bend001:07714] [ 2] [bend001:07715] [ 0] 
>/usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
>
>[bend001:07715] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
>
>[bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED (3)
>
>
>
>This is with the head of the 1.8 branch. Any suggestions?
>
>Ralph
>
>
>
>On Nov 26, 2014, at 8:46 AM, Ralph Castain  wrote:
>
>
>Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks like 
>the fix indeed failed to come over to 1.8. I’ll take a gander (pretty sure I 
>remember how I fixed it) - thanks!
>
>On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
> wrote:
>
>Ralph,
>
>i noted several hangs in mtt with the v1.8 branch.
>
>a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
>from the intel_tests suite,
>invoke mpirun on one node and run the taks on an other node :
>
>node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
>
>/* since this is a race condition, you might need to run this in a loop
>in order to hit the bug */
>
>the attached tarball contains a patch (add debug + temporary hack) and
>some log files obtained with
>--mca errmgr_base_verbose 100 --mca odls_base_verbose 100
>
>without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
>the hack, i can still reproduce the hang (though it might
>be a different one) with -np 16 (log.ko.2.txt)
>
>i remember some similar hangs were fixed on the trunk/master a few
>monthes ago.
>i tried to backport some commits but it did not help :-(
>
>could you please have a look at this ?
>
>Cheers,
>
>Gilles
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/11/16357.php
>
>


Re: [OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Ralph Castain
Ick - I’m getting a segfault when trying to run that test:

MPITEST info  (0): Starting MPI_Errhandler_fatal test
MPITEST info  (0): This test will abort after printing the results message
MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
[bend001:07714] *** Process received signal ***
[bend001:07714] Signal: Segmentation fault (11)
[bend001:07714] Signal code: Address not mapped (1)
[bend001:07714] Failing at address: 0x50
[bend001:07715] *** Process received signal ***
[bend001:07715] Signal: Segmentation fault (11)
[bend001:07715] Signal code: Address not mapped (1)
[bend001:07715] Failing at address: 0x50
[bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
[bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
[bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
[bend001:07713] *** Process received signal ***
[bend001:07713] Signal: Segmentation fault (11)
[bend001:07713] Signal code: Address not mapped (1)
[bend001:07713] Failing at address: 0x50
[bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
[bend001:07713] [ 1] 
/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
[bend001:07713] [ 2] [bend001:07714] [ 0] 
/usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
[bend001:07714] [ 1] 
/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
[bend001:07714] [ 2] [bend001:07715] [ 0] 
/usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
[bend001:07715] [ 1] 
/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
[bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED (3)


This is with the head of the 1.8 branch. Any suggestions?
Ralph


> On Nov 26, 2014, at 8:46 AM, Ralph Castain  wrote:
> 
> Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks 
> like the fix indeed failed to come over to 1.8. I’ll take a gander (pretty 
> sure I remember how I fixed it) - thanks!
> 
>> On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
>> > wrote:
>> 
>> Ralph,
>> 
>> i noted several hangs in mtt with the v1.8 branch.
>> 
>> a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
>> from the intel_tests suite,
>> invoke mpirun on one node and run the taks on an other node :
>> 
>> node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
>> 
>> /* since this is a race condition, you might need to run this in a loop
>> in order to hit the bug */
>> 
>> the attached tarball contains a patch (add debug + temporary hack) and
>> some log files obtained with
>> --mca errmgr_base_verbose 100 --mca odls_base_verbose 100
>> 
>> without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
>> the hack, i can still reproduce the hang (though it might
>> be a different one) with -np 16 (log.ko.2.txt)
>> 
>> i remember some similar hangs were fixed on the trunk/master a few
>> monthes ago.
>> i tried to backport some commits but it did not help :-(
>> 
>> could you please have a look at this ?
>> 
>> Cheers,
>> 
>> Gilles
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/11/16357.php 
>> 


Re: [OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Ralph Castain
Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks like 
the fix indeed failed to come over to 1.8. I’ll take a gander (pretty sure I 
remember how I fixed it) - thanks!

> On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
>  wrote:
> 
> Ralph,
> 
> i noted several hangs in mtt with the v1.8 branch.
> 
> a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
> from the intel_tests suite,
> invoke mpirun on one node and run the taks on an other node :
> 
> node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
> 
> /* since this is a race condition, you might need to run this in a loop
> in order to hit the bug */
> 
> the attached tarball contains a patch (add debug + temporary hack) and
> some log files obtained with
> --mca errmgr_base_verbose 100 --mca odls_base_verbose 100
> 
> without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
> the hack, i can still reproduce the hang (though it might
> be a different one) with -np 16 (log.ko.2.txt)
> 
> i remember some similar hangs were fixed on the trunk/master a few
> monthes ago.
> i tried to backport some commits but it did not help :-(
> 
> could you please have a look at this ?
> 
> Cheers,
> 
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16357.php



[OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Gilles Gouaillardet
Ralph,

i noted several hangs in mtt with the v1.8 branch.

a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
from the intel_tests suite,
invoke mpirun on one node and run the taks on an other node :

node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f

/* since this is a race condition, you might need to run this in a loop
in order to hit the bug */

the attached tarball contains a patch (add debug + temporary hack) and
some log files obtained with
--mca errmgr_base_verbose 100 --mca odls_base_verbose 100

without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
the hack, i can still reproduce the hang (though it might
be a different one) with -np 16 (log.ko.2.txt)

i remember some similar hangs were fixed on the trunk/master a few
monthes ago.
i tried to backport some commits but it did not help :-(

could you please have a look at this ?

Cheers,

Gilles


abort_hang.tar.gz
Description: application/gzip