Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-06 Thread George Bosilca
Ralph,

Things got better, in the sense that before I was getting about 1 deadlock
for each 300 runs, now the number if more 1 out of every 500.

  George.


On Tue, Jun 7, 2016 at 12:04 AM, George Bosilca  wrote:

> Ralph,
>
> Not there yet. I got similar deadlocks, but the stack not looks slightly
> different. I only have 1 single thread doing something useful (aka being in
> listen_thread_fn), every other thread is having a similar stack:
>
>   * frame #0: 0x7fff93306de6 libsystem_kernel.dylib`__psynch_mutexwait
> + 10
> frame #1: 0x7fff9a000e4a
> libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89
> frame #2: 0x7fff99ffe5f5
> libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300
> frame #3: 0x7fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire() + 16
> frame #4: 0x7fff6bbc3177
> dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int,
> ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55
> frame #5: 0x7fff6bbad063
> dyld`dyld::fastBindLazySymbol(ImageLoader**, unsigned long) + 90
> frame #6: 0x7fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282
> frame #7: 0x0001019d39b0 libopen-pal.0.dylib`obj_order_type + 3776
>
>   George.
>
>
> On Mon, Jun 6, 2016 at 1:41 PM, Ralph Castain  wrote:
>
>> I think I have this fixed here:
>> https://github.com/open-mpi/ompi/pull/1756
>>
>> George - can you please try it on your system?
>>
>>
>> On Jun 5, 2016, at 4:18 PM, Ralph Castain  wrote:
>>
>> Yeah, I can reproduce on my box. What is happening is that we aren’t
>> properly protected during finalize, and so we tear down some component that
>> is registered for a callback, and then the callback occurs. So we just need
>> to ensure that we finalize in the right order
>>
>> On Jun 5, 2016, at 3:51 PM, Jeff Squyres (jsquyres) 
>> wrote:
>>
>> Ok, good.
>>
>> FWIW: I tried all Friday afternoon to reproduce on my OS X laptop (i.e.,
>> I did something like George's shell script loop), and just now I ran
>> George's exact loop, but I haven't been able to reproduce.  In this case,
>> I'm falling on the wrong side of whatever race condition is happening...
>>
>>
>> On Jun 4, 2016, at 7:57 PM, Ralph Castain  wrote:
>>
>> I may have an idea of what’s going on here - I just need to finish
>> something else first and then I’ll take a look.
>>
>>
>> On Jun 4, 2016, at 4:20 PM, George Bosilca  wrote:
>>
>>
>> On Jun 5, 2016, at 07:53 , Ralph Castain  wrote:
>>
>>
>> On Jun 4, 2016, at 1:11 PM, George Bosilca  wrote:
>>
>>
>>
>> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain  wrote:
>> He can try adding "-mca state_base_verbose 5”, but if we are failing to
>> catch sigchld, I’m not sure what debugging info is going to help resolve
>> that problem. These aren’t even fast-running apps, so there was plenty of
>> time to register for the signal prior to termination.
>>
>> I vaguely recollect that we have occasionally seen this on Mac before and
>> it had something to do with oddness in sigchld handling…
>>
>> Assuming sigchld has some oddness on OSX. Why is then mpirun deadlocking
>> instead of quitting which will then allow the OS to clean all children?
>>
>>
>> I don’t think mpirun is actually “deadlocked” - I think it may just be
>> waiting for sigchld to tell it that the local processes have terminated.
>>
>> However, that wouldn't explain why you see what looks like libraries
>> being unloaded. That implies mpirun is actually finalizing, but failing to
>> fully exit - which would indeed be more of a deadlock.
>>
>> So the question is: are you truly seeing us missing sigchld (as was
>> suggested earlier in this thread),
>>
>>
>> In theory the processes remains in zombie state until the parent calls
>> waitpid on them, at which moment they are supposed to disappear. Based on
>> this, as the processes are still in zombie state, I assumed that mpirun was
>> not calling waitpid. One could also assume we are again hit by the fork
>> race condition we had a while back, but as all local processes are in
>> zombie mode, this is hardly believable.
>>
>> or did mpirun correctly see all the child processes terminate and is
>> actually hanging while trying to exit (as was also suggested earlier)?
>>
>>
>> One way or another the stack of the main thread looks busted. While the
>> discussion about this was going on I was able to replicate the bug with
>> only ORTE involved. Simply running
>>
>> for i in `seq 1 1 1000`; do echo “$i"; mpirun -np 4 hostname; done
>>
>> ‘deadlock’ or whatever name we want to call this reliably before hitting
>> the 300 iteration. Unfortunately adding the verbose option alter the
>> behavior enough that the issue does not reproduce.
>>
>> George.
>>
>>
>> Adding the state verbosity should tell us which of those two is true,
>> assuming it doesn’t affect the timing 

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-06 Thread George Bosilca
Ralph,

Not there yet. I got similar deadlocks, but the stack not looks slightly
different. I only have 1 single thread doing something useful (aka being in
listen_thread_fn), every other thread is having a similar stack:

  * frame #0: 0x7fff93306de6 libsystem_kernel.dylib`__psynch_mutexwait
+ 10
frame #1: 0x7fff9a000e4a
libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89
frame #2: 0x7fff99ffe5f5
libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300
frame #3: 0x7fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire() + 16
frame #4: 0x7fff6bbc3177
dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int,
ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55
frame #5: 0x7fff6bbad063
dyld`dyld::fastBindLazySymbol(ImageLoader**, unsigned long) + 90
frame #6: 0x7fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282
frame #7: 0x0001019d39b0 libopen-pal.0.dylib`obj_order_type + 3776

  George.


On Mon, Jun 6, 2016 at 1:41 PM, Ralph Castain  wrote:

> I think I have this fixed here: https://github.com/open-mpi/ompi/pull/1756
>
> George - can you please try it on your system?
>
>
> On Jun 5, 2016, at 4:18 PM, Ralph Castain  wrote:
>
> Yeah, I can reproduce on my box. What is happening is that we aren’t
> properly protected during finalize, and so we tear down some component that
> is registered for a callback, and then the callback occurs. So we just need
> to ensure that we finalize in the right order
>
> On Jun 5, 2016, at 3:51 PM, Jeff Squyres (jsquyres) 
> wrote:
>
> Ok, good.
>
> FWIW: I tried all Friday afternoon to reproduce on my OS X laptop (i.e., I
> did something like George's shell script loop), and just now I ran George's
> exact loop, but I haven't been able to reproduce.  In this case, I'm
> falling on the wrong side of whatever race condition is happening...
>
>
> On Jun 4, 2016, at 7:57 PM, Ralph Castain  wrote:
>
> I may have an idea of what’s going on here - I just need to finish
> something else first and then I’ll take a look.
>
>
> On Jun 4, 2016, at 4:20 PM, George Bosilca  wrote:
>
>
> On Jun 5, 2016, at 07:53 , Ralph Castain  wrote:
>
>
> On Jun 4, 2016, at 1:11 PM, George Bosilca  wrote:
>
>
>
> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain  wrote:
> He can try adding "-mca state_base_verbose 5”, but if we are failing to
> catch sigchld, I’m not sure what debugging info is going to help resolve
> that problem. These aren’t even fast-running apps, so there was plenty of
> time to register for the signal prior to termination.
>
> I vaguely recollect that we have occasionally seen this on Mac before and
> it had something to do with oddness in sigchld handling…
>
> Assuming sigchld has some oddness on OSX. Why is then mpirun deadlocking
> instead of quitting which will then allow the OS to clean all children?
>
>
> I don’t think mpirun is actually “deadlocked” - I think it may just be
> waiting for sigchld to tell it that the local processes have terminated.
>
> However, that wouldn't explain why you see what looks like libraries being
> unloaded. That implies mpirun is actually finalizing, but failing to fully
> exit - which would indeed be more of a deadlock.
>
> So the question is: are you truly seeing us missing sigchld (as was
> suggested earlier in this thread),
>
>
> In theory the processes remains in zombie state until the parent calls
> waitpid on them, at which moment they are supposed to disappear. Based on
> this, as the processes are still in zombie state, I assumed that mpirun was
> not calling waitpid. One could also assume we are again hit by the fork
> race condition we had a while back, but as all local processes are in
> zombie mode, this is hardly believable.
>
> or did mpirun correctly see all the child processes terminate and is
> actually hanging while trying to exit (as was also suggested earlier)?
>
>
> One way or another the stack of the main thread looks busted. While the
> discussion about this was going on I was able to replicate the bug with
> only ORTE involved. Simply running
>
> for i in `seq 1 1 1000`; do echo “$i"; mpirun -np 4 hostname; done
>
> ‘deadlock’ or whatever name we want to call this reliably before hitting
> the 300 iteration. Unfortunately adding the verbose option alter the
> behavior enough that the issue does not reproduce.
>
> George.
>
>
> Adding the state verbosity should tell us which of those two is true,
> assuming it doesn’t affect the timing so much that everything works :-/
>
>
>
> George.
>
>
>
>
> On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) 
> wrote:
>
> Meh.  Ok.  Should George run with some verbose level to get more info?
>
> On Jun 4, 2016, at 6:43 AM, Ralph Castain  wrote:
>
> Neither of those threads have anything to do with catching the sigchld -
> 

Re: [OMPI devel] [1.10.3rc4] testing results

2016-06-06 Thread Christopher Samuel
On 06/06/16 15:09, Larry Baker wrote:

> An impressive accomplishment by the development team.  And impressive
> coverage by Paul's testbed.  Well done!

Agreed, it is very impressive to watch both on the breaking & the fixing
side of things. :-)

Thanks so much to all involved with this.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[OMPI devel] [1.10.3rc4] testing results

2016-06-06 Thread Paul Hargrove
I am pleased to report SUCCESS on 93 out of 95 distinct test configurations.

The two failures were NAG Fortran versions 5 and 6, which were not expected
to work with v1.10.  The NAG support is being tracked in issue #1284, and
the work (PR 1295) was merged to master just minutes ago.  While the issue
does currently list 1.10.3 as the target milestone, I am certainly not
going to be the one to insist on delaying this release for NAG Fortran
support.

This round of testing includes SPARC for the first time, with both V8+ and
V9 ABIs covered.
Other non-x86 CPUs tested included are ia64, ppc32, ppc64, ppc64el, mips32,
mips64, mips64el, arm and aarch64.
Multiple ISAs (e.g. ARMv5, v6 and v7) and ABIs (e.g. MIPS "32", "n32" and
"64) tested should cover all opal atomics except "alpha" and "win32"
(including gcc and darwin built-ins).
Testing of x86 and x86-64 includes Linux, Mac OSX, Solaris, FreeBSD, NetBSD
and OpenBSD.
Tested compiler families include GNU, Clang, Intel, PGI, IBM, Sun/Oracle,
Pathscale, Open64 and Absoft (with multiple versions of most).

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-06 Thread Ralph Castain
I think I have this fixed here: https://github.com/open-mpi/ompi/pull/1756 


George - can you please try it on your system?


> On Jun 5, 2016, at 4:18 PM, Ralph Castain  wrote:
> 
> Yeah, I can reproduce on my box. What is happening is that we aren’t properly 
> protected during finalize, and so we tear down some component that is 
> registered for a callback, and then the callback occurs. So we just need to 
> ensure that we finalize in the right order
> 
>> On Jun 5, 2016, at 3:51 PM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>> Ok, good.
>> 
>> FWIW: I tried all Friday afternoon to reproduce on my OS X laptop (i.e., I 
>> did something like George's shell script loop), and just now I ran George's 
>> exact loop, but I haven't been able to reproduce.  In this case, I'm falling 
>> on the wrong side of whatever race condition is happening...
>> 
>> 
>>> On Jun 4, 2016, at 7:57 PM, Ralph Castain  wrote:
>>> 
>>> I may have an idea of what’s going on here - I just need to finish 
>>> something else first and then I’ll take a look.
>>> 
>>> 
 On Jun 4, 2016, at 4:20 PM, George Bosilca  wrote:
 
> 
> On Jun 5, 2016, at 07:53 , Ralph Castain  wrote:
> 
>> 
>> On Jun 4, 2016, at 1:11 PM, George Bosilca  wrote:
>> 
>> 
>> 
>> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain  wrote:
>> He can try adding "-mca state_base_verbose 5”, but if we are failing to 
>> catch sigchld, I’m not sure what debugging info is going to help resolve 
>> that problem. These aren’t even fast-running apps, so there was plenty 
>> of time to register for the signal prior to termination.
>> 
>> I vaguely recollect that we have occasionally seen this on Mac before 
>> and it had something to do with oddness in sigchld handling…
>> 
>> Assuming sigchld has some oddness on OSX. Why is then mpirun deadlocking 
>> instead of quitting which will then allow the OS to clean all children?
> 
> I don’t think mpirun is actually “deadlocked” - I think it may just be 
> waiting for sigchld to tell it that the local processes have terminated.
> 
> However, that wouldn't explain why you see what looks like libraries 
> being unloaded. That implies mpirun is actually finalizing, but failing 
> to fully exit - which would indeed be more of a deadlock.
> 
> So the question is: are you truly seeing us missing sigchld (as was 
> suggested earlier in this thread),
 
 In theory the processes remains in zombie state until the parent calls 
 waitpid on them, at which moment they are supposed to disappear. Based on 
 this, as the processes are still in zombie state, I assumed that mpirun 
 was not calling waitpid. One could also assume we are again hit by the 
 fork race condition we had a while back, but as all local processes are in 
 zombie mode, this is hardly believable.
 
> or did mpirun correctly see all the child processes terminate and is 
> actually hanging while trying to exit (as was also suggested earlier)?
 
 One way or another the stack of the main thread looks busted. While the 
 discussion about this was going on I was able to replicate the bug with 
 only ORTE involved. Simply running 
 
 for i in `seq 1 1 1000`; do echo “$i"; mpirun -np 4 hostname; done
 
 ‘deadlock’ or whatever name we want to call this reliably before hitting 
 the 300 iteration. Unfortunately adding the verbose option alter the 
 behavior enough that the issue does not reproduce.
 
 George.
 
> 
> Adding the state verbosity should tell us which of those two is true, 
> assuming it doesn’t affect the timing so much that everything works :-/
> 
> 
>> 
>> George.
>> 
>> 
>> 
>> 
>>> On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) 
>>>  wrote:
>>> 
>>> Meh.  Ok.  Should George run with some verbose level to get more info?
>>> 
 On Jun 4, 2016, at 6:43 AM, Ralph Castain  wrote:
 
 Neither of those threads have anything to do with catching the sigchld 
 - threads 4-5 are listening for OOB and PMIx connection requests. It 
 looks more like mpirun thought it had picked everything up and has 
 begun shutting down, but I can’t really tell for certain.
 
> On Jun 4, 2016, at 6:29 AM, Jeff Squyres (jsquyres) 
>  wrote:
> 
> On Jun 3, 2016, at 11:07 PM, George Bosilca  
> wrote:
>> 
>> After finalize. As I said in my original email I se all the output 
>> the application is generating, and all processes (which are local as