Re: [OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

2014-08-25 Thread Andrej Prsa
Hi Jeff,

My apologies for the delay in replying, I was flying back from the UK
to the States, but now I'm here and I can provide a more timely
response.

> I confirm that the hwloc message you sent (and your posts to the
> hwloc-users list) indicate that hwloc is getting confused by a buggy
> BIOS, but it's only dealing with the L3 cache, and that shouldn't
> affect the binding that OMPI is doing.

Great, good to know. I'd still be interested in learning how to build a
hwloc-parsable xml as a workaround, especially if it fixes the bindings
(see below).

> 1. Run with "--report-bindings" and send the output.  It'll
> prettyprint-render where OMPI thinks it is binding each process.

Please find it attached.

> 2. Run with "--bind-to none" and see if that helps.  I.e., if, per
> #1, OMPI thinks it is binding correctly (i.e., each of the 48
> processes is being bound to a unique core), then perhaps hwloc is
> doing something wrong in the actual binding (i.e., binding the 48
> processes only among the lower 32 cores).

BINGO! As soon as I did this, indeed all the cores went to 100%! Here's
the updated timing (compared to 13 minutes from before):

real1m8.442s
user0m0.077s
sys 0m0.071s

So I guess the conclusion is that hwloc is somehow messing things up on
this chipset?

Thanks,
Andrej


test_report_bindings.stderr
Description: Binary data


Re: [OMPI devel] OMPI devel] pmix: race condition in dynamic/intercomm_create from the ibm test suite

2014-08-25 Thread Ralph Castain
And that was indeed the problem - fixed, and now the trunk runs clean thru my 
MTT.

Thanks again!
Ralph

On Aug 25, 2014, at 7:38 AM, Ralph Castain  wrote:

> Yeah, that was going to be my first place to look once I finished breakfast 
> :-)
> 
> Thanks!
> Ralph
> 
> On Aug 25, 2014, at 7:32 AM, Gilles Gouaillardet 
>  wrote:
> 
>> Thanks for the explanation
>> 
>> In orte_dt_compare_sig(...) memcmp did not multiply value1->sz by 
>> sizeof(opal_identifier_t).
>> 
>> Being afk, I could not test but that looks like a good suspect
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> Ralph Castain  wrote:
>>> Each collective is given a "signature" that is just the array of names for 
>>> all procs involved in the collective. Thus, even though task 0 is involved 
>>> in both of the disconnect barriers, the two collectives should be running 
>>> in isolation from each other.
>>> 
>>> The "tags" are just receive callbacks and have no meaning other than to 
>>> associate a particular callback to a given send/recv pair. It is the 
>>> signature that counts as the daemons are using that to keep the various 
>>> collectives separated.
>>> 
>>> I'll have to take a look at why task 2 is leaving early. The key will be to 
>>> look at that signature to ensure we aren't getting it confused.
>>> 
>>> On Aug 25, 2014, at 1:59 AM, Gilles Gouaillardet 
>>>  wrote:
>>> 
 Folks,
 
 when i run
 mpirun -np 1 ./intercomm_create
 from the ibm test suite, it either :
 - success
 - hangs
 - mpirun crashes (SIGSEGV) soon after writing the following message
 ORTE_ERROR_LOG: Not found in file
 ../../../src/ompi-trunk/orte/orted/pmix/pmix_server.c at line 566
 
 here is what happens :
 
 first, the test program itself :
 task 0 spawns task 1 : the inter communicator is ab_inter on task 0 and
 parent on task 1
 then
 task 0 spawns task 2 : the inter communicator is ac_inter on task 0 and
 parent on task 2
 then
 several operations (merge, barrier, ...)
 and then without any synchronization :
 - task 0 MPI_Comm_disconnect(ab_inter) and then
 MPI_Comm_disconnect(ac_inter)
 - task 1 and task 2 MPI_Comm_disconnect(parent)
 
 i applied the attached pmix_debug.patch and ran
 mpirun -np 1 --mca pmix_base_verbose 90 ./intercomm_create
 
 basically, tasks 0 and 1 execute a native fence and in parallel, tasks 0
 and 2 execute a native fence.
 they both use the *same* tags on different though overlapping tasks
 bottom line, task 2 leave the fences *before* task 0 enterred the fence
 (it seems task 1 told task 2 it is ok to leave the fence)
 
 a simple work around is to call MPI_Barrier before calling
 MPI_Comm_disconnect
 
 at this stage, i doubt it is even possible to get this working at the
 pmix level, so the fix
 might be to have MPI_Comm_disconnect invoke MPI_Barrier
 the attached comm_disconnect.patch always call the barrier before
 (indirectly) invoking pmix
 
 could you please comment on this issue ?
 
 Cheers,
 
 Gilles
 
 here are the relevant logs :
 
 [soleil:00650] [[8110,3],0] pmix:native executing fence on 2 procs
 [[8110,1],0] and [[8110,3],0]
 [soleil:00650] [[8110,3],0]
 [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] 
 post
 send to server
 [soleil:00650] [[8110,3],0] posting recv on tag 5
 [soleil:00650] [[8110,3],0] usock:send_nb: already connected to server -
 queueing for send
 [soleil:00650] [[8110,3],0] usock:send_handler called to send to server
 [soleil:00650] [[8110,3],0] usock:send_handler SENDING TO SERVER
 [soleil:00647] [[8110,2],0] pmix:native executing fence on 2 procs
 [[8110,1],0] and [[8110,2],0]
 [soleil:00647] [[8110,2],0]
 [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] 
 post
 send to server
 [soleil:00647] [[8110,2],0] posting recv on tag 5
 [soleil:00647] [[8110,2],0] usock:send_nb: already connected to server -
 queueing for send
 [soleil:00647] [[8110,2],0] usock:send_handler called to send to server
 [soleil:00647] [[8110,2],0] usock:send_handler SENDING TO SERVER
 [soleil:00650] [[8110,3],0] usock:recv:handler called
 [soleil:00650] [[8110,3],0] usock:recv:handler CONNECTED
 [soleil:00650] [[8110,3],0] usock:recv:handler allocate new recv msg
 [soleil:00650] usock:recv:handler read hdr
 [soleil:00650] [[8110,3],0] usock:recv:handler allocate data region of
 size 14
 [soleil:00650] [[8110,3],0] RECVD COMPLETE MESSAGE FROM SERVER OF 14
 BYTES FOR TAG 5
 [soleil:00650] [[8110,3],0]
 [../../../../../../src/ompi-trunk/opal/mca/pmix/native/usock_sendrecv.c:415]
 post msg
 [soleil:00650] [[8110,3],0] message received 14 bytes for tag 5
 [soleil:00650] [[8110,3],0] checking msg on tag 5 for tag 5
 [soleil:00650] [[8110,3],0] pmix

Re: [OMPI devel] OMPI devel] pmix: race condition in dynamic/intercomm_create from the ibm test suite

2014-08-25 Thread Ralph Castain
Yeah, that was going to be my first place to look once I finished breakfast :-)

Thanks!
Ralph

On Aug 25, 2014, at 7:32 AM, Gilles Gouaillardet 
 wrote:

> Thanks for the explanation
> 
> In orte_dt_compare_sig(...) memcmp did not multiply value1->sz by 
> sizeof(opal_identifier_t).
> 
> Being afk, I could not test but that looks like a good suspect
> 
> Cheers,
> 
> Gilles
> 
> Ralph Castain  wrote:
>> Each collective is given a "signature" that is just the array of names for 
>> all procs involved in the collective. Thus, even though task 0 is involved 
>> in both of the disconnect barriers, the two collectives should be running in 
>> isolation from each other.
>> 
>> The "tags" are just receive callbacks and have no meaning other than to 
>> associate a particular callback to a given send/recv pair. It is the 
>> signature that counts as the daemons are using that to keep the various 
>> collectives separated.
>> 
>> I'll have to take a look at why task 2 is leaving early. The key will be to 
>> look at that signature to ensure we aren't getting it confused.
>> 
>> On Aug 25, 2014, at 1:59 AM, Gilles Gouaillardet 
>>  wrote:
>> 
>>> Folks,
>>> 
>>> when i run
>>> mpirun -np 1 ./intercomm_create
>>> from the ibm test suite, it either :
>>> - success
>>> - hangs
>>> - mpirun crashes (SIGSEGV) soon after writing the following message
>>> ORTE_ERROR_LOG: Not found in file
>>> ../../../src/ompi-trunk/orte/orted/pmix/pmix_server.c at line 566
>>> 
>>> here is what happens :
>>> 
>>> first, the test program itself :
>>> task 0 spawns task 1 : the inter communicator is ab_inter on task 0 and
>>> parent on task 1
>>> then
>>> task 0 spawns task 2 : the inter communicator is ac_inter on task 0 and
>>> parent on task 2
>>> then
>>> several operations (merge, barrier, ...)
>>> and then without any synchronization :
>>> - task 0 MPI_Comm_disconnect(ab_inter) and then
>>> MPI_Comm_disconnect(ac_inter)
>>> - task 1 and task 2 MPI_Comm_disconnect(parent)
>>> 
>>> i applied the attached pmix_debug.patch and ran
>>> mpirun -np 1 --mca pmix_base_verbose 90 ./intercomm_create
>>> 
>>> basically, tasks 0 and 1 execute a native fence and in parallel, tasks 0
>>> and 2 execute a native fence.
>>> they both use the *same* tags on different though overlapping tasks
>>> bottom line, task 2 leave the fences *before* task 0 enterred the fence
>>> (it seems task 1 told task 2 it is ok to leave the fence)
>>> 
>>> a simple work around is to call MPI_Barrier before calling
>>> MPI_Comm_disconnect
>>> 
>>> at this stage, i doubt it is even possible to get this working at the
>>> pmix level, so the fix
>>> might be to have MPI_Comm_disconnect invoke MPI_Barrier
>>> the attached comm_disconnect.patch always call the barrier before
>>> (indirectly) invoking pmix
>>> 
>>> could you please comment on this issue ?
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> here are the relevant logs :
>>> 
>>> [soleil:00650] [[8110,3],0] pmix:native executing fence on 2 procs
>>> [[8110,1],0] and [[8110,3],0]
>>> [soleil:00650] [[8110,3],0]
>>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] 
>>> post
>>> send to server
>>> [soleil:00650] [[8110,3],0] posting recv on tag 5
>>> [soleil:00650] [[8110,3],0] usock:send_nb: already connected to server -
>>> queueing for send
>>> [soleil:00650] [[8110,3],0] usock:send_handler called to send to server
>>> [soleil:00650] [[8110,3],0] usock:send_handler SENDING TO SERVER
>>> [soleil:00647] [[8110,2],0] pmix:native executing fence on 2 procs
>>> [[8110,1],0] and [[8110,2],0]
>>> [soleil:00647] [[8110,2],0]
>>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] 
>>> post
>>> send to server
>>> [soleil:00647] [[8110,2],0] posting recv on tag 5
>>> [soleil:00647] [[8110,2],0] usock:send_nb: already connected to server -
>>> queueing for send
>>> [soleil:00647] [[8110,2],0] usock:send_handler called to send to server
>>> [soleil:00647] [[8110,2],0] usock:send_handler SENDING TO SERVER
>>> [soleil:00650] [[8110,3],0] usock:recv:handler called
>>> [soleil:00650] [[8110,3],0] usock:recv:handler CONNECTED
>>> [soleil:00650] [[8110,3],0] usock:recv:handler allocate new recv msg
>>> [soleil:00650] usock:recv:handler read hdr
>>> [soleil:00650] [[8110,3],0] usock:recv:handler allocate data region of
>>> size 14
>>> [soleil:00650] [[8110,3],0] RECVD COMPLETE MESSAGE FROM SERVER OF 14
>>> BYTES FOR TAG 5
>>> [soleil:00650] [[8110,3],0]
>>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/usock_sendrecv.c:415]
>>> post msg
>>> [soleil:00650] [[8110,3],0] message received 14 bytes for tag 5
>>> [soleil:00650] [[8110,3],0] checking msg on tag 5 for tag 5
>>> [soleil:00650] [[8110,3],0] pmix:native recv callback activated with 14
>>> bytes
>>> [soleil:00650] [[8110,3],0] pmix:native fence released on 2 procs
>>> [[8110,1],0] and [[8110,3],0]
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.op

Re: [OMPI devel] OMPI devel] pmix: race condition in dynamic/intercomm_create from the ibm test suite

2014-08-25 Thread Gilles Gouaillardet
Thanks for the explanation

In orte_dt_compare_sig(...) memcmp did not multiply value1->sz by 
sizeof(opal_identifier_t).

Being afk, I could not test but that looks like a good suspect

Cheers,

Gilles

Ralph Castain  wrote:
>Each collective is given a "signature" that is just the array of names for all 
>procs involved in the collective. Thus, even though task 0 is involved in both 
>of the disconnect barriers, the two collectives should be running in isolation 
>from each other.
>
>The "tags" are just receive callbacks and have no meaning other than to 
>associate a particular callback to a given send/recv pair. It is the signature 
>that counts as the daemons are using that to keep the various collectives 
>separated.
>
>I'll have to take a look at why task 2 is leaving early. The key will be to 
>look at that signature to ensure we aren't getting it confused.
>
>On Aug 25, 2014, at 1:59 AM, Gilles Gouaillardet 
> wrote:
>
>> Folks,
>> 
>> when i run
>> mpirun -np 1 ./intercomm_create
>> from the ibm test suite, it either :
>> - success
>> - hangs
>> - mpirun crashes (SIGSEGV) soon after writing the following message
>> ORTE_ERROR_LOG: Not found in file
>> ../../../src/ompi-trunk/orte/orted/pmix/pmix_server.c at line 566
>> 
>> here is what happens :
>> 
>> first, the test program itself :
>> task 0 spawns task 1 : the inter communicator is ab_inter on task 0 and
>> parent on task 1
>> then
>> task 0 spawns task 2 : the inter communicator is ac_inter on task 0 and
>> parent on task 2
>> then
>> several operations (merge, barrier, ...)
>> and then without any synchronization :
>> - task 0 MPI_Comm_disconnect(ab_inter) and then
>> MPI_Comm_disconnect(ac_inter)
>> - task 1 and task 2 MPI_Comm_disconnect(parent)
>> 
>> i applied the attached pmix_debug.patch and ran
>> mpirun -np 1 --mca pmix_base_verbose 90 ./intercomm_create
>> 
>> basically, tasks 0 and 1 execute a native fence and in parallel, tasks 0
>> and 2 execute a native fence.
>> they both use the *same* tags on different though overlapping tasks
>> bottom line, task 2 leave the fences *before* task 0 enterred the fence
>> (it seems task 1 told task 2 it is ok to leave the fence)
>> 
>> a simple work around is to call MPI_Barrier before calling
>> MPI_Comm_disconnect
>> 
>> at this stage, i doubt it is even possible to get this working at the
>> pmix level, so the fix
>> might be to have MPI_Comm_disconnect invoke MPI_Barrier
>> the attached comm_disconnect.patch always call the barrier before
>> (indirectly) invoking pmix
>> 
>> could you please comment on this issue ?
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> here are the relevant logs :
>> 
>> [soleil:00650] [[8110,3],0] pmix:native executing fence on 2 procs
>> [[8110,1],0] and [[8110,3],0]
>> [soleil:00650] [[8110,3],0]
>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] 
>> post
>> send to server
>> [soleil:00650] [[8110,3],0] posting recv on tag 5
>> [soleil:00650] [[8110,3],0] usock:send_nb: already connected to server -
>> queueing for send
>> [soleil:00650] [[8110,3],0] usock:send_handler called to send to server
>> [soleil:00650] [[8110,3],0] usock:send_handler SENDING TO SERVER
>> [soleil:00647] [[8110,2],0] pmix:native executing fence on 2 procs
>> [[8110,1],0] and [[8110,2],0]
>> [soleil:00647] [[8110,2],0]
>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] 
>> post
>> send to server
>> [soleil:00647] [[8110,2],0] posting recv on tag 5
>> [soleil:00647] [[8110,2],0] usock:send_nb: already connected to server -
>> queueing for send
>> [soleil:00647] [[8110,2],0] usock:send_handler called to send to server
>> [soleil:00647] [[8110,2],0] usock:send_handler SENDING TO SERVER
>> [soleil:00650] [[8110,3],0] usock:recv:handler called
>> [soleil:00650] [[8110,3],0] usock:recv:handler CONNECTED
>> [soleil:00650] [[8110,3],0] usock:recv:handler allocate new recv msg
>> [soleil:00650] usock:recv:handler read hdr
>> [soleil:00650] [[8110,3],0] usock:recv:handler allocate data region of
>> size 14
>> [soleil:00650] [[8110,3],0] RECVD COMPLETE MESSAGE FROM SERVER OF 14
>> BYTES FOR TAG 5
>> [soleil:00650] [[8110,3],0]
>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/usock_sendrecv.c:415]
>> post msg
>> [soleil:00650] [[8110,3],0] message received 14 bytes for tag 5
>> [soleil:00650] [[8110,3],0] checking msg on tag 5 for tag 5
>> [soleil:00650] [[8110,3],0] pmix:native recv callback activated with 14
>> bytes
>> [soleil:00650] [[8110,3],0] pmix:native fence released on 2 procs
>> [[8110,1],0] and [[8110,3],0]
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15701.php
>
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>h

Re: [OMPI devel] pmix: race condition in dynamic/intercomm_create from the ibm test suite

2014-08-25 Thread Ralph Castain
Each collective is given a "signature" that is just the array of names for all 
procs involved in the collective. Thus, even though task 0 is involved in both 
of the disconnect barriers, the two collectives should be running in isolation 
from each other.

The "tags" are just receive callbacks and have no meaning other than to 
associate a particular callback to a given send/recv pair. It is the signature 
that counts as the daemons are using that to keep the various collectives 
separated.

I'll have to take a look at why task 2 is leaving early. The key will be to 
look at that signature to ensure we aren't getting it confused.

On Aug 25, 2014, at 1:59 AM, Gilles Gouaillardet 
 wrote:

> Folks,
> 
> when i run
> mpirun -np 1 ./intercomm_create
> from the ibm test suite, it either :
> - success
> - hangs
> - mpirun crashes (SIGSEGV) soon after writing the following message
> ORTE_ERROR_LOG: Not found in file
> ../../../src/ompi-trunk/orte/orted/pmix/pmix_server.c at line 566
> 
> here is what happens :
> 
> first, the test program itself :
> task 0 spawns task 1 : the inter communicator is ab_inter on task 0 and
> parent on task 1
> then
> task 0 spawns task 2 : the inter communicator is ac_inter on task 0 and
> parent on task 2
> then
> several operations (merge, barrier, ...)
> and then without any synchronization :
> - task 0 MPI_Comm_disconnect(ab_inter) and then
> MPI_Comm_disconnect(ac_inter)
> - task 1 and task 2 MPI_Comm_disconnect(parent)
> 
> i applied the attached pmix_debug.patch and ran
> mpirun -np 1 --mca pmix_base_verbose 90 ./intercomm_create
> 
> basically, tasks 0 and 1 execute a native fence and in parallel, tasks 0
> and 2 execute a native fence.
> they both use the *same* tags on different though overlapping tasks
> bottom line, task 2 leave the fences *before* task 0 enterred the fence
> (it seems task 1 told task 2 it is ok to leave the fence)
> 
> a simple work around is to call MPI_Barrier before calling
> MPI_Comm_disconnect
> 
> at this stage, i doubt it is even possible to get this working at the
> pmix level, so the fix
> might be to have MPI_Comm_disconnect invoke MPI_Barrier
> the attached comm_disconnect.patch always call the barrier before
> (indirectly) invoking pmix
> 
> could you please comment on this issue ?
> 
> Cheers,
> 
> Gilles
> 
> here are the relevant logs :
> 
> [soleil:00650] [[8110,3],0] pmix:native executing fence on 2 procs
> [[8110,1],0] and [[8110,3],0]
> [soleil:00650] [[8110,3],0]
> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] post
> send to server
> [soleil:00650] [[8110,3],0] posting recv on tag 5
> [soleil:00650] [[8110,3],0] usock:send_nb: already connected to server -
> queueing for send
> [soleil:00650] [[8110,3],0] usock:send_handler called to send to server
> [soleil:00650] [[8110,3],0] usock:send_handler SENDING TO SERVER
> [soleil:00647] [[8110,2],0] pmix:native executing fence on 2 procs
> [[8110,1],0] and [[8110,2],0]
> [soleil:00647] [[8110,2],0]
> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] post
> send to server
> [soleil:00647] [[8110,2],0] posting recv on tag 5
> [soleil:00647] [[8110,2],0] usock:send_nb: already connected to server -
> queueing for send
> [soleil:00647] [[8110,2],0] usock:send_handler called to send to server
> [soleil:00647] [[8110,2],0] usock:send_handler SENDING TO SERVER
> [soleil:00650] [[8110,3],0] usock:recv:handler called
> [soleil:00650] [[8110,3],0] usock:recv:handler CONNECTED
> [soleil:00650] [[8110,3],0] usock:recv:handler allocate new recv msg
> [soleil:00650] usock:recv:handler read hdr
> [soleil:00650] [[8110,3],0] usock:recv:handler allocate data region of
> size 14
> [soleil:00650] [[8110,3],0] RECVD COMPLETE MESSAGE FROM SERVER OF 14
> BYTES FOR TAG 5
> [soleil:00650] [[8110,3],0]
> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/usock_sendrecv.c:415]
> post msg
> [soleil:00650] [[8110,3],0] message received 14 bytes for tag 5
> [soleil:00650] [[8110,3],0] checking msg on tag 5 for tag 5
> [soleil:00650] [[8110,3],0] pmix:native recv callback activated with 14
> bytes
> [soleil:00650] [[8110,3],0] pmix:native fence released on 2 procs
> [[8110,1],0] and [[8110,3],0]
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15701.php



[OMPI devel] pmix: race condition in dynamic/intercomm_create from the ibm test suite

2014-08-25 Thread Gilles Gouaillardet
Folks,

when i run
mpirun -np 1 ./intercomm_create
from the ibm test suite, it either :
- success
- hangs
- mpirun crashes (SIGSEGV) soon after writing the following message
ORTE_ERROR_LOG: Not found in file
../../../src/ompi-trunk/orte/orted/pmix/pmix_server.c at line 566

here is what happens :

first, the test program itself :
task 0 spawns task 1 : the inter communicator is ab_inter on task 0 and
parent on task 1
then
task 0 spawns task 2 : the inter communicator is ac_inter on task 0 and
parent on task 2
then
several operations (merge, barrier, ...)
and then without any synchronization :
- task 0 MPI_Comm_disconnect(ab_inter) and then
MPI_Comm_disconnect(ac_inter)
- task 1 and task 2 MPI_Comm_disconnect(parent)

i applied the attached pmix_debug.patch and ran
mpirun -np 1 --mca pmix_base_verbose 90 ./intercomm_create

basically, tasks 0 and 1 execute a native fence and in parallel, tasks 0
and 2 execute a native fence.
they both use the *same* tags on different though overlapping tasks
bottom line, task 2 leave the fences *before* task 0 enterred the fence
(it seems task 1 told task 2 it is ok to leave the fence)

a simple work around is to call MPI_Barrier before calling
MPI_Comm_disconnect

at this stage, i doubt it is even possible to get this working at the
pmix level, so the fix
might be to have MPI_Comm_disconnect invoke MPI_Barrier
the attached comm_disconnect.patch always call the barrier before
(indirectly) invoking pmix

could you please comment on this issue ?

Cheers,

Gilles

here are the relevant logs :

[soleil:00650] [[8110,3],0] pmix:native executing fence on 2 procs
[[8110,1],0] and [[8110,3],0]
[soleil:00650] [[8110,3],0]
[../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] post
send to server
[soleil:00650] [[8110,3],0] posting recv on tag 5
[soleil:00650] [[8110,3],0] usock:send_nb: already connected to server -
queueing for send
[soleil:00650] [[8110,3],0] usock:send_handler called to send to server
[soleil:00650] [[8110,3],0] usock:send_handler SENDING TO SERVER
[soleil:00647] [[8110,2],0] pmix:native executing fence on 2 procs
[[8110,1],0] and [[8110,2],0]
[soleil:00647] [[8110,2],0]
[../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] post
send to server
[soleil:00647] [[8110,2],0] posting recv on tag 5
[soleil:00647] [[8110,2],0] usock:send_nb: already connected to server -
queueing for send
[soleil:00647] [[8110,2],0] usock:send_handler called to send to server
[soleil:00647] [[8110,2],0] usock:send_handler SENDING TO SERVER
[soleil:00650] [[8110,3],0] usock:recv:handler called
[soleil:00650] [[8110,3],0] usock:recv:handler CONNECTED
[soleil:00650] [[8110,3],0] usock:recv:handler allocate new recv msg
[soleil:00650] usock:recv:handler read hdr
[soleil:00650] [[8110,3],0] usock:recv:handler allocate data region of
size 14
[soleil:00650] [[8110,3],0] RECVD COMPLETE MESSAGE FROM SERVER OF 14
BYTES FOR TAG 5
[soleil:00650] [[8110,3],0]
[../../../../../../src/ompi-trunk/opal/mca/pmix/native/usock_sendrecv.c:415]
post msg
[soleil:00650] [[8110,3],0] message received 14 bytes for tag 5
[soleil:00650] [[8110,3],0] checking msg on tag 5 for tag 5
[soleil:00650] [[8110,3],0] pmix:native recv callback activated with 14
bytes
[soleil:00650] [[8110,3],0] pmix:native fence released on 2 procs
[[8110,1],0] and [[8110,3],0]


Index: opal/mca/pmix/native/pmix_native.c
===
--- opal/mca/pmix/native/pmix_native.c  (revision 32594)
+++ opal/mca/pmix/native/pmix_native.c  (working copy)
@@ -390,9 +390,17 @@
 size_t i;
 uint32_t np;

-opal_output_verbose(2, opal_pmix_base_framework.framework_output,
+if (2 == nprocs) {
+opal_output_verbose(2, opal_pmix_base_framework.framework_output,
+"%s pmix:native executing fence on %u procs %s and %s",
+OPAL_NAME_PRINT(OPAL_PROC_MY_NAME), (unsigned 
int)nprocs,
+OPAL_NAME_PRINT(procs[0]),
+OPAL_NAME_PRINT(procs[1]));
+} else {
+opal_output_verbose(2, opal_pmix_base_framework.framework_output,
 "%s pmix:native executing fence on %u procs",
 OPAL_NAME_PRINT(OPAL_PROC_MY_NAME), (unsigned 
int)nprocs);
+}

 if (NULL == mca_pmix_native_component.uri) {
 /* no server available, so just return */
@@ -545,9 +553,17 @@

 OBJ_RELEASE(cb);

-opal_output_verbose(2, opal_pmix_base_framework.framework_output,
-"%s pmix:native fence released",
-OPAL_NAME_PRINT(OPAL_PROC_MY_NAME));
+if (2 == nprocs) {
+opal_output_verbose(2, opal_pmix_base_framework.framework_output,
+"%s pmix:native fence released on %u procs %s and %s",
+OPAL_NAME_PRINT(OPAL_PROC_MY_NAME), (unsigned 
int)nprocs,
+OPAL_NAME_PRINT(procs[0]),
+OPAL_NAM

Re: [OMPI devel] OMPI devel] MPI_Abort does not make mpirun return with the right exit code

2014-08-25 Thread Gilles Gouaillardet
Thanks Ralph !

i confirm my all test cases pass now :-)

FYI, i commited r32592 in order to fix a parsing bug on 32bits platform
(hence the mtt failures on trunk on x86)

Cheers,

Gilles


On 2014/08/23 4:59, Ralph Castain wrote:
> I think these are fixed now - at least, your test cases all pass for me
>
>
> On Aug 22, 2014, at 9:12 AM, Ralph Castain  wrote:
>
>> On Aug 22, 2014, at 9:06 AM, Gilles Gouaillardet 
>>  wrote:
>>
>>> Ralph,
>>>
>>> Will do on Monday
>>>
>>> About the first test, in my case echo $? returns 0
>> My "showcode" is just an alias for the echo
>>
>>> I noticed this confusing message in your output :
>>> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
>>> signal 0 (Unknown signal 0).
>> I'll take a look at why that happened
>>
>>> About the second test, please note my test program return 3;
>>> whereas your mpi_no_op.c return 0;
>> I didn't see that little cuteness - sigh
>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> Ralph Castain  wrote:
>>> You might want to try again with current head of trunk as something seems 
>>> off in what you are seeing - more below
>>>
>>>
>>> On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet 
>>>  wrote:
>>>
 Ralph,

 i tried again after the merge and found the same behaviour, though the
 internals are very different.

 i run without any batch manager

 from node0:
 mpirun -np 1 --mca btl tcp,self -host node1 ./abort

 exit with exit code zero :-(
>>> Hmmm...it works fine for me, without your patch:
>>>
>>> 07:35:41  $ mpirun -n 1 -mca btl tcp,self -host bend002 ./abort
>>> Hello, World, I am 0 of 1
>>> --
>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
>>> with errorcode 2.
>>>
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>> --
>>> --
>>> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
>>> signal 0 (Unknown signal 0).
>>> --
>>> 07:35:56  $ showcode
>>> 130
>>>
 short story : i applied pmix.2.patch and that fixed my problem
 could you please review this ?

 long story :
 i initially applied pmix.1.patch and it solved my problem
 then i ran
 mpirun -np 1 --mca btl openib,self -host node1 ./abort
 and i came back to square one : exit code is zero
 so i used the debugger and was unable to reproduce the issue
 (one more race condition, yeah !)
 finally, i wrote pmix.2.patch, fixed my issue and realized that
 pmix.1.patch was no more needed.
 currently, and assuming pmix.2.patch is correct, i cannot tell wether
 pmix.1.patch is needed or not
 since this part of the code is no more executed.

 i also found one hang with the following trivial program within one node :

 int main (int argc, char *argv[]) {
 MPI_Init(&argc, &argv);
MPI_Finalize();
return 3;
 }

 from node0 :
 $ mpirun -np 1 ./test
 ---
 Primary job  terminated normally, but 1 process returned
 a non-zero exit code.. Per user-direction, the job has been aborted.
 ---

 AND THE PROGRAM HANGS
>>> This also works fine for me:
>>>
>>> 07:37:27  $ mpirun -n 1 ./mpi_no_op
>>> 07:37:36  $ cat mpi_no_op.c
>>> /* -*- C -*-
>>>  *
>>>  * $HEADER$
>>>  *
>>>  * The most basic of MPI applications
>>>  */
>>>
>>> #include 
>>> #include "mpi.h"
>>>
>>> int main(int argc, char* argv[])
>>> {
>>> MPI_Init(&argc, &argv);
>>>
>>> MPI_Finalize();
>>> return 0;
>>> }
>>>
>>>
 *but*
 $ mpirun -np 1 -host node1 ./test
 ---
 Primary job  terminated normally, but 1 process returned
 a non-zero exit code.. Per user-direction, the job has been aborted.
 ---
 --
 mpirun detected that one or more processes exited with non-zero status,
 thus causing
 the job to be terminated. The first process to do so was:

  Process name: [[22080,1],0]
  Exit code:3
 --

 return with exit code 3.
>>> Likewise here - works just fine for me
>>>
>>>
 then i found a strange behaviour with helloworld if only the self btl is
 used :
 $ mpirun -np 1 --mca btl self ./hw
 [helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3
 [he