Re: [OMPI devel] Strange intercomm_create, spawn, spawn_multiple hang on trunk

2014-06-06 Thread Ralph Castain

On Jun 6, 2014, at 12:50 PM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:

> Thanks for trying Ralph.   Looks like my issues has to do with coll ml 
> interaction.  If I exclude coll ml, then all my tests pass.  Do you know if 
> there is a bug for this issue?

There is a known issue with coll ml for intercomm_create - Nathan is working on 
a fix. It was reported by Gilles (yesterday?)

> If so, then I can run my nightly tests with coll ml disabled and wait for the 
> bug to be fixed.
>  
> Also, where does simple_spawn and spawn_multiple live?

I have a copy/version in my orte/test/mpi directory that I use - that's where 
these came from. Note that I left coll ml "on" for those as they weren't having 
troubles.


>   I was running “spawn” and “spawn_multiple” from the ibm/dynamic test suite. 
> Your output for spawn_multiple looks different than mine.
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Friday, June 06, 2014 3:19 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] Strange intercomm_create, spawn, spawn_multiple 
> hang on trunk
>  
> Works fine for me:
>  
> [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 ./simple_spawn
> [pid 22777] starting up!
> [pid 22778] starting up!
> [pid 22779] starting up!
> 1 completed MPI_Init
> Parent [pid 22778] about to spawn!
> 2 completed MPI_Init
> Parent [pid 22779] about to spawn!
> 0 completed MPI_Init
> Parent [pid 22777] about to spawn!
> [pid 22783] starting up!
> [pid 22784] starting up!
> Parent done with spawn
> Parent sending message to child
> Parent done with spawn
> Parent done with spawn
> 0 completed MPI_Init
> Hello from the child 0 of 2 on host bend001 pid 22783
> Child 0 received msg: 38
> 1 completed MPI_Init
> Hello from the child 1 of 2 on host bend001 pid 22784
> Child 1 disconnected
> Parent disconnected
> Parent disconnected
> Parent disconnected
> Child 0 disconnected
> 22784: exiting
> 22778: exiting
> 22779: exiting
> 22777: exiting
> 22783: exiting
> [rhc@bend001 mpi]$ make spawn_multiple
> mpicc -g --openmpi:linkallspawn_multiple.c   -o spawn_multiple
> [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 ./spawn_multiple
> Parent [pid 22797] about to spawn!
> Parent [pid 22798] about to spawn!
> Parent [pid 22799] about to spawn!
> Parent done with spawn
> Parent done with spawn
> Parent sending message to children
> Parent done with spawn
> Hello from the child 0 of 2 on host bend001 pid 22803: argv[1] = foo
> Child 0 received msg: 38
> Hello from the child 1 of 2 on host bend001 pid 22804: argv[1] = bar
> Child 1 disconnected
> Parent disconnected
> Parent disconnected
> Parent disconnected
> Child 0 disconnected
> [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 -mca coll ^ml ./intercomm_create
> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, ) [rank 3]
> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, ) [rank 4]
> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, ) [rank 5]
> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, ) [rank 3]
> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, ) [rank 4]
> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, ) [rank 5]
> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, ) (0)
> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, ) (0)
> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, ) (0)
> b: intercomm_create (0)
> b: barrier on inter-comm - before
> b: barrier on inter-comm - after
> b: intercomm_create (0)
> b: barrier on inter-comm - before
> b: barrier on inter-comm - after
> c: intercomm_create (0)
> c: barrier on inter-comm - before
> c: barrier on inter-comm - after
> c: intercomm_create (0)
> c: barrier on inter-comm - before
> c: barrier on inter-comm - after
> a: intercomm_create (0)
> a: barrier on inter-comm - before
> a: barrier on inter-comm - after
> c: intercomm_create (0)
> c: barrier on inter-comm - before
> c: barrier on inter-comm - after
> a: intercomm_create (0)
> a: barrier on inter-comm - before
> a: barrier on inter-comm - after
> a: intercomm_create (0)
> a: barrier on inter-comm - before
> a: barrier on inter-comm - after
> b: intercomm_create (0)
> b: barrier on inter-comm - before
> b: barrier on inter-comm - after
> a: intercomm_merge(0) (0) [rank 2]
> c: intercomm_merge(0) (0) [rank 8]
> a: intercomm_merge(0) (0) [rank 0]
> a: intercomm_merge(0) (0) [rank 1]
> c: intercomm_merge(0) (0) [rank 7]
> b: intercomm_merge(1) (0) [rank 4]
> b: intercomm_merge(1) (0) [rank 5]
> c: intercomm_merge(0) (0) [rank 6]
> b: intercomm_merge(1) (0) [rank 3]
> a: barrier (0)
> b: barrier (

Re: [OMPI devel] Strange intercomm_create, spawn, spawn_multiple hang on trunk

2014-06-06 Thread Rolf vandeVaart
Thanks for trying Ralph.   Looks like my issues has to do with coll ml 
interaction.  If I exclude coll ml, then all my tests pass.  Do you know if 
there is a bug for this issue?
If so, then I can run my nightly tests with coll ml disabled and wait for the 
bug to be fixed.

Also, where does simple_spawn and spawn_multiple live?  I was running "spawn" 
and "spawn_multiple" from the ibm/dynamic test suite.
Your output for spawn_multiple looks different than mine.

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, June 06, 2014 3:19 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] Strange intercomm_create, spawn, spawn_multiple hang 
on trunk

Works fine for me:

[rhc@bend001 mpi]$ mpirun -n 3 --host bend001 ./simple_spawn
[pid 22777] starting up!
[pid 22778] starting up!
[pid 22779] starting up!
1 completed MPI_Init
Parent [pid 22778] about to spawn!
2 completed MPI_Init
Parent [pid 22779] about to spawn!
0 completed MPI_Init
Parent [pid 22777] about to spawn!
[pid 22783] starting up!
[pid 22784] starting up!
Parent done with spawn
Parent sending message to child
Parent done with spawn
Parent done with spawn
0 completed MPI_Init
Hello from the child 0 of 2 on host bend001 pid 22783
Child 0 received msg: 38
1 completed MPI_Init
Hello from the child 1 of 2 on host bend001 pid 22784
Child 1 disconnected
Parent disconnected
Parent disconnected
Parent disconnected
Child 0 disconnected
22784: exiting
22778: exiting
22779: exiting
22777: exiting
22783: exiting
[rhc@bend001 mpi]$ make spawn_multiple
mpicc -g --openmpi:linkallspawn_multiple.c   -o spawn_multiple
[rhc@bend001 mpi]$ mpirun -n 3 --host bend001 ./spawn_multiple
Parent [pid 22797] about to spawn!
Parent [pid 22798] about to spawn!
Parent [pid 22799] about to spawn!
Parent done with spawn
Parent done with spawn
Parent sending message to children
Parent done with spawn
Hello from the child 0 of 2 on host bend001 pid 22803: argv[1] = foo
Child 0 received msg: 38
Hello from the child 1 of 2 on host bend001 pid 22804: argv[1] = bar
Child 1 disconnected
Parent disconnected
Parent disconnected
Parent disconnected
Child 0 disconnected
[rhc@bend001 mpi]$ mpirun -n 3 --host bend001 -mca coll ^ml ./intercomm_create
b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, ) [rank 3]
b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, ) [rank 4]
b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, ) [rank 5]
c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, ) [rank 3]
c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, ) [rank 4]
c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, ) [rank 5]
a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, ) (0)
a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, ) (0)
a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, ) (0)
b: intercomm_create (0)
b: barrier on inter-comm - before
b: barrier on inter-comm - after
b: intercomm_create (0)
b: barrier on inter-comm - before
b: barrier on inter-comm - after
c: intercomm_create (0)
c: barrier on inter-comm - before
c: barrier on inter-comm - after
c: intercomm_create (0)
c: barrier on inter-comm - before
c: barrier on inter-comm - after
a: intercomm_create (0)
a: barrier on inter-comm - before
a: barrier on inter-comm - after
c: intercomm_create (0)
c: barrier on inter-comm - before
c: barrier on inter-comm - after
a: intercomm_create (0)
a: barrier on inter-comm - before
a: barrier on inter-comm - after
a: intercomm_create (0)
a: barrier on inter-comm - before
a: barrier on inter-comm - after
b: intercomm_create (0)
b: barrier on inter-comm - before
b: barrier on inter-comm - after
a: intercomm_merge(0) (0) [rank 2]
c: intercomm_merge(0) (0) [rank 8]
a: intercomm_merge(0) (0) [rank 0]
a: intercomm_merge(0) (0) [rank 1]
c: intercomm_merge(0) (0) [rank 7]
b: intercomm_merge(1) (0) [rank 4]
b: intercomm_merge(1) (0) [rank 5]
c: intercomm_merge(0) (0) [rank 6]
b: intercomm_merge(1) (0) [rank 3]
a: barrier (0)
b: barrier (0)
c: barrier (0)
a: barrier (0)
c: barrier (0)
b: barrier (0)
a: barrier (0)
c: barrier (0)
b: barrier (0)
dpm_base_disconnect_init: error -12 in isend to process 3
dpm_base_disconnect_init: error -12 in isend to process 3
dpm_base_disconnect_init: error -12 in isend to process 3
dpm_base_disconnect_init: error -12 in isend to process 0
dpm_base_disconnect_init: error -12 in isend to process 3
dpm_base_disconnect_init: error -12 in isend to process 3
dpm_base_disconnect_init: error -12 in isend to process 3
dpm_base_disconnect_init: error -12 in isend to process 0
dpm_base_disconnect_init: error -12 in isend to process 3
dpm_base_disconnect_init: error -12 in isend to process 3
dpm_base_disconnect_init: error -12 in isend to process 3
dpm_base_disconnect_init: error -12 in isend to process 1
dpm_base_disconnect_init: error -12 in isend to process 3
[rhc@bend001 mpi]$



On Jun 6, 2014, at 11:26 AM, Rolf vandeVaart 
<rvandeva...@nvi

[OMPI devel] Strange intercomm_create, spawn, spawn_multiple hang on trunk

2014-06-06 Thread Rolf vandeVaart
I am seeing an interesting failure on trunk.  intercomm_create, spawn, and 
spawn_multiple from the IBM tests hang if I explicitly list the hostnames to 
run on.  For example:

Good:
$ mpirun -np 2 --mca btl self,sm,tcp spawn_multiple
Parent: 0 of 2, drossetti-ivy0.nvidia.com (0 in init)
Parent: 1 of 2, drossetti-ivy0.nvidia.com (0 in init)
Child: 0 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init)
Child: 1 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init)
Child: 2 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init)
Child: 3 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init)
$ 

Bad:
$ mpirun -np 2 --mca btl self,sm,tcp -host drossetti-ivy0,drossetti-ivy0 
spawn_multiple
Parent: 0 of 2, drossetti-ivy0.nvidia.com (1 in init)
Parent: 1 of 2, drossetti-ivy0.nvidia.com (1 in init)
Child: 0 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init)
Child: 1 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init)
Child: 2 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init)
Child: 3 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init)
[..and we are hung here...]

I see the exact same behavior for spawn and spawn_multiple.  Ralph, any 
thoughts?  Open MPI 1.8 is fine.  I can provide more information if needed, but 
I assume this is reproducible. 

Thanks,
Rolf
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---