Re: [OMPI devel] Strange intercomm_create, spawn, spawn_multiple hang on trunk
On Jun 6, 2014, at 12:50 PM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote: > Thanks for trying Ralph. Looks like my issues has to do with coll ml > interaction. If I exclude coll ml, then all my tests pass. Do you know if > there is a bug for this issue? There is a known issue with coll ml for intercomm_create - Nathan is working on a fix. It was reported by Gilles (yesterday?) > If so, then I can run my nightly tests with coll ml disabled and wait for the > bug to be fixed. > > Also, where does simple_spawn and spawn_multiple live? I have a copy/version in my orte/test/mpi directory that I use - that's where these came from. Note that I left coll ml "on" for those as they weren't having troubles. > I was running “spawn” and “spawn_multiple” from the ibm/dynamic test suite. > Your output for spawn_multiple looks different than mine. > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Friday, June 06, 2014 3:19 PM > To: Open MPI Developers > Subject: Re: [OMPI devel] Strange intercomm_create, spawn, spawn_multiple > hang on trunk > > Works fine for me: > > [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 ./simple_spawn > [pid 22777] starting up! > [pid 22778] starting up! > [pid 22779] starting up! > 1 completed MPI_Init > Parent [pid 22778] about to spawn! > 2 completed MPI_Init > Parent [pid 22779] about to spawn! > 0 completed MPI_Init > Parent [pid 22777] about to spawn! > [pid 22783] starting up! > [pid 22784] starting up! > Parent done with spawn > Parent sending message to child > Parent done with spawn > Parent done with spawn > 0 completed MPI_Init > Hello from the child 0 of 2 on host bend001 pid 22783 > Child 0 received msg: 38 > 1 completed MPI_Init > Hello from the child 1 of 2 on host bend001 pid 22784 > Child 1 disconnected > Parent disconnected > Parent disconnected > Parent disconnected > Child 0 disconnected > 22784: exiting > 22778: exiting > 22779: exiting > 22777: exiting > 22783: exiting > [rhc@bend001 mpi]$ make spawn_multiple > mpicc -g --openmpi:linkallspawn_multiple.c -o spawn_multiple > [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 ./spawn_multiple > Parent [pid 22797] about to spawn! > Parent [pid 22798] about to spawn! > Parent [pid 22799] about to spawn! > Parent done with spawn > Parent done with spawn > Parent sending message to children > Parent done with spawn > Hello from the child 0 of 2 on host bend001 pid 22803: argv[1] = foo > Child 0 received msg: 38 > Hello from the child 1 of 2 on host bend001 pid 22804: argv[1] = bar > Child 1 disconnected > Parent disconnected > Parent disconnected > Parent disconnected > Child 0 disconnected > [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 -mca coll ^ml ./intercomm_create > b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, ) [rank 3] > b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, ) [rank 4] > b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, ) [rank 5] > c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, ) [rank 3] > c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, ) [rank 4] > c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, ) [rank 5] > a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, ) (0) > a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, ) (0) > a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, ) (0) > b: intercomm_create (0) > b: barrier on inter-comm - before > b: barrier on inter-comm - after > b: intercomm_create (0) > b: barrier on inter-comm - before > b: barrier on inter-comm - after > c: intercomm_create (0) > c: barrier on inter-comm - before > c: barrier on inter-comm - after > c: intercomm_create (0) > c: barrier on inter-comm - before > c: barrier on inter-comm - after > a: intercomm_create (0) > a: barrier on inter-comm - before > a: barrier on inter-comm - after > c: intercomm_create (0) > c: barrier on inter-comm - before > c: barrier on inter-comm - after > a: intercomm_create (0) > a: barrier on inter-comm - before > a: barrier on inter-comm - after > a: intercomm_create (0) > a: barrier on inter-comm - before > a: barrier on inter-comm - after > b: intercomm_create (0) > b: barrier on inter-comm - before > b: barrier on inter-comm - after > a: intercomm_merge(0) (0) [rank 2] > c: intercomm_merge(0) (0) [rank 8] > a: intercomm_merge(0) (0) [rank 0] > a: intercomm_merge(0) (0) [rank 1] > c: intercomm_merge(0) (0) [rank 7] > b: intercomm_merge(1) (0) [rank 4] > b: intercomm_merge(1) (0) [rank 5] > c: intercomm_merge(0) (0) [rank 6] > b: intercomm_merge(1) (0) [rank 3] > a: barrier (0) > b: barrier (
Re: [OMPI devel] Strange intercomm_create, spawn, spawn_multiple hang on trunk
Thanks for trying Ralph. Looks like my issues has to do with coll ml interaction. If I exclude coll ml, then all my tests pass. Do you know if there is a bug for this issue? If so, then I can run my nightly tests with coll ml disabled and wait for the bug to be fixed. Also, where does simple_spawn and spawn_multiple live? I was running "spawn" and "spawn_multiple" from the ibm/dynamic test suite. Your output for spawn_multiple looks different than mine. From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Friday, June 06, 2014 3:19 PM To: Open MPI Developers Subject: Re: [OMPI devel] Strange intercomm_create, spawn, spawn_multiple hang on trunk Works fine for me: [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 ./simple_spawn [pid 22777] starting up! [pid 22778] starting up! [pid 22779] starting up! 1 completed MPI_Init Parent [pid 22778] about to spawn! 2 completed MPI_Init Parent [pid 22779] about to spawn! 0 completed MPI_Init Parent [pid 22777] about to spawn! [pid 22783] starting up! [pid 22784] starting up! Parent done with spawn Parent sending message to child Parent done with spawn Parent done with spawn 0 completed MPI_Init Hello from the child 0 of 2 on host bend001 pid 22783 Child 0 received msg: 38 1 completed MPI_Init Hello from the child 1 of 2 on host bend001 pid 22784 Child 1 disconnected Parent disconnected Parent disconnected Parent disconnected Child 0 disconnected 22784: exiting 22778: exiting 22779: exiting 22777: exiting 22783: exiting [rhc@bend001 mpi]$ make spawn_multiple mpicc -g --openmpi:linkallspawn_multiple.c -o spawn_multiple [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 ./spawn_multiple Parent [pid 22797] about to spawn! Parent [pid 22798] about to spawn! Parent [pid 22799] about to spawn! Parent done with spawn Parent done with spawn Parent sending message to children Parent done with spawn Hello from the child 0 of 2 on host bend001 pid 22803: argv[1] = foo Child 0 received msg: 38 Hello from the child 1 of 2 on host bend001 pid 22804: argv[1] = bar Child 1 disconnected Parent disconnected Parent disconnected Parent disconnected Child 0 disconnected [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 -mca coll ^ml ./intercomm_create b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, ) [rank 3] b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, ) [rank 4] b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, ) [rank 5] c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, ) [rank 3] c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, ) [rank 4] c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, ) [rank 5] a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, ) (0) a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, ) (0) a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, ) (0) b: intercomm_create (0) b: barrier on inter-comm - before b: barrier on inter-comm - after b: intercomm_create (0) b: barrier on inter-comm - before b: barrier on inter-comm - after c: intercomm_create (0) c: barrier on inter-comm - before c: barrier on inter-comm - after c: intercomm_create (0) c: barrier on inter-comm - before c: barrier on inter-comm - after a: intercomm_create (0) a: barrier on inter-comm - before a: barrier on inter-comm - after c: intercomm_create (0) c: barrier on inter-comm - before c: barrier on inter-comm - after a: intercomm_create (0) a: barrier on inter-comm - before a: barrier on inter-comm - after a: intercomm_create (0) a: barrier on inter-comm - before a: barrier on inter-comm - after b: intercomm_create (0) b: barrier on inter-comm - before b: barrier on inter-comm - after a: intercomm_merge(0) (0) [rank 2] c: intercomm_merge(0) (0) [rank 8] a: intercomm_merge(0) (0) [rank 0] a: intercomm_merge(0) (0) [rank 1] c: intercomm_merge(0) (0) [rank 7] b: intercomm_merge(1) (0) [rank 4] b: intercomm_merge(1) (0) [rank 5] c: intercomm_merge(0) (0) [rank 6] b: intercomm_merge(1) (0) [rank 3] a: barrier (0) b: barrier (0) c: barrier (0) a: barrier (0) c: barrier (0) b: barrier (0) a: barrier (0) c: barrier (0) b: barrier (0) dpm_base_disconnect_init: error -12 in isend to process 3 dpm_base_disconnect_init: error -12 in isend to process 3 dpm_base_disconnect_init: error -12 in isend to process 3 dpm_base_disconnect_init: error -12 in isend to process 0 dpm_base_disconnect_init: error -12 in isend to process 3 dpm_base_disconnect_init: error -12 in isend to process 3 dpm_base_disconnect_init: error -12 in isend to process 3 dpm_base_disconnect_init: error -12 in isend to process 0 dpm_base_disconnect_init: error -12 in isend to process 3 dpm_base_disconnect_init: error -12 in isend to process 3 dpm_base_disconnect_init: error -12 in isend to process 3 dpm_base_disconnect_init: error -12 in isend to process 1 dpm_base_disconnect_init: error -12 in isend to process 3 [rhc@bend001 mpi]$ On Jun 6, 2014, at 11:26 AM, Rolf vandeVaart <rvandeva...@nvi
[OMPI devel] Strange intercomm_create, spawn, spawn_multiple hang on trunk
I am seeing an interesting failure on trunk. intercomm_create, spawn, and spawn_multiple from the IBM tests hang if I explicitly list the hostnames to run on. For example: Good: $ mpirun -np 2 --mca btl self,sm,tcp spawn_multiple Parent: 0 of 2, drossetti-ivy0.nvidia.com (0 in init) Parent: 1 of 2, drossetti-ivy0.nvidia.com (0 in init) Child: 0 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init) Child: 1 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init) Child: 2 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init) Child: 3 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init) $ Bad: $ mpirun -np 2 --mca btl self,sm,tcp -host drossetti-ivy0,drossetti-ivy0 spawn_multiple Parent: 0 of 2, drossetti-ivy0.nvidia.com (1 in init) Parent: 1 of 2, drossetti-ivy0.nvidia.com (1 in init) Child: 0 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init) Child: 1 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init) Child: 2 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init) Child: 3 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init) [..and we are hung here...] I see the exact same behavior for spawn and spawn_multiple. Ralph, any thoughts? Open MPI 1.8 is fine. I can provide more information if needed, but I assume this is reproducible. Thanks, Rolf --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ---