I very much doubt that either of those mappers has ever been tested against 
comm_spawn. Just glancing thru them, I don't see an immediate reason why 
loadbalance wouldn't work, but the error indicates that the system wound up 
mapping one or more processes to an unknown node.

We are revising the mappers at this time, so I doubt we'll try to fix it for 
1.5.2. You might try the 1.4 series to see if it behaves differently, though I 
suspect those mappers weren't tested against comm_spawn there either.


On Feb 21, 2011, at 12:59 PM, Skouson, Gary B wrote:

> I'm trying to use MPI_Comm_spawn_multiple and it doesn't seem to always work 
> like I'd expect.
> 
> The simple test code I have starts a couple of master processes and then 
> tries to spawn a couple of worker threads on each of the nodes running the 
> master processes.
> 
> I was using 1.5.1, but gave 1.5.2rc2 a try too.
> 
> If I do:
> [skouson@cu2n29 mpi2_example]$ mpirun -hostfile hostfile -n 2 -bynode 
> ./mpi2_manager
> MPI Initialized=0, Finalized=0
> MPI Initialized=0, Finalized=0
> I'm manager 0 of 2 on cu2n29 running MPI 2.1
> setting up host cu2n29 - ./mpi2_worker
> setting up host cu2n30 - ./mpi2_worker
> Spawning 2 worker processes running ./mpi2_worker
> Sleeping for a bit...
> I'm manager 1 of 2 on cu2n30 running MPI 2.1
> **** I'm worker 0 of 2 on cu2n29 running MPI 2.1
> **** Worker 0: number of parents = 2
> **** Worker 0: Success!
> **** I'm worker 1 of 2 on cu2n30 running MPI 2.1
> **** Worker 1: number of parents = 2
> **** Worker 1: Success!
> **** Worker 0: Value recd = 25
> 1: MPI Initialized=1, Finalized=1
> 0: MPI Initialized=1, Finalized=1
> 
> It seems to work as expected, however, if I use -loadbalance or -npernode 1 
> rather than the -bynode flag I get an obscure error and things hang until a 
> ctrl-c out of it.
> 
> [skouson@cu2n29 mpi2_example]$ mpirun -hostfile hostfile -n 2 -loadbalance 
> ./mpi2_manager
> MPI Initialized=0, Finalized=0
> MPI Initialized=0, Finalized=0
> I'm manager 1 of 2 on cu2n30 running MPI 2.1
> I'm manager 0 of 2 on cu2n29 running MPI 2.1
> setting up host cu2n29 - ./mpi2_worker
> setting up host cu2n30 - ./mpi2_worker
> Spawning 2 worker processes running ./mpi2_worker
> Sleeping for a bit...
> [cu2n29:03088] [[62875,0],0] ORTE_ERROR_LOG: Not found in file 
> base/odls_base_default_fns.c at line 906
> mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
> 
> My environment has:
> OMPI_MCA_btl_openib_ib_retry_count=7
> OMPI_MCA_mpi_keep_peer_hostnames=1
> OMPI_MCA_btl_openib_ib_timeout=31
> 
> I've included the sample code, along with config.log etc.
> 
> If anyone has any can point out what I'm missing to be able to run with the 
> -loadbalance flag, I'd appreciate it.
> 
> -----
> Gary Skouson
> <mpi2_example.tar.bz2>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to