Thanks Ralph! I must have mentioned though. Without the Torque environment, spawning with ssh works ok. But Under the torque environment, not.
I started the simple_spawn with 3 processes and spawned 9 processes (3 per node on 3 nodes). There is no problem with the Torque environment because all the 9 processes are started on the respective nodes. But the MPI_Comm_spawn of the parent and MPI_Init of the children, "sometimes" don't return! This is the output of simple_spawn - which confirms the above statement. [pid 31208] starting up! [pid 31209] starting up! [pid 31210] starting up! 0 completed MPI_Init Parent [pid 31208] about to spawn! 1 completed MPI_Init Parent [pid 31209] about to spawn! 2 completed MPI_Init Parent [pid 31210] about to spawn! [pid 28630] starting up! [pid 28631] starting up! [pid 9846] starting up! [pid 9847] starting up! [pid 9848] starting up! [pid 6363] starting up! [pid 6361] starting up! [pid 6362] starting up! [pid 28632] starting up! Any hints? Best, Suraj On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote: > Hmmm...I don't see anything immediately glaring. What do you mean by "doesn't > work"? Is there some specific behavior you see? > > You might try the attached program. It's a simple spawn test we use - 1.7.4 > seems happy with it. > > <simple_spawn.c> > > On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> > wrote: > >> I am using 1.7.4! >> >> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote: >> >>> What OMPI version are you using? >>> >>> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran >>> <suraj.prabhaka...@gmail.com> wrote: >>> >>>> Hello! >>>> >>>> I am having problem using MPI_Comm_spawn under torque. It doesn't work >>>> when spawning more than 12 processes on various nodes. To be more precise, >>>> "sometimes" it works, and "sometimes" it doesn't! >>>> >>>> Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE >>>> looks like below. >>>> >>>> node1 >>>> node1 >>>> node1 >>>> node2 >>>> node2 >>>> node2 >>>> node3 >>>> node3 >>>> node3 >>>> node4 >>>> node4 >>>> node4 >>>> node5 >>>> node5 >>>> node5 >>>> >>>> I started a hello program (which just spawns itself and of course, the >>>> children don't spawn), with >>>> >>>> mpiexec -np 3 ./hello >>>> >>>> Spawning 3 more processes (on node 2) - works! >>>> spawning 6 more processes (node 2 and 3) - works! >>>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not! >>>> spawning 12 processes (node 2,3,4,5) - "mostly" not! >>>> >>>> I ideally want to spawn about 32 processes with large number of nodes, but >>>> this is at the moment impossible. I have attached my hello program to this >>>> email. >>>> >>>> I will be happy to provide any more info or verbose outputs if you could >>>> please tell me what exactly you would like to see. >>>> >>>> Best, >>>> Suraj >>>> >>>> <hello.c>_______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel