Right, so I have the output here. Same case, mpiexec -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca grpcomm_base_verbose 5 -np 3 ./simple_spawn
Output attached. Best, Suraj
output
Description: Binary data
On Feb 21, 2014, at 5:30 AM, Ralph Castain wrote: > > On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> > wrote: > >> Thanks Ralph! >> >> I must have mentioned though. Without the Torque environment, spawning with >> ssh works ok. But Under the torque environment, not. > > Ah, no - you forgot to mention that point. > >> >> I started the simple_spawn with 3 processes and spawned 9 processes (3 per >> node on 3 nodes). >> >> There is no problem with the Torque environment because all the 9 processes >> are started on the respective nodes. But the MPI_Comm_spawn of the parent >> and MPI_Init of the children, "sometimes" don't return! > > Seems odd - the launch environment has nothing to do with MPI_Init, so if the > processes are indeed being started, they should run. One possibility is that > they aren't correctly getting some wireup info. > > Can you configure OMPI --enable-debug and then rerun the example with "-mca > plm_base_verbose 5 -mca ess_base_verbose 5 -mca grpcomm_base_verbose 5" on > the command line? > > >> >> This is the output of simple_spawn - which confirms the above statement. >> >> [pid 31208] starting up! >> [pid 31209] starting up! >> [pid 31210] starting up! >> 0 completed MPI_Init >> Parent [pid 31208] about to spawn! >> 1 completed MPI_Init >> Parent [pid 31209] about to spawn! >> 2 completed MPI_Init >> Parent [pid 31210] about to spawn! >> [pid 28630] starting up! >> [pid 28631] starting up! >> [pid 9846] starting up! >> [pid 9847] starting up! >> [pid 9848] starting up! >> [pid 6363] starting up! >> [pid 6361] starting up! >> [pid 6362] starting up! >> [pid 28632] starting up! >> >> Any hints? >> >> Best, >> Suraj >> >> On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote: >> >>> Hmmm...I don't see anything immediately glaring. What do you mean by >>> "doesn't work"? Is there some specific behavior you see? >>> >>> You might try the attached program. It's a simple spawn test we use - 1.7.4 >>> seems happy with it. >>> >>> <simple_spawn.c> >>> >>> On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran >>> <suraj.prabhaka...@gmail.com> wrote: >>> >>>> I am using 1.7.4! >>>> >>>> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote: >>>> >>>>> What OMPI version are you using? >>>>> >>>>> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran >>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>> >>>>>> Hello! >>>>>> >>>>>> I am having problem using MPI_Comm_spawn under torque. It doesn't work >>>>>> when spawning more than 12 processes on various nodes. To be more >>>>>> precise, "sometimes" it works, and "sometimes" it doesn't! >>>>>> >>>>>> Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE >>>>>> looks like below. >>>>>> >>>>>> node1 >>>>>> node1 >>>>>> node1 >>>>>> node2 >>>>>> node2 >>>>>> node2 >>>>>> node3 >>>>>> node3 >>>>>> node3 >>>>>> node4 >>>>>> node4 >>>>>> node4 >>>>>> node5 >>>>>> node5 >>>>>> node5 >>>>>> >>>>>> I started a hello program (which just spawns itself and of course, the >>>>>> children don't spawn), with >>>>>> >>>>>> mpiexec -np 3 ./hello >>>>>> >>>>>> Spawning 3 more processes (on node 2) - works! >>>>>> spawning 6 more processes (node 2 and 3) - works! >>>>>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not! >>>>>> spawning 12 processes (node 2,3,4,5) - "mostly" not! >>>>>> >>>>>> I ideally want to spawn about 32 processes with large number of nodes, >>>>>> but this is at the moment impossible. I have attached my hello program >>>>>> to this email. >>>>>> >>>>>> I will be happy to provide any more info or verbose outputs if you could >>>>>> please tell me what exactly you would like to see. >>>>>> >>>>>> Best, >>>>>> Suraj >>>>>> >>>>>> <hello.c>_______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel