On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> wrote:
> Thanks Ralph! > > I must have mentioned though. Without the Torque environment, spawning with > ssh works ok. But Under the torque environment, not. Ah, no - you forgot to mention that point. > > I started the simple_spawn with 3 processes and spawned 9 processes (3 per > node on 3 nodes). > > There is no problem with the Torque environment because all the 9 processes > are started on the respective nodes. But the MPI_Comm_spawn of the parent and > MPI_Init of the children, "sometimes" don't return! Seems odd - the launch environment has nothing to do with MPI_Init, so if the processes are indeed being started, they should run. One possibility is that they aren't correctly getting some wireup info. Can you configure OMPI --enable-debug and then rerun the example with "-mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca grpcomm_base_verbose 5" on the command line? > > This is the output of simple_spawn - which confirms the above statement. > > [pid 31208] starting up! > [pid 31209] starting up! > [pid 31210] starting up! > 0 completed MPI_Init > Parent [pid 31208] about to spawn! > 1 completed MPI_Init > Parent [pid 31209] about to spawn! > 2 completed MPI_Init > Parent [pid 31210] about to spawn! > [pid 28630] starting up! > [pid 28631] starting up! > [pid 9846] starting up! > [pid 9847] starting up! > [pid 9848] starting up! > [pid 6363] starting up! > [pid 6361] starting up! > [pid 6362] starting up! > [pid 28632] starting up! > > Any hints? > > Best, > Suraj > > On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote: > >> Hmmm...I don't see anything immediately glaring. What do you mean by >> "doesn't work"? Is there some specific behavior you see? >> >> You might try the attached program. It's a simple spawn test we use - 1.7.4 >> seems happy with it. >> >> <simple_spawn.c> >> >> On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran >> <suraj.prabhaka...@gmail.com> wrote: >> >>> I am using 1.7.4! >>> >>> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote: >>> >>>> What OMPI version are you using? >>>> >>>> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran >>>> <suraj.prabhaka...@gmail.com> wrote: >>>> >>>>> Hello! >>>>> >>>>> I am having problem using MPI_Comm_spawn under torque. It doesn't work >>>>> when spawning more than 12 processes on various nodes. To be more >>>>> precise, "sometimes" it works, and "sometimes" it doesn't! >>>>> >>>>> Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE >>>>> looks like below. >>>>> >>>>> node1 >>>>> node1 >>>>> node1 >>>>> node2 >>>>> node2 >>>>> node2 >>>>> node3 >>>>> node3 >>>>> node3 >>>>> node4 >>>>> node4 >>>>> node4 >>>>> node5 >>>>> node5 >>>>> node5 >>>>> >>>>> I started a hello program (which just spawns itself and of course, the >>>>> children don't spawn), with >>>>> >>>>> mpiexec -np 3 ./hello >>>>> >>>>> Spawning 3 more processes (on node 2) - works! >>>>> spawning 6 more processes (node 2 and 3) - works! >>>>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not! >>>>> spawning 12 processes (node 2,3,4,5) - "mostly" not! >>>>> >>>>> I ideally want to spawn about 32 processes with large number of nodes, >>>>> but this is at the moment impossible. I have attached my hello program to >>>>> this email. >>>>> >>>>> I will be happy to provide any more info or verbose outputs if you could >>>>> please tell me what exactly you would like to see. >>>>> >>>>> Best, >>>>> Suraj >>>>> >>>>> <hello.c>_______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel