Well, that all looks fine. However, I note that the procs on grsacc04 all stopped making progress at the same point, which is why the job hung. All the procs on the other nodes were just fine.
So let's try a couple of things: 1. add "--bind-to none" to your cmd line so we avoid any contention issues 2. if possible, remove grsacc04 from the allocation (you can just use the -host option on the cmd line to ignore it), and/or replace that host with another one. Let's see if the problem has something to do with that specific node. On Feb 21, 2014, at 4:08 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> wrote: > Right, so I have the output here. Same case, > > mpiexec -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca > grpcomm_base_verbose 5 -np 3 ./simple_spawn > > Output attached. > > Best, > Suraj > > <output> > > On Feb 21, 2014, at 5:30 AM, Ralph Castain wrote: > >> >> On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> >> wrote: >> >>> Thanks Ralph! >>> >>> I must have mentioned though. Without the Torque environment, spawning with >>> ssh works ok. But Under the torque environment, not. >> >> Ah, no - you forgot to mention that point. >> >>> >>> I started the simple_spawn with 3 processes and spawned 9 processes (3 per >>> node on 3 nodes). >>> >>> There is no problem with the Torque environment because all the 9 processes >>> are started on the respective nodes. But the MPI_Comm_spawn of the parent >>> and MPI_Init of the children, "sometimes" don't return! >> >> Seems odd - the launch environment has nothing to do with MPI_Init, so if >> the processes are indeed being started, they should run. One possibility is >> that they aren't correctly getting some wireup info. >> >> Can you configure OMPI --enable-debug and then rerun the example with "-mca >> plm_base_verbose 5 -mca ess_base_verbose 5 -mca grpcomm_base_verbose 5" on >> the command line? >> >> >>> >>> This is the output of simple_spawn - which confirms the above statement. >>> >>> [pid 31208] starting up! >>> [pid 31209] starting up! >>> [pid 31210] starting up! >>> 0 completed MPI_Init >>> Parent [pid 31208] about to spawn! >>> 1 completed MPI_Init >>> Parent [pid 31209] about to spawn! >>> 2 completed MPI_Init >>> Parent [pid 31210] about to spawn! >>> [pid 28630] starting up! >>> [pid 28631] starting up! >>> [pid 9846] starting up! >>> [pid 9847] starting up! >>> [pid 9848] starting up! >>> [pid 6363] starting up! >>> [pid 6361] starting up! >>> [pid 6362] starting up! >>> [pid 28632] starting up! >>> >>> Any hints? >>> >>> Best, >>> Suraj >>> >>> On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote: >>> >>>> Hmmm...I don't see anything immediately glaring. What do you mean by >>>> "doesn't work"? Is there some specific behavior you see? >>>> >>>> You might try the attached program. It's a simple spawn test we use - >>>> 1.7.4 seems happy with it. >>>> >>>> <simple_spawn.c> >>>> >>>> On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran >>>> <suraj.prabhaka...@gmail.com> wrote: >>>> >>>>> I am using 1.7.4! >>>>> >>>>> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote: >>>>> >>>>>> What OMPI version are you using? >>>>>> >>>>>> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran >>>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>>> >>>>>>> Hello! >>>>>>> >>>>>>> I am having problem using MPI_Comm_spawn under torque. It doesn't work >>>>>>> when spawning more than 12 processes on various nodes. To be more >>>>>>> precise, "sometimes" it works, and "sometimes" it doesn't! >>>>>>> >>>>>>> Here is my case. I obtain 5 nodes, 3 cores per node and my >>>>>>> $PBS_NODEFILE looks like below. >>>>>>> >>>>>>> node1 >>>>>>> node1 >>>>>>> node1 >>>>>>> node2 >>>>>>> node2 >>>>>>> node2 >>>>>>> node3 >>>>>>> node3 >>>>>>> node3 >>>>>>> node4 >>>>>>> node4 >>>>>>> node4 >>>>>>> node5 >>>>>>> node5 >>>>>>> node5 >>>>>>> >>>>>>> I started a hello program (which just spawns itself and of course, the >>>>>>> children don't spawn), with >>>>>>> >>>>>>> mpiexec -np 3 ./hello >>>>>>> >>>>>>> Spawning 3 more processes (on node 2) - works! >>>>>>> spawning 6 more processes (node 2 and 3) - works! >>>>>>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not! >>>>>>> spawning 12 processes (node 2,3,4,5) - "mostly" not! >>>>>>> >>>>>>> I ideally want to spawn about 32 processes with large number of nodes, >>>>>>> but this is at the moment impossible. I have attached my hello program >>>>>>> to this email. >>>>>>> >>>>>>> I will be happy to provide any more info or verbose outputs if you >>>>>>> could please tell me what exactly you would like to see. >>>>>>> >>>>>>> Best, >>>>>>> Suraj >>>>>>> >>>>>>> <hello.c>_______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel