Strange - it all looks just fine. How was OMPI configured? On Feb 21, 2014, at 3:31 PM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> wrote:
> Ok, I figured out that it was not a problem with the node grsacc04 because I > now conducted the same on totally different set of nodes. > > I must really say that with --bind-to none option, the program completed > "many" times compared to earlier but still "sometimes" it hangs! Attaching > now the output of the same case conducted on different set of nodes with the > --bind-to none option. > > mpiexec -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca > grpcomm_base_verbose 5 --bind-to none -np 3 ./example > > Best, > Suraj > > <output.rtf> > > > On Feb 21, 2014, at 5:03 PM, Ralph Castain wrote: > >> Well, that all looks fine. However, I note that the procs on grsacc04 all >> stopped making progress at the same point, which is why the job hung. All >> the procs on the other nodes were just fine. >> >> So let's try a couple of things: >> >> 1. add "--bind-to none" to your cmd line so we avoid any contention issues >> >> 2. if possible, remove grsacc04 from the allocation (you can just use the >> -host option on the cmd line to ignore it), and/or replace that host with >> another one. Let's see if the problem has something to do with that specific >> node. >> >> >> On Feb 21, 2014, at 4:08 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> >> wrote: >> >>> Right, so I have the output here. Same case, >>> >>> mpiexec -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca >>> grpcomm_base_verbose 5 -np 3 ./simple_spawn >>> >>> Output attached. >>> >>> Best, >>> Suraj >>> >>> <output> >>> >>> On Feb 21, 2014, at 5:30 AM, Ralph Castain wrote: >>> >>>> >>>> On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran >>>> <suraj.prabhaka...@gmail.com> wrote: >>>> >>>>> Thanks Ralph! >>>>> >>>>> I must have mentioned though. Without the Torque environment, spawning >>>>> with ssh works ok. But Under the torque environment, not. >>>> >>>> Ah, no - you forgot to mention that point. >>>> >>>>> >>>>> I started the simple_spawn with 3 processes and spawned 9 processes (3 >>>>> per node on 3 nodes). >>>>> >>>>> There is no problem with the Torque environment because all the 9 >>>>> processes are started on the respective nodes. But the MPI_Comm_spawn of >>>>> the parent and MPI_Init of the children, "sometimes" don't return! >>>> >>>> Seems odd - the launch environment has nothing to do with MPI_Init, so if >>>> the processes are indeed being started, they should run. One possibility >>>> is that they aren't correctly getting some wireup info. >>>> >>>> Can you configure OMPI --enable-debug and then rerun the example with >>>> "-mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca grpcomm_base_verbose >>>> 5" on the command line? >>>> >>>> >>>>> >>>>> This is the output of simple_spawn - which confirms the above statement. >>>>> >>>>> [pid 31208] starting up! >>>>> [pid 31209] starting up! >>>>> [pid 31210] starting up! >>>>> 0 completed MPI_Init >>>>> Parent [pid 31208] about to spawn! >>>>> 1 completed MPI_Init >>>>> Parent [pid 31209] about to spawn! >>>>> 2 completed MPI_Init >>>>> Parent [pid 31210] about to spawn! >>>>> [pid 28630] starting up! >>>>> [pid 28631] starting up! >>>>> [pid 9846] starting up! >>>>> [pid 9847] starting up! >>>>> [pid 9848] starting up! >>>>> [pid 6363] starting up! >>>>> [pid 6361] starting up! >>>>> [pid 6362] starting up! >>>>> [pid 28632] starting up! >>>>> >>>>> Any hints? >>>>> >>>>> Best, >>>>> Suraj >>>>> >>>>> On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote: >>>>> >>>>>> Hmmm...I don't see anything immediately glaring. What do you mean by >>>>>> "doesn't work"? Is there some specific behavior you see? >>>>>> >>>>>> You might try the attached program. It's a simple spawn test we use - >>>>>> 1.7.4 seems happy with it. >>>>>> >>>>>> <simple_spawn.c> >>>>>> >>>>>> On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran >>>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>>> >>>>>>> I am using 1.7.4! >>>>>>> >>>>>>> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote: >>>>>>> >>>>>>>> What OMPI version are you using? >>>>>>>> >>>>>>>> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran >>>>>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hello! >>>>>>>>> >>>>>>>>> I am having problem using MPI_Comm_spawn under torque. It doesn't >>>>>>>>> work when spawning more than 12 processes on various nodes. To be >>>>>>>>> more precise, "sometimes" it works, and "sometimes" it doesn't! >>>>>>>>> >>>>>>>>> Here is my case. I obtain 5 nodes, 3 cores per node and my >>>>>>>>> $PBS_NODEFILE looks like below. >>>>>>>>> >>>>>>>>> node1 >>>>>>>>> node1 >>>>>>>>> node1 >>>>>>>>> node2 >>>>>>>>> node2 >>>>>>>>> node2 >>>>>>>>> node3 >>>>>>>>> node3 >>>>>>>>> node3 >>>>>>>>> node4 >>>>>>>>> node4 >>>>>>>>> node4 >>>>>>>>> node5 >>>>>>>>> node5 >>>>>>>>> node5 >>>>>>>>> >>>>>>>>> I started a hello program (which just spawns itself and of course, >>>>>>>>> the children don't spawn), with >>>>>>>>> >>>>>>>>> mpiexec -np 3 ./hello >>>>>>>>> >>>>>>>>> Spawning 3 more processes (on node 2) - works! >>>>>>>>> spawning 6 more processes (node 2 and 3) - works! >>>>>>>>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not! >>>>>>>>> spawning 12 processes (node 2,3,4,5) - "mostly" not! >>>>>>>>> >>>>>>>>> I ideally want to spawn about 32 processes with large number of >>>>>>>>> nodes, but this is at the moment impossible. I have attached my hello >>>>>>>>> program to this email. >>>>>>>>> >>>>>>>>> I will be happy to provide any more info or verbose outputs if you >>>>>>>>> could please tell me what exactly you would like to see. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Suraj >>>>>>>>> >>>>>>>>> <hello.c>_______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel