Hmm.. but in actual the MPI_Comm_spawn of parents and MPI_Init of children never returned!
I configured MPI with ./configure --prefix=/dir/ --enable-debug --with-tm=/usr/local/ On Feb 22, 2014, at 12:53 AM, Ralph Castain wrote: > Strange - it all looks just fine. How was OMPI configured? > > On Feb 21, 2014, at 3:31 PM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> > wrote: > >> Ok, I figured out that it was not a problem with the node grsacc04 because I >> now conducted the same on totally different set of nodes. >> >> I must really say that with --bind-to none option, the program completed >> "many" times compared to earlier but still "sometimes" it hangs! Attaching >> now the output of the same case conducted on different set of nodes with the >> --bind-to none option. >> >> mpiexec -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca >> grpcomm_base_verbose 5 --bind-to none -np 3 ./example >> >> Best, >> Suraj >> >> <output.rtf> >> >> >> On Feb 21, 2014, at 5:03 PM, Ralph Castain wrote: >> >>> Well, that all looks fine. However, I note that the procs on grsacc04 all >>> stopped making progress at the same point, which is why the job hung. All >>> the procs on the other nodes were just fine. >>> >>> So let's try a couple of things: >>> >>> 1. add "--bind-to none" to your cmd line so we avoid any contention issues >>> >>> 2. if possible, remove grsacc04 from the allocation (you can just use the >>> -host option on the cmd line to ignore it), and/or replace that host with >>> another one. Let's see if the problem has something to do with that >>> specific node. >>> >>> >>> On Feb 21, 2014, at 4:08 AM, Suraj Prabhakaran >>> <suraj.prabhaka...@gmail.com> wrote: >>> >>>> Right, so I have the output here. Same case, >>>> >>>> mpiexec -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca >>>> grpcomm_base_verbose 5 -np 3 ./simple_spawn >>>> >>>> Output attached. >>>> >>>> Best, >>>> Suraj >>>> >>>> <output> >>>> >>>> On Feb 21, 2014, at 5:30 AM, Ralph Castain wrote: >>>> >>>>> >>>>> On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran >>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>> >>>>>> Thanks Ralph! >>>>>> >>>>>> I must have mentioned though. Without the Torque environment, spawning >>>>>> with ssh works ok. But Under the torque environment, not. >>>>> >>>>> Ah, no - you forgot to mention that point. >>>>> >>>>>> >>>>>> I started the simple_spawn with 3 processes and spawned 9 processes (3 >>>>>> per node on 3 nodes). >>>>>> >>>>>> There is no problem with the Torque environment because all the 9 >>>>>> processes are started on the respective nodes. But the MPI_Comm_spawn of >>>>>> the parent and MPI_Init of the children, "sometimes" don't return! >>>>> >>>>> Seems odd - the launch environment has nothing to do with MPI_Init, so if >>>>> the processes are indeed being started, they should run. One possibility >>>>> is that they aren't correctly getting some wireup info. >>>>> >>>>> Can you configure OMPI --enable-debug and then rerun the example with >>>>> "-mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca >>>>> grpcomm_base_verbose 5" on the command line? >>>>> >>>>> >>>>>> >>>>>> This is the output of simple_spawn - which confirms the above statement. >>>>>> >>>>>> [pid 31208] starting up! >>>>>> [pid 31209] starting up! >>>>>> [pid 31210] starting up! >>>>>> 0 completed MPI_Init >>>>>> Parent [pid 31208] about to spawn! >>>>>> 1 completed MPI_Init >>>>>> Parent [pid 31209] about to spawn! >>>>>> 2 completed MPI_Init >>>>>> Parent [pid 31210] about to spawn! >>>>>> [pid 28630] starting up! >>>>>> [pid 28631] starting up! >>>>>> [pid 9846] starting up! >>>>>> [pid 9847] starting up! >>>>>> [pid 9848] starting up! >>>>>> [pid 6363] starting up! >>>>>> [pid 6361] starting up! >>>>>> [pid 6362] starting up! >>>>>> [pid 28632] starting up! >>>>>> >>>>>> Any hints? >>>>>> >>>>>> Best, >>>>>> Suraj >>>>>> >>>>>> On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote: >>>>>> >>>>>>> Hmmm...I don't see anything immediately glaring. What do you mean by >>>>>>> "doesn't work"? Is there some specific behavior you see? >>>>>>> >>>>>>> You might try the attached program. It's a simple spawn test we use - >>>>>>> 1.7.4 seems happy with it. >>>>>>> >>>>>>> <simple_spawn.c> >>>>>>> >>>>>>> On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran >>>>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>>>> >>>>>>>> I am using 1.7.4! >>>>>>>> >>>>>>>> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote: >>>>>>>> >>>>>>>>> What OMPI version are you using? >>>>>>>>> >>>>>>>>> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran >>>>>>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hello! >>>>>>>>>> >>>>>>>>>> I am having problem using MPI_Comm_spawn under torque. It doesn't >>>>>>>>>> work when spawning more than 12 processes on various nodes. To be >>>>>>>>>> more precise, "sometimes" it works, and "sometimes" it doesn't! >>>>>>>>>> >>>>>>>>>> Here is my case. I obtain 5 nodes, 3 cores per node and my >>>>>>>>>> $PBS_NODEFILE looks like below. >>>>>>>>>> >>>>>>>>>> node1 >>>>>>>>>> node1 >>>>>>>>>> node1 >>>>>>>>>> node2 >>>>>>>>>> node2 >>>>>>>>>> node2 >>>>>>>>>> node3 >>>>>>>>>> node3 >>>>>>>>>> node3 >>>>>>>>>> node4 >>>>>>>>>> node4 >>>>>>>>>> node4 >>>>>>>>>> node5 >>>>>>>>>> node5 >>>>>>>>>> node5 >>>>>>>>>> >>>>>>>>>> I started a hello program (which just spawns itself and of course, >>>>>>>>>> the children don't spawn), with >>>>>>>>>> >>>>>>>>>> mpiexec -np 3 ./hello >>>>>>>>>> >>>>>>>>>> Spawning 3 more processes (on node 2) - works! >>>>>>>>>> spawning 6 more processes (node 2 and 3) - works! >>>>>>>>>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not! >>>>>>>>>> spawning 12 processes (node 2,3,4,5) - "mostly" not! >>>>>>>>>> >>>>>>>>>> I ideally want to spawn about 32 processes with large number of >>>>>>>>>> nodes, but this is at the moment impossible. I have attached my >>>>>>>>>> hello program to this email. >>>>>>>>>> >>>>>>>>>> I will be happy to provide any more info or verbose outputs if you >>>>>>>>>> could please tell me what exactly you would like to see. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Suraj >>>>>>>>>> >>>>>>>>>> <hello.c>_______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel