Hmm.. but in actual the MPI_Comm_spawn of parents and MPI_Init of children 
never returned!

I configured MPI with 

./configure --prefix=/dir/ --enable-debug --with-tm=/usr/local/


On Feb 22, 2014, at 12:53 AM, Ralph Castain wrote:

> Strange - it all looks just fine. How was OMPI configured?
> 
> On Feb 21, 2014, at 3:31 PM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> 
> wrote:
> 
>> Ok, I figured out that it was not a problem with the node grsacc04 because I 
>> now conducted the same on totally different set of nodes. 
>> 
>> I must really say that with --bind-to none option, the program completed 
>> "many" times compared to earlier but still "sometimes" it hangs! Attaching 
>> now the output of the same case conducted on different set of nodes with the 
>> --bind-to none option.
>> 
>> mpiexec  -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca 
>> grpcomm_base_verbose 5 --bind-to none -np 3 ./example
>> 
>> Best,
>> Suraj
>> 
>> <output.rtf>
>> 
>> 
>> On Feb 21, 2014, at 5:03 PM, Ralph Castain wrote:
>> 
>>> Well, that all looks fine. However, I note that the procs on grsacc04 all 
>>> stopped making progress at the same point, which is why the job hung. All 
>>> the procs on the other nodes were just fine.
>>> 
>>> So let's try a couple of things:
>>> 
>>> 1. add "--bind-to none" to your cmd line so we avoid any contention issues
>>> 
>>> 2. if possible, remove grsacc04 from the allocation (you can just use the 
>>> -host option on the cmd line to ignore it), and/or replace that host with 
>>> another one. Let's see if the problem has something to do with that 
>>> specific node.
>>> 
>>> 
>>> On Feb 21, 2014, at 4:08 AM, Suraj Prabhakaran 
>>> <suraj.prabhaka...@gmail.com> wrote:
>>> 
>>>> Right, so I have the output here. Same case, 
>>>> 
>>>> mpiexec  -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca 
>>>> grpcomm_base_verbose 5  -np 3 ./simple_spawn
>>>> 
>>>> Output attached. 
>>>> 
>>>> Best,
>>>> Suraj
>>>> 
>>>> <output>
>>>> 
>>>> On Feb 21, 2014, at 5:30 AM, Ralph Castain wrote:
>>>> 
>>>>> 
>>>>> On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran 
>>>>> <suraj.prabhaka...@gmail.com> wrote:
>>>>> 
>>>>>> Thanks Ralph!
>>>>>> 
>>>>>> I must have mentioned though. Without the Torque environment, spawning 
>>>>>> with ssh works ok. But Under the torque environment, not. 
>>>>> 
>>>>> Ah, no - you forgot to mention that point.
>>>>> 
>>>>>> 
>>>>>> I started the simple_spawn with 3 processes and spawned 9 processes (3 
>>>>>> per node on 3 nodes). 
>>>>>> 
>>>>>> There is no problem with the Torque environment because all the 9 
>>>>>> processes are started on the respective nodes. But the MPI_Comm_spawn of 
>>>>>> the parent and MPI_Init of the children, "sometimes" don't return!
>>>>> 
>>>>> Seems odd - the launch environment has nothing to do with MPI_Init, so if 
>>>>> the processes are indeed being started, they should run. One possibility 
>>>>> is that they aren't correctly getting some wireup info.
>>>>> 
>>>>> Can you configure OMPI --enable-debug and then rerun the example with 
>>>>> "-mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca 
>>>>> grpcomm_base_verbose 5" on the command line?
>>>>> 
>>>>> 
>>>>>> 
>>>>>> This is the output of simple_spawn - which confirms the above statement. 
>>>>>> 
>>>>>> [pid 31208] starting up!
>>>>>> [pid 31209] starting up!
>>>>>> [pid 31210] starting up!
>>>>>> 0 completed MPI_Init
>>>>>> Parent [pid 31208] about to spawn!
>>>>>> 1 completed MPI_Init
>>>>>> Parent [pid 31209] about to spawn!
>>>>>> 2 completed MPI_Init
>>>>>> Parent [pid 31210] about to spawn!
>>>>>> [pid 28630] starting up!
>>>>>> [pid 28631] starting up!
>>>>>> [pid 9846] starting up!
>>>>>> [pid 9847] starting up!
>>>>>> [pid 9848] starting up!
>>>>>> [pid 6363] starting up!
>>>>>> [pid 6361] starting up!
>>>>>> [pid 6362] starting up!
>>>>>> [pid 28632] starting up!
>>>>>> 
>>>>>> Any hints?
>>>>>> 
>>>>>> Best,
>>>>>> Suraj
>>>>>> 
>>>>>> On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote:
>>>>>> 
>>>>>>> Hmmm...I don't see anything immediately glaring. What do you mean by 
>>>>>>> "doesn't work"? Is there some specific behavior you see?
>>>>>>> 
>>>>>>> You might try the attached program. It's a simple spawn test we use - 
>>>>>>> 1.7.4 seems happy with it.
>>>>>>> 
>>>>>>> <simple_spawn.c>
>>>>>>> 
>>>>>>> On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran 
>>>>>>> <suraj.prabhaka...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> I am using 1.7.4! 
>>>>>>>> 
>>>>>>>> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote:
>>>>>>>> 
>>>>>>>>> What OMPI version are you using?
>>>>>>>>> 
>>>>>>>>> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran 
>>>>>>>>> <suraj.prabhaka...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>>> Hello!
>>>>>>>>>> 
>>>>>>>>>> I am having problem using MPI_Comm_spawn under torque. It doesn't 
>>>>>>>>>> work when spawning more than 12 processes on various nodes. To be 
>>>>>>>>>> more precise, "sometimes" it works, and "sometimes" it doesn't!
>>>>>>>>>> 
>>>>>>>>>> Here is my case. I obtain 5 nodes, 3 cores per node and my 
>>>>>>>>>> $PBS_NODEFILE looks like below.
>>>>>>>>>> 
>>>>>>>>>> node1
>>>>>>>>>> node1
>>>>>>>>>> node1
>>>>>>>>>> node2
>>>>>>>>>> node2
>>>>>>>>>> node2
>>>>>>>>>> node3
>>>>>>>>>> node3
>>>>>>>>>> node3
>>>>>>>>>> node4
>>>>>>>>>> node4
>>>>>>>>>> node4
>>>>>>>>>> node5
>>>>>>>>>> node5
>>>>>>>>>> node5
>>>>>>>>>> 
>>>>>>>>>> I started a hello program (which just spawns itself and of course, 
>>>>>>>>>> the children don't spawn), with 
>>>>>>>>>> 
>>>>>>>>>> mpiexec -np 3 ./hello
>>>>>>>>>> 
>>>>>>>>>> Spawning 3 more processes (on node 2) - works!
>>>>>>>>>> spawning 6 more processes (node 2 and 3) - works!
>>>>>>>>>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not!
>>>>>>>>>> spawning 12 processes (node 2,3,4,5) - "mostly" not!
>>>>>>>>>> 
>>>>>>>>>> I ideally want to spawn about 32 processes with large number of 
>>>>>>>>>> nodes, but this is at the moment impossible. I have attached my 
>>>>>>>>>> hello program to this email. 
>>>>>>>>>> 
>>>>>>>>>> I will be happy to provide any more info or verbose outputs if you 
>>>>>>>>>> could please tell me what exactly you would like to see.
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Suraj
>>>>>>>>>> 
>>>>>>>>>> <hello.c>_______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to