Right, so I have the output here. Same case, 

mpiexec  -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca 
grpcomm_base_verbose 5  -np 3 ./simple_spawn

Output attached. 

Best,
Suraj

Attachment: output
Description: Binary data


On Feb 21, 2014, at 5:30 AM, Ralph Castain wrote:

> 
> On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> 
> wrote:
> 
>> Thanks Ralph!
>> 
>> I must have mentioned though. Without the Torque environment, spawning with 
>> ssh works ok. But Under the torque environment, not. 
> 
> Ah, no - you forgot to mention that point.
> 
>> 
>> I started the simple_spawn with 3 processes and spawned 9 processes (3 per 
>> node on 3 nodes). 
>> 
>> There is no problem with the Torque environment because all the 9 processes 
>> are started on the respective nodes. But the MPI_Comm_spawn of the parent 
>> and MPI_Init of the children, "sometimes" don't return!
> 
> Seems odd - the launch environment has nothing to do with MPI_Init, so if the 
> processes are indeed being started, they should run. One possibility is that 
> they aren't correctly getting some wireup info.
> 
> Can you configure OMPI --enable-debug and then rerun the example with "-mca 
> plm_base_verbose 5 -mca ess_base_verbose 5 -mca grpcomm_base_verbose 5" on 
> the command line?
> 
> 
>> 
>> This is the output of simple_spawn - which confirms the above statement. 
>> 
>> [pid 31208] starting up!
>> [pid 31209] starting up!
>> [pid 31210] starting up!
>> 0 completed MPI_Init
>> Parent [pid 31208] about to spawn!
>> 1 completed MPI_Init
>> Parent [pid 31209] about to spawn!
>> 2 completed MPI_Init
>> Parent [pid 31210] about to spawn!
>> [pid 28630] starting up!
>> [pid 28631] starting up!
>> [pid 9846] starting up!
>> [pid 9847] starting up!
>> [pid 9848] starting up!
>> [pid 6363] starting up!
>> [pid 6361] starting up!
>> [pid 6362] starting up!
>> [pid 28632] starting up!
>> 
>> Any hints?
>> 
>> Best,
>> Suraj
>> 
>> On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote:
>> 
>>> Hmmm...I don't see anything immediately glaring. What do you mean by 
>>> "doesn't work"? Is there some specific behavior you see?
>>> 
>>> You might try the attached program. It's a simple spawn test we use - 1.7.4 
>>> seems happy with it.
>>> 
>>> <simple_spawn.c>
>>> 
>>> On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran 
>>> <suraj.prabhaka...@gmail.com> wrote:
>>> 
>>>> I am using 1.7.4! 
>>>> 
>>>> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote:
>>>> 
>>>>> What OMPI version are you using?
>>>>> 
>>>>> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran 
>>>>> <suraj.prabhaka...@gmail.com> wrote:
>>>>> 
>>>>>> Hello!
>>>>>> 
>>>>>> I am having problem using MPI_Comm_spawn under torque. It doesn't work 
>>>>>> when spawning more than 12 processes on various nodes. To be more 
>>>>>> precise, "sometimes" it works, and "sometimes" it doesn't!
>>>>>> 
>>>>>> Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE 
>>>>>> looks like below.
>>>>>> 
>>>>>> node1
>>>>>> node1
>>>>>> node1
>>>>>> node2
>>>>>> node2
>>>>>> node2
>>>>>> node3
>>>>>> node3
>>>>>> node3
>>>>>> node4
>>>>>> node4
>>>>>> node4
>>>>>> node5
>>>>>> node5
>>>>>> node5
>>>>>> 
>>>>>> I started a hello program (which just spawns itself and of course, the 
>>>>>> children don't spawn), with 
>>>>>> 
>>>>>> mpiexec -np 3 ./hello
>>>>>> 
>>>>>> Spawning 3 more processes (on node 2) - works!
>>>>>> spawning 6 more processes (node 2 and 3) - works!
>>>>>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not!
>>>>>> spawning 12 processes (node 2,3,4,5) - "mostly" not!
>>>>>> 
>>>>>> I ideally want to spawn about 32 processes with large number of nodes, 
>>>>>> but this is at the moment impossible. I have attached my hello program 
>>>>>> to this email. 
>>>>>> 
>>>>>> I will be happy to provide any more info or verbose outputs if you could 
>>>>>> please tell me what exactly you would like to see.
>>>>>> 
>>>>>> Best,
>>>>>> Suraj
>>>>>> 
>>>>>> <hello.c>_______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to