Maybe the issue is generated by how the hostile is specified. I used 
orte_default_hostfile= in my mca-params.conf.

  george.

On Nov 17, 2011, at 19:17 , Ralph Castain wrote:

> I'm still building on odin, but will check there again to see if I can 
> replicate - perhaps something didn't get committed cleanly.
> 
>> 
>> george.
>> 
>> On Nov 17, 2011, at 19:06 , Ralph Castain wrote:
>> 
>>> Hmmm...well, things seem to work just fine for me:
>>> 
>>> [rhc@odin ~/ompi-hwloc]$ mpirun -np 2 -bynode -mca plm rsh hostname
>>> odin090.cs.indiana.edu
>>> odin091.cs.indiana.edu
>>> 
>>> [rhc@odin mpi]$ mpirun -np 2 -bynode -mca plm rsh ./hello_nodename
>>> Hello, World, I am 1 of 2 on host odin091.cs.indiana.edu from app number 0 
>>> universe size 8
>>> Hello, World, I am 0 of 2 on host odin090.cs.indiana.edu from app number 0 
>>> universe size 8
>>> 
>>> 
>>> I'll get a fresh checkout and see if I can replicate from that...
>>> 
>>> On Nov 17, 2011, at 7:42 PM, George Bosilca wrote:
>>> 
>>>> I guess I reach one of these corner-cases that didn't got tested. I can't 
>>>> start any apps (not even a hostname) after this commit using the rsh PLM 
>>>> (as soon as I add a hostile). The mpirun is blocked in an infinite loop 
>>>> (after it spawned the daemons) in orte_rmaps_base_compute_vpids. Attaching 
>>>> with gdb indicates that cnt is never incremented, thus the mpirun is stuck 
>>>> forever in the while loop at line 397.
>>>> 
>>>> I used "mpirun -np 2 --bynode ./tp_lb_ub_ng" to start my application, and 
>>>> I have a machine file containing two nodes:
>>>> 
>>>> node01 slots=8
>>>> node02 slots=8
>>>> 
>>>> In addition CTRL+C seems to be broken …
>>>> 
>>>> george.
>>>> 
>>>> Begin forwarded message:
>>>> 
>>>>> Author: rhc
>>>>> Date: 2011-11-14 22:40:11 EST (Mon, 14 Nov 2011)
>>>>> New Revision: 25476
>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/25476
>>>>> 
>>>>> Log:
>>>>> At long last, the fabled revision to the affinity system has arrived. A 
>>>>> more detailed explanation of how this all works will be presented here:
>>>>> 
>>>>> https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
>>>>> 
>>>>> The wiki page is incomplete at the moment, but I hope to complete it over 
>>>>> the next few days. I will provide updates on the devel list. As the wiki 
>>>>> page states, the default and most commonly used options remain unchanged 
>>>>> (except as noted below). New, esoteric and complex options have been 
>>>>> added, but unless you are a true masochist, you are unlikely to use many 
>>>>> of them beyond perhaps an initial curiosity-motivated experimentation.
>>>>> 
>>>>> In a nutshell, this commit revamps the map/rank/bind procedure to take 
>>>>> into account topology info on the compute nodes. I have, for the most 
>>>>> part, preserved the default behaviors, with three notable exceptions:
>>>>> 
>>>>> 1. I have at long last bowed my head in submission to the system admin's 
>>>>> of managed clusters. For years, they have complained about our default of 
>>>>> allowing users to oversubscribe nodes - i.e., to run more processes on a 
>>>>> node than allocated slots. Accordingly, I have modified the default 
>>>>> behavior: if you are running off of hostfile/dash-host allocated nodes, 
>>>>> then the default is to allow oversubscription. If you are running off of 
>>>>> RM-allocated nodes, then the default is to NOT allow oversubscription. 
>>>>> Flags to override these behaviors are provided, so this only affects the 
>>>>> default behavior.
>>>>> 
>>>>> 2. both cpus/rank and stride have been removed. The latter was demanded 
>>>>> by those who didn't understand the purpose behind it - and I agreed as 
>>>>> the users who requested it are no longer using it. The former was removed 
>>>>> temporarily pending implementation.
>>>>> 
>>>>> 3. vm launch is now the sole method for starting OMPI. It was just too 
>>>>> darned hard to maintain multiple launch procedures - maybe someday, 
>>>>> provided someone can demonstrate a reason to do so.
>>>>> 
>>>>> As Jeff stated, it is impossible to fully test a change of this size. I 
>>>>> have tested it on Linux and Mac, covering all the default and simple 
>>>>> options, singletons, and comm_spawn. That said, I'm sure others will find 
>>>>> problems, so I'll be watching MTT results until this stabilizes.
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 


Reply via email to