On Feb 28, 2013, at 6:17 AM, Reuti <re...@staff.uni-marburg.de> wrote:

> Am 28.02.2013 um 08:58 schrieb Reuti:
> 
>> Am 28.02.2013 um 06:55 schrieb Ralph Castain:
>> 
>>> I don't off-hand see a problem, though I do note that your "working" 
>>> version incorrectly reports the universe size as 2!
>> 
>> Yes, it was 2 in the case when it was working by giving only two hostnames 
>> without any dedicated slot count. What should it be in this case - 
>> "unknown", "infinity"?
> 
> As an add on:
> 
> a) I tried it again on the command line and still get:
> 
> Total: 64
> Universe: 2
> 
> with a hostfile
> 
> node006
> node007
> 

My bad - since no slots were given, we default to a value of 1 for each node, 
so this is correct.

> 
> b) In a job script under SGE and Open MPI compiled --with-sge I get after 
> mangling the hostfile:
> 
> #!/bin /sh
> #$ -pe openmpi* 128
> #$ -l exclusive
> cut -f 1 -d" " $PE_HOSTFILE > $TMPDIR/machines
> mpiexec -cpus-per-proc 2 -report-bindings -hostfile $TMPDIR/machines -np 64 
> ./mpihello
> 
> Here:
> 
> Total: 64
> Universe: 128

This would be correct as SGE is allocating a total of 128 slots (or pe's)

> 
> Maybe the found allocation by SGE and the one from the command line argument 
> are getting mixed here.
> 
> -- Reuti
> 
> 
>> -- Reuti
>> 
>> 
>>> 
>>> I'll have to take a look at this and get back to you on it.
>>> 
>>> On Feb 27, 2013, at 3:15 PM, Reuti <re...@staff.uni-marburg.de> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I have an issue using the option -cpus-per-proc 2. As I have Bulldozer 
>>>> machines and I want only one process per FP core, I thought using 
>>>> -cpus-per-proc 2 would be the way to go. Initially I had this issue inside 
>>>> GridEngine but then tried it outside any queuingsystem and face exactly 
>>>> the same behavior.
>>>> 
>>>> @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 
>>>> integer cores per machine in total. Used Open MPI is 1.6.4.
>>>> 
>>>> 
>>>> a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 
>>>> ./mpihello
>>>> 
>>>> and a hostfile containing only the two lines listing the machines:
>>>> 
>>>> node006
>>>> node007
>>>> 
>>>> This works as I would like it (see working.txt) when initiated on node006.
>>>> 
>>>> 
>>>> b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 
>>>> ./mpihello
>>>> 
>>>> But changing the hostefile so that it is having a slot count which might 
>>>> mimic the behavior in case of a parsed machinefile out of any queuing 
>>>> system:
>>>> 
>>>> node006 slots=64
>>>> node007 slots=64
>>>> 
>>>> This fails with:
>>>> 
>>>> --------------------------------------------------------------------------
>>>> An invalid physical processor ID was returned when attempting to bind
>>>> an MPI process to a unique processor on node:
>>>> 
>>>> Node: node006
>>>> 
>>>> This usually means that you requested binding to more processors than
>>>> exist (e.g., trying to bind N MPI processes to M processors, where N >
>>>> M), or that the node has an unexpectedly different topology.
>>>> 
>>>> Double check that you have enough unique processors for all the
>>>> MPI processes that you are launching on this host, and that all nodes
>>>> have identical topologies.
>>>> 
>>>> You job will now abort.
>>>> --------------------------------------------------------------------------
>>>> 
>>>> (see failed.txt)
>>>> 
>>>> 
>>>> b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 
>>>> ./mpihello
>>>> 
>>>> This works and the found universe is 128 as expected (see only32.txt).
>>>> 
>>>> 
>>>> c) Maybe the used machinefile is not parsed in the correct way, so I 
>>>> checked:
>>>> 
>>>> c1) mpiexec -hostfile machines -np 64 ./mpihello => works
>>>> 
>>>> c2) mpiexec -hostfile machines -np 128 ./mpihello => works
>>>> 
>>>> c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected
>>>> 
>>>> So, it got the slot counts in the correct way.
>>>> 
>>>> What do I miss?
>>>> 
>>>> -- Reuti
>>>> 
>>>> <failed.txt><only32.txt><working.txt>_______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to