Hmmm....the problem is that we are mapping procs using the provided slots 
instead of dividing the slots by cpus-per-proc. So we put too many on the first 
node, and the backend daemon aborts the job because it lacks sufficient 
processors for cpus-per-proc=2.

Given that there are no current plans for a 1.6.5, this may not get fixed.

On Feb 27, 2013, at 3:15 PM, Reuti <re...@staff.uni-marburg.de> wrote:

> Hi,
> 
> I have an issue using the option -cpus-per-proc 2. As I have Bulldozer 
> machines and I want only one process per FP core, I thought using 
> -cpus-per-proc 2 would be the way to go. Initially I had this issue inside 
> GridEngine but then tried it outside any queuingsystem and face exactly the 
> same behavior.
> 
> @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 
> integer cores per machine in total. Used Open MPI is 1.6.4.
> 
> 
> a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 
> ./mpihello
> 
> and a hostfile containing only the two lines listing the machines:
> 
> node006
> node007
> 
> This works as I would like it (see working.txt) when initiated on node006.
> 
> 
> b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 
> ./mpihello
> 
> But changing the hostefile so that it is having a slot count which might 
> mimic the behavior in case of a parsed machinefile out of any queuing system:
> 
> node006 slots=64
> node007 slots=64
> 
> This fails with:
> 
> --------------------------------------------------------------------------
> An invalid physical processor ID was returned when attempting to bind
> an MPI process to a unique processor on node:
> 
> Node: node006
> 
> This usually means that you requested binding to more processors than
> exist (e.g., trying to bind N MPI processes to M processors, where N >
> M), or that the node has an unexpectedly different topology.
> 
> Double check that you have enough unique processors for all the
> MPI processes that you are launching on this host, and that all nodes
> have identical topologies.
> 
> You job will now abort.
> --------------------------------------------------------------------------
> 
> (see failed.txt)
> 
> 
> b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 
> ./mpihello
> 
> This works and the found universe is 128 as expected (see only32.txt).
> 
> 
> c) Maybe the used machinefile is not parsed in the correct way, so I checked:
> 
> c1) mpiexec -hostfile machines -np 64 ./mpihello => works
> 
> c2) mpiexec -hostfile machines -np 128 ./mpihello => works
> 
> c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected
> 
> So, it got the slot counts in the correct way.
> 
> What do I miss?
> 
> -- Reuti
> 
> <failed.txt><only32.txt><working.txt>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to