Hi,

I thought that "slot" is the smallest manageable entity so that I
must set "slot=4" for a dual-processor dual-core machine with one
hardware-thread per core. Today I learned about the new keyword
"sockets" for a hostfile (I didn't find it in "man orte_hosts").
How would I specify a system with two dual-core processors so that
"mpiexec -report-bindings -hostfile host_sunpc0_1 -np 4 
-cpus-per-proc 2 -bind-to-core hostname" or even
"mpiexec -report-bindings -hostfile host_sunpc0_1 -np 2 
-cpus-per-proc 4 -bind-to-core hostname" would work in the same way
as the commands below.

tyr fd1026 217 mpiexec -report-bindings -host sunpc0,sunpc1 -np 2 \
  -cpus-per-proc 4 -bind-to-core hostname
[sunpc0:11658] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B]
sunpc0
[sunpc1:00553] MCW rank 1 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B]
sunpc1


Thank you very much for your help in advance.


Kind regards

Siegmar



> > I recognized another problem with procecss bindings. The command
> > works, if I use "-host" and it breaks, if I use "-hostfile" with 
> > the same machines.
> > 
> > tyr fd1026 178 mpiexec -report-bindings -host sunpc0,sunpc1 -np 4 \
> >  -cpus-per-proc 2 -bind-to-core hostname
> > sunpc1
> > [sunpc1:00086] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .]
> > [sunpc1:00086] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B]
> > sunpc0
> > [sunpc0:10929] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
> > sunpc0
> > [sunpc0:10929] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B]
> > sunpc1
> > 
> > 
> 
> Yes, this works because you told us there is only ONE slot on each
> host. As a result, we split the 4 processes across the two hosts
> (both of which are now oversubscribed), resulting in TWO processes
> running on each host. Since there are 4 cores on each host, and
> you asked for 2 cores/process, we can make this work.
> 
> 
> > tyr fd1026 179 cat host_sunpc0_1 
> > sunpc0 slots=4
> > sunpc1 slots=4
> > 
> > 
> > tyr fd1026 180 mpiexec -report-bindings -hostfile host_sunpc0_1 -np 4 \
> >  -cpus-per-proc 2 -bind-to-core hostname
> 
> And this will of course not work. In your hostfile, you told us there
> are FOUR slots on each host. Since the default is to map by slot, we
> correctly mapped all four processes to the first node. We then tried
> to bind 2 cores for each process, resulting in 8 cores - which is
> more than you have.
> 
> 
> > --------------------------------------------------------------------------
> > An invalid physical processor ID was returned when attempting to bind
> > an MPI process to a unique processor.
> > 
> > This usually means that you requested binding to more processors than
> > exist (e.g., trying to bind N MPI processes to M processors, where N >
> > M).  Double check that you have enough unique processors for all the
> > MPI processes that you are launching on this host.
> > 
> > You job will now abort.
> > --------------------------------------------------------------------------
> > sunpc0
> > [sunpc0:10964] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
> > sunpc0
> > [sunpc0:10964] MCW rank 1 bound to socket 1[core 0-1]: [. .][B B]
> > --------------------------------------------------------------------------
> > mpiexec was unable to start the specified application as it encountered
> >  an error
> > on node sunpc0. More information may be available above.
> > --------------------------------------------------------------------------
> > 4 total processes failed to start
> > 
> > 
> > Perhaps this error is related to the other errors. Thank you very
> > much for any help in advance.
> > 
> > 
> > Kind regards
> > 
> > Siegmar
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 

Reply via email to