Hi, I thought that "slot" is the smallest manageable entity so that I must set "slot=4" for a dual-processor dual-core machine with one hardware-thread per core. Today I learned about the new keyword "sockets" for a hostfile (I didn't find it in "man orte_hosts"). How would I specify a system with two dual-core processors so that "mpiexec -report-bindings -hostfile host_sunpc0_1 -np 4 -cpus-per-proc 2 -bind-to-core hostname" or even "mpiexec -report-bindings -hostfile host_sunpc0_1 -np 2 -cpus-per-proc 4 -bind-to-core hostname" would work in the same way as the commands below.
tyr fd1026 217 mpiexec -report-bindings -host sunpc0,sunpc1 -np 2 \ -cpus-per-proc 4 -bind-to-core hostname [sunpc0:11658] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] sunpc0 [sunpc1:00553] MCW rank 1 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] sunpc1 Thank you very much for your help in advance. Kind regards Siegmar > > I recognized another problem with procecss bindings. The command > > works, if I use "-host" and it breaks, if I use "-hostfile" with > > the same machines. > > > > tyr fd1026 178 mpiexec -report-bindings -host sunpc0,sunpc1 -np 4 \ > > -cpus-per-proc 2 -bind-to-core hostname > > sunpc1 > > [sunpc1:00086] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .] > > [sunpc1:00086] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B] > > sunpc0 > > [sunpc0:10929] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .] > > sunpc0 > > [sunpc0:10929] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B] > > sunpc1 > > > > > > Yes, this works because you told us there is only ONE slot on each > host. As a result, we split the 4 processes across the two hosts > (both of which are now oversubscribed), resulting in TWO processes > running on each host. Since there are 4 cores on each host, and > you asked for 2 cores/process, we can make this work. > > > > tyr fd1026 179 cat host_sunpc0_1 > > sunpc0 slots=4 > > sunpc1 slots=4 > > > > > > tyr fd1026 180 mpiexec -report-bindings -hostfile host_sunpc0_1 -np 4 \ > > -cpus-per-proc 2 -bind-to-core hostname > > And this will of course not work. In your hostfile, you told us there > are FOUR slots on each host. Since the default is to map by slot, we > correctly mapped all four processes to the first node. We then tried > to bind 2 cores for each process, resulting in 8 cores - which is > more than you have. > > > > -------------------------------------------------------------------------- > > An invalid physical processor ID was returned when attempting to bind > > an MPI process to a unique processor. > > > > This usually means that you requested binding to more processors than > > exist (e.g., trying to bind N MPI processes to M processors, where N > > > M). Double check that you have enough unique processors for all the > > MPI processes that you are launching on this host. > > > > You job will now abort. > > -------------------------------------------------------------------------- > > sunpc0 > > [sunpc0:10964] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .] > > sunpc0 > > [sunpc0:10964] MCW rank 1 bound to socket 1[core 0-1]: [. .][B B] > > -------------------------------------------------------------------------- > > mpiexec was unable to start the specified application as it encountered > > an error > > on node sunpc0. More information may be available above. > > -------------------------------------------------------------------------- > > 4 total processes failed to start > > > > > > Perhaps this error is related to the other errors. Thank you very > > much for any help in advance. > > > > > > Kind regards > > > > Siegmar > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > >