Hello,

With OGS/GE 2011.11 and OpenMPI 1.8.3 we have a problem with core/memory
binding when multiple OpenMPI jobs run on the same machine.

"qsub binding linear:1 job" works fine, if the job is the only one running
on the machine. As hwloc-ps and numastat show, each MPI thread is bound to
one core and allocates memory that belongs to the socket containing the
core.

However, when two or more jobs run on the same machine, "binding linear:1"
causes them to be bound to the same cores. For instance, when two jobs with
6 MPI threads each are started on a 12 core (2 x Xeon L5640, hyperthreading
switched off) machine, each of the two jobs is bound to these cores:

[lx012:16840] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././.][./././././.]
[lx012:16840] MCW rank 1 bound to socket 1[core 6[hwt 0]]:
[./././././.][B/././././.]
[lx012:16840] MCW rank 2 bound to socket 0[core 1[hwt 0]]:
[./B/./././.][./././././.]
[lx012:16840] MCW rank 3 bound to socket 1[core 7[hwt 0]]:
[./././././.][./B/./././.]
[lx012:16840] MCW rank 4 bound to socket 0[core 2[hwt 0]]:
[././B/././.][./././././.]
[lx012:16840] MCW rank 5 bound to socket 1[core 8[hwt 0]]:
[./././././.][././B/././.]
("mpirun -report-binding" output)

Thus each MPI thread gets only 50% of a core and the remaining 6 cores are
not used.

This is clearly not what we want. Is there a communication problem between
grid engine and OpenMPI? We do not fully understand how the communication
is supposed to work. The machines file created by grid engine contains only
machine names, but no information about which cores to use on these
machines.

One could fix the binding by specifying explicitly (as parameters or in a
machine file) which cores should be used by mpirun. However, grid engine
seems to provide only the information on which core the first MPI thread
should run. When "qsub binding env linear:1" is used, grid engine sets
SGE_BINDING to 0 for the first job, 6 for the second job, 1 for the third
job, 7 for the forth job and so on. However, to construct a machine file
for OpenMPI one needs to know all cores that are supposed to be used by the
job.

How can we force grid engine and OpenMPI to manage core binding in a
reasonable way?

Maybe we are missing some setting in OpenMPI which we are not aware of (I
thought binding should be enabled as a standard).
If you need to know anything about our queue settings I could tell you that.

Thank you
Michael
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to