[gridengine users] MPI jobs on a multi-architecture cluster?

bergman Wed, 12 Dec 2012 15:29:00 -0800

I've got a question that's very similar to Joseph Farran's query "How do I
request the CPU type in qrsh / qsub with SGE 8.1.2?" [1], but which is a
problem specifically with MPI jobs.


I think we're running into a chipset-architecture issue (AMD vs Intel)
in OpenMPI jobs. We're using SGE 6.2u5 and OpenMPI 1.33 with tight
integration. All MPI jobs are launched by SGE.

We've got a locally-written program that dynamically links against
a package that's compiled with optimizations for different chipsets
(ATLAS[2]). We've built ATLAS with multiple versions, optimized
for each architecture in our cluster.

This is fine for serial jobs--the login environment
sets the path according to the chipset on each server
(ie., ATLASDIR=/opt/ATLAS/3.8.3/Intel/Xeon/Westmere or
ATLASDIR=/opt/ATLAS/3.8.3/AMD/Opteron). We do the same thing for other
packages that provide chipset optimization (LAPACK[3] and BLAS[4]).

Our executables are dynamically linked, so there's no problem running
the same program on either the Intel or AMD machines. Users simply submit
the job to SGE and the executable uses the correct library for the server
at runtime.

Everything is fine if the job (MPI master process and slaves) all run
on nodes of the same chip architecture.

However, there seems to be a problem with OpenMPI jobs if the slave
process runs on a different chipset than the master. I believe
that the slave jobs are launched without going through a shell, so
they don't get the environment settings that would be applied in an
interactive session or SGE job. The slave process seems to run with
the same paths as the parent. For example, if the master MPI job
is launched on an Intel node, LD_LIBRARY_PATH may be set to include
"/opt/ATLAS/3.8.3/Intel/Xeon/Westmere/lib", and this seems to be passed
to slave MPI processes running on AMD nodes, with the result that they
pick up the wrong library and this causes a segmentation fault.

I could set up separate MPI queues within SGE per-chipset (ie., submit jobs
with "-pe mpi-intel" or "-pe mpi-amd"), but that adds a complication for users
and reduces the effectiveness of SGE doing the scheduling.

I'm wondering if there's a way to force SGE to select slave nodes from the
same architecture type as the master MPI process, at run-time. We've already
got the architecture as an attribute within SGE. In other words, when SGE
determines which nodes have resources available to make up the "machine list"
passed to OpenMPI, could that list be restricted to nodes of the same
architecture as the node that SGE selects for the master process?

Thanks,

Mark

        [1] http://gridengine.org/pipermail/users/2012-December/005329.html
        [2] http://math-atlas.sourceforge.net/
        [3] http://www.netlib.org/lapack/
        [4] http://www.netlib.org/blas/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] MPI jobs on a multi-architecture cluster?

Reply via email to