Hi all,

I'm running into trouble running an openmpi job under slurm.  I
imagine the trouble may be in my slurm configuration, but since the
error itself involves mpirun crashing, I thought I'd best ask here
first.  The error message I get is:

--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

This shows up when I run my MPI job with the following script:

#!/bin/sh
set -ev
hostname
mpirun pw.x < pw.in > pw.out 2> errors_pw
(end of submit.sh)

if I submit using

sbatch -c 2 submit.sh

If I use "-N 2" instead of "-c 2", the job runs fine, but runs on two
separate nodes, rather than two separate cores on a single node (which
makes it extremely slow).  I know that the problem is related somehow
to the environment variables that are passed to openmpi by slurm,
since I can fix the crash by changing my script to read:

#!/bin/sh
set -ev
hostname
# clear SLURM environment variables
for i in `env | awk -F= '/SLURM/ {print $1}' | grep SLURM`; do
  echo unsetting $i
  unset $i
done
mpirun -np 2 pw.x < pw.in > pw.out 2> errors_pw

So you can see that I just clear all the environment variables and
then specify the number of processors to use manually.  I suppose I
could use a bisection approach to figure out which environment
variable is triggering this crash, and then could either edit my
script to just modify that variable, or could figure out how to make
slurm pass things differently.  But I thought that before entering
upon this laborious process, it'd be worth asking on the list to see
if anyone has a suggestion as to what might be going wrong? I'll be
happy to provide my slurm config (or anything else that seems useful)
if you think that would be helpful!
-- 
David Roundy

Reply via email to