We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily).
Scheduling is done through SGE.  MPI communication is over InfiniBand.

We have been running with this setup for over 9 months.  Last week, all
user jobs stopped executing (cluster load dropped to zero).  User can
schedule jobs but when they try to execute, they get errors of the form:

-----------------------------------------
[compute-2-5.local:12321] ERROR: The daemon exited unexpectedly with
status 1.
error: commlib error: access denied (client IP resolved to host name "".
This is not identical to clients host name "")
error: executing task of job 11901 failed: failed sending task to
execd@compute-5-9.local: can't find connection
[compute-2-5.local:12321] ERROR: A daemon on node compute-5-9.local
failed to start as expected.
[compute-2-5.local:12321] ERROR: There may be more information available
from
[compute-2-5.local:12321] ERROR: the 'qstat -t' command on the Grid
Engine tasks.
[compute-2-5.local:12321] ERROR: If the problem persists, please restart
the
[compute-2-5.local:12321] ERROR: Grid Engine PE job
[compute-2-5.local:12321] ERROR: The daemon exited unexpectedly with
status 1.
error: commlib error: access denied (client IP resolved to host name "".
This is not identical to clients host name "")


When run interactively, we see.

---------------------------------------------
error: commlib error: access denied (client IP resolved to host name "".
This is not identical to clients host name "")
error: executing task of job 12094 failed: failed sending task to
execd@compute-4-11.local: can't find connection
--------------------------------------------------------------------------
A daemon (pid 4938) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished
---------------------------------------------

This seems to be an error with SGE but it is only affecting OpenMPI.
User can successfully launch and run jobs with MVAPICH.

Some changes were made to the ROCKS setup that may have caused this but
I have not found where the actual problems lies.

-- 

 Ray Muno
 University of Minnesota

Reply via email to