-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

G'day Stu! ;-)

On 19/09/13 17:19, Stu Midgley wrote:

> SGE has a special job error state of 100 (ie. exit 100) which puts
> the job in E state in the queue.  The job leaves the allocated
> node(s) and goes back into the queue in E state.  This means we can
> easily know which jobs have failed, look at their log, fix the
> problem (usually a system problem - like an unmounted file system
> or crashed ypbind) and then clear the error and the job goes into
> into Q state.

I can't comment on the special exit status, but we make much use of
the health check within Slurm (and Torque before it) to spot system
issues and mark nodes as DRAIN if we see something wrong.

With Torque we would run the health check scripts from cron and
pbs_mom would just run a script to cat the file the cron job produced
(in /dev/shm) to avoid any blocking, and for Slurm we've just ported
that directly across except changing the cat behaviour in the script
invoked by slurmd to use scontrol to knock the node offline (or online
if the checks are passing and it was an auto check that took it
offline last).

Works well for us and may help your situation.

All the best,
Chris
- -- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlI7GBwACgkQO2KABBYQAh+zAACeP+SPRJeLfroG9Za4rCzpR6Nw
mBwAoJMMlPeLGTDDAcVv6qqNeDok9x2f
=LEAV
-----END PGP SIGNATURE-----

Reply via email to