Dear all,

it happens that a job hangs waiting for an unresponsive file system. This job cannot be killed, we have to reboot the node. My idea, would be to
1) set the node to drain,
2) force the batchsystem to forget the CG jobs
  (else it would never reach the drained state),
3) reboot the node via Slurm
4) set the node to "resume".

The second is inspired by LSF, there is some forced cancellation for zombie processes. Do we have something similar for Slurm?

Thank you,
Ulf

PS. It would be great to have a reboot flag to bring the rebooted node back into the "resume" state automatically. Or do we have it already?


--
___________________________________________________________________
Dr. Ulf Markwardt

Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany

Phone: (+49) 351/463-33640      WWW:  http://www.tu-dresden.de/zih

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to