Dear all,it happens that a job hangs waiting for an unresponsive file system. This job cannot be killed, we have to reboot the node. My idea, would be to
1) set the node to drain, 2) force the batchsystem to forget the CG jobs (else it would never reach the drained state), 3) reboot the node via Slurm 4) set the node to "resume".
The second is inspired by LSF, there is some forced cancellation for zombie processes. Do we have something similar for Slurm?
Thank you, UlfPS. It would be great to have a reboot flag to bring the rebooted node back into the "resume" state automatically. Or do we have it already?
-- ___________________________________________________________________ Dr. Ulf Markwardt Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) 01062 Dresden, Germany Phone: (+49) 351/463-33640 WWW: http://www.tu-dresden.de/zih
smime.p7s
Description: S/MIME Cryptographic Signature