What helps then is to set the node down and resume it afterwards:
scontrol update nodename=<nodename> state=drain reason=stuck; scontrol update nodename=<nodename> state=resume
Best Marcus Am 20.09.2023 um 09:11 schrieb Ole Holm Nielsen:
On 9/20/23 01:39, Feng Zhang wrote:Restarting the slurmd dameon of the compute node should work, if the node is still online and normal.Probably not. If the filesystem used by the job is hung, the node must probably be rebooted, and the filesystem must be checked./OleOn Tue, Sep 19, 2023 at 8:03 AM Felix <fe...@itim-cj.ro> wrote:HelloI have a job on my system which is running more than its time, more than4 days. 1808851 debug gridjob atlas01 CG 4-00:00:19 1 awn-047 I'm trying to cancel it [@arc7-node ~]# scancel 1808851I get no message as if the job was canceled but when getting informationabout the job, the job is still there [@arc7-node ~]# squeue | grep awn-0471808851 debug gridjob atlas01 CG 4-00:00:19 1 awn-047Can I do any other thinks to kill end the job?
smime.p7s
Description: Kryptografische S/MIME-Signatur