Re: [slurm-users] help with canceling or deleteing a job
Restarting the slurmd dameon of the compute node should work, if the node is still online and normal. Best, Feng On Tue, Sep 19, 2023 at 8:03 AM Felix wrote: > > Hello > > I have a job on my system which is running more than its time, more than > 4 days. > > 1808851 debug gridjob atlas01 CG 4-00:00:19 1 awn-047 > > I'm trying to cancel it > > [@arc7-node ~]# scancel 1808851 > > I get no message as if the job was canceled but when getting information > about the job, the job is still there > > [@arc7-node ~]# squeue | grep awn-047 > 1808851 debug gridjob atlas01 CG 4-00:00:19 1 awn-047 > > Can I do any other thinks to kill end the job? > > Thank you > > Felix > > > -- > Dr. Eng. Farcas Felix > National Institute of Research and Development of Isotopic and Molecular > Technology, > IT - Department - Cluj-Napoca, Romania > Mobile: +40742195323 >
Re: [slurm-users] help with canceling or deleteing a job
On 9/19/23 13:59, Felix wrote: Hello I have a job on my system which is running more than its time, more than 4 days. 1808851 debug gridjob atlas01 CG 4-00:00:19 1 awn-047 The job has state "CG" which means "Completing". The Completing status is explained in "man sinfo". This means that Slurm is trying to cancel the job, but it hangs for some reason. I'm trying to cancel it [@arc7-node ~]# scancel 1808851 I get no message as if the job was canceled but when getting information about the job, the job is still there [@arc7-node ~]# squeue | grep awn-047 1808851 debug gridjob atlas01 CG 4-00:00:19 1 awn-047 What is your UnkillableStepTimeout parameter? The default of 60 seconds can be changed in slurm.conf. My cluster: $ scontrol show config | grep UnkillableStepTimeout UnkillableStepTimeout = 126 sec Can I do any other thinks to kill end the job? It may be impossible to kill the job's processes, for example, if a filesystem is hanging. You may log in to the node and give the job's processes a "kill -9". Or just reboot the node. /Ole
[slurm-users] help with canceling or deleteing a job
Hello I have a job on my system which is running more than its time, more than 4 days. 1808851 debug gridjob atlas01 CG 4-00:00:19 1 awn-047 I'm trying to cancel it [@arc7-node ~]# scancel 1808851 I get no message as if the job was canceled but when getting information about the job, the job is still there [@arc7-node ~]# squeue | grep awn-047 1808851 debug gridjob atlas01 CG 4-00:00:19 1 awn-047 Can I do any other thinks to kill end the job? Thank you Felix -- Dr. Eng. Farcas Felix National Institute of Research and Development of Isotopic and Molecular Technology, IT - Department - Cluj-Napoca, Romania Mobile: +40742195323 smime.p7s Description: S/MIME Cryptographic Signature