Re: [slurm-users] help with canceling or deleteing a job

2023-09-20 Thread Feng Zhang


Best,

Feng


On Wed, Sep 20, 2023 at 7:29 AM Wagner, Marcus 
wrote:

> Even after rebooting, sometimes nodes are stuck because of "completing
> jobs".
>
> What helps then is to set the node down and resume it afterwards:
>
> scontrol update nodename= state=drain reason=stuck; scontrol
> update nodename= state=resume
>
>
> Best
> Marcus
>
> Am 20.09.2023 um 09:11 schrieb Ole Holm Nielsen:
> > On 9/20/23 01:39, Feng Zhang wrote:
> >> Restarting the slurmd dameon of the compute node should work, if the
> >> node is still online and normal.
> >
> > Probably not.  If the filesystem used by the job is hung, the node
> > must probably be rebooted, and the filesystem must be checked.
> >
> > /Ole
> >
> >> On Tue, Sep 19, 2023 at 8:03 AM Felix  wrote:
> >>>
> >>> Hello
> >>>
> >>> I have a job on my system which is running more than its time, more
> >>> than
> >>> 4 days.
> >>>
> >>> 1808851 debug  gridjob  atlas01 CG 4-00:00:19  1 awn-047
> >>>
> >>> I'm trying to cancel it
> >>>
> >>> [@arc7-node ~]# scancel 1808851
> >>>
> >>> I get no message as if the job was canceled but when getting
> >>> information
> >>> about the job, the job is still there
> >>>
> >>> [@arc7-node ~]# squeue | grep awn-047
> >>>  1808851 debug  gridjob  atlas01 CG 4-00:00:19 1
> >>> awn-047
> >>>
> >>> Can I do any other thinks to kill end the job?
> >
>


Re: [slurm-users] help with canceling or deleteing a job

2023-09-20 Thread Wagner, Marcus
Even after rebooting, sometimes nodes are stuck because of "completing 
jobs".


What helps then is to set the node down and resume it afterwards:

scontrol update nodename= state=drain reason=stuck; scontrol 
update nodename= state=resume



Best
Marcus

Am 20.09.2023 um 09:11 schrieb Ole Holm Nielsen:

On 9/20/23 01:39, Feng Zhang wrote:

Restarting the slurmd dameon of the compute node should work, if the
node is still online and normal.


Probably not.  If the filesystem used by the job is hung, the node 
must probably be rebooted, and the filesystem must be checked.


/Ole


On Tue, Sep 19, 2023 at 8:03 AM Felix  wrote:


Hello

I have a job on my system which is running more than its time, more 
than

4 days.

1808851 debug  gridjob  atlas01 CG 4-00:00:19  1 awn-047

I'm trying to cancel it

[@arc7-node ~]# scancel 1808851

I get no message as if the job was canceled but when getting 
information

about the job, the job is still there

[@arc7-node ~]# squeue | grep awn-047
 1808851 debug  gridjob  atlas01 CG 4-00:00:19 1 
awn-047


Can I do any other thinks to kill end the job?




smime.p7s
Description: Kryptografische S/MIME-Signatur


Re: [slurm-users] help with canceling or deleteing a job

2023-09-20 Thread Ole Holm Nielsen

On 9/20/23 01:39, Feng Zhang wrote:

Restarting the slurmd dameon of the compute node should work, if the
node is still online and normal.


Probably not.  If the filesystem used by the job is hung, the node must 
probably be rebooted, and the filesystem must be checked.


/Ole


On Tue, Sep 19, 2023 at 8:03 AM Felix  wrote:


Hello

I have a job on my system which is running more than its time, more than
4 days.

1808851 debug  gridjob  atlas01 CG 4-00:00:19  1 awn-047

I'm trying to cancel it

[@arc7-node ~]# scancel 1808851

I get no message as if the job was canceled but when getting information
about the job, the job is still there

[@arc7-node ~]# squeue | grep awn-047
 1808851 debug  gridjob  atlas01 CG 4-00:00:19 1 awn-047

Can I do any other thinks to kill end the job?




Re: [slurm-users] help with canceling or deleteing a job

2023-09-19 Thread Feng Zhang
Restarting the slurmd dameon of the compute node should work, if the
node is still online and normal.

Best,

Feng

On Tue, Sep 19, 2023 at 8:03 AM Felix  wrote:
>
> Hello
>
> I have a job on my system which is running more than its time, more than
> 4 days.
>
> 1808851 debug  gridjob  atlas01 CG 4-00:00:19  1 awn-047
>
> I'm trying to cancel it
>
> [@arc7-node ~]# scancel 1808851
>
> I get no message as if the job was canceled but when getting information
> about the job, the job is still there
>
> [@arc7-node ~]# squeue | grep awn-047
> 1808851 debug  gridjob  atlas01 CG 4-00:00:19 1 awn-047
>
> Can I do any other thinks to kill end the job?
>
> Thank you
>
> Felix
>
>
> --
> Dr. Eng. Farcas Felix
> National Institute of Research and Development of Isotopic and Molecular 
> Technology,
> IT - Department - Cluj-Napoca, Romania
> Mobile: +40742195323
>



Re: [slurm-users] help with canceling or deleteing a job

2023-09-19 Thread Ole Holm Nielsen




On 9/19/23 13:59, Felix wrote:

Hello

I have a job on my system which is running more than its time, more than 4 
days.


1808851 debug  gridjob  atlas01 CG 4-00:00:19  1 awn-047


The job has state "CG" which means "Completing".  The Completing status is 
explained in "man sinfo".


This means that Slurm is trying to cancel the job, but it hangs for some 
reason.



I'm trying to cancel it

[@arc7-node ~]# scancel 1808851

I get no message as if the job was canceled but when getting information 
about the job, the job is still there


[@arc7-node ~]# squeue | grep awn-047
    1808851 debug  gridjob  atlas01 CG 4-00:00:19 1 awn-047


What is your UnkillableStepTimeout parameter?  The default of 60 seconds 
can be changed in slurm.conf.  My cluster:


$ scontrol show config | grep UnkillableStepTimeout
UnkillableStepTimeout   = 126 sec


Can I do any other thinks to kill end the job?


It may be impossible to kill the job's processes, for example, if a 
filesystem is hanging.


You may log in to the node and give the job's processes a "kill -9".  Or 
just reboot the node.


/Ole



[slurm-users] help with canceling or deleteing a job

2023-09-19 Thread Felix

Hello

I have a job on my system which is running more than its time, more than 
4 days.


1808851 debug  gridjob  atlas01 CG 4-00:00:19  1 awn-047

I'm trying to cancel it

[@arc7-node ~]# scancel 1808851

I get no message as if the job was canceled but when getting information 
about the job, the job is still there


[@arc7-node ~]# squeue | grep awn-047
   1808851 debug  gridjob  atlas01 CG 4-00:00:19 1 awn-047

Can I do any other thinks to kill end the job?

Thank you

Felix


--
Dr. Eng. Farcas Felix
National Institute of Research and Development of Isotopic and Molecular 
Technology,
IT - Department - Cluj-Napoca, Romania
Mobile: +40742195323



smime.p7s
Description: S/MIME Cryptographic Signature