[slurm-dev] Re: How to get rid of "zombie" jobs?

Loris Bennett Mon, 06 Jun 2016 02:39:08 -0700

Hello Steffen,

Steffen Grunewald
<[email protected]> writes:


> Hello all,
>
> I've got a rather newly setup cluster, which at the moment is completely idle
> ("squeue" doesn't return anything.)
>
> From the testing phases, a couple of now unused accounts and associations are
> left, which I'd like to get rid of:
>
> [root@login ~]# sacctmgr show assoc
>    Cluster    Account       User  Partition     Share GrpJobs       GrpTRES 
> GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode 
> MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS 
> GrpTRESRunMin 
> ---------- ---------- ---------- ---------- --------- ------- ------------- 
> --------- ----------- ------------- ------- ------------- -------------- 
> --------- ----------- ------------- -------------------- --------- 
> ------------- 
> [...]
>    cluster    default                               1                         
>                                                                               
>                                                      normal               
>    cluster    default        tom                    1                         
>                                                                               
>                                                      normal               
> [...]
> [root@login ~]# sacctmgr delete user name=tom account=default
>  Error with request: Job(s) active, cancel job(s) before remove
>   JobID = 15498      C = cluster    A = default    U = tom      
>   JobID = 15500      C = cluster    A = default    U = tom      
>   JobID = 15501      C = cluster    A = default    U = tom      
>   JobID = 15502      C = cluster    A = default    U = tom      
>   JobID = 15503      C = cluster    A = default    U = tom      
>   JobID = 15504      C = cluster    A = default    U = tom      
>   JobID = 15505      C = cluster    A = default    U = tom      
>   JobID = 15506      C = cluster    A = default    U = tom      
>   JobID = 15508      C = cluster    A = default    U = tom      
>   JobID = 15509      C = cluster    A = default    U = tom      
> [root@login ~]# scontrol show jobid -dd 15500
> slurm_load_jobs error: Invalid job id specified
> [root@login ~]# sacct -j 15500
>        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
> ------------ ---------- ---------- ---------- ---------- ---------- -------- 
> 15500        intel-test  partition    default         48    RUNNING      0:0 
>
>
> Is there a "gold standard" way to repair this?

I don't think there is a "gold standard" for this.  You probably just
have to go into the database an fix it yourself.

A while ago I posted some code to fix anomalous jobs.  It was intended
to make the data plausible (e.g. by adding a missing completion date for
a job with status "RUNNING" which no longer exists), and not for
deleting jobs completely, but it might help:

https://groups.google.com/forum/#!msg/slurm-devel/nf7JxV91F40/KUsS1AmyWRYJ

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email [email protected]

[slurm-dev] Re: How to get rid of "zombie" jobs?

Reply via email to