Re: [slurm-users] sacct end time for failed jobs

Paul Edmon Wed, 06 Mar 2019 10:09:29 -0800

A lot of this is automated in the new versions of slurm. You shouldjust need to run:


sacctmgr show runawayjobs

It will then give you an option to clean them and slurm will handle therest. If you add the -i option it will just clean them automatically.


-Paul Edmon-

On 3/6/2019 11:58 AM, Cyrus Proctor wrote:


Hi Brian,

Others probably have better suggestions before going the route I'mabout to detail. If you do go this route, be warned, you definitelyhave the ability to irrevocably lose data or destroy your Slurmaccounting database. Do so at your own risk. I got here withGoogle-foo after being out of other (known to me) options. Someoneplease save Brian having to do what comes below ;-)

Last warning: I'd recommend turning off slurmdbd and backing up thedatabase (mysqldump) before going forward.

In my case, runaway jobs did not show up with `sacctmgr listrunawayjobs`. My problem was removing a user from the Slurm databasebecause it thought they still had active jobs. The likely cause ofthis was the slurmdb daemon not shutting down gracefully at somepoint. The job was long gone but it was still in a pending state:


# sacct -j 899139
        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
899139            equil   gpu-long    p-1234         20    PENDING      0:0
# scontrol show job 899139
slurm_load_jobs error: Invalid job id specified
# mysql -u root -p
...
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 7453
Server version: 5.1.73 Source distribution

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select state,time_end,time_start,time_submit,id_assoc,partition from 
banana_job_table where id_job=899139;
+-------+----------+------------+-------------+----------+-----------+
| state | time_end | time_start | time_submit | id_assoc | partition |
+-------+----------+------------+-------------+----------+-----------+
|     0 |        0 |          0 |  1546880711 |     2078 | gpu-long  |
+-------+----------+------------+-------------+----------+-----------+
1 row in set (0.00 sec)

mysql> update banana_job_table set state=3 where id_job=899139;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select state,time_end,time_start,time_submit,id_assoc,partition from 
banana_job_table where id_job=899139;
+-------+----------+------------+-------------+----------+-----------+
| state | time_end | time_start | time_submit | id_assoc | partition |
+-------+----------+------------+-------------+----------+-----------+
|     3 |        0 |          0 |  1546880711 |     2078 | gpu-long  |
+-------+----------+------------+-------------+----------+-----------+
1 row in set (0.00 sec)

mysql> update banana_job_table set time_start=1546880712 where id_job=899139;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select state,time_end,time_start,time_submit,id_assoc,partition from 
banana_job_table where id_job=899139;
+-------+----------+------------+-------------+----------+-----------+
| state | time_end | time_start | time_submit | id_assoc | partition |
+-------+----------+------------+-------------+----------+-----------+
|     3 |        0 | 1546880712 |  1546880711 |     2078 | gpu-long  |
+-------+----------+------------+-------------+----------+-----------+
1 row in set (0.00 sec)

mysql> update banana_job_table set time_end=1546880713 where id_job=899139;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select state,time_end,time_start,time_submit,id_assoc,partition from 
banana_job_table where id_job=899139;
+-------+------------+------------+-------------+----------+-----------+
| state | time_end   | time_start | time_submit | id_assoc | partition |
+-------+------------+------------+-------------+----------+-----------+
|     3 | 1546880713 | 1546880712 |  1546880711 |     2078 | gpu-long  |
+-------+------------+------------+-------------+----------+-----------+
1 row in set (0.00 sec)

In this case for job ID 899139 on the banana cluster, the state wasnot updated and neither were start or end times. I went in andmanually edited the job entries such that Slurm thought they werecomplete with feasible start and end times. Again, this worked for me.I don't know if this is your problem or not. If you choose this route,be careful and good luck!


On 3/6/19 10:15 AM, Brian Andrus wrote:

It shows several jobs that all have "Unknown" for end_time. Some arePENDING and some are RUNNING (none are truly in either state).
It asked to fix them, which I did, but nothing seems to have changed.They still show up with that command and in reports.
Brian

On 3/5/2019 10:34 PM, Chris Samuel wrote:
On Tuesday, 5 March 2019 10:07:30 AM PST Brian Andrus wrote:
Does anyone have a process they use to handle empty (aka "Unknown")end
times for jobs that are not running?
What does:

sacctmgr list runawayjobs

say?

Re: [slurm-users] sacct end time for failed jobs

Reply via email to