[slurm-dev] Some problems after upgrade from 16.05.8 to 17.0.2

Dimitris Dellis Mon, 29 May 2017 11:40:24 -0700


Hi all.

We have an installation that is working fine during last ~2 years ~400kjobs.

Last week 16.05.8 was updated to 17.0.2.

All nodes drain/down, all queues down, slurmctl and slurmdbd down, nojobs running,update version with yum, rpms built on the same system with previousversion,slurmdbd started successfully, did db schema modifications - had thetime to do it etc.


From slurmdbd.log after update :

[2017-05-26T22:24:07.582] adding column federation after flags in tablecluster_table[2017-05-26T22:24:07.582] adding column fed_id after federation in tablecluster_table[2017-05-26T22:24:07.582] adding column fed_state after fed_id in tablecluster_table[2017-05-26T22:24:07.582] adding column fed_weight after fed_state intable cluster_table[2017-05-26T22:24:12.676] adding column admin_comment after account intable "aris_job_table"[2017-05-26T22:24:43.295] Warning: Note very large processing time frommake table current "aris_job_table": usec=30619903 began=22:2

4:12.675

[2017-05-26T22:25:16.970] Warning: Note very large processing time frommake table current "aris_step_table": usec=33550038 began=22:

24:43.420
[2017-05-26T22:25:17.142] Accounting storage MYSQL plugin loaded
[2017-05-26T22:25:18.363] slurmdbd version 17.02.3 started

When slurmctld started
[2017-05-27T15:17:14.777] slurmctld version 17.02.3 started on cluster aris
[2017-05-27T15:17:15.822] layouts: no layout to initialize
[2017-05-27T15:17:16.181] layouts: loading entities/relations information
[2017-05-27T15:17:16.183] Recovered state of 534 nodes

[2017-05-27T15:17:16.184] Recovered JobID=335290 State=0x0 NodeCnt=0Assoc=1047

......
[2017-05-27T15:17:16.193] Registering slurmctld at port 6817 with slurmdbd.
[2017-05-27T15:17:16.400] No parameter for mcs plugin, default values set
[2017-05-27T15:17:16.400] mcs: MCSParameters = (null). ondemand set.

we see some error records in logs and in addition job state is notupdated for the majority of jobs,

although it seems that is updated for few of them.
slurmdbd.log : this is written every ~10 sec

[2017-05-27T13:26:54.901] error: CONN:11 Failed to unpack DBD_NODE_STATEmessage[2017-05-27T13:27:04.933] error: CONN:11 Failed to unpack DBD_NODE_STATEmessage


slurmctl.log : this is written every time we have a job state change.

[2017-05-29T21:00:59.908] error: slurmdbd: agent queue is full,discarding request

Jobs are running, squeue reports the real status of jobs in queue, weface problem with accounting.Jobs that were successfully finished appear in sacct either PENDING orRUNNING.

In job_completions records contain what is expected.

The main problem is the budget. Jobs run without consuming budget forthose staying in PENDINGwhile those that stay in RUNNING state consume from budget while theyare finished.


Any Idea what may be wrong or how to fix this issue ?

[slurm-dev] Some problems after upgrade from 16.05.8 to 17.0.2

Reply via email to