Hi all.

We have an installation that is working fine during last ~2 years ~400k jobs.
Last week 16.05.8 was updated to 17.0.2.
All nodes drain/down, all queues down, slurmctl and slurmdbd down, no jobs running, update version with yum, rpms built on the same system with previous version, slurmdbd started successfully, did db schema modifications - had the time to do it etc.

From slurmdbd.log after update :
[2017-05-26T22:24:07.582] adding column federation after flags in table cluster_table [2017-05-26T22:24:07.582] adding column fed_id after federation in table cluster_table [2017-05-26T22:24:07.582] adding column fed_state after fed_id in table cluster_table [2017-05-26T22:24:07.582] adding column fed_weight after fed_state in table cluster_table [2017-05-26T22:24:12.676] adding column admin_comment after account in table "aris_job_table" [2017-05-26T22:24:43.295] Warning: Note very large processing time from make table current "aris_job_table": usec=30619903 began=22:2
4:12.675
[2017-05-26T22:25:16.970] Warning: Note very large processing time from make table current "aris_step_table": usec=33550038 began=22:
24:43.420
[2017-05-26T22:25:17.142] Accounting storage MYSQL plugin loaded
[2017-05-26T22:25:18.363] slurmdbd version 17.02.3 started

When slurmctld started
[2017-05-27T15:17:14.777] slurmctld version 17.02.3 started on cluster aris
[2017-05-27T15:17:15.822] layouts: no layout to initialize
[2017-05-27T15:17:16.181] layouts: loading entities/relations information
[2017-05-27T15:17:16.183] Recovered state of 534 nodes
[2017-05-27T15:17:16.184] Recovered JobID=335290 State=0x0 NodeCnt=0 Assoc=1047
......
[2017-05-27T15:17:16.193] Registering slurmctld at port 6817 with slurmdbd.
[2017-05-27T15:17:16.400] No parameter for mcs plugin, default values set
[2017-05-27T15:17:16.400] mcs: MCSParameters = (null). ondemand set.

we see some error records in logs and in addition job state is not updated for the majority of jobs,
although it seems that is updated for few of them.
slurmdbd.log : this is written every ~10 sec
[2017-05-27T13:26:54.901] error: CONN:11 Failed to unpack DBD_NODE_STATE message [2017-05-27T13:27:04.933] error: CONN:11 Failed to unpack DBD_NODE_STATE message

slurmctl.log : this is written every time we have a job state change.
[2017-05-29T21:00:59.908] error: slurmdbd: agent queue is full, discarding request

Jobs are running, squeue reports the real status of jobs in queue, we face problem with accounting. Jobs that were successfully finished appear in sacct either PENDING or RUNNING.
In job_completions records contain what is expected.

The main problem is the budget. Jobs run without consuming budget for those staying in PENDING while those that stay in RUNNING state consume from budget while they are finished.

Any Idea what may be wrong or how to fix this issue ?







Reply via email to