Hi all.
We have an installation that is working fine during last ~2 years ~400k
jobs.
Last week 16.05.8 was updated to 17.0.2.
All nodes drain/down, all queues down, slurmctl and slurmdbd down, no
jobs running,
update version with yum, rpms built on the same system with previous
version,
slurmdbd started successfully, did db schema modifications - had the
time to do it etc.
From slurmdbd.log after update :
[2017-05-26T22:24:07.582] adding column federation after flags in table
cluster_table
[2017-05-26T22:24:07.582] adding column fed_id after federation in table
cluster_table
[2017-05-26T22:24:07.582] adding column fed_state after fed_id in table
cluster_table
[2017-05-26T22:24:07.582] adding column fed_weight after fed_state in
table cluster_table
[2017-05-26T22:24:12.676] adding column admin_comment after account in
table "aris_job_table"
[2017-05-26T22:24:43.295] Warning: Note very large processing time from
make table current "aris_job_table": usec=30619903 began=22:2
4:12.675
[2017-05-26T22:25:16.970] Warning: Note very large processing time from
make table current "aris_step_table": usec=33550038 began=22:
24:43.420
[2017-05-26T22:25:17.142] Accounting storage MYSQL plugin loaded
[2017-05-26T22:25:18.363] slurmdbd version 17.02.3 started
When slurmctld started
[2017-05-27T15:17:14.777] slurmctld version 17.02.3 started on cluster aris
[2017-05-27T15:17:15.822] layouts: no layout to initialize
[2017-05-27T15:17:16.181] layouts: loading entities/relations information
[2017-05-27T15:17:16.183] Recovered state of 534 nodes
[2017-05-27T15:17:16.184] Recovered JobID=335290 State=0x0 NodeCnt=0
Assoc=1047
......
[2017-05-27T15:17:16.193] Registering slurmctld at port 6817 with slurmdbd.
[2017-05-27T15:17:16.400] No parameter for mcs plugin, default values set
[2017-05-27T15:17:16.400] mcs: MCSParameters = (null). ondemand set.
we see some error records in logs and in addition job state is not
updated for the majority of jobs,
although it seems that is updated for few of them.
slurmdbd.log : this is written every ~10 sec
[2017-05-27T13:26:54.901] error: CONN:11 Failed to unpack DBD_NODE_STATE
message
[2017-05-27T13:27:04.933] error: CONN:11 Failed to unpack DBD_NODE_STATE
message
slurmctl.log : this is written every time we have a job state change.
[2017-05-29T21:00:59.908] error: slurmdbd: agent queue is full,
discarding request
Jobs are running, squeue reports the real status of jobs in queue, we
face problem with accounting.
Jobs that were successfully finished appear in sacct either PENDING or
RUNNING.
In job_completions records contain what is expected.
The main problem is the budget. Jobs run without consuming budget for
those staying in PENDING
while those that stay in RUNNING state consume from budget while they
are finished.
Any Idea what may be wrong or how to fix this issue ?