Hello
   I have a ticket posted with schedmd, but this may be an issue the community 
has seen and may have a quick response.

Slurmctld segfaulted (signal 11) on us and now segfaults on restart. I'm not 
aware of an obvious trigger for this behavior.
We upgraded this cluster from 20.02.5 to 20.11.4 a week ago (Feb 23rd)
Slurmdbd is running on a different machine than the scheduler and seems to be 
ok. No obvious errors and sacct returns information.

The last log lines before crashing were...

[2021-03-01T08:28:20.944] error: The modification time of 
/no_backup/shared/slurm/slurmstate/job_state moved backwards by 31
seconds
[2021-03-01T08:28:20.944] error: The clock of the file system and this computer 
appear to not be synchronized
[2021-03-01T08:28:30.072] error: Nodes un1 not responding
[2021-03-01T08:30:33.208] error: Nodes un1 not responding, setting DOWN
[2021-03-01T08:31:02.240] error: job_resources_node_inx_to_cpu_inx: no 
job_resrcs or node_bitmap
[2021-03-01T08:31:02.241] error: job_update_tres_cnt: problem getting offset of 
JobId=2386112_2091(2386112)
[2021-03-01T08:31:02.241] cleanup_completing: JobId=2386112_2091(2386112) 
completion process took 478645 seconds

The modification time error looks like it has been there for a while and we 
need to check the ntp service on the file server. (The slurm statedir is NFS 
mounted). The ntpd service is working on the scheduler and the time seems 
correct. (Though someone may have fixed it after the crash and before I got on 
site).

An attempt at a restart gives a similar error...

[2021-03-01T13:39:00.054] _sync_nodes_to_comp_job: JobId=2386112_2091(2386112) 
in completing state
<CUT list of debug2 lines with reasonable usage values, including 900 tres cpu 
seconds, for job 2386122_2091(2386112)>
[2021-03-01T13:39:00.055] debug2: We have already ran the job_fini for 
JobId=2386112_2091(2386112)
[2021-03-01T13:39:00.055] select/cons_tres: job_res_rm_job: plugin still 
initializing
[2021-03-01T13:39:00.055] cleanup_completing: JobId=2386112_2091(2386112) 
completion process took 497131 seconds

I'd guess that something is corrupt about the spooled information for the job 
mentioned but I am not aware of the proper way to verify and fix this.
Is this an issue that people have run into before and have any suggestions on 
how to solve it?

Thanks.

Reply via email to