Hello, Due to the recent CVE posted by Tim, we did upgrade from SLURM 20.11.3 to 20.11.9.
Today, I received a ticket from a user with their output files populated with the "slurmstepd: error: Exceeded job memory limit" message. But, the jobs are still running and it seems that the controller is misidentifying the job and/or step ID. Please see below. # slurmd log
[2022-05-18T09:33:31.279] Job 7733409 exceeded memory limit (7973>5120), cancelling it [2022-05-18T09:33:31.291] debug: _rpc_job_notify, uid = 65536, JobId=7733409 [2022-05-18T09:33:31.291] [7733409.0] debug: Handling REQUEST_STEP_UID [2022-05-18T09:33:31.300] send notification to StepId=7733409.batch [2022-05-18T09:33:31.300] [7733409.batch] debug: Handling REQUEST_JOB_NOTIFY [2022-05-18T09:33:31.302] [7733409.batch] error: Exceeded job memory limit
# controller log
[2022-05-18T09:33:31.293] debug2: Processing RPC: REQUEST_CANCEL_JOB_STEP from UID=0 [2022-05-18T09:33:31.293] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7733409+0 [2022-05-18T09:33:31.293] kill_job_step: invalid JobId=4367416 [2022-05-18T09:33:31.293] debug2: slurm_send_timeout: Socket no longer there
A restart of the controller doesn't help either, as there are a log of misidentified jobs (truncated):
[2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731668+0 [2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731684+0 [2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731625+0 [2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731634+0 [2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731629+0 [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0 [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0 [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731632+0 [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724375+0 [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731650+0 [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7728855+0 [2022-05-18T09:41:27.130] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731681+0 [2022-05-18T09:41:27.130] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731651+0 [2022-05-18T09:41:27.131] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0 [2022-05-18T09:41:27.131] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7728855+0 [2022-05-18T09:41:27.133] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724378+0 [2022-05-18T09:41:27.133] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0 [2022-05-18T09:41:27.134] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724378+0
These jobs were started post upgrade, too. Has anyone else seen this? Thank you, John DeSantis
OpenPGP_signature
Description: OpenPGP digital signature