Hello,

Due to the recent CVE posted by Tim, we did upgrade from SLURM 20.11.3 to 
20.11.9.

Today, I received a ticket from a user with their output files populated with the 
"slurmstepd: error: Exceeded job memory limit" message.  But, the jobs are 
still running and it seems that the controller is misidentifying the job and/or step ID.  
Please see below.

# slurmd log

[2022-05-18T09:33:31.279] Job 7733409 exceeded memory limit (7973>5120), 
cancelling it
[2022-05-18T09:33:31.291] debug:  _rpc_job_notify, uid = 65536, JobId=7733409
[2022-05-18T09:33:31.291] [7733409.0] debug:  Handling REQUEST_STEP_UID
[2022-05-18T09:33:31.300] send notification to StepId=7733409.batch
[2022-05-18T09:33:31.300] [7733409.batch] debug:  Handling REQUEST_JOB_NOTIFY
[2022-05-18T09:33:31.302] [7733409.batch] error: Exceeded job memory limit

# controller log

[2022-05-18T09:33:31.293] debug2: Processing RPC: REQUEST_CANCEL_JOB_STEP from 
UID=0
[2022-05-18T09:33:31.293] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7733409+0
[2022-05-18T09:33:31.293] kill_job_step: invalid JobId=4367416
[2022-05-18T09:33:31.293] debug2: slurm_send_timeout: Socket no longer there

A restart of the controller doesn't help either, as there are a log of 
misidentified jobs (truncated):

[2022-05-18T09:41:27.128] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731668+0
[2022-05-18T09:41:27.128] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731684+0
[2022-05-18T09:41:27.128] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731625+0
[2022-05-18T09:41:27.128] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731634+0
[2022-05-18T09:41:27.128] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731629+0
[2022-05-18T09:41:27.129] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
[2022-05-18T09:41:27.129] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
[2022-05-18T09:41:27.129] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731632+0
[2022-05-18T09:41:27.129] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7724375+0
[2022-05-18T09:41:27.129] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731650+0
[2022-05-18T09:41:27.129] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7728855+0
[2022-05-18T09:41:27.130] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731681+0
[2022-05-18T09:41:27.130] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731651+0
[2022-05-18T09:41:27.131] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
[2022-05-18T09:41:27.131] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7728855+0
[2022-05-18T09:41:27.133] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7724378+0
[2022-05-18T09:41:27.133] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
[2022-05-18T09:41:27.134] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7724378+0

These jobs were started post upgrade, too.

Has anyone else seen this?

Thank you,
John DeSantis

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to