Hello,

It also appears that random jobs are being identified as using too much memory, 
despite being well within limits.

For example, a job is running that requested 2048 MB per CPU and all processes 
are within the limit.  But, the job is identified as being over limit when it 
isn't.  Please see below:

[2022-05-18T10:26:45.020] Job 7724384 exceeded memory limit (13140>2056), 
cancelling it
[2022-05-18T10:26:53.736] Job 7724384 exceeded memory limit (13140>2056), 
cancelling it

   PID PSR   RSS    VSZ     ELAPSED COMMAND
 16893   1 844172 1337540     15:53 vasp
 16885   0 842844 1337424     15:53 vasp
 16892  14 841216 1337652     15:53 vasp
 16888   6 840848 1337540     15:53 vasp
 16896   7 840756 1337536     15:53 vasp
 16891  12 840664 1337544     15:53 vasp
 16898  11 840556 1337540     15:53 vasp
 16900  15 840512 1337656     15:53 vasp
 16890  10 840500 1337540     15:53 vasp
 16889   8 840468 1337536     15:53 vasp
 16895   5 840416 1337544     15:53 vasp
 16886   2 840356 1337540     15:53 vasp
 16887   4 840348 1337540     15:53 vasp
 16897   9 840288 1337544     15:53 vasp
 16894   3 840128 1337536     15:53 vasp
 16899  13 839952 1337544     15:53 vasp

ps -U USER -o pid,psr,rss,vsz,etime,args h --sort=-rss|awk '{rss+=$3} END 
{printf("%.2f\n",rss/1024)}'
13140.52

This job isn't hasn't exceeded its memory request on a per CPU basis, yet it 
has been identified as having done so.  Unfortunately, this will cause concern 
with users because their output logs indicate that the job exceeded its memory 
limit.

I ran a few test jobs requesting 1K of memory and they were all terminated and 
identified correctly, so this new behavior is random and frequent.

The commonality between both properly and improperly misidentified jobs is that 
they are not terminated and their output files log the message from slurmstepd 
that the memory limit has been exceeded.

Are any other users on the 20.11.9 branch experiencing this issue?  Are any 
users on the 21.08.8 branch experiencing this issue?  We use 
JobAcctGatherParams=OverMemoryKill in our environment to monitor and kill jobs 
when the physical memory limit has been exceeded.

Thank you,
John DeSantis

On 5/18/22 09:45, John DeSantis wrote:
Hello,

Due to the recent CVE posted by Tim, we did upgrade from SLURM 20.11.3 to 
20.11.9.

Today, I received a ticket from a user with their output files populated with the 
"slurmstepd: error: Exceeded job memory limit" message.  But, the jobs are 
still running and it seems that the controller is misidentifying the job and/or step ID.  
Please see below.

# slurmd log

[2022-05-18T09:33:31.279] Job 7733409 exceeded memory limit (7973>5120), 
cancelling it
[2022-05-18T09:33:31.291] debug:  _rpc_job_notify, uid = 65536, JobId=7733409
[2022-05-18T09:33:31.291] [7733409.0] debug:  Handling REQUEST_STEP_UID
[2022-05-18T09:33:31.300] send notification to StepId=7733409.batch
[2022-05-18T09:33:31.300] [7733409.batch] debug:  Handling REQUEST_JOB_NOTIFY
[2022-05-18T09:33:31.302] [7733409.batch] error: Exceeded job memory limit

# controller log

[2022-05-18T09:33:31.293] debug2: Processing RPC: REQUEST_CANCEL_JOB_STEP from 
UID=0
[2022-05-18T09:33:31.293] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7733409+0
[2022-05-18T09:33:31.293] kill_job_step: invalid JobId=4367416
[2022-05-18T09:33:31.293] debug2: slurm_send_timeout: Socket no longer there

A restart of the controller doesn't help either, as there are a log of 
misidentified jobs (truncated):

[2022-05-18T09:41:27.128] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731668+0
[2022-05-18T09:41:27.128] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731684+0
[2022-05-18T09:41:27.128] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731625+0
[2022-05-18T09:41:27.128] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731634+0
[2022-05-18T09:41:27.128] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731629+0
[2022-05-18T09:41:27.129] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
[2022-05-18T09:41:27.129] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
[2022-05-18T09:41:27.129] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731632+0
[2022-05-18T09:41:27.129] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7724375+0
[2022-05-18T09:41:27.129] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731650+0
[2022-05-18T09:41:27.129] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7728855+0
[2022-05-18T09:41:27.130] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731681+0
[2022-05-18T09:41:27.130] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7731651+0
[2022-05-18T09:41:27.131] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
[2022-05-18T09:41:27.131] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7728855+0
[2022-05-18T09:41:27.133] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7724378+0
[2022-05-18T09:41:27.133] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
[2022-05-18T09:41:27.134] STEPS: Processing RPC details: 
REQUEST_CANCEL_JOB_STEP StepId=4367416.7724378+0

These jobs were started post upgrade, too.

Has anyone else seen this?

Thank you,
John DeSantis

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to