Here's seff output, if it makes any difference. In any case, the exact same job 
was run by the user on their laptop with 16 GB RAM with no problem.

Job ID: 83387
Cluster: picotte
User/Group: foob/foob
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 06:50:30
CPU Efficiency: 11.96% of 2-09:10:56 core-walltime
Job Wall-clock time: 03:34:26
Memory Utilized: 1.54 GB
Memory Efficiency: 1.21% of 128.00 GB


--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu                     215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Paul 
Edmon <ped...@cfa.harvard.edu>
Sent: Monday, March 15, 2021 14:02
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and 
MaxVMSize are under the ReqMem value


External.

One should keep in mind that sacct results for memory usage are not accurate 
for Out Of Memory (OoM) jobs.  This is due to the fact that the job is 
typically terminated prior to next sacct polling period, and also terminated 
prior to it reaching full memory allocation.  Thus I wouldn't trust any of the 
results with regards to memory usage if the job is terminated by OoM.  sacct 
just can't pick up a sudden memory spike like that and even if it did  it would 
not correctly record the peak memory because the job was terminated prior to 
that point.


-Paul Edmon-


On 3/15/2021 1:52 PM, Chin,David wrote:
Hi, all:

I'm trying to understand why a job exited with an error condition. I think it 
was actually terminated by Slurm: job was a Matlab script, and its output was 
incomplete.

Here's sacct output:

               JobID    JobName      User  Partition        NodeList    Elapsed 
     State ExitCode     ReqMem     MaxRSS  MaxVMSize                        
AllocTRES AllocGRE
-------------------- ---------- --------- ---------- --------------- ---------- 
---------- -------- ---------- ---------- ---------- 
-------------------------------- --------
               83387 ProdEmisI+      foob        def         node001   03:34:26 
OUT_OF_ME+    0:125      128Gn                               
billing=16,cpu=16,node=1
         83387.batch      batch                              node001   03:34:26 
OUT_OF_ME+    0:125      128Gn   1617705K   7880672K              
cpu=16,mem=0,node=1
        83387.extern     extern                              node001   03:34:26 
 COMPLETED      0:0      128Gn       460K    153196K         
billing=16,cpu=16,node=1

Thanks in advance,
    Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu<mailto:dw...@drexel.edu>                     215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu<mailto:urcf-supp...@drexel.edu>
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



Drexel Internal Data


Drexel Internal Data

Reply via email to