Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
One possible datapoint: on the node where the job ran, there were two 
slurmstepd processes running, both at 100%CPU even after the job had ended.


--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode


From: slurm-users  on behalf of 
Chin,David 
Sent: Monday, March 15, 2021 13:52
To: Slurm-Users List 
Subject: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and 
MaxVMSize are under the ReqMem value


External.

Hi, all:

I'm trying to understand why a job exited with an error condition. I think it 
was actually terminated by Slurm: job was a Matlab script, and its output was 
incomplete.

Here's sacct output:

   JobIDJobName  User  PartitionNodeListElapsed 
 State ExitCode ReqMem MaxRSS  MaxVMSize
AllocTRES AllocGRE
 -- - -- --- -- 
--  -- -- -- 
 
   83387 ProdEmisI+  foobdef node001   03:34:26 
OUT_OF_ME+0:125  128Gn   
billing=16,cpu=16,node=1
 83387.batch  batch  node001   03:34:26 
OUT_OF_ME+0:125  128Gn   1617705K   7880672K  
cpu=16,mem=0,node=1
83387.extern extern  node001   03:34:26 
 COMPLETED  0:0  128Gn   460K153196K 
billing=16,cpu=16,node=1

Thanks in advance,
Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



Drexel Internal Data


Drexel Internal Data


Drexel Internal Data


Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
Hi Michael:

I looked at the Matlab script: it's loading an xlsx file which is 2.9 kB.

There are some "static" arrays allocated with ones() or zeros(), but those use 
small subsets (< 10 columns) of the loaded data, and outputs are arrays of 
6x10. Certainly there are not 16e9 rows in the original file.

Saved output .mat file is only 1.8kB.

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



From: slurm-users  on behalf of Renfro, 
Michael 
Sent: Monday, March 15, 2021 14:04
To: Slurm User Community List 
Subject: Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and 
MaxVMSize are under the ReqMem value


External.

Just a starting guess, but are you certain the MATLAB script didn’t try to 
allocate enormous amounts of memory for variables? That’d be about 16e9 
floating point values, if I did the units correctly.




Drexel Internal Data


Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
Here's seff output, if it makes any difference. In any case, the exact same job 
was run by the user on their laptop with 16 GB RAM with no problem.

Job ID: 83387
Cluster: picotte
User/Group: foob/foob
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 06:50:30
CPU Efficiency: 11.96% of 2-09:10:56 core-walltime
Job Wall-clock time: 03:34:26
Memory Utilized: 1.54 GB
Memory Efficiency: 1.21% of 128.00 GB


--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode


From: slurm-users  on behalf of Paul 
Edmon 
Sent: Monday, March 15, 2021 14:02
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and 
MaxVMSize are under the ReqMem value


External.

One should keep in mind that sacct results for memory usage are not accurate 
for Out Of Memory (OoM) jobs.  This is due to the fact that the job is 
typically terminated prior to next sacct polling period, and also terminated 
prior to it reaching full memory allocation.  Thus I wouldn't trust any of the 
results with regards to memory usage if the job is terminated by OoM.  sacct 
just can't pick up a sudden memory spike like that and even if it did  it would 
not correctly record the peak memory because the job was terminated prior to 
that point.


-Paul Edmon-


On 3/15/2021 1:52 PM, Chin,David wrote:
Hi, all:

I'm trying to understand why a job exited with an error condition. I think it 
was actually terminated by Slurm: job was a Matlab script, and its output was 
incomplete.

Here's sacct output:

   JobIDJobName  User  PartitionNodeListElapsed 
 State ExitCode ReqMem MaxRSS  MaxVMSize
AllocTRES AllocGRE
 -- - -- --- -- 
--  -- -- -- 
 
   83387 ProdEmisI+  foobdef node001   03:34:26 
OUT_OF_ME+0:125  128Gn   
billing=16,cpu=16,node=1
 83387.batch  batch  node001   03:34:26 
OUT_OF_ME+0:125  128Gn   1617705K   7880672K  
cpu=16,mem=0,node=1
83387.extern extern  node001   03:34:26 
 COMPLETED  0:0  128Gn   460K153196K 
billing=16,cpu=16,node=1

Thanks in advance,
Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



Drexel Internal Data


Drexel Internal Data


Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Renfro, Michael
Just a starting guess, but are you certain the MATLAB script didn’t try to 
allocate enormous amounts of memory for variables? That’d be about 16e9 
floating point values, if I did the units correctly.

On Mar 15, 2021, at 12:53 PM, Chin,David  wrote:



External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Hi, all:

I'm trying to understand why a job exited with an error condition. I think it 
was actually terminated by Slurm: job was a Matlab script, and its output was 
incomplete.

Here's sacct output:

   JobIDJobName  User  PartitionNodeListElapsed 
 State ExitCode ReqMem MaxRSS  MaxVMSize
AllocTRES AllocGRE
 -- - -- --- -- 
--  -- -- -- 
 
   83387 ProdEmisI+  foobdef node001   03:34:26 
OUT_OF_ME+0:125  128Gn   
billing=16,cpu=16,node=1
 83387.batch  batch  node001   03:34:26 
OUT_OF_ME+0:125  128Gn   1617705K   7880672K  
cpu=16,mem=0,node=1
83387.extern extern  node001   03:34:26 
 COMPLETED  0:0  128Gn   460K153196K 
billing=16,cpu=16,node=1

Thanks in advance,
Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



Drexel Internal Data


Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Paul Edmon
One should keep in mind that sacct results for memory usage are not 
accurate for Out Of Memory (OoM) jobs.  This is due to the fact that the 
job is typically terminated prior to next sacct polling period, and also 
terminated prior to it reaching full memory allocation.  Thus I wouldn't 
trust any of the results with regards to memory usage if the job is 
terminated by OoM.  sacct just can't pick up a sudden memory spike like 
that and even if it did  it would not correctly record the peak memory 
because the job was terminated prior to that point.



-Paul Edmon-


On 3/15/2021 1:52 PM, Chin,David wrote:

Hi, all:

I'm trying to understand why a job exited with an error condition. I 
think it was actually terminated by Slurm: job was a Matlab script, 
and its output was incomplete.


Here's sacct output:

               JobID    JobName      User  Partition  NodeList   
 Elapsed      State ExitCode     ReqMem     MaxRSS  MaxVMSize         
               AllocTRES AllocGRE
 -- - -- --- 
-- --  -- -- -- 
 
               83387 ProdEmisI+      foob        def   node001   
03:34:26 OUT_OF_ME+    0:125      128Gn                     
billing=16,cpu=16,node=1
         83387.batch      batch  node001   03:34:26 OUT_OF_ME+   
 0:125      128Gn   1617705K   7880672K              cpu=16,mem=0,node=1
        83387.extern     extern  node001   03:34:26  COMPLETED     
 0:0      128Gn       460K  153196K         billing=16,cpu=16,node=1


Thanks in advance,
    Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu  215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode


Drexel Internal Data