Hi Dave,

Hope you're doing well.

(...very possible you have already done these things...)

Maybe the logs on the compute node (system and slurmd.log) would yield more
info?

Rolling dice, it may also be worth a look for runaway processes or jobs on
that compute node as well as confirm the node is healthy... (No hardware
issues, etc.)

Cheers,
Chad

------------------------------------------------------------

Chad DeWitt, CISSP | University Research Computing

UNC Charlotte *| *Office of OneIT

ccdew...@uncc.edu *| *https://oneit.uncc.edu

------------------------------------------------------------




On Mon, Mar 15, 2021 at 2:50 PM Chin,David <dw...@drexel.edu> wrote:

> [*Caution*: Email from External Sender. Do not click or open links or
> attachments unless you know this sender.]
>
> One possible datapoint: on the node where the job ran, there were two
> slurmstepd processes running, both at 100%CPU even after the job had ended.
>
>
> --
> David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
> dw...@drexel.edu                     215.571.4335 (o)
> For URCF support: urcf-supp...@drexel.edu
> https://proteusmaster.urcf.drexel.edu/urcfwiki
> github:prehensilecode
>
> ------------------------------
> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of
> Chin,David <dw...@drexel.edu>
> *Sent:* Monday, March 15, 2021 13:52
> *To:* Slurm-Users List <slurm-users@lists.schedmd.com>
> *Subject:* [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS
> and MaxVMSize are under the ReqMem value
>
>
> External.
> Hi, all:
>
> I'm trying to understand why a job exited with an error condition. I think
> it was actually terminated by Slurm: job was a Matlab script, and its
> output was incomplete.
>
> Here's sacct output:
>
>                JobID    JobName      User  Partition        NodeList
>  Elapsed      State ExitCode     ReqMem     MaxRSS  MaxVMSize
>          AllocTRES AllocGRE
> -------------------- ---------- --------- ---------- ---------------
> ---------- ---------- -------- ---------- ---------- ----------
> -------------------------------- --------
>                83387 ProdEmisI+      foob        def         node001
> 03:34:26 OUT_OF_ME+    0:125      128Gn
> billing=16,cpu=16,node=1
>          83387.batch      batch                              node001
> 03:34:26 OUT_OF_ME+    0:125      128Gn   1617705K   7880672K
>  cpu=16,mem=0,node=1
>         83387.extern     extern                              node001
> 03:34:26  COMPLETED      0:0      128Gn       460K    153196K
> billing=16,cpu=16,node=1
>
> Thanks in advance,
>     Dave
>
> --
> David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
> dw...@drexel.edu                     215.571.4335 (o)
> For URCF support: urcf-supp...@drexel.edu
> https://proteusmaster.urcf.drexel.edu/urcfwiki
> github:prehensilecode
>
>
> Drexel Internal Data
>
> Drexel Internal Data
>
> Drexel Internal Data
>

Reply via email to