Hi Dave, Hope you're doing well.
(...very possible you have already done these things...) Maybe the logs on the compute node (system and slurmd.log) would yield more info? Rolling dice, it may also be worth a look for runaway processes or jobs on that compute node as well as confirm the node is healthy... (No hardware issues, etc.) Cheers, Chad ------------------------------------------------------------ Chad DeWitt, CISSP | University Research Computing UNC Charlotte *| *Office of OneIT ccdew...@uncc.edu *| *https://oneit.uncc.edu ------------------------------------------------------------ On Mon, Mar 15, 2021 at 2:50 PM Chin,David <dw...@drexel.edu> wrote: > [*Caution*: Email from External Sender. Do not click or open links or > attachments unless you know this sender.] > > One possible datapoint: on the node where the job ran, there were two > slurmstepd processes running, both at 100%CPU even after the job had ended. > > > -- > David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel > dw...@drexel.edu 215.571.4335 (o) > For URCF support: urcf-supp...@drexel.edu > https://proteusmaster.urcf.drexel.edu/urcfwiki > github:prehensilecode > > ------------------------------ > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Chin,David <dw...@drexel.edu> > *Sent:* Monday, March 15, 2021 13:52 > *To:* Slurm-Users List <slurm-users@lists.schedmd.com> > *Subject:* [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS > and MaxVMSize are under the ReqMem value > > > External. > Hi, all: > > I'm trying to understand why a job exited with an error condition. I think > it was actually terminated by Slurm: job was a Matlab script, and its > output was incomplete. > > Here's sacct output: > > JobID JobName User Partition NodeList > Elapsed State ExitCode ReqMem MaxRSS MaxVMSize > AllocTRES AllocGRE > -------------------- ---------- --------- ---------- --------------- > ---------- ---------- -------- ---------- ---------- ---------- > -------------------------------- -------- > 83387 ProdEmisI+ foob def node001 > 03:34:26 OUT_OF_ME+ 0:125 128Gn > billing=16,cpu=16,node=1 > 83387.batch batch node001 > 03:34:26 OUT_OF_ME+ 0:125 128Gn 1617705K 7880672K > cpu=16,mem=0,node=1 > 83387.extern extern node001 > 03:34:26 COMPLETED 0:0 128Gn 460K 153196K > billing=16,cpu=16,node=1 > > Thanks in advance, > Dave > > -- > David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel > dw...@drexel.edu 215.571.4335 (o) > For URCF support: urcf-supp...@drexel.edu > https://proteusmaster.urcf.drexel.edu/urcfwiki > github:prehensilecode > > > Drexel Internal Data > > Drexel Internal Data > > Drexel Internal Data >