Re: [slurm-users] Possible to get cluster utilization by partition?

2022-08-24 Thread Chin,David
e/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode researchgate.NET:David-Chin-6 From: slurm-users on behalf of Chin,

Re: [slurm-users] Memory usage not tracked

2022-04-06 Thread Chin,David
Hi, Xand: How does adding "ReqMem" to the sacct change the output? E.g. on my cluster running Slurm 20.02.7 (on RHEL8), our GPU nodes have TRESBillingWeights=CPU=0,Mem=0,GRES/gpu=43: $ sacct --format=JobID%25,State,AllocTRES%50,ReqTRES,ReqMem,ReqCPUS|grep RUNNING JobID

Re: [slurm-users] Possible to get cluster utilization by partition?

2021-11-05 Thread Chin,David
ensilecode From: slurm-users on behalf of Ole Holm Nielsen Sent: Friday, November 5, 2021 03:26 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Possible to get cluster utilization by partition? External. Hi Dave, On 11/4/21 21:47, Chin,David wrote: > I am running Slurm 20.02.7. I

[slurm-users] Possible to get cluster utilization by partition?

2021-11-04 Thread Chin,David
Hi, I am running Slurm 20.02.7. I would like to generate cluster utilization report based on the billing TRES, but separated by partition. I can get full cluster utilization using: sreport cluster utilization -T billing start=2021-01-01 end=2021-06-30 but it would be useful for understandin

Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
Hi, Sean: Slurm version 20.02.6 (via Bright Cluster Manager) ProctrackType=proctrack/cgroup JobAcctGatherType=jobacct_gather/linux JobAcctGatherParams=UsePss,NoShared I just skimmed https://bugs.schedmd.com/show_bug.cgi?id=5549 because this job appeared to have left two slurmstepd zombie

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode From: slurm-users on behalf of Chin,David Sent: Monday, March 15, 2021 13:52 To: Slurm-Users List Subject: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
Hi Michael: I looked at the Matlab script: it's loading an xlsx file which is 2.9 kB. There are some "static" arrays allocated with ones() or zeros(), but those use small subsets (< 10 columns) of the loaded data, and outputs are arrays of 6x10. Certainly there are not 16e9 rows in the original

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
to that point. -Paul Edmon- On 3/15/2021 1:52 PM, Chin,David wrote: Hi, all: I'm trying to understand why a job exited with an error condition. I think it was actually terminated by Slurm: job was a Matlab script, and its output was incomplete. Here's sacct output: J

[slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
Hi, all: I'm trying to understand why a job exited with an error condition. I think it was actually terminated by Slurm: job was a Matlab script, and its output was incomplete. Here's sacct output: JobIDJobName User PartitionNodeListElapsed State Exit

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David
My mistake - from slurm.conf(5): SrunProlog runs on the node where the "srun" is executing. i.e. the login node, which explains why the directory is not being created on the compute node, while the echos work. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David
creating the directory in (chmod 1777 for the parent directory is good) Brian Andrus On 3/4/2021 9:03 AM, Chin,David wrote: Hi, Brian: So, this is my SrunProlog script -- I want a job-specific tmp dir, which makes for easy cleanup at end of job: #!/bin/bash if [[ -z ${SLURM_ARRAY_JOB

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David
o change a particular one (or more), use something like --export=ALL,MYVAR=othervalue do 'man srun' and look at the --export option Brian Andrus On 3/3/2021 9:28 PM, Chin,David wrote: ahmet.mer...@uhem.itu.edu.tr<mailto:ahmet.mer...@uhem.itu.edu.tr> wrote: > Prolog and Ta

Re: [slurm-users] prolog not passing env var to job

2021-03-03 Thread Chin,David
shell on the compute node does not have the env variables set. I use the same prolog script as TaskProlog, which sets it properly for jobs submitted with sbatch. Thanks in advance, Dave Chin -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.57

[slurm-users] sreport cluster AccountUtilizationByUser showing utilization of a deleted account

2021-02-09 Thread Chin,David
Hello, all: Details: * slurm 20.02.6 * MariaDB 10.3.17 * RHEL 8.1 I have a fairshare setup. I went through a couple of iterations in testing of manually creating accounts and users that I later deleted before putting in what is to be the production setup. One of the deleted accounts

Re: [slurm-users] sacctmgr archive dump - no dump file produced, and data not purged?

2021-02-09 Thread Chin,David
github:prehensilecode From: slurm-users on behalf of Chin,David Sent: Friday, February 5, 2021 15:47 To: Slurm-Users List Subject: [slurm-users] sacctmgr archive dump - no dump file produced, and data not purged? External. Hi all: I have a new cluster, and

[slurm-users] Unsetting a QOS Flag?

2021-02-08 Thread Chin,David
Hello all: I have a QOS defined which has the Flaq DenyOnLimit set: $ sacctmgr show qos foo format=name,flags NameFlags -- foo DenyOnLimit How can I "unset" that Flag? I tried "sacctmgr modify qos foo unset Flags=DenyOnLimit",

[slurm-users] sacctmgr archive dump - no dump file produced, and data not purged?

2021-02-05 Thread Chin,David
;s".) Is there something I am missing? Thanks, Dave Chin -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode Drexel Internal Data