Still observing, but it looks like clearing out the runaway jobs followed by restarting slurmdbd got my user up to 986 CPU-days remaining out of their allowed 1000. Not certain the runaways were related, but it definitely started behaving better after a late afternoon/early evening slurmdbd restart.
Thanks. > On May 8, 2020, at 11:47 AM, Renfro, Michael <ren...@tntech.edu> wrote: > > Working on something like that now. From an SQL export, I see 16 jobs from > my user that have a state of 7. Both states 3 and 7 show up as COMPLETED in > sacct, and may also have some duplicate job entries found via sacct > --duplicates. > >> On May 8, 2020, at 11:34 AM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> >> wrote: >> >> Hi Michael, >> >> You can inquire the database for a job summary of a particular user and >> time period using the slurmacct command: >> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmacct >> >> You can also call "sacct --user=USER" directly like in slurmacct: >> >> # Request job data >> export >> FORMAT="JobID,User${ulen},Group${glen},Partition,AllocNodes,AllocCPUS,Submit,Eligible,Start,End,CPUTimeRAW,State" >> # Request job states: Cancelled, Completed, Failed, Timeout, Preempted >> export STATE="ca,cd,f,to,pr" >> # Get Slurm individual job accounting records using the "sacct" command >> sacct $partitionselect -n -X -a -S $start_time -E $end_time -o $FORMAT >> -s $STATE >> >> There are numerous output fields which you can inquire, see "sacct -e". >> >> /Ole >> >> >>>> On 08-05-2020 16:54, Renfro, Michael wrote: >>> Slurm 19.05.3 (packaged by Bright). For the three running jobs, the >>> total GrpTRESRunMins requested is 564480 CPU-minutes as shown by >>> 'showjob', and their remaining usage that the limit would check against >>> is less than that. >>> >>> My download of your scripts dated to August 21, 2019, and I've just now >>> done a clone of your repository to see if there were any differences. >>> None that I see -- 'showuserlimits -u USER -A ACCOUNT -s cpu' returns >>> "Limit = 1440000, current value = 1399895". >>> >>> So I assume there's something lingering in the database from some jobs >>> that already completed, but still get counted against the user's current >>> requests. >>> >>> ------------------------------------------------------------------------ >>> *From:* Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> >>> *Sent:* Friday, May 8, 2020 9:27 AM >>> *To:* slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> >>> *Cc:* Renfro, Michael <ren...@tntech.edu> >>> *Subject:* Re: [slurm-users] scontrol show assoc_mgr showing more >>> resources in use than squeue >>> Hi Michael, >>> >>> Yes, my Slurm tools use and trust the output of Slurm commands such as >>> sacct, and any discrepancy would have to come from the Slurm database. >>> Which version of Slurm are you running on the database server and the >>> node where you run sacct? >>> >>> Did you add up the GrpTRESRunMins values of all the user's running jobs? >>> They had better add up to current value = 1402415. The "showjob" >>> command prints #CPUs and time limit in minutes, so you need to multiply >>> these numbers together. Example: >>> >>> This job requests 160 CPUs and has a time limit of 2-00:00:00 >>> (days-hh:mm:ss) = 2880 min. >>> >>> Did you download the latest versions of my Slurm tools from Github? I >>> make improvements of them from time to time. >>> >>> /Ole >>> >>> >>>> On 08-05-2020 16:12, Renfro, Michael wrote: >>>> Thanks, Ole. Your showuserlimits script is actually where I got started >>>> today, and where I found the sacct command I sent earlier. >>>> >>>> Your script gives the same output for that user: the only line that's >>>> not a "Limit = None" is for the user's GrpTRESRunMins value, which is >>>> at "Limit = 1440000, current value = 1402415". >>>> >>>> The limit value is correct, but the current value is not (due to the >>>> incorrect sacct output). >>>> >>>> I've also gone through sacctmgr show runaway to clean up any runaway >>>> jobs. I had lots, but they were all from a different user, and had no >>>> effect on this particular user's values. >>>> >>>> ------------------------------------------------------------------------ >>>> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of >>>> Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> >>>> *Sent:* Friday, May 8, 2020 8:54 AM >>>> *To:* slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> >>>> *Subject:* Re: [slurm-users] scontrol show assoc_mgr showing more >>>> resources in use than squeue >>>> >>>> Hi Michael, >>>> >>>> Maybe you will find a couple of my Slurm tools useful for displaying >>>> data from the Slurm database in a more user-friendly format: >>>> >>>> showjob: Show status of Slurm job(s). Both queue information and >>>> accounting information is printed. >>>> >>>> showuserlimits: Print Slurm resource user limits and usage >>>> >>>> The user's limits are printed in detail by showuserlimits. >>>> >>>> These tools are available from >>>> https://github.com/OleHolmNielsen/Slurm_tools >>>> >>>> /Ole >>>> >>>> On 08-05-2020 15:34, Renfro, Michael wrote: >>>>> Hey, folks. I've had a 1000 CPU-day (1440000 CPU-minutes) GrpTRESMins >>>>> limit applied to each user for years. It generally works as intended, >>>>> but I have one user I've noticed whose usage is highly inflated from >>>>> reality, causing the GrpTRESMins limit to be enforced much earlier than >>>>> necessary: >>>>> >>>>> squeue output, showing roughly 340 CPU-days in running jobs, and all >>>>> other jobs blocked: >>>>> >>>>> # squeue -u USER >>>>> JOBID PARTI NAME USER ST TIME CPUS NODES >>>>> NODELIST(REASON) PRIORITY TRES_P START_TIME TIME_LEFT >>>>> 747436 batch job USER PD 0:00 28 1 >>>>> (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00 >>>>> 747437 batch job USER PD 0:00 28 1 >>>>> (AssocGrpCPURunM 4784 N/A N/A 4-04:00:00 >>>>> 747438 batch job USER PD 0:00 28 1 >>>>> (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00 >>>>> 747439 batch job USER PD 0:00 28 1 >>>>> (AssocGrpCPURunM 4784 N/A N/A 4-04:00:00 >>>>> 747440 batch job USER PD 0:00 28 1 >>>>> (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00 >>>>> 747441 batch job USER PD 0:00 28 1 >>>>> (AssocGrpCPURunM 4784 N/A N/A 4-14:00:00 >>>>> 747442 batch job USER PD 0:00 28 1 >>>>> (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00 >>>>> 747446 batch job USER PD 0:00 14 1 >>>>> (AssocGrpCPURunM 4778 N/A N/A 4-00:00:00 >>>>> 747447 batch job USER PD 0:00 14 1 >>>>> (AssocGrpCPURunM 4778 N/A N/A 4-00:00:00 >>>>> 747448 batch job USER PD 0:00 14 1 >>>>> (AssocGrpCPURunM 4778 N/A N/A 4-00:00:00 >>>>> 747445 batch job USER R 8:39:17 14 1 node002 >>>>> 4778 N/A 2020-05-07T23:02:19 3-15:20:43 >>>>> 747444 batch job USER R 16:03:13 14 1 node003 >>>>> 4515 N/A 2020-05-07T15:38:23 3-07:56:47 >>>>> 747435 batch job USER R 1-10:07:42 28 1 node005 >>>>> 3784 N/A 2020-05-06T21:33:54 8-13:52:18 >>>>> >>>>> scontrol output, showing roughly 980 CPU-days in use on the second line, >>>>> and thus blocking additional jobs: >>>>> >>>>> # scontrol -o show assoc_mgr users=USER account=ACCOUNT flags=assoc >>>>> ClusterName=its Account=ACCOUNT UserName= Partition= Priority=0 ID=21 >>>>> SharesRaw/Norm/Level/Factor=1/0.03/35/0.00 >>>>> UsageRaw/Norm/Efctv=2733615872.34/0.39/0.71 ParentAccount=PARENT(9) >>>>> Lft=1197 DefAssoc=No GrpJobs=N(4) GrpJobsAccrue=N(10) >>>>> GrpSubmitJobs=N(14) GrpWall=N(616142.94) >>>>> GrpTRES=cpu=N(84),mem=N(168000),energy=N(0),node=N(40),billing=N(420),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) >>>>> GrpTRESMins=cpu=N(9239391),mem=N(18478778157),energy=N(0),node=N(616142),billing=N(45546470),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) >>>>> GrpTRESRunMins=cpu=N(1890060),mem=N(3780121866),energy=N(0),node=N(113778),billing=N(9450304),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) >>>>> MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= >>>>> MaxTRESMinsPJ= MinPrioThresh= >>>>> ClusterName=its Account=ACCOUNT UserName=USER(UID) Partition= Priority=0 >>>>> ID=56 SharesRaw/Norm/Level/Factor=1/0.08/13/0.00 >>>>> UsageRaw/Norm/Efctv=994969457.37/0.14/0.36 ParentAccount= Lft=1218 >>>>> DefAssoc=Yes GrpJobs=N(3) GrpJobsAccrue=N(10) GrpSubmitJobs=N(13) >>>>> GrpWall=N(227625.69) >>>>> GrpTRES=cpu=N(56),mem=N(112000),energy=N(0),node=N(35),billing=N(280),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=8(0) >>>>> GrpTRESMins=cpu=N(3346095),mem=N(6692190572),energy=N(0),node=N(227625),billing=N(16580497),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) >>>>> GrpTRESRunMins=cpu=1440000(1407455),mem=N(2814910466),energy=N(0),node=N(88171),billing=N(7037276),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) >>>>> MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= >>>>> MaxTRESMinsPJ= MinPrioThresh= >>>>> >>>>> Where can I investigate to find the cause of this difference? Thanks. >>