[slurm-users] sacctmgr archive dump - no dump file produced, and data not purged?

2021-02-05 Thread Chin,David
Hi all:

I have a new cluster, and I am attempting to dump all the accounting data that 
I generated in the test period before our official opening.

Installation info:

  *   Bright Cluster Manager 9.0
  *   Slurm 20.02.6
  *   Red Hat 8.1

In slurmdbd.conf, I have:

ArchiveJobs=yes
ArchiveSteps=yes
ArchiveEvents=yes
ArchiveSuspend=yes

On the commandline, I do:

$ sudo sacctmgr archive dump Directory=/data/Backups/Slurm 
PurgeEventAfter=1hours PurgeJobAfter=1hours PurgeStepAfter=1hours 
PurgeSuspendAfter=1hours
This may result in loss of accounting database records (if Purge* options 
enabled).
Are you sure you want to continue? (You have 30 seconds to decide)
(N/y): y
sacctmgr: slurmdbd: SUCCESS

However, no dump file is produced. And if I run sreport, I still see data from 
last month. (I also tried "1hour", i.e. dropping the "s".)

Is there something I am missing?

Thanks,
Dave Chin

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



Drexel Internal Data


Re: [slurm-users] [EXT] How to determine (on the ControlMachine) which cores/gpus are assigned to a job?

2021-02-05 Thread Sean Crosby
Hi Thomas,

Add the -d flag to scontrol show job

e.g.

# scontrol show job 23891862 -d
JobId=23891862 JobName=SPI_DOWN
   UserId=user1(11283) GroupId=group1(10414) MCS_label=N/A
   Priority=586 Nice=0 Account=group1 QOS=qos1
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=2-00:13:58 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2021-02-03T19:19:28 EligibleTime=2021-02-03T19:19:28
   AccrueTime=2021-02-03T19:19:31
   StartTime=2021-02-03T19:19:31 EndTime=2021-02-10T19:19:31 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-03T19:19:31
   Partition=gpgpu AllocNode:Sid=spartan-login3:222306
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=spartan-gpgpu007
   BatchHost=spartan-gpgpu007
   NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*
   TRES=cpu=6,mem=24000M,node=1,billing=101,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   JOB_GRES=gpu:1
 Nodes=spartan-gpgpu007 CPU_IDs=6-11 Mem=24000 GRES=gpu:1(IDX:1)
   MinCPUsNode=6 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

Note the CPU_IDs and GPU IDX in the output

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Fri, 5 Feb 2021 at 02:01, Thomas Zeiser <
thomas.zei...@rrze.uni-erlangen.de> wrote:

> UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts
>
> Dear All,
>
> we are running Slurm-20.02.6 and using
> "SelectType=select/cons_tres" with
> "SelectTypeParameters=CR_Core_Memory", "TaskPlugin=task/cgroup",
> and "ProctrackType=proctrack/cgroup". Nodes can be shared between
> multiple jobs with the partition defaults "ExclusiveUser=no
> OverSubscribe=No"
>
> For monitoring purpose, we'd like to know on the ControlMachine
> which cores of a batch node are assigned to a specific job. Is
> there any way (except looking on each batch node itself into
> /sys/fs/cgroup/cpuset/slurm_*) to get the assigned core ranges or
> GPU IDs?
>
> E.g. from Torque we are used that qstat tells the assigned cores.
> However, with Slurm, even "scontrol show job JOBID" does not seem
> to have any information in that direction.
>
> Knowing which GPU is allocated (in case of gres/gpu) of course
> also would be interested to know on the ControlMachine.
>
>
> Here's the output we get from scontrol show job; it has the node
> name and the number of cores assigned but not the "core IDs" (e.g.
> 32-63)
>
> JobId=886 JobName=br-14
>UserId=hpc114(1356) GroupId=hpc1(1355) MCS_label=N/A
>Priority=1010 Nice=0 Account=hpc1 QOS=normal WCKey=*
>JobState=RUNNING Reason=None Dependency=(null)
>Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>RunTime=00:40:09 TimeLimit=1-00:00:00 TimeMin=N/A
>SubmitTime=2021-02-04T07:26:51 EligibleTime=2021-02-04T07:26:51
>AccrueTime=2021-02-04T07:26:51
>StartTime=2021-02-04T07:26:54 EndTime=2021-02-05T07:26:54 Deadline=N/A
>PreemptEligibleTime=2021-02-04T07:26:54 PreemptTime=None
>SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-04T07:26:54
>Partition=a100 AllocNode:Sid=gpu001:1743663
>ReqNodeList=(null) ExcNodeList=(null)
>NodeList=gpu001
>BatchHost=gpu001
>NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>TRES=cpu=32,mem=12M,node=1,billing=32,gres/gpu=1,gres/gpu:a100=1
>Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>MinCPUsNode=1 MinMemoryCPU=3750M MinTmpDiskNode=0
>Features=(null) DelayBoot=00:00:00
>OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>Command=/var/tmp/slurmd_spool/job00877/slurm_script
>WorkDir=/home/hpc114/run2
>StdErr=/home/hpc114//run2/br-14.o886
>StdIn=/dev/null
>StdOut=/home/hpc114/run2/br-14.o886
>Power=
>TresPerNode=gpu:a100:1
>MailUser=(null) MailType=NONE
>
> Also "scontrol show node" is not helpful
>
> NodeName=gpu001 Arch=x86_64 CoresPerSocket=64
>CPUAlloc=128 CPUTot=128 CPULoad=4.09
>AvailableFeatures=hwperf
>ActiveFeatures=hwperf
>Gres=gpu:a100:4(S:0-1)
>NodeAddr=gpu001 NodeHostName=gpu001 Port=6816 Version=20.02.6
>OS=Linux 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021
>RealMemory=51 AllocMem=48 FreeMem=495922 Sockets=2 Boards=1
>State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=80 Owner=N/A
> MCS_label=N/A
>Partitions=a100
>BootTime=2021-01-27T16:03:48 SlurmdStartTime=2021-02-03T13:43:05
>CfgTRES=cpu=128,mem=51M,billing=128,gres/gpu=4,gres/gpu:a100=4
>AllocTRES=cpu=128,mem=48M,gres/gpu=4,gres/gpu:a100=4
>CapWatts=n/a
>CurrentWatts=0 AveWatts=0
>ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> There is no information on the currently running four jobs
> included; neither