Re: [slurm-users] Accounting core-hours usages
Dear Sushil: please share the slurm.conf, if possible. Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Supercomputing Facility & Information System and Technology Facility Academic Block 5, Room 110A Indian Institute of Technology Gandhinagar [https://iitgn.ac.in/] Palaj, Gujarat 382055, INDIA *IITGN: Celebrating 10 years of educational excellence <http://sites.iitgn.ac.in/10/>* On Mon, Oct 10, 2022 at 8:27 PM Jörg Striewski wrote: > did you enter the information slurm needs for the database in > slurmdbd,conf and slurm.conf ? > > > Mit freundlichen Grüßen / kind regards > > -- > Jörg Striewski > > Information Systems and Machine Learning Lab (ISMLL) > Institute of Computer Science > University of Hildesheim Germany > post address: Universitätsplatz 1, D-31141Hildesheim, Germany > visitor address: Samelsonplatz 1, D-31141 Hildesheim,Germany > Tel.(+49) 05121 / 883-40392 > http://www.ismll.uni-hildesheim.de > > On 10.10.22 16:38, Sushil Mishra wrote: > > Dear all, > > > > I am pretty new to system administration and looking for some help > > setup slumdb or maridb in a GPU cluster. We bought a machine but the > > vendor simply installed slurm and did not install any database for > > accounting. I tried installing MariaDB and then slurmdb as described > > in the manual but looks like I am missing something. I wonder if > > someone can help us with this off the list? I only need to keep an > > account of the core hours used by each user only. Is there any > > alternate way of keeping an account of core hour usages per user > > without installing DB? > > > > Best, > > Sushil > > > >
[slurm-users] Requirement of one GPU job should run in GPU nodes in a cluster
Hello All: Can we please restrict one GPU job on one GPU node? That is, a) when we submit a GPU job on an empty node (say gpu2) requesting 16 cores as that gives the best performance in the GPU and it gives best performance. b) Then another user flooded the CPU cores on gpu2 sharing the GPU resources. The net results is a GPU job got hit by 40% performance in the next run Can we make some changes in the slurm configuration such that when a GPU job is submitted in a GPU node, no other job can enter that GPU node? I am attaching my slurm.config file along with this email. Any help will be deeply appreciated! I apologize if this is a repeated email. Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System and Technology Facility Academic Block 5, Room 110A Indian Institute of Technology Gandhinagar Palaj, Gujarat 382055, INDIA slurm.conf Description: Binary data
Re: [slurm-users] Nodes not returning from DRAINING
May try with this workaround scontrol update NodeName= State=IDLE Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System and Technology Facility Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355, INDIA On Wed, Oct 28, 2020 at 5:41 PM Diego Zuccato wrote: > Hello all. > > I've found that sometimes, some jobs leave the nodes in DRAINING state. > > In slurmctld.log I find: > -8<-- > [2020-10-28T11:30:16.999] update_node: node str957-mtx-11 reason set to: > Kill task failed > [2020-10-28T11:30:16.999] update_node: node str957-mtx-11 state set to > DRAINING > -8<-- > while on the node (slurmd.log): > -8<-- > [2020-10-28T11:24:11.980] [8975.0] task/cgroup: > /slurm_str957-mtx-11/uid_2126297435/job_8975: alloc=117600MB > mem.limit=117600MB memsw.limit=117600MB > [2020-10-28T11:24:11.980] [8975.0] task/cgroup: > /slurm_str957-mtx-11/uid_2126297435/job_8975/step_0: alloc=117600MB > mem.limit=117600MB memsw.limit=117600MB > [2020-10-28T11:29:18.926] [8975.0] Defering sending signal, processes in > job are currently core dumping > [2020-10-28T11:30:17.000] [8975.0] error: *** STEP 8975.0 STEPD > TERMINATED ON str957-mtx-11 AT 2020-10-28T11:30:16 DUE TO JOB NOT ENDING > WITH SIGNALS *** > [2020-10-28T11:30:19.306] [8975.0] done with job > -8<-- > > Seems slurmd takes a bit too much time to close the job. Is there some > timeout I could change to avoid having to fix it manually? > > TIA. > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > >
Re: [slurm-users] unable to start slurmd process.
Hi: please mention the below output. cat /etc/redhat-release OR cat /etc/lsb_release Also, please let us know the detailed log reports that is probably available at /var/log/slurm/slurmctld.log status of: ps -ef | grep slurmctld Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA On 11/06/20 5:54 pm, navin srivastava wrote: Hi Team, when i am trying to start the slurmd process i am getting the below error. 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon... 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start operation timed out. Terminating. 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node daemon. 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered failed state. 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with result 'timeout'. 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): session opened for user root by (uid=0) Slurm version is 17.11.8 The server and slurm is running from long time and we have not made any changes but today when i am starting it is giving this error message. Any idea what could be wrong here. Regards Navin.
Re: [slurm-users] Problem with permisions. CentOS 7.8
also check: a) whether NTP has been setup and communicating with master node b) iptables may be flushed (iptables -L) c) SeLinux to disabled, to check : getenforce vim /etc/sysconfig/selinux (change SELINUX=enforcing to SELINUX=disabled and save the file and reboot) Thanks & Regards, Sudeep Narayan Banerjee On Fri, May 29, 2020 at 12:08 PM Sudeep Narayan Banerjee < snbaner...@iitgn.ac.in> wrote: > I have not checked on the CentOS7.8 > a) if /var/run/munge folder does not exist then please double check > whether munge has been installed or not > b) user root or sudo user to do > ps -ef | grep munge > kill -9 //where PID is the Process ID for munge (if the process is > running at all); else > > which munged > /etc/init.d/munge start > > please let me know the the output of: > > $ munge -n > > $ munge -n | unmunge > > $ sudo systemctl status --full munge > > Thanks & Regards, > Sudeep Narayan Banerjee > System Analyst | Scientist B > Indian Institute of Technology Gandhinagar > Gujarat, INDIA > > > On Fri, May 29, 2020 at 11:55 AM Bjørn-Helge Mevik > wrote: > >> Ferran Planas Padros writes: >> >> > I run the command as slurm user, and the /var/log/munge folder does >> belong to slurm. >> >> For security reasons, I strongly advise that you run munged as a >> separate user, which is unprivileged and not used for anything else. >> >> -- >> Regards, >> Bjørn-Helge Mevik, dr. scient, >> Department for Research Computing, University of Oslo >> >
Re: [slurm-users] Problem with permisions. CentOS 7.8
I have not checked on the CentOS7.8 a) if /var/run/munge folder does not exist then please double check whether munge has been installed or not b) user root or sudo user to do ps -ef | grep munge kill -9 //where PID is the Process ID for munge (if the process is running at all); else which munged /etc/init.d/munge start please let me know the the output of: $ munge -n $ munge -n | unmunge $ sudo systemctl status --full munge Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Indian Institute of Technology Gandhinagar Gujarat, INDIA On Fri, May 29, 2020 at 11:55 AM Bjørn-Helge Mevik wrote: > Ferran Planas Padros writes: > > > I run the command as slurm user, and the /var/log/munge folder does > belong to slurm. > > For security reasons, I strongly advise that you run munged as a > separate user, which is unprivileged and not used for anything else. > > -- > Regards, > Bjørn-Helge Mevik, dr. scient, > Department for Research Computing, University of Oslo >
Re: [slurm-users] require info on merging diff core count nodes under single queue or partition
Dear Loris: Many thanks for your response. I did change the IDLE state to UNKNOWN state for NodeName configuration, then reloaded *slurmctld* and got 2 gpu nodes(gpu3 & 4) as drain mode. Although the same state I have manually updated to IDLE state. But how do I change the CoresPerSocket and ThreadsPerCore in the NodeName parameter? Thanks & Regards, Sudeep Narayan Banerjee On 18/05/20 7:29 pm, Loris Bennett wrote: Hi Sudeep, I am not sure if this is the cause of the problem but in your slurm.conf you have # COMPUTE NODES NodeName=node[1-10] Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 Procs=16 RealMemory=6 State=IDLE NodeName=gpu[1-2] CPUs=16 Gres=gpu:2 State=IDLE NodeName=node[11-22] Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 Procs=32 State=IDLE NodeName=node[23-24] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 State=IDLE NodeName=gpu[3-4] CPUs=32 Gres=gpu:1 State=IDLE But if you read man slurm.conf you will find the following under the description of the parameter "State" for nodes: "IDLE" should not be specified in the node configuration, but set the node state to "UNKNOWN" instead. Cheers, Loris Sudeep Narayan Banerjee writes: Dear Loris: I am very sorry to address as Support; actually it has become a bad habit for me which I will change. Sincere Apologies! Yes, I have checked while adding hybrid arch of hardware but while executing slurmctld, it shows mismatch in core-count and also the existing 32core nodes goes to Dowm/Drng mode and new 40-core nodes sets to IDLE. Any help/guide to some link will be highly appreciated! Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA On 18/05/20 6:30 pm, Loris Bennett wrote: Dear Sudeep, Sudeep Narayan Banerjee writes: Dear Support, This mailing list is not really the Slurm support list. It is just the Slurm User Community List, so basically a bunch of people just like you. node11-22 is having 16cores socket x 2 and node23-24 is having 20cores socket x 2. In slurm.conf file (attached), can we merge all the nodes 11-24 (having different core count) and have a single queue or partition name? Yes, you can have a partition consisting of heterogeneous nodes. Have you tried this? Was there a problem? Cheers, Loris
Re: [slurm-users] require info on merging diff core count nodes under single queue or partition
Dear Loris: I am very sorry to address as Support; actually it has become a bad habit for me which I will change. Sincere Apologies! Yes, I have checked while adding hybrid arch of hardware but while executing slurmctld, it shows mismatch in core-count and also the existing 32core nodes goes to Dowm/Drng mode and new 40-core nodes sets to IDLE. Any help/guide to some link will be highly appreciated! Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA On 18/05/20 6:30 pm, Loris Bennett wrote: Dear Sudeep, Sudeep Narayan Banerjee writes: Dear Support, This mailing list is not really the Slurm support list. It is just the Slurm User Community List, so basically a bunch of people just like you. node11-22 is having 16cores socket x 2 and node23-24 is having 20cores socket x 2. In slurm.conf file (attached), can we merge all the nodes 11-24 (having different core count) and have a single queue or partition name? Yes, you can have a partition consisting of heterogeneous nodes. Have you tried this? Was there a problem? Cheers, Loris
[slurm-users] require info on merging diff core count nodes under single queue or partition
Dear Support, node11-22 is having 16cores socket x 2 and node23-24 is having 20cores socket x 2. In slurm.conf file (attached), can we merge all the nodes 11-24 (having different core count) and have a single queue or partition name? -- Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=hpc #ControlAddr= #BackupAddr= # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none #CryptoType=crypto/none #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=99 #GresTypes=gpu #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #Licenses=foo*4,bar #MailProg=/bin/mail MaxJobCount=5000 MaxStepCount=4 MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/pgid #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/tmp/slurmd SlurmUser=root #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/tmp SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 MessageTimeout=80 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res SelectTypeParameters=CR_CORE_Memory # # # JOB PRIORITY #PriorityType=priority/basic PriorityType=priority/multifactor #PriorityDecayHalfLife= DebugFlags=NO_CONF_HASH #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING AccountingStorageEnforce=limits #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/mysql #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=cluster-iitgn #DebugFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/mysql #JobCompUser= JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= GresTypes=gpu # # # COMPUTE NODES NodeName=node[1-10] Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 Procs=16 RealMemory=6 State=IDLE NodeName=gpu[1-2] CPUs=16 Gres=gpu:2 State=IDLE NodeName=node[11-22] Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 Procs=32 State=IDLE NodeName=node[23-24] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 State=IDLE NodeName=gpu[3-4] CPUs=32 Gres=gpu:1 State=IDLE #NodeName=hpc CPUs=12 State=UNKNOWN PartitionName=serial Nodes=gpu1 Default=YES Shared=YES Priority=20 PreemptMode=suspend MaxTime=1-0:0 MaxCPUsPerNode=10 State=UP PartitionName=main Nodes=node[1-10] Default=YES Shared=YES Priority=10 PreemptMode=suspend MaxTime=2-0:0 State=UP PartitionName=main_new Nodes=node[11-22] Default=YES Shared=YES Priority=10 PreemptMode=suspend MaxTime=2-0:0 State=UP #PartitionName=main_new Nodes=node[11-24] Default=YES Shared=YES Priority=10 PreemptMode=suspend MaxTime=2-0:0 State=UP PartitionName=gsgroup Nodes=node[23-24] Default=NO Shared=YES Priority=30 PreemptMode=suspend MaxTime=2-0:0 State=UP Allowgroups=GauravS_grp PartitionName=pdgroup Nodes=node[9-10] Default=NO Shared=YES Priority=30 PreemptMode=suspend MaxTime=3-0:0 State=UP Allowgroups=PD_grp PartitionName=ssmgroup Nodes=gpu[3-4] Default=NO Shared=YES Priority=30 PreemptMode=suspend MaxTime=7-0:0 State=UP Allowgroups=SSM_grp PartitionName=gpu Nodes=gpu[1-2] Default=NO Shared=yes MaxTime=3-0:0 State=UP PartitionName=gpu_new Nodes=gpu[3-4] Default=NO Shared
[slurm-users] need to use unused cores | wherein all compute nodes are ALLOC
Dear All, I have 360 cpu cores in my cluster; 9 compute nodes with 20core x 2 sockets each. I have slurm.18.08.7 version and have multifactor (fair share) and backfill enabled. I am running jobs with less ntasks_per_node in the script and at some point all my compute nodes are ALLOC (with overall 300 cores). but since I have not used all the cores, around around 60 ntasks are still unused (distributed all over the 9 nodes). Question: how can I still submit another job that gets those unused cores to run? I know the status of all such nodes will be changed in MIX. so, what options has to be tweaked in slurm.conf file. Currently the status shows (Resources) as Reason for not getting in the scheduler. -- Thanks & Regards, Sudeep Narayan Banerjee
Re: [slurm-users] sacct -c not honor -M clusrername
Dear Fred: should be possible sacct --format=user,state --starttime=04/01/19 --endtime=03/31/20 | grep COMPLETED Please let us know if this helps. Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA On 26/04/20 9:27 pm, Fred Liu wrote: Hi, Is it possible to get job completion stats per cluster? Thanks. Fred
[slurm-users] Require help in setting up Priority in slurm
Dear All: I want to setup priority queuing in slurm (slurm-18.08.7). Say, one user userA has submitted and running 4 jobs from a group USER1-grp; this same user userA has also submitted 4 more jobs in PD status. Now userB from User2-grp wants to submit job whose job should get top priority rather userA's job. Currently the scheduler will behave as FIFO and no Fairshare policy has been implemented yet. I have gone through this PDF <https://slurm.schedmd.com/SLUG19/Priority_and_Fair_Trees.pdf> once and studying link1 <https://slurm.schedmd.com/priority_multifactor.html> and link2 <https://slurm.schedmd.com/classic_fair_share.html>. I have already setup the queues : [root@aneesur ~]# sinfo -v - dead = false exact = 0 filtering = false format = %9P %.5a %.10l %.6D %.6t %N iterate = 0 long = false no_header = false node_field = false node_format = false nodes = n/a part_field = true partition = n/a responding = false states = (null) sort = (null) summarize = false verbose = 1 - all_flag = false alloc_mem_flag = false avail_flag = true cpus_flag = false default_time_flag =false disk_flag = false features_flag = false features_flag_act = false groups_flag = false gres_flag = false job_size_flag = false max_time_flag = true memory_flag = false partition_flag = true port_flag = false priority_job_factor_flag = false priority_tier_flag = false reason_flag = false reason_timestamp_flag = false reason_user_flag = false reservation_flag = false root_flag = false oversubscribe_flag = false state_flag = true weight_flag = false - Thu Apr 23 19:46:33 2020 sinfo: Consumable Resources (CR) Node Selection plugin loaded with argument 1 sinfo: Cray node selection plugin loaded sinfo: Linear node selection plugin loaded with argument 1 sinfo: Serial Job Resource Selection plugin loaded with argument 1 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST short* up 1:00:00 9 idle node[1-9] medium up 2-00:00:00 9 idle node[1-9] long up 4-00:00:00 9 idle node[1-9] intensive up 7-00:00:00 9 idle node[1-9] gpu up infinite 4 idle gpu[1-4] Attaching the slurm.conf file. Any help or guide will genuinely help. I know the PDFs and links are best guide but I need to setup and release a bit early! -- Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # SlurmctldHost=aneesur # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/cgroup CacheGroups=0 ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root #StateSaveLocation=/var/spool/slurm StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none TaskPlugin=task/cgroup # # # TIMERS KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 InactiveLimit=0 Waittime=0 # # # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_CPU SchedulerParameters=assoc_limit_continue # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/slurmdbd ClusterName=cluster AccountingStorageEnforce=limits,qos AccountingStorageTRES=cpu,gres/gpu #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux #SlurmctldDebug=info #SlurmctldLogFile= #SlurmdDebug=info #SlurmdLogFile= #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityCalcPeriod=0-0:05 #PriorityFavorSmall=NO #PriorityMaxAge=7-0 #PriorityWeightAge=1 #PriorityWeightFairshare=10 #PriorityWeightJobSize=1000 #PriorityWeightPartition=5000 #PriorityWeightQOS=1000 #PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3 ## # # COMPUTE NODES NodeName=node[1-9] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 State=IDLE NodeName=gpu[1-4] Procs=40 Gres=gpu:1 State=IDLE NodeName=aneesur Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 State=IDLE NodeName=aneesur1 Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 State=IDLE #PartitionName=main Nodes=node[1-9] Default=YES MaxTime=INFINITE State=UP PartitionName=short Nodes=node[1-9] AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL Default=YES MaxTime=60 MinNodes=1 MaxNodes=1 Priority=1 State=UP PartitionName=medium Nodes=node[1-9] AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL Default=NO MaxTime=2880 MaxNodes=2 Priority=2 State=UP PartitionName=long Nodes=node[1-9] AllowGroups=ALL AllowAccounts=A
Re: [slurm-users] Need to calculate total runtime/walltime for one year
Thank you so much.. already started working ..little optimized a bit .. Thanks again! *sacct --format=user,ncpus,state,elapsed --starttime=04/1/17 --endtime=03/31/18 | grep COMPLETED | grep mithunr | awk '{print $4}' * Thanks & Regards, Sudeep Narayan Banerjee On 12/04/20 5:43 am, Pablo Flores wrote: You can optimize your query as follows [root@hpc ~]# sacct --format=user,ncpus,state,elapsed --starttime=01/1/20 --endtime=03/31/20 --state=COMPLETED -u mithunr El sáb., 11 abr. 2020 a las 11:37, Sudeep Narayan Banerjee (mailto:snbaner...@iitgn.ac.in>>) escribió: Dear Michael: Thank you, I also did the same now and seems to work !!! Thanks [root@hpc ~]# sacct --format=user,ncpus,state,elapsed --starttime=01/1/20 --endtime=03/31/20 | grep COMPLETED | grep mithunr mithunr 32 COMPLETED 1-12:36:02 mithunr 32 COMPLETED 1-08:36:56 mithunr 32 COMPLETED 1-14:54:28 mithunr 32 COMPLETED 1-02:46:46 mithunr 32 COMPLETED 1-11:07:10 mithunr 32 COMPLETED 1-21:47:19 mithunr 32 COMPLETED 1-12:38:04 mithunr 32 COMPLETED 1-21:44:05 mithunr 32 COMPLETED 1-09:34:25 mithunr 32 COMPLETED 1-09:25:49 mithunr 32 COMPLETED 20:46:55 mithunr 32 COMPLETED 22:56:59 mithunr 32 COMPLETED 16:05:14 mithunr 32 COMPLETED 1-16:32:38 mithunr 32 COMPLETED 1-23:55:13 mithunr 32 COMPLETED 16:36:48 mithunr 32 COMPLETED 1-11:40:56 Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA On 11/04/20 9:00 pm, Renfro, Michael wrote: Unless I’m misreading it, you have a wall time limit of 2 days, and jobs that use up to 32 CPUs. So a total CPU time of up to 64 CPU-days would be possible for a single job. So if you want total wall time for jobs instead of CPU time, then you’ll want to use the Elapsed attribute, not CPUTime. -- Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services 931 372-3601 / Tennessee Tech University On Apr 11, 2020, at 10:05 AM, Sudeep Narayan Banerjee <mailto:snbaner...@iitgn.ac.in> wrote: Hi, I want to calculate the total walltime or runtime for all jobs submitted by each user in a year. I am using the syntax as below and it is also generating some output. We have walltime set for queues (main & main_new) as 48hrs only but the below is giving me hours ranging from 15hrs to 56 hours of even more. I am missing anything from logical/analytical point of view or the syntax is not correct with respect to the desired information ? Many thanks for any suggestion. [root@hpc ~]# sacct --format=user,ncpus,state,CPUTime --starttime=04/01/19 --endtime=03/31/20 | grep mithunr mithunr 32 COMPLETED 15-10:34:40 mithunr 16 COMPLETED 00:02:56 mithunr 16 COMPLETED 02:22:40 mithunr 16 COMPLETED 00:00:48 mithunr 16 COMPLETED 00:00:32 mithunr 16 FAILED 00:00:32 mithunr 16 FAILED 00:00:32 mithunr 16 FAILED 00:00:48 mithunr 16 FAILED 00:00:32 mithunr 16 FAILED 00:00:32 mithunr 0 CANCELLED+ 00:00:00 mithunr 16 FAILED 00:00:32 mithunr 32 COMPLETED 00:02:08 mithunr 0 CANCELLED+ 00:00:00 mithunr 32 COMPLETED 00:01:36 mithunr 16 FAILED 00:00:48 mithunr 32 COMPLETED 33-02:58:08 mithunr 32 COMPLETED 56-01:23:12 ... .. .. -- Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA --
Re: [slurm-users] Need to calculate total runtime/walltime for one year
Dear Michael: Thank you, I also did the same now and seems to work !!! Thanks [root@hpc ~]# sacct --format=user,ncpus,state,elapsed --starttime=01/1/20 --endtime=03/31/20 | grep COMPLETED | grep mithunr mithunr 32 COMPLETED 1-12:36:02 mithunr 32 COMPLETED 1-08:36:56 mithunr 32 COMPLETED 1-14:54:28 mithunr 32 COMPLETED 1-02:46:46 mithunr 32 COMPLETED 1-11:07:10 mithunr 32 COMPLETED 1-21:47:19 mithunr 32 COMPLETED 1-12:38:04 mithunr 32 COMPLETED 1-21:44:05 mithunr 32 COMPLETED 1-09:34:25 mithunr 32 COMPLETED 1-09:25:49 mithunr 32 COMPLETED 20:46:55 mithunr 32 COMPLETED 22:56:59 mithunr 32 COMPLETED 16:05:14 mithunr 32 COMPLETED 1-16:32:38 mithunr 32 COMPLETED 1-23:55:13 mithunr 32 COMPLETED 16:36:48 mithunr 32 COMPLETED 1-11:40:56 Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA On 11/04/20 9:00 pm, Renfro, Michael wrote: Unless I’m misreading it, you have a wall time limit of 2 days, and jobs that use up to 32 CPUs. So a total CPU time of up to 64 CPU-days would be possible for a single job. So if you want total wall time for jobs instead of CPU time, then you’ll want to use the Elapsed attribute, not CPUTime. -- Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services 931 372-3601 / Tennessee Tech University On Apr 11, 2020, at 10:05 AM, Sudeep Narayan Banerjee wrote: Hi, I want to calculate the total walltime or runtime for all jobs submitted by each user in a year. I am using the syntax as below and it is also generating some output. We have walltime set for queues (main & main_new) as 48hrs only but the below is giving me hours ranging from 15hrs to 56 hours of even more. I am missing anything from logical/analytical point of view or the syntax is not correct with respect to the desired information ? Many thanks for any suggestion. [root@hpc ~]# sacct --format=user,ncpus,state,CPUTime --starttime=04/01/19 --endtime=03/31/20 | grep mithunr mithunr 32 COMPLETED 15-10:34:40 mithunr 16 COMPLETED 00:02:56 mithunr 16 COMPLETED 02:22:40 mithunr 16 COMPLETED 00:00:48 mithunr 16 COMPLETED 00:00:32 mithunr 16 FAILED 00:00:32 mithunr 16 FAILED 00:00:32 mithunr 16 FAILED 00:00:48 mithunr 16 FAILED 00:00:32 mithunr 16 FAILED 00:00:32 mithunr 0 CANCELLED+ 00:00:00 mithunr 16 FAILED 00:00:32 mithunr 32 COMPLETED 00:02:08 mithunr 0 CANCELLED+ 00:00:00 mithunr 32 COMPLETED 00:01:36 mithunr 16 FAILED 00:00:48 mithunr 32 COMPLETED 33-02:58:08 mithunr 32 COMPLETED 56-01:23:12 ... .. .. -- Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA
[slurm-users] Need to calculate total runtime/walltime for one year
Hi, I want to calculate the total walltime or runtime for all jobs submitted by each user in a year. I am using the syntax as below and it is also generating some output. We have walltime set for queues (main & main_new) as 48hrs only but the below is giving me hours ranging from 15hrs to 56 hours of even more. I am missing anything from logical/analytical point of view or the syntax is not correct with respect to the desired information ? Many thanks for any suggestion. [root@hpc ~]# sacct --format=user,ncpus,state,CPUTime --starttime=04/01/19 --endtime=03/31/20 | grep mithunr mithunr 32 COMPLETED 15-10:34:40 mithunr 16 COMPLETED 00:02:56 mithunr 16 COMPLETED 02:22:40 mithunr 16 COMPLETED 00:00:48 mithunr 16 COMPLETED 00:00:32 mithunr 16 FAILED 00:00:32 mithunr 16 FAILED 00:00:32 mithunr 16 FAILED 00:00:48 mithunr 16 FAILED 00:00:32 mithunr 16 FAILED 00:00:32 mithunr 0 CANCELLED+ 00:00:00 mithunr 16 FAILED 00:00:32 mithunr 32 COMPLETED 00:02:08 mithunr 0 CANCELLED+ 00:00:00 mithunr 32 COMPLETED 00:01:36 mithunr 16 FAILED 00:00:48 mithunr 32 COMPLETED 33-02:58:08 mithunr 32 COMPLETED 56-01:23:12 ... .. .. -- Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA
[slurm-users] How to find the average downtime of compute nodes in a cluster?
Like any node is *down* state (not drng or drain or IDLE or ALLOC) -- Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA
Re: [slurm-users] How to get the Average number of CPU cores used by jobs per day?
Dear Steven: Yes, but am unable to get the desired data. Not sure which flags to use. Thanks & Regards, Sudeep Narayan Banerjee On 03/04/20 10:42 am, Steven Dick wrote: Have you looked at sreport? On Fri, Apr 3, 2020 at 1:09 AM Sudeep Narayan Banerjee wrote: How to get the Average number of CPU cores used by jobs per day by a particular group? By group means: say faculty group1, group2 etc. all those groups are having a certain number of students -- Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA
[slurm-users] How to get the Average number of CPU cores used by jobs per day?
How to get the Average number of CPU cores used by jobs per day by a particular group? By group means: say faculty group1, group2 etc. all those groups are having a certain number of students -- Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA
Re: [slurm-users] How many users are running jobs per day on average in slurm ?
Dear Peter: I am trying with *sacct* and multiple flags.. but am not getting the desired output as per the query... Thanks & Regards, Sudeep Narayan Banerjee On 02/04/20 5:23 pm, Peter Kjellström wrote: On Thu, 2 Apr 2020 16:57:46 +0530 Sudeep Narayan Banerjee wrote: any help in getting the right flags ? You may need to clarify that question a bit... How many users ran jobs on each day? (weekly, monthly average?) How many jobs/per day did each user run? (weekly, monthly average?) And what counts as job activity for a day? Started a job that day? completed a job? Had at least one job running? /Peter
Re: [slurm-users] How many users are running jobs per day on average in slurm ?
Well I am looking for, How many users ran jobs on each day on an average (day average) with at least one job running? Thanks & Regards, Sudeep Narayan Banerjee On 02/04/20 5:34 pm, Ole Holm Nielsen wrote: On 02-04-2020 13:27, Sudeep Narayan Banerjee wrote: any help in getting the right flags ? The question is not well-defined. If you just want to know the JobID number in the cluster, you could run this command every day and watch the NEXT_JOB_ID increase: # scontrol show config | grep NEXT_JOB_ID NEXT_JOB_ID = 2377393 Job accounting is probably what you are looking for. You may take a look at my Slurm Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_accounting /Ole
Re: [slurm-users] How many users are running jobs per day on average in slurm ?
Dear Peter: Thank you for your response. Well I am looking for, How many users ran jobs on each day on an average (day average) with at least one job running? Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA On 02/04/20 5:23 pm, Peter Kjellström wrote: On Thu, 2 Apr 2020 16:57:46 +0530 Sudeep Narayan Banerjee wrote: any help in getting the right flags ? You may need to clarify that question a bit... How many users ran jobs on each day? (weekly, monthly average?) How many jobs/per day did each user run? (weekly, monthly average?) And what counts as job activity for a day? Started a job that day? completed a job? Had at least one job running? /Peter
[slurm-users] How many users are running jobs per day on average in slurm ?
any help in getting the right flags ? -- Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA