Re: [slurm-users] Accounting core-hours usages

2022-10-10 Thread Sudeep Narayan Banerjee
Dear Sushil: please share the slurm.conf, if possible.

Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Supercomputing Facility & Information System and Technology Facility
Academic Block 5, Room 110A
Indian Institute of Technology Gandhinagar [https://iitgn.ac.in/]
Palaj, Gujarat 382055, INDIA
*IITGN: Celebrating 10 years of educational excellence
<http://sites.iitgn.ac.in/10/>*


On Mon, Oct 10, 2022 at 8:27 PM Jörg Striewski  wrote:

> did you enter the information slurm needs for the database in
> slurmdbd,conf and slurm.conf ?
>
>
> Mit freundlichen Grüßen / kind regards
>
> --
> Jörg Striewski
>
> Information Systems and Machine Learning Lab (ISMLL)
> Institute of Computer Science
> University of Hildesheim Germany
> post address: Universitätsplatz 1, D-31141Hildesheim, Germany
> visitor address: Samelsonplatz 1, D-31141 Hildesheim,Germany
> Tel.(+49) 05121 / 883-40392
> http://www.ismll.uni-hildesheim.de
>
> On 10.10.22 16:38, Sushil Mishra wrote:
> > Dear all,
> >
> > I am pretty new to system administration and looking for some help
> > setup slumdb or maridb in a GPU cluster. We bought a machine but the
> > vendor simply installed slurm and did not install any database for
> > accounting. I tried installing MariaDB and then slurmdb as described
> > in the manual but looks like I am missing something. I wonder if
> > someone can help us with this off the list? I only need to keep an
> > account of the core hours used by each user only. Is there any
> > alternate way of keeping an account of core hour usages per user
> > without installing DB?
> >
> > Best,
> > Sushil
> >
>
>


[slurm-users] Requirement of one GPU job should run in GPU nodes in a cluster

2021-12-16 Thread Sudeep Narayan Banerjee
Hello All: Can we please restrict one GPU job on one GPU node?

That is,
a) when we submit a GPU job on an empty node (say gpu2) requesting 16 cores
as that gives the best performance in the GPU and it gives best performance.
b) Then another user flooded the CPU cores on gpu2 sharing the GPU
resources. The net results is a GPU job got hit by 40% performance in the
next run

Can we make some changes in the slurm configuration such that when a GPU
job is submitted in a GPU node, no other job can enter that GPU node?

I am attaching my slurm.config file along with this email. Any help will be
deeply appreciated!

I apologize if this is a repeated email.


Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System and Technology Facility
Academic Block 5, Room 110A
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382055, INDIA


slurm.conf
Description: Binary data


Re: [slurm-users] Nodes not returning from DRAINING

2020-10-28 Thread Sudeep Narayan Banerjee
May try with this workaround

scontrol update NodeName= State=IDLE

Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System and Technology Facility
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355, INDIA


On Wed, Oct 28, 2020 at 5:41 PM Diego Zuccato 
wrote:

> Hello all.
>
> I've found that sometimes, some jobs leave the nodes in DRAINING state.
>
> In slurmctld.log I find:
> -8<--
> [2020-10-28T11:30:16.999] update_node: node str957-mtx-11 reason set to:
> Kill task failed
> [2020-10-28T11:30:16.999] update_node: node str957-mtx-11 state set to
> DRAINING
> -8<--
> while on the node (slurmd.log):
> -8<--
> [2020-10-28T11:24:11.980] [8975.0] task/cgroup:
> /slurm_str957-mtx-11/uid_2126297435/job_8975: alloc=117600MB
> mem.limit=117600MB memsw.limit=117600MB
> [2020-10-28T11:24:11.980] [8975.0] task/cgroup:
> /slurm_str957-mtx-11/uid_2126297435/job_8975/step_0: alloc=117600MB
> mem.limit=117600MB memsw.limit=117600MB
> [2020-10-28T11:29:18.926] [8975.0] Defering sending signal, processes in
> job are currently core dumping
> [2020-10-28T11:30:17.000] [8975.0] error: *** STEP 8975.0 STEPD
> TERMINATED ON str957-mtx-11 AT 2020-10-28T11:30:16 DUE TO JOB NOT ENDING
> WITH SIGNALS ***
> [2020-10-28T11:30:19.306] [8975.0] done with job
> -8<--
>
> Seems slurmd takes a bit too much time to close the job. Is there some
> timeout I could change to avoid having to fix it manually?
>
> TIA.
>
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
>
>


Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Sudeep Narayan Banerjee

Hi: please mention the below output.

cat /etc/redhat-release

OR

cat /etc/lsb_release

Also, please let us know the detailed log reports that is probably 
available at /var/log/slurm/slurmctld.log


status of:
ps -ef | grep slurmctld

Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA

On 11/06/20 5:54 pm, navin srivastava wrote:

Hi Team,

when i am trying to start the slurmd process i am getting the below error.

2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node 
daemon...
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: 
Start operation timed out. Terminating.
2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start 
Slurm node daemon.
2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: 
Unit entered failed state.
2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: 
Failed with result 'timeout'.
2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: 
pam_unix(crond:session): session opened for user root by (uid=0)


Slurm version is 17.11.8

The server and slurm is running from long time and we have not made 
any changes but today when i am starting it is giving this error message.

Any idea what could be wrong here.

Regards
Navin.






Re: [slurm-users] Problem with permisions. CentOS 7.8

2020-05-28 Thread Sudeep Narayan Banerjee
also check:
a) whether NTP has been setup and communicating with master node
b) iptables may be flushed (iptables -L)
c) SeLinux to disabled, to check :
getenforce
vim /etc/sysconfig/selinux
(change SELINUX=enforcing to SELINUX=disabled and save the file and reboot)

Thanks & Regards,
Sudeep Narayan Banerjee


On Fri, May 29, 2020 at 12:08 PM Sudeep Narayan Banerjee <
snbaner...@iitgn.ac.in> wrote:

> I have not checked on the CentOS7.8
> a) if /var/run/munge folder does not exist then please double check
> whether munge has been installed or not
> b) user root or sudo user to do
> ps -ef | grep munge
> kill -9  //where PID is the Process ID for munge (if the process is
> running at all); else
>
> which munged
> /etc/init.d/munge start
>
> please let me know the the output of:
>
> $ munge -n
>
> $ munge -n | unmunge
>
> $ sudo systemctl status --full munge
>
> Thanks & Regards,
> Sudeep Narayan Banerjee
> System Analyst | Scientist B
> Indian Institute of Technology Gandhinagar
> Gujarat, INDIA
>
>
> On Fri, May 29, 2020 at 11:55 AM Bjørn-Helge Mevik 
> wrote:
>
>> Ferran Planas Padros  writes:
>>
>> > I run the command as slurm user, and the /var/log/munge folder does
>> belong to slurm.
>>
>> For security reasons, I strongly advise that you run munged as a
>> separate user, which is unprivileged and not used for anything else.
>>
>> --
>> Regards,
>> Bjørn-Helge Mevik, dr. scient,
>> Department for Research Computing, University of Oslo
>>
>


Re: [slurm-users] Problem with permisions. CentOS 7.8

2020-05-28 Thread Sudeep Narayan Banerjee
I have not checked on the CentOS7.8
a) if /var/run/munge folder does not exist then please double check whether
munge has been installed or not
b) user root or sudo user to do
ps -ef | grep munge
kill -9  //where PID is the Process ID for munge (if the process is
running at all); else

which munged
/etc/init.d/munge start

please let me know the the output of:

$ munge -n

$ munge -n | unmunge

$ sudo systemctl status --full munge

Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Indian Institute of Technology Gandhinagar
Gujarat, INDIA


On Fri, May 29, 2020 at 11:55 AM Bjørn-Helge Mevik 
wrote:

> Ferran Planas Padros  writes:
>
> > I run the command as slurm user, and the /var/log/munge folder does
> belong to slurm.
>
> For security reasons, I strongly advise that you run munged as a
> separate user, which is unprivileged and not used for anything else.
>
> --
> Regards,
> Bjørn-Helge Mevik, dr. scient,
> Department for Research Computing, University of Oslo
>


Re: [slurm-users] require info on merging diff core count nodes under single queue or partition

2020-05-18 Thread Sudeep Narayan Banerjee

Dear Loris: Many thanks for your response.

I did change the IDLE state to UNKNOWN state for NodeName configuration, 
then reloaded *slurmctld* and got 2 gpu nodes(gpu3 & 4) as drain mode. 
Although the same state I have manually updated to IDLE state.


But how do I change the CoresPerSocket and ThreadsPerCore in the 
NodeName parameter?



Thanks & Regards,
Sudeep Narayan Banerjee

On 18/05/20 7:29 pm, Loris Bennett wrote:

Hi Sudeep,

I am not sure if this is the cause of the problem but in your slurm.conf
you have

   # COMPUTE NODES

   NodeName=node[1-10] Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 Procs=16  
RealMemory=6  State=IDLE
   NodeName=gpu[1-2] CPUs=16 Gres=gpu:2 State=IDLE

   NodeName=node[11-22] Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 Procs=32 
State=IDLE
   NodeName=node[23-24] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 
State=IDLE
   NodeName=gpu[3-4] CPUs=32 Gres=gpu:1 State=IDLE

But if you read

   man slurm.conf

you will find the following under the description of the parameter
"State" for nodes:

   "IDLE" should not be specified in the node configuration, but set the
   node state to "UNKNOWN" instead.

Cheers,

Loris


Sudeep Narayan Banerjee  writes:


Dear Loris: I am very sorry to address as Support; actually it has
become a bad habit for me which I will change. Sincere Apologies!

Yes, I have checked while adding hybrid arch of hardware but while
executing slurmctld, it shows mismatch in core-count and also the
existing 32core nodes goes to Dowm/Drng mode and new 40-core nodes
sets to IDLE.

Any help/guide to some link will be highly appreciated!

Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA
On 18/05/20 6:30 pm, Loris Bennett wrote:

  Dear Sudeep,

Sudeep Narayan Banerjee  writes:

  Dear Support,


This mailing list is not really the Slurm support list.  It is just the
Slurm User Community List, so basically a bunch of people just like you.

  node11-22 is having 16cores socket x 2 and node23-24 is having 20cores
socket x 2. In slurm.conf file (attached), can we merge all the nodes
11-24 (having different core count) and have a single queue or
partition name?


Yes, you can have a partition consisting of heterogeneous nodes.  Have
you tried this?  Was there a problem?

Cheers,

Loris



Re: [slurm-users] require info on merging diff core count nodes under single queue or partition

2020-05-18 Thread Sudeep Narayan Banerjee
Dear Loris: I am very sorry to address as Support; actually it has 
become a bad habit for me which I will change. Sincere Apologies!


Yes, I have checked while adding hybrid arch of hardware but while 
executing slurmctld, it shows mismatch in core-count and also the 
existing 32core nodes goes to Dowm/Drng mode and new 40-core nodes sets 
to IDLE.


Any help/guide to some link will be highly appreciated!

Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA

On 18/05/20 6:30 pm, Loris Bennett wrote:

Dear Sudeep,

Sudeep Narayan Banerjee  writes:


Dear Support,

This mailing list is not really the Slurm support list.  It is just the
Slurm User Community List, so basically a bunch of people just like you.


node11-22 is having 16cores socket x 2 and node23-24 is having 20cores
socket x 2. In slurm.conf file (attached), can we merge all the nodes
11-24 (having different core count) and have a single queue or
partition name?

Yes, you can have a partition consisting of heterogeneous nodes.  Have
you tried this?  Was there a problem?

Cheers,

Loris



[slurm-users] require info on merging diff core count nodes under single queue or partition

2020-05-18 Thread Sudeep Narayan Banerjee

Dear Support,

node11-22 is having 16cores socket x 2 and node23-24 is having 20cores 
socket x 2. In slurm.conf file (attached), can we merge all the nodes 
11-24 (having different core count) and have a single queue or partition 
name?




--
Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=hpc
#ControlAddr=
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
#CryptoType=crypto/none
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=99
#GresTypes=gpu
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
MaxJobCount=5000
MaxStepCount=4
MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=root
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/tmp
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
MessageTimeout=80
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CORE_Memory
#
#
# JOB PRIORITY
#PriorityType=priority/basic

PriorityType=priority/multifactor
#PriorityDecayHalfLife=
DebugFlags=NO_CONF_HASH
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
AccountingStorageEnforce=limits
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/mysql
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster-iitgn
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/mysql
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
GresTypes=gpu
#
#
# COMPUTE NODES

NodeName=node[1-10] Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 Procs=16  
RealMemory=6  State=IDLE
NodeName=gpu[1-2] CPUs=16 Gres=gpu:2 State=IDLE

NodeName=node[11-22] Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 Procs=32 
State=IDLE
NodeName=node[23-24] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 
State=IDLE
NodeName=gpu[3-4] CPUs=32 Gres=gpu:1 State=IDLE

#NodeName=hpc CPUs=12 State=UNKNOWN

PartitionName=serial Nodes=gpu1 Default=YES Shared=YES Priority=20 
PreemptMode=suspend MaxTime=1-0:0 MaxCPUsPerNode=10 State=UP


PartitionName=main Nodes=node[1-10] Default=YES Shared=YES Priority=10 
PreemptMode=suspend MaxTime=2-0:0 State=UP
PartitionName=main_new Nodes=node[11-22] Default=YES Shared=YES Priority=10 
PreemptMode=suspend MaxTime=2-0:0 State=UP
#PartitionName=main_new Nodes=node[11-24] Default=YES Shared=YES Priority=10 
PreemptMode=suspend MaxTime=2-0:0 State=UP

PartitionName=gsgroup Nodes=node[23-24] Default=NO Shared=YES Priority=30 
PreemptMode=suspend MaxTime=2-0:0 State=UP Allowgroups=GauravS_grp 
PartitionName=pdgroup Nodes=node[9-10] Default=NO Shared=YES Priority=30 
PreemptMode=suspend MaxTime=3-0:0 State=UP Allowgroups=PD_grp 
PartitionName=ssmgroup Nodes=gpu[3-4] Default=NO Shared=YES Priority=30 
PreemptMode=suspend MaxTime=7-0:0 State=UP Allowgroups=SSM_grp 


PartitionName=gpu Nodes=gpu[1-2] Default=NO Shared=yes  MaxTime=3-0:0 State=UP
PartitionName=gpu_new Nodes=gpu[3-4] Default=NO Shared

[slurm-users] need to use unused cores | wherein all compute nodes are ALLOC

2020-04-27 Thread Sudeep Narayan Banerjee

Dear All,

I have 360 cpu cores in my cluster; 9 compute nodes with 20core x 2 
sockets each.


I have slurm.18.08.7 version and have multifactor (fair share) and 
backfill enabled.


I am running jobs with less ntasks_per_node in the script and at some 
point all my compute nodes are ALLOC (with overall 300 cores). but since 
I have not used all the cores, around around 60 ntasks are still unused 
(distributed all over the 9 nodes).


Question: how can I still submit another job that gets those unused 
cores to run? I know the status of all such nodes will be changed in 
MIX. so, what options has to be tweaked in slurm.conf file.


Currently the status shows (Resources) as Reason for not getting in the 
scheduler.


--
Thanks & Regards,
Sudeep Narayan Banerjee



Re: [slurm-users] sacct -c not honor -M clusrername

2020-04-26 Thread Sudeep Narayan Banerjee

Dear Fred: should be possible

sacct --format=user,state --starttime=04/01/19 --endtime=03/31/20 | grep 
COMPLETED


Please let us know if this helps.

Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA

On 26/04/20 9:27 pm, Fred Liu wrote:


Hi,

Is it possible to get job completion stats per cluster?

Thanks.

Fred


[slurm-users] Require help in setting up Priority in slurm

2020-04-23 Thread Sudeep Narayan Banerjee

Dear All:

I want to setup priority queuing in slurm (slurm-18.08.7). Say, one user 
userA has submitted and running 4 jobs from a group USER1-grp; this same 
user userA has also submitted 4 more jobs in PD status. Now userB from 
User2-grp wants to submit job whose job should get top priority rather 
userA's job.


Currently the scheduler will behave as FIFO and no Fairshare policy has 
been implemented yet.


I have gone through this PDF 
<https://slurm.schedmd.com/SLUG19/Priority_and_Fair_Trees.pdf> once and 
studying link1 <https://slurm.schedmd.com/priority_multifactor.html> and 
link2 <https://slurm.schedmd.com/classic_fair_share.html>.


I have already setup the queues :
[root@aneesur ~]# sinfo -v
-
dead    = false
exact   = 0
filtering   = false
format  = %9P %.5a %.10l %.6D %.6t %N
iterate = 0
long    = false
no_header   = false
node_field  = false
node_format = false
nodes   = n/a
part_field  = true
partition   = n/a
responding  = false
states  = (null)
sort    = (null)
summarize   = false
verbose = 1
-
all_flag    = false
alloc_mem_flag  = false
avail_flag  = true
cpus_flag   = false
default_time_flag =false
disk_flag   = false
features_flag   = false
features_flag_act = false
groups_flag = false
gres_flag   = false
job_size_flag   = false
max_time_flag   = true
memory_flag = false
partition_flag  = true
port_flag   = false
priority_job_factor_flag   = false
priority_tier_flag   = false
reason_flag = false
reason_timestamp_flag = false
reason_user_flag = false
reservation_flag = false
root_flag   = false
oversubscribe_flag  = false
state_flag  = true
weight_flag = false
-

Thu Apr 23 19:46:33 2020
sinfo: Consumable Resources (CR) Node Selection plugin loaded with 
argument 1

sinfo: Cray node selection plugin loaded
sinfo: Linear node selection plugin loaded with argument 1
sinfo: Serial Job Resource Selection plugin loaded with argument 1
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
short*   up    1:00:00  9   idle node[1-9]
medium   up 2-00:00:00  9   idle node[1-9]
long up 4-00:00:00  9   idle node[1-9]
intensive    up 7-00:00:00  9   idle node[1-9]
gpu  up   infinite  4   idle gpu[1-4]

Attaching the slurm.conf file. Any help or guide will genuinely help. I 
know the PDFs and links are best guide but I need to setup and release a 
bit early!


--
Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA

# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=aneesur
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
CacheGroups=0
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#StateSaveLocation=/var/spool/slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
InactiveLimit=0
Waittime=0
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
SchedulerParameters=assoc_limit_continue
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
ClusterName=cluster
AccountingStorageEnforce=limits,qos
AccountingStorageTRES=cpu,gres/gpu
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=info
#SlurmctldLogFile=
#SlurmdDebug=info
#SlurmdLogFile=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityCalcPeriod=0-0:05
#PriorityFavorSmall=NO
#PriorityMaxAge=7-0
#PriorityWeightAge=1
#PriorityWeightFairshare=10
#PriorityWeightJobSize=1000
#PriorityWeightPartition=5000
#PriorityWeightQOS=1000
#PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3
##
#
# COMPUTE NODES
NodeName=node[1-9] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 
State=IDLE
NodeName=gpu[1-4] Procs=40 Gres=gpu:1 State=IDLE
NodeName=aneesur  Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 
State=IDLE
NodeName=aneesur1  Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 
State=IDLE
#PartitionName=main Nodes=node[1-9] Default=YES MaxTime=INFINITE State=UP
PartitionName=short Nodes=node[1-9] AllowGroups=ALL AllowAccounts=ALL 
AllowQos=ALL Default=YES MaxTime=60 MinNodes=1 MaxNodes=1 Priority=1 State=UP
PartitionName=medium Nodes=node[1-9] AllowGroups=ALL AllowAccounts=ALL 
AllowQos=ALL Default=NO MaxTime=2880 MaxNodes=2 Priority=2 State=UP
PartitionName=long Nodes=node[1-9] AllowGroups=ALL AllowAccounts=A

Re: [slurm-users] Need to calculate total runtime/walltime for one year

2020-04-11 Thread Sudeep Narayan Banerjee
Thank you so much.. already started working ..little optimized a bit .. 
Thanks again!


*sacct --format=user,ncpus,state,elapsed --starttime=04/1/17 
--endtime=03/31/18 | grep COMPLETED | grep mithunr | awk '{print $4}'

*

Thanks & Regards,
Sudeep Narayan Banerjee

On 12/04/20 5:43 am, Pablo Flores wrote:

You can optimize your query as follows

[root@hpc ~]# sacct  --format=user,ncpus,state,elapsed 
--starttime=01/1/20 --endtime=03/31/20 --state=COMPLETED -u mithunr



El sáb., 11 abr. 2020 a las 11:37, Sudeep Narayan Banerjee 
(mailto:snbaner...@iitgn.ac.in>>) escribió:


Dear Michael: Thank you, I also did the same now and seems to work
!!! Thanks

[root@hpc ~]# sacct --format=user,ncpus,state,elapsed
--starttime=01/1/20 --endtime=03/31/20 | grep COMPLETED | grep mithunr
  mithunr 32  COMPLETED 1-12:36:02
  mithunr 32  COMPLETED 1-08:36:56
  mithunr 32  COMPLETED 1-14:54:28
  mithunr 32  COMPLETED 1-02:46:46
  mithunr 32  COMPLETED 1-11:07:10
  mithunr 32  COMPLETED 1-21:47:19
  mithunr 32  COMPLETED 1-12:38:04
  mithunr 32  COMPLETED 1-21:44:05
  mithunr 32  COMPLETED 1-09:34:25
  mithunr 32  COMPLETED 1-09:25:49
  mithunr 32  COMPLETED   20:46:55
  mithunr 32  COMPLETED   22:56:59
  mithunr 32  COMPLETED   16:05:14
  mithunr 32  COMPLETED 1-16:32:38
  mithunr 32  COMPLETED 1-23:55:13
  mithunr 32  COMPLETED   16:36:48
  mithunr 32  COMPLETED 1-11:40:56

Thanks & Regards,
    Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA

On 11/04/20 9:00 pm, Renfro, Michael wrote:

Unless I’m misreading it, you have a wall time limit of 2 days,
and jobs that use up to 32 CPUs. So a total CPU time of up to 64
CPU-days would be possible for a single job.

So if you want total wall time for jobs instead of CPU time, then
you’ll want to use the Elapsed attribute, not CPUTime.

--
Mike Renfro, PhD  / HPC Systems Administrator, Information
Technology Services
931 372-3601       / Tennessee Tech University


On Apr 11, 2020, at 10:05 AM, Sudeep Narayan Banerjee
 <mailto:snbaner...@iitgn.ac.in> wrote:

Hi,

I want to calculate the total walltime or runtime for all jobs
submitted by each user in a year. I am using the syntax as below
and it is also generating some output.

We have walltime set for queues (main & main_new) as 48hrs only
but the below is giving me hours ranging from 15hrs to 56 hours
of even more. I am missing anything from logical/analytical
point of view or the syntax is not correct with respect to the
desired information ? Many thanks for any suggestion.

[root@hpc ~]# sacct  --format=user,ncpus,state,CPUTime
--starttime=04/01/19 --endtime=03/31/20 | grep mithunr
  mithunr 32  COMPLETED 15-10:34:40
  mithunr 16  COMPLETED   00:02:56
  mithunr 16  COMPLETED   02:22:40
  mithunr 16  COMPLETED   00:00:48
  mithunr 16  COMPLETED   00:00:32
  mithunr 16 FAILED   00:00:32
  mithunr 16 FAILED   00:00:32
  mithunr 16 FAILED   00:00:48
  mithunr 16 FAILED   00:00:32
  mithunr 16 FAILED   00:00:32
  mithunr  0 CANCELLED+   00:00:00
  mithunr 16 FAILED   00:00:32
  mithunr 32  COMPLETED   00:02:08
  mithunr  0 CANCELLED+   00:00:00
  mithunr 32  COMPLETED   00:01:36
  mithunr 16 FAILED   00:00:48
  mithunr 32  COMPLETED 33-02:58:08
  mithunr 32  COMPLETED 56-01:23:12
...
..
..

    -- 
Thanks & Regards,

Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA




--



Re: [slurm-users] Need to calculate total runtime/walltime for one year

2020-04-11 Thread Sudeep Narayan Banerjee
Dear Michael: Thank you, I also did the same now and seems to work !!! 
Thanks


[root@hpc ~]# sacct --format=user,ncpus,state,elapsed 
--starttime=01/1/20 --endtime=03/31/20 | grep COMPLETED | grep mithunr

  mithunr 32  COMPLETED 1-12:36:02
  mithunr 32  COMPLETED 1-08:36:56
  mithunr 32  COMPLETED 1-14:54:28
  mithunr 32  COMPLETED 1-02:46:46
  mithunr 32  COMPLETED 1-11:07:10
  mithunr 32  COMPLETED 1-21:47:19
  mithunr 32  COMPLETED 1-12:38:04
  mithunr 32  COMPLETED 1-21:44:05
  mithunr 32  COMPLETED 1-09:34:25
  mithunr 32  COMPLETED 1-09:25:49
  mithunr 32  COMPLETED   20:46:55
  mithunr 32  COMPLETED   22:56:59
  mithunr 32  COMPLETED   16:05:14
  mithunr 32  COMPLETED 1-16:32:38
  mithunr 32  COMPLETED 1-23:55:13
  mithunr 32  COMPLETED   16:36:48
  mithunr 32  COMPLETED 1-11:40:56

Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA

On 11/04/20 9:00 pm, Renfro, Michael wrote:
Unless I’m misreading it, you have a wall time limit of 2 days, and 
jobs that use up to 32 CPUs. So a total CPU time of up to 64 CPU-days 
would be possible for a single job.


So if you want total wall time for jobs instead of CPU time, then 
you’ll want to use the Elapsed attribute, not CPUTime.


--
Mike Renfro, PhD  / HPC Systems Administrator, Information Technology 
Services

931 372-3601       / Tennessee Tech University

On Apr 11, 2020, at 10:05 AM, Sudeep Narayan Banerjee 
 wrote:


Hi,

I want to calculate the total walltime or runtime for all jobs 
submitted by each user in a year. I am using the syntax as below and 
it is also generating some output.


We have walltime set for queues (main & main_new) as 48hrs only but 
the below is giving me hours ranging from 15hrs to 56 hours of even 
more. I am missing anything from logical/analytical point of view or 
the syntax is not correct with respect to the desired information ? 
Many thanks for any suggestion.


[root@hpc ~]# sacct --format=user,ncpus,state,CPUTime 
--starttime=04/01/19 --endtime=03/31/20 |  grep mithunr

  mithunr 32  COMPLETED 15-10:34:40
  mithunr 16  COMPLETED   00:02:56
  mithunr 16  COMPLETED   02:22:40
  mithunr 16  COMPLETED   00:00:48
  mithunr 16  COMPLETED   00:00:32
  mithunr 16 FAILED   00:00:32
  mithunr 16 FAILED   00:00:32
  mithunr 16 FAILED   00:00:48
  mithunr 16 FAILED   00:00:32
  mithunr 16 FAILED   00:00:32
  mithunr  0 CANCELLED+   00:00:00
  mithunr 16 FAILED   00:00:32
  mithunr 32  COMPLETED   00:02:08
  mithunr  0 CANCELLED+   00:00:00
  mithunr 32  COMPLETED   00:01:36
  mithunr 16 FAILED   00:00:48
  mithunr 32  COMPLETED 33-02:58:08
  mithunr 32  COMPLETED 56-01:23:12
...
..
..

--
Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA


[slurm-users] Need to calculate total runtime/walltime for one year

2020-04-11 Thread Sudeep Narayan Banerjee

Hi,

I want to calculate the total walltime or runtime for all jobs submitted 
by each user in a year. I am using the syntax as below and it is also 
generating some output.


We have walltime set for queues (main & main_new) as 48hrs only but the 
below is giving me hours ranging from 15hrs to 56 hours of even more. I 
am missing anything from logical/analytical point of view or the syntax 
is not correct with respect to the desired information ? Many thanks for 
any suggestion.


[root@hpc ~]# sacct --format=user,ncpus,state,CPUTime 
--starttime=04/01/19 --endtime=03/31/20 |  grep mithunr

  mithunr 32  COMPLETED 15-10:34:40
  mithunr 16  COMPLETED   00:02:56
  mithunr 16  COMPLETED   02:22:40
  mithunr 16  COMPLETED   00:00:48
  mithunr 16  COMPLETED   00:00:32
  mithunr 16 FAILED   00:00:32
  mithunr 16 FAILED   00:00:32
  mithunr 16 FAILED   00:00:48
  mithunr 16 FAILED   00:00:32
  mithunr 16 FAILED   00:00:32
  mithunr  0 CANCELLED+   00:00:00
  mithunr 16 FAILED   00:00:32
  mithunr 32  COMPLETED   00:02:08
  mithunr  0 CANCELLED+   00:00:00
  mithunr 32  COMPLETED   00:01:36
  mithunr 16 FAILED   00:00:48
  mithunr 32  COMPLETED 33-02:58:08
  mithunr 32  COMPLETED 56-01:23:12
...
..
..

--
Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA



[slurm-users] How to find the average downtime of compute nodes in a cluster?

2020-04-02 Thread Sudeep Narayan Banerjee

Like any node is *down* state

(not drng or drain or IDLE or ALLOC)

--
Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA



Re: [slurm-users] How to get the Average number of CPU cores used by jobs per day?

2020-04-02 Thread Sudeep Narayan Banerjee
Dear Steven: Yes, but am unable to get the desired data. Not sure which 
flags to use.


Thanks & Regards,
Sudeep Narayan Banerjee

On 03/04/20 10:42 am, Steven Dick wrote:

Have you looked at sreport?

On Fri, Apr 3, 2020 at 1:09 AM Sudeep Narayan Banerjee
 wrote:

How to get the Average number of CPU cores used by jobs per day by a particular 
group?

By group means: say faculty group1, group2 etc. all those groups are having a 
certain number of students

--
Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA


[slurm-users] How to get the Average number of CPU cores used by jobs per day?

2020-04-02 Thread Sudeep Narayan Banerjee
How to get the Average number of CPU cores used by jobs per day by a 
particular group?


By group means: say faculty group1, group2 etc. all those groups are 
having a certain number of students


--
Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA



Re: [slurm-users] How many users are running jobs per day on average in slurm ?

2020-04-02 Thread Sudeep Narayan Banerjee
Dear Peter: I am trying with *sacct* and multiple flags.. but am not 
getting the desired output as per the query...


Thanks & Regards,
Sudeep Narayan Banerjee

On 02/04/20 5:23 pm, Peter Kjellström wrote:

On Thu, 2 Apr 2020 16:57:46 +0530
Sudeep Narayan Banerjee  wrote:


any help in getting the right flags ?

You may need to clarify that question a bit...

How many users ran jobs on each day? (weekly, monthly average?)

How many jobs/per day did each user run? (weekly, monthly average?)

And what counts as job activity for a day? Started a job that day?
completed a job? Had at least one job running?

/Peter


Re: [slurm-users] How many users are running jobs per day on average in slurm ?

2020-04-02 Thread Sudeep Narayan Banerjee
Well I am looking for, How many users ran jobs on each day on an average 
(day average) with at least one job running?


Thanks & Regards,
Sudeep Narayan Banerjee

On 02/04/20 5:34 pm, Ole Holm Nielsen wrote:

On 02-04-2020 13:27, Sudeep Narayan Banerjee wrote:

any help in getting the right flags ?


The question is not well-defined.  If you just want to know the JobID 
number in the cluster, you could run this command every day and watch 
the NEXT_JOB_ID increase:


# scontrol show config | grep NEXT_JOB_ID
NEXT_JOB_ID = 2377393

Job accounting is probably what you are looking for.  You may take a 
look at my Slurm Wiki page 
https://wiki.fysik.dtu.dk/niflheim/Slurm_accounting


/Ole



Re: [slurm-users] How many users are running jobs per day on average in slurm ?

2020-04-02 Thread Sudeep Narayan Banerjee
Dear Peter: Thank you for your response. Well I am looking for, How many 
users ran jobs on each day on an average (day average) with at least one 
job running?



Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA

On 02/04/20 5:23 pm, Peter Kjellström wrote:

On Thu, 2 Apr 2020 16:57:46 +0530
Sudeep Narayan Banerjee  wrote:


any help in getting the right flags ?

You may need to clarify that question a bit...

How many users ran jobs on each day? (weekly, monthly average?)

How many jobs/per day did each user run? (weekly, monthly average?)

And what counts as job activity for a day? Started a job that day?
completed a job? Had at least one job running?

/Peter


[slurm-users] How many users are running jobs per day on average in slurm ?

2020-04-02 Thread Sudeep Narayan Banerjee

any help in getting the right flags ?

--
Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA