Re: [slurm-users] Fairshare +FairTree Algorithm + TRESBillingWeights

2021-04-06 Thread Yap, Mike
Also found my answer for the weight value here https://slurm.schedmd.com/priority_multifactor.html#fairshare IMPORTANT: The weight values should be high enough to get a good set of significant digits since all the factors are floating point numbers from 0.0 to 1.0. For example, one job could

Re: [slurm-users] Fairshare +FairTree Algorithm + TRESBillingWeights

2021-04-06 Thread Yap, Mike
Fix the issue with TRESBillingWeights, It seems like I will need to set PartitionName for it to work https://bugs.schedmd.com/show_bug.cgi?id=3753 PartitionName=DEFAULT TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0" From: slurm-users On Behalf Of Yap, Mike Sent: Wednesday, 7 April 2021

Re: [slurm-users] Fairshare +FairTree Algorithm + TRESBillingWeights

2021-04-06 Thread Yap, Mike
Thanks Luke.. Will go through the 2 commands (will try to digest them) Wondering if you're able to advise on TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0". Tried to include it in slurm.conf but slurm fail to start Also wondering if anyone can advise on the fairshare value. I recall

[slurm-users] RawUsage 0??

2021-04-06 Thread Matthias Leopold
Hi, I'm very new to Slurm and try to understand basic concepts. One of them is the "Multifactor Priority Plugin". For this I submitted some jobs and looked at sshare output. To my surprise I don't get any numbers for "RawUsage", regardless what I do RawUsage stays 0 (same in "scontrol show

[slurm-users] Updated "pestat" tool for printing Slurm nodes status with 1 line per node including job info

2021-04-06 Thread Ole Holm Nielsen
I have updated the "pestat" tool for printing Slurm nodes status with 1 line per node including job info. The download page is https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat (also listed in https://slurm.schedmd.com/download.html). The pestat tool can print a large variety

Re: [slurm-users] Cannot run interactive jobs

2021-04-06 Thread Manalo, Kevin L
Sajesh, For those other users that may have run into this. I found a reason why srun cannot run interactive jobs, and it may not necessarily be related to RHEL/CentOS 7 If one straces the slurmd one may see (see arg 3 for gid) chown("/dev/pts/1", 1326, 7) = -1 EPERM (Operation not permitted)

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Sean Crosby
I just checked my cluster and my spool dir is SlurmdSpoolDir=/var/spool/slurm (i.e. without the d at the end) It doesn't really matter, as long as the directory exists and has the correct permissions on all nodes -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Sean Crosby
I think I've worked out a problem I see in your slurm.conf you have this SlurmdSpoolDir=/var/spool/slurm/d It should be SlurmdSpoolDir=/var/spool/slurmd You'll need to restart slurmd on all the nodes after you make that change I would also double check the permissions on that directory on

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Sean Crosby
It looks like your ctl isn't contacting the slurmdbd properly. The control host, control port etc are all blank. The first thing I would do is change the ClusterName in your slurm.conf from upper case TUC to lower case tuc. You'll then need to restart your ctld. Then recheck sacctmgr show cluster

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread ibotsis
sinfo -N -o "%N %T %C %m %P %a" NODELIST STATE CPUS(A/I/O/T) MEMORY PARTITION AVAIL wn001 drained 0/0/2/2 3934 TUC* up wn002 drained 0/0/2/2 3934 TUC* up wn003 drained 0/0/2/2 3934 TUC* up wn004 drained 0/0/2/2 3934 TUC* up wn005 drained 0/0/2/2 3934 TUC* up wn006 drained 0/0/2/2 3934 TUC*

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread ibotsis
sacctmgr list cluster Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS -- --- - - --- - - ---

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Sean Crosby
It looks like your attachment of sinfo -R didn't come through It also looks like your dbd isn't set up correctly Can you also show the output of sacctmgr list cluster and scontrol show config | grep ClusterName Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Ioannis Botsis
Hi Sean, I am trying to submit a simple job but freeze srun -n44 -l /bin/hostname srun: Required node not available (down, drained or reserved) srun: job 15 queued and waiting for resources ^Csrun: Job allocation 15 has been revoked srun: Force Terminated job 15 daemons are

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Ole Holm Nielsen
Hi Ioannis, On 06-04-2021 07:56, Ioannis Botsis wrote: slurmctld is active and running but on system reboot doesn’t start automatically…..I have to start it manually Maybe you will find my Slurm Wiki pages of use for setting up your Slurm system: https://wiki.fysik.dtu.dk/niflheim/SLURM