date:20200518

[slurm-users] Show "maxjobs"

2020-05-18 Thread Gestió Servidors

Hi,

I have applied "maxjobs" in accounting only for a user (not account), so the 
others users in the same account have "infinite" maxjobs, but a user have 3 
(the number I have configured). If I run "sacctmgr -s show user MYUSER 
format=user,maxjobs" I can see that 3 but how could I run "sacctmgr" to show 
maxjobs limit for all users? I have test with "sacctmgr -s show account 
MYACCOUNT format=user,account,maxjobs" and it works, but I have configured 
several accounts, so I would like to show all accounts with only one command 
execution "sacctmgr -s show".

Thanks.

Re: [slurm-users] QOS cutting off users before CPU limit is reached

2020-05-18 Thread Greg Wickham

Something to try . .

If you restart “slurmctld” does the new QOS apply?

We had a situation where slurmdbd was running as a different user than 
slurmctld and hence sacctmgr changes weren’t being reflected in slurmctld.

   -greg


On 27 Apr 2020, at 12:57, Simon Andrews 
mailto:simon.andr...@babraham.ac.uk>> wrote:

I’m trying to use QoS limits to dynamically change the number of CPUs a user is 
allowed to use on our cluster.  As far as I can see I’m setting the appropriate 
GrpTRES=cpu value and I can read that back, but then jobs are being stopped 
before the user has reached that limit.

In squeue I see loads of lines like:

166599normal nf-BISMARK_(288)   auser PD   0:00  1 
(QOSMaxCpuPerUserLimit)

..but if I run:

squeue -t running -p normal --format="%.12u %.2t %C "

Then the total for that user is 288 cores, but in the QoS configuration they 
should be allowed more.  If I run:

sacctmgr show user WithAssoc format=user%12,GrpTRES

..then I get:

auser  cpu=512

What am I missing?  Why is ‘auser’ not being allowed to use all 512 of their 
allowed CPUs before the QOS limit is kicking in?

Thanks for any help you can offer.

Simon.

The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered 
Charity No. 1053902.
The information transmitted in this email is directed only to the addressee. If 
you received this in error, please contact the sender and delete this email 
from your system. The contents of this e-mail are the views of the sender and 
do not necessarily represent the views of the Babraham Institute. Full 
conditions at: www.babraham.ac.uk

[slurm-users] MaxJobs not working

2020-05-18 Thread Gestió Servidors

Hi,

Some minutes ago, I have applied "MaxJobs=3" for an user. After that, if I ran 
"sacctmgr -s show user MYUSER format=account,user,maxjobs", system showed a "3" 
at the maxjobs column. However, now, I have run a "squeue" and I'm seeing 4 
jobs (from that user) in "running" state... Shouldn't it be just 3 and not 4 in 
"running" state???

Thanks.

Re: [slurm-users] MaxJobs not working

2020-05-18 Thread Marcus Boden

Hi,

> Some minutes ago, I have applied "MaxJobs=3" for an user. After that, if I 
> ran "sacctmgr -s show user MYUSER format=account,user,maxjobs", system showed 
> a "3" at the maxjobs column. However, now, I have run a "squeue" and I'm 
> seeing 4 jobs (from that user) in "running" state... Shouldn't it be just 3 
> and not 4 in "running" state???

were the 4 jobs running beforehand? Slurm wouldn't cancel the jobs if
they were already running, but just prevent new jobs from starting.

> I have applied "maxjobs" in accounting only for a user (not account), so the 
> others users in the same account have "infinite" maxjobs, but a user have 3 
> (the number I have configured). If I run "sacctmgr -s show user MYUSER 
> format=user,maxjobs" I can see that 3 but how could I run "sacctmgr" to show 
> maxjobs limit for all users? I have test with "sacctmgr -s show account 
> MYACCOUNT format=user,account,maxjobs" and it works, but I have configured 
> several accounts, so I would like to show all accounts with only one command 
> execution "sacctmgr -s show".

Try:
sacctmgr -s show assoc format=user,account,maxjobs

Best,
Marcus

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [slurm-users] Gres GPU Resource Issue

2020-05-18 Thread Marcus Wagner


Andrew,

you could try change it to the following:

/etc/slurm/slurm.conf:
NodeName=node[1-3]      CPUs=40 RealMemory=48000 Sockets=2 
CoresPerSocket=10 ThreadsPerCore=2 Feature="p4000" Gres=gpu:pascal:8 
State=UNKNOWN
NodeName=node[4-5,7-10] CPUs=8  RealMemory=48000 Sockets=2 
CoresPerSocket=4  ThreadsPerCore=1 Feature="p1000" Gres=gpu:pascal:4 
State=UNKNOWN
NodeName=node[6]        CPUs=24 RealMemory=3 Sockets=2 
CoresPerSocket=6  ThreadsPerCore=2 Feature="p1000" Gres=gpu:pascal:4 
State=UNKNOWN


/etc/slurm/gres.conf
NodeName=node[1-3]  Name=gpu Type=pascal File=/dev/nvidia[0-7]
NodeName=node[4-10] Name=gpu Type=pascal File=/dev/nvidia[0-4]

Best
Marcus

On 5/15/20 11:48 PM, Speer, Andrew wrote:
I've run into a bit of an issue when trying to define GPU's in our 
slurm conf. Any insight is appreciated.

Hopefully relevant lines from the configs below.

Error:
[2020-05-15T16:35:14.862] error: gres_plugin_node_config_unpack: No 
plugin configured to process GRES data from node node3 (Name:gpu 
Type:(null) PluginID:7696487 Count:2)
[2020-05-15T16:35:15.321] error: gres_plugin_node_config_unpack: No 
plugin configured to process GRES data from node node4 (Name:gpu 
Type:(null) PluginID:7696487 Count:1)
[2020-05-15T16:35:15.738] error: gres_plugin_node_config_unpack: No 
plugin configured to process GRES data from node node5 (Name:gpu 
Type:(null) PluginID:7696487 Count:1)
[2020-05-15T16:35:16.229] error: gres_plugin_node_config_unpack: No 
plugin configured to process GRES data from node node6 (Name:gpu 
Type:(null) PluginID:7696487 Count:1)


/etc/slurm/slurm.conf:
GresTypes=gpu
NodeName=node[1-3]      CPUs=40 RealMemory=48000 Sockets=2 
CoresPerSocket=10 ThreadsPerCore=2 Feature="pascal,p4000" Gres=gpu:8 
State=UNKNOWN
NodeName=node[4-5,7-10] CPUs=8  RealMemory=48000 Sockets=2 
CoresPerSocket=4  ThreadsPerCore=1 Feature="pascal,p1000" Gres=gpu:8 
State=UNKNOWN
NodeName=node[6]        CPUs=24 RealMemory=3 Sockets=2 
CoresPerSocket=6  ThreadsPerCore=2 Feature="pascal,p1000" Gres=gpu:8 
State=UNKNOWN


/etc/slurm/gres.conf
NodeName=node[1-3]  Name=gpu File=/dev/nvidia[0-7]
NodeName=node[4-10] Name=gpu File=/dev/nvidia[0-4]

scontrol show node node1
NodeName=node1 Arch=x86_64 CoresPerSocket=10
   CPUAlloc=0 CPUTot=40 CPULoad=1.75
   AvailableFeatures=pascal,p4000
   ActiveFeatures=pascal,p4000
   Gres=(null) <
   NodeAddr=node1 NodeHostName=node1
   OS=Linux 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019
   RealMemory=48000 AllocMem=0 FreeMem=57465 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=pharmacy
   BootTime=2020-05-15T09:26:45 SlurmdStartTime=2020-05-15T16:35:13
   CfgTRES=cpu=40,mem=48000M,billing=40
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s





--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

[slurm-users] require info on merging diff core count nodes under single queue or partition

2020-05-18 Thread Sudeep Narayan Banerjee


Dear Support,

node11-22 is having 16cores socket x 2 and node23-24 is having 20cores 
socket x 2. In slurm.conf file (attached), can we merge all the nodes 
11-24 (having different core count) and have a single queue or partition 
name?




--
Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=hpc
#ControlAddr=
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
#CryptoType=crypto/none
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=99
#GresTypes=gpu
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
MaxJobCount=5000
MaxStepCount=4
MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=root
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/tmp
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
MessageTimeout=80
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CORE_Memory
#
#
# JOB PRIORITY
#PriorityType=priority/basic

PriorityType=priority/multifactor
#PriorityDecayHalfLife=
DebugFlags=NO_CONF_HASH
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
AccountingStorageEnforce=limits
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/mysql
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster-iitgn
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/mysql
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
GresTypes=gpu
#
#
# COMPUTE NODES

NodeName=node[1-10] Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 Procs=16  
RealMemory=6  State=IDLE
NodeName=gpu[1-2] CPUs=16 Gres=gpu:2 State=IDLE

NodeName=node[11-22] Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 Procs=32 
State=IDLE
NodeName=node[23-24] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 
State=IDLE
NodeName=gpu[3-4] CPUs=32 Gres=gpu:1 State=IDLE

#NodeName=hpc CPUs=12 State=UNKNOWN

PartitionName=serial Nodes=gpu1 Default=YES Shared=YES Priority=20 
PreemptMode=suspend MaxTime=1-0:0 MaxCPUsPerNode=10 State=UP


PartitionName=main Nodes=node[1-10] Default=YES Shared=YES Priority=10 
PreemptMode=suspend MaxTime=2-0:0 State=UP
PartitionName=main_new Nodes=node[11-22] Default=YES Shared=YES Priority=10 
PreemptMode=suspend MaxTime=2-0:0 State=UP
#PartitionName=main_new Nodes=node[11-24] Default=YES Shared=YES Priority=10 
PreemptMode=suspend MaxTime=2-0:0 State=UP

PartitionName=gsgroup Nodes=node[23-24] Default=NO Shared=YES Priority=30 
PreemptMode=suspend MaxTime=2-0:0 State=UP Allowgroups=GauravS_grp 
PartitionName=pdgroup Nodes=node[9-10] Default=NO Shared=YES Priority=30 
PreemptMode=suspend MaxTime=3-0:0 State=UP Allowgroups=PD_grp 
PartitionName=ssmgroup Nodes=gpu[3-4] Default=NO Shared=YES Priority=30 
PreemptMode=suspend MaxTime=7-0:0 State=UP Allowgroups=SSM_grp 


PartitionName=gpu Nodes=gpu[1-2] Default=NO Shared=yes  MaxTime=3-0:0 State=UP
PartitionName=gpu_new Nodes=gpu[3-4] Default=NO Shared=yes  MaxTim

Re: [slurm-users] require info on merging diff core count nodes under single queue or partition

2020-05-18 Thread Loris Bennett

Dear Sudeep,

Sudeep Narayan Banerjee  writes:

> Dear Support,

This mailing list is not really the Slurm support list.  It is just the
Slurm User Community List, so basically a bunch of people just like you.

> node11-22 is having 16cores socket x 2 and node23-24 is having 20cores
> socket x 2. In slurm.conf file (attached), can we merge all the nodes
> 11-24 (having different core count) and have a single queue or
> partition name?

Yes, you can have a partition consisting of heterogeneous nodes.  Have
you tried this?  Was there a problem?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

Re: [slurm-users] require info on merging diff core count nodes under single queue or partition

2020-05-18 Thread Sudeep Narayan Banerjee

Dear Loris: I am very sorry to address as Support; actually it has 
become a bad habit for me which I will change. Sincere Apologies!


Yes, I have checked while adding hybrid arch of hardware but while 
executing slurmctld, it shows mismatch in core-count and also the 
existing 32core nodes goes to Dowm/Drng mode and new 40-core nodes sets 
to IDLE.


Any help/guide to some link will be highly appreciated!

Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA

On 18/05/20 6:30 pm, Loris Bennett wrote:

Dear Sudeep,

Sudeep Narayan Banerjee  writes:


Dear Support,

This mailing list is not really the Slurm support list.  It is just the
Slurm User Community List, so basically a bunch of people just like you.


node11-22 is having 16cores socket x 2 and node23-24 is having 20cores
socket x 2. In slurm.conf file (attached), can we merge all the nodes
11-24 (having different core count) and have a single queue or
partition name?

Yes, you can have a partition consisting of heterogeneous nodes.  Have
you tried this?  Was there a problem?

Cheers,

Loris

Re: [slurm-users] require info on merging diff core count nodes under single queue or partition

2020-05-18 Thread Loris Bennett

Hi Sudeep,

I am not sure if this is the cause of the problem but in your slurm.conf
you have 

  # COMPUTE NODES

  NodeName=node[1-10] Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 Procs=16  
RealMemory=6  State=IDLE
  NodeName=gpu[1-2] CPUs=16 Gres=gpu:2 State=IDLE

  NodeName=node[11-22] Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 Procs=32 
State=IDLE
  NodeName=node[23-24] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 
State=IDLE
  NodeName=gpu[3-4] CPUs=32 Gres=gpu:1 State=IDLE

But if you read

  man slurm.conf

you will find the following under the description of the parameter
"State" for nodes:

  "IDLE" should not be specified in the node configuration, but set the
  node state to "UNKNOWN" instead.

Cheers,

Loris


Sudeep Narayan Banerjee  writes:

> Dear Loris: I am very sorry to address as Support; actually it has
> become a bad habit for me which I will change. Sincere Apologies!
>
> Yes, I have checked while adding hybrid arch of hardware but while
> executing slurmctld, it shows mismatch in core-count and also the
> existing 32core nodes goes to Dowm/Drng mode and new 40-core nodes
> sets to IDLE.
>
> Any help/guide to some link will be highly appreciated!
>
> Thanks & Regards,
> Sudeep Narayan Banerjee
> System Analyst | Scientist B
> Information System Technology Facility
> Academic Block 5 | Room 110
> Indian Institute of Technology Gandhinagar
> Palaj, Gujarat 382355 INDIA
> On 18/05/20 6:30 pm, Loris Bennett wrote:
>
>  Dear Sudeep,
>
> Sudeep Narayan Banerjee  writes:
>
>  Dear Support,
>
>
> This mailing list is not really the Slurm support list.  It is just the
> Slurm User Community List, so basically a bunch of people just like you.
>
>  node11-22 is having 16cores socket x 2 and node23-24 is having 20cores
> socket x 2. In slurm.conf file (attached), can we merge all the nodes
> 11-24 (having different core count) and have a single queue or
> partition name?
>
>
> Yes, you can have a partition consisting of heterogeneous nodes.  Have
> you tried this?  Was there a problem?
>
> Cheers,
>
> Loris
>
-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-users] How to detect Job submission by srun / interactive jobs

2020-05-18 Thread Stephan Roth


Dear all,

Does anybody know of a way to detect whether a job is submitted with 
srun, preferrably in job_submit.lua?


The goal is to allow interactive jobs only on specific partitions.

Any recommendation or best practice on how to handle interactive jobs is 
welcome.


Thank you,
Stephan

Re: [slurm-users] require info on merging diff core count nodes under single queue or partition

2020-05-18 Thread Sudeep Narayan Banerjee


Dear Loris: Many thanks for your response.

I did change the IDLE state to UNKNOWN state for NodeName configuration, 
then reloaded *slurmctld* and got 2 gpu nodes(gpu3 & 4) as drain mode. 
Although the same state I have manually updated to IDLE state.


But how do I change the CoresPerSocket and ThreadsPerCore in the 
NodeName parameter?



Thanks & Regards,
Sudeep Narayan Banerjee

On 18/05/20 7:29 pm, Loris Bennett wrote:

Hi Sudeep,

I am not sure if this is the cause of the problem but in your slurm.conf
you have

   # COMPUTE NODES

   NodeName=node[1-10] Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 Procs=16  
RealMemory=6  State=IDLE
   NodeName=gpu[1-2] CPUs=16 Gres=gpu:2 State=IDLE

   NodeName=node[11-22] Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 Procs=32 
State=IDLE
   NodeName=node[23-24] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 
State=IDLE
   NodeName=gpu[3-4] CPUs=32 Gres=gpu:1 State=IDLE

But if you read

   man slurm.conf

you will find the following under the description of the parameter
"State" for nodes:

   "IDLE" should not be specified in the node configuration, but set the
   node state to "UNKNOWN" instead.

Cheers,

Loris


Sudeep Narayan Banerjee  writes:


Dear Loris: I am very sorry to address as Support; actually it has
become a bad habit for me which I will change. Sincere Apologies!

Yes, I have checked while adding hybrid arch of hardware but while
executing slurmctld, it shows mismatch in core-count and also the
existing 32core nodes goes to Dowm/Drng mode and new 40-core nodes
sets to IDLE.

Any help/guide to some link will be highly appreciated!

Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA
On 18/05/20 6:30 pm, Loris Bennett wrote:

  Dear Sudeep,

Sudeep Narayan Banerjee  writes:

  Dear Support,


This mailing list is not really the Slurm support list.  It is just the
Slurm User Community List, so basically a bunch of people just like you.

  node11-22 is having 16cores socket x 2 and node23-24 is having 20cores
socket x 2. In slurm.conf file (attached), can we merge all the nodes
11-24 (having different core count) and have a single queue or
partition name?


Yes, you can have a partition consisting of heterogeneous nodes.  Have
you tried this?  Was there a problem?

Cheers,

Loris

Re: [slurm-users] [External] How to detect Job submission by srun / interactive jobs

2020-05-18 Thread Florian Zillner

Hi Stephan,

From the slurm.conf docs:
---
BatchFlag
Jobs submitted using the sbatch command have BatchFlag set to 1. Jobs submitted 
using other commands have BatchFlag set to 0.
---
You can look that up e.g. with scontrol show job . I haven't checked 
though how to access that via lua. If you know, let me know, I'd be interested 
as well.

Example:
# scontrol show job 128922
JobId=128922 JobName=sleep
   ...
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:54 TimeLimit=00:30:00 TimeMin=N/A

Cheers,
Florian

-Original Message-
From: slurm-users  On Behalf Of Stephan 
Roth
Sent: Montag, 18. Mai 2020 16:04
To: slurm-users@lists.schedmd.com
Subject: [External] [slurm-users] How to detect Job submission by srun / 
interactive jobs

Dear all,

Does anybody know of a way to detect whether a job is submitted with 
srun, preferrably in job_submit.lua?

The goal is to allow interactive jobs only on specific partitions.

Any recommendation or best practice on how to handle interactive jobs is 
welcome.

Thank you,
Stephan

Re: [slurm-users] require info on merging diff core count nodes under single queue or partition

2020-05-18 Thread Loris Bennett

Sudeep Narayan Banerjee  writes:

> Dear Loris: Many thanks for your response. 
>
> I did change the IDLE state to UNKNOWN state for NodeName
> configuration, then reloaded slurmctld and got 2 gpu nodes(gpu3 & 4)
> as drain mode. Although the same state I have manually updated to IDLE
> state.

That shouldn't be necessary.  At some point the slurmds on the nodes
should contact the slurmctld and inform it about their actual status.

> But how do I change the CoresPerSocket and ThreadsPerCore in the
> NodeName parameter?

Why do you need to change them if they are correct?  What is the problem
you are seeing?

Whatever that is, What is probably also incorrect is that you are
overspecifing the number of cores/procs

  NodeName=node[11-22] Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 Procs=32 
State=IDLE
  NodeName=node[23-24] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 
State=IDLE

If you look at

  man slurm.conf

you will find for 'Procs' or rather 'CPUs'

  CPUs   Number of logical processors on the node (e.g. "2").   CPUs  and  
Boards  are
 mutually  exclusive.  It  can be set to the total number of sockets, 
cores or
 threads. This can be useful when you want to schedule only  the  cores 
 on  a
 hyper-threaded node.  If CPUs is omitted, it will be set equal to the 
product
 of Sockets, CoresPerSocket, and ThreadsPerCore.  The default value is 
1.

So you should probably omit the 'Procs' specification.

Cheers,

Loris

> *
> *
>
> Thanks & Regards,
> Sudeep Narayan Banerjee
> On 18/05/20 7:29 pm, Loris Bennett wrote:
>
>  Hi Sudeep,
>
> I am not sure if this is the cause of the problem but in your slurm.conf
> you have 
>
>   # COMPUTE NODES
>
>   NodeName=node[1-10] Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 Procs=16  
> RealMemory=6  State=IDLE
>   NodeName=gpu[1-2] CPUs=16 Gres=gpu:2 State=IDLE
>
>   NodeName=node[11-22] Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 Procs=32 
> State=IDLE
>   NodeName=node[23-24] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Procs=40 
> State=IDLE
>   NodeName=gpu[3-4] CPUs=32 Gres=gpu:1 State=IDLE
>
> But if you read
>
>   man slurm.conf
>
> you will find the following under the description of the parameter
> "State" for nodes:
>
>   "IDLE" should not be specified in the node configuration, but set the
>   node state to "UNKNOWN" instead.
>
> Cheers,
>
> Loris
>
>
> Sudeep Narayan Banerjee  writes:
>
>  Dear Loris: I am very sorry to address as Support; actually it has
> become a bad habit for me which I will change. Sincere Apologies!
>
> Yes, I have checked while adding hybrid arch of hardware but while
> executing slurmctld, it shows mismatch in core-count and also the
> existing 32core nodes goes to Dowm/Drng mode and new 40-core nodes
> sets to IDLE.
>
> Any help/guide to some link will be highly appreciated!
>
> Thanks & Regards,
> Sudeep Narayan Banerjee
> System Analyst | Scientist B
> Information System Technology Facility
> Academic Block 5 | Room 110
> Indian Institute of Technology Gandhinagar
> Palaj, Gujarat 382355 INDIA
> On 18/05/20 6:30 pm, Loris Bennett wrote:
>
>  Dear Sudeep,
>
> Sudeep Narayan Banerjee  writes:
>
>  Dear Support,
>
>
> This mailing list is not really the Slurm support list.  It is just the
> Slurm User Community List, so basically a bunch of people just like you.
>
>  node11-22 is having 16cores socket x 2 and node23-24 is having 20cores
> socket x 2. In slurm.conf file (attached), can we merge all the nodes
> 11-24 (having different core count) and have a single queue or
> partition name?
>
>
> Yes, you can have a partition consisting of heterogeneous nodes.  Have
> you tried this?  Was there a problem?
>
> Cheers,
>
> Loris

[slurm-users] Show "maxjobs"

Re: [slurm-users] QOS cutting off users before CPU limit is reached

[slurm-users] MaxJobs not working

Re: [slurm-users] MaxJobs not working

Re: [slurm-users] Gres GPU Resource Issue

[slurm-users] require info on merging diff core count nodes under single queue or partition

Re: [slurm-users] require info on merging diff core count nodes under single queue or partition

Re: [slurm-users] require info on merging diff core count nodes under single queue or partition

Re: [slurm-users] require info on merging diff core count nodes under single queue or partition

[slurm-users] How to detect Job submission by srun / interactive jobs

Re: [slurm-users] require info on merging diff core count nodes under single queue or partition

Re: [slurm-users] [External] How to detect Job submission by srun / interactive jobs

Re: [slurm-users] require info on merging diff core count nodes under single queue or partition

13 matches

Site Navigation

Mail list logo

Footer information