Re: [slurm-users] [External] [slurm 20.02.3] don't suspend nodes in down state

2020-08-31 Thread Steininger, Herbert
Hi Guys,

Thanks for your answers.

I would like not to patch the source code of Slurm, like Jacek does it, to make 
things easier.
But I think, it is the way to go.

When I try the solutions, Florian and Angelos suggested, slurm will still think 
that the nodes are "powered down", even if they not.
Well, it is better that slurm only thinks that they are down, better as if they 
will power down while upgrading something.


What we really need is some state like "MAINT", for maintenance, which will 
slurm tell, not to utilize the node but also don't power down the node.

Thanks,
Herbert



Von: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] Im Auftrag von 
Florian Zillner
Gesendet: Mittwoch, 26. August 2020 10:36
An: Slurm User Community List 
Betreff: Re: [slurm-users] [External] [slurm 20.02.3] don't suspend nodes in 
down state

Hi Herbert,

just like Angelos described, we also have logic in our poweroff script that 
checks if the node is really IDLE and only sends the poweroff command if that's 
the case.

Excerpt:
hosts=$(scontrol show hostnames $1)
for host in $hosts; do
scontrol show node $host | tr ' ' '\n' | grep -q 'State=IDLE+POWER$'
if [[ $? == 1 ]]; then
echo "node $host NOT IDLE" >>$OUTFILE
continue
else
echo "node $host IDLE" >>$OUTFILE
fi
ssh $host poweroff
...
sleep 1
...
done

Best,
Florian


From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Steininger, Herbert 
mailto:herbert_steinin...@psych.mpg.de>>
Sent: Monday, 24 August 2020 10:52
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: [External] [slurm-users] [slurm 20.02.3] don't suspend nodes in down 
state

Hi,

how can I prevent slurm, to suspend nodes, which I have set to down state for 
maintenance?
I know about "SuspendExcNodes", but this doesn't seem the right way, to roll 
out the slurm.conf every time this changes.
Is there a state that I can set so that the nodes doesn't get suspended?

It happened a few times that I was doing some stuff on a server and after our 
idle time (1h) slurm decided to suspend the node.

TIA,
Herbert

--
Herbert Steininger
Leiter EDV & HPC
Administrator
Max-Planck-Institut für Psychiatrie
Kraepelinstr.  2-10
80804 München
Tel  +49 (0)89 / 30622-368
Mail   herbert_steinin...@psych.mpg.de
Web  https://www.psych.mpg.de




Re: [slurm-users] Core reserved/bound to a GPU

2020-08-31 Thread Chris Samuel
On Monday, 31 August 2020 7:41:13 AM PDT Manuel BERTRAND wrote:

> Every thing works great so far but now I would like to bound a specific
> core to each GPUs on each node. By "bound" I mean to make a particular
> core not assignable to a CPU job alone so that the GPU is available
> whatever the CPU workload on the node.

What I've done in the past (waves to Swinburne folks on the list) was to have 
overlapping partitions on GPU nodes where the GPU job partition had access to 
all the cores and the CPU only job partition had access to only a subset 
(limited by the MaxCPUsPerNode parameter on the partition).

The problem you run into there though is that there's no way to reserve cores 
on a particular socket, which means problems for folks who care about locality 
for GPU codes as they can wait in the queue with GPUs free and cores free but 
not the right cores on the right socket to be able to use the GPUs. :-(

Here's my bug from when I was in Australia for this issue where I suggested a 
MaxCPUsPerSocket parameter for partitions:

https://bugs.schedmd.com/show_bug.cgi?id=4717

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] Slurm User Group Meeting (SLUG'20) Agenda Posted

2020-08-31 Thread Tim Wickberg
We're still nailing down a few details with the streaming platform (and 
will add them to the website when resolved), but do expect to have the 
video available for one or two weeks afterwards.


- Tim

On 8/31/20 7:07 AM, Ole Holm Nielsen wrote:

On 8/28/20 10:45 PM, Tim Wickberg wrote:
The Slurm User Group Meeting (SLUG'20) this fall will be moving 
online. In lieu of an in-person meeting, SchedMD will broadcast a 
select set of presentations on Tuesday, September 15th, 2020, from 9am 
to noon (MDT).


The agenda is now posted online at:
https://slurm.schedmd.com/slurm_ug_agenda.html

Links to the broadcasts will be added there when available, and an 
update will be sent to slurm-announce and slurm-users lists.


The broadcast timing is a bit awkward for European customers due to the 
8 hour time difference.  I will most likely need to view the 
presentations later on.  Can the broadcasts be made available for 
viewing later on?


Thanks,
Ole





Re: [slurm-users] [ext] Re: Jobs getting StartTime 3 days in the future?

2020-08-31 Thread Holtgrewe, Manuel
Thank you for your reply.

I think I found the issue. We have only few "skylake" nodes and this job is 
requesting them. Thus, this user is limited to the (relatively few) Skylake 
generation CPU nodes.

d'oh!

--
Dr. Manuel Holtgrewe, Dipl.-Inform.
Bioinformatician
Core Unit Bioinformatics – CUBI
Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in the 
Helmholtz Association / Charité – Universitätsmedizin Berlin

Visiting Address: Invalidenstr. 80, 3rd Floor, Room 03 028, 10117 Berlin
Postal Address: Chariteplatz 1, 10117 Berlin

E-Mail: manuel.holtgr...@bihealth.de
Phone: +49 30 450 543 607
Fax: +49 30 450 7 543 901
Web: cubi.bihealth.org  www.bihealth.org  www.mdc-berlin.de  www.charite.de

From: slurm-users [slurm-users-boun...@lists.schedmd.com] on behalf of Renfro, 
Michael [ren...@tntech.edu]
Sent: Monday, August 31, 2020 19:36
To: Slurm User Community List
Subject: [ext] Re: [slurm-users] Jobs getting StartTime 3 days in the future?

One pending job in this partition should have a reason of “Resources”. That job 
has the highest priority, and if your job below would delay the 
highest-priority job’s start, it’ll get pushed back like you see here.

On Aug 31, 2020, at 12:13 PM, Holtgrewe, Manuel  
wrote:

Dear all,

I'm seeing some user's job getting a StartTime 3 days in the future although 
there are plenty of resources available in the the partition (and the user is 
well below maxTRESPU of the partition).

Attached is our slurm.conf and the dump of "sacctmgr list qos -P". I'd be 
grateful for any insight and happy to provide more information.

The scontrol show job output is as follows:

JobId=2902252 JobName=X
   UserId=X(X GroupId=X(X MCS_label=N/A
   Priority=796 Nice=0 Account=hpc-ag-kehr QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=23:59:00 TimeMin=N/A
   SubmitTime=2020-08-31T16:34:16 EligibleTime=2020-08-31T16:34:16
   AccrueTime=2020-08-31T16:34:16
   StartTime=2020-09-03T12:43:58 EndTime=2020-09-04T12:42:58 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-31T19:11:13
   Partition=medium AllocNode:Sid=med0107:7749
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=112000M,node=1,billing=16
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=7000M MinTmpDiskNode=0
   Features=skylake DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=X
   StdErr=X
   StdIn=/dev/null
   StdOut=X
   Power=
   MailUser=(null) MailType=NONE


Best wishes,
Manuel

--
Dr. Manuel Holtgrewe, Dipl.-Inform.
Bioinformatician
Core Unit Bioinformatics – CUBI
Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in the 
Helmholtz Association / Charité – Universitätsmedizin Berlin

Visiting Address: Invalidenstr. 80, 3rd Floor, Room 03 028, 10117 Berlin
Postal Address: Chariteplatz 1, 10117 Berlin

E-Mail: manuel.holtgr...@bihealth.de
Phone: +49 30 450 543 607
Fax: +49 30 450 7 543 901
Web: cubi.bihealth.org  www.bihealth.org  www.mdc-berlin.de  www.charite.de




Re: [slurm-users] Jobs getting StartTime 3 days in the future?

2020-08-31 Thread Renfro, Michael
One pending job in this partition should have a reason of “Resources”. That job 
has the highest priority, and if your job below would delay the 
highest-priority job’s start, it’ll get pushed back like you see here.

On Aug 31, 2020, at 12:13 PM, Holtgrewe, Manuel  
wrote:

Dear all,

I'm seeing some user's job getting a StartTime 3 days in the future although 
there are plenty of resources available in the the partition (and the user is 
well below maxTRESPU of the partition).

Attached is our slurm.conf and the dump of "sacctmgr list qos -P". I'd be 
grateful for any insight and happy to provide more information.

The scontrol show job output is as follows:

JobId=2902252 JobName=X
   UserId=X(X GroupId=X(X MCS_label=N/A
   Priority=796 Nice=0 Account=hpc-ag-kehr QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=23:59:00 TimeMin=N/A
   SubmitTime=2020-08-31T16:34:16 EligibleTime=2020-08-31T16:34:16
   AccrueTime=2020-08-31T16:34:16
   StartTime=2020-09-03T12:43:58 EndTime=2020-09-04T12:42:58 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-31T19:11:13
   Partition=medium AllocNode:Sid=med0107:7749
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=112000M,node=1,billing=16
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=7000M MinTmpDiskNode=0
   Features=skylake DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=X
   StdErr=X
   StdIn=/dev/null
   StdOut=X
   Power=
   MailUser=(null) MailType=NONE


Best wishes,
Manuel

--
Dr. Manuel Holtgrewe, Dipl.-Inform.
Bioinformatician
Core Unit Bioinformatics – CUBI
Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in the 
Helmholtz Association / Charité – Universitätsmedizin Berlin

Visiting Address: Invalidenstr. 80, 3rd Floor, Room 03 028, 10117 Berlin
Postal Address: Chariteplatz 1, 10117 Berlin

E-Mail: manuel.holtgr...@bihealth.de
Phone: +49 30 450 543 607
Fax: +49 30 450 7 543 901
Web: cubi.bihealth.org  www.bihealth.org  www.mdc-berlin.de  www.charite.de




[slurm-users] Jobs getting StartTime 3 days in the future?

2020-08-31 Thread Holtgrewe, Manuel
Dear all,

I'm seeing some user's job getting a StartTime 3 days in the future although 
there are plenty of resources available in the the partition (and the user is 
well below maxTRESPU of the partition).

Attached is our slurm.conf and the dump of "sacctmgr list qos -P". I'd be 
grateful for any insight and happy to provide more information.

The scontrol show job output is as follows:

JobId=2902252 JobName=X
   UserId=X(X GroupId=X(X MCS_label=N/A
   Priority=796 Nice=0 Account=hpc-ag-kehr QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=23:59:00 TimeMin=N/A
   SubmitTime=2020-08-31T16:34:16 EligibleTime=2020-08-31T16:34:16
   AccrueTime=2020-08-31T16:34:16
   StartTime=2020-09-03T12:43:58 EndTime=2020-09-04T12:42:58 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-31T19:11:13
   Partition=medium AllocNode:Sid=med0107:7749
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=112000M,node=1,billing=16
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=7000M MinTmpDiskNode=0
   Features=skylake DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=X
   StdErr=X
   StdIn=/dev/null
   StdOut=X
   Power=
   MailUser=(null) MailType=NONE


Best wishes,
Manuel

--
Dr. Manuel Holtgrewe, Dipl.-Inform.
Bioinformatician
Core Unit Bioinformatics – CUBI
Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in the 
Helmholtz Association / Charité – Universitätsmedizin Berlin

Visiting Address: Invalidenstr. 80, 3rd Floor, Room 03 028, 10117 Berlin
Postal Address: Chariteplatz 1, 10117 Berlin

E-Mail: manuel.holtgr...@bihealth.de
Phone: +49 30 450 543 607
Fax: +49 30 450 7 543 901
Web: cubi.bihealth.org  www.bihealth.org  www.mdc-berlin.de  www.charite.de
Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES
normal|0|00:00:00|||cluster|||1.00|||cpu=512||
debug|0|00:00:00|||cluster|||1.00|||cpu=1000||
medium|0|00:00:00|||cluster|||1.00|||cpu=512||
critical|0|00:00:00|||cluster|||1.00|||cpu=2000||
long|0|00:00:00|||cluster|||1.00|||cpu=64||
highmem|0|00:00:00|||cluster|||1.00|
gpu|0|00:00:00|||cluster|||1.00|
gpu-interactive|0|00:00:00|||cluster|||1.00|


slurm.conf
Description: slurm.conf


Re: [slurm-users] Core reserved/bound to a GPU

2020-08-31 Thread Stephan Schott
Hi,
I'm also very interested in how this could be done properly. At the moment
what we are doing is setting up partitions with MaxCPUsPerNode set to
CPUs-GPUs. Maybe this can help you in the meanwhile, but this is a
suboptimal solution (in fact we have nodes with different number of CPUs,
so we had to make a partition per "node type"). Someone else can have a
better idea.
Cheers,

El lun., 31 ago. 2020 a las 16:45, Manuel BERTRAND (<
manuel.bertr...@lis-lab.fr>) escribió:

> Hi list,
>
> I am totally new to Slurm and have just deployed a heterogeneous GPU/CPU
> cluster by following the latest OpenHPC recipe on CentOS 8.2 (thanks
> OpenHPC team for making those !)
> Every thing works great so far but now I would like to bound a specific
> core to each GPUs on each node. By "bound" I mean to make a particular
> core not assignable to a CPU job alone so that the GPU is available
> whatever the CPU workload on the node. I'm asking this because in the
> actual state a CPU only user can monopolize the whole node, preventing a
> GPU user to come in as there is no CPU available even if the GPU is
> free. I'm not sure what is the best way to enforce this. Hope this is
> clear :)
>
> Any help greatly appreciated !
>
> Here is my gres.conf, cgroup.conf, partitions configuration, followed by
> the output of 'scontrol show config':
>
> ### gres.conf 
> NodeName=gpunode1 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode1 Name=gpu  File=/dev/nvidia1
> NodeName=gpunode1 Name=gpu  File=/dev/nvidia2
> NodeName=gpunode1 Name=gpu  File=/dev/nvidia3
> NodeName=gpunode2 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode2 Name=gpu  File=/dev/nvidia1
> NodeName=gpunode2 Name=gpu  File=/dev/nvidia2
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia1
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia2
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia3
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia4
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia5
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia6
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia7
> NodeName=gpunode4 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode4 Name=gpu  File=/dev/nvidia1
> NodeName=gpunode5 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode5 Name=gpu  File=/dev/nvidia1
> NodeName=gpunode5 Name=gpu  File=/dev/nvidia2
> NodeName=gpunode5 Name=gpu  File=/dev/nvidia3
> NodeName=gpunode5 Name=gpu  File=/dev/nvidia4
> NodeName=gpunode5 Name=gpu  File=/dev/nvidia5
> NodeName=gpunode6 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode6 Name=gpu  File=/dev/nvidia1
> NodeName=gpunode6 Name=gpu  File=/dev/nvidia2
> NodeName=gpunode6 Name=gpu  File=/dev/nvidia3
> NodeName=gpunode7 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode7 Name=gpu  File=/dev/nvidia1
> NodeName=gpunode7 Name=gpu  File=/dev/nvidia2
> NodeName=gpunode7 Name=gpu  File=/dev/nvidia3
> NodeName=gpunode8 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode8 Name=gpu  File=/dev/nvidia1
>
> ### cgroup.conf 
> CgroupAutomount=yes
> TaskAffinity=no
> ConstrainCores=yes
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=yes
> ConstrainKmemSpace=no
> ConstrainDevices=yes
>
>
> ### partitions configuration ###
> PartitionName=cpu Nodes=cpunode1,cpunode2,cpunode3,cpunode4,cpunode5
> Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
> PartitionName=gpu
> Nodes=gpunode1,gpunode2,gpunode3,gpunode4,gpunode5,gpunode6,gpunode7,gpunode8
>
> Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
> PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00
> State=UP
>
>
> ### Slurm configuration ###
> Configuration data as of 2020-08-31T16:23:54
> AccountingStorageBackupHost = (null)
> AccountingStorageEnforce = none
> AccountingStorageHost   = sms.mycluster
> AccountingStorageLoc= N/A
> AccountingStoragePort   = 6819
> AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
> AccountingStorageType   = accounting_storage/slurmdbd
> AccountingStorageUser   = N/A
> AccountingStoreJobComment = No
> AcctGatherEnergyType= acct_gather_energy/none
> AcctGatherFilesystemType = acct_gather_filesystem/none
> AcctGatherInterconnectType = acct_gather_interconnect/none
> AcctGatherNodeFreq  = 0 sec
> AcctGatherProfileType   = acct_gather_profile/none
> AllowSpecResourcesUsage = No
> AuthAltTypes= (null)
> AuthInfo= (null)
> AuthType= auth/munge
> BatchStartTimeout   = 10 sec
>
> EpilogMsgTime   = 2000 usec
> EpilogSlurmctld = (null)
> ExtSensorsType  = ext_sensors/none
> ExtSensorsFreq  = 0 sec
> FederationParameters= (null)
> FirstJobId  = 1
> GetEnvTimeout   = 2 sec
> GresTypes   = gpu
> GpuFreqDef  = high,memory=high
> GroupUpdateForce= 1
> GroupUpdateTime = 600 sec
> HASH_VAL= Match
> HealthCheckInterval = 300 sec
> HealthCheckNodeState= ANY
> HealthCh

[slurm-users] Core reserved/bound to a GPU

2020-08-31 Thread Manuel BERTRAND

Hi list,

I am totally new to Slurm and have just deployed a heterogeneous GPU/CPU 
cluster by following the latest OpenHPC recipe on CentOS 8.2 (thanks 
OpenHPC team for making those !)
Every thing works great so far but now I would like to bound a specific 
core to each GPUs on each node. By "bound" I mean to make a particular 
core not assignable to a CPU job alone so that the GPU is available 
whatever the CPU workload on the node. I'm asking this because in the 
actual state a CPU only user can monopolize the whole node, preventing a 
GPU user to come in as there is no CPU available even if the GPU is 
free. I'm not sure what is the best way to enforce this. Hope this is 
clear :)


Any help greatly appreciated !

Here is my gres.conf, cgroup.conf, partitions configuration, followed by 
the output of 'scontrol show config':


### gres.conf 
NodeName=gpunode1 Name=gpu  File=/dev/nvidia0
NodeName=gpunode1 Name=gpu  File=/dev/nvidia1
NodeName=gpunode1 Name=gpu  File=/dev/nvidia2
NodeName=gpunode1 Name=gpu  File=/dev/nvidia3
NodeName=gpunode2 Name=gpu  File=/dev/nvidia0
NodeName=gpunode2 Name=gpu  File=/dev/nvidia1
NodeName=gpunode2 Name=gpu  File=/dev/nvidia2
NodeName=gpunode3 Name=gpu  File=/dev/nvidia0
NodeName=gpunode3 Name=gpu  File=/dev/nvidia1
NodeName=gpunode3 Name=gpu  File=/dev/nvidia2
NodeName=gpunode3 Name=gpu  File=/dev/nvidia3
NodeName=gpunode3 Name=gpu  File=/dev/nvidia4
NodeName=gpunode3 Name=gpu  File=/dev/nvidia5
NodeName=gpunode3 Name=gpu  File=/dev/nvidia6
NodeName=gpunode3 Name=gpu  File=/dev/nvidia7
NodeName=gpunode4 Name=gpu  File=/dev/nvidia0
NodeName=gpunode4 Name=gpu  File=/dev/nvidia1
NodeName=gpunode5 Name=gpu  File=/dev/nvidia0
NodeName=gpunode5 Name=gpu  File=/dev/nvidia1
NodeName=gpunode5 Name=gpu  File=/dev/nvidia2
NodeName=gpunode5 Name=gpu  File=/dev/nvidia3
NodeName=gpunode5 Name=gpu  File=/dev/nvidia4
NodeName=gpunode5 Name=gpu  File=/dev/nvidia5
NodeName=gpunode6 Name=gpu  File=/dev/nvidia0
NodeName=gpunode6 Name=gpu  File=/dev/nvidia1
NodeName=gpunode6 Name=gpu  File=/dev/nvidia2
NodeName=gpunode6 Name=gpu  File=/dev/nvidia3
NodeName=gpunode7 Name=gpu  File=/dev/nvidia0
NodeName=gpunode7 Name=gpu  File=/dev/nvidia1
NodeName=gpunode7 Name=gpu  File=/dev/nvidia2
NodeName=gpunode7 Name=gpu  File=/dev/nvidia3
NodeName=gpunode8 Name=gpu  File=/dev/nvidia0
NodeName=gpunode8 Name=gpu  File=/dev/nvidia1

### cgroup.conf 
CgroupAutomount=yes
TaskAffinity=no
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainKmemSpace=no
ConstrainDevices=yes


### partitions configuration ###
PartitionName=cpu Nodes=cpunode1,cpunode2,cpunode3,cpunode4,cpunode5 
Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
PartitionName=gpu 
Nodes=gpunode1,gpunode2,gpunode3,gpunode4,gpunode5,gpunode6,gpunode7,gpunode8 
Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00 
State=UP



### Slurm configuration ###
Configuration data as of 2020-08-31T16:23:54
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = sms.mycluster
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = No
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq  = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = No
AuthAltTypes    = (null)
AuthInfo    = (null)
AuthType    = auth/munge
BatchStartTimeout   = 10 sec

EpilogMsgTime   = 2000 usec
EpilogSlurmctld = (null)
ExtSensorsType  = ext_sensors/none
ExtSensorsFreq  = 0 sec
FederationParameters    = (null)
FirstJobId  = 1
GetEnvTimeout   = 2 sec
GresTypes   = gpu
GpuFreqDef  = high,memory=high
GroupUpdateForce    = 1
GroupUpdateTime = 600 sec
HASH_VAL    = Match
HealthCheckInterval = 300 sec
HealthCheckNodeState    = ANY
HealthCheckProgram  = /usr/sbin/nhc
InactiveLimit   = 0 sec
JobAcctGatherFrequency  = 30
JobAcctGatherType   = jobacct_gather/none
JobAcctGatherParams = (null)
JobCompHost = localhost
JobCompLoc  = /var/log/slurm_jobcomp.log
JobCompPort = 0
JobCompType = jobcomp/none
JobCompUser = root
JobContainerType    = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults = (null)
JobFileAppend   = 0
JobRequeue  = 1
JobSubmitPlugins    = (null)
KeepAliveTime   = SYSTEM_DEFAULT
KillOnBadExit   = 0
KillWa

Re: [slurm-users] Slurm User Group Meeting (SLUG'20) Agenda Posted

2020-08-31 Thread Ole Holm Nielsen

On 8/28/20 10:45 PM, Tim Wickberg wrote:
The Slurm User Group Meeting (SLUG'20) this fall will be moving online. In 
lieu of an in-person meeting, SchedMD will broadcast a select set of 
presentations on Tuesday, September 15th, 2020, from 9am to noon (MDT).


The agenda is now posted online at:
https://slurm.schedmd.com/slurm_ug_agenda.html

Links to the broadcasts will be added there when available, and an update 
will be sent to slurm-announce and slurm-users lists.


The broadcast timing is a bit awkward for European customers due to the 8 
hour time difference.  I will most likely need to view the presentations 
later on.  Can the broadcasts be made available for viewing later on?


Thanks,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark



Re: [slurm-users] Slurm User Group Meeting (SLUG'20) Agenda Posted

2020-08-31 Thread Bjørn-Helge Mevik
Just wondering, will we get our t-shirts by email? :D

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature