Re: [slurm-users] slurm node weights

2019-09-05 Thread Merlin Hartley
I believe this is so that small jobs will naturally go on older, slower nodes 
first - leaving the bigger,better ones for jobs that actually need them.


Merlin
--
Merlin Hartley
IT Support Engineer
MRC Mitochondrial Biology Unit
University of Cambridge
Cambridge, CB2 0XY
United Kingdom

> On 5 Sep 2019, at 16:48, Douglas Duckworth  wrote:
> 
> Hello
> 
> We added some newer Epyc nodes, with NVMe scratch, to our cluster and so want 
> jobs to run on these over others.  So we added "Weight=100" to the older 
> nodes and left the new ones blank.  So indeed, ceteris paribus, srun reveals 
> that the faster nodes will accept jobs over older ones.
> 
> We have the desired outcome though I am a bit confused by two statements in 
> the manpage <https://slurm.schedmd.com/slurm.conf.html> that seem to be 
> contradictory:
> 
> "All things being equal, jobs will be allocated the nodes with the lowest 
> weight which satisfies their requirements."
> 
> "...larger weights should be assigned to nodes with more processors, memory, 
> disk space, higher processor speed, etc."
> 
> 100 is larger than 1 and we do see jobs preferring the new nodes which have 
> the default weight of 1.  Yet we're also told to assign larger weights to 
> faster nodes?
> 
> Thanks!
> Doug
> 
> -- 
> Thanks,
> 
> Douglas Duckworth, MSc, LFCS
> HPC System Administrator
> Scientific Computing Unit <https://scu.med.cornell.edu/>
> Weill Cornell Medicine"
> E: d...@med.cornell.edu <mailto:d...@med.cornell.edu>
> O: 212-746-6305
> F: 212-746-8690



Re: [slurm-users] maximum size of array jobs

2019-02-26 Thread Merlin Hartley
max_array_tasks
Specify the maximum number of tasks that be included in a job array. The 
default limit is MaxArraySize, but this option can be used to set a lower 
limit. For example, max_array_tasks=1000 and MaxArraySize=11 would permit a 
maximum task ID of 10, but limit the number of tasks in any single job 
array to 1000.
https://slurm.schedmd.com/slurm.conf.html 
<https://slurm.schedmd.com/slurm.conf.html>

SchedulerParameters=max_array_tasks=1000

MaxArraySize=10

See commit:
https://github.com/SchedMD/slurm/commit/09c13fb292a4a6a56b4078de840aae0d4db70309
 
<https://github.com/SchedMD/slurm/commit/09c13fb292a4a6a56b4078de840aae0d4db70309>



--
Merlin Hartley
Computer Officer
MRC Mitochondrial Biology Unit
University of Cambridge
Cambridge, CB2 0XY
United Kingdom

> On 26 Feb 2019, at 14:27, Jeffrey Frey  wrote:
> 
> Also see "https://slurm.schedmd.com/slurm.conf.html 
> <https://slurm.schedmd.com/slurm.conf.html>" for MaxArraySize/MaxJobCount.
> 
> We just went through a user-requested adjustment to MaxArraySize to bump it 
> from 1000 to 1; as the documentation states, since each index of an array 
> job is essentially "a job," you must be sure to also adjust MaxJobCount (from 
> 1 to 10 in our case).  Adjusting MaxJobCount requires a restart of 
> slurmctld; though the documentation doesn't state it, so does adjustment of 
> MaxArraySize (scontrol reconfigure will succeed but leave the previous limit 
> in effect, see "https://bugs.schedmd.com/show_bug.cgi?id=6553 
> <https://bugs.schedmd.com/show_bug.cgi?id=6553>").
> 
> The "MaxArraySize" is a bit of a misnomer since it's really 1 + the top of 
> the valid range of indices -- "MaxArrayIndex" would be more apt.  Our users 
> were very happy with Grid Engine's allowance of any index range and striding 
> that produces no more than "max_aj_tasks" indices; since moving to Slurm 
> they're forced to come up with their own index-mapping functionality at 
> times, but the relatively low MaxArraySize versus what we had in GridEngine 
> (75000) has been especially frustrating for them.
> 
> So far the 1/10 combo hasn't come close to exhausting resources on 
> our slurmctld nodes; but we haven't actually submitted a couple 1-index 
> array jobs and enough other jobs to hit 10 active jobs, so current memory 
> usage isn't an adequate measure of usage under load.  Since the slurm.conf 
> documentation states:
> 
> 
> Performance can suffer with more than a few hundred thousand jobs. 
> 
> 
> we're reluctant to increase MaxJobCount too much higher.
> 
> 
> 
> 
>> On Feb 26, 2019, at 3:18 AM, Ole Holm Nielsen > <mailto:ole.h.niel...@fysik.dtu.dk>> wrote:
>> 
>> On 2/26/19 9:07 AM, Marcus Wagner wrote:
>>> Does anyone know, why per default the number of array elements is limited 
>>> to 1000?
>>> We have one user, who would like to have 100k array elements!
>>> What is more difficult for the scheduler, one array job with 100k elements 
>>> or 100k non-array jobs?
>>> Where did you set the limit? Do your users use array jobs at all?
>> 
>> Google is your friend :-)
>> 
>> https://slurm.schedmd.com/job_array.html 
>> <https://slurm.schedmd.com/job_array.html>
>> 
>>> A new configuration parameter has been added to control the maximum job 
>>> array size: MaxArraySize. The smallest index that can be specified by a 
>>> user is zero and the maximum index is MaxArraySize minus one. The default 
>>> value of MaxArraySize is 1001. The maximum MaxArraySize supported in Slurm 
>>> is 401. Be mindful about the value of MaxArraySize as job arrays offer 
>>> an easy way for users to submit large numbers of jobs very quickly.
>> 
>> /Ole
>> 
> 
> 
> ::
> Jeffrey T. Frey, Ph.D.
> Systems Programmer V / HPC Management
> Network & Systems Services / College of Engineering
> University of Delaware, Newark DE  19716
> Office: (302) 831-6034  Mobile: (302) 419-4976
> ::
> 
> 
> 
> 



Re: [slurm-users] How to get the CPU usage of history jobs at each compute node?

2019-02-15 Thread Merlin Hartley
using sacct [1] - assuming you have accounting [2] enabled:

sacct -j 

Hope this helps!


Merlin


[1] https://slurm.schedmd.com/sacct.html <https://slurm.schedmd.com/sacct.html>
[2] https://slurm.schedmd.com/accounting.html 
<https://slurm.schedmd.com/accounting.html>


--
Merlin Hartley
Computer Officer
MRC Mitochondrial Biology Unit
University of Cambridge
Cambridge, CB2 0XY
United Kingdom

> On 15 Feb 2019, at 10:05, hu...@sugon.com <mailto:hu...@sugon.com> wrote:
> 
> Dear there,
> How to view the cpu usage of history jobs at each compute node?
> However, this command(control show jobs jobid --detail) can only get the cpu 
> usage of the currently running job at each compute node :
>     
> Appreciatively,
> Menglong



--
Merlin Hartley
Computer Officer
MRC Mitochondrial Biology Unit
University of Cambridge
Cambridge, CB2 0XY
United Kingdom



Re: [slurm-users] How to request ONLY one CPU instead of one socket or one node?

2019-02-15 Thread Merlin Hartley
Seems like you aren't specifying a --mem option, so the default would be to ask 
for a whole-node’s worth of RAM thus you would use the whole node for each job.

Hope this is useful!


Merlin
--
Merlin Hartley
Computer Officer
MRC Mitochondrial Biology Unit
University of Cambridge
Cambridge, CB2 0XY
United Kingdom

> On 14 Feb 2019, at 02:21, Wang, Liaoyuan  <mailto:wan...@alfred.edu>> wrote:
> 
> Dear there,
>  
> I wrote an analytic program to analyze my data. The analysis costs around 
> twenty days to analyze all data for one species. When I submit my job to the 
> cluster, it always request one node instead of one CPU. I am wondering how I 
> can ONLY request one CPU using “sbatch” command? Below is my batch file. Any 
> comments and help would be highly appreciated.
>  
> Appreciatively,
> Leon
> 
> #!/bin/sh 
> 
> #SBATCH --ntasks=1 
> #SBATCH --cpus-per-task=1 
> #SBATCH -t 45-00:00:00 
> #SBATCH -J 9625%j 
> #SBATCH -o 9625.out 
> #SBATCH -e 9625.err 
> 
> /home/scripts/wcnqn.auto.pl
> ===
> Where wcnqn.auto.pl is my program. 9625 denotes the species number.



--
Merlin Hartley
Computer Officer
MRC Mitochondrial Biology Unit
University of Cambridge
Cambridge, CB2 0XY
United Kingdom



Re: [slurm-users] Reserve CPUs/MEM for GPUs

2019-02-15 Thread Merlin Hartley
You could instead only allow the cpu partition to use 192G RAM and 20 CPU on 
those nodes...

--
Merlin Hartley


> On 13 Feb 2019, at 07:38, Quirin Lohr  wrote:
> 
> Hi all,
> 
> we have a slurm cluster running on nodes with 2x18 cores, 256GB RAM and 8 
> GPUs. Is there a way to reserve a bare minimum of two CPUs and 8GB RAM for 
> each GPU, so a high-CPU job cannot render the GPUs "unusable"?
> 
> Thanks in advance
> Quirin
> -- 
> Quirin Lohr
> Systemadministration
> Technische Universität München
> Fakultät für Informatik
> Lehrstuhl für Bildverarbeitung und Mustererkennung
> 
> Boltzmannstrasse 3
> 85748 Garching
> 
> Tel. +49 89 289 17769
> Fax +49 89 289 17757
> 
> quirin.l...@in.tum.de
> www.vision.in.tum.de
> 



--
Merlin Hartley
Computer Officer
MRC Mitochondrial Biology Unit
University of Cambridge
Cambridge, CB2 0XY
United Kingdom



Re: [slurm-users] jobs stuck in ReqNodeNotAvail,

2017-11-29 Thread Merlin Hartley
damn autocorrect - I meant:

# scontrol show job 6982



--
Merlin Hartley
Computer Officer
MRC Mitochondrial Biology Unit
Cambridge, CB2 0XY
United Kingdom

> On 29 Nov 2017, at 16:08, Merlin Hartley <merlin-sl...@mrc-mbu.cam.ac.uk> 
> wrote:
> 
> Can you give us the output of 
> # control show job 6982
> 
> Could be an issue with requesting too many CPUs or something…
> 
> 
> Merlin
> --
> Merlin Hartley
> Computer Officer
> MRC Mitochondrial Biology Unit
> Cambridge, CB2 0XY
> United Kingdom
> 
>> On 29 Nov 2017, at 15:21, Christian Anthon <ant...@rth.dk 
>> <mailto:ant...@rth.dk>> wrote:
>> 
>> Hi,
>> 
>> I have a problem with a newly setup slurm-17.02.7-1.el6.x86_64 that jobs 
>> seems to be stuck in ReqNodeNotAvail:
>> 
>>   6982 panic  Morgensferro PD   0:00 1 
>> (ReqNodeNotAvail, UnavailableNodes:)
>>   6981 panic SPECferro PD   0:00 1 
>> (ReqNodeNotAvail, UnavailableNodes:)
>> 
>> The nodes are fully allocated in terms of memory, but not all cpu resources 
>> are consumed
>> 
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> _default up   infinite 19mix 
>> clone[05-11,25-29,31-32,36-37,39-40,45]
>> _default up   infinite 11  alloc alone[02-08,10-13]
>> fastlane up   infinite 19mix 
>> clone[05-11,25-29,31-32,36-37,39-40,45]
>> fastlane up   infinite 11  alloc alone[02-08,10-13]
>> panicup   infinite 19mix 
>> clone[05-11,25-29,31-32,36-37,39-40,45]
>> panicup   infinite 12  alloc alone[02-08,10-13,15]
>> free*up   infinite 19mix 
>> clone[05-11,25-29,31-32,36-37,39-40,45]
>> free*up   infinite 11  alloc alone[02-08,10-13]
>> 
>> Possibly relevant lines in slurm.conf (full slurm.conf attached)
>> 
>> SchedulerType=sched/backfill
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_CPU_Memory
>> TaskPlugin=task/none
>> FastSchedule=1
>> 
>> Any advice?
>> 
>> Cheers, Christian.
>> 
>> 
>