See the task/cgroups plugin for constraining jobs to specific CPUs and memory. Also see --exclusive option in srun.
On June 23, 2015 4:44:27 AM PDT, "Rémi Piatek" <[email protected]> wrote: > >I had considered this simple explanation, but it seemed unlikely to me, > >as it would imply that we completely have to rely on users to specify >correctly the number of CPUs they need. People use my server for >CPU-intensive jobs, so it is important for me to make sure that >resources are fairly shared. I was hoping slurm would allow me to do >this, and prevent people from free-riding (so far, it would be easy to >request a small number of CPUs and use a much larger number, thus >slowing down the other users). > >I read that when jobs exceed the memory requested and allocated by >slurm, they are automatically interrupted. Is there nothing similar for > >the use of CPUs? > >Thanks for the help! Much appreciated. > > >On 06/23/2015 12:05 PM, Loris Bennett wrote: >> >> Rémi Piatek <[email protected]> writes: >> >>> Hello, >>> >>> I am getting started with SLURM and I am having a hard time >understanding how it >>> allocates CPUs to users depending on the resources they request. The >problem I >>> am facing can be summarized as follows. Consider a bash script >test.sh that >>> requests 8 CPUs but actually starts a job that uses 10 CPUs: >>> >>> #!/bin/sh >>> #SBATCH --ntasks=8 >>> stress -c 10 >>> >>> On a server with 32 CPUs, if I start 5 times this script with sbatch >test.sh, 4 >>> of them start running right away and the last one appears as >pending, as shown >>> by the squeue command: >>> >>> JOBID PARTITION NAME USER ST TIME NODES >NODELIST(REASON) >>> 5 main test.sh jack PD 0:00 1 >(Resources) >>> 1 main test.sh jack R 0:08 1 server >>> 2 main test.sh jack R 0:08 1 server >>> 3 main test.sh jack R 0:05 1 server >>> 4 main test.sh jack R 0:05 1 server >>> >>> The problem is that these 4 jobs are actually using 40 CPUs and >overload the >>> server. I would on the contrary expect SLURM to either not start the >jobs that >>> are actually using more resources than requested by the user, or to >put them on >>> hold until there are enough resources to start them. How can I make >sure that >>> the users of my server do not start jobs that use too many CPUs? >>> >>> Some useful details about my slurm.conf file: >>> >>> # SCHEDULING >>> #DefMemPerCPU=0 >>> FastSchedule=1 >>> #MaxMemPerCPU=0 >>> SchedulerType=sched/backfill >>> SchedulerPort=7321 >>> SelectType=select/cons_res >>> SelectTypeParameters=CR_CPU >>> # COMPUTE NODES >>> NodeName=server CPUs=32 RealMemory=10000 State=UNKNOWN >>> # PARTITIONS >>> PartitionName=main Nodes=server Default=YES Shared=YES >MaxTime=INFINITE >>> State=UP >>> >>> I am probably making a trivial mistake in the configuration file, of >just >>> misunderstanding a basic concept of SLURM. Any help or advice would >be much >>> appreciated. >>> >>> Many thanks in advance! >> Slurm just keeps track of how many cores have been assigned to >running >> jobs - it doesn't check how many processes are actually started >within a >> given job. So, it is up to the user to make sure she starts the >correct >> number of processes. >> >> Cheers, >> >> Loris >> -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
