28.02.2020 20:53, Renfro, Michael пишет: > When I made similar queues, and only wanted my GPU jobs to use up to 8 cores > per GPU, I set Cores=0-7 and 8-15 for each of the two GPU devices in > gres.conf. Have you tried reducing those values to Cores=0 and Cores=20?
Yes, I've tried to do it. Unfortunately, it does not work. Moreover, If I set limit of CPU cores on 'cpu'partition to any number between 37 and 20 it still would not start the GPU job after the CPU one, despite having un-allocated CPU cores. Another test was the following: I limited 'cpu'partition to 39 cores and 'gpu' partition to 2 cores (MaxCPUsPerNode=39 and MaxCPUsPerNode=2) respectively. When I start CPU-job with requirements ntasks-per-node=38 one task less than limit, GPU-job starts without any problem It looks like I have to limit ntasks-per-node of my CPU-jobs to MaxCPUsPerNode-1 or less. If I use all available CPU cores, GPU-job does not start in 'gpu' partition on thesame nodes. -- Pavel Vashchenkov > >> On Feb 27, 2020, at 9:51 PM, Pavel Vashchenkov <vas...@itam.nsc.ru> wrote: >> >> External Email Warning >> >> This email originated from outside the university. Please use caution when >> opening attachments, clicking links, or responding to requests. >> >> ________________________________ >> >> Hello, >> >> I have a hybrid cluster with 2 GPUs and 2 20-cores CPUs on each node. >> >> I created two partitions: - "cpu" for CPU-only jobs which are allowed to >> allocate up to 38 cores per node - "gpu" for GPU-only jobs which are >> allowed to allocate up to 2 GPUs and 2 CPU cores. >> >> Respective sections in slurm.conf: >> >> # NODES >> NodeName=node[01-06] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 >> Gres=gpu:2(S:0-1) RealMemory=257433 >> >> # PARTITIONS >> PartitionName=cpu Default=YES Nodes=node[01-06] MaxNodes=6 MinNodes=0 >> DefaultTime=04:00:00 MaxTime=14-00:00:00 MaxCPUsPerNode=38 >> PartitionName=gpu Nodes=node[01-06] MaxNodes=6 MinNodes=0 >> DefaultTime=04:00:00 MaxTime=14-00:00:00 MaxCPUsPerNode=2 >> >> and in gres.conf: >> Name=gpu Type=v100 File=/dev/nvidia0 Cores=0-19 >> Name=gpu Type=v100 File=/dev/nvidia1 Cores=20-39 >> >> However, it seems to be not working properly. If I first submit GPU job >> using all available in "gpu" partition resources and then CPU job >> allocating the rest of the CPU cores (i.e. 38 cores per node) in "cpu" >> partition, it works perfectly fine. Both jobs start running. But if I >> change the submission order and start CPU-job before GPU-job, the "cpu" >> job starts running while the "gpu" job stays in queue with PENDING >> status and RESOURCES reason. >> >> My first guess was that "cpu" job allocates cores assigned to respective >> GPUs in gres.conf and prevents the GPU devices from running. However, it >> seems not to be the case, because 37 cores job per node instead of 38 >> solves the problem. >> >> Another thought was it has something to do with the specialized cores >> reservation, but I tried to change CoreSpecCount option without success. >> >> So, any ideas how to fix this behavior and where should look? >> >> Thanks! >> > >