Hello, I have a hybrid cluster with 2 GPUs and 2 20-cores CPUs on each node.
I created two partitions: - "cpu" for CPU-only jobs which are allowed to allocate up to 38 cores per node - "gpu" for GPU-only jobs which are allowed to allocate up to 2 GPUs and 2 CPU cores. Respective sections in slurm.conf: # NODES NodeName=node[01-06] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Gres=gpu:2(S:0-1) RealMemory=257433 # PARTITIONS PartitionName=cpu Default=YES Nodes=node[01-06] MaxNodes=6 MinNodes=0 DefaultTime=04:00:00 MaxTime=14-00:00:00 MaxCPUsPerNode=38 PartitionName=gpu Nodes=node[01-06] MaxNodes=6 MinNodes=0 DefaultTime=04:00:00 MaxTime=14-00:00:00 MaxCPUsPerNode=2 and in gres.conf: Name=gpu Type=v100 File=/dev/nvidia0 Cores=0-19 Name=gpu Type=v100 File=/dev/nvidia1 Cores=20-39 However, it seems to be not working properly. If I first submit GPU job using all available in "gpu" partition resources and then CPU job allocating the rest of the CPU cores (i.e. 38 cores per node) in "cpu" partition, it works perfectly fine. Both jobs start running. But if I change the submission order and start CPU-job before GPU-job, the "cpu" job starts running while the "gpu" job stays in queue with PENDING status and RESOURCES reason. My first guess was that "cpu" job allocates cores assigned to respective GPUs in gres.conf and prevents the GPU devices from running. However, it seems not to be the case, because 37 cores job per node instead of 38 solves the problem. Another thought was it has something to do with the specialized cores reservation, but I tried to change CoreSpecCount option without success. So, any ideas how to fix this behavior and where should look? Thanks!