Dear SLURM developers,we are observing a strange behavior regarding --ntasks and --ntasks-per-node in connection with gres.conf.
On our GPU nodes (Procs=24 and Gres=gpu:4), we have the following gres.conf: Name=gpu File=/dev/nvidia0 CPUs=0-5 Name=gpu File=/dev/nvidia1 CPUs=6-11 Name=gpu File=/dev/nvidia2 CPUs=12-17 Name=gpu File=/dev/nvidia3 CPUs=18-23Now, depending on whether we use --ntasks or --ntasks-per-node, SLURM behaves entirely different.
A) Consider the following batch file: #SBATCH --gres=gpu:2 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2I can submit this job, but it will never start (pending with reason ReqNodeNotAvail). Probably because SLURM wants to allocate the two tasks on core ids 0-1 if they are free, and thus no core id from the cpuset of the second gpu (according to gres.conf) is used, but that is just my guess.
B) Now we try the same with two nodes: #SBATCH --gres=gpu:2 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=2Job submission fails with the error message: "Requested node configuration is not available", okay. My guess is that, once again, only core ids from the first gpu resource are tried to be allocated, which conflicts with gres.conf when requesting two gpus, because no core from the cpuset of the second gpu is allocated.
C) To support my guess I tried adding the --cpus-per-task parameter, to see if I could successfully start such jobs when I request enough cores so that one core id of the second gpu is used:
#SBATCH --gres=gpu:2 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=3This still fails. Why? Because 2 tasks with 3 cpus per task each only amounts to 6 cores, so core ids 0-5 get allocated, which is still only the cpuset of the first gpu. However, as soon as I increase --cpus-per-task to 4, I get core ids 0-7 and voila, the job can be submitted!
Now after all this testing, it certainly seems that with a gres.conf like ours, you have to make sure you always get a core from the cpuset of all the gpus you requested, right? Wrong...
D) Let's try with just the parameter --ntasks instead of --ntasks-per-node: #SBATCH --gres=gpu:2 #SBATCH --nodes=2 #SBATCH --ntasks=4This job can be submitted just fine, and it not even stays in pending state! Wait, what? It doesn't make much sense, because with this, you can even get 3 cpus on the first node and only 1 on the second node, even though you are requesting 2 gpus per node... This seems to be, however, the only way you can successfully submit multi-node GPU jobs on our cluster at the moment. We had to suggest to our users to use --mincpus to ensure an evenly balanced core distribution across the nodes, but intuitively one would use --ntasks-per-node for that, which unfortunately does not work.
Now my questions:1. is it intended that the core id distribution has to match the gres.conf entries when requesting gpu resources, meaning, the job has to have a core id from each gpu that is being requested? 2. if 1 is true, why does it completely ignore this when using just --ntasks instead of --ntasks-per-node? 3. how can we ensure SLURM allocates "round-robin" across the cpu-sets from gres.conf for jobs with generic resources in order to make --ntasks-per-node working again? We probably have to change SelectTypeParamters for this, but we don't want to do this globally for all jobs... is this even possible at the moment? 3. why does it fail in one case on job submission (see B)), and in the other case (A)), the job can be submitted but stays pending with ReqNodeNotAvail? This is highly confusing for our users.
For your reference, our current select parameters are: SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE,CR_ALLOCATE_FULL_SOCKET,CR_PACK_NODES We are using SLURM Version 14.11.7. Thank you kindly in advance. -- Maik Schmidt Technische Universität Dresden Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) Willers-Bau A116 D-01062 Dresden Telefon: +49 351 463-32836
smime.p7s
Description: S/MIME Cryptographic Signature
