Hi,

On a cluster comprised of quad-processors nodes, I encountered the following issue during a job allocation for an application that has 4 tasks, each requiring 3 processors (this is exactly the example you are providing in the --cpus-per-task section of the salloc manpage).

Here are the facts:

First, the cluster configuration : 4 nodes, 1 socket/node, 4 cores/socket, 1 thread/core

> sinfo -V
slurm 17.11.9-2

> sinfo
PARTITION        AVAIL  TIMELIMIT NODES   CPUS(A/I/O/T) STATE  NODELIST
any*             up       2:00:00     4       0/16/0/16 idle~  n[101-104]

1) With the CR_CORE consumable resource:

> scontrol show conf|grep -i select
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE

> salloc -n4 -c3
salloc: Granted job allocation 218
> squeue
               JOBID      QOS PRIORITY    PARTITION NAME     USER  ST       TIME  NODES  CPUS  NODELIST(REASON)                  218   normal       43          any bash     sila   R       0:03      3    10  n[101-103]
> srun hostname
srun: error: Unable to create step for job 218: More processors requested than permitted

We can see that the number of granted processors and nodes is completely wrong : 10 cpus instead of 12 and only 3 nodes instead of 4. The correct behaviour when requesting 4 tasks (-n 4) with 3 processors per tasks (-c 3) on a quad-core nodes cluster should be that the controller grant an allocation of 4 nodes, one for each of the 4 tasks.

Note that when specifying --tasks-per-node=1, the behaviour is correct:

> salloc -n4 -c3 --tasks-per-node=1
salloc: Granted job allocation 221
> squeue
               JOBID      QOS PRIORITY    PARTITION NAME     USER  ST       TIME  NODES  CPUS  NODELIST(REASON)                  221   normal       43          any bash     sila   R       0:03      4    12  n[101-104]
> srun hostname
n101
n103
n102
n104

2) With the CR_SOCKET consumable resource:

> scontrol show conf|grep -i select
SelectType              = select/cons_res
SelectTypeParameters    = CR_SOCKET

> salloc -n4 -c3
salloc: Granted job allocation 226
> squeue
               JOBID      QOS PRIORITY    PARTITION NAME     USER  ST       TIME  NODES  CPUS  NODELIST(REASON)                  226   normal       43          any bash     sila   R       0:02      3    12  n[101-103]

Here, slurm allocates the right number of processors (12) but the number of allocated nodes is wrong : 3 instead of 4. Then, there will be 2 tasks on the same node (n101):

> srun hostname
n102
n101
n101
n103

Again,  when specifying --tasks-per-node=1, the behaviour is correct:

> salloc -n4 -c3 --tasks-per-node=1
salloc: Granted job allocation 230
> squeue
               JOBID      QOS PRIORITY    PARTITION NAME     USER  ST       TIME  NODES  CPUS  NODELIST(REASON)                  230   normal       43          any bash     sila  R        0:03      4    16  n[101-104]

Note that 16 processors have been allocated instead of 12 but this is correct because slurm is configured with the CR_Socket consumable resource (each node has got 1 socket and 4 cores/socket). The srun command is as expected :

sila@master2-l422:~> srun hostname
n101
n102
n103
n104

3) Conclusion and fix :

I thought that the --tasks-per-node should not be mandatory to obtain the right behaviour so I did some investigations.

I think that a bug has been introduced when unifying allocation code for CR_Socket and CR_Core (commit 6fa3d5ad) in the Step 3 of the _allocate_sc() function (src/plugins/select/cons_res/job_test.c) when computing the avail_cpus:

src/plugins/select/cons_res/job_test.c, _allocate_sc(...):

        if (cpus_per_task < 2) {
                avail_cpus = num_tasks;
        } else if ((ntasks_per_core == 1) &&
                   (cpus_per_task > threads_per_core)) {
                /* find out how many cores a task will use */
                int task_cores = (cpus_per_task + threads_per_core - 1) /
                                 threads_per_core;
                int task_cpus  = task_cores * threads_per_core;
                /* find out how many tasks can fit on a node */
                int tasks = avail_cpus / task_cpus;
                /* how many cpus the the job would use on the node */
                avail_cpus = tasks * task_cpus;
                /* subtract out the extra cpus. */
                avail_cpus -= (tasks * (task_cpus - cpus_per_task));
        } else {
                j = avail_cpus / cpus_per_task;
                num_tasks = MIN(num_tasks, j);
                if (job_ptr->details->ntasks_per_node) <- problem
                        avail_cpus = num_tasks * cpus_per_task;
        }


The 'if (job_ptr->details->ntasks_per_node)' condition marked above as 'problem' prevents the avail_cpus to be correctly computed when --ntasks_per_node is NOT specified (and cpus_per_task>1). Before unifying the _allocate_sockets and _allocate_cores functions in the _allocate_sc function, this condition was only present in the _allocate_socket() function code. It appears that it is unnecessary in the _allocate_sc() code and 'avail_cpus' should be computed without any condition.

With the following patch, the slurm controller with select/cons_res (CR_CORE and CR_SOCKET) is doing his job correctly when allocating with --cpus-per-task (-c) without the need of specifying the --ntasks_per_node option:

diff --git a/src/plugins/select/cons_res/job_test.c b/src/plugins/select/cons_res/job_test.c
index 25e0b8875b..4e704e8b65 100644
--- a/src/plugins/select/cons_res/job_test.c
+++ b/src/plugins/select/cons_res/job_test.c
@@ -474,8 +474,7 @@ static uint16_t _allocate_sc(struct job_record *job_ptr, bitstr_t *core_map,
        } else {
                j = avail_cpus / cpus_per_task;
                num_tasks = MIN(num_tasks, j);
-               if (job_ptr->details->ntasks_per_node)
-                       avail_cpus = num_tasks * cpus_per_task;
+               avail_cpus = num_tasks * cpus_per_task;
        }

        if ((job_ptr->details->ntasks_per_node &&

Test results OK after applying the patch:
1) CR_CORE:
> scontrol show conf|grep -i select
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE
> salloc -n4 -c3
salloc: Granted job allocation 234
> squeue
               JOBID      QOS PRIORITY    PARTITION NAME     USER  ST       TIME  NODES  CPUS  NODELIST(REASON)                  234   normal       43          any bash     sila   R       0:02      4    12  n[101-104]

2) CR_SOCKET:
> scontrol show conf|grep -i select
SelectType              = select/cons_res
SelectTypeParameters    = CR_SOCKET
> salloc -n4 -c3
salloc: Granted job allocation 233
> squeue
               JOBID      QOS PRIORITY    PARTITION NAME     USER  ST       TIME  NODES  CPUS  NODELIST(REASON)                  233   normal       43          any bash     sila   R       0:03      4    16  n[101-104]

What do you think?

Best regards,

Didier

Reply via email to