[slurm-users] select/cons_res - found bug when allocating job with --cpus-per-task (-c) option on slurm 17.11.9 (fix included).

Didier GAZEN Wed, 05 Sep 2018 07:38:19 -0700

Hi,

On a cluster comprised of quad-processors nodes, I encountered thefollowing issue during a job allocation for an application that has 4tasks, each requiring 3 processors (this is exactly the example you areproviding in the --cpus-per-task section of the salloc manpage).


Here are the facts:

First, the cluster configuration : 4 nodes, 1 socket/node, 4cores/socket, 1 thread/core


> sinfo -V
slurm 17.11.9-2

> sinfo
PARTITION        AVAIL  TIMELIMIT NODES   CPUS(A/I/O/T) STATE  NODELIST
any*             up       2:00:00     4       0/16/0/16 idle~  n[101-104]

1) With the CR_CORE consumable resource:

> scontrol show conf|grep -i select
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE

> salloc -n4 -c3
salloc: Granted job allocation 218
> squeue

JOBID QOS PRIORITY PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON) 218 normal 43 any bash sila R 0:03 3 10 n[101-103]

> srun hostname

srun: error: Unable to create step for job 218: More processorsrequested than permitted

We can see that the number of granted processors and nodes is completelywrong : 10 cpus instead of 12 and only 3 nodes instead of 4.The correct behaviour when requesting 4 tasks (-n 4) with 3 processorsper tasks (-c 3) on a quad-core nodes cluster should be that thecontroller grant an allocation of 4 nodes, one for each of the 4 tasks.


Note that when specifying --tasks-per-node=1, the behaviour is correct:

> salloc -n4 -c3 --tasks-per-node=1
salloc: Granted job allocation 221
> squeue

JOBID QOS PRIORITY PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON) 221 normal 43 any bash sila R 0:03 4 12 n[101-104]

> srun hostname
n101
n103
n102
n104

2) With the CR_SOCKET consumable resource:

> scontrol show conf|grep -i select
SelectType              = select/cons_res
SelectTypeParameters    = CR_SOCKET

> salloc -n4 -c3
salloc: Granted job allocation 226
> squeue

JOBID QOS PRIORITY PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON) 226 normal 43 any bash sila R 0:02 3 12 n[101-103]

Here, slurm allocates the right number of processors (12) but the numberof allocated nodes is wrong : 3 instead of 4. Then, there will be 2tasks on the same node (n101):


> srun hostname
n102
n101
n101
n103

Again,  when specifying --tasks-per-node=1, the behaviour is correct:

> salloc -n4 -c3 --tasks-per-node=1
salloc: Granted job allocation 230
> squeue

JOBID QOS PRIORITY PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON) 230 normal 43 any bash sila R 0:03 4 16 n[101-104]

Note that 16 processors have been allocated instead of 12 but this iscorrect because slurm is configured with the CR_Socket consumableresource (each node has got 1 socket and 4 cores/socket). The sruncommand is as expected :


sila@master2-l422:~> srun hostname
n101
n102
n103
n104

3) Conclusion and fix :

I thought that the --tasks-per-node should not be mandatory to obtainthe right behaviour so I did some investigations.

I think that a bug has been introduced when unifying allocation code forCR_Socket and CR_Core (commit 6fa3d5ad) in the Step 3 of the_allocate_sc() function (src/plugins/select/cons_res/job_test.c) whencomputing the avail_cpus:


src/plugins/select/cons_res/job_test.c, _allocate_sc(...):

        if (cpus_per_task < 2) {
                avail_cpus = num_tasks;
        } else if ((ntasks_per_core == 1) &&
                   (cpus_per_task > threads_per_core)) {
                /* find out how many cores a task will use */
                int task_cores = (cpus_per_task + threads_per_core - 1) /
                                 threads_per_core;
                int task_cpus  = task_cores * threads_per_core;
                /* find out how many tasks can fit on a node */
                int tasks = avail_cpus / task_cpus;
                /* how many cpus the the job would use on the node */
                avail_cpus = tasks * task_cpus;
                /* subtract out the extra cpus. */
                avail_cpus -= (tasks * (task_cpus - cpus_per_task));
        } else {
                j = avail_cpus / cpus_per_task;
                num_tasks = MIN(num_tasks, j);
                if (job_ptr->details->ntasks_per_node) <- problem
                        avail_cpus = num_tasks * cpus_per_task;
        }

The 'if (job_ptr->details->ntasks_per_node)' condition marked above as'problem' prevents the avail_cpus to be correctly computed when--ntasks_per_node is NOT specified (and cpus_per_task>1). Beforeunifying the _allocate_sockets and _allocate_cores functions in the_allocate_sc function, this condition was only present in the_allocate_socket() function code. It appears that it is unnecessary inthe _allocate_sc() code and 'avail_cpus' should be computed without anycondition.

With the following patch, the slurm controller with select/cons_res(CR_CORE and CR_SOCKET) is doing his job correctly when allocating with--cpus-per-task (-c) without the need of specifying the--ntasks_per_node option:

diff --git a/src/plugins/select/cons_res/job_test.cb/src/plugins/select/cons_res/job_test.c

index 25e0b8875b..4e704e8b65 100644
--- a/src/plugins/select/cons_res/job_test.c
+++ b/src/plugins/select/cons_res/job_test.c

@@ -474,8 +474,7 @@ static uint16_t _allocate_sc(struct job_record*job_ptr, bitstr_t *core_map,

        } else {
                j = avail_cpus / cpus_per_task;
                num_tasks = MIN(num_tasks, j);
-               if (job_ptr->details->ntasks_per_node)
-                       avail_cpus = num_tasks * cpus_per_task;
+               avail_cpus = num_tasks * cpus_per_task;
        }

        if ((job_ptr->details->ntasks_per_node &&

Test results OK after applying the patch:
1) CR_CORE:
> scontrol show conf|grep -i select
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE
> salloc -n4 -c3
salloc: Granted job allocation 234
> squeue

JOBID QOS PRIORITY PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON) 234 normal 43 any bash sila R 0:02 4 12 n[101-104]


2) CR_SOCKET:
> scontrol show conf|grep -i select
SelectType              = select/cons_res
SelectTypeParameters    = CR_SOCKET
> salloc -n4 -c3
salloc: Granted job allocation 233
> squeue

JOBID QOS PRIORITY PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON) 233 normal 43 any bash sila R 0:03 4 16 n[101-104]


What do you think?

Best regards,

Didier

[slurm-users] select/cons_res - found bug when allocating job with --cpus-per-task (-c) option on slurm 17.11.9 (fix included).

Reply via email to