It seems plane_size can't work  when plane_size > 1.

The job are 64 tasks with 2 nodes and plane_size = 32.

The expectation was that there were 32 tasks in each node. But I got there were 63 tasks in one node and 1 task in the other node.

-------------------------------------------------

Slurm Version:
slurm 21.08.8-2

------------------------------------------------

SelectType:

SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY

-------------------------------------------------

CMD:   srun -v -N 2 -n 64  -m plane=32 -p amd_512  sleep 5

srun: defined options
srun: -------------------- --------------------
srun: distribution        : plane=32
srun: nodes               : 2
srun: ntasks              : 64
srun: partition           : amd_512
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: job 774895 queued and waiting for resources
srun: job 774895 has been allocated resources
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
srun: Nodes g1008,h0107 are ready for job
srun: jobid 774895: nodes(2):`g1008,h0107', cpu counts: 63(x1),1(x1)         # <------------ strange cpu counts
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: launching StepId=774895.0 on host g1008, 63 tasks: [0-31,33-63]
srun: launching StepId=774895.0 on host h0107, 1 tasks: 32
srun: route/default: init: route default plugin loaded
srun: launch/slurm: _task_start: Node h0107, 1 tasks started
srun: launch/slurm: _task_start: Node g1008, 63 tasks started
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=774895.0 (status=0x0000).
srun: launch/slurm: _task_finish: h0107: task 32: Completed
srun: launch/slurm: _setup_max_wait_timer: First task exited. Terminating job in 60s srun: launch/slurm: _task_finish: Received task exit notification for 63 tasks of StepId=774895.0 (status=0x0000).
srun: launch/slurm: _task_finish: g1008: tasks 0-31,33-63: Completed

---------------------------------------------------------------

Then I tried plane_size = 1,  I got 32 tasks in each node.

CMD: srun -v -N 2 -n 64  -m plane=1 -p amd_512  sleep 5
srun: defined options
srun: -------------------- --------------------
srun: distribution        : plane=1
srun: nodes               : 2
srun: ntasks              : 64
srun: partition           : amd_512
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: job 774896 queued and waiting for resources
srun: job 774896 has been allocated resources
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
srun: Nodes g1008,h0107 are ready for job
srun: jobid 774896: nodes(2):`g1008,h0107', cpu counts: 63(x1),1(x1)
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: launching StepId=774896.0 on host g1008, 32 tasks: [0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62] srun: launching StepId=774896.0 on host h0107, 32 tasks: [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63]
srun: route/default: init: route default plugin loaded
srun: launch/slurm: _task_start: Node h0107, 32 tasks started
srun: launch/slurm: _task_start: Node g1008, 32 tasks started
srun: launch/slurm: _task_finish: Received task exit notification for 32 tasks of StepId=774896.0 (status=0x0000). srun: launch/slurm: _task_finish: g1008: tasks 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62: Completed srun: launch/slurm: _setup_max_wait_timer: First task exited. Terminating job in 60s srun: launch/slurm: _task_finish: Received task exit notification for 32 tasks of StepId=774896.0 (status=0x0000). srun: launch/slurm: _task_finish: h0107: tasks 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63: Completed

--
Thanks
毛登峰 Mao, Dengfeng


Reply via email to