It seems plane_size can't work when plane_size > 1.
The job are 64 tasks with 2 nodes and plane_size = 32.
The expectation was that there were 32 tasks in each node. But I got
there were 63 tasks in one node and 1 task in the other node.
-------------------------------------------------
Slurm Version:
slurm 21.08.8-2
------------------------------------------------
SelectType:
SelectType = select/cons_res
SelectTypeParameters = CR_CORE_MEMORY
-------------------------------------------------
CMD: srun -v -N 2 -n 64 -m plane=32 -p amd_512 sleep 5
srun: defined options
srun: -------------------- --------------------
srun: distribution : plane=32
srun: nodes : 2
srun: ntasks : 64
srun: partition : amd_512
srun: verbose : 1
srun: -------------------- --------------------
srun: end of defined options
srun: job 774895 queued and waiting for resources
srun: job 774895 has been allocated resources
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs
x index)
srun: Nodes g1008,h0107 are ready for job
srun: jobid 774895: nodes(2):`g1008,h0107', cpu counts:
63(x1),1(x1) # <------------ strange cpu counts
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: launching StepId=774895.0 on host g1008, 63 tasks: [0-31,33-63]
srun: launching StepId=774895.0 on host h0107, 1 tasks: 32
srun: route/default: init: route default plugin loaded
srun: launch/slurm: _task_start: Node h0107, 1 tasks started
srun: launch/slurm: _task_start: Node g1008, 63 tasks started
srun: launch/slurm: _task_finish: Received task exit notification for 1
task of StepId=774895.0 (status=0x0000).
srun: launch/slurm: _task_finish: h0107: task 32: Completed
srun: launch/slurm: _setup_max_wait_timer: First task exited.
Terminating job in 60s
srun: launch/slurm: _task_finish: Received task exit notification for 63
tasks of StepId=774895.0 (status=0x0000).
srun: launch/slurm: _task_finish: g1008: tasks 0-31,33-63: Completed
---------------------------------------------------------------
Then I tried plane_size = 1, I got 32 tasks in each node.
CMD: srun -v -N 2 -n 64 -m plane=1 -p amd_512 sleep 5
srun: defined options
srun: -------------------- --------------------
srun: distribution : plane=1
srun: nodes : 2
srun: ntasks : 64
srun: partition : amd_512
srun: verbose : 1
srun: -------------------- --------------------
srun: end of defined options
srun: job 774896 queued and waiting for resources
srun: job 774896 has been allocated resources
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs
x index)
srun: Nodes g1008,h0107 are ready for job
srun: jobid 774896: nodes(2):`g1008,h0107', cpu counts: 63(x1),1(x1)
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: launching StepId=774896.0 on host g1008, 32 tasks:
[0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62]
srun: launching StepId=774896.0 on host h0107, 32 tasks:
[1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63]
srun: route/default: init: route default plugin loaded
srun: launch/slurm: _task_start: Node h0107, 32 tasks started
srun: launch/slurm: _task_start: Node g1008, 32 tasks started
srun: launch/slurm: _task_finish: Received task exit notification for 32
tasks of StepId=774896.0 (status=0x0000).
srun: launch/slurm: _task_finish: g1008: tasks
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62:
Completed
srun: launch/slurm: _setup_max_wait_timer: First task exited.
Terminating job in 60s
srun: launch/slurm: _task_finish: Received task exit notification for 32
tasks of StepId=774896.0 (status=0x0000).
srun: launch/slurm: _task_finish: h0107: tasks
1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63:
Completed
--
Thanks
毛登峰 Mao, Dengfeng