Hi,

I'm investigating some job issues and would like to figure out the exact
CPU distribution of the job from the accounting info. Right now SLURM does
not offer something like the "exec_host" field in Torque. This makes it
difficult to achieve this task. I can only make a guess from the NodeList,
AllocCPUS, and Layout fields but after some testing I find this approach
extremely unreliable. For example, on a shared cluster, if I acquire the
resources with "--ntasks=13", I would get 8 processes running on node 0 and
5 on node 1. However by examining the Layout of the job I see that it is
registered as "Cyclic" fashion instead of "Block" as I would image. If
nodes are partially used, it may even end up with 3 5 5 fashion, so I have
no idea how many processes were actually launched on a node.

So my question here is, how to recreate the CPU distribution for a job from
the accounting info? This will be extremely useful for people to debug a
job in a shared environment after something bad happened. If no way under
the current framework, would that be possible to add this as an extra field
for the accounting info?

Thanks,

Yong Qin

Reply via email to