Just ran into an issue while removing old nodes that I had in a
partition.  We use a job_submit.lua script that has every job submitted
to every partition.

local submit_part = ""
  if asc_job_partition == "" then
    asc_job_partition = "dmc,uv"
  end
  if string.find( asc_job_partition , 'dmc' ) then
    submit_part = (submit_part .. "dmc-ivy-bridge,dmc-haswell,dmc-
broadwell,gpu_kepler,gpu_pascal,")
  end
  if string.find( asc_job_partition , 'uv' ) then
    submit_part = (submit_part .. "uv,")
  end
  if string.find( asc_job_partition , 'knl' ) then
    submit_part = (submit_part .. "knl,")
  end
  --[[ Strips the last comma off the string and writes the new info ]]-
-
  job_desc.partition = string.sub(submit_part, 1, -2)

I had previously gpu_fermi and gpu_tesla in my string.  All currently
running jobs had submit partitions with this string in it, although all
of the nodes were drained in the gpu_fermi and gpu_tesla partitions.

I went ahead and removed the line from my job_submit.lua, and removed
the partition and nodes from the slurm.conf.  I then initiated a
scontrol reconfigure.

After a minute or two I noticed then, that all the jobs had
disappeared.  Nothing pending, nothing running.  

The logs showed 'error: Invalid partition (gpu_fermi) for job 70496'
and that SLURM had sent a terminate command to all of the running jobs
because of this.

So, a cautionary tale for all.  I think in the future I will edit my
job_submit.lua script and wait for all the jobs that have ran through
it to finish before removing partitions.

My question for the group is, other than the above mentioned method, is
there something I could have done differently to prevent SLURM from
killing jobs when removing partitions?

Thanks!

-- 
Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority

Reply via email to