Just ran into an issue while removing old nodes that I had in a partition. We use a job_submit.lua script that has every job submitted to every partition.
local submit_part = "" if asc_job_partition == "" then asc_job_partition = "dmc,uv" end if string.find( asc_job_partition , 'dmc' ) then submit_part = (submit_part .. "dmc-ivy-bridge,dmc-haswell,dmc- broadwell,gpu_kepler,gpu_pascal,") end if string.find( asc_job_partition , 'uv' ) then submit_part = (submit_part .. "uv,") end if string.find( asc_job_partition , 'knl' ) then submit_part = (submit_part .. "knl,") end --[[ Strips the last comma off the string and writes the new info ]]- - job_desc.partition = string.sub(submit_part, 1, -2) I had previously gpu_fermi and gpu_tesla in my string. All currently running jobs had submit partitions with this string in it, although all of the nodes were drained in the gpu_fermi and gpu_tesla partitions. I went ahead and removed the line from my job_submit.lua, and removed the partition and nodes from the slurm.conf. I then initiated a scontrol reconfigure. After a minute or two I noticed then, that all the jobs had disappeared. Nothing pending, nothing running. The logs showed 'error: Invalid partition (gpu_fermi) for job 70496' and that SLURM had sent a terminate command to all of the running jobs because of this. So, a cautionary tale for all. I think in the future I will edit my job_submit.lua script and wait for all the jobs that have ran through it to finish before removing partitions. My question for the group is, other than the above mentioned method, is there something I could have done differently to prevent SLURM from killing jobs when removing partitions? Thanks! -- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority