On 6/8/22 10:43 am, David Magda wrote:

It seems that the the new srun(1) cannot talk to the old slurmd(8).

Is this 'on purpose'? Does the backwards compatibility of the protocol not 
extend to srun(1)?

That's expected, what you're hoping for here is forward compatibility.

Newer daemons know how to talk to older utilities, but it doesn't work the other way around.

What we do in this situation is upgrade slurmdbd, then slurmctld, change our images for compute nodes to be ones that have the new Slurm version then before we bring partitions back up we issue an "scontrol reboot ASAP nextstate=resume" for all the compute nodes.

This means existing jobs will keep going but no new jobs will start on compute nodes with older versions of Slurm from that point on. As jobs on nodes finish they'll get rebooted into the new images and will accept jobs again (the "ASAP" flag drains the node, then once it's successfully started its slurmd as the final thing on boot it'll undrain at that point - and also slurmctld is smart with planning its scheduling for this situation).

It's also safe to restart slurmd's with running jobs, though you may want to drain them before that so slurmctld won't try and send them a job in the middle.

The one issue you can get where backwards compatibility in the Slurm protocol can't help is if there are incompatible config file changes needed, then you need to bite the bullet and upgrade the slurmd's and commands at the same time everywhere where the new config file goes (and for those of running in configless mode that means everywhere).

Hope this helps! All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


Reply via email to