Re: [slurm-users] srun: error: io_init_msg_unpack: unpack error

Chris Samuel Sat, 06 Aug 2022 12:14:46 -0700

On 6/8/22 10:43 am, David Magda wrote:

It seems that the the new srun(1) cannot talk to the old slurmd(8).


Is this 'on purpose'? Does the backwards compatibility of the protocol not 
extend to srun(1)?


That's expected, what you're hoping for here is forward compatibility.

Newer daemons know how to talk to older utilities, but it doesn't workthe other way around.

What we do in this situation is upgrade slurmdbd, then slurmctld, changeour images for compute nodes to be ones that have the new Slurm versionthen before we bring partitions back up we issue an "scontrol rebootASAP nextstate=resume" for all the compute nodes.

This means existing jobs will keep going but no new jobs will start oncompute nodes with older versions of Slurm from that point on. As jobson nodes finish they'll get rebooted into the new images and will acceptjobs again (the "ASAP" flag drains the node, then once it's successfullystarted its slurmd as the final thing on boot it'll undrain at thatpoint - and also slurmctld is smart with planning its scheduling forthis situation).

It's also safe to restart slurmd's with running jobs, though you maywant to drain them before that so slurmctld won't try and send them ajob in the middle.

The one issue you can get where backwards compatibility in the Slurmprotocol can't help is if there are incompatible config file changesneeded, then you need to bite the bullet and upgrade the slurmd's andcommands at the same time everywhere where the new config file goes (andfor those of running in configless mode that means everywhere).


Hope this helps! All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Re: [slurm-users] srun: error: io_init_msg_unpack: unpack error

Reply via email to