Re: [slurm-users] srun: error: io_init_msg_unpack: unpack error

2022-08-06 Thread Chris Samuel

On 6/8/22 10:43 am, David Magda wrote:


It seems that the the new srun(1) cannot talk to the old slurmd(8).

Is this 'on purpose'? Does the backwards compatibility of the protocol not 
extend to srun(1)?


That's expected, what you're hoping for here is forward compatibility.

Newer daemons know how to talk to older utilities, but it doesn't work 
the other way around.


What we do in this situation is upgrade slurmdbd, then slurmctld, change 
our images for compute nodes to be ones that have the new Slurm version 
then before we bring partitions back up we issue an "scontrol reboot 
ASAP nextstate=resume" for all the compute nodes.


This means existing jobs will keep going but no new jobs will start on 
compute nodes with older versions of Slurm from that point on. As jobs 
on nodes finish they'll get rebooted into the new images and will accept 
jobs again (the "ASAP" flag drains the node, then once it's successfully 
started its slurmd as the final thing on boot it'll undrain at that 
point - and also slurmctld is smart with planning its scheduling for 
this situation).


It's also safe to restart slurmd's with running jobs, though you may 
want to drain them before that so slurmctld won't try and send them a 
job in the middle.


The one issue you can get where backwards compatibility in the Slurm 
protocol can't help is if there are incompatible config file changes 
needed, then you need to bite the bullet and upgrade the slurmd's and 
commands at the same time everywhere where the new config file goes (and 
for those of running in configless mode that means everywhere).


Hope this helps! All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




[slurm-users] srun: error: io_init_msg_unpack: unpack error

2022-08-06 Thread David Magda
Hello,

We are testing the upgrade process with going from 20.11.9 to 22.05.2. The 
master server is running 22.05.2 slurmctld/slurmdbd, and the compute nodes are 
(currently) running slurm-20.11.9 slurmd. We are running this 'mixed 
environment' because our production cluster has a reasonable number of nodes 
(~200) so it will take a while to get through them all.

Back to our smaller (test) cluster: things are generally working fine in that 
jobs are scheduled, launched, and finish cleanly.

The main issue we're experiencing is with srun(1). If you execute the "new" 
binary the following output is generated:

$ /opt/slurm-22.05.2/bin/srun --mem=8GB --gres=gpu:1 -p wsgpu --pty bash
srun: job 1939765 queued and waiting for resources
srun: job 1939765 has been allocated resources
srun: error: io_init_msg_unpack: unpack error
srun: error: io_init_msg_read_from_fd: io_init_msg_unpack failed: rc=-1
srun: error: failed reading io init message

If I SSH into the host manually (as root), I do see a shell session for my user 
running bash.  Running the "old" binary:

$ /opt/slurm-20.11.9b/bin/srun --mem=8GB --gres=gpu:1 -p wsgpu --pty 
bash
srun: job 1939768 queued and waiting for resources
srun: job 1939768 has been allocated resources
dmagda@wsgpu11:~$ 

It seems that the the new srun(1) cannot talk to the old slurmd(8).

Is this 'on purpose'? Does the backwards compatibility of the protocol not 
extend to srun(1)?

Is there any way around this, or should we simply upgrade slurmd(8) on the work 
nodes, but leave the paths to the older user CLI utilities alone until all the 
compute nodes have been upgraded?

Thanks for any info.

Regards,
David