Re: [slurm-users] srun: error: io_init_msg_unpack: unpack error
On 6/8/22 10:43 am, David Magda wrote: It seems that the the new srun(1) cannot talk to the old slurmd(8). Is this 'on purpose'? Does the backwards compatibility of the protocol not extend to srun(1)? That's expected, what you're hoping for here is forward compatibility. Newer daemons know how to talk to older utilities, but it doesn't work the other way around. What we do in this situation is upgrade slurmdbd, then slurmctld, change our images for compute nodes to be ones that have the new Slurm version then before we bring partitions back up we issue an "scontrol reboot ASAP nextstate=resume" for all the compute nodes. This means existing jobs will keep going but no new jobs will start on compute nodes with older versions of Slurm from that point on. As jobs on nodes finish they'll get rebooted into the new images and will accept jobs again (the "ASAP" flag drains the node, then once it's successfully started its slurmd as the final thing on boot it'll undrain at that point - and also slurmctld is smart with planning its scheduling for this situation). It's also safe to restart slurmd's with running jobs, though you may want to drain them before that so slurmctld won't try and send them a job in the middle. The one issue you can get where backwards compatibility in the Slurm protocol can't help is if there are incompatible config file changes needed, then you need to bite the bullet and upgrade the slurmd's and commands at the same time everywhere where the new config file goes (and for those of running in configless mode that means everywhere). Hope this helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
[slurm-users] srun: error: io_init_msg_unpack: unpack error
Hello, We are testing the upgrade process with going from 20.11.9 to 22.05.2. The master server is running 22.05.2 slurmctld/slurmdbd, and the compute nodes are (currently) running slurm-20.11.9 slurmd. We are running this 'mixed environment' because our production cluster has a reasonable number of nodes (~200) so it will take a while to get through them all. Back to our smaller (test) cluster: things are generally working fine in that jobs are scheduled, launched, and finish cleanly. The main issue we're experiencing is with srun(1). If you execute the "new" binary the following output is generated: $ /opt/slurm-22.05.2/bin/srun --mem=8GB --gres=gpu:1 -p wsgpu --pty bash srun: job 1939765 queued and waiting for resources srun: job 1939765 has been allocated resources srun: error: io_init_msg_unpack: unpack error srun: error: io_init_msg_read_from_fd: io_init_msg_unpack failed: rc=-1 srun: error: failed reading io init message If I SSH into the host manually (as root), I do see a shell session for my user running bash. Running the "old" binary: $ /opt/slurm-20.11.9b/bin/srun --mem=8GB --gres=gpu:1 -p wsgpu --pty bash srun: job 1939768 queued and waiting for resources srun: job 1939768 has been allocated resources dmagda@wsgpu11:~$ It seems that the the new srun(1) cannot talk to the old slurmd(8). Is this 'on purpose'? Does the backwards compatibility of the protocol not extend to srun(1)? Is there any way around this, or should we simply upgrade slurmd(8) on the work nodes, but leave the paths to the older user CLI utilities alone until all the compute nodes have been upgraded? Thanks for any info. Regards, David