Re: [slurm-users] Rolling upgrade of compute nodes

Ole Holm Nielsen Mon, 30 May 2022 11:57:33 -0700

On 30-05-2022 19:34, Chris Samuel wrote:

On 30/5/22 10:06 am, Chris Samuel wrote:
If you switch that symlink those jobs will pick up the 20.11 srunbinary and that's where you may come unstuck.
Just to quickly fix that, srun talks to slurmctld (which would also be20.11 for you), slurmctld will talk to the slurmd's running the job(which would be 19.05, so OK) but then the slurmd would try and launch a20.11 slurmstepd and that is where I suspect things could come undone.

How about restarting all slurmd's at version 20.11 in one shot? Noreboot will be required. There will be running 19.05 slurmstepd's forthe running job steps, even though slurmd is at 20.11. You couldperhaps restart 20.11 slurmd one partition at a time in order to see ifit works correctly on a small partition of the cluster.

I think we have done this successfully when we install new RPMs on *all*compute nodes in one shot, and I'm not aware of any job crashes. Yourmileage may vary depending on job types!

Question: Does anyone have bad experiences with upgrading slurmd whilethe cluster is running production?


/Ole

Re: [slurm-users] Rolling upgrade of compute nodes

Reply via email to