On Tue, Sep 2, 2014 at 12:53 AM, Всеволод Никоноров <[email protected]> wrote:
> Thank you very much, but what I am searching for is not exactly the upgrade > procedure, I am rather trying to understand what happened in my enviroment > and how to avoid such problems in future. We are testing two installations of > slurm on ajacent nodes, so that users who test the new version could have all > the network-mounted filesystems (nfs, lustre) from the main installation. It > seems that slurmctld 2.5.7 adressed to a node running slurmctld 14.11 and > slurmd 14.11 simultaneously, and then some of the nodes controlled by > slurmctld 2.5.7 got confused and lost jobs. If you read carefully the first paragraph on the page Moe linked you to, it will explain the problem. 2.5.7 can only interact with 2.6.x and 14.03. To upgrade to 14.11, you'll need to upgrade somewhere in between first. You also intermix references to slurmctld (which runs on the master) with slurmd (which runs on the nodes), so it's not clear you followed the proper upgrade procedure in terms of order of operations; this is also documented on the page Moe provided. Any time you're upgrading more than a single major version step, we strongly recommend you try the upgrade on a non-production system first. This will help to avoid any surprises (like job loss) during the actual upgrade. HTH, Michael -- Michael Jennings <[email protected]> Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615
