[slurm-dev] Re: Protocol mismatch and side effects(?)

Michael Jennings Tue, 02 Sep 2014 10:57:06 -0700

On Tue, Sep 2, 2014 at 12:53 AM, Всеволод Никоноров
<[email protected]> wrote:


> Thank you very much, but what I am searching for is not exactly the upgrade 
> procedure, I am rather trying to understand what happened in my enviroment 
> and how to avoid such problems in future. We are testing two installations of 
> slurm on ajacent nodes, so that users who test the new version could have all 
> the network-mounted filesystems (nfs, lustre) from the main installation. It 
> seems that slurmctld 2.5.7 adressed to a node running slurmctld 14.11 and 
> slurmd 14.11 simultaneously, and then some of the nodes controlled by 
> slurmctld 2.5.7 got confused and lost jobs.

If you read carefully the first paragraph on the page Moe linked you
to, it will explain the problem.  2.5.7 can only interact with 2.6.x
and 14.03.  To upgrade to 14.11, you'll need to upgrade somewhere in
between first.

You also intermix references to slurmctld (which runs on the master)
with slurmd (which runs on the nodes), so it's not clear you followed
the proper upgrade procedure in terms of order of operations; this is
also documented on the page Moe provided.

Any time you're upgrading more than a single major version step, we
strongly recommend you try the upgrade on a non-production system
first.  This will help to avoid any surprises (like job loss) during
the actual upgrade.

HTH,
Michael

-- 
Michael Jennings <[email protected]>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E        W: 510-495-2687
MS 050B-3209          F: 510-486-8615

[slurm-dev] Re: Protocol mismatch and side effects(?)

Reply via email to