So I’ve found some more info on this. It seems like the primary controller is 
writing  “ none” as the AllocNodes value in the partition state file when it 
shuts down.  It does this even with the backup out of the picture, and it still 
happens even when I switched the primary and backup controller nodes in the 
config.

When the primary starts up, it ignores these none values and sets 
AllocNodes=ALL on all partitions (what we want), but when the backup starts up, 
it “honors” the none values and all partitions have AllocNodes=none set.  
Again, the slurm.conf on both nodes are the same, and this happens even when 
swapping the primary/backup roles of the nodes. I am digging through the source 
to try and find some hints.

Does anyone have any ideas?

From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Dave 
Sizer <dsi...@nvidia.com>
Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com>
Date: Tuesday, December 17, 2019 at 1:05 PM
To: Brian Andrus <toomuc...@gmail.com>, "slurm-us...@schedmd.com" 
<slurm-us...@schedmd.com>
Subject: Re: [slurm-users] Issues with HA config and AllocNodes

External email: Use caution opening links or attachments

Thanks for the response.

I have confirmed that the slurm.conf files are the same and that StateSaveDir 
is working, we see logs like the following on the backup controller:
Recovered state of 9 partitions
Recovered JobId=124 Assoc=6
Recovered JobId=125 Assoc=6
Recovered JobId=126 Assoc=6
Recovered JobId=127 Assoc=6
Recovered JobId=128 Assoc=6

I do see the following error when the backup takes control, but not sure if it 
is related since it continues to start up fine:

error: _shutdown_bu_thread:send/recv slurm-ctrl-02: Connection refused

We also see a lot of these messages on the backup while it is in standby mode, 
but from what I’ve researched these maybe unrelated as well?

error: Invalid RPC received 1002 while in standby mode

and similar messages with other RPC codes. We no longer see these once the 
backup controller has taken control.

I do agree with the idea that there is some issue with the saving/loading of 
partition state during takeover, I’m just a bit stumped on why it is happening 
and what to do to stop partitions being loaded with the AllocNodes=none config.



From: Brian Andrus <toomuc...@gmail.com>
Date: Tuesday, December 17, 2019 at 12:30 PM
To: Dave Sizer <dsi...@nvidia.com>
Subject: Re: [slurm-users] Issues with HA config and AllocNodes

External email: Use caution opening links or attachments


Double check that your slurm.conf are the same and that both systems are 
successfully using your savestate directory

Brian Andrus
On 12/17/2019 9:23 AM, Dave Sizer wrote:
Hello friends,

We are running slurm 19.05.1-2 with an HA setup consisting of one primary and 
one backup controller.  However, we are observing that when the backup takes 
over, for some reason AllocNodes is getting set to “none” on all of our 
partitions.  We can remedy this by manually setting AllocNodes=ALL on each 
partition, however this is not feasible in production, since any jobs launched 
just before the takeover still fail to submit (before the partitions can be 
manually updated).  For reference, the backup controller has the correct config 
if it is restarted AFTER the primary is taken down, so this issue seems 
isolated to the takeover flow.

Has anyone seen this issue before?  Or any hints for how I can debug this 
problem?

Thanks in advance!

Dave
________________________________
This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.
________________________________

Reply via email to