[slurm-users] Issues with HA config and AllocNodes

Dave Sizer Tue, 17 Dec 2019 09:26:45 -0800

Hello friends,

We are running slurm 19.05.1-2 with an HA setup consisting of one primary and 
one backup controller.  However, we are observing that when the backup takes 
over, for some reason AllocNodes is getting set to “none” on all of our 
partitions.  We can remedy this by manually setting AllocNodes=ALL on each 
partition, however this is not feasible in production, since any jobs launched 
just before the takeover still fail to submit (before the partitions can be 
manually updated).  For reference, the backup controller has the correct config 
if it is restarted AFTER the primary is taken down, so this issue seems 
isolated to the takeover flow.


Has anyone seen this issue before?  Or any hints for how I can debug this 
problem?

Thanks in advance!

Dave

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

[slurm-users] Issues with HA config and AllocNodes

Reply via email to