Your welcome :)

If you aren't pleased with the timeouts, you may want to look at the SlurmctldTimeout in slurm.conf:

SlurmctldTimeout
The interval, in seconds, that the backup controller waits for the primary controller to respond before assuming control. The default value is 120 seconds. May not exceed 65533.

Brian Andrus

On 7/3/2019 2:45 PM, Tina Fora wrote:
Thanks Brian Andrus and Chris Samuel.

I was able to get it to work on our dev setup as primary/backup. Already
had the shared state directory. If I take primary down it takes about two
minutes for slurm commands to work again as the backup takes over. When I
bring the primary back up it is a bit faster.

Cheers.



On 2/7/19 1:48 pm, Tina Fora wrote:

We run mysql on a dedicated machine with slurmctld and slurmdbd running
on
another machine. Now I want to add another machine running slurmctld and
slurmdbd and this machine with be on CentOS 7. Existing one is CentOS 6.
Is this possible? Can I run two seperate slurmctld and slurmdbd point to
the same slurm config and database?
Are you trying to set up an HA system (where one controller runs both
and a second waits in the wings in case the first fails and will take
over)?

Or do you want them to run separate clusters?

If you want the second, and are happy to have the same users and QOS's
on both, then you can run one slurmctld per system and point them at the
same slurmdbd (having created a cluster for each there first).

If you want HA then it's a lot more complicated as you'll need a (fast)
shared filesystem between them both (we use GPFS for this) as both
slurmctld's need to see the same state directory all the time.

We also run slurmdbd in failover mode talking to the same MySQL/MariaDB
instance (but with a backup in case that fails).

All the best,
Chris
--
   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA





Reply via email to