[slurm-dev] RE: Slurm with High Availabilty/Automatic failover

J. Smith Tue, 25 Jul 2017 11:17:38 -0700

Hi,

Thank you both for sharing and would definitely would like to hear more
about it.


Davide, what type of issue did you run into with parallel filesystem? You
are using keepalived for both controller and database daemon? How about the
Database, what type of setup is it? master/slave?

On my side, slurm is installed on our GPFS parallel filesystem. Using two
servers both running the controller and db daemons and Mariadb Database on
each server. For the time being, we are just replicated the master db to a
slave on the second server and want to change this configuration for a
better failover and automated option.  The failover works fine as
master/slave for slurmctld but having issues to failover slurmdbd.

On Tue, Jul 25, 2017 at 12:31 PM, Vanzo, Davide <davide.va...@vanderbilt.edu
> wrote:

> Gary,
>
> Would it be possible to get some additional details on your experience
> with DRBD?
> Thank you.
>
>
> --
> *Davide Vanzo, PhD*
> Application Developer
> Adjunct Assistant Professor of Chemical and Biomolecular Engineering
> Advanced Computing Center for Research and Education (ACCRE)
> Vanderbilt University - Hill Center 201
> (615)-875-9137 <(615)%20875-9137>
> www.accre.vanderbilt.edu
>
>
> On 2017-07-25 11:13:30-05:00 Skouson, Gary B wrote:
>
> We use an NFS appliance for storing state files.  The NFS has been VERY
> stable.  We tried the DRBD shared volume but found that our problems were
> more likely to be something with the DRBD than with slurmctld.
>
>
>
> -----
>
> Gary Skouson
>
>
>
>
>
> *From:* Vanzo, Davide [mailto:davide.va...@vanderbilt.edu]
> *Sent:* Tuesday, July 25, 2017 8:39 AM
> *To:* slurm-dev <slurm-dev@schedmd.com>
> *Cc:* slurm-dev@schedmd.com
> *Subject:* [slurm-dev] RE: Slurm with High Availabilty/Automatic failover
>
>
>
> We are currently experimenting with keepalived+DRBD to have an HA cluster
> with two nodes where both the controller and the database are hosted on the
> same node. The reason why we are pursuing this route is because we
> experienced significant performance and stability issues of having the
> state files on the cluster parallel filesystem.
>
> We are still in the early stages of testing but I will be happy to share
> our experience if you are interested.
>
> --
>
> *Davide Vanzo, PhD*
>
> Application Developer
>
> Adjunct Assistant Professor of Chemical and Biomolecular Engineering
>
> Advanced Computing Center for Research and Education (ACCRE)
>
> Vanderbilt University - Hill Center 201
>
> (615)-875-9137 <(615)%20875-9137>
>
> www.accre.vanderbilt.edu
>
>
>
> On 2017-07-25 09:20:55-05:00 J. Smith wrote:
>
> Does anyone has any suggestions in setting up high availability and
> automatic failover between two servers that run a Controller daemon,
> Database daemon and Mysql Database (i.e replication vs galera cluster)?
>
> Any input would be appreciated.
>
> Thanks!
>
>

[slurm-dev] RE: Slurm with High Availabilty/Automatic failover

Reply via email to