Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Reid Wahl
On Wed, Jul 29, 2020 at 10:45 PM Strahil Nikolov wrote: > You got plenty of options: > - IPMI based fencing like HP iLO, DELL iDRAC > - SCSI-3 persistent reservations (which can be extended to fence the > node when the reservation(s) were removed) > > - Shared disk (even iSCSI) and

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Reid Wahl
I don't know of a stonith method that acts upon a filesystem directly. You'd generally want to act upon the power state of the node or upon the underlying shared storage. What kind of hardware or virtualization platform are these systems running on? If there is a hardware watchdog timer, then sbd

Re: [ClusterLabs] Maximum cluster size with Pacemaker 2.x and Corosync 3.x, and scaling to hundreds of nodes

2020-07-29 Thread Reid Wahl
Addressing only the first paragraph of your message, inline below. I'll have to defer to others to answer the remainder. On Wed, Jul 29, 2020 at 4:12 PM Toby Haynes wrote: > In Corosync 1.x there was a limit on the maximum number of active nodes in > a corosync cluster - broswing the mailing

[ClusterLabs] Maximum cluster size with Pacemaker 2.x and Corosync 3.x, and scaling to hundreds of nodes

2020-07-29 Thread Toby Haynes
In Corosync 1.x there was a limit on the maximum number of active nodes in a corosync cluster - broswing the mailing list says 64 hosts. The Pacemaker 1.1 documentation says scalability goes up to 16 nodes. The Pacemaker 2.0 documentation says the same, although I can't find a maximum number of

Re: [ClusterLabs] why is node fenced ?

2020-07-29 Thread Ken Gaillot
On Wed, 2020-07-29 at 17:26 +0200, Lentes, Bernd wrote: > Hi, > > a few days ago one of my nodes was fenced and i don't know why, which > is something i really don't like. > What i did: > I put one node (ha-idg-1) in standby. The resources on it (most of > all virtual domains) were migrated to

Re: [ClusterLabs] why is node fenced ?

2020-07-29 Thread Lentes, Bernd
- Am 29. Jul 2020 um 17:26 schrieb Bernd Lentes bernd.len...@helmholtz-muenchen.de: Hi, sorry, i missed: OS: SLES 12 SP4 kernel: 4.12.14-95.32 pacmaker: pacemaker-1.1.19+20181105.ccd6b5b10-3.13.1.x86_64 Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches

[ClusterLabs] why is node fenced ?

2020-07-29 Thread Lentes, Bernd
Hi, a few days ago one of my nodes was fenced and i don't know why, which is something i really don't like. What i did: I put one node (ha-idg-1) in standby. The resources on it (most of all virtual domains) were migrated to ha-idg-2, except one domain (vm_nextcloud). On ha-idg-2 a mountpoint

[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Stonith failing

2020-07-29 Thread Ulrich Windl
>>> Gabriele Bulfon schrieb am 29.07.2020 um 14:18 in Nachricht <479956351.444.1596025101064@www>: > Hi, it's a single controller, shared to both nodes, SM server. You mean external controller, like NAS or SAN? I thought you are talking about an internal controller like SCSI... I don't know what

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Gabriele Bulfon
Thanks a lot for the extensive explanation! Any idea about a ZFS stonith?   Gabriele     Sonicle S.r.l.  :  http://www.sonicle.com Music:  http://www.gabrielebulfon.com Quantum Mechanics :  http://www.cdbaby.com/cd/gabrielebulfon Da: Reid Wahl A: Cluster Labs - All topics related to open-source

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Gabriele Bulfon
It is a ZFS based illumos system. I don't think SBD is an option. Is there a reliable ZFS based stonith?   Gabriele     Sonicle S.r.l.  :  http://www.sonicle.com Music:  http://www.gabrielebulfon.com Quantum Mechanics :  http://www.cdbaby.com/cd/gabrielebulfon Da: Andrei Borzenkov A: Cluster Labs

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Stonith failing

2020-07-29 Thread Gabriele Bulfon
Hi, it's a single controller, shared to both nodes, SM server.   Thanks! Gabriele     Sonicle S.r.l.  :  http://www.sonicle.com Music:  http://www.gabrielebulfon.com Quantum Mechanics :  http://www.cdbaby.com/cd/gabrielebulfon

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Stonith failing

2020-07-29 Thread Reid Wahl
On Wed, Jul 29, 2020 at 2:48 AM Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote: > >>> Reid Wahl schrieb am 29.07.2020 um 11:39 in > Nachricht > : > > "As it stated in the comments, we don't want to halt or boot via ssh, > only > > reboot." > > > > Generally speaking, a stonith reboot

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Strahil Nikolov
Do you have a reason not to use any stonith already available ? Best Regards, Strahil Nikolov На 28 юли 2020 г. 13:26:52 GMT+03:00, Gabriele Bulfon написа: >Thanks, I attach here the script. >It basically runs ssh on the other node with no password (must be >preconfigured via authorization

[ClusterLabs] Antw: Re: Antw: [EXT] Stonith failing

2020-07-29 Thread Ulrich Windl
>>> Reid Wahl schrieb am 29.07.2020 um 11:39 in Nachricht : > "As it stated in the comments, we don't want to halt or boot via ssh, only > reboot." > > Generally speaking, a stonith reboot action consists of the following basic > sequence of events: > >1. Execute the fence agent with the

[ClusterLabs] Pacemaker crashed and produce a coredump file

2020-07-29 Thread lkxjtu
Hi Reid Wahl, There are more log informations below. The reason seems to be that communication with DBUS timed out. Any suggestions? 1672712 Jul 24 21:20:17 [3945305] B0610011 lrmd: info: pcmk_dbus_timeout_dispatch:Timeout 0x147bbd0 expired 1672713 Jul 24 21:20:17 [3945305]

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Reid Wahl
"As it stated in the comments, we don't want to halt or boot via ssh, only reboot." Generally speaking, a stonith reboot action consists of the following basic sequence of events: 1. Execute the fence agent with the "off" action. 2. Poll the power status of the fenced node until it is

Re: [ClusterLabs] Pacemaker crashed and produce a coredump file

2020-07-29 Thread Reid Wahl
Hi, It looks like this is a bug that was fixed in later releases. The `path` variable was a null pointer when it was passed to `systemd_unit_exec_with_unit` as the `unit` argument. Commit 62a0d26a

[ClusterLabs] [Announce] libqb 2.0.1 released

2020-07-29 Thread Christine Caulfield
We are pleased to announce the release of libqb 2.0.1. This is the latest stable release of libqb Source code is available at: https://github.com/ClusterLabs/libqb/releases/download/2.0.1/libqb-2.0.1.tar.xz Please use the signed .tar.gz or .tar.xz files with the version number in rather than

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Andrei Borzenkov
On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon wrote: > That one was taken from a specific implementation on Solaris 11. > The situation is a dual node server with shared storage controller: both > nodes see the same disks concurrently. > Here we must be sure that the two nodes are not going to

[ClusterLabs] Antw: Re: Antw: [EXT] Stonith failing

2020-07-29 Thread Ulrich Windl
>>> Gabriele Bulfon schrieb am 29.07.2020 um 08:01 in Nachricht <603366395.379.1596002482554@www>: > That one was taken from a specific implementation on Solaris 11. > The situation is a dual node server with shared storage controller: both > nodes see the same disks concurrently. You mean you

[ClusterLabs] Pacemaker crashed and produce a coredump file

2020-07-29 Thread lkxjtu
RPM Version Information: corosync-2.3.4-7.el7_2.1.x86_64 pacemaker-1.1.12-22.el7.x86_64 Coredump file backtrace: ``` warning: .dynamic section for "/lib64/libk5crypto.so.3" is not at the expected address (wrong library or version mismatch?) Missing separate debuginfo for Try: yum

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Gabriele Bulfon
That one was taken from a specific implementation on Solaris 11. The situation is a dual node server with shared storage controller: both nodes see the same disks concurrently. Here we must be sure that the two nodes are not going to import/mount the same zpool at the same time, or we will