[ClusterLabs] Antw: [EXT] Cluster breaks after pcs unstandby node
Hi! I'm using SLES, but I think your configuration misses many colocations (IMHO every ordering should have a correspoonding colocation). From the logs of node1, this looks odd to me: attrd[11024]:error: Connection to the CPG API failed: Library error (2) After systemd[1]: Unit pacemaker.service entered failed state. it's expected that the node be fenced. However this is not fencing IMHO: Jan 04 13:59:04 kvm03-node01 systemd-logind[5456]: Power key pressed. Jan 04 13:59:04 kvm03-node01 systemd-logind[5456]: Powering Off... The main question is what makes the cluster think the node is lost: Jan 04 13:58:27 kvm03-node01 corosync[10995]: [TOTEM ] A processor failed, forming new configuration. Jan 04 13:58:27 kvm03-node02 corosync[28814]: [TOTEM ] A processor failed, forming new configuration. The answer seems to be node3: Jan 04 13:58:07 kvm03-node03 crmd[37819]: notice: Initiating monitor operation ipmi-fencing-node02_monitor_6 on kvm03-node02.avigol-gcs.dk Jan 04 13:58:07 kvm03-node03 crmd[37819]: notice: Initiating monitor operation ipmi-fencing-node03_monitor_6 on kvm03-node01.avigol-gcs.dk Jan 04 13:58:25 kvm03-node03 corosync[37794]: [TOTEM ] A new membership (172.31.0.31:1044) was formed. Members Jan 04 13:58:25 kvm03-node03 corosync[37794]: [CPG ] downlist left_list: 0 received Jan 04 13:58:25 kvm03-node03 corosync[37794]: [CPG ] downlist left_list: 0 received Jan 04 13:58:25 kvm03-node03 corosync[37794]: [CPG ] downlist left_list: 0 received Jan 04 13:58:27 kvm03-node03 corosync[37794]: [TOTEM ] A processor failed, forming new configuration. Before: Jan 04 13:54:18 kvm03-node03 crmd[37819]: notice: Node kvm03-node02.avigol-gcs.dk state is now lost Jan 04 13:54:18 kvm03-node03 crmd[37819]: notice: Node kvm03-node02.avigol-gcs.dk state is now lost No idea why, but then: Jan 04 13:54:18 kvm03-node03 crmd[37819]: notice: Node kvm03-node02.avigol-gcs.dk state is now lost Why "shutdown" and not "fencing"? (A side-note on "pe-input-497.bz2": You may want to limit the number of policy files being kept; here I use 100 as limit) Node2 then seems to have rejoined before being fenced: Jan 04 13:57:21 kvm03-node03 crmd[37819]: notice: State transition S_IDLE -> S_POLICY_ENGINE The node3 seems unavailable, moding resource to node2: Jan 04 13:58:07 kvm03-node03 crmd[37819]: notice: State transition S_IDLE -> S_POLICY_ENGINE Jan 04 13:58:07 kvm03-node03 pengine[37818]: notice: * Move ipmi-fencing-node02 ( kvm03-node03.avigol-gcs.dk -> kvm03-node02.avigol-gcs.dk ) Jan 04 13:58:07 kvm03-node03 pengine[37818]: notice: * Move ipmi-fencing-node03 ( kvm03-node03.avigol-gcs.dk -> kvm03-node01.avigol-gcs.dk ) Jan 04 13:58:07 kvm03-node03 pengine[37818]: notice: * Stop dlm:2 ( kvm03-node03.avigol-gcs.dk ) due to node availability Then node1 seems gone: Jan 04 13:58:27 kvm03-node03 corosync[37794]: [TOTEM ] A processor failed, forming new configuration. The suddenly node-1 is here again: Jan 04 13:58:33 kvm03-node03 crmd[37819]: notice: Stonith/shutdown of kvm03-node01.avigol-gcs.dk not matched Jan 04 13:58:33 kvm03-node03 crmd[37819]: notice: Transition aborted: Node failure Jan 04 13:58:33 kvm03-node03 cib[37814]: notice: Node kvm03-node01.avigol-gcs.dk state is now member Jan 04 13:58:33 kvm03-node03 attrd[37817]: notice: Node kvm03-node01.avigol-gcs.dk state is now member Jan 04 13:58:33 kvm03-node03 dlm_controld[39252]: 5452 cpg_mcast_joined retry 300 plock Jan 04 13:58:33 kvm03-node03 stonith-ng[37815]: notice: Node kvm03-node01.avigol-gcs.dk state is now member And it's lost again: Jan 04 13:58:33 kvm03-node03 attrd[37817]: notice: Node kvm03-node01.avigol-gcs.dk state is now lost Jan 04 13:58:33 kvm03-node03 cib[37814]: notice: Node kvm03-node01.avigol-gcs.dk state is now lost Jan 04 13:58:33 kvm03-node03 crmd[37819]: warning: No reason to expect node 1 to be down Jan 04 13:58:33 kvm03-node03 crmd[37819]: notice: Stonith/shutdown of kvm03-node01.avigol-gcs.dk not matched Then it seems only node1 can fence node1, but communication with node1 is lost: Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node02 can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node03 can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node01 can fence (reboot) kvm03-node01.avigol-gcs.dk: static-list Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node02 can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node03 can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node01 can fence (reboot) kvm03-node01.avigol-gcs.dk: sta
Re: [ClusterLabs] How to set up "active-active" cluster by balancing multiple exports across servers?
Hi. I would run nfsserver and nfsnotify as a separate cloned group and make both other groups colocated/ordered with it. So nfs server will be just a per-host service, and then you attach exports (with LVs, filesystems, ip addresses) to it. NFS server in linux is an in-kernel creature, not an userspace process, and it is not designed to have several instances bound to different addresses. But with the approach above you can overcome that. On Tue, 2021-01-12 at 11:04 -0700, Billy Wilson wrote: > I'm having trouble setting up what seems like should be a > straightforward NFS-HA design. It is similar to what Christoforos > Christoforou attempted to do earlier in 2020 > (https://www.mail-archive.com/users@clusterlabs.org/msg09671.html). > > My goal is to balance multiple NFS exports across two nodes to > effectively have an "active-active" configuration. Each export should > only be available from one node at a time, but they should be able to > freely fail back and forth to balance between the two nodes. > > I'm also hoping to isolate each exported filesystem to its own set of > underlying disks, to prevent heavy IO on one exported filesystem from > affecting another one. So each filesystem to be exported should be > backed by a unique volume group. > > I've set up two nodes with fencing, an ethmonitor clone, and the > following two resource groups. > > """ > * Resource Group: ha1: > * alice_lvm (ocf::heartbeat:LVM-activate): Started host1 > * alice_xfs (ocf::heartbeat:Filesystem): Started host1 > * alice_nfs (ocf::heartbeat:nfsserver): Started host1 > * alice_ip (ocf::heartbeat:IPaddr2): Started host1 > * alice_nfsnotify (ocf::heartbeat:nfsnotify): Started > host1 > * alice_login01 (ocf::heartbeat:exportfs): Started host1 > * alice_login02 (ocf::heartbeat:exportfs): Started host1 > * Resource Group: ha2: > * bob_lvm (ocf::heartbeat:LVM-activate): Started host2 > * bob_xfs (ocf::heartbeat:Filesystem): Started host2 > * bob_nfs (ocf::heartbeat:nfsserver): Started host2 > * bob_ip (ocf::heartbeat:IPaddr2): Started host2 > * bob_nfsnotify (ocf::heartbeat:nfsnotify): Started host2 > * bob_login01 (ocf::heartbeat:exportfs): Started host2 > * bob_login02 (ocf::heartbeat:exportfs): Started host2 > """ > > We had an older storage appliance that used Red Hat HA on RHEL 6 > (back > when it still used RGManager and not Pacemaker), and it was capable > of > load-balanced NFS-HA like this. > > The problem with this approach using Pacemaker is that the > "nfsserver" > resource agent only wants one instance per host. During a failover > event, both "nfsserver" RAs will try to bind mount the NFS shared > info > directory to /var/lib/nfs/. Only one will claim the directory. > > If I convert everything to a single resource group as Christoforos > did, > then the cluster is active-passive, and all the resources fail as a > single unit. Having one node serve all the exports while the other is > idle doesn't seem very ideal. > > I'd like to eventually have something like this: > > """ > * Resource Group: ha1: > * alice_lvm (ocf::heartbeat:LVM-activate): Started host1 > * alice_xfs (ocf::heartbeat:Filesystem): Started host1 > * charlie_lvm (ocf::heartbeat:LVM-activate): Started host1 > * charlie_xfs (ocf::heartbeat:Filesystem): Started host1 > * ha1_nfs (ocf::heartbeat:nfsserver): Started host1 > * alice_ip (ocf::heartbeat:IPaddr2): Started host1 > * charlie_ip (ocf::heartbeat:IPaddr2): Started host1 > * ha1_nfsnotify (ocf::heartbeat:nfsnotify): Started host1 > * alice_login01 (ocf::heartbeat:exportfs): Started host1 > * alice_login02 (ocf::heartbeat:exportfs): Started host1 > * charlie_login01 (ocf::heartbeat:exportfs): Started host1 > * charlie_login02 (ocf::heartbeat:exportfs): Started host1 > * Resource Group: ha2: > * bob_lvm (ocf::heartbeat:LVM-activate): Started host2 > * bob_xfs (ocf::heartbeat:Filesystem): Started host2 > * david_lvm (ocf::heartbeat:LVM-activate): Started host2 > * david_xfs (ocf::heartbeat:Filesystem): Started host2 > * ha2_nfs (ocf::heartbeat:nfsserver): Started host2 > * bob_ip (ocf::heartbeat:IPaddr2): Started host2 > * david_ip (ocf::heartbeat:IPaddr2): Started host2 > * ha2_nfsnotify (ocf::heartbeat:nfsnotify): Started host2 > * bob_login01 (ocf::heartbeat:exportfs): Started host2 > * bob_login02 (ocf::heartbeat:exportfs): Started host2 > * david_login01 (ocf::heartbeat:exportfs): Started host2 > * david_login02 (ocf::heartbeat:exportfs): Started host2 > """ > > Or even this: > > """ > * Resource Group: alice_research: > * alice_lvm (ocf::he
[ClusterLabs] How to set up "active-active" cluster by balancing multiple exports across servers?
I'm having trouble setting up what seems like should be a straightforward NFS-HA design. It is similar to what Christoforos Christoforou attempted to do earlier in 2020 (https://www.mail-archive.com/users@clusterlabs.org/msg09671.html). My goal is to balance multiple NFS exports across two nodes to effectively have an "active-active" configuration. Each export should only be available from one node at a time, but they should be able to freely fail back and forth to balance between the two nodes. I'm also hoping to isolate each exported filesystem to its own set of underlying disks, to prevent heavy IO on one exported filesystem from affecting another one. So each filesystem to be exported should be backed by a unique volume group. I've set up two nodes with fencing, an ethmonitor clone, and the following two resource groups. """ * Resource Group: ha1: * alice_lvm(ocf::heartbeat:LVM-activate):Started host1 * alice_xfs(ocf::heartbeat:Filesystem):Started host1 * alice_nfs(ocf::heartbeat:nfsserver):Started host1 * alice_ip(ocf::heartbeat:IPaddr2):Started host1 * alice_nfsnotify(ocf::heartbeat:nfsnotify):Started host1 * alice_login01(ocf::heartbeat:exportfs):Started host1 * alice_login02(ocf::heartbeat:exportfs):Started host1 * Resource Group: ha2: * bob_lvm(ocf::heartbeat:LVM-activate):Started host2 * bob_xfs(ocf::heartbeat:Filesystem):Started host2 * bob_nfs(ocf::heartbeat:nfsserver):Started host2 * bob_ip(ocf::heartbeat:IPaddr2):Started host2 * bob_nfsnotify(ocf::heartbeat:nfsnotify):Started host2 * bob_login01(ocf::heartbeat:exportfs):Started host2 * bob_login02(ocf::heartbeat:exportfs):Started host2 """ We had an older storage appliance that used Red Hat HA on RHEL 6 (back when it still used RGManager and not Pacemaker), and it was capable of load-balanced NFS-HA like this. The problem with this approach using Pacemaker is that the "nfsserver" resource agent only wants one instance per host. During a failover event, both "nfsserver" RAs will try to bind mount the NFS shared info directory to /var/lib/nfs/. Only one will claim the directory. If I convert everything to a single resource group as Christoforos did, then the cluster is active-passive, and all the resources fail as a single unit. Having one node serve all the exports while the other is idle doesn't seem very ideal. I'd like to eventually have something like this: """ * Resource Group: ha1: * alice_lvm(ocf::heartbeat:LVM-activate):Started host1 * alice_xfs(ocf::heartbeat:Filesystem):Started host1 * charlie_lvm(ocf::heartbeat:LVM-activate):Started host1 * charlie_xfs(ocf::heartbeat:Filesystem):Started host1 * ha1_nfs(ocf::heartbeat:nfsserver):Started host1 * alice_ip(ocf::heartbeat:IPaddr2):Started host1 * charlie_ip(ocf::heartbeat:IPaddr2):Started host1 * ha1_nfsnotify(ocf::heartbeat:nfsnotify):Started host1 * alice_login01(ocf::heartbeat:exportfs):Started host1 * alice_login02(ocf::heartbeat:exportfs):Started host1 * charlie_login01(ocf::heartbeat:exportfs):Started host1 * charlie_login02(ocf::heartbeat:exportfs):Started host1 * Resource Group: ha2: * bob_lvm(ocf::heartbeat:LVM-activate):Started host2 * bob_xfs(ocf::heartbeat:Filesystem):Started host2 * david_lvm(ocf::heartbeat:LVM-activate):Started host2 * david_xfs(ocf::heartbeat:Filesystem):Started host2 * ha2_nfs(ocf::heartbeat:nfsserver):Started host2 * bob_ip(ocf::heartbeat:IPaddr2):Started host2 * david_ip(ocf::heartbeat:IPaddr2):Started host2 * ha2_nfsnotify(ocf::heartbeat:nfsnotify):Started host2 * bob_login01(ocf::heartbeat:exportfs):Started host2 * bob_login02(ocf::heartbeat:exportfs):Started host2 * david_login01(ocf::heartbeat:exportfs):Started host2 * david_login02(ocf::heartbeat:exportfs):Started host2 """ Or even this: """ * Resource Group: alice_research: * alice_lvm(ocf::heartbeat:LVM-activate):Started host1 * alice_xfs(ocf::heartbeat:Filesystem):Started host1 * alice_nfs(ocf::heartbeat:nfsserver):Started host1 * alice_ip(ocf::heartbeat:IPaddr2):Started host1 * alice_nfsnotify(ocf::heartbeat:nfsnotify):Started host1 * alice_login01(ocf::heartbeat:exportfs):Started host1 * alice_login02(ocf::heartbeat:exportfs):Started host1 * Resource Group: charlie_research: * charlie_lvm(ocf::heartbeat:LVM-activate):Started host1 * charlie_xfs(ocf::heartbeat:Filesystem):Started host1 * charlie_nfs(ocf::heartbeat:nfsserver):Started host1 * charlie_ip(ocf::heartbeat:IPaddr2)
Re: [ClusterLabs] Q: List resources affected by utilization limits
On 1/13/21 9:14 AM, Ulrich Windl wrote: Hi! I had made a test: I had configured RAM requirements for some test VMs together with node RAM capacities. Things were running fine. Then as a test I reduced the RAM capacity of all nodes, and test VMs were stopped due to not enough RAM. Now I wonder: is there a command that can list those resources that couldn't start because of "not enough nod capacity"? Preferrably combined with the utilization attribute that could not be fulfilled? crm_simulate -LU should give some hints. Regards, Yan Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: Questions about the infamous TOTEM retransmit list
On 1/13/21 3:31 PM, Ulrich Windl wrote: Roger Zhou schrieb am 13.01.2021 um 05:32 in Nachricht <97ac2305-85b4-cbb0-7133-ac1372143...@suse.com>: On 1/12/21 4:23 PM, Ulrich Windl wrote: Hi! Before setting up our first pacemaker cluster we thought one low-speed redundant network would be good in addition to the normal high-speed network. However as is seems now (SLES15 SP2) there is NO reasonable RRP mode to drive such a configuration with corosync. Passive RRP mode with UDPU still sends each packet through both nets, Indeed, packets are sent in the round-robin fashion. being throttled by the slower network. (Originally we were using multicast, but that was even worse) Now I realized that even under modest load, I see messages about "retransmit list", like this: Jan 08 10:57:56 h16 corosync[3562]: [TOTEM ] Retransmit List: 3e2 Jan 08 10:57:56 h16 corosync[3562]: [TOTEM ] Retransmit List: 3e2 3e4 Jan 08 11:13:21 h16 corosync[3562]: [TOTEM ] Retransmit List: 60e 610 612 614 Jan 08 11:13:21 h16 corosync[3562]: [TOTEM ] Retransmit List: 610 614 Jan 08 11:13:21 h16 corosync[3562]: [TOTEM ] Retransmit List: 614 Jan 08 11:13:41 h16 corosync[3562]: [TOTEM ] Retransmit List: 6ed What's the latency of this low speed link? The normal net is fibre-based: 4 packets transmitted, 4 received, 0% packet loss, time 3058ms rtt min/avg/max/mdev = 0.131/0.175/0.205/0.027 ms The redundant net is copper-based: 5 packets transmitted, 5 received, 0% packet loss, time 4104ms rtt min/avg/max/mdev = 0.293/0.304/0.325/0.019 ms Aha, RTT < 1ms, the network is fast enough. It clear my doubt to guess the latency of the slow link might even in tens or even hundred ms level. Then, I might wonder if corosync packet get the bad luck and get delayed due to workload on one of the link. Questions on that: Will the situation be much better with knet? knet provides "link_mode: passive" could fit your thought slightly which is not round-robin. But, it still doesn't fit your game well, since knet assumes the similar latency among links again. You may have to tune parameters for the low speed link and likely sacrifice the benefit from the fast link. Well in the past when using HP Service Guard, everything was working quite differently: There was a true heartbeat on each cluster net, determining ist "being alive", and when the cluster performed no action there was no traffic on the cluster links (except that heartbeat). When the cluster actually had to talk, it was using the link that was flagged "alive" with a preference of primary first, then secondary when both were available. "link_mode: passive" together with knet_link_priority would be useful. Also, use sctp in knet could be the alternative too. Cheers, Roger ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Re: A bug? (SLES15 SP2 with "crm resource refresh")
On 1/12/21 8:23 AM, Ulrich Windl wrote: Ken Gaillot schrieb am 11.01.2021 um 16:45 in Nachricht > <3e78312a1c92cde0a1cdd82c2fed33a679f63770.ca...@redhat.com>: > > ... from growing indefinitely). (Plus some timing issues to consider.) >>> Wouldn't a temporary local status variable do also? > Hi Ken, > > I appreciate your comments. > >> No, the scheduler is stateless. All information that the scheduler >> needs must be contained within the CIB. >> >> The main advantages of that approach are (1) the scheduler can crash >> and respawn without causing any problems; (2) the DC can be changed to > I think it's nice when being able to recover smoothly after a crash, but > program design should not be biased towards frequent crashes ;-) > >> another node at any time without causing any problems; and (3) saved > Well, if every status update is stored in the CIB (as it seems to be), > changing DCs shouln't be a bug problem until there are multiple at the same > time. > >> CIBs can be replayed for debugging and testing purposes with the >> identical result as a live cluster. > Are you talking about the whole CIB, or about the configuration section of > the CIB? I can't see any sense of replacing the status section of the CIB > unless you want to debug resource recovery and probing. That is the whole CIB. All the scheduler regression tests are working like that. Feed the CIB into crm_simulate and see what it does. Klaus > > ... > > Regards, > Ulrich > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Questions about the infamous TOTEM retransmit list
On 1/12/21 4:23 PM, Ulrich Windl wrote: Hi! Before setting up our first pacemaker cluster we thought one low-speed redundant network would be good in addition to the normal high-speed network. However as is seems now (SLES15 SP2) there is NO reasonable RRP mode to drive such a configuration with corosync. Passive RRP mode with UDPU still sends each packet through both nets, Indeed, packets are sent in the round-robin fashion. being throttled by the slower network. (Originally we were using multicast, but that was even worse) Now I realized that even under modest load, I see messages about "retransmit list", like this: Jan 08 10:57:56 h16 corosync[3562]: [TOTEM ] Retransmit List: 3e2 Jan 08 10:57:56 h16 corosync[3562]: [TOTEM ] Retransmit List: 3e2 3e4 Jan 08 11:13:21 h16 corosync[3562]: [TOTEM ] Retransmit List: 60e 610 612 614 Jan 08 11:13:21 h16 corosync[3562]: [TOTEM ] Retransmit List: 610 614 Jan 08 11:13:21 h16 corosync[3562]: [TOTEM ] Retransmit List: 614 Jan 08 11:13:41 h16 corosync[3562]: [TOTEM ] Retransmit List: 6ed What's the latency of this low speed link? I guess it is rather large, and probably not qualified for the use unless modify the default corosync.conf carefully. Put it in another way around, corosync mostly works for the local network with the small latency by default. Also, it is not designed for links with large different latency. Questions on that: Will the situation be much better with knet? knet provides "link_mode: passive" could fit your thought slightly which is not round-robin. But, it still doesn't fit your game well, since knet assumes the similar latency among links again. You may have to tune parameters for the low speed link and likely sacrifice the benefit from the fast link. Is there a smooth migration path from UDPU to knet? Out of my head, corosync3 need restart when switching from "transport: udpu" to "transport: knet". Cheers, Roger ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Q: List resources affected by utilization limits
Hi! I had made a test: I had configured RAM requirements for some test VMs together with node RAM capacities. Things were running fine. Then as a test I reduced the RAM capacity of all nodes, and test VMs were stopped due to not enough RAM. Now I wonder: is there a command that can list those resources that couldn't start because of "not enough nod capacity"? Preferrably combined with the utilization attribute that could not be fulfilled? Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/