Re: [ClusterLabs] Redundant ring not recovering after node is back
David, BTW, where I can download Corosync 3.x? I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/corosync/ Yes, that's Alpha 4 of Corosync 3. 2018-08-23 9:11 GMT+02:00 David Tolosa : I'm currently using an Ubuntu 18.04 server configuration with netplan. Here you have my current YAML configuration: # This file describes the network interfaces available on your system # For more information, see netplan(5). network: version: 2 renderer: networkd ethernets: eno1: addresses: [192.168.0.1/24] enp4s0f0: addresses: [192.168.1.1/24] enp5s0f0: {} vlans: vlan.XXX: id: XXX link: enp5s0f0 addresses: [ 10.1.128.5/29 ] gateway4: 10.1.128.1 nameservers: addresses: [ 8.8.8.8, 8.8.4.4 ] search: [ foo.com, bar.com ] vlan.YYY: id: YYY link: enp5s0f0 addresses: [ 10.1.128.5/29 ] So, eno1 and enp4s0f0 are the two ethernet ports connected each other with crossover cables to node2. enp5s0f0 port is used to connect outside/services using vlans defined in the same file. In short, I'm using systemd-networkd default Ubuntu 18 server service for Ok, so systemd-networkd is really doing ifdown and somebody actually tries fix it and merge into upstream (sadly with not too much luck :( ) https://github.com/systemd/systemd/pull/7403 manage networks. Im not detecting any NetworkManager-config-server package in my repository neither. I'm not sure how it's called in Debian based distributions, but it's just one small file in /etc, so you can extract it from RPM. So the only solution that I have left, I suppose, is to test corosync 3.x and see if it works better handling RRP. You may also reconsider to try ether completely static network configuration or NetworkManager + NetworkManager-config-server. Corosync 3.x with knet will work for sure, but be prepared for quite a long compile path, because you first have to compile knet and then corosync. What may help you a bit is that we have a ubuntu 18.04 in our jenkins, so it should be possible corosync build log https://ci.kronosnet.org/view/corosync/job/corosync-build-all-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lts-x86-64/consoleText, knet build log https://ci.kronosnet.org/view/knet/job/knet-build-all-voting/lastBuild/knet-build-all-voting=ubuntu-18-04-lts-x86-64/consoleText). Also please consult http://people.redhat.com/ccaulfie/docs/KnetCorosync.pdf about changes in corosync configuration. Regards, Honza Thank you for your quick response! 2018-08-23 8:40 GMT+02:00 Jan Friesse : David, Hello, Im getting crazy about this problem, that I expect to resolve here, with your help guys: I have 2 nodes with Corosync redundant ring feature. Each node has 2 similarly connected/configured NIC's. Both nodes are connected each other by two crossover cables. I believe this is root of the problem. Are you using NetworkManager? If so, have you installed NetworkManager-config-server? If not, please install it and test again. I configured both nodes with rrp mode passive. Everything is working well at this point, but when I shutdown 1 node to test failover, and this node > returns to be online, corosync is marking the interface as FAULTY and rrp I believe it's because with crossover cables configuration when other side is shutdown, NetworkManager detects it and does ifdown of the interface. And corosync is unable to handle ifdown properly. Ifdown is bad with single ring, but it's just killer with RRP (127.0.0.1 poisons every node in the cluster). fails to recover the initial state: 1. Initial scenario: # corosync-cfgtool -s Printing ring status. Local node ID 1 RING ID 0 id = 192.168.0.1 status = ring 0 active with no faults RING ID 1 id = 192.168.1.1 status = ring 1 active with no faults 2. When I shutdown the node 2, all continues with no faults. Sometimes the ring ID's are bonding with 127.0.0.1 and then bond back to their respective heartbeat IP. Again, result of ifdown. 3. When node 2 is back online: # corosync-cfgtool -s Printing ring status. Local node ID 1 RING ID 0 id = 192.168.0.1 status = ring 0 active with no faults RING ID 1 id = 192.168.1.1 status = Marking ringid 1 interface 192.168.1.1 FAULTY # service corosync status ● corosync.service - Corosync Cluster Engine Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s ago Docs: man:corosync man:corosync.conf man:corosync_overview Main PID: 1439 (corosync) Tasks: 2 (limit: 4915) CGroup: /system.slice/corosync.service └─1439 /usr/sbin/corosync -f Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] The network inte
Re: [ClusterLabs] Redundant ring not recovering after node is back
I'm currently using an Ubuntu 18.04 server configuration with netplan. Here you have my current YAML configuration: # This file describes the network interfaces available on your system # For more information, see netplan(5). network: version: 2 renderer: networkd ethernets: eno1: addresses: [192.168.0.1/24] enp4s0f0: addresses: [192.168.1.1/24] enp5s0f0: {} vlans: vlan.XXX: id: XXX link: enp5s0f0 addresses: [ 10.1.128.5/29 ] gateway4: 10.1.128.1 nameservers: addresses: [ 8.8.8.8, 8.8.4.4 ] search: [ foo.com, bar.com ] vlan.YYY: id: YYY link: enp5s0f0 addresses: [ 10.1.128.5/29 ] So, eno1 and enp4s0f0 are the two ethernet ports connected each other with crossover cables to node2. enp5s0f0 port is used to connect outside/services using vlans defined in the same file. In short, I'm using systemd-networkd default Ubuntu 18 server service for manage networks. Im not detecting any NetworkManager-config-server package in my repository neither. So the only solution that I have left, I suppose, is to test corosync 3.x and see if it works better handling RRP. Thank you for your quick response! 2018-08-23 8:40 GMT+02:00 Jan Friesse : > David, > > Hello, >> Im getting crazy about this problem, that I expect to resolve here, with >> your help guys: >> >> I have 2 nodes with Corosync redundant ring feature. >> >> Each node has 2 similarly connected/configured NIC's. Both nodes are >> connected each other by two crossover cables. >> > > I believe this is root of the problem. Are you using NetworkManager? If > so, have you installed NetworkManager-config-server? If not, please install > it and test again. > > >> I configured both nodes with rrp mode passive. Everything is working well >> at this point, but when I shutdown 1 node to test failover, and this node >> > returns to be online, corosync is marking the interface as FAULTY and rrp >> > > I believe it's because with crossover cables configuration when other side > is shutdown, NetworkManager detects it and does ifdown of the interface. > And corosync is unable to handle ifdown properly. Ifdown is bad with single > ring, but it's just killer with RRP (127.0.0.1 poisons every node in the > cluster). > > fails to recover the initial state: >> >> 1. Initial scenario: >> >> # corosync-cfgtool -s >> Printing ring status. >> Local node ID 1 >> RING ID 0 >> id = 192.168.0.1 >> status = ring 0 active with no faults >> RING ID 1 >> id = 192.168.1.1 >> status = ring 1 active with no faults >> >> >> 2. When I shutdown the node 2, all continues with no faults. Sometimes the >> ring ID's are bonding with 127.0.0.1 and then bond back to their >> respective >> heartbeat IP. >> > > Again, result of ifdown. > > >> 3. When node 2 is back online: >> >> # corosync-cfgtool -s >> Printing ring status. >> Local node ID 1 >> RING ID 0 >> id = 192.168.0.1 >> status = ring 0 active with no faults >> RING ID 1 >> id = 192.168.1.1 >> status = Marking ringid 1 interface 192.168.1.1 FAULTY >> >> >> # service corosync status >> ● corosync.service - Corosync Cluster Engine >> Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor >> preset: enabled) >> Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s >> ago >> Docs: man:corosync >> man:corosync.conf >> man:corosync_overview >> Main PID: 1439 (corosync) >> Tasks: 2 (limit: 4915) >> CGroup: /system.slice/corosync.service >> └─1439 /usr/sbin/corosync -f >> >> >> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] The >> network interface [192.168.0.1] is now up. >> Aug 22 14:44:11 node1 corosync[1439]: [TOTEM ] The network interface >> [192.168.0.1] is now up. >> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] The >> network interface [192.168.1.1] is now up. >> Aug 22 14:44:11 node1 corosync[1439]: [TOTEM ] The network interface >> [192.168.1.1] is now up. >> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice [TOTEM ] A >> new membership (192.168.0.1:601760) was formed. Members >> Aug 22 14:44:26 node1 corosync[1439]: [TOTEM ] A new membership ( >> 192.168.0.1:601760) was formed. Members >> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice [TOTEM ] A >> new membership (192.168.0.1:601764) was formed. Members joined: 2 >> Aug 22 14:44:32 node1 corosync[1439]: [TOTEM ] A new membership ( >> 192.168.0.1:601764) was formed. Members joined: 2 >> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error [TOTEM ] >> Marking ringid 1 interface 192.168.1.1 FAULTY >> Aug 22 14:44:34 node1 corosync[1439]: [TOTEM ] Marking ringid 1 >> interface >> 192.168.1.1 FAULTY >> >> >> If I execute corosync-cfgtool, clears the faulty error but after some >> seconds return to be FAULTY. >>
Re: [ClusterLabs] Redundant ring not recovering after node is back
BTW, where I can download Corosync 3.x? I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/corosync/ 2018-08-23 9:11 GMT+02:00 David Tolosa : > I'm currently using an Ubuntu 18.04 server configuration with netplan. > > Here you have my current YAML configuration: > > # This file describes the network interfaces available on your system > # For more information, see netplan(5). > network: > version: 2 > renderer: networkd > ethernets: > eno1: > addresses: [192.168.0.1/24] > enp4s0f0: > addresses: [192.168.1.1/24] > enp5s0f0: > {} > vlans: > vlan.XXX: > id: XXX > link: enp5s0f0 > addresses: [ 10.1.128.5/29 ] > gateway4: 10.1.128.1 > nameservers: > addresses: [ 8.8.8.8, 8.8.4.4 ] > search: [ foo.com, bar.com ] > vlan.YYY: > id: YYY > link: enp5s0f0 > addresses: [ 10.1.128.5/29 ] > > > So, eno1 and enp4s0f0 are the two ethernet ports connected each other > with crossover cables to node2. > enp5s0f0 port is used to connect outside/services using vlans defined in > the same file. > > In short, I'm using systemd-networkd default Ubuntu 18 server service for > manage networks. Im not detecting any NetworkManager-config-server > package in my repository neither. > So the only solution that I have left, I suppose, is to test corosync 3.x > and see if it works better handling RRP. > > Thank you for your quick response! > > 2018-08-23 8:40 GMT+02:00 Jan Friesse : > >> David, >> >> Hello, >>> Im getting crazy about this problem, that I expect to resolve here, with >>> your help guys: >>> >>> I have 2 nodes with Corosync redundant ring feature. >>> >>> Each node has 2 similarly connected/configured NIC's. Both nodes are >>> connected each other by two crossover cables. >>> >> >> I believe this is root of the problem. Are you using NetworkManager? If >> so, have you installed NetworkManager-config-server? If not, please install >> it and test again. >> >> >>> I configured both nodes with rrp mode passive. Everything is working well >>> at this point, but when I shutdown 1 node to test failover, and this >>> node > returns to be online, corosync is marking the interface as FAULTY >>> and rrp >>> >> >> I believe it's because with crossover cables configuration when other >> side is shutdown, NetworkManager detects it and does ifdown of the >> interface. And corosync is unable to handle ifdown properly. Ifdown is bad >> with single ring, but it's just killer with RRP (127.0.0.1 poisons every >> node in the cluster). >> >> fails to recover the initial state: >>> >>> 1. Initial scenario: >>> >>> # corosync-cfgtool -s >>> Printing ring status. >>> Local node ID 1 >>> RING ID 0 >>> id = 192.168.0.1 >>> status = ring 0 active with no faults >>> RING ID 1 >>> id = 192.168.1.1 >>> status = ring 1 active with no faults >>> >>> >>> 2. When I shutdown the node 2, all continues with no faults. Sometimes >>> the >>> ring ID's are bonding with 127.0.0.1 and then bond back to their >>> respective >>> heartbeat IP. >>> >> >> Again, result of ifdown. >> >> >>> 3. When node 2 is back online: >>> >>> # corosync-cfgtool -s >>> Printing ring status. >>> Local node ID 1 >>> RING ID 0 >>> id = 192.168.0.1 >>> status = ring 0 active with no faults >>> RING ID 1 >>> id = 192.168.1.1 >>> status = Marking ringid 1 interface 192.168.1.1 FAULTY >>> >>> >>> # service corosync status >>> ● corosync.service - Corosync Cluster Engine >>> Loaded: loaded (/lib/systemd/system/corosync.service; enabled; >>> vendor >>> preset: enabled) >>> Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min >>> 38s ago >>> Docs: man:corosync >>> man:corosync.conf >>> man:corosync_overview >>> Main PID: 1439 (corosync) >>> Tasks: 2 (limit: 4915) >>> CGroup: /system.slice/corosync.service >>> └─1439 /usr/sbin/corosync -f >>> >>> >>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] >>> The >>> network interface [192.168.0.1] is now up. >>> Aug 22 14:44:11 node1 corosync[1439]: [TOTEM ] The network interface >>> [192.168.0.1] is now up. >>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] >>> The >>> network interface [192.168.1.1] is now up. >>> Aug 22 14:44:11 node1 corosync[1439]: [TOTEM ] The network interface >>> [192.168.1.1] is now up. >>> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice [TOTEM ] A >>> new membership (192.168.0.1:601760) was formed. Members >>> Aug 22 14:44:26 node1 corosync[1439]: [TOTEM ] A new membership ( >>> 192.168.0.1:601760) was formed. Members >>> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice [TOTEM ] A >>> new membership (192.168.0.1:601764) was formed. Members joined: 2 >>> Aug 22 14:44:32 node1 corosync[1439]: [TOTEM ] A new membership ( >>> 192.168.0.1:601764) was formed.
[ClusterLabs] Q: automaticlly remove expired location constraints
Hi! I have a non-trivial question: How can I remove expired manual migration requests, like the following?: location cli-standby-rsc rsc rule -inf: #uname eq host and date lt "2013-06-12 13:47:26Z" One problem is that the date value is not a constant, and it had to be compared against the current date&time. Regards, Ulrich ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Q: ordering for a monitoring op only?
>>> Ryan Thomas schrieb am 21.08.2018 um 17:38 in Nachricht : > You could accomplish this be creating a custom RA which normally acts as a > pass-through and calls the "real" RA. However, it intercepts "monitor" > actions, checks nfs, and if nfs is down it returns success, otherwise it > passes though the monitor action to the real RA. If nfs fails the monitor > action is in-flight, the customer RA can intercept the failure, check if > nfs is down, and if so change the failure to a success. Hi! This sounds like an interesting approach, but I wonder how to avoid a monitoring timeout: I.e. what value to return when NFS is down? I'm missing a return value like CANNOT_CHECK_AT_THE_MOMENT_SO_PLEASE_ASSUME_RESOURCE_STILL_HAS_ITS_LAST_STATE ;-) Unless I can return such a value, the wrapper RA will have to wait (possibly causing a timeout). OK, the wrapper RA could cache its last return value and reuse that when NFS is down. Regards, Ulrich > > On Mon, Aug 20, 2018 at 3:51 AM Ulrich Windl < > ulrich.wi...@rz.uni-regensburg.de> wrote: > >> Hi! >> >> I wonder whether it's possible to run a monitoring op only if some >> specific resource is up. >> Background: We have some resource that runs fine without NFS, but the >> start, stop and monitor operations will just hang if NFS is down. In effect >> the monitor operation will time out, the cluster will try to recover, >> calling the stop operation, which in turn will time out, making things >> worse (i.e.: causing a node fence). >> >> So my idea was to pause the monitoing operation while NFS is down (NFS >> itself is controlled by the cluster and should recover "rather soon" TM). >> >> Is that possible? >> And before you ask: No, I have not written that RA that has the problem; a >> multi-million-dollar company wrote it (Years before I had written a monitor >> for HP-UX' cluster that did not have this problem, even though the >> configuration files were read from NFS (It's not magic: Just periodically >> copy them to shared memory, and read the config from shared memory). >> >> Regards, >> Ulrich >> >> >> ___ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Antw: Re: Spurious node loss in corosync cluster
>>> Prasad Nagaraj schrieb am 22.08.2018 um 02:59 >>> in Nachricht : > Thanks Ken and Ulrich. There is definitely high IO on the system with > sometimes IOWAIT s of upto 90% > I have come across some previous posts that IOWAIT is also considered as > CPU load by Corosync. Is this true ? Does having high IO may lead corosync It's not Corosync, ist Linux: A process busy with I/O also adds to the (CPU) load. One typ8ical effect tha twe see if some stale NFS or CIFS share is still being used, the "CPU load" goes up... > complain as in " Corosync main process was not scheduled for..." or "High > CPU load detected.." ? > > I will surely monitor the system more. As recommanded try sar's disk activity (the numbers below are from a fast disk sytem, BTW): 00:00:01 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 08:40:01dev253-13 1225.65 18393.28 1469.77 16.21 0.53 0.44 0.33 40.76 08:50:01dev253-51 4972.41 38796.90977.88 8.00 2.26 0.46 0.11 55.19 09:10:01dev253-51 4709.03 36692.07975.01 8.00 2.73 0.58 0.14 64.57 09:20:01dev253-51 4445.17 34708.88847.96 8.00 1.70 0.38 0.12 55.03 10:10:01dev253-51 4246.66 32944.55 1023.61 8.00 3.12 0.73 0.18 77.83 11:00:01dev253-51 5500.39 42984.68 1012.82 8.00 4.55 0.83 0.14 76.91 19:50:01dev253-51 49618.88 396396.53547.83 8.00139.60 2.81 0.01 60.98 The %util is the column too look at; you could also look at await, but note that network-related I/O is not included there. Regards, Ulrich > > Thanks for your help. > Prasad > > > > On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot wrote: > >> On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote: >> > > > > Prasad Nagaraj schrieb am >> > > > > 21.08.2018 um 11:42 in >> > >> > Nachricht >> > : >> > > Hi Ken - Thanks for you response. >> > > >> > > We do have seen messages in other cases like >> > > corosync [MAIN ] Corosync main process was not scheduled for >> > > 17314.4746 ms >> > > (threshold is 8000. ms). Consider token timeout increase. >> > > corosync [TOTEM ] A processor failed, forming new configuration. >> > > >> > > Is this the indication of a failure due to CPU load issues and will >> > > this >> > > get resolved if I upgrade to Corosync 2.x series ? >> >> Yes, most definitely this is a CPU issue. It means corosync isn't >> getting enough CPU cycles to handle the cluster token before the >> timeout is reached. >> >> Upgrading may indeed help, as recent versions ensure that corosync runs >> with real-time priority in the kernel, and thus are more likely to get >> CPU time when something of lower priority is consuming all the CPU. >> >> But of course, there is some underlying problem that should be >> identified and addressed. Figure out what's maxing out the CPU or I/O. >> Ulrich's monitoring suggestion is a good start. >> >> > Hi! >> > >> > I'd strongly recommend starting monitoring on your nodes, at least >> > until you know what's going on. The good old UNIX sa (sysstat >> > package) could be a starting point. I'd monitor CPU idle >> > specifically. Then go for 100% device utilization, then look for >> > network bottlenecks... >> > >> > A new corosync release cannot fix those, most likely. >> > >> > Regards, >> > Ulrich >> > >> > > >> > > In any case, for the current scenario, we did not see any >> > > scheduling >> > > related messages. >> > > >> > > Thanks for your help. >> > > Prasad >> > > >> > > On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot >> > > wrote: >> > > >> > > > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote: >> > > > > Hi: >> > > > > >> > > > > One of these days, I saw a spurious node loss on my 3-node >> > > > > corosync >> > > > > cluster with following logged in the corosync.log of one of the >> > > > > nodes. >> > > > > >> > > > > Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: >> > > > > Transitional membership event on ring 32: memb=2, new=0, lost=1 >> > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: >> > > > > vm02d780875f 67114156 >> > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: >> > > > > vmfa2757171f 151000236 >> > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: lost: >> > > > > vm728316982d 201331884 >> > > > > Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: >> > > > > Stable >> > > > > membership event on ring 32: memb=2, new=0, lost=0 >> > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: >> > > > > vm02d780875f 67114156 >> > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: >> > > > > vmfa2757171f 151000236 >> > > > > Aug 18 12:40:25 corosync [pcmk ] info: >> > > > > ais_mark_unseen_peer_dead: >> > > > > Node vm728316982d was not seen in the previous transiti
[ClusterLabs] Antw: Re: Antw: Re: Spurious node loss in corosync cluster
>>> Prasad Nagaraj schrieb am 22.08.2018 um 19:00 >>> in Nachricht : > Hi - My systems are single core cpu VMs running on azure platform. I am OK, so you don't have any control over overprovisioning CPU power and the VM being migrated between nodes, I guess. Be aware that the CPU time you are seeing is purely virtual, and there may be times when a "100% busy CPU" gets no CPU cycles at all. An interesting experiment would be to compare the CLOCK_MONOTONIC values against CLOCK_REALTIME (ona real hiost, so that CLOCK_REALTIME actually is no virtual time) over some time. I wouldn't be suprised if you see jumps. I think clouds are no god for realtime demands. Regards, Ulrich > running MySQL on the nodes that do generate high io load. And my bad , I > meant to say 'High CPU load detected' logged by crmd and not corosync. > Corosync logs messages like 'Corosync main process was not scheduled > for.' kind of messages which inturn makes pacemaker monitor action to > fail sometimes. Is increasing token timeout a solution for this or any > other ways ? > > Thanks for the help > Prasaf > > On Wed, 22 Aug 2018, 11:55 am Jan Friesse, wrote: > >> Prasad, >> >> > Thanks Ken and Ulrich. There is definitely high IO on the system with >> > sometimes IOWAIT s of upto 90% >> > I have come across some previous posts that IOWAIT is also considered as >> > CPU load by Corosync. Is this true ? Does having high IO may lead >> corosync >> > complain as in " Corosync main process was not scheduled for..." or "Hi >> > CPU load detected.." ? >> >> Yes it can. >> >> Corosync never logs "Hi CPU load detected...". >> >> > >> > I will surely monitor the system more. >> >> Is that system VM or physical machine? Because " Corosync main process >> was not scheduled for..." is usually happening on VMs where hosts are >> highly overloaded. >> >> Honza >> >> > >> > Thanks for your help. >> > Prasad >> > >> > >> > >> > On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot >> wrote: >> > >> >> On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote: >> >> Prasad Nagaraj schrieb am >> >> 21.08.2018 um 11:42 in >> >>> >> >>> Nachricht >> >>> : >> Hi Ken - Thanks for you response. >> >> We do have seen messages in other cases like >> corosync [MAIN ] Corosync main process was not scheduled for >> 17314.4746 ms >> (threshold is 8000. ms). Consider token timeout increase. >> corosync [TOTEM ] A processor failed, forming new configuration. >> >> Is this the indication of a failure due to CPU load issues and will >> this >> get resolved if I upgrade to Corosync 2.x series ? >> >> >> >> Yes, most definitely this is a CPU issue. It means corosync isn't >> >> getting enough CPU cycles to handle the cluster token before the >> >> timeout is reached. >> >> >> >> Upgrading may indeed help, as recent versions ensure that corosync runs >> >> with real-time priority in the kernel, and thus are more likely to get >> >> CPU time when something of lower priority is consuming all the CPU. >> >> >> >> But of course, there is some underlying problem that should be >> >> identified and addressed. Figure out what's maxing out the CPU or I/O. >> >> Ulrich's monitoring suggestion is a good start. >> >> >> >>> Hi! >> >>> >> >>> I'd strongly recommend starting monitoring on your nodes, at least >> >>> until you know what's going on. The good old UNIX sa (sysstat >> >>> package) could be a starting point. I'd monitor CPU idle >> >>> specifically. Then go for 100% device utilization, then look for >> >>> network bottlenecks... >> >>> >> >>> A new corosync release cannot fix those, most likely. >> >>> >> >>> Regards, >> >>> Ulrich >> >>> >> >> In any case, for the current scenario, we did not see any >> scheduling >> related messages. >> >> Thanks for your help. >> Prasad >> >> On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot >> wrote: >> >> > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote: >> >> Hi: >> >> >> >> One of these days, I saw a spurious node loss on my 3-node >> >> corosync >> >> cluster with following logged in the corosync.log of one of the >> >> nodes. >> >> >> >> Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: >> >> Transitional membership event on ring 32: memb=2, new=0, lost=1 >> >> Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: >> >> vm02d780875f 67114156 >> >> Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: >> >> vmfa2757171f 151000236 >> >> Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: lost: >> >> vm728316982d 201331884 >> >> Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: >> >> Stable >> >> membership event on ring 32: memb=2, new=0, lost=0 >> >> Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: >> >> vm02d780875f 67114156 >> >> Aug 18 12:40:25 corosyn
Re: [ClusterLabs] Q: automaticlly remove expired location constraints
On Thu, 2018-08-23 at 12:27 +0200, Ulrich Windl wrote: > Hi! > > I have a non-trivial question: How can I remove expired manual > migration requests, like the following?: > location cli-standby-rsc rsc rule -inf: #uname eq host and date lt > "2013-06-12 13:47:26Z" > > One problem is that the date value is not a constant, and it had to > be compared against the current date&time. > > Regards, > Ulrich crm_resource --clear -r RSC will clear all cli-* constraints -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Q: (SLES11 SP4) lrm_rsc_op without last-run?
On Thu, 2018-08-23 at 08:08 +0200, Ulrich Windl wrote: > Hi! > > Many years ago I wrote a parser that could format the CIB XML in a > flexible way. Today I used it again to print some statistics for > "exec-time". Thereby I discovered one operation that has a valid > "exec-time", a valid "last-rc-change", but no "last-run". > All other operations had "last-run". Can someone explain how this can > happen? The operation in question is "monitor", so it should be run > frequently (specified as "op monitor interval=600 timeout=30"). Recurring actions never get last-run, because the crmd doesn't initiate each run of a recurring action. > I see no failed actions regarding the resource. > > The original XML part for the operation looks like this: > operation_key="prm_cron-cleanup_monitor_60" operation="monitor" > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" > transition-key="92:3:0:6c6eff09-0d57-4844-9c3c-bc300c095bb6" > transition-magic="0:0;92:3:0:6c6eff09-0d57-4844-9c3c-bc300c095bb6" > on_node="h06" call-id="151" rc-code="0" op-status="0" > interval="60" last-rc-change="1513367408" exec-time="13" queue- > time="0" op-digest="2351b51e5316689a0eb89e8061445728"/> > > The node is not completely up-to-date, and it's using pacemaker- > 1.1.12-18.1... > > Regards, > Ulrich > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Redundant ring not recovering after node is back
I tried to install corosync 3.x and it works pretty well. Cool But when I install pacemaker, it installs previous version of corosync as dependency and breaks all the setup. Any suggestions? I can see at least following "solutions": - make proper Debian package - install corosync 3 to /usr/local - (ugly) install packaged corosync and reinstall by corosync 3 from source Regards, Honza 2018-08-23 9:32 GMT+02:00 Jan Friesse : David, BTW, where I can download Corosync 3.x? I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/coro sync/ Yes, that's Alpha 4 of Corosync 3. 2018-08-23 9:11 GMT+02:00 David Tolosa : I'm currently using an Ubuntu 18.04 server configuration with netplan. Here you have my current YAML configuration: # This file describes the network interfaces available on your system # For more information, see netplan(5). network: version: 2 renderer: networkd ethernets: eno1: addresses: [192.168.0.1/24] enp4s0f0: addresses: [192.168.1.1/24] enp5s0f0: {} vlans: vlan.XXX: id: XXX link: enp5s0f0 addresses: [ 10.1.128.5/29 ] gateway4: 10.1.128.1 nameservers: addresses: [ 8.8.8.8, 8.8.4.4 ] search: [ foo.com, bar.com ] vlan.YYY: id: YYY link: enp5s0f0 addresses: [ 10.1.128.5/29 ] So, eno1 and enp4s0f0 are the two ethernet ports connected each other with crossover cables to node2. enp5s0f0 port is used to connect outside/services using vlans defined in the same file. In short, I'm using systemd-networkd default Ubuntu 18 server service for Ok, so systemd-networkd is really doing ifdown and somebody actually tries fix it and merge into upstream (sadly with not too much luck :( ) https://github.com/systemd/systemd/pull/7403 manage networks. Im not detecting any NetworkManager-config-server package in my repository neither. I'm not sure how it's called in Debian based distributions, but it's just one small file in /etc, so you can extract it from RPM. So the only solution that I have left, I suppose, is to test corosync 3.x and see if it works better handling RRP. You may also reconsider to try ether completely static network configuration or NetworkManager + NetworkManager-config-server. Corosync 3.x with knet will work for sure, but be prepared for quite a long compile path, because you first have to compile knet and then corosync. What may help you a bit is that we have a ubuntu 18.04 in our jenkins, so it should be possible corosync build log https://ci.kronosnet.org/view/corosync/job/corosync-build-al l-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lt s-x86-64/consoleText, knet build log https://ci.kronosnet.org/view/ knet/job/knet-build-all-voting/lastBuild/knet-build-all- voting=ubuntu-18-04-lts-x86-64/consoleText). Also please consult http://people.redhat.com/ccaul fie/docs/KnetCorosync.pdf about changes in corosync configuration. Regards, Honza Thank you for your quick response! 2018-08-23 8:40 GMT+02:00 Jan Friesse : David, Hello, Im getting crazy about this problem, that I expect to resolve here, with your help guys: I have 2 nodes with Corosync redundant ring feature. Each node has 2 similarly connected/configured NIC's. Both nodes are connected each other by two crossover cables. I believe this is root of the problem. Are you using NetworkManager? If so, have you installed NetworkManager-config-server? If not, please install it and test again. I configured both nodes with rrp mode passive. Everything is working well at this point, but when I shutdown 1 node to test failover, and this node > returns to be online, corosync is marking the interface as FAULTY and rrp I believe it's because with crossover cables configuration when other side is shutdown, NetworkManager detects it and does ifdown of the interface. And corosync is unable to handle ifdown properly. Ifdown is bad with single ring, but it's just killer with RRP (127.0.0.1 poisons every node in the cluster). fails to recover the initial state: 1. Initial scenario: # corosync-cfgtool -s Printing ring status. Local node ID 1 RING ID 0 id = 192.168.0.1 status = ring 0 active with no faults RING ID 1 id = 192.168.1.1 status = ring 1 active with no faults 2. When I shutdown the node 2, all continues with no faults. Sometimes the ring ID's are bonding with 127.0.0.1 and then bond back to their respective heartbeat IP. Again, result of ifdown. 3. When node 2 is back online: # corosync-cfgtool -s Printing ring status. Local node ID 1 RING ID 0 id = 192.168.0.1 status = ring 0 active with no faults RING ID 1 id = 192.168.1.1 status = Marking ringid 1 interface 192.168.1.1 FAULTY # service corosync status ● corosync.service - Corosync Cluster Engine Loaded: loaded (/l