Re: [ClusterLabs] I'm doing something stupid probably but...
Sigh...never mind. node2 was in standby (not sure how that happened)...pcs node unstandby node2 and now it's working. --- Regards, Kevin Martin On Tue, Oct 12, 2021 at 3:43 PM kevin martin wrote: > Ok, so I'm doing more wrong than I thought. I did a "pcs cluster stop > node1" on the main node expecting it would roll over the virtual ip to > node2, no joy. So "graceful" failover doesnt' work either. The actual > message is: (pcmk__native_allocate) info: Resource virtual_ip cannot run > anywhere > > --- > > > Regards, > > Kevin Martin > > > On Tue, Oct 12, 2021 at 3:32 PM kevin martin wrote: > >> I'm trying to replace a 2 node cluster running on rhel6 with a 2 node >> cluster on el8 using the version of pacemaker/corosync/pcsd that's in the >> repos (pacemake 1.1.20, pcs 0-.9, corosync 2.4.3 on el6 and 2.0.5, 0.10, >> and 3.1 on el8) and I must be doing something wrong. when I shutdown the >> main node of the cluster (like with a reboot after patching) I expect the >> virtual ip to move to the 2nd node, however I'm not seeing that. i'm seeing >> a message in the pacemaker log that says the virtual ip cannot run >> anywhere. I'm not sure what I'm supposed to configure to allow that to >> happen (again, this is with a reboot of the main node so it's an ungraceful >> failover). Any help is appreciated. >> >> >> --- >> >> >> Regards, >> >> Kevin Martin >> > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] I'm doing something stupid probably but...
Ok, so I'm doing more wrong than I thought. I did a "pcs cluster stop node1" on the main node expecting it would roll over the virtual ip to node2, no joy. So "graceful" failover doesnt' work either. The actual message is: (pcmk__native_allocate) info: Resource virtual_ip cannot run anywhere --- Regards, Kevin Martin On Tue, Oct 12, 2021 at 3:32 PM kevin martin wrote: > I'm trying to replace a 2 node cluster running on rhel6 with a 2 node > cluster on el8 using the version of pacemaker/corosync/pcsd that's in the > repos (pacemake 1.1.20, pcs 0-.9, corosync 2.4.3 on el6 and 2.0.5, 0.10, > and 3.1 on el8) and I must be doing something wrong. when I shutdown the > main node of the cluster (like with a reboot after patching) I expect the > virtual ip to move to the 2nd node, however I'm not seeing that. i'm seeing > a message in the pacemaker log that says the virtual ip cannot run > anywhere. I'm not sure what I'm supposed to configure to allow that to > happen (again, this is with a reboot of the main node so it's an ungraceful > failover). Any help is appreciated. > > > --- > > > Regards, > > Kevin Martin > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] I'm doing something stupid probably but...
I'm trying to replace a 2 node cluster running on rhel6 with a 2 node cluster on el8 using the version of pacemaker/corosync/pcsd that's in the repos (pacemake 1.1.20, pcs 0-.9, corosync 2.4.3 on el6 and 2.0.5, 0.10, and 3.1 on el8) and I must be doing something wrong. when I shutdown the main node of the cluster (like with a reboot after patching) I expect the virtual ip to move to the 2nd node, however I'm not seeing that. i'm seeing a message in the pacemaker log that says the virtual ip cannot run anywhere. I'm not sure what I'm supposed to configure to allow that to happen (again, this is with a reboot of the main node so it's an ungraceful failover). Any help is appreciated. --- Regards, Kevin Martin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Coming in Pacemaker 2.1.2: new fencing configuration options
On Tue, 2021-10-12 at 20:48 +0300, Andrei Borzenkov wrote: > On 12.10.2021 09:27, Ulrich Windl wrote: > > > > > Andrei Borzenkov schrieb am 11.10.2021 > > > > > um 11:43 in > > Nachricht > > > >: > > > On Mon, Oct 11, 2021 at 9:29 AM Ulrich Windl > > > wrote: > > > > > > > > > Also how long would such a delay be: Long enough until the > > > > > > other node > > > > > > is > > > > > > fenced, or long enough until the other node was fenced, > > > > > > booted > > > > > > (assuming it > > > > > > does) and is running pacemaker? > > > > > > > > > > The delay should be on the less‑preferred node, long enough > > > > > for that > > > > > node to get fenced. The other node, with no delay, will fence > > > > > it if it > > > > > can. If the other node is for whatever reason unable to > > > > > fence, the node > > > > > with the delay will fence it after the delay. > > > > > > > > So the "fence intention" will be lost when the node is being > > > > fenced? > > > > Otherwise the surviving node would have to clean up the "fence > > > > intention". > > > > Or does it mean the "fence intention" does not make it to the > > > > CIB and > > stays > > > > local on the node? > > > > > > > > > > Two nodes cannot communicate with each other so the surviving > > > node is > > > not aware of anything the fenced node did or intended to do. When > > > the > > > > I thought (local) CIB writes do not need a quorum. > > > > > fenced node reboots and pacemaker starts it should pull CIB from > > > the > > > surviving node, so whatever intentions the fenced node had before > > > reboot should be lost at this point. > > > > If the surviving node has a CIB newer (as per > > modification/configuration > > count) the fenced node that is true, but it the fenced node has a > > newer CIB, > > the surviving node would pull the "other" CIB, right? > > Indeed. I honestly did not expect it. > > I am not sure what consequences it has in practice though. It is > certainly one more argument against running without mandatory stonith > because in this case both nodes happily continue and it is > unpredictable > which one will win after they rejoin. > > Assuming we do run with mandatory stonith then we have relatively > small > window before DC is killed (because only DC can update CIB). But I am > not sure whether CIB changes will be committed locally until all > nodes > are either confirmed to be offline or acknowledged CIB changes. I > guess > only Ken can answer it :) In general each node maintains its own copy of the CIB (writing locally), and only changes (diffs) are passed between nodes. Checksums are used to make sure the content remains functionally the same on all nodes. However full CIB replacements can be done, whether by user request (pcs generally uses this for config changes, btw) or when the CIB gets out of sync on the nodes. When a node joins an existing cluster (like a fenced node rejoining), the CIB versions will be compared, and the newest one wins (actually more like the one with the most changes). Generally, the existing cluster had more activity after the node was fenced, and the fenced node has little to no activity before it rejoins the cluster, so it works out well. However I have seen scripts that start the cluster on a node and immediately set some node attributes or whatnot, causing the fenced node to look "newer" when it rejoins. > > > I think I had a few cases in the past when the "last dying node" > > did not have > > the "latest" CIB, causing some "extra noise" when the cluster was > > formed > > again. > > Details of what happened are certainly interesting. > > > Probably some period to wait for all nodes to join (and thus sync > > the CIBs) > > before performing any actions would help there. -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Coming in Pacemaker 2.1.2: new fencing configuration options
On 12.10.2021 09:27, Ulrich Windl wrote: Andrei Borzenkov schrieb am 11.10.2021 um 11:43 in > Nachricht > : >> On Mon, Oct 11, 2021 at 9:29 AM Ulrich Windl >> wrote: >> > Also how long would such a delay be: Long enough until the other node > is > fenced, or long enough until the other node was fenced, booted > (assuming it > does) and is running pacemaker? The delay should be on the less‑preferred node, long enough for that node to get fenced. The other node, with no delay, will fence it if it can. If the other node is for whatever reason unable to fence, the node with the delay will fence it after the delay. >>> >>> So the "fence intention" will be lost when the node is being fenced? >>> Otherwise the surviving node would have to clean up the "fence intention". >>> Or does it mean the "fence intention" does not make it to the CIB and > stays >>> local on the node? >>> >> >> Two nodes cannot communicate with each other so the surviving node is >> not aware of anything the fenced node did or intended to do. When the > > I thought (local) CIB writes do not need a quorum. > >> fenced node reboots and pacemaker starts it should pull CIB from the >> surviving node, so whatever intentions the fenced node had before >> reboot should be lost at this point. > > If the surviving node has a CIB newer (as per modification/configuration > count) the fenced node that is true, but it the fenced node has a newer CIB, > the surviving node would pull the "other" CIB, right? Indeed. I honestly did not expect it. I am not sure what consequences it has in practice though. It is certainly one more argument against running without mandatory stonith because in this case both nodes happily continue and it is unpredictable which one will win after they rejoin. Assuming we do run with mandatory stonith then we have relatively small window before DC is killed (because only DC can update CIB). But I am not sure whether CIB changes will be committed locally until all nodes are either confirmed to be offline or acknowledged CIB changes. I guess only Ken can answer it :) > I think I had a few cases in the past when the "last dying node" did not have > the "latest" CIB, causing some "extra noise" when the cluster was formed > again. Details of what happened are certainly interesting. > Probably some period to wait for all nodes to join (and thus sync the CIBs) > before performing any actions would help there. > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: [EXT] Re: Possible timing bug in SLES15
>>> Roger Zhou via Users schrieb am 12.10.2021 um 09:55 in Nachricht : ... >> # Time syncs can make the clock jump backward, which messes with logging >> # and failure timestamps, so wait until it's done. >> After=time‑sync.target >> ... >> >> Oct 05 14:58:10 h16 pacemakerd[6974]: notice: Starting Pacemaker > 2.0.4+20200616.2deceaa3a‑3.9.1 >> But still it does not "Require" time‑sync.target... >> > > Actually `After=` is more strict dependency than `Require=`. From discussions in the systemd development list there is hardly a scenario where after without require makes sense, because (as I understood it) "After" only has an effect if both units are started in the same "transaction". The way I understood it, it would mean that if you start pacemaker manuall and your clock is not in-sync, it would start pacemaker anyway. I may be wrong, though. Maybe a counter-argument is that pacemaker might stop if the time in not in sync (although I believe a dependency on NTP would be bad, but time-sync is probably OK). > >> Doesn't corosync need synchronized clocks? > > Seems good to have, but low priority. Well at least when comparing log timestamps it seems useful if all nodes have the same time. Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres
On Tue, 12 Oct 2021 09:46:04 +0200 "Ulrich Windl" wrote: > >>> Jehan-Guillaume de Rorthais schrieb am 12.10.2021 um > >>> 09:35 in > Nachricht <20211012093554.4bb761a2@firost>: > > On Tue, 12 Oct 2021 08:42:49 +0200 > > "Ulrich Windl" wrote: > > > ... > >> "watch cat /proc/meminfo" could be your friend. > > > > Or even better, make sure you have sysstat or pcp tools family installed and > > harvesting system metrics. You'll have the full historic of the dirty pages > > variations during the day/week/month. > > Actually I think the 10 minute granularity of sysstat (sar) is to coarse to > learn what's going on, specifically if your node is fenced before the latest > record is written. Indeed. You can still set it down to 1min in the crontab if needed. But the point is to gather a better understanding on the dirties (and many other useful metrics) evolution during a long time frame. You will always loose a small part of information after a fencing. No matter if your period is 10min, 1min or even 1s. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Possible timing bug in SLES15
On 10/12/21 3:32 PM, Ulrich Windl wrote: Hi! I just examined the corosync.service unit in SLES15. It contains: # /usr/lib/systemd/system/corosync.service [Unit] Description=Corosync Cluster Engine Documentation=man:corosync man:corosync.conf man:corosync_overview ConditionKernelCommandLine=!nocluster Requires=network-online.target After=network-online.target ... However the documentation says corosync requires synchronized system clocks. With this configuration corosync starts before the clocks are synchronized: The point looks valid and make sense. Well, sounds like no(or very seldom) victim because of it in the real life. Oct 05 14:57:47 h16 ntpd[6767]: ntpd 4.2.8p15@1.3728-o Tue Jun 15 12:00:00 UTC 2021 (1): Starting ... Oct 05 14:57:48 h16 systemd[1]: Starting Wait for ntpd to synchronize system clock... ... Oct 05 14:57:48 h16 corosync[6793]: [TOTEM ] Initializing transport (UDP/IP Unicast). ... Oct 05 14:57:48 h16 systemd[1]: Started Corosync Cluster Engine. ... Oct 05 14:58:10 h16 systemd[1]: Started Wait for ntpd to synchronize system clock. Oct 05 14:58:10 h16 systemd[1]: Reached target System Time Synchronized. Only pacemaker.service has: # /usr/lib/systemd/system/pacemaker.service [Unit] Description=Pacemaker High Availability Cluster Manager Documentation=man:pacemakerd Documentation=https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html # DefaultDependencies takes care of sysinit.target, # basic.target, and shutdown.target # We need networking to bind to a network address. It is recommended not to # use Wants or Requires with network.target, and not to use # network-online.target for server daemons. After=network.target # Time syncs can make the clock jump backward, which messes with logging # and failure timestamps, so wait until it's done. After=time-sync.target ... Oct 05 14:58:10 h16 pacemakerd[6974]: notice: Starting Pacemaker 2.0.4+20200616.2deceaa3a-3.9.1 But still it does not "Require" time-sync.target... Actually `After=` is more strict dependency than `Require=`. Doesn't corosync need synchronized clocks? Seems good to have, but low priority. BR, Roger Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres
On Tue, 12 Oct 2021 08:42:49 +0200 "Ulrich Windl" wrote: > ... > >> sysctl ‑a | grep dirty > >> vm.dirty_background_bytes = 0 > >> vm.dirty_background_ratio = 10 > > > > Considering your 256GB of physical memory, this means you can dirty up to > > 25GB > > pages in cache before the kernel start to write them on storage. > > > > You might want to trigger these background, lighter syncs much before > > hitting > > this limit. > > > >> vm.dirty_bytes = 0 > >> vm.dirty_expire_centisecs = 3000 > >> vm.dirty_ratio = 20 > > > > This is 20% of your 256GB physical memory. After this limit, writes have to > > go to disks, directly. Considering the time to write to SSD compared to > > memory and the amount of data to sync in the background as well (52GB), > > this could be very painful. > > Wowever (unless doing really large commits) databases should flush buffers > rather frequently, so I doubt database operations would fill the dirty buffer > rate. It depends on you database setup, your concurrency, your active dataset, your query profile, batch, and so on. > "watch cat /proc/meminfo" could be your friend. Or even better, make sure you have sysstat or pcp tools family installed and harvesting system metrics. You'll have the full historic of the dirty pages variations during the day/week/month. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Possible timing bug in SLES15
Hi! I just examined the corosync.service unit in SLES15. It contains: # /usr/lib/systemd/system/corosync.service [Unit] Description=Corosync Cluster Engine Documentation=man:corosync man:corosync.conf man:corosync_overview ConditionKernelCommandLine=!nocluster Requires=network-online.target After=network-online.target ... However the documentation says corosync requires synchronized system clocks. With this configuration corosync starts before the clocks are synchronized: Oct 05 14:57:47 h16 ntpd[6767]: ntpd 4.2.8p15@1.3728-o Tue Jun 15 12:00:00 UTC 2021 (1): Starting ... Oct 05 14:57:48 h16 systemd[1]: Starting Wait for ntpd to synchronize system clock... ... Oct 05 14:57:48 h16 corosync[6793]: [TOTEM ] Initializing transport (UDP/IP Unicast). ... Oct 05 14:57:48 h16 systemd[1]: Started Corosync Cluster Engine. ... Oct 05 14:58:10 h16 systemd[1]: Started Wait for ntpd to synchronize system clock. Oct 05 14:58:10 h16 systemd[1]: Reached target System Time Synchronized. Only pacemaker.service has: # /usr/lib/systemd/system/pacemaker.service [Unit] Description=Pacemaker High Availability Cluster Manager Documentation=man:pacemakerd Documentation=https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html # DefaultDependencies takes care of sysinit.target, # basic.target, and shutdown.target # We need networking to bind to a network address. It is recommended not to # use Wants or Requires with network.target, and not to use # network-online.target for server daemons. After=network.target # Time syncs can make the clock jump backward, which messes with logging # and failure timestamps, so wait until it's done. After=time-sync.target ... Oct 05 14:58:10 h16 pacemakerd[6974]: notice: Starting Pacemaker 2.0.4+20200616.2deceaa3a-3.9.1 But still it does not "Require" time-sync.target... Doesn't corosync need synchronized clocks? Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: Re: Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres
>>> Jehan-Guillaume de Rorthais schrieb am 11.10.2021 um 11:57 in Nachricht <2021105737.7cc99e69@firost>: > Hi, > > I kept the full answer in history to keep the list informed of your full > answer. > > My answer down below. > > On Mon, 11 Oct 2021 11:33:12 +0200 > damiano giuliani wrote: > >> ehy guys sorry for being late, was busy during the WE >> >> here i im: >> >> >> > Did you see the swap activity (in/out, not just swap occupation) happen in >> > the >> > >> > same time the member was lost on corosync side? >> > Did you check corosync or some of its libs were indeed in swap? >> > >> > >> no and i dont know how do it, i just noticed the swap occupation which >> suggest me (and my collegue) to find out if it could cause some trouble. >> >> > First, corosync now sit on a lot of memory because of knet. Did you try to >> > switch back to udpu which is using way less memory? >> >> >> No i havent move to udpd, cast stop processes at all. >> >> "Could not lock memory of service to avoid page faults" >> >> >> grep ‑rn 'Could not lock memory of service to avoid page faults' /var/log/* >> returns noting Maybe the expression is too specific (try "lock memory", maybe), or syslog in in journal only (journalctl -b | grep "lock memory"). > > This message should appears on corosync startup. Make sure the logs hadn't > been > rotated to a blackhole in the meantime... > >> > On my side, mlocks is unlimited on ulimit settings. Check the values >> > in /proc/$(coro PID)/limits (be careful with the ulimit command, check the >> > proc itself). >> >> >> cat /proc/101350/limits >> Limit Soft Limit Hard Limit Units >> Max cpu time unlimitedunlimited seconds >> Max file size unlimitedunlimitedbytes >> Max data size unlimitedunlimitedbytes >> Max stack size8388608 unlimitedbytes >> Max core file size0unlimitedbytes >> Max resident set unlimitedunlimitedbytes >> Max processes 770868 770868 >> processes >> Max open files1024 4096 files >> Max locked memory unlimitedunlimitedbytes >> Max address space unlimitedunlimitedbytes >> Max file locksunlimitedunlimitedlocks >> Max pending signals 770868 770868 signals >> Max msgqueue size 819200 819200 bytes >> Max nice priority 00 >> Max realtime priority 00 >> Max realtime timeout unlimitedunlimitedus >> >> Ah... That's the first thing I change. >> > In SLES, that is defaulted to 10s and so far I have never seen an >> > environment that is stable enough for the default 1s timeout. >> >> >> old versions have 10s default >> you are not going to fix the problem lthis way, 1s timeout for a bonded >> network and overkill hardware is enourmous time. >> >> hostnamectl | grep Kernel >> Kernel: Linux 3.10.0‑1160.6.1.el7.x86_64 >> [root@ltaoperdbs03 ~]# cat /etc/os‑release >> NAME="CentOS Linux" >> VERSION="7 (Core)" >> >> > Indeed. But it's an arbitrage between swapping process mem or freeing >> > mem by removing data from cache. For database servers, it is advised to >> > use a >> > lower value for swappiness anyway, around 5‑10, as a swapped process means >> > longer query, longer data in caches, piling sessions, etc. >> >> >> totally agree, for db server swappines has to be 5‑10. >> >> kernel? >> > What are your settings for vm.dirty_* ? >> >> >> >> hostnamectl | grep Kernel >> Kernel: Linux 3.10.0‑1160.6.1.el7.x86_64 >> [root@ltaoperdbs03 ~]# cat /etc/os‑release >> NAME="CentOS Linux" >> VERSION="7 (Core)" >> >> >> sysctl ‑a | grep dirty >> vm.dirty_background_bytes = 0 >> vm.dirty_background_ratio = 10 > > Considering your 256GB of physical memory, this means you can dirty up to > 25GB > pages in cache before the kernel start to write them on storage. > > You might want to trigger these background, lighter syncs much before > hitting > this limit. > >> vm.dirty_bytes = 0 >> vm.dirty_expire_centisecs = 3000 >> vm.dirty_ratio = 20 > > This is 20% of your 256GB physical memory. After this limit, writes have to > go > to disks, directly. Considering the time to write to SSD compared to memory > and the amount of data to sync in the background as well (52GB), this could > be > very painful. Wowever (unless doing really large commits) databases should flush buffers rather frequently, so I doubt database operations would fill the dirty buffer rate. "watch cat /proc/meminfo" could be your friend. > >> vm.dirty_writeback_centisecs = 500 >> >> >> > Do you
[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Coming in Pacemaker 2.1.2: new fencing configuration options
>>> Andrei Borzenkov schrieb am 11.10.2021 um 11:43 in Nachricht : > On Mon, Oct 11, 2021 at 9:29 AM Ulrich Windl > wrote: > >> >> Also how long would such a delay be: Long enough until the other node >> >> is >> >> fenced, or long enough until the other node was fenced, booted >> >> (assuming it >> >> does) and is running pacemaker? >> > >> > The delay should be on the less‑preferred node, long enough for that >> > node to get fenced. The other node, with no delay, will fence it if it >> > can. If the other node is for whatever reason unable to fence, the node >> > with the delay will fence it after the delay. >> >> So the "fence intention" will be lost when the node is being fenced? >> Otherwise the surviving node would have to clean up the "fence intention". >> Or does it mean the "fence intention" does not make it to the CIB and stays >> local on the node? >> > > Two nodes cannot communicate with each other so the surviving node is > not aware of anything the fenced node did or intended to do. When the I thought (local) CIB writes do not need a quorum. > fenced node reboots and pacemaker starts it should pull CIB from the > surviving node, so whatever intentions the fenced node had before > reboot should be lost at this point. If the surviving node has a CIB newer (as per modification/configuration count) the fenced node that is true, but it the fenced node has a newer CIB, the surviving node would pull the "other" CIB, right? I think I had a few cases in the past when the "last dying node" did not have the "latest" CIB, causing some "extra noise" when the cluster was formed again. Probably some period to wait for all nodes to join (and thus sync the CIBs) before performing any actions would help there. Regards, Ulrich > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/