On 08.10.2021 16:00, damiano giuliani wrote: > Hi Guys, after months of suddens unexpected failovers, checking every > corners and types of logs without any luck, cuz no logs and no reasons or > problems were shown anywhere, i was on the edge of madness i finally > managed to find out what was the problems of this suddends switches. > Was a tough bout but finally i think i got it. > in quite sure this can be quite useful expecially for high load databases > clusters. > the servers are all resoruce overkills with 80 cpus and 256 gb ram even if > the db ingest milions records x day, the network si bonded 10gbs, ssd disks. > so i found out under high loads the db suddently switches without no > reason, kicking out the master cuz no comunication with him. > the network works flavlessy without dropping a packets, ram was never > saturated and cpu are quite ovrkill. > So it turn out that a lil bit of swap was used and i suspect corosync > process were swapped to disks creating lag where 1s default corosync > timeout was not enough.
But you do not know whether corosync was swapped out at all. So it is just guess. > So it is, swap doesnt log anything and moving process to allocated ram to > swap take times more that 1s default timeout (probably many many mores). > i fix it changing the swappiness of each servers to 10 (at minimum) > avoinding the corosync process could swap. swappiness kernel parameter does not really prevent swap from being used. What is your kernel version? On several consecutive kernel versions I observed the following effect - once swap started being used at all system experienced periodical stalls for several seconds. It feeled like frozen system. It did not matter how much swap was in allocated - several megabytes was already enough. As far as I understand, the problem was not really time to swap out/in, but time kernel spent traversing page tables to make decision. I think it start with kernel 5.3 (or may be 5.2) and I do not see it any more since I believe kernel 5.7. > this issue which should be easy drove me crazy because nowhere process > swap is tracked on logs but make corosync trigger the timeout and make the > cluster failover. > > Really hope it can help the community > > Best > > Il giorno ven 23 lug 2021 alle ore 15:46 Jehan-Guillaume de Rorthais < > j...@dalibo.com> ha scritto: > >> On Fri, 23 Jul 2021 12:52:00 +0200 >> damiano giuliani <damianogiulian...@gmail.com> wrote: >> >>> the time query isnt the problem, is known that took its time. the network >>> is 10gbs bonding, quite impossible to sature with queries :=). >> >> Everything is possible, it's just harder :) >> >> [...] >>> checking again the logs what for me is not clear its the cause of the >> loss >>> of quorum and then fence the node. >> >> As said before, according to logs from other nodes, ltaoperdbs02 did not >> answers to the TOTEM protocol anymore, so it left the communication group. >> But >> worse, it did it without saying goodbye properly: >> >> > [TOTEM ] Failed to receive the leave message. failed: 1 >> >> From this exact time, the node is then considered "uncleaned", aka >> its state "unknown". To solve this trouble, the cluster needs to fence it >> to >> set a predictable state: OFF. So, the reaction to the trouble is sane. >> >> Now, from the starting point of this conversation, the question is what >> happened? Logs on other nodes will probably not help, as they just >> witnessed a >> node disappearing without any explanation. >> >> Logs from ltaoperdbs02 might help, but the corosync log you sent stop at >> 00:38:44, almost 2 minutes before the fencing as reported from other nodes: >> >> > Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: warning: >> pe_fence_node: >> > Cluster node ltaoperdbs02 will be fenced: peer is no longer part of >> >> >>> So the cluster works flawessy as expected: as soon ltaoperdbs02 become >>> "unreachable", it formed a new quorum, fenced the lost node and promoted >>> the new master. >> >> exact. >> >>> What i cant findout is WHY its happened. >>> there are no useful information into the system logs neither into the >>> Idrac motherboard logs. >> >> Because I suppose some log where not synced to disks when the server has >> been >> fenced. >> >> Either the server clocks were not synched (I doubt), or you really lost >> almost >> 2 minutes of logs. >> >>> There is a way to improve or configure a log system for fenced / failed >>> node? >> >> Yes: >> >> 1.setup rsyslog to export logs on some dedicated logging servers. Such >> servers should receive and save logs from your clusters and other hardwares >> (network?) and keep them safe. You will not loose messages anymore. >> >> 2. Gather a lot of system metrics and keep them safe (eg. export them >> using pcp, >> collectd, etc). Metrics and visualization are important to cross-compare >> with >> logs and pinpoint something behaving outside of the usual scope. >> >> >> Looking at your log, I still find your query time are suspicious. I'm not >> convinced they are the root cause, they might be just a bad symptom/signal >> of something going wrong there. Having a one-row INSERT taking 649.754ms is >> suspicious. Maybe it's just a locking problem, maybe there's some CPU-bound >> postgis things involved, maybe with some GIN or GiST indexes, but it's >> still >> suspicious considering the server is over-sized in performance as you >> stated... >> >> And maybe the network or SAN had a hick-up and corosync has been too >> sensible >> to it. Check the retransmit and timeout parameters? >> >> >> Regards, >> > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/