On Fri, 23 Jul 2021 12:52:00 +0200 damiano giuliani <damianogiulian...@gmail.com> wrote:
> the time query isnt the problem, is known that took its time. the network > is 10gbs bonding, quite impossible to sature with queries :=). Everything is possible, it's just harder :) [...] > checking again the logs what for me is not clear its the cause of the loss > of quorum and then fence the node. As said before, according to logs from other nodes, ltaoperdbs02 did not answers to the TOTEM protocol anymore, so it left the communication group. But worse, it did it without saying goodbye properly: > [TOTEM ] Failed to receive the leave message. failed: 1 >From this exact time, the node is then considered "uncleaned", aka its state "unknown". To solve this trouble, the cluster needs to fence it to set a predictable state: OFF. So, the reaction to the trouble is sane. Now, from the starting point of this conversation, the question is what happened? Logs on other nodes will probably not help, as they just witnessed a node disappearing without any explanation. Logs from ltaoperdbs02 might help, but the corosync log you sent stop at 00:38:44, almost 2 minutes before the fencing as reported from other nodes: > Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: warning: pe_fence_node: > Cluster node ltaoperdbs02 will be fenced: peer is no longer part of > So the cluster works flawessy as expected: as soon ltaoperdbs02 become > "unreachable", it formed a new quorum, fenced the lost node and promoted > the new master. exact. > What i cant findout is WHY its happened. > there are no useful information into the system logs neither into the > Idrac motherboard logs. Because I suppose some log where not synced to disks when the server has been fenced. Either the server clocks were not synched (I doubt), or you really lost almost 2 minutes of logs. > There is a way to improve or configure a log system for fenced / failed > node? Yes: 1.setup rsyslog to export logs on some dedicated logging servers. Such servers should receive and save logs from your clusters and other hardwares (network?) and keep them safe. You will not loose messages anymore. 2. Gather a lot of system metrics and keep them safe (eg. export them using pcp, collectd, etc). Metrics and visualization are important to cross-compare with logs and pinpoint something behaving outside of the usual scope. Looking at your log, I still find your query time are suspicious. I'm not convinced they are the root cause, they might be just a bad symptom/signal of something going wrong there. Having a one-row INSERT taking 649.754ms is suspicious. Maybe it's just a locking problem, maybe there's some CPU-bound postgis things involved, maybe with some GIN or GiST indexes, but it's still suspicious considering the server is over-sized in performance as you stated... And maybe the network or SAN had a hick-up and corosync has been too sensible to it. Check the retransmit and timeout parameters? Regards, _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/