hi guys thanks for supporting. the time query isnt the problem, is known that took its time. the network is 10gbs bonding, quite impossible to sature with queries :=). the servers are totally overkilled, at database full working loads 20% of the resources have been used. checking again the logs what for me is not clear its the cause of the loss of quorum and then fence the node. there are no informations into the logs (even into Idrac/ motherboard event logs).
the only clear logs are : 228684] ltaoperdbs03 corosyncnotice [TOTEM ] A processor failed, forming new configuration. [228684] ltaoperdbs03 corosyncnotice [TOTEM ] A new membership ( 172.18.2.12:227) was formed. Members left: 1 [228684] ltaoperdbs03 corosyncnotice [TOTEM ] Failed to receive the leave message. failed: 1 [228684] ltaoperdbs03 corosyncwarning [CPG ] downlist left_list: 1 received [228684] ltaoperdbs03 corosyncwarning [CPG ] downlist left_list: 1 received Jul 13 00:40:37 [228695] ltaoperdbs03 cib: info: pcmk_cpg_membership: Group cib event 3: ltaoperdbs02 (node 1 pid 6136) left via cluster exit Jul 13 00:40:37 [228695] ltaoperdbs03 cib: info: crm_update_peer_proc: pcmk_cpg_membership: Node ltaoperdbs02[1] - corosync-cpg is now offline Jul 13 00:40:37 [228700] ltaoperdbs03 crmd: info: pcmk_cpg_membership: Group crmd event 3: ltaoperdbs02 (node 1 pid 6141) left via cluster exit Jul 13 00:40:37 [228695] ltaoperdbs03 cib: notice: crm_update_peer_state_iter: Node ltaoperdbs02 state is now lost | nodeid=1 previous=member source=crm_update_peer_proc Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: warning: pe_fence_node: Cluster node ltaoperdbs02 will be fenced: peer is no longer part of the cluster Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: warning: determine_online_status: Node ltaoperdbs02 is unclean Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: notice: LogNodeActions: * Fence (reboot) ltaoperdbs02 'peer is no longer part of the cluster' Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: notice: LogAction: * Promote pgsqld:0 ( Slave -> Master ltaoperdbs03 ) Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: info: LogActions: Leave pgsqld:1 (Slave ltaoperdbs04) So the cluster works flawessy as expected: as soon ltaoperdbs02 become "unreachable", it formed a new quorum, fenced the lost node and promoted the new master. What i cant findout is WHY its happened. there are no useful information into the system logs neither into the Idrac motherboard logs. There is a way to improve or configure a log system for fenced / failed node? Thanks Damiano Il giorno gio 22 lug 2021 alle ore 15:06 Jehan-Guillaume de Rorthais < j...@dalibo.com> ha scritto: > Hi, > > On Wed, 14 Jul 2021 07:58:14 +0200 > "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> wrote: > [...] > > Could it be that a command saturated the network? > > Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13 > 00:39:28.936 > > UTC [172262] LOG: duration: 660.329 ms execute <unnamed>: SELECT > > xmf.file_id, f.size, fp.full_path FROM ism_x_medium_file xmf JOIN#011 > > ism_files f ON f.id_file = xmf.file_id JOIN#011 ism_files_path fp ON > > f.id_file = fp.file_id JOIN ism_online o ON o.file_id = xmf.file_id > WHERE > > xmf.medium_id = 363 AND xmf.x_media_file_status_id = 1 AND > > o.online_status_id = 3 GROUP BY xmf.file_id, f.size, fp.full_path > LIMIT > > 7265 ; > > I doubt such a query could saturate the network. The query time itself > isn't > proportional to the result set size. > > Moreover, there's only three fields per row and according to their name, I > doubt the row size is really big. > > Plus, imagine the result set is that big, chances are that the frontend > will > not be able to cope with it as fast as the network, unless the frontend is > doing > nothing really fancy with the dataset. So the frontend itself might > saturate > before the network, giving some break to the later. > > However, if this query time is unusual, that might illustrate some > pressure on > the server by some other mean (CPU ? MEM ? IO ?). Detailed metrics would > help. > > Regards, > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/