Hi guys, im back with some PAF postgres cluster problems. tonight the cluster fenced the master node and promote the PAF resource to a new node. everything went fine, unless i really dont know why. so this morning i noticed the old master was fenced by sbd and a new master was promoted, this happen tonight at 00.40.XX. filtering the logs i cant find out the any reasons why the old master was fenced and the start of promotion of the new master (which seems went perfectly), at certain point, im a bit lost cuz non of us can is able to get the real reason. the cluster worked flawessy for days with no issues, till now. crucial for me uderstand why this switch occured.
a attached the current status and configuration and logs. on the old master node log cant find any reasons on the new master the only thing is the fencing and the promotion. PS: could be this the reason of fencing? grep -e sbd /var/log/messages Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant pcmk is outdated (age: 4) Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child: Servant pcmk is healthy (age: 0) Any though and help is really appreciate. Damiano
pcs status Cluster name: ltaoperdbscluster Stack: corosync Current DC: ltaoperdbs03 (version 1.1.23-1.el7-9acf116022) - partition with quorum Last updated: Tue Jul 13 10:06:01 2021 Last change: Tue Jul 13 00:41:05 2021 by root via crm_attribute on ltaoperdbs03 3 nodes configured 4 resource instances configured Online: [ ltaoperdbs03 ltaoperdbs04 ] OFFLINE: [ ltaoperdbs02 ] Full list of resources: Master/Slave Set: pgsql-ha [pgsqld] Masters: [ ltaoperdbs03 ] Slaves: [ ltaoperdbs04 ] Stopped: [ ltaoperdbs02 ] pgsql-master-ip (ocf::heartbeat:IPaddr2): Started ltaoperdbs03 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled sbd: active/enabled [root@ltaoperdbs03 pengine]# pcs config show Cluster Name: ltaoperdbscluster Corosync Nodes: ltaoperdbs02 ltaoperdbs03 ltaoperdbs04 Pacemaker Nodes: ltaoperdbs02 ltaoperdbs03 ltaoperdbs04 Resources: Master: pgsql-ha Meta Attrs: notify=true Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms) Attributes: bindir=/usr/pgsql-13/bin pgdata=/workspace/pdgs-db/13/data pgport=5432 Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s) methods interval=0s timeout=5 (pgsqld-methods-interval-0s) monitor interval=15s role=Master timeout=25s (pgsqld-monitor-interval-15s) monitor interval=16s role=Slave timeout=25s (pgsqld-monitor-interval-16s) notify interval=0s timeout=60s (pgsqld-notify-interval-0s) promote interval=0s timeout=30s (pgsqld-promote-interval-0s) reload interval=0s timeout=20 (pgsqld-reload-interval-0s) start interval=0s timeout=60s (pgsqld-start-interval-0s) stop interval=0s timeout=60s (pgsqld-stop-interval-0s) Resource: pgsql-master-ip (class=ocf provider=heartbeat type=IPaddr2) Attributes: cidr_netmask=24 ip=172.18.2.10 Operations: monitor interval=10s (pgsql-master-ip-monitor-interval-10s) start interval=0s timeout=20s (pgsql-master-ip-start-interval-0s) stop interval=0s timeout=20s (pgsql-master-ip-stop-interval-0s) Stonith Devices: Fencing Levels: Location Constraints: Ordering Constraints: promote pgsql-ha then start pgsql-master-ip (kind:Mandatory) (non-symmetrical) (id:order-pgsql-ha-pgsql-master-ip-Mandatory) demote pgsql-ha then stop pgsql-master-ip (kind:Mandatory) (non-symmetrical) (id:order-pgsql-ha-pgsql-master-ip-Mandatory-1) Colocation Constraints: pgsql-master-ip with pgsql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-pgsql-master-ip-pgsql-ha-INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: No defaults set Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: ltaoperdbscluster dc-version: 1.1.23-1.el7-9acf116022 have-watchdog: true last-lrm-refresh: 1625090339 stonith-enabled: true stonith-watchdog-timeout: 10s Quorum: Options: stonith_admin --verbose --history "*" ltaoperdbs03 was able to reboot node ltaoperdbs02 on behalf of crmd.228700 from ltaoperdbs03 at Tue Jul 13 00:40:47 2021 ####SBD CONFIG grep -v \# /etc/sysconfig/sbd | sort | uniq SBD_DELAY_START=no SBD_MOVE_TO_ROOT_CGROUP=auto SBD_OPTS= SBD_PACEMAKER=yes SBD_STARTMODE=always SBD_TIMEOUT_ACTION=flush,reboot SBD_WATCHDOG_DEV=/dev/watchdog SBD_WATCHDOG_TIMEOUT=5
ltaoperdbs03 cluster]# stonith_admin --verbose --history "*" ltaoperdbs03 was able to reboot node ltaoperdbs02 on behalf of crmd.228700 from ltaoperdbs03 at Tue Jul 13 00:40:47 2021 [root@ltaoperdbs03 cluster]# grep "Jul 13 00:40:" /var/log/messages Jul 13 00:40:01 ltaoperdbs03 systemd: Created slice User Slice of ltauser. Jul 13 00:40:01 ltaoperdbs03 systemd: Started Session 85454 of user ltauser. Jul 13 00:40:01 ltaoperdbs03 systemd: Started Session 85455 of user nmon. Jul 13 00:40:01 ltaoperdbs03 systemd: Started Session 85456 of user nmon. Jul 13 00:40:02 ltaoperdbs03 postgresql-13: M&C|Monitoring|MON|LOG|service=postgresql-13|action=status|retcode=3|message="Id=postgresql-13 SubState=dead" Jul 13 00:40:02 ltaoperdbs03 systemd: Removed slice User Slice of ltauser. Jul 13 00:40:35 ltaoperdbs03 corosync[228685]: [TOTEM ] A processor failed, forming new configuration. Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [TOTEM ] A new membership (172.18.2.12:227) was formed. Members left: 1 Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [TOTEM ] Failed to receive the leave message. failed: 1 Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [CPG ] downlist left_list: 1 received Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [CPG ] downlist left_list: 1 received Jul 13 00:40:37 ltaoperdbs03 cib[228695]: notice: Node ltaoperdbs02 state is now lost Jul 13 00:40:37 ltaoperdbs03 cib[228695]: notice: Purged 1 peer with id=1 and/or uname=ltaoperdbs02 from the membership cache Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Node ltaoperdbs02 state is now lost Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [QUORUM] Members[2]: 2 3 Jul 13 00:40:37 ltaoperdbs03 attrd[228698]: notice: Lost attribute writer ltaoperdbs02 Jul 13 00:40:37 ltaoperdbs03 crmd[228700]: notice: Our peer on the DC (ltaoperdbs02) is dead Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [MAIN ] Completed service synchronization, ready to provide service. Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Purged 1 peer with id=1 and/or uname=ltaoperdbs02 from the membership cache Jul 13 00:40:37 ltaoperdbs03 attrd[228698]: notice: Node ltaoperdbs02 state is now lost Jul 13 00:40:37 ltaoperdbs03 attrd[228698]: notice: Removing all ltaoperdbs02 attributes for peer loss Jul 13 00:40:37 ltaoperdbs03 attrd[228698]: notice: Purged 1 peer with id=1 and/or uname=ltaoperdbs02 from the membership cache Jul 13 00:40:37 ltaoperdbs03 crmd[228700]: notice: State transition S_NOT_DC -> S_ELECTION Jul 13 00:40:37 ltaoperdbs03 crmd[228700]: notice: Node ltaoperdbs02 state is now lost Jul 13 00:40:37 ltaoperdbs03 pacemakerd[228694]: notice: Node ltaoperdbs02 state is now lost Jul 13 00:40:37 ltaoperdbs03 crmd[228700]: notice: State transition S_ELECTION -> S_INTEGRATION Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: notice: Watchdog will be used via SBD if fencing is required Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Cluster node ltaoperdbs02 will be fenced: peer is no longer part of the cluster Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Node ltaoperdbs02 is unclean Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_demote_0 on ltaoperdbs02 is unrunnable (offline) Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_stop_0 on ltaoperdbs02 is unrunnable (offline) Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_demote_0 on ltaoperdbs02 is unrunnable (offline) Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_stop_0 on ltaoperdbs02 is unrunnable (offline) Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_demote_0 on ltaoperdbs02 is unrunnable (offline) Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_stop_0 on ltaoperdbs02 is unrunnable (offline) Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_demote_0 on ltaoperdbs02 is unrunnable (offline) Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_stop_0 on ltaoperdbs02 is unrunnable (offline) Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsql-master-ip_stop_0 on ltaoperdbs02 is unrunnable (offline) Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Scheduling Node ltaoperdbs02 for STONITH Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: notice: * Fence (reboot) ltaoperdbs02 'peer is no longer part of the cluster' Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: notice: * Promote pgsqld:0 ( Slave -> Master ltaoperdbs03 ) Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: notice: * Stop pgsqld:2 ( Master ltaoperdbs02 ) due to node availability Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: notice: * Move pgsql-master-ip ( ltaoperdbs02 -> ltaoperdbs03 ) Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-1.bz2 Jul 13 00:40:37 ltaoperdbs03 crmd[228700]: notice: Initiating cancel operation pgsqld_monitor_16000 locally on ltaoperdbs03 Jul 13 00:40:37 ltaoperdbs03 crmd[228700]: notice: Requesting fencing (reboot) of node ltaoperdbs02 Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Client crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device '(any)' Jul 13 00:40:37 ltaoperdbs03 crmd[228700]: notice: Initiating notify operation pgsqld_pre_notify_demote_0 locally on ltaoperdbs03 Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Requesting peer fencing (reboot) targeting ltaoperdbs02 Jul 13 00:40:37 ltaoperdbs03 crmd[228700]: notice: Initiating notify operation pgsqld_pre_notify_demote_0 on ltaoperdbs04 Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Couldn't find anyone to fence (reboot) ltaoperdbs02 with any device Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Waiting 10s for ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5 Jul 13 00:40:37 ltaoperdbs03 crmd[228700]: notice: Result of notify operation for pgsqld on ltaoperdbs03: 0 (ok) Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]: notice: Self-fencing (reboot) by ltaoperdbs02 for crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]: notice: Operation 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for crmd.228700@ltaoperdbs03.f5d882d5: OK Jul 13 00:40:47 ltaoperdbs03 crmd[228700]: notice: Stonith operation 2/1:0:0:665567e9-db35-4f6f-a502-d3e9d33ee25b: OK (0) Jul 13 00:40:47 ltaoperdbs03 crmd[228700]: notice: Peer ltaoperdbs02 was terminated (reboot) by ltaoperdbs03 on behalf of crmd.228700: OK Jul 13 00:40:47 ltaoperdbs03 crmd[228700]: notice: Initiating notify operation pgsqld_post_notify_demote_0 locally on ltaoperdbs03 Jul 13 00:40:47 ltaoperdbs03 crmd[228700]: notice: Initiating notify operation pgsqld_post_notify_demote_0 on ltaoperdbs04 Jul 13 00:40:47 ltaoperdbs03 crmd[228700]: notice: Result of notify operation for pgsqld on ltaoperdbs03: 0 (ok) Jul 13 00:40:47 ltaoperdbs03 crmd[228700]: notice: Initiating notify operation pgsqld_pre_notify_stop_0 locally on ltaoperdbs03 Jul 13 00:40:47 ltaoperdbs03 crmd[228700]: notice: Initiating notify operation pgsqld_pre_notify_stop_0 on ltaoperdbs04 Jul 13 00:40:47 ltaoperdbs03 crmd[228700]: notice: Result of notify operation for pgsqld on ltaoperdbs03: 0 (ok) Jul 13 00:40:47 ltaoperdbs03 crmd[228700]: notice: Initiating notify operation pgsqld_post_notify_stop_0 locally on ltaoperdbs03 Jul 13 00:40:47 ltaoperdbs03 crmd[228700]: notice: Initiating notify operation pgsqld_post_notify_stop_0 on ltaoperdbs04 Jul 13 00:40:47 ltaoperdbs03 crmd[228700]: notice: Result of notify operation for pgsqld on ltaoperdbs03: 0 (ok) Jul 13 00:40:47 ltaoperdbs03 crmd[228700]: notice: Initiating notify operation pgsqld_pre_notify_promote_0 locally on ltaoperdbs03 Jul 13 00:40:47 ltaoperdbs03 crmd[228700]: notice: Initiating notify operation pgsqld_pre_notify_promote_0 on ltaoperdbs04 Jul 13 00:40:48 ltaoperdbs03 pgsqlms(pgsqld)[34737]: INFO: Promoting instance on node "ltaoperdbs03" Jul 13 00:40:48 ltaoperdbs03 postgres[228792]: [10469-1] 2021-07-13 00:40:48.399 UTC [228792] LOG: restartpoint complete: wrote 84513 buffers (2.1%); 0 WAL file(s) added, 0 removed, 36 recycled; write=222.219 s, sync=0.011 s, total=222.261 s; sync files=505, longest=0.002 s, average=0.000 s; distance=683179 kB, estimate=792192 kB Jul 13 00:40:48 ltaoperdbs03 postgres[228792]: [10470-1] 2021-07-13 00:40:48.400 UTC [228792] LOG: recovery restart point at D5B/A815EE88 Jul 13 00:40:48 ltaoperdbs03 postgres[228792]: [10470-2] 2021-07-13 00:40:48.400 UTC [228792] DETAIL: Last completed transaction was at log time 2021-07-13 00:40:34.15804+00. Jul 13 00:40:48 ltaoperdbs03 pgsqlms(pgsqld)[34737]: INFO: Current node TL#LSN: 12#14688449270880 Jul 13 00:40:48 ltaoperdbs03 crmd[228700]: notice: Result of notify operation for pgsqld on ltaoperdbs03: 0 (ok) Jul 13 00:40:48 ltaoperdbs03 crmd[228700]: notice: Initiating promote operation pgsqld_promote_0 locally on ltaoperdbs03 Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [11-1] 2021-07-13 00:40:48.735 UTC [228791] LOG: received promote request Jul 13 00:40:48 ltaoperdbs03 postgres[228796]: [9-1] 2021-07-13 00:40:48.736 UTC [228796] FATAL: terminating walreceiver process due to administrator command Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [12-1] 2021-07-13 00:40:48.737 UTC [228791] LOG: invalid resource manager ID 32 at D5B/EBCD1460 Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [13-1] 2021-07-13 00:40:48.737 UTC [228791] LOG: redo done at D5B/EBCD1438 Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [14-1] 2021-07-13 00:40:48.737 UTC [228791] LOG: last completed transaction was at log time 2021-07-13 00:40:34.15804+00 Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [15-1] 2021-07-13 00:40:48.754 UTC [228791] LOG: selected new timeline ID: 13 Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [16-1] 2021-07-13 00:40:48.784 UTC [228791] LOG: archive recovery complete Jul 13 00:40:49 ltaoperdbs03 postgres[228792]: [10471-1] 2021-07-13 00:40:49.046 UTC [228792] LOG: checkpoint starting: force Jul 13 00:40:49 ltaoperdbs03 pgsqlms(pgsqld)[34771]: INFO: Promote complete Jul 13 00:40:49 ltaoperdbs03 crmd[228700]: notice: Result of promote operation for pgsqld on ltaoperdbs03: 0 (ok) Jul 13 00:40:49 ltaoperdbs03 crmd[228700]: notice: Initiating notify operation pgsqld_post_notify_promote_0 locally on ltaoperdbs03 Jul 13 00:40:49 ltaoperdbs03 crmd[228700]: notice: Initiating notify operation pgsqld_post_notify_promote_0 on ltaoperdbs04 Jul 13 00:40:49 ltaoperdbs03 crmd[228700]: notice: Result of notify operation for pgsqld on ltaoperdbs03: 0 (ok) Jul 13 00:40:49 ltaoperdbs03 crmd[228700]: notice: Initiating start operation pgsql-master-ip_start_0 locally on ltaoperdbs03 Jul 13 00:40:49 ltaoperdbs03 crmd[228700]: notice: Initiating monitor operation pgsqld_monitor_15000 locally on ltaoperdbs03 Jul 13 00:40:49 ltaoperdbs03 IPaddr2(pgsql-master-ip)[34821]: INFO: Adding inet address 172.18.2.10/24 with broadcast address 172.18.2.255 to device bond0 Jul 13 00:40:49 ltaoperdbs03 IPaddr2(pgsql-master-ip)[34821]: INFO: Bringing device bond0 up Jul 13 00:40:49 ltaoperdbs03 IPaddr2(pgsql-master-ip)[34821]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-172.18.2.10 bond0 172.18.2.10 auto not_used not_used Jul 13 00:40:49 ltaoperdbs03 crmd[228700]: notice: Result of start operation for pgsql-master-ip on ltaoperdbs03: 0 (ok) Jul 13 00:40:49 ltaoperdbs03 crmd[228700]: notice: Initiating monitor operation pgsql-master-ip_monitor_10000 locally on ltaoperdbs03 Jul 13 00:40:49 ltaoperdbs03 pgsqlms(pgsqld)[34822]: WARNING: No secondary connected to the master Jul 13 00:40:49 ltaoperdbs03 pgsqlms(pgsqld)[34822]: WARNING: "ltaoperdbs04" is not connected to the primary Jul 13 00:40:49 ltaoperdbs03 crmd[228700]: notice: Transition aborted by nodes-3-master-pgsqld doing modify master-pgsqld=-1000: Configuration change Jul 13 00:40:49 ltaoperdbs03 crmd[228700]: notice: ltaoperdbs03-pgsqld_monitor_15000:27 [ /tmp:5432 - accepting connections\n ] Jul 13 00:40:49 ltaoperdbs03 crmd[228700]: notice: Transition 0 (Complete=41, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-1.bz2): Complete Jul 13 00:40:49 ltaoperdbs03 pengine[228699]: notice: Watchdog will be used via SBD if fencing is required Jul 13 00:40:49 ltaoperdbs03 pengine[228699]: notice: Calculated transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-2185.bz2 Jul 13 00:40:49 ltaoperdbs03 crmd[228700]: notice: Transition 1 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2185.bz2): Complete Jul 13 00:40:49 ltaoperdbs03 crmd[228700]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Jul 13 00:40:50 ltaoperdbs03 postgres[228789]: [8-1] 2021-07-13 00:40:50.107 UTC [228789] LOG: database system is ready to accept connections Jul 13 00:40:50 ltaoperdbs03 ntpd[1471]: Listen normally on 7 bond0 172.18.2.10 UDP 123 Jul 13 00:40:53 ltaoperdbs03 IPaddr2(pgsql-master-ip)[34821]: INFO: ARPING 172.18.2.10 from 172.18.2.10 bond0#012Sent 5 probes (5 broadcast(s))#012Received 0 response(s) [root@ltaoperdbs03 cluster]# grep "Jul 13 00:39:" /var/log/messages Jul 13 00:39:01 ltaoperdbs03 systemd: Created slice User Slice of ltauser. Jul 13 00:39:01 ltaoperdbs03 systemd: Started Session 85451 of user ltauser. Jul 13 00:39:01 ltaoperdbs03 systemd: Started Session 85453 of user nmon. Jul 13 00:39:01 ltaoperdbs03 systemd: Started Session 85452 of user nmon. Jul 13 00:39:01 ltaoperdbs03 postgresql-13: M&C|Monitoring|MON|LOG|service=postgresql-13|action=status|retcode=3|message="Id=postgresql-13 SubState=dead" Jul 13 00:39:01 ltaoperdbs03 systemd: Removed slice User Slice of ltauser. [root@ltaoperdbs03 cluster]# grep stonith-ng /var/log/messages Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Node ltaoperdbs02 state is now lost Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Purged 1 peer with id=1 and/or uname=ltaoperdbs02 from the membership cache Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Client crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device '(any)' Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Requesting peer fencing (reboot) targeting ltaoperdbs02 Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Couldn't find anyone to fence (reboot) ltaoperdbs02 with any device Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Waiting 10s for ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5 Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]: notice: Self-fencing (reboot) by ltaoperdbs02 for crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]: notice: Operation 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for crmd.228700@ltaoperdbs03.f5d882d5: OK [root@ltaoperdbs03 cluster]# grep -e sbd /var/log/messages [root@ltaoperdbs03 cluster]#
ltaoperdbs02 cluster]# stonith_admin --verbose --history "*" Could not connect to fencer: Transport endpoint is not connected [root@ltaoperdbs02 cluster]# ltaoperdbs03 was able to reboot node ltaoperdbs02 on behalf of crmd.228700 from ltaoperdbs03 at Tue Jul 13 00:40:47 2021 -bash: ltaoperdbs03: command not found [root@ltaoperdbs02 cluster]# grep "Jul 13 00:40:" /var/log/messages [root@ltaoperdbs02 cluster]# grep "Jul 13 00:39:" /var/log/messages Jul 13 00:39:01 ltaoperdbs02 postgres[211289]: [13-1] 2021-07-13 00:39:01.520 UTC [211289] LOG: duration: 5927.941 ms execute <unnamed>: SELECT partition_id, id_medium, online, capacity, used_space, library_id FROM ism_v_available_media WHERE library_synchronizing = 'f' ORDER BY partition_id, online DESC, used_space DESC, medium ASC ; Jul 13 00:39:01 ltaoperdbs02 systemd: Created slice User Slice of ltauser. Jul 13 00:39:01 ltaoperdbs02 systemd: Started Session 85420 of user ltauser. Jul 13 00:39:01 ltaoperdbs02 systemd: Started Session 85421 of user nmon. Jul 13 00:39:01 ltaoperdbs02 systemd: Started Session 85419 of user nmon. Jul 13 00:39:01 ltaoperdbs02 postgresql-13: M&C|Monitoring|MON|LOG|service=postgresql-13|action=status|retcode=3|message="Id=postgresql-13 SubState=dead" Jul 13 00:39:01 ltaoperdbs02 systemd: Removed slice User Slice of ltauser. Jul 13 00:39:22 ltaoperdbs02 postgres[172262]: [18-1] 2021-07-13 00:39:22.372 UTC [172262] LOG: duration: 664.280 ms execute <unnamed>: SELECT xmf.file_id, f.size, fp.full_path FROM ism_x_medium_file xmf JOIN#011 ism_files f ON f.id_file = xmf.file_id JOIN#011 ism_files_path fp ON f.id_file = fp.file_id JOIN ism_online o ON o.file_id = xmf.file_id WHERE xmf.medium_id = 363 AND xmf.x_media_file_status_id = 1 AND o.online_status_id = 3 GROUP BY xmf.file_id, f.size, fp.full_path LIMIT 7265 ; Jul 13 00:39:24 ltaoperdbs02 postgres[219378]: [13-1] 2021-07-13 00:39:24.681 UTC [219378] LOG: duration: 871.726 ms statement: UPDATE t_srv_inventory SET validity = 't' WHERE (( (t_srv_inventory.id = 76806878) )) ; Jul 13 00:39:27 ltaoperdbs02 postgres[219466]: [13-1] 2021-07-13 00:39:27.172 UTC [219466] LOG: duration: 649.754 ms statement: INSERT INTO t_srv_inventory ("name", contenttype, contentlength, origindate, checksum, validitystart, validitystop, footprint, validity, filetype_id, satellite_id, mission) VALUES ('S2A_OPER_MSI_L0__GR_SGS__20160810T190057_S20160810T134256_D01_N02.04.tar', 'application/octet-stream', 18388992, '2020-12-20 09:28:19.625', '[{"Algorithm":"MD5","ChecksumDate":"2021-07-13T00:40:07.859Z","Value":"e59dd78dd277a3c18da471056c85b2bc"},{"Algorithm":"XXH","ChecksumDate":"2021-07-13T00:40:07.869Z","Value":"6fcb8cb8f8d0d353"}]', '2016-08-10 13:42:56.000', '2016-08-10 13:42:56.000', 'POLYGON((-28.801828467124398 69.405031066987505,-29.2051411416635 69.059478097980403,-28.570372603499099 68.977450843291805,-28.160811855338 69.320498378428198,-28.801828467124398 69.405031066987505))', 'f', 373, 38, -1) ; Jul 13 00:39:27 ltaoperdbs02 postgres[172262]: [19-1] 2021-07-13 00:39:27.270 UTC [172262] LOG: duration: 516.499 ms execute <unnamed>: SELECT xmf.file_id, f.size, fp.full_path FROM ism_x_medium_file xmf JOIN#011 ism_files f ON f.id_file = xmf.file_id JOIN#011 ism_files_path fp ON f.id_file = fp.file_id JOIN ism_online o ON o.file_id = xmf.file_id WHERE xmf.medium_id = 363 AND xmf.x_media_file_status_id = 1 AND o.online_status_id = 3 GROUP BY xmf.file_id, f.size, fp.full_path LIMIT 7265 ; Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13 00:39:28.936 UTC [172262] LOG: duration: 660.329 ms execute <unnamed>: SELECT xmf.file_id, f.size, fp.full_path FROM ism_x_medium_file xmf JOIN#011 ism_files f ON f.id_file = xmf.file_id JOIN#011 ism_files_path fp ON f.id_file = fp.file_id JOIN ism_online o ON o.file_id = xmf.file_id WHERE xmf.medium_id = 363 AND xmf.x_media_file_status_id = 1 AND o.online_status_id = 3 GROUP BY xmf.file_id, f.size, fp.full_path LIMIT 7265 ; [root@ltaoperdbs02 cluster]# grep stonith-ng /var/log/messages [root@ltaoperdbs02 cluster]# grep -e sbd /var/log/messages Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant pcmk is outdated (age: 4) Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child: Servant pcmk is healthy (age: 0)
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/