[ClusterLabs] unexpected fenced node and promotion of the new master PAF - postgres

damiano giuliani Tue, 13 Jul 2021 04:43:43 -0700

Hi guys,
im back with some PAF postgres cluster problems.
tonight the cluster fenced the master node and promote the PAF resource to
a new node.
everything went fine, unless i really dont know why.
so this morning i noticed the old master was fenced by sbd and a new master
was promoted, this happen tonight at 00.40.XX.
filtering the logs i cant find out the any reasons why the old master was
fenced and the start of promotion of the new master (which seems went
perfectly), at certain point, im a bit lost cuz non of us can is able to
get the real reason.
the cluster worked flawessy for days  with no issues, till now.
crucial for me uderstand why this switch occured.


a attached the current status and configuration and logs.
on the old master node log cant find any reasons
on the new master the only thing is the fencing and the promotion.


PS:
could be this the reason of fencing?

grep  -e sbd /var/log/messages
Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant
pcmk is outdated (age: 4)
Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child: Servant
pcmk is healthy (age: 0)

Any though and help is really appreciate.

Damiano

pcs status
Cluster name: ltaoperdbscluster
Stack: corosync
Current DC: ltaoperdbs03 (version 1.1.23-1.el7-9acf116022) - partition with 
quorum
Last updated: Tue Jul 13 10:06:01 2021
Last change: Tue Jul 13 00:41:05 2021 by root via crm_attribute on ltaoperdbs03

3 nodes configured
4 resource instances configured

Online: [ ltaoperdbs03 ltaoperdbs04 ]
OFFLINE: [ ltaoperdbs02 ]

Full list of resources:

 Master/Slave Set: pgsql-ha [pgsqld]
     Masters: [ ltaoperdbs03 ]
     Slaves: [ ltaoperdbs04 ]
     Stopped: [ ltaoperdbs02 ]
 pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started ltaoperdbs03

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
  sbd: active/enabled
[root@ltaoperdbs03 pengine]# pcs config show
Cluster Name: ltaoperdbscluster
Corosync Nodes:
 ltaoperdbs02 ltaoperdbs03 ltaoperdbs04
Pacemaker Nodes:
 ltaoperdbs02 ltaoperdbs03 ltaoperdbs04

Resources:
 Master: pgsql-ha
  Meta Attrs: notify=true
  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/pgsql-13/bin pgdata=/workspace/pdgs-db/13/data 
pgport=5432
   Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
               methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
               monitor interval=15s role=Master timeout=25s 
(pgsqld-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=25s 
(pgsqld-monitor-interval-16s)
               notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
               promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
               reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
               start interval=0s timeout=60s (pgsqld-start-interval-0s)
               stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
 Resource: pgsql-master-ip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: cidr_netmask=24 ip=172.18.2.10
  Operations: monitor interval=10s (pgsql-master-ip-monitor-interval-10s)
              start interval=0s timeout=20s (pgsql-master-ip-start-interval-0s)
              stop interval=0s timeout=20s (pgsql-master-ip-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote pgsql-ha then start pgsql-master-ip (kind:Mandatory) 
(non-symmetrical) (id:order-pgsql-ha-pgsql-master-ip-Mandatory)
  demote pgsql-ha then stop pgsql-master-ip (kind:Mandatory) (non-symmetrical) 
(id:order-pgsql-ha-pgsql-master-ip-Mandatory-1)
Colocation Constraints:
  pgsql-master-ip with pgsql-ha (score:INFINITY) (rsc-role:Started) 
(with-rsc-role:Master) (id:colocation-pgsql-master-ip-pgsql-ha-INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: ltaoperdbscluster
 dc-version: 1.1.23-1.el7-9acf116022
 have-watchdog: true
 last-lrm-refresh: 1625090339
 stonith-enabled: true
 stonith-watchdog-timeout: 10s

Quorum:
  Options:


stonith_admin --verbose --history "*"
ltaoperdbs03 was able to reboot node ltaoperdbs02 on behalf of crmd.228700 from 
ltaoperdbs03 at Tue Jul 13 00:40:47 2021

####SBD CONFIG
grep -v \# /etc/sysconfig/sbd | sort | uniq

SBD_DELAY_START=no
SBD_MOVE_TO_ROOT_CGROUP=auto
SBD_OPTS=
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_TIMEOUT_ACTION=flush,reboot
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5

ltaoperdbs03 cluster]# stonith_admin --verbose --history "*"
ltaoperdbs03 was able to reboot node ltaoperdbs02 on behalf of crmd.228700 from 
ltaoperdbs03 at Tue Jul 13 00:40:47 2021

[root@ltaoperdbs03 cluster]# grep "Jul 13 00:40:" /var/log/messages
Jul 13 00:40:01 ltaoperdbs03 systemd: Created slice User Slice of ltauser.
Jul 13 00:40:01 ltaoperdbs03 systemd: Started Session 85454 of user ltauser.
Jul 13 00:40:01 ltaoperdbs03 systemd: Started Session 85455 of user nmon.
Jul 13 00:40:01 ltaoperdbs03 systemd: Started Session 85456 of user nmon.
Jul 13 00:40:02 ltaoperdbs03 postgresql-13: 
M&C|Monitoring|MON|LOG|service=postgresql-13|action=status|retcode=3|message="Id=postgresql-13
 SubState=dead"
Jul 13 00:40:02 ltaoperdbs03 systemd: Removed slice User Slice of ltauser.
Jul 13 00:40:35 ltaoperdbs03 corosync[228685]: [TOTEM ] A processor failed, 
forming new configuration.
Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [TOTEM ] A new membership 
(172.18.2.12:227) was formed. Members left: 1
Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [TOTEM ] Failed to receive the 
leave message. failed: 1
Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [CPG   ] downlist left_list: 1 
received
Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [CPG   ] downlist left_list: 1 
received
Jul 13 00:40:37 ltaoperdbs03 cib[228695]:  notice: Node ltaoperdbs02 state is 
now lost
Jul 13 00:40:37 ltaoperdbs03 cib[228695]:  notice: Purged 1 peer with id=1 
and/or uname=ltaoperdbs02 from the membership cache
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Node ltaoperdbs02 
state is now lost
Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [QUORUM] Members[2]: 2 3
Jul 13 00:40:37 ltaoperdbs03 attrd[228698]:  notice: Lost attribute writer 
ltaoperdbs02
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Our peer on the DC 
(ltaoperdbs02) is dead
Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [MAIN  ] Completed service 
synchronization, ready to provide service.
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Purged 1 peer with 
id=1 and/or uname=ltaoperdbs02 from the membership cache
Jul 13 00:40:37 ltaoperdbs03 attrd[228698]:  notice: Node ltaoperdbs02 state is 
now lost
Jul 13 00:40:37 ltaoperdbs03 attrd[228698]:  notice: Removing all ltaoperdbs02 
attributes for peer loss
Jul 13 00:40:37 ltaoperdbs03 attrd[228698]:  notice: Purged 1 peer with id=1 
and/or uname=ltaoperdbs02 from the membership cache
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: State transition S_NOT_DC 
-> S_ELECTION
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Node ltaoperdbs02 state is 
now lost
Jul 13 00:40:37 ltaoperdbs03 pacemakerd[228694]:  notice: Node ltaoperdbs02 
state is now lost
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: State transition S_ELECTION 
-> S_INTEGRATION
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]:  notice: Watchdog will be used 
via SBD if fencing is required
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Cluster node 
ltaoperdbs02 will be fenced: peer is no longer part of the cluster
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Node ltaoperdbs02 is 
unclean
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_demote_0 
on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_stop_0 
on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_demote_0 
on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_stop_0 
on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_demote_0 
on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_stop_0 
on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_demote_0 
on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_stop_0 
on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action 
pgsql-master-ip_stop_0 on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Scheduling Node 
ltaoperdbs02 for STONITH
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]:  notice:  * Fence (reboot) 
ltaoperdbs02 'peer is no longer part of the cluster'
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]:  notice:  * Promote    pgsqld:0   
  ( Slave -> Master ltaoperdbs03 )
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]:  notice:  * Stop       pgsqld:2   
  (          Master ltaoperdbs02 )   due to node availability
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]:  notice:  * Move       
pgsql-master-ip     ( ltaoperdbs02 -> ltaoperdbs03 )
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Calculated transition 0 
(with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-1.bz2
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Initiating cancel operation 
pgsqld_monitor_16000 locally on ltaoperdbs03
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Requesting fencing (reboot) 
of node ltaoperdbs02
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Client 
crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device '(any)'
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation 
pgsqld_pre_notify_demote_0 locally on ltaoperdbs03
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Requesting peer 
fencing (reboot) targeting ltaoperdbs02
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation 
pgsqld_pre_notify_demote_0 on ltaoperdbs04
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Couldn't find anyone 
to fence (reboot) ltaoperdbs02 with any device
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Waiting 10s for 
ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Result of notify operation 
for pgsqld on ltaoperdbs03: 0 (ok)
Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Self-fencing (reboot) 
by ltaoperdbs02 for crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed 
complete
Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Operation 'reboot' 
targeting ltaoperdbs02 on ltaoperdbs03 for crmd.228700@ltaoperdbs03.f5d882d5: OK
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Stonith operation 
2/1:0:0:665567e9-db35-4f6f-a502-d3e9d33ee25b: OK (0)
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Peer ltaoperdbs02 was 
terminated (reboot) by ltaoperdbs03 on behalf of crmd.228700: OK
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation 
pgsqld_post_notify_demote_0 locally on ltaoperdbs03
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation 
pgsqld_post_notify_demote_0 on ltaoperdbs04
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Result of notify operation 
for pgsqld on ltaoperdbs03: 0 (ok)
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation 
pgsqld_pre_notify_stop_0 locally on ltaoperdbs03
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation 
pgsqld_pre_notify_stop_0 on ltaoperdbs04
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Result of notify operation 
for pgsqld on ltaoperdbs03: 0 (ok)
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation 
pgsqld_post_notify_stop_0 locally on ltaoperdbs03
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation 
pgsqld_post_notify_stop_0 on ltaoperdbs04
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Result of notify operation 
for pgsqld on ltaoperdbs03: 0 (ok)
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation 
pgsqld_pre_notify_promote_0 locally on ltaoperdbs03
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation 
pgsqld_pre_notify_promote_0 on ltaoperdbs04
Jul 13 00:40:48 ltaoperdbs03 pgsqlms(pgsqld)[34737]: INFO: Promoting instance 
on node "ltaoperdbs03"
Jul 13 00:40:48 ltaoperdbs03 postgres[228792]: [10469-1] 2021-07-13 
00:40:48.399 UTC [228792] LOG:  restartpoint complete: wrote 84513 buffers 
(2.1%); 0 WAL file(s) added, 0 removed, 36 recycled; write=222.219 s, 
sync=0.011 s, total=222.261 s; sync files=505, longest=0.002 s, average=0.000 
s; distance=683179 kB, estimate=792192 kB
Jul 13 00:40:48 ltaoperdbs03 postgres[228792]: [10470-1] 2021-07-13 
00:40:48.400 UTC [228792] LOG:  recovery restart point at D5B/A815EE88
Jul 13 00:40:48 ltaoperdbs03 postgres[228792]: [10470-2] 2021-07-13 
00:40:48.400 UTC [228792] DETAIL:  Last completed transaction was at log time 
2021-07-13 00:40:34.15804+00.
Jul 13 00:40:48 ltaoperdbs03 pgsqlms(pgsqld)[34737]: INFO: Current node TL#LSN: 
12#14688449270880
Jul 13 00:40:48 ltaoperdbs03 crmd[228700]:  notice: Result of notify operation 
for pgsqld on ltaoperdbs03: 0 (ok)
Jul 13 00:40:48 ltaoperdbs03 crmd[228700]:  notice: Initiating promote 
operation pgsqld_promote_0 locally on ltaoperdbs03
Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [11-1] 2021-07-13 00:40:48.735 
UTC [228791] LOG:  received promote request
Jul 13 00:40:48 ltaoperdbs03 postgres[228796]: [9-1] 2021-07-13 00:40:48.736 
UTC [228796] FATAL:  terminating walreceiver process due to administrator 
command
Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [12-1] 2021-07-13 00:40:48.737 
UTC [228791] LOG:  invalid resource manager ID 32 at D5B/EBCD1460
Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [13-1] 2021-07-13 00:40:48.737 
UTC [228791] LOG:  redo done at D5B/EBCD1438
Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [14-1] 2021-07-13 00:40:48.737 
UTC [228791] LOG:  last completed transaction was at log time 2021-07-13 
00:40:34.15804+00
Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [15-1] 2021-07-13 00:40:48.754 
UTC [228791] LOG:  selected new timeline ID: 13
Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [16-1] 2021-07-13 00:40:48.784 
UTC [228791] LOG:  archive recovery complete
Jul 13 00:40:49 ltaoperdbs03 postgres[228792]: [10471-1] 2021-07-13 
00:40:49.046 UTC [228792] LOG:  checkpoint starting: force
Jul 13 00:40:49 ltaoperdbs03 pgsqlms(pgsqld)[34771]: INFO: Promote complete
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Result of promote operation 
for pgsqld on ltaoperdbs03: 0 (ok)
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation 
pgsqld_post_notify_promote_0 locally on ltaoperdbs03
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation 
pgsqld_post_notify_promote_0 on ltaoperdbs04
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Result of notify operation 
for pgsqld on ltaoperdbs03: 0 (ok)
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Initiating start operation 
pgsql-master-ip_start_0 locally on ltaoperdbs03
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Initiating monitor 
operation pgsqld_monitor_15000 locally on ltaoperdbs03
Jul 13 00:40:49 ltaoperdbs03 IPaddr2(pgsql-master-ip)[34821]: INFO: Adding inet 
address 172.18.2.10/24 with broadcast address 172.18.2.255 to device bond0
Jul 13 00:40:49 ltaoperdbs03 IPaddr2(pgsql-master-ip)[34821]: INFO: Bringing 
device bond0 up
Jul 13 00:40:49 ltaoperdbs03 IPaddr2(pgsql-master-ip)[34821]: INFO: 
/usr/libexec/heartbeat/send_arp  -i 200 -r 5 -p 
/var/run/resource-agents/send_arp-172.18.2.10 bond0 172.18.2.10 auto not_used 
not_used
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Result of start operation 
for pgsql-master-ip on ltaoperdbs03: 0 (ok)
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Initiating monitor 
operation pgsql-master-ip_monitor_10000 locally on ltaoperdbs03
Jul 13 00:40:49 ltaoperdbs03 pgsqlms(pgsqld)[34822]: WARNING: No secondary 
connected to the master
Jul 13 00:40:49 ltaoperdbs03 pgsqlms(pgsqld)[34822]: WARNING: "ltaoperdbs04" is 
not connected to the primary
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Transition aborted by 
nodes-3-master-pgsqld doing modify master-pgsqld=-1000: Configuration change
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: 
ltaoperdbs03-pgsqld_monitor_15000:27 [ /tmp:5432 - accepting connections\n ]
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Transition 0 (Complete=41, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-warn-1.bz2): Complete
Jul 13 00:40:49 ltaoperdbs03 pengine[228699]:  notice: Watchdog will be used 
via SBD if fencing is required
Jul 13 00:40:49 ltaoperdbs03 pengine[228699]:  notice: Calculated transition 1, 
saving inputs in /var/lib/pacemaker/pengine/pe-input-2185.bz2
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Transition 1 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-2185.bz2): Complete
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE
Jul 13 00:40:50 ltaoperdbs03 postgres[228789]: [8-1] 2021-07-13 00:40:50.107 
UTC [228789] LOG:  database system is ready to accept connections
Jul 13 00:40:50 ltaoperdbs03 ntpd[1471]: Listen normally on 7 bond0 172.18.2.10 
UDP 123
Jul 13 00:40:53 ltaoperdbs03 IPaddr2(pgsql-master-ip)[34821]: INFO: ARPING 
172.18.2.10 from 172.18.2.10 bond0#012Sent 5 probes (5 
broadcast(s))#012Received 0 response(s)
[root@ltaoperdbs03 cluster]# grep "Jul 13 00:39:" /var/log/messages
Jul 13 00:39:01 ltaoperdbs03 systemd: Created slice User Slice of ltauser.
Jul 13 00:39:01 ltaoperdbs03 systemd: Started Session 85451 of user ltauser.
Jul 13 00:39:01 ltaoperdbs03 systemd: Started Session 85453 of user nmon.
Jul 13 00:39:01 ltaoperdbs03 systemd: Started Session 85452 of user nmon.
Jul 13 00:39:01 ltaoperdbs03 postgresql-13: 
M&C|Monitoring|MON|LOG|service=postgresql-13|action=status|retcode=3|message="Id=postgresql-13
 SubState=dead"
Jul 13 00:39:01 ltaoperdbs03 systemd: Removed slice User Slice of ltauser.
[root@ltaoperdbs03 cluster]# grep stonith-ng /var/log/messages
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Node ltaoperdbs02 
state is now lost
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Purged 1 peer with 
id=1 and/or uname=ltaoperdbs02 from the membership cache
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Client 
crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device '(any)'
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Requesting peer 
fencing (reboot) targeting ltaoperdbs02
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Couldn't find anyone 
to fence (reboot) ltaoperdbs02 with any device
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Waiting 10s for 
ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5
Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Self-fencing (reboot) 
by ltaoperdbs02 for crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed 
complete
Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Operation 'reboot' 
targeting ltaoperdbs02 on ltaoperdbs03 for crmd.228700@ltaoperdbs03.f5d882d5: OK
[root@ltaoperdbs03 cluster]# grep  -e sbd /var/log/messages
[root@ltaoperdbs03 cluster]#

ltaoperdbs02 cluster]# stonith_admin --verbose --history "*"
Could not connect to fencer: Transport endpoint is not connected
[root@ltaoperdbs02 cluster]# ltaoperdbs03 was able to reboot node ltaoperdbs02 
on behalf of crmd.228700 from ltaoperdbs03 at Tue Jul 13 00:40:47 2021
-bash: ltaoperdbs03: command not found
[root@ltaoperdbs02 cluster]# grep "Jul 13 00:40:" /var/log/messages
[root@ltaoperdbs02 cluster]# grep "Jul 13 00:39:" /var/log/messages
Jul 13 00:39:01 ltaoperdbs02 postgres[211289]: [13-1] 2021-07-13 00:39:01.520 
UTC [211289] LOG:  duration: 5927.941 ms  execute <unnamed>: SELECT 
partition_id, id_medium, online, capacity, used_space, library_id FROM 
ism_v_available_media WHERE library_synchronizing = 'f' ORDER BY partition_id, 
online DESC, used_space DESC, medium ASC ;
Jul 13 00:39:01 ltaoperdbs02 systemd: Created slice User Slice of ltauser.
Jul 13 00:39:01 ltaoperdbs02 systemd: Started Session 85420 of user ltauser.
Jul 13 00:39:01 ltaoperdbs02 systemd: Started Session 85421 of user nmon.
Jul 13 00:39:01 ltaoperdbs02 systemd: Started Session 85419 of user nmon.
Jul 13 00:39:01 ltaoperdbs02 postgresql-13: 
M&C|Monitoring|MON|LOG|service=postgresql-13|action=status|retcode=3|message="Id=postgresql-13
 SubState=dead"
Jul 13 00:39:01 ltaoperdbs02 systemd: Removed slice User Slice of ltauser.
Jul 13 00:39:22 ltaoperdbs02 postgres[172262]: [18-1] 2021-07-13 00:39:22.372 
UTC [172262] LOG:  duration: 664.280 ms  execute <unnamed>:  SELECT  
xmf.file_id, f.size, fp.full_path  FROM ism_x_medium_file xmf  JOIN#011 
ism_files f  ON f.id_file = xmf.file_id  JOIN#011 ism_files_path fp  ON 
f.id_file = fp.file_id  JOIN ism_online o  ON o.file_id = xmf.file_id  WHERE 
xmf.medium_id = 363 AND  xmf.x_media_file_status_id = 1  AND o.online_status_id 
= 3    GROUP BY xmf.file_id, f.size,  fp.full_path   LIMIT 7265 ;
Jul 13 00:39:24 ltaoperdbs02 postgres[219378]: [13-1] 2021-07-13 00:39:24.681 
UTC [219378] LOG:  duration: 871.726 ms  statement:  UPDATE  t_srv_inventory  
SET  validity = 't'  WHERE  (( (t_srv_inventory.id = 76806878) )) ;
Jul 13 00:39:27 ltaoperdbs02 postgres[219466]: [13-1] 2021-07-13 00:39:27.172 
UTC [219466] LOG:  duration: 649.754 ms  statement:  INSERT INTO  
t_srv_inventory  ("name", contenttype, contentlength, origindate, checksum, 
validitystart, validitystop, footprint, validity, filetype_id, satellite_id, 
mission) VALUES 
('S2A_OPER_MSI_L0__GR_SGS__20160810T190057_S20160810T134256_D01_N02.04.tar', 
'application/octet-stream', 18388992, '2020-12-20 09:28:19.625', 
'[{"Algorithm":"MD5","ChecksumDate":"2021-07-13T00:40:07.859Z","Value":"e59dd78dd277a3c18da471056c85b2bc"},{"Algorithm":"XXH","ChecksumDate":"2021-07-13T00:40:07.869Z","Value":"6fcb8cb8f8d0d353"}]',
 '2016-08-10 13:42:56.000', '2016-08-10 13:42:56.000', 
'POLYGON((-28.801828467124398 69.405031066987505,-29.2051411416635 
69.059478097980403,-28.570372603499099 68.977450843291805,-28.160811855338 
69.320498378428198,-28.801828467124398 69.405031066987505))', 'f', 373, 38, -1) 
 ;
Jul 13 00:39:27 ltaoperdbs02 postgres[172262]: [19-1] 2021-07-13 00:39:27.270 
UTC [172262] LOG:  duration: 516.499 ms  execute <unnamed>:  SELECT  
xmf.file_id, f.size, fp.full_path  FROM ism_x_medium_file xmf  JOIN#011 
ism_files f  ON f.id_file = xmf.file_id  JOIN#011 ism_files_path fp  ON 
f.id_file = fp.file_id  JOIN ism_online o  ON o.file_id = xmf.file_id  WHERE 
xmf.medium_id = 363 AND  xmf.x_media_file_status_id = 1  AND o.online_status_id 
= 3    GROUP BY xmf.file_id, f.size,  fp.full_path   LIMIT 7265 ;
Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13 00:39:28.936 
UTC [172262] LOG:  duration: 660.329 ms  execute <unnamed>:  SELECT  
xmf.file_id, f.size, fp.full_path  FROM ism_x_medium_file xmf  JOIN#011 
ism_files f  ON f.id_file = xmf.file_id  JOIN#011 ism_files_path fp  ON 
f.id_file = fp.file_id  JOIN ism_online o  ON o.file_id = xmf.file_id  WHERE 
xmf.medium_id = 363 AND  xmf.x_media_file_status_id = 1  AND o.online_status_id 
= 3    GROUP BY xmf.file_id, f.size,  fp.full_path   LIMIT 7265 ;
[root@ltaoperdbs02 cluster]# grep stonith-ng /var/log/messages
[root@ltaoperdbs02 cluster]# grep  -e sbd /var/log/messages
Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant pcmk 
is outdated (age: 4)
Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child: Servant pcmk 
is healthy (age: 0)

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] unexpected fenced node and promotion of the new master PAF - postgres

Reply via email to