Re: [ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

Tiemen Ruiten Sat, 06 Jul 2019 03:31:21 -0700

On Fri, Jul 5, 2019 at 5:09 PM Jehan-Guillaume de Rorthais <j...@dalibo.com>
wrote:


> It seems to me the problem comes from here:
>
>   Jul 03 19:31:38 [30151] ph-sql-03.prod.ams.i.rdmedia.com       crmd:
>  notice:
>     te_rsc_command:     Initiating notify operation
>     pgsqld_pre_notify_promote_0 on ph-sql-05 | action 67
>   Jul 03 19:32:38 [30148] ph-sql-03.prod.ams.i.rdmedia.com       lrmd:
> warning:
>     operation_finished: pgsqld_notify_0:30939 - timed out after 60000ms
>
> and here:
>
>   Jul 03 19:31:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com       lrmd:
>  info:
>     log_execute:        executing - rsc:pgsqld action:notify call_id:38
>     pgsqlms(pgsqld)[20881]:
>   Jul 03 19:32:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com       lrmd:
>     warning: operation_finished:        pgsqld_notify_0:20881 - timed out
> after
>     60000ms
>
> The pgsql election occurs during the pre-promote action where all remaining
> nodes set their LSN location. During the promote action, the designed
> primary
> checks its LSN location is the highest one. If it is not, it just cancel
> the
> promotion so the next round will elect the best one.
>
> Back to your issue. According to the logs, both standby nodes timed out
> during
> the pre-promote action. No LSN location has been set accros the whole
> cluster.
> I couldn't see any messages from the attrd daemon related to the
> lsn_location
> attribute or other cleanup actions.
>
> I couldn't even find the INFO message from pgsqlms giving its current
> status
> before actually setting it ("Current node TL#LSN: %s"). But this message
> appears soon after a...checkpoint. See:
> https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L2017
>
> Maybe you have very long checkpoints on both nodes that timed out the
> pre-promote action? Do you have some PostgreSQL logs showing some useful
> info
> around this timelapse?
>

Yes I have logs that show at what time checkpoints were taken. They're
attached. Currently checkpoint_timeout is set to 15 minutes. I already
lowered it from the previous value of 60 minutes. I could experiment with
even lower values for checkpoint_timeout as it seems the problem is that
forced checkpoints are taking too long (more than 60s). Perhaps I should
also increase the notify timeout?

Thank you very much for your time.

2019-07-03 19:30:10.887 CEST [1214]: LOG:  restartpoint complete: wrote 345241 
buffers (32.9%); 0 WAL file(s) added, 0 removed, 9 recycled; write=810.270 s, 
sync=0.973 s, total=811.445 s; sync files=594, longest=0.064 s, average=0.001 
s; distance=151568 kB, estimate=2426859 kB
2019-07-03 19:30:10.887 CEST [1214]: LOG:  recovery restart point at 
444/DB777640
2019-07-03 19:30:10.887 CEST [1214]: DETAIL:  Last completed transaction was at 
log time 2019-07-03 19:29:45.48093+02.
2019-07-03 19:31:38.647 CEST [1214]: LOG:  restartpoint starting: immediate wait
2019-07-03 19:32:19.721 CEST [30315]: FATAL:  terminating walreceiver due to 
timeout
2019-07-03 19:32:19.813 P00   INFO: archive-get command begin 2.15: 
[00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress 
--db-timeout=3600 --log-level-console=info --log-level-file=detail 
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data 
--pg2-path=/var/lib/pgsql/11/data --process-max=4 
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest 
--stanza=pgdb2
2019-07-03 19:32:19.821 P00   INFO: unable to find 00000005.history in the 
archive
2019-07-03 19:32:19.821 P00   INFO: archive-get command end: completed 
successfully (9ms)
2019-07-03 19:32:19.831 P00   INFO: archive-get command begin 2.15: 
[0000000400000445000000C7, pg_wal/RECOVERYXLOG] --archive-async --no-compress 
--db-timeout=3600 --log-level-console=info --log-level-file=detail 
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data 
--pg2-path=/var/lib/pgsql/11/data --process-max=4 
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest 
--stanza=pgdb2
2019-07-03 19:32:19.941 P00   INFO: unable to find 0000000400000445000000C7 in 
the archive
2019-07-03 19:32:19.941 P00   INFO: archive-get command end: completed 
successfully (110ms)
2019-07-03 19:32:23.955 CEST [1714]: FATAL:  could not connect to the primary 
server: could not connect to server: No route to host
2019-07-03 19:32:23.966 P00   INFO: archive-get command begin 2.15: 
[00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress 
--db-timeout=3600 --log-level-console=info --log-level-file=detail 
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data 
--pg2-path=/var/lib/pgsql/11/data --process-max=4 
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest 
--stanza=pgdb2
2019-07-03 19:32:23.970 P00   INFO: unable to find 00000005.history in the 
archive
2019-07-03 19:32:23.970 P00   INFO: archive-get command end: completed 
successfully (5ms)
2019-07-03 19:32:24.835 P00   INFO: archive-get command begin 2.15: 
[0000000400000445000000C7, pg_wal/RECOVERYXLOG] --archive-async --no-compress 
--db-timeout=3600 --log-level-console=info --log-level-file=detail 
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data 
--pg2-path=/var/lib/pgsql/11/data --process-max=4 
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest 
--stanza=pgdb2
2019-07-03 19:32:24.938 P00   INFO: unable to find 0000000400000445000000C7 in 
the archive
2019-07-03 19:32:24.938 P00   INFO: archive-get command end: completed 
successfully (104ms)
2019-07-03 19:32:26.959 CEST [2580]: FATAL:  could not connect to the primary 
server: could not connect to server: No route to host
2019-07-03 19:32:26.971 P00   INFO: archive-get command begin 2.15: 
[00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress 
--db-timeout=3600 --log-level-console=info --log-level-file=detail 
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data 
--pg2-path=/var/lib/pgsql/11/data --process-max=4 
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest 
--stanza=pgdb2
2019-07-03 19:32:26.976 P00   INFO: unable to find 00000005.history in the 
archive
2019-07-03 19:32:26.976 P00   INFO: archive-get command end: completed 
successfully (6ms)
2019-07-03 19:32:29.838 P00   INFO: archive-get command begin 2.15: 
[0000000400000445000000C7, pg_wal/RECOVERYXLOG] --archive-async --no-compress 
--db-timeout=3600 --log-level-console=info --log-level-file=detail 
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data 
--pg2-path=/var/lib/pgsql/11/data --process-max=4 
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest 
--stanza=pgdb2
2019-07-03 19:32:29.941 P00   INFO: unable to find 0000000400000445000000C7 in 
the archive
2019-07-03 19:32:29.941 P00   INFO: archive-get command end: completed 
successfully (104ms)
2019-07-03 19:32:29.965 CEST [3823]: FATAL:  could not connect to the primary 
server: could not connect to server: No route to host
2019-07-03 19:32:29.975 P00   INFO: archive-get command begin 2.15: 
[00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress 
--db-timeout=3600 --log-level-console=info --log-level-file=detail 
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data 
--pg2-path=/var/lib/pgsql/11/data --process-max=4 
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest 
--stanza=pgdb2
2019-07-03 19:32:29.980 P00   INFO: unable to find 00000005.history in the 
archive
2019-07-03 19:32:29.980 P00   INFO: archive-get command end: completed 
successfully (6ms)
2019-07-03 19:32:34.844 P00   INFO: archive-get command begin 2.15: 
[0000000400000445000000C7, pg_wal/RECOVERYXLOG] --archive-async --no-compress 
--db-timeout=3600 --log-level-console=info --log-level-file=detail 
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data 
--pg2-path=/var/lib/pgsql/11/data --process-max=4 
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest 
--stanza=pgdb2
2019-07-03 19:32:34.946 P00   INFO: unable to find 0000000400000445000000C7 in 
the archive
2019-07-03 19:32:34.946 P00   INFO: archive-get command end: completed 
successfully (103ms)
2019-07-03 19:32:37.956 CEST [4903]: FATAL:  could not connect to the primary 
server: could not connect to server: No route to host
2019-07-03 19:32:37.967 P00   INFO: archive-get command begin 2.15: 
[00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress 
--db-timeout=3600 --log-level-console=info --log-level-file=detail 
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data 
--pg2-path=/var/lib/pgsql/11/data --process-max=4 
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest 
--stanza=pgdb2
2019-07-03 19:32:37.971 P00   INFO: unable to find 00000005.history in the 
archive
2019-07-03 19:32:37.971 P00   INFO: archive-get command end: completed 
successfully (5ms)
2019-07-03 19:32:39.682 CEST [702]: LOG:  received fast shutdown request
2019-07-03 19:32:39.692 CEST [702]: LOG:  aborting any active transactions
2019-07-03 19:32:39.693 CEST [30963]: FATAL:  terminating connection due to 
administrator command
2019-07-03 19:32:39.693 CEST [30963]: STATEMENT:  CHECKPOINT
2019-07-03 19:32:39.693 CEST [30963]: LOG:  could not send data to client: 
Broken pipe
2019-07-03 19:32:39.693 CEST [30963]: STATEMENT:  CHECKPOINT
2019-07-03 19:32:39.694 CEST [5868]: FATAL:  terminating connection due to 
administrator command
2019-07-03 19:32:39.694 CEST [5868]: STATEMENT:  SELECT SUM(idx_scan::bigint) 
FROM pg_stat_all_tables WHERE     schemaname <> 'pg_catalog'     AND schemaname 
<> 'information_schema'     AND schemaname !~ '^pg_toast'
2019-07-03 19:32:39.699 CEST [5886]: FATAL:  the database system is shutting 
down

2019-07-03 19:29:52.438 CEST [23255]: LOG:  checkpoint complete: wrote 91743 
buffers (8.7%); 0 WAL file(s) added, 0 removed, 66 recycled; write=809.647 s, 
sync=5.
2019-07-03 19:31:11.526 CEST [23255]: LOG:  checkpoint starting: time

2019-07-03 19:31:20.048 CEST [26823]: LOG:  restartpoint complete: wrote 362269 
buffers (34.5%); 0 WAL file(s) added, 0 removed, 9 recycled; write=809.797 s, 
sync=0.823 s, total=810.785 s; sync files=597, longest=0.146 s, average=0.001 
s; distance=151568 kB, estimate=2426859 kB
2019-07-03 19:31:20.048 CEST [26823]: LOG:  recovery restart point at 
444/DB777640
2019-07-03 19:31:20.048 CEST [26823]: DETAIL:  Last completed transaction was 
at log time 2019-07-03 19:31:19.511618+02.
2019-07-03 19:31:38.628 CEST [26823]: LOG:  restartpoint starting: immediate 
wait
2019-07-03 19:32:19.715 CEST [9468]: FATAL:  terminating walreceiver due to 
timeout
2019-07-03 19:32:22.720 CEST [34253]: FATAL:  could not connect to the primary 
server: could not connect to server: No route to host
...
2019-07-03 19:33:52.948 CEST [24754]: FATAL:  could not connect to the primary 
server: could not connect to server: No route to host
2019-07-03 19:33:53.204 CEST [26574]: LOG:  received fast shutdown request
2019-07-03 19:33:53.212 CEST [26574]: LOG:  aborting any active transactions
2019-07-03 19:33:53.212 CEST [20902]: FATAL:  terminating connection due to 
administrator command
2019-07-03 19:33:53.212 CEST [20902]: STATEMENT:  CHECKPOINT
2019-07-03 19:33:53.212 CEST [20902]: LOG:  could not send data to client: 
Broken pipe
2019-07-03 19:33:53.212 CEST [20902]: STATEMENT:  CHECKPOINT
2019-07-03 19:33:53.240 CEST [24843]: FATAL:  the database system is shutting 
down
...
2019-07-03 19:38:02.296 CEST [27169]: FATAL:  the database system is shutting 
down
2019-07-03 19:38:20.448 CEST [26823]: LOG:  restartpoint complete: wrote 745698 
buffers (71.1%); 0 WAL file(s) added, 16 removed, 50 recycled; write=400.770 s, 
sync=0.002 s, total=401.819 s; sync files=538, longest=0.000 s, average=0.000 
s; distance=1077509 kB, estimate=2291924 kB
2019-07-03 19:38:20.448 CEST [26823]: LOG:  recovery restart point at 
445/1D3B8D20
2019-07-03 19:38:20.448 CEST [26823]: DETAIL:  Last completed transaction was 
at log time 2019-07-03 19:31:19.511618+02.
2019-07-03 19:38:20.450 CEST [26823]: LOG:  shutting down
2019-07-03 19:38:21.560 CEST [26574]: LOG:  database system is shut down

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

Reply via email to