On Fri, Jul 5, 2019 at 5:09 PM Jehan-Guillaume de Rorthais <j...@dalibo.com>
wrote:
> It seems to me the problem comes from here:
>
> Jul 03 19:31:38 [30151] ph-sql-03.prod.ams.i.rdmedia.com crmd:
> notice:
> te_rsc_command: Initiating notify operation
> pgsqld_pre_notify_promote_0 on ph-sql-05 | action 67
> Jul 03 19:32:38 [30148] ph-sql-03.prod.ams.i.rdmedia.com lrmd:
> warning:
> operation_finished: pgsqld_notify_0:30939 - timed out after 60000ms
>
> and here:
>
> Jul 03 19:31:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com lrmd:
> info:
> log_execute: executing - rsc:pgsqld action:notify call_id:38
> pgsqlms(pgsqld)[20881]:
> Jul 03 19:32:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com lrmd:
> warning: operation_finished: pgsqld_notify_0:20881 - timed out
> after
> 60000ms
>
> The pgsql election occurs during the pre-promote action where all remaining
> nodes set their LSN location. During the promote action, the designed
> primary
> checks its LSN location is the highest one. If it is not, it just cancel
> the
> promotion so the next round will elect the best one.
>
> Back to your issue. According to the logs, both standby nodes timed out
> during
> the pre-promote action. No LSN location has been set accros the whole
> cluster.
> I couldn't see any messages from the attrd daemon related to the
> lsn_location
> attribute or other cleanup actions.
>
> I couldn't even find the INFO message from pgsqlms giving its current
> status
> before actually setting it ("Current node TL#LSN: %s"). But this message
> appears soon after a...checkpoint. See:
> https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L2017
>
> Maybe you have very long checkpoints on both nodes that timed out the
> pre-promote action? Do you have some PostgreSQL logs showing some useful
> info
> around this timelapse?
>
Yes I have logs that show at what time checkpoints were taken. They're
attached. Currently checkpoint_timeout is set to 15 minutes. I already
lowered it from the previous value of 60 minutes. I could experiment with
even lower values for checkpoint_timeout as it seems the problem is that
forced checkpoints are taking too long (more than 60s). Perhaps I should
also increase the notify timeout?
Thank you very much for your time.
2019-07-03 19:30:10.887 CEST [1214]: LOG: restartpoint complete: wrote 345241
buffers (32.9%); 0 WAL file(s) added, 0 removed, 9 recycled; write=810.270 s,
sync=0.973 s, total=811.445 s; sync files=594, longest=0.064 s, average=0.001
s; distance=151568 kB, estimate=2426859 kB
2019-07-03 19:30:10.887 CEST [1214]: LOG: recovery restart point at
444/DB777640
2019-07-03 19:30:10.887 CEST [1214]: DETAIL: Last completed transaction was at
log time 2019-07-03 19:29:45.48093+02.
2019-07-03 19:31:38.647 CEST [1214]: LOG: restartpoint starting: immediate wait
2019-07-03 19:32:19.721 CEST [30315]: FATAL: terminating walreceiver due to
timeout
2019-07-03 19:32:19.813 P00 INFO: archive-get command begin 2.15:
[00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress
--db-timeout=3600 --log-level-console=info --log-level-file=detail
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data
--pg2-path=/var/lib/pgsql/11/data --process-max=4
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest
--stanza=pgdb2
2019-07-03 19:32:19.821 P00 INFO: unable to find 00000005.history in the
archive
2019-07-03 19:32:19.821 P00 INFO: archive-get command end: completed
successfully (9ms)
2019-07-03 19:32:19.831 P00 INFO: archive-get command begin 2.15:
[0000000400000445000000C7, pg_wal/RECOVERYXLOG] --archive-async --no-compress
--db-timeout=3600 --log-level-console=info --log-level-file=detail
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data
--pg2-path=/var/lib/pgsql/11/data --process-max=4
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest
--stanza=pgdb2
2019-07-03 19:32:19.941 P00 INFO: unable to find 0000000400000445000000C7 in
the archive
2019-07-03 19:32:19.941 P00 INFO: archive-get command end: completed
successfully (110ms)
2019-07-03 19:32:23.955 CEST [1714]: FATAL: could not connect to the primary
server: could not connect to server: No route to host
2019-07-03 19:32:23.966 P00 INFO: archive-get command begin 2.15:
[00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress
--db-timeout=3600 --log-level-console=info --log-level-file=detail
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data
--pg2-path=/var/lib/pgsql/11/data --process-max=4
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest
--stanza=pgdb2
2019-07-03 19:32:23.970 P00 INFO: unable to find 00000005.history in the
archive
2019-07-03 19:32:23.970 P00 INFO: archive-get command end: completed
successfully (5ms)
2019-07-03 19:32:24.835 P00 INFO: archive-get command begin 2.15:
[0000000400000445000000C7, pg_wal/RECOVERYXLOG] --archive-async --no-compress
--db-timeout=3600 --log-level-console=info --log-level-file=detail
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data
--pg2-path=/var/lib/pgsql/11/data --process-max=4
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest
--stanza=pgdb2
2019-07-03 19:32:24.938 P00 INFO: unable to find 0000000400000445000000C7 in
the archive
2019-07-03 19:32:24.938 P00 INFO: archive-get command end: completed
successfully (104ms)
2019-07-03 19:32:26.959 CEST [2580]: FATAL: could not connect to the primary
server: could not connect to server: No route to host
2019-07-03 19:32:26.971 P00 INFO: archive-get command begin 2.15:
[00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress
--db-timeout=3600 --log-level-console=info --log-level-file=detail
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data
--pg2-path=/var/lib/pgsql/11/data --process-max=4
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest
--stanza=pgdb2
2019-07-03 19:32:26.976 P00 INFO: unable to find 00000005.history in the
archive
2019-07-03 19:32:26.976 P00 INFO: archive-get command end: completed
successfully (6ms)
2019-07-03 19:32:29.838 P00 INFO: archive-get command begin 2.15:
[0000000400000445000000C7, pg_wal/RECOVERYXLOG] --archive-async --no-compress
--db-timeout=3600 --log-level-console=info --log-level-file=detail
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data
--pg2-path=/var/lib/pgsql/11/data --process-max=4
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest
--stanza=pgdb2
2019-07-03 19:32:29.941 P00 INFO: unable to find 0000000400000445000000C7 in
the archive
2019-07-03 19:32:29.941 P00 INFO: archive-get command end: completed
successfully (104ms)
2019-07-03 19:32:29.965 CEST [3823]: FATAL: could not connect to the primary
server: could not connect to server: No route to host
2019-07-03 19:32:29.975 P00 INFO: archive-get command begin 2.15:
[00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress
--db-timeout=3600 --log-level-console=info --log-level-file=detail
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data
--pg2-path=/var/lib/pgsql/11/data --process-max=4
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest
--stanza=pgdb2
2019-07-03 19:32:29.980 P00 INFO: unable to find 00000005.history in the
archive
2019-07-03 19:32:29.980 P00 INFO: archive-get command end: completed
successfully (6ms)
2019-07-03 19:32:34.844 P00 INFO: archive-get command begin 2.15:
[0000000400000445000000C7, pg_wal/RECOVERYXLOG] --archive-async --no-compress
--db-timeout=3600 --log-level-console=info --log-level-file=detail
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data
--pg2-path=/var/lib/pgsql/11/data --process-max=4
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest
--stanza=pgdb2
2019-07-03 19:32:34.946 P00 INFO: unable to find 0000000400000445000000C7 in
the archive
2019-07-03 19:32:34.946 P00 INFO: archive-get command end: completed
successfully (103ms)
2019-07-03 19:32:37.956 CEST [4903]: FATAL: could not connect to the primary
server: could not connect to server: No route to host
2019-07-03 19:32:37.967 P00 INFO: archive-get command begin 2.15:
[00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress
--db-timeout=3600 --log-level-console=info --log-level-file=detail
--pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data
--pg2-path=/var/lib/pgsql/11/data --process-max=4
--repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest
--stanza=pgdb2
2019-07-03 19:32:37.971 P00 INFO: unable to find 00000005.history in the
archive
2019-07-03 19:32:37.971 P00 INFO: archive-get command end: completed
successfully (5ms)
2019-07-03 19:32:39.682 CEST [702]: LOG: received fast shutdown request
2019-07-03 19:32:39.692 CEST [702]: LOG: aborting any active transactions
2019-07-03 19:32:39.693 CEST [30963]: FATAL: terminating connection due to
administrator command
2019-07-03 19:32:39.693 CEST [30963]: STATEMENT: CHECKPOINT
2019-07-03 19:32:39.693 CEST [30963]: LOG: could not send data to client:
Broken pipe
2019-07-03 19:32:39.693 CEST [30963]: STATEMENT: CHECKPOINT
2019-07-03 19:32:39.694 CEST [5868]: FATAL: terminating connection due to
administrator command
2019-07-03 19:32:39.694 CEST [5868]: STATEMENT: SELECT SUM(idx_scan::bigint)
FROM pg_stat_all_tables WHERE schemaname <> 'pg_catalog' AND schemaname
<> 'information_schema' AND schemaname !~ '^pg_toast'
2019-07-03 19:32:39.699 CEST [5886]: FATAL: the database system is shutting
down
2019-07-03 19:29:52.438 CEST [23255]: LOG: checkpoint complete: wrote 91743
buffers (8.7%); 0 WAL file(s) added, 0 removed, 66 recycled; write=809.647 s,
sync=5.
2019-07-03 19:31:11.526 CEST [23255]: LOG: checkpoint starting: time
2019-07-03 19:31:20.048 CEST [26823]: LOG: restartpoint complete: wrote 362269
buffers (34.5%); 0 WAL file(s) added, 0 removed, 9 recycled; write=809.797 s,
sync=0.823 s, total=810.785 s; sync files=597, longest=0.146 s, average=0.001
s; distance=151568 kB, estimate=2426859 kB
2019-07-03 19:31:20.048 CEST [26823]: LOG: recovery restart point at
444/DB777640
2019-07-03 19:31:20.048 CEST [26823]: DETAIL: Last completed transaction was
at log time 2019-07-03 19:31:19.511618+02.
2019-07-03 19:31:38.628 CEST [26823]: LOG: restartpoint starting: immediate
wait
2019-07-03 19:32:19.715 CEST [9468]: FATAL: terminating walreceiver due to
timeout
2019-07-03 19:32:22.720 CEST [34253]: FATAL: could not connect to the primary
server: could not connect to server: No route to host
...
2019-07-03 19:33:52.948 CEST [24754]: FATAL: could not connect to the primary
server: could not connect to server: No route to host
2019-07-03 19:33:53.204 CEST [26574]: LOG: received fast shutdown request
2019-07-03 19:33:53.212 CEST [26574]: LOG: aborting any active transactions
2019-07-03 19:33:53.212 CEST [20902]: FATAL: terminating connection due to
administrator command
2019-07-03 19:33:53.212 CEST [20902]: STATEMENT: CHECKPOINT
2019-07-03 19:33:53.212 CEST [20902]: LOG: could not send data to client:
Broken pipe
2019-07-03 19:33:53.212 CEST [20902]: STATEMENT: CHECKPOINT
2019-07-03 19:33:53.240 CEST [24843]: FATAL: the database system is shutting
down
...
2019-07-03 19:38:02.296 CEST [27169]: FATAL: the database system is shutting
down
2019-07-03 19:38:20.448 CEST [26823]: LOG: restartpoint complete: wrote 745698
buffers (71.1%); 0 WAL file(s) added, 16 removed, 50 recycled; write=400.770 s,
sync=0.002 s, total=401.819 s; sync files=538, longest=0.000 s, average=0.000
s; distance=1077509 kB, estimate=2291924 kB
2019-07-03 19:38:20.448 CEST [26823]: LOG: recovery restart point at
445/1D3B8D20
2019-07-03 19:38:20.448 CEST [26823]: DETAIL: Last completed transaction was
at log time 2019-07-03 19:31:19.511618+02.
2019-07-03 19:38:20.450 CEST [26823]: LOG: shutting down
2019-07-03 19:38:21.560 CEST [26574]: LOG: database system is shut down
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/