On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote: > ## Analysis for failure 1 > > The failure caused by a time lag between walreceiver finishes and > pg_is_in_recovery() > returns true. > > According to the output [1], it seems that the tool failed at > wait_for_end_recovery() > with the message "standby server disconnected from the primary". Also, lines > "redo done at..." and "terminating walreceiver process due to administrator > command" > meant that walreceiver was requested to shut down by XLogShutdownWalRcv(). > > According to the source, we confirm that walreceiver is shut down in > StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, > SharedRecoveryState > is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return > true) > at the latter part of StartupXLOG(). > > So, if there is a delay between FinishWalRecovery() and change the state, the > check > in wait_for_end_recovery() would be failed during the time. Since we allow to > miss > the walreceiver 10 times and it is checked once per second, the failure > occurs if > the time lag is longer than 10 seconds. > > I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS > larger, > but it's not a fundamental solution.
I was expecting that slow hosts might have issues in wait_for_end_recovery(). As you said it took a lot of steps between FinishWalRecovery() (where walreceiver is shutdown -- XLogShutdownWalRcv) and SharedRecoveryState is set to RECOVERY_STATE_DONE. If this window takes longer than NUM_CONN_ATTEMPTS * WAIT_INTERVAL (10 seconds), it aborts the execution. That's a bad decision because it already finished the promotion and it is just doing the final preparation for the host to become a primary. /* * If it is still in recovery, make sure the target server is * connected to the primary so it can receive the required WAL to * finish the recovery process. If it is disconnected try * NUM_CONN_ATTEMPTS in a row and bail out if not succeed. */ res = PQexec(conn, "SELECT 1 FROM pg_catalog.pg_stat_wal_receiver"); if (PQntuples(res) == 0) { if (++count > NUM_CONN_ATTEMPTS) { stop_standby_server(subscriber_dir); pg_log_error("standby server disconnected from the primary"); break; } } else count = 0; /* reset counter if it connects again */ This code was add to defend against the death/crash of the target server. There are at least 3 options: (1) increase NUM_CONN_ATTEMPTS * WAIT_INTERVAL seconds. We discussed this constant and I decided to use 10 seconds because even in some slow hosts, this time wasn't reached during my tests. It seems I forgot to test the combination of slow host, asserts enabled, and ubsan. I didn't notice that pg_promote() uses 60 seconds as default wait. Maybe that's a reasonable value. I checked the 004_timeline_switch test and the last run took: 39.2s (serinus), 33.1s (culicidae), 18.31s (calliphoridae) and 27.52s (olingo). (2) check if the primary is not running when walreceiver is not available on the target server. Increase the connection attempts iif the primary is not running. Hence, the described case doesn't cause an increment on the count variable. (3) set recovery_timeout default to != 0 and remove pg_stat_wal_receiver check protection against the death/crash target server. I explained in a previous message that timeout may occur in cases that WAL replay to reach consistent state takes more than recovery-timeout seconds. Option (1) is the easiest fix, however, we can have the same issue again if a slow host decides to be even slower, hence, we have to adjust this value again. Option (2) interprets the walreceiver absence as a recovery end and if the primary server is running it can indicate that the target server is in the imminence of the recovery end. Option (3) is not as resilient as the other options. The first patch implements a combination of (1) and (2). > ## Analysis for failure 2 > > According to [2], the physical replication slot which is specified as > primary_slot_name > was not used by the walsender process. At that time walsender has not existed. > > ``` > ... > pg_createsubscriber: publisher: current wal senders: 0 > pg_createsubscriber: command is: SELECT 1 FROM > pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot' > pg_createsubscriber: error: could not obtain replication slot information: > got 0 rows, expected 1 row > ... > ``` > > Currently standby must be stopped before the command and current code does not > block the flow to ensure the replication is started. So there is a possibility > that the checking is run before walsender is launched. > > One possible approach is to wait until the replication starts. Alternative > one is > to ease the condition. That's my suggestion too. I reused NUM_CONN_ATTEMPTS (that was renamed to NUM_ATTEMPTS in the first patch). See second patch. -- Euler Taveira EDB https://www.enterprisedb.com/
From b2cfa13627bd844b28fe050ffb35e8b9696fdf2e Mon Sep 17 00:00:00 2001 From: Euler Taveira <euler.tave...@enterprisedb.com> Date: Mon, 25 Mar 2024 22:01:52 -0300 Subject: [PATCH v1 1/2] Improve the code that checks if the recovery is finishing The recovery process has a window between the walreceiver shutdown and the pg_is_in_recovery function returns false. It means that the pg_stat_wal_receiver checks can cause the server to finish the recovery (even if it already reaches the recovery target). Since it checks the pg_stat_wal_receiver to verify the primary is available, if it does not return a row, PQping the primary server. If it is up and running, it can indicate that the target server is finishing the recovery process, hence, we shouldn't count it as an attempt. It avoids premature failures on slow hosts. While on it, increase the number of attempts (10 to 60). The wait time is the same pg_promote function uses by default. --- src/bin/pg_basebackup/pg_createsubscriber.c | 30 +++++++++++++-------- 1 file changed, 19 insertions(+), 11 deletions(-) diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c index b8f8269340..cca93d8c25 100644 --- a/src/bin/pg_basebackup/pg_createsubscriber.c +++ b/src/bin/pg_basebackup/pg_createsubscriber.c @@ -30,6 +30,8 @@ #define DEFAULT_SUB_PORT "50432" +#define NUM_ATTEMPTS 60 + /* Command-line options */ struct CreateSubscriberOptions { @@ -93,7 +95,7 @@ static void pg_ctl_status(const char *pg_ctl_cmd, int rc); static void start_standby_server(const struct CreateSubscriberOptions *opt, bool restricted_access); static void stop_standby_server(const char *datadir); -static void wait_for_end_recovery(const char *conninfo, +static void wait_for_end_recovery(const struct LogicalRepInfo *dbinfo, const struct CreateSubscriberOptions *opt); static void create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo); static void drop_publication(PGconn *conn, struct LogicalRepInfo *dbinfo); @@ -1354,18 +1356,16 @@ stop_standby_server(const char *datadir) * the recovery process. By default, it waits forever. */ static void -wait_for_end_recovery(const char *conninfo, const struct CreateSubscriberOptions *opt) +wait_for_end_recovery(const struct LogicalRepInfo *dbinfo, const struct CreateSubscriberOptions *opt) { PGconn *conn; int status = POSTMASTER_STILL_STARTING; int timer = 0; int count = 0; /* number of consecutive connection attempts */ -#define NUM_CONN_ATTEMPTS 10 - pg_log_info("waiting for the target server to reach the consistent state"); - conn = connect_database(conninfo, true); + conn = connect_database(dbinfo->subconninfo, true); for (;;) { @@ -1384,16 +1384,24 @@ wait_for_end_recovery(const char *conninfo, const struct CreateSubscriberOptions } /* - * If it is still in recovery, make sure the target server is - * connected to the primary so it can receive the required WAL to - * finish the recovery process. If it is disconnected try - * NUM_CONN_ATTEMPTS in a row and bail out if not succeed. + * If it is still in recovery, make sure the target server is connected + * to the primary so it can receive the required WAL to finish the + * recovery process. If the walreceiver process is not running it + * should indicate that (i) the recovery is almost finished or (ii) the + * primary is not running or is not accpeting connections. It should + * count as attempts iif (ii) is true. In this case, try NUM_ATTEMPTS + * in a row and bail out if not succeed. */ res = PQexec(conn, "SELECT 1 FROM pg_catalog.pg_stat_wal_receiver"); if (PQntuples(res) == 0) { - if (++count > NUM_CONN_ATTEMPTS) + if (PQping(dbinfo->pubconninfo) != PQPING_OK) + count++; + else + count = 0; /* reset counter if it connects again */ + + if (count > NUM_ATTEMPTS) { stop_standby_server(subscriber_dir); pg_log_error("standby server disconnected from the primary"); @@ -2113,7 +2121,7 @@ main(int argc, char **argv) start_standby_server(&opt, true); /* Waiting the subscriber to be promoted */ - wait_for_end_recovery(dbinfo[0].subconninfo, &opt); + wait_for_end_recovery(&dbinfo[0], &opt); /* * Create the subscription for each database on subscriber. It does not -- 2.30.2
From 872510ed67b91cb7482dea3bce831549d3519e80 Mon Sep 17 00:00:00 2001 From: Euler Taveira <euler.tave...@enterprisedb.com> Date: Mon, 25 Mar 2024 23:25:30 -0300 Subject: [PATCH v1 2/2] Improve the code that checks if the primary slot is available The target server is started a few instructions before checking the primary server and the replication slot might not be active. Instead of failing the first time, try a few times. --- src/bin/pg_basebackup/pg_createsubscriber.c | 35 +++++++++++++++------ 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c index cca93d8c25..f17e9bde9c 100644 --- a/src/bin/pg_basebackup/pg_createsubscriber.c +++ b/src/bin/pg_basebackup/pg_createsubscriber.c @@ -887,6 +887,8 @@ check_publisher(const struct LogicalRepInfo *dbinfo) { PQExpBuffer str = createPQExpBuffer(); char *psn_esc = PQescapeLiteral(conn, primary_slot_name, strlen(primary_slot_name)); + int ntuples; + int count = 0; appendPQExpBuffer(str, "SELECT 1 FROM pg_catalog.pg_replication_slots " @@ -897,25 +899,38 @@ check_publisher(const struct LogicalRepInfo *dbinfo) pg_log_debug("command is: %s", str->data); - res = PQexec(conn, str->data); - if (PQresultStatus(res) != PGRES_TUPLES_OK) + /* + * The replication slot might take some time to be active, try a few + * times if necessary. + */ + do { - pg_log_error("could not obtain replication slot information: %s", - PQresultErrorMessage(res)); - disconnect_database(conn, true); - } + res = PQexec(conn, str->data); + if (PQresultStatus(res) != PGRES_TUPLES_OK) + { + pg_log_error("could not obtain replication slot information: %s", + PQresultErrorMessage(res)); + disconnect_database(conn, true); + } - if (PQntuples(res) != 1) + ntuples = PQntuples(res); + PQclear(res); + + if (ntuples == 1) /* replication slot is already active */ + break; + else + count++; + } while (count > NUM_ATTEMPTS); + + if (ntuples != 1) { pg_log_error("could not obtain replication slot information: got %d rows, expected %d row", - PQntuples(res), 1); + ntuples, 1); disconnect_database(conn, true); } else pg_log_info("primary has replication slot \"%s\"", primary_slot_name); - - PQclear(res); } disconnect_database(conn, false); -- 2.30.2