On 3/26/24 03:53, Euler Taveira wrote: > On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote: >> ## Analysis for failure 1 >> >> The failure caused by a time lag between walreceiver finishes and >> pg_is_in_recovery() >> returns true. >> >> According to the output [1], it seems that the tool failed at >> wait_for_end_recovery() >> with the message "standby server disconnected from the primary". Also, lines >> "redo done at..." and "terminating walreceiver process due to administrator >> command" >> meant that walreceiver was requested to shut down by XLogShutdownWalRcv(). >> >> According to the source, we confirm that walreceiver is shut down in >> StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, >> SharedRecoveryState >> is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return >> true) >> at the latter part of StartupXLOG(). >> >> So, if there is a delay between FinishWalRecovery() and change the state, >> the check >> in wait_for_end_recovery() would be failed during the time. Since we allow >> to miss >> the walreceiver 10 times and it is checked once per second, the failure >> occurs if >> the time lag is longer than 10 seconds. >> >> I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS >> larger, >> but it's not a fundamental solution. > > I was expecting that slow hosts might have issues in wait_for_end_recovery(). > As you said it took a lot of steps between FinishWalRecovery() (where > walreceiver is shutdown -- XLogShutdownWalRcv) and SharedRecoveryState is set > to > RECOVERY_STATE_DONE. If this window takes longer than NUM_CONN_ATTEMPTS * > WAIT_INTERVAL (10 seconds), it aborts the execution. That's a bad decision > because it already finished the promotion and it is just doing the final > preparation for the host to become a primary. > > /* > * If it is still in recovery, make sure the target server is > * connected to the primary so it can receive the required WAL to > * finish the recovery process. If it is disconnected try > * NUM_CONN_ATTEMPTS in a row and bail out if not succeed. > */ > res = PQexec(conn, > "SELECT 1 FROM pg_catalog.pg_stat_wal_receiver"); > if (PQntuples(res) == 0) > { > if (++count > NUM_CONN_ATTEMPTS) > { > stop_standby_server(subscriber_dir); > pg_log_error("standby server disconnected from the primary"); > break; > } > } > else > count = 0; /* reset counter if it connects again */ > > This code was add to defend against the death/crash of the target server. > There > are at least 3 options: > > (1) increase NUM_CONN_ATTEMPTS * WAIT_INTERVAL seconds. We discussed this > constant > and I decided to use 10 seconds because even in some slow hosts, this time > wasn't reached during my tests. It seems I forgot to test the combination of > slow > host, asserts enabled, and ubsan. I didn't notice that pg_promote() uses 60 > seconds as default wait. Maybe that's a reasonable value. I checked the > 004_timeline_switch test and the last run took: 39.2s (serinus), 33.1s > (culicidae), 18.31s (calliphoridae) and 27.52s (olingo). > > (2) check if the primary is not running when walreceiver is not available on the > target server. Increase the connection attempts iif the primary is not > running. > Hence, the described case doesn't cause an increment on the count variable. > > (3) set recovery_timeout default to != 0 and remove pg_stat_wal_receiver check > protection against the death/crash target server. I explained in a previous > message that timeout may occur in cases that WAL replay to reach consistent > state takes more than recovery-timeout seconds. > > Option (1) is the easiest fix, however, we can have the same issue again if a > slow host decides to be even slower, hence, we have to adjust this value > again. > Option (2) interprets the walreceiver absence as a recovery end and if the > primary server is running it can indicate that the target server is in the > imminence of the recovery end. Option (3) is not as resilient as the other > options. > > The first patch implements a combination of (1) and (2). > >> ## Analysis for failure 2 >> >> According to [2], the physical replication slot which is specified as >> primary_slot_name >> was not used by the walsender process. At that time walsender has not >> existed. >> >> ``` >> ... >> pg_createsubscriber: publisher: current wal senders: 0 >> pg_createsubscriber: command is: SELECT 1 FROM >> pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot' >> pg_createsubscriber: error: could not obtain replication slot information: >> got 0 rows, expected 1 row >> ... >> ``` >> >> Currently standby must be stopped before the command and current code does >> not >> block the flow to ensure the replication is started. So there is a >> possibility >> that the checking is run before walsender is launched. >> >> One possible approach is to wait until the replication starts. Alternative >> one is >> to ease the condition. > > That's my suggestion too. I reused NUM_CONN_ATTEMPTS (that was renamed to > NUM_ATTEMPTS in the first patch). See second patch. >
Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed? Why isn't recovery_timeout enough to decide if wait_for_end_recovery() waited long enough? IMHO the test should simply pass PG_TEST_DEFAULT_TIMEOUT when calling pg_createsubscriber, and that should do the trick. Increasing PG_TEST_DEFAULT_TIMEOUT is what buildfarm animals doing things like ubsan/valgrind already use to deal with exactly this kind of timeout problem. Or is there a deeper problem with deciding if the system is in recovery? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company