Re: speed up a logical replica setup

Tomas Vondra Tue, 26 Mar 2024 12:13:17 -0700

On 3/26/24 03:53, Euler Taveira wrote:
> On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote:
>> ## Analysis for failure 1
>>
>> The failure caused by a time lag between walreceiver finishes and 
>> pg_is_in_recovery()
>> returns true.
>>
>> According to the output [1], it seems that the tool failed at 
>> wait_for_end_recovery()
>> with the message "standby server disconnected from the primary". Also, lines
>> "redo done at..." and "terminating walreceiver process due to administrator 
>> command"
>> meant that walreceiver was requested to shut down by XLogShutdownWalRcv().
>>
>> According to the source, we confirm that walreceiver is shut down in
>> StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, 
>> SharedRecoveryState
>> is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return 
>> true)
>> at the latter part of StartupXLOG().
>>
>> So, if there is a delay between FinishWalRecovery() and change the state, 
>> the check
>> in wait_for_end_recovery() would be failed during the time. Since we allow 
>> to miss
>> the walreceiver 10 times and it is checked once per second, the failure 
>> occurs if
>> the time lag is longer than 10 seconds.
>>
>> I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS 
>> larger,
>> but it's not a fundamental solution.
> 
> I was expecting that slow hosts might have issues in wait_for_end_recovery().
> As you said it took a lot of steps between FinishWalRecovery() (where
> walreceiver is shutdown -- XLogShutdownWalRcv) and SharedRecoveryState is set 
> to
> RECOVERY_STATE_DONE. If this window takes longer than NUM_CONN_ATTEMPTS *
> WAIT_INTERVAL (10 seconds), it aborts the execution. That's a bad decision
> because it already finished the promotion and it is just doing the final
> preparation for the host to become a primary.
> 
>         /*   
>          * If it is still in recovery, make sure the target server is
>          * connected to the primary so it can receive the required WAL to
>          * finish the recovery process. If it is disconnected try
>          * NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
>          */
>         res = PQexec(conn,
>                      "SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");
>         if (PQntuples(res) == 0)
>         {
>             if (++count > NUM_CONN_ATTEMPTS)
>             {
>                 stop_standby_server(subscriber_dir);
>                 pg_log_error("standby server disconnected from the primary");
>                 break;
>             }    
>         }
>         else
>             count = 0;          /* reset counter if it connects again */
> 
> This code was add to defend against the death/crash of the target server. 
> There
> are at least 3 options:
> 
> (1) increase NUM_CONN_ATTEMPTS * WAIT_INTERVAL seconds. We discussed this 
> constant
> and I decided to use 10 seconds because even in some slow hosts, this time
> wasn't reached during my tests. It seems I forgot to test the combination of 
> slow
> host, asserts enabled, and ubsan. I didn't notice that pg_promote() uses 60
> seconds as default wait. Maybe that's a reasonable value. I checked the
> 004_timeline_switch test and the last run took: 39.2s (serinus), 33.1s
> (culicidae), 18.31s (calliphoridae) and 27.52s (olingo).
> > (2) check if the primary is not running when walreceiver is not
available on the
> target server. Increase the connection attempts iif the primary is not 
> running.
> Hence, the described case doesn't cause an increment on the count variable.
> 
> (3) set recovery_timeout default to != 0 and remove pg_stat_wal_receiver check
> protection against the death/crash target server. I explained in a previous
> message that timeout may occur in cases that WAL replay to reach consistent
> state takes more than recovery-timeout seconds.
> 
> Option (1) is the easiest fix, however, we can have the same issue again if a
> slow host decides to be even slower, hence, we have to adjust this value 
> again.
> Option (2) interprets the walreceiver absence as a recovery end and if the
> primary server is running it can indicate that the target server is in the
> imminence of the recovery end. Option (3) is not as resilient as the other
> options.
> 
> The first patch implements a combination of (1) and (2).
> 
>> ## Analysis for failure 2
>>
>> According to [2], the physical replication slot which is specified as 
>> primary_slot_name
>> was not used by the walsender process. At that time walsender has not 
>> existed.
>>
>> ```
>> ...
>> pg_createsubscriber: publisher: current wal senders: 0
>> pg_createsubscriber: command is: SELECT 1 FROM 
>> pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot'
>> pg_createsubscriber: error: could not obtain replication slot information: 
>> got 0 rows, expected 1 row
>> ...
>> ```
>>
>> Currently standby must be stopped before the command and current code does 
>> not
>> block the flow to ensure the replication is started. So there is a 
>> possibility
>> that the checking is run before walsender is launched.
>>
>> One possible approach is to wait until the replication starts. Alternative 
>> one is
>> to ease the condition.
> 
> That's my suggestion too. I reused NUM_CONN_ATTEMPTS (that was renamed to
> NUM_ATTEMPTS in the first patch). See second patch.
>


Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
waited long enough?

IMHO the test should simply pass PG_TEST_DEFAULT_TIMEOUT when calling
pg_createsubscriber, and that should do the trick.

Increasing PG_TEST_DEFAULT_TIMEOUT is what buildfarm animals doing
things like ubsan/valgrind already use to deal with exactly this kind of
timeout problem.

Or is there a deeper problem with deciding if the system is in recovery?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: speed up a logical replica setup

Reply via email to