Re: pg_upgrade: optimize replication slot caught-up check

Chao Li Thu, 15 Jan 2026 18:52:41 -0800

> On Jan 15, 2026, at 01:54, Masahiko Sawada <[email protected]> wrote:
> 
> On Sun, Jan 11, 2026 at 7:31 PM Chao Li <[email protected]> wrote:
>> 
>> 
>> 
>>> On Jan 10, 2026, at 05:32, Masahiko Sawada <[email protected]> wrote:
>>> 
>>> On Fri, Jan 9, 2026 at 10:03 AM Masahiko Sawada <[email protected]> 
>>> wrote:
>>>> 
>>>> On Thu, Jan 8, 2026 at 6:59 PM Chao Li <[email protected]> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> I just looked into v3. Basically, it now does a shared WAL scan to find 
>>>>> the newest decodable LSN and uses that to compare with all slots’ 
>>>>> confirmed_flush_lsn, which significantly reduces WAL scan effort when 
>>>>> there are many slots.
>>>> 
>>>> Thank you for reviewing the patch!
>>>> 
>>>>> 
>>>>> One thing I'm thinking about is that if all slots are far behind, the 
>>>>> shared scan may still take a long time. Before this change, a scan could 
>>>>> stop as soon as it found a pending WAL. So after the change, when there 
>>>>> are only a few slots and they are far behind, the scan might end up doing 
>>>>> more work than before.
>>>> 
>>>> That's a valid concern.
>>>> 
>>>>> As a possible optimization, maybe we could also pass the newest 
>>>>> confirmed_flush_lsn to the scan. Once it finds a decodable WAL record 
>>>>> newer than that confirmed_flush_lsn, we already know all slots are 
>>>>> behind, so the scan could stop at that point.
>>>> 
>>>> Sounds like a reasonable idea. I'll give it a try and see how it's 
>>>> worthwhile.
>>>> 
>>>>> 
>>>>> WRT the code change, I got a few comments:
>>>>> 
>>>>> 1
>>>>> ···
>>>>> + * otherwise false. If last_pending_wal_p is set by the caller, it 
>>>>> continues
>>>>> + * scanning WAL even after detecting a decodable WAL record, in order to
>>>>> + * get the last decodable WAL record's LSN.
>>>>> */
>>>>> bool
>>>>> -LogicalReplicationSlotHasPendingWal(XLogRecPtr end_of_wal)
>>>>> +LogicalReplicationSlotHasPendingWal(XLogRecPtr end_of_wal,
>>>>> +                                                                       
>>>>> XLogRecPtr *last_pending_wal_p)
>>>>> {
>>>>>       bool            has_pending_wal = false;
>>>>> 
>>>>>       Assert(MyReplicationSlot);
>>>>> 
>>>>> +       if (last_pending_wal_p != NULL)
>>>>> +               *last_pending_wal_p = InvalidXLogRecPtr;
>>>>> ···
>>>>> 
>>>>> The header comment seems to conflict to the code. last_pending_wal_p is 
>>>>> unconditionally set to InvalidXLogRecPtr, so whatever a caller set is 
>>>>> overwritten. I think you meant to say “if last_pending_wal_p is not NULL”.
>>>> 
>>>> Agreed.
>>>> 
>>>>> 
>>>>> 2
>>>>> ```
>>>>> @@ -286,9 +287,9 @@ 
>>>>> binary_upgrade_logical_slot_has_caught_up(PG_FUNCTION_ARGS)
>>>>> {
>>>>>       Name            slot_name;
>>>>>       XLogRecPtr      end_of_wal;
>>>>> -       bool            found_pending_wal;
>>>>> +       XLogRecPtr      last_pending_wal;
>>>>> ```
>>>>> 
>>>>> The function header comment still says “returns true if …”, that should 
>>>>> be updated.
>>>>> 
>>>>> Also, with the change, the function name becomes misleading, where “has” 
>>>>> implies a boolean result, but now it will return the newest docodeable 
>>>>> wal when no catching up. The function name doesn’t reflect to the actual 
>>>>> behavior. Looks like the function is only used by pg_upgrade, so maybe 
>>>>> rename it.
>>>> 
>>>> Agreed, I'll incorporate the comment in the next version patch.
>>>> 
>>> 
>>> I've attached the updated patch that addressed all review comments.
>>> 
>>> Regards,
>>> 
>>> --
>>> Masahiko Sawada
>>> Amazon Web Services: https://aws.amazon.com
>>> <v4-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch>
>> 
>> A few comments on v4:
>> 
>> 1
>> ```
>> -        * slot is considered caught up is done by an upgrade function. This
>> -        * regards the slot as caught up if we don't find any decodable 
>> changes.
>> -        * See binary_upgrade_logical_slot_has_caught_up().
>> +        * slot is considered caught up is done by an upgrade function, 
>> unless the
>> +        * caller sets skip_caught_up_check. This regards the slot as caught 
>> up if
>> +        * we don't find any decodable changes. See
>> +        * binary_upgrade_logical_slot_has_caught_up().
>> ```
>> 
>> binary_upgrade_logical_slot_has_caught_up has been renamed, so this commend 
>> needs to be updated.
>> 
>> 2
>> ```
>> +                                       "temporary IS FALSE "
>> +                                       "ORDER BY 1;",
>> +                                       (skip_caught_up_check || 
>> user_opts.live_check) ? "FALSE" :
>>                                        "(CASE WHEN invalidation_reason IS 
>> NOT NULL THEN FALSE "
>>                                        "ELSE (SELECT 
>> pg_catalog.binary_upgrade_logical_slot_has_caught_up(slot_name)) "
>>                                        "END)");
>> ```
>> 
>> pg_catalog.binary_upgrade_logical_slot_has_caught_up has been renamed and it 
>> takes two parameters now.
> 
> binary_upgrade_logical_slot_has_caught_up() is still used when the old
> cluster is PG18 or older, so we cannot use
> binary_upgrade_check_logical_slot_pending_wal() here.
> 
>> 
>> 
>> 3
>> ```
>> +                               if (last_pending_wal > scan_cutoff_lsn)
>> +                                       break;
>> ```
>> 
>> In LogicalReplicationSlotHasPendingWal() we early break when 
>> last_pending_wal > scan_cutoff_lsn, and later the SQL check 
>> “confirmed_flush_lsn > last_pending_wal”. So there is an edge case, where 
>> last_pending_wal == scan_cutoff_lsn and confirmed_flush_lsn == 
>> last_pending_wal, then neither early break nor caught up happens.
>> 
>> So, I think we should use “>=“ for the both checks.
> 
> Good catch, but I think we should use ">=" only in
> LogicalReplicationSlotHasPendingWal(). We can terminate the scan early
> if we find any decodable WAL whose LSN is >= confirmed_flush_lsn. If
> scan_cutoff (refers to a slot's confirmed_flush_lsn) ==
> last_pending_wal, the slot should not ignore the last_pending_wal.
> 
> I've attached the updated patch.
> 
> Regards,
> 
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
> <v5-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch>

Hi Masahiko-san,

One typo, otherwise V5 looks good to me.

```
+                * using another query, it not during a live_check.
```
it not -> if not

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Re: pg_upgrade: optimize replication slot caught-up check

Reply via email to