Re: pg_upgrade: optimize replication slot caught-up check

Chao Li Thu, 08 Jan 2026 18:59:28 -0800

> On Jan 8, 2026, at 05:08, Masahiko Sawada <[email protected]> wrote:
> 
> On Wed, Jan 7, 2026 at 12:18 PM Masahiko Sawada <[email protected]> wrote:
>> 
>> On Tue, Jan 6, 2026 at 12:09 PM Masahiko Sawada <[email protected]> 
>> wrote:
>>> 
>>> On Mon, Jan 5, 2026 at 6:55 PM Chao Li <[email protected]> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Jan 6, 2026, at 02:02, Masahiko Sawada <[email protected]> wrote:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> Commit 29d0a77fa6 improved pg_upgrade to allow migrating logical
>>>>> slots. Currently, to check if the slots are ready to be migrated, we
>>>>> call binary_upgrade_logical_slot_has_caught_up() for every single
>>>>> slot. This checks if there are any unconsumed WAL records. However, we
>>>>> noticed a performance issue. If there are many slots (e.g., 100 or
>>>>> more) or if there is a WAL backlog, checking all slots one by one
>>>>> takes a long time.
>>>>> 
>>>>> Here are some test results from my environment:
>>>>> With an empty cluster: 1.55s
>>>>> With 200 slots and 30MB backlog: 15.51s
>>>>> 
>>>>> Commit 6d3d2e8e5 introduced parallel checks per database, but a single
>>>>> job might still have to check too many slots, causing delays.
>>>>> 
>>>>> Since binary_upgrade_logical_slot_has_caught_up() essentially checks
>>>>> if any decodable record exists in the database, IIUC it is not
>>>>> necessary to check every slot. We can optimize this by checking only
>>>>> the slot with the minimum confirmed_flush_lsn. If that slot is caught
>>>>> up, we can assume others are too. The attached patch implements this
>>>>> optimization. With the patch, the test with 200 slots finished in
>>>>> 2.512s. The execution time is now stable regardless of the number of
>>>>> slots.
>>>>> 
>>>>> One thing to note is that DecodeTXNNeedSkip() also considers
>>>>> replication origin filters. Theoretically, a plugin could filter out
>>>>> specific origins, which might lead to different results. However, this
>>>>> is a very rare case. Even if it happens, it would just result in a
>>>>> false positive (the upgrade fails safely), so the impact is minimal.
>>>>> Therefore, the patch simplifies the check to be per-database instead
>>>>> of per-slot.
>>>>> 
>>>>> Feedback is very welcome.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> --
>>>>> Masahiko Sawada
>>>>> Amazon Web Services: https://aws.amazon.com
>>>>> <v1-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch>
>>>> 
>>>> Hi Masahiko-san,
>>>> 
>>>> Basically I think the optimization idea makes sense to me. Reducing the 
>>>> check from O(slots × WAL) to O(databases × WAL) sounds a win. Just a few 
>>>> small comments:
>>> 
>>> Thank you for reviewing the patch!
>>> 
>>>> 
>>>> 1
>>>> ```
>>>> +static void process_old_cluter_logical_slot_catchup_infos(DbInfo *dbinfo, 
>>>> PGresult *res, void *arg);
>>>> ```
>>>> 
>>>> Typo in the function name: cluter->cluster
>>>> 
>>>> 2
>>>> ···
>>>> -               logical_slot_infos_query = 
>>>> get_old_cluster_logical_slot_infos_query();
>>>> +               const char *slot_info_query = "SELECT slot_name, plugin, 
>>>> two_phase, failover, “
>>>> ···
>>>> 
>>>> logical_slot_infos_query is no longer used, should be removed.
>>> 
>>> Fixed.
>>> 
>>>> 
>>>> 3
>>>> ```
>>>> +                               "  ORDER BY confirmed_flush_lsn ASC "
>>>> ```
>>>> 
>>>> I am thinking, if confirmed_flush_lsn can be NULL? If that’s true, we may 
>>>> want to add “NULLS LAST”.
>>> 
>>> Given that the query is not executed when live-check, I think
>>> confirmed_flush cannot be NULL. temporary should also not be true but
>>> it's for consistency with slot_info_query.
>>> 
>>>> 
>>>> 4
>>>> ```
>>>> +       Assert(num_slots == dbinfo->slot_arr.nslots);
>>>> +
>>>> +       for (int i = 0; i < num_slots; i++)
>>>> +       {
>>>> +               char       *slotname;
>>>> +               bool            caught_up;
>>>> +
>>>> +               slotname = PQgetvalue(res, i, PQfnumber(res, "slot_name"));
>>>> +               caught_up = (strcmp(PQgetvalue(res, i, PQfnumber(res, 
>>>> "caught_up")), "t") == 0);
>>>> +
>>>> +               for (int slotnum = 0; slotnum < dbinfo->slot_arr.nslots; 
>>>> slotnum++)
>>>> +               {
>>>> ```
>>>> 
>>>> As num_slots == dbinfo->slot_arr.nslots, this is still a O(num_slots^2) 
>>>> check. Given this patch’s goal is to eliminate the burden from large 
>>>> number of slots, maybe we can build a hash table to make the check faster.
>>>> 
>>> 
>>> Or if we sort both queries in the same order we can simply update the
>>> slots sequentially.
>>> 
>>> One problem I can see in this approach is that we could wrongly report
>>> the invalid slots in invalid_logical_slots.txt. Even if the slot with
>>> the oldest confirmed_flush_lsn has unconsumed decodable records, other
>>> slots might have already been caught up. I'll think of how to deal
>>> with this problem.
>> 
>> I came up with a simple idea; instead of
>> binary_upgrade_logical_slot_has_caught_up() returning true/false, we
>> can have it return the last decodable WAL record's LSN, and consider
>> logical slots as caught-up if its confirmed_flush_lsn already passed
>> that LSN. This way, we can keep scanning at most one logical slot per
>> database and reporting the same contents as today. The function nwo
>> needs to scan all WAL backlogs but it's still much better than the
>> current method.
>> 
>> I've implemented this idea in the v2 patch with some updates:
>> 
>> - incorporated the review comments.
>> - added logic to support PG18 or older.
>> - added regression tests.
>> 
>> Feedback is very welcome.
>> 
> 
> I've fixed the test failures reported by CI.
> 
> Regards,
> 
> -- 
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
> <v3-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch>

I just looked into v3. Basically, it now does a shared WAL scan to find the 
newest decodable LSN and uses that to compare with all slots’ 
confirmed_flush_lsn, which significantly reduces WAL scan effort when there are 
many slots.

One thing I'm thinking about is that if all slots are far behind, the shared 
scan may still take a long time. Before this change, a scan could stop as soon 
as it found a pending WAL. So after the change, when there are only a few slots 
and they are far behind, the scan might end up doing more work than before.

As a possible optimization, maybe we could also pass the newest 
confirmed_flush_lsn to the scan. Once it finds a decodable WAL record newer 
than that confirmed_flush_lsn, we already know all slots are behind, so the 
scan could stop at that point.

WRT the code change, I got a few comments:

1
···
+ * otherwise false. If last_pending_wal_p is set by the caller, it continues
+ * scanning WAL even after detecting a decodable WAL record, in order to
+ * get the last decodable WAL record's LSN.
  */
 bool
-LogicalReplicationSlotHasPendingWal(XLogRecPtr end_of_wal)
+LogicalReplicationSlotHasPendingWal(XLogRecPtr end_of_wal,
+                                                                       
XLogRecPtr *last_pending_wal_p)
 {
        bool            has_pending_wal = false;
 
        Assert(MyReplicationSlot);
 
+       if (last_pending_wal_p != NULL)
+               *last_pending_wal_p = InvalidXLogRecPtr;
···

The header comment seems to conflict to the code. last_pending_wal_p is 
unconditionally set to InvalidXLogRecPtr, so whatever a caller set is 
overwritten. I think you meant to say “if last_pending_wal_p is not NULL”.

2
```
@@ -286,9 +287,9 @@ binary_upgrade_logical_slot_has_caught_up(PG_FUNCTION_ARGS)
 {
        Name            slot_name;
        XLogRecPtr      end_of_wal;
-       bool            found_pending_wal;
+       XLogRecPtr      last_pending_wal;
```

The function header comment still says “returns true if …”, that should be 
updated.

Also, with the change, the function name becomes misleading, where “has” 
implies a boolean result, but now it will return the newest docodeable wal when 
no catching up. The function name doesn’t reflect to the actual behavior. Looks 
like the function is only used by pg_upgrade, so maybe rename it.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Re: pg_upgrade: optimize replication slot caught-up check

Reply via email to