Re: Logical replication prefetch

Amit Kapila Tue, 08 Jul 2025 04:52:22 -0700

On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <[email protected]> wrote:
>
> There is well known Postgres problem that logical replication subscriber
> can not caught-up with publisher just because LR changes are applied by
> single worker and at publisher changes are made by
> multiple concurrent backends. The problem is not logical replication
> specific: physical replication stream is also handled by single
> walreceiver. But for physical replication Postgres now implements
> prefetch: looking at WAL record blocks it is quite easy to predict which
> pages will be required for redo and prefetch them. With logical
> replication situation is much more complicated.
>
> My first idea was to implement parallel apply of transactions. But to do
> it we need to track dependencies between transactions. Right now
> Postgres can apply transactions in parallel, but only if they are
> streamed  (which is done only for large transactions) and serialize them
> by commits. It is possible to enforce parallel apply of short
> transactions using `debug_logical_replication_streaming` but then
> performance is ~2x times slower than in case of sequential apply by
> single worker.
>


What is the reason of such a large slow down? Is it because the amount
of network transfer has increased without giving any significant
advantage because of the serialization of commits?

> By removing serialization by commits, it is possible to
> speedup apply 3x times and make subscriber apply changes faster then
> producer can produce them even with multiple clients. But it is possible
> only if transactions are independent and it can be enforced only by
> tracking dependencies which seems to be very non-trivial and invasive.
>
> I still do not completely give up with tracking dependencies approach,
> but decided first to try more simple solution - prefetching.
>

Sounds reasonable, but in the long run, we should track transaction
dependencies and allow parallel apply of all the transactions.

 It is
> already used for physical replication. Certainly in case of physical
> replication it is much simpler, because each WAL record contains list of
> accessed blocks.
>
> In case of logical replication prefetching can be done either by
> prefetching access to replica identity index (usually primary key),
> either by executing replication command by some background worker
> Certainly first case is much more easy.
>

It seems there is only one case described, so what exactly are you
referring to first and second?

 We just perform index lookup in
> prefetch worker and it loads accessed index and heap pages in shared
> buffer, so main apply worker does not need to read something from disk.
> But it works well only for DELETE and HOT UPDATE operations.
>
> In the second case we normally execute the LR command in background
> worker and then abort transaction. Certainly in this case we are doing
> the same work twice. But assumption is the same: parallel prefetch
> workers should load affected pages, speeding up work of the main apply
> worker.
>
> I have implemented some PoC (see attached patch). And get first results
> of efficiency of such prefetching.
>
> *** First scenario (update-only).
>
> Publisher:
> ```
> create table t(pk integer primary key, counter integer, filler text
> default repeat('x', 1000)) with (fillfactor=10);
> insert into t values (generate_series(1,100000), 0);
> create publication pub1 for table t;
> ```
>
> Subscriber:
> ```
> create table t(pk integer primary key, counter integer, filler text
> default repeat('x', 1000)) with (fillfactor=10);
> create subscription sub1 connection 'port=54321 dbname=postgres'
> publication pub1;
> ```
>
> Then I wait until replication is synced, stop subscriber and do random
> dot updates in 10 sessions at publisher:
>
> ```
> pgbench -T 100 -c 10 -M prepared -n -f update.sql -p 54321 -d postgres
> ```
>
> where update.sql is:
>
> ```
> \set pk random(1, 100000)
> update t set counter=counter+1 where pk=:pk;
> ```
>
> Then I start subscriber and measure how much time is needed for it to
> caught up.
> Results:
>
> no prefetch: 2:00 min
> prefetch (replica identity only): 0:55 min
> prefetch (all): 1:10 min
>
> This is definitely the best case for replica-identity index only
> prefetch (update-only and no other indexes).
> How to interpret this results?
>
> Without prefetch applying updates takes about two times  more at
> subscriber than performing this updates at publisher.
> It means that under huge workload subscriber has no chances to caught up.
>
> With prefetching replica identity index, apply time is even smaller than
> time needed to perform updates at publisher.
> Performing the whole operation and transaction abort certainly adds more
> overhead. But still improvement is quite significant.
>
> Please also notice that this results were obtains at the system with
> larger amount of RAM (64Gb) and fast SSD.
> With data set not fitting in RAM and much slower disks, the difference
> is expected to be more significant.
>

But what about worst cases where these additional pre-fetches could
lead to removing some pages from shared_buffers, which are required by
the workload on the subscriber? I think you try such workloads as
well.

> I have tried to simulate it be adding 0.1msec delay to pg_preadv.
> When I add artificial 0.1msec `preadv` delay, I got the following results:
>
> no prefetch: 7:40
> prefetch (replica identity only): 3:10 min
> prefetch (all): 3:09
>
>
> In this case apply takes much more time than 100 seconds during which
> updates are performed at publisher. Prefetch can improve speed about two
> times,
> but it doesn't allow subcriber to caught-up.
>
...
>
> Please notice that this approach requires minimal changes in Postgres,
> because all infrastructure of parallel apply workers is already present
> and we can reuse the same apply code (with minimal changes) for
> performing prefetch. I only have to introduce extra tuple lock types
> (no-lock and try-lock) to minimize overhead and lock conflicts between
> prefetch and main apply workers. Still it can not completely prevent
> locks conflicts and deadlocks in prefetch workers. Looks like more work
> is needed here.
>

I understand that it is just a POC, so you haven't figured out all the
details, but it would be good to know the reason of these deadlocks.

>
> I wonder if such LR prefetching approach is considered to be useful?
> Or it is better to investigate other ways to improve LR apply speed
> (parallel apply)?
>

I think it could be a good intermediate step till we are able to find
a solution for tracking the dependencies. Do you think this work will
be useful once we have parallel apply, and if so how?

-- 
With Regards,
Amit Kapila.

Re: Logical replication prefetch

Reply via email to