On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizh...@garret.ru> wrote: > > There is well known Postgres problem that logical replication subscriber > can not caught-up with publisher just because LR changes are applied by > single worker and at publisher changes are made by > multiple concurrent backends. The problem is not logical replication > specific: physical replication stream is also handled by single > walreceiver. But for physical replication Postgres now implements > prefetch: looking at WAL record blocks it is quite easy to predict which > pages will be required for redo and prefetch them. With logical > replication situation is much more complicated. > > My first idea was to implement parallel apply of transactions. But to do > it we need to track dependencies between transactions. Right now > Postgres can apply transactions in parallel, but only if they are > streamed (which is done only for large transactions) and serialize them > by commits. It is possible to enforce parallel apply of short > transactions using `debug_logical_replication_streaming` but then > performance is ~2x times slower than in case of sequential apply by > single worker. >
What is the reason of such a large slow down? Is it because the amount of network transfer has increased without giving any significant advantage because of the serialization of commits? > By removing serialization by commits, it is possible to > speedup apply 3x times and make subscriber apply changes faster then > producer can produce them even with multiple clients. But it is possible > only if transactions are independent and it can be enforced only by > tracking dependencies which seems to be very non-trivial and invasive. > > I still do not completely give up with tracking dependencies approach, > but decided first to try more simple solution - prefetching. > Sounds reasonable, but in the long run, we should track transaction dependencies and allow parallel apply of all the transactions. It is > already used for physical replication. Certainly in case of physical > replication it is much simpler, because each WAL record contains list of > accessed blocks. > > In case of logical replication prefetching can be done either by > prefetching access to replica identity index (usually primary key), > either by executing replication command by some background worker > Certainly first case is much more easy. > It seems there is only one case described, so what exactly are you referring to first and second? We just perform index lookup in > prefetch worker and it loads accessed index and heap pages in shared > buffer, so main apply worker does not need to read something from disk. > But it works well only for DELETE and HOT UPDATE operations. > > In the second case we normally execute the LR command in background > worker and then abort transaction. Certainly in this case we are doing > the same work twice. But assumption is the same: parallel prefetch > workers should load affected pages, speeding up work of the main apply > worker. > > I have implemented some PoC (see attached patch). And get first results > of efficiency of such prefetching. > > *** First scenario (update-only). > > Publisher: > ``` > create table t(pk integer primary key, counter integer, filler text > default repeat('x', 1000)) with (fillfactor=10); > insert into t values (generate_series(1,100000), 0); > create publication pub1 for table t; > ``` > > Subscriber: > ``` > create table t(pk integer primary key, counter integer, filler text > default repeat('x', 1000)) with (fillfactor=10); > create subscription sub1 connection 'port=54321 dbname=postgres' > publication pub1; > ``` > > Then I wait until replication is synced, stop subscriber and do random > dot updates in 10 sessions at publisher: > > ``` > pgbench -T 100 -c 10 -M prepared -n -f update.sql -p 54321 -d postgres > ``` > > where update.sql is: > > ``` > \set pk random(1, 100000) > update t set counter=counter+1 where pk=:pk; > ``` > > Then I start subscriber and measure how much time is needed for it to > caught up. > Results: > > no prefetch: 2:00 min > prefetch (replica identity only): 0:55 min > prefetch (all): 1:10 min > > This is definitely the best case for replica-identity index only > prefetch (update-only and no other indexes). > How to interpret this results? > > Without prefetch applying updates takes about two times more at > subscriber than performing this updates at publisher. > It means that under huge workload subscriber has no chances to caught up. > > With prefetching replica identity index, apply time is even smaller than > time needed to perform updates at publisher. > Performing the whole operation and transaction abort certainly adds more > overhead. But still improvement is quite significant. > > Please also notice that this results were obtains at the system with > larger amount of RAM (64Gb) and fast SSD. > With data set not fitting in RAM and much slower disks, the difference > is expected to be more significant. > But what about worst cases where these additional pre-fetches could lead to removing some pages from shared_buffers, which are required by the workload on the subscriber? I think you try such workloads as well. > I have tried to simulate it be adding 0.1msec delay to pg_preadv. > When I add artificial 0.1msec `preadv` delay, I got the following results: > > no prefetch: 7:40 > prefetch (replica identity only): 3:10 min > prefetch (all): 3:09 > > > In this case apply takes much more time than 100 seconds during which > updates are performed at publisher. Prefetch can improve speed about two > times, > but it doesn't allow subcriber to caught-up. > ... > > Please notice that this approach requires minimal changes in Postgres, > because all infrastructure of parallel apply workers is already present > and we can reuse the same apply code (with minimal changes) for > performing prefetch. I only have to introduce extra tuple lock types > (no-lock and try-lock) to minimize overhead and lock conflicts between > prefetch and main apply workers. Still it can not completely prevent > locks conflicts and deadlocks in prefetch workers. Looks like more work > is needed here. > I understand that it is just a POC, so you haven't figured out all the details, but it would be good to know the reason of these deadlocks. > > I wonder if such LR prefetching approach is considered to be useful? > Or it is better to investigate other ways to improve LR apply speed > (parallel apply)? > I think it could be a good intermediate step till we are able to find a solution for tracking the dependencies. Do you think this work will be useful once we have parallel apply, and if so how? -- With Regards, Amit Kapila.