Hi, On Fri, Nov 7, 2025 at 12:55 PM Xuneng Zhou <[email protected]> wrote: > > Hi hackers, > > I'd like to propose an optimization for logical decoding's snapshot building > mechanism that eliminates repeated sorting overhead once a replication slot > reaches the CONSISTENT state. > > 1) Problem > > Currently, SnapBuildBuildSnapshot() sorts the entire committed.xip array on > every call using qsort(). Once a logical decoding slot reaches CONSISTENT > state, snapshots are built frequently—after every catalog-modifying > transaction > commit. This repeated O(n log n) sorting becomes a performance bottleneck as > the committed transaction array grows. > > The existing code even has a TODO comment acknowledging this inefficiency: > "TODO: It's unclear whether that reasoning has much merit. Every time we add > something here after becoming consistent will also require distributing a > snapshot." > > 2) Proposed Solution > > Maintain sorted order via batch merge on each commit: > > 1. Collect all relevant XIDs (including subtransactions) into a batch array > 2. Sort the batch: O(m log m) where m is typically small (1 + nsubxacts) > 3. Merge into the global array: O(n + m) via reverse in-place merge > > While per-commit cost increases from O(m) to O(m log m + n), this is offset by > eliminating O(n log n) sorts at each snapshot build. Since CONSISTENT state > builds snapshots after each catalog-modifying commit, the amortized cost is > significantly lower. > > 3) Performance Impact > > Expected improvements in CONSISTENT state: > - Eliminates O(n log n) qsort on every snapshot build > - Replaces with O(m log m + n) merge per commit > - For typical cases where m << n and snapshots are frequent, this is a win > > > The benchmark results for this patch are shown below. > > 4) Configuration > > shared_buffers = '4GB' > wal_level = logical > max_replication_slots = 10 > max_wal_senders = 10 > log_min_messages = warning > max_connections = 600 > autovacuum = off > checkpoint_timeout = 15min > max_wal_size = 4GB > > > 5) Workloads > > # Workload 1: DDL - High frequency, small commits > create_ddl_workload() { > local ROOT="$1" > cat >"$ROOT/ddl.sql" <<'SQL' > -- High-frequency small catalog commits > -- Each commit triggers SnapBuildBuildSnapshot > DO $$ > DECLARE > tbl text := format('temp_%s_%s', > current_setting('application_name', true), > floor(random()*1e9)::int); > BEGIN > EXECUTE format('CREATE TEMP TABLE %I (id int, data text) ON COMMIT > DROP', tbl); > EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl); > EXECUTE format('SELECT * FROM %I', tbl); > END$$; > SQL > } > > # Workload 2: MIXED - mix of DDL and DML, 50%-50% > create_mixed_workload() { > local ROOT="$1" > cat >"$ROOT/mixed_ddl.sql" <<'SQL' > -- DDL workload (catalog changes) > DO $$ > DECLARE > tbl text := format('t_%s_%s', > current_setting('application_name', true), > floor(random()*1e9)::int); > BEGIN > EXECUTE format('CREATE TABLE %I (id int, data text) ON COMMIT DROP', tbl); > EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl); > END$$; > > SQL > cat >"$ROOT/mixed_dml.sql" <<'SQL' > -- DML workload (no catalog changes) > INSERT INTO app_data (id, data) > VALUES (floor(random()*1e6)::int, repeat('x', 100)) > ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100); > SQL > } > > # Workload 3: CONTROL - Pure DML, no catalog changes > create_control_workload() { > local ROOT="$1" > cat >"$ROOT/control.sql" <<'SQL' > > -- Pure DML, no catalog changes > -- Should show no difference between baseline and patched > INSERT INTO control_data (id, data) > VALUES (floor(random()*1e6)::int, repeat('x', 100)) > ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100); > SQL > } > > > 6) Test strategy > > Clients: 100 concurrent connections per workload > Duration: 40 seconds per run > Repetitions: 1 run per workload type > Replication: Logical replication slot using `test_decoding` plugin > with `EXPORT_SNAPSHOT=true` > > Workload Types: > 1. DDL - Pure catalog churn (temp table create/drop) > 2. Mixed- 50% DDL + 50% DML (workload) > 3. Control - Pure DML (no catalog changes) > > Measurement: > - Metrics captured from pg_stat_replication_slots before/after each run > - Primary metrics: total_txns (transactions decoded) and total_bytes > (data volume) > - Compared baseline (vanilla PostgreSQL) vs patched (sorted > committed.xip optimization) > > > 7) Performance results > > DDL Workload: +235% Decoder Improvement > Decoder throughput: 713.76 → 2396.52 txns/sec (+235%) > Throughput (MB/s): 672.67 → 1747.22 MB/s (+159%) > Decode efficiency: 9.46% → 33.29% (+23.83 points) > > Mixed Workload: No Change > Decoder throughput: 2751.10 → 2730.00 txns/sec (0%) > Decode efficiency: 40.47% → 40.47% (unchanged) > > We can see that the qsort overhead in SnapBuild has been eliminated in > the flamegraphs in the mixed workload. However, the performance > improvement was not observed as in the DDL workload. My guess is that, > for DML workloads, ReorderBufferApplyChange has become the new > hotspot. > > DML Workload: No Change > Decoder throughput: 3062.57 → 3066.37 txns/sec (0%) > Decode efficiency: 49.97% → 49.97% (unchanged) > > === Workload: ddl === > Client commits/sec: > Baseline: 7545.76 commits/sec > Patched: 7198.21 commits/sec > > Decoder throughput (from pg_stat_replication_slots): > Baseline: 713.76 txns/sec (672.67 MB/s) > Patched: 2396.52 txns/sec (1747.22 MB/s) > > Transaction efficiency (decoded vs committed): > Baseline: 309376 committed → 29264 decoded (9.46%) > Patched: 302325 committed → 100654 decoded (33.29%) > > Total decoded (all reps): > Baseline: 29264 txns (27579.53 MB) > Patched: 100654 txns (73383.10 MB) > > Decoder improvement: +235.00% (txns/sec) > Decoder improvement: +159.00% (MB/s) > Efficiency improvement: +23.83% points (more transactions decoded > per committed) > > === Workload: mixed === > Client commits/sec: > Baseline: 6797.50 commits/sec > Patched: 6745.26 commits/sec > > Decoder throughput (from pg_stat_replication_slots): > Baseline: 2751.10 txns/sec (210.35 MB/s) > Patched: 2730.00 txns/sec (205.64 MB/s) > > Transaction efficiency (decoded vs committed): > Baseline: 285495 committed → 115546 decoded (40.47%) > Patched: 283301 committed → 114660 decoded (40.47%) > > Total decoded (all reps): > Baseline: 115546 txns (8834.71 MB) > Patched: 114660 txns (8636.96 MB) > > ≈ Decoder unchanged: 0.00% (txns/sec) > > === Workload: DML === > Client commits/sec: > Baseline: 6129.24 commits/sec > Patched: 6136.93 commits/sec > > Decoder throughput (from pg_stat_replication_slots): > Baseline: 3062.57 txns/sec (0.26 MB/s) > Patched: 3066.37 txns/sec (0.26 MB/s) > > Transaction efficiency (decoded vs committed): > Baseline: 257428 committed → 128628 decoded (49.97%) > Patched: 251614 committed → 125721 decoded (49.97%) > > Total decoded (all reps): > Baseline: 128628 txns (10.98 MB) > Patched: 125721 txns (10.69 MB) > > ≈ Decoder unchanged: 0.00% (txns/sec) > > 8) Potential regression > > The potential regression point could be before the slot reaches the > CONSISTENT state, particularly when building_full_snapshot is set to > true. In this phase, all transactions including those that don’t > modify the catalog — must be added to the committed.xip array. These > XIDs don’t require later snapshot builds or sorting, so the > batch-insert logic increases the per-insert cost from O(1) to O(m + n) > without providing a direct benefit. > > However, the impact of this regression could be limited. The system > remains in the pre-CONSISTENT phase only briefly during initial > snapshot building, and the building_full_snapshot = true case is rare, > mainly used when creating replication slots with the EXPORT_SNAPSHOT > option. > > Once the slot becomes CONSISTENT, only catalog-modifying transactions > are tracked in committed.xip, and the patch reduces overall > snapshot-building overhead by eliminating repeated full-array sorts. > > We could also adopt a two-phase approach — keeping the current > behavior before reaching the CONSISTENT state and maintaining a sorted > array only after that point. This would preserve the performance > benefits while avoiding potential regressions. However, it would > introduce additional complexity and potential risks in handling the > state transitions. > > if (builder->state < SNAPBUILD_CONSISTENT) > { > /* ensure that only commits after this are getting replayed */ > if (builder->start_decoding_at <= lsn) > builder->start_decoding_at = lsn + 1; > > /* > * If building an exportable snapshot, force xid to be tracked, even > * if the transaction didn't modify the catalog. > */ > if (builder->building_full_snapshot) > { > needs_timetravel = true; > } > }
After some consideration, a two-phase sorting strategy seems feasible to implement in a relatively straightforward manner. So it's done in v2. I also plan to run benchmarks to evaluate potential regressions of the original “sort-at-all-stages” approach of v1. > 9) Additional benefit > With this patch applied, we can optimize the SnapBuildPurgeOlderTxn to use > two binary searchs rather than interating and copying to find the interval > to keep in the sorted commited.xip array. [1] > > Feedbacks welcomed. > > [1] > https://www.postgresql.org/message-id/flat/CABPTF7V9gcpTLrOY0fG4YontoHjVg8YrbmiH4XB_5PT6K56xhg%40mail.gmail.com V2 fixes several issues in v1, including a potential memory leak, type inconsistencies, and applies pgindent to the files. -- Best, Xuneng
v2-0001-Optimize-SnapBuild-by-maintaining-committed.xip-i.patch
Description: Binary data
