Re: Optimize SnapBuild by maintaining committed.xip in sorted order

Xuneng Zhou Tue, 16 Dec 2025 02:20:55 -0800

Hi,

On Fri, Nov 7, 2025 at 12:55 PM Xuneng Zhou <[email protected]> wrote:
>
> Hi hackers,
>
> I'd like to propose an optimization for logical decoding's snapshot building
> mechanism that eliminates repeated sorting overhead once a replication slot
> reaches the CONSISTENT state.
>
> 1) Problem
>
> Currently, SnapBuildBuildSnapshot() sorts the entire committed.xip array on
> every call using qsort(). Once a logical decoding slot reaches CONSISTENT
> state, snapshots are built frequently—after every catalog-modifying 
> transaction
> commit. This repeated O(n log n) sorting becomes a performance bottleneck as
> the committed transaction array grows.
>
> The existing code even has a TODO comment acknowledging this inefficiency:
> "TODO: It's unclear whether that reasoning has much merit. Every time we add
> something here after becoming consistent will also require distributing a
> snapshot."
>
> 2) Proposed Solution
>
> Maintain sorted order via batch merge on each commit:
>
> 1. Collect all relevant XIDs (including subtransactions) into a batch array
> 2. Sort the batch: O(m log m) where m is typically small (1 + nsubxacts)
> 3. Merge into the global array: O(n + m) via reverse in-place merge
>
> While per-commit cost increases from O(m) to O(m log m + n), this is offset by
> eliminating O(n log n) sorts at each snapshot build. Since CONSISTENT state
> builds snapshots after each catalog-modifying commit, the amortized cost is
> significantly lower.
>
> 3) Performance Impact
>
> Expected improvements in CONSISTENT state:
> - Eliminates O(n log n) qsort on every snapshot build
> - Replaces with O(m log m + n) merge per commit
> - For typical cases where m << n and snapshots are frequent, this is a win
>
>
> The benchmark results for this patch are shown below.
>
> 4) Configuration
>
> shared_buffers = '4GB'
> wal_level = logical
> max_replication_slots = 10
> max_wal_senders = 10
> log_min_messages = warning
> max_connections = 600
> autovacuum = off
> checkpoint_timeout = 15min
> max_wal_size = 4GB
>
>
> 5) Workloads
>
> # Workload 1: DDL - High frequency, small commits
> create_ddl_workload() {
>   local ROOT="$1"
>   cat >"$ROOT/ddl.sql" <<'SQL'
> -- High-frequency small catalog commits
> -- Each commit triggers SnapBuildBuildSnapshot
> DO $$
> DECLARE
>   tbl text := format('temp_%s_%s',
>                      current_setting('application_name', true),
>                      floor(random()*1e9)::int);
> BEGIN
>   EXECUTE format('CREATE TEMP TABLE %I (id int, data text) ON COMMIT
> DROP', tbl);
>   EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
>   EXECUTE format('SELECT * FROM %I', tbl);
> END$$;
> SQL
> }
>
> # Workload 2: MIXED - mix of DDL and DML, 50%-50%
> create_mixed_workload() {
>   local ROOT="$1"
>   cat >"$ROOT/mixed_ddl.sql" <<'SQL'
> -- DDL workload (catalog changes)
> DO $$
> DECLARE
>   tbl text := format('t_%s_%s',
>                      current_setting('application_name', true),
>                      floor(random()*1e9)::int);
> BEGIN
>   EXECUTE format('CREATE TABLE %I (id int, data text) ON COMMIT DROP', tbl);
>   EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
> END$$;
>
> SQL
>   cat >"$ROOT/mixed_dml.sql" <<'SQL'
> -- DML workload (no catalog changes)
> INSERT INTO app_data (id, data)
> VALUES (floor(random()*1e6)::int, repeat('x', 100))
> ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
> SQL
> }
>
> # Workload 3: CONTROL - Pure DML, no catalog changes
> create_control_workload() {
>   local ROOT="$1"
>   cat >"$ROOT/control.sql" <<'SQL'
>
> -- Pure DML, no catalog changes
> -- Should show no difference between baseline and patched
> INSERT INTO control_data (id, data)
> VALUES (floor(random()*1e6)::int, repeat('x', 100))
> ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
> SQL
> }
>
>
> 6) Test strategy
>
> Clients: 100 concurrent connections per workload
> Duration: 40 seconds per run
> Repetitions: 1 run per workload type
> Replication:  Logical replication slot using `test_decoding` plugin
> with `EXPORT_SNAPSHOT=true`
>
> Workload Types:
> 1. DDL - Pure catalog churn (temp table create/drop)
> 2. Mixed- 50% DDL + 50% DML (workload)
> 3. Control - Pure DML (no catalog changes)
>
> Measurement:
> - Metrics captured from pg_stat_replication_slots before/after each run
> - Primary metrics: total_txns (transactions decoded) and total_bytes
> (data volume)
> - Compared baseline (vanilla PostgreSQL) vs patched (sorted
> committed.xip optimization)
>
>
> 7) Performance results
>
> DDL Workload: +235% Decoder Improvement
> Decoder throughput:  713.76 → 2396.52 txns/sec  (+235%)
> Throughput (MB/s):   672.67 → 1747.22 MB/s     (+159%)
> Decode efficiency:   9.46% → 33.29%             (+23.83 points)
>
> Mixed Workload: No Change
> Decoder throughput:  2751.10 → 2730.00 txns/sec  (0%)
> Decode efficiency:   40.47% → 40.47%             (unchanged)
>
> We can see that the qsort overhead in SnapBuild has been eliminated in
> the flamegraphs in the mixed workload. However, the performance
> improvement was not observed as in the DDL workload. My guess is that,
> for DML workloads, ReorderBufferApplyChange has become the new
> hotspot.
>
> DML Workload: No Change
> Decoder throughput:  3062.57 → 3066.37 txns/sec  (0%)
> Decode efficiency:   49.97% → 49.97%             (unchanged)
>
> === Workload: ddl ===
> Client commits/sec:
>   Baseline:  7545.76 commits/sec
>   Patched:   7198.21 commits/sec
>
> Decoder throughput (from pg_stat_replication_slots):
>   Baseline:  713.76 txns/sec  (672.67 MB/s)
>   Patched:   2396.52 txns/sec  (1747.22 MB/s)
>
> Transaction efficiency (decoded vs committed):
>   Baseline:  309376 committed  →  29264 decoded  (9.46%)
>   Patched:   302325 committed  →  100654 decoded  (33.29%)
>
> Total decoded (all reps):
>   Baseline:  29264 txns  (27579.53 MB)
>   Patched:   100654 txns  (73383.10 MB)
>
>   Decoder improvement: +235.00% (txns/sec)
>   Decoder improvement: +159.00% (MB/s)
>   Efficiency improvement: +23.83% points (more transactions decoded
> per committed)
>
> === Workload: mixed ===
> Client commits/sec:
>   Baseline:  6797.50 commits/sec
>   Patched:   6745.26 commits/sec
>
> Decoder throughput (from pg_stat_replication_slots):
>   Baseline:  2751.10 txns/sec  (210.35 MB/s)
>   Patched:   2730.00 txns/sec  (205.64 MB/s)
>
> Transaction efficiency (decoded vs committed):
>   Baseline:  285495 committed  →  115546 decoded  (40.47%)
>   Patched:   283301 committed  →  114660 decoded  (40.47%)
>
> Total decoded (all reps):
>   Baseline:  115546 txns  (8834.71 MB)
>   Patched:   114660 txns  (8636.96 MB)
>
>   ≈  Decoder unchanged: 0.00% (txns/sec)
>
> === Workload: DML ===
> Client commits/sec:
>   Baseline:  6129.24 commits/sec
>   Patched:   6136.93 commits/sec
>
> Decoder throughput (from pg_stat_replication_slots):
>   Baseline:  3062.57 txns/sec  (0.26 MB/s)
>   Patched:   3066.37 txns/sec  (0.26 MB/s)
>
> Transaction efficiency (decoded vs committed):
>   Baseline:  257428 committed  →  128628 decoded  (49.97%)
>   Patched:   251614 committed  →  125721 decoded  (49.97%)
>
> Total decoded (all reps):
>   Baseline:  128628 txns  (10.98 MB)
>   Patched:   125721 txns  (10.69 MB)
>
>   ≈  Decoder unchanged: 0.00% (txns/sec)
>
> 8) Potential regression
>
> The potential regression point could be before the slot reaches the
> CONSISTENT state, particularly when building_full_snapshot is set to
> true. In this phase, all transactions including those that don’t
> modify the catalog — must be added to the committed.xip array. These
> XIDs don’t require later snapshot builds or sorting, so the
> batch-insert logic increases the per-insert cost from O(1) to O(m + n)
> without providing a direct benefit.
>
> However, the impact of this regression could be limited. The system
> remains in the pre-CONSISTENT phase only briefly during initial
> snapshot building, and the building_full_snapshot = true case is rare,
> mainly used when creating replication slots with the EXPORT_SNAPSHOT
> option.
>
> Once the slot becomes CONSISTENT, only catalog-modifying transactions
> are tracked in committed.xip, and the patch reduces overall
> snapshot-building overhead by eliminating repeated full-array sorts.
>
> We could also adopt a two-phase approach — keeping the current
> behavior before reaching the CONSISTENT state and maintaining a sorted
> array only after that point. This would preserve the performance
> benefits while avoiding potential regressions. However, it would
> introduce additional complexity and potential risks in handling the
> state transitions.
>
> if (builder->state < SNAPBUILD_CONSISTENT)
> {
> /* ensure that only commits after this are getting replayed */
> if (builder->start_decoding_at <= lsn)
> builder->start_decoding_at = lsn + 1;
>
> /*
> * If building an exportable snapshot, force xid to be tracked, even
> * if the transaction didn't modify the catalog.
> */
> if (builder->building_full_snapshot)
> {
> needs_timetravel = true;
> }
> }


After some consideration, a two-phase sorting strategy seems feasible
to implement in a relatively straightforward manner. So it's done in
v2. I also plan to run benchmarks to evaluate potential regressions of
the original “sort-at-all-stages” approach of v1.

> 9) Additional benefit
> With this patch applied, we can optimize the SnapBuildPurgeOlderTxn to use
> two binary searchs rather than interating and copying to find the interval
> to keep in the sorted commited.xip array. [1]
>
> Feedbacks welcomed.
>
> [1] 
> https://www.postgresql.org/message-id/flat/CABPTF7V9gcpTLrOY0fG4YontoHjVg8YrbmiH4XB_5PT6K56xhg%40mail.gmail.com

V2 fixes several issues in v1, including a potential memory leak, type
inconsistencies, and applies pgindent to the files.

--
Best,
Xuneng

v2-0001-Optimize-SnapBuild-by-maintaining-committed.xip-i.patch
Description: Binary data

Re: Optimize SnapBuild by maintaining committed.xip in sorted order

Reply via email to