Hi hackers,

When a long-running write transaction coexists with many concurrent
catalog-modifying commits (e.g., autovacuum updating pg_class stats for
many tables, or batch DDL), the reorder buffer's spill files for logical
decoding can grow quadratically, reaching ~80GB in our production with
N=100K extrapolation.  This patch eliminates the problem by deferring
snapshot distribution from commit-time to data-change-time.

== Problem ==

The trigger requires two conditions:

  (i)  Something prevents xmin from advancing.  This can be the long write
       transaction itself (its own XID holds xmin back), or an unrelated
       xmin-holder like pg_dump's REPEATABLE READ session or a slow
       logical-replication consumer.
  (ii) A transaction that has at least one data change is present in the
       reorder buffer.  Pure read-only sessions (pg_dump, long SELECT) do
       not appear in the reorder buffer and are not victims themselves;
       they only enable condition (i).

When both are met, the chain reaction runs across four stages:

1) VACUUM updates pg_class statistics via heap_inplace_update_and_unlock(),
   which triggers CacheInvalidateHeapTuple().  At command end,
   LogLogicalInvalidations() emits an XLOG_XACT_INVALIDATIONS WAL record.

2) During logical decoding, xact_decode() processes XLOG_XACT_INVALIDATIONS
   by calling ReorderBufferXidSetCatalogChanges(), marking the VACUUM
   transaction as having catalog changes.

3) When the VACUUM transaction commits, SnapBuildCommitTxn() sees it has
   catalog changes, builds a new snapshot via SnapBuildBuildSnapshot(), and
   calls SnapBuildDistributeSnapshotAndInval() to distribute the snapshot
   to ALL in-progress transactions.

4) The snapshot includes the entire builder->committed.xip array.  Since a
   long-running transaction prevents SnapBuildPurgeOlderTxn() from cleaning
   old entries (builder->xmin cannot advance), the committed array grows
   monotonically.  Each successive snapshot is larger than the last:

     Snapshot N contains N xids, size = 192 + 4*N bytes

   Total disk usage = sum(i=1..N)(192 + 4*i) = 2*N^2 bytes

   In our measurements (see Performance impact below), master spills
   247 MB at N=5000 and 813 MB at N=10000, scaling cleanly as O(N^2).
   Extrapolation to N=100K yields ~80 GB.

Beyond disk usage, these accumulated snapshots also degrade decoding
performance:

a) Memory pressure: the snapshots consume logical_decoding_work_mem
   quickly, forcing the reorder buffer to spill to disk much earlier
   than necessary.

b) Spill/restore I/O: each INTERNAL_SNAPSHOT change is serialized
   during spill and deserialized during commit-time processing,
   adding significant disk I/O for what is essentially dead data.

c) Commit-time iteration overhead: ReorderBufferProcessTXN() iterates
   all changes via a binary heap.  100K extra INTERNAL_SNAPSHOT entries
   mean 100K additional heap pop operations and snapshot_now switches
   (refcount inc/dec, potential old snapshot free).

d) Replication lag: the combined CPU and I/O overhead directly
   increases the time between a transaction's commit and its decoded
   output reaching subscribers.

== Why not filter VACUUM? ==

One might consider filtering out VACUUM-generated XLOG_XACT_INVALIDATIONS
so they don't trigger snapshot rebuilds.  However, the O(N^2) problem is
not specific to VACUUM.  Any workload that generates many catalog-modifying
commits triggers it: batch partition creation, online migration tools
running thousands of DDLs, extension installations creating many functions
and types, etc.  A general solution at the distribution layer is needed.

== Prior work ==

The distribution function modified by this patch was last touched by the
"long-standing data loss bug in initial sync of logical replication"
thread started by Tomas Vondra in 2023 [1], committed by Amit Kapila in
2025.  That fix renamed
SnapBuildDistributeNewCatalogSnapshot
to SnapBuildDistributeSnapshotAndInval and added invalidation
distribution alongside the existing snapshot distribution, to ensure
that in-progress transactions invalidate their relcache when concurrent
publication DDL changes the set of published tables.  The commit
message acknowledged "some performance regression ... primarily during
frequent execution of publication DDL statements".  The present patch
addresses the snapshot-distribution side of that overhead while keeping
the invalidation-distribution semantics unchanged, since invalidations
cannot be deduplicated the way snapshots can (see "Correctness" below).

Related but orthogonal work in the same area:

  - Sawada's "Using per-transaction memory contexts for storing decoded
    tuples" [2] addresses a different root cause of decoding memory
    bloat under long-running transactions: GenerationContext
    fragmentation in the reorder buffer's tuple context.  His patch is
    complementary; both contribute to keeping decoding memory usage
    within logical_decoding_work_mem.

  - Tachoires's "Compress ReorderBuffer spill files using LZ4" [3]
    reduces the disk cost of spilled changes via compression.  The two
    approaches are complementary: this patch reduces what needs to be
    spilled in the first place (eliminating O(N^2) INTERNAL_SNAPSHOT
    entries for write transactions with few data changes), while [3]
    reduces the cost of whatever does spill.

[1] 
https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com
[2] 
https://www.postgresql.org/message-id/flat/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/flat/CAA4eK1Kd6vP1g5pJXoiPgfLQsn54hecsc06hxpNWV-9_xMa%2B5Q%40mail.gmail.com

== Solution: Lazy Snapshot Distribution ==

Instead of distributing a snapshot to every in-progress transaction on each
catalog-modifying commit, we defer distribution until a transaction actually
needs to decode a data change (in SnapBuildProcessChange()).

A generation counter (SnapBuild.snapshot_generation) is incremented each
time a new catalog snapshot is built.  Each transaction tracks the last
generation it received (ReorderBufferTXN.last_snapshot_generation).  When
SnapBuildProcessChange() is called for a data change, it checks whether the
transaction's generation is behind the builder's and distributes the current
snapshot only if needed.

This means:
- A long-running write transaction with K data changes accumulates at
  most min(K, G) snapshots in its change list (where G is the number of
  catalog-modifying commits during its lifetime), instead of G snapshots
  each of growing size.  The generation counter deduplicates further:
  consecutive data changes without intervening DDLs share one snapshot.
- Total disk usage attributable to snapshot distribution drops from
  O(N^2) to O(min(K, G) * N), where N is the committed xid array size.
  For the common production case where K << G (e.g. a long batch INSERT
  during a vacuum storm), the savings are substantial.

Invalidation messages are still distributed eagerly as before, since they
are small (100K vacuum commits = 14MB) and already have overflow protection
via MAX_DISTR_INVAL_MSG_PER_TXN.

== Correctness ==

The lazy approach is equivalent to eager distribution because:

1) WAL is processed strictly in LSN order.  When SnapBuildProcessChange()
   is called at LSN X, builder->snapshot reflects exactly all catalog-
   modifying commits with LSN < X.  It is neither "too new" (future commits
   haven't been processed) nor "too old" (all past commits are included).

2) The committed array is append-only (only grows, never shrinks while a
   long transaction holds xmin back).  Snapshot N is a superset of snapshots
   1..N-1.  Skipping intermediate snapshots does not affect visibility.

3) In the original code, when a transaction's changes are decoded at commit
   time, each data change uses the most recent snapshot preceding it in the
   change list.  Intermediate snapshots that have no data changes between
   them are effectively unused.  Lazy distribution simply avoids creating
   these unused entries.

4) This works correctly with streaming mode (changes decoded before commit)
   because SnapBuildProcessChange() is always called before each data change
   regardless of whether the change is later streamed or buffered.

5) The lazy snapshot is distributed to the toplevel transaction (using
   txn->xid), consistent with the original eager distribution.  When a
   data change belongs to a subtransaction, the snapshot and the data
   change reside in different change lists (toplevel vs subtransaction).
   This is safe because eager invalidation distribution always places an
   invalidation change entry in the toplevel's list at the DDL commit
   LSN (which is strictly less than the data change LSN).  During binary
   heap iteration, this invalidation entry ensures the toplevel is at
   the heap root when the equal-LSN comparison between the snapshot and
   the data change occurs, so the snapshot is always processed first.

== Why not eliminate ReorderBufferAddSnapshot entirely? ==

One might ask: if we can defer snapshot distribution, why not skip storing
snapshots in the reorder buffer altogether and just use builder->snapshot
directly at decode time?

This doesn't work because DecodeInsert/DecodeUpdate only package the raw
WAL tuple into a ReorderBufferChange and enqueue it.  The actual catalog
lookup (RelationIdGetRelation, tuple interpretation) happens later during
commit-time processing in ReorderBufferProcessTXN.  At that point,
builder->snapshot has advanced past all changes and no longer reflects the
catalog state at each individual change's LSN.

The INTERNAL_SNAPSHOT entries in the change list serve as markers that tell
commit-time processing "switch to this catalog snapshot from here on."
Without them, all changes would use the base_snapshot, and an INSERT that
happened before an ALTER TABLE ADD COLUMN would be decoded with the new
schema, producing incorrect output (extra columns with NULL values).

Since each INTERNAL_SNAPSHOT entry serves all subsequent data changes until
the next generation bump, the total number of entries is min(K, G), which
is already minimal.

== Notes ==

- RBTXN_HAS_CATALOG_CHANGES is still needed: it is used by
  SnapBuildCommitTxn() to decide whether to rebuild the snapshot,
  ReorderBufferBuildTupleCidHash() for combo CID handling, and
  SnapBuildSerialize() for serialization.  This patch does not change
  its semantics.

- SNAPBUILD_VERSION is bumped from 6 to 7 because snapshot_generation
  is added to the SnapBuild struct which is serialized to disk.  On
  upgrade, existing serialized snapshots will be invalidated and rebuilt
  automatically, which is the standard behavior for version mismatches.

== Changes ==

src/include/replication/snapbuild_internal.h:
  - Add uint64 snapshot_generation to struct SnapBuild

src/include/replication/reorderbuffer.h:
  - Add uint64 last_snapshot_generation to ReorderBufferTXN
  - Declare ReorderBufferTXNByXid() as extern

src/backend/replication/logical/snapbuild.c:
  - SnapBuildCommitTxn(): increment snapshot_generation after building
    a new snapshot
  - Rename SnapBuildDistributeSnapshotAndInval() to SnapBuildDistributeInval()
    and remove snapshot distribution code; keep invalidation distribution
  - SnapBuildProcessChange(): do a single ReorderBufferTXNByXid() lookup,
    reuse the txn pointer for both base snapshot check and lazy catalog
    snapshot distribution.  The lazy snapshot is distributed to the
    toplevel transaction, consistent with the original eager approach.
    This also eliminates redundant hash lookups that existed in the
    original code path.
  - Bump SNAPBUILD_VERSION from 6 to 7

src/backend/replication/logical/reorderbuffer.c:
  - Remove 'static' from ReorderBufferTXNByXid()

== Performance impact ==

Theoretical:

                         Before          After
  Snapshot distributions G (per commit)  0 or 1 (per data change)
  Disk usage growth      O(N^2)          O(min(K, G) * N)

  where N = committed xid array size at the time of the K-th data change,
        G = number of catalog-modifying commits during the long txn,
        K = number of data changes in the long txn.

The hot path (SnapBuildProcessChange for each data change) now does a single
hash lookup instead of 2-4 separate lookups in the original code, so there
is no performance regression for the common case.

Measured (single-threaded benchmark, scripts attached):

Scenario: one long-running transaction with K=1 INSERT, coexisting with
N concurrent CREATE/DROP TABLE pairs (each its own catalog-modifying
commit), then drained via pg_logical_slot_get_changes() with the
test_decoding output plugin.  3-run median per cell.

  logical_decoding_work_mem = 64MB (production-like default):

      N       decode_ms        spill_bytes      total_bytes
      ----    -------------    -------------    -------------
      1000    master  45       master  0        master   22 MB
              patch   39       patch   0        patch    13 MB
      5000    master  629      master  207 MB   master  429 MB
              patch   261      patch   0        patch   227 MB
      10000   master  1865     master  813 MB   master 1659 MB
              patch   814      patch   0        patch   855 MB

    At ldwm=64MB the patched code does not spill at all up to N=10000;
    master spills 207 MB at N=5000 and 813 MB at N=10000, scaling
    quadratically in N.  Decoding time drops 2.3-2.4x at N>=5000,
    purely from avoided spill I/O.  Memory residency (total_bytes)
    is halved as well, because INTERNAL_SNAPSHOT entries no longer
    accumulate in the long transaction's change list.

  logical_decoding_work_mem = 64kB (stress, forces spill even at small N):

      N       decode_ms        spill_bytes      total_bytes
      ----    -------------    -------------    -------------
      500     master  28       master  2.6 MB   master   6.9 MB
              patch   24       patch   0.4 MB   patch    4.7 MB
      1000    master  57       master  9.3 MB   master  22 MB
              patch   42       patch   0.9 MB   patch   13 MB
      2000    master  143      master  35 MB    master  76 MB
              patch   82       patch   1.8 MB   patch   43 MB
      5000    master  607      master  247 MB   master  429 MB
              patch   289      patch   25 MB    patch   227 MB

    Under stress, the patch reduces spill bytes by 6-19x and decoding
    time by 1.2-2.1x.  Master's spill grows as ~10*N^2 bytes (clean
    O(N^2) signal: 5000^2 / 1000^2 = 25, spill 247 MB / 9.3 MB = 26.6).
    The patch's residual spill is dominated by invalidation-message
    entries (still distributed eagerly by design); these are bounded
    by MAX_DISTR_INVAL_MSG_PER_TXN.

  The decoded change count is identical between master and patch in
  every cell (3 changes: BEGIN + INSERT + COMMIT of the long
  transaction; DDL transactions emit no test_decoding output), so
  the patch preserves decoding output exactly.

Extrapolating to N=100K using the observed quadratic coefficient on
master and linear on patch:

      N=100K  master  ~80 GB spill      (vs the 2*N^2 lower-bound
                                         estimate of 20 GB; the observed
                                         coefficient is ~4x higher due to
                                         per-entry ReorderBufferChange
                                         overhead and accumulated
                                         invalidation entries)
              patch    0 spill at 64MB ldwm
                       <500 MB at 64kB ldwm
                       (residual is invalidation entries, bounded by
                        MAX_DISTR_INVAL_MSG_PER_TXN)

Scripts to reproduce are in the second patch (v1-0002):
bench/setup_cluster.sh, bench/lazy_snapshot_bench.sh,
bench/run_matrix.sh, bench/aggregate.sh.

== Limitations ==

When the long-running transaction itself has continuous data changes
(e.g. COPY to many tables), K approaches G and lazy distribution
degenerates toward eager behavior:

                         K small (typical)  K~G (bulk COPY)
  Eager (before)         O(G^2)             O(G^2)
  Lazy (this patch)      O(K*N)             O(G^2)

For the bulk COPY scenario, the root cause is that VACUUM's pg_class
statistics updates (relpages, reltuples, relallvisible) are treated as
catalog changes even though they do not affect tuple decoding.  A
complementary optimization could filter these non-schema-affecting
catalog modifications from triggering snapshot rebuilds.  That is a
separate change with its own considerations and is left as future work.

== Tests ==

1) Isolation test (contrib/test_decoding/specs/lazy_snapshot_distribution.spec):
   Three scenarios verifying decoding correctness with lazy distribution:
   - Long transaction + multiple ALTER TABLE ADD COLUMN + INSERT with
     new schema
   - Long transaction + many CREATE TABLE (catalog changes) + INSERT
   - Subtransaction + DDL + INSERT with new schema

2) TAP test (contrib/test_decoding/t/002_lazy_snapshot_spill.pl):
   Verifies that a long transaction with 200 catalog-modifying DDLs in
   between its two INSERTs results in spill_bytes=0, confirming that lazy
   distribution eliminates snapshot-caused spilling.

3) All existing tests pass without modification:
   - test_decoding: 20 regression + 15 isolation + 1 TAP
   - src/test/subscription: 39 TAP tests (581 test cases), covering
     streaming, subtransactions, DDL, two-phase commit, etc.
   - src/test/recovery: 51 TAP tests (635 test cases), including
     006_logical_decoding and 038_save_logical_slots_shutdown
   - contrib/pg_logicalinspect: 2 tests

Two patches attached:

  v1-0001-Lazy-snapshot-distribution-in-logical-decoding.patch
    The lazy distribution change itself plus isolation and TAP tests.

  v1-0002-Benchmark-scripts-for-lazy-snapshot-distribution.patch
    Reproducer scripts for the measurements above (bench/).  Kept as a
    separate patch since it lives outside the contrib/ tree and is not
    intended for backport or for production builds.

Thanks,
Rui Zhao

Attachment: v1-0001-Lazy-snapshot-distribution-in-logical-decoding.patch
Description: Binary data

Attachment: v1-0002-Benchmark-scripts-for-lazy-snapshot-distribution.patch
Description: Binary data

Reply via email to