Hi Bharath,

Thank you for the thorough and thoughtful review!


> Having a lot of spill files also increases crash/recovery times.
> However, files spilling to disk causing no-space-left-on-disk issues
> leading to downtime applies to WAL files, historical catalog snapshot
> files, subtransaction overflow files, CLOG (and all the subsystems
> backed by SLRU data structure), etc. - basically any Postgres
> subsystem writing files to disk. I'm a bit worried that we may end up
> solving disk space issues, which IMHO are outside of the database
> scope, in the database. Others may have different opinions though.

That's a fair philosophical point.  However, I'd argue that spill files
are uniquely unprotected compared to the other subsystems you mentioned:

  - WAL files → max_slot_wal_keep_size
  - Temp files → temp_file_limit
  - SLRU/CLOG → bounded by design (sized proportionally to active XIDs)

Spill files are the gap: they can grow proportionally to the *data
volume* of in-progress transactions, not to the number of transactions,
and there is no existing upper bound.  A single bulk-load transaction
replicated via logical decoding can generate tens of GBs of spill files
with no way for the DBA to cap it.

So this is not about solving a general OS-level disk problem in the
database; it is about providing the same kind of safety net that already
exists for WAL (max_slot_wal_keep_size) and temp files (temp_file_limit)
— a per-subsystem guardrail that the DBA can configure.



> How common is this issue? Could you please add a test case to the
> proposed patch that without this feature would otherwise hit the issue
> described?

Typically triggered by bulk data migrations or long-running DDL on large
tables,
combined with a subscriber that fell behind.  Each incident caused the
instance
to go read-only, affecting all workloads — not just replication.

We have also seen reports of this in pgsql-general and on community
forums, though the root cause is not always identified as spill files.

Absolutely.  I will add a TAP test that:
  1. Creates 10 logical replication slots with test_decoding.
  2. Starts a large transaction that generates spill files exceeding
     the configured logical_decoding_spill_limit.
  3. Verifies that the walsender ERRORs out with the expected
     ERRCODE_CONFIGURATION_LIMIT_EXCEEDED.
  4. Verifies that spill files are cleaned up after the error.

I will include the 4 in v2.

Test case:
1. build a table:

psql -p 5432 -d postgres -c "
CREATE TABLE spill_load (
    id      bigint,
    payload text
);
"

2. build some slots:
#base

for i in $(seq 1 10); do
    psql -p 5432 -d postgres -c \
        "SELECT pg_create_logical_replication_slot('spill_slot_${i}',
'test_decoding');"
done

psql -p 5432 -d postgres -c \
    "SELECT slot_name, plugin, active FROM pg_replication_slots ORDER BY
slot_name;"

3. start pg_recvlogical:
#bash
mkdir -p /tmp/spill_test/pids

for i in $(seq 1 10); do
    pg_recvlogical \
        --dbname="port=5432 dbname=postgres" \
        --slot="spill_slot_${i}" \
        --start \
        --no-loop \
        --status-interval=0 \
        --file=/dev/null &
    echo $! > /tmp/spill_test/pids/recv_${i}.pid
    echo "Started pg_recvlogical for spill_slot_${i}, PID=$!"
done

sleep 2

psql -p 5432 -d postgres -c \
    "SELECT slot_name, active, active_pid
     FROM pg_replication_slots
     WHERE slot_name LIKE 'spill_slot_%'
     ORDER BY slot_name;"

4. make a big transaction:

psql -p 5432 -d postgres -c "
BEGIN;
INSERT INTO spill_load
    SELECT i, repeat('X', 100000)
    FROM generate_series(1, 10000000) i;
COMMIT;

"

> Having said that, were alternatives like disabling subscriptions when
> seen occupying the disk space considered?

Yes, we considered monitoring-based approaches, but they have inherent
limitations:

  1. Reaction latency: Even with 1-second polling, a high-throughput
     bulk load can spill gigabytes between two polls.  The spill rate
     can easily exceed 100 MB/s, so a 5-second monitoring gap means
     500 MB of uncontrolled growth.

  2. External dependency: Relying on an external tool means the
     protection is not available by default.  Every deployment must
     set up and maintain the monitoring — and if it fails or is
     misconfigured, the disk-full scenario still occurs.

  3. No subscriber-side control: pg_recvlogical users and third-party
     CDC tools (Debezium, etc.) that consume directly from the
     publisher have no "subscription" to disable.

  4. Disabling a subscription does not immediately stop spilling:
     the walsender continues processing WAL records already in flight.

The proposed GUC acts as a synchronous, in-process safety valve — it
checks the limit at the exact point where the next spill write would
occur, with zero latency.  Monitoring and the GUC are complementary:
monitoring for alerting, the GUC as a hard backstop.


> When the logical_decoding_spill_limit is exceeded, ERRORing out in the
> walsender is even more problematic, right? The replication slot would
> be inactive, causing bloat and preventing tuple freezing, WAL files
> growth and eventually the system may hit disk-space issues - it is
> like "we avoided disk space issues for one subsystem, but introduced
> it for another". This looks a bit problematic IMHO. Others may have
> different opinions though.

This is a great point and deserves a careful answer.

First, the "inactive slot causes WAL growth" problem already has a
well-established solution: max_slot_wal_keep_size.  When this is
configured (which is recommended practice), an inactive slot will be
invalidated once WAL retention exceeds the limit, releasing both WAL
files and catalog bloat.  So the combination of:

logical_decoding_spill_limit + max_slot_wal_keep_size

provides defense in depth: the spill limit prevents the data disk from
filling up with spill files, while max_slot_wal_keep_size prevents WAL
disk from growing unboundedly due to an inactive slot.

Second, compare the two failure modes:

  (a) WITHOUT spill limit: disk fills up → the ENTIRE instance goes
      read-only or shutdown.  All workloads stop.  Recovery requires manual
      intervention (rm spill files, restart).  Data loss is possible
      if WAL writes also fail.

  (b) WITH spill limit: walsender ERRORs → only replication is
      interrupted.  The primary continues serving all read/write
      workloads normally.  The DBA receives a clear error message
      and can act: increase the limit, increase work_mem, enable
      streaming, or investigate the large transaction.

cenario (b) is strictly better than (a): it converts an
instance-wide outage into a replication-only interruption with a
clear, actionable error message.

Additionally, in the updated patch (v2), I have ensured that spill

files are properly cleaned up in the error path (via
WalSndErrorCleanup), so exceeding the limit does not leave orphaned
files on disk.

Best regards,
Shawn
From 3693062b63acb189f657c819c1df0f55fdffa118 Mon Sep 17 00:00:00 2001
From: Shawn Wang <[email protected]>
Date: Mon, 23 Mar 2026 20:46:08 +0800
Subject: [PATCH v2] Add logical_decoding_spill_limit GUC to cap spill file
 disk usage.

When logical decoding exceeds logical_decoding_work_mem, changes are
spilled to disk under pg_replslot/<slot>/.  Currently there is no
limit on the total size of these spill files, which can lead to
uncontrolled disk space consumption for large or numerous concurrent
transactions.

Add a new GUC logical_decoding_spill_limit (integer, unit kB,
default 0 meaning unlimited) that limits the total on-disk spill file
size per replication slot.  The implementation tracks the current
on-disk footprint in ReorderBuffer.spillBytesOnDisk (incremented
after each successful write in ReorderBufferSerializeChange,
decremented when spill files are removed in
ReorderBufferRestoreCleanup) and per-transaction in
ReorderBufferTXN.serialized_size.

Before serializing a transaction to disk, ReorderBufferCheckMemoryLimit
checks whether the projected write would exceed the configured limit.
If so, it raises an ERROR.  The replication slot state (restart_lsn,
confirmed_flush) is preserved, so the subscriber can reconnect after
the administrator increases the limit, raises logical_decoding_work_mem,
or switches to a streaming-capable output plugin.

The check is only on the spill-to-disk path (not the streaming path,
which involves no disk I/O) and only when the limit is set (> 0), so
there is zero overhead on the normal hot path and full backward
compatibility with existing configurations.

Additionally, ensure spill files are cleaned up during error recovery.
Export ReorderBufferCleanupSerializedTXNs() and call it in
WalSndErrorCleanup() before releasing the replication slot, so that
spill files left behind by the ERROR are properly removed.  Without
this, the error path would release the slot (setting MyReplicationSlot
to NULL) before any chance to clean up, leaving orphaned spill files
on disk until the next walsender connects to the same slot or the
server restarts.
---
 doc/src/sgml/config.sgml                      | 24 ++++++++
 .../replication/logical/reorderbuffer.c       | 60 ++++++++++++++++++-
 src/backend/replication/walsender.c           | 12 ++++
 src/backend/utils/misc/guc_parameters.dat     | 10 ++++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/replication/reorderbuffer.h       | 18 ++++++
 6 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cdd826fbd3..e6fc9653fef 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2120,6 +2120,30 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-spill-limit" xreflabel="logical_decoding_spill_limit">
+      <term><varname>logical_decoding_spill_limit</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_spill_limit</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum total size of on-disk spill files that logical
+        decoding may use per replication slot. When changes exceed the memory
+        limit set by <varname>logical_decoding_work_mem</varname>, they are
+        spilled to disk. This parameter limits the total amount of disk space
+        used by those spill files. If the limit would be exceeded, the
+        walsender process raises an error and disconnects. The replication
+        slot's state is preserved, so the subscriber can reconnect after the
+        administrator increases this limit or switches to a streaming-capable
+        output plugin.
+        If this value is specified without units, it is taken as kilobytes.
+        The default value is <literal>0</literal>, which means there is no
+        limit on spill file disk usage.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-commit-timestamp-buffers" xreflabel="commit_timestamp_buffers">
       <term><varname>commit_timestamp_buffers</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 682d13c9f22..95e5b0f0507 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -72,6 +72,15 @@
  *	  counter is updated; however transactions with size 0 are not stored in
  *	  the heap, because they have no changes to evict.
  *
+ *	  To prevent spill files from consuming unbounded disk space, the
+ *	  logical_decoding_spill_limit GUC can be set to limit the total size
+ *	  of spill files per replication slot.  We track the current on-disk
+ *	  footprint in ReorderBuffer.spillBytesOnDisk (incremented on each
+ *	  successful write, decremented when spill files are cleaned up) and
+ *	  per-transaction in ReorderBufferTXN.serialized_size.  Before
+ *	  serializing a transaction, we check whether the projected write would
+ *	  exceed the configured limit and raise an ERROR if so.
+ *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
  *	  as we load the subxacts independently. One option to deal with this
@@ -224,6 +233,7 @@ typedef struct ReorderBufferDiskChange
  * like.
  */
 int			logical_decoding_work_mem;
+int			logical_decoding_spill_limit;
 static const Size max_changes_in_memory = 4096; /* XXX for restore only */
 
 /* GUC variable */
@@ -276,7 +286,7 @@ static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									 bool txn_prepared);
 static void ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static bool ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
+/* ReorderBufferCleanupSerializedTXNs is declared in reorderbuffer.h */
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
 static int	ReorderBufferTXNSizeCompare(const pairingheap_node *a, const pairingheap_node *b, void *arg);
@@ -3885,6 +3895,10 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
  * If debug_logical_replication_streaming is set to "immediate", stream or
  * serialize the changes immediately.
  *
+ * When spilling to disk, if logical_decoding_spill_limit is set (> 0),
+ * we check whether the projected write would exceed the configured limit
+ * and raise an ERROR if so, to prevent unbounded disk usage.
+ *
  * XXX At this point we select the transactions until we reach under the memory
  * limit, but we might also adapt a more elaborate eviction strategy - for example
  * evicting enough transactions to free certain fraction (e.g. 50%) of the memory
@@ -3964,6 +3978,36 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
 				continue;
 
+			/*
+			 * Check the spill-to-disk size limit before actually serializing.
+			 * We use the transaction's in-memory size as an estimate of how
+			 * much will be written, which is a reasonable approximation.
+			 *
+			 * We only check when logical_decoding_spill_limit is set (> 0).
+			 * When the limit would be exceeded, raise an ERROR.  The
+			 * walsender will exit, but the replication slot's restart_lsn is
+			 * preserved so the subscriber can reconnect after the DBA
+			 * increases the limit or switches to a streaming-capable output
+			 * plugin.
+			 */
+			if (logical_decoding_spill_limit > 0 &&
+				rb->spillBytesOnDisk + txn->size >
+				(Size) logical_decoding_spill_limit * 1024)
+			{
+				ereport(ERROR,
+						(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+						 errmsg("logical decoding spill file size limit exceeded"),
+						 errdetail("Current on-disk spill size is %zu bytes, "
+								   "transaction to spill is %zu bytes, "
+								   "limit is %d kB.",
+								   rb->spillBytesOnDisk, txn->size,
+								   logical_decoding_spill_limit),
+						 errhint("Consider increasing %s, %s, "
+								 "or using a streaming-capable output plugin.",
+								 "logical_decoding_spill_limit",
+								 "logical_decoding_work_mem")));
+			}
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
@@ -4286,6 +4330,13 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	}
 	pgstat_report_wait_end();
 
+	/*
+	 * Update the on-disk spill size accounting for both the transaction and
+	 * the reorder buffer.  This is used to enforce logical_decoding_spill_limit.
+	 */
+	txn->serialized_size += ondisk->size;
+	rb->spillBytesOnDisk += ondisk->size;
+
 	/*
 	 * Keep the transaction's final_lsn up to date with each change we send to
 	 * disk, so that ReorderBufferRestoreCleanup works correctly.  (We used to
@@ -4872,13 +4923,18 @@ ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn)
 					(errcode_for_file_access(),
 					 errmsg("could not remove file \"%s\": %m", path)));
 	}
+
+	/* Update the on-disk spill size accounting. */
+	Assert(rb->spillBytesOnDisk >= txn->serialized_size);
+	rb->spillBytesOnDisk -= txn->serialized_size;
+	txn->serialized_size = 0;
 }
 
 /*
  * Remove any leftover serialized reorder buffers from a slot directory after a
  * prior crash or decoding session exit.
  */
-static void
+void
 ReorderBufferCleanupSerializedTXNs(const char *slotname)
 {
 	DIR		   *spill_dir;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 08253103cb3..2ca99b31c5b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -356,6 +356,18 @@ WalSndErrorCleanup(void)
 	if (xlogreader != NULL && xlogreader->seg.ws_file >= 0)
 		wal_segment_close(xlogreader);
 
+	/*
+	 * Clean up spill files before releasing the slot, because
+	 * ReorderBufferCleanupSerializedTXNs() needs the slot name to locate the
+	 * directory.  We use the static logical_decoding_ctx to determine whether
+	 * we are in a logical decoding session.
+	 */
+	if (logical_decoding_ctx != NULL && MyReplicationSlot != NULL)
+	{
+		ReorderBufferCleanupSerializedTXNs(NameStr(MyReplicationSlot->data.name));
+		logical_decoding_ctx = NULL;
+	}
+
 	if (MyReplicationSlot != NULL)
 		ReplicationSlotRelease();
 
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 0c9854ad8fc..3168d23f387 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1849,6 +1849,16 @@
   boot_val => 'false',
 },
 
+{ name => 'logical_decoding_spill_limit', type => 'int', context => 'PGC_USERSET', group => 'RESOURCES_MEM',
+  short_desc => 'Sets the maximum total size of spill files for logical decoding.',
+  long_desc => 'Limits the total disk space used by spill files per replication slot during logical decoding. 0 means no limit.',
+  flags => 'GUC_UNIT_KB',
+  variable => 'logical_decoding_spill_limit',
+  boot_val => '0',
+  min => '0',
+  max => 'MAX_KILOBYTES',
+},
+
 { name => 'logical_decoding_work_mem', type => 'int', context => 'PGC_USERSET', group => 'RESOURCES_MEM',
   short_desc => 'Sets the maximum memory to be used for logical decoding.',
   long_desc => 'This much memory can be used by each internal reorder buffer before spilling to disk.',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e4abe6c0077..04b4fb73737 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -149,6 +149,8 @@
 #maintenance_work_mem = 64MB            # min 64kB
 #autovacuum_work_mem = -1               # min 64kB, or -1 to use maintenance_work_mem
 #logical_decoding_work_mem = 64MB       # min 64kB
+#logical_decoding_spill_limit = 0       # limits spill file disk usage per slot;
+                                        # 0 means no limit
 #max_stack_depth = 2MB                  # min 100kB
 #shared_memory_type = mmap              # the default is the first option
                                         # supported by the operating system:
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 2d717a9e152..6158643571f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -25,6 +25,7 @@
 
 /* GUC variables */
 extern PGDLLIMPORT int logical_decoding_work_mem;
+extern PGDLLIMPORT int logical_decoding_spill_limit;
 extern PGDLLIMPORT int debug_logical_replication_streaming;
 
 /* possible values for debug_logical_replication_streaming */
@@ -461,6 +462,14 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
+	/*
+	 * Size of this transaction's changes currently serialized to disk (in
+	 * bytes).  Tracked per-transaction so that we can accurately update the
+	 * ReorderBuffer-level spillBytesOnDisk counter when spill files are
+	 * cleaned up.
+	 */
+	Size		serialized_size;
+
 	/*
 	 * Private data pointer of the output plugin.
 	 */
@@ -685,6 +694,14 @@ struct ReorderBuffer
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillBytes;		/* amount of data spilled to disk */
 
+	/*
+	 * Current total size of spill files on disk for this reorder buffer (in
+	 * bytes).  Unlike spillBytes which is a cumulative statistic counter, this
+	 * tracks the actual on-disk footprint right now and is decremented when
+	 * spill files are cleaned up.  Used to enforce logical_decoding_spill_limit.
+	 */
+	Size		spillBytesOnDisk;
+
 	/* Statistics about transactions streamed to the decoding output plugin */
 	int64		streamTxns;		/* number of transactions streamed */
 	int64		streamCount;	/* streaming invocation counter */
@@ -782,5 +799,6 @@ extern uint32 ReorderBufferGetInvalidations(ReorderBuffer *rb,
 											SharedInvalidationMessage **msgs);
 
 extern void StartupReorderBuffer(void);
+extern void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 
 #endif
-- 
2.43.7

Reply via email to