Re: Inval reliability, especially for inplace updates

Noah Misch Thu, 04 Dec 2025 16:19:24 -0800

Thanks for the review.

On Wed, Dec 03, 2025 at 12:45:58PM -0800, Paul A Jungwirth wrote:
> Surya Poondla (cc'd) and I took another look at this as part of the
> November Patch Review Workshop.
> 
> We think it looks good. But I couldn't get the latest patch to apply
> on top of REL_17_STABLE until I did this:
> 
> ```
> git am inplace160-inval-durability-inplace-v7.patch_v17
> git revert bc6bad8857 # revert the revert of "WAL-log inplace update"
> git am inplace280-comment-fix-v1.patch.nocfbot # attached
> git am inplace290-comments202508-v1.patch
> ```
> 
> The inplace280 step adds a small comment change that seems to be in
> your git history, but I couldn't find it in the email chain.

inplace290 targets master.  inplace280-comment-fix-v1 is doing
s/heap_inplace_update/heap_inplace_update_and_unlock/ on decode.c, like
11012c5 did on master.  I'll incorporate inplace280-comment-fix-v1 when
back-patching.

> Also the
> 290 patch has context from reverting the WAL-log revert.

That makes sense.  I've not yet probed that level of detail.  FYI, if it ends
up making the back-patch clearer, I may split inplace290 into two patches, one
for the inplace160 bits and one for the inplace180 ("WAL-log inplace update")
bits.

> The patch avoids deadlocks by reordering invalidation prep before
> buffer locking. While no explicit assertion exists to detect future
> violations, would it be helpful to add a helper or macro that enforces
> this lock ordering rule more visibly? Probably not for a backpatch,
> but in master?

I think f4ece89 added that.  I plan to back-patch it on the same day as
$SUBJECT.

> On Sun, Aug 24, 2025 at 4:39 PM Noah Misch <[email protected]> wrote:
> > > Some of the comments felt a bit compressed.

> Thanks for expanding on this! Here are some thoughts about the new comment:
> 
> +    * Register shared cache invals if necessary.  Our input to inval can be
> +    * weaker than heap_update() input to inval in these ways:
> 
> Perhaps "than the heap_update() input" or "than heap_update()'s input"?

I think adding "the" would make it less correct, since we're contemplating a
class of inputs, not a specific population.  I think it's fine with or without
"'s", because there's no rule about treating a C function as a noun adjunct.

> +    * - This passes only the old version of the tuple.  Inval reacts only to
> +    * catcache lookup key columns and pg_class.oid values stored in
> +    * relcache-relevant catalog columns.  All of those columns are indexed.
> +    * Inplace update mustn't be used for any operations that could change
> +    * those.  Hence, the new tuple would provide no additional inval-relevant
> +    * information.  Those facts also make it fine to skip updating indexes.
> 
> This is confusing to me. "Inval only reacts": who is inval? Are you
> talking about the other backends when they receive the message? After
> spending a lot of time, I think you mean
> CacheInvalidateHeapTupleCommon, how it decides whether to invalidate
> first the catcache and then the relcache.

Right, the last interpretation.

> Also I wondered, if an
> inplace update never changes index keys, and [something] only cares
> about inval messages that change index keys, why are we sending an
> inval message at all?

An invalidation message contains only the key, not the value to cache.  Hence,
any change to a cacheable tuple must queue an invalidation message, but the
keys suffice to compute which invalidation message to insert.

The cache value is the whole tuple, but the cache keys are a function of:
- list of columns in syscache_info.h
- indexed columns that reference pg_class.oid, e.g. pg_constraint.conrelid

> After a few months away from this patch, it was
> hard for me to remember. I think the comment is a bit misleading. We
> *do* send catcache invalidations if non-key columns change. What about
> this?:
> 
> +    * - This passes only the old version of the tuple. Catcache
> invalidation doesn't need newtuple because inplace updates never
> change key columns, so it only needs to invalidate one hash value, not
> two. [For the same reason, we don't need to update indexes.] Relcache
> invalidation (in CacheInvalidateHeapTupleCommon) ignores newtuple
> altogether, even for regular updates, because an update can never move
> a tuple from one relcache entry to another.

This influenced some of my new wording.

> I bracketed the line about indexes, because I don't understand why
> we're talking about updating indexes here. I don't see anything about
> that in CacheInvalidateHeapTupleCommon or PrepareInvalidationState
> (which doesn't have access to newtuple anyway).

It's basically saying "don't worry about cache key inplace updates until we
solve inplace updates of indexed cols".  If we ever wanted to inplace-update a
cache key column, we'd first need to solve inplace-update of indexed columns,
since all cache key columns are indexed.  Solving indexed columns is hard with
the buffer locks involved, and there's no reason to expect a wish to do it.

> Also it feels like this comment (or something similar) really belongs
> on CacheInvalidateHeapTupleInplace. And that function doesn't need its
> newtuple parameter.

I like that, so I've made edits along those lines.

> +    * - The xwait found below may COMMIT between now and this function
> +    * returning, making the tuple dead.  That can change inval decisions, so
> 
> Is this third bullet point still explaining why an inplace update can
> be looser about invalidating caches than heap_update? Or is it making
> a separate point? It seems like it should be a paragraph, not a bullet
> point.

It's a separate point.  Fixed.

> Incidentally, this comment on heap_inplace_lock looks suspicious:
> 
>  * One could modify this to return true for tuples with delete in progress,
>  * All inplace updaters take a lock that conflicts with DROP.  If explicit
>  * "DELETE FROM pg_class" is in progress, we'll wait for it like we would an
>  * update.
> 
> I think it should be "While one could modify this . . . , all inplace
> updaters . . . ." Something to consider for a non-backpatch commit
> anyway.

I did break English grammar there.  But I now find the paragraph's argument
faulty, so I've rewritten it:

 * heap_delete() is a rarer source of blocking transactions (xwait).  We'll
 * wait for such a transaction just like for the normal heap_update() case.
 * Normal concurrent DROP commands won't cause that, because all inplace
 * updaters take some lock that conflicts with DROP.  An explicit SQL "DELETE
 * FROM pg_class" can cause it.  By waiting, if the concurrent transaction
 * executed both "DELETE FROM pg_class" and "INSERT INTO pg_class", our caller
 * can find the successor tuple.

I considered just deleting the paragraph as being too esoteric to be worth
hackers reading.  But if you're reading heap_inplace_lock(), esoterica is
expected.

The attached version doesn't need a comprehensive re-review, but I'd
particularly value hearing about any places where you find it's reducing
comprehensibility rather than enhancing.

From: Noah Misch <[email protected]>



diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 843c2e5..16f7d78 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -199,3 +199,35 @@ under a reader holding a pin.  A reader of a heap_fetch() 
result tuple may
 witness a torn read.  Current inplace-updated fields are aligned and are no
 wider than four bytes, and current readers don't need consistency across
 fields.  Hence, they get by with just fetching each field once.
+
+During logical decoding, caches reflect an inplace update no later than the
+next XLOG_XACT_INVALIDATIONS.  That record witnesses the end of a command.
+Tuples of its cmin are then visible to decoding, as are inplace updates of any
+lower LSN.  Inplace updates of a higher LSN may also be visible, even if those
+updates would have been invisible to a non-historic snapshot matching
+decoding's historic snapshot.  (In other words, decoding may see inplace
+updates that were not visible to a similar snapshot taken during original
+transaction processing.)  That's a consequence of inplace update violating
+MVCC: there are no snapshot-specific versions of inplace-updated values.  This
+all makes it hard to reason about inplace-updated column reads during logical
+decoding, but the behavior does suffice for relhasindex.  A relhasindex=t in
+CREATE INDEX becomes visible no later than the new pg_index row.  While it may
+be visible earlier, that's harmless.  Finding zero indexes despite
+relhasindex=t is normal in more cases than this, e.g. after DROP INDEX.
+Example of a case that meaningfully reacts to the inplace inval:
+
+CREATE TABLE cat (c int) WITH (user_catalog_table = true);
+CREATE TABLE normal (d int);
+...
+CREATE INDEX ON cat (c)\; INSERT INTO normal VALUES (1);
+
+If the output plugin reads "cat" during decoding of the INSERT, it's fair to
+want that read to see relhasindex=t and use the new index.
+
+An alternative would be to have decoding of XLOG_HEAP_INPLACE immediately
+execute its invals.  That would behave more like invals during original
+transaction processing.  It would remove the decoding-specific delay in e.g. a
+decoding plugin witnessing a relfrozenxid change.  However, a good use case
+for that is unlikely, since the plugin would still witness relfrozenxid
+changes prematurely.  Hence, inplace update takes the trivial approach of
+delegating to XLOG_XACT_INVALIDATIONS.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4d382a0..7858691 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6349,10 +6349,13 @@ heap_abort_speculative(Relation relation, const 
ItemPointerData *tid)
  * Since this is intended for system catalogs and SERIALIZABLE doesn't cover
  * DDL, this doesn't guarantee any particular predicate locking.
  *
- * One could modify this to return true for tuples with delete in progress,
- * All inplace updaters take a lock that conflicts with DROP.  If explicit
- * "DELETE FROM pg_class" is in progress, we'll wait for it like we would an
- * update.
+ * heap_delete() is a rarer source of blocking transactions (xwait).  We'll
+ * wait for such a transaction just like for the normal heap_update() case.
+ * Normal concurrent DROP commands won't cause that, because all inplace
+ * updaters take some lock that conflicts with DROP.  An explicit SQL "DELETE
+ * FROM pg_class" can cause it.  By waiting, if the concurrent transaction
+ * executed both "DELETE FROM pg_class" and "INSERT INTO pg_class", our caller
+ * can find the successor tuple.
  *
  * Readers of inplace-updated fields expect changes to those fields are
  * durable.  For example, vac_truncate_clog() reads datfrozenxid from
@@ -6393,15 +6396,17 @@ heap_inplace_lock(Relation relation,
        Assert(BufferIsValid(buffer));
 
        /*
-        * Construct shared cache inval if necessary.  Because we pass a tuple
-        * version without our own inplace changes or inplace changes other
-        * sessions complete while we wait for locks, inplace update mustn't
-        * change catcache lookup keys.  But we aren't bothering with index
-        * updates either, so that's true a fortiori.  After LockBuffer(), it
-        * would be too late, because this might reach a
-        * CatalogCacheInitializeCache() that locks "buffer".
+        * Register shared cache invals if necessary.  Other sessions may finish
+        * inplace updates of this tuple between this step and LockTuple().  
Since
+        * inplace updates don't change cache keys, that's harmless.
+        *
+        * While it's tempting to register invals only after confirming we can
+        * return true, the following obstacle precludes reordering steps that
+        * way.  Registering invals might reach a CatalogCacheInitializeCache()
+        * that locks "buffer".  That would hang indefinitely if running after 
our
+        * own LockBuffer().  Hence, we must register invals before 
LockBuffer().
         */
-       CacheInvalidateHeapTupleInplace(relation, oldtup_ptr, NULL);
+       CacheInvalidateHeapTupleInplace(relation, oldtup_ptr);
 
        LockTuple(relation, &oldtup.t_self, InplaceUpdateTupleLock);
        LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -6639,10 +6644,6 @@ heap_inplace_update_and_unlock(Relation relation,
        /*
         * Send invalidations to shared queue.  SearchSysCacheLocked1() assumes 
we
         * do this before UnlockTuple().
-        *
-        * If we're mutating a tuple visible only to this transaction, there's 
an
-        * equivalent transactional inval from the action that created the 
tuple,
-        * and this inval is superfluous.
         */
        AtInplace_Inval();
 
@@ -6653,10 +6654,10 @@ heap_inplace_update_and_unlock(Relation relation,
        AcceptInvalidationMessages();   /* local processing of just-sent inval 
*/
 
        /*
-        * Queue a transactional inval.  The immediate invalidation we just sent
-        * is the only one known to be necessary.  To reduce risk from the
-        * transition to immediate invalidation, continue sending a 
transactional
-        * invalidation like we've long done.  Third-party code might rely on 
it.
+        * Queue a transactional inval, for logical decoding and for third-party
+        * code that might have been relying on it since long before inplace
+        * update adopted immediate invalidation.  See README.tuplock section
+        * "Reading inplace-updated columns" for logical decoding details.
         */
        if (!IsBootstrapProcessingMode())
                CacheInvalidateHeapTuple(relation, tuple, NULL);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c969170..5f7b0ca 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -781,10 +781,11 @@ systable_endscan_ordered(SysScanDesc sysscan)
  * systable_inplace_update_begin --- update a row "in place" (overwrite it)
  *
  * Overwriting violates both MVCC and transactional safety, so the uses of
- * this function in Postgres are extremely limited.  Nonetheless we find some
- * places to use it.  See README.tuplock section "Locking to write
- * inplace-updated tables" and later sections for expectations of readers and
- * writers of a table that gets inplace updates.  Standard flow:
+ * this function in Postgres are extremely limited.  This makes no effort to
+ * support updating cache key columns or other indexed columns.  Nonetheless
+ * we find some places to use it.  See README.tuplock section "Locking to
+ * write inplace-updated tables" and later sections for expectations of
+ * readers and writers of a table that gets inplace updates.  Standard flow:
  *
  * ... [any slow preparation not requiring oldtup] ...
  * systable_inplace_update_begin([...], &tup, &inplace_state);
diff --git a/src/backend/replication/logical/decode.c 
b/src/backend/replication/logical/decode.c
index cc03f07..5e15cb1 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -521,18 +521,9 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer 
*buf)
 
                        /*
                         * Inplace updates are only ever performed on catalog 
tuples and
-                        * can, per definition, not change tuple visibility.  
Inplace
-                        * updates don't affect storage or interpretation of 
table rows,
-                        * so they don't affect logicalrep_write_tuple() 
outcomes.  Hence,
-                        * we don't process invalidations from the original 
operation.  If
-                        * inplace updates did affect those things, 
invalidations wouldn't
-                        * make it work, since there are no snapshot-specific 
versions of
-                        * inplace-updated values.  Since we also don't decode 
catalog
-                        * tuples, we're not interested in the record's 
contents.
-                        *
-                        * WAL contains likely-unnecessary commit-time invals 
from the
-                        * CacheInvalidateHeapTuple() call in
-                        * heap_inplace_update_and_unlock(). Excess 
invalidation is safe.
+                        * can, per definition, not change tuple visibility.  
Since we
+                        * also don't decode catalog tuples, we're not 
interested in the
+                        * record's contents.
                         */
                        break;
 
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 06f736c..f75c247 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1583,13 +1583,17 @@ CacheInvalidateHeapTuple(Relation relation,
  *             implied.
  *
  * Like CacheInvalidateHeapTuple(), but for inplace updates.
+ *
+ * Just before and just after the inplace update, the tuple's cache keys must
+ * match those in key_equivalent_tuple.  Cache keys consist of catcache lookup
+ * key columns and columns referencing pg_class.oid values,
+ * e.g. pg_constraint.conrelid, which would trigger relcache inval.
  */
 void
 CacheInvalidateHeapTupleInplace(Relation relation,
-                                                               HeapTuple tuple,
-                                                               HeapTuple 
newtuple)
+                                                               HeapTuple 
key_equivalent_tuple)
 {
-       CacheInvalidateHeapTupleCommon(relation, tuple, newtuple,
+       CacheInvalidateHeapTupleCommon(relation, key_equivalent_tuple, NULL,
                                                                   
PrepareInplaceInvalidationState);
 }
 
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index af46625..345733d 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -61,8 +61,7 @@ extern void CacheInvalidateHeapTuple(Relation relation,
                                                                         
HeapTuple tuple,
                                                                         
HeapTuple newtuple);
 extern void CacheInvalidateHeapTupleInplace(Relation relation,
-                                                                               
        HeapTuple tuple,
-                                                                               
        HeapTuple newtuple);
+                                                                               
        HeapTuple key_equivalent_tuple);
 
 extern void CacheInvalidateCatalog(Oid catalogId);

Re: Inval reliability, especially for inplace updates

Reply via email to