Re: race condition in pg_class

2024-09-08 Thread Nitin Motiani
On Sat, Sep 7, 2024 at 12:25 AM Noah Misch  wrote:
>
> On Fri, Sep 06, 2024 at 03:22:48PM +0530, Nitin Motiani wrote:
> > Thanks. I have no other comments.
>
> https://commitfest.postgresql.org/49/5090/ remains in status="Needs review".
> When someone moves it to status="Ready for Committer", I will commit
> inplace090, inplace110, and inplace120 patches.  If one of you is comfortable
> with that, please modify the status.

Done.




Re: race condition in pg_class

2024-09-06 Thread Noah Misch
On Fri, Sep 06, 2024 at 03:22:48PM +0530, Nitin Motiani wrote:
> Thanks. I have no other comments.

https://commitfest.postgresql.org/49/5090/ remains in status="Needs review".
When someone moves it to status="Ready for Committer", I will commit
inplace090, inplace110, and inplace120 patches.  If one of you is comfortable
with that, please modify the status.




Re: race condition in pg_class

2024-09-06 Thread Nitin Motiani
On Fri, Sep 6, 2024 at 3:34 AM Noah Misch  wrote:
>
> On Thu, Sep 05, 2024 at 07:10:04PM +0530, Nitin Motiani wrote:
> > On Thu, Sep 5, 2024 at 1:27 AM Noah Misch  wrote:
> > > On Wed, Sep 04, 2024 at 09:00:32PM +0530, Nitin Motiani wrote:
> > > >  Assert(rel->ri_needsLockTagTuple ==  
> > > > IsInplaceUpdateRelation(rel->relationDesc)
> > > >
> > > > This can safeguard against users of ResultRelInfo missing this field.
> > >
> > > v10 does the rename and adds that assertion.  This question remains open:
> >
> > Looks good. A couple of minor comments :
> > 1. In the inplace110 commit message, there are still references to
> > heap_inplace_update. Should it be clarified that the function has been
> > renamed?
>
> PGXN has only one caller of this function, so I think that wouldn't help
> readers enough.  If someone gets a compiler error about the old name, they'll
> figure it out without commit log guidance.  If a person doesn't get a compiler
> error, they didn't need to read about the fact of the rename.
>
> > 2. Should there be a comment above the ri_needLockTag definition in
> > execNodes.h that we are caching this value to avoid function calls to
> > IsInPlaceUpdateRelation for every tuple? Similar to how the comment
> > above ri_TrigFunctions mentions that it is cached lookup info.
>
> Current comment:
>
> /* updates do LockTuple() before oldtup read; see README.tuplock */
> boolri_needLockTagTuple;
>
> Once the comment doesn't fit in one line, pgindent rules make it take a
> minimum of four lines.  I don't think words about avoiding function calls
> would add enough value to justify the vertical space, because a person
> starting to remove it would see where it's called.  That's not to say the
> addition would be negligent.  If someone else were writing the patch and had
> included that, I wouldn't be deleting the material.

Thanks. I have no other comments.




Re: race condition in pg_class

2024-09-05 Thread Noah Misch
On Thu, Sep 05, 2024 at 07:10:04PM +0530, Nitin Motiani wrote:
> On Thu, Sep 5, 2024 at 1:27 AM Noah Misch  wrote:
> > On Wed, Sep 04, 2024 at 09:00:32PM +0530, Nitin Motiani wrote:
> > >  Assert(rel->ri_needsLockTagTuple ==  
> > > IsInplaceUpdateRelation(rel->relationDesc)
> > >
> > > This can safeguard against users of ResultRelInfo missing this field.
> >
> > v10 does the rename and adds that assertion.  This question remains open:
> 
> Looks good. A couple of minor comments :
> 1. In the inplace110 commit message, there are still references to
> heap_inplace_update. Should it be clarified that the function has been
> renamed?

PGXN has only one caller of this function, so I think that wouldn't help
readers enough.  If someone gets a compiler error about the old name, they'll
figure it out without commit log guidance.  If a person doesn't get a compiler
error, they didn't need to read about the fact of the rename.

> 2. Should there be a comment above the ri_needLockTag definition in
> execNodes.h that we are caching this value to avoid function calls to
> IsInPlaceUpdateRelation for every tuple? Similar to how the comment
> above ri_TrigFunctions mentions that it is cached lookup info.

Current comment:

/* updates do LockTuple() before oldtup read; see README.tuplock */
boolri_needLockTagTuple;

Once the comment doesn't fit in one line, pgindent rules make it take a
minimum of four lines.  I don't think words about avoiding function calls
would add enough value to justify the vertical space, because a person
starting to remove it would see where it's called.  That's not to say the
addition would be negligent.  If someone else were writing the patch and had
included that, I wouldn't be deleting the material.




Re: race condition in pg_class

2024-09-05 Thread Nitin Motiani
On Thu, Sep 5, 2024 at 1:27 AM Noah Misch  wrote:
>
> On Wed, Sep 04, 2024 at 09:00:32PM +0530, Nitin Motiani wrote:
> > How about this alternative then? The tuple length check
> > and the elog(ERROR) gets its own function. Something like
> > heap_inplace_update_validate or
> > heap_inplace_update_validate_tuple_length. So in that case, it would
> > look like this :
> >
> >   genam.c:systable_inplace_update_finish
> > heapam.c:heap_inplace_update_validate/heap_inplace_update_precheck
> > PreInplace_Inval
> > START_CRIT_SECTION
> > heapam.c:heap_inplace_update
> > BUFFER_LOCK_UNLOCK
> > AtInplace_Inval
> > END_CRIT_SECTION
> > UnlockTuple
> > AcceptInvalidationMessages
> >
> > This is starting to get complicated though so I don't have any issues
> > with just renaming the heap_inplace_update to
> > heap_inplace_update_and_unlock.
>
> Complexity aside, I don't see the _precheck design qualifying as a modularity
> improvement.
>
> >  Assert(rel->ri_needsLockTagTuple ==  
> > IsInplaceUpdateRelation(rel->relationDesc)
> >
> > This can safeguard against users of ResultRelInfo missing this field.
>
> v10 does the rename and adds that assertion.  This question remains open:
>

Looks good. A couple of minor comments :
1. In the inplace110 commit message, there are still references to
heap_inplace_update. Should it be clarified that the function has been
renamed?
2. Should there be a comment above the ri_needLockTag definition in
execNodes.h that we are caching this value to avoid function calls to
IsInPlaceUpdateRelation for every tuple? Similar to how the comment
above ri_TrigFunctions mentions that it is cached lookup info.

> On Thu, Aug 22, 2024 at 12:32:00AM -0700, Noah Misch wrote:
> > On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> > > How many of those for RELKIND_INDEX vs tables? I'm thinking if we should
> > > always require a tuple lock on indexes, if that would make a difference.
> >
> > Three sites.  See attached inplace125 patch.  Is it a net improvement?  If 
> > so,
> > I'll squash it into inplace120.
>
> If nobody has an opinion, I'll discard inplace125.  I feel it's not a net
> improvement, but either way is fine with me.

Seems moderately simpler to me. But there is still special handling
for the RELKIND_INDEX. Just that instead of doing it in
systable_inplace_update_begin, we have a special case in heap_update.
So overall it's only a small improvement and I'm fine either way.

Thanks & Regards
Nitin Motiani
Google




Re: race condition in pg_class

2024-09-04 Thread Noah Misch
On Wed, Sep 04, 2024 at 09:00:32PM +0530, Nitin Motiani wrote:
> How about this alternative then? The tuple length check
> and the elog(ERROR) gets its own function. Something like
> heap_inplace_update_validate or
> heap_inplace_update_validate_tuple_length. So in that case, it would
> look like this :
> 
>   genam.c:systable_inplace_update_finish
> heapam.c:heap_inplace_update_validate/heap_inplace_update_precheck
> PreInplace_Inval
> START_CRIT_SECTION
> heapam.c:heap_inplace_update
> BUFFER_LOCK_UNLOCK
> AtInplace_Inval
> END_CRIT_SECTION
> UnlockTuple
> AcceptInvalidationMessages
> 
> This is starting to get complicated though so I don't have any issues
> with just renaming the heap_inplace_update to
> heap_inplace_update_and_unlock.

Complexity aside, I don't see the _precheck design qualifying as a modularity
improvement.

>  Assert(rel->ri_needsLockTagTuple ==  
> IsInplaceUpdateRelation(rel->relationDesc)
> 
> This can safeguard against users of ResultRelInfo missing this field.

v10 does the rename and adds that assertion.  This question remains open:

On Thu, Aug 22, 2024 at 12:32:00AM -0700, Noah Misch wrote:
> On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> > How many of those for RELKIND_INDEX vs tables? I'm thinking if we should
> > always require a tuple lock on indexes, if that would make a difference.
> 
> Three sites.  See attached inplace125 patch.  Is it a net improvement?  If so,
> I'll squash it into inplace120.

If nobody has an opinion, I'll discard inplace125.  I feel it's not a net
improvement, but either way is fine with me.
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by Heikki Linnakangas.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 83b99a9..51d52c4 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2255,6 +2255,16 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
locallock->numLockOwners = 0;
}
 
+#ifdef USE_ASSERT_CHECKING
+
+   /*
+* Tuple locks are currently held only for short durations 
within a
+* transaction. Check that we didn't forget to release one.
+*/
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks)
+   elog(WARNING, "tuple lock held at commit");
+#endif
+
/*
 * If the lock or proclock pointers are NULL, this lock was 
taken via
 * the relation fast-path (and is not known to have been 
transferred).
Author: Noah Misch 
Commit: Noah Misch 

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the systable_inplace_update_begin()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reviewed by Heikki Linnakangas, Nitin Motiani and Alexander Lakhin.

Discussion: 
https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 6441e8b..ddb2def 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -153,3 +153,14 @@ The following infomask bits are applicable:
 
 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
 is set.
+
+Reading inplace-updated columns
+---
+
+Inplace updates create an exception to the rule that tuple data won't change
+under a reader holding a pin.  A reader of a heap_fetch() result tuple may
+witness a torn read.  Current inplace-updated fields are aligned and are no
+wider than four bytes, and current readers don't need consistency across
+fields.  Hence

Re: race condition in pg_class

2024-09-04 Thread Nitin Motiani
On Wed, Sep 4, 2024 at 2:53 AM Noah Misch  wrote:
>
>
> > So this also pulls the invalidation to the genam.c
> > layer. Am I understanding this correctly?
>
> Compared to the v9 patch, the "call both" alternative would just move the
> systable_endscan() call to a new systable_inplace_update_end().  It wouldn't
> move anything across the genam:heapam boundary.
> systable_inplace_update_finish() would remain a thin wrapper around a heapam
> function.
>

Thanks for the clarification.

> > > > > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
> > > > >   tolerable now, this gets less attractive after the inplace160 patch 
> > > > > from
> > > > >   https://postgr.es/m/flat/20240523000548.58.nmi...@google.com
> > > > >
> > > > I skimmed through the inplace160 patch. It wasn't clear to me why this
> > > > becomes less attractive with the patch. I see there is a new
> > > > CacheInvalidateHeapTupleInPlace but that looks like it would be called
> > > > while holding the lock. And then there is an
> > > > AcceptInvalidationMessages which can perhaps be moved to the genam.c
> > > > layer too. Is the concern that one invalidation call will be in the
> > > > heapam layer and the other will be in the genam layer?
> > >
> > > That, or a critical section would start in heapam.c, then end in genam.c.
> > > Current call tree at inplace160 v4:
>
> > How about this alternative?
> >
> >  genam.c:systable_inplace_update_finish
> >PreInplace_Inval
> >START_CRIT_SECTION
> >heapam.c:heap_inplace_update
> >BUFFER_LOCK_UNLOCK
> >AtInplace_Inval
> >END_CRIT_SECTION
> >UnlockTuple
> >AcceptInvalidationMessages
>
> > Looking at inplace160, it seems that the start of the critical section
> > is right after PreInplace_Inval. So why not pull START_CRIT_SECTION
> > and END_CRIT_SECTION out to the genam.c layer?
>
> heap_inplace_update() has an elog(ERROR) that needs to happen outside any
> critical section.  Since the condition for that elog deals with tuple header
> internals, it belongs at the heapam layer more than the systable layer.
>

Understood. How about this alternative then? The tuple length check
and the elog(ERROR) gets its own function. Something like
heap_inplace_update_validate or
heap_inplace_update_validate_tuple_length. So in that case, it would
look like this :

  genam.c:systable_inplace_update_finish
heapam.c:heap_inplace_update_validate/heap_inplace_update_precheck
PreInplace_Inval
START_CRIT_SECTION
heapam.c:heap_inplace_update
BUFFER_LOCK_UNLOCK
AtInplace_Inval
END_CRIT_SECTION
UnlockTuple
AcceptInvalidationMessages

This is starting to get complicated though so I don't have any issues
with just renaming the heap_inplace_update to
heap_inplace_update_and_unlock.

> > Alternatively since
> > heap_inplace_update is commented as a subroutine of
> > systable_inplace_update_finish, should everything just be moved to the
> > genam.c layer? Although it looks like you already considered and
> > rejected this approach.
>
> Calling XLogInsert(RM_HEAP_ID) in genam.c would be a worse modularity
> violation than the one that led to the changes between v8 and v9.  I think
> even calling CacheInvalidateHeapTuple() in genam.c would be a worse modularity
> violation than the one attributed to v8.  Modularity would have the
> heap_inplace function resemble heap_update() handling of invals.
>

Understood. Thanks.

> > If the above alternatives are not possible, it's probably fine to go
> > ahead with the current patch with the function renamed to
> > heap_inplace_update_and_unlock (or something similar) as mentioned
> > earlier?
>
> I like that name.  The next version will use it.
>

So either we go with this or try the above approach of having a
separate function _validate/_precheck/_validate_tuple_length. I don't
have a strong opinion on either of these approaches.

> > > > I see that all the places we check resultRelInfo->ri_needLockTagTuple,
> > > > we can just call
> > > > IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a
> > > > big advantage of storing a separate bool field? Also there is another
> > >
> > > No, not a big advantage.  I felt it was more in line with the typical 
> > > style of
> > > src/backend/executor.
>
> > just find it a smell that there are two fields which are not
> > independent and they go out of sync. And that's why my preference is
> > to not have a dependent field unless there is a specific advantage.
>
> Got it.  This check happens for every tuple of every UPDATE, so performance
> may be a factor.  Some designs and their merits:
>

Thanks. If performance is a factor, it makes sense to keep it.

>  a. ri_needLockTagTuple
> Performance: best: check one value for nonzero
> Drawback: one more value lifecycle to understand
> Drawback: users of ResultRelInfo w/o InitResultRelInfo() could miss this
>
>  b. call IsInplaceUpdateRelation
> Performance: worst: two extern function calls, then com

Re: race condition in pg_class

2024-09-03 Thread Noah Misch
On Tue, Sep 03, 2024 at 09:24:52PM +0530, Nitin Motiani wrote:
> On Sat, Aug 31, 2024 at 6:40 AM Noah Misch  wrote:
> > On Thu, Aug 29, 2024 at 09:08:43PM +0530, Nitin Motiani wrote:
> > > On Thu, Aug 29, 2024 at 8:11 PM Noah Misch  wrote:
> > > > - In the cancel case, call both systable_inplace_update_cancel and
> > > >   systable_inplace_update_end.  _finish or _cancel would own unlock, 
> > > > while
> > > >   _end would own systable_endscan().
> > > >
> > > What happens to CacheInvalidateHeapTuple() in this approach?  I think
> > > it will still need to be brought to the genam.c layer if we are
> > > releasing the lock in systable_inplace_update_finish.

> understanding is that the code in this approach would look like below
> :
> if (dirty)
>systable_inplace_update_finish(inplace_state, tup);
> else
>systable_inplace_update_cancel(inplace_state);
> systable_inplace_update_end(inplace_state);
> 
> And that in this structure, both _finish and _cancel will call
> heap_inplace_unlock and then _end will call systable_endscan. So even
> with this structure, the invalidation has to happen inside _finish
> after the unlock.

Right.

> So this also pulls the invalidation to the genam.c
> layer. Am I understanding this correctly?

Compared to the v9 patch, the "call both" alternative would just move the
systable_endscan() call to a new systable_inplace_update_end().  It wouldn't
move anything across the genam:heapam boundary.
systable_inplace_update_finish() would remain a thin wrapper around a heapam
function.

> > > > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
> > > >   tolerable now, this gets less attractive after the inplace160 patch 
> > > > from
> > > >   https://postgr.es/m/flat/20240523000548.58.nmi...@google.com
> > > >
> > > I skimmed through the inplace160 patch. It wasn't clear to me why this
> > > becomes less attractive with the patch. I see there is a new
> > > CacheInvalidateHeapTupleInPlace but that looks like it would be called
> > > while holding the lock. And then there is an
> > > AcceptInvalidationMessages which can perhaps be moved to the genam.c
> > > layer too. Is the concern that one invalidation call will be in the
> > > heapam layer and the other will be in the genam layer?
> >
> > That, or a critical section would start in heapam.c, then end in genam.c.
> > Current call tree at inplace160 v4:

> How about this alternative?
> 
>  genam.c:systable_inplace_update_finish
>PreInplace_Inval
>START_CRIT_SECTION
>heapam.c:heap_inplace_update
>BUFFER_LOCK_UNLOCK
>AtInplace_Inval
>END_CRIT_SECTION
>UnlockTuple
>AcceptInvalidationMessages

> Looking at inplace160, it seems that the start of the critical section
> is right after PreInplace_Inval. So why not pull START_CRIT_SECTION
> and END_CRIT_SECTION out to the genam.c layer?

heap_inplace_update() has an elog(ERROR) that needs to happen outside any
critical section.  Since the condition for that elog deals with tuple header
internals, it belongs at the heapam layer more than the systable layer.

> Alternatively since
> heap_inplace_update is commented as a subroutine of
> systable_inplace_update_finish, should everything just be moved to the
> genam.c layer? Although it looks like you already considered and
> rejected this approach.

Calling XLogInsert(RM_HEAP_ID) in genam.c would be a worse modularity
violation than the one that led to the changes between v8 and v9.  I think
even calling CacheInvalidateHeapTuple() in genam.c would be a worse modularity
violation than the one attributed to v8.  Modularity would have the
heap_inplace function resemble heap_update() handling of invals.

> If the above alternatives are not possible, it's probably fine to go
> ahead with the current patch with the function renamed to
> heap_inplace_update_and_unlock (or something similar) as mentioned
> earlier?

I like that name.  The next version will use it.

> > > I see that all the places we check resultRelInfo->ri_needLockTagTuple,
> > > we can just call
> > > IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a
> > > big advantage of storing a separate bool field? Also there is another
> >
> > No, not a big advantage.  I felt it was more in line with the typical style 
> > of
> > src/backend/executor.

> just find it a smell that there are two fields which are not
> independent and they go out of sync. And that's why my preference is
> to not have a dependent field unless there is a specific advantage.

Got it.  This check happens for every tuple of every UPDATE, so performance
may be a factor.  Some designs and their merits:

 a. ri_needLockTagTuple
Performance: best: check one value for nonzero
Drawback: one more value lifecycle to understand
Drawback: users of ResultRelInfo w/o InitResultRelInfo() could miss this

 b. call IsInplaceUpdateRelation
Performance: worst: two extern function calls, then compare against two values

 c. make IsInplaceUpdateRelation

Re: race condition in pg_class

2024-09-03 Thread Nitin Motiani
On Sat, Aug 31, 2024 at 6:40 AM Noah Misch  wrote:
>
> On Thu, Aug 29, 2024 at 09:08:43PM +0530, Nitin Motiani wrote:
> > On Thu, Aug 29, 2024 at 8:11 PM Noah Misch  wrote:
> > > On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> > > > My order of preference is: 2, 1, 3.
> > >
> > > I kept tuple locking responsibility in heapam.c.  That's simpler and 
> > > better
> > > for modularity, but it does mean we release+acquire after any xmax wait.
> > > Before, we avoided that if the next genam.c scan found the same TID.  (If 
> > > the
> > > next scan finds the same TID, the xmax probably aborted.)  I think DDL 
> > > aborts
> > > are rare enough to justify simplifying as this version does.  I don't 
> > > expect
> > > anyone to notice the starvation outside of tests built to show it.  (With
> > > previous versions, one can show it with a purpose-built test that commits
> > > instead of aborting, like the "001_pgbench_grant@9" test.)
> > >
> > > This move also loses the optimization of unpinning before 
> > > XactLockTableWait().
> > > heap_update() doesn't optimize that way, so that's fine.
> > >
> > > The move ended up more like (1), though I did do
> > > s/systable_inplace_update_scan/systable_inplace_update_begin/ like in 
> > > (2).  I
> > > felt that worked better than (2) to achieve lock release before
> > > CacheInvalidateHeapTuple().  Alternatives that could be fine:
> > >
> > From a consistency point of view, I find it cleaner if we can have all
> > the heap_inplace_lock and heap_inplace_unlock in the same set of
> > functions. So here those would be the systable_inplace_* functions.
>
> That will technically be the case after inplace160, and I could make it so
> here by inlining heap_inplace_unlock() into its heapam.c caller.  Would that
> be cleaner or less clean?
>

I am not sure. It seems more inconsistent to take the lock using
heap_inplace_lock but then just unlock by calling LockBuffer. On the
other hand, it doesn't seem that different from the way
SearchSysCacheLocked1 and UnlockTuple are used in inplace120. If we
are doing it this way, perhaps it would be good to rename
heap_inplace_update to heap_inplace_update_and_unlock.

> > > - In the cancel case, call both systable_inplace_update_cancel and
> > >   systable_inplace_update_end.  _finish or _cancel would own unlock, while
> > >   _end would own systable_endscan().
> > >
> > What happens to CacheInvalidateHeapTuple() in this approach?  I think
> > it will still need to be brought to the genam.c layer if we are
> > releasing the lock in systable_inplace_update_finish.
>
> Cancel scenarios don't do invalidation.  (Same under other alternatives.)
>

Sorry, I wasn't clear about this one. Let me rephrase. My
understanding is that the code in this approach would look like below
:
if (dirty)
   systable_inplace_update_finish(inplace_state, tup);
else
   systable_inplace_update_cancel(inplace_state);
systable_inplace_update_end(inplace_state);

And that in this structure, both _finish and _cancel will call
heap_inplace_unlock and then _end will call systable_endscan. So even
with this structure, the invalidation has to happen inside _finish
after the unlock. So this also pulls the invalidation to the genam.c
layer. Am I understanding this correctly?

> > > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
> > >   tolerable now, this gets less attractive after the inplace160 patch from
> > >   https://postgr.es/m/flat/20240523000548.58.nmi...@google.com
> > >
> > I skimmed through the inplace160 patch. It wasn't clear to me why this
> > becomes less attractive with the patch. I see there is a new
> > CacheInvalidateHeapTupleInPlace but that looks like it would be called
> > while holding the lock. And then there is an
> > AcceptInvalidationMessages which can perhaps be moved to the genam.c
> > layer too. Is the concern that one invalidation call will be in the
> > heapam layer and the other will be in the genam layer?
>
> That, or a critical section would start in heapam.c, then end in genam.c.
> Current call tree at inplace160 v4:
>
> genam.c:systable_inplace_update_finish
>  heapam.c:heap_inplace_update
>   PreInplace_Inval
>   START_CRIT_SECTION
>   BUFFER_LOCK_UNLOCK
>   AtInplace_Inval
>   END_CRIT_SECTION
>   UnlockTuple
>   AcceptInvalidationMessages
>
> If we hoisted all of invalidation up to the genam.c layer, a critical section
> that starts in heapam.c would end in genam.c:
>
> genam.c:systable_inplace_update_finish
>  PreInplace_Inval
>  heapam.c:heap_inplace_update
>   START_CRIT_SECTION
>  BUFFER_LOCK_UNLOCK
>  AtInplace_Inval
>  END_CRIT_SECTION
>  UnlockTuple
>  AcceptInvalidationMessages
>
> If we didn't accept splitting the critical section but did accept splitting
> invalidation responsibilities, one gets perhaps:
>
> genam.c:systable_inplace_update_finish
>  PreInplace_Inval
>  heapam.c:heap_inplace_update
>   START_CRIT_SECTION
>   BUFFER_LOCK_UNLOCK
>   AtInplace_Inval
>   END_CRIT_SECTI

Re: race condition in pg_class

2024-08-30 Thread Noah Misch
On Thu, Aug 29, 2024 at 09:08:43PM +0530, Nitin Motiani wrote:
> On Thu, Aug 29, 2024 at 8:11 PM Noah Misch  wrote:
> > On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> > > My order of preference is: 2, 1, 3.
> >
> > I kept tuple locking responsibility in heapam.c.  That's simpler and better
> > for modularity, but it does mean we release+acquire after any xmax wait.
> > Before, we avoided that if the next genam.c scan found the same TID.  (If 
> > the
> > next scan finds the same TID, the xmax probably aborted.)  I think DDL 
> > aborts
> > are rare enough to justify simplifying as this version does.  I don't expect
> > anyone to notice the starvation outside of tests built to show it.  (With
> > previous versions, one can show it with a purpose-built test that commits
> > instead of aborting, like the "001_pgbench_grant@9" test.)
> >
> > This move also loses the optimization of unpinning before 
> > XactLockTableWait().
> > heap_update() doesn't optimize that way, so that's fine.
> >
> > The move ended up more like (1), though I did do
> > s/systable_inplace_update_scan/systable_inplace_update_begin/ like in (2).  
> > I
> > felt that worked better than (2) to achieve lock release before
> > CacheInvalidateHeapTuple().  Alternatives that could be fine:
> >
> From a consistency point of view, I find it cleaner if we can have all
> the heap_inplace_lock and heap_inplace_unlock in the same set of
> functions. So here those would be the systable_inplace_* functions.

That will technically be the case after inplace160, and I could make it so
here by inlining heap_inplace_unlock() into its heapam.c caller.  Would that
be cleaner or less clean?

> > - In the cancel case, call both systable_inplace_update_cancel and
> >   systable_inplace_update_end.  _finish or _cancel would own unlock, while
> >   _end would own systable_endscan().
> >
> What happens to CacheInvalidateHeapTuple() in this approach?  I think
> it will still need to be brought to the genam.c layer if we are
> releasing the lock in systable_inplace_update_finish.

Cancel scenarios don't do invalidation.  (Same under other alternatives.)

> > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
> >   tolerable now, this gets less attractive after the inplace160 patch from
> >   https://postgr.es/m/flat/20240523000548.58.nmi...@google.com
> >
> I skimmed through the inplace160 patch. It wasn't clear to me why this
> becomes less attractive with the patch. I see there is a new
> CacheInvalidateHeapTupleInPlace but that looks like it would be called
> while holding the lock. And then there is an
> AcceptInvalidationMessages which can perhaps be moved to the genam.c
> layer too. Is the concern that one invalidation call will be in the
> heapam layer and the other will be in the genam layer?

That, or a critical section would start in heapam.c, then end in genam.c.
Current call tree at inplace160 v4:

genam.c:systable_inplace_update_finish
 heapam.c:heap_inplace_update
  PreInplace_Inval
  START_CRIT_SECTION
  BUFFER_LOCK_UNLOCK
  AtInplace_Inval
  END_CRIT_SECTION
  UnlockTuple
  AcceptInvalidationMessages

If we hoisted all of invalidation up to the genam.c layer, a critical section
that starts in heapam.c would end in genam.c:

genam.c:systable_inplace_update_finish
 PreInplace_Inval
 heapam.c:heap_inplace_update
  START_CRIT_SECTION
 BUFFER_LOCK_UNLOCK
 AtInplace_Inval
 END_CRIT_SECTION
 UnlockTuple
 AcceptInvalidationMessages

If we didn't accept splitting the critical section but did accept splitting
invalidation responsibilities, one gets perhaps:

genam.c:systable_inplace_update_finish
 PreInplace_Inval
 heapam.c:heap_inplace_update
  START_CRIT_SECTION
  BUFFER_LOCK_UNLOCK
  AtInplace_Inval
  END_CRIT_SECTION
  UnlockTuple
 AcceptInvalidationMessages

That's how I ended up at inplace120 v9's design.

> Also I have a small question from inplace120.
> 
> I see that all the places we check resultRelInfo->ri_needLockTagTuple,
> we can just call
> IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a
> big advantage of storing a separate bool field? Also there is another

No, not a big advantage.  I felt it was more in line with the typical style of
src/backend/executor.

> write to ri_RelationDesc in CatalogOpenIndexes in
> src/backlog/catalog/indexing.c. I think ri_needLockTagTuple needs to
> be set there also to keep it consistent with ri_RelationDesc. Please
> let me know if I am missing something about the usage of the new
> field.

Can you say more about consequences you found?

Only the full executor reads the field, doing so when it fetches the most
recent version of a row.  CatalogOpenIndexes() callers lack the full
executor's practice of fetching the most recent version of a row, so they
couldn't benefit reading the field.

I don't think any CatalogOpenIndexes() caller passes its ResultRelInfo to the
full executor, and "typedef struct ResultRelInfo *CatalogIndexState" exists in
part to k

Re: race condition in pg_class

2024-08-29 Thread Nitin Motiani
On Thu, Aug 29, 2024 at 8:11 PM Noah Misch  wrote:
>
> On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> > My order of preference is: 2, 1, 3.
>
> I kept tuple locking responsibility in heapam.c.  That's simpler and better
> for modularity, but it does mean we release+acquire after any xmax wait.
> Before, we avoided that if the next genam.c scan found the same TID.  (If the
> next scan finds the same TID, the xmax probably aborted.)  I think DDL aborts
> are rare enough to justify simplifying as this version does.  I don't expect
> anyone to notice the starvation outside of tests built to show it.  (With
> previous versions, one can show it with a purpose-built test that commits
> instead of aborting, like the "001_pgbench_grant@9" test.)
>
> This move also loses the optimization of unpinning before XactLockTableWait().
> heap_update() doesn't optimize that way, so that's fine.
>
> The move ended up more like (1), though I did do
> s/systable_inplace_update_scan/systable_inplace_update_begin/ like in (2).  I
> felt that worked better than (2) to achieve lock release before
> CacheInvalidateHeapTuple().  Alternatives that could be fine:
>
>From a consistency point of view, I find it cleaner if we can have all
the heap_inplace_lock and heap_inplace_unlock in the same set of
functions. So here those would be the systable_inplace_* functions.

> - In the cancel case, call both systable_inplace_update_cancel and
>   systable_inplace_update_end.  _finish or _cancel would own unlock, while
>   _end would own systable_endscan().
>
What happens to CacheInvalidateHeapTuple() in this approach?  I think
it will still need to be brought to the genam.c layer if we are
releasing the lock in systable_inplace_update_finish.

> - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
>   tolerable now, this gets less attractive after the inplace160 patch from
>   https://postgr.es/m/flat/20240523000548.58.nmi...@google.com
>
I skimmed through the inplace160 patch. It wasn't clear to me why this
becomes less attractive with the patch. I see there is a new
CacheInvalidateHeapTupleInPlace but that looks like it would be called
while holding the lock. And then there is an
AcceptInvalidationMessages which can perhaps be moved to the genam.c
layer too. Is the concern that one invalidation call will be in the
heapam layer and the other will be in the genam layer?

Also I have a small question from inplace120.

I see that all the places we check resultRelInfo->ri_needLockTagTuple,
we can just call
IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a
big advantage of storing a separate bool field? Also there is another
write to ri_RelationDesc in CatalogOpenIndexes in
src/backlog/catalog/indexing.c. I think ri_needLockTagTuple needs to
be set there also to keep it consistent with ri_RelationDesc. Please
let me know if I am missing something about the usage of the new
field.

Thanks & Regards,
Nitin Motiani
Google




Re: race condition in pg_class

2024-08-22 Thread Noah Misch
On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> On 17/08/2024 07:07, Noah Misch wrote:
> > On Fri, Aug 16, 2024 at 12:26:28PM +0300, Heikki Linnakangas wrote:
> > > I wonder if the functions should be called "systable_*" and placed in
> > > genam.c rather than in heapam.c. The interface looks more like the 
> > > existing
> > > systable functions. It feels like a modularity violation for a function in
> > > heapam.c to take an argument like "indexId", and call back into systable_*
> > > functions.
> > 
> > Yes, _scan() and _cancel() especially are wrappers around systable.  Some 
> > API
> > options follow.  Any preference or other ideas?
> > 
> >  direct s/heap_/systable_/ rename [option 1]
> > 
> >   systable_inplace_update_scan([...], &tup, &inplace_state);
> >   if (!HeapTupleIsValid(tup))
> > elog(ERROR, [...]);
> >   ... [buffer is exclusive-locked; mutate "tup"] ...
> >   if (dirty)
> > systable_inplace_update_finish(inplace_state, tup);
> >   else
> > systable_inplace_update_cancel(inplace_state);
> > 
> >  make the first and last steps more systable-like [option 2]
> > 
> >   systable_inplace_update_begin([...], &tup, &inplace_state);
> >   if (!HeapTupleIsValid(tup))
> > elog(ERROR, [...]);
> >   ... [buffer is exclusive-locked; mutate "tup"] ...
> >   if (dirty)
> > systable_inplace_update(inplace_state, tup);
> >   systable_inplace_update_end(inplace_state);

> My order of preference is: 2, 1, 3.

I kept tuple locking responsibility in heapam.c.  That's simpler and better
for modularity, but it does mean we release+acquire after any xmax wait.
Before, we avoided that if the next genam.c scan found the same TID.  (If the
next scan finds the same TID, the xmax probably aborted.)  I think DDL aborts
are rare enough to justify simplifying as this version does.  I don't expect
anyone to notice the starvation outside of tests built to show it.  (With
previous versions, one can show it with a purpose-built test that commits
instead of aborting, like the "001_pgbench_grant@9" test.)

This move also loses the optimization of unpinning before XactLockTableWait().
heap_update() doesn't optimize that way, so that's fine.

The move ended up more like (1), though I did do
s/systable_inplace_update_scan/systable_inplace_update_begin/ like in (2).  I
felt that worked better than (2) to achieve lock release before
CacheInvalidateHeapTuple().  Alternatives that could be fine:

- In the cancel case, call both systable_inplace_update_cancel and
  systable_inplace_update_end.  _finish or _cancel would own unlock, while
  _end would own systable_endscan().

- Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
  tolerable now, this gets less attractive after the inplace160 patch from
  https://postgr.es/m/flat/20240523000548.58.nmi...@google.com

I made the other changes we discussed, also.

> > > Could we just stipulate that you must always hold LOCKTAG_TUPLE when you
> > > call heap_update() on pg_class or pg_database? That'd make the rule 
> > > simple.
> > 
> > We could.  That would change more code sites.  Rough estimate:
> > 
> > $ git grep -E CatalogTupleUpd'.*(class|relrelation|relationRelation)' | wc 
> > -l
> > 23

> How many of those for RELKIND_INDEX vs tables? I'm thinking if we should
> always require a tuple lock on indexes, if that would make a difference.

Three sites.  See attached inplace125 patch.  Is it a net improvement?  If so,
I'll squash it into inplace120.

> > Another option here would be to preface that README section with a 
> > simplified
> > view, something like, "If a warning brought you here, take a tuple lock.  
> > The
> > rest of this section is just for people needing to understand the conditions
> > for --enable-casserts emitting that warning."  How about that instead of
> > simplifying the rules?
> 
> Works for me. Or perhaps the rules could just be explained more succinctly.
> Something like:
> 
> -

I largely used your text instead.


While doing these updates, I found an intra-grant-inplace.spec permutation
being flaky on inplace110 but stable on inplace120.  That turned out not to be
v9-specific.  As of patch v1, I now see it was already flaky (~5% failure
here).  I've now added to inplace110 a minimal tweak to stabilize that spec,
which inplace120 removes.

Thanks,
nm
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by FIXME.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 04

Re: race condition in pg_class

2024-08-20 Thread Heikki Linnakangas

On 17/08/2024 07:07, Noah Misch wrote:

On Fri, Aug 16, 2024 at 12:26:28PM +0300, Heikki Linnakangas wrote:

On 14/07/2024 20:48, Noah Misch wrote:

+ * ... [any slow preparation not requiring oldtup] ...
+ * heap_inplace_update_scan([...], &tup, &inplace_state);
+ * if (!HeapTupleIsValid(tup))
+ * elog(ERROR, [...]);
+ * ... [buffer is exclusive-locked; mutate "tup"] ...
+ * if (dirty)
+ * heap_inplace_update_finish(inplace_state, tup);
+ * else
+ * heap_inplace_update_cancel(inplace_state);


I wonder if the functions should be called "systable_*" and placed in
genam.c rather than in heapam.c. The interface looks more like the existing
systable functions. It feels like a modularity violation for a function in
heapam.c to take an argument like "indexId", and call back into systable_*
functions.


Yes, _scan() and _cancel() especially are wrappers around systable.  Some API
options follow.  Any preference or other ideas?

 direct s/heap_/systable_/ rename [option 1]

  systable_inplace_update_scan([...], &tup, &inplace_state);
  if (!HeapTupleIsValid(tup))
elog(ERROR, [...]);
  ... [buffer is exclusive-locked; mutate "tup"] ...
  if (dirty)
systable_inplace_update_finish(inplace_state, tup);
  else
systable_inplace_update_cancel(inplace_state);

 make the first and last steps more systable-like [option 2]

  systable_inplace_update_begin([...], &tup, &inplace_state);
  if (!HeapTupleIsValid(tup))
elog(ERROR, [...]);
  ... [buffer is exclusive-locked; mutate "tup"] ...
  if (dirty)
systable_inplace_update(inplace_state, tup);
  systable_inplace_update_end(inplace_state);

 no systable_ wrapper for middle step, more like CatalogTupleUpdate [option 
3]

  systable_inplace_update_begin([...], &tup, &inplace_state);
  if (!HeapTupleIsValid(tup))
elog(ERROR, [...]);
  ... [buffer is exclusive-locked; mutate "tup"] ...
  if (dirty)
heap_inplace_update(relation,

systable_inplace_old_tuple(inplace_state),
tup,

systable_inplace_buffer(inplace_state));
  systable_inplace_update_end(inplace_state);


My order of preference is: 2, 1, 3.


Could we just stipulate that you must always hold LOCKTAG_TUPLE when you
call heap_update() on pg_class or pg_database? That'd make the rule simple.


We could.  That would change more code sites.  Rough estimate:

$ git grep -E CatalogTupleUpd'.*(class|relrelation|relationRelation)' | wc -l
23

If the count were 2, I'd say let's simplify the rule like you're exploring.
(I originally had a complicated rule for pg_database, but I abandoned that
when it helped few code sites.)  If it were 100, I'd say the complicated rule
is worth it.  A count of 23 makes both choices fair.


Ok.

How many of those for RELKIND_INDEX vs tables? I'm thinking if we should 
always require a tuple lock on indexes, if that would make a difference.



Long-term, I hope relfrozenxid gets reimplemented with storage outside
pg_class, removing the need for inplace updates.  So the additional 23 code
sites might change back at a future date.  That shouldn't be a big
consideration, though.

Another option here would be to preface that README section with a simplified
view, something like, "If a warning brought you here, take a tuple lock.  The
rest of this section is just for people needing to understand the conditions
for --enable-casserts emitting that warning."  How about that instead of
simplifying the rules?


Works for me. Or perhaps the rules could just be explained more 
succinctly. Something like:


-
pg_class heap_inplace_update_scan() callers: before the call, acquire a 
lock on the relation in mode ShareUpdateExclusiveLock or stricter. If 
the update targets a row of RELKIND_INDEX (but not 
RELKIND_PARTITIONED_INDEX), that lock must be on the table, locking the 
index rel is not necessary.  (This allows VACUUM to overwrite per-index 
pg_class while holding a lock on the table alone.) 
heap_inplace_update_scan() acquires and releases LOCKTAG_TUPLE in 
InplaceUpdateTupleLock, an alias for ExclusiveLock, on each tuple it 
overwrites.


pg_class heap_update() callers: before copying the tuple to modify, take 
a lock on the tuple, or a ShareUpdateExclusiveLock or stricter on the 
relation.


SearchSysCacheLocked1() is one convenient way to acquire the tuple lock. 
Most heap_update() callers already hold a suitable lock on the relation 
for other reasons, and can skip the tuple lock. If you do acquire the 
tuple lock, release it immediately after the update.



pg_database: before copying the tuple to modify, all updaters of 
pg_database rows acquire LOCKTAG_TUPLE.  (Few updaters acquire 
LOCKTAG_OBJECT on the database OID, so it wasn't worth extending that as 
a second option.)

-

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: race condition in pg_class

2024-08-16 Thread Noah Misch
Thanks for reviewing.

On Fri, Aug 16, 2024 at 12:26:28PM +0300, Heikki Linnakangas wrote:
> On 14/07/2024 20:48, Noah Misch wrote:
> > I've pushed the two patches for your reports.  To placate cfbot, I'm 
> > attaching
> > the remaining patches.
> 
> inplace090-LOCKTAG_TUPLE-eoxact-v8.patch: Makes sense. A comment would be in
> order, it looks pretty random as it is. Something like:
> 
> /*
>  * Tuple locks are currently only held for short durations within a
>  * transaction. Check that we didn't forget to release one.
>  */

Will add.

> inplace110-successors-v8.patch: Makes sense.
> 
> The README changes would be better as part of the third patch, as this patch
> doesn't actually do any of the new locking described in the README, and it
> fixes the "inplace update updates wrong tuple" bug even without those tuple
> locks.

That should work.  Will confirm.

> > + * ... [any slow preparation not requiring oldtup] ...
> > + * heap_inplace_update_scan([...], &tup, &inplace_state);
> > + * if (!HeapTupleIsValid(tup))
> > + * elog(ERROR, [...]);
> > + * ... [buffer is exclusive-locked; mutate "tup"] ...
> > + * if (dirty)
> > + * heap_inplace_update_finish(inplace_state, tup);
> > + * else
> > + * heap_inplace_update_cancel(inplace_state);
> 
> I wonder if the functions should be called "systable_*" and placed in
> genam.c rather than in heapam.c. The interface looks more like the existing
> systable functions. It feels like a modularity violation for a function in
> heapam.c to take an argument like "indexId", and call back into systable_*
> functions.

Yes, _scan() and _cancel() especially are wrappers around systable.  Some API
options follow.  Any preference or other ideas?

 direct s/heap_/systable_/ rename

 systable_inplace_update_scan([...], &tup, &inplace_state);
 if (!HeapTupleIsValid(tup))
elog(ERROR, [...]);
 ... [buffer is exclusive-locked; mutate "tup"] ...
 if (dirty)
systable_inplace_update_finish(inplace_state, tup);
 else
systable_inplace_update_cancel(inplace_state);

 make the first and last steps more systable-like

 systable_inplace_update_begin([...], &tup, &inplace_state);
 if (!HeapTupleIsValid(tup))
elog(ERROR, [...]);
 ... [buffer is exclusive-locked; mutate "tup"] ...
 if (dirty)
systable_inplace_update(inplace_state, tup);
 systable_inplace_update_end(inplace_state);

 no systable_ wrapper for middle step, more like CatalogTupleUpdate

 systable_inplace_update_begin([...], &tup, &inplace_state);
 if (!HeapTupleIsValid(tup))
elog(ERROR, [...]);
 ... [buffer is exclusive-locked; mutate "tup"] ...
 if (dirty)
heap_inplace_update(relation,

systable_inplace_old_tuple(inplace_state),
tup,

systable_inplace_buffer(inplace_state));
 systable_inplace_update_end(inplace_state);

> > /*--
> >  * XXX A crash here can allow datfrozenxid() to get ahead of 
> > relfrozenxid:
> >  *
> >  * ["D" is a VACUUM (ONLY_DATABASE_STATS)]
> >  * ["R" is a VACUUM tbl]
> >  * D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
> > * c * D: systable_getnext() returns pg_class tuple of tbl
> >  * R: memcpy() into pg_class tuple of tbl
> >  * D: raise pg_database.datfrozenxid, XLogInsert(), finish
> >  * [crash]
> >  * [recovery restores datfrozenxid w/o relfrozenxid]
> >  */
> 
> Hmm, that's a tight race, but feels bad to leave it unfixed. One approach
> would be to modify the tuple on the buffer only after WAL-logging it. That
> way, D cannot read the updated value before it has been WAL logged. Just
> need to make sure that the change still gets included in the WAL record.
> Maybe something like:
> 
> if (RelationNeedsWAL(relation))
> {
> /*
>  * Make a temporary copy of the page that includes the change, in
>  * case the a full-page image is logged
>  */
> PGAlignedBlock tmppage;
> 
> memcpy(tmppage.data, page, BLCKSZ);
> 
> /* copy the tuple to the temporary copy */
> memcpy(...);
> 
> XLogRegisterBlock(0, ..., tmppage, REGBUF_STANDARD);
> XLogInsert();
> }
> 
> /* copy the tuple to the buffer */
> memcpy(...);

Yes, that's the essence of
inplace180-datfrozenxid-overtakes-relfrozenxid-v1.patch from
https://postgr.es/m/flat/20240620012908.92.nmi...@google.com.

> >   pg_class heap_inplace_update_scan() callers: before the call, acquire
> >   LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), 
> > ShareUpdateExclusiveLock
> >   (VACUUM), or a mode with strictly more conflicts.  If the update targets a
> >   row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must 
> > be
> >   on the table.  Locking the index rel is optional.  (This allows VACUUM to
> >   overwrite per-index pg_class while holding a lock on the table alone.)  We
> >   could allow weaker locks, in which case the ne

Re: race condition in pg_class

2024-08-16 Thread Heikki Linnakangas

On 14/07/2024 20:48, Noah Misch wrote:

I've pushed the two patches for your reports.  To placate cfbot, I'm attaching
the remaining patches.


inplace090-LOCKTAG_TUPLE-eoxact-v8.patch: Makes sense. A comment would 
be in order, it looks pretty random as it is. Something like:


/*
 * Tuple locks are currently only held for short durations within a
 * transaction. Check that we didn't forget to release one.
 */

inplace110-successors-v8.patch: Makes sense.

The README changes would be better as part of the third patch, as this 
patch doesn't actually do any of the new locking described in the 
README, and it fixes the "inplace update updates wrong tuple" bug even 
without those tuple locks.



+ * ... [any slow preparation not requiring oldtup] ...
+ * heap_inplace_update_scan([...], &tup, &inplace_state);
+ * if (!HeapTupleIsValid(tup))
+ * elog(ERROR, [...]);
+ * ... [buffer is exclusive-locked; mutate "tup"] ...
+ * if (dirty)
+ * heap_inplace_update_finish(inplace_state, tup);
+ * else
+ * heap_inplace_update_cancel(inplace_state);


I wonder if the functions should be called "systable_*" and placed in 
genam.c rather than in heapam.c. The interface looks more like the 
existing systable functions. It feels like a modularity violation for a 
function in heapam.c to take an argument like "indexId", and call back 
into systable_* functions.



/*--
 * XXX A crash here can allow datfrozenxid() to get ahead of 
relfrozenxid:
 *
 * ["D" is a VACUUM (ONLY_DATABASE_STATS)]
 * ["R" is a VACUUM tbl]
 * D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
* c * D: systable_getnext() returns pg_class tuple of tbl
 * R: memcpy() into pg_class tuple of tbl
 * D: raise pg_database.datfrozenxid, XLogInsert(), finish
 * [crash]
 * [recovery restores datfrozenxid w/o relfrozenxid]
 */


Hmm, that's a tight race, but feels bad to leave it unfixed. One 
approach would be to modify the tuple on the buffer only after 
WAL-logging it. That way, D cannot read the updated value before it has 
been WAL logged. Just need to make sure that the change still gets 
included in the WAL record. Maybe something like:


if (RelationNeedsWAL(relation))
{
/*
 * Make a temporary copy of the page that includes the change, in
 * case the a full-page image is logged
 */
PGAlignedBlock tmppage;

memcpy(tmppage.data, page, BLCKSZ);

/* copy the tuple to the temporary copy */
memcpy(...);

XLogRegisterBlock(0, ..., tmppage, REGBUF_STANDARD);
XLogInsert();
}

/* copy the tuple to the buffer */
memcpy(...);



  pg_class heap_inplace_update_scan() callers: before the call, acquire
  LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), ShareUpdateExclusiveLock
  (VACUUM), or a mode with strictly more conflicts.  If the update targets a
  row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be
  on the table.  Locking the index rel is optional.  (This allows VACUUM to
  overwrite per-index pg_class while holding a lock on the table alone.)  We
  could allow weaker locks, in which case the next paragraph would simply call
  for stronger locks for its class of commands.  heap_inplace_update_scan()
  acquires and releases LOCKTAG_TUPLE in InplaceUpdateTupleLock, an alias for
  ExclusiveLock, on each tuple it overwrites.

  pg_class heap_update() callers: before copying the tuple to modify, take a
  lock that conflicts with at least one of those from the preceding paragraph.
  SearchSysCacheLocked1() is one convenient way to acquire LOCKTAG_TUPLE.
  After heap_update(), release any LOCKTAG_TUPLE.  Most of these callers opt
  to acquire just the LOCKTAG_RELATION.


These rules seem complicated. Phrasing this slightly differently, if I 
understand correctly: for a heap_update() caller, it's always sufficient 
to hold LOCKTAG_TUPLE, but if you happen to hold some other lock on the 
relation that conflicts with those mentioned in the first paragraph, 
then you can skip the LOCKTAG_TUPLE lock.


Could we just stipulate that you must always hold LOCKTAG_TUPLE when you 
call heap_update() on pg_class or pg_database? That'd make the rule simple.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: race condition in pg_class

2024-07-28 Thread Tom Lane
Noah Misch  writes:
> On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote:
>> Is it time to worry yet?  If this were HEAD only, I'd not be too
>> concerned; but two of these three are on allegedly-stable branches.
>> And we have releases coming up fast.

> I don't know; neither decision feels terrible to me.

Yeah, same here.  Obviously, it'd be better to spend effort on getting
the bug fix committed than to spend effort on some cosmetic
workaround.

The fact that the failure is in the isolation tests not the core
regression tests reduces my level of concern somewhat about shipping
it this way.  I think that packagers typically run the core tests
not check-world during package verification, so they won't hit this.

regards, tom lane




Re: race condition in pg_class

2024-07-28 Thread Noah Misch
On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
> >> A recent buildfarm test failure [1] showed that the
> >> intra-grant-inplace-db.spec test added with 0844b3968 may fail
> 
> >> But as the test going to be modified by the inplace110-successors-v8.patch
> >> and the modified test (with all three latest patches applied) passes
> >> reliably in the same conditions, maybe this failure doesn't deserve a
> >> deeper exploration.
> 
> > Agreed.  Let's just wait for code review of the actual bug fix, not develop 
> > a
> > separate change to stabilize the test.  One flake in three weeks is low 
> > enough
> > to make that okay.
> 
> It's now up to three similar failures in the past ten days

> Is it time to worry yet?  If this were HEAD only, I'd not be too
> concerned; but two of these three are on allegedly-stable branches.
> And we have releases coming up fast.

I don't know; neither decision feels terrible to me.  A bug fix that would
address both the data corruption causes and those buildfarm failures has been
awaiting review on this thread for 77 days.  The data corruption causes are
more problematic than 0.03% of buildfarm runs getting noise failures.  Two
wrongs don't make a right, but a commit masking that level of buildfarm noise
also feels like sending the wrong message.




Re: race condition in pg_class

2024-07-28 Thread Tom Lane
Noah Misch  writes:
> On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
>> A recent buildfarm test failure [1] showed that the
>> intra-grant-inplace-db.spec test added with 0844b3968 may fail
>> on a slow machine

>> But as the test going to be modified by the inplace110-successors-v8.patch
>> and the modified test (with all three latest patches applied) passes
>> reliably in the same conditions, maybe this failure doesn't deserve a
>> deeper exploration.

> Agreed.  Let's just wait for code review of the actual bug fix, not develop a
> separate change to stabilize the test.  One flake in three weeks is low enough
> to make that okay.

It's now up to three similar failures in the past ten days: in
addition to

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=habu&dt=2024-07-18%2003%3A08%3A08

I see

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&dt=2024-07-22%2018%3A00%3A46

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-07-28%2012%3A20%3A37

Is it time to worry yet?  If this were HEAD only, I'd not be too
concerned; but two of these three are on allegedly-stable branches.
And we have releases coming up fast.

(BTW, I don't think taipan qualifies as a slow machine.)

regards, tom lane




Re: race condition in pg_class

2024-07-20 Thread Noah Misch
On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
> 28.06.2024 08:13, Noah Misch wrote:
> > Pushed.
> 
> A recent buildfarm test failure [1] showed that the
> intra-grant-inplace-db.spec test added with 0844b3968 may fail
> on a slow machine

> But as the test going to be modified by the inplace110-successors-v8.patch
> and the modified test (with all three latest patches applied) passes
> reliably in the same conditions, maybe this failure doesn't deserve a
> deeper exploration.

Agreed.  Let's just wait for code review of the actual bug fix, not develop a
separate change to stabilize the test.  One flake in three weeks is low enough
to make that okay.

> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=habu&dt=2024-07-18%2003%3A08%3A08




Re: race condition in pg_class

2024-07-20 Thread Alexander Lakhin

Hello Noah,

28.06.2024 08:13, Noah Misch wrote:

Pushed.


A recent buildfarm test failure [1] showed that the
intra-grant-inplace-db.spec test added with 0844b3968 may fail
on a slow machine (per my understanding):

test intra-grant-inplace-db   ... FAILED 4302 ms

@@ -21,8 +21,7 @@
 WHERE datname = current_catalog
 AND age(datfrozenxid) > (SELECT min(age(x)) FROM frozen_witness);

-?column?
---
-datfrozenxid retreated
-(1 row)
+?column?
+
+(0 rows)

whilst the previous (successful) run shows much shorter duration:
test intra-grant-inplace-db   ... ok  540 ms

I reproduced this failure on a VM slowed down so that the test duration
reached 4+ seconds, with 100 test: intra-grant-inplace-db in
isolation_schedule:
test intra-grant-inplace-db   ... ok 4324 ms
test intra-grant-inplace-db   ... FAILED 4633 ms
test intra-grant-inplace-db   ... ok 4649 ms

But as the test going to be modified by the inplace110-successors-v8.patch
and the modified test (with all three latest patches applied) passes
reliably in the same conditions, maybe this failure doesn't deserve a
deeper exploration.

What do you think?

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=habu&dt=2024-07-18%2003%3A08%3A08

Best regards,
Alexander




Re: race condition in pg_class

2024-07-14 Thread Noah Misch
On Thu, Jul 04, 2024 at 03:08:16PM -0700, Noah Misch wrote:
> On Thu, Jul 04, 2024 at 08:00:00AM +0300, Alexander Lakhin wrote:
> > 28.06.2024 08:13, Noah Misch wrote:
> > > Pushed. ...
> > 
> > Please look also at another anomaly, I've discovered.
> > 
> > An Assert added with d5f788b41 may be falsified with:
> > CREATE TABLE t(a int PRIMARY KEY);
> > INSERT INTO t VALUES (1);
> > CREATE VIEW v AS SELECT * FROM t;
> > 
> > MERGE INTO v USING (VALUES (1)) AS va(a) ON v.a = va.a
> >   WHEN MATCHED THEN DO NOTHING
> >   WHEN NOT MATCHED THEN DO NOTHING;
> > 
> > TRAP: failed Assert("resultRelInfo->ri_TrigDesc"), File: 
> > "nodeModifyTable.c", Line: 2891, PID: 1590670
> 
> Thanks.  When all the MERGE actions are DO NOTHING, view_has_instead_trigger()
> returns true

I've pushed the two patches for your reports.  To placate cfbot, I'm attaching
the remaining patches.
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by FIXME.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0400a50..461d925 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2256,6 +2256,11 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
locallock->numLockOwners = 0;
}
 
+#ifdef USE_ASSERT_CHECKING
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks)
+   elog(WARNING, "tuple lock held at commit");
+#endif
+
/*
 * If the lock or proclock pointers are NULL, this lock was 
taken via
 * the relation fast-path (and is not known to have been 
transferred).
Author: Noah Misch 
Commit: Noah Misch 

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the heap_inplace_update_scan()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reviewed by FIXME and Alexander Lakhin.

Discussion: 
https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 6441e8b..dbfa2b7 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -153,3 +153,56 @@ The following infomask bits are applicable:
 
 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
 is set.
+
+Locking to write inplace-updated tables
+---
+
+[This is the plan, but LOCKTAG_TUPLE acquisition is not yet here.]
+
+If IsInplaceUpdateRelation() returns true for a table, the table is a system
+catalog that receives heap_inplace_update_scan() calls.  Preparing a
+heap_update() of these tables follows additional locking rules, to ensure we
+don't lose the effects of an inplace update.  In particular, consider a moment
+when a backend has fetched the old tuple to modify, not yet having called
+heap_update().  Another backend's inplace update starting then can't conclude
+until the heap_update() places its new tuple in a buffer.  We enforce that
+using locktags as follows.  While DDL code is the main audience, the executor
+follows these rules to make e.g. "MERGE INTO pg_class" safer.  Locking rules
+are per-catalog:
+
+  pg_class heap_inplace_update_scan() callers: before the call, acquire
+  LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), ShareUpdateExclusiveLock
+  (VACUUM), or a mode with strictly more conflicts.  If the update targets a
+  row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be
+  on the table.  Locking the index rel is optional.  (This allows VACUUM to
+  overwrite per-index pg_class while holding a lock on the table alone.)  We
+  could allow weaker locks, in which case the next paragraph wou

Re: race condition in pg_class

2024-07-04 Thread Noah Misch
On Thu, Jul 04, 2024 at 08:00:00AM +0300, Alexander Lakhin wrote:
> 28.06.2024 08:13, Noah Misch wrote:
> > Pushed. ...
> 
> Please look also at another anomaly, I've discovered.
> 
> An Assert added with d5f788b41 may be falsified with:
> CREATE TABLE t(a int PRIMARY KEY);
> INSERT INTO t VALUES (1);
> CREATE VIEW v AS SELECT * FROM t;
> 
> MERGE INTO v USING (VALUES (1)) AS va(a) ON v.a = va.a
>   WHEN MATCHED THEN DO NOTHING
>   WHEN NOT MATCHED THEN DO NOTHING;
> 
> TRAP: failed Assert("resultRelInfo->ri_TrigDesc"), File: "nodeModifyTable.c", 
> Line: 2891, PID: 1590670

Thanks.  When all the MERGE actions are DO NOTHING, view_has_instead_trigger()
returns true, so we use the wholerow code regardless of the view's triggers or
auto update capability.  The behavior is fine, so I'm fixing the new assertion
and comments with new patch inplace087-merge-DO-NOTHING-v8.patch.  The closest
relevant tests processed zero rows, so they narrowly avoided witnessing this
assertion.
Author: Noah Misch 
Commit: Noah Misch 

Don't lose partitioned table reltuples=0 after relhassubclass=f.

ANALYZE sets relhassubclass=f when a partitioned table no longer has
partitions.  An ANALYZE doing that proceeded to apply the inplace update
of pg_class.reltuples to the old pg_class tuple instead of the new
tuple, losing that reltuples=0 change if the ANALYZE committed.
Non-partitioning inheritance trees were unaffected.  Back-patch to v14,
where commit 375aed36ad83f0e021e9bdd3a0034c0c992c66dc introduced
maintenance of partitioned table pg_class.reltuples.

Reviewed by FIXME.  Reported by Alexander Lakhin.

Discussion: 
https://postgr.es/m/a295b499-dcab-6a99-c06e-01cf60593...@gmail.com

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7d2cd24..c590a2a 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -629,7 +629,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
else
relallvisible = 0;
 
-   /* Update pg_class for table relation */
+   /*
+* Update pg_class for table relation.  CCI first, in case 
acquirefunc
+* updated pg_class.
+*/
+   CommandCounterIncrement();
vac_update_relstats(onerel,
relpages,
totalrows,
@@ -664,6 +668,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 * Partitioned tables don't have storage, so we don't set any 
fields
 * in their pg_class entries except for reltuples and 
relhasindex.
 */
+   CommandCounterIncrement();
vac_update_relstats(onerel, -1, totalrows,
0, hasindex, 
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/test/regress/expected/vacuum.out 
b/src/test/regress/expected/vacuum.out
index 330fcd8..2eba712 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -83,6 +83,53 @@ INSERT INTO vactst SELECT generate_series(301, 400);
 DELETE FROM vactst WHERE i % 5 <> 0; -- delete a few rows inside
 ANALYZE vactst;
 COMMIT;
+-- Test ANALYZE setting relhassubclass=f for non-partitioning inheritance
+BEGIN;
+CREATE TABLE past_inh_parent ();
+CREATE TABLE past_inh_child () INHERITS (past_inh_parent);
+INSERT INTO past_inh_child DEFAULT VALUES;
+INSERT INTO past_inh_child DEFAULT VALUES;
+ANALYZE past_inh_parent;
+SELECT reltuples, relhassubclass
+  FROM pg_class WHERE oid = 'past_inh_parent'::regclass;
+ reltuples | relhassubclass 
+---+
+ 0 | t
+(1 row)
+
+DROP TABLE past_inh_child;
+ANALYZE past_inh_parent;
+SELECT reltuples, relhassubclass
+  FROM pg_class WHERE oid = 'past_inh_parent'::regclass;
+ reltuples | relhassubclass 
+---+
+ 0 | f
+(1 row)
+
+COMMIT;
+-- Test ANALYZE setting relhassubclass=f for partitioning
+BEGIN;
+CREATE TABLE past_parted (i int) PARTITION BY LIST(i);
+CREATE TABLE past_part PARTITION OF past_parted FOR VALUES IN (1);
+INSERT INTO past_parted VALUES (1),(1);
+ANALYZE past_parted;
+DROP TABLE past_part;
+SELECT reltuples, relhassubclass
+  FROM pg_class WHERE oid = 'past_parted'::regclass;
+ reltuples | relhassubclass 
+---+
+ 2 | t
+(1 row)
+
+ANALYZE past_parted;
+SELECT reltuples, relhassubclass
+  FROM pg_class WHERE oid = 'past_parted'::regclass;
+ reltuples | relhassubclass 
+---+
+ 0 | f
+(1 row)
+
+COMMIT;
 VACUUM FULL pg_am;
 VACUUM FULL pg_class;
 VACUUM FULL pg_database;
diff --git a/src/test/regress/sql/vacuum.sql b/src/test/regress/sql/vacuum.sql
index 0b63ef8..548cd7a 100644
--- a/src/test/regress/sql/vacuum.s

Re: race condition in pg_class

2024-07-03 Thread Alexander Lakhin

Hello Noah,

28.06.2024 08:13, Noah Misch wrote:

Pushed. ...


Please look also at another anomaly, I've discovered.

An Assert added with d5f788b41 may be falsified with:
CREATE TABLE t(a int PRIMARY KEY);
INSERT INTO t VALUES (1);
CREATE VIEW v AS SELECT * FROM t;

MERGE INTO v USING (VALUES (1)) AS va(a) ON v.a = va.a
  WHEN MATCHED THEN DO NOTHING
  WHEN NOT MATCHED THEN DO NOTHING;

TRAP: failed Assert("resultRelInfo->ri_TrigDesc"), File: "nodeModifyTable.c", 
Line: 2891, PID: 1590670

Best regards,
Alexander




Re: race condition in pg_class

2024-07-03 Thread Noah Misch
On Wed, Jul 03, 2024 at 04:09:54PM -0700, Noah Misch wrote:
> On Wed, Jul 03, 2024 at 06:00:00AM +0300, Alexander Lakhin wrote:
> > 29.06.2024 05:42, Noah Misch wrote:
> > > Good point, any effort on (2) would be wasted once the fixes get 
> > > certified.  I
> > > pushed (1).  I'm attaching the rebased fix patches.
> > 
> > Please look at a new anomaly, introduced by inplace110-successors-v5.patch:
> > CREATE TABLE t (i int) PARTITION BY LIST(i);
> > CREATE TABLE p1 (i int);
> > ALTER TABLE t ATTACH PARTITION p1 FOR VALUES IN (1);
> > ALTER TABLE t DETACH PARTITION p1;
> > ANALYZE t;
> > 
> > triggers unexpected
> > ERROR:  tuple to be updated was already modified by an operation triggered 
> > by the current command
> 
> Thanks.  Today, it's okay to issue heap_inplace_update() after heap_update()
> without an intervening CommandCounterIncrement().

Correction: it's not okay today.  If code does that, heap_inplace_update()
mutates a tuple that is going to become invisible at CCI.  The lack of CCI
yields a minor live bug in v14+.  Its consequences seem to be limited to
failing to update reltuples for a partitioned table having zero partitions.

> The patch makes the CCI
> required.  The ANALYZE in your example reaches this with a heap_update to set
> relhassubclass=f.  I've fixed this by just adding a CCI (and adding to the
> tests in vacuum.sql).

That's still the right fix, but I've separated it into its own patch and
expanded the test.  All the non-comment changes between v5 and v6 are now part
of the separate patch.

> The alternative would be to allow inplace updates on TM_SelfModified tuples.
> I can't think of a specific problem with allowing that, but I feel that would
> make system state interactions harder to reason about.  It might be optimal to
> allow that in back branches only, to reduce the chance of releasing a bug like
> the one you found.

Allowing a mutation of a TM_SelfModified tuple is bad, since that tuple is
going to become dead soon.  Mutating its successor could be okay.  Since we'd
expect such code to be unreachable, I'm not keen carry such code.  For that
scenario, I'd rather keep the error you encountered.  Other opinions?
Author: Noah Misch 
Commit: Noah Misch 

Don't lose partitioned table reltuples=0 after relhassubclass=f.

ANALYZE sets relhassubclass=f when a partitioned table no longer has
partitions.  An ANALYZE doing that proceeded to apply the inplace update
of pg_class.reltuples to the old pg_class tuple instead of the new
tuple, losing that reltuples=0 change if the ANALYZE committed.
Non-partitioning inheritance trees were unaffected.  Back-patch to v14,
where commit 375aed36ad83f0e021e9bdd3a0034c0c992c66dc introduced
maintenance of partitioned table pg_class.reltuples.

Reviewed by FIXME.  Reported by Alexander Lakhin.

Discussion: 
https://postgr.es/m/a295b499-dcab-6a99-c06e-01cf60593...@gmail.com

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7d2cd24..c590a2a 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -629,7 +629,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
else
relallvisible = 0;
 
-   /* Update pg_class for table relation */
+   /*
+* Update pg_class for table relation.  CCI first, in case 
acquirefunc
+* updated pg_class.
+*/
+   CommandCounterIncrement();
vac_update_relstats(onerel,
relpages,
totalrows,
@@ -664,6 +668,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 * Partitioned tables don't have storage, so we don't set any 
fields
 * in their pg_class entries except for reltuples and 
relhasindex.
 */
+   CommandCounterIncrement();
vac_update_relstats(onerel, -1, totalrows,
0, hasindex, 
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/test/regress/expected/vacuum.out 
b/src/test/regress/expected/vacuum.out
index 330fcd8..2eba712 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -83,6 +83,53 @@ INSERT INTO vactst SELECT generate_series(301, 400);
 DELETE FROM vactst WHERE i % 5 <> 0; -- delete a few rows inside
 ANALYZE vactst;
 COMMIT;
+-- Test ANALYZE setting relhassubclass=f for non-partitioning inheritance
+BEGIN;
+CREATE TABLE past_inh_parent ();
+CREATE TABLE past_inh_child () INHERITS (past_inh_parent);
+INSERT INTO past_inh_child DEFAULT VALUES;
+INSERT INTO past_inh_child DEFAULT VALUES;
+ANALYZE past_inh_parent;
+SELECT reltuples, relhassubclass
+  FROM pg_class WHERE oid = 'past_inh_parent'::regclas

Re: race condition in pg_class

2024-07-03 Thread Noah Misch
On Wed, Jul 03, 2024 at 06:00:00AM +0300, Alexander Lakhin wrote:
> 29.06.2024 05:42, Noah Misch wrote:
> > Good point, any effort on (2) would be wasted once the fixes get certified. 
> >  I
> > pushed (1).  I'm attaching the rebased fix patches.
> 
> Please look at a new anomaly, introduced by inplace110-successors-v5.patch:
> CREATE TABLE t (i int) PARTITION BY LIST(i);
> CREATE TABLE p1 (i int);
> ALTER TABLE t ATTACH PARTITION p1 FOR VALUES IN (1);
> ALTER TABLE t DETACH PARTITION p1;
> ANALYZE t;
> 
> triggers unexpected
> ERROR:  tuple to be updated was already modified by an operation triggered by 
> the current command

Thanks.  Today, it's okay to issue heap_inplace_update() after heap_update()
without an intervening CommandCounterIncrement().  The patch makes the CCI
required.  The ANALYZE in your example reaches this with a heap_update to set
relhassubclass=f.  I've fixed this by just adding a CCI (and adding to the
tests in vacuum.sql).

The alternative would be to allow inplace updates on TM_SelfModified tuples.
I can't think of a specific problem with allowing that, but I feel that would
make system state interactions harder to reason about.  It might be optimal to
allow that in back branches only, to reduce the chance of releasing a bug like
the one you found.
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by FIXME.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0400a50..461d925 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2256,6 +2256,11 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
locallock->numLockOwners = 0;
}
 
+#ifdef USE_ASSERT_CHECKING
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks)
+   elog(WARNING, "tuple lock held at commit");
+#endif
+
/*
 * If the lock or proclock pointers are NULL, this lock was 
taken via
 * the relation fast-path (and is not known to have been 
transferred).
Author: Noah Misch 
Commit: Noah Misch 

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the heap_inplace_update_scan()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reviewed by FIXME.

Discussion: 
https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 6441e8b..dbfa2b7 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -153,3 +153,56 @@ The following infomask bits are applicable:
 
 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
 is set.
+
+Locking to write inplace-updated tables
+---
+
+[This is the plan, but LOCKTAG_TUPLE acquisition is not yet here.]
+
+If IsInplaceUpdateRelation() returns true for a table, the table is a system
+catalog that receives heap_inplace_update_scan() calls.  Preparing a
+heap_update() of these tables follows additional locking rules, to ensure we
+don't lose the effects of an inplace update.  In particular, consider a moment
+when a backend has fetched the old tuple to modify, not yet having called
+heap_update().  Another backend's inplace update starting then can't conclude
+until the heap_update() places its new tuple in a buffer.  We enforce that
+using locktags as follows.  While DDL code is the main audience, the executor
+follows these rules to make e.g. "MERGE INTO pg_class" safer.  Locking rules
+are per-catalog:
+
+  pg_class heap_inplace_update_scan() callers: before the call, acquire
+  LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), Sha

Re: race condition in pg_class

2024-07-02 Thread Alexander Lakhin

Hello Noah,

29.06.2024 05:42, Noah Misch wrote:

Good point, any effort on (2) would be wasted once the fixes get certified.  I
pushed (1).  I'm attaching the rebased fix patches.


Please look at a new anomaly, introduced by inplace110-successors-v5.patch:
CREATE TABLE t (i int) PARTITION BY LIST(i);
CREATE TABLE p1 (i int);
ALTER TABLE t ATTACH PARTITION p1 FOR VALUES IN (1);
ALTER TABLE t DETACH PARTITION p1;
ANALYZE t;

triggers unexpected
ERROR:  tuple to be updated was already modified by an operation triggered by 
the current command

Best regards,
Alexander




Re: race condition in pg_class

2024-06-28 Thread Noah Misch
On Fri, Jun 28, 2024 at 01:17:22AM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > Pushed.  Buildfarm member prion is failing the new inplace-inval.spec, 
> > almost
> > surely because prion uses -DCATCACHE_FORCE_RELEASE and inplace-inval.spec is
> > testing an extant failure to inval a cache entry.  Naturally, inexorable 
> > inval
> > masks the extant bug.  Ideally, I'd just skip the test under any kind of 
> > cache
> > clobber option.  I don't know a pleasant way to do that, so these are
> > known-feasible things I'm considering:
> 
> > 1. Neutralize the test in all branches, probably by having it just not 
> > report
> >the final answer.  Undo in the later fix patch.
> 
> > 2. v14+ has pg_backend_memory_contexts.  In the test, run some plpgsql that
> >uses heuristics on that to deduce whether caches are getting released.
> >Have a separate expected output for the cache-release scenario.  Perhaps
> >also have the test treat installcheck like cache-release, since
> >installcheck could experience sinval reset with similar consequences.
> >Neutralize the test in v12 & v13.
> 
> > 3. Add a test module with a C function that reports whether any kind of 
> > cache
> >clobber is active.  Call it in this test.  Have a separate expected 
> > output
> >for the cache-release scenario.
> 
> > Preferences or other ideas?  I'm waffling between (1) and (2).  I'll give it
> > more thought over the next day.
> 
> I'd just go for (1).  We were doing fine without this test case.
> I can't see expending effort towards hiding its result rather
> than actually fixing anything.

Good point, any effort on (2) would be wasted once the fixes get certified.  I
pushed (1).  I'm attaching the rebased fix patches.
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by FIXME.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0400a50..461d925 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2256,6 +2256,11 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
locallock->numLockOwners = 0;
}
 
+#ifdef USE_ASSERT_CHECKING
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks)
+   elog(WARNING, "tuple lock held at commit");
+#endif
+
/*
 * If the lock or proclock pointers are NULL, this lock was 
taken via
 * the relation fast-path (and is not known to have been 
transferred).
Author: Noah Misch 
Commit: Noah Misch 

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the heap_inplace_update_scan()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reviewed by FIXME.

Discussion: 
https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 6441e8b..dbfa2b7 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -153,3 +153,56 @@ The following infomask bits are applicable:
 
 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
 is set.
+
+Locking to write inplace-updated tables
+---
+
+[This is the plan, but LOCKTAG_TUPLE acquisition is not yet here.]
+
+If IsInplaceUpdateRelation() returns true for a table, the table is a system
+catalog that receives heap_inplace_update_scan() calls.  Preparing a
+heap_update() of these tables follows additional locking rules, to ensure we
+don't lose the effects of an inplace update.  In particular, consider a moment
+when a backend has fetched the old tuple to modify, not yet having called
+heap_up

Re: race condition in pg_class

2024-06-27 Thread Tom Lane
Noah Misch  writes:
> Pushed.  Buildfarm member prion is failing the new inplace-inval.spec, almost
> surely because prion uses -DCATCACHE_FORCE_RELEASE and inplace-inval.spec is
> testing an extant failure to inval a cache entry.  Naturally, inexorable inval
> masks the extant bug.  Ideally, I'd just skip the test under any kind of cache
> clobber option.  I don't know a pleasant way to do that, so these are
> known-feasible things I'm considering:

> 1. Neutralize the test in all branches, probably by having it just not report
>the final answer.  Undo in the later fix patch.

> 2. v14+ has pg_backend_memory_contexts.  In the test, run some plpgsql that
>uses heuristics on that to deduce whether caches are getting released.
>Have a separate expected output for the cache-release scenario.  Perhaps
>also have the test treat installcheck like cache-release, since
>installcheck could experience sinval reset with similar consequences.
>Neutralize the test in v12 & v13.

> 3. Add a test module with a C function that reports whether any kind of cache
>clobber is active.  Call it in this test.  Have a separate expected output
>for the cache-release scenario.

> Preferences or other ideas?  I'm waffling between (1) and (2).  I'll give it
> more thought over the next day.

I'd just go for (1).  We were doing fine without this test case.
I can't see expending effort towards hiding its result rather
than actually fixing anything.

regards, tom lane




Re: race condition in pg_class

2024-06-27 Thread Noah Misch
On Fri, Jun 21, 2024 at 02:28:42PM -0700, Noah Misch wrote:
> On Thu, Jun 13, 2024 at 05:35:49PM -0700, Noah Misch wrote:
> > I think the attached covers all comments to date.  I gave everything v3, but
> > most patches have just a no-conflict rebase vs. v2.  The exceptions are
> > inplace031-inj-wait-event (implements the holding from that branch of the
> > thread) and inplace050-tests-inj (updated to cooperate with inplace031).  
> > Much
> > of inplace031-inj-wait-event is essentially s/Extension/Custom/ for the
> > infrastructure common to the two custom wait event types.
> 
> Starting 2024-06-27, I'd like to push
> inplace080-catcache-detoast-inplace-stale and earlier patches, self-certifying
> them if needed.  Then I'll submit the last three to the commitfest.  Does
> anyone want me to delay that step?

Pushed.  Buildfarm member prion is failing the new inplace-inval.spec, almost
surely because prion uses -DCATCACHE_FORCE_RELEASE and inplace-inval.spec is
testing an extant failure to inval a cache entry.  Naturally, inexorable inval
masks the extant bug.  Ideally, I'd just skip the test under any kind of cache
clobber option.  I don't know a pleasant way to do that, so these are
known-feasible things I'm considering:

1. Neutralize the test in all branches, probably by having it just not report
   the final answer.  Undo in the later fix patch.

2. v14+ has pg_backend_memory_contexts.  In the test, run some plpgsql that
   uses heuristics on that to deduce whether caches are getting released.
   Have a separate expected output for the cache-release scenario.  Perhaps
   also have the test treat installcheck like cache-release, since
   installcheck could experience sinval reset with similar consequences.
   Neutralize the test in v12 & v13.

3. Add a test module with a C function that reports whether any kind of cache
   clobber is active.  Call it in this test.  Have a separate expected output
   for the cache-release scenario.

Preferences or other ideas?  I'm waffling between (1) and (2).  I'll give it
more thought over the next day.

Thanks,
nm




Re: race condition in pg_class

2024-06-21 Thread Noah Misch
On Thu, Jun 13, 2024 at 05:35:49PM -0700, Noah Misch wrote:
> On Mon, Jun 10, 2024 at 07:45:25PM -0700, Noah Misch wrote:
> > On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> > > It's not this patch set's fault, but I'm not very pleased to see that
> > > the injection point wait events have been shoehorned into the
> > > "Extension" category
> > 
> > I've replied on that branch of the thread.
> 
> I think the attached covers all comments to date.  I gave everything v3, but
> most patches have just a no-conflict rebase vs. v2.  The exceptions are
> inplace031-inj-wait-event (implements the holding from that branch of the
> thread) and inplace050-tests-inj (updated to cooperate with inplace031).  Much
> of inplace031-inj-wait-event is essentially s/Extension/Custom/ for the
> infrastructure common to the two custom wait event types.

Starting 2024-06-27, I'd like to push
inplace080-catcache-detoast-inplace-stale and earlier patches, self-certifying
them if needed.  Then I'll submit the last three to the commitfest.  Does
anyone want me to delay that step?

Two more test-related changes compared to v3:

- In inplace010-tests, add to 027_stream_regress.pl a test that catalog
  contents match between primary and standby.  If one of these patches broke
  replay of inplace updates, this would help catch it.

- In inplace031-inj-wait-event, make sysviews.sql indifferent to whether
  InjectionPoint wait events exist.  installcheck need this if other activity
  created such an event since the last postmaster restart.
Author: Noah Misch 
Commit: Noah Misch 

Improve test coverage for changes to inplace-updated catalogs.

This covers both regular and inplace changes, since bugs arise at their
intersection.  Where marked, these witness extant bugs.  Back-patch to
v12 (all supported versions).

Reviewed by FIXME.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl 
b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 59ea538..956e290 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -68,6 +68,34 @@ $node->pgbench(
  "CREATE TYPE pg_temp.e AS ENUM ($labels); DROP TYPE 
pg_temp.e;"
});
 
+# Test inplace updates from VACUUM concurrent with heap_update from GRANT.
+# The PROC_IN_VACUUM environment can't finish MVCC table scans consistently,
+# so this fails rarely.  To reproduce consistently, add a sleep after
+# GetCatalogSnapshot(non-catalog-rel).
+Test::More->builder->todo_start('PROC_IN_VACUUM scan breakage');
+$node->safe_psql('postgres', 'CREATE TABLE ddl_target ()');
+$node->pgbench(
+   '--no-vacuum --client=5 --protocol=prepared --transactions=50',
+   0,
+   [qr{processed: 250/250}],
+   [qr{^$}],
+   'concurrent GRANT/VACUUM',
+   {
+   '001_pgbench_grant@9' => q(
+   DO $$
+   BEGIN
+   PERFORM pg_advisory_xact_lock(42);
+   FOR i IN 1 .. 10 LOOP
+   GRANT SELECT ON ddl_target TO PUBLIC;
+   REVOKE SELECT ON ddl_target FROM PUBLIC;
+   END LOOP;
+   END
+   $$;
+),
+   '001_pgbench_vacuum_ddl_target@1' => "VACUUM ddl_target;",
+   });
+Test::More->builder->todo_end;
+
 # Trigger various connection errors
 $node->pgbench(
'no-such-database',
diff --git a/src/test/isolation/expected/eval-plan-qual.out 
b/src/test/isolation/expected/eval-plan-qual.out
index 0237271..032d420 100644
--- a/src/test/isolation/expected/eval-plan-qual.out
+++ b/src/test/isolation/expected/eval-plan-qual.out
@@ -1337,3 +1337,29 @@ a|b|c|   d
 2|2|2|1004
 (2 rows)
 
+
+starting permutation: sys1 sysupd2 c1 c2
+step sys1: 
+   UPDATE pg_class SET reltuples = 123 WHERE oid = 'accounts'::regclass;
+
+step sysupd2: 
+   UPDATE pg_class SET reltuples = reltuples * 2
+   WHERE oid = 'accounts'::regclass;
+ 
+step c1: COMMIT;
+step sysupd2: <... completed>
+step c2: COMMIT;
+
+starting permutation: sys1 sysmerge2 c1 c2
+step sys1: 
+   UPDATE pg_class SET reltuples = 123 WHERE oid = 'accounts'::regclass;
+
+step sysmerge2: 
+   MERGE INTO pg_class
+   USING (SELECT 'accounts'::regclass AS o) j
+   ON o = oid
+   WHEN MATCHED THEN UPDATE SET reltuples = reltuples * 2;
+ 
+step c1: COMMIT;
+step sysmerge2: <... completed>
+step c2: COMMIT;
diff --git a/src/test/isolation/expected/inplace-inval.out 
b/src/test/isolation/expected/inplace-inval.out
new file mode 100644
index 000..67b34ad
--- /dev/null
+++ b/src/test/isolation/expected/inplace-inval.out
@@ -0,0 +1,32 @@
+Parsed test spec with 3 sessions
+
+starting permutation: cachefill3 cir1 cic2 ddl3 read1
+step cachefill3: TABLE newly_indexed;
+c
+-
+(0 row

Re: race condition in pg_class

2024-06-16 Thread Michael Paquier
On Sun, Jun 16, 2024 at 07:07:08AM -0700, Noah Misch wrote:
> It would be odd to detect exactly 0x0B00U and not other invalid inputs,
> like 0x0A01U where only 0x0B01U is valid.  I'm attaching roughly what
> it would take.  Shall I squash this into inplace031?

Agreed that merging both together is cleaner.  Moving the event class
into the key of WaitEventCustomEntryByInfo leads to a more consistent
final result.

> The thing I feel strongly about here is keeping focus on fixing $SUBJECT bugs
> that are actually corrupting data out there.

Agreed to focus on that first.
--
Michael


signature.asc
Description: PGP signature


Re: race condition in pg_class

2024-06-16 Thread Noah Misch
On Sun, Jun 16, 2024 at 09:28:05AM +0900, Michael Paquier wrote:
> On Thu, Jun 13, 2024 at 07:42:25PM -0700, Noah Misch wrote:
> > On Fri, Jun 14, 2024 at 09:58:59AM +0900, Michael Paquier wrote:
> >> GetWaitEventCustomIdentifier() is incorrect, and should return
> >> "InjectionPoint" in the default case of this class name, no?
> > 
> > I intentionally didn't provide a default event ID for InjectionPoint.
> > PG_WAIT_EXTENSION needs a default case for backward compatibility, if 
> > nothing
> > else.  For this second custom type, it's needless complexity.  The value
> > 0x0B00U won't just show up like PG_WAIT_EXTENSION does.
> > GetLWLockIdentifier() also has no default case.  How do you see it?
> 
> I would add a default for consistency as this is just a few extra
> lines, but if you feel strongly about that, I'm OK as well.  It makes
> a bit easier the detection of incorrect wait event numbers set
> incorrectly in extensions depending on the class wanted.

It would be odd to detect exactly 0x0B00U and not other invalid inputs,
like 0x0A01U where only 0x0B01U is valid.  I'm attaching roughly what
it would take.  Shall I squash this into inplace031?

The thing I feel strongly about here is keeping focus on fixing $SUBJECT bugs
that are actually corrupting data out there.  I think we should all limit our
interest in the verbiage of strings that appear only when running developer
tests, especially when $SUBJECT is a bug fix.  When the string appears only
after C code passes invalid input to other C code, it matters even less.

> > The patch added to xfunc.sgml an example of using it.  I'd be more inclined 
> > to
> > delete the WaitEventExtensionNew() docbook documentation than to add its 
> > level
> > of detail for WaitEventInjectionPointNew().  We don't have that kind of
> > documentation for most extension-facing C functions.
> 
> It's one of the areas where I think that we should have more
> documentation, not less of it, so I'd rather keep it and maintaining
> it is not really a pain (?).  The backend gets complicated enough
> these days that limiting what developers have to guess on their own is
> a better long-term approach because the Postgres out-of-core ecosystem
> is expanding a lot (aka have also in-core documentation for hooks,
> even if there's been a lot of reluctance historically about having
> them).

[getting deeply off topic -- let's move this to another thread if it needs to
expand] I like reducing the need to guess.  So far in this inplace update
project (this thread plus postgr.es/m/20240615223718.42.nmi...@google.com),
three patches just fix comments.  Even comments carry quite a price, but I
value them.  When we hand-maintain documentation of a C function in both its
header comment and another place, I get skeptical about whether hackers
(including myself) will actually keep them in sync and skeptical of the
incremental value of maintaining the second version.
diff --git a/src/backend/utils/activity/wait_event.c 
b/src/backend/utils/activity/wait_event.c
index 300de90..aaf9f3d 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -47,11 +47,11 @@ uint32 *my_wait_event_info = 
&local_my_wait_event_info;
  * Hash tables for storing custom wait event ids and their names in
  * shared memory.
  *
- * WaitEventCustomHashById is used to find the name from an event id.
- * Any backend can search it to find custom wait events.
+ * WaitEventCustomHashByInfo is used to find the name from wait event
+ * information.  Any backend can search it to find custom wait events.
  *
- * WaitEventCustomHashByName is used to find the ID from a name.
- * It is used to ensure that no duplicated entries are registered.
+ * WaitEventCustomHashByName is used to find the wait event information from a
+ * name.  It is used to ensure that no duplicated entries are registered.
  *
  * For simplicity, we use the same ID counter across types of custom events.
  * We could end that anytime the need arises.
@@ -61,18 +61,18 @@ uint32 *my_wait_event_info = 
&local_my_wait_event_info;
  * unlikely that the number of entries will reach
  * WAIT_EVENT_CUSTOM_HASH_MAX_SIZE.
  */
-static HTAB *WaitEventCustomHashById;  /* find names from IDs */
-static HTAB *WaitEventCustomHashByName; /* find IDs from names */
+static HTAB *WaitEventCustomHashByInfo; /* find names from infos */
+static HTAB *WaitEventCustomHashByName; /* find infos from names */
 
 #define WAIT_EVENT_CUSTOM_HASH_INIT_SIZE   16
 #define WAIT_EVENT_CUSTOM_HASH_MAX_SIZE128
 
 /* hash table entries */
-typedef struct WaitEventCustomEntryById
+typedef struct WaitEventCustomEntryByInfo
 {
-   uint16  event_id;   /* hash key */
+   uint32  wait_event_info;/* hash key */
charwait_event_name[NAMEDATALEN];   /* custom wait event 
name */
-} WaitEventCustomEntryById;
+} WaitEventCustomEntryByInfo;
 
 typedef struct

Re: race condition in pg_class

2024-06-15 Thread Michael Paquier
On Thu, Jun 13, 2024 at 07:42:25PM -0700, Noah Misch wrote:
> On Fri, Jun 14, 2024 at 09:58:59AM +0900, Michael Paquier wrote:
>> GetWaitEventCustomIdentifier() is incorrect, and should return
>> "InjectionPoint" in the default case of this class name, no?
> 
> I intentionally didn't provide a default event ID for InjectionPoint.
> PG_WAIT_EXTENSION needs a default case for backward compatibility, if nothing
> else.  For this second custom type, it's needless complexity.  The value
> 0x0B00U won't just show up like PG_WAIT_EXTENSION does.
> GetLWLockIdentifier() also has no default case.  How do you see it?

I would add a default for consistency as this is just a few extra
lines, but if you feel strongly about that, I'm OK as well.  It makes
a bit easier the detection of incorrect wait event numbers set
incorrectly in extensions depending on the class wanted.

> The patch added to xfunc.sgml an example of using it.  I'd be more inclined to
> delete the WaitEventExtensionNew() docbook documentation than to add its level
> of detail for WaitEventInjectionPointNew().  We don't have that kind of
> documentation for most extension-facing C functions.

It's one of the areas where I think that we should have more
documentation, not less of it, so I'd rather keep it and maintaining
it is not really a pain (?).  The backend gets complicated enough
these days that limiting what developers have to guess on their own is
a better long-term approach because the Postgres out-of-core ecosystem
is expanding a lot (aka have also in-core documentation for hooks,
even if there's been a lot of reluctance historically about having
them).
--
Michael


signature.asc
Description: PGP signature


Re: race condition in pg_class

2024-06-13 Thread Noah Misch
On Fri, Jun 14, 2024 at 09:58:59AM +0900, Michael Paquier wrote:
> Looking at inplace031-inj-wait-event..
> 
> The comment at the top of GetWaitEventCustomNames() requires an
> update, still mentioning extensions.

Thanks.  Fixed locally.

> GetWaitEventCustomIdentifier() is incorrect, and should return
> "InjectionPoint" in the default case of this class name, no?

I intentionally didn't provide a default event ID for InjectionPoint.
PG_WAIT_EXTENSION needs a default case for backward compatibility, if nothing
else.  For this second custom type, it's needless complexity.  The value
0x0B00U won't just show up like PG_WAIT_EXTENSION does.
GetLWLockIdentifier() also has no default case.  How do you see it?

> I would
> just pass the classID to GetWaitEventCustomIdentifier().

As you say, that would allow eventId==0 to raise "could not find custom wait
event" for PG_WAIT_INJECTIONPOINT instead of wrongly returning "Extension".
Even if 0x0B00U somehow does show up, having pg_stat_activity report
"Extension" instead of an error, in a developer test run, feels unimportant to
me.

> It is suboptimal to have pg_get_wait_events() do two scans of
> WaitEventCustomHashByName.  Wouldn't it be better to do a single scan,
> returning a set of (class_name,event_name) fed to the tuplestore of
> this SRF?

Micro-optimization of pg_get_wait_events() doesn't matter.  I did consider
that or pushing more of the responsibility into wait_events.c, but I
considered it on code repetition grounds, not performance grounds.

>  uint32
>  WaitEventExtensionNew(const char *wait_event_name)
>  {
> + return WaitEventCustomNew(PG_WAIT_EXTENSION, wait_event_name);
> +}
> +
> +uint32
> +WaitEventInjectionPointNew(const char *wait_event_name)
> +{
> + return WaitEventCustomNew(PG_WAIT_INJECTIONPOINT, wait_event_name);
> +}
> 
> Hmm.  The advantage of two routines is that it is possible to control
> the class IDs allowed to use the custom wait events.  Shouldn't the
> second routine be documented in xfunc.sgml?

The patch added to xfunc.sgml an example of using it.  I'd be more inclined to
delete the WaitEventExtensionNew() docbook documentation than to add its level
of detail for WaitEventInjectionPointNew().  We don't have that kind of
documentation for most extension-facing C functions.




Re: race condition in pg_class

2024-06-13 Thread Michael Paquier
On Thu, Jun 13, 2024 at 05:35:49PM -0700, Noah Misch wrote:
> I think the attached covers all comments to date.  I gave everything v3, but
> most patches have just a no-conflict rebase vs. v2.  The exceptions are
> inplace031-inj-wait-event (implements the holding from that branch of the
> thread) and inplace050-tests-inj (updated to cooperate with inplace031).  Much
> of inplace031-inj-wait-event is essentially s/Extension/Custom/ for the
> infrastructure common to the two custom wait event types.

Looking at inplace031-inj-wait-event..

The comment at the top of GetWaitEventCustomNames() requires an
update, still mentioning extensions.

GetWaitEventCustomIdentifier() is incorrect, and should return
"InjectionPoint" in the default case of this class name, no?  I would
just pass the classID to GetWaitEventCustomIdentifier().

It is suboptimal to have pg_get_wait_events() do two scans of
WaitEventCustomHashByName.  Wouldn't it be better to do a single scan,
returning a set of (class_name,event_name) fed to the tuplestore of
this SRF?

 uint32
 WaitEventExtensionNew(const char *wait_event_name)
 {
+   return WaitEventCustomNew(PG_WAIT_EXTENSION, wait_event_name);
+}
+
+uint32
+WaitEventInjectionPointNew(const char *wait_event_name)
+{
+   return WaitEventCustomNew(PG_WAIT_INJECTIONPOINT, wait_event_name);
+}

Hmm.  The advantage of two routines is that it is possible to control
the class IDs allowed to use the custom wait events.  Shouldn't the
second routine be documented in xfunc.sgml?

wait_event_names.txt also needs tweaks, in the shape of a new class
name for the new class "InjectionPoint" so as it can be documented for
its default case.  That's a fallback if an event ID cannot be found,
which should not be the case, still that's more correct than showing
"Extension" for all class IDs covered by custom wait events.
--
Michael


signature.asc
Description: PGP signature


Re: race condition in pg_class

2024-06-13 Thread Noah Misch
On Wed, Jun 12, 2024 at 10:02:00PM +0200, Michail Nikolaev wrote:
> > Can you say more about the connection you see between $SUBJECT and that?
> That
> > looks like a valid report of an important bug, but I'm not following the
> > potential relationship to $SUBJECT.
> 
> I was guided by the following logic:
> * A pg_class race condition can cause table indexes to look stale.
> * REINDEX updates indexes
> * errors can be explained by different backends using different arbiter
> indexes

Got it.  The race condition of $SUBJECT involves inplace updates, and the
wrong content becomes permanent.  Hence, I suspect they're unrelated.




Re: race condition in pg_class

2024-06-12 Thread Michael Paquier
On Wed, Jun 12, 2024 at 12:32:23PM -0700, Noah Misch wrote:
> On Wed, Jun 12, 2024 at 02:08:31PM -0400, Robert Haas wrote:
>> Personally, I think the fact that injection point wait events were put
>> under Extension is a design mistake that should be corrected before 17
>> is out of beta.

Well, isolation tests and the way to wait for specific points in them
is something I've thought about when working on the initial injpoint
infrastructure, but all my ideas went down to the fact that this is
not specific to injection points: I've also wanted to be able to cause
an isolation to wait for a specific event (class,name).  A hardcoded
sleep is an example.  Even if I discourage anything like that in the
in-core tests because they're slow on fast machines and can be
unreliable on slow machines, it is a fact that they are used by
out-of-core code and that extension developers find them acceptable.

> Works for me.  I don't personally have a problem with the use of Extension,
> since it is a src/test/modules extension creating them.

That's the original reason why Extension has been used in this case,
because the points are assigned in an extension.

> Yes, the last line does refer to that.  Updated table:
> 
> STRATEGY| Paquier | Misch | Haas
> 
> new "Injection Point" wait type | maybe   | yes   | yes
> INJECTION_POINT(...) naming | yes | yes   | no
> isolation spec says event names | yes | no| yes
> 
> I find that's adequate support for the first line.  If there are no objections
> in the next 24hr, I will implement that.

OK.  That sounds like a consensus to me, useful enough for the cases
at hand.
--
Michael


signature.asc
Description: PGP signature


Re: race condition in pg_class

2024-06-12 Thread Michail Nikolaev
Hello!

> Can you say more about the connection you see between $SUBJECT and that?
That
> looks like a valid report of an important bug, but I'm not following the
> potential relationship to $SUBJECT.

I was guided by the following logic:
* A pg_class race condition can cause table indexes to look stale.
* REINDEX updates indexes
* errors can be explained by different backends using different arbiter
indexes

> On your other thread, it would be useful to see stack traces from the
high-CPU
> processes once the live lock has ended all query completion.
I'll do.

Best regards,
Mikhail.


Re: race condition in pg_class

2024-06-12 Thread Noah Misch
On Wed, Jun 12, 2024 at 03:02:43PM +0200, Michail Nikolaev wrote:
> I am not sure, but I think that issue may be related to the issue described
> in
> https://www.postgresql.org/message-id/CANtu0ojXmqjmEzp-%3DaJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg%40mail.gmail.com
> 
> It looks like REINDEX CONCURRENTLY could interfere with ON CONFLICT UPDATE
> in some strange way.

Can you say more about the connection you see between $SUBJECT and that?  That
looks like a valid report of an important bug, but I'm not following the
potential relationship to $SUBJECT.

On your other thread, it would be useful to see stack traces from the high-CPU
processes once the live lock has ended all query completion.




Re: race condition in pg_class

2024-06-12 Thread Noah Misch
On Wed, Jun 12, 2024 at 02:08:31PM -0400, Robert Haas wrote:
> On Wed, Jun 12, 2024 at 1:54 PM Noah Misch  wrote:
> > If I were making a list of changes always welcome post-beta, it wouldn't
> > include adding wait event types.  But I don't hesitate to add one if it
> > unblocks a necessary test for a bug present in all versions.
> 
> However, injection points themselves are not present in all versions,
> so even if we invent a new wait-event type, we'll have difficulty
> testing older versions, unless we're planning to back-patch all of
> that infrastructure, which I assume we aren't.

Right.  We could put the injection point tests in v18 only instead of v17+v18.
I feel that would be an overreaction to a dispute about names that show up
only in tests.  Even so, I could accept that.

> Personally, I think the fact that injection point wait events were put
> under Extension is a design mistake that should be corrected before 17
> is out of beta.

Works for me.  I don't personally have a problem with the use of Extension,
since it is a src/test/modules extension creating them.

> > Here's what I'm reading for each person's willingness to tolerate each 
> > option:
> >
> > STRATEGY| Paquier | Misch | Haas
> > 
> > new "Injection Point" wait type | maybe   | yes   | yes
> > INJECTION_POINT(...) naming | yes | yes   | unknown
> > isolation spec says event names | yes | no| unknown
> >
> > Corrections and additional strategy lines welcome.  Robert, how do you judge
> > the lines where I've listed you as "unknown"?
> 
> I'd tolerate INJECTION_POINT() if we had no other option but I think
> it's clearly inferior. Does the last line refer to putting the
> specific wait event names in the isolation spec file? If so, I'd also
> be fine with that.

Yes, the last line does refer to that.  Updated table:

STRATEGY| Paquier | Misch | Haas

new "Injection Point" wait type | maybe   | yes   | yes
INJECTION_POINT(...) naming | yes | yes   | no
isolation spec says event names | yes | no| yes

I find that's adequate support for the first line.  If there are no objections
in the next 24hr, I will implement that.




Re: race condition in pg_class

2024-06-12 Thread Robert Haas
On Wed, Jun 12, 2024 at 1:54 PM Noah Misch  wrote:
> If I were making a list of changes always welcome post-beta, it wouldn't
> include adding wait event types.  But I don't hesitate to add one if it
> unblocks a necessary test for a bug present in all versions.

However, injection points themselves are not present in all versions,
so even if we invent a new wait-event type, we'll have difficulty
testing older versions, unless we're planning to back-patch all of
that infrastructure, which I assume we aren't.

Personally, I think the fact that injection point wait events were put
under Extension is a design mistake that should be corrected before 17
is out of beta.

> Here's what I'm reading for each person's willingness to tolerate each option:
>
> STRATEGY| Paquier | Misch | Haas
> 
> new "Injection Point" wait type | maybe   | yes   | yes
> INJECTION_POINT(...) naming | yes | yes   | unknown
> isolation spec says event names | yes | no| unknown
>
> Corrections and additional strategy lines welcome.  Robert, how do you judge
> the lines where I've listed you as "unknown"?

I'd tolerate INJECTION_POINT() if we had no other option but I think
it's clearly inferior. Does the last line refer to putting the
specific wait event names in the isolation spec file? If so, I'd also
be fine with that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: race condition in pg_class

2024-06-12 Thread Noah Misch
On Tue, Jun 11, 2024 at 01:37:21PM +0900, Michael Paquier wrote:
> On Mon, Jun 10, 2024 at 07:19:27PM -0700, Noah Misch wrote:
> > On Fri, Jun 07, 2024 at 09:08:03AM -0400, Robert Haas wrote:
> >> I think the core code should provide an "Injection Point" wait event
> >> type and let extensions add specific wait events there, just like you
> >> did for "Extension".
> > 
> > Michael, could you accept the core code offering that, or not?  If so, I am
> > content to implement that.  If not, for injection point wait events, I have
> > just one priority.  The isolation tester already detects lmgr locks without
> > the test writer teaching it about each lock individually.  I want it to have
> > that same capability for injection points. Do you think we can find 
> > something
> > everyone can accept, having that property?  These wait events show up in 
> > tests
> > only, and I'm happy to make the cosmetics be anything compatible with that
> > detection ability.
> 
> Adding a wait event class for injection point is an interesting
> suggestion that would simplify the detection in the isolation function
> quite a bit.  Are you sure that this is something that would be fit
> for v17 material?  TBH, I am not sure.

If I were making a list of changes always welcome post-beta, it wouldn't
include adding wait event types.  But I don't hesitate to add one if it
unblocks a necessary test for a bug present in all versions.

> At the end, the test coverage has the highest priority and the bugs
> you are addressing are complex enough that isolation tests of this
> level are a necessity, so I don't object to what
> inplace050-tests-inj-v2.patch introduces with the naming dependency
> for the time being on HEAD.  I'll just adapt and live with that
> depending on what I deal with, while trying to improve HEAD later on.

Here's what I'm reading for each person's willingness to tolerate each option:

STRATEGY| Paquier | Misch | Haas

new "Injection Point" wait type | maybe   | yes   | yes
INJECTION_POINT(...) naming | yes | yes   | unknown
isolation spec says event names | yes | no| unknown

Corrections and additional strategy lines welcome.  Robert, how do you judge
the lines where I've listed you as "unknown"?

> I'm still wondering if there is something that could be more elegant
> than a dedicated class for injection points, but I cannot think about
> something that would be better for isolation tests on top of my head.
> If there is something I can think of, I'll just go and implement it :)

I once considered changing them to use advisory lock waits instead of
ConditionVariableSleep(), but I recall that was worse from the perspective of
injection points in critical sections.




Re: race condition in pg_class

2024-06-12 Thread Michail Nikolaev
Hello, everyone.

I am not sure, but I think that issue may be related to the issue described
in
https://www.postgresql.org/message-id/CANtu0ojXmqjmEzp-%3DaJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg%40mail.gmail.com

It looks like REINDEX CONCURRENTLY could interfere with ON CONFLICT UPDATE
in some strange way.

Best regards,
Mikhail.


Re: race condition in pg_class

2024-06-10 Thread Michael Paquier
On Mon, Jun 10, 2024 at 07:19:27PM -0700, Noah Misch wrote:
> On Fri, Jun 07, 2024 at 09:08:03AM -0400, Robert Haas wrote:
>> I think the core code should provide an "Injection Point" wait event
>> type and let extensions add specific wait events there, just like you
>> did for "Extension".
> 
> Michael, could you accept the core code offering that, or not?  If so, I am
> content to implement that.  If not, for injection point wait events, I have
> just one priority.  The isolation tester already detects lmgr locks without
> the test writer teaching it about each lock individually.  I want it to have
> that same capability for injection points. Do you think we can find something
> everyone can accept, having that property?  These wait events show up in tests
> only, and I'm happy to make the cosmetics be anything compatible with that
> detection ability.

Adding a wait event class for injection point is an interesting
suggestion that would simplify the detection in the isolation function
quite a bit.  Are you sure that this is something that would be fit
for v17 material?  TBH, I am not sure.

At the end, the test coverage has the highest priority and the bugs
you are addressing are complex enough that isolation tests of this
level are a necessity, so I don't object to what
inplace050-tests-inj-v2.patch introduces with the naming dependency
for the time being on HEAD.  I'll just adapt and live with that
depending on what I deal with, while trying to improve HEAD later on.

I'm still wondering if there is something that could be more elegant
than a dedicated class for injection points, but I cannot think about
something that would be better for isolation tests on top of my head.
If there is something I can think of, I'll just go and implement it :)
--
Michael


signature.asc
Description: PGP signature


Re: race condition in pg_class

2024-06-10 Thread Noah Misch
On Fri, Jun 07, 2024 at 09:08:03AM -0400, Robert Haas wrote:
> On Thu, Jun 6, 2024 at 7:20 PM Michael Paquier  wrote:
> > On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> > > It's not this patch set's fault, but I'm not very pleased to see that
> > > the injection point wait events have been shoehorned into the
> > > "Extension" category - which they are not - instead of being a new
> > > wait_event_type. That would have avoided the ugly wait-event naming
> > > pattern, inconsistent with everything else, introduced by
> > > inplace050-tests-inj-v1.patch.
> >
> > Not sure to agree with that.  The set of core backend APIs supporting
> > injection points have nothing to do with wait events.  The library
> > attached to one or more injection points *may* decide to use a wait
> > event like what the wait/wakeup calls in modules/injection_points do,
> > but that's entirely optional.  These rely on custom wait events,
> > plugged into the Extension category as the code run is itself in an
> > extension.  I am not arguing against the point that it may be
> > interesting to plug in custom wait event categories, but the current
> > design of wait events makes that much harder than what core is
> > currently able to handle, and I am not sure that this brings much at
> > the end as long as the wait event strings can be customized.
> >
> > I've voiced upthread concerns over the naming enforced by the patch
> > and the way it plugs the namings into the isolation functions, by the
> > way.
> 
> I think the core code should provide an "Injection Point" wait event
> type and let extensions add specific wait events there, just like you
> did for "Extension".

Michael, could you accept the core code offering that, or not?  If so, I am
content to implement that.  If not, for injection point wait events, I have
just one priority.  The isolation tester already detects lmgr locks without
the test writer teaching it about each lock individually.  I want it to have
that same capability for injection points.  Do you think we can find something
everyone can accept, having that property?  These wait events show up in tests
only, and I'm happy to make the cosmetics be anything compatible with that
detection ability.




Re: race condition in pg_class

2024-06-07 Thread Robert Haas
On Thu, Jun 6, 2024 at 7:20 PM Michael Paquier  wrote:
> On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> > It's not this patch set's fault, but I'm not very pleased to see that
> > the injection point wait events have been shoehorned into the
> > "Extension" category - which they are not - instead of being a new
> > wait_event_type. That would have avoided the ugly wait-event naming
> > pattern, inconsistent with everything else, introduced by
> > inplace050-tests-inj-v1.patch.
>
> Not sure to agree with that.  The set of core backend APIs supporting
> injection points have nothing to do with wait events.  The library
> attached to one or more injection points *may* decide to use a wait
> event like what the wait/wakeup calls in modules/injection_points do,
> but that's entirely optional.  These rely on custom wait events,
> plugged into the Extension category as the code run is itself in an
> extension.  I am not arguing against the point that it may be
> interesting to plug in custom wait event categories, but the current
> design of wait events makes that much harder than what core is
> currently able to handle, and I am not sure that this brings much at
> the end as long as the wait event strings can be customized.
>
> I've voiced upthread concerns over the naming enforced by the patch
> and the way it plugs the namings into the isolation functions, by the
> way.

I think the core code should provide an "Injection Point" wait event
type and let extensions add specific wait events there, just like you
did for "Extension". Then this ugly naming would go away. As I see it,
"Extension" is only supposed to be used as a catch-all when we have no
other information, but here we do. If we refuse to use the
wait_event_type field to categorize waits, then people are going to
have to find some other way to get that data into the system, as Noah
has done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: race condition in pg_class

2024-06-06 Thread Michael Paquier
On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> It's not this patch set's fault, but I'm not very pleased to see that
> the injection point wait events have been shoehorned into the
> "Extension" category - which they are not - instead of being a new
> wait_event_type. That would have avoided the ugly wait-event naming
> pattern, inconsistent with everything else, introduced by
> inplace050-tests-inj-v1.patch.

Not sure to agree with that.  The set of core backend APIs supporting
injection points have nothing to do with wait events.  The library
attached to one or more injection points *may* decide to use a wait
event like what the wait/wakeup calls in modules/injection_points do,
but that's entirely optional.  These rely on custom wait events,
plugged into the Extension category as the code run is itself in an
extension.  I am not arguing against the point that it may be
interesting to plug in custom wait event categories, but the current
design of wait events makes that much harder than what core is
currently able to handle, and I am not sure that this brings much at
the end as long as the wait event strings can be customized.

I've voiced upthread concerns over the naming enforced by the patch
and the way it plugs the namings into the isolation functions, by the
way.
--
Michael


signature.asc
Description: PGP signature


Re: race condition in pg_class

2024-06-06 Thread Robert Haas
On Wed, Jun 5, 2024 at 2:17 PM Noah Misch  wrote:
> Starting 2024-06-10, I plan to push the first seven of the ten patches:
>
> inplace005-UNEXPECTEDPASS-tap-meson-v1.patch
> inplace010-tests-v1.patch
> inplace040-waitfuncs-v1.patch
> inplace050-tests-inj-v1.patch
> inplace060-nodeModifyTable-comments-v1.patch
>   Those five just deal in tests, test infrastructure, and comments.
> inplace070-rel-locks-missing-v1.patch
>   Main risk is new DDL deadlocks.
> inplace080-catcache-detoast-inplace-stale-v1.patch
>   If it fails to fix the bug it targets, I expect it's a no-op rather than
>   breaking things.
>
> I'll leave the last three of the ten needing review.  Those three are beyond
> my skill to self-certify.

It's not this patch set's fault, but I'm not very pleased to see that
the injection point wait events have been shoehorned into the
"Extension" category - which they are not - instead of being a new
wait_event_type. That would have avoided the ugly wait-event naming
pattern, inconsistent with everything else, introduced by
inplace050-tests-inj-v1.patch.

I think that the comments and commit messages in this patch set could,
in some places, use improvement. For instance,
inplace060-nodeModifyTable-comments-v1.patch reflows a bunch of
comments, which makes it hard to see what actually changed, and the
commit message doesn't tell you, either. A good bit of it seems to be
changing "a view" to "a view INSTEAD OF trigger" or "a view having an
INSTEAD OF trigger," but the reasoning behind that change is not
spelled out anywhere. The reader is left to guess what the other case
is and why the same principles don't apply to it. I don't doubt that
the new comments are more correct than the old ones, but I expect
future patch authors to have difficulty maintaining that state of
affairs.

Similarly, inplace070-rel-locks-missing-v1.patch adds no comments.
IMHO, the commit message also isn't very informative. It disclaims
knowledge of what bug it's fixing, while at the same time leaving the
reader to figure out for themselves how the behavior has changed.
Consequently, I expect writing the release notes for a release
including this patch to be difficult: "We added some locks that block
... something ... in some circumstances ... to prevent ... something."
It's not really the job of the release note author to fill in those
blanks, but rather of the patch author or committer. I don't want to
overburden the act of fixing bugs, but I just feel like more
explanation is needed here. When I see for example that we're adding a
lock acquisition to the end of heap_create(), I can't help but wonder
if it's really true that we don't take a lock on a just-created
relation today. I'm certainly under the impression that we lock
newly-created, uncommitted relations, and a quick test seems to
confirm that. I don't quite know whether that happens, but evidently
this call is guarding against something more subtle than a categorical
failure to lock a relation on creation so I think there should be a
comment explaining what that thing is.

It's also quite surprising that SetRelationHasSubclass() says "take X
lock before calling" and 2 of 4 callers just don't. I guess that's how
it is. But shouldn't we then have an assertion inside that function to
guard against future mistakes? If the reason why we failed to add this
initially is discernible from the commit messages that introduced the
bug, it would be nice to mention what it seems to have been; if not,
it would at least be nice to mention the offending commit(s). I'm also
a bit worried that this is going to cause deadlocks, but I suppose if
it does, that's still better than the status quo.

IsInplaceUpdateOid's header comment says IsInplaceUpdateRelation
instead of IsInplaceUpdateOid.

inplace080-catcache-detoast-inplace-stale-v1.patch seems like another
place where spelling out the rationale in more detail would be helpful
to future readers; for instance, the commit message says that
PgDatabaseToastTable is the only one affected, but it doesn't say why
the others are not, or why this one is. The lengthy comment in
CatalogCacheCreateEntry is also difficult to correlate with the code
which follows. I can't guess whether the two cases called out in the
comment always needed to be handled and were handled save only for
in-place updates, and thus the comment changes were simply taking the
opportunity to elaborate on the existing comments; or whether one of
those cases is preexisting and the other arises from the desire to
handle inplace updates. It could be helpful to mention relevant
identifiers from the code in the comment text e.g.
"systable_recheck_tuple detects ordinary updates by noting changes to
the tuple's visibility information, while the equalTuple() case
detects inplace updates."

IMHO, this patch set underscores the desirability of removing in-place
update altogether. That sounds difficult and not back-patchable, but I
can't classify what this patch set does as anything better

Re: race condition in pg_class

2024-06-05 Thread Noah Misch
On Sun, May 12, 2024 at 04:29:23PM -0700, Noah Misch wrote:
> I'm attaching patches implementing the LockTuple() design.

Starting 2024-06-10, I plan to push the first seven of the ten patches:

inplace005-UNEXPECTEDPASS-tap-meson-v1.patch
inplace010-tests-v1.patch
inplace040-waitfuncs-v1.patch
inplace050-tests-inj-v1.patch
inplace060-nodeModifyTable-comments-v1.patch
  Those five just deal in tests, test infrastructure, and comments.
inplace070-rel-locks-missing-v1.patch
  Main risk is new DDL deadlocks.
inplace080-catcache-detoast-inplace-stale-v1.patch
  If it fails to fix the bug it targets, I expect it's a no-op rather than
  breaking things.

I'll leave the last three of the ten needing review.  Those three are beyond
my skill to self-certify.




Re: race condition in pg_class

2024-05-13 Thread Noah Misch
On Mon, May 13, 2024 at 03:53:08PM -0400, Robert Haas wrote:
> On Sun, May 12, 2024 at 7:29 PM Noah Misch  wrote:
> > - [consequences limited to transient failure] Since a PROC_IN_VACUUM 
> > backend's
> >   xmin does not stop pruning, an MVCC scan in that backend can find zero
> >   tuples when one is live.  This is like what all backends got in the days 
> > of
> >   SnapshotNow catalog scans.  See the pgbench test suite addition.  (Perhaps
> >   the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM,
> >   setting that flag later and unsetting it earlier.)
> 
> Are you saying that this is a problem already, or that the patch
> causes it to start happening? If it's the former, that's horrible.

The former.




Re: race condition in pg_class

2024-05-13 Thread Robert Haas
On Sun, May 12, 2024 at 7:29 PM Noah Misch  wrote:
> - [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's
>   xmin does not stop pruning, an MVCC scan in that backend can find zero
>   tuples when one is live.  This is like what all backends got in the days of
>   SnapshotNow catalog scans.  See the pgbench test suite addition.  (Perhaps
>   the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM,
>   setting that flag later and unsetting it earlier.)

Are you saying that this is a problem already, or that the patch
causes it to start happening? If it's the former, that's horrible. If
it's the latter, I'd say that is a fatal defect.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: race condition in pg_class

2024-05-13 Thread Noah Misch
On Mon, May 13, 2024 at 04:59:59PM +0900, Michael Paquier wrote:
> About inplace050-tests-inj-v1.patch.
> 
> + /* Check if blocked_pid is in injection_wait(). */
> + proc = BackendPidGetProc(blocked_pid);
> + if (proc == NULL)
> + PG_RETURN_BOOL(false);  /* session gone: definitely unblocked */
> + wait_event =
> + 
> pgstat_get_wait_event(UINT32_ACCESS_ONCE(proc->wait_event_info));
> + if (wait_event && strncmp("INJECTION_POINT(",
> +   wait_event,
> +   
> strlen("INJECTION_POINT(")) == 0)
> + PG_RETURN_BOOL(true);
> 
> Hmm.  I am not sure that this is the right interface for the job
> because this is not only related to injection points but to the
> monitoring of a one or more wait events when running a permutation
> step.

Could you say more about that?  Permutation steps don't monitor wait events
today.  This patch would be the first instance of that.

> Perhaps this is something that should be linked to the spec
> files with some property area listing the wait events we're expected
> to wait on instead when running a step that we know will wait?

The spec syntax doesn't distinguish contention types at all.  The isolation
tester's needs are limited to distinguishing:

  (a) process is waiting on another test session
  (b) process is waiting on automatic background activity (autovacuum, mainly)

Automatic background activity doesn't make a process enter or leave
injection_wait(), so all injection point wait events fall in (a).  (The tester
ignores (b), since those clear up without intervention.  Failing to ignore
them, as the tester did long ago, made output unstable.)




Re: race condition in pg_class

2024-05-13 Thread Michael Paquier
On Sun, May 12, 2024 at 04:29:23PM -0700, Noah Misch wrote:
> I'm attaching patches implementing the LockTuple() design.  It turns out we
> don't just lose inplace updates.  We also overwrite unrelated tuples,
> reproduced at inplace.spec.  Good starting points are README.tuplock and the
> heap_inplace_update_scan() header comment.

About inplace050-tests-inj-v1.patch.

+   /* Check if blocked_pid is in injection_wait(). */
+   proc = BackendPidGetProc(blocked_pid);
+   if (proc == NULL)
+   PG_RETURN_BOOL(false);  /* session gone: definitely unblocked */
+   wait_event =
+   
pgstat_get_wait_event(UINT32_ACCESS_ONCE(proc->wait_event_info));
+   if (wait_event && strncmp("INJECTION_POINT(",
+ wait_event,
+ 
strlen("INJECTION_POINT(")) == 0)
+   PG_RETURN_BOOL(true);

Hmm.  I am not sure that this is the right interface for the job
because this is not only related to injection points but to the
monitoring of a one or more wait events when running a permutation
step.  Perhaps this is something that should be linked to the spec
files with some property area listing the wait events we're expected
to wait on instead when running a step that we know will wait?
--
Michael


signature.asc
Description: PGP signature


Re: race condition in pg_class

2023-11-01 Thread Noah Misch
I prototyped two ways, one with a special t_ctid and one with LockTuple().

On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote:
> On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> > Noah Misch  writes:
> > > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:

> > > I anticipate a new locktag per catalog that can receive inplace updates,
> > > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.
> > 
> > We could perhaps make this work by using the existing tuple-lock
> > infrastructure, rather than inventing new locktags (a choice that
> > spills to a lot of places including clients that examine pg_locks).
> 
> That could be okay.  It would be weird to reuse a short-term lock like that
> one as something held till end of transaction.  But the alternative of new
> locktags ain't perfect, as you say.

That worked.

> > I would prefer though to find a solution that only depends on making
> > heap_inplace_update protect itself, without high-level cooperation
> > from the possibly-interfering updater.  This is basically because
> > I'm still afraid that we're defining the problem too narrowly.
> > For one thing, I have nearly zero confidence that GRANT et al are
> > the only problematic source of conflicting transactional updates.
> 
> Likewise here, but I have fair confidence that an assertion would flush out
> the rest.  heap_inplace_update() would assert that the backend holds one of
> the acceptable locks.  It could even be an elog; heap_inplace_update() can
> tolerate that cost.

That check would fall in both heap_inplace_update() and heap_update().  After
all, a heap_inplace_update() check won't detect an omission in GRANT.

> > For another, I'm worried that some extension may be using
> > heap_inplace_update against a catalog we're not considering here.
> 
> A pgxn search finds "citus" using heap_inplace_update().
> 
> > I'd also like to find a solution that fixes the case of a conflicting
> > manual UPDATE (although certainly that's a stretch goal we may not be
> > able to reach).
> 
> It would be nice.

I expect most approaches could get there by having ExecModifyTable() arrange
for the expected locking or other actions.  That's analogous to how
heap_update() takes care of sinval even for a manual UPDATE.

> > I wonder if there's a way for heap_inplace_update to mark the tuple
> > header as just-updated in a way that regular heap_update could
> > recognize.  (For standard catalog updates, we'd then end up erroring
> > in simple_heap_update, which I think is fine.)  We can't update xmin,
> > because the VACUUM callers don't have an XID; but maybe there's some
> > other way?  I'm speculating about putting a funny value into xmax,
> > or something like that, and having heap_update check that what it
> > sees in xmax matches what was in the tuple the update started with.
> 
> Hmmm.  Achieving it without an XID would be the real trick.  (With an XID, we
> could use xl_heap_lock like heap_update() does.)  Thinking out loud, what if
> heap_inplace_update() sets HEAP_XMAX_INVALID and xmax =
> TransactionIdAdvance(xmax)?  Or change t_ctid in a similar way.  Then regular
> heap_update() could complain if the field changed vs. last seen value.  This
> feels like something to regret later in terms of limiting our ability to
> harness those fields for more-valuable ends or compact them away in a future
> page format.  I can't pinpoint a specific loss, so the idea might have legs.
> Nontransactional data in separate tables or in new metapages smells like the
> right long-term state.  A project wanting to reuse the tuple header bits could
> introduce such storage to unblock its own bit reuse.

heap_update() does not have the pre-modification xmax today, so I used t_ctid.
heap_modify_tuple() preserves t_ctid, so heap_update() already has the
pre-modification t_ctid in key cases.  For details of how the prototype uses
t_ctid, see comment at "#define InplaceCanaryOffsetNumber".  The prototype
doesn't prevent corruption in the following scenario, because the aborted
ALTER TABLE RENAME overwrites the special t_ctid:

  == session 1
  drop table t;
  create table t (c int);
  begin;
  -- in gdb, set breakpoint on heap_modify_tuple
  grant select on t to public;

  == session 2
  alter table t add primary key (c);
  begin; alter table t rename to t2; rollback;

  == back in session 1
  -- release breakpoint
  -- want error (would get it w/o the begin;alter;rollback)
  commit;

I'm missing how to mark the tuple in a fashion accessible to a second
heap_update() after a rolled-back heap_update().  The mark needs enough bits
"N" so it's implausible for 2^N inplace updates to happen between GRANT
fetching the old tuple and GRANT completing heap_update().  Looking for bits
that could persist across a rolled-back heap_update(), we have 3 in t_ctid, 2
in t_infomask2, and 0 in xmax.  I definitely don't want to paint us into a
corner by spending the t_infomask2 bits on this.  Even if I did, 2^(3+

Re: race condition in pg_class

2023-10-27 Thread Noah Misch
On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
> >> I'm inclined to propose that heap_inplace_update should check to
> >> make sure that it's operating on the latest version of the tuple
> >> (including, I guess, waiting for an uncommitted update?) and throw
> >> error if not.  I think this is what your B3 option is, but maybe
> >> I misinterpreted.  It might be better to throw error immediately
> >> instead of waiting to see if the other updater commits.
> 
> > That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
> > waiting for GRANT to commit but instead inplace-updating every member of the
> > update chain.  For B2, I was thinking we don't need to error.  There are two
> > problematic orders of events.  The easy one is heap_inplace_update() 
> > mutating
> > a tuple that already has an xmax.  That's the one in the test case upthread,
> > and detecting it is trivial.  The harder one is heap_inplace_update() 
> > mutating
> > a tuple after GRANT fetches the old tuple, before GRANT enters 
> > heap_update().
> 
> Ugh ... you're right, what I was imagining would not catch that last case.
> 
> > I anticipate a new locktag per catalog that can receive inplace updates,
> > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.
> 
> We could perhaps make this work by using the existing tuple-lock
> infrastructure, rather than inventing new locktags (a choice that
> spills to a lot of places including clients that examine pg_locks).

That could be okay.  It would be weird to reuse a short-term lock like that
one as something held till end of transaction.  But the alternative of new
locktags ain't perfect, as you say.

> I would prefer though to find a solution that only depends on making
> heap_inplace_update protect itself, without high-level cooperation
> from the possibly-interfering updater.  This is basically because
> I'm still afraid that we're defining the problem too narrowly.
> For one thing, I have nearly zero confidence that GRANT et al are
> the only problematic source of conflicting transactional updates.

Likewise here, but I have fair confidence that an assertion would flush out
the rest.  heap_inplace_update() would assert that the backend holds one of
the acceptable locks.  It could even be an elog; heap_inplace_update() can
tolerate that cost.

> For another, I'm worried that some extension may be using
> heap_inplace_update against a catalog we're not considering here.

A pgxn search finds "citus" using heap_inplace_update().

> I'd also like to find a solution that fixes the case of a conflicting
> manual UPDATE (although certainly that's a stretch goal we may not be
> able to reach).

It would be nice.

> I wonder if there's a way for heap_inplace_update to mark the tuple
> header as just-updated in a way that regular heap_update could
> recognize.  (For standard catalog updates, we'd then end up erroring
> in simple_heap_update, which I think is fine.)  We can't update xmin,
> because the VACUUM callers don't have an XID; but maybe there's some
> other way?  I'm speculating about putting a funny value into xmax,
> or something like that, and having heap_update check that what it
> sees in xmax matches what was in the tuple the update started with.

Hmmm.  Achieving it without an XID would be the real trick.  (With an XID, we
could use xl_heap_lock like heap_update() does.)  Thinking out loud, what if
heap_inplace_update() sets HEAP_XMAX_INVALID and xmax =
TransactionIdAdvance(xmax)?  Or change t_ctid in a similar way.  Then regular
heap_update() could complain if the field changed vs. last seen value.  This
feels like something to regret later in terms of limiting our ability to
harness those fields for more-valuable ends or compact them away in a future
page format.  I can't pinpoint a specific loss, so the idea might have legs.
Nontransactional data in separate tables or in new metapages smells like the
right long-term state.  A project wanting to reuse the tuple header bits could
introduce such storage to unblock its own bit reuse.

> Or we could try to get rid of in-place updates, but that seems like
> a mighty big lift.  All of the existing callers, except maybe
> the johnny-come-lately dropdb usage, have solid documented reasons
> to do it that way.

Yes, removing that smells problematic.




Re: race condition in pg_class

2023-10-27 Thread Tom Lane
Noah Misch  writes:
> On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
>> I'm inclined to propose that heap_inplace_update should check to
>> make sure that it's operating on the latest version of the tuple
>> (including, I guess, waiting for an uncommitted update?) and throw
>> error if not.  I think this is what your B3 option is, but maybe
>> I misinterpreted.  It might be better to throw error immediately
>> instead of waiting to see if the other updater commits.

> That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
> waiting for GRANT to commit but instead inplace-updating every member of the
> update chain.  For B2, I was thinking we don't need to error.  There are two
> problematic orders of events.  The easy one is heap_inplace_update() mutating
> a tuple that already has an xmax.  That's the one in the test case upthread,
> and detecting it is trivial.  The harder one is heap_inplace_update() mutating
> a tuple after GRANT fetches the old tuple, before GRANT enters heap_update().

Ugh ... you're right, what I was imagining would not catch that last case.

> I anticipate a new locktag per catalog that can receive inplace updates,
> i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.

We could perhaps make this work by using the existing tuple-lock
infrastructure, rather than inventing new locktags (a choice that
spills to a lot of places including clients that examine pg_locks).

I would prefer though to find a solution that only depends on making
heap_inplace_update protect itself, without high-level cooperation
from the possibly-interfering updater.  This is basically because
I'm still afraid that we're defining the problem too narrowly.
For one thing, I have nearly zero confidence that GRANT et al are
the only problematic source of conflicting transactional updates.
For another, I'm worried that some extension may be using
heap_inplace_update against a catalog we're not considering here.
I'd also like to find a solution that fixes the case of a conflicting
manual UPDATE (although certainly that's a stretch goal we may not be
able to reach).

I wonder if there's a way for heap_inplace_update to mark the tuple
header as just-updated in a way that regular heap_update could
recognize.  (For standard catalog updates, we'd then end up erroring
in simple_heap_update, which I think is fine.)  We can't update xmin,
because the VACUUM callers don't have an XID; but maybe there's some
other way?  I'm speculating about putting a funny value into xmax,
or something like that, and having heap_update check that what it
sees in xmax matches what was in the tuple the update started with.

Or we could try to get rid of in-place updates, but that seems like
a mighty big lift.  All of the existing callers, except maybe
the johnny-come-lately dropdb usage, have solid documented reasons
to do it that way.

regards, tom lane




Re: race condition in pg_class

2023-10-27 Thread Noah Misch
On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
> >> We'll likely need to change how we maintain relhasindex or perhaps take a 
> >> lock
> >> in GRANT.
> 
> > The main choice is accepting more DDL blocking vs. accepting inefficient
> > relcache builds.  Options I'm seeing:
> 
> It looks to me like you're only thinking about relhasindex, but it
> seems to me that any call of heap_inplace_update brings some
> risk of this kind.  Excluding the bootstrap-mode-only usage in
> create_toast_table, I see four callers:
> 
> * index_update_stats updating a pg_class tuple's
>   relhasindex, relpages, reltuples, relallvisible
> 
> * vac_update_relstats updating a pg_class tuple's
>   relpages, reltuples, relallvisible, relhasindex, relhasrules,
>   relhastriggers, relfrozenxid, relminmxid
> 
> * vac_update_datfrozenxid updating a pg_database tuple's
>   datfrozenxid, datminmxid
> 
> * dropdb updating a pg_database tuple's datconnlimit
> 
> So we have just as much of a problem with GRANTs on databases
> as GRANTs on relations.  Also, it looks like we can lose
> knowledge of the presence of rules and triggers, which seems
> nearly as bad as forgetting about indexes.  The rest of these
> updates might not be correctness-critical, although I wonder
> how bollixed things could get if we forget an advancement of
> relfrozenxid or datfrozenxid (especially if the calling
> transaction goes on to make other changes that assume that
> the update happened).

Thanks for researching that.  Let's treat frozenxid stuff as critical; I
wouldn't want to advance XID limits based on a datfrozenxid that later gets
rolled back.  I agree relhasrules and relhastriggers are also critical.  The
"inefficient relcache builds" option family can't solve cases like
relfrozenxid and datconnlimit, so that leaves us with the "more DDL blocking"
option family.

> BTW, vac_update_datfrozenxid believes (correctly I think) that
> it cannot use the syscache copy of a tuple as the basis for in-place
> update, because syscache will have detoasted any toastable fields.
> These other callers are ignoring that, which seems like it should
> result in heap_inplace_update failing with "wrong tuple length".
> I wonder how come we're not seeing reports of that from the field.

Good question.  Perhaps we'll need some test cases that exercise each inplace
update against a row having a toast pointer.  It's too easy to go a long time
without encountering those in the field.

> I'm inclined to propose that heap_inplace_update should check to
> make sure that it's operating on the latest version of the tuple
> (including, I guess, waiting for an uncommitted update?) and throw
> error if not.  I think this is what your B3 option is, but maybe
> I misinterpreted.  It might be better to throw error immediately
> instead of waiting to see if the other updater commits.

That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
waiting for GRANT to commit but instead inplace-updating every member of the
update chain.  For B2, I was thinking we don't need to error.  There are two
problematic orders of events.  The easy one is heap_inplace_update() mutating
a tuple that already has an xmax.  That's the one in the test case upthread,
and detecting it is trivial.  The harder one is heap_inplace_update() mutating
a tuple after GRANT fetches the old tuple, before GRANT enters heap_update().
I anticipate a new locktag per catalog that can receive inplace updates,
i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.  Here's a
walk-through for the pg_database case.  GRANT will use the following sequence
of events:

- acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode
- fetch latest pg_database tuple
- heap_update()
- COMMIT, releasing LOCKTAG_DATABASE_DEFINITION

vac_update_datfrozenxid() sequence of events:

- acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode
- (now, all GRANTs on the given database have committed or aborted)
- fetch latest pg_database tuple
- heap_inplace_update()
- release LOCKTAG_DATABASE_DEFINITION, even if xact not ending
- continue with other steps, e.g. vac_truncate_clog()

How does that compare to what you envisioned?  vac_update_datfrozenxid() could
further use xmax as a best-efforts thing to catch conflict with manual UPDATE
statements, but it wouldn't solve the case where the UPDATE had fetched the
tuple but not yet heap_update()'d it.




Re: race condition in pg_class

2023-10-27 Thread Tom Lane
Noah Misch  writes:
> On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
>> We'll likely need to change how we maintain relhasindex or perhaps take a 
>> lock
>> in GRANT.

> The main choice is accepting more DDL blocking vs. accepting inefficient
> relcache builds.  Options I'm seeing:

It looks to me like you're only thinking about relhasindex, but it
seems to me that any call of heap_inplace_update brings some
risk of this kind.  Excluding the bootstrap-mode-only usage in
create_toast_table, I see four callers:

* index_update_stats updating a pg_class tuple's
  relhasindex, relpages, reltuples, relallvisible

* vac_update_relstats updating a pg_class tuple's
  relpages, reltuples, relallvisible, relhasindex, relhasrules,
  relhastriggers, relfrozenxid, relminmxid

* vac_update_datfrozenxid updating a pg_database tuple's
  datfrozenxid, datminmxid

* dropdb updating a pg_database tuple's datconnlimit

So we have just as much of a problem with GRANTs on databases
as GRANTs on relations.  Also, it looks like we can lose
knowledge of the presence of rules and triggers, which seems
nearly as bad as forgetting about indexes.  The rest of these
updates might not be correctness-critical, although I wonder
how bollixed things could get if we forget an advancement of
relfrozenxid or datfrozenxid (especially if the calling
transaction goes on to make other changes that assume that
the update happened).

BTW, vac_update_datfrozenxid believes (correctly I think) that
it cannot use the syscache copy of a tuple as the basis for in-place
update, because syscache will have detoasted any toastable fields.
These other callers are ignoring that, which seems like it should
result in heap_inplace_update failing with "wrong tuple length".
I wonder how come we're not seeing reports of that from the field.

I'm inclined to propose that heap_inplace_update should check to
make sure that it's operating on the latest version of the tuple
(including, I guess, waiting for an uncommitted update?) and throw
error if not.  I think this is what your B3 option is, but maybe
I misinterpreted.  It might be better to throw error immediately
instead of waiting to see if the other updater commits.

regards, tom lane




Re: race condition in pg_class

2023-10-27 Thread Noah Misch
On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
> On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote:
> > We are running PG13.10 and recently we have encountered what appears to be
> > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> > some other catalog-writer, possibly ANALYZE.
> > The problem is that after successfully creating index on relation (which
> > previosly didnt have any indexes), its pg_class.relhasindex remains set to
> > "false", which is illegal, I think.

It's damaging.  The table will behave like it has no indexes.  If something
adds an index later, old indexes will reappear, corrupt, having not received
updates during the relhasindex=false era.  ("pg_amcheck --heapallindexed" can
detect this.)

> > Index was built using the following statement:
> > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
> 
> This is going to be a problem with any operation that does a transactional
> pg_class update without taking a lock that conflicts with ShareLock.  GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
> 
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
> 
> == session 2
> alter table t add primary key (c);
> 
> == back in session 1
> commit;
> 
> 
> We'll likely need to change how we maintain relhasindex or perhaps take a lock
> in GRANT.

The main choice is accepting more DDL blocking vs. accepting inefficient
relcache builds.  Options I'm seeing:

=== "more DDL blocking" option family

B1. Take ShareUpdateExclusiveLock in GRANT, REVOKE, and anything that makes
transactional pg_class updates without holding some stronger lock.  New
asserts could catch future commands failing to do this.

B2. Take some shorter-lived lock around pg_class tuple formation, such that
GRANT blocks CREATE INDEX, but two CREATE INDEX don't block each other.
Anything performing a transactional update of a pg_class row would acquire
the lock in exclusive mode before fetching the old tuple and hold it till
end of transaction.  relhasindex=true in-place updates would acquire it
the same way, but they would release it after the inplace update.  I
expect a new heavyweight lock type, e.g. LOCKTAG_RELATION_DEFINITION, with
the same key as LOCKTAG_RELATION.  This has less blocking than the
previous option, but it's more complicated to explain to both users and
developers.

B3. Have CREATE INDEX do an EvalPlanQual()-style thing to update all successor
tuple versions.  Like the previous option, this would require new locking,
but the new lock would not need to persist till end of xact.  It would be
even more complicated to explain to users and developers.  (If this is
promising enough to warrant more detail, let me know.)

B4. Use transactional updates to set relhasindex=true.  Two CREATE INDEX
commands on the same table would block each other.  If we did it the way
most DDL does today, they'd get "tuple concurrently updated" failures
after the blocking ends.

=== "inefficient relcache builds" option family

R1. Ignore relhasindex; possibly remove it in v17.  Relcache builds et
al. will issue more superfluous queries.

R2. As a weird variant of the previous option, keep relhasindex and make all
transactional updates of pg_class set relhasindex=true pessimistically.
(VACUUM will set it back to false.)

=== other

O1. This is another case where the sometimes-discussed "pg_class_nt" for
nontransactional columns would help.  I'm ruling that out as too hard to
back-patch.


Are there other options important to consider?  I currently like (B1) the
most, followed closely by (R1) and (B2).  A key unknown is the prevalence of
index-free tables.  Low prevalence would argue in favor of (R1).  In my
limited experience, they've been rare.  That said, I assume relcache builds
happen a lot more than GRANTs, so it's harder to bound the damage from (R1)
compared to the damage from (B1).  Thoughts on this decision?

Thanks,
nm




Re: race condition in pg_class

2023-10-27 Thread Smolkin Grigory
> This is going to be a problem with any operation that does a transactional
> pg_class update without taking a lock that conflicts with ShareLock.
GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
>
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
>
> == session 2
> alter table t add primary key (c);
>
> == back in session 1
> commit;
>
>
> We'll likely need to change how we maintain relhasindex or perhaps take a
lock
> in GRANT.

Oh, that explains it. Thank you very much.

> Can you explore that as follows?
>
>- PITR to just before the COMMIT record.
>- Save all rows of pg_class.
>- PITR to just after the COMMIT record.
>- Save all rows of pg_class.
>- Diff the two sets of saved rows.

Sure, but it will take some time, its a large db with lots of WAL segments
to apply.

> extensions

  extname   | extversion
+
 plpgsql| 1.0
 pg_stat_statements | 1.8
 pg_buffercache | 1.3
 pgstattuple| 1.5


Re: race condition in pg_class

2023-10-26 Thread Noah Misch
On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

This is going to be a problem with any operation that does a transactional
pg_class update without taking a lock that conflicts with ShareLock.  GRANT
doesn't lock the table at all, so I can reproduce this in v17 as follows:

== session 1
create table t (c int);
begin;
grant select on t to public;

== session 2
alter table t add primary key (c);

== back in session 1
commit;


We'll likely need to change how we maintain relhasindex or perhaps take a lock
in GRANT.

> Looking into the WAL via waldump given us the following picture (full
> waldump output is attached):

> 1202295045 - create index statement
> 1202298790 and 1202298791 are some other concurrent operations,
> unfortunately I wasnt able to determine what are they

Can you explore that as follows?

- PITR to just before the COMMIT record.
- Save all rows of pg_class.
- PITR to just after the COMMIT record.
- Save all rows of pg_class.
- Diff the two sets of saved rows.

Which columns changed?  The evidence you've shown would be consistent with a
transaction doing GRANT or REVOKE on dozens of tables.  If the changed column
is something other than relacl, that would be great to know.

On the off-chance it's relevant, what extensions do you have (\dx in psql)?




Re: race condition in pg_class

2023-10-26 Thread Smolkin Grigory
> ALTER TABLE ADD CONSTRAINT would certainly have taken
> AccessExclusiveLock on the "example" table, which should be sufficient
> to prevent anything else from touching its pg_class row.  The only
> mechanism I can think of that might bypass that is a manual UPDATE on
> pg_class, which would just manipulate the row as a row without concern
> for associated relation-level locks.  Any chance that somebody was
> doing something like that?

No chance. Our infrastructure dont do that, and users dont just have the
privileges to mess with pg_catalog.

ср, 25 окт. 2023 г. в 21:06, Tom Lane :

> Smolkin Grigory  writes:
> > We are running PG13.10 and recently we have encountered what appears to
> be
> > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT
> and
> > some other catalog-writer, possibly ANALYZE.
> > The problem is that after successfully creating index on relation (which
> > previosly didnt have any indexes), its pg_class.relhasindex remains set
> to
> > "false", which is illegal, I think.
> > Index was built using the following statement:
> > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
>
> ALTER TABLE ADD CONSTRAINT would certainly have taken
> AccessExclusiveLock on the "example" table, which should be sufficient
> to prevent anything else from touching its pg_class row.  The only
> mechanism I can think of that might bypass that is a manual UPDATE on
> pg_class, which would just manipulate the row as a row without concern
> for associated relation-level locks.  Any chance that somebody was
> doing something like that?
>
> regards, tom lane
>


Re: race condition in pg_class

2023-10-25 Thread Tom Lane
Smolkin Grigory  writes:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

ALTER TABLE ADD CONSTRAINT would certainly have taken
AccessExclusiveLock on the "example" table, which should be sufficient
to prevent anything else from touching its pg_class row.  The only
mechanism I can think of that might bypass that is a manual UPDATE on
pg_class, which would just manipulate the row as a row without concern
for associated relation-level locks.  Any chance that somebody was
doing something like that?

regards, tom lane




Re: race condition in pg_class

2023-10-25 Thread Andrey M. Borodin



> On 25 Oct 2023, at 13:39, Smolkin Grigory  wrote:
> 
> We are running PG13.10 and recently we have encountered what appears to be a 
> bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and 
> some other catalog-writer, possibly ANALYZ

> I've tried to reproduce this scenario with CREATE INDEX and various 
> concurrent statements, but no luck.

Maybe it would be possible to reproduce with modifying tests for concurrent 
index creation. For example add “ANALYZE” here [0].
Keep in mind that for easier reproduction it would make sense to increase 
transaction count radically.


Best regards, Andrey Borodin.


[0] 
https://github.com/postgres/postgres/blob/master/contrib/amcheck/t/002_cic.pl#L34