subject:"race condition in pg

Re: race condition in pg_class

2024-07-04 Thread Noah Misch

On Thu, Jul 04, 2024 at 08:00:00AM +0300, Alexander Lakhin wrote:
> 28.06.2024 08:13, Noah Misch wrote:
> > Pushed. ...
> 
> Please look also at another anomaly, I've discovered.
> 
> An Assert added with d5f788b41 may be falsified with:
> CREATE TABLE t(a int PRIMARY KEY);
> INSERT INTO t VALUES (1);
> CREATE VIEW v AS SELECT * FROM t;
> 
> MERGE INTO v USING (VALUES (1)) AS va(a) ON v.a = va.a
>   WHEN MATCHED THEN DO NOTHING
>   WHEN NOT MATCHED THEN DO NOTHING;
> 
> TRAP: failed Assert("resultRelInfo->ri_TrigDesc"), File: "nodeModifyTable.c", 
> Line: 2891, PID: 1590670

Thanks.  When all the MERGE actions are DO NOTHING, view_has_instead_trigger()
returns true, so we use the wholerow code regardless of the view's triggers or
auto update capability.  The behavior is fine, so I'm fixing the new assertion
and comments with new patch inplace087-merge-DO-NOTHING-v8.patch.  The closest
relevant tests processed zero rows, so they narrowly avoided witnessing this
assertion.
Author: Noah Misch 
Commit: Noah Misch 

Don't lose partitioned table reltuples=0 after relhassubclass=f.

ANALYZE sets relhassubclass=f when a partitioned table no longer has
partitions.  An ANALYZE doing that proceeded to apply the inplace update
of pg_class.reltuples to the old pg_class tuple instead of the new
tuple, losing that reltuples=0 change if the ANALYZE committed.
Non-partitioning inheritance trees were unaffected.  Back-patch to v14,
where commit 375aed36ad83f0e021e9bdd3a0034c0c992c66dc introduced
maintenance of partitioned table pg_class.reltuples.

Reviewed by FIXME.  Reported by Alexander Lakhin.

Discussion: 
https://postgr.es/m/a295b499-dcab-6a99-c06e-01cf60593...@gmail.com

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7d2cd24..c590a2a 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -629,7 +629,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
else
relallvisible = 0;
 
-   /* Update pg_class for table relation */
+   /*
+* Update pg_class for table relation.  CCI first, in case 
acquirefunc
+* updated pg_class.
+*/
+   CommandCounterIncrement();
vac_update_relstats(onerel,
relpages,
totalrows,
@@ -664,6 +668,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 * Partitioned tables don't have storage, so we don't set any 
fields
 * in their pg_class entries except for reltuples and 
relhasindex.
 */
+   CommandCounterIncrement();
vac_update_relstats(onerel, -1, totalrows,
0, hasindex, 
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/test/regress/expected/vacuum.out 
b/src/test/regress/expected/vacuum.out
index 330fcd8..2eba712 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -83,6 +83,53 @@ INSERT INTO vactst SELECT generate_series(301, 400);
 DELETE FROM vactst WHERE i % 5 <> 0; -- delete a few rows inside
 ANALYZE vactst;
 COMMIT;
+-- Test ANALYZE setting relhassubclass=f for non-partitioning inheritance
+BEGIN;
+CREATE TABLE past_inh_parent ();
+CREATE TABLE past_inh_child () INHERITS (past_inh_parent);
+INSERT INTO past_inh_child DEFAULT VALUES;
+INSERT INTO past_inh_child DEFAULT VALUES;
+ANALYZE past_inh_parent;
+SELECT reltuples, relhassubclass
+  FROM pg_class WHERE oid = 'past_inh_parent'::regclass;
+ reltuples | relhassubclass 
+---+
+ 0 | t
+(1 row)
+
+DROP TABLE past_inh_child;
+ANALYZE past_inh_parent;
+SELECT reltuples, relhassubclass
+  FROM pg_class WHERE oid = 'past_inh_parent'::regclass;
+ reltuples | relhassubclass 
+---+
+ 0 | f
+(1 row)
+
+COMMIT;
+-- Test ANALYZE setting relhassubclass=f for partitioning
+BEGIN;
+CREATE TABLE past_parted (i int) PARTITION BY LIST(i);
+CREATE TABLE past_part PARTITION OF past_parted FOR VALUES IN (1);
+INSERT INTO past_parted VALUES (1),(1);
+ANALYZE past_parted;
+DROP TABLE past_part;
+SELECT reltuples, relhassubclass
+  FROM pg_class WHERE oid = 'past_parted'::regclass;
+ reltuples | relhassubclass 
+---+
+ 2 | t
+(1 row)
+
+ANALYZE past_parted;
+SELECT reltuples, relhassubclass
+  FROM pg_class WHERE oid = 'past_parted'::regclass;
+ reltuples | relhassubclass 
+---+
+ 0 | f
+(1 row)
+
+COMMIT;
 VACUUM FULL pg_am;
 VACUUM FULL pg_class;
 VACUUM FULL pg_database;
diff --git a/src/test/regress/sql/vacuum.sql b/src/test/regress/sql/vacuum.sql
index 0b63ef8..548cd7a 100644
---

Re: race condition in pg_class

2024-07-03 Thread Alexander Lakhin


Hello Noah,

28.06.2024 08:13, Noah Misch wrote:

Pushed. ...


Please look also at another anomaly, I've discovered.

An Assert added with d5f788b41 may be falsified with:
CREATE TABLE t(a int PRIMARY KEY);
INSERT INTO t VALUES (1);
CREATE VIEW v AS SELECT * FROM t;

MERGE INTO v USING (VALUES (1)) AS va(a) ON v.a = va.a
  WHEN MATCHED THEN DO NOTHING
  WHEN NOT MATCHED THEN DO NOTHING;

TRAP: failed Assert("resultRelInfo->ri_TrigDesc"), File: "nodeModifyTable.c", 
Line: 2891, PID: 1590670

Best regards,
Alexander

Re: race condition in pg_class

2024-07-03 Thread Noah Misch

On Wed, Jul 03, 2024 at 04:09:54PM -0700, Noah Misch wrote:
> On Wed, Jul 03, 2024 at 06:00:00AM +0300, Alexander Lakhin wrote:
> > 29.06.2024 05:42, Noah Misch wrote:
> > > Good point, any effort on (2) would be wasted once the fixes get 
> > > certified.  I
> > > pushed (1).  I'm attaching the rebased fix patches.
> > 
> > Please look at a new anomaly, introduced by inplace110-successors-v5.patch:
> > CREATE TABLE t (i int) PARTITION BY LIST(i);
> > CREATE TABLE p1 (i int);
> > ALTER TABLE t ATTACH PARTITION p1 FOR VALUES IN (1);
> > ALTER TABLE t DETACH PARTITION p1;
> > ANALYZE t;
> > 
> > triggers unexpected
> > ERROR:  tuple to be updated was already modified by an operation triggered 
> > by the current command
> 
> Thanks.  Today, it's okay to issue heap_inplace_update() after heap_update()
> without an intervening CommandCounterIncrement().

Correction: it's not okay today.  If code does that, heap_inplace_update()
mutates a tuple that is going to become invisible at CCI.  The lack of CCI
yields a minor live bug in v14+.  Its consequences seem to be limited to
failing to update reltuples for a partitioned table having zero partitions.

> The patch makes the CCI
> required.  The ANALYZE in your example reaches this with a heap_update to set
> relhassubclass=f.  I've fixed this by just adding a CCI (and adding to the
> tests in vacuum.sql).

That's still the right fix, but I've separated it into its own patch and
expanded the test.  All the non-comment changes between v5 and v6 are now part
of the separate patch.

> The alternative would be to allow inplace updates on TM_SelfModified tuples.
> I can't think of a specific problem with allowing that, but I feel that would
> make system state interactions harder to reason about.  It might be optimal to
> allow that in back branches only, to reduce the chance of releasing a bug like
> the one you found.

Allowing a mutation of a TM_SelfModified tuple is bad, since that tuple is
going to become dead soon.  Mutating its successor could be okay.  Since we'd
expect such code to be unreachable, I'm not keen carry such code.  For that
scenario, I'd rather keep the error you encountered.  Other opinions?
Author: Noah Misch 
Commit: Noah Misch 

Don't lose partitioned table reltuples=0 after relhassubclass=f.

ANALYZE sets relhassubclass=f when a partitioned table no longer has
partitions.  An ANALYZE doing that proceeded to apply the inplace update
of pg_class.reltuples to the old pg_class tuple instead of the new
tuple, losing that reltuples=0 change if the ANALYZE committed.
Non-partitioning inheritance trees were unaffected.  Back-patch to v14,
where commit 375aed36ad83f0e021e9bdd3a0034c0c992c66dc introduced
maintenance of partitioned table pg_class.reltuples.

Reviewed by FIXME.  Reported by Alexander Lakhin.

Discussion: 
https://postgr.es/m/a295b499-dcab-6a99-c06e-01cf60593...@gmail.com

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7d2cd24..c590a2a 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -629,7 +629,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
else
relallvisible = 0;

-   /* Update pg_class for table relation */
+   /*
+* Update pg_class for table relation.  CCI first, in case 
acquirefunc
+* updated pg_class.
+*/
+   CommandCounterIncrement();
vac_update_relstats(onerel,
relpages,
totalrows,
@@ -664,6 +668,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 * Partitioned tables don't have storage, so we don't set any 
fields
 * in their pg_class entries except for reltuples and 
relhasindex.
 */
+   CommandCounterIncrement();
vac_update_relstats(onerel, -1, totalrows,
0, hasindex, 
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/test/regress/expected/vacuum.out 
b/src/test/regress/expected/vacuum.out
index 330fcd8..2eba712 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -83,6 +83,53 @@ INSERT INTO vactst SELECT generate_series(301, 400);
 DELETE FROM vactst WHERE i % 5 <> 0; -- delete a few rows inside
 ANALYZE vactst;
 COMMIT;
+-- Test ANALYZE setting relhassubclass=f for non-partitioning inheritance
+BEGIN;
+CREATE TABLE past_inh_parent ();
+CREATE TABLE past_inh_child () INHERITS (past_inh_parent);
+INSERT INTO past_inh_child DEFAULT VALUES;
+INSERT INTO past_inh_child DEFAULT VALUES;
+ANALYZE past_inh_parent;
+SELECT reltuples, relhassubclass
+  FROM pg_class WHERE oid =

Re: race condition in pg_class

2024-07-03 Thread Noah Misch

On Wed, Jul 03, 2024 at 06:00:00AM +0300, Alexander Lakhin wrote:
> 29.06.2024 05:42, Noah Misch wrote:
> > Good point, any effort on (2) would be wasted once the fixes get certified. 
> >  I
> > pushed (1).  I'm attaching the rebased fix patches.
> 
> Please look at a new anomaly, introduced by inplace110-successors-v5.patch:
> CREATE TABLE t (i int) PARTITION BY LIST(i);
> CREATE TABLE p1 (i int);
> ALTER TABLE t ATTACH PARTITION p1 FOR VALUES IN (1);
> ALTER TABLE t DETACH PARTITION p1;
> ANALYZE t;
> 
> triggers unexpected
> ERROR:  tuple to be updated was already modified by an operation triggered by 
> the current command

Thanks.  Today, it's okay to issue heap_inplace_update() after heap_update()
without an intervening CommandCounterIncrement().  The patch makes the CCI
required.  The ANALYZE in your example reaches this with a heap_update to set
relhassubclass=f.  I've fixed this by just adding a CCI (and adding to the
tests in vacuum.sql).

The alternative would be to allow inplace updates on TM_SelfModified tuples.
I can't think of a specific problem with allowing that, but I feel that would
make system state interactions harder to reason about.  It might be optimal to
allow that in back branches only, to reduce the chance of releasing a bug like
the one you found.
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by FIXME.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0400a50..461d925 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2256,6 +2256,11 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
locallock->numLockOwners = 0;
}

+#ifdef USE_ASSERT_CHECKING
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks)
+   elog(WARNING, "tuple lock held at commit");
+#endif
+
/*
 * If the lock or proclock pointers are NULL, this lock was 
taken via
 * the relation fast-path (and is not known to have been 
transferred).
Author: Noah Misch 
Commit: Noah Misch 

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the heap_inplace_update_scan()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reviewed by FIXME.

Discussion: 
https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 6441e8b..dbfa2b7 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -153,3 +153,56 @@ The following infomask bits are applicable:

 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
 is set.
+
+Locking to write inplace-updated tables
+---
+
+[This is the plan, but LOCKTAG_TUPLE acquisition is not yet here.]
+
+If IsInplaceUpdateRelation() returns true for a table, the table is a system
+catalog that receives heap_inplace_update_scan() calls.  Preparing a
+heap_update() of these tables follows additional locking rules, to ensure we
+don't lose the effects of an inplace update.  In particular, consider a moment
+when a backend has fetched the old tuple to modify, not yet having called
+heap_update().  Another backend's inplace update starting then can't conclude
+until the heap_update() places its new tuple in a buffer.  We enforce that
+using locktags as follows.  While DDL code is the main audience, the executor
+follows these rules to make e.g. "MERGE INTO pg_class" safer.  Locking rules
+are per-catalog:
+
+  pg_class heap_inplace_update_scan() callers: before the call, acquire
+  LOCKTAG_RELATION in mode ShareLock (CREATE INDEX),

Re: race condition in pg_class

2024-07-02 Thread Alexander Lakhin


Hello Noah,

29.06.2024 05:42, Noah Misch wrote:

Good point, any effort on (2) would be wasted once the fixes get certified.  I
pushed (1).  I'm attaching the rebased fix patches.


Please look at a new anomaly, introduced by inplace110-successors-v5.patch:
CREATE TABLE t (i int) PARTITION BY LIST(i);
CREATE TABLE p1 (i int);
ALTER TABLE t ATTACH PARTITION p1 FOR VALUES IN (1);
ALTER TABLE t DETACH PARTITION p1;
ANALYZE t;

triggers unexpected
ERROR:  tuple to be updated was already modified by an operation triggered by 
the current command

Best regards,
Alexander

Re: race condition in pg_class

2024-06-28 Thread Noah Misch

On Fri, Jun 28, 2024 at 01:17:22AM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > Pushed.  Buildfarm member prion is failing the new inplace-inval.spec, 
> > almost
> > surely because prion uses -DCATCACHE_FORCE_RELEASE and inplace-inval.spec is
> > testing an extant failure to inval a cache entry.  Naturally, inexorable 
> > inval
> > masks the extant bug.  Ideally, I'd just skip the test under any kind of 
> > cache
> > clobber option.  I don't know a pleasant way to do that, so these are
> > known-feasible things I'm considering:
> 
> > 1. Neutralize the test in all branches, probably by having it just not 
> > report
> >the final answer.  Undo in the later fix patch.
> 
> > 2. v14+ has pg_backend_memory_contexts.  In the test, run some plpgsql that
> >uses heuristics on that to deduce whether caches are getting released.
> >Have a separate expected output for the cache-release scenario.  Perhaps
> >also have the test treat installcheck like cache-release, since
> >installcheck could experience sinval reset with similar consequences.
> >Neutralize the test in v12 & v13.
> 
> > 3. Add a test module with a C function that reports whether any kind of 
> > cache
> >clobber is active.  Call it in this test.  Have a separate expected 
> > output
> >for the cache-release scenario.
> 
> > Preferences or other ideas?  I'm waffling between (1) and (2).  I'll give it
> > more thought over the next day.
> 
> I'd just go for (1).  We were doing fine without this test case.
> I can't see expending effort towards hiding its result rather
> than actually fixing anything.

Good point, any effort on (2) would be wasted once the fixes get certified.  I
pushed (1).  I'm attaching the rebased fix patches.
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by FIXME.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0400a50..461d925 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2256,6 +2256,11 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
locallock->numLockOwners = 0;
}
 
+#ifdef USE_ASSERT_CHECKING
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks)
+   elog(WARNING, "tuple lock held at commit");
+#endif
+
/*
 * If the lock or proclock pointers are NULL, this lock was 
taken via
 * the relation fast-path (and is not known to have been 
transferred).
Author: Noah Misch 
Commit: Noah Misch 

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the heap_inplace_update_scan()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reviewed by FIXME.

Discussion: 
https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 6441e8b..dbfa2b7 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -153,3 +153,56 @@ The following infomask bits are applicable:
 
 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
 is set.
+
+Locking to write inplace-updated tables
+---
+
+[This is the plan, but LOCKTAG_TUPLE acquisition is not yet here.]
+
+If IsInplaceUpdateRelation() returns true for a table, the table is a system
+catalog that receives heap_inplace_update_scan() calls.  Preparing a
+heap_update() of these tables follows additional locking rules, to ensure we
+don't lose the effects of an inplace update.  In particular, consider a moment
+when a backend has fetched the old tuple to modify, not yet having called

Re: race condition in pg_class

2024-06-27 Thread Tom Lane

Noah Misch  writes:
> Pushed.  Buildfarm member prion is failing the new inplace-inval.spec, almost
> surely because prion uses -DCATCACHE_FORCE_RELEASE and inplace-inval.spec is
> testing an extant failure to inval a cache entry.  Naturally, inexorable inval
> masks the extant bug.  Ideally, I'd just skip the test under any kind of cache
> clobber option.  I don't know a pleasant way to do that, so these are
> known-feasible things I'm considering:

> 1. Neutralize the test in all branches, probably by having it just not report
>the final answer.  Undo in the later fix patch.

> 2. v14+ has pg_backend_memory_contexts.  In the test, run some plpgsql that
>uses heuristics on that to deduce whether caches are getting released.
>Have a separate expected output for the cache-release scenario.  Perhaps
>also have the test treat installcheck like cache-release, since
>installcheck could experience sinval reset with similar consequences.
>Neutralize the test in v12 & v13.

> 3. Add a test module with a C function that reports whether any kind of cache
>clobber is active.  Call it in this test.  Have a separate expected output
>for the cache-release scenario.

> Preferences or other ideas?  I'm waffling between (1) and (2).  I'll give it
> more thought over the next day.

I'd just go for (1).  We were doing fine without this test case.
I can't see expending effort towards hiding its result rather
than actually fixing anything.

regards, tom lane

Re: race condition in pg_class

2024-06-27 Thread Noah Misch

On Fri, Jun 21, 2024 at 02:28:42PM -0700, Noah Misch wrote:
> On Thu, Jun 13, 2024 at 05:35:49PM -0700, Noah Misch wrote:
> > I think the attached covers all comments to date.  I gave everything v3, but
> > most patches have just a no-conflict rebase vs. v2.  The exceptions are
> > inplace031-inj-wait-event (implements the holding from that branch of the
> > thread) and inplace050-tests-inj (updated to cooperate with inplace031).  
> > Much
> > of inplace031-inj-wait-event is essentially s/Extension/Custom/ for the
> > infrastructure common to the two custom wait event types.
> 
> Starting 2024-06-27, I'd like to push
> inplace080-catcache-detoast-inplace-stale and earlier patches, self-certifying
> them if needed.  Then I'll submit the last three to the commitfest.  Does
> anyone want me to delay that step?

Pushed.  Buildfarm member prion is failing the new inplace-inval.spec, almost
surely because prion uses -DCATCACHE_FORCE_RELEASE and inplace-inval.spec is
testing an extant failure to inval a cache entry.  Naturally, inexorable inval
masks the extant bug.  Ideally, I'd just skip the test under any kind of cache
clobber option.  I don't know a pleasant way to do that, so these are
known-feasible things I'm considering:

1. Neutralize the test in all branches, probably by having it just not report
   the final answer.  Undo in the later fix patch.

2. v14+ has pg_backend_memory_contexts.  In the test, run some plpgsql that
   uses heuristics on that to deduce whether caches are getting released.
   Have a separate expected output for the cache-release scenario.  Perhaps
   also have the test treat installcheck like cache-release, since
   installcheck could experience sinval reset with similar consequences.
   Neutralize the test in v12 & v13.

3. Add a test module with a C function that reports whether any kind of cache
   clobber is active.  Call it in this test.  Have a separate expected output
   for the cache-release scenario.

Preferences or other ideas?  I'm waffling between (1) and (2).  I'll give it
more thought over the next day.

Thanks,
nm

Re: race condition in pg_class

2024-06-21 Thread Noah Misch

On Thu, Jun 13, 2024 at 05:35:49PM -0700, Noah Misch wrote:
> On Mon, Jun 10, 2024 at 07:45:25PM -0700, Noah Misch wrote:
> > On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> > > It's not this patch set's fault, but I'm not very pleased to see that
> > > the injection point wait events have been shoehorned into the
> > > "Extension" category
> > 
> > I've replied on that branch of the thread.
> 
> I think the attached covers all comments to date.  I gave everything v3, but
> most patches have just a no-conflict rebase vs. v2.  The exceptions are
> inplace031-inj-wait-event (implements the holding from that branch of the
> thread) and inplace050-tests-inj (updated to cooperate with inplace031).  Much
> of inplace031-inj-wait-event is essentially s/Extension/Custom/ for the
> infrastructure common to the two custom wait event types.

Starting 2024-06-27, I'd like to push
inplace080-catcache-detoast-inplace-stale and earlier patches, self-certifying
them if needed.  Then I'll submit the last three to the commitfest.  Does
anyone want me to delay that step?

Two more test-related changes compared to v3:

- In inplace010-tests, add to 027_stream_regress.pl a test that catalog
  contents match between primary and standby.  If one of these patches broke
  replay of inplace updates, this would help catch it.

- In inplace031-inj-wait-event, make sysviews.sql indifferent to whether
  InjectionPoint wait events exist.  installcheck need this if other activity
  created such an event since the last postmaster restart.
Author: Noah Misch 
Commit: Noah Misch 

Improve test coverage for changes to inplace-updated catalogs.

This covers both regular and inplace changes, since bugs arise at their
intersection.  Where marked, these witness extant bugs.  Back-patch to
v12 (all supported versions).

Reviewed by FIXME.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl 
b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 59ea538..956e290 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -68,6 +68,34 @@ $node->pgbench(
  "CREATE TYPE pg_temp.e AS ENUM ($labels); DROP TYPE 
pg_temp.e;"
});
 
+# Test inplace updates from VACUUM concurrent with heap_update from GRANT.
+# The PROC_IN_VACUUM environment can't finish MVCC table scans consistently,
+# so this fails rarely.  To reproduce consistently, add a sleep after
+# GetCatalogSnapshot(non-catalog-rel).
+Test::More->builder->todo_start('PROC_IN_VACUUM scan breakage');
+$node->safe_psql('postgres', 'CREATE TABLE ddl_target ()');
+$node->pgbench(
+   '--no-vacuum --client=5 --protocol=prepared --transactions=50',
+   0,
+   [qr{processed: 250/250}],
+   [qr{^$}],
+   'concurrent GRANT/VACUUM',
+   {
+   '001_pgbench_grant@9' => q(
+   DO $$
+   BEGIN
+   PERFORM pg_advisory_xact_lock(42);
+   FOR i IN 1 .. 10 LOOP
+   GRANT SELECT ON ddl_target TO PUBLIC;
+   REVOKE SELECT ON ddl_target FROM PUBLIC;
+   END LOOP;
+   END
+   $$;
+),
+   '001_pgbench_vacuum_ddl_target@1' => "VACUUM ddl_target;",
+   });
+Test::More->builder->todo_end;
+
 # Trigger various connection errors
 $node->pgbench(
'no-such-database',
diff --git a/src/test/isolation/expected/eval-plan-qual.out 
b/src/test/isolation/expected/eval-plan-qual.out
index 0237271..032d420 100644
--- a/src/test/isolation/expected/eval-plan-qual.out
+++ b/src/test/isolation/expected/eval-plan-qual.out
@@ -1337,3 +1337,29 @@ a|b|c|   d
 2|2|2|1004
 (2 rows)
 
+
+starting permutation: sys1 sysupd2 c1 c2
+step sys1: 
+   UPDATE pg_class SET reltuples = 123 WHERE oid = 'accounts'::regclass;
+
+step sysupd2: 
+   UPDATE pg_class SET reltuples = reltuples * 2
+   WHERE oid = 'accounts'::regclass;
+ 
+step c1: COMMIT;
+step sysupd2: <... completed>
+step c2: COMMIT;
+
+starting permutation: sys1 sysmerge2 c1 c2
+step sys1: 
+   UPDATE pg_class SET reltuples = 123 WHERE oid = 'accounts'::regclass;
+
+step sysmerge2: 
+   MERGE INTO pg_class
+   USING (SELECT 'accounts'::regclass AS o) j
+   ON o = oid
+   WHEN MATCHED THEN UPDATE SET reltuples = reltuples * 2;
+ 
+step c1: COMMIT;
+step sysmerge2: <... completed>
+step c2: COMMIT;
diff --git a/src/test/isolation/expected/inplace-inval.out 
b/src/test/isolation/expected/inplace-inval.out
new file mode 100644
index 000..67b34ad
--- /dev/null
+++ b/src/test/isolation/expected/inplace-inval.out
@@ -0,0 +1,32 @@
+Parsed test spec with 3 sessions
+
+starting permutation: cachefill3 cir1 cic2 ddl3 read1
+step cachefill3: TABLE newly_indexed;
+c
+-
+(0

Re: race condition in pg_class

2024-06-16 Thread Michael Paquier

On Sun, Jun 16, 2024 at 07:07:08AM -0700, Noah Misch wrote:
> It would be odd to detect exactly 0x0B00U and not other invalid inputs,
> like 0x0A01U where only 0x0B01U is valid.  I'm attaching roughly what
> it would take.  Shall I squash this into inplace031?

Agreed that merging both together is cleaner.  Moving the event class
into the key of WaitEventCustomEntryByInfo leads to a more consistent
final result.

> The thing I feel strongly about here is keeping focus on fixing $SUBJECT bugs
> that are actually corrupting data out there.

Agreed to focus on that first.
--
Michael


signature.asc
Description: PGP signature

Re: race condition in pg_class

2024-06-16 Thread Noah Misch

On Sun, Jun 16, 2024 at 09:28:05AM +0900, Michael Paquier wrote:
> On Thu, Jun 13, 2024 at 07:42:25PM -0700, Noah Misch wrote:
> > On Fri, Jun 14, 2024 at 09:58:59AM +0900, Michael Paquier wrote:
> >> GetWaitEventCustomIdentifier() is incorrect, and should return
> >> "InjectionPoint" in the default case of this class name, no?
> > 
> > I intentionally didn't provide a default event ID for InjectionPoint.
> > PG_WAIT_EXTENSION needs a default case for backward compatibility, if 
> > nothing
> > else.  For this second custom type, it's needless complexity.  The value
> > 0x0B00U won't just show up like PG_WAIT_EXTENSION does.
> > GetLWLockIdentifier() also has no default case.  How do you see it?
> 
> I would add a default for consistency as this is just a few extra
> lines, but if you feel strongly about that, I'm OK as well.  It makes
> a bit easier the detection of incorrect wait event numbers set
> incorrectly in extensions depending on the class wanted.

It would be odd to detect exactly 0x0B00U and not other invalid inputs,
like 0x0A01U where only 0x0B01U is valid.  I'm attaching roughly what
it would take.  Shall I squash this into inplace031?

The thing I feel strongly about here is keeping focus on fixing $SUBJECT bugs
that are actually corrupting data out there.  I think we should all limit our
interest in the verbiage of strings that appear only when running developer
tests, especially when $SUBJECT is a bug fix.  When the string appears only
after C code passes invalid input to other C code, it matters even less.

> > The patch added to xfunc.sgml an example of using it.  I'd be more inclined 
> > to
> > delete the WaitEventExtensionNew() docbook documentation than to add its 
> > level
> > of detail for WaitEventInjectionPointNew().  We don't have that kind of
> > documentation for most extension-facing C functions.
> 
> It's one of the areas where I think that we should have more
> documentation, not less of it, so I'd rather keep it and maintaining
> it is not really a pain (?).  The backend gets complicated enough
> these days that limiting what developers have to guess on their own is
> a better long-term approach because the Postgres out-of-core ecosystem
> is expanding a lot (aka have also in-core documentation for hooks,
> even if there's been a lot of reluctance historically about having
> them).

[getting deeply off topic -- let's move this to another thread if it needs to
expand] I like reducing the need to guess.  So far in this inplace update
project (this thread plus postgr.es/m/20240615223718.42.nmi...@google.com),
three patches just fix comments.  Even comments carry quite a price, but I
value them.  When we hand-maintain documentation of a C function in both its
header comment and another place, I get skeptical about whether hackers
(including myself) will actually keep them in sync and skeptical of the
incremental value of maintaining the second version.
diff --git a/src/backend/utils/activity/wait_event.c 
b/src/backend/utils/activity/wait_event.c
index 300de90..aaf9f3d 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -47,11 +47,11 @@ uint32 *my_wait_event_info = 
_my_wait_event_info;
  * Hash tables for storing custom wait event ids and their names in
  * shared memory.
  *
- * WaitEventCustomHashById is used to find the name from an event id.
- * Any backend can search it to find custom wait events.
+ * WaitEventCustomHashByInfo is used to find the name from wait event
+ * information.  Any backend can search it to find custom wait events.
  *
- * WaitEventCustomHashByName is used to find the ID from a name.
- * It is used to ensure that no duplicated entries are registered.
+ * WaitEventCustomHashByName is used to find the wait event information from a
+ * name.  It is used to ensure that no duplicated entries are registered.
  *
  * For simplicity, we use the same ID counter across types of custom events.
  * We could end that anytime the need arises.
@@ -61,18 +61,18 @@ uint32 *my_wait_event_info = 
_my_wait_event_info;
  * unlikely that the number of entries will reach
  * WAIT_EVENT_CUSTOM_HASH_MAX_SIZE.
  */
-static HTAB *WaitEventCustomHashById;  /* find names from IDs */
-static HTAB *WaitEventCustomHashByName; /* find IDs from names */
+static HTAB *WaitEventCustomHashByInfo; /* find names from infos */
+static HTAB *WaitEventCustomHashByName; /* find infos from names */
 
 #define WAIT_EVENT_CUSTOM_HASH_INIT_SIZE   16
 #define WAIT_EVENT_CUSTOM_HASH_MAX_SIZE128
 
 /* hash table entries */
-typedef struct WaitEventCustomEntryById
+typedef struct WaitEventCustomEntryByInfo
 {
-   uint16  event_id;   /* hash key */
+   uint32  wait_event_info;/* hash key */
charwait_event_name[NAMEDATALEN];   /* custom wait event 
name */
-} WaitEventCustomEntryById;
+} WaitEventCustomEntryByInfo;
 
 typedef struct

Re: race condition in pg_class

2024-06-15 Thread Michael Paquier

On Thu, Jun 13, 2024 at 07:42:25PM -0700, Noah Misch wrote:
> On Fri, Jun 14, 2024 at 09:58:59AM +0900, Michael Paquier wrote:
>> GetWaitEventCustomIdentifier() is incorrect, and should return
>> "InjectionPoint" in the default case of this class name, no?
> 
> I intentionally didn't provide a default event ID for InjectionPoint.
> PG_WAIT_EXTENSION needs a default case for backward compatibility, if nothing
> else.  For this second custom type, it's needless complexity.  The value
> 0x0B00U won't just show up like PG_WAIT_EXTENSION does.
> GetLWLockIdentifier() also has no default case.  How do you see it?

I would add a default for consistency as this is just a few extra
lines, but if you feel strongly about that, I'm OK as well.  It makes
a bit easier the detection of incorrect wait event numbers set
incorrectly in extensions depending on the class wanted.

> The patch added to xfunc.sgml an example of using it.  I'd be more inclined to
> delete the WaitEventExtensionNew() docbook documentation than to add its level
> of detail for WaitEventInjectionPointNew().  We don't have that kind of
> documentation for most extension-facing C functions.

It's one of the areas where I think that we should have more
documentation, not less of it, so I'd rather keep it and maintaining
it is not really a pain (?).  The backend gets complicated enough
these days that limiting what developers have to guess on their own is
a better long-term approach because the Postgres out-of-core ecosystem
is expanding a lot (aka have also in-core documentation for hooks,
even if there's been a lot of reluctance historically about having
them).
--
Michael


signature.asc
Description: PGP signature

Re: race condition in pg_class

2024-06-13 Thread Noah Misch

On Fri, Jun 14, 2024 at 09:58:59AM +0900, Michael Paquier wrote:
> Looking at inplace031-inj-wait-event..
> 
> The comment at the top of GetWaitEventCustomNames() requires an
> update, still mentioning extensions.

Thanks.  Fixed locally.

> GetWaitEventCustomIdentifier() is incorrect, and should return
> "InjectionPoint" in the default case of this class name, no?

I intentionally didn't provide a default event ID for InjectionPoint.
PG_WAIT_EXTENSION needs a default case for backward compatibility, if nothing
else.  For this second custom type, it's needless complexity.  The value
0x0B00U won't just show up like PG_WAIT_EXTENSION does.
GetLWLockIdentifier() also has no default case.  How do you see it?

> I would
> just pass the classID to GetWaitEventCustomIdentifier().

As you say, that would allow eventId==0 to raise "could not find custom wait
event" for PG_WAIT_INJECTIONPOINT instead of wrongly returning "Extension".
Even if 0x0B00U somehow does show up, having pg_stat_activity report
"Extension" instead of an error, in a developer test run, feels unimportant to
me.

> It is suboptimal to have pg_get_wait_events() do two scans of
> WaitEventCustomHashByName.  Wouldn't it be better to do a single scan,
> returning a set of (class_name,event_name) fed to the tuplestore of
> this SRF?

Micro-optimization of pg_get_wait_events() doesn't matter.  I did consider
that or pushing more of the responsibility into wait_events.c, but I
considered it on code repetition grounds, not performance grounds.

>  uint32
>  WaitEventExtensionNew(const char *wait_event_name)
>  {
> + return WaitEventCustomNew(PG_WAIT_EXTENSION, wait_event_name);
> +}
> +
> +uint32
> +WaitEventInjectionPointNew(const char *wait_event_name)
> +{
> + return WaitEventCustomNew(PG_WAIT_INJECTIONPOINT, wait_event_name);
> +}
> 
> Hmm.  The advantage of two routines is that it is possible to control
> the class IDs allowed to use the custom wait events.  Shouldn't the
> second routine be documented in xfunc.sgml?

The patch added to xfunc.sgml an example of using it.  I'd be more inclined to
delete the WaitEventExtensionNew() docbook documentation than to add its level
of detail for WaitEventInjectionPointNew().  We don't have that kind of
documentation for most extension-facing C functions.

Re: race condition in pg_class

2024-06-13 Thread Michael Paquier

On Thu, Jun 13, 2024 at 05:35:49PM -0700, Noah Misch wrote:
> I think the attached covers all comments to date.  I gave everything v3, but
> most patches have just a no-conflict rebase vs. v2.  The exceptions are
> inplace031-inj-wait-event (implements the holding from that branch of the
> thread) and inplace050-tests-inj (updated to cooperate with inplace031).  Much
> of inplace031-inj-wait-event is essentially s/Extension/Custom/ for the
> infrastructure common to the two custom wait event types.

Looking at inplace031-inj-wait-event..

The comment at the top of GetWaitEventCustomNames() requires an
update, still mentioning extensions.

GetWaitEventCustomIdentifier() is incorrect, and should return
"InjectionPoint" in the default case of this class name, no?  I would
just pass the classID to GetWaitEventCustomIdentifier().

It is suboptimal to have pg_get_wait_events() do two scans of
WaitEventCustomHashByName.  Wouldn't it be better to do a single scan,
returning a set of (class_name,event_name) fed to the tuplestore of
this SRF?

 uint32
 WaitEventExtensionNew(const char *wait_event_name)
 {
+   return WaitEventCustomNew(PG_WAIT_EXTENSION, wait_event_name);
+}
+
+uint32
+WaitEventInjectionPointNew(const char *wait_event_name)
+{
+   return WaitEventCustomNew(PG_WAIT_INJECTIONPOINT, wait_event_name);
+}

Hmm.  The advantage of two routines is that it is possible to control
the class IDs allowed to use the custom wait events.  Shouldn't the
second routine be documented in xfunc.sgml?

wait_event_names.txt also needs tweaks, in the shape of a new class
name for the new class "InjectionPoint" so as it can be documented for
its default case.  That's a fallback if an event ID cannot be found,
which should not be the case, still that's more correct than showing
"Extension" for all class IDs covered by custom wait events.
--
Michael

signature.asc
Description: PGP signature

Re: race condition in pg_class

2024-06-13 Thread Noah Misch

On Wed, Jun 12, 2024 at 10:02:00PM +0200, Michail Nikolaev wrote:
> > Can you say more about the connection you see between $SUBJECT and that?
> That
> > looks like a valid report of an important bug, but I'm not following the
> > potential relationship to $SUBJECT.
> 
> I was guided by the following logic:
> * A pg_class race condition can cause table indexes to look stale.
> * REINDEX updates indexes
> * errors can be explained by different backends using different arbiter
> indexes

Got it.  The race condition of $SUBJECT involves inplace updates, and the
wrong content becomes permanent.  Hence, I suspect they're unrelated.

Re: race condition in pg_class

2024-06-12 Thread Michael Paquier

On Wed, Jun 12, 2024 at 12:32:23PM -0700, Noah Misch wrote:
> On Wed, Jun 12, 2024 at 02:08:31PM -0400, Robert Haas wrote:
>> Personally, I think the fact that injection point wait events were put
>> under Extension is a design mistake that should be corrected before 17
>> is out of beta.

Well, isolation tests and the way to wait for specific points in them
is something I've thought about when working on the initial injpoint
infrastructure, but all my ideas went down to the fact that this is
not specific to injection points: I've also wanted to be able to cause
an isolation to wait for a specific event (class,name).  A hardcoded
sleep is an example.  Even if I discourage anything like that in the
in-core tests because they're slow on fast machines and can be
unreliable on slow machines, it is a fact that they are used by
out-of-core code and that extension developers find them acceptable.

> Works for me.  I don't personally have a problem with the use of Extension,
> since it is a src/test/modules extension creating them.

That's the original reason why Extension has been used in this case,
because the points are assigned in an extension.

> Yes, the last line does refer to that.  Updated table:
> 
> STRATEGY| Paquier | Misch | Haas
> 
> new "Injection Point" wait type | maybe   | yes   | yes
> INJECTION_POINT(...) naming | yes | yes   | no
> isolation spec says event names | yes | no| yes
> 
> I find that's adequate support for the first line.  If there are no objections
> in the next 24hr, I will implement that.

OK.  That sounds like a consensus to me, useful enough for the cases
at hand.
--
Michael

signature.asc
Description: PGP signature

Re: race condition in pg_class

2024-06-12 Thread Michail Nikolaev

Hello!

> Can you say more about the connection you see between $SUBJECT and that?
That
> looks like a valid report of an important bug, but I'm not following the
> potential relationship to $SUBJECT.

I was guided by the following logic:
* A pg_class race condition can cause table indexes to look stale.
* REINDEX updates indexes
* errors can be explained by different backends using different arbiter
indexes

> On your other thread, it would be useful to see stack traces from the
high-CPU
> processes once the live lock has ended all query completion.
I'll do.

Best regards,
Mikhail.

Re: race condition in pg_class

2024-06-12 Thread Noah Misch

On Wed, Jun 12, 2024 at 03:02:43PM +0200, Michail Nikolaev wrote:
> I am not sure, but I think that issue may be related to the issue described
> in
> https://www.postgresql.org/message-id/CANtu0ojXmqjmEzp-%3DaJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg%40mail.gmail.com
> 
> It looks like REINDEX CONCURRENTLY could interfere with ON CONFLICT UPDATE
> in some strange way.

Can you say more about the connection you see between $SUBJECT and that?  That
looks like a valid report of an important bug, but I'm not following the
potential relationship to $SUBJECT.

On your other thread, it would be useful to see stack traces from the high-CPU
processes once the live lock has ended all query completion.

Re: race condition in pg_class

2024-06-12 Thread Noah Misch

On Wed, Jun 12, 2024 at 02:08:31PM -0400, Robert Haas wrote:
> On Wed, Jun 12, 2024 at 1:54 PM Noah Misch  wrote:
> > If I were making a list of changes always welcome post-beta, it wouldn't
> > include adding wait event types.  But I don't hesitate to add one if it
> > unblocks a necessary test for a bug present in all versions.
> 
> However, injection points themselves are not present in all versions,
> so even if we invent a new wait-event type, we'll have difficulty
> testing older versions, unless we're planning to back-patch all of
> that infrastructure, which I assume we aren't.

Right.  We could put the injection point tests in v18 only instead of v17+v18.
I feel that would be an overreaction to a dispute about names that show up
only in tests.  Even so, I could accept that.

> Personally, I think the fact that injection point wait events were put
> under Extension is a design mistake that should be corrected before 17
> is out of beta.

Works for me.  I don't personally have a problem with the use of Extension,
since it is a src/test/modules extension creating them.

> > Here's what I'm reading for each person's willingness to tolerate each 
> > option:
> >
> > STRATEGY| Paquier | Misch | Haas
> > 
> > new "Injection Point" wait type | maybe   | yes   | yes
> > INJECTION_POINT(...) naming | yes | yes   | unknown
> > isolation spec says event names | yes | no| unknown
> >
> > Corrections and additional strategy lines welcome.  Robert, how do you judge
> > the lines where I've listed you as "unknown"?
> 
> I'd tolerate INJECTION_POINT() if we had no other option but I think
> it's clearly inferior. Does the last line refer to putting the
> specific wait event names in the isolation spec file? If so, I'd also
> be fine with that.

Yes, the last line does refer to that.  Updated table:

STRATEGY| Paquier | Misch | Haas

new "Injection Point" wait type | maybe   | yes   | yes
INJECTION_POINT(...) naming | yes | yes   | no
isolation spec says event names | yes | no| yes

I find that's adequate support for the first line.  If there are no objections
in the next 24hr, I will implement that.

Re: race condition in pg_class

2024-06-12 Thread Robert Haas

On Wed, Jun 12, 2024 at 1:54 PM Noah Misch  wrote:
> If I were making a list of changes always welcome post-beta, it wouldn't
> include adding wait event types.  But I don't hesitate to add one if it
> unblocks a necessary test for a bug present in all versions.

However, injection points themselves are not present in all versions,
so even if we invent a new wait-event type, we'll have difficulty
testing older versions, unless we're planning to back-patch all of
that infrastructure, which I assume we aren't.

Personally, I think the fact that injection point wait events were put
under Extension is a design mistake that should be corrected before 17
is out of beta.

> Here's what I'm reading for each person's willingness to tolerate each option:
>
> STRATEGY| Paquier | Misch | Haas
> 
> new "Injection Point" wait type | maybe   | yes   | yes
> INJECTION_POINT(...) naming | yes | yes   | unknown
> isolation spec says event names | yes | no| unknown
>
> Corrections and additional strategy lines welcome.  Robert, how do you judge
> the lines where I've listed you as "unknown"?

I'd tolerate INJECTION_POINT() if we had no other option but I think
it's clearly inferior. Does the last line refer to putting the
specific wait event names in the isolation spec file? If so, I'd also
be fine with that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: race condition in pg_class

2024-06-12 Thread Noah Misch

On Tue, Jun 11, 2024 at 01:37:21PM +0900, Michael Paquier wrote:
> On Mon, Jun 10, 2024 at 07:19:27PM -0700, Noah Misch wrote:
> > On Fri, Jun 07, 2024 at 09:08:03AM -0400, Robert Haas wrote:
> >> I think the core code should provide an "Injection Point" wait event
> >> type and let extensions add specific wait events there, just like you
> >> did for "Extension".
> > 
> > Michael, could you accept the core code offering that, or not?  If so, I am
> > content to implement that.  If not, for injection point wait events, I have
> > just one priority.  The isolation tester already detects lmgr locks without
> > the test writer teaching it about each lock individually.  I want it to have
> > that same capability for injection points. Do you think we can find 
> > something
> > everyone can accept, having that property?  These wait events show up in 
> > tests
> > only, and I'm happy to make the cosmetics be anything compatible with that
> > detection ability.
> 
> Adding a wait event class for injection point is an interesting
> suggestion that would simplify the detection in the isolation function
> quite a bit.  Are you sure that this is something that would be fit
> for v17 material?  TBH, I am not sure.

If I were making a list of changes always welcome post-beta, it wouldn't
include adding wait event types.  But I don't hesitate to add one if it
unblocks a necessary test for a bug present in all versions.

> At the end, the test coverage has the highest priority and the bugs
> you are addressing are complex enough that isolation tests of this
> level are a necessity, so I don't object to what
> inplace050-tests-inj-v2.patch introduces with the naming dependency
> for the time being on HEAD.  I'll just adapt and live with that
> depending on what I deal with, while trying to improve HEAD later on.

Here's what I'm reading for each person's willingness to tolerate each option:

STRATEGY| Paquier | Misch | Haas

new "Injection Point" wait type | maybe   | yes   | yes
INJECTION_POINT(...) naming | yes | yes   | unknown
isolation spec says event names | yes | no| unknown

Corrections and additional strategy lines welcome.  Robert, how do you judge
the lines where I've listed you as "unknown"?

> I'm still wondering if there is something that could be more elegant
> than a dedicated class for injection points, but I cannot think about
> something that would be better for isolation tests on top of my head.
> If there is something I can think of, I'll just go and implement it :)

I once considered changing them to use advisory lock waits instead of
ConditionVariableSleep(), but I recall that was worse from the perspective of
injection points in critical sections.

Re: race condition in pg_class

2024-06-12 Thread Michail Nikolaev

Hello, everyone.

I am not sure, but I think that issue may be related to the issue described
in
https://www.postgresql.org/message-id/CANtu0ojXmqjmEzp-%3DaJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg%40mail.gmail.com

It looks like REINDEX CONCURRENTLY could interfere with ON CONFLICT UPDATE
in some strange way.

Best regards,
Mikhail.

Re: race condition in pg_class

2024-06-10 Thread Michael Paquier

On Mon, Jun 10, 2024 at 07:19:27PM -0700, Noah Misch wrote:
> On Fri, Jun 07, 2024 at 09:08:03AM -0400, Robert Haas wrote:
>> I think the core code should provide an "Injection Point" wait event
>> type and let extensions add specific wait events there, just like you
>> did for "Extension".
> 
> Michael, could you accept the core code offering that, or not?  If so, I am
> content to implement that.  If not, for injection point wait events, I have
> just one priority.  The isolation tester already detects lmgr locks without
> the test writer teaching it about each lock individually.  I want it to have
> that same capability for injection points. Do you think we can find something
> everyone can accept, having that property?  These wait events show up in tests
> only, and I'm happy to make the cosmetics be anything compatible with that
> detection ability.

Adding a wait event class for injection point is an interesting
suggestion that would simplify the detection in the isolation function
quite a bit.  Are you sure that this is something that would be fit
for v17 material?  TBH, I am not sure.

At the end, the test coverage has the highest priority and the bugs
you are addressing are complex enough that isolation tests of this
level are a necessity, so I don't object to what
inplace050-tests-inj-v2.patch introduces with the naming dependency
for the time being on HEAD.  I'll just adapt and live with that
depending on what I deal with, while trying to improve HEAD later on.

I'm still wondering if there is something that could be more elegant
than a dedicated class for injection points, but I cannot think about
something that would be better for isolation tests on top of my head.
If there is something I can think of, I'll just go and implement it :)
--
Michael

signature.asc
Description: PGP signature

Re: race condition in pg_class

2024-06-10 Thread Noah Misch

On Fri, Jun 07, 2024 at 09:08:03AM -0400, Robert Haas wrote:
> On Thu, Jun 6, 2024 at 7:20 PM Michael Paquier  wrote:
> > On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> > > It's not this patch set's fault, but I'm not very pleased to see that
> > > the injection point wait events have been shoehorned into the
> > > "Extension" category - which they are not - instead of being a new
> > > wait_event_type. That would have avoided the ugly wait-event naming
> > > pattern, inconsistent with everything else, introduced by
> > > inplace050-tests-inj-v1.patch.
> >
> > Not sure to agree with that.  The set of core backend APIs supporting
> > injection points have nothing to do with wait events.  The library
> > attached to one or more injection points *may* decide to use a wait
> > event like what the wait/wakeup calls in modules/injection_points do,
> > but that's entirely optional.  These rely on custom wait events,
> > plugged into the Extension category as the code run is itself in an
> > extension.  I am not arguing against the point that it may be
> > interesting to plug in custom wait event categories, but the current
> > design of wait events makes that much harder than what core is
> > currently able to handle, and I am not sure that this brings much at
> > the end as long as the wait event strings can be customized.
> >
> > I've voiced upthread concerns over the naming enforced by the patch
> > and the way it plugs the namings into the isolation functions, by the
> > way.
> 
> I think the core code should provide an "Injection Point" wait event
> type and let extensions add specific wait events there, just like you
> did for "Extension".

Michael, could you accept the core code offering that, or not?  If so, I am
content to implement that.  If not, for injection point wait events, I have
just one priority.  The isolation tester already detects lmgr locks without
the test writer teaching it about each lock individually.  I want it to have
that same capability for injection points.  Do you think we can find something
everyone can accept, having that property?  These wait events show up in tests
only, and I'm happy to make the cosmetics be anything compatible with that
detection ability.

Re: race condition in pg_class

2024-06-07 Thread Robert Haas

On Thu, Jun 6, 2024 at 7:20 PM Michael Paquier  wrote:
> On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> > It's not this patch set's fault, but I'm not very pleased to see that
> > the injection point wait events have been shoehorned into the
> > "Extension" category - which they are not - instead of being a new
> > wait_event_type. That would have avoided the ugly wait-event naming
> > pattern, inconsistent with everything else, introduced by
> > inplace050-tests-inj-v1.patch.
>
> Not sure to agree with that.  The set of core backend APIs supporting
> injection points have nothing to do with wait events.  The library
> attached to one or more injection points *may* decide to use a wait
> event like what the wait/wakeup calls in modules/injection_points do,
> but that's entirely optional.  These rely on custom wait events,
> plugged into the Extension category as the code run is itself in an
> extension.  I am not arguing against the point that it may be
> interesting to plug in custom wait event categories, but the current
> design of wait events makes that much harder than what core is
> currently able to handle, and I am not sure that this brings much at
> the end as long as the wait event strings can be customized.
>
> I've voiced upthread concerns over the naming enforced by the patch
> and the way it plugs the namings into the isolation functions, by the
> way.

I think the core code should provide an "Injection Point" wait event
type and let extensions add specific wait events there, just like you
did for "Extension". Then this ugly naming would go away. As I see it,
"Extension" is only supposed to be used as a catch-all when we have no
other information, but here we do. If we refuse to use the
wait_event_type field to categorize waits, then people are going to
have to find some other way to get that data into the system, as Noah
has done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: race condition in pg_class

2024-06-06 Thread Michael Paquier

On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> It's not this patch set's fault, but I'm not very pleased to see that
> the injection point wait events have been shoehorned into the
> "Extension" category - which they are not - instead of being a new
> wait_event_type. That would have avoided the ugly wait-event naming
> pattern, inconsistent with everything else, introduced by
> inplace050-tests-inj-v1.patch.

Not sure to agree with that.  The set of core backend APIs supporting
injection points have nothing to do with wait events.  The library
attached to one or more injection points *may* decide to use a wait
event like what the wait/wakeup calls in modules/injection_points do,
but that's entirely optional.  These rely on custom wait events,
plugged into the Extension category as the code run is itself in an
extension.  I am not arguing against the point that it may be
interesting to plug in custom wait event categories, but the current
design of wait events makes that much harder than what core is
currently able to handle, and I am not sure that this brings much at
the end as long as the wait event strings can be customized.

I've voiced upthread concerns over the naming enforced by the patch
and the way it plugs the namings into the isolation functions, by the
way.
--
Michael

signature.asc
Description: PGP signature

Re: race condition in pg_class

2024-06-06 Thread Robert Haas

On Wed, Jun 5, 2024 at 2:17 PM Noah Misch  wrote:
> Starting 2024-06-10, I plan to push the first seven of the ten patches:
>
> inplace005-UNEXPECTEDPASS-tap-meson-v1.patch
> inplace010-tests-v1.patch
> inplace040-waitfuncs-v1.patch
> inplace050-tests-inj-v1.patch
> inplace060-nodeModifyTable-comments-v1.patch
>   Those five just deal in tests, test infrastructure, and comments.
> inplace070-rel-locks-missing-v1.patch
>   Main risk is new DDL deadlocks.
> inplace080-catcache-detoast-inplace-stale-v1.patch
>   If it fails to fix the bug it targets, I expect it's a no-op rather than
>   breaking things.
>
> I'll leave the last three of the ten needing review.  Those three are beyond
> my skill to self-certify.

It's not this patch set's fault, but I'm not very pleased to see that
the injection point wait events have been shoehorned into the
"Extension" category - which they are not - instead of being a new
wait_event_type. That would have avoided the ugly wait-event naming
pattern, inconsistent with everything else, introduced by
inplace050-tests-inj-v1.patch.

I think that the comments and commit messages in this patch set could,
in some places, use improvement. For instance,
inplace060-nodeModifyTable-comments-v1.patch reflows a bunch of
comments, which makes it hard to see what actually changed, and the
commit message doesn't tell you, either. A good bit of it seems to be
changing "a view" to "a view INSTEAD OF trigger" or "a view having an
INSTEAD OF trigger," but the reasoning behind that change is not
spelled out anywhere. The reader is left to guess what the other case
is and why the same principles don't apply to it. I don't doubt that
the new comments are more correct than the old ones, but I expect
future patch authors to have difficulty maintaining that state of
affairs.

Similarly, inplace070-rel-locks-missing-v1.patch adds no comments.
IMHO, the commit message also isn't very informative. It disclaims
knowledge of what bug it's fixing, while at the same time leaving the
reader to figure out for themselves how the behavior has changed.
Consequently, I expect writing the release notes for a release
including this patch to be difficult: "We added some locks that block
... something ... in some circumstances ... to prevent ... something."
It's not really the job of the release note author to fill in those
blanks, but rather of the patch author or committer. I don't want to
overburden the act of fixing bugs, but I just feel like more
explanation is needed here. When I see for example that we're adding a
lock acquisition to the end of heap_create(), I can't help but wonder
if it's really true that we don't take a lock on a just-created
relation today. I'm certainly under the impression that we lock
newly-created, uncommitted relations, and a quick test seems to
confirm that. I don't quite know whether that happens, but evidently
this call is guarding against something more subtle than a categorical
failure to lock a relation on creation so I think there should be a
comment explaining what that thing is.

It's also quite surprising that SetRelationHasSubclass() says "take X
lock before calling" and 2 of 4 callers just don't. I guess that's how
it is. But shouldn't we then have an assertion inside that function to
guard against future mistakes? If the reason why we failed to add this
initially is discernible from the commit messages that introduced the
bug, it would be nice to mention what it seems to have been; if not,
it would at least be nice to mention the offending commit(s). I'm also
a bit worried that this is going to cause deadlocks, but I suppose if
it does, that's still better than the status quo.

IsInplaceUpdateOid's header comment says IsInplaceUpdateRelation
instead of IsInplaceUpdateOid.

inplace080-catcache-detoast-inplace-stale-v1.patch seems like another
place where spelling out the rationale in more detail would be helpful
to future readers; for instance, the commit message says that
PgDatabaseToastTable is the only one affected, but it doesn't say why
the others are not, or why this one is. The lengthy comment in
CatalogCacheCreateEntry is also difficult to correlate with the code
which follows. I can't guess whether the two cases called out in the
comment always needed to be handled and were handled save only for
in-place updates, and thus the comment changes were simply taking the
opportunity to elaborate on the existing comments; or whether one of
those cases is preexisting and the other arises from the desire to
handle inplace updates. It could be helpful to mention relevant
identifiers from the code in the comment text e.g.
"systable_recheck_tuple detects ordinary updates by noting changes to
the tuple's visibility information, while the equalTuple() case
detects inplace updates."

IMHO, this patch set underscores the desirability of removing in-place
update altogether. That sounds difficult and not back-patchable, but I
can't classify what this patch set does as anything

Re: race condition in pg_class

2024-06-05 Thread Noah Misch

On Sun, May 12, 2024 at 04:29:23PM -0700, Noah Misch wrote:
> I'm attaching patches implementing the LockTuple() design.

Starting 2024-06-10, I plan to push the first seven of the ten patches:

inplace005-UNEXPECTEDPASS-tap-meson-v1.patch
inplace010-tests-v1.patch
inplace040-waitfuncs-v1.patch
inplace050-tests-inj-v1.patch
inplace060-nodeModifyTable-comments-v1.patch
  Those five just deal in tests, test infrastructure, and comments.
inplace070-rel-locks-missing-v1.patch
  Main risk is new DDL deadlocks.
inplace080-catcache-detoast-inplace-stale-v1.patch
  If it fails to fix the bug it targets, I expect it's a no-op rather than
  breaking things.

I'll leave the last three of the ten needing review.  Those three are beyond
my skill to self-certify.

Re: race condition in pg_class

2024-05-13 Thread Noah Misch

On Mon, May 13, 2024 at 03:53:08PM -0400, Robert Haas wrote:
> On Sun, May 12, 2024 at 7:29 PM Noah Misch  wrote:
> > - [consequences limited to transient failure] Since a PROC_IN_VACUUM 
> > backend's
> >   xmin does not stop pruning, an MVCC scan in that backend can find zero
> >   tuples when one is live.  This is like what all backends got in the days 
> > of
> >   SnapshotNow catalog scans.  See the pgbench test suite addition.  (Perhaps
> >   the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM,
> >   setting that flag later and unsetting it earlier.)
> 
> Are you saying that this is a problem already, or that the patch
> causes it to start happening? If it's the former, that's horrible.

The former.

Re: race condition in pg_class

2024-05-13 Thread Robert Haas

On Sun, May 12, 2024 at 7:29 PM Noah Misch  wrote:
> - [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's
>   xmin does not stop pruning, an MVCC scan in that backend can find zero
>   tuples when one is live.  This is like what all backends got in the days of
>   SnapshotNow catalog scans.  See the pgbench test suite addition.  (Perhaps
>   the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM,
>   setting that flag later and unsetting it earlier.)

Are you saying that this is a problem already, or that the patch
causes it to start happening? If it's the former, that's horrible. If
it's the latter, I'd say that is a fatal defect.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: race condition in pg_class

2024-05-13 Thread Noah Misch

On Mon, May 13, 2024 at 04:59:59PM +0900, Michael Paquier wrote:
> About inplace050-tests-inj-v1.patch.
> 
> + /* Check if blocked_pid is in injection_wait(). */
> + proc = BackendPidGetProc(blocked_pid);
> + if (proc == NULL)
> + PG_RETURN_BOOL(false);  /* session gone: definitely unblocked */
> + wait_event =
> + 
> pgstat_get_wait_event(UINT32_ACCESS_ONCE(proc->wait_event_info));
> + if (wait_event && strncmp("INJECTION_POINT(",
> +   wait_event,
> +   
> strlen("INJECTION_POINT(")) == 0)
> + PG_RETURN_BOOL(true);
> 
> Hmm.  I am not sure that this is the right interface for the job
> because this is not only related to injection points but to the
> monitoring of a one or more wait events when running a permutation
> step.

Could you say more about that?  Permutation steps don't monitor wait events
today.  This patch would be the first instance of that.

> Perhaps this is something that should be linked to the spec
> files with some property area listing the wait events we're expected
> to wait on instead when running a step that we know will wait?

The spec syntax doesn't distinguish contention types at all.  The isolation
tester's needs are limited to distinguishing:

  (a) process is waiting on another test session
  (b) process is waiting on automatic background activity (autovacuum, mainly)

Automatic background activity doesn't make a process enter or leave
injection_wait(), so all injection point wait events fall in (a).  (The tester
ignores (b), since those clear up without intervention.  Failing to ignore
them, as the tester did long ago, made output unstable.)

Re: race condition in pg_class

2024-05-13 Thread Michael Paquier

On Sun, May 12, 2024 at 04:29:23PM -0700, Noah Misch wrote:
> I'm attaching patches implementing the LockTuple() design.  It turns out we
> don't just lose inplace updates.  We also overwrite unrelated tuples,
> reproduced at inplace.spec.  Good starting points are README.tuplock and the
> heap_inplace_update_scan() header comment.

About inplace050-tests-inj-v1.patch.

+   /* Check if blocked_pid is in injection_wait(). */
+   proc = BackendPidGetProc(blocked_pid);
+   if (proc == NULL)
+   PG_RETURN_BOOL(false);  /* session gone: definitely unblocked */
+   wait_event =
+   
pgstat_get_wait_event(UINT32_ACCESS_ONCE(proc->wait_event_info));
+   if (wait_event && strncmp("INJECTION_POINT(",
+ wait_event,
+ 
strlen("INJECTION_POINT(")) == 0)
+   PG_RETURN_BOOL(true);

Hmm.  I am not sure that this is the right interface for the job
because this is not only related to injection points but to the
monitoring of a one or more wait events when running a permutation
step.  Perhaps this is something that should be linked to the spec
files with some property area listing the wait events we're expected
to wait on instead when running a step that we know will wait?
--
Michael


signature.asc
Description: PGP signature

Re: race condition in pg_class

2023-11-01 Thread Noah Misch

I prototyped two ways, one with a special t_ctid and one with LockTuple().

On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote:
> On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> > Noah Misch  writes:
> > > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:

> > > I anticipate a new locktag per catalog that can receive inplace updates,
> > > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.
> > 
> > We could perhaps make this work by using the existing tuple-lock
> > infrastructure, rather than inventing new locktags (a choice that
> > spills to a lot of places including clients that examine pg_locks).
> 
> That could be okay.  It would be weird to reuse a short-term lock like that
> one as something held till end of transaction.  But the alternative of new
> locktags ain't perfect, as you say.

That worked.

> > I would prefer though to find a solution that only depends on making
> > heap_inplace_update protect itself, without high-level cooperation
> > from the possibly-interfering updater.  This is basically because
> > I'm still afraid that we're defining the problem too narrowly.
> > For one thing, I have nearly zero confidence that GRANT et al are
> > the only problematic source of conflicting transactional updates.
> 
> Likewise here, but I have fair confidence that an assertion would flush out
> the rest.  heap_inplace_update() would assert that the backend holds one of
> the acceptable locks.  It could even be an elog; heap_inplace_update() can
> tolerate that cost.

That check would fall in both heap_inplace_update() and heap_update().  After
all, a heap_inplace_update() check won't detect an omission in GRANT.

> > For another, I'm worried that some extension may be using
> > heap_inplace_update against a catalog we're not considering here.
> 
> A pgxn search finds "citus" using heap_inplace_update().
> 
> > I'd also like to find a solution that fixes the case of a conflicting
> > manual UPDATE (although certainly that's a stretch goal we may not be
> > able to reach).
> 
> It would be nice.

I expect most approaches could get there by having ExecModifyTable() arrange
for the expected locking or other actions.  That's analogous to how
heap_update() takes care of sinval even for a manual UPDATE.

> > I wonder if there's a way for heap_inplace_update to mark the tuple
> > header as just-updated in a way that regular heap_update could
> > recognize.  (For standard catalog updates, we'd then end up erroring
> > in simple_heap_update, which I think is fine.)  We can't update xmin,
> > because the VACUUM callers don't have an XID; but maybe there's some
> > other way?  I'm speculating about putting a funny value into xmax,
> > or something like that, and having heap_update check that what it
> > sees in xmax matches what was in the tuple the update started with.
> 
> Hmmm.  Achieving it without an XID would be the real trick.  (With an XID, we
> could use xl_heap_lock like heap_update() does.)  Thinking out loud, what if
> heap_inplace_update() sets HEAP_XMAX_INVALID and xmax =
> TransactionIdAdvance(xmax)?  Or change t_ctid in a similar way.  Then regular
> heap_update() could complain if the field changed vs. last seen value.  This
> feels like something to regret later in terms of limiting our ability to
> harness those fields for more-valuable ends or compact them away in a future
> page format.  I can't pinpoint a specific loss, so the idea might have legs.
> Nontransactional data in separate tables or in new metapages smells like the
> right long-term state.  A project wanting to reuse the tuple header bits could
> introduce such storage to unblock its own bit reuse.

heap_update() does not have the pre-modification xmax today, so I used t_ctid.
heap_modify_tuple() preserves t_ctid, so heap_update() already has the
pre-modification t_ctid in key cases.  For details of how the prototype uses
t_ctid, see comment at "#define InplaceCanaryOffsetNumber".  The prototype
doesn't prevent corruption in the following scenario, because the aborted
ALTER TABLE RENAME overwrites the special t_ctid:

  == session 1
  drop table t;
  create table t (c int);
  begin;
  -- in gdb, set breakpoint on heap_modify_tuple
  grant select on t to public;

  == session 2
  alter table t add primary key (c);
  begin; alter table t rename to t2; rollback;

  == back in session 1
  -- release breakpoint
  -- want error (would get it w/o the begin;alter;rollback)
  commit;

I'm missing how to mark the tuple in a fashion accessible to a second
heap_update() after a rolled-back heap_update().  The mark needs enough bits
"N" so it's implausible for 2^N inplace updates to happen between GRANT
fetching the old tuple and GRANT completing heap_update().  Looking for bits
that could persist across a rolled-back heap_update(), we have 3 in t_ctid, 2
in t_infomask2, and 0 in xmax.  I definitely don't want to paint us into a
corner by spending the t_infomask2 bits on this.  Even if I did,

Re: race condition in pg_class

2023-10-27 Thread Noah Misch

On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
> >> I'm inclined to propose that heap_inplace_update should check to
> >> make sure that it's operating on the latest version of the tuple
> >> (including, I guess, waiting for an uncommitted update?) and throw
> >> error if not.  I think this is what your B3 option is, but maybe
> >> I misinterpreted.  It might be better to throw error immediately
> >> instead of waiting to see if the other updater commits.
> 
> > That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
> > waiting for GRANT to commit but instead inplace-updating every member of the
> > update chain.  For B2, I was thinking we don't need to error.  There are two
> > problematic orders of events.  The easy one is heap_inplace_update() 
> > mutating
> > a tuple that already has an xmax.  That's the one in the test case upthread,
> > and detecting it is trivial.  The harder one is heap_inplace_update() 
> > mutating
> > a tuple after GRANT fetches the old tuple, before GRANT enters 
> > heap_update().
> 
> Ugh ... you're right, what I was imagining would not catch that last case.
> 
> > I anticipate a new locktag per catalog that can receive inplace updates,
> > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.
> 
> We could perhaps make this work by using the existing tuple-lock
> infrastructure, rather than inventing new locktags (a choice that
> spills to a lot of places including clients that examine pg_locks).

That could be okay.  It would be weird to reuse a short-term lock like that
one as something held till end of transaction.  But the alternative of new
locktags ain't perfect, as you say.

> I would prefer though to find a solution that only depends on making
> heap_inplace_update protect itself, without high-level cooperation
> from the possibly-interfering updater.  This is basically because
> I'm still afraid that we're defining the problem too narrowly.
> For one thing, I have nearly zero confidence that GRANT et al are
> the only problematic source of conflicting transactional updates.

Likewise here, but I have fair confidence that an assertion would flush out
the rest.  heap_inplace_update() would assert that the backend holds one of
the acceptable locks.  It could even be an elog; heap_inplace_update() can
tolerate that cost.

> For another, I'm worried that some extension may be using
> heap_inplace_update against a catalog we're not considering here.

A pgxn search finds "citus" using heap_inplace_update().

> I'd also like to find a solution that fixes the case of a conflicting
> manual UPDATE (although certainly that's a stretch goal we may not be
> able to reach).

It would be nice.

> I wonder if there's a way for heap_inplace_update to mark the tuple
> header as just-updated in a way that regular heap_update could
> recognize.  (For standard catalog updates, we'd then end up erroring
> in simple_heap_update, which I think is fine.)  We can't update xmin,
> because the VACUUM callers don't have an XID; but maybe there's some
> other way?  I'm speculating about putting a funny value into xmax,
> or something like that, and having heap_update check that what it
> sees in xmax matches what was in the tuple the update started with.

Hmmm.  Achieving it without an XID would be the real trick.  (With an XID, we
could use xl_heap_lock like heap_update() does.)  Thinking out loud, what if
heap_inplace_update() sets HEAP_XMAX_INVALID and xmax =
TransactionIdAdvance(xmax)?  Or change t_ctid in a similar way.  Then regular
heap_update() could complain if the field changed vs. last seen value.  This
feels like something to regret later in terms of limiting our ability to
harness those fields for more-valuable ends or compact them away in a future
page format.  I can't pinpoint a specific loss, so the idea might have legs.
Nontransactional data in separate tables or in new metapages smells like the
right long-term state.  A project wanting to reuse the tuple header bits could
introduce such storage to unblock its own bit reuse.

> Or we could try to get rid of in-place updates, but that seems like
> a mighty big lift.  All of the existing callers, except maybe
> the johnny-come-lately dropdb usage, have solid documented reasons
> to do it that way.

Yes, removing that smells problematic.

Re: race condition in pg_class

2023-10-27 Thread Tom Lane

Noah Misch  writes:
> On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
>> I'm inclined to propose that heap_inplace_update should check to
>> make sure that it's operating on the latest version of the tuple
>> (including, I guess, waiting for an uncommitted update?) and throw
>> error if not.  I think this is what your B3 option is, but maybe
>> I misinterpreted.  It might be better to throw error immediately
>> instead of waiting to see if the other updater commits.

> That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
> waiting for GRANT to commit but instead inplace-updating every member of the
> update chain.  For B2, I was thinking we don't need to error.  There are two
> problematic orders of events.  The easy one is heap_inplace_update() mutating
> a tuple that already has an xmax.  That's the one in the test case upthread,
> and detecting it is trivial.  The harder one is heap_inplace_update() mutating
> a tuple after GRANT fetches the old tuple, before GRANT enters heap_update().

Ugh ... you're right, what I was imagining would not catch that last case.

> I anticipate a new locktag per catalog that can receive inplace updates,
> i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.

We could perhaps make this work by using the existing tuple-lock
infrastructure, rather than inventing new locktags (a choice that
spills to a lot of places including clients that examine pg_locks).

I would prefer though to find a solution that only depends on making
heap_inplace_update protect itself, without high-level cooperation
from the possibly-interfering updater.  This is basically because
I'm still afraid that we're defining the problem too narrowly.
For one thing, I have nearly zero confidence that GRANT et al are
the only problematic source of conflicting transactional updates.
For another, I'm worried that some extension may be using
heap_inplace_update against a catalog we're not considering here.
I'd also like to find a solution that fixes the case of a conflicting
manual UPDATE (although certainly that's a stretch goal we may not be
able to reach).

I wonder if there's a way for heap_inplace_update to mark the tuple
header as just-updated in a way that regular heap_update could
recognize.  (For standard catalog updates, we'd then end up erroring
in simple_heap_update, which I think is fine.)  We can't update xmin,
because the VACUUM callers don't have an XID; but maybe there's some
other way?  I'm speculating about putting a funny value into xmax,
or something like that, and having heap_update check that what it
sees in xmax matches what was in the tuple the update started with.

Or we could try to get rid of in-place updates, but that seems like
a mighty big lift.  All of the existing callers, except maybe
the johnny-come-lately dropdb usage, have solid documented reasons
to do it that way.

regards, tom lane

Re: race condition in pg_class

2023-10-27 Thread Noah Misch

On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
> >> We'll likely need to change how we maintain relhasindex or perhaps take a 
> >> lock
> >> in GRANT.
> 
> > The main choice is accepting more DDL blocking vs. accepting inefficient
> > relcache builds.  Options I'm seeing:
> 
> It looks to me like you're only thinking about relhasindex, but it
> seems to me that any call of heap_inplace_update brings some
> risk of this kind.  Excluding the bootstrap-mode-only usage in
> create_toast_table, I see four callers:
> 
> * index_update_stats updating a pg_class tuple's
>   relhasindex, relpages, reltuples, relallvisible
> 
> * vac_update_relstats updating a pg_class tuple's
>   relpages, reltuples, relallvisible, relhasindex, relhasrules,
>   relhastriggers, relfrozenxid, relminmxid
> 
> * vac_update_datfrozenxid updating a pg_database tuple's
>   datfrozenxid, datminmxid
> 
> * dropdb updating a pg_database tuple's datconnlimit
> 
> So we have just as much of a problem with GRANTs on databases
> as GRANTs on relations.  Also, it looks like we can lose
> knowledge of the presence of rules and triggers, which seems
> nearly as bad as forgetting about indexes.  The rest of these
> updates might not be correctness-critical, although I wonder
> how bollixed things could get if we forget an advancement of
> relfrozenxid or datfrozenxid (especially if the calling
> transaction goes on to make other changes that assume that
> the update happened).

Thanks for researching that.  Let's treat frozenxid stuff as critical; I
wouldn't want to advance XID limits based on a datfrozenxid that later gets
rolled back.  I agree relhasrules and relhastriggers are also critical.  The
"inefficient relcache builds" option family can't solve cases like
relfrozenxid and datconnlimit, so that leaves us with the "more DDL blocking"
option family.

> BTW, vac_update_datfrozenxid believes (correctly I think) that
> it cannot use the syscache copy of a tuple as the basis for in-place
> update, because syscache will have detoasted any toastable fields.
> These other callers are ignoring that, which seems like it should
> result in heap_inplace_update failing with "wrong tuple length".
> I wonder how come we're not seeing reports of that from the field.

Good question.  Perhaps we'll need some test cases that exercise each inplace
update against a row having a toast pointer.  It's too easy to go a long time
without encountering those in the field.

> I'm inclined to propose that heap_inplace_update should check to
> make sure that it's operating on the latest version of the tuple
> (including, I guess, waiting for an uncommitted update?) and throw
> error if not.  I think this is what your B3 option is, but maybe
> I misinterpreted.  It might be better to throw error immediately
> instead of waiting to see if the other updater commits.

That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
waiting for GRANT to commit but instead inplace-updating every member of the
update chain.  For B2, I was thinking we don't need to error.  There are two
problematic orders of events.  The easy one is heap_inplace_update() mutating
a tuple that already has an xmax.  That's the one in the test case upthread,
and detecting it is trivial.  The harder one is heap_inplace_update() mutating
a tuple after GRANT fetches the old tuple, before GRANT enters heap_update().
I anticipate a new locktag per catalog that can receive inplace updates,
i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.  Here's a
walk-through for the pg_database case.  GRANT will use the following sequence
of events:

- acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode
- fetch latest pg_database tuple
- heap_update()
- COMMIT, releasing LOCKTAG_DATABASE_DEFINITION

vac_update_datfrozenxid() sequence of events:

- acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode
- (now, all GRANTs on the given database have committed or aborted)
- fetch latest pg_database tuple
- heap_inplace_update()
- release LOCKTAG_DATABASE_DEFINITION, even if xact not ending
- continue with other steps, e.g. vac_truncate_clog()

How does that compare to what you envisioned?  vac_update_datfrozenxid() could
further use xmax as a best-efforts thing to catch conflict with manual UPDATE
statements, but it wouldn't solve the case where the UPDATE had fetched the
tuple but not yet heap_update()'d it.

Re: race condition in pg_class

2023-10-27 Thread Tom Lane

Noah Misch  writes:
> On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
>> We'll likely need to change how we maintain relhasindex or perhaps take a 
>> lock
>> in GRANT.

> The main choice is accepting more DDL blocking vs. accepting inefficient
> relcache builds.  Options I'm seeing:

It looks to me like you're only thinking about relhasindex, but it
seems to me that any call of heap_inplace_update brings some
risk of this kind.  Excluding the bootstrap-mode-only usage in
create_toast_table, I see four callers:

* index_update_stats updating a pg_class tuple's
  relhasindex, relpages, reltuples, relallvisible

* vac_update_relstats updating a pg_class tuple's
  relpages, reltuples, relallvisible, relhasindex, relhasrules,
  relhastriggers, relfrozenxid, relminmxid

* vac_update_datfrozenxid updating a pg_database tuple's
  datfrozenxid, datminmxid

* dropdb updating a pg_database tuple's datconnlimit

So we have just as much of a problem with GRANTs on databases
as GRANTs on relations.  Also, it looks like we can lose
knowledge of the presence of rules and triggers, which seems
nearly as bad as forgetting about indexes.  The rest of these
updates might not be correctness-critical, although I wonder
how bollixed things could get if we forget an advancement of
relfrozenxid or datfrozenxid (especially if the calling
transaction goes on to make other changes that assume that
the update happened).

BTW, vac_update_datfrozenxid believes (correctly I think) that
it cannot use the syscache copy of a tuple as the basis for in-place
update, because syscache will have detoasted any toastable fields.
These other callers are ignoring that, which seems like it should
result in heap_inplace_update failing with "wrong tuple length".
I wonder how come we're not seeing reports of that from the field.

I'm inclined to propose that heap_inplace_update should check to
make sure that it's operating on the latest version of the tuple
(including, I guess, waiting for an uncommitted update?) and throw
error if not.  I think this is what your B3 option is, but maybe
I misinterpreted.  It might be better to throw error immediately
instead of waiting to see if the other updater commits.

regards, tom lane

Re: race condition in pg_class

2023-10-27 Thread Noah Misch

On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
> On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote:
> > We are running PG13.10 and recently we have encountered what appears to be
> > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> > some other catalog-writer, possibly ANALYZE.
> > The problem is that after successfully creating index on relation (which
> > previosly didnt have any indexes), its pg_class.relhasindex remains set to
> > "false", which is illegal, I think.

It's damaging.  The table will behave like it has no indexes.  If something
adds an index later, old indexes will reappear, corrupt, having not received
updates during the relhasindex=false era.  ("pg_amcheck --heapallindexed" can
detect this.)

> > Index was built using the following statement:
> > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
> 
> This is going to be a problem with any operation that does a transactional
> pg_class update without taking a lock that conflicts with ShareLock.  GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
> 
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
> 
> == session 2
> alter table t add primary key (c);
> 
> == back in session 1
> commit;
> 
> 
> We'll likely need to change how we maintain relhasindex or perhaps take a lock
> in GRANT.

The main choice is accepting more DDL blocking vs. accepting inefficient
relcache builds.  Options I'm seeing:

=== "more DDL blocking" option family

B1. Take ShareUpdateExclusiveLock in GRANT, REVOKE, and anything that makes
transactional pg_class updates without holding some stronger lock.  New
asserts could catch future commands failing to do this.

B2. Take some shorter-lived lock around pg_class tuple formation, such that
GRANT blocks CREATE INDEX, but two CREATE INDEX don't block each other.
Anything performing a transactional update of a pg_class row would acquire
the lock in exclusive mode before fetching the old tuple and hold it till
end of transaction.  relhasindex=true in-place updates would acquire it
the same way, but they would release it after the inplace update.  I
expect a new heavyweight lock type, e.g. LOCKTAG_RELATION_DEFINITION, with
the same key as LOCKTAG_RELATION.  This has less blocking than the
previous option, but it's more complicated to explain to both users and
developers.

B3. Have CREATE INDEX do an EvalPlanQual()-style thing to update all successor
tuple versions.  Like the previous option, this would require new locking,
but the new lock would not need to persist till end of xact.  It would be
even more complicated to explain to users and developers.  (If this is
promising enough to warrant more detail, let me know.)

B4. Use transactional updates to set relhasindex=true.  Two CREATE INDEX
commands on the same table would block each other.  If we did it the way
most DDL does today, they'd get "tuple concurrently updated" failures
after the blocking ends.

=== "inefficient relcache builds" option family

R1. Ignore relhasindex; possibly remove it in v17.  Relcache builds et
al. will issue more superfluous queries.

R2. As a weird variant of the previous option, keep relhasindex and make all
transactional updates of pg_class set relhasindex=true pessimistically.
(VACUUM will set it back to false.)

=== other

O1. This is another case where the sometimes-discussed "pg_class_nt" for
nontransactional columns would help.  I'm ruling that out as too hard to
back-patch.

Are there other options important to consider?  I currently like (B1) the
most, followed closely by (R1) and (B2).  A key unknown is the prevalence of
index-free tables.  Low prevalence would argue in favor of (R1).  In my
limited experience, they've been rare.  That said, I assume relcache builds
happen a lot more than GRANTs, so it's harder to bound the damage from (R1)
compared to the damage from (B1).  Thoughts on this decision?

Thanks,
nm

Re: race condition in pg_class

2023-10-27 Thread Smolkin Grigory

> This is going to be a problem with any operation that does a transactional
> pg_class update without taking a lock that conflicts with ShareLock.
GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
>
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
>
> == session 2
> alter table t add primary key (c);
>
> == back in session 1
> commit;
>
>
> We'll likely need to change how we maintain relhasindex or perhaps take a
lock
> in GRANT.

Oh, that explains it. Thank you very much.

> Can you explore that as follows?
>
>- PITR to just before the COMMIT record.
>- Save all rows of pg_class.
>- PITR to just after the COMMIT record.
>- Save all rows of pg_class.
>- Diff the two sets of saved rows.

Sure, but it will take some time, its a large db with lots of WAL segments
to apply.

> extensions

  extname   | extversion
+
 plpgsql| 1.0
 pg_stat_statements | 1.8
 pg_buffercache | 1.3
 pgstattuple| 1.5

Re: race condition in pg_class

2023-10-26 Thread Noah Misch

On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

This is going to be a problem with any operation that does a transactional
pg_class update without taking a lock that conflicts with ShareLock.  GRANT
doesn't lock the table at all, so I can reproduce this in v17 as follows:

== session 1
create table t (c int);
begin;
grant select on t to public;

== session 2
alter table t add primary key (c);

== back in session 1
commit;

We'll likely need to change how we maintain relhasindex or perhaps take a lock
in GRANT.

> Looking into the WAL via waldump given us the following picture (full
> waldump output is attached):

> 1202295045 - create index statement
> 1202298790 and 1202298791 are some other concurrent operations,
> unfortunately I wasnt able to determine what are they

Can you explore that as follows?

- PITR to just before the COMMIT record.
- Save all rows of pg_class.
- PITR to just after the COMMIT record.
- Save all rows of pg_class.
- Diff the two sets of saved rows.

Which columns changed?  The evidence you've shown would be consistent with a
transaction doing GRANT or REVOKE on dozens of tables.  If the changed column
is something other than relacl, that would be great to know.

On the off-chance it's relevant, what extensions do you have (\dx in psql)?

Re: race condition in pg_class

2023-10-26 Thread Smolkin Grigory

> ALTER TABLE ADD CONSTRAINT would certainly have taken
> AccessExclusiveLock on the "example" table, which should be sufficient
> to prevent anything else from touching its pg_class row.  The only
> mechanism I can think of that might bypass that is a manual UPDATE on
> pg_class, which would just manipulate the row as a row without concern
> for associated relation-level locks.  Any chance that somebody was
> doing something like that?

No chance. Our infrastructure dont do that, and users dont just have the
privileges to mess with pg_catalog.

ср, 25 окт. 2023 г. в 21:06, Tom Lane :

> Smolkin Grigory  writes:
> > We are running PG13.10 and recently we have encountered what appears to
> be
> > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT
> and
> > some other catalog-writer, possibly ANALYZE.
> > The problem is that after successfully creating index on relation (which
> > previosly didnt have any indexes), its pg_class.relhasindex remains set
> to
> > "false", which is illegal, I think.
> > Index was built using the following statement:
> > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
>
> ALTER TABLE ADD CONSTRAINT would certainly have taken
> AccessExclusiveLock on the "example" table, which should be sufficient
> to prevent anything else from touching its pg_class row.  The only
> mechanism I can think of that might bypass that is a manual UPDATE on
> pg_class, which would just manipulate the row as a row without concern
> for associated relation-level locks.  Any chance that somebody was
> doing something like that?
>
> regards, tom lane
>

Re: race condition in pg_class

2023-10-25 Thread Tom Lane

Smolkin Grigory  writes:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

ALTER TABLE ADD CONSTRAINT would certainly have taken
AccessExclusiveLock on the "example" table, which should be sufficient
to prevent anything else from touching its pg_class row.  The only
mechanism I can think of that might bypass that is a manual UPDATE on
pg_class, which would just manipulate the row as a row without concern
for associated relation-level locks.  Any chance that somebody was
doing something like that?

regards, tom lane

Re: race condition in pg_class

2023-10-25 Thread Andrey M. Borodin

> On 25 Oct 2023, at 13:39, Smolkin Grigory  wrote:
> 
> We are running PG13.10 and recently we have encountered what appears to be a 
> bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and 
> some other catalog-writer, possibly ANALYZ

> I've tried to reproduce this scenario with CREATE INDEX and various 
> concurrent statements, but no luck.

Maybe it would be possible to reproduce with modifying tests for concurrent 
index creation. For example add “ANALYZE” here [0].
Keep in mind that for easier reproduction it would make sense to increase 
transaction count radically.

Best regards, Andrey Borodin.

[0] 
https://github.com/postgres/postgres/blob/master/contrib/amcheck/t/002_cic.pl#L34

43 matches

Mail list logo