Re: Weird test mixup

2024-05-12 Thread Michael Paquier
On Sat, May 11, 2024 at 11:45:33AM +0500, Andrey M. Borodin wrote:
> I see that you store condition in private_data. So "private" means
> that this is a data specific to extension, do I understand it right?

Yes, it is possible to pass down some custom data to the callbacks
registered, generated in a module.  One example would be more complex
condition grammar, like a JSON-based thing.  I don't really see the
need for this amount of complexity in the tree yet, but one could do
that outside the tree easily.

> As long as I started anyway, I also want to ask some more stupid
> questions:
> 1. Where is the border between responsibility of an extension and
> the core part? I mean can we define in simple words what
> functionality must be in extension?

Rule 0 I've been using here: keep the footprint on the backend as
simple as possible.  These have as absolute minimum requirement:
- A function name.
- A library name.
- A point name.

The private area contents and size are added to address the
concurrency cases with runtime checks.  I didn't see a strong use for
that first, but Noah has been convincing enough with his use cases and
the fact that the race between detach and run was not completely
closed because we lacked consistency with the shmem hash table lookup.

> 2. If we have some concurrency issues, why can't we just protect
> everything with one giant LWLock\SpinLock. We have some locking
> model instead of serializing access from enter until exit.

This reduces the test infrastructure flexibility, because one may want
to attach or detach injection points while a point is running.  So it
is by design that the LWLock protecting the shmem hash table is not hold
when a point is running.  This has been covered a bit upthread, and I
want to be able to do that as well.
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-05-12 Thread Michael Paquier
On Sun, May 12, 2024 at 10:48:51AM -0700, Noah Misch wrote:
> This looks correct, and it works well in my tests.  Thanks.

Thanks for looking.  While looking at it yesterday I've decided to
split the change into two commits, one for the infra and one for the
module.  While doing so, I've noticed that the case of a private area
passed as NULL was not completely correct as memcpy would be
undefined.

The open item has been marked as fixed.
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-05-12 Thread Noah Misch
On Fri, May 10, 2024 at 10:04:17AM +0900, Michael Paquier wrote:
> Attached is an updated patch for now, indented with a happy CI.  I am
> still planning to look at that a second time on Monday with a fresher
> mind, in case I'm missing something now.

This looks correct, and it works well in my tests.  Thanks.




Re: Weird test mixup

2024-05-10 Thread Andrey M. Borodin



> On 10 May 2024, at 06:04, Michael Paquier  wrote:
> 
> Attached is an updated patch for now

Can you, please, add some more comments regarding purpose of private data?
I somewhat lost understanding of the discussion for a week or so. And I hoped 
to grasp the idea of private_data from resulting code. But I cannot do so from 
current patch version...

I see that you store condition in private_data. So "private" means that this is 
a data specific to extension, do I understand it right?

As long as I started anyway, I also want to ask some more stupid questions:
1. Where is the border between responsibility of an extension and the core 
part? I mean can we define in simple words what functionality must be in 
extension?
2. If we have some concurrency issues, why can't we just protect everything 
with one giant LWLock\SpinLock. We have some locking model instead of 
serializing access from enter until exit.

Most probably, this was discussed somewhere, but I could not find it.

Thanks!


Best regards, Andrey Borodin.



Re: Weird test mixup

2024-05-09 Thread Michael Paquier
On Thu, May 09, 2024 at 04:39:00PM -0700, Noah Misch wrote:

Thanks for the feedback.

> The return-bool approach sounds fine.  Up to you whether to do in this patch,
> else I'll do it when I add the test.

I see no reason to not change the signature of the routine now if we
know that we're going to do it anyway in the future.  I was shortly
wondering if doing the same for InjectionpointAttach() would make
sense, but it has more error states, so I'm not really tempted without
an actual reason (cannot think of a case where I'd want to put more
control into a module after a failed attach).

>> It could
>> always be possible that a concurrent backend does a detach followed by
>> an attach with the same name, causing the shmem exit callback to drop
>> a point it should not, but that's not really a plausible case IMO :)
> 
> Agreed.  It's reasonable to expect test cases to serialize backend exits,
> attach calls, and detach calls.  If we need to fix that later, we can use
> attachment serial numbers.

Okay by me.

> I'd name this INJ_CONDITION_UNCONDITIONAL or INJ_CONDITION_ALWAYS.  INVALID
> sounds like a can't-happen event or an injection point that never runs.
> Otherwise, the patch looks good and makes src/test/modules/gin safe for
> installcheck.  Thanks.

INJ_CONDITION_ALWAYS sounds like a good compromise here.

Attached is an updated patch for now, indented with a happy CI.  I am
still planning to look at that a second time on Monday with a fresher
mind, in case I'm missing something now.
--
Michael
From 233e710c8fc2242bdd4e40d5a20f8de2828765b7 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 10 May 2024 09:40:42 +0900
Subject: [PATCH v4] Add chunk area for injection points

---
 src/include/utils/injection_point.h   |   9 +-
 src/backend/utils/misc/injection_point.c  |  53 -
 .../expected/injection_points.out |   2 +-
 .../injection_points/injection_points.c   | 191 +++---
 doc/src/sgml/xfunc.sgml   |  14 +-
 src/tools/pgindent/typedefs.list  |   1 +
 6 files changed, 131 insertions(+), 139 deletions(-)

diff --git a/src/include/utils/injection_point.h b/src/include/utils/injection_point.h
index 55524b568f..a61d5d4439 100644
--- a/src/include/utils/injection_point.h
+++ b/src/include/utils/injection_point.h
@@ -23,15 +23,18 @@
 /*
  * Typedef for callback function launched by an injection point.
  */
-typedef void (*InjectionPointCallback) (const char *name);
+typedef void (*InjectionPointCallback) (const char *name,
+		const void *private_data);
 
 extern Size InjectionPointShmemSize(void);
 extern void InjectionPointShmemInit(void);
 
 extern void InjectionPointAttach(const char *name,
  const char *library,
- const char *function);
+ const char *function,
+ const void *private_data,
+ int private_data_size);
 extern void InjectionPointRun(const char *name);
-extern void InjectionPointDetach(const char *name);
+extern bool InjectionPointDetach(const char *name);
 
 #endif			/* INJECTION_POINT_H */
diff --git a/src/backend/utils/misc/injection_point.c b/src/backend/utils/misc/injection_point.c
index 0cf4d51cac..d5a8726644 100644
--- a/src/backend/utils/misc/injection_point.c
+++ b/src/backend/utils/misc/injection_point.c
@@ -42,6 +42,7 @@ static HTAB *InjectionPointHash;	/* find points from names */
 #define INJ_NAME_MAXLEN		64
 #define INJ_LIB_MAXLEN		128
 #define INJ_FUNC_MAXLEN		128
+#define INJ_PRIVATE_MAXLEN	1024
 
 /* Single injection point stored in InjectionPointHash */
 typedef struct InjectionPointEntry
@@ -49,6 +50,12 @@ typedef struct InjectionPointEntry
 	char		name[INJ_NAME_MAXLEN];	/* hash key */
 	char		library[INJ_LIB_MAXLEN];	/* library */
 	char		function[INJ_FUNC_MAXLEN];	/* function */
+
+	/*
+	 * Opaque data area that modules can use to pass some custom data to
+	 * callbacks, registered when attached.
+	 */
+	char		private_data[INJ_PRIVATE_MAXLEN];
 } InjectionPointEntry;
 
 #define INJECTION_POINT_HASH_INIT_SIZE	16
@@ -61,6 +68,7 @@ typedef struct InjectionPointEntry
 typedef struct InjectionPointCacheEntry
 {
 	char		name[INJ_NAME_MAXLEN];
+	char		private_data[INJ_PRIVATE_MAXLEN];
 	InjectionPointCallback callback;
 } InjectionPointCacheEntry;
 
@@ -73,7 +81,8 @@ static HTAB *InjectionPointCache = NULL;
  */
 static void
 injection_point_cache_add(const char *name,
-		  InjectionPointCallback callback)
+		  InjectionPointCallback callback,
+		  const void *private_data)
 {
 	InjectionPointCacheEntry *entry;
 	bool		found;
@@ -99,6 +108,7 @@ injection_point_cache_add(const char *name,
 	Assert(!found);
 	strlcpy(entry->name, name, sizeof(entry->name));
 	entry->callback = callback;
+	memcpy(entry->private_data, private_data, INJ_PRIVATE_MAXLEN);
 }
 
 /*
@@ -124,11 +134,14 @@ injection_point_cache_remove(const char *name)
  * Retrieve an injection point from the local cache, if any.
  */
 static InjectionPointCallback
-injection_po

Re: Weird test mixup

2024-05-09 Thread Noah Misch
On Thu, May 09, 2024 at 04:40:43PM +0900, Michael Paquier wrote:
> So I like your suggestion.  This version closes the race window
> between the shmem exit detach in backend A and a concurrent backend B
> running a point local to A so as B will never run the local point of
> A.  However, it can cause a failure in the shmem exit callback of
> backend A if backend B does a detach of a point local to A because A
> tracks its local points with a static list in its TopMemoryContext, at
> least in the attached.  The second case could be solved by tracking
> the list of local points in the module's InjectionPointSharedState,
> but is this case really worth the complications it adds in the module
> knowing that the first case would be solid enough?  Perhaps not.
> 
> Another idea I have for the second case is to make
> InjectionPointDetach() a bit "softer", by returning a boolean status 
> rather than fail if the detach cannot be done, so as the shmem exit
> callback can still loop through the entries it has in store.

The return-bool approach sounds fine.  Up to you whether to do in this patch,
else I'll do it when I add the test.

> It could
> always be possible that a concurrent backend does a detach followed by
> an attach with the same name, causing the shmem exit callback to drop
> a point it should not, but that's not really a plausible case IMO :)

Agreed.  It's reasonable to expect test cases to serialize backend exits,
attach calls, and detach calls.  If we need to fix that later, we can use
attachment serial numbers.

> --- a/src/test/modules/injection_points/injection_points.c
> +++ b/src/test/modules/injection_points/injection_points.c

> +typedef enum InjectionPointConditionType
> +{
> + INJ_CONDITION_INVALID = 0,

I'd name this INJ_CONDITION_UNCONDITIONAL or INJ_CONDITION_ALWAYS.  INVALID
sounds like a can't-happen event or an injection point that never runs.
Otherwise, the patch looks good and makes src/test/modules/gin safe for
installcheck.  Thanks.




Re: Weird test mixup

2024-05-09 Thread Michael Paquier
On Thu, May 09, 2024 at 01:47:45PM +0900, Michael Paquier wrote:
> That sounds fine to do that at the end..  I'm not sure how large this
> chunk area added to each InjectionPointEntry should be, though.  128B
> stored in each InjectionPointEntry would be more than enough I guess?
> Or more like 1024?  The in-core module does not need much currently,
> but larger is likely better for pluggability.

Actually, this is leading to simplifications in the module, giving me
the attached:
4 files changed, 117 insertions(+), 134 deletions(-) 

So I like your suggestion.  This version closes the race window
between the shmem exit detach in backend A and a concurrent backend B
running a point local to A so as B will never run the local point of
A.  However, it can cause a failure in the shmem exit callback of
backend A if backend B does a detach of a point local to A because A
tracks its local points with a static list in its TopMemoryContext, at
least in the attached.  The second case could be solved by tracking
the list of local points in the module's InjectionPointSharedState,
but is this case really worth the complications it adds in the module
knowing that the first case would be solid enough?  Perhaps not.

Another idea I have for the second case is to make
InjectionPointDetach() a bit "softer", by returning a boolean status 
rather than fail if the detach cannot be done, so as the shmem exit
callback can still loop through the entries it has in store.  It could
always be possible that a concurrent backend does a detach followed by
an attach with the same name, causing the shmem exit callback to drop
a point it should not, but that's not really a plausible case IMO :)

This stuff can be adjusted in subtle ways depending on the cases you
are most interested in.  What do you think?
--
Michael
From 27b0ba88d3530c50cb87d1495439372885f0773b Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 9 May 2024 16:29:16 +0900
Subject: [PATCH v3] Add chunk area for injection points

---
 src/include/utils/injection_point.h   |   7 +-
 src/backend/utils/misc/injection_point.c  |  45 -
 .../injection_points/injection_points.c   | 189 +++---
 doc/src/sgml/xfunc.sgml   |  10 +-
 4 files changed, 117 insertions(+), 134 deletions(-)

diff --git a/src/include/utils/injection_point.h b/src/include/utils/injection_point.h
index 55524b568f..a8a11de5f2 100644
--- a/src/include/utils/injection_point.h
+++ b/src/include/utils/injection_point.h
@@ -23,14 +23,17 @@
 /*
  * Typedef for callback function launched by an injection point.
  */
-typedef void (*InjectionPointCallback) (const char *name);
+typedef void (*InjectionPointCallback) (const char *name,
+		const void *private_data);
 
 extern Size InjectionPointShmemSize(void);
 extern void InjectionPointShmemInit(void);
 
 extern void InjectionPointAttach(const char *name,
  const char *library,
- const char *function);
+ const char *function,
+ const void *private_data,
+ int private_data_size);
 extern void InjectionPointRun(const char *name);
 extern void InjectionPointDetach(const char *name);
 
diff --git a/src/backend/utils/misc/injection_point.c b/src/backend/utils/misc/injection_point.c
index 0cf4d51cac..d46ad8b48f 100644
--- a/src/backend/utils/misc/injection_point.c
+++ b/src/backend/utils/misc/injection_point.c
@@ -42,6 +42,7 @@ static HTAB *InjectionPointHash;	/* find points from names */
 #define INJ_NAME_MAXLEN		64
 #define INJ_LIB_MAXLEN		128
 #define INJ_FUNC_MAXLEN		128
+#define INJ_PRIVATE_MAXLEN	1024
 
 /* Single injection point stored in InjectionPointHash */
 typedef struct InjectionPointEntry
@@ -49,6 +50,12 @@ typedef struct InjectionPointEntry
 	char		name[INJ_NAME_MAXLEN];	/* hash key */
 	char		library[INJ_LIB_MAXLEN];	/* library */
 	char		function[INJ_FUNC_MAXLEN];	/* function */
+
+	/*
+	 * Opaque data area that modules can use to pass some custom data to
+	 * callbacks, registered when attached.
+	 */
+	char		private_data[INJ_PRIVATE_MAXLEN];
 } InjectionPointEntry;
 
 #define INJECTION_POINT_HASH_INIT_SIZE	16
@@ -61,6 +68,7 @@ typedef struct InjectionPointEntry
 typedef struct InjectionPointCacheEntry
 {
 	char		name[INJ_NAME_MAXLEN];
+	char		private_data[INJ_PRIVATE_MAXLEN];
 	InjectionPointCallback callback;
 } InjectionPointCacheEntry;
 
@@ -73,7 +81,8 @@ static HTAB *InjectionPointCache = NULL;
  */
 static void
 injection_point_cache_add(const char *name,
-		  InjectionPointCallback callback)
+		  InjectionPointCallback callback,
+		  const void *private_data)
 {
 	InjectionPointCacheEntry *entry;
 	bool		found;
@@ -99,6 +108,7 @@ injection_point_cache_add(const char *name,
 	Assert(!found);
 	strlcpy(entry->name, name, sizeof(entry->name));
 	entry->callback = callback;
+	memcpy(entry->private_data, private_data, INJ_PRIVATE_MAXLEN);
 }
 
 /*
@@ -124,11 +134,14 @@ injection_point_cache_remove(const char *name)
  * Retrieve an injection

Re: Weird test mixup

2024-05-08 Thread Michael Paquier
On Wed, May 08, 2024 at 08:15:53PM -0700, Noah Misch wrote:
> Yes, that would be a loss for test readability.  Also, I wouldn't be surprised
> if some test will want to attach POINT-B while POINT-A is in injection_wait().

Not impossible, still annoying with more complex scenarios.

> Various options avoiding those limitations:
> 
> 1. The data area formerly called a "status" area is immutable after attach.
>The core code copies it into a stack buffer to pass in a const callback
>argument.
> 
> 3. Move the PID check into core code.

The PID checks are specific to the module, and there could be much
more conditions like running only in specific backend types (first
example coming into mind), so I want to avoid that kind of knowledge
in the backend.

> (1) is, I think, simple and sufficient.  How about that?

That sounds fine to do that at the end..  I'm not sure how large this
chunk area added to each InjectionPointEntry should be, though.  128B
stored in each InjectionPointEntry would be more than enough I guess?
Or more like 1024?  The in-core module does not need much currently,
but larger is likely better for pluggability.
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-05-08 Thread Noah Misch
On Thu, May 09, 2024 at 09:37:54AM +0900, Michael Paquier wrote:
> On Tue, May 07, 2024 at 11:53:10AM -0700, Noah Misch wrote:
> > The problem I'm trying to tackle in this thread is to make
> > src/test/modules/gin installcheck-safe.  $SUBJECT's commit 5105c90 started
> > that work, having seen the intarray test suite break when run concurrently
> > with the injection_points test suite.  That combination still does break at
> > the exit-time race condition.  To reproduce, apply this attachment to add
> > sleeps, and run:
> > 
> > make -C src/test/modules/gin installcheck USE_MODULE_DB=1 & sleep 2; make 
> > -C contrib/intarray installcheck USE_MODULE_DB=1
> 
> Thanks for that.  I am not really sure how to protect that without a
> small cost in flexibility for the cases of detach vs run paths.  This
> comes down to the fact that a custom routine could be run while it
> could be detached concurrently, removing any stuff a callback could
> depend on in the module.
> 
> It was mentioned upthread to add to InjectionPointCacheEntry a fixed
> area of memory that modules could use to store some "status" data, but
> it would not close the run/detach race because a backend could still
> hold a pointer to a callback, with concurrent backends playing with
> the contents of InjectionPointCacheEntry (concurrent detaches and
> attaches that would cause the same entries to be reused).

> One way to close entirely the window would be to hold
> InjectionPointLock longer in InjectionPointRun() until the callback
> finishes or until it triggers an ERROR.  This would mean that the case
> you've mentioned in [1] would change, by blocking the detach() of s3
> until the callback of s2 finishes.

> [1] https://www.postgresql.org/message-id/20240501231214...@rfd.leadboat.com

Yes, that would be a loss for test readability.  Also, I wouldn't be surprised
if some test will want to attach POINT-B while POINT-A is in injection_wait().
Various options avoiding those limitations:

1. The data area formerly called a "status" area is immutable after attach.
   The core code copies it into a stack buffer to pass in a const callback
   argument.

2. InjectionPointAttach() returns an attachment serial number, and the
   callback receives that as an argument.  injection_points.c would maintain
   an InjectionPointCondition for every still-attached serial number,
   including global attachments.  Finding no condition matching the serial
   number argument means detach already happened and callback should do
   nothing.

3. Move the PID check into core code.

4. Separate the concept of "make ineligible to fire" from "detach", with
   stronger locking for detach.  v1 pointed in this direction, though not
   using that terminology.

5. Injection point has multiple callbacks.  At least one runs with the lock
   held and can mutate the "status" data.  At least one runs without the lock.

(1) is, I think, simple and sufficient.  How about that?




Re: Weird test mixup

2024-05-08 Thread Michael Paquier
On Tue, May 07, 2024 at 11:53:10AM -0700, Noah Misch wrote:
> On Tue, May 07, 2024 at 10:17:49AM +0900, Michael Paquier wrote:
>> Always resetting
>> condition->name when detaching a point is a simpler flow and saner
>> IMO.
>> 
>> Overall, this switches from one detach behavior to a different one,
> 
> Can you say more about that?  The only behavior change known to me is that a
> given injection point workload uses more of INJ_MAX_CONDITION.  If there's
> another behavior change, it was likely unintended.

As far as I read the previous patch, the conditions stored in
InjectionPointSharedState would never be really gone, even if the
points are removed from InjectionPointHash. 

> The problem I'm trying to tackle in this thread is to make
> src/test/modules/gin installcheck-safe.  $SUBJECT's commit 5105c90 started
> that work, having seen the intarray test suite break when run concurrently
> with the injection_points test suite.  That combination still does break at
> the exit-time race condition.  To reproduce, apply this attachment to add
> sleeps, and run:
> 
> make -C src/test/modules/gin installcheck USE_MODULE_DB=1 & sleep 2; make -C 
> contrib/intarray installcheck USE_MODULE_DB=1

Thanks for that.  I am not really sure how to protect that without a
small cost in flexibility for the cases of detach vs run paths.  This
comes down to the fact that a custom routine could be run while it
could be detached concurrently, removing any stuff a callback could
depend on in the module.

It was mentioned upthread to add to InjectionPointCacheEntry a fixed
area of memory that modules could use to store some "status" data, but
it would not close the run/detach race because a backend could still
hold a pointer to a callback, with concurrent backends playing with
the contents of InjectionPointCacheEntry (concurrent detaches and
attaches that would cause the same entries to be reused).

One way to close entirely the window would be to hold
InjectionPointLock longer in InjectionPointRun() until the callback
finishes or until it triggers an ERROR.  This would mean that the case
you've mentioned in [1] would change, by blocking the detach() of s3
until the callback of s2 finishes.  We don't have tests in the tree
that do any of that, so holding InjectionPointLock longer would not
break anything on HEAD.  A detach being possible while the callback is
run is something I've considered as valid in d86d20f0ba79, but perhaps
that was too cute of me, even more with the use case of local points.

> Separately, I see injection_points_attach() populates InjectionPointCondition
> after InjectionPointAttach().  Shouldn't InjectionPointAttach() come last, to
> avoid the same sort of race?  I've not tried to reproduce that one.

Good point.  You could run into the case of a concurrent backend
running an injection point that should be local if waiting between
InjectionPointAttach() and the condition getting registered in
injection_points_attach().  That should be reversed.

[1] https://www.postgresql.org/message-id/20240501231214...@rfd.leadboat.com
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-05-07 Thread Noah Misch
On Tue, May 07, 2024 at 11:53:10AM -0700, Noah Misch wrote:
> On Tue, May 07, 2024 at 10:17:49AM +0900, Michael Paquier wrote:
> > Overall, this switches from one detach behavior to a different one,
> 
> Can you say more about that?  The only behavior change known to me is that a
> given injection point workload uses more of INJ_MAX_CONDITION.  If there's
> another behavior change, it was likely unintended.

I see patch inplace030-inj-exit-race-v1.patch does not fix the race seen with
repro-inj-exit-race-v1.patch.  I withdraw inplace030-inj-exit-race-v1.patch,
and I withdraw the above question.

> To reproduce, apply [repro-inj-exit-race-v1.patch] to add
> sleeps, and run:
> 
> make -C src/test/modules/gin installcheck USE_MODULE_DB=1 & sleep 2; make -C 
> contrib/intarray installcheck USE_MODULE_DB=1
> 
> Separately, I see injection_points_attach() populates InjectionPointCondition
> after InjectionPointAttach().  Shouldn't InjectionPointAttach() come last, to
> avoid the same sort of race?  I've not tried to reproduce that one.




Re: Weird test mixup

2024-05-07 Thread Noah Misch
On Tue, May 07, 2024 at 10:17:49AM +0900, Michael Paquier wrote:
> On Mon, May 06, 2024 at 02:23:24PM -0700, Noah Misch wrote:
> > Here's how I've patched it locally.  It does avoid changing the 
> > backend-side,
> > which has some attraction.  Shall I just push this?
> 
> It looks like you did not rebase on top of HEAD

Yes, the base was 713cfaf (Sunday).

> A side effect is that this causes the conditions to pile
> up on a running server when running installcheck, and assuming that
> many test suites are run on a server left running this could cause
> spurious failures when failing to find a new slot.

Yes, we'd be raising INJ_MAX_CONDITION more often under this approach.

> Always resetting
> condition->name when detaching a point is a simpler flow and saner
> IMO.
> 
> Overall, this switches from one detach behavior to a different one,

Can you say more about that?  The only behavior change known to me is that a
given injection point workload uses more of INJ_MAX_CONDITION.  If there's
another behavior change, it was likely unintended.

> which may or may not be intuitive depending on what one is looking
> for.  FWIW, I see InjectionPointCondition as something that should be
> around as long as its injection point exists, with the condition
> entirely gone once the point is detached because it should not exist
> anymore on the server running, with no information left in shmem.
> 
> Through your patch, you make conditions have a different meaning, with
> a mix of "local" definition, but it is kind of permanent as it keeps a
> trace of the point's name in shmem.  I find the behavior of the patch
> less intuitive.  Perhaps it would be interesting to see first the bug
> and/or problem you are trying to tackle with this different behavior
> as I feel like we could do something even with the module as-is.  As
> far as I understand, the implementation of the module on HEAD allows
> one to emulate a breakpoint with a wait/wake, which can avoid the
> window mentioned in step 2.  Even if a wait point is detached
> concurrently, it can be awaken with its traces in shmem removed.

The problem I'm trying to tackle in this thread is to make
src/test/modules/gin installcheck-safe.  $SUBJECT's commit 5105c90 started
that work, having seen the intarray test suite break when run concurrently
with the injection_points test suite.  That combination still does break at
the exit-time race condition.  To reproduce, apply this attachment to add
sleeps, and run:

make -C src/test/modules/gin installcheck USE_MODULE_DB=1 & sleep 2; make -C 
contrib/intarray installcheck USE_MODULE_DB=1

Separately, I see injection_points_attach() populates InjectionPointCondition
after InjectionPointAttach().  Shouldn't InjectionPointAttach() come last, to
avoid the same sort of race?  I've not tried to reproduce that one.
diff --git a/src/test/modules/gin/expected/gin_incomplete_splits.out 
b/src/test/modules/gin/expected/gin_incomplete_splits.out
index 15574e5..4bc5ef1 100644
--- a/src/test/modules/gin/expected/gin_incomplete_splits.out
+++ b/src/test/modules/gin/expected/gin_incomplete_splits.out
@@ -126,6 +126,12 @@ SELECT 
injection_points_attach('gin-leave-leaf-split-incomplete', 'error');
 select insert_until_fail(:next_i) as next_i
 \gset
 NOTICE:  failed with: error triggered for injection point 
gin-leave-leaf-split-incomplete
+SELECT pg_sleep(10);
+ pg_sleep 
+--
+ 
+(1 row)
+
 SELECT injection_points_detach('gin-leave-leaf-split-incomplete');
  injection_points_detach 
 -
diff --git a/src/test/modules/gin/sql/gin_incomplete_splits.sql 
b/src/test/modules/gin/sql/gin_incomplete_splits.sql
index ebf0f62..fb66bac 100644
--- a/src/test/modules/gin/sql/gin_incomplete_splits.sql
+++ b/src/test/modules/gin/sql/gin_incomplete_splits.sql
@@ -118,6 +118,7 @@ select verify(:next_i);
 SELECT injection_points_attach('gin-leave-leaf-split-incomplete', 'error');
 select insert_until_fail(:next_i) as next_i
 \gset
+SELECT pg_sleep(10);
 SELECT injection_points_detach('gin-leave-leaf-split-incomplete');
 
 -- Verify that a scan works even though there's an incomplete split
diff --git a/src/test/modules/injection_points/injection_points.c 
b/src/test/modules/injection_points/injection_points.c
index a74a4a2..f9e8a1f 100644
--- a/src/test/modules/injection_points/injection_points.c
+++ b/src/test/modules/injection_points/injection_points.c
@@ -18,6 +18,7 @@
 #include "postgres.h"
 
 #include "fmgr.h"
+#include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "storage/condition_variable.h"
 #include "storage/dsm_registry.h"
@@ -213,6 +214,14 @@ injection_points_cleanup(int code, Datum arg)
 void
 injection_error(const char *name)
 {
+   if (strcmp(name, "gin-leave-leaf-split-incomplete") == 0 &&
+   !injection_point_allowed(name))
+   {
+   sigprocmask(SIG_SETMASK, &BlockSig, NULL);
+   pg_usleep(20 * USECS_PER_SEC); /* let other suite detach */
+   sigprocmask

Re: Weird test mixup

2024-05-06 Thread Michael Paquier
On Mon, May 06, 2024 at 02:23:24PM -0700, Noah Misch wrote:
> Here's how I've patched it locally.  It does avoid changing the backend-side,
> which has some attraction.  Shall I just push this?

It looks like you did not rebase on top of HEAD to avoid the spinlock
taken with InjectionPointDetach() in the shmem callback.  I think that
you'd mean the attached, once rebased (apologies I've reset the author
field).

+ *s1: local-attach to POINT
+ *s2: yield CPU before InjectionPointRun(POINT) calls injection_callback()
+ *s1: exit()
+ *s2: run POINT as though it had been non-local

I see.  So you are taking a shortcut in the shape of never resetting
the name of a condition, so as it is possible to let the point of step
4 check the runtime condition with a condition still stored while the
point has been detached, removed from the hash table.

 if (strcmp(condition->name, name) == 0)
 {
+condition->valid = false;
 condition->pid = 0;
-condition->name[0] = '\0';
 }
 }

As of HEAD, we rely on InjectionPointCondition->name to be set to
check if a condition is valid.  Your patch adds a different variable
to do mostly the same thing, and never clears the name in any existing
conditions.  A side effect is that this causes the conditions to pile
up on a running server when running installcheck, and assuming that
many test suites are run on a server left running this could cause
spurious failures when failing to find a new slot.  Always resetting
condition->name when detaching a point is a simpler flow and saner
IMO.

Overall, this switches from one detach behavior to a different one,
which may or may not be intuitive depending on what one is looking
for.  FWIW, I see InjectionPointCondition as something that should be
around as long as its injection point exists, with the condition
entirely gone once the point is detached because it should not exist
anymore on the server running, with no information left in shmem.

Through your patch, you make conditions have a different meaning, with
a mix of "local" definition, but it is kind of permanent as it keeps a
trace of the point's name in shmem.  I find the behavior of the patch
less intuitive.  Perhaps it would be interesting to see first the bug
and/or problem you are trying to tackle with this different behavior
as I feel like we could do something even with the module as-is.  As
far as I understand, the implementation of the module on HEAD allows
one to emulate a breakpoint with a wait/wake, which can avoid the
window mentioned in step 2.  Even if a wait point is detached
concurrently, it can be awaken with its traces in shmem removed.
--
Michael
From 9cf38b2f6dadd5b4bb81fe5ccea350d259e6a241 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 7 May 2024 10:15:28 +0900
Subject: [PATCH v2] Fix race condition in backend exit after
 injection_points_set_local().

Symptoms were like those prompting commit
f587338dec87d3c35b025e131c5977930ac69077.  That is, under installcheck,
backends from unrelated tests could run the injection points.

Reviewed by FIXME.

Discussion: https://postgr.es/m/FIXME
---
 .../injection_points/injection_points.c   | 27 ++-
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/src/test/modules/injection_points/injection_points.c b/src/test/modules/injection_points/injection_points.c
index a74a4a28af..8c9a3ebd74 100644
--- a/src/test/modules/injection_points/injection_points.c
+++ b/src/test/modules/injection_points/injection_points.c
@@ -37,7 +37,13 @@ PG_MODULE_MAGIC;
 
 /*
  * Conditions related to injection points.  This tracks in shared memory the
- * runtime conditions under which an injection point is allowed to run.
+ * runtime conditions under which an injection point is allowed to run.  Once
+ * set, a name is never changed.  That avoids a race condition:
+ *
+ *	s1: local-attach to POINT
+ *	s2: yield CPU before InjectionPointRun(POINT) calls injection_callback()
+ *	s1: exit()
+ *	s2: run POINT as though it had been non-local
  *
  * If more types of runtime conditions need to be tracked, this structure
  * should be expanded.
@@ -49,6 +55,9 @@ typedef struct InjectionPointCondition
 
 	/* ID of the process where the injection point is allowed to run */
 	int			pid;
+
+	/* Should "pid" run this injection point, or not? */
+	bool		valid;
 } InjectionPointCondition;
 
 /* Shared state information for injection points. */
@@ -133,7 +142,7 @@ injection_point_allowed(const char *name)
 	{
 		InjectionPointCondition *condition = &inj_state->conditions[i];
 
-		if (strcmp(condition->name, name) == 0)
+		if (condition->valid && strcmp(condition->name, name) == 0)
 		{
 			/*
 			 * Check if this injection point is allowed to run in this
@@ -175,9 +184,11 @@ injection_points_cleanup(int code, Datum arg)
 	{
 		InjectionPointCondition *condition = &inj_state->conditions[i];
 
-		if (condition->name[0] == '\0')
+		if (!condition->valid

Re: Weird test mixup

2024-05-06 Thread Noah Misch
On Mon, May 06, 2024 at 10:03:37AM +0900, Michael Paquier wrote:
> On Thu, May 02, 2024 at 12:35:55PM -0700, Noah Misch wrote:
> > I should have given a simpler example:
> > 
> > s1: local-attach to POINT
> > s2: enter InjectionPointRun(POINT), yield CPU just before 
> > injection_callback()
> > s1: exit
> > s2: wake up and run POINT as though it had been non-local

Here's how I've patched it locally.  It does avoid changing the backend-side,
which has some attraction.  Shall I just push this?

> Hmm.  Even if you were to emulate that in a controlled manner, you
> would need a second injection point that does a wait in s2, which is
> something that would happen before injection_callback() and before
> scanning the local entry.  This relies on the fact that we're holding
> CPU in s2 between the backend shmem hash table lookup and the callback
> being called.

Right.  We would need "second-level injection points" to write a test for that
race in the injection point system.
Author: Noah Misch 
Commit: Noah Misch 

Fix race condition in backend exit after injection_points_set_local().

Symptoms were like those prompting commit
f587338dec87d3c35b025e131c5977930ac69077.  That is, under installcheck,
backends from unrelated tests could run the injection points.

Reviewed by FIXME.

Discussion: https://postgr.es/m/FIXME

diff --git a/src/test/modules/injection_points/injection_points.c 
b/src/test/modules/injection_points/injection_points.c
index ace386d..75466d3 100644
--- a/src/test/modules/injection_points/injection_points.c
+++ b/src/test/modules/injection_points/injection_points.c
@@ -37,7 +37,13 @@ PG_MODULE_MAGIC;
 
 /*
  * Conditions related to injection points.  This tracks in shared memory the
- * runtime conditions under which an injection point is allowed to run.
+ * runtime conditions under which an injection point is allowed to run.  Once
+ * set, a name is never changed.  That avoids a race condition:
+ *
+ * s1: local-attach to POINT
+ * s2: yield CPU before InjectionPointRun(POINT) calls injection_callback()
+ * s1: exit()
+ * s2: run POINT as though it had been non-local
  *
  * If more types of runtime conditions need to be tracked, this structure
  * should be expanded.
@@ -49,6 +55,9 @@ typedef struct InjectionPointCondition
 
/* ID of the process where the injection point is allowed to run */
int pid;
+
+   /* Should "pid" run this injection point, or not? */
+   boolvalid;
 } InjectionPointCondition;
 
 /* Shared state information for injection points. */
@@ -133,7 +142,7 @@ injection_point_allowed(const char *name)
{
InjectionPointCondition *condition = &inj_state->conditions[i];
 
-   if (strcmp(condition->name, name) == 0)
+   if (condition->valid && strcmp(condition->name, name) == 0)
{
/*
 * Check if this injection point is allowed to run in 
this
@@ -168,15 +177,16 @@ injection_points_cleanup(int code, Datum arg)
{
InjectionPointCondition *condition = &inj_state->conditions[i];
 
-   if (condition->name[0] == '\0')
+   if (!condition->valid)
continue;
+   Assert(condition->name[0] != '\0');
 
if (condition->pid != MyProcPid)
continue;
 
/* Detach the injection point and unregister condition */
InjectionPointDetach(condition->name);
-   condition->name[0] = '\0';
+   condition->valid = false;
condition->pid = 0;
}
SpinLockRelease(&inj_state->lock);
@@ -299,11 +309,13 @@ injection_points_attach(PG_FUNCTION_ARGS)
{
InjectionPointCondition *condition = 
&inj_state->conditions[i];
 
-   if (condition->name[0] == '\0')
+   if (strcmp(condition->name, name) == 0 ||
+   condition->name[0] == '\0')
{
index = i;
strlcpy(condition->name, name, INJ_NAME_MAXLEN);
condition->pid = MyProcPid;
+   condition->valid = true;
break;
}
}
@@ -416,8 +428,8 @@ injection_points_detach(PG_FUNCTION_ARGS)
 
if (strcmp(condition->name, name) == 0)
{
+   condition->valid = false;
condition->pid = 0;
-   condition->name[0] = '\0';
}
}
SpinLockRelease(&inj_state->lock);


Re: Weird test mixup

2024-05-05 Thread Michael Paquier
On Thu, May 02, 2024 at 12:35:55PM -0700, Noah Misch wrote:
> I should have given a simpler example:
> 
> s1: local-attach to POINT
> s2: enter InjectionPointRun(POINT), yield CPU just before injection_callback()
> s1: exit
> s2: wake up and run POINT as though it had been non-local

Hmm.  Even if you were to emulate that in a controlled manner, you
would need a second injection point that does a wait in s2, which is
something that would happen before injection_callback() and before
scanning the local entry.  This relies on the fact that we're holding
CPU in s2 between the backend shmem hash table lookup and the callback
being called.

>> Fun.  One thing I would ask is why it makes sense to be able to detach
>> a local point from a different session than the one who defined it as
>> local.  Shouldn't the operation of s3 be restricted rather than
>> authorized as a safety measure, instead?
> 
> (That's orthogonal to the race condition.)  When s1 would wait at the
> injection point multiple times in one SQL statement, I like issuing the detach
> from s3 so s1 waits at just the first encounter with the injection point.
> This mimics setting a gdb breakpoint and deleting that breakpoint before
> "continue".  The alternative, waking s1 repeatedly until it finishes the SQL
> statement, is less convenient.  (I also patched _detach() to wake the waiter,
> and I plan to propose that.)

Okay.

>> Indeed.  That's a brain fade.  This one could be fixed by collecting
>> the point names when cleaning up the conditions and detach after
>> releasing the spinlock.  This opens a race condition between the
>> moment when the spinlock is released and the detach, where another
>> backend could come in and detach a point before the shmem_exit
>> callback has the time to do its cleanup, even if detach() is
>> restricted for local points.  So we could do the callback cleanup in
> 
> That race condition seems fine.  The test can be expected to control the
> timing of backend exits vs. detach calls.  Unlike the InjectionPointRun()
> race, it wouldn't affect backends unrelated to the test.

Sure.  The fact that there are two spinlocks in the backend code and
the module opens concurrency issues.  Making that more robust if there
is a case for it is OK by me, but I'd rather avoid making the
backend-side more complicated than need be.

>> three steps in the shmem exit callback: 
>> - Collect the names of the points to detach, while holding the
>> spinlock.
>> - Do the Detach.
>> - Take again the spinlock, clean up the conditions.
>> 
>> Please see the attached.
> 
> The injection_points_cleanup() parts look good.  Thanks.

Thanks for the check.

>> @@ -403,6 +430,10 @@ injection_points_detach(PG_FUNCTION_ARGS)
>>  {
>>  char   *name = text_to_cstring(PG_GETARG_TEXT_PP(0));
>>  
>> +if (!injection_point_allowed(name))
>> +elog(ERROR, "cannot detach injection point \"%s\" not allowed 
>> to run",
>> + name);
>> +
> 
> As above, I disagree with the injection_points_detach() part.

Okay, noted.  Fine by me to expand this stuff as you feel, the code
has been written to be extended depending on what people want to
support.  There should be tests in the tree that rely on any
new behavior, though.

I've applied the patch to fix the spinlock logic in the exit callback
for now.
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-05-04 Thread Michael Paquier
On Thu, May 02, 2024 at 01:52:20PM +0500, Andrey M. Borodin wrote:
> As far as I understand this will effectively forbid calling
> injection_points_detach() for local injection point of other
> backend. Do I get it right?

Yes, that would be the intention.  Noah has other use cases in mind
with this interface that I did not think about supporting, hence he's
objecting to this restriction.
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-05-02 Thread Noah Misch
On Thu, May 02, 2024 at 04:27:12PM +0900, Michael Paquier wrote:
> On Wed, May 01, 2024 at 04:12:14PM -0700, Noah Misch wrote:
> > While writing an injection point test, I encountered a variant of the race
> > condition that f4083c4 fixed.  It had three sessions and this sequence of
> > events:
> > 
> > s1: local-attach to POINT
> > s2: enter InjectionPointRun(POINT), yield CPU just before 
> > injection_callback()
> > s3: detach POINT, deleting the InjectionPointCondition record
> > s2: wake up and run POINT as though it had been non-local

I should have given a simpler example:

s1: local-attach to POINT
s2: enter InjectionPointRun(POINT), yield CPU just before injection_callback()
s1: exit
s2: wake up and run POINT as though it had been non-local

> Fun.  One thing I would ask is why it makes sense to be able to detach
> a local point from a different session than the one who defined it as
> local.  Shouldn't the operation of s3 be restricted rather than
> authorized as a safety measure, instead?

(That's orthogonal to the race condition.)  When s1 would wait at the
injection point multiple times in one SQL statement, I like issuing the detach
from s3 so s1 waits at just the first encounter with the injection point.
This mimics setting a gdb breakpoint and deleting that breakpoint before
"continue".  The alternative, waking s1 repeatedly until it finishes the SQL
statement, is less convenient.  (I also patched _detach() to wake the waiter,
and I plan to propose that.)

> > Separately, injection_points_cleanup() breaks the rules by calling
> > InjectionPointDetach() while holding a spinlock.  The latter has an
> > elog(ERROR), and reaching that elog(ERROR) leaves a stuck spinlock.  I 
> > haven't
> > given as much thought to solutions for this one.
> 
> Indeed.  That's a brain fade.  This one could be fixed by collecting
> the point names when cleaning up the conditions and detach after
> releasing the spinlock.  This opens a race condition between the
> moment when the spinlock is released and the detach, where another
> backend could come in and detach a point before the shmem_exit
> callback has the time to do its cleanup, even if detach() is
> restricted for local points.  So we could do the callback cleanup in

That race condition seems fine.  The test can be expected to control the
timing of backend exits vs. detach calls.  Unlike the InjectionPointRun()
race, it wouldn't affect backends unrelated to the test.

> three steps in the shmem exit callback: 
> - Collect the names of the points to detach, while holding the
> spinlock.
> - Do the Detach.
> - Take again the spinlock, clean up the conditions.
> 
> Please see the attached.

The injection_points_cleanup() parts look good.  Thanks.

> @@ -403,6 +430,10 @@ injection_points_detach(PG_FUNCTION_ARGS)
>  {
>   char   *name = text_to_cstring(PG_GETARG_TEXT_PP(0));
>  
> + if (!injection_point_allowed(name))
> + elog(ERROR, "cannot detach injection point \"%s\" not allowed 
> to run",
> +  name);
> +

As above, I disagree with the injection_points_detach() part.




Re: Weird test mixup

2024-05-02 Thread Andrey M. Borodin



> On 2 May 2024, at 13:43, Michael Paquier  wrote:
> 
> A detach is not a wakeup.

Oh, now I see. Sorry for the noise.

Detaching local injection point of other backend seems to be useless and can be 
forbidden.
As far as I understand, your patch is already doing this in
+   if (!injection_point_allowed(name))
+   elog(ERROR, "cannot detach injection point \"%s\" not allowed 
to run",
+name);
+

As far as I understand this will effectively forbid calling 
injection_points_detach() for local injection point of other backend. Do I get 
it right?


Best regards, Andrey Borodin.



Re: Weird test mixup

2024-05-02 Thread Michael Paquier
On Thu, May 02, 2024 at 01:33:45PM +0500, Andrey M. Borodin wrote:
> That seems to prevent meaningful use case. If we want exactly one
> session to be waiting just before some specific point, the only way
> to achieve this is to create local injection point. But the session
> must be resumable from another session.
> Without this local waiting injection points are meaningless.

I am not quite sure to follow your argument here.  It is still
possible to attach a local injection point with a wait callback that
can be awaken by a different backend:
s1: select injection_points_set_local();
s1: select injection_points_attach('popo', 'wait');
s1: select injection_points_run('popo'); -- waits
s2: select injection_points_wakeup('popo');
s1: -- ready for action.

A detach is not a wakeup.
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-05-02 Thread Andrey M. Borodin



> On 2 May 2024, at 12:27, Michael Paquier  wrote:
> 
> On Wed, May 01, 2024 at 04:12:14PM -0700, Noah Misch wrote:
>> While writing an injection point test, I encountered a variant of the race
>> condition that f4083c4 fixed.  It had three sessions and this sequence of
>> events:
>> 
>> s1: local-attach to POINT
>> s2: enter InjectionPointRun(POINT), yield CPU just before 
>> injection_callback()
>> s3: detach POINT, deleting the InjectionPointCondition record
>> s2: wake up and run POINT as though it had been non-local
> 
> Fun.  One thing I would ask is why it makes sense to be able to detach
> a local point from a different session than the one who defined it as
> local.  Shouldn't the operation of s3 be restricted rather than
> authorized as a safety measure, instead?

That seems to prevent meaningful use case. If we want exactly one session to be 
waiting just before some specific point, the only way to achieve this is to 
create local injection point. But the session must be resumable from another 
session.
Without this local waiting injection points are meaningless.


Best regards, Andrey Borodin.



Re: Weird test mixup

2024-05-02 Thread Michael Paquier
On Wed, May 01, 2024 at 04:12:14PM -0700, Noah Misch wrote:
> While writing an injection point test, I encountered a variant of the race
> condition that f4083c4 fixed.  It had three sessions and this sequence of
> events:
> 
> s1: local-attach to POINT
> s2: enter InjectionPointRun(POINT), yield CPU just before injection_callback()
> s3: detach POINT, deleting the InjectionPointCondition record
> s2: wake up and run POINT as though it had been non-local

Fun.  One thing I would ask is why it makes sense to be able to detach
a local point from a different session than the one who defined it as
local.  Shouldn't the operation of s3 be restricted rather than
authorized as a safety measure, instead?

> On Sat, Mar 16, 2024 at 08:40:21AM +0900, Michael Paquier wrote:
>> On Fri, Mar 15, 2024 at 11:23:31AM +0200, Heikki Linnakangas wrote:
>>> Wrt. the spinlock and shared memory handling, I think this would be simpler
>>> if you could pass some payload in the InjectionPointAttach() call, which
>>> would be passed back to the callback function:
>>> 
>>> In this case, the payload would be the "slot index" in shared memory.
>>> 
>>> Or perhaps always allocate, say, 1024 bytes of working area for every
>>> attached injection point that the test module can use any way it wants. Like
>>> for storing extra conditions, or for the wakeup counter stuff in
>>> injection_wait(). A fixed size working area is a little crude, but would be
>>> very handy in practice.
> 
> That would be one good way to solve it.  (Storing a slot index has the same
> race condition, but it fixes the race to store a struct containing the PID.)
> 
> The best alternative I see is to keep an InjectionPointCondition forever after
> creating it.  Give it a "bool valid" field that we set on detach.  I don't see
> a major reason to prefer one of these over the other.  One puts a negligible
> amount of memory pressure on the main segment, but it simplifies the module
> code.  I lean toward the "1024 bytes of working area" idea.  Other ideas or
> opinions?

If more state data is needed, the fixed area injection_point.c would
be better.  Still, I am not sure that this is required here, either.

> Separately, injection_points_cleanup() breaks the rules by calling
> InjectionPointDetach() while holding a spinlock.  The latter has an
> elog(ERROR), and reaching that elog(ERROR) leaves a stuck spinlock.  I haven't
> given as much thought to solutions for this one.

Indeed.  That's a brain fade.  This one could be fixed by collecting
the point names when cleaning up the conditions and detach after
releasing the spinlock.  This opens a race condition between the
moment when the spinlock is released and the detach, where another
backend could come in and detach a point before the shmem_exit
callback has the time to do its cleanup, even if detach() is
restricted for local points.  So we could do the callback cleanup in
three steps in the shmem exit callback: 
- Collect the names of the points to detach, while holding the
spinlock.
- Do the Detach.
- Take again the spinlock, clean up the conditions.

Please see the attached.
--
Michael
diff --git a/src/test/modules/injection_points/injection_points.c b/src/test/modules/injection_points/injection_points.c
index ace386da50..caf58ef9db 100644
--- a/src/test/modules/injection_points/injection_points.c
+++ b/src/test/modules/injection_points/injection_points.c
@@ -159,10 +159,39 @@ injection_point_allowed(const char *name)
 static void
 injection_points_cleanup(int code, Datum arg)
 {
+	char		names[INJ_MAX_CONDITION][INJ_NAME_MAXLEN] = {0};
+	int			count = 0;
+
 	/* Leave if nothing is tracked locally */
 	if (!injection_point_local)
 		return;
 
+	/*
+	 * This is done in three steps: detect the points to detach,
+	 * detach them and release their conditions.
+	 */
+	SpinLockAcquire(&inj_state->lock);
+	for (int i = 0; i < INJ_MAX_CONDITION; i++)
+	{
+		InjectionPointCondition *condition = &inj_state->conditions[i];
+
+		if (condition->name[0] == '\0')
+			continue;
+
+		if (condition->pid != MyProcPid)
+			continue;
+
+		/* Extract the point name to detach */
+		strlcpy(names[count], condition->name, INJ_NAME_MAXLEN);
+		count++;
+	}
+	SpinLockRelease(&inj_state->lock);
+
+	/* Detach, without holding the spinlock */
+	for (int i = 0; i < count; i++)
+		InjectionPointDetach(names[i]);
+
+	/* Clear all the conditions */
 	SpinLockAcquire(&inj_state->lock);
 	for (int i = 0; i < INJ_MAX_CONDITION; i++)
 	{
@@ -174,8 +203,6 @@ injection_points_cleanup(int code, Datum arg)
 		if (condition->pid != MyProcPid)
 			continue;
 
-		/* Detach the injection point and unregister condition */
-		InjectionPointDetach(condition->name);
 		condition->name[0] = '\0';
 		condition->pid = 0;
 	}
@@ -403,6 +430,10 @@ injection_points_detach(PG_FUNCTION_ARGS)
 {
 	char	   *name = text_to_cstring(PG_GETARG_TEXT_PP(0));
 
+	if (!injection_point_allowed(name))
+		elog(ERROR, "cannot detach injection point \"%s\" not allowed to run",
+			 name);

Re: Weird test mixup

2024-05-01 Thread Noah Misch
While writing an injection point test, I encountered a variant of the race
condition that f4083c4 fixed.  It had three sessions and this sequence of
events:

s1: local-attach to POINT
s2: enter InjectionPointRun(POINT), yield CPU just before injection_callback()
s3: detach POINT, deleting the InjectionPointCondition record
s2: wake up and run POINT as though it had been non-local

On Sat, Mar 16, 2024 at 08:40:21AM +0900, Michael Paquier wrote:
> On Fri, Mar 15, 2024 at 11:23:31AM +0200, Heikki Linnakangas wrote:
> > Wrt. the spinlock and shared memory handling, I think this would be simpler
> > if you could pass some payload in the InjectionPointAttach() call, which
> > would be passed back to the callback function:
> > 
> > In this case, the payload would be the "slot index" in shared memory.
> > 
> > Or perhaps always allocate, say, 1024 bytes of working area for every
> > attached injection point that the test module can use any way it wants. Like
> > for storing extra conditions, or for the wakeup counter stuff in
> > injection_wait(). A fixed size working area is a little crude, but would be
> > very handy in practice.

That would be one good way to solve it.  (Storing a slot index has the same
race condition, but it fixes the race to store a struct containing the PID.)

The best alternative I see is to keep an InjectionPointCondition forever after
creating it.  Give it a "bool valid" field that we set on detach.  I don't see
a major reason to prefer one of these over the other.  One puts a negligible
amount of memory pressure on the main segment, but it simplifies the module
code.  I lean toward the "1024 bytes of working area" idea.  Other ideas or
opinions?


Separately, injection_points_cleanup() breaks the rules by calling
InjectionPointDetach() while holding a spinlock.  The latter has an
elog(ERROR), and reaching that elog(ERROR) leaves a stuck spinlock.  I haven't
given as much thought to solutions for this one.

Thanks,
nm




Re: Weird test mixup

2024-04-09 Thread Michael Paquier
On Tue, Apr 09, 2024 at 12:41:57PM +0900, Michael Paquier wrote:
> Applied that after a second check.  And thanks to Bharath for the
> poke.

And now that the buildfarm is cooler, I've also applied the final
patch in the series as of 5105c9079681 to make the GIN module
concurrent-safe using injection_points_set_local().
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-04-08 Thread Michael Paquier
On Mon, Apr 08, 2024 at 12:29:43PM +0300, Andrey M. Borodin wrote:
> On 8 Apr 2024, at 11:55, Michael Paquier  wrote:
>> Uh, I did not understand this. Because commit message was about
>> stabiilzizing tests, not extending coverage.

Okay, it is about stabilizing an existing test.

> Also, should we drop function wait_pid() at the end of a test?

Sure.

> Given that tweaks with are nothing new, I think patch looks good.

Applied that after a second check.  And thanks to Bharath for the
poke.
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-04-08 Thread Andrey M. Borodin



> On 8 Apr 2024, at 11:55, Michael Paquier  wrote:
> 
>  the point of the test is to make sure that the local
> cleanup happens
Uh, I did not understand this. Because commit message was about stabiilzizing 
tests, not extending coverage.
Also, should we drop function wait_pid() at the end of a test?
Given that tweaks with are nothing new, I think patch looks good.


Best regards, Andrey Borodin.



Re: Weird test mixup

2024-04-08 Thread Michael Paquier
On Mon, Apr 08, 2024 at 10:42:08AM +0300, Andrey M. Borodin wrote:
> As an alternative we can make local injection points mutually exclusive.

Sure.  Now, the point of the test is to make sure that the local
cleanup happens, so I'd rather keep it as-is and use the same names
across reloads.
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-04-08 Thread Andrey M. Borodin



> On 8 Apr 2024, at 10:33, Michael Paquier  wrote:
> 
> Thoughts?

As an alternative we can make local injection points mutually exclusive.


Best regards, Andrey Borodin.




Re: Weird test mixup

2024-04-08 Thread Michael Paquier
On Mon, Apr 08, 2024 at 10:22:40AM +0900, Michael Paquier wrote:
> For now I have applied 997db123c054 to make the GIN tests with
> injection points repeatable as it was an independent issue, and
> f587338dec87 to add the local function pieces.

Bharath has reported me offlist that one of the new tests has a race
condition when doing the reconnection.  When the backend creating the
local points is very slow to exit, the backend created after the
reconnection may detect that a local point previously created still
exists, causing a failure.  The failure can be reproduced with a sleep
in the shmem exit callback, like:
--- a/src/test/modules/injection_points/injection_points.c
+++ b/src/test/modules/injection_points/injection_points.c
@@ -163,6 +163,8 @@ injection_points_cleanup(int code, Datum arg)
if (!injection_point_local)
return;
 
+   pg_usleep(100 * 1L);
+
SpinLockAcquire(&inj_state->lock);
for (int i = 0; i < INJ_MAX_CONDITION; i++)
{

At first I was looking at a loop with a scan of pg_stat_activity, but
I've noticed that regress.so includes a wait_pid() that we can use to
make sure that a given process exits before moving on to the next
parts of a test, so I propose to just reuse that here.  This requires
tweaks with --dlpath for meson and ./configure, nothing new.  The CI
is clean.  Patch attached.

Thoughts?
--
Michael
From 8f43807000c1f33a0238d7bbcc148a196e4134e4 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 8 Apr 2024 16:28:17 +0900
Subject: [PATCH] Stabilize injection point test

---
 src/test/modules/injection_points/Makefile   |  1 +
 .../expected/injection_points.out| 16 
 src/test/modules/injection_points/meson.build|  1 +
 .../injection_points/sql/injection_points.sql| 15 +++
 4 files changed, 33 insertions(+)

diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 1cb395c37a..31bd787994 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -7,6 +7,7 @@ DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
 REGRESS = injection_points
+REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 # The injection points are cluster-wide, so disable installcheck
 NO_INSTALLCHECK = 1
diff --git a/src/test/modules/injection_points/expected/injection_points.out b/src/test/modules/injection_points/expected/injection_points.out
index 3d94016ac9..d2a69b54e8 100644
--- a/src/test/modules/injection_points/expected/injection_points.out
+++ b/src/test/modules/injection_points/expected/injection_points.out
@@ -1,4 +1,11 @@
 CREATE EXTENSION injection_points;
+\getenv libdir PG_LIBDIR
+\getenv dlsuffix PG_DLSUFFIX
+\set regresslib :libdir '/regress' :dlsuffix
+CREATE FUNCTION wait_pid(int)
+  RETURNS void
+  AS :'regresslib'
+  LANGUAGE C STRICT;
 SELECT injection_points_attach('TestInjectionBooh', 'booh');
 ERROR:  incorrect action "booh" for injection point creation
 SELECT injection_points_attach('TestInjectionError', 'error');
@@ -156,8 +163,17 @@ NOTICE:  notice triggered for injection point TestConditionLocal2
  
 (1 row)
 
+select pg_backend_pid() as oldpid \gset
 -- reload, local injection points should be gone.
 \c
+-- Wait for the previous backend process to exit, ensuring that its local
+-- injection points are cleaned up.
+SELECT wait_pid(:'oldpid');
+ wait_pid 
+--
+ 
+(1 row)
+
 SELECT injection_points_run('TestConditionLocal1'); -- nothing
  injection_points_run 
 --
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index a29217f75f..8e1b5b4539 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -33,6 +33,7 @@ tests += {
 'sql': [
   'injection_points',
 ],
+'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
 # The injection points are cluster-wide, so disable installcheck
 'runningcheck': false,
   },
diff --git a/src/test/modules/injection_points/sql/injection_points.sql b/src/test/modules/injection_points/sql/injection_points.sql
index 2aa76a542b..e18146eb8c 100644
--- a/src/test/modules/injection_points/sql/injection_points.sql
+++ b/src/test/modules/injection_points/sql/injection_points.sql
@@ -1,5 +1,14 @@
 CREATE EXTENSION injection_points;
 
+\getenv libdir PG_LIBDIR
+\getenv dlsuffix PG_DLSUFFIX
+\set regresslib :libdir '/regress' :dlsuffix
+
+CREATE FUNCTION wait_pid(int)
+  RETURNS void
+  AS :'regresslib'
+  LANGUAGE C STRICT;
+
 SELECT injection_points_attach('TestInjectionBooh', 'booh');
 SELECT injection_points_attach('TestInjectionError', 'error');
 SELECT injection_points_attach('TestInjectionLog', 'notice');
@@ -40,8 +49,14 @@ SELECT injection_points_attach('TestConditionLocal1', 'error');
 SELECT injection_points_at

Re: Weird test mixup

2024-04-07 Thread Michael Paquier
On Sat, Apr 06, 2024 at 10:34:46AM +0500, Andrey M. Borodin wrote:
> I find name of the function "injection_points_local()" strange,
> because there is no verb in the name. How about
> "injection_points_set_local"? 

That makes sense.

> I'm not sure if we should refactor anything here, but
> InjectionPointSharedState has singular name, plural wait_counts and
> singular condition.
> InjectionPointSharedState is already an array of injection points,
> maybe let's add there optional pid instead of inventing separate
> array of pids?

Perhaps we could unify these two concepts, indeed, with a "kind" added
to InjectionPointCondition.  Now waits/wakeups are a different beast
than the conditions that could be assigned to a point to filter if it
should be executed.  More runtime conditions coming immediately into
my mind, that could be added to this structure relate mostly to global
objects, like:
- Specific database name and/or OID.
- Specific role(s).
So that's mostly cross-checking states coming from miscadmin.h for
now.

> Can we set global point to 'notice', but same local to 'wait'? Looks
> like now we can't, but allowing to do so would make code simpler.

You mean using the name point name with more than more callback?  Not
sure we'd want to be able to do that.  Perhaps you're right, though,
if there is a use case that justifies it.

> Besides this opportunity to simplify stuff, both patches looks good
> to me.

Yeah, this module can be always tweaked more if necessary.  Saying
that, naming the new thing "condition" in InjectionPointSharedState
felt strange, as you said, because it is an array of multiple
conditions.

For now I have applied 997db123c054 to make the GIN tests with
injection points repeatable as it was an independent issue, and
f587338dec87 to add the local function pieces.

Attached is the last piece to switch the GIN test to use local
injection points.  85f65d7a26fc should maintain the buildfarm at bay,
but I'd rather not take a bet and accidently freeze the buildfarm as
it would impact folks who aim at getting patches committed just before
the finish line.  So I am holding on this one for a few more days
until we're past the freeze and the buildfarm is more stable.
--
Michael
From bc2ce072cad0efa94084297330db3102ee346c25 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 8 Apr 2024 10:20:19 +0900
Subject: [PATCH v3] Switch GIN tests with injection points to be
 concurrent-safe

---
 src/test/modules/gin/Makefile   | 3 ---
 src/test/modules/gin/expected/gin_incomplete_splits.out | 7 +++
 src/test/modules/gin/meson.build| 2 --
 src/test/modules/gin/sql/gin_incomplete_splits.sql  | 3 +++
 4 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/src/test/modules/gin/Makefile b/src/test/modules/gin/Makefile
index da4c9cea5e..e007e38ac2 100644
--- a/src/test/modules/gin/Makefile
+++ b/src/test/modules/gin/Makefile
@@ -4,9 +4,6 @@ EXTRA_INSTALL = src/test/modules/injection_points
 
 REGRESS = gin_incomplete_splits
 
-# The injection points are cluster-wide, so disable installcheck
-NO_INSTALLCHECK = 1
-
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/src/test/modules/gin/expected/gin_incomplete_splits.out b/src/test/modules/gin/expected/gin_incomplete_splits.out
index 973a8ce6c8..15574e547a 100644
--- a/src/test/modules/gin/expected/gin_incomplete_splits.out
+++ b/src/test/modules/gin/expected/gin_incomplete_splits.out
@@ -12,6 +12,13 @@
 -- This uses injection points to cause errors that leave some page
 -- splits in "incomplete" state
 create extension injection_points;
+-- Make all injection points local to this process, for concurrency.
+SELECT injection_points_set_local();
+ injection_points_set_local 
+
+ 
+(1 row)
+
 -- Use the index for all the queries
 set enable_seqscan=off;
 -- Print a NOTICE whenever an incomplete split gets fixed
diff --git a/src/test/modules/gin/meson.build b/src/test/modules/gin/meson.build
index 5ec0760a27..9734b51de2 100644
--- a/src/test/modules/gin/meson.build
+++ b/src/test/modules/gin/meson.build
@@ -12,7 +12,5 @@ tests += {
 'sql': [
   'gin_incomplete_splits',
 ],
-# The injection points are cluster-wide, so disable installcheck
-'runningcheck': false,
   },
 }
diff --git a/src/test/modules/gin/sql/gin_incomplete_splits.sql b/src/test/modules/gin/sql/gin_incomplete_splits.sql
index ea3667b38d..ebf0f620f0 100644
--- a/src/test/modules/gin/sql/gin_incomplete_splits.sql
+++ b/src/test/modules/gin/sql/gin_incomplete_splits.sql
@@ -14,6 +14,9 @@
 -- splits in "incomplete" state
 create extension injection_points;
 
+-- Make all injection points local to this process, for concurrency.
+SELECT injection_points_set_local();
+
 -- Use the index for all the queries
 set enable_seqscan=off;
 
-- 
2.43.0



signature.asc
Description: PGP signature


Re: Weird test mixup

2024-04-05 Thread Andrey M. Borodin



> On 5 Apr 2024, at 07:19, Michael Paquier  wrote:
> 
> It's been a couple of weeks since this has been sent, and this did not
> get any reviews.  I'd still be happy with the simplicity of a single
> injection_points_local() that can be used to link all the injection
> points created in a single process to it, discarding them once the
> process exists with a shmem exit callback.

OK, makes sense.
I find name of the function "injection_points_local()" strange, because there 
is no verb in the name. How about "injection_points_set_local"?

>  And I don't really see an
> argument to tweak the backend-side routines, as well.
>  Comments and/or
> objections?

I'm not sure if we should refactor anything here, but InjectionPointSharedState 
has singular name, plural wait_counts and singular condition.
InjectionPointSharedState is already an array of injection points, maybe let's 
add there optional pid instead of inventing separate array of pids?

Can we set global point to 'notice', but same local to 'wait'? Looks like now 
we can't, but allowing to do so would make code simpler.

Besides this opportunity to simplify stuff, both patches looks good to me.


Best regards, Andrey Borodin.



Re: Weird test mixup

2024-04-04 Thread Michael Paquier
On Mon, Mar 18, 2024 at 10:04:45AM +0900, Michael Paquier wrote:
> Please find a patch to do exactly that, without touching the backend
> APIs.  0001 adds a new function call injection_points_local() that can
> be added on top of a SQL test to make it concurrent-safe.  0002 is the
> fix for the GIN tests.
> 
> I am going to add an open item to not forget about all that.

It's been a couple of weeks since this has been sent, and this did not
get any reviews.  I'd still be happy with the simplicity of a single
injection_points_local() that can be used to link all the injection
points created in a single process to it, discarding them once the
process exists with a shmem exit callback.  And I don't really see an
argument to tweak the backend-side routines, as well.  Comments and/or
objections?
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-03-17 Thread Michael Paquier
On Mon, Mar 18, 2024 at 10:50:25AM +0500, Andrey M. Borodin wrote:
> Maybe consider function injection_points_attach_local(‘point name’)
> instead of static switch?
> Or even injection_points_attach_global(‘point name’), while function
> injection_points_attach(‘point name’) will be global? This would
> favour writing concurrent test by default… 

The point is to limit accidents like the one of this thread.  So, for 
cases already in the tree, not giving the point name in the SQL
function would be simple enough.

What you are suggesting can be simply done, as well, though I'd rather
wait for a reason to justify doing so.
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-03-17 Thread Andrey M. Borodin



> On 18 Mar 2024, at 06:04, Michael Paquier  wrote:
> 
> new function call injection_points_local() that can
> be added on top of a SQL test to make it concurrent-safe.

Maybe consider function injection_points_attach_local(‘point name’) instead of 
static switch?
Or even injection_points_attach_global(‘point name’), while function 
injection_points_attach(‘point name’) will be global? This would favour writing 
concurrent test by default…


Best regards, Andrey Borodin.



Re: Weird test mixup

2024-03-17 Thread Michael Paquier
On Sat, Mar 16, 2024 at 08:40:21AM +0900, Michael Paquier wrote:
> Linking all the points to a PID with a injection_points_attach_local()
> that switches a static flag while registering a before_shmem_exit() to
> do an automated cleanup sounds like the simplest approach to me based
> on what I'm reading on this thread.

Please find a patch to do exactly that, without touching the backend
APIs.  0001 adds a new function call injection_points_local() that can
be added on top of a SQL test to make it concurrent-safe.  0002 is the
fix for the GIN tests.

I am going to add an open item to not forget about all that.

Comments are welcome.
--
Michael
From a4c219f0142a36b2a009af1f9e9affdeab6ab383 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 18 Mar 2024 09:53:05 +0900
Subject: [PATCH v2 1/2] Introduce runtime conditions in module
 injection_points

This adds a new SQL function injection_points_local() that can be used
to force injection points to be run only on the process where they are
attached.  This is handy for SQL tests to:
- detach automatically local injection points when the process exits.
- allow tests with injection points to run concurrently with other test
suites.
---
 .../expected/injection_points.out |  77 
 .../injection_points--1.0.sql |  11 ++
 .../injection_points/injection_points.c   | 176 ++
 .../injection_points/sql/injection_points.sql |  20 ++
 src/tools/pgindent/typedefs.list  |   1 +
 5 files changed, 285 insertions(+)

diff --git a/src/test/modules/injection_points/expected/injection_points.out b/src/test/modules/injection_points/expected/injection_points.out
index 827456ccc5..f24a602def 100644
--- a/src/test/modules/injection_points/expected/injection_points.out
+++ b/src/test/modules/injection_points/expected/injection_points.out
@@ -115,4 +115,81 @@ NOTICE:  notice triggered for injection point TestInjectionLog2
  
 (1 row)
 
+SELECT injection_points_detach('TestInjectionLog2');
+ injection_points_detach 
+-
+ 
+(1 row)
+
+-- Runtime conditions
+SELECT injection_points_attach('TestConditionError', 'error');
+ injection_points_attach 
+-
+ 
+(1 row)
+
+-- Any follow-up injection point attached will be local to this process.
+SELECT injection_points_local();
+ injection_points_local 
+
+ 
+(1 row)
+
+SELECT injection_points_attach('TestConditionLocal1', 'error');
+ injection_points_attach 
+-
+ 
+(1 row)
+
+SELECT injection_points_attach('TestConditionLocal2', 'notice');
+ injection_points_attach 
+-
+ 
+(1 row)
+
+SELECT injection_points_run('TestConditionLocal1'); -- error
+ERROR:  error triggered for injection point TestConditionLocal1
+SELECT injection_points_run('TestConditionLocal2'); -- notice
+NOTICE:  notice triggered for injection point TestConditionLocal2
+ injection_points_run 
+--
+ 
+(1 row)
+
+-- reload, local injection points should be gone.
+\c
+SELECT injection_points_run('TestConditionLocal1'); -- nothing
+ injection_points_run 
+--
+ 
+(1 row)
+
+SELECT injection_points_run('TestConditionLocal2'); -- nothing
+ injection_points_run 
+--
+ 
+(1 row)
+
+SELECT injection_points_run('TestConditionError'); -- error
+ERROR:  error triggered for injection point TestConditionError
+SELECT injection_points_detach('TestConditionError');
+ injection_points_detach 
+-
+ 
+(1 row)
+
+-- Attaching injection points that use the same name as one defined locally
+-- previously should work.
+SELECT injection_points_attach('TestConditionLocal1', 'error');
+ injection_points_attach 
+-
+ 
+(1 row)
+
+SELECT injection_points_detach('TestConditionLocal1');
+ injection_points_detach 
+-
+ 
+(1 row)
+
 DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/injection_points--1.0.sql b/src/test/modules/injection_points/injection_points--1.0.sql
index 0a2e59aba7..1813a4d6d8 100644
--- a/src/test/modules/injection_points/injection_points--1.0.sql
+++ b/src/test/modules/injection_points/injection_points--1.0.sql
@@ -34,6 +34,17 @@ RETURNS void
 AS 'MODULE_PATHNAME', 'injection_points_wakeup'
 LANGUAGE C STRICT PARALLEL UNSAFE;
 
+--
+-- injection_points_local()
+--
+-- Trigger switch to link any future injection points attached to the
+-- current process, useful to make SQL tests concurrently-safe.
+--
+CREATE FUNCTION injection_points_local()
+RETURNS void
+AS 'MODULE_PATHNAME', 'injection_points_local'
+LANGUAGE C STRICT PARALLEL UNSAFE;
+
 --
 -- injection_points_detach()
 --
diff --git a/src/test/modules/injection_points/injection_points.c b/src/test/modules/injection_points/injection_points.c
index 0730254d54..8be081db99 100644
--- a/src/test/modules/injection_points/injection_points.c
+++ b/src/test/modules/injection_points/injection_point

Re: Weird test mixup

2024-03-15 Thread Michael Paquier
On Fri, Mar 15, 2024 at 11:23:31AM +0200, Heikki Linnakangas wrote:
> For the gin test, a single "SELECT injection_points_attach_local()" at the
> top of the test file would be most convenient.
> 
> If I have to do "SELECT
> injection_points_condition('gin-finish-incomplete-split', :'datname');" for
> every injection point in the test, I will surely forget it sometimes.

So will I, most likely..  The odds never play in favor of hackers.  I
have a few more tests in mind that can be linked to a specific
backend with SQL queries, but I've not been able to get back to it
yet.

> Wrt. the spinlock and shared memory handling, I think this would be simpler
> if you could pass some payload in the InjectionPointAttach() call, which
> would be passed back to the callback function:
> 
> In this case, the payload would be the "slot index" in shared memory.
>
> Or perhaps always allocate, say, 1024 bytes of working area for every
> attached injection point that the test module can use any way it wants. Like
> for storing extra conditions, or for the wakeup counter stuff in
> injection_wait(). A fixed size working area is a little crude, but would be
> very handy in practice.

Perhaps.  I am not sure that we need more than the current signature,
all that can just be handled in some module-specific shmem area.  The
key is to be able to link a point name to some state related to it.
Using a hash table would be more efficient, but performance wise a
array is not going to matter as there will most likely never be more
than 8 points.  4 is already a lot, just doubling that on safety
ground.

> It would be nice to automatically detach all the injection points on process
> exit. You wouldn't always want that, but I think most tests hold a session
> open throughout the test, and for those it would be handy.

Linking all the points to a PID with a injection_points_attach_local()
that switches a static flag while registering a before_shmem_exit() to
do an automated cleanup sounds like the simplest approach to me based
on what I'm reading on this thread.

(Just saw the buildfarm storm, wow.)
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-03-15 Thread Thomas Munro
On Sat, Mar 16, 2024 at 7:27 AM Tom Lane  wrote:
> Are there limits on the runtime of CI or cfbot jobs?  Maybe
> somebody should go check those systems.

Those get killed at a higher level after 60 minutes (configurable but
we didn't change it AFAIK):

https://cirrus-ci.org/faq/#instance-timed-out

It's a fresh virtual machine for each run, and after that it's gone
(well the ccache directory survives but only by being
uploaded/downloaded in explicit steps to transmit it between runs).




Re: Weird test mixup

2024-03-15 Thread Tom Lane
Heikki Linnakangas  writes:
> On 15/03/2024 16:00, Tom Lane wrote:
>> It may be even worse than it appears from the buildfarm status page.
>> My animals were stuck in infinite loops that required a manual "kill"
>> to get out of, and it's reasonable to assume there are others that
>> will require owner intervention.  Why would this test have done that,
>> if the module failed to load?

> The gin_incomplete_split test inserts rows until it hits the injection 
> point, at page split. There is a backstop, it should give up after 1 
> iterations, but that was broken. Fixed that, thanks for the report!

Duh ...

> Hmm, don't we have any timeout that would kill tests if they get stuck?

AFAIK, the only constraint on a buildfarm animal's runtime is the
wait_timeout setting, which is infinite by default, and was on my
machines.  (Not anymore ;-).)  We do have timeouts in (most?) TAP
tests, but this wasn't a TAP test.

If this is a continuous-insertion loop, presumably it will run the
machine out of disk space eventually, which could be unpleasant
if there are other services running besides the buildfarm.
I think I'll go notify the buildfarm owners list to check for
trouble.

Are there limits on the runtime of CI or cfbot jobs?  Maybe
somebody should go check those systems.

regards, tom lane




Re: Weird test mixup

2024-03-15 Thread Heikki Linnakangas

On 15/03/2024 16:00, Tom Lane wrote:

Heikki Linnakangas  writes:

On 15/03/2024 13:09, Heikki Linnakangas wrote:

I committed a patch to do that, to put out the fire.



That's turning the buildfarm quite red. Many, but not all animals are
failing like this:


It may be even worse than it appears from the buildfarm status page.
My animals were stuck in infinite loops that required a manual "kill"
to get out of, and it's reasonable to assume there are others that
will require owner intervention.  Why would this test have done that,
if the module failed to load?


The gin_incomplete_split test inserts rows until it hits the injection 
point, at page split. There is a backstop, it should give up after 1 
iterations, but that was broken. Fixed that, thanks for the report!


Hmm, don't we have any timeout that would kill tests if they get stuck?

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Weird test mixup

2024-03-15 Thread Tom Lane
Heikki Linnakangas  writes:
> On 15/03/2024 13:09, Heikki Linnakangas wrote:
>> I committed a patch to do that, to put out the fire.

> That's turning the buildfarm quite red. Many, but not all animals are 
> failing like this:

It may be even worse than it appears from the buildfarm status page.
My animals were stuck in infinite loops that required a manual "kill"
to get out of, and it's reasonable to assume there are others that
will require owner intervention.  Why would this test have done that,
if the module failed to load?

regards, tom lane




Re: Weird test mixup

2024-03-15 Thread Heikki Linnakangas

On 15/03/2024 14:10, Heikki Linnakangas wrote:

On 15/03/2024 13:09, Heikki Linnakangas wrote:

I committed a patch to do that, to put out the fire.


That's turning the buildfarm quite red. Many, but not all animals are
failing like this:


--- 
/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/expected/injection_points.out
 2024-03-15 12:41:16.363286975 +0100
+++ 
/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/results/injection_points.out
  2024-03-15 12:53:11.528159615 +0100
@@ -1,118 +1,111 @@
  CREATE EXTENSION injection_points;
+ERROR:  extension "injection_points" is not available
+DETAIL:  Could not open extension control file 
"/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/tmp_install/home/buildfarm/hippopotamus/buildroot/HEAD/inst/share/postgresql/extension/injection_points.control":
 No such file or directory.
+HINT:  The extension must first be installed on the system where PostgreSQL is 
running.
...


Looks like adding NO_INSTALLCHECK somehow affected how the modules are
installed in tmp_install. I'll investigate..


I think this is a bug in the buildfarm client. In the make_misc_check 
step, it does this (reduced to just the interesting parts):



# run the modules that can't be run with installcheck
sub make_misc_check
{
...
my @dirs = glob("$pgsql/src/test/modules/* $pgsql/contrib/*");
foreach my $dir (@dirs)
{
next unless -e "$dir/Makefile";
my $makefile = file_contents("$dir/Makefile");
next unless $makefile =~ /^NO_INSTALLCHECK/m;
my $test = basename($dir);

# skip redundant TAP tests which are called elsewhere
my @out = run_log("cd $dir && $make $instflags TAP_TESTS= 
check");
...
}


So it scans src/test/modules, and runs "make check" for all 
subdirectories that have NO_INSTALLCHECK in the makefile. But the 
injection fault tests are also conditional on the 
enable_injection_points in the parent Makefile:



ifeq ($(enable_injection_points),yes)
SUBDIRS += injection_points gin
else
ALWAYS_SUBDIRS += injection_points gin
endif


The buildfarm client doesn't pay any attention to that, and runs the 
test anyway.


I committed an ugly hack to the subdirectory Makefiles, to turn "make 
check" into a no-op if injection points are disabled. Normally when you 
run "make check" at the parent level, it doesn't even recurse to the 
directories, but this works around the buildfarm script. I hope...


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Weird test mixup

2024-03-15 Thread Heikki Linnakangas

On 15/03/2024 13:09, Heikki Linnakangas wrote:

On 15/03/2024 01:13, Tom Lane wrote:

Michael Paquier  writes:

Or we could just disable runningcheck because of the concurrency
requirement in this test.  The test would still be able to run, just
less times.


No, actually we *must* mark all these tests NO_INSTALLCHECK if we
stick with the current definition of injection points.  The point
of installcheck mode is that the tests are supposed to be safe to
run in a live installation.  Side-effects occurring in other
databases are completely not OK.


I committed a patch to do that, to put out the fire.


That's turning the buildfarm quite red. Many, but not all animals are 
failing like this:



--- 
/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/expected/injection_points.out
 2024-03-15 12:41:16.363286975 +0100
+++ 
/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/results/injection_points.out
  2024-03-15 12:53:11.528159615 +0100
@@ -1,118 +1,111 @@
 CREATE EXTENSION injection_points;
+ERROR:  extension "injection_points" is not available
+DETAIL:  Could not open extension control file 
"/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/tmp_install/home/buildfarm/hippopotamus/buildroot/HEAD/inst/share/postgresql/extension/injection_points.control":
 No such file or directory.
+HINT:  The extension must first be installed on the system where PostgreSQL is 
running.
... 


Looks like adding NO_INSTALLCHECK somehow affected how the modules are 
installed in tmp_install. I'll investigate..


--
Heikki Linnakangas
Neon (https://neon.tech)




Re: Weird test mixup

2024-03-15 Thread Heikki Linnakangas

On 15/03/2024 01:13, Tom Lane wrote:

Michael Paquier  writes:

Or we could just disable runningcheck because of the concurrency
requirement in this test.  The test would still be able to run, just
less times.


No, actually we *must* mark all these tests NO_INSTALLCHECK if we
stick with the current definition of injection points.  The point
of installcheck mode is that the tests are supposed to be safe to
run in a live installation.  Side-effects occurring in other
databases are completely not OK.


I committed a patch to do that, to put out the fire.

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Weird test mixup

2024-03-15 Thread Heikki Linnakangas

On 15/03/2024 09:39, Michael Paquier wrote:

On Thu, Mar 14, 2024 at 07:13:53PM -0400, Tom Lane wrote:

I can see that some tests would want to be able to inject code
cluster-wide, but I bet that's going to be a small minority.
I suggest that we invent a notion of "global" vs "local"
injection points, where a "local" one only fires in the DB it
was defined in.  Then only tests that require a global injection
point need to be NO_INSTALLCHECK.


Attached is a POC of what could be done.  I have extended the module
injection_points so as it is possible to register what I'm calling a
"condition" in the module that can be defined with a new SQL function.

The condition is stored in shared memory with the point name, then at
runtime the conditions are cross-checked in the callbacks.  With the
interface of this patch, the condition should be registered *before* a
point is attached, but this stuff could also be written so as
injection_points_attach() takes an optional argument with a database
name.  Or this could use a different, new SQL function, say a
injection_points_attach_local() that registers a condition with
MyDatabaseId on top of attaching the point, making the whole happening
while holding once the spinlock of the shmem state for the module.


For the gin test, a single "SELECT injection_points_attach_local()" at 
the top of the test file would be most convenient.


If I have to do "SELECT 
injection_points_condition('gin-finish-incomplete-split', :'datname');" 
for every injection point in the test, I will surely forget it sometimes.


In the 'gin' test, they could actually be scoped to the same backend.


Wrt. the spinlock and shared memory handling, I think this would be 
simpler if you could pass some payload in the InjectionPointAttach() 
call, which would be passed back to the callback function:


 void
 InjectionPointAttach(const char *name,
 const char *library,
-const char *function)
+const char *function,
+uint64 payload)

In this case, the payload would be the "slot index" in shared memory.

Or perhaps always allocate, say, 1024 bytes of working area for every 
attached injection point that the test module can use any way it wants. 
Like for storing extra conditions, or for the wakeup counter stuff in 
injection_wait(). A fixed size working area is a little crude, but would 
be very handy in practice.



By the way, modules/gin/ was missing missing a detach, so the test was
not repeatable with successive installchecks.


Oops.

It would be nice to automatically detach all the injection points on 
process exit. You wouldn't always want that, but I think most tests hold 
a session open throughout the test, and for those it would be handy.


--
Heikki Linnakangas
Neon (https://neon.tech)




Re: Weird test mixup

2024-03-15 Thread Michael Paquier
On Thu, Mar 14, 2024 at 07:13:53PM -0400, Tom Lane wrote:
> Michael Paquier  writes:
>> Or we could just disable runningcheck because of the concurrency
>> requirement in this test.  The test would still be able to run, just
>> less times.
> 
> No, actually we *must* mark all these tests NO_INSTALLCHECK if we
> stick with the current definition of injection points.  The point
> of installcheck mode is that the tests are supposed to be safe to
> run in a live installation.  Side-effects occurring in other
> databases are completely not OK.

I really don't want to plug any runtime conditions into the backend
APIs, because there can be so much more that can be done there than
only restricting a callback to a database.  I can imagine process type
restrictions, process PID restrictions, etc.  So this knowledge should
stick into the test module itself, and be expanded there.  That's
easier ABI-wise, as well.

> I can see that some tests would want to be able to inject code
> cluster-wide, but I bet that's going to be a small minority.
> I suggest that we invent a notion of "global" vs "local"
> injection points, where a "local" one only fires in the DB it
> was defined in.  Then only tests that require a global injection
> point need to be NO_INSTALLCHECK.

Attached is a POC of what could be done.  I have extended the module
injection_points so as it is possible to register what I'm calling a
"condition" in the module that can be defined with a new SQL function.

The condition is stored in shared memory with the point name, then at
runtime the conditions are cross-checked in the callbacks.  With the
interface of this patch, the condition should be registered *before* a
point is attached, but this stuff could also be written so as
injection_points_attach() takes an optional argument with a database
name.  Or this could use a different, new SQL function, say a
injection_points_attach_local() that registers a condition with
MyDatabaseId on top of attaching the point, making the whole happening
while holding once the spinlock of the shmem state for the module.

By the way, modules/gin/ was missing missing a detach, so the test was
not repeatable with successive installchecks.  Adding a pg_sleep of a
few seconds after 'gin-leave-leaf-split-incomplete' is registered
enlarges the window, and the patch avoids failures when running
installcheck in parallel for modules/gin/ and something else using
gin, like contrib/btree_gin/:
while make USE_MODULE_DB=1 installcheck; do :; done

0001 is the condition facility for the module, 0002 is a fix for the
GIN test.  Thoughts are welcome.
--
Michael
From 5f00092a0bd193f70d9daff77caacfc2ba83ca52 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 15 Mar 2024 16:03:05 +0900
Subject: [PATCH 1/2] Introduce runtime conditions in module injection_points

This adds a new SQL function injection_points_condition() that can be
used to register a runtime condition to an injection point, declared or
not.  For now, it is possible to register a database name to make an
injection point run only on a defined database.
---
 .../expected/injection_points.out |  54 
 .../injection_points--1.0.sql |  13 ++
 .../injection_points/injection_points.c   | 131 ++
 .../injection_points/sql/injection_points.sql |  14 ++
 4 files changed, 212 insertions(+)

diff --git a/src/test/modules/injection_points/expected/injection_points.out b/src/test/modules/injection_points/expected/injection_points.out
index 827456ccc5..ccb83b291f 100644
--- a/src/test/modules/injection_points/expected/injection_points.out
+++ b/src/test/modules/injection_points/expected/injection_points.out
@@ -115,4 +115,58 @@ NOTICE:  notice triggered for injection point TestInjectionLog2
  
 (1 row)
 
+SELECT injection_points_detach('TestInjectionLog2');
+ injection_points_detach 
+-
+ 
+(1 row)
+
+-- Conditions
+SELECT current_database() AS datname \gset
+-- Register injection point to run on database 'postgres'.
+SELECT injection_points_attach('TestConditionError', 'error');
+ injection_points_attach 
+-
+ 
+(1 row)
+
+SELECT injection_points_condition('TestConditionError', 'postgres');
+ injection_points_condition 
+
+ 
+(1 row)
+
+SELECT injection_points_run('TestConditionError'); -- nothing
+ injection_points_run 
+--
+ 
+(1 row)
+
+SELECT injection_points_detach('TestConditionError');
+ injection_points_detach 
+-
+ 
+(1 row)
+
+-- Register injection point to run on current database
+SELECT injection_points_attach('TestConditionError', 'error');
+ injection_points_attach 
+-
+ 
+(1 row)
+
+SELECT injection_points_condition('TestConditionError', :'datname');
+ injection_points_condition 
+
+ 
+(1 row)
+
+SELECT injection_points_run('TestConditionError'); -- error
+ERROR:  error triggered for injection point 

Re: Weird test mixup

2024-03-14 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Mar 14, 2024 at 06:19:38PM -0400, Tom Lane wrote:
>> I wonder if it'd be wise to adjust the injection point stuff so that
>> it's active in only the specific database the injection point was
>> activated in.

> It can be made optional by extending InjectionPointAttach() to
> specify a database OID or a database name.  Note that
> 041_checkpoint_at_promote.pl wants an injection point to run in the
> checkpointer, where we don't have a database requirement.

> Or we could just disable runningcheck because of the concurrency
> requirement in this test.  The test would still be able to run, just
> less times.

No, actually we *must* mark all these tests NO_INSTALLCHECK if we
stick with the current definition of injection points.  The point
of installcheck mode is that the tests are supposed to be safe to
run in a live installation.  Side-effects occurring in other
databases are completely not OK.

I can see that some tests would want to be able to inject code
cluster-wide, but I bet that's going to be a small minority.
I suggest that we invent a notion of "global" vs "local"
injection points, where a "local" one only fires in the DB it
was defined in.  Then only tests that require a global injection
point need to be NO_INSTALLCHECK.

regards, tom lane




Re: Weird test mixup

2024-03-14 Thread Michael Paquier
On Fri, Mar 15, 2024 at 07:53:57AM +0900, Michael Paquier wrote:
> It can be made optional by extending InjectionPointAttach() to
> specify a database OID or a database name.  Note that
> 041_checkpoint_at_promote.pl wants an injection point to run in the
> checkpointer, where we don't have a database requirement.

Slight correction here.  It is also possible to not touch
InjectionPointAttach() at all: just tweak the callbacks to do that as
long as the database that should be used is tracked in shmem with its
point name, say with new fields in InjectionPointSharedState.  That
keeps the backend APIs in a cleaner state.
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-03-14 Thread Michael Paquier
On Thu, Mar 14, 2024 at 06:19:38PM -0400, Tom Lane wrote:
> Do they?  It'd be fairly easy to explain this if these things were
> being run in "installcheck" style.  I'm not sure about CI, but from
> memory, the buildfarm does use installcheck for some things.
> 
> I wonder if it'd be wise to adjust the injection point stuff so that
> it's active in only the specific database the injection point was
> activated in.

It can be made optional by extending InjectionPointAttach() to
specify a database OID or a database name.  Note that
041_checkpoint_at_promote.pl wants an injection point to run in the
checkpointer, where we don't have a database requirement.

Or we could just disable runningcheck because of the concurrency
requirement in this test.  The test would still be able to run, just
less times.
--
Michael


signature.asc
Description: PGP signature


Re: Weird test mixup

2024-03-14 Thread Tom Lane
Thomas Munro  writes:
> On Fri, Mar 15, 2024 at 11:19 AM Tom Lane  wrote:
>> Do they?  It'd be fairly easy to explain this if these things were
>> being run in "installcheck" style.  I'm not sure about CI, but from
>> memory, the buildfarm does use installcheck for some things.

> Right, as mentioned here:
> https://www.postgresql.org/message-id/flat/CA%2BhUKGJYhcG_o2nwSK6r01eOZJwNWUJUbX%3D%3DAVnW84f-%2B8yamQ%40mail.gmail.com
> That's the "running" test, which is like the old installcheck.

Hmm.  Seems like maybe we need to institute a rule that anything
using injection points has to be marked NO_INSTALLCHECK.  That's
kind of a big hammer though.

regards, tom lane




Re: Weird test mixup

2024-03-14 Thread Tom Lane
I wrote:
> Heikki Linnakangas  writes:
>> Somehow the 'gin-leave-leaf-split-incomplete' injection point was active 
>> in the 'intarray' test. That makes no sense. That injection point is 
>> only used by the test in src/test/modules/gin/. Perhaps that ran at the 
>> same time as the intarray test? But they run in separate instances, with 
>> different data directories.

> Do they?  It'd be fairly easy to explain this if these things were
> being run in "installcheck" style.  I'm not sure about CI, but from
> memory, the buildfarm does use installcheck for some things.

Hmm, Munro's comment yesterday[1] says that current CI does use
installcheck mode in some cases.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/CA+hUKGJYhcG_o2nwSK6r01eOZJwNWUJUbX==avnw84f-+8y...@mail.gmail.com




Re: Weird test mixup

2024-03-14 Thread Thomas Munro
On Fri, Mar 15, 2024 at 11:19 AM Tom Lane  wrote:
> Heikki Linnakangas  writes:
> > Somehow the 'gin-leave-leaf-split-incomplete' injection point was active
> > in the 'intarray' test. That makes no sense. That injection point is
> > only used by the test in src/test/modules/gin/. Perhaps that ran at the
> > same time as the intarray test? But they run in separate instances, with
> > different data directories.
>
> Do they?  It'd be fairly easy to explain this if these things were
> being run in "installcheck" style.  I'm not sure about CI, but from
> memory, the buildfarm does use installcheck for some things.

Right, as mentioned here:

https://www.postgresql.org/message-id/flat/CA%2BhUKGJYhcG_o2nwSK6r01eOZJwNWUJUbX%3D%3DAVnW84f-%2B8yamQ%40mail.gmail.com

That's the "running" test, which is like the old installcheck.




Re: Weird test mixup

2024-03-14 Thread Tom Lane
Heikki Linnakangas  writes:
> Somehow the 'gin-leave-leaf-split-incomplete' injection point was active 
> in the 'intarray' test. That makes no sense. That injection point is 
> only used by the test in src/test/modules/gin/. Perhaps that ran at the 
> same time as the intarray test? But they run in separate instances, with 
> different data directories.

Do they?  It'd be fairly easy to explain this if these things were
being run in "installcheck" style.  I'm not sure about CI, but from
memory, the buildfarm does use installcheck for some things.

I wonder if it'd be wise to adjust the injection point stuff so that
it's active in only the specific database the injection point was
activated in.

regards, tom lane




Weird test mixup

2024-03-14 Thread Heikki Linnakangas
I got a weird test failure while testing my forking refactor patches on 
Cirrus CI 
(https://cirrus-ci.com/task/5880724448870400?logs=test_running#L121):



[16:52:39.753] Summary of Failures:
[16:52:39.753] 
[16:52:39.753] 66/73 postgresql:intarray-running / intarray-running/regress   ERROR 6.27s   exit status 1
[16:52:39.753] 
[16:52:39.753] Ok: 72  
[16:52:39.753] Expected Fail:  0   
[16:52:39.753] Fail:   1   
[16:52:39.753] Unexpected Pass:0   
[16:52:39.753] Skipped:0   
[16:52:39.753] Timeout:0   
[16:52:39.753] 
[16:52:39.753] Full log written to /tmp/cirrus-ci-build/build/meson-logs/testlog-running.txt


And:


diff -U3 /tmp/cirrus-ci-build/contrib/intarray/expected/_int.out 
/tmp/cirrus-ci-build/build/testrun/intarray-running/regress/results/_int.out
--- /tmp/cirrus-ci-build/contrib/intarray/expected/_int.out 2024-03-14 
16:48:48.690367000 +
+++ 
/tmp/cirrus-ci-build/build/testrun/intarray-running/regress/results/_int.out
2024-03-14 16:52:05.759444000 +
@@ -804,6 +804,7 @@
 
 DROP INDEX text_idx;

 CREATE INDEX text_idx on test__int using gin ( a gin__int_ops );
+ERROR:  error triggered for injection point gin-leave-leaf-split-incomplete
 SELECT count(*) from test__int WHERE a && '{23,50}';
  count 
 ---

@@ -877,6 +878,7 @@
 (1 row)
 
 DROP INDEX text_idx;

+ERROR:  index "text_idx" does not exist
 -- Repeat the same queries with an extended data set. The data set is the
 -- same that we used before, except that each element in the array is
 -- repeated three times, offset by 1000 and 2000. For example, {1, 5}


Somehow the 'gin-leave-leaf-split-incomplete' injection point was active 
in the 'intarray' test. That makes no sense. That injection point is 
only used by the test in src/test/modules/gin/. Perhaps that ran at the 
same time as the intarray test? But they run in separate instances, with 
different data directories. And the 'gin' test passed.


I'm completely stumped. Anyone have a theory?

--
Heikki Linnakangas
Neon (https://neon.tech)