Re: Faster "SET search_path"

2023-07-29 Thread Nathan Bossart
On Sat, Jul 29, 2023 at 08:59:01AM -0700, Jeff Davis wrote:
> 0001: Transform the settings in proconfig into a List for faster
> processing. This is simple and speeds up any proconfig setting.

This looks straightforward.  It might be worth combining the common parts
of ProcessGUCArray() and TransformGUCArray() if there is a clean way to do
so.

> 0002: Introduce CheckIdentifierString(), which is a faster version of
> SplitIdentifierString() that only validates, and can be used in
> check_search_path().

This also looks straightforward.  And I'd make the same comment as above
for SplitIdentifierString() and CheckIdentifierString().  Perhaps we could
have folks set SplitIdentifierString's namelist argument to NULL to
indicate that it only needs to validate.

> 0003: Cache of previous search_path settings. The key is the raw
> namespace_search_path string and the role OID, and it caches the
> computed OID list. Changes to the search_path setting or the role can
> retrieve the cached OID list as long as nothing else invalidates the
> cache (changes to the temp schema or a syscache invalidation of
> pg_namespace or pg_role).

At a glance, this looks pretty reasonable, too.

> One behavior change in 0003 is that retrieving a cached OID list
> doesn't call InvokeNamespaceSearchHook(). It would be easy enough to
> disable caching when a hook exists, but I didn't see a reason to expect
> that "SET search_path" must invoke that hook each time. Invoking it
> when computing for the first time, or after a real invalidation, seemed
> fine to me. Feedback on that is welcome.

Could a previously cached list be used to circumvent a hook that was just
reconfigured to ERROR for certain namespaces?  If something like that is
possible, then we might need to disable the cache when there is a hook.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: buildfarm instance bichir stuck

2023-07-29 Thread Thomas Munro
On Fri, Apr 9, 2021 at 6:11 PM Thomas Munro  wrote:
> On Wed, Apr 7, 2021 at 7:31 PM Robins Tharakan  wrote:
> > Correct. This is easily reproducible on this test-instance, so let me know 
> > if you want me to test a patch.
>
> From your description it sounds like signals are not arriving at all,
> rather than some more complicated race.  Let's go back to basics...

I was looking into the portability of SIGURG and OOB socket data for
something totally different (hallway track discussion from PGCon,
could we use that for query cancel, like FTP does, instead of opening
another socket?), and lo and behold, someone has figured out a
workaround for this latch problem:

https://github.com/microsoft/WSL/issues/8619

I don't really want to add code to scrape uname() ouput detect
different kernels at runtime as shown there, but it doesn't seem to
make a difference on Linux if we just always do what was suggested.  I
didn't look too hard into whether that is the right place to put the
call, or really understand *why* it works, and since I am not a
Windows user and we don't have a WSL1 CI, I can't confirm that it
works or explore whether there is some other ordering of operations
that would be better but still work, but if that does the trick then
maybe we should just do something like the attached.

Thoughts?
From 7c077ff2f0d922a2d13948db8d6c01eb791c83c6 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 30 Jul 2023 10:43:27 +1200
Subject: [PATCH] Work around signalfd() oddity on WSL1.

It's not clear why setting SIGURG's disposition to SIG_DFL (which should
be no change) affects the behavior of WSL1 (one of the ways that Windows
can run Linux programs).  It's harmles to do that on Linux, so let's add
that workaround to help WSL1 users.  Based on research done by  Elvis
Pranskevichus in a WSL bug report:

https://github.com/microsoft/WSL/issues/8619

Discussion: https://postgr.es/m/CAEP4nAymAZP1VEBNoWAQca85ZtU5YxuwS95%2BVu%2BXW%2B-eMfq_vQ%40mail.gmail.com

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index db889385b7..06dc7c503a 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -307,6 +307,12 @@ InitializeLatchSupport(void)
 	if (signal_fd < 0)
 		elog(FATAL, "signalfd() failed");
 	ReserveExternalFD();
+
+	/*
+	 * Setting this to its default disposition (ignored) appears to be
+	 * redundant, but seems to fix a problem with signalfd() on WSL1 kernels.
+	 */
+	pqsignal(SIGURG, SIG_DFL);
 #endif
 
 #ifdef WAIT_USE_KQUEUE
-- 
2.39.2



Postmaster doesn't correctly handle crashes in PM_STARTUP state

2023-07-29 Thread Andres Freund
Hi,

While testing something I made the checkpointer process intentionally crash as
soon as it started up.  The odd thing I observed on macOS is that we start a
*new* checkpointer before shutting down:

2023-07-29 14:32:39.241 PDT [65031] LOG:  listening on Unix socket 
"/tmp/.s.PGSQL.5432"
2023-07-29 14:32:39.244 PDT [65031] DEBUG:  reaping dead processes
2023-07-29 14:32:39.244 PDT [65031] LOG:  checkpointer process (PID 65032) was 
terminated by signal 11: Segmentation fault: 11
2023-07-29 14:32:39.244 PDT [65031] LOG:  terminating any other active server 
processes
2023-07-29 14:32:39.244 PDT [65031] DEBUG:  sending SIGQUIT to process 65034
2023-07-29 14:32:39.245 PDT [65031] DEBUG:  sending SIGQUIT to process 65033
2023-07-29 14:32:39.245 PDT [65031] DEBUG:  reaping dead processes
2023-07-29 14:32:39.245 PDT [65035] LOG:  process 65035 taking over ProcSignal 
slot 126, but it's not empty
2023-07-29 14:32:39.245 PDT [65031] DEBUG:  reaping dead processes
2023-07-29 14:32:39.245 PDT [65031] LOG:  shutting down because 
restart_after_crash is off

Note that a new process (65035) is started after the crash has been
observed. I added logging to StartChildProcess(), and the process that's
started is another checkpointer.

I could not initially reproduce this on linux.

After a fair bit of confusion, I figured out the reason: On macOS it takes a
bit longer for the startup process to finish, which means we're still in
PM_STARTUP state when we see that crash, instead of PM_RECOVERY or PM_RUN or
...

The problem is that unfortunately HandleChildCrash() doesn't change pmState
when in PM_STARTUP:

/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_SHUTDOWN)
pmState = PM_WAIT_BACKENDS;

Once I figured that out, I put a sleep(1) in StartupProcessMain(), and the
problem reproduces on linux as well.

I haven't fully dug through the history, this looks to be a quite old problem.


Arguably we might also be missing PM_SHUTDOWN_2, but I can't really see a bad
consequence of that.

Greetings,

Andres Freund




Re: should frontend tools use syncfs() ?

2023-07-29 Thread Nathan Bossart
On Wed, Apr 13, 2022 at 06:54:12AM -0500, Justin Pryzby wrote:
> I didn't pursue this patch, as it's easier for me to use /bin/sync -f.  
> Someone
> should adopt it if interested.

I was about to start a new thread, but I found this one with some good
preliminary discussion.  I came to the same conclusion about introducing a
new option instead of using syncfs() by default wherever it is available.
The attached patch is still a work-in-progress, but it seems to behave as
expected.  I began investigating this because I noticed that the
sync-data-directory step on pg_upgrade takes quite a while when there are
many files, and I am looking for ways to reduce the amount of downtime
required for pg_upgrade.

The attached patch adds a new --sync-method option to the relevant frontend
utilities, but I am not wedded to that name/approach.

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 4f484cd0268e63da8226d78dd21a8d7e29aa2b78 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 28 Jul 2023 15:56:24 -0700
Subject: [PATCH v1 1/1] allow syncfs in frontend utilities

---
 src/bin/initdb/initdb.c   | 11 +++-
 src/bin/pg_basebackup/pg_basebackup.c |  8 ++-
 src/bin/pg_checksums/pg_checksums.c   |  8 ++-
 src/bin/pg_rewind/file_ops.c  |  2 +-
 src/bin/pg_rewind/pg_rewind.c |  8 +++
 src/bin/pg_rewind/pg_rewind.h |  2 +
 src/bin/pg_upgrade/option.c   | 12 
 src/bin/pg_upgrade/pg_upgrade.c   |  6 +-
 src/bin/pg_upgrade/pg_upgrade.h   |  1 +
 src/common/file_utils.c   | 79 ++-
 src/fe_utils/option_utils.c   | 18 ++
 src/include/common/file_utils.h   | 15 -
 src/include/fe_utils/option_utils.h   |  3 +
 src/include/storage/fd.h  |  4 ++
 src/tools/pgindent/typedefs.list  |  1 +
 15 files changed, 168 insertions(+), 10 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3f4167682a..908263ee62 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -76,6 +76,7 @@
 #include "common/restricted_token.h"
 #include "common/string.h"
 #include "common/username.h"
+#include "fe_utils/option_utils.h"
 #include "fe_utils/string_utils.h"
 #include "getopt_long.h"
 #include "mb/pg_wchar.h"
@@ -165,6 +166,7 @@ static bool data_checksums = false;
 static char *xlog_dir = NULL;
 static char *str_wal_segment_size_mb = NULL;
 static int	wal_segment_size_mb;
+static SyncMethod sync_method = SYNC_METHOD_FSYNC;
 
 
 /* internal vars */
@@ -3106,6 +3108,7 @@ main(int argc, char *argv[])
 		{"locale-provider", required_argument, NULL, 15},
 		{"icu-locale", required_argument, NULL, 16},
 		{"icu-rules", required_argument, NULL, 17},
+		{"sync-method", required_argument, NULL, 18},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3285,6 +3288,10 @@ main(int argc, char *argv[])
 			case 17:
 icu_rules = pg_strdup(optarg);
 break;
+			case 18:
+if (!parse_sync_method(optarg, _method))
+	exit(1);
+break;
 			default:
 /* getopt_long already emitted a complaint */
 pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -3332,7 +3339,7 @@ main(int argc, char *argv[])
 
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		fsync_pgdata(pg_data, PG_VERSION_NUM);
+		fsync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
 		check_ok();
 		return 0;
 	}
@@ -3409,7 +3416,7 @@ main(int argc, char *argv[])
 	{
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		fsync_pgdata(pg_data, PG_VERSION_NUM);
+		fsync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
 		check_ok();
 	}
 	else
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 1dc8efe0cb..548e764b2f 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -148,6 +148,7 @@ static bool verify_checksums = true;
 static bool manifest = true;
 static bool manifest_force_encode = false;
 static char *manifest_checksums = NULL;
+static SyncMethod sync_method = SYNC_METHOD_FSYNC;
 
 static bool success = false;
 static bool made_new_pgdata = false;
@@ -2203,7 +2204,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
 		}
 		else
 		{
-			(void) fsync_pgdata(basedir, serverVersion);
+			(void) fsync_pgdata(basedir, serverVersion, sync_method);
 		}
 	}
 
@@ -2281,6 +2282,7 @@ main(int argc, char **argv)
 		{"no-manifest", no_argument, NULL, 5},
 		{"manifest-force-encode", no_argument, NULL, 6},
 		{"manifest-checksums", required_argument, NULL, 7},
+		{"sync-method", required_argument, NULL, 8},
 		{NULL, 0, NULL, 0}
 	};
 	int			c;
@@ -2452,6 +2454,10 @@ main(int argc, char **argv)
 			case 7:
 manifest_checksums = pg_strdup(optarg);
 break;
+			case 8:
+if (!parse_sync_method(optarg, _method))
+	exit(1);
+break;
 			default:
 /* getopt_long already emitted a complaint */
 pg_log_error_hint("Try \"%s --help\" 

Re: add timing information to pg_upgrade

2023-07-29 Thread Nathan Bossart
On Sat, Jul 29, 2023 at 12:17:40PM +0530, Bharath Rupireddy wrote:
> While on this, I noticed a thing unrelated to your patch that there's
> no space between the longest status message of size 60 bytes and ok -
> 'Checking for incompatible "aclitem" data type in user tablesok
> 23.932 ms'. I think MESSAGE_WIDTH needs to be bumped up - 64 or more.

Good catch.  I think I'd actually propose just removing "in user tables" or
the word "incompatible" from these messages to keep them succinct.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Faster "SET search_path"

2023-07-29 Thread Isaac Morland
On Sat, 29 Jul 2023 at 11:59, Jeff Davis  wrote:

Unfortunately, adding a "SET search_path" clause to functions slows
> them down. The attached patches close the performance gap
> substantially.
>
> Changes:
>
> 0001: Transform the settings in proconfig into a List for faster
> processing. This is simple and speeds up any proconfig setting.
>
> 0002: Introduce CheckIdentifierString(), which is a faster version of
> SplitIdentifierString() that only validates, and can be used in
> check_search_path().
>
> 0003: Cache of previous search_path settings. The key is the raw
> namespace_search_path string and the role OID, and it caches the
> computed OID list. Changes to the search_path setting or the role can
> retrieve the cached OID list as long as nothing else invalidates the
> cache (changes to the temp schema or a syscache invalidation of
> pg_namespace or pg_role).
>

I'm glad to see this work. Something related to consider, not sure if this
is helpful: can the case of the caller's search_path happening to be the
same as the SET search_path setting be optimized? Essentially, "just"
observe efficiently (somehow) that no change is needed, and skip changing
it? I ask because substantially all my functions are written using "SET
search_path FROM CURRENT", and then many of them call each other. As a
result, in my use I would say that the common case is a function being
called by another function, where both have the same search_path setting.
So ideally, the search_path would not be changed at all when entering and
exiting the callee.


Re: brininsert optimization opportunity

2023-07-29 Thread Soumyadeep Chakraborty
Attached v4 of the patch, rebased against latest HEAD.

Regards,
Soumyadeep (VMware)
From b5fb12fbb8b0c1b2963a05a2877b5063bbc75f58 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Sat, 29 Jul 2023 09:17:49 -0700
Subject: [PATCH v4 1/1] Reuse revmap and brin desc in brininsert

brininsert() used to have code that performed per-tuple initialization
of the revmap. That had some overhead.

To mitigate, we introduce a new AM callback to clean up state at the end
of all inserts, and perform the revmap cleanup there.
---
 contrib/bloom/blutils.c  |  1 +
 doc/src/sgml/indexam.sgml| 14 +-
 src/backend/access/brin/brin.c   | 74 +++-
 src/backend/access/gin/ginutil.c |  1 +
 src/backend/access/gist/gist.c   |  1 +
 src/backend/access/hash/hash.c   |  1 +
 src/backend/access/index/indexam.c   | 15 ++
 src/backend/access/nbtree/nbtree.c   |  1 +
 src/backend/access/spgist/spgutils.c |  1 +
 src/backend/executor/execIndexing.c  |  5 ++
 src/include/access/amapi.h   |  3 ++
 src/include/access/brin_internal.h   |  1 +
 src/include/access/genam.h   |  2 +
 13 files changed, 108 insertions(+), 12 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index d935ed8fbdf..9720f69de09 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -131,6 +131,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->aminsertcleanup = NULL;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index e813e2b620a..17f7d6597f1 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -139,6 +139,7 @@ typedef struct IndexAmRoutine
 ambuild_function ambuild;
 ambuildempty_function ambuildempty;
 aminsert_function aminsert;
+aminsertcleanup_function aminsertcleanup;
 ambulkdelete_function ambulkdelete;
 amvacuumcleanup_function amvacuumcleanup;
 amcanreturn_function amcanreturn;   /* can be NULL */
@@ -355,11 +356,22 @@ aminsert (Relation indexRelation,
within an SQL statement, it can allocate space
in indexInfo-ii_Context and store a pointer to the
data in indexInfo-ii_AmCache (which will be NULL
-   initially).
+   initially). If the data cached contains structures that can be simply pfree'd,
+   they will implicitly be pfree'd. But, if it contains more complex data, such
+   as Buffers or TupleDescs, additional cleanup is necessary. This additional
+   cleanup can be performed in indexinsertcleanup.
   
 
   
 
+bool
+aminsertcleanup (IndexInfo *indexInfo);
+
+   Clean up state that was maintained across successive inserts in
+   indexInfo-ii_AmCache. This is useful if the data
+   contained is complex - like Buffers or TupleDescs which need additional
+   cleanup, unlike simple structures that will be implicitly pfree'd.
+
 IndexBulkDeleteResult *
 ambulkdelete (IndexVacuumInfo *info,
   IndexBulkDeleteResult *stats,
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3c6a956eaa3..2da59f2ab03 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -58,6 +58,17 @@ typedef struct BrinBuildState
 	BrinMemTuple *bs_dtuple;
 } BrinBuildState;
 
+/*
+ * We use a BrinInsertState to capture running state spanning multiple
+ * brininsert invocations, within the same command.
+ */
+typedef struct BrinInsertState
+{
+	BrinRevmap *bs_rmAccess;
+	BrinDesc   *bs_desc;
+	BlockNumber bs_pages_per_range;
+}			BrinInsertState;
+
 /*
  * Struct used as "opaque" during index scans
  */
@@ -72,6 +83,7 @@ typedef struct BrinOpaque
 
 static BrinBuildState *initialize_brin_buildstate(Relation idxRel,
   BrinRevmap *revmap, BlockNumber pagesPerRange);
+static BrinInsertState *initialize_brin_insertstate(Relation idxRel, IndexInfo *indexInfo);
 static void terminate_brin_buildstate(BrinBuildState *state);
 static void brinsummarize(Relation index, Relation heapRel, BlockNumber pageRange,
 		  bool include_partial, double *numSummarized, double *numExisting);
@@ -117,6 +129,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->aminsertcleanup = brininsertcleanup;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
@@ -140,6 +153,28 @@ brinhandler(PG_FUNCTION_ARGS)
 	PG_RETURN_POINTER(amroutine);
 }
 
+/*
+ * Initialize a BrinInsertState to maintain state to be used across multiple
+ * tuple inserts, within the same command.
+ */
+static BrinInsertState *
+initialize_brin_insertstate(Relation idxRel, IndexInfo *indexInfo)
+{
+	BrinInsertState *bistate;
+	MemoryContext oldcxt;
+
+	

Re: POC, WIP: OR-clause support for indexes

2023-07-29 Thread Peter Geoghegan
On Wed, Jul 26, 2023 at 6:30 PM Alena Rybakina  wrote:
> > I didn't intend to imply that you might have the same problem here. I
> > just meant that OR optimizations can have problems with duplicate
> > elimination, in general. I suspect that your patch doesn't have that
> > problem, because you are performing a transformation of one kind of OR
> > into another kind of OR.
>
> Yes, you are right, but I studied this topic and two other sources to
> accumulate my knowledge. It was an exciting experience for me)

Cool! Yeah, a lot of the value with these sorts of things comes from
the way that they can interact with each other. This is hard to
describe exactly, but still important.

> I was especially inspired after studying the interview with Goetz Graf
> [2], his life experience is the most inspiring, and from this article I
> was able to get a broad understanding of the field of databases:
> current problems, future development, how it works... Thank you for the
> recommendation.

I also think that his perspective is very interesting.

> I think it really helps to speed up the search with similar deep
> filtering compared to cluster indexes, but do we have cases where we
> don't use this algorithm because it takes longer than an usual index?
> I thought about the situation with wide indexes (with a lot of multiple
> columns) and having a lot of filtering predicates for them.

I think that it should be possible for the optimizer to only use
multi-column SAOP index paths when there is at least likely to be some
small advantage -- that's definitely my goal. Importantly, we may not
really need to accurately model the costs where the new approach turns
out to be much faster. The only essential thing is that we avoid cases
where the new approach is much slower than the old approach. Which is
possible (in at least some cases) by making the runtime behavior
adaptive.

The best decision that the planner can make may be no decision at all.
Better to wait until runtime where at all possible, since that gives
us the latest and most accurate picture of things.

> But I'm not sure about this, so it seems to me that this is a problem of
> improper use of indexes rather.

It's hard to say how true that is.

Certainly, workloads similar to the TPC-DS benchmark kinda need
something like MDAM. It's just not practical to have enough indexes to
support every possible query -- the benchmark is deliberately designed
to have unpredictable, hard-to-optimize access paths. It seems to
require having fewer, more general indexes that can support
multi-dimensional access reasonably efficiently.

Of course, with OLTP it's much more likely that the workload will have
predictable access patterns. That makes having exactly the right
indexes much more practical. So maybe you're right there. But, I still
see a lot of value in a design that is as forgiving as possible. Users
really like that kind of thing in my experience.

> > Currently, the optimizer doesn't recognize multi-column indexes with
> > SAOPs on every column as having a valid sort order, except on the
> > first column. It seems possible that that has consequences for your
> > patch. (I'm really only guessing, though; don't trust anything that I
> > say about the optimizer too much.)
> >
> Honestly, I couldn't understand your concerns very well, could you
> describe it in more detail?

Well, I'm not sure if there is any possible scenario where the
transformation from your patch makes it possible to go from an access
path that has a valid sort order (usually because there is an
underlying index scan) into an access path that doesn't. In fact, the
opposite situation seems more likely (which is good news) --
especially if you assume that my own patch is also present.

Going from a bitmap OR (which cannot return sorted output) to a
multi-column SAOP index scan (which now can) may have significant
value in some specific circumstances. Most obviously, it's really
useful when it enables us to feed tuples into a GroupAggregate without
a separate sort step, and without a hash aggregate (that's why I see
value in combining your patch with my own patch). You just need to be
careful about allowing the opposite situation to take place.

More generally, there is a need to think about strange second order
effects. We want to be open to useful second order effects that make
query execution much faster in some specific context, while avoiding
harmful second order effects. Intuitively, I think that it should be
possible to do this with the transformations performed by your patch.

In other words, "helpful serendipity" is an important advantage, while
"harmful anti-serendipity" is what we really want to avoid. Ideally by
making the harmful cases impossible "by construction".

-- 
Peter Geoghegan




Faster "SET search_path"

2023-07-29 Thread Jeff Davis
Improve performance of "SET search_path".

Motivation:

Creating functions with a "SET search_path" clause is safer and more
secure because the function behavior doesn't change based on the
caller's search_path setting.

Setting search_path in the function declaration is especially important
for SECURITY DEFINER functions[1], but even SECURITY INVOKER functions
can be executed more like SECURITY DEFINER in some contexts (e.g.
REINDEX executing an index function). Also, it's just error-prone to
depend on the caller's search_path unless there's a specific reason you
want to do that.

Unfortunately, adding a "SET search_path" clause to functions slows
them down. The attached patches close the performance gap
substantially.

Changes:

0001: Transform the settings in proconfig into a List for faster
processing. This is simple and speeds up any proconfig setting.

0002: Introduce CheckIdentifierString(), which is a faster version of
SplitIdentifierString() that only validates, and can be used in
check_search_path().

0003: Cache of previous search_path settings. The key is the raw
namespace_search_path string and the role OID, and it caches the
computed OID list. Changes to the search_path setting or the role can
retrieve the cached OID list as long as nothing else invalidates the
cache (changes to the temp schema or a syscache invalidation of
pg_namespace or pg_role).

One behavior change in 0003 is that retrieving a cached OID list
doesn't call InvokeNamespaceSearchHook(). It would be easy enough to
disable caching when a hook exists, but I didn't see a reason to expect
that "SET search_path" must invoke that hook each time. Invoking it
when computing for the first time, or after a real invalidation, seemed
fine to me. Feedback on that is welcome.

Test:

  CREATE SCHEMA a;
  CREATE SCHEMA b;
  CREATE TABLE big(i) AS SELECT generate_series(1,2000);
  VACUUM big; CHECKPOINT;
  CREATE FUNCTION inc(int) RETURNS INT
LANGUAGE plpgsql
AS $$ begin return $1+1; end; $$;
  CREATE FUNCTION inc_ab(int) RETURNS INT
LANGUAGE plpgsql SET search_path = a, b
AS $$ begin return $1+1; end; $$;

  -- baseline
  EXPLAIN ANALYZE SELECT inc(i) FROM big;

  -- test query
  EXPLAIN ANALYZE SELECT inc_ab(i) FROM big;

Results:

  baseline:  4.3s
  test query:
without patch:  14.7s
0001:   13.6s
0001,0002:  10.4s
0001,0002,0003:  8.6s

Timings were inconsistent for me so I took the middle of three runs.

It's a lot faster than without the patch. It's still 2X worse than not
specifying any search_path (baseline), but I think it brings it into
"usable" territory for more use cases.

Regards,
Jeff Davis

[1]
https://www.postgresql.org/docs/current/sql-createfunction.html#SQL-CREATEFUNCTION-SECURITY
From c05ee846d6d9099642d58933924a00d80262 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 27 Jul 2023 15:11:42 -0700
Subject: [PATCH v1 1/3] Transform proconfig for faster execution.

Store function config settings as a list of (name, value) pairs to
avoid the need to parse for each function invocation.
---
 src/backend/utils/fmgr/fmgr.c | 27 ++--
 src/backend/utils/misc/guc.c  | 58 +++
 src/include/utils/guc.h   |  1 +
 3 files changed, 77 insertions(+), 9 deletions(-)

diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c
index 9208c31fe0..3b49684991 100644
--- a/src/backend/utils/fmgr/fmgr.c
+++ b/src/backend/utils/fmgr/fmgr.c
@@ -612,7 +612,7 @@ struct fmgr_security_definer_cache
 {
 	FmgrInfo	flinfo;			/* lookup info for target function */
 	Oid			userid;			/* userid to set, or InvalidOid */
-	ArrayType  *proconfig;		/* GUC values to set, or NULL */
+	List	   *configList;		/* GUC values to set, or NULL */
 	Datum		arg;			/* passthrough argument for plugin modules */
 };
 
@@ -634,6 +634,7 @@ fmgr_security_definer(PG_FUNCTION_ARGS)
 	FmgrInfo   *save_flinfo;
 	Oid			save_userid;
 	int			save_sec_context;
+	ListCell   *lc;
 	volatile int save_nestlevel;
 	PgStat_FunctionCallUsage fcusage;
 
@@ -666,8 +667,10 @@ fmgr_security_definer(PG_FUNCTION_ARGS)
 );
 		if (!isnull)
 		{
+			ArrayType *array;
 			oldcxt = MemoryContextSwitchTo(fcinfo->flinfo->fn_mcxt);
-			fcache->proconfig = DatumGetArrayTypePCopy(datum);
+			array = DatumGetArrayTypeP(datum);
+			fcache->configList = TransformGUCArray(array);
 			MemoryContextSwitchTo(oldcxt);
 		}
 
@@ -680,7 +683,7 @@ fmgr_security_definer(PG_FUNCTION_ARGS)
 
 	/* GetUserIdAndSecContext is cheap enough that no harm in a wasted call */
 	GetUserIdAndSecContext(_userid, _sec_context);
-	if (fcache->proconfig)		/* Need a new GUC nesting level */
+	if (fcache->configList != NIL)		/* Need a new GUC nesting level */
 		save_nestlevel = NewGUCNestLevel();
 	else
 		save_nestlevel = 0;		/* keep compiler quiet */
@@ -689,12 +692,18 @@ fmgr_security_definer(PG_FUNCTION_ARGS)
 		SetUserIdAndSecContext(fcache->userid,
 			   save_sec_context | 

Re: logical decoding and replication of sequences, take 2

2023-07-29 Thread Tomas Vondra



On 7/29/23 06:54, Amit Kapila wrote:
> On Fri, Jul 28, 2023 at 6:12 PM Tomas Vondra
>  wrote:
>>
>> On 7/28/23 11:42, Amit Kapila wrote:
>>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
>>>  wrote:

 On 7/26/23 09:27, Amit Kapila wrote:
> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila  
> wrote:

 Anyway, I was thinking about this a bit more, and it seems it's not as
 difficult to use the page LSN to ensure sequences don't go backwards.

>>>
>>> While studying the changes for this proposal and related areas, I have
>>> a few comments:
>>> 1. I think you need to advance the origin if it is changed due to
>>> copy_sequence(), otherwise, if the sync worker restarts after
>>> SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN
>>> value.
>>>
>>
>> True, we want to restart at the new origin_startpos.
>>
>>> 2. Between the time of SYNCDONE and READY state, the patch can skip
>>> applying non-transactional sequence changes even if it should apply
>>> it. The reason is that during that state change
>>> should_apply_changes_for_rel() decides whether to apply change based
>>> on the value of remote_final_lsn which won't be set for
>>> non-transactional change. I think we need to send the start LSN of a
>>> non-transactional record and then use that as remote_final_lsn for
>>> such a change.
>>
>> Good catch. remote_final_lsn is set in apply_handle_begin, but that
>> won't happen for sequences. We're already sending the LSN, but
>> logicalrep_read_sequence ignores it - it should be enough to add it to
>> LogicalRepSequence and then set it in apply_handle_sequence().
>>
> 
> As per my understanding, the LSN sent is EndRecPtr of record which is
> the beginning of the next record (means current_record_end + 1). For
> comparing the current record, we use the start_position of the record
> as we do when we use the remote_final_lsn via apply_handle_begin().
> 
>>>
>>> 3. For non-transactional sequence change apply, we don't set
>>> replorigin_session_origin_lsn/replorigin_session_origin_timestamp as
>>> we are doing in apply_handle_commit_internal() before calling
>>> CommitTransactionCommand(). So, that can lead to the origin moving
>>> backwards after restart which will lead to requesting and applying the
>>> same changes again and for that period of time sequence can go
>>> backwards. This needs some more thought as to what is the correct
>>> behaviour/solution for this.
>>>
>>
>> I think saying "origin moves backwards" is a bit misleading. AFAICS the
>> origin position is not actually moving backwards, it's more that we
>> don't (and can't) move it forwards for each non-transactional change. So
>> yeah, we may re-apply those, and IMHO that's expected - the sequence is
>> allowed to be "ahead" on the subscriber.
>>
> 
> But, if this happens then for a period of time the sequence will go
> backwards relative to what one would have observed before restart.
> 

That is true, but is it really a problem? This whole sequence decoding
thing was meant to allow logical failover - make sure that after switch
to the subscriber, the sequences don't generate duplicate values. From
this POV, the sequence going backwards (back to the confirmed origin
position) is not an issue - it's still far enough (ahead of publisher).

Is that great / ideal? No, I agree with that. But it was considered
acceptable and good enough for the failover use case ...

The only idea how to improve that is we could keep the non-transactional
changes (instead of applying them immediately), and then apply them on
the nearest "commit". That'd mean it's subject to the position tracking,
and the sequence would not go backwards, I think.

So every time we decode a commit, we'd check if we decoded any sequence
changes since the last commit, and merge them (a bit like a subxact).

This would however also mean sequence changes from rolled-back xacts may
not be replictated. I think that'd be fine, but IIRC Andres suggested
it's a valid use case.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Avoid undefined behavior with msvc compiler (src/include/port/pg_bitutils.h)

2023-07-29 Thread Ranier Vilela
Hi,

The pg_leftmost_one_pos32 function with msvc compiler,
relies on Assert to guarantee the correct result.

But msvc documentation [1] says clearly that
undefined behavior can occur.

Fix by checking the result of the function to avoid
a possible undefined behavior.

patch attached.

best regards,
Ranier Vilela

[1]
https://learn.microsoft.com/en-us/cpp/intrinsics/bitscanreverse-bitscanreverse64?view=msvc-170


0001-Avoid-undefined-behavior-with-msvc-compiler.patch
Description: Binary data


Re: logical decoding and replication of sequences, take 2

2023-07-29 Thread Tomas Vondra



On 7/28/23 14:44, Ashutosh Bapat wrote:
> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
>  wrote:
>>
>> Anyway, I was thinking about this a bit more, and it seems it's not as
>> difficult to use the page LSN to ensure sequences don't go backwards.
>> The 0005 change does that, by:
>>
>> 1) adding pg_sequence_state, that returns both the sequence state and
>>the page LSN
>>
>> 2) copy_sequence returns the page LSN
>>
>> 3) tablesync then sets this LSN as origin_startpos (which for tables is
>>just the LSN of the replication slot)
>>
>> AFAICS this makes it work - we start decoding at the page LSN, so that
>> we  skip the increments that could lead to the sequence going backwards.
>>
> 
> I like this design very much. It makes things simpler than complex.
> Thanks for doing this.
> 

I agree it seems simpler. It'd be good to try testing / reviewing it a
bit more, so that it doesn't misbehave in some way.

> I am wondering whether we could reuse pg_sequence_last_value() instead
> of adding a new function. But the name of the function doesn't leave
> much space for expanding its functionality. So we are good with a new
> one. Probably some code deduplication.
> 

I don't think we should do that, the pg_sequence_last_value() function
is meant to do something different. I don't think it'd be any simpler to
also make it do what pg_sequence_state() does would make it any simpler.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Postgres v15 windows bincheck regression test failures

2023-07-29 Thread Noah Misch
On Fri, Jul 28, 2023 at 04:00:00PM +0300, Alexander Lakhin wrote:
> 28.07.2023 14:42, Noah Misch wrpte:
> >That was about a bug that appears when using TCP sockets. ...
> 
> Yes, and according to the failed test output, TCP sockets were used:
> 
> #   'pg_amcheck: error: connection to server at
> "127.0.0.1", port 49393 failed: server closed the connection
> unexpectedly
> #   This probably means the server terminated abnormally
> #   before or while processing the request.

I think we were talking about different details.  Agreed, bug #16678 probably
did cause the failure in the original post.  I was saying that bug has no
connection to the "scary comment", though.




Re: add timing information to pg_upgrade

2023-07-29 Thread Bharath Rupireddy
On Sat, Jul 29, 2023 at 12:17 AM Nathan Bossart
 wrote:
>
> On Fri, Jul 28, 2023 at 10:38:14AM -0700, Nathan Bossart wrote:
> > I'm okay with simply adding the time.  However, I wonder if we want to
> > switch to seconds, minutes, hours, etc. if the step takes longer.  This
> > would add a bit of complexity, but it would improve human readability.
> > Alternatively, we could keep the code simple and machine readable by always
> > using milliseconds.
>
> ... or maybe we show both like psql does.  Attached іs a new version of the
> patch in which I've made use of the INSTR_TIME_* macros and enhanced the
> output formatting (largely borrowed from PrintTiming() in
> src/bin/psql/common.c).

The v2 patch LGTM. I tested it with an upgrade of the 22GB database,
the output is here [1].

While on this, I noticed a thing unrelated to your patch that there's
no space between the longest status message of size 60 bytes and ok -
'Checking for incompatible "aclitem" data type in user tablesok
23.932 ms'. I think MESSAGE_WIDTH needs to be bumped up - 64 or more.

[1]
Performing Consistency Checks
-
Checking cluster versions   ok  0.000 ms
Checking database user is the install user  ok  1.767 ms
Checking database connection settings   ok  2.210 ms
Checking for prepared transactions  ok  1.411 ms
Checking for system-defined composite types in user tables  ok  29.471 ms
Checking for reg* data types in user tables ok  26.251 ms
Checking for contrib/isn with bigint-passing mismatch   ok  0.000 ms
Checking for incompatible "aclitem" data type in user tablesok  23.932 ms
Checking for user-defined encoding conversions  ok  8.350 ms
Checking for user-defined postfix operators ok  8.229 ms
Checking for incompatible polymorphic functions ok  15.271 ms
Checking for tables WITH OIDS   ok  6.120 ms
Checking for invalid "sql_identifier" user columns  ok  24.509 ms
Creating dump of global objects ok  14.007 ms
Creating dump of database schemas
ok  176.690 ms
Checking for presence of required libraries ok  0.011 ms
Checking database user is the install user  ok  2.566 ms
Checking for prepared transactions  ok  2.065 ms
Checking for new cluster tablespace directories ok  0.000 ms

If pg_upgrade fails after this point, you must re-initdb the
new cluster before continuing.

Performing Upgrade
--
Setting locale and encoding for new cluster ok  3.014 ms
Analyzing all rows in the new cluster   ok  373.270 ms
Freezing all rows in the new clusterok  81.064 ms
Deleting files from new pg_xact ok  0.089 ms
Copying old pg_xact to new server   ok  2.204 ms
Setting oldest XID for new cluster  ok  38.314 ms
Setting next transaction ID and epoch for new cluster   ok  111.503 ms
Deleting files from new pg_multixact/offsetsok  0.078 ms
Copying old pg_multixact/offsets to new server  ok  1.790 ms
Deleting files from new pg_multixact/membersok  0.050 ms
Copying old pg_multixact/members to new server  ok  1.532 ms
Setting next multixact ID and offset for new clusterok  36.770 ms
Resetting WAL archives  ok  37.182 ms
Setting frozenxid and minmxid counters in new cluster   ok  47.879 ms
Restoring global objects in the new cluster ok  11.615 ms
Restoring database schemas in the new cluster
ok  141.839 ms
Copying user relation files
ok
151308.601 ms (02:31.309)
Setting next OID for new clusterok  44.800 ms
Sync data directory to disk ok
4461.213 ms (00:04.461)
Creating script to delete old cluster   ok  0.059 ms
Checking for extension updates  ok  66.899 ms

Upgrade Complete

Optimizer statistics are not transferred by pg_upgrade.
Once you start the new server, consider running:
/home/ubuntu/postgres/HEAD/bin/vacuumdb --all --analyze-in-stages
Running this script will delete the old cluster's data files:
./delete_old_cluster.sh

real2m38.133s
user0m0.151s
sys 0m21.556s

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com