date:20231129

Re: GUC names in messages

2023-11-29 Thread Peter Smith

On Thu, Nov 30, 2023 at 4:59 PM Michael Paquier  wrote:
>
> On Tue, Nov 28, 2023 at 11:54:33AM +1100, Peter Smith wrote:
> > Here is patch set v3.
> >
> > Patches 0001 and 0002 are unchanged from v2.
>
> After some grepping, I've noticed that 0002 had a mistake with
> track_commit_timestamp: some alternate output of modules/commit_ts/
> was not updated.  meson was able to reproduce the failure as well.
>
> I am not sure regarding what we should do a mixed cases as well, so I
> have discarded DateStyle for now, and applied the rest.
>
> Also applied 0001 from Alvaro.
>

Thanks for pushing those parts.

> > Patch 0003 now uses a "%s%s%s" format specifier with GUC_FORMAT macro
> > in guc.c, as recently suggested by Michael [1].
>
> I cannot think about a better idea as these strings need to be
> translated so they need three %s.
>
> +   if (*p == '_')
> +   underscore = true;
> +   else if ('a' <= *p && *p <= 'z')
> +   lowercase = true;
>
> An issue with this code is that it would forget to quote GUCs that use
> dots, like the ones from an extension.  I don't really see why we
> cannot just make the macro return true only if all the characters of a
> GUC name is made of lower-case alpha characters?

Not forgotten. I felt the dot separator in such names might be
mistaken for a period in a sentence which is why I left quotes for
those ones. YMMV.

==
Kind Regards,
Peter. Smith.
Fujitsu Australia

Re: Testing autovacuum wraparound (including failsafe)

2023-11-29 Thread Masahiko Sawada

On Wed, Nov 29, 2023 at 5:27 AM Masahiko Sawada  wrote:
>
> On Tue, Nov 28, 2023 at 7:16 PM Daniel Gustafsson  wrote:
> >
> > > On 28 Nov 2023, at 03:00, Masahiko Sawada  wrote:
> > >
> > > On Mon, Nov 27, 2023 at 10:40 PM Daniel Gustafsson  
> > > wrote:
> > >>
> > >>> On 27 Nov 2023, at 14:06, Masahiko Sawada  wrote:
> > >>
> > >>> Is it true that we can modify the timeout after creating
> > >>> BackgroundPsql object? If so, it seems we don't need to introduce the
> > >>> new timeout argument. But how?
> > >>
> > >> I can't remember if that's leftovers that incorrectly remains from an 
> > >> earlier
> > >> version of the BackgroundPsql work, or if it's a very bad explanation of
> > >> ->set_query_timer_restart().  The timeout will use the timeout_default 
> > >> value
> > >> and that cannot be overridden, it can only be reset per query.
> > >
> > > Thank you for confirming this. I see there is the same problem also in
> > > interactive_psql(). So I've attached the 0001 patch to fix these
> > > documentation issues.
> >
> > -A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up,
> > -which can be modified later.
> > +A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up.
> >
> > Since it cannot be modified, I think we should just say "A timeout of .." 
> > and
> > call it a default timeout.  This obviously only matters for the backpatch 
> > since
> > the sentence is removed in 0002.
>
> Agreed.
>
> I've attached new version patches (0002 and 0003 are unchanged except
> for the commit message). I'll push them, barring any objections.
>

Pushed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Is this a problem in GenericXLogFinish()?

2023-11-29 Thread Hayato Kuroda (Fujitsu)

Dear Alexander,

> 
> Good catch, thank you for reporting! I will investigate more about it and 
> post my
> analysis.
>

Again, good catch. Here is my analysis and fix patch.
I think it is sufficient to add an initialization for writebuf.

In the reported case, neither is_prim_bucket_same_wrt nor 
is_prev_bucket_same_wrt
is set in the WAL record, and ntups is also zero. This means that the wbuf is 
not
written in the WAL record on primary side (see [1]).
Also, there are no reasons to read (and lock) other buffer page because we do
not modify it. Based on them, I think that just initializing as InvalidBuffer 
is sufficient.


I want to add a test case for it as well. I've tested on my env and found that 
proposed
tuples seems sufficient.
(We must fill the primary bucket, so initial 500 is needed. Also, we must add
many dead pages to lead squeeze operation. Final page seems OK to smaller 
value.)

I compared the execution time before/after patching, and they are not so 
different
(1077 ms -> 1125 ms).

How do you think?

[1]:
```
else if (xlrec.is_prim_bucket_same_wrt || 
xlrec.is_prev_bucket_same_wrt)
{
uint8   wbuf_flags;

/*
 * A write buffer needs to be registered even if no 
tuples are
 * added to it to ensure that we can acquire a cleanup 
lock on it
 * if it is the same as primary bucket buffer or update 
the
 * nextblkno if it is same as the previous bucket 
buffer.
 */
Assert(xlrec.ntups == 0);

wbuf_flags = REGBUF_STANDARD;
if (!xlrec.is_prev_bucket_same_wrt)
wbuf_flags |= REGBUF_NO_CHANGE;
XLogRegisterBuffer(1, wbuf, wbuf_flags);
}
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



initialize_writebuf.patch
Description: initialize_writebuf.patch

[PATCH] plpython function causes server panic

2023-11-29 Thread Hao Zhang

Hi hackers,
I found a problem when executing the plpython function:
After the plpython function returns an error, in the same session, if we
continue to execute
plpython function, the server panic will be caused.

*Reproduce*
preparation

SET max_parallel_workers_per_gather=4;
SET parallel_setup_cost=1;
SET min_parallel_table_scan_size ='4kB';

CREATE TABLE t(i int);
INSERT INTO t SELECT generate_series(1, 1);

CREATE EXTENSION plpython3u;
CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS
$$
plpy.execute("select pg_backend_pid()")

for i in range(0, 5):
yield (i)

$$ LANGUAGE plpython3u parallel safe;

execute the function twice in the same session

postgres=# SELECT test_func() from t where i>10 and i<100;
ERROR:  error fetching next item from iterator
DETAIL:  Exception: cannot start subtransactions during a parallel
operation
CONTEXT:  Traceback (most recent call last):
PL/Python function "test_func"

postgres=# SELECT test_func();
server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.

*Analysis*

   - There is an SPI call in test_func(): plpy.execute().
   - Then the server will start a subtransaction by
   PLy_spi_subtransaction_begin(); BUT! The Python processor does not know
   whether an error happened during PLy_spi_subtransaction_begin().
   - If there is an error that occurs in PLy_spi_subtransaction_begin(),
   the SPI call will be terminated but the python error indicator won't be set
   and the PyObject won't be free.
   - Then the next plpython UDF in the same session will fail due to the
   wrong Python environment.


*Solution*
Use try-catch to catch the error that occurs in
PLy_spi_subtransaction_begin(), and set the python error indicator.

With Regards
Hao Zhang
From 891ff26ba9b52f3eba6f0d112657c3b288524de2 Mon Sep 17 00:00:00 2001
From: Hao Zhang 
Date: Thu, 30 Nov 2023 14:51:07 +0800
Subject: [PATCH v1] Fix bug: plpython function causes server panic.

---
 src/pl/plpython/plpy_cursorobject.c | 12 ++---
 src/pl/plpython/plpy_spi.c  | 41 +++--
 src/pl/plpython/plpy_spi.h  |  2 +-
 3 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/src/pl/plpython/plpy_cursorobject.c b/src/pl/plpython/plpy_cursorobject.c
index 57e8f8ec21..8bfe05815b 100644
--- a/src/pl/plpython/plpy_cursorobject.c
+++ b/src/pl/plpython/plpy_cursorobject.c
@@ -98,7 +98,8 @@ PLy_cursor_query(const char *query)
 	oldcontext = CurrentMemoryContext;
 	oldowner = CurrentResourceOwner;
 
-	PLy_spi_subtransaction_begin(oldcontext, oldowner);
+	if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+		return NULL;
 
 	PG_TRY();
 	{
@@ -196,7 +197,8 @@ PLy_cursor_plan(PyObject *ob, PyObject *args)
 	oldcontext = CurrentMemoryContext;
 	oldowner = CurrentResourceOwner;
 
-	PLy_spi_subtransaction_begin(oldcontext, oldowner);
+	if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+		return NULL;
 
 	PG_TRY();
 	{
@@ -333,7 +335,8 @@ PLy_cursor_iternext(PyObject *self)
 	oldcontext = CurrentMemoryContext;
 	oldowner = CurrentResourceOwner;
 
-	PLy_spi_subtransaction_begin(oldcontext, oldowner);
+	if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+		return NULL;
 
 	PG_TRY();
 	{
@@ -403,7 +406,8 @@ PLy_cursor_fetch(PyObject *self, PyObject *args)
 	oldcontext = CurrentMemoryContext;
 	oldowner = CurrentResourceOwner;
 
-	PLy_spi_subtransaction_begin(oldcontext, oldowner);
+	if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+		return NULL;
 
 	PG_TRY();
 	{
diff --git a/src/pl/plpython/plpy_spi.c b/src/pl/plpython/plpy_spi.c
index ff87b27de0..58a274eda9 100644
--- a/src/pl/plpython/plpy_spi.c
+++ b/src/pl/plpython/plpy_spi.c
@@ -77,7 +77,8 @@ PLy_spi_prepare(PyObject *self, PyObject *args)
 	oldcontext = CurrentMemoryContext;
 	oldowner = CurrentResourceOwner;
 
-	PLy_spi_subtransaction_begin(oldcontext, oldowner);
+	if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+		return NULL;
 
 	PG_TRY();
 	{
@@ -217,7 +218,8 @@ PLy_spi_execute_plan(PyObject *ob, PyObject *list, long limit)
 	oldcontext = CurrentMemoryContext;
 	oldowner = CurrentResourceOwner;
 
-	PLy_spi_subtransaction_begin(oldcontext, oldowner);
+	if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+		return NULL;
 
 	PG_TRY();
 	{
@@ -313,7 +315,8 @@ PLy_spi_execute_query(char *query, long limit)
 	oldcontext = CurrentMemoryContext;
 	oldowner = CurrentResourceOwner;
 
-	PLy_spi_subtransaction_begin(oldcontext, oldowner);
+	if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+		return NULL;
 
 	PG_TRY();
 	{
@@ -556,7 +559,9 @@ PLy_rollback(PyObject *self, PyObject *args)
  *	MemoryContext oldcontext = CurrentMemoryContext;
  *	ResourceOwner oldowner = CurrentResourceOwner;
  *
- *	PLy_spi_subtransaction_begin(oldcontext, oldowner);
+ *	if (!

Re: Synchronizing slots from primary to standby

2023-11-29 Thread Peter Smith

Here are some review comments for v41-0001.

==
doc/src/sgml/ref/alter_subscription.sgml

1.
+ 
+  When altering the
+  slot_name,
+  the failover property of the new slot may
differ from the
+  failover
+  parameter specified in the subscription, you need to adjust the
+  failover property when creating the slot so that it
+  matches the value specified in subscription.
+ 

For the second part a) it should be a separate sentence, and b) IMO
you are not really "adjusting" something if you are "creating" it.

SUGGESTION
When altering the slot_name, the failover property of the new slot may
differ from the failover parameter specified in the subscription. When
creating the slot  ensure the slot failover property matches the
failover parameter value of the subscription.

==
src/backend/catalog/pg_subscription.c

2. AlterSubscription

+ if (sub->failoverstate == LOGICALREP_FAILOVER_STATE_ENABLED && opts.copy_data)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed
when failover is enabled"),

This is another example where the "two_phase" and "failover" should be
extracted to make a common message for the translator.

(Already posted this comment before -- see [1] #13)

~~~

3.
+ /*
+ * See comments above for twophasestate, same holds true for
+ * 'failover'
+ */
+ if (sub->failoverstate == LOGICALREP_FAILOVER_STATE_ENABLED && opts.copy_data)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed
when failover is enabled"),
+ errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false,
or use DROP/CREATE SUBSCRIPTION.")));

This is another example where the "two_phase" and "failover" should be
extracted to make a common message for the translator.

(Already posted this comment before -- see [1] #14)

==
src/backend/replication/walsender.c

4.
+/*
+ * Wake up logical walsenders with failover-enabled slots if the physical slot
+ * of the current walsender is specified in standby_slot_names GUC.
+ */
+void
+PhysicalWakeupLogicalWalSnd(void)

Is it better to refer to "walsender processes" being woken instead of
just walsenders?

e.g.
"Wake up logical walsenders..." --> "Wake up the logical walsender processes..."

==
[1] v35-0001 review.
https://www.postgresql.org/message-id/CAHut%2BPv-yu71ogj_hRi6cCtmD55bsyw7XTxj1Nq8yVFKpY3NDQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: GUC names in messages

2023-11-29 Thread Michael Paquier

On Thu, Nov 30, 2023 at 03:29:49PM +0900, Kyotaro Horiguchi wrote:
> In this patch, the quotation marks cannot be changed from double
> quotes.

Indeed, that's a good point.  I completely forgot about that.
--
Michael


signature.asc
Description: PGP signature

Re: Python installation selection in Meson

2023-11-29 Thread Peter Eisentraut


On 28.11.23 19:16, Andres Freund wrote:

On 2023-11-28 19:02:42 +0100, Peter Eisentraut wrote:

I noticed that under meson, the selection of the Python installation using
the 'PYTHON' option doesn't work completely.  The 'PYTHON' option determined
the Python binary that will be used to call the various build support
programs.  But it doesn't affect the Python installation used for PL/Python.
For that, we need to pass the program determined by the 'PYTHON' option back
into the find_installation() routine of the python module.  (Otherwise,
find_installation() will just search for an installation on its own.)  See
attached patch.  I ran this through Cirrus, seems to work.


Makes sense!


I have committed this, and also backpatched to 16 to keep the behavior 
consistent.

Re: GUC names in messages

2023-11-29 Thread Kyotaro Horiguchi

At Thu, 30 Nov 2023 14:59:21 +0900, Michael Paquier  wrote 
in 
> > Patch 0003 now uses a "%s%s%s" format specifier with GUC_FORMAT macro
> > in guc.c, as recently suggested by Michael [1].
> 
> I cannot think about a better idea as these strings need to be
> translated so they need three %s.


In this patch, the quotation marks cannot be changed from double
quotes.

After a brief review of the use of quotation marks in various
languages, it's observed that French uses guillemets (« »), German
uses lower qutation marks („ “), Spanish uses angular quotation marks
(« ») or alternatively, lower quotetaion marks. Japanese commonly uses
corner brackets (「」), but can also adopt double or single quotation
marks in certain contexts. I took a look at the backend's fr.po file
for a trial, and it indeed seems that guillemets are being used.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: proposal: possibility to read dumped table's name from file

2023-11-29 Thread Pavel Stehule

Hi

čt 30. 11. 2023 v 4:40 odesílatel Tom Lane  napsal:

> Daniel Gustafsson  writes:
> > I took another look at this, found some more polish that was needed,
> added
> > another testcase and ended up pushing it.
>
> mamba is unhappy because this uses  functions without
> casting their arguments to unsigned char:
>
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2023-11-30%2002%3A53%3A25
>
> (I had not realized that we still had buildfarm animals that would
> complain about this ... but I'm glad we do, because it's a hazard.
> POSIX is quite clear that the behavior is undefined for signed chars.)
>

here is a patch

Regards

Pavel


>
> regards, tom lane
>
commit 29a570f6b003513ef93691d92e01ea80fab11844
Author: ok...@github.com 
Date:   Thu Nov 30 07:10:58 2023 +0100

fix warning array subscript has type 'char'

diff --git a/src/bin/pg_dump/filter.c b/src/bin/pg_dump/filter.c
index dff871c7cd..d79b6da27c 100644
--- a/src/bin/pg_dump/filter.c
+++ b/src/bin/pg_dump/filter.c
@@ -185,14 +185,14 @@ filter_get_keyword(const char **line, int *size)
 	*size = 0;
 
 	/* Skip initial whitespace */
-	while (isspace(*ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
-	if (isalpha(*ptr))
+	if (isalpha((unsigned char) *ptr))
 	{
 		result = ptr++;
 
-		while (isalpha(*ptr) || *ptr == '_')
+		while (isalpha((unsigned char) *ptr) || *ptr == '_')
 			ptr++;
 
 		*size = ptr - result;
@@ -301,7 +301,7 @@ read_pattern(FilterStateData *fstate, const char *str, PQExpBuffer pattern)
 	bool		found_space = false;
 
 	/* Skip initial whitespace */
-	while (isspace(*str))
+	while (isspace((unsigned char) *str))
 		str++;
 
 	if (*str == '\0')
@@ -312,7 +312,7 @@ read_pattern(FilterStateData *fstate, const char *str, PQExpBuffer pattern)
 
 	while (*str && *str != '#')
 	{
-		while (*str && !isspace(*str) && !strchr("#,.()\"", *str))
+		while (*str && !isspace((unsigned char) *str) && !strchr("#,.()\"", *str))
 		{
 			/*
 			 * Append space only when it is allowed, and when it was found in
@@ -351,7 +351,7 @@ read_pattern(FilterStateData *fstate, const char *str, PQExpBuffer pattern)
 		found_space = false;
 
 		/* skip ending whitespaces */
-		while (isspace(*str))
+		while (isspace((unsigned char) *str))
 		{
 			found_space = true;
 			str++;
@@ -400,7 +400,7 @@ filter_read_item(FilterStateData *fstate,
 		fstate->lineno++;
 
 		/* Skip initial white spaces */
-		while (isspace(*str))
+		while (isspace((unsigned char) *str))
 			str++;
 
 		/*

Re: GUC names in messages

2023-11-29 Thread Michael Paquier

On Tue, Nov 28, 2023 at 11:54:33AM +1100, Peter Smith wrote:
> Here is patch set v3.
> 
> Patches 0001 and 0002 are unchanged from v2.

After some grepping, I've noticed that 0002 had a mistake with
track_commit_timestamp: some alternate output of modules/commit_ts/
was not updated.  meson was able to reproduce the failure as well.

I am not sure regarding what we should do a mixed cases as well, so I
have discarded DateStyle for now, and applied the rest.

Also applied 0001 from Alvaro.

> Patch 0003 now uses a "%s%s%s" format specifier with GUC_FORMAT macro
> in guc.c, as recently suggested by Michael [1].

I cannot think about a better idea as these strings need to be
translated so they need three %s.

+   if (*p == '_')
+   underscore = true;
+   else if ('a' <= *p && *p <= 'z')
+   lowercase = true;

An issue with this code is that it would forget to quote GUCs that use
dots, like the ones from an extension.  I don't really see why we
cannot just make the macro return true only if all the characters of a
GUC name is made of lower-case alpha characters?

With an extra indentation applied, I finish with the attached for
0003.
--
Michael
From 77bd6b6f58ab1b0671f6afa12dc4e00900b456a8 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 30 Nov 2023 14:58:32 +0900
Subject: [PATCH v4] GUC names - maybe add quotes

---
 src/backend/utils/misc/guc.c  | 278 --
 .../expected/test_oat_hooks.out   |  16 +-
 .../unsafe_tests/expected/guc_privs.out   |  58 ++--
 .../unsafe_tests/expected/rolenames.out   |   2 +-
 src/test/regress/expected/compression.out |   4 +-
 src/test/regress/expected/compression_1.out   |   6 +-
 src/test/regress/expected/create_am.out   |   4 +-
 src/test/regress/expected/guc.out |  30 +-
 src/test/regress/expected/password.out|   4 +-
 src/test/regress/expected/subscription.out|   2 +-
 src/test/regress/expected/transactions.out|   8 +-
 contrib/auto_explain/t/001_auto_explain.pl|   2 +-
 src/pl/plperl/expected/plperl_init.out|   4 +-
 13 files changed, 259 insertions(+), 159 deletions(-)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e76c083003..aad72db9e5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -265,6 +265,30 @@ static bool call_string_check_hook(struct config_string *conf, char **newval,
 static bool call_enum_check_hook(struct config_enum *conf, int *newval,
  void **extra, GucSource source, int elevel);
 
+/* Macro for handling optional quotes around GUC names */
+#define GUC_FORMAT(s)\
+	quotes_needed_for_GUC_name(s) ? "\"" : "",\
+	s,\
+	quotes_needed_for_GUC_name(s) ? "\"" : ""
+
+/*
+ * Return whether the GUC name should be enclosed in double-quotes.
+ *
+ * Quoting is intended for names which could be mistaken for normal English
+ * words.  Quotes are only applied to GUC names that are written entirely with
+ * lower-case alphabetical characters.
+ */
+static bool
+quotes_needed_for_GUC_name(const char *name)
+{
+	for (const char *p = name; *p; p++)
+	{
+		if ('a' > *p || *p > 'z')
+			return false;
+	}
+
+	return true;
+}
 
 /*
  * This function handles both actual config file (re)loads and execution of
@@ -420,8 +444,9 @@ ProcessConfigFileInternal(GucContext context, bool applySettings, int elevel)
 			/* Invalid non-custom variable, so complain */
 			ereport(elevel,
 	(errcode(ERRCODE_UNDEFINED_OBJECT),
-	 errmsg("unrecognized configuration parameter \"%s\" in file \"%s\" line %d",
-			item->name,
+			/* translator: %s%s%s is for an optionally quoted GUC name */
+	 errmsg("unrecognized configuration parameter %s%s%s in file \"%s\" line %d",
+			GUC_FORMAT(item->name),
 			item->filename, item->sourceline)));
 			item->errmsg = pstrdup("unrecognized configuration parameter");
 			error = true;
@@ -460,10 +485,13 @@ ProcessConfigFileInternal(GucContext context, bool applySettings, int elevel)
 			gconf->status |= GUC_PENDING_RESTART;
 			ereport(elevel,
 	(errcode(ERRCODE_CANT_CHANGE_RUNTIME_PARAM),
-	 errmsg("parameter \"%s\" cannot be changed without restarting the server",
-			gconf->name)));
-			record_config_file_error(psprintf("parameter \"%s\" cannot be changed without restarting the server",
-			  gconf->name),
+			/* translator: %s%s%s is for an optionally quoted GUC name */
+	 errmsg("parameter %s%s%s cannot be changed without restarting the server",
+			GUC_FORMAT(gconf->name;
+			record_config_file_error(
+			/* translator: %s%s%s is for an optionally quoted GUC name */
+	 psprintf("parameter %s%s%s cannot be changed without restarting the server",
+			  GUC_FORMAT(gconf->name)),
 	 NULL, 0,
 	 &head, &tail);
 			error = true;
@@ -496,8 +524,9 @@ ProcessConfigFileInternal(GucContext context, bool applySettings, int elevel)
 			/* Log t

Re: possibility to define only local cursors

2023-11-29 Thread Pavel Stehule

čt 30. 11. 2023 v 6:45 odesílatel Pavel Stehule 
napsal:

> Hi
>
> one my customer migrated a pretty large application from Oracle, and when
> did performance tests, he found very high memory usage related probably to
> unclosed cursors. The overhead is significantly bigger than on Oracle
> (probably Oracle closes cursors after leaving cursor's variable scope, I
> don't know. Maybe it just uses a different pattern with shorter
> transactions on Oracle). He cannot use FOR cycle, because he needs to hold
> code in form that allows automatic translation from PL/SQL to PL/pgSQL for
> some years (some years he will support both platforms).
>
> DECLARE qNAJUPOSPL refcursor;
> BEGIN
>   OPEN qNAJUPOSPL FOR EXECUTE mSqNAJUPOSPL;
>   LOOP
> FETCH qNAJUPOSPL INTO mID_NAJVUPOSPL , mID_NAJDATSPLT , mID_PREDPIS;
> EXIT WHEN NOT FOUND; /* apply on qNAJUPOSPL */
>   END LOOP;
> END;
>
> Because plpgsql and postgres can be referenced just by name then it is not
> possible to use some reference counters and close cursors when the
> reference number is zero. Can we introduce some modifier that forces
> closing the unclosed cursor before the related scope is left?
>
> Some like `DECLATE curvar refcursor LOCAL`
>
> Another way to solve this issue is just warning when the number of opened
> cursors crosses some limit. Later this warning can be disabled, increased
> or solved. But investigation of related memory issues can be easy then.
>

it can be implemented like extra warning for OPEN statement.



>
> Comments, notes?
>
> Regards
>
> Pavel
>
>
>
>
>

possibility to define only local cursors

2023-11-29 Thread Pavel Stehule

Hi

one my customer migrated a pretty large application from Oracle, and when
did performance tests, he found very high memory usage related probably to
unclosed cursors. The overhead is significantly bigger than on Oracle
(probably Oracle closes cursors after leaving cursor's variable scope, I
don't know. Maybe it just uses a different pattern with shorter
transactions on Oracle). He cannot use FOR cycle, because he needs to hold
code in form that allows automatic translation from PL/SQL to PL/pgSQL for
some years (some years he will support both platforms).

DECLARE qNAJUPOSPL refcursor;
BEGIN
  OPEN qNAJUPOSPL FOR EXECUTE mSqNAJUPOSPL;
  LOOP
FETCH qNAJUPOSPL INTO mID_NAJVUPOSPL , mID_NAJDATSPLT , mID_PREDPIS;
EXIT WHEN NOT FOUND; /* apply on qNAJUPOSPL */
  END LOOP;
END;

Because plpgsql and postgres can be referenced just by name then it is not
possible to use some reference counters and close cursors when the
reference number is zero. Can we introduce some modifier that forces
closing the unclosed cursor before the related scope is left?

Some like `DECLATE curvar refcursor LOCAL`

Another way to solve this issue is just warning when the number of opened
cursors crosses some limit. Later this warning can be disabled, increased
or solved. But investigation of related memory issues can be easy then.

Comments, notes?

Regards

Pavel

Re: Custom explain options

2023-11-29 Thread Andrei Lepikhov


On 21/10/2023 19:16, Konstantin Knizhnik wrote:
EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, 
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page 
prefetching during query execution (prefetching is more critical for Neon
architecture because of separation of compute and storage, so it is 
implemented not only for bitmap heap scan as in Vanilla Postgres, but 
also for seqscan, indexscan and indexonly scan). Another possible 
candidate  for explain options is local file cache (extra caching layer 
above shared buffers which is used to somehow replace file system cache 
in standalone Postgres).


I think that it will be nice to have a generic mechanism which allows 
extensions to add its own options to EXPLAIN.


Generally, I welcome this idea: Extensions can already do a lot of work, 
and they should have a tool to report their state, not only into the log.
But I think your approach needs to be elaborated. At first, it would be 
better to allow registering extended instruments for specific node types 
to avoid unneeded calls.
Secondly, looking into the Instrumentation usage, I don't see the reason 
to limit the size: as I see everywhere it exists locally or in the DSA 
where its size is calculated on the fly. So, by registering an extended 
instrument, we can reserve a slot for the extension. The actual size of 
underlying data can be provided by the extension routine.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: should check collations when creating partitioned index

2023-11-29 Thread Peter Eisentraut


On 23.11.23 11:01, Peter Eisentraut wrote:

On 20.11.23 17:25, Tom Lane wrote:

Peter Eisentraut  writes:

On 14.11.23 17:15, Tom Lane wrote:
I don't love the patch details though.  It seems entirely wrong to 
check

this before we check the opclass match.



Not sure why?  The order doesn't seem to matter?


The case that was bothering me was if we had a non-collated type
versus a collated type.  That would result in throwing an error
about collation mismatch, when complaining about the opclass seems
more apropos.  However, if we do this:


I see.  That means we shouldn't raise an error on a mismatch but just do
  if (key->partcollation[i] != collationIds[j])
  continue;


it might not matter much.


Here is an updated patch that works as indicated above.

The behavior if you try to create an index with mismatching collations 
now is that it will skip over the column and complain at the end with 
something like


ERROR:  0A000: unique constraint on partitioned table must include all 
partitioning columns
DETAIL:  UNIQUE constraint on table "t1" lacks column "b" which is part 
of the partition key.


which perhaps isn't intuitive, but I think it would be the same if you 
somehow tried to build an index with different operator classes than the 
partitioning.  I think these less-specific error messages are ok in such 
edge cases.


If there are no further comments on this patch version, I plan to go 
ahead and commit it soon.

Re: patch: improve "user mapping not found" error message

2023-11-29 Thread Peter Eisentraut


On 23.11.23 09:41, Peter Eisentraut wrote:

On 20.11.23 02:25, Ian Lawrence Barwick wrote:

2023年7月3日(月) 18:22 Peter Eisentraut :


On 23.06.23 09:45, Ian Lawrence Barwick wrote:

   if (!HeapTupleIsValid(tp))
+ {
+ ForeignServer *server = GetForeignServer(serverid);
+
   ereport(ERROR,
   (errcode(ERRCODE_UNDEFINED_OBJECT),
-  errmsg("user mapping not found for 
\"%s\"",
- 
MappingUserName(userid;
+  errmsg("user mapping not found for 
user \"%s\", server \"%s\"",

+ MappingUserName(userid),
+ server->servername)));
+ }


What if the foreign server does not exist either?  Then this would show
a "cache lookup failed" error message, which I think we should avoid.

There is existing logic for handling this in
get_object_address_usermapping().


Apologies, missed this response somewhere. Does the attached fix that?


Hmm, now that I look at this again, under what circumstances would the 
server not be found?  Maybe the first patch was right and it should give 
a "scary" error in that case, instead of just omitting it.


In any case, this patch appears to be missing an update in the 
postgres_fdw test output.


I have committed the first version of the patch together with the 
required test changes.

Re: Improve rowcount estimate for UNNEST(column)

2023-11-29 Thread jian he

On Mon, Nov 27, 2023 at 3:05 PM jian he  wrote:
>
> Hi.
> Since both array_op_test, arrest both are not dropped at the end of
> src/test/regress/sql/arrays.sql.
> I found using table array_op_test test more convincing.
>
> select
> reltuples * 10 as original,
> reltuples * (select
> floor(elem_count_histogram[array_length(elem_count_histogram,1)])
> frompg_stats
> where   tablename = 'array_op_test' and attname = 'i')
> as with_patch
> ,(select (elem_count_histogram[array_length(elem_count_histogram,1)])
> frompg_stats
> where   tablename = 'array_op_test' and attname = 'i')
> as elem_count_histogram_last_element
> from pg_class where relname = 'array_op_test';
>  original | with_patch | elem_count_histogram_last_element
> --++---
>  1030 |412 | 4.7843137
> (1 row)
>
> without patch:
> explain select unnest(i)  from array_op_test;
>   QUERY PLAN
> --
>  ProjectSet  (cost=0.00..9.95 rows=1030 width=4)
>->  Seq Scan on array_op_test  (cost=0.00..4.03 rows=103 width=40)
> (2 rows)
>
> with patch:
>  explain select unnest(i)  from array_op_test;
>   QUERY PLAN
> --
>  ProjectSet  (cost=0.00..6.86 rows=412 width=4)
>->  Seq Scan on array_op_test  (cost=0.00..4.03 rows=103 width=40)
> (2 rows)
> 

Hi.
I did a minor change. change estimate_array_length return type to
double,  cost_tidscan function inside `int ntuples` to `double
ntuples`.

 `clamp_row_est(get_function_rows(root, expr->funcid, clause));` will
round 4.7843137 to 5.
so with your patch and my refactor, the rows will be 103 * 5 = 515.

 explain select unnest(i)  from array_op_test;
  QUERY PLAN
--
 ProjectSet  (cost=0.00..7.38 rows=515 width=4)
   ->  Seq Scan on array_op_test  (cost=0.00..4.03 rows=103 width=40)
(2 rows)
From 448e0149d0dd1cfc0ab629a6fc164c6071169926 Mon Sep 17 00:00:00 2001
From: pgaddict 
Date: Thu, 30 Nov 2023 12:15:01 +0800
Subject: [PATCH v1 1/1] minor refactor.

make estimate_array_length return double instead of int.
similarily change estimate_array_length's caller fucntion cost_tidscan's variable ntuple
from type int to double.

because in estimate_array_length, we may need "average count of distinct element values"
info, and that is a float4 data type.

---
 src/backend/optimizer/path/costsize.c |  6 +++---
 src/backend/utils/adt/selfuncs.c  | 18 +-
 src/include/utils/selfuncs.h  |  2 +-
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8bcdcc7f..65854fbe 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1227,7 +1227,7 @@ cost_tidscan(Path *path, PlannerInfo *root,
 	QualCost	qpqual_cost;
 	Cost		cpu_per_tuple;
 	QualCost	tid_qual_cost;
-	int			ntuples;
+	double			ntuples;
 	ListCell   *l;
 	double		spc_random_page_cost;
 
@@ -1242,7 +1242,7 @@ cost_tidscan(Path *path, PlannerInfo *root,
 		path->rows = baserel->rows;
 
 	/* Count how many tuples we expect to retrieve */
-	ntuples = 0;
+	ntuples = 0.0;
 	foreach(l, tidquals)
 	{
 		RestrictInfo *rinfo = lfirst_node(RestrictInfo, l);
@@ -4741,7 +4741,7 @@ cost_qual_eval_walker(Node *node, cost_qual_eval_context *context)
 		Node	   *arraynode = (Node *) lsecond(saop->args);
 		QualCost	sacosts;
 		QualCost	hcosts;
-		int			estarraylen = estimate_array_length(context->root, arraynode);
+		double			estarraylen = estimate_array_length(context->root, arraynode);
 
 		set_sa_opfuncid(saop);
 		sacosts.startup = sacosts.per_tuple = 0;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 6346b41d..f6519c7f 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2130,7 +2130,7 @@ scalararraysel(PlannerInfo *root,
  *
  * It's important that this agree with scalararraysel.
  */
-int
+double
 estimate_array_length(PlannerInfo *root, Node *arrayexpr)
 {
 	/* look through any binary-compatible relabeling of arrayexpr */
@@ -2145,12 +2145,12 @@ estimate_array_length(PlannerInfo *root, Node *arrayexpr)
 		if (arrayisnull)
 			return 0;
 		arrayval = DatumGetArrayTypeP(arraydatum);
-		return ArrayGetNItems(ARR_NDIM(arrayval), ARR_DIMS(arrayval));
+		return (double) ArrayGetNItems(ARR_NDIM(arrayval), ARR_DIMS(arrayval));
 	}
 	else if (arrayexpr && IsA(arrayexpr, ArrayExpr) &&
 			 !((ArrayExpr *) arrayexpr)->multidims)
 	{
-		return list_length(((ArrayExpr *) arrayexpr)->elements);
+		return (double) list_length(((ArrayExpr *) arrayexpr)->elements);
 	}
 	else if (arrayexpr && IsA(arrayexp

Re: [PoC] Reducing planning time when tables have many partitions

2023-11-29 Thread Yuya Watari

Hello,

On Wed, Nov 22, 2023 at 2:32 PM Yuya Watari  wrote:
> Unfortunately, I've been busy due to work, so I won't be able to
> respond for several weeks. I'm really sorry for not being able to see
> the patches. As soon as I'm not busy, I will look at them, consider
> the above approach, and reply to this thread. If there is no
> objection, I will move this CF entry forward to next CF.

Since the end of this month is approaching, I moved this CF entry to
the next CF (January CF). I will reply to this thread in a few weeks.
Again, I appreciate your kind reviews and patches.

-- 
Best regards,
Yuya Watari

Re: [PATCH] Native spinlock support on RISC-V

2023-11-29 Thread Thomas Munro

On Wed, Nov 3, 2021 at 11:55 AM Thomas Munro  wrote:
> 1.  Even though we're using generic built-ins for atomics, I guess we
> could still use a src/include/port/atomics/arch-riscv.h file so we
> have a place to define PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY when
> building for 64 bit.  We'd need to find the chapter-and-verse to
> justify that, of course, which I can try to do; if someone more
> familiar with the ISA/spec/manual can point to it I'm all ears.

I randomly remembered this topic after seeing an s390x announcement
from Christoph[1], and figured he or someone else might be interested
in the same observation about that platform.  That is, we finally got
around to defining this for ARM, but I bet one internet point that
RISCV64 and s390x have that property too (but would need to find a
written reference, or perhaps a similar declaration in the Linux
sources, etc):

$ git grep '#define PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY'
src/include/port/atomics/arch-arm.h:#define PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
src/include/port/atomics/arch-ppc.h:#define PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
src/include/port/atomics/arch-x86.h:#define PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY

[1] 
https://www.postgresql.org/about/news/postgresql-on-s390x-on-debian-and-ubuntu-2752/

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-11-29 Thread Hayato Kuroda (Fujitsu)

Dear Amit,

> Sorry, my analysis was not complete. On looking closely, I think the
> reason is that we are allowed to upgrade the slot iff there is no
> pending WAL to be processed. 

Yes, the guard will strongly protect from data loss, but I do not take care in 
the test.

> The test first disables the subscription
> to avoid unnecessary LOGs on the subscriber and then stops the
> publisher node.

Right. Unnecessary ERROR would be appeared if we do not disable.

> It is quite possible that just before the shutdown of
> the server, autovacuum generates some WAL record that needs to be
> processed, 

Yeah, pg_upgrade does not ensure that autovacuum is not running *before* the
upgrade.

> so you propose just disabling the autovacuum for this test.

Absolutely correct.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Postgres Partitions Limitations (5.11.2.3)

2023-11-29 Thread Shubham Khanna

On Thu, Nov 9, 2023 at 10:00 PM shihao zhong  wrote:
>
> That looks good to me!
>
> The new status of this patch is: Ready for Committer


I have reviewed the patch and it is working fine.

Thanks and Regards,
Shubham Khanna.

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-11-29 Thread Tom Lane

Alexander Lakhin  writes:
> And a warning:
> $ CC=gcc-12 CFLAGS="-Wall -Wextra -Wno-unused-parameter -Wno-sign-compare 
> -Wno-clobbered 
> -Wno-missing-field-initializers" ./configure -q && make -s
> slru.c:63:1: warning: ‘inline’ is not at beginning of declaration 
> [-Wold-style-declaration]
>     63 | static int  inline
>    | ^~

> Maybe it's worth fixing before committing...

This should have been fixed before commit, because there are now a
dozen buildfarm animals complaining about it, as well as who-knows-
how-many developers' compilers.

 calliphoridae | 2023-11-30 02:48:59 | 
/home/bf/bf-build/calliphoridae/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
 warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
 canebrake | 2023-11-29 14:22:10 | 
/home/bf/bf-build/canebrake/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
 warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
 culicidae | 2023-11-30 02:49:06 | 
/home/bf/bf-build/culicidae/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
 warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
 desmoxytes| 2023-11-30 03:11:15 | 
/home/bf/bf-build/desmoxytes/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
 warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
 flaviventris  | 2023-11-30 02:53:19 | 
/home/bf/bf-build/flaviventris/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
 warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
 francolin | 2023-11-30 02:26:08 | 
../pgsql/src/backend/access/transam/slru.c:63:1: warning: 'inline' is not at 
beginning of declaration [-Wold-style-declaration]
 grassquit | 2023-11-30 02:58:36 | 
/home/bf/bf-build/grassquit/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
 warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
 komodoensis   | 2023-11-30 03:07:52 | 
/home/bf/bf-build/komodoensis/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
 warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
 phycodurus| 2023-11-29 14:29:02 | 
/home/bf/bf-build/phycodurus/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
 warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
 piculet   | 2023-11-30 02:32:57 | 
../pgsql/src/backend/access/transam/slru.c:63:1: warning: 'inline' is not at 
beginning of declaration [-Wold-style-declaration]
 pogona| 2023-11-29 14:22:31 | 
/home/bf/bf-build/pogona/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
 warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
 rorqual   | 2023-11-30 02:32:41 | 
../pgsql/src/backend/access/transam/slru.c:63:1: warning: 'inline' is not at 
beginning of declaration [-Wold-style-declaration]
 serinus   | 2023-11-30 02:47:05 | 
/home/bf/bf-build/serinus/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
 warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
 skink | 2023-11-29 14:23:05 | 
/home/bf/bf-build/skink-master/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
 warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
 taipan| 2023-11-30 03:03:52 | 
/home/bf/bf-build/taipan/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
 warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
 tamandua  | 2023-11-30 02:49:50 | 
/home/bf/bf-build/tamandua/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
 warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]

regards, tom lane

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-11-29 Thread Amit Kapila

On Thu, Nov 30, 2023 at 8:40 AM Amit Kapila  wrote:
>
> On Wed, Nov 29, 2023 at 2:56 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > > > >
> > > > > Pushed!
> > > >
> > > > Hi all, the CF entry for this is marked RfC, and CI is trying to apply
> > > > the last patch committed. Is there further work that needs to be
> > > > re-attached and/or rebased?
> > > >
> > >
> > > No. I have marked it as committed.
> > >
> >
> > I found another failure related with the commit [1]. I think it is caused 
> > by the
> > autovacuum. I want to propose a patch which disables the feature for old 
> > publisher.
> >
> > More detail, please see below.
> >
> > # Analysis of the failure
> >
> > Summary: this failure occurs when the autovacuum starts after the 
> > subscription
> > is disabled but before doing pg_upgrade.
> >
> > According to the regress file, it unexpectedly failed the pg_upgrade [2]. 
> > There are
> > no possibilities for slots are invalidated, so some WALs seemed to be 
> > generated
> > after disabling the subscriber.
> >
> > Also, server log caused by oldpub said that autovacuum worker was 
> > terminated when
> > it stopped. This was occurred after walsender released the logical slots. 
> > WAL records
> > caused by autovacuum workers could not be consumed by the slots, so that 
> > upgrading
> > function returned false.
> >
> > # How to reproduce
> >
> > I made a small file for reproducing the failure. Please see reproduce.txt. 
> > This contains
> > changes for launching autovacuum worker very often and for ensuring actual 
> > works are
> > done. After applying it, I could reproduce the same failure every time.
> >
> > # How to fix
> >
> > I think it is sufficient to fix only the test code.
> > The easiest way is to disable the autovacuum on old publisher. PSA the 
> > patch file.
> >
>
> Agreed, for now, we should change the test as you proposed. I'll take
> care of that. However, I wonder, if we should also ensure that
> autovacuum or any other worker is shut down before walsender processes
> the last set of WAL before shutdown. We can analyze more on this and
> probably start a separate thread to discuss this point.
>

Sorry, my analysis was not complete. On looking closely, I think the
reason is that we are allowed to upgrade the slot iff there is no
pending WAL to be processed. The test first disables the subscription
to avoid unnecessary LOGs on the subscriber and then stops the
publisher node. It is quite possible that just before the shutdown of
the server, autovacuum generates some WAL record that needs to be
processed, so you propose just disabling the autovacuum for this test.

-- 
With Regards,
Amit Kapila.

Re: [PATCH] LockAcquireExtended improvement

2023-11-29 Thread Jingxian Li

Hi Andres, Thanks for your quick reply!

On 2023/11/29 0:51, Andres Freund wrote:
> Hi,
>
> On 2023-11-28 20:52:31 +0800, Jingxian Li wrote:
>> postgres=*# lock table test in exclusive mode ;
>>
>>
>> T4
>>
>> Case 1:
>>
>> postgres=*# lock table test in share row exclusive mode   nowait;
>>
>> ERROR:  could not   obtain lock on relation "test"
>>
>> 
>>
>> Case 2:
>>
>> postgres=*# lock table test in share row exclusive mode;
>>
>> LOCK TABLE
>>
>>
>>
>> At T4 moment in session A, (case 1) when executing SQL “lock table test in 
>> share row exclusive mode nowait;”, an error occurs with message “could not 
>> obtain lock on relation test";However, (case 2) when executing the SQL above 
>> without nowait, lock can be obtained successfully.
>>
>> Digging into the source code, I find that in case 2 the lock was obtained in
>> the function ProcSleep instead of LockAcquireExtended. Due to nowait logic
>> processed before WaitOnLock->ProcSleep, acquiring lock failed in case
>> 1. Can any changes be made so that the act of such lock granted occurs
>> before WaitOnLock?
> I don't think that'd make sense - lock reordering is done to prevent deadlocks
> and is quite expensive. Why should NOWAIT incur that cost?
>
>
>>
>> Providing a more universal case:
>>
>> Transaction A already holds an n-mode lock on table test. If then 
>> transaction A requests an m-mode lock on table Test, m and n have the 
>> following constraints:
>>
>> (lockMethodTable->conflictTab[n] & 
>> lockMethodTable->conflictTab[m]) == lockMethodTable->conflictTab[m]
>>
>> Obviously, in this case, m<=n.
>>
>> Should the m-mode lock be granted before WaitOnLock?
>>
>>
>> In the case of m=n (i.e. we already hold the lock), the m-mode lock is
>> immediately granted in the LocalLock path, without the need of lock conflict
>> check.
> Sure - it'd not help anybody to wait for a lock we already hold - in fact it'd
> create a lot of deadlocks.
>
>
>> Based on the facts above, can we obtain a weaker lock (m> object within the same transaction without doing lock conflict check?
> Perhaps. There's no inherent "lock strength" ordering for all locks though.



I also noticed that there is no inherent "lock strength" orderingfor all locks.
So I use the following method in the code to determine the strength of the lock:
if (mconflictTab[n] &
lockMethodTable->conflictTab[m]) == lockMethodTable->conflictTab[m])
then we can say that m-mode lock is weaker than n-mode lock.


Transaction A already holds an n-mode lock on table test,
that is, there is no locks held conflicting with the n-mode lock on table test,
If then transaction A requests an m-mode lock on table test,
as n's confilctTab covers m, it can be concluded that
there are no locks conflicting with the requested m-mode lock.



>
> Greetings,
>
> Andres Freund
>

With regards,

Jingxian Li

Re: proposal: possibility to read dumped table's name from file

2023-11-29 Thread Tom Lane

Daniel Gustafsson  writes:
> I took another look at this, found some more polish that was needed, added
> another testcase and ended up pushing it.

mamba is unhappy because this uses  functions without
casting their arguments to unsigned char:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2023-11-30%2002%3A53%3A25

(I had not realized that we still had buildfarm animals that would
complain about this ... but I'm glad we do, because it's a hazard.
POSIX is quite clear that the behavior is undefined for signed chars.)

regards, tom lane

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-11-29 Thread Amit Kapila

On Wed, Nov 29, 2023 at 2:56 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > > >
> > > > Pushed!
> > >
> > > Hi all, the CF entry for this is marked RfC, and CI is trying to apply
> > > the last patch committed. Is there further work that needs to be
> > > re-attached and/or rebased?
> > >
> >
> > No. I have marked it as committed.
> >
>
> I found another failure related with the commit [1]. I think it is caused by 
> the
> autovacuum. I want to propose a patch which disables the feature for old 
> publisher.
>
> More detail, please see below.
>
> # Analysis of the failure
>
> Summary: this failure occurs when the autovacuum starts after the subscription
> is disabled but before doing pg_upgrade.
>
> According to the regress file, it unexpectedly failed the pg_upgrade [2]. 
> There are
> no possibilities for slots are invalidated, so some WALs seemed to be 
> generated
> after disabling the subscriber.
>
> Also, server log caused by oldpub said that autovacuum worker was terminated 
> when
> it stopped. This was occurred after walsender released the logical slots. WAL 
> records
> caused by autovacuum workers could not be consumed by the slots, so that 
> upgrading
> function returned false.
>
> # How to reproduce
>
> I made a small file for reproducing the failure. Please see reproduce.txt. 
> This contains
> changes for launching autovacuum worker very often and for ensuring actual 
> works are
> done. After applying it, I could reproduce the same failure every time.
>
> # How to fix
>
> I think it is sufficient to fix only the test code.
> The easiest way is to disable the autovacuum on old publisher. PSA the patch 
> file.
>

Agreed, for now, we should change the test as you proposed. I'll take
care of that. However, I wonder, if we should also ensure that
autovacuum or any other worker is shut down before walsender processes
the last set of WAL before shutdown. We can analyze more on this and
probably start a separate thread to discuss this point.


-- 
With Regards,
Amit Kapila.

RE: Is this a problem in GenericXLogFinish()?

2023-11-29 Thread Hayato Kuroda (Fujitsu)

Dear Alexander,

> 
> I've discovered that that patch introduced a code path leading to an
> uninitialized memory access.
> With the following addition to hash_index.sql:
>   -- Fill overflow pages by "dead" tuples.
>   BEGIN;
>   INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000)
> as i;
> +INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as
> i;
>   ROLLBACK;
> +INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as
> i;
> 
> make check -C src/test/recovery/ PROVE_TESTS="t/027*"
> when executed under Valgrind, triggers:
> ==00:00:02:30.285 97744== Conditional jump or move depends on uninitialised
> value(s)
> ==00:00:02:30.285 97744==at 0x227585: BufferIsValid (bufmgr.h:303)
> ==00:00:02:30.285 97744==by 0x227585: hash_xlog_squeeze_page
> (hash_xlog.c:781)
> ==00:00:02:30.285 97744==by 0x228133: hash_redo (hash_xlog.c:1083)
> ==00:00:02:30.285 97744==by 0x2C2801: ApplyWalRecord
> (xlogrecovery.c:1943)
> ==00:00:02:30.285 97744==by 0x2C5C52: PerformWalRecovery
> (xlogrecovery.c:1774)
> ==00:00:02:30.285 97744==by 0x2B63A1: StartupXLOG (xlog.c:5559)
> ==00:00:02:30.285 97744==by 0x558165: StartupProcessMain
> (startup.c:282)
> ==00:00:02:30.285 97744==by 0x54DFE8: AuxiliaryProcessMain
> (auxprocess.c:141)
> ==00:00:02:30.285 97744==by 0x5546B0: StartChildProcess
> (postmaster.c:5331)
> ==00:00:02:30.285 97744==by 0x557A53: PostmasterMain
> (postmaster.c:1458)
> ==00:00:02:30.285 97744==by 0x4720C2: main (main.c:198)
> ==00:00:02:30.285 97744==
> (in 027_stream_regress_standby_1.log)
> 
> That is, when line
> https://coverage.postgresql.org/src/backend/access/hash/hash_xlog.c.gcov.ht
> ml#661
> is reached, writebuf stays uninitialized.

Good catch, thank you for reporting! I will investigate more about it and post 
my analysis.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

pg_dump/nls.mk is missing a file

2023-11-29 Thread Kyotaro Horiguchi

Hello.

Upon reviewing my translation, I discovered that filter.c was not
included in the nls.mk of pg_dump. Additional it appears that two '.h'
files have been included for a long time, but they seem unnecessary as
their removal does not affect the results. The attached patch is
intended to correct these issues.

For Meson, on the other hand, I believe there is nothing in particular
that needs to be done.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/pg_dump/nls.mk b/src/bin/pg_dump/nls.mk
index cd91737f48..ca7ce74275 100644
--- a/src/bin/pg_dump/nls.mk
+++ b/src/bin/pg_dump/nls.mk
@@ -15,13 +15,12 @@ GETTEXT_FILES= $(FRONTEND_COMMON_GETTEXT_FILES) \
compress_zstd.c \
pg_dump.c \
common.c \
+   filter.c \
pg_dump_sort.c \
pg_restore.c \
pg_dumpall.c \
parallel.c \
-   parallel.h \
pg_backup_utils.c \
-   pg_backup_utils.h \
../../common/compression.c \
../../common/exec.c \
../../common/fe_memutils.c \

Re: pg_upgrade and logical replication

2023-11-29 Thread Amit Kapila

On Thu, Nov 30, 2023 at 6:37 AM Peter Smith  wrote:
>
> Here are some review comments for patch v20-0001
>
> ==
>
> 1. getSubscriptions
>
> + if (dopt->binary_upgrade && fout->remoteVersion >= 17)
> + appendPQExpBufferStr(query, " s.subenabled\n");
> + else
> + appendPQExpBufferStr(query, " false AS subenabled\n");
>
> Probably I misunderstood this logic... AFAIK the CREATE SUBSCRIPTION
> is normally default *enabled*, so why does this code set default
> differently as 'false'. OTOH, if this is some special case default
> needed because the subscription upgrade is not supported before PG17
> then maybe it needs a comment to explain.
>

Yes, it is for prior versions. By default subscriptions are restored
disabled even if they are enabled before dump. See docs [1] for
reasons (When dumping logical replication subscriptions, ..). I don't
think we need a comment here as that is a norm we use at other similar
places where we do version checking. We can argue that there could be
more comments as to why the 'connect' is false and if those are really
required, we should do that as a separate patch.

[1] - https://www.postgresql.org/docs/devel/app-pgdump.html

-- 
With Regards,
Amit Kapila.

Re: encoding affects ICU regex character classification

2023-11-29 Thread Thomas Munro

On Thu, Nov 30, 2023 at 1:23 PM Jeff Davis  wrote:
> Character classification is not localized at all in libc or ICU as far
> as I can tell.

Really?  POSIX isalpha()/isalpha_l() and friends clearly depend on a
locale.  See eg d522b05c for a case where that broke something.
Perhaps you mean glibc wouldn't do that to you because you know that,
as an unstandardised detail, it sucks in (some version of) Unicode's
data which shouldn't vary between locales.  But you are allowed to
make your own locales, including putting whatever classifications you
want into the LC_TYPE file using POSIX-standardised tools like
localedef.  Perhaps that is a bit of a stretch, and no one really does
that in practice, but anyway it's still "localized".

Not knowing anything about how glibc generates its charmaps, Unicode
or pre-Unicode, I could take a wild guess that maybe in LATIN9 they
have an old hand-crafted table, but for UTF-8 encoding it's fully
outsourced to Unicode, and that's why you see a difference.  Another
problem seen in a few parts of our tree is that we sometimes feed
individual UTF-8 bytes to the isXXX() functions which is about as well
defined as trying to pay for a pint with the left half of a $10 bill.

As for ICU, it's "not localized" only if there is only one ICU library
in the universe, but of course different versions of ICU might give
different answers because they correspond to different versions of
Unicode (as do glibc versions, FreeBSD libc versions, etc) and also
might disagree with tables built by PostgreSQL.  Maybe irrelevant for
now, but I think with thus-far-imagined variants of the multi-version
ICU proposal, you have to choose whether to call u_isUAlphabetic() in
the library we're linked against, or via the dlsym() we look up in a
particular dlopen'd library.  So I guess we'd have to access it via
our pg_locale_t, so again it'd be "localized" by some definitions.

Thinking about how to apply that thinking to libc, ... this is going
to sound far fetched and handwavy but here goes:  we could even
imagine a multi-version system based on different base locale paths.
Instead of using the system-provided locales under /usr/share/locale
to look when we call newlocale(..., "en_NZ.UTF-8", ...), POSIX says
we're allowed to specify an absolute path eg newlocale(...,
"/foo/bar/unicode11/en_NZ.UTF-8", ...).  If it is possible to use
$DISTRO's localedef to compile $OLD_DISTRO's locale sources to get
historical behaviour, that might provide a way to get them without
assuming the binary format is stable (it definitely isn't, but the
source format is nailed down by POSIX).  One fly in the ointment is
that glibc failed to implement absolute path support, so you might
need to use versioned locale names instead, or see if the LOCPATH
environment variable can be swizzled around without confusing glibc's
locale cache.  Then wouldn't be fundamentally different than the
hypothesised multi-version ICU case: you could probably come up with
different isalpha_l() results for different locales because you have
different LC_CTYPE versions (for example Unicode 15.0 added new
extended Cyrillic characters 1E030..1E08F, they look alphabetical to
me but what would I know).  That is an extremely hypothetical
pie-in-the-sky thought and I don't know if it'd really work very well,
but it is a concrete way that someone might finish up getting
different answers out of isalpha_l(), to observe that it really is
localised.  And localized.

Re: about help message for new pg_dump's --filter option

2023-11-29 Thread Kyotaro Horiguchi

At Thu, 30 Nov 2023 10:20:40 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Hello.
> 
> Recently, a new --filter option was added to pg_dump. I might be
> wrong, but the syntax of the help message for this feels off. Is the
> word 'on' not necessary after 'based'?
> 
> >  --filter=FILENAMEinclude or exclude objects and data from dump
> >   based expressions in FILENAME

Hmm. A similar message is spelled as "based on expression". Thus, the
attached correct this message in this line.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 57c6836b88..eea320b731 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -1119,7 +1119,7 @@ help(const char *progname)
 			 "   including child and partition tables\n"));
 	printf(_("  --extra-float-digits=NUM override default setting for extra_float_digits\n"));
 	printf(_("  --filter=FILENAMEinclude or exclude objects and data from dump\n"
-			 "   based expressions in FILENAME\n"));
+			 "   based on expressions in FILENAME\n"));
 	printf(_("  --if-exists  use IF EXISTS when dropping objects\n"));
 	printf(_("  --include-foreign-data=PATTERN\n"
 			 "   include data of foreign tables on foreign\n"

Re: logical decoding and replication of sequences, take 2

2023-11-29 Thread Peter Smith

On Wed, Nov 29, 2023 at 11:45 PM Tomas Vondra
 wrote:
>
>
>
> On 11/27/23 23:06, Peter Smith wrote:
> > FWIW, here are some more minor review comments for v20231127-3-0001
> >
> > ==
> > .../replication/logical/reorderbuffer.c
> >
> > 3.
> > + *   To decide if a sequence change is transactional, we maintain a hash
> > + *   table of relfilenodes created in each (sub)transactions, along with
> > + *   the XID of the (sub)transaction that created the relfilenode. The
> > + *   entries from substransactions are copied to the top-level transaction
> > + *   to make checks cheaper. The hash table gets cleaned up when the
> > + *   transaction completes (commit/abort).
> >
> > /substransactions/subtransactions/
> >
>
> Will fix.

FYI - I think this typo still exists in the patch v20231128-0001.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Extra periods in pg_dump messages

2023-11-29 Thread Kyotaro Horiguchi

Sorry for the sequential mails.

In the bleeding-edge version of pg_dump, when a conditionspecifying an
index, for example, is described in an object filter file, the
following message is output. However, there is a period at the end of
the line. Shouldn't this be removed?

$ pg_dump --filter=/tmp/hoge.filter
pg_dump: error: invalid format in filter read from "/tmp/hoge.filter" on line 
1: include filter for "index" is not allowed.

The attached patch includes modifications related to the calls to
pg_log_filter_error().

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 57c6836b88..ce50566c3a 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -18812,7 +18812,7 @@ read_dump_filters(const char *filename, DumpOptions *dopt)
 case FILTER_OBJECT_TYPE_TABLE_DATA:
 case FILTER_OBJECT_TYPE_TABLE_DATA_AND_CHILDREN:
 case FILTER_OBJECT_TYPE_TRIGGER:
-	pg_log_filter_error(&fstate, _("%s filter for \"%s\" is not allowed."),
+	pg_log_filter_error(&fstate, _("%s filter for \"%s\" is not allowed"),
 		"include",
 		filter_object_type_name(objtype));
 	exit_nicely(1);
@@ -18851,7 +18851,7 @@ read_dump_filters(const char *filename, DumpOptions *dopt)
 case FILTER_OBJECT_TYPE_TRIGGER:
 case FILTER_OBJECT_TYPE_EXTENSION:
 case FILTER_OBJECT_TYPE_FOREIGN_DATA:
-	pg_log_filter_error(&fstate, _("%s filter for \"%s\" is not allowed."),
+	pg_log_filter_error(&fstate, _("%s filter for \"%s\" is not allowed"),
 		"exclude",
 		filter_object_type_name(objtype));
 	exit_nicely(1);
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
index 1b974cf7e8..92389353a4 100644
--- a/src/bin/pg_dump/pg_dumpall.c
+++ b/src/bin/pg_dump/pg_dumpall.c
@@ -1969,7 +1969,7 @@ read_dumpall_filters(const char *filename, SimpleStringList *pattern)
 	{
 		if (comtype == FILTER_COMMAND_TYPE_INCLUDE)
 		{
-			pg_log_filter_error(&fstate, _("%s filter for \"%s\" is not allowed."),
+			pg_log_filter_error(&fstate, _("%s filter for \"%s\" is not allowed"),
 "include",
 filter_object_type_name(objtype));
 			exit_nicely(1);
@@ -1989,7 +1989,7 @@ read_dumpall_filters(const char *filename, SimpleStringList *pattern)
 			case FILTER_OBJECT_TYPE_SCHEMA:
 			case FILTER_OBJECT_TYPE_TABLE:
 			case FILTER_OBJECT_TYPE_TABLE_AND_CHILDREN:
-pg_log_filter_error(&fstate, _("unsupported filter object."));
+pg_log_filter_error(&fstate, _("unsupported filter object"));
 exit_nicely(1);
 break;
 
diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c
index 1459e02263..c3beacdec1 100644
--- a/src/bin/pg_dump/pg_restore.c
+++ b/src/bin/pg_dump/pg_restore.c
@@ -535,7 +535,7 @@ read_restore_filters(const char *filename, RestoreOptions *opts)
 case FILTER_OBJECT_TYPE_DATABASE:
 case FILTER_OBJECT_TYPE_EXTENSION:
 case FILTER_OBJECT_TYPE_FOREIGN_DATA:
-	pg_log_filter_error(&fstate, _("%s filter for \"%s\" is not allowed."),
+	pg_log_filter_error(&fstate, _("%s filter for \"%s\" is not allowed"),
 		"include",
 		filter_object_type_name(objtype));
 	exit_nicely(1);
@@ -581,7 +581,7 @@ read_restore_filters(const char *filename, RestoreOptions *opts)
 case FILTER_OBJECT_TYPE_TABLE:
 case FILTER_OBJECT_TYPE_TABLE_AND_CHILDREN:
 case FILTER_OBJECT_TYPE_TRIGGER:
-	pg_log_filter_error(&fstate, _("%s filter for \"%s\" is not allowed."),
+	pg_log_filter_error(&fstate, _("%s filter for \"%s\" is not allowed"),
 		"exclude",
 		filter_object_type_name(objtype));
 	exit_nicely(1);

Bug in pgbench prepared statements

2023-11-29 Thread Lev Kokotov

Hi,

I noticed something that looks like a bug in pgbench when using the
prepared protocol. pgbench assumes that all prepared statements are
prepared correctly, even if they contain errors (e.g. syntax, column/table
doesn't exist, etc.).

My test script is just:

SELECT one;

The output looks something like this:

$ pgbench -f test.sql --protocol prepared -d postgres

[...]
pgbench: client 0 executing script "test.sql"
pgbench: client 0 preparing P_0
pgbench: error: ERROR:  column "one" does not exist
LINE 1: SELECT one;
   ^
pgbench: client 0 sending P_0
pgbench: client 0 receiving
pgbench: client 0 receiving
pgbench: error: client 0 script 0 aborted in command 0 query 0: ERROR:
 prepared statement "P_0" does not exist
transaction type: test.sql
[...]

Normally this wouldn't be a big deal, although by itself the output is
confusing, since the second error, while technically true, is not what
caused the test run to fail. In my case, I was using pgbench to validate
the correctness of prepared statements implementation in our pooler. Having
the second error sent me on quite a debugging session until I realized that
my fix was actually working.

Patch attached, if there is any interest in fixing this small bug.

Cheers!

Lev
postgresml.org


pgbench.patch
Description: Binary data

about help message for new pg_dump's --filter option

2023-11-29 Thread Kyotaro Horiguchi

Hello.

Recently, a new --filter option was added to pg_dump. I might be
wrong, but the syntax of the help message for this feels off. Is the
word 'on' not necessary after 'based'?

>  --filter=FILENAMEinclude or exclude objects and data from dump
>   based expressions in FILENAME

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BackgroundPsql's set_query_timer_restart() may not work

2023-11-29 Thread Masahiko Sawada

On Wed, Nov 29, 2023 at 7:48 PM Bharath Rupireddy
 wrote:
>
> On Wed, Nov 29, 2023 at 1:49 PM Masahiko Sawada  wrote:
> >
> > On Wed, Nov 29, 2023 at 4:30 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Tue, Nov 28, 2023 at 12:23 PM Tom Lane  wrote:
> > > >
> > > > Bharath Rupireddy  writes:
> > > > > A nitpick on the patch - how about honoring the passed-in parameter
> > > > > with something like $self->{query_timer_restart} = 1 if !defined
> > > > > $self->{query_timer_restart}; instead of just setting it to 1 (a value
> > > > > other than undef) $self->{query_timer_restart} = 1;?
> > > >
> > > > I wondered about that too, but the evidence of existing callers is
> > > > that nobody cares.  If we did make the code do something like that,
> > > > (a) I don't think your fragment is right, and (b) we'd need to rewrite
> > > > the function's comment to explain it.  I'm not seeing a reason to
> > > > think it's worth spending effort on.
> >
> > Agreed.
> >
> > > Hm. I don't mind doing just the $self->{query_timer_restart} = 1; like
> > > in Sawada-san's patch.
> >
> > Okay, I've attached the patch that I'm going to push through v16,
> > barring any objections.
>
> How about the commit message summary 'Fix TAP function
> set_query_timer_restart() issue without argument.'? Also, it's good to
> specify the commit 664d7575 that introduced the TAP function in the
> commit message description.

Thanks! I've incorporated your comment and pushed the patch.

Regards,


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [Proposal] global sequence implemented by snowflake ID

2023-11-29 Thread Michael Paquier

On Tue, Nov 28, 2023 at 02:23:44PM +0530, Amit Kapila wrote:
> It is interesting to see you want to work towards globally distributed
> sequences. I think it would be important to discuss how and what we
> want to achieve with sequences w.r.t logical replication and or
> active-active configuration. There is a patch [1] for logical
> replication of sequences which will primarily achieve the failover
> case, i.e. if the publisher goes down and the subscriber takes over
> the role, one can re-direct connections to it. Now, if we have global
> sequences, one can imagine that even after failover the clients can
> still get unique values of sequences. It will be a bit more flexible
> to use global sequences, for example, we can use the sequence on both
> nodes at the same time which won't be possible with the replication of
> sequences as they will become inconsistent. Now, it is also possible
> that both serve different use cases and we need both functionalities
> but it would be better to have some discussion on the same.
> 
> Thoughts?
> 
> [1] - https://commitfest.postgresql.org/45/3823/

Thanks for pointing this out.  I've read through the patch proposed by
Tomas and both are independent things IMO.  The logical decoding patch
relies on the SEQ_LOG records to find out which last_value/is_called
to transfer, which is something directly depending on the in-core
sequence implementation.  Sequence AMs are concepts that cover much
more ground, leaving it up to the implementor to do what they want
while hiding the activity with a RELKIND_SEQUENCE (generated columns
included).

To put it short, I have the impression that one and the other don't
really conflict, but just cover different ground.  However, I agree
that depending on the sequence AM implementation used in a cluster
(snowflake IDs guarantee unicity with their machine ID), replication
may not be necessary because the sequence implementation may be able
to ensure that no replication is required from the start.
--
Michael


signature.asc
Description: PGP signature

Re: pg_upgrade and logical replication

2023-11-29 Thread Peter Smith

On Thu, Nov 30, 2023 at 12:06 PM Peter Smith  wrote:
>
> Here are some review comments for patch v20-0001
>
> 3.
> +# The subscription's running status should be preserved
> +my $result =
> +  $new_sub->safe_psql('postgres',
> + "SELECT subenabled FROM pg_subscription WHERE subname = 'regress_sub'");
> +is($result, qq(f),
> + "check that the subscriber that was disable on the old subscriber
> should be disabled in the new subscriber"
> +);
> +$result =
> +  $new_sub->safe_psql('postgres',
> + "SELECT subenabled FROM pg_subscription WHERE subname = 'regress_sub1'");
> +is($result, qq(t),
> + "check that the subscriber that was enabled on the old subscriber
> should be enabled in the new subscriber"
> +);
> +$new_sub->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
> +
>
> BEFORE
> check that the subscriber that was disable on the old subscriber
> should be disabled in the new subscriber
>
> SUGGESTION
> check that a subscriber that was disabled on the old subscriber is
> disabled on the new subscriber
>
> ~
>
> BEFORE
> check that the subscriber that was enabled on the old subscriber
> should be enabled in the new subscriber
>
> SUGGESTION
> check that a subscriber that was enabled on the old subscriber is
> enabled on the new subscriber
>

Oops. I think that should have been "subscription", not "subscriber". i.e.

SUGGESTION
check that a subscription that was disabled on the old subscriber is
disabled on the new subscriber

and

SUGGESTION
check that a subscription that was enabled on the old subscriber is
enabled on the new subscriber

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: pg_upgrade and logical replication

2023-11-29 Thread Peter Smith

Here are some review comments for patch v20-0001

==

1. getSubscriptions

+ if (dopt->binary_upgrade && fout->remoteVersion >= 17)
+ appendPQExpBufferStr(query, " s.subenabled\n");
+ else
+ appendPQExpBufferStr(query, " false AS subenabled\n");

Probably I misunderstood this logic... AFAIK the CREATE SUBSCRIPTION
is normally default *enabled*, so why does this code set default
differently as 'false'. OTOH, if this is some special case default
needed because the subscription upgrade is not supported before PG17
then maybe it needs a comment to explain.

~~~

2. dumpSubscription

+ if (strcmp(subinfo->subenabled, "t") == 0)
+ {
+ appendPQExpBufferStr(query,
+ "\n-- For binary upgrade, must preserve the subscriber's running state.\n");
+ appendPQExpBuffer(query, "ALTER SUBSCRIPTION %s ENABLE;\n", qsubname);
+ }

(this is a bit similar to previous comment)

Probably I misunderstood this logic... but AFAIK the CREATE
SUBSCRIPTION is normally default *enabled*. In the CREATE SUBSCRIPTION
top of this function I did not see any "enabled=xxx" code, so won't
this just default to enabled=true per normal. In other words, what
happens if the subscription being upgraded was already DISABLED -- How
does it remain disabled still after upgrade?

But I saw there is a test case for this so perhaps the code is fine?
Maybe it just needs more explanatory comments for this area?

==
src/bin/pg_upgrade/t/004_subscription.pl

3.
+# The subscription's running status should be preserved
+my $result =
+  $new_sub->safe_psql('postgres',
+ "SELECT subenabled FROM pg_subscription WHERE subname = 'regress_sub'");
+is($result, qq(f),
+ "check that the subscriber that was disable on the old subscriber
should be disabled in the new subscriber"
+);
+$result =
+  $new_sub->safe_psql('postgres',
+ "SELECT subenabled FROM pg_subscription WHERE subname = 'regress_sub1'");
+is($result, qq(t),
+ "check that the subscriber that was enabled on the old subscriber
should be enabled in the new subscriber"
+);
+$new_sub->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
+

BEFORE
check that the subscriber that was disable on the old subscriber
should be disabled in the new subscriber

SUGGESTION
check that a subscriber that was disabled on the old subscriber is
disabled on the new subscriber

~

BEFORE
check that the subscriber that was enabled on the old subscriber
should be enabled in the new subscriber

SUGGESTION
check that a subscriber that was enabled on the old subscriber is
enabled on the new subscriber

~~~

4.
+is($result, qq($remote_lsn), "remote_lsn should have been preserved");
+
+
+# Check the number of rows for each table on each server


Double blank lines.

~~~

5.
+$old_sub->safe_psql('postgres', "ALTER SUBSCRIPTION regress_sub1 DISABLE");
+$old_sub->safe_psql('postgres',
+ "ALTER SUBSCRIPTION regress_sub1 SET (slot_name = none)");
+$old_sub->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
+

Probably it would be tidier to combine all of those.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: proposal: change behavior on collation version mismatch

2023-11-29 Thread Jeremy Schneider

On 11/28/23 2:12 AM, Daniel Verite wrote:
> Jeremy Schneider wrote:
>> 1) "collation changes are uncommon" (which is relatively correct)
>> 2) "most users would rather have ease-of-use than 100% safety, since
>> it's uncommon"
>>
>> And I think this led to the current behavior of issuing a warning rather
>> than an error,
> There's a technical reason for this being a warning.
> If it was an error, any attempt to do anything with the collation
> would fail, which includes REINDEX on indexes  using
> that collation.
> And yet that's precisely what you're supposed to do in that
> situation.

Indexes are the most obvious and impactful corruption, so the focus is
understandable, but there's a bit of a myth in the general public that
REINDEX means you fixed your database.  I'm concerned that too many
people believe this falsehood, and don't realize that things like
constraints and partitions can also be affected by a major OS update
when leaving PG data files in place.  Also there's a tendancy to use
amcheck and validate btree indexes, but skip other index types.  And of
course none of this is possible when people mistakenly use a different
major OS for the hot standby (but Postgres willingly sends incorrect
query results to users).

This is why my original proposal included an update to the ALTER ...
REFRESH/COLLATION docs.  Today's conventional wisdom suggests this is a
safe command.  It's really not, if you're using unicode (which everyone
is). Fifteen years ago, you needed to buy a french keyboard to type
french accented characters.  Today it's a quick tap on your phone to get
chinese, russian, tibetan, emojis, and any other character you can dream
of.  All of those surprising characters eventually get stored in Postres
databases, often to the surprise of devs and admins, after they discover
corruption from an OS upgrade.

And to recap some data about historical ICU versions from the torture test:

ICU Version | OS Version | en-US characters changed collation |
zh-Hans-CN characters changed collation | fr-FR characters changed collation
55.1-7ubuntu0.5 | Ubuntu 16.04.7 LTS | 286,654 | 286,654 | 286,654
60.2-3ubuntu3.1 | Ubuntu 18.04.6 LTS | 23,741 | 24,415 | 23,741
63.1-6 | Ubuntu 19.04 | 688 | 688 | 688
66.1-2ubuntu2 | Ubuntu 20.04.3 LTS | 6,497 | 6,531 | 6,497
70.1-2 | Ubuntu 22.04 LTS | 879 | 887 | 879

The very clear trend here is that most changes are made in the root
collation rules, affecting all locales.  This means that worrying about
specific collation versions of different locales is really focusing on
an irrelevant edge case.  In ICU development, all the locales tend to
change.

If anyone thinks the Collation Apocalypse is bad now, I predict the
Kubernetes wave will be mayhem.  Fifteen years ago it was rare to
physically move PG datafiles to a new major OS.  Most people would dump
and load their databases, sized in GBs.  Today's multi-TB Postgres
databases have meant an increase of in-place OS upgrades in recent
years.  People started to either detach/attach their storage, or they
used a hot standby. Kubernetes will make these moves across major OS's a
daily, effortless occurrence.

ICU doesn't fix anything directly.  We do need ICU - only because it
finally enables us to compile that old version of ICU forever on every
new OS we move to going forward. This was simply impossible with glibc.
Over the past couple decades, not even Oracle or IBM has managed to
deprecate a single version of ICU from a relational database, and not
for lack of desire.

-Jeremy

-- 
Jeremy Schneider
Performance Engineer
Amazon Web Services

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-29 Thread Tomas Vondra

On 11/29/23 23:59, Matthias van de Meent wrote:
> On Wed, 29 Nov 2023 at 21:56, Tomas Vondra
>  wrote:
>>
>> On 11/29/23 21:30, Matthias van de Meent wrote:
>>> On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
>>>  wrote:
 I did try to measure how much it actually saves, but none of the tests I
 did actually found measurable improvement. So I'm tempted to just not
 include this part, and accept that we may deserialize some of the tuples
 unnecessarily.

 Did you actually observe measurable improvements in some cases?
>>>
>>> The improvements would mostly stem from brin indexes with multiple
>>> (potentially compressed) by-ref types, as they go through more complex
>>> and expensive code to deserialize, requiring separate palloc() and
>>> memcpy() calls each.
>>> For single-column and by-value types the improvements are expected to
>>> be negligible, because there is no meaningful difference between
>>> copying a single by-ref value and copying its container; the
>>> additional work done for each tuple is marginal for those.
>>>
>>> For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
>>> (sha256((id+1)::text::bytea)::text),
>>> (sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
>>> measured a difference of 10x less time spent in the main loop of
>>> _brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
>>> ranges. It's not a lot, but worth at least something, I guess?
>>>
>>
>> It is something, but I can't really convince myself it's worth the extra
>> code complexity. It's a somewhat extreme example, and the parallelism
>> certainly saves much more than this.
> 
> True. For this, I usually keep in mind that the docs on multi-column
> indexes still indicate to use 1 N-column brin index over N 1-column
> brin indexes (assuming the same storage parameters), so multi-column
> BRIN indexes should not be considered to be uncommon:
> 
> "The only reason to have multiple BRIN indexes instead of one
> multicolumn BRIN index on a single table is to have a different
> pages_per_range storage parameter."
> 
> Note that most of the time in my example index is spent in creating
> the actual tuples due to the use of hashing for data generation; for
> index or plain to-text formatting the improvement is much more
> pronounced: If I use an 8-column index (id::text, id, ...), index
> creation takes ~500ms with 4+ workers. Of this, deforming takes some
> 20ms, though when skipping the deforming step (i.e.,with my patch) it
> takes ~3.5ms. That's a 3% shaved off the build time when the index
> shape is beneficial.
> 

That's all true, and while 3.5% is not something to ignore, my POV is
that the parallelism speeds this up from ~2000ms to ~500ms. Yes, it
would be great to shave off the extra 1% (relative to the original
duration). But I don't have a great idea how to do code that in a way
that is readable, and I don't want to stall the patch indefinitely
because of a comparatively small improvement.

Therefore I propose we get the simpler code committed and leave this as
a future improvement.

>>> The attached patch fixes the issue that you called out .
>>> It also further updates _brin_end_parallel: the final 'write empty
>>> tuples' loop is never hit and is thus removed, because if there were
>>> any tuples in the spool we'd have filled the empty ranges at the end
>>> of the main loop, and if there were no tuples in the spool then the
>>> memtuple would still be at its original initialized value of 0 thus
>>> resulting in a constant false condition. I also updated some comments.
>>>
>>
>> Ah, right. I'll take a look tomorrow, but I guess I didn't realize we
>> insert the empty ranges in the main loop, because we're already looking
>> at the *next* summary.
> 
> Yes, merging adds some significant complexity here. I don't think we
> can easily get around that though...
> 
>> But I think the idea was to insert empty ranges if there's a chunk of
>> empty ranges at the end of the table, after the last tuple the index
>> build reads. But I'm not sure that can actually happen ...
> 
> This would be trivial to construct with partial indexes; e.g. WHERE
> (my_pk IS NULL) would consist of exclusively empty ranges.
> I don't see a lot of value in partial BRIN indexes, but I may be
> overlooking something.
> 

Oh, I haven't even thought about partial BRIN indexes! I'm sure for
those it's even more important to actually fill-in the empty ranges,
otherwise we end up scanning the whole supposedly filtered-out part of
the table. I'll do some testing with that.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Streaming I/O, vectored I/O (WIP)

2023-11-29 Thread Thomas Munro

On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas  wrote:
> On 29/11/2023 21:39, Thomas Munro wrote:
> > One thing I wasn't 100% happy with was the treatment of ENOSPC.  A few
> > callers believe that short writes set errno: they error out with a
> > message including %m.  We have historically set errno = ENOSPC inside
> > FileWrite() if the write size was unexpectedly small AND the kernel
> > didn't set errno to a non-zero value (having set it to zero ourselves
> > earlier).  In FileWriteV(), I didn't want to do that because it is
> > expensive to compute the total write size from the vector array and we
> > managed to measure an effect due to that in some workloads.
> >
> > Note that the smgr patch actually handles short writes by continuing,
> > instead of raising an error.  Short writes do already occur in the
> > wild on various systems for various rare technical reasons other than
> > ENOSPC I have heard (imagine transient failure to acquire some
> > temporary memory that the kernel chooses not to wait for, stuff like
> > that, though certainly many people and programs believe they should
> > not happen[1]), and it seems like a good idea to actually handle them
> > as our write sizes increase and the probability of short writes might
> > presumably increase.
>
> Maybe we should bite the bullet and always retry short writes in
> FileWriteV(). Is that what you meant by "handling them"?
> If the total size is expensive to calculate, how about passing it as an
> extra argument? Presumably it is cheap for the callers to calculate at
> the same time that they build the iovec array?

It's cheap for md.c, because it already has nblocks_this_segment.
That's one reason I chose to put the retry there.  If we push it down
to fd.c in order to be able to help other callers, you're right that
we could pass in the total size (and I guess assert that it's
correct), but that is sort of annoyingly redundant and further from
the interface we're wrapping.

There is another problem with pushing it down to fd.c, though.
Suppose you try to write 8192 bytes, and the kernel says "you wrote
4096 bytes" so your loop goes around again with the second half the
data and now the kernel says "-1, ENOSPC".  What are you going to do?
fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
so you'd either have to return -1, ENOSPC (converting short writes
into actual errors, a lie because you did write some data), or return
4096 (and possibly also set errno = ENOSPC as we have always done).
So you can't really handle this problem at this level, can you?
Unless you decide that fd.c should get into the business of raising
errors for I/O failures, which would be a bit of a departure.

That's why I did the retry higher up in md.c.

> > With the previous version of the patch, we'd have to change a couple
> > of other callers not to believe that short writes are errors and set
> > errno (callers are inconsistent on this point).  I don't really love
> > that we have "fake" system errors but I also want to stay focused
> > here, so in this new version V3 I tried a new approach: I realised I
> > can just always set errno without needing the total size, so that
> > (undocumented) aspect of the interface doesn't change.  The point
> > being that it doesn't matter if you clobber errno with a bogus value
> > when the write was non-short.  Thoughts?
>
> Feels pretty ugly, but I don't see anything outright wrong with that.

Cool.  I would consider cleaning up all the callers and get rid of
this ENOSPC stuff in independent work, but I didn't want discussion of
that (eg what external/extension code knows about this API?) to derail
THIS project, hence desire to preserve existing behaviour.

Re: logical decoding and replication of sequences, take 2

2023-11-29 Thread Tomas Vondra

On 11/29/23 15:41, Tomas Vondra wrote:
> ...
>>
>> One thing that worries me about that approach is that it can suck with
>> the workload that has a lot of DDLs that create XLOG_SMGR_CREATE
>> records. We have previously fixed some such workloads in logical
>> decoding where decoding a transaction containing truncation of a table
>> with a lot of partitions (1000 or more) used to take a very long time.
>> Don't we face performance issues in such scenarios?
>>
> 
> I don't think we do, really. We will have to decode the SMGR records and
> add the relfilenodes to the hash table(s), but I think that affects the
> lookup performance too much. What I think might be a problem is if we
> have many top-level transactions, especially if those transactions do
> something that creates a relfilenode. Because then we'll have to do a
> hash_search for each of them, and that might be measurable even if each
> lookup is O(1). And we do the lookup for every sequence change ...
> 

I did some micro-benchmarking today, trying to identify cases where this
would cause unexpected problems, either due to having to maintain all
the relfilenodes, or due to having to do hash lookups for every sequence
change. But I think it's fine, mostly ...

I did all the following tests with 64 clients. I may try more, but even
with this there should be fair number of concurrent transactions, which
determines the number of top-level transactions in reorderbuffer. I'll
try with more clients tomorrow, but I don't think it'll change stuff.

The test is fairly simple - run a particular number of transactions
(might be 1000 * 64, or more). And then measure how long it takes to
decode the changes using test_decoding.

Now, the various workloads I tried:

1) "good case" - small OLTP transactions, a couple nextval('s') calls

  begin;
  insert into t (1);
  select nextval('s');
  insert into t (1);
  commit;

This is pretty fine, the sequence part of reorderbuffer is really not
measurable, it's like 1% of the total CPU time. Which is expected,
because we only wal-log every 32-nd increment or so.

2) "good case" - same as (1) but more nextval calls to always do wal

  begin;
  insert into t (1);
  select nextval('s') from generate_series(1,40);
  insert into t (1);
  commit;

Here sequences are more measurable, it's like 15% of CPU time, but most
of that comes to AbortCurrentTransaction() in the non-transactional
branch of ReorderBufferQueueSequence. I don't think there's a way around
that, and it's entirely unrelated to relfilenodes. The function checking
if the change is transactional (ReorderBufferSequenceIsTransactional) is
less than 1% of the profile - and this is the version that always walks
all top-level transactions.

3) "bad case" - small transactions that generate a lot of relfilenodes

  select alter_sequence();

where the function is defined like this (I did create 1000 sequences
before the test):

  CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$
  DECLARE
  v INT;
  BEGIN
  v := 1 + (random() * 999)::int;
  execute format('alter sequence s%s restart with 1000', v);
  perform nextval('s');
  END;
  $$ LANGUAGE plpgsql;

This performs terribly, but it's entirely unrelated to sequences.
Current master has exactly the same problem, if transactions do DDL.
Like this, for example:

  CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$
  DECLARE
  v INT;
  BEGIN
  v := 1 + (random() * 999)::int;
  execute format('create table t%s (a int)', v);
  execute format('drop table t%s', v);
  insert into t values (1);
  END;
  $$ LANGUAGE plpgsql;

This has the same impact on master. The perf report shows this:

  --98.06%--pg_logical_slot_get_changes_guts
   |
--97.88%--LogicalDecodingProcessRecord
 |
 --97.56%--xact_decode
  |
   --97.51%--DecodeCommit
|
|--91.92%--SnapBuildCommitTxn
| |
|  --91.65%--SnapBuildBuildSnapshot
|   |
|   --91.14%--pg_qsort

The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot
takes so long is because:

-
  Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8)
  at snapbuild.c:498
  498+ sizeof(TransactionId) *   builder->committed.xcnt
  (gdb) p builder->committed.xcnt
  $4 = 11532
-

And with each iteration it grows by 1. That looks quite weird, possibly
a bug worth fixing, but unrelated to this patch. I can't investigate
this more at the moment, not sure when/if I'll get to that.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: encoding affects ICU regex character classification

2023-11-29 Thread Tom Lane

Jeff Davis  writes:
> The problem seems to be confusion between pg_wchar and a unicode code
> point in pg_wc_isalpha() and related functions.

Yeah, that's an ancient sore spot: we don't really know what the
representation of wchar is.  We assume it's Unicode code points
for UTF8 locales, but libc isn't required to do that AFAIK.  See
comment block starting about line 20 in regc_pg_locale.c.

I doubt that ICU has much to do with this directly.

We'd have to find an alternate source of knowledge to replace the
 functions if we wanted to fix it fully ... can ICU do that?

regards, tom lane

encoding affects ICU regex character classification

2023-11-29 Thread Jeff Davis

The following query:

SELECT U&'\017D' ~ '[[:alpha:]]' collate "en-US-x-icu";

returns true if the server encoding is UTF8, and false if the server
encoding is LATIN9. That's a bug -- any behavior involving ICU should
be encoding-independent.

The problem seems to be confusion between pg_wchar and a unicode code
point in pg_wc_isalpha() and related functions.

It might be good to introduce some infrastructure here that can convert
a pg_wchar into a Unicode code point, or decode a string of bytes into
a string of 32-bit code points. Right now, that's possible, but it
involves pg_wchar2mb() followed by encoding conversion to UTF8,
followed by decoding the UTF8 to a code point. (Is there an easier path
that I missed?)

One wrinkle is MULE_INTERNAL, which doesn't have any conversion path to
UTF8. That's not important for ICU (because ICU is not allowed for that
encoding), but I'd like it if we could make this infrastructure
independent of ICU, because I have some follow-up proposals to simplify
character classification here and in ts_locale.c.

Thoughts?

Regards,
Jeff Davis

Re: Streaming I/O, vectored I/O (WIP)

2023-11-29 Thread Heikki Linnakangas


On 29/11/2023 21:39, Thomas Munro wrote:

One thing I wasn't 100% happy with was the treatment of ENOSPC.  A few
callers believe that short writes set errno: they error out with a
message including %m.  We have historically set errno = ENOSPC inside
FileWrite() if the write size was unexpectedly small AND the kernel
didn't set errno to a non-zero value (having set it to zero ourselves
earlier).  In FileWriteV(), I didn't want to do that because it is
expensive to compute the total write size from the vector array and we
managed to measure an effect due to that in some workloads.

Note that the smgr patch actually handles short writes by continuing,
instead of raising an error.  Short writes do already occur in the
wild on various systems for various rare technical reasons other than
ENOSPC I have heard (imagine transient failure to acquire some
temporary memory that the kernel chooses not to wait for, stuff like
that, though certainly many people and programs believe they should
not happen[1]), and it seems like a good idea to actually handle them
as our write sizes increase and the probability of short writes might
presumably increase.


Maybe we should bite the bullet and always retry short writes in 
FileWriteV(). Is that what you meant by "handling them"?


If the total size is expensive to calculate, how about passing it as an 
extra argument? Presumably it is cheap for the callers to calculate at 
the same time that they build the iovec array?



With the previous version of the patch, we'd have to change a couple
of other callers not to believe that short writes are errors and set
errno (callers are inconsistent on this point).  I don't really love
that we have "fake" system errors but I also want to stay focused
here, so in this new version V3 I tried a new approach: I realised I
can just always set errno without needing the total size, so that
(undocumented) aspect of the interface doesn't change.  The point
being that it doesn't matter if you clobber errno with a bogus value
when the write was non-short.  Thoughts?


Feels pretty ugly, but I don't see anything outright wrong with that.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-29 Thread Matthias van de Meent

On Wed, 29 Nov 2023 at 21:56, Tomas Vondra
 wrote:
>
> On 11/29/23 21:30, Matthias van de Meent wrote:
>> On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
>>  wrote:
>>> I did try to measure how much it actually saves, but none of the tests I
>>> did actually found measurable improvement. So I'm tempted to just not
>>> include this part, and accept that we may deserialize some of the tuples
>>> unnecessarily.
>>>
>>> Did you actually observe measurable improvements in some cases?
>>
>> The improvements would mostly stem from brin indexes with multiple
>> (potentially compressed) by-ref types, as they go through more complex
>> and expensive code to deserialize, requiring separate palloc() and
>> memcpy() calls each.
>> For single-column and by-value types the improvements are expected to
>> be negligible, because there is no meaningful difference between
>> copying a single by-ref value and copying its container; the
>> additional work done for each tuple is marginal for those.
>>
>> For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
>> (sha256((id+1)::text::bytea)::text),
>> (sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
>> measured a difference of 10x less time spent in the main loop of
>> _brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
>> ranges. It's not a lot, but worth at least something, I guess?
>>
>
> It is something, but I can't really convince myself it's worth the extra
> code complexity. It's a somewhat extreme example, and the parallelism
> certainly saves much more than this.

True. For this, I usually keep in mind that the docs on multi-column
indexes still indicate to use 1 N-column brin index over N 1-column
brin indexes (assuming the same storage parameters), so multi-column
BRIN indexes should not be considered to be uncommon:

"The only reason to have multiple BRIN indexes instead of one
multicolumn BRIN index on a single table is to have a different
pages_per_range storage parameter."

Note that most of the time in my example index is spent in creating
the actual tuples due to the use of hashing for data generation; for
index or plain to-text formatting the improvement is much more
pronounced: If I use an 8-column index (id::text, id, ...), index
creation takes ~500ms with 4+ workers. Of this, deforming takes some
20ms, though when skipping the deforming step (i.e.,with my patch) it
takes ~3.5ms. That's a 3% shaved off the build time when the index
shape is beneficial.

> > The attached patch fixes the issue that you called out .
> > It also further updates _brin_end_parallel: the final 'write empty
> > tuples' loop is never hit and is thus removed, because if there were
> > any tuples in the spool we'd have filled the empty ranges at the end
> > of the main loop, and if there were no tuples in the spool then the
> > memtuple would still be at its original initialized value of 0 thus
> > resulting in a constant false condition. I also updated some comments.
> >
>
> Ah, right. I'll take a look tomorrow, but I guess I didn't realize we
> insert the empty ranges in the main loop, because we're already looking
> at the *next* summary.

Yes, merging adds some significant complexity here. I don't think we
can easily get around that though...

> But I think the idea was to insert empty ranges if there's a chunk of
> empty ranges at the end of the table, after the last tuple the index
> build reads. But I'm not sure that can actually happen ...

This would be trivial to construct with partial indexes; e.g. WHERE
(my_pk IS NULL) would consist of exclusively empty ranges.
I don't see a lot of value in partial BRIN indexes, but I may be
overlooking something.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: [HACKERS] Changing references of password encryption to hashing

2023-11-29 Thread Nathan Bossart

On Wed, Nov 29, 2023 at 04:02:11PM -0500, Robert Haas wrote:
> I'd fully support having good documentation that says "hey, here are
> the low security authentication configurations, here are the
> medium-security ones, here are the high security ones, and here's why
> these ones are better than those ones and what they protect against
> and what risks remain." That would be awesome.

+1.  IMO the "Password Authentication" section [0] does this pretty well
already.

[0] https://www.postgresql.org/docs/devel/auth-password.html

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

optimize atomic exchanges

2023-11-29 Thread Nathan Bossart

On Fri, Nov 10, 2023 at 08:55:29PM -0600, Nathan Bossart wrote:
> On Fri, Nov 10, 2023 at 06:48:39PM -0800, Andres Freund wrote:
>> Yes. We should optimize pg_atomic_exchange_u32() one of these days - it can 
>> be
>> done *far* faster than a cmpxchg. When I was adding the atomic abstraction
>> there was concern with utilizing too many different atomic instructions. I
>> didn't really agree back then, but these days I really don't see a reason to
>> not use a few more intrinsics.
> 
> I might give this a try, if for no other reason than it'd force me to
> improve my mental model of this stuff.  :)

Here's a first draft.  I haven't attempted to add implementations for
PowerPC, and I only added the __atomic version for gcc since
__sync_lock_test_and_set() only supports setting the value to 1 on some
platforms.  Otherwise, I tried to add specialized atomic exchange
implementations wherever there existed other specialized atomic
implementations.

I haven't done any sort of performance testing on this yet.  Some
preliminary web searches suggest that there is unlikely to be much
difference between cmpxchg and xchg, but presumably there's some difference
between xchg and doing cmpxchg in a while loop (as is done in
atomics/generic.h today).  I'll report back once I've had a chance to do
some testing...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 44374af7026642f2cffa28dccea71f2d370a3ce1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 29 Nov 2023 15:00:58 -0600
Subject: [PATCH v1 1/1] optimized atomic exchange

---
 config/c-compiler.m4  | 34 ++-
 configure | 72 +--
 configure.ac  |  2 +
 meson.build   | 12 
 src/include/pg_config.h.in|  6 ++
 src/include/port/atomics/arch-x86.h   | 28 +
 src/include/port/atomics/generic-gcc.h| 28 +
 src/include/port/atomics/generic-msvc.h   | 18 ++
 src/include/port/atomics/generic-sunpro.h | 14 +
 src/tools/msvc/Solution.pm|  2 +
 10 files changed, 210 insertions(+), 6 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..440c90d411 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -571,7 +571,7 @@ fi])# PGAC_HAVE_GCC__SYNC_INT64_CAS
 # Check if the C compiler understands __atomic_compare_exchange_n() for 32bit
 # types, and define HAVE_GCC__ATOMIC_INT32_CAS if so.
 AC_DEFUN([PGAC_HAVE_GCC__ATOMIC_INT32_CAS],
-[AC_CACHE_CHECK(for builtin __atomic int32 atomic operations, pgac_cv_gcc_atomic_int32_cas,
+[AC_CACHE_CHECK(for builtin __atomic int32 atomic compare/exchange, pgac_cv_gcc_atomic_int32_cas,
 [AC_LINK_IFELSE([AC_LANG_PROGRAM([],
   [int val = 0;
int expect = 0;
@@ -587,7 +587,7 @@ fi])# PGAC_HAVE_GCC__ATOMIC_INT32_CAS
 # Check if the C compiler understands __atomic_compare_exchange_n() for 64bit
 # types, and define HAVE_GCC__ATOMIC_INT64_CAS if so.
 AC_DEFUN([PGAC_HAVE_GCC__ATOMIC_INT64_CAS],
-[AC_CACHE_CHECK(for builtin __atomic int64 atomic operations, pgac_cv_gcc_atomic_int64_cas,
+[AC_CACHE_CHECK(for builtin __atomic int64 atomic compare/exchange, pgac_cv_gcc_atomic_int64_cas,
 [AC_LINK_IFELSE([AC_LANG_PROGRAM([],
   [PG_INT64_TYPE val = 0;
PG_INT64_TYPE expect = 0;
@@ -598,6 +598,36 @@ if test x"$pgac_cv_gcc_atomic_int64_cas" = x"yes"; then
   AC_DEFINE(HAVE_GCC__ATOMIC_INT64_CAS, 1, [Define to 1 if you have __atomic_compare_exchange_n(int64 *, int64 *, int64).])
 fi])# PGAC_HAVE_GCC__ATOMIC_INT64_CAS
 
+# PGAC_HAVE_GCC__ATOMIC_INT32_XCHG
+# 
+# Check if the C compiler understands __atomic_exchange_n() for 32bit types,
+# and define HAVE_GCC__ATOMIC_INT32_XCHG if so.
+AC_DEFUN([PGAC_HAVE_GCC__ATOMIC_INT32_XCHG],
+[AC_CACHE_CHECK(for builtin __atomic int32 atomic exchange, pgac_cv_gcc_atomic_int32_xchg,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+  [int val = 0;
+   __atomic_exchange_n(&val, 37, __ATOMIC_SEQ_CST);])],
+  [pgac_cv_gcc_atomic_int32_xchg="yes"],
+  [pgac_cv_gcc_atomic_int32_xchg="no"])])
+if test x"$pgac_cv_gcc_atomic_int32_xchg" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__ATOMIC_INT32_XCHG, 1, [Define to 1 if you have __atomic_exchange_n(int *, int).])
+fi])# PGAC_HAVE_GCC__ATOMIC_INT32_XCHG
+
+# PGAC_HAVE_GCC__ATOMIC_INT64_XCHG
+# 
+# Check if the C compiler understands __atomic_exchange_n() for 64bit types,
+# and define HAVE_GCC__ATOMIC_INT64_XCHG if so.
+AC_DEFUN([PGAC_HAVE_GCC__ATOMIC_INT64_XCHG],
+[AC_CACHE_CHECK(for builtin __atomic int64 atomic exchange, pgac_cv_gcc_atomic_int64_xchg,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+  [PG_INT64_TYPE val = 0;
+   __atomic_exchange_n(&val, (PG_INT64_TYPE) 37, __ATOMIC_SEQ_CST);])],
+  [pgac_cv_gcc_atomic_int64_xchg="yes"],
+  [pgac_cv_gcc_atomic_int64_xchg="no"])])
+if test x"$pgac_cv_gcc_atomic_int64_xchg" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__ATOMIC_INT64_XCHG, 1

Re: [HACKERS] Changing references of password encryption to hashing

2023-11-29 Thread Robert Haas

On Wed, Nov 29, 2023 at 2:12 AM Stephen Frost  wrote:
> I had been hoping you might shed some light on just what use cases you
> were referring to so that we could have a constructive discussion about
> if ldap is actually a reasonable solution.  I even explicitly pointed
> out that there may still exist some environments, such as those with
> OpenLDAP only and which don't have Kerberos that could be a valid reason
> to keep ldap, but I haven't heard of any such in quite a long time.

I mean, I just have a hard time seeing this as a serious discussion. I
can't believe a proposal to remove LDAP authentication would reach
consensus for any reason other than because people didn't notice the
discussion or didn't want to argue about it, and I'm sure we'd have
unhappy users later. I don't want to spend a huge amount of time
arguing over something that I think is fairly obviously a non-viable
idea. If I'm wrong and a whole bunch of people show up and say they
like it, well, then I'm wrong. But either way, literally nobody wants
to read 100 emails where you and I go around in circles. And I don't
want to write them, either.

That said, I think that the case you mention, where someone has only
LDAP and not AD, is one example of where you would want LDAP
authentication. And really, that ought to end discussion of the topic,
because surely that is a valid use case, and we don't typically remove
features that someone might find useful because one hacker suggests
it. But in my experience, that's actually missing the point
completely. The most common complaint that I see from users about
authentication goes like this:

"I have log_statement=all enabled and so every time a user changes
their password the new password ends up in plaintext in the log file
and that means PostgreSQL is insecure."

I want to make it clear that, number one, this is not a joke, and,
number two, this is not the only concern that regularly reaches me
which reflects this degree of failing to understand how to set up a
secure environment. It's just an example of the level that many people
are apparently at. You're talking about removing authentication
methods because they have subtle, conditional security hazards, but a
lot of people are still running systems with flagrant, hideous
security hazards. Making it harder for such a user to store passwords
outside the database, or making them feel like that's somehow bad or
insecure when what they're doing instead is WAY worse, is not the
right thing at all.

I'd fully support having good documentation that says "hey, here are
the low security authentication configurations, here are the
medium-security ones, here are the high security ones, and here's why
these ones are better than those ones and what they protect against
and what risks remain." That would be awesome. But I'm not in favor of
removing options. We don't know where any particular user is coming
from today, and as in the example above, it might be something REALLY
bad. Moreover, users are entitled to make prudential judgments about
which things they care about in practice. They're allowed to say "hey,
yeah, so LDAP has this weakness if the server gets suborned, but I
don't care about that because of X reason," just as they're allowed to
accept the risks of any other authentication method, including trust.
Some of the reasons people might have for accepting such risks are
doubtless extremely strong, like "it's a physically disconnected
network" or "this is a burner setup that I'm tearing down tomorrow and
will contain only non-critical data anyway," and others are doubtless
poor reasons where users take inappropriate risk. But removing
everything from PostgreSQL that can be used to hose yourself is not a
viable strategy, and it's also not right for us to try to substitute
our judgement about what is appropriate in a particular user's
environment for that user's own judgement, especially if they're
right, but even if they're wrong. We're here to provide software, not
compel obedience.

> I quoted the part where you agreed that md5 has a known vulnerability to
> point out that, given that, we should be making our users aware of that
> through the normal process that we use for that.  I wasn't claiming that
> you agreed that we should remove md5, unless you're referring to some
> other part of my response which you didn't quote, in which case it'd be
> helpful if you could clarify which you were referring to so that I can
> try to clarify.

You made it sound like I supported what you wanted to do when you knew
I didn't. That's a poor foundation for any kind of reasonable
discussion of alternatives.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-29 Thread Tomas Vondra

On 11/29/23 21:30, Matthias van de Meent wrote:
> On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
>  wrote:
>>
>> On 11/29/23 15:52, Tomas Vondra wrote:
 ...

 This also made me think a bit more about how we're working with the
 tuples. With your latest patch, we always deserialize and re-serialize
 the sorted brin tuples, just in case the next tuple will also be a
 BRIN tuple of the same page range. Could we save some of that
 deserialization time by optimistically expecting that we're not going
 to need to merge the tuple and only store a local copy of it locally?
 See attached 0002; this saves some cycles in common cases.

>>>
>>> Good idea!
>>>
>>
>> FWIW there's a bug, in this part of the optimization:
>>
>> --
>> +if (memtuple == NULL)
>> +memtuple = brin_deform_tuple(state->bs_bdesc, btup,
>> + memtup_holder);
>> +
>>  union_tuples(state->bs_bdesc, memtuple, btup);
>>  continue;
>> --
>>
>> The deforming should use prevbtup, otherwise union_tuples() jut combines
>> two copies of the same tuple.
> 
> Good point. There were some more issues as well, fixes are attached.
> 
>> Which however brings me to the bigger issue with this - my stress test
>> found this issue pretty quickly, but then I spent quite a bit of time
>> trying to find what went wrong. I find this reworked code pretty hard to
>> understand, and not necessarily because of how it's written. The problem
>> is it the same loop tries to juggle multiple pieces of information with
>> different lifespans, and so on. I find it really hard to reason about
>> how it behaves ...
> 
> Yeah, it'd be nice if we had a peek option for sortsupport, that'd
> improve context handling.
> 
>> I did try to measure how much it actually saves, but none of the tests I
>> did actually found measurable improvement. So I'm tempted to just not
>> include this part, and accept that we may deserialize some of the tuples
>> unnecessarily.
>>
>> Did you actually observe measurable improvements in some cases?
> 
> The improvements would mostly stem from brin indexes with multiple
> (potentially compressed) by-ref types, as they go through more complex
> and expensive code to deserialize, requiring separate palloc() and
> memcpy() calls each.
> For single-column and by-value types the improvements are expected to
> be negligible, because there is no meaningful difference between
> copying a single by-ref value and copying its container; the
> additional work done for each tuple is marginal for those.
> 
> For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
> (sha256((id+1)::text::bytea)::text),
> (sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
> measured a difference of 10x less time spent in the main loop of
> _brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
> ranges. It's not a lot, but worth at least something, I guess?
> 

It is something, but I can't really convince myself it's worth the extra
code complexity. It's a somewhat extreme example, and the parallelism
certainly saves much more than this.

> The attached patch fixes the issue that you called out .
> It also further updates _brin_end_parallel: the final 'write empty
> tuples' loop is never hit and is thus removed, because if there were
> any tuples in the spool we'd have filled the empty ranges at the end
> of the main loop, and if there were no tuples in the spool then the
> memtuple would still be at its original initialized value of 0 thus
> resulting in a constant false condition. I also updated some comments.
> 

Ah, right. I'll take a look tomorrow, but I guess I didn't realize we
insert the empty ranges in the main loop, because we're already looking
at the *next* summary.

But I think the idea was to insert empty ranges if there's a chunk of
empty ranges at the end of the table, after the last tuple the index
build reads. But I'm not sure that can actually happen ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-29 Thread Matthias van de Meent

On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
 wrote:
>
> On 11/29/23 15:52, Tomas Vondra wrote:
> >> ...
> >>
> >> This also made me think a bit more about how we're working with the
> >> tuples. With your latest patch, we always deserialize and re-serialize
> >> the sorted brin tuples, just in case the next tuple will also be a
> >> BRIN tuple of the same page range. Could we save some of that
> >> deserialization time by optimistically expecting that we're not going
> >> to need to merge the tuple and only store a local copy of it locally?
> >> See attached 0002; this saves some cycles in common cases.
> >>
> >
> > Good idea!
> >
>
> FWIW there's a bug, in this part of the optimization:
>
> --
> +if (memtuple == NULL)
> +memtuple = brin_deform_tuple(state->bs_bdesc, btup,
> + memtup_holder);
> +
>  union_tuples(state->bs_bdesc, memtuple, btup);
>  continue;
> --
>
> The deforming should use prevbtup, otherwise union_tuples() jut combines
> two copies of the same tuple.

Good point. There were some more issues as well, fixes are attached.

> Which however brings me to the bigger issue with this - my stress test
> found this issue pretty quickly, but then I spent quite a bit of time
> trying to find what went wrong. I find this reworked code pretty hard to
> understand, and not necessarily because of how it's written. The problem
> is it the same loop tries to juggle multiple pieces of information with
> different lifespans, and so on. I find it really hard to reason about
> how it behaves ...

Yeah, it'd be nice if we had a peek option for sortsupport, that'd
improve context handling.

> I did try to measure how much it actually saves, but none of the tests I
> did actually found measurable improvement. So I'm tempted to just not
> include this part, and accept that we may deserialize some of the tuples
> unnecessarily.
>
> Did you actually observe measurable improvements in some cases?

The improvements would mostly stem from brin indexes with multiple
(potentially compressed) by-ref types, as they go through more complex
and expensive code to deserialize, requiring separate palloc() and
memcpy() calls each.
For single-column and by-value types the improvements are expected to
be negligible, because there is no meaningful difference between
copying a single by-ref value and copying its container; the
additional work done for each tuple is marginal for those.

For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
(sha256((id+1)::text::bytea)::text),
(sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
measured a difference of 10x less time spent in the main loop of
_brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
ranges. It's not a lot, but worth at least something, I guess?

The attached patch fixes the issue that you called out .
It also further updates _brin_end_parallel: the final 'write empty
tuples' loop is never hit and is thus removed, because if there were
any tuples in the spool we'd have filled the empty ranges at the end
of the main loop, and if there were no tuples in the spool then the
memtuple would still be at its original initialized value of 0 thus
resulting in a constant false condition. I also updated some comments.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

v6-0003-NOCOMMIT-Instrumentation-for-time-spent-in-_brin_.patch
Description: Binary data

v6-0002-Reduce-de-forming-of-BRIN-tuples-in-parallel-BRIN.patch
Description: Binary data

v6-0001-Allow-BRIN-to-build-its-index-in-parallel.patch
Description: Binary data

Re: remaining sql/json patches

2023-11-29 Thread Andrew Dunstan





> On Nov 29, 2023, at 2:41 PM, Andres Freund  wrote:
> 
> Hi,
> 
>> On 2023-11-29 14:21:59 -0500, Andrew Dunstan wrote:
>> On 2023-11-29 We 12:42, Andres Freund wrote:
 I do have a CFLAGS setting, but for meson I used '-Ddebug=true' and no
 buildtype  or optimization setting. However, I see that in meson.build 
 we're
 defaulting to "buildtype=debugoptimized" as opposed to the standard meson
 "buildtype=debug", so I guess that accounts for it.
 
 Still getting used to this stuff.
>>> What I meant was whether you set CFLAGS for the *autoconf* build,
>> 
>> That's what I meant too.
>> 
>>> because that
>>> will result in an unoptimized build unless you explicitly add -O2 (or 
>>> whatnot)
>>> to the flags.  Doing benchmarking without compiler optimizations is pretty
>>> pointless.
>>> 
>> 
>> Right. My latest reported results should all be at -O2.
> 
> Why are the results suddenly so much slower?
> 
> 


As I mentioned I increased the iteration count to 5000.

Cheers 

Andrew

Re: Custom explain options

2023-11-29 Thread Pavel Stehule

Hi

so 25. 11. 2023 v 8:23 odesílatel Konstantin Knizhnik 
napsal:

> Hi hackers,
>
> EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, COST,...)
> which help to provide useful details of query execution.
> In Neon we have added PREFETCH option which shows information about page
> prefetching during query execution (prefetching is more critical for Neon
> architecture because of separation of compute and storage, so it is
> implemented not only for bitmap heap scan as in Vanilla Postgres, but also
> for seqscan, indexscan and indexonly scan). Another possible candidate  for
> explain options is local file cache (extra caching layer above shared
> buffers which is used to somehow replace file system cache in standalone
> Postgres).
>
> I think that it will be nice to have a generic mechanism which allows
> extensions to add its own options to EXPLAIN.
> I have attached the patch with implementation of such mechanism (also
> available as PR:  https://github.com/knizhnik/postgres/pull/1 )
>
> I have demonstrated this mechanism using Bloom extension - just to report
> number of Bloom matches.
> Not sure that it is really useful information but it is used mostly as
> example:
>
> explain (analyze,bloom) select * from t where pk=2000;
>QUERY PLAN
> -
>  Bitmap Heap Scan on t  (cost=15348.00..15352.01 rows=1 width=4) (actual 
> time=25.244..25.939 rows=1 loops=1)
>Recheck Cond: (pk = 2000)
>Rows Removed by Index Recheck: 292
>Heap Blocks: exact=283
>Bloom: matches=293
>->  Bitmap Index Scan on t_pk_idx  (cost=0.00..15348.00 rows=1 width=0) 
> (actual time=25.147..25.147 rows=293 loops=1)
>  Index Cond: (pk = 2000)
>  Bloom: matches=293
>  Planning:
>Bloom: matches=0
>  Planning Time: 0.387 ms
>  Execution Time: 26.053 ms
> (12 rows)
>
> There are two known issues with this proposal:
>
> 1. I have to limit total size of all custom metrics - right now it is
> limited by 128 bytes. It is done to keep Instrumentation and some other
> data structures fixes size. Otherwise maintaining varying parts of this
> structure is ugly, especially in shared memory
>
> 2. Custom extension is added by means of RegisterCustomInsrumentation function
> which is called from _PG_init
> But _PG_init is called when extension is loaded and it is loaded on
> demand when some of extension functions is called (except when extension is
> included
> in shared_preload_libraries list), Bloom extension doesn't require it. So
> if your first statement executed in your session is:
>
>  explain (analyze,bloom) select * from t where pk=2000;
>
> ...you will get error:
>
> ERROR:  unrecognized EXPLAIN option "bloom"
> LINE 1: explain (analyze,bloom) select * from t where pk=2000;
>
> It happens because at the moment when explain statement parses options,
> Bloom index is not yet selected and so bloom extension is not loaded and
> RegisterCustomInsrumentation is not yet called. If we repeat the query,
> then proper result will be displayed (see above).
>
>
This patch has a lot of whitespaces and formatting issues. I fixed some

I don't understand how selecting some custom instrumentation can be safe.

List *pgCustInstr is a global variable. The attribute selected is set by
NewExplainState routine

+   /* Reset custom instrumentations selection flag */
+   foreach (lc, pgCustInstr)
+   {
+   CustomInstrumentation *ci = (CustomInstrumentation*) lfirst(lc);
+
+   ci->selected = false;
+   }

and this attribute is used more times. But the queries can be nested.
Theoretically EXPLAIN ANALYZE can run another EXPLAIN ANALYZE, and then
this attribute of the global list can be rewritten. The list of selected
custom instrumentations should be part of explain state, I think.

Regards

Pavel
From fcc48bee889e4a052ba20f10f08d7f5b5eeaa43f Mon Sep 17 00:00:00 2001
From: "ok...@github.com" 
Date: Wed, 29 Nov 2023 07:51:51 +0100
Subject: [PATCH 2/2] fix whitespaces and formatting

---
 contrib/bloom/blscan.c|  1 +
 contrib/bloom/blutils.c   | 21 +---
 src/backend/access/nbtree/nbtsort.c   |  7 +--
 src/backend/commands/explain.c| 17 ---
 src/backend/commands/vacuumparallel.c |  5 +-
 src/backend/executor/execParallel.c   |  5 +-
 src/backend/executor/instrument.c | 72 ---
 7 files changed, 79 insertions(+), 49 deletions(-)

diff --git a/contrib/bloom/blscan.c b/contrib/bloom/blscan.c
index 8890951943..7fc43f174d 100644
--- a/contrib/bloom/blscan.c
+++ b/contrib/bloom/blscan.c
@@ -167,5 +167,6 @@ blgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	}
 	FreeAccessStrategy(bas);
 	bloomUsage.matches += ntids;
+
 	return ntids;
 }
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index d8fff4f057..1ba7afc607 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/

Re: meson: Stop using deprecated way getting path of files

2023-11-29 Thread Andres Freund

Hi,

On 2023-11-29 13:11:23 -0600, Tristan Partin wrote:
> This looks good to me.

Cool.

> What is our limiting factor on bumping the minimum Meson version?

Old distro versions, particularly ones where the distro just has an older
python. It's one thing to require installing meson but quite another to also
require building python. There's some other ongoing discussion about
establishing the minimum baseline in a somewhat more, uh, formalized way:
https://postgr.es/m/CA%2BhUKGLhNs5geZaVNj2EJ79Dx9W8fyWUU3HxcpZy55sMGcY%3DiA%40mail.gmail.com

> While we are on the topic of Meson, it would be great if you could take a
> look at this thread[0], where I am trying to compile Postgres with
> -fsanitize=address,undefined (-Db_sanitize=address,undefined).

Done. Not sure it helps you much though :)

Greetings,

Andres Freund

Re: remaining sql/json patches

2023-11-29 Thread Andres Freund

Hi,

On 2023-11-29 14:21:59 -0500, Andrew Dunstan wrote:
> On 2023-11-29 We 12:42, Andres Freund wrote:
> > > I do have a CFLAGS setting, but for meson I used '-Ddebug=true' and no
> > > buildtype  or optimization setting. However, I see that in meson.build 
> > > we're
> > > defaulting to "buildtype=debugoptimized" as opposed to the standard meson
> > > "buildtype=debug", so I guess that accounts for it.
> > > 
> > > Still getting used to this stuff.
> > What I meant was whether you set CFLAGS for the *autoconf* build,
>
> That's what I meant too.
> 
> > because that
> > will result in an unoptimized build unless you explicitly add -O2 (or 
> > whatnot)
> > to the flags.  Doing benchmarking without compiler optimizations is pretty
> > pointless.
> > 
> 
> Right. My latest reported results should all be at -O2.

Why are the results suddenly so much slower?

Greetings,

Andres Freund

Re: Streaming I/O, vectored I/O (WIP)

2023-11-29 Thread Thomas Munro

On Wed, Nov 29, 2023 at 1:44 AM Heikki Linnakangas  wrote:
> LGTM. I think this 0001 patch is ready for commit, independently of the
> rest of the patches.

Done.

> In v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch, fd.h:
>
> > +/* Filename components */
> > +#define PG_TEMP_FILES_DIR "pgsql_tmp"
> > +#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
> > +
>
> These seem out of place, we already have them in common/file_utils.h.

Yeah, they moved from there in f39b2658 and I messed up the rebase.  Fixed.

> Other than that,
> v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch and
> v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch look
> good to me.

One thing I wasn't 100% happy with was the treatment of ENOSPC.  A few
callers believe that short writes set errno: they error out with a
message including %m.  We have historically set errno = ENOSPC inside
FileWrite() if the write size was unexpectedly small AND the kernel
didn't set errno to a non-zero value (having set it to zero ourselves
earlier).  In FileWriteV(), I didn't want to do that because it is
expensive to compute the total write size from the vector array and we
managed to measure an effect due to that in some workloads.

Note that the smgr patch actually handles short writes by continuing,
instead of raising an error.  Short writes do already occur in the
wild on various systems for various rare technical reasons other than
ENOSPC I have heard (imagine transient failure to acquire some
temporary memory that the kernel chooses not to wait for, stuff like
that, though certainly many people and programs believe they should
not happen[1]), and it seems like a good idea to actually handle them
as our write sizes increase and the probability of short writes might
presumably increase.

With the previous version of the patch, we'd have to change a couple
of other callers not to believe that short writes are errors and set
errno (callers are inconsistent on this point).  I don't really love
that we have "fake" system errors but I also want to stay focused
here, so in this new version V3 I tried a new approach: I realised I
can just always set errno without needing the total size, so that
(undocumented) aspect of the interface doesn't change.  The point
being that it doesn't matter if you clobber errno with a bogus value
when the write was non-short.  Thoughts?

[1] https://utcc.utoronto.ca/~cks/space/blog/unix/WritesNotShortOften
From 7808d7241088f8dc02d7d828dc4ee0227e8e2a01 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 19 Jul 2023 12:32:51 +1200
Subject: [PATCH v3 1/7] Provide vectored variants of FileRead() and
 FileWrite().

FileReadV() and FileWriteV() adapt preadv() and pwritev() for
PostgreSQL's virtual file descriptors.  The traditional FileRead() and
FileWrite() functions are implemented in terms of the new functions.

Reviewed-by: Heikki Linnakangas 
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6ut5tum2g...@mail.gmail.com
---
 src/backend/storage/file/fd.c | 33 +++--
 src/include/storage/fd.h  | 30 --
 2 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f691ba0932..aa89c755b3 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2110,8 +2110,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 }

 int
-FileRead(File file, void *buffer, size_t amount, off_t offset,
-		 uint32 wait_event_info)
+FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+		  uint32 wait_event_info)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
@@ -2131,7 +2131,7 @@ FileRead(File file, void *buffer, size_t amount, off_t offset,

 retry:
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = pg_pread(vfdP->fd, buffer, amount, offset);
+	returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
 	pgstat_report_wait_end();

 	if (returnCode < 0)
@@ -2166,8 +2166,8 @@ retry:
 }

 int
-FileWrite(File file, const void *buffer, size_t amount, off_t offset,
-		  uint32 wait_event_info)
+FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+		   uint32 wait_event_info)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
@@ -2195,7 +2195,14 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 	 */
 	if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
 	{
-		off_t		past_write = offset + amount;
+		size_t		size = 0;
+		off_t		past_write;
+
+		/* Compute the total transfer size. */
+		for (int i = 0; i < iovcnt; ++i)
+			size += iov[i].iov_len;
+
+		past_write = offset + size;

 		if (past_write > vfdP->fileSize)
 		{
@@ -2213,11 +2220,17 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 retry:
 	errno = 0;
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = pg_pwrite(VfdCache[file].fd, buffer, amount, offset);
+	retur

Re: Fix some memory leaks in ecpg.addons

2023-11-29 Thread Andres Freund

Hi,

On 2023-11-08 22:00:00 +0300, Alexander Lakhin wrote:
> Hello Tristan,
>
> 08.11.2023 20:37, Tristan Partin wrote:
> > Are people using some suppression file or setting ASAN_OPTIONS to something?
> >
>
> I use the following:
> ASAN_OPTIONS=detect_leaks=0:abort_on_error=1:print_stacktrace=1:\
> disable_coredump=0:strict_string_checks=1:check_initialization_order=1:\
> strict_init_order=1:detect_stack_use_after_return=0

I wonder if we should add some of these options by default ourselves. We could
e.g. add something like the __ubsan_default_options() in
src/backend/main/main.c to src/port/... instead, and return a combination of
"our" options (like detect_leaks=0) and the ones from the environment.

> (You'll need to add detect_stack_use_after_return=0 with a newer clang
> (I use clang-18) to workaround an incompatibility of check_stack_depth()
> with that sanitizer feature enabled by default.)

I have been wondering if we should work on fixing that.  There are a few ways:

- We can add a compiler parameter explicitly disabling the use-after-return
  checks - however, the checks are quite useful, so that'd be somewhat of a
  shame.

- We could exempt the stack depth checking functions from being validated with
  asan, I think that should fix this issue. Looks like
  __attribute__((no_sanitize("address")))
  would work

- Asan has an interface for getting the real stack address. See

https://github.com/llvm/llvm-project/blob/main/compiler-rt/include/sanitizer/asan_interface.h#L322

ISTM that, if it actually works as I theorize it should, using
__attribute__((no_sanitize("address"))) would be the easiest approach
here. Something like

#if defined(__has_feature) && __has_feature(address_sanitizer)
#define pg_attribute_no_asan __attribute__((no_sanitize("address")))
#else
#define pg_attribute_no_asan
#endif

or such should work.

> So I would say that fixing ecpg won't make postgres sanitizer-friendly in
> a whole.

One thing that's been holding me back on trying to do something around this is
the basically non-existing documentation around all of this. I haven't even
found documentation referencing the fact that there are headers like
sanitizer/asan_interface.h, you just have to figure that out yourself. Compare
that to something like valgrind, which has documented this at least somewhat.

Greetings,

Andres Freund

Re: remaining sql/json patches

2023-11-29 Thread Andrew Dunstan




On 2023-11-29 We 12:42, Andres Freund wrote:

Hi,

On 2023-11-29 07:37:53 -0500, Andrew Dunstan wrote:

On 2023-11-28 Tu 21:10, Andres Freund wrote:

Hi,

On 2023-11-28 20:58:41 -0500, Andrew Dunstan wrote:

On 2023-11-28 Tu 19:32, Tom Lane wrote:

Andrew Dunstan  writes:
So I'm now a bit baffled.  Can you provide more color on what
your test setup is?

*sigh* yes, you're right. I inadvertently used a setup that used meson for
building REL16_STABLE and HEAD. When I switch it to autoconf I get results
that are similar to the earlier branches:


 REL_16_STABLE 
Time: 3401.625 ms (00:03.402)
 HEAD 
Time: 3419.088 ms (00:03.419)


It's not clear to me why that should be. I didn't have assertions enabled
anywhere. It's the same version of bison, same compiler throughout. Maybe
meson sets a higher level of optimization? It shouldn't really matter, ISTM.

Is it possible that you have CFLAGS set in your environment? For reasons that
I find very debatable, configure.ac only adds -O2 when CFLAGS is not set:

# C[XX]FLAGS are selected so:
# If the user specifies something in the environment, that is used.
# else:  If the template file set something, that is used.
# else:  If coverage was enabled, don't set anything.
# else:  If the compiler is GCC, then we use -O2.
# else:  If the compiler is something else, then we use -O, unless debugging.

if test "$ac_env_CFLAGS_set" = set; then
CFLAGS=$ac_env_CFLAGS_value
elif test "${CFLAGS+set}" = set; then
: # (keep what template set)
elif test "$enable_coverage" = yes; then
: # no optimization by default
elif test "$GCC" = yes; then
CFLAGS="-O2"
else
# if the user selected debug mode, don't use -O
if test "$enable_debug" != yes; then
  CFLAGS="-O"
fi
fi

So if you have CFLAGS set in the environment, we'll not add -O2 to the
compilation flags.

I'd check what the actual flags are when building a some .o.


I do have a CFLAGS setting, but for meson I used '-Ddebug=true' and no
buildtype  or optimization setting. However, I see that in meson.build we're
defaulting to "buildtype=debugoptimized" as opposed to the standard meson
"buildtype=debug", so I guess that accounts for it.

Still getting used to this stuff.

What I meant was whether you set CFLAGS for the *autoconf* build,




That's what I meant too.



because that
will result in an unoptimized build unless you explicitly add -O2 (or whatnot)
to the flags.  Doing benchmarking without compiler optimizations is pretty
pointless.



Right. My latest reported results should all be at -O2.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Fix some memory leaks in ecpg.addons

2023-11-29 Thread Andres Freund

Hi,

On 2023-11-08 11:37:46 -0600, Tristan Partin wrote:
> On Wed Nov 8, 2023 at 11:18 AM CST, Michael Meskes wrote:
> > Am Mittwoch, dem 08.11.2023 um 12:07 -0500 schrieb Tom Lane:
> > > "Tristan Partin"  writes:
> > > > clang and gcc both now support -fsanitize=address,undefined. These
> > > > are > > really useful to me personally when trying to debug issues.
> > > > Unfortunately ecpg code has a ton of memory leaks, which makes
> > > > builds > > really painful. It would be great to fix all of them, but
> > I don't
> > > > have > > the patience to try to read flex/bison code. Here are two
> > memory
> > > > leak > > fixes in any case.
> > > > I'm kind of failing to see the point.  As you say, the ecpg
> > > preprocessor leaks memory like there's no tomorrow.  But given its
> > > usage (process one source file and exit) I'm not sure that is worth
> > > much effort to fix.  And what does it buy to fix just two spots?
> > 
> > Agreed, it's not exactly uncommon for tools like ecpg to not worry
> > about memory. After all it gets freed when the program ends.
> 
> In the default configuration of AddressSanitizer, I can't even complete a
> full build of Postgres.

I don't find the leak checks very useful for the moment. Leaks that happen
once in the lifetime of the program aren't problematic, and often tracking
them would make code more complicated.  Perhaps we'll eventually change our
tune on this, but I don't think it's worth fighting this windmill at this
point. I think at the very least we'd first want to port the memory context
infrastructure to frontend programs.

> 
> Are people using some suppression file or setting ASAN_OPTIONS to something?

You pretty much have to. Locally I use this:

export 
ASAN_OPTIONS='debug=1:print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0:detect_stack_use_after_return=0'
 UBSAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1'

CI uses something similar.

Greetings,

Andres Freund

Re: meson: Stop using deprecated way getting path of files

2023-11-29 Thread Tristan Partin

This looks good to me. What is our limiting factor on bumping the 
minimum Meson version?


While we are on the topic of Meson, it would be great if you could take 
a look at this thread[0], where I am trying to compile Postgres with 
-fsanitize=address,undefined (-Db_sanitize=address,undefined).


[0]: https://www.postgresql.org/message-id/CWTM35CAUKRT.1733OSMXUZW7%40neon.tech

--
Tristan Partin
Neon (https://neon.tech)

Re: Fix assertion in autovacuum worker

2023-11-29 Thread Andres Freund

Hi,

On 2023-11-29 11:52:01 -0600, Nathan Bossart wrote:
> On Tue, Nov 28, 2023 at 06:48:59PM -0800, Andres Freund wrote:
> > On 2023-11-28 20:42:47 -0600, Nathan Bossart wrote:
> >> Right.  Perhaps we could add a LWLockReleaseAll() to
> >> pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that
> >> is still just a hack.
> >
> > Yea, we'd need that in just about all before_shmem_exit() callbacks.  I 
> > could
> > see an argument for doing it in proc_exit_prepare(). While that'd be a 
> > fairly
> > gross layering violation, we already do reset a number a bunch of stuff in
> > there:
>
> Gross layering violations aside, that at least seems more future-proof
> against other sigsetjmp() blocks that proc_exit() without doing any
> preliminary cleanup.

It's not just sigsetjmp() blocks not doing cleanup that are problematic -
consider what happens if there's either no sigsetjmp() block established (and
thus ERROR is promoted to FATAL), or if a FATAL error is raised. Then cleanup
in the sigsetjmp() site doesn't matter.  So we really need a better answer
here.

If we don't want to add LWLockReleaseAll() to proc_exit_prepare(), ISTM we
should all at least add some assertion infrastructure verifying it's being
called in relevant paths, triggering an assertion if there was no
LWLockReleaseAll() before reaching important before_shmem_exit() routines,
even if we don't actually hold an lwlock at the time of the error. Otherwise
problematic paths are way too hard to find.

Greetings,

Andres Freund

meson: Stop using deprecated way getting path of files

2023-11-29 Thread Andres Freund

Hi,

The just released meson 1.3 strongly deprecated a hack we were using, emitting
a noisy warning (the hack basically depended on an implementation detail to
work). Turns out there has been a better way available for a while, I just
hadn't found it. 1.4 added a more convenient approach, but we can't rely on
that.

Everything continues to work, but the warning is annoying.

The warning:

Message: checking for file conflicts between source and build directory
../home/andres/src/postgresql/meson.build:2972: DEPRECATION: Project uses 
feature that was always broken, and is now deprecated since '1.3.0': 
str.format: Value other than strings, integers, bools, options, dictionaries 
and lists thereof..
[...]
WARNING: Broken features used:
 * 1.3.0: {'str.format: Value other than strings, integers, bools, options, 
dictionaries and lists thereof.'}

I plan to apply this soon, unless I hear some opposition / better ideas / 

Greetings,

Andres Freund
>From d296c088ceadeac281a54d6f7f748112c5efb33a Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 29 Nov 2023 10:47:20 -0800
Subject: [PATCH v1] meson: Stop using deprecated way getting path of files

The just released meson 1.3 strongly deprecated a hack we were using, emitting
a noisy warning (the hack basically depended on an implementation detail to
work). Turns out there has been a better way available for a while, I just
hadn't found it. 1.4 added a more convenient approach, but we can't rely on
that.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch: 16-, where the meson build was added.
---
 meson.build | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/meson.build b/meson.build
index 0095fb183af..d8ff5cb51f0 100644
--- a/meson.build
+++ b/meson.build
@@ -2924,8 +2924,12 @@ potentially_conflicting_files = []
 foreach t : potentially_conflicting_files_t
   potentially_conflicting_files += t.full_path()
 endforeach
-foreach t : configure_files
-  t = '@0@'.format(t)
+foreach t1 : configure_files
+  if meson.version().version_compare('>=0.59')
+t = fs.parent(t1) / fs.name(t1)
+  else
+t = '@0@'.format(t1)
+  endif
   potentially_conflicting_files += meson.current_build_dir() / t
 endforeach
 foreach sub, fnames : generated_sources_ac
-- 
2.38.0

Re: pgsql: meson: docs: Add {html,man} targets, rename install-doc-*

2023-11-29 Thread Andres Freund

Hi,

This started at 
https://www.postgresql.org/message-id/746ba786-85bb-d1f7-b613-57bec35c642a%40dunslane.net
but seems worth discussing on -hackers.

On 2023-11-29 07:20:59 -0500, Andrew Dunstan wrote:
> On 2023-11-28 Tu 21:28, Andres Freund wrote:
> > On 2023-11-23 08:32:21 -0500, Andrew Dunstan wrote:
> > > On 2023-11-20 Mo 20:53, Andres Freund wrote:
> > > > meson: docs: Add {html,man} targets, rename install-doc-*
> > > >
> > > > We have toplevel html, man targets in the autoconf build as well. It'd 
> > > > be odd
> > > > to have an 'html' target but have the install target be 
> > > > 'install-doc-html',
> > > > thus rename the install targets to match.
> > >
> > > This commit of one of its nearby friends appears to have broken crake's 
> > > docs
> > > build:
> > >
> > > ERROR: Can't invoke target `html`: ambiguous name.Add target type and/or 
> > > path:
> > > - ./doc/src/sgml/html:custom
> > > - ./doc/src/sgml/html:alias
> > >
> > > See
> > Ah, I realize now that this is from meson compile html, not 'ninja html'. 
> > That
> > explains why I couldn't reproduce this initially and why CI didn't complain.
> > I don't really understand why meson compile complains in this case.  I 
> > assume
> > you don't want to disambiguate as suggested, by building html:alias instead?
> >
>
> I've done that as a temporary fix to get crake out of the hole, but it's
> pretty ugly, and I don't want to do it in a release if at all possible.

If we want to prevent these kind of conflicts, which doesn't seem
unreasonable, I think we need an automatic check that prevents reintroducing
them. I think most people will just use ninja and not see them. Meson stores
the relevant information in meson-info/intro-targets.json, so that's just a
bit of munging of that file.

I think the background for this issue existing is that meson supports a "flat"
build directory layout (which is deprecated), so the directory name can't be
used to deconflict with meson compile, which tries to work across all "build
execution" systems.

Prototype of such a check, as well as a commit deconflicting the target names,
attached.

Greetings,

Andres Freund
>From 7b44707b9caa4528f1a6e8dd218644fa114007b3 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 29 Nov 2023 10:33:47 -0800
Subject: [PATCH v1 1/2] meson: Rename target names that conflict with global
 target names

Author:
Reviewed-by:
Discussion: https://postgr.es/m/703110e4-b495-e409-26a5-28e9fca8f...@dunslane.net
Backpatch:
---
 src/common/unicode/meson.build | 2 +-
 doc/src/sgml/meson.build   | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/common/unicode/meson.build b/src/common/unicode/meson.build
index 6af46122c4e..c548710c293 100644
--- a/src/common/unicode/meson.build
+++ b/src/common/unicode/meson.build
@@ -139,7 +139,7 @@ endif
 
 # Use a custom target, as run targets serialize the output, making this harder
 # to debug, and don't deal well with targets with multiple outputs.
-update_unicode = custom_target('update-unicode',
+update_unicode = custom_target('tgt-update-unicode',
   depends: update_unicode_dep,
   output: ['dont-exist'],
   input: update_unicode_targets,
diff --git a/doc/src/sgml/meson.build b/doc/src/sgml/meson.build
index e1a85dc607b..c604bdbd644 100644
--- a/doc/src/sgml/meson.build
+++ b/doc/src/sgml/meson.build
@@ -135,7 +135,7 @@ endif
 # Full documentation as html, text
 #
 if docs_dep.found()
-  html = custom_target('html',
+  html = custom_target('tgt-build-html',
 input: ['stylesheet.xsl', postgres_full_xml],
 output: 'html',
 depfile: 'html.d',
@@ -144,7 +144,7 @@ if docs_dep.found()
   )
   alldocs += html
 
-  install_doc_html = custom_target('install-html',
+  install_doc_html = custom_target('tgt-install-html',
 output: 'install-html',
 command: [
   python, install_files, '--prefix', dir_prefix,
@@ -224,7 +224,7 @@ endif
 #
 if docs_dep.found()
   # FIXME: implement / consider sqlmansectnum logic
-  man = custom_target('man',
+  man = custom_target('tgt-build-man',
 input: ['stylesheet-man.xsl', postgres_full_xml],
 output: ['man1', 'man3', 'man7'],
 depfile: 'man.d',
-- 
2.38.0

>From a15cc895abfc9b426af75051d069fd625391b618 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 29 Nov 2023 10:30:21 -0800
Subject: [PATCH v1 2/2] meson: Add test checking if there are conflicting
 target names

Author:
Reviewed-by:
Discussion: https://postgr.es/m/703110e4-b495-e409-26a5-28e9fca8f...@dunslane.net
Backpatch:
---
 meson.build   |  8 ++
 src/tools/check_meson | 67 +++
 2 files changed, 75 insertions(+)
 create mode 100755 src/tools/check_meson

diff --git a/meson.build b/meson.build
index 0095fb183af..e0895f41f78 100644
--- a/meson.build
+++ b/meson.build
@@ -3320,6 +3320,14 @@ if meson.version().version_compare('>=0.57')
 endif
 
 
+# Test

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-29 Thread Tomas Vondra

On 11/29/23 15:52, Tomas Vondra wrote:
>> ...
>>
>> This also made me think a bit more about how we're working with the
>> tuples. With your latest patch, we always deserialize and re-serialize
>> the sorted brin tuples, just in case the next tuple will also be a
>> BRIN tuple of the same page range. Could we save some of that
>> deserialization time by optimistically expecting that we're not going
>> to need to merge the tuple and only store a local copy of it locally?
>> See attached 0002; this saves some cycles in common cases.
>>
> 
> Good idea!
> 

FWIW there's a bug, in this part of the optimization:

--
+if (memtuple == NULL)
+memtuple = brin_deform_tuple(state->bs_bdesc, btup,
+ memtup_holder);
+
 union_tuples(state->bs_bdesc, memtuple, btup);
 continue;
--

The deforming should use prevbtup, otherwise union_tuples() jut combines
two copies of the same tuple.

Which however brings me to the bigger issue with this - my stress test
found this issue pretty quickly, but then I spent quite a bit of time
trying to find what went wrong. I find this reworked code pretty hard to
understand, and not necessarily because of how it's written. The problem
is it the same loop tries to juggle multiple pieces of information with
different lifespans, and so on. I find it really hard to reason about
how it behaves ...

I did try to measure how much it actually saves, but none of the tests I
did actually found measurable improvement. So I'm tempted to just not
include this part, and accept that we may deserialize some of the tuples
unnecessarily.

Did you actually observe measurable improvements in some cases?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fix assertion in autovacuum worker

2023-11-29 Thread Nathan Bossart

On Tue, Nov 28, 2023 at 06:48:59PM -0800, Andres Freund wrote:
> On 2023-11-28 20:42:47 -0600, Nathan Bossart wrote:
>> Right.  Perhaps we could add a LWLockReleaseAll() to
>> pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that
>> is still just a hack.
> 
> Yea, we'd need that in just about all before_shmem_exit() callbacks.  I could
> see an argument for doing it in proc_exit_prepare(). While that'd be a fairly
> gross layering violation, we already do reset a number a bunch of stuff in
> there:

Gross layering violations aside, that at least seems more future-proof
against other sigsetjmp() blocks that proc_exit() without doing any
preliminary cleanup.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: psql not responding to SIGINT upon db reconnection

2023-11-29 Thread Tristan Partin

On Thu Nov 23, 2023 at 3:19 AM CST, Heikki Linnakangas wrote:

On 22/11/2023 23:29, Tristan Partin wrote:
> Ha, you're right. I had this working yesterday, but convinced myself it
> didn't. I had a do while loop wrapping the blocking call. Here is a v4,
> which seems to pass the tests that were pointed out to be failing
> earlier.

Thanks! This suffers from a classic race condition:

> +  if (cancel_pressed)
> +  break;
> +
> +  sock = PQsocket(n_conn);
> +  if (sock == -1)
> +  break;
> +
> +  rc = pqSocketPoll(sock, for_read, !for_read, (time_t)-1);
> +  Assert(rc != 0); /* Timeout is impossible. */
> +  if (rc == -1)
> +  {
> +  success = false;
> +  break;
> +  }

If a signal arrives between the "if (cancel_pressed)" check and 
pqSocketPoll(), pgSocketPoll will "miss" the signal and block 
indefinitely. There are three solutions to this:

1. Use a timeout, so that you wake up periodically to check for any 
missed signals. Easy solution, the downsides are that you will not react 
as quickly if the signal is missed, and you waste some cycles by waking 
up unnecessarily.

2. The Self-pipe trick: https://cr.yp.to/docs/selfpipe.html. We also use 
that in src/backend/storage/ipc/latch.c. It's portable, but somewhat 
complicated.

3. Use pselect() or ppoll(). They were created specifically to address 
this problem. Not fully portable, I think it's missing on Windows at least.

Maybe use pselect() if it's available, and select() with a timeout if 
it's not.

First I have ever heard of this race condition, and now I will commit it 
to durable storage :). I went ahead and did option #3 that you proposed. 
On Windows, I have a 1 second timeout for the select/poll. That seemed 
like a good balance of user experience and spinning the CPU. But I am 
open to other values. I don't have a Windows VM, but maybe I should set 
one up...

I am not completely in love with the code I have written. Lots of 
conditional compilation which makes it hard to read. Looking forward to 
another round of review to see what y'all think.

For what it's worth, I did try #2, but since psql installs its own 
SIGINT handler, the code became really hairy.

--
Tristan Partin
Neon (https://neon.tech)
From 4fdccf9211ac78bbd7430488d515646f9e9cce9b Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Mon, 24 Jul 2023 11:12:59 -0500
Subject: [PATCH v5] Allow SIGINT to cancel psql database reconnections

After installing the SIGINT handler in psql, SIGINT can no longer cancel
database reconnections. For instance, if the user starts a reconnection
and then needs to do some form of interaction (ie psql is polling),
there is no way to cancel the reconnection process currently.

Use PQconnectStartParams() in order to insert a CancelRequested check
into the polling loop.
---
 meson.build|   1 +
 src/bin/psql/command.c | 222 -
 src/include/pg_config.h.in |   3 +
 src/tools/msvc/Solution.pm |   1 +
 4 files changed, 226 insertions(+), 1 deletion(-)

diff --git a/meson.build b/meson.build
index ee58ee7a06..2d63485c53 100644
--- a/meson.build
+++ b/meson.build
@@ -2440,6 +2440,7 @@ func_checks = [
   ['posix_fadvise'],
   ['posix_fallocate'],
   ['ppoll'],
+  ['pselect'],
   ['pstat'],
   ['pthread_barrier_wait', {'dependencies': [thread_dep]}],
   ['pthread_is_threaded_np', {'dependencies': [thread_dep]}],
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 82cc091568..c325fedb51 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -11,6 +11,11 @@
 #include 
 #include 
 #include 
+#if HAVE_POLL
+#include 
+#else
+#include 
+#endif
 #ifndef WIN32
 #include 			/* for stat() */
 #include 			/* for setitimer() */
@@ -40,6 +45,7 @@
 #include "large_obj.h"
 #include "libpq-fe.h"
 #include "libpq/pqcomm.h"
+#include "libpq/pqsignal.h"
 #include "mainloop.h"
 #include "portability/instr_time.h"
 #include "pqexpbuffer.h"
@@ -3251,6 +3257,169 @@ copy_previous_query(PQExpBuffer query_buf, PQExpBuffer previous_buf)
 	return false;
 }

+/*
+ * Check a file descriptor for read and/or write data, possibly waiting.
+ * If neither forRead nor forWrite are set, immediately return a timeout
+ * condition (without waiting).  Return >0 if condition is met, 0
+ * if a timeout occurred, -1 if an error or interrupt occurred.
+ *
+ * Timeout is infinite if end_time is -1.  Timeout is immediate (no blocking)
+ * if end_time is 0 (or indeed, any time before now).
+ */
+static int
+pqSocketPoll(int sock, int forRead, int forWrite, time_t end_time)
+{
+	/*
+	 * We use functions in the following order if available:
+	 *   - ppoll(2) OR pselect(2)
+	 *   - poll(2) OR select(2)
+	 */
+#ifdef HAVE_POLL
+	struct pollfd input_fd;
+#ifdef HAVE_PPOLL
+	intrc;
+	sigset_t		emp

Re: remaining sql/json patches

2023-11-29 Thread Andres Freund

Hi,

On 2023-11-29 07:37:53 -0500, Andrew Dunstan wrote:
> On 2023-11-28 Tu 21:10, Andres Freund wrote:
> > Hi,
> >
> > On 2023-11-28 20:58:41 -0500, Andrew Dunstan wrote:
> > > On 2023-11-28 Tu 19:32, Tom Lane wrote:
> > > > Andrew Dunstan  writes:
> > > > So I'm now a bit baffled.  Can you provide more color on what
> > > > your test setup is?
> > >
> > > *sigh* yes, you're right. I inadvertently used a setup that used meson for
> > > building REL16_STABLE and HEAD. When I switch it to autoconf I get results
> > > that are similar to the earlier branches:
> > >
> > >
> > >  REL_16_STABLE 
> > > Time: 3401.625 ms (00:03.402)
> > >  HEAD 
> > > Time: 3419.088 ms (00:03.419)
> > >
> > >
> > > It's not clear to me why that should be. I didn't have assertions enabled
> > > anywhere. It's the same version of bison, same compiler throughout. Maybe
> > > meson sets a higher level of optimization? It shouldn't really matter, 
> > > ISTM.
> > Is it possible that you have CFLAGS set in your environment? For reasons 
> > that
> > I find very debatable, configure.ac only adds -O2 when CFLAGS is not set:
> >
> > # C[XX]FLAGS are selected so:
> > # If the user specifies something in the environment, that is used.
> > # else:  If the template file set something, that is used.
> > # else:  If coverage was enabled, don't set anything.
> > # else:  If the compiler is GCC, then we use -O2.
> > # else:  If the compiler is something else, then we use -O, unless 
> > debugging.
> >
> > if test "$ac_env_CFLAGS_set" = set; then
> >CFLAGS=$ac_env_CFLAGS_value
> > elif test "${CFLAGS+set}" = set; then
> >: # (keep what template set)
> > elif test "$enable_coverage" = yes; then
> >: # no optimization by default
> > elif test "$GCC" = yes; then
> >CFLAGS="-O2"
> > else
> ># if the user selected debug mode, don't use -O
> >if test "$enable_debug" != yes; then
> >  CFLAGS="-O"
> >fi
> > fi
> >
> > So if you have CFLAGS set in the environment, we'll not add -O2 to the
> > compilation flags.
> >
> > I'd check what the actual flags are when building a some .o.
> >
>
> I do have a CFLAGS setting, but for meson I used '-Ddebug=true' and no
> buildtype  or optimization setting. However, I see that in meson.build we're
> defaulting to "buildtype=debugoptimized" as opposed to the standard meson
> "buildtype=debug", so I guess that accounts for it.
>
> Still getting used to this stuff.

What I meant was whether you set CFLAGS for the *autoconf* build, because that
will result in an unoptimized build unless you explicitly add -O2 (or whatnot)
to the flags.  Doing benchmarking without compiler optimizations is pretty
pointless.

Greetings,

Andres Freund

Re: Whose Cirrus CI credits are used when making a PR to the GitHub mirror?

2023-11-29 Thread Andres Freund

Hi,

On 2023-11-29 11:30:25 -0600, Tristan Partin wrote:
> I enabled CI on my personal Postgres fork. I then tried to open a PR against
> my fork, and since GitHub defaults to creating PRs against upstream, I
> accidentally opened a PR against the Postgres mirror, which the
> postgres-mirror bot then closed, which is good. Stupid me.

> What the bot didn't do however was cancel the Cirrus CI build that arose
> from my immediately closed PR. Here[0] is the current run. It seems like you
> could very easily waste all the CI credits by creating a whole bunch of PRs
> against the mirror.
> 
> If I am just wasting my own credits, please ignore :).

It's currently using custom compute resources provided by google (formerly
provided by me), the same as cfbot. There is a hard limits to the number of
concurrent tasks and warnings when getting closer to those. So I am currently
not too worried about this threat.

I do wish github would allow disabling PRs...

Greetings,

Andres Freund

Whose Cirrus CI credits are used when making a PR to the GitHub mirror?

2023-11-29 Thread Tristan Partin

I enabled CI on my personal Postgres fork. I then tried to open a PR 
against my fork, and since GitHub defaults to creating PRs against 
upstream, I accidentally opened a PR against the Postgres mirror, which 
the postgres-mirror bot then closed, which is good. Stupid me.


What the bot didn't do however was cancel the Cirrus CI build that arose 
from my immediately closed PR. Here[0] is the current run. It seems like 
you could very easily waste all the CI credits by creating a whole bunch 
of PRs against the mirror. 


If I am just wasting my own credits, please ignore :).

[0]: https://cirrus-ci.com/build/6235510532734976

--
Tristan Partin
Neon (https://neon.tech)

add AVX2 support to simd.h

2023-11-29 Thread Nathan Bossart

On Wed, Nov 22, 2023 at 12:49:35PM -0600, Nathan Bossart wrote:
> On Wed, Nov 22, 2023 at 02:54:13PM +0200, Ants Aasma wrote:
>> For reference, executing the page checksum 10M times on a AMD 3900X CPU:
>> 
>> clang-14 -O2 4.292s (17.8 GiB/s)
>> clang-14 -O2 -msse4.12.859s (26.7 GiB/s)
>> clang-14 -O2 -msse4.1 -mavx2 1.378s (55.4 GiB/s)
> 
> Nice.  I've noticed similar improvements with AVX2 intrinsics in simd.h.

I've alluded to this a few times now, so I figured I'd park the patch and
preliminary benchmarks in a new thread while we iron out how to support
newer instructions (see discussion here [0]).

Using the same benchmark as we did for the SSE2 linear searches in
XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following:

  writerssse2avx2 %
  25611951188-1
  512 9281054   +14
 1024 633 716   +13
 2048 332 420   +27
 4096 162 203   +25
 8192 162 182   +12

It's been a while since I ran these benchmarks, but I vaguely recall also
seeing something like a 50% improvement for a dedicated pg_lfind32()
benchmark on long arrays.

As is, the patch likely won't do anything unless you add -mavx2 or
-march=native to your CFLAGS.  I don't intend for this patch to be
seriously considered until we have better support for detecting/compiling
AVX2 instructions and a buildfarm machine that uses them.

I plan to start another thread for AVX2 support for the page checksums.

[0] https://postgr.es/m/20231107024734.GB729644%40nathanxps13
[1] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9...@2ndquadrant.com
[2] https://postgr.es/m/20220713170950.GA3116318%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 5a90f1597fdc64aa6df6b9d0ffd959af7df41abd Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 29 Nov 2023 10:01:32 -0600
Subject: [PATCH v1 1/1] add avx2 support in simd.h

---
 src/include/port/simd.h | 50 -
 1 file changed, 39 insertions(+), 11 deletions(-)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 1fa6c3bc6c..0e698dcfab 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -18,7 +18,15 @@
 #ifndef SIMD_H
 #define SIMD_H
 
-#if (defined(__x86_64__) || defined(_M_AMD64))
+#if defined(__AVX2__)
+
+#include 
+#define USE_AVX2
+typedef __m256i Vector8;
+typedef __m256i Vector32;
+
+#elif (defined(__x86_64__) || defined(_M_AMD64))
+
 /*
  * SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume
  * that compilers targeting this architecture understand SSE2 intrinsics.
@@ -105,7 +113,9 @@ static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 static inline void
 vector8_load(Vector8 *v, const uint8 *s)
 {
-#if defined(USE_SSE2)
+#if defined(USE_AVX2)
+	*v = _mm256_loadu_si256((const __m256i *) s);
+#elif defined(USE_SSE2)
 	*v = _mm_loadu_si128((const __m128i *) s);
 #elif defined(USE_NEON)
 	*v = vld1q_u8(s);
@@ -118,7 +128,9 @@ vector8_load(Vector8 *v, const uint8 *s)
 static inline void
 vector32_load(Vector32 *v, const uint32 *s)
 {
-#ifdef USE_SSE2
+#if defined(USE_AVX2)
+	*v = _mm256_loadu_si256((const __m256i *) s);
+#elif defined(USE_SSE2)
 	*v = _mm_loadu_si128((const __m128i *) s);
 #elif defined(USE_NEON)
 	*v = vld1q_u32(s);
@@ -132,7 +144,9 @@ vector32_load(Vector32 *v, const uint32 *s)
 static inline Vector8
 vector8_broadcast(const uint8 c)
 {
-#if defined(USE_SSE2)
+#if defined(USE_AVX2)
+	return _mm256_set1_epi8(c);
+#elif defined(USE_SSE2)
 	return _mm_set1_epi8(c);
 #elif defined(USE_NEON)
 	return vdupq_n_u8(c);
@@ -145,7 +159,9 @@ vector8_broadcast(const uint8 c)
 static inline Vector32
 vector32_broadcast(const uint32 c)
 {
-#ifdef USE_SSE2
+#if defined(USE_AVX2)
+	return _mm256_set1_epi32(c);
+#elif defined(USE_SSE2)
 	return _mm_set1_epi32(c);
 #elif defined(USE_NEON)
 	return vdupq_n_u32(c);
@@ -268,7 +284,9 @@ vector8_has_le(const Vector8 v, const uint8 c)
 static inline bool
 vector8_is_highbit_set(const Vector8 v)
 {
-#ifdef USE_SSE2
+#if defined(USE_AVX2)
+	return _mm256_movemask_epi8(v) != 0;
+#elif defined(USE_SSE2)
 	return _mm_movemask_epi8(v) != 0;
 #elif defined(USE_NEON)
 	return vmaxvq_u8(v) > 0x7F;
@@ -305,7 +323,9 @@ vector32_is_highbit_set(const Vector32 v)
 static inline Vector8
 vector8_or(const Vector8 v1, const Vector8 v2)
 {
-#ifdef USE_SSE2
+#if defined(USE_AVX2)
+	return _mm256_or_si256(v1, v2);
+#elif defined(USE_SSE2)
 	return _mm_or_si128(v1, v2);
 #elif defined(USE_NEON)
 	return vorrq_u8(v1, v2);
@@ -318,7 +338,9 @@ vector8_or(const Vector8 v1, const Vector8 v2)
 static inline Vector32
 vector32_or(const Vector32 v1, const Vector32 v2)
 {
-#ifdef USE_SSE2
+#if defined(USE_AVX2)
+	return _mm256_or_si256(v1, v2);
+#elif defined(USE_SSE2)
 	return _mm_or_si128(v1, v2);
 #elif defined(USE_NEON)
 	return vorrq_u32(v1, v2);
@@ -336,7 +358,9 @@ vector32_or(const Vector32 v1, const Vector32 v2)
 stati

Re: remaining sql/json patches

2023-11-29 Thread Andrew Dunstan




On 2023-11-28 Tu 20:58, Andrew Dunstan wrote:


On 2023-11-28 Tu 19:32, Tom Lane wrote:

Andrew Dunstan  writes:

Cool, I took this and ran with it a bit. (See attached) Here are
comparative timings for 1000 iterations parsing most of the
information_schema.sql, all the way back to 9.3:
...
 REL_15_STABLE 
Time: 3372.491 ms (00:03.372)
 REL_16_STABLE 
Time: 1654.056 ms (00:01.654)
 HEAD 
Time: 1614.949 ms (00:01.615)
This is fairly repeatable.

These results astonished me, because I didn't recall us having done
anything that'd be likely to double the speed of the raw parser.
So I set out to replicate them, intending to bisect to find where
the change happened.  And ... I can't replicate them.  What I got
is essentially level performance from HEAD back to d10b19e22
(Stamp HEAD as 14devel):

HEAD: 3742.544 ms
d31d30973a (16 stamp): 3871.441 ms
596b5af1d (15 stamp): 3759.319 ms
d10b19e22 (14 stamp): 3730.834 ms

The run-to-run variation is a couple percent, which means that
these differences are down in the noise.  This is using your
test code from github (but with 5000 iterations not 1000).
Builds are pretty vanilla with asserts off, on an M1 MacBook Pro.
The bison version might matter here: it's 3.8.2 from MacPorts.

I wondered if you'd tested assert-enabled builds, but there
doesn't seem to be much variation with that turned on either.

So I'm now a bit baffled.  Can you provide more color on what
your test setup is?



*sigh* yes, you're right. I inadvertently used a setup that used meson 
for building REL16_STABLE and HEAD. When I switch it to autoconf I get 
results that are similar to the earlier branches:



 REL_16_STABLE 
Time: 3401.625 ms (00:03.402)
 HEAD 
Time: 3419.088 ms (00:03.419)


It's not clear to me why that should be. I didn't have assertions 
enabled anywhere. It's the same version of bison, same compiler 
throughout. Maybe meson sets a higher level of optimization? It 
shouldn't really matter, ISTM.



OK, with completely vanilla autoconf builds, doing 5000 iterations, here 
are the timings I get, including a test with Amit's latest published 
patches (with a small fixup due to bitrot).


Essentially, with the patches applied it's very slightly slower than 
master, about the same as release 16, faster than everything earlier. 
And we hope to improve the grammar impact of the JSON_TABLE piece before 
we're done.




 REL_11_STABLE 
Time: 10381.814 ms (00:10.382)
 REL_12_STABLE 
Time: 8151.213 ms (00:08.151)
 REL_13_STABLE 
Time: 7774.034 ms (00:07.774)
 REL_14_STABLE 
Time: 7911.005 ms (00:07.911)
 REL_15_STABLE 
Time: 7868.483 ms (00:07.868)
 REL_16_STABLE 
Time: 7729.359 ms (00:07.729)
 master 
Time: 7615.815 ms (00:07.616)
 sqljson 
Time: 7715.652 ms (00:07.716)


Bottom line: I don't think grammar slowdown is a reason to be concerned 
about these patches.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: SSL tests fail on OpenSSL v3.2.0

2023-11-29 Thread Alvaro Herrera

On 2023-Nov-29, Tom Lane wrote:

> Kind of odd that, with that mission statement, they are adding
> BIO_{get,set}_app_data on the justification that OpenSSL has it
> and Postgres is starting to use it.  Nonetheless, that commit
> also seems to prove the point about lack of API/ABI stability.

As I understand it, this simply means that Google is already building
their own fork of Postgres, patching it to use BoringSSL.  (This makes
sense, since they offer Postgres databases in their cloud offerings.)
They don't need PGDG to support BoringSSL, but they do need to make sure
that BoringSSL is able to support being used by Postgres.

> I'm content to take their advice and not try to support BoringSSL.

That seems the right reaction.  It is not our problem.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)

Re: SSL tests fail on OpenSSL v3.2.0

2023-11-29 Thread Tristan Partin

On Wed Nov 29, 2023 at 10:32 AM CST, Tom Lane wrote:

Daniel Gustafsson  writes:
> On 29 Nov 2023, at 16:21, Tristan Partin  wrote:
>> Funnily enough, here[0] is BoringSSL adding the BIO_{get,set}_app_data() 
APIs.

> Still doesn't seem like a good candidate for a postgres TLS library since they
> themselves claim:
>"Although BoringSSL is an open source project, it is not intended for
> general use, as OpenSSL is.  We don't recommend that third parties depend
> upon it.  Doing so is likely to be frustrating because there are no
> guarantees of API or ABI stability."

Kind of odd that, with that mission statement, they are adding
BIO_{get,set}_app_data on the justification that OpenSSL has it
and Postgres is starting to use it.  Nonetheless, that commit
also seems to prove the point about lack of API/ABI stability.

I'm content to take their advice and not try to support BoringSSL.
It's not clear what benefit to us there would be, and we already
have our hands full coping with all the different OpenSSL and LibreSSL
versions.

Yep, I just wanted to point it out in the interest of relevancy to our 
conversation yesterday :).

--
Tristan Partin
Neon (https://neon.tech)

Re: SSL tests fail on OpenSSL v3.2.0

2023-11-29 Thread Tom Lane

Daniel Gustafsson  writes:
> On 29 Nov 2023, at 16:21, Tristan Partin  wrote:
>> Funnily enough, here[0] is BoringSSL adding the BIO_{get,set}_app_data() 
>> APIs.

> Still doesn't seem like a good candidate for a postgres TLS library since they
> themselves claim:
>"Although BoringSSL is an open source project, it is not intended for
> general use, as OpenSSL is.  We don't recommend that third parties depend
> upon it.  Doing so is likely to be frustrating because there are no
> guarantees of API or ABI stability."

Kind of odd that, with that mission statement, they are adding
BIO_{get,set}_app_data on the justification that OpenSSL has it
and Postgres is starting to use it.  Nonetheless, that commit
also seems to prove the point about lack of API/ABI stability.

I'm content to take their advice and not try to support BoringSSL.
It's not clear what benefit to us there would be, and we already
have our hands full coping with all the different OpenSSL and LibreSSL
versions.

regards, tom lane

Re: Change GUC hashtable to use simplehash?

2023-11-29 Thread John Naylor

On Wed, Nov 29, 2023 at 9:59 PM Heikki Linnakangas  wrote:
>
> I didn't understand what you meant by the above. Did you wack around
> fast-hash, or who did?

I turned it into an init/accum/final style (shouldn't affect the
result), and took out the input length from the calculation (will
affect the result and I'll look into putting it back some other way).

> Who switched mixing/final functions; compared to
> what?

Sorry for the confusion. I didn't change those, I was speaking hypothetically.

> In any case, +1 on the implementation you had in the patch at a quick
> glance.
>
> Let's also replace the partial murmurhash implementations we have in
> hashfn.h with this. It's a very similar algorithm, and we don't need two.

Thanks for taking a look! For small fixed-sized values, it's common to
special-case a murmur-style finalizer regardless of the algorithm for
longer inputs. Syscache combines multiple hashes for multiple keys, so
it's probably worth it to avoid adding cycles there.

Re: proposal: possibility to read dumped table's name from file

2023-11-29 Thread Pavel Stehule

Hi

st 29. 11. 2023 v 15:44 odesílatel Daniel Gustafsson 
napsal:

> > On 22 Nov 2023, at 05:27, Erik Rijkers  wrote:
> >
> > Op 11/21/23 om 22:10 schreef Daniel Gustafsson:
> >>> On 20 Nov 2023, at 06:20, Pavel Stehule 
> wrote:
> >
> >> The attached is pretty close to a committable patch IMO, review is
> welcome on
> >> both the patch and commit message.  I tried to identify all reviewers
> over the
> >> past 3+ years but I might have missed someone.
>
> I took another look at this, found some more polish that was needed, added
> another testcase and ended up pushing it.
>
> > I've tested this, albeit mostly in the initial iterations  (*shrug* but
> a mention is nice)
>
> As I mentioned above it's easy to miss when reviewing three years worth of
> emails, no-one was intentionally left out.  I went back and looked and
> added
> you as a reviewer. Thanks for letting me know.
>

Thank you very much

Regards

Pavel


> --
> Daniel Gustafsson
>
>

Re: SSL tests fail on OpenSSL v3.2.0

2023-11-29 Thread Daniel Gustafsson

> On 29 Nov 2023, at 16:21, Tristan Partin  wrote:
> 
> On Tue Nov 28, 2023 at 9:42 AM CST, Tom Lane wrote:
>> "Tristan Partin"  writes:
>> > When you say "this" are you referring to the patch I sent or adding > 
>> > support for BoringSSL?
>> 
>> I have no interest in supporting BoringSSL.
> 
> Funnily enough, here[0] is BoringSSL adding the BIO_{get,set}_app_data() APIs.

Still doesn't seem like a good candidate for a postgres TLS library since they
themselves claim:

   "Although BoringSSL is an open source project, it is not intended for
general use, as OpenSSL is.  We don't recommend that third parties depend
upon it.  Doing so is likely to be frustrating because there are no
guarantees of API or ABI stability."

--
Daniel Gustafsson

Re: SSL tests fail on OpenSSL v3.2.0

2023-11-29 Thread Tristan Partin

On Tue Nov 28, 2023 at 9:42 AM CST, Tom Lane wrote:

"Tristan Partin"  writes:
> When you say "this" are you referring to the patch I sent or adding 
> support for BoringSSL?

I have no interest in supporting BoringSSL.

Funnily enough, here[0] is BoringSSL adding the BIO_{get,set}_app_data() 
APIs.

[0]: 
https://github.com/google/boringssl/commit/2139aba2e3e28cd1cdefbd9b48e2c31a75441203

--
Tristan Partin
Neon (https://neon.tech)

Re: [PATCHES] Post-special page storage TDE support

2023-11-29 Thread David Christensen

On Tue, Nov 7, 2023 at 6:20 PM Andres Freund  wrote:

> Hi,
>
> - IMO the patch touches many places it shouldn't need to touch, because of
>   essentially renaming a lot of existing macro names to *Limit,
>   necessitating modifying a lot of users. I think instead the few places
> that
>   care about the runtime limit should be modified.
>
>   As-is the patch would cause a lot of fallout in extensions that just do
>   things like defining an on-stack array of Datums or such - even though
> all
>   they'd need is to change the define to the *Limit one.
>
>   Even leaving extensions aside, it must makes reviewing (and I'm sure
>   maintaining) the patch very tedious.
>

Hi Andres et al,

So I've been looking at alternate approaches to this issue and considering
how to reduce churn, and I think we still need the *Limit variants.  Let's
take a simple example:

Just looking at MaxHeapTuplesPerPage and breaking down instances in the
code, loosely partitioning into whether it's used as an array index or
other usage (doesn't discriminate against code vs comments, unfortunately)
we get the following breakdown:

$ git grep -hoE [[]?MaxHeapTuplesPerPage | sort | uniq -c
 18 [MaxHeapTuplesPerPage
 51 MaxHeapTuplesPerPage

This would be 18 places where we would need at adjust in a fairly
mechanical fashion to add the MaxHeapTuplesPerPageLimit instead of
MaxHeapTuplesPerPage vs some significant fraction of non-comment--even if
you assumed half were in comments, there would presumably need to be some
sort of adjustments in verbage since we are going to be changing some of
the interpretation.

I am working on a patch to cleanup some of the assumptions that smgr makes
currently about its space usage and how the individual access methods
consider it, as they should only be calculating things based on how much
space is available after smgr is done with it.  That has traditionally been
BLCKSZ - SizeOfPageHeaderData, but this patch (included) factors that out
into a single expression that we can now use in access methods, so we can
then reserve additional page space and not need to adjust the access
methods furter.

Building on top of this patch, we'd define something like this to handle
the #defines that need to be dynamic:

extern Size reserved_page_space;
#define PageUsableSpace (BLCKSZ - SizeOfPageHeaderData -
reserved_page_space)
#define MaxHeapTuplesPerPage CalcMaxHeapTuplesPerPage(PageUsableSpace)
#define MaxHeapTuplesPerPageLimit CalcMaxHeapTuplesPerPage(BLCKSZ -
SizeOfPageHeaderData)
#define CalcMaxHeapTuplesPerPage(freesize)
  ((int) ((freesize) / \
  (MAXALIGN(SizeofHeapTupleHeader) +
sizeof(ItemIdData

In my view, extensions that are expecting to need no changes when it comes
to changing how these are interpreted are better off needing to only change
the static allocation in a mechanical sense than revisit any other uses of
code; this seems more likely to guarantee a correct result than if you
exceed the page space and start overwriting things you weren't because
you're not aware that you need to check for dynamic limits on your own.

Take another thing which would need adjusting for reserving page space,
MaxHeapTupleSize:

$ git grep -ohE '[[]?MaxHeapTupleSize' | sort | uniq -c
  3 [MaxHeapTupleSize
 16 MaxHeapTupleSize

Here there are 3 static arrays which would need to be adjusted vs 16 other
instances. If we kept MaxHeapTupleSize interpretation the same and didn't
adjust an extension it would compile just fine, but with too large of a
length compared to the smaller PageUsableSpace, so you could conceivably
overwrite into the reserved space depending on what you were doing.

(since by definition the reserved_page_space >= 0, so PageUsableSpace will
always be <= BLCKSZ - SizeOfPageHeaderData, so any expression based on it
as a basis will be smaller).

In short, I think the approach I took originally actually will reduce
errors out-of-core, and while churn is still necessary churn.

I can produce a second patch which implements this calc/limit atop this
first one as well.

Thanks,

David

0001-Create-PageUsableSpace-to-represent-space-post-smgr.patch
Description: Binary data

Re: Is this a problem in GenericXLogFinish()?

2023-11-29 Thread Alexander Lakhin


Hello,

13.11.2023 20:21, Robert Haas wrote:

On Mon, Nov 13, 2023 at 12:47 AM Hayato Kuroda (Fujitsu)
 wrote:

Moved.

I see that this patch was committed, but I'm not very convinced that
the approach is correct. The comment says this:



I've discovered that that patch introduced a code path leading to an
uninitialized memory access.
With the following addition to hash_index.sql:
 -- Fill overflow pages by "dead" tuples.
 BEGIN;
 INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i;
 ROLLBACK;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i;

make check -C src/test/recovery/ PROVE_TESTS="t/027*"
when executed under Valgrind, triggers:
==00:00:02:30.285 97744== Conditional jump or move depends on uninitialised 
value(s)
==00:00:02:30.285 97744==    at 0x227585: BufferIsValid (bufmgr.h:303)
==00:00:02:30.285 97744==    by 0x227585: hash_xlog_squeeze_page 
(hash_xlog.c:781)
==00:00:02:30.285 97744==    by 0x228133: hash_redo (hash_xlog.c:1083)
==00:00:02:30.285 97744==    by 0x2C2801: ApplyWalRecord (xlogrecovery.c:1943)
==00:00:02:30.285 97744==    by 0x2C5C52: PerformWalRecovery 
(xlogrecovery.c:1774)
==00:00:02:30.285 97744==    by 0x2B63A1: StartupXLOG (xlog.c:5559)
==00:00:02:30.285 97744==    by 0x558165: StartupProcessMain (startup.c:282)
==00:00:02:30.285 97744==    by 0x54DFE8: AuxiliaryProcessMain 
(auxprocess.c:141)
==00:00:02:30.285 97744==    by 0x5546B0: StartChildProcess (postmaster.c:5331)
==00:00:02:30.285 97744==    by 0x557A53: PostmasterMain (postmaster.c:1458)
==00:00:02:30.285 97744==    by 0x4720C2: main (main.c:198)
==00:00:02:30.285 97744==
(in 027_stream_regress_standby_1.log)

That is, when line
https://coverage.postgresql.org/src/backend/access/hash/hash_xlog.c.gcov.html#661
is reached, writebuf stays uninitialized.

Best regards,
Alexander

Re: Change GUC hashtable to use simplehash?

2023-11-29 Thread Heikki Linnakangas


On 29/11/2023 15:31, John Naylor wrote:

However, I did find a couple hash functions that are much simpler to
adapt to a bytewise interface, pass SMHasher, and are decently fast on
short inputs:

- fast-hash, MIT licensed, and apparently has some use in software [1]
- MX3, CC0 license (looking around, seems controversial for a code
license, so didn't go further). [2] Seems to be a for-fun project, but
the accompanying articles are very informative on how to develop these
things.

After wacking fast-hash around, it doesn't really resemble the
original much, and if for some reason we went as far as switching out
the mixing/final functions, it may as well be called completely
original work. I thought it best to start with something whose mixing
behavior passes SMHasher, and hopefully preserve that property.


I didn't understand what you meant by the above. Did you wack around 
fast-hash, or who did? Who switched mixing/final functions; compared to 
what? The version you have in the patch matches the implementation in 
smhasher, did you mean that the smhasher author changed it compared to 
the original?


In any case, +1 on the implementation you had in the patch at a quick 
glance.


Let's also replace the partial murmurhash implementations we have in 
hashfn.h with this. It's a very similar algorithm, and we don't need two.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-29 Thread Tomas Vondra

On 11/29/23 15:42, Matthias van de Meent wrote:
> On Tue, 28 Nov 2023 at 18:59, Tomas Vondra
>  wrote:
>>
>> On 11/28/23 16:39, Matthias van de Meent wrote:
>>> On Thu, 23 Nov 2023 at 14:35, Tomas Vondra
>>>  wrote:
 On 11/23/23 13:33, Matthias van de Meent wrote:
> The union operator may leak (lots of) memory, so I think it makes
> sense to keep a context around that can be reset after we've extracted
> the merge result.
>

 But does the current code actually achieve that? It does create a "brin
 union" context, but then it only does this:

 /* Use our own memory context to avoid retail pfree */
 cxt = AllocSetContextCreate(CurrentMemoryContext,
 "brin union",
 ALLOCSET_DEFAULT_SIZES);
 oldcxt = MemoryContextSwitchTo(cxt);
 db = brin_deform_tuple(bdesc, b, NULL);
 MemoryContextSwitchTo(oldcxt);

 Surely that does not limit the amount of memory used by the actual union
 functions in any way?
>>>
>>> Oh, yes, of course. For some reason I thought that covered the calls
>>> to the union operator function too, but it indeed only covers
>>> deserialization. I do think it is still worthwhile to not do the
>>> create/delete cycle, but won't hold the patch back for that.
>>>
>>
>> I think the union_tuples() changes are better left for a separate patch.
>>
>> However, I don't think the number of union_tuples calls is likely to be
>> very high, especially for large tables. Because we split the table into
>> 2048 chunks, and then cap the chunk size by 8192. For large tables
>> (where this matters) we're likely close to 8192.
>
> I agree that the merging part of the index creation is the last part,
> and usually has no high impact on the total performance of the reindex
> operation, but in memory-constrained environments releasing and then
> requesting the same chunk of memory over and over again just isn't
> great.

 OK, I'll take a look at the scratch context you suggested.

 My point however was we won't actually do that very often, because on
 large tables the BRIN ranges are likely smaller than the parallel scan
 chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
 ranges are large, there'll be few of them.
>>>
>>> That's true, so maybe I'm concerned about something that amounts to
>>> only marginal gains.
>>>
>>
>> However, after thinking about this a bit more, I think we actually do
>> need to do something about the memory management when merging tuples.
>> AFAIK the general assumption was that union_tuple() only runs for a
>> single range, and then the whole context gets freed.
> 
> Correct, but it is also is (or should be) assumed that union_tuple
> will be called several times in the same context to fix repeat
> concurrent updates. Presumably, that only happens rarely, but it's
> something that should be kept in mind regardless.
> 

In theory, yes. But union_tuples() is used only in summarize_range(),
and that only processes a single page range.

>> But the way the
>> merging was implemented, it all runs in a single context. And while a
>> single union_tuple() may not need a lot memory, in total it may be
>> annoying. I just added a palloc(1MB) into union_tuples and ended up with
>> ~160MB allocated in the PortalContext on just 2GB table. In practice the
>> memory will grow more slowly, but not great :-/
>>
>> The attached 0003 patch adds a memory context that's reset after
>> producing a merged BRIN tuple for each page range.
> 
> Looks good.
> 
> This also made me think a bit more about how we're working with the
> tuples. With your latest patch, we always deserialize and re-serialize
> the sorted brin tuples, just in case the next tuple will also be a
> BRIN tuple of the same page range. Could we save some of that
> deserialization time by optimistically expecting that we're not going
> to need to merge the tuple and only store a local copy of it locally?
> See attached 0002; this saves some cycles in common cases.
> 

Good idea!

> The v20231128 version of the patchset (as squashed, attached v5-0001)
> looks good to me.
> 

Cool. I'll put this through a bit more stress testing, and then I'll get
it pushed.

Thanks for the reviews!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: proposal: possibility to read dumped table's name from file

2023-11-29 Thread Daniel Gustafsson

> On 22 Nov 2023, at 05:27, Erik Rijkers  wrote:
> 
> Op 11/21/23 om 22:10 schreef Daniel Gustafsson:
>>> On 20 Nov 2023, at 06:20, Pavel Stehule  wrote:
> 
>> The attached is pretty close to a committable patch IMO, review is welcome on
>> both the patch and commit message.  I tried to identify all reviewers over 
>> the
>> past 3+ years but I might have missed someone.

I took another look at this, found some more polish that was needed, added
another testcase and ended up pushing it.

> I've tested this, albeit mostly in the initial iterations  (*shrug* but a 
> mention is nice)

As I mentioned above it's easy to miss when reviewing three years worth of
emails, no-one was intentionally left out.  I went back and looked and added
you as a reviewer. Thanks for letting me know.

--
Daniel Gustafsson

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-29 Thread Matthias van de Meent

On Tue, 28 Nov 2023 at 18:59, Tomas Vondra
 wrote:
>
> On 11/28/23 16:39, Matthias van de Meent wrote:
> > On Thu, 23 Nov 2023 at 14:35, Tomas Vondra
> >  wrote:
> >> On 11/23/23 13:33, Matthias van de Meent wrote:
> >>> The union operator may leak (lots of) memory, so I think it makes
> >>> sense to keep a context around that can be reset after we've extracted
> >>> the merge result.
> >>>
> >>
> >> But does the current code actually achieve that? It does create a "brin
> >> union" context, but then it only does this:
> >>
> >> /* Use our own memory context to avoid retail pfree */
> >> cxt = AllocSetContextCreate(CurrentMemoryContext,
> >> "brin union",
> >> ALLOCSET_DEFAULT_SIZES);
> >> oldcxt = MemoryContextSwitchTo(cxt);
> >> db = brin_deform_tuple(bdesc, b, NULL);
> >> MemoryContextSwitchTo(oldcxt);
> >>
> >> Surely that does not limit the amount of memory used by the actual union
> >> functions in any way?
> >
> > Oh, yes, of course. For some reason I thought that covered the calls
> > to the union operator function too, but it indeed only covers
> > deserialization. I do think it is still worthwhile to not do the
> > create/delete cycle, but won't hold the patch back for that.
> >
>
> I think the union_tuples() changes are better left for a separate patch.
>
>  However, I don't think the number of union_tuples calls is likely to be
>  very high, especially for large tables. Because we split the table into
>  2048 chunks, and then cap the chunk size by 8192. For large tables
>  (where this matters) we're likely close to 8192.
> >>>
> >>> I agree that the merging part of the index creation is the last part,
> >>> and usually has no high impact on the total performance of the reindex
> >>> operation, but in memory-constrained environments releasing and then
> >>> requesting the same chunk of memory over and over again just isn't
> >>> great.
> >>
> >> OK, I'll take a look at the scratch context you suggested.
> >>
> >> My point however was we won't actually do that very often, because on
> >> large tables the BRIN ranges are likely smaller than the parallel scan
> >> chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
> >> ranges are large, there'll be few of them.
> >
> > That's true, so maybe I'm concerned about something that amounts to
> > only marginal gains.
> >
>
> However, after thinking about this a bit more, I think we actually do
> need to do something about the memory management when merging tuples.
> AFAIK the general assumption was that union_tuple() only runs for a
> single range, and then the whole context gets freed.

Correct, but it is also is (or should be) assumed that union_tuple
will be called several times in the same context to fix repeat
concurrent updates. Presumably, that only happens rarely, but it's
something that should be kept in mind regardless.

> But the way the
> merging was implemented, it all runs in a single context. And while a
> single union_tuple() may not need a lot memory, in total it may be
> annoying. I just added a palloc(1MB) into union_tuples and ended up with
> ~160MB allocated in the PortalContext on just 2GB table. In practice the
> memory will grow more slowly, but not great :-/
>
> The attached 0003 patch adds a memory context that's reset after
> producing a merged BRIN tuple for each page range.

Looks good.

This also made me think a bit more about how we're working with the
tuples. With your latest patch, we always deserialize and re-serialize
the sorted brin tuples, just in case the next tuple will also be a
BRIN tuple of the same page range. Could we save some of that
deserialization time by optimistically expecting that we're not going
to need to merge the tuple and only store a local copy of it locally?
See attached 0002; this saves some cycles in common cases.

The v20231128 version of the patchset (as squashed, attached v5-0001)
looks good to me.

Kind regards,

Matthias van de Meent
Neon (http://neon.tech)


v5-0002-Reduce-de-forming-of-BRIN-tuples-in-parallel-BRIN.patch
Description: Binary data


v5-0001-Allow-BRIN-to-build-its-index-in-parallel.patch
Description: Binary data

Re: logical decoding and replication of sequences, take 2

2023-11-29 Thread Tomas Vondra

On 11/29/23 14:42, Amit Kapila wrote:
> On Wed, Nov 29, 2023 at 2:59 AM Tomas Vondra
>  wrote:
>>
>> I have been hacking on improving the improvements outlined in my
>> preceding e-mail, but I have some bad news - I ran into an issue that I
>> don't know how to solve :-(
>>
>> Consider this transaction:
>>
>>   BEGIN;
>>   ALTER SEQUENCE s RESTART 1000;
>>
>>   SAVEPOINT s1;
>>   ALTER SEQUENCE s RESTART 2000;
>>   ROLLBACK TO s1;
>>
>>   INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,40);
>>   COMMIT;
>>
>> If you try this with the approach relying on rd_newRelfilelocatorSubid
>> and rd_createSubid, it fails like this on the subscriber:
>>
>>   ERROR:  could not map filenode "base/5/16394" to relation OID
>>
>> This happens because ReorderBufferQueueSequence tries to do this in the
>> non-transactional branch:
>>
>>   reloid = RelidByRelfilenumber(rlocator.spcOid, rlocator.relNumber);
>>
>> and the relfilenode is the one created by the first ALTER. But this is
>> obviously wrong - the changes should have been treated as transactional,
>> because they are tied to the first ALTER. So how did we get there?
>>
>> Well, the whole problem is that in case of abort, AtEOSubXact_cleanup
>> resets the two fields to InvalidSubTransactionId. Which means the
>> rollback in the above transaction also forgets about the first ALTER.
>> Now that I look at the RelationData comments, it actually describes
>> exactly this situation:
>>
>>   *
>>   * rd_newRelfilelocatorSubid is the ID of the highest subtransaction
>>   * the most-recent relfilenumber change has survived into or zero if
>>   * not changed in the current transaction (or we have forgotten
>>   * changing it).  This field is accurate when non-zero, but it can be
>>   * zero when a relation has multiple new relfilenumbers within a
>>   * single transaction, with one of them occurring in a subsequently
>>   * aborted subtransaction, e.g.
>>   *BEGIN;
>>   *TRUNCATE t;
>>   *SAVEPOINT save;
>>   *TRUNCATE t;
>>   *ROLLBACK TO save;
>>   *-- rd_newRelfilelocatorSubid is now forgotten
>>   *
>>
>> The root of this problem is that we'd need some sort of "history" for
>> the field, so that when a subxact aborts, we can restore the previous
>> value. But we obviously don't have that, and I doubt we want to add that
>> to relcache - for example, it'd either need to impose some limit on the
>> history (and thus a failure when we reach the limit), or it'd need to
>> handle histories of arbitrary length.
>>
> 
> Yeah, I think that would be really tricky and we may not want to go there.
> 
>> At this point I don't see a solution for this, which means the best way
>> forward with the sequence decoding patch seems to be the original
>> approach, on the decoding side.
>>
> 
> One thing that worries me about that approach is that it can suck with
> the workload that has a lot of DDLs that create XLOG_SMGR_CREATE
> records. We have previously fixed some such workloads in logical
> decoding where decoding a transaction containing truncation of a table
> with a lot of partitions (1000 or more) used to take a very long time.
> Don't we face performance issues in such scenarios?
> 

I don't think we do, really. We will have to decode the SMGR records and
add the relfilenodes to the hash table(s), but I think that affects the
lookup performance too much. What I think might be a problem is if we
have many top-level transactions, especially if those transactions do
something that creates a relfilenode. Because then we'll have to do a
hash_search for each of them, and that might be measurable even if each
lookup is O(1). And we do the lookup for every sequence change ...

> How do we see this work w.r.t to some sort of global sequences? There
> is some recent discussion where I have raised a similar point [1].
> 
> [1] - 
> https://www.postgresql.org/message-id/CAA4eK1JF%3D4_Eoq7FFjHSe98-_ooJ5QWd0s2_pj8gR%2B_dvwKxvA%40mail.gmail.com
> 

I think those are very different things, even though called "sequences".
AFAIK solutions like snowflakeID or UUIDs don't require replication of
any shared state (that's kinda the whole point), so I don't see why
would it need some special support in logical decoding.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Table AM Interface Enhancements

2023-11-29 Thread Pavel Borisov

Hi, Nikita!

On Wed, 29 Nov 2023 at 18:27, Nikita Malakhov  wrote:

> Hi,
>
> Pavel, as far as I understand Alexander's idea assertion and especially
> ereport
> here does not make any sense - this method is not considered to report
> error, it
> silently calls if there is underlying [free] function and simply falls
> through otherwise,
> also, take into account that it could be located in the uninterruptible
> part of the code.
>
> On the whole topic I have to
>
> On Wed, Nov 29, 2023 at 4:56 PM Pavel Borisov 
> wrote:
>
>> Hi, Alexander!
>>
>>> I'm planning to review some of the other patches from the current
>>> patchset soon.
>>>
>>
>> I've looked into the patch 0003.
>> The patch looks in good shape and is uncontroversial to me. Making memory
>> structures to be dynamically allocated is simple enough and it allows to
>> store complex data like lists etc. I consider places like this that expect
>> memory structures to be flat and allocated at once are because the was no
>> need in more complex ones previously. If there is a need for them, I think
>> they could be added without much doubt, provided the simplicity of the
>> change.
>>
>> For the code:
>> +static inline void
>> +table_free_rd_amcache(Relation rel)
>> +{
>> + if (rel->rd_tableam && rel->rd_tableam->free_rd_amcache)
>> + {
>> + rel->rd_tableam->free_rd_amcache(rel);
>> + }
>> + else
>> + {
>> + if (rel->rd_amcache)
>> + pfree(rel->rd_amcache);
>> + rel->rd_amcache = NULL;
>> + }
>>
>> here I suggest adding Assert(rel->rd_amcache == NULL) (or maybe better an
>> error report) after calling free_rd_amcache to be sure the custom
>> implementation has done what it should do.
>>
>> Also, I think some brief documentation about writing this custom method
>> is quite relevant maybe based on already existing comments in the code.
>>
>> Kind regards,
>> Pavel
>>
>
> When we do default single chunk routine we invalidate rd_amcache pointer,
+ if (rel->rd_amcache)
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;

If we delegate this to method, my idea is to check the method
implementation don't leave this pointer valid.
If it's not needed, I'm ok with it, but to me it seems that the check I
proposed makes sense.

Regards,
Pavel

Re: Table AM Interface Enhancements

2023-11-29 Thread Nikita Malakhov

Hi,

Pavel, as far as I understand Alexander's idea assertion and especially
ereport
here does not make any sense - this method is not considered to report
error, it
silently calls if there is underlying [free] function and simply falls
through otherwise,
also, take into account that it could be located in the uninterruptible
part of the code.

On the whole topic I have to

On Wed, Nov 29, 2023 at 4:56 PM Pavel Borisov 
wrote:

> Hi, Alexander!
>
>> I'm planning to review some of the other patches from the current
>> patchset soon.
>>
>
> I've looked into the patch 0003.
> The patch looks in good shape and is uncontroversial to me. Making memory
> structures to be dynamically allocated is simple enough and it allows to
> store complex data like lists etc. I consider places like this that expect
> memory structures to be flat and allocated at once are because the was no
> need in more complex ones previously. If there is a need for them, I think
> they could be added without much doubt, provided the simplicity of the
> change.
>
> For the code:
> +static inline void
> +table_free_rd_amcache(Relation rel)
> +{
> + if (rel->rd_tableam && rel->rd_tableam->free_rd_amcache)
> + {
> + rel->rd_tableam->free_rd_amcache(rel);
> + }
> + else
> + {
> + if (rel->rd_amcache)
> + pfree(rel->rd_amcache);
> + rel->rd_amcache = NULL;
> + }
>
> here I suggest adding Assert(rel->rd_amcache == NULL) (or maybe better an
> error report) after calling free_rd_amcache to be sure the custom
> implementation has done what it should do.
>
> Also, I think some brief documentation about writing this custom method is
> quite relevant maybe based on already existing comments in the code.
>
> Kind regards,
> Pavel
>


-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

Re: trying again to get incremental backup

2023-11-29 Thread Robert Haas

On Wed, Nov 15, 2023 at 9:14 AM Jakub Wartak
 wrote:
> so I've spent some time playing still with patchset v8 (without the
> 6/6 testing patch related to wal_level=minimal), with the exception of
> - patchset v9 - marked otherwise.

Thanks, as usual, for that.

> 2. Usability thing: I hit the timeout hard: "This backup requires WAL
> to be summarized up to 0/9D8, but summarizer has only reached
> 0/0." with summarize_wal=off (default) but apparently this in TODO.
> Looks like an important usability thing.

All right. I'd sort of forgotten about the need to address that issue,
but apparently, I need to re-remember.

> 5. On v8 i've finally played a little bit with standby(s) and this
> patchset with couple of basic scenarios while mixing source of the
> backups:
>
> a. full on standby, incr1 on standby, full db restore (incl. incr1) on standby
> # sometimes i'm getting spurious error like those when doing
> incrementals on standby with -c fast :
> 2023-11-15 13:49:05.721 CET [10573] LOG:  recovery restart point
> at 0/A28
> 2023-11-15 13:49:07.591 CET [10597] WARNING:  aborting backup due
> to backend exiting before pg_backup_stop was called
> 2023-11-15 13:49:07.591 CET [10597] ERROR:  manifest requires WAL
> from final timeline 1 ending at 0/AF8, but this backup starts at
> 0/A28
> 2023-11-15 13:49:07.591 CET [10597] STATEMENT:  BASE_BACKUP (
> INCREMENTAL,  LABEL 'pg_basebackup base backup',  PROGRESS,
> CHECKPOINT 'fast',  WAIT 0,  MANIFEST 'yes',  TARGET 'client')
> # when you retry the same pg_basebackup it goes fine (looks like
> CHECKPOINT on standby/restartpoint <-> summarizer disconnect, I'll dig
> deeper tomorrow. It seems that issuing "CHECKPOINT; pg_sleep(1);"
> against primary just before pg_basebackup --incr on standby
> workarounds it)
>
> b. full on primary, incr1 on standby, full db restore (incl. incr1) on
> standby # WORKS
> c. full on standby, incr1 on standby, full db restore (incl. incr1) on
> primary # WORKS*
> d. full on primary, incr1 on standby, full db restore (incl. incr1) on
> primary # WORKS*
>
> * - needs pg_promote() due to the controlfile having standby bit +
> potential fiddling with postgresql.auto.conf as it is having
> primary_connstring GUC.

Well, "manifest requires WAL from final timeline 1 ending at
0/AF8, but this backup starts at 0/A28" is a valid complaint,
not a spurious error. It's essentially saying that WAL replay for this
incremental backup would have to begin at a location that is earlier
than where replay for the earlier backup would have to end while
recovering that backup. It's almost like you're trying to go backwards
in time, with the incremental happening before the full backup instead
of after it. I think the reason this is happening is that when you
take a backup, recovery has to start from the previous checkpoint. On
the primary, we perform a new checkpoint and plan to start recovery
from it. But on a standby, we can't perform a new checkpoint, since we
can't write WAL, so we arrange for recovery of the backup to begin
from the most recent checkpoint. And if you do two backups on the
standby in a row without much happening in the middle, then the most
recent checkpoint will be the same for both. And that I think is
what's resulting in this error, because the end of the backup follows
the start of the backup, so if two consecutive backups have the same
start, then the start of the second one will precede the end of the
first one.

One thing that's interesting to note here is that there is no point in
performing an incremental backup under these circumstances. You would
accrue no advantage over just letting replay continue further from the
full backup. The whole point of an incremental backup is that it lets
you "fast forward" your older backup -- you could have just replayed
all the WAL from the older backup until you got to the start LSN of
the newer backup, but reconstructing a backup that can start replay
from the newer LSN directly is, hopefully, quicker than replaying all
of that WAL. But in this scenario, you're starting from the same
checkpoint no matter what -- the amount of WAL replay required to
reach any given LSN will be unchanged. So storing an incremental
backup would be strictly a loss.

Another interesting point to consider is that you could also get this
complaint by doing something like take the full backup from the
primary, and then try to take an incremental backup from a standby,
maybe even a time-delayed standby that's far behind the primary. In
that case, you would really be trying to take an incremental backup
before you actually took the full backup, as far as LSN time goes.

I'm not quite sure what to do about any of this. I think the error is
correct and accurate, but understanding what it means and why it's
happening and what to do about it is probably going to be difficult
for people. Possibly we should have documentation that talks you
through all of this. Or possibly ther

Re: Table AM Interface Enhancements

2023-11-29 Thread Pavel Borisov

Hi, Alexander!

> I'm planning to review some of the other patches from the current patchset
> soon.
>

I've looked into the patch 0003.
The patch looks in good shape and is uncontroversial to me. Making memory
structures to be dynamically allocated is simple enough and it allows to
store complex data like lists etc. I consider places like this that expect
memory structures to be flat and allocated at once are because the was no
need in more complex ones previously. If there is a need for them, I think
they could be added without much doubt, provided the simplicity of the
change.

For the code:
+static inline void
+table_free_rd_amcache(Relation rel)
+{
+ if (rel->rd_tableam && rel->rd_tableam->free_rd_amcache)
+ {
+ rel->rd_tableam->free_rd_amcache(rel);
+ }
+ else
+ {
+ if (rel->rd_amcache)
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }

here I suggest adding Assert(rel->rd_amcache == NULL) (or maybe better an
error report) after calling free_rd_amcache to be sure the custom
implementation has done what it should do.

Also, I think some brief documentation about writing this custom method is
quite relevant maybe based on already existing comments in the code.

Kind regards,
Pavel

Re: logical decoding and replication of sequences, take 2

2023-11-29 Thread Amit Kapila

On Wed, Nov 29, 2023 at 2:59 AM Tomas Vondra
 wrote:
>
> I have been hacking on improving the improvements outlined in my
> preceding e-mail, but I have some bad news - I ran into an issue that I
> don't know how to solve :-(
>
> Consider this transaction:
>
>   BEGIN;
>   ALTER SEQUENCE s RESTART 1000;
>
>   SAVEPOINT s1;
>   ALTER SEQUENCE s RESTART 2000;
>   ROLLBACK TO s1;
>
>   INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,40);
>   COMMIT;
>
> If you try this with the approach relying on rd_newRelfilelocatorSubid
> and rd_createSubid, it fails like this on the subscriber:
>
>   ERROR:  could not map filenode "base/5/16394" to relation OID
>
> This happens because ReorderBufferQueueSequence tries to do this in the
> non-transactional branch:
>
>   reloid = RelidByRelfilenumber(rlocator.spcOid, rlocator.relNumber);
>
> and the relfilenode is the one created by the first ALTER. But this is
> obviously wrong - the changes should have been treated as transactional,
> because they are tied to the first ALTER. So how did we get there?
>
> Well, the whole problem is that in case of abort, AtEOSubXact_cleanup
> resets the two fields to InvalidSubTransactionId. Which means the
> rollback in the above transaction also forgets about the first ALTER.
> Now that I look at the RelationData comments, it actually describes
> exactly this situation:
>
>   *
>   * rd_newRelfilelocatorSubid is the ID of the highest subtransaction
>   * the most-recent relfilenumber change has survived into or zero if
>   * not changed in the current transaction (or we have forgotten
>   * changing it).  This field is accurate when non-zero, but it can be
>   * zero when a relation has multiple new relfilenumbers within a
>   * single transaction, with one of them occurring in a subsequently
>   * aborted subtransaction, e.g.
>   *BEGIN;
>   *TRUNCATE t;
>   *SAVEPOINT save;
>   *TRUNCATE t;
>   *ROLLBACK TO save;
>   *-- rd_newRelfilelocatorSubid is now forgotten
>   *
>
> The root of this problem is that we'd need some sort of "history" for
> the field, so that when a subxact aborts, we can restore the previous
> value. But we obviously don't have that, and I doubt we want to add that
> to relcache - for example, it'd either need to impose some limit on the
> history (and thus a failure when we reach the limit), or it'd need to
> handle histories of arbitrary length.
>

Yeah, I think that would be really tricky and we may not want to go there.

> At this point I don't see a solution for this, which means the best way
> forward with the sequence decoding patch seems to be the original
> approach, on the decoding side.
>

One thing that worries me about that approach is that it can suck with
the workload that has a lot of DDLs that create XLOG_SMGR_CREATE
records. We have previously fixed some such workloads in logical
decoding where decoding a transaction containing truncation of a table
with a lot of partitions (1000 or more) used to take a very long time.
Don't we face performance issues in such scenarios?

How do we see this work w.r.t to some sort of global sequences? There
is some recent discussion where I have raised a similar point [1].

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JF%3D4_Eoq7FFjHSe98-_ooJ5QWd0s2_pj8gR%2B_dvwKxvA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Python installation selection in Meson

2023-11-29 Thread Peter Eisentraut


On 29.11.23 14:23, Andrew Dunstan wrote:


On 2023-11-28 Tu 13:02, Peter Eisentraut wrote:
I noticed that under meson, the selection of the Python installation 
using the 'PYTHON' option doesn't work completely.  The 'PYTHON' 
option determined the Python binary that will be used to call the 
various build support programs.  But it doesn't affect the Python 
installation used for PL/Python.  For that, we need to pass the 
program determined by the 'PYTHON' option back into the 
find_installation() routine of the python module.  (Otherwise, 
find_installation() will just search for an installation on its own.) 
See attached patch.  I ran this through Cirrus, seems to work.


I noticed when working on the meson/windows stuff that meson would try 
to build plpython against its python installation, which failed 
miserably. The workaround was to abandon separate meson/ninja 
installations via chocolatey, and instead install them using pip. Maybe 
this was as a result of the above problem?


That sounds like it could be the case.

Re: Change GUC hashtable to use simplehash?

2023-11-29 Thread John Naylor

Attached is a rough start with Andres's earlier ideas, to get
something concrete out there.

I took a look around at other implementations a bit. Many modern hash
functions use MUM-style hashing, which typically uses 128-bit
arithmetic. Even if they already have an incremental interface and
have a compatible license, it seems a bit too much work to adopt just
for a couple string use cases. Might be useful elsewhere, though, but
that's off topic.

However, I did find a couple hash functions that are much simpler to
adapt to a bytewise interface, pass SMHasher, and are decently fast on
short inputs:

- fast-hash, MIT licensed, and apparently has some use in software [1]
- MX3, CC0 license (looking around, seems controversial for a code
license, so didn't go further). [2] Seems to be a for-fun project, but
the accompanying articles are very informative on how to develop these
things.

After wacking fast-hash around, it doesn't really resemble the
original much, and if for some reason we went as far as switching out
the mixing/final functions, it may as well be called completely
original work. I thought it best to start with something whose mixing
behavior passes SMHasher, and hopefully preserve that property.

Note that the combining and final steps share most of their arithmetic
operations. This may have been done on purpose to minimize binary
size, but I didn't check. Also, it incorporates input length into the
calculation. Since we don't know the length of C strings up front, I
threw that out for now. It'd be possible to track the length as we go
and incorporate something into the final step. The hard part is
verifying it hasn't lost any quality.

v5-0001 puts fash-hash as-is into a new header, named in a way to
convey in-memory use e.g. hash tables.

v5-0002 does the minimal to allow dynash to use this for string_hash,
inlined but still calling strlen.

v5-0003 shows one way to do a incremental interface. It might be okay
for simplehash with fixed length keys, but seems awkward for strings.

v5-0004 shows a bytewise incremental interface, with implementations
for dynahash (getting rid of strlen) and guc hash.

[1] https://code.google.com/archive/p/fast-hash/
[2] https://github.com/jonmaiga/mx3
From c2b799dd2418fb68fcfc6ccf006a50f74c9072fe Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 29 Nov 2023 16:28:58 +0700
Subject: [PATCH v5 4/4] Add bytewise interface for incrementing the hash state

This is the nicest interface for guc_name_hash, and seems
good for string_hash too. It's not clear if this is good for
the latter from a performance perspective.
---
 src/backend/utils/hash/dynahash.c| 14 +++
 src/backend/utils/misc/guc.c | 16 ++---
 src/include/common/hashfn_unstable.h | 35 
 3 files changed, 46 insertions(+), 19 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 1c08dc8942..ab2dbefd12 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -328,19 +328,11 @@ string_hash(const void *key, Size keysize)
 
 	fasthash64_init(&hs, 0);
 
-	while (*buf)
+	while (*buf && s_len < keysize)
 	{
-		int chunk_len = 0;
+		s_len++;
 
-		for (int i = 0;
-			i < 8 && *buf++ && s_len < keysize;
-			i++)
-		{
-			chunk_len++;
-			s_len++;
-		}
-
-		fasthash64_accum(&hs, buf, chunk_len);
+		fasthash64_accum_byte(&hs, *buf);
 	}
 
 	return fasthash64_final32(&hs);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 82d8efbc96..2428f2475c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -33,6 +33,7 @@
 #include "catalog/objectaccess.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_parameter_acl.h"
+#include "common/hashfn_unstable.h"
 #include "guc_internal.h"
 #include "libpq/pqformat.h"
 #include "parser/scansup.h"
@@ -1324,22 +1325,21 @@ guc_name_compare(const char *namea, const char *nameb)
 static uint32
 guc_name_hash(const void *key, Size keysize)
 {
-	uint32		result = 0;
 	const char *name = *(const char *const *) key;
+	fasthash64_state hs;
+
+	fasthash64_init(&hs, 0);
 
 	while (*name)
 	{
 		char		ch = *name++;
 
-		/* Case-fold in the same way as guc_name_compare */
-		if (ch >= 'A' && ch <= 'Z')
-			ch += 'a' - 'A';
+		/* quick and dirty case-folding suitable for hashing */
+		ch |= 0x20;
 
-		/* Merge into hash ... not very bright, but it needn't be */
-		result = pg_rotate_left32(result, 5);
-		result ^= (uint32) ch;
+		fasthash64_accum_byte(&hs, ch);
 	}
-	return result;
+	return fasthash64_final32(&hs);
 }
 
 /*
diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h
index a95942f7af..1b7db5ac07 100644
--- a/src/include/common/hashfn_unstable.h
+++ b/src/include/common/hashfn_unstable.h
@@ -26,6 +26,7 @@
 typedef struct fasthash64_state
 {
 	uint64 accum;
+	int8	accum_len;
 	uint64 hash;
 } fasthash64_state;
 
@@ -46,6 +47,31 @@ void fasthash64_init(fasthash64_st

Re: Python installation selection in Meson

2023-11-29 Thread Andrew Dunstan




On 2023-11-28 Tu 13:02, Peter Eisentraut wrote:
I noticed that under meson, the selection of the Python installation 
using the 'PYTHON' option doesn't work completely.  The 'PYTHON' 
option determined the Python binary that will be used to call the 
various build support programs.  But it doesn't affect the Python 
installation used for PL/Python.  For that, we need to pass the 
program determined by the 'PYTHON' option back into the 
find_installation() routine of the python module.  (Otherwise, 
find_installation() will just search for an installation on its own.)  
See attached patch.  I ran this through Cirrus, seems to work.



I noticed when working on the meson/windows stuff that meson would try 
to build plpython against its python installation, which failed 
miserably. The workaround was to abandon separate meson/ninja 
installations via chocolatey, and instead install them using pip. Maybe 
this was as a result of the above problem?



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2023-11-29 Thread Tomas Vondra




On 11/27/23 23:06, Peter Smith wrote:
> FWIW, here are some more minor review comments for v20231127-3-0001
> 
> ==
> doc/src/sgml/logicaldecoding.sgml
> 
> 1.
> +  The txn parameter contains meta information 
> about
> +  the transaction the sequence change is part of. Note however that for
> +  non-transactional updates, the transaction may be NULL, depending on
> +  if the transaction already has an XID assigned.
> +  The sequence_lsn has the WAL location of the
> +  sequence update. transactional says if the
> +  sequence has to be replayed as part of the transaction or directly.
> 
> /says if/specifies whether/
> 

Will fix.

> ==
> src/backend/commands/sequence.c
> 
> 2. DecodeSeqTuple
> 
> + memcpy(((char *) tuple->tuple.t_data),
> +data + sizeof(xl_seq_rec),
> +SizeofHeapTupleHeader);
> +
> + memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader,
> +data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader,
> +datalen);
> 
> Maybe I am misreading but isn't this just copying 2 contiguous pieces
> of data? Won't a single memcpy of (SizeofHeapTupleHeader + datalen)
> achieve the same?
> 

You're right, will fix. I think the code looked differently before, got
simplified and I haven't noticed this can be a single memcpy().

> ==
> .../replication/logical/reorderbuffer.c
> 
> 3.
> + *   To decide if a sequence change is transactional, we maintain a hash
> + *   table of relfilenodes created in each (sub)transactions, along with
> + *   the XID of the (sub)transaction that created the relfilenode. The
> + *   entries from substransactions are copied to the top-level transaction
> + *   to make checks cheaper. The hash table gets cleaned up when the
> + *   transaction completes (commit/abort).
> 
> /substransactions/subtransactions/
> 

Will fix.

> ~~~
> 
> 4.
> + * A naive approach would be to just loop through all transactions and check
> + * each of them, but there may be (easily thousands) of subtransactions, and
> + * the check happens for each sequence change. So this could be very costly.
> 
> /may be (easily thousands) of/may be (easily thousands of)/
> 
> ~~~

Thanks. I've reworded this to

  ... may be many (easily thousands of) subtransactions ...

> 
> 5. ReorderBufferSequenceCleanup
> 
> + while ((ent = (ReorderBufferSequenceEnt *)
> hash_seq_search(&scan_status)) != NULL)
> + {
> + (void) hash_search(txn->toptxn->sequences_hash,
> +(void *) &ent->rlocator,
> +HASH_REMOVE, NULL);
> + }
> 
> Typically, other HASH_REMOVE code I saw would check result for NULL to
> give elog(ERROR, "hash table corrupted");
> 

Good point, I'll add the error check

> ~~~
> 
> 6. ReorderBufferQueueSequence
> 
> + if (xid != InvalidTransactionId)
> + txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> 
> How about using the macro: TransactionIdIsValid
> 

Actually, I wrote in some other message, I think the check is not
necessary. Or rather it should be an assert that XID is valid. And yeah,
the macro is a good idea.

> ~~~
> 
> 7. ReorderBufferQueueSequence
> 
> + if (reloid == InvalidOid)
> + elog(ERROR, "could not map filenode \"%s\" to relation OID",
> + relpathperm(rlocator,
> + MAIN_FORKNUM));
> 
> How about using the macro: OidIsValid
> 

I chose to keep this consistent with other places in reorderbuffer, and
all of them use the equality check.

> ~~~
> 
> 8.
> + /*
> + * Calculate the first value of the next batch (at which point we
> + * generate and decode another WAL record.
> + */
> 
> Missing ')'
> 

Will fix.

> ~~~
> 
> 9. ReorderBufferAddRelFileLocator
> 
> + /*
> + * We only care about sequence relfilenodes for now, and those always have
> + * a XID. So if there's no XID, don't bother adding them to the hash.
> + */
> + if (xid == InvalidTransactionId)
> + return;
> 
> How about using the macro: TransactionIdIsValid
> 

Will change.

> ~~~
> 
> 10. ReorderBufferProcessTXN
> 
> + if (reloid == InvalidOid)
> + elog(ERROR, "could not map filenode \"%s\" to relation OID",
> + relpathperm(change->data.sequence.locator,
> + MAIN_FORKNUM));
> 
> How about using the macro: OidIsValid
> 

Same as the other Oid check - consistency.

> ~~~
> 
> 11. ReorderBufferChangeSize
> 
> + if (tup)
> + {
> + sz += sizeof(HeapTupleData);
> + len = tup->tuple.t_len;
> + sz += len;
> + }
> 
> Why is the 'sz' increment split into 2 parts?
> 

Because the other branches in ReorderBufferChangeSize do it that way.
You're right it might be coded on a single line.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Extending SMgrRelation lifetimes

2023-11-29 Thread Heikki Linnakangas

I spent some more time digging into this, experimenting with different 
approaches. Came up with pretty significant changes; see below:


On 18/09/2023 18:19, Robert Haas wrote:

I think that if you believe 0001 to be correct you should go ahead and
commit it sooner rather than later. If you're wrong and something
weird starts happening we'll then have a chance to notice that before
too much other stuff gets changed on top of this and confuses the
matter.


+1


On Wed, Aug 23, 2023 at 12:55 AM Thomas Munro  wrote:

Right, let's one find one good place.  I think smgropen() would be best.


I think it would be a good idea to give this comment a bit more oomph.
In lieu of this:

+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transactional backends.

I would leave the existing "This does not attempt to actually open the
underlying files." comment as a separate comment, and add something
like this as a new paragraph:

In versions of PostgreSQL prior to 17, this function returned an
object with no defined lifetime. Now, however, the object remains
valid for the lifetime of the transaction, up to the point where
AtEOXact_SMgr() is called, making it much easier for callers to know
for how long they can hold on to a pointer to the returned object. If
this function is called outside of a transaction, the object remains
valid until smgrdestroy() or smgrdestroyall() is called. Background
processes that use smgr but not transactions typically do this once
per checkpoint cycle.


+1


Apart from that, the main thing that is bothering me is that the
justification for redefining smgrclose() to do what smgrrelease() now
does isn't really spelled out anywhere. You mentioned some reasons and
Heikki mentioned the benefit to extensions, but I feel like somebody
should be able to understand the reasoning clearly from reading the
commit message or the comments in the patch, rather than having to
visit the mailing list discussion, and I'm not sure we're there yet. I
feel like I understood why we were doing this and was convinced it was
a good idea at some point, but now the reasoning has gone out of my
head and I can't recover it. If somebody does smgropen() .. stuff ...
smgrclose() as in heapam_relation_copy_data() or index_copy_data(),
this change has the effect of making the SmgrRelation remain valid
until eoxact whereas before it would have been destroyed instantly. Is
that what we want? Presumably yes, or this patch wouldn't be shaped
like this, but I don't know why we want that...


Fair. I tried to address that by adding an overview comment at top of 
smgr.c, explaining how this stuff work. I hope that helps.



Another thing that seems a bit puzzling is how this is intended to, or
does, interact with the ownership mechanism. Intuitively, I feel like
a Relation owns an SMgrRelation *because* the SMgrRelation has no
defined lifetime. But I think that's not quite right. I guess the
ownership mechanism doesn't guarantee anything about the lifetime of
the object, just that the pointer in the Relation won't hang around
any longer than the object to which it's pointing. So does that mean
that we're free to redefine the object lifetime to be pretty much
anything we want and that mechanism doesn't really care? Huh,
interesting.


Yeah that owner mechanism is weird. It guarantees that the pointer to 
the SMgrRelation is cleared when the SMgrRelation is destroyed. But it 
also prevents the SMgrRelation from being destroyed at end of 
transaction. That's how it is in 'master' too.


But with this patch, we don't normally call smgrdestroy() on an 
SMgrRelation that came from the relation cache. We do call 
smgrdestroyall() in the aux processes, but they don't have a relcache. 
So the real effect of setting the owner now is to prevent the 
SMgrRelation from being destroyed at end of transaction; the mechanism 
of clearing the pointer is unused.


I found two exceptions to that, though, by adding some extra assertions 
and running the regression tests:


1. The smgrdestroyall() in a single-user backend in RequestCheckpoint(). 
It destroys SMgrRelations belonging to relcache entries, and the owner 
mechanism clears the pointers from the relcache. I think smgrcloseall(), 
or doing nothing, would actually be more appropriate there.


2. A funny case with foreign tables: ANALYZE on a foreign table calls 
visibilitymap_count(). A foreign table has no visibility map so it 
returns 0, but before doing so it calls RelationGetSmgr on the foreign 
table, which has 0/0/0 rellocator. That creates an SMgrRelation for 
0/0/0, and sets the foreign table's relcache entry as its owner. If you 
then call ANALYZE on another foreign table, it also calls 
RelationGetSmgr with 0/0/0 rellocator, returning the same SMgrRelation 
entry, and changes its owner to the new relcache entry. That doesn't 
make much sense and i

Re: remaining sql/json patches

2023-11-29 Thread Andrew Dunstan




On 2023-11-28 Tu 21:10, Andres Freund wrote:

Hi,

On 2023-11-28 20:58:41 -0500, Andrew Dunstan wrote:

On 2023-11-28 Tu 19:32, Tom Lane wrote:

Andrew Dunstan  writes:
So I'm now a bit baffled.  Can you provide more color on what
your test setup is?


*sigh* yes, you're right. I inadvertently used a setup that used meson for
building REL16_STABLE and HEAD. When I switch it to autoconf I get results
that are similar to the earlier branches:


 REL_16_STABLE 
Time: 3401.625 ms (00:03.402)
 HEAD 
Time: 3419.088 ms (00:03.419)


It's not clear to me why that should be. I didn't have assertions enabled
anywhere. It's the same version of bison, same compiler throughout. Maybe
meson sets a higher level of optimization? It shouldn't really matter, ISTM.

Is it possible that you have CFLAGS set in your environment? For reasons that
I find very debatable, configure.ac only adds -O2 when CFLAGS is not set:

# C[XX]FLAGS are selected so:
# If the user specifies something in the environment, that is used.
# else:  If the template file set something, that is used.
# else:  If coverage was enabled, don't set anything.
# else:  If the compiler is GCC, then we use -O2.
# else:  If the compiler is something else, then we use -O, unless debugging.

if test "$ac_env_CFLAGS_set" = set; then
   CFLAGS=$ac_env_CFLAGS_value
elif test "${CFLAGS+set}" = set; then
   : # (keep what template set)
elif test "$enable_coverage" = yes; then
   : # no optimization by default
elif test "$GCC" = yes; then
   CFLAGS="-O2"
else
   # if the user selected debug mode, don't use -O
   if test "$enable_debug" != yes; then
 CFLAGS="-O"
   fi
fi

So if you have CFLAGS set in the environment, we'll not add -O2 to the
compilation flags.

I'd check what the actual flags are when building a some .o.



I do have a CFLAGS setting, but for meson I used '-Ddebug=true' and no 
buildtype  or optimization setting. However, I see that in meson.build 
we're defaulting to "buildtype=debugoptimized" as opposed to the 
standard meson "buildtype=debug", so I guess that accounts for it.


Still getting used to this stuff.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-29 Thread Dilip Kumar

On Wed, Nov 29, 2023 at 3:29 PM Alvaro Herrera  wrote:
>
> On 2023-Nov-29, tender wang wrote:
>
> > The v8-0001 patch failed to apply in my local repo as below:
> >
> > git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
> > error: patch failed: src/backend/access/transam/multixact.c:1851
> > error: src/backend/access/transam/multixact.c: patch does not apply
> > error: patch failed: src/backend/access/transam/subtrans.c:184
> > error: src/backend/access/transam/subtrans.c: patch does not apply
> > error: patch failed: src/backend/commands/async.c:117
> > error: src/backend/commands/async.c: patch does not apply
> > error: patch failed: src/backend/storage/lmgr/predicate.c:808
> > error: src/backend/storage/lmgr/predicate.c: patch does not apply
> > error: patch failed: src/include/commands/async.h:15
> > error: src/include/commands/async.h: patch does not apply
>
> Yeah, this patch series conflicts with today's commit 4ed8f0913bfd.

I will send a rebased version by tomorrow.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: BackgroundPsql's set_query_timer_restart() may not work

2023-11-29 Thread Bharath Rupireddy

On Wed, Nov 29, 2023 at 1:49 PM Masahiko Sawada  wrote:
>
> On Wed, Nov 29, 2023 at 4:30 PM Bharath Rupireddy
>  wrote:
> >
> > On Tue, Nov 28, 2023 at 12:23 PM Tom Lane  wrote:
> > >
> > > Bharath Rupireddy  writes:
> > > > A nitpick on the patch - how about honoring the passed-in parameter
> > > > with something like $self->{query_timer_restart} = 1 if !defined
> > > > $self->{query_timer_restart}; instead of just setting it to 1 (a value
> > > > other than undef) $self->{query_timer_restart} = 1;?
> > >
> > > I wondered about that too, but the evidence of existing callers is
> > > that nobody cares.  If we did make the code do something like that,
> > > (a) I don't think your fragment is right, and (b) we'd need to rewrite
> > > the function's comment to explain it.  I'm not seeing a reason to
> > > think it's worth spending effort on.
>
> Agreed.
>
> > Hm. I don't mind doing just the $self->{query_timer_restart} = 1; like
> > in Sawada-san's patch.
>
> Okay, I've attached the patch that I'm going to push through v16,
> barring any objections.

How about the commit message summary 'Fix TAP function
set_query_timer_restart() issue without argument.'? Also, it's good to
specify the commit 664d7575 that introduced the TAP function in the
commit message description.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

2023-11-29 Thread Drouvot, Bertrand


Hi,

On 11/29/23 3:58 AM, Amit Kapila wrote:

On Tue, Nov 28, 2023 at 2:17 PM Drouvot, Bertrand
 wrote:


What do you think about also adding a pg_alter_logical_replication_slot() or 
such
function?

That would allow users to alter manually created logical replication slots 
without
the need to make a replication connection.



But then won't that make it inconsistent with the subscription
failover state? 


Do you mean allowing one to use pg_alter_logical_replication_slot() on a slot
linked to a subscription? (while we're saying / documenting to not alter such 
slots?)

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-29 Thread Alvaro Herrera

On 2023-Nov-29, tender wang wrote:

> The v8-0001 patch failed to apply in my local repo as below:
> 
> git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
> error: patch failed: src/backend/access/transam/multixact.c:1851
> error: src/backend/access/transam/multixact.c: patch does not apply
> error: patch failed: src/backend/access/transam/subtrans.c:184
> error: src/backend/access/transam/subtrans.c: patch does not apply
> error: patch failed: src/backend/commands/async.c:117
> error: src/backend/commands/async.c: patch does not apply
> error: patch failed: src/backend/storage/lmgr/predicate.c:808
> error: src/backend/storage/lmgr/predicate.c: patch does not apply
> error: patch failed: src/include/commands/async.h:15
> error: src/include/commands/async.h: patch does not apply

Yeah, this patch series conflicts with today's commit 4ed8f0913bfd.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.

1 2 >

1 - 100 of 107 matches

Mail list logo