Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-04-16 Thread Michael Paquier
On Tue, Apr 16, 2024 at 02:19:56PM +0900, Michael Paquier wrote:
> Actually, I've come up with an idea just after hitting the send
> button: let's use an extra ALTER TABLE SET ACCESS METHOD rather than
> rely on the GUC to set the AM of the partitioned table correctly.
> This extra command should be optional, depending on
> --no-table-access-method.  If a partitioned table has 0 as relam,
> let's not add this extra ALTER TABLE at all.

I have explored this idea, and while this is tempting this faces a
couple of challenges:
1) Binary upgrades would fail because the table rewrite created by
ALTER TABLE SET ACCESS METHOD for relkinds with physical storage
expects heap_create_with_catalog to have a fixed OID, but the rewrite
would require extra steps to be able to handle that, and I am not
convinced that more binary_upgrade_set_next_heap_relfilenode() is a
good idea.
2) We could limit these extra ALTER TABLE commands to be generated for
partitioned tables.  This is kind of confusing as resulting dumps
would mix SET commands for default_table_access_method that would
affect tables with physical storage, while partitioned tables would
have their own extra ALTER TABLE commands.  Another issue that needs
more consideration is that TocEntrys don't hold any relkind
information so pg_backup_archiver.c cannot make a difference with
tables and partitioned tables to select if SET or ALTER TABLE should
be generated.

Several designs are possible, like:
- Mix SET and ALTER TABLE commands in the dumps to set the AM, SET for
tables and matviews, ALTER TABLE for relations without storage.  This
would bypass the binary upgrade problem with the fixed relid.
- Use only SET, requiring a new "default" value for
default_table_access_method that would force a partitioned table's
relam to be 0.  Be stricter with the "current" table AM tracked in
pg_dump's backup archiver.
- Use only ALTER TABLE commands, with extra binary upgrade tweaks to
force relation OIDs for the second heap_create_with_catalog() done
with the rewrite to update a relation's AM.

With all that in mind, it may be better to revert 374c7a229042 and
e2395cdbe83a from HEAD and reconsider how to tackle the dump issues in
v18 or newer versions as all of the approaches I can think of lead to
more complications of their own.

Please see attached a non-polished POC that switches dumps to use
ALTER TABLE, that I've used to detect the upgrade problems.

Thoughts or comments are welcome.
--
Michael
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index c7a6c918a6..93df260986 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -61,7 +61,7 @@ static void _becomeUser(ArchiveHandle *AH, const char *user);
 static void _becomeOwner(ArchiveHandle *AH, TocEntry *te);
 static void _selectOutputSchema(ArchiveHandle *AH, const char *schemaName);
 static void _selectTablespace(ArchiveHandle *AH, const char *tablespace);
-static void _selectTableAccessMethod(ArchiveHandle *AH, const char *tableam);
+static void _selectTableAccessMethod(ArchiveHandle *AH, TocEntry *te);
 static void processEncodingEntry(ArchiveHandle *AH, TocEntry *te);
 static void processStdStringsEntry(ArchiveHandle *AH, TocEntry *te);
 static void processSearchPathEntry(ArchiveHandle *AH, TocEntry *te);
@@ -2373,7 +2373,6 @@ _allocAH(const char *FileSpec, const ArchiveFormat fmt,
 	AH->currUser = NULL;		/* unknown */
 	AH->currSchema = NULL;		/* ditto */
 	AH->currTablespace = NULL;	/* ditto */
-	AH->currTableAm = NULL;		/* ditto */
 
 	AH->toc = (TocEntry *) pg_malloc0(sizeof(TocEntry));
 
@@ -3356,9 +3355,6 @@ _reconnectToDB(ArchiveHandle *AH, const char *dbname)
 	free(AH->currSchema);
 	AH->currSchema = NULL;
 
-	free(AH->currTableAm);
-	AH->currTableAm = NULL;
-
 	free(AH->currTablespace);
 	AH->currTablespace = NULL;
 
@@ -3519,31 +3515,32 @@ _selectTablespace(ArchiveHandle *AH, const char *tablespace)
 }
 
 /*
- * Set the proper default_table_access_method value for the table.
+ * Set the proper default table access method for the table.
  */
 static void
-_selectTableAccessMethod(ArchiveHandle *AH, const char *tableam)
+_selectTableAccessMethod(ArchiveHandle *AH, TocEntry *te)
 {
 	RestoreOptions *ropt = AH->public.ropt;
+	const char *tableam = te->tableam;
 	PQExpBuffer cmd;
-	const char *want,
-			   *have;
 
 	/* do nothing in --no-table-access-method mode */
 	if (ropt->noTableAm)
 		return;
 
-	have = AH->currTableAm;
-	want = tableam;
-
-	if (!want)
-		return;
-
-	if (have && strcmp(want, have) == 0)
+	if (!tableam)
 		return;
 
 	cmd = createPQExpBuffer();
-	appendPQExpBuffer(cmd, "SET default_table_access_method = %s;", fmtId(want));
+
+	appendPQExpBufferStr(cmd, "ALTER ");
+	if (strcmp(te->desc, "MATERIALIZED VIEW") == 0)
+		appendPQExpBufferStr(cmd, "MATERIALIZED VIEW ");
+	else
+		appendPQExpBufferStr(cmd, "TABLE ");
+	appendPQExpBuffer(cmd, "%s ", fmtQualifiedId(te->namespace, te->tag));
+	appendPQExpBuffer(cmd, 

Re: Extract numeric filed in JSONB more effectively

2024-04-16 Thread Andy Fan

Andy Fan  writes:

> Here is latest version, nothing changed besides the rebase to the latest
> master. The most recent 3 questions should be addressed.
>
> - The error message compatible issue [1] and the Peter's answer at [2].
> - Peter's new question at [2] and my answer at [3].
>
> Any effrot to move this patch ahead is welcome and thanks all the people
> who provided high quaility feedback so far, especially chapman!
>
> [1] https://www.postgresql.org/message-id/87r0hmvuvr@163.com
> [2]
> https://www.postgresql.org/message-id/8102ff5b-b156-409e-a48f-e53e63a39b36%40eisentraut.org
> [3] https://www.postgresql.org/message-id/8734t6c5rh.fsf%40163.com

rebase to the latest master again.

commit bc990b983136ef658cd3be03cdb07f2eadc4cd5c (HEAD -> jsonb_numeric)
Author: yizhi.fzh 
Date:   Mon Apr 1 09:36:08 2024 +0800

Improve the performance of Jsonb numeric/bool extraction.

JSONB object uses a binary compatible numeric format with the numeric
data type in SQL. However in the past, extracting a numeric value from a
JSONB field still needs to find the corresponding JsonbValue first,
then convert the JsonbValue to Jsonb, and finally use the cast system to
convert the Jsonb to a Numeric data type. This approach was very
inefficient in terms of performance.

In the current patch, It is handled that the JsonbValue is converted to
numeric data type directly.  This is done by the planner support
function which detects the above case and simplify the expression.
Because the boolean type and numeric type share certain similarities in
their attributes, we have implemented the same optimization approach for
both.  In the ideal test case, the performance can be 2x than before.

The optimized functions and operators includes:
1. jsonb_object_field / ->
2. jsonb_array_element / ->
3. jsonb_extract_path / #>
4. jsonb_path_query
5. jsonb_path_query_first

-- 
Best Regards
Andy Fan

>From bc990b983136ef658cd3be03cdb07f2eadc4cd5c Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Mon, 1 Apr 2024 09:36:08 +0800
Subject: [PATCH v18 1/1] Improve the performance of Jsonb numeric/bool
 extraction.

JSONB object uses a binary compatible numeric format with the numeric
data type in SQL. However in the past, extracting a numeric value from a
JSONB field still needs to find the corresponding JsonbValue first,
then convert the JsonbValue to Jsonb, and finally use the cast system to
convert the Jsonb to a Numeric data type. This approach was very
inefficient in terms of performance.

In the current patch, It is handled that the JsonbValue is converted to
numeric data type directly.  This is done by the planner support
function which detects the above case and simplify the expression.
Because the boolean type and numeric type share certain similarities in
their attributes, we have implemented the same optimization approach for
both.  In the ideal test case, the performance can be 2x than before.

The optimized functions and operators includes:
1. jsonb_object_field / ->
2. jsonb_array_element / ->
3. jsonb_extract_path / #>
4. jsonb_path_query
5. jsonb_path_query_first
---
 src/backend/utils/adt/jsonb.c | 206 ++
 src/backend/utils/adt/jsonbsubs.c |   4 +-
 src/backend/utils/adt/jsonfuncs.c | 123 ++-
 src/backend/utils/adt/jsonpath_exec.c |  32 +++-
 src/include/catalog/pg_proc.dat   |  46 +-
 src/include/utils/jsonb.h |  11 +-
 src/test/regress/expected/jsonb.out   | 112 +-
 src/test/regress/sql/jsonb.sql|  66 -
 src/tools/pgindent/typedefs.list  |   1 +
 9 files changed, 542 insertions(+), 59 deletions(-)

diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c
index e4562b3c6c..e05b5b35f1 100644
--- a/src/backend/utils/adt/jsonb.c
+++ b/src/backend/utils/adt/jsonb.c
@@ -16,9 +16,15 @@
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
+#include "nodes/makefuncs.h"
+#include "nodes/supportnodes.h"
+#include "parser/parse_coerce.h"
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "utils/builtins.h"
+#include "utils/date.h"
+#include "utils/datetime.h"
+#include "utils/fmgroids.h"
 #include "utils/json.h"
 #include "utils/jsonb.h"
 #include "utils/jsonfuncs.h"
@@ -2034,6 +2040,206 @@ cannotCastJsonbValue(enum jbvType type, const char *sqltype)
 	elog(ERROR, "unknown jsonb type: %d", (int) type);
 }
 
+
+/*
+ * jsonb_cast_support()
+ *
+ * Planner support function for extracting numeric or bool data type more
+ * effectively. After finding out the corresponding JsonbValue, instead of
+ * casting it to Jsonb as an intermediate type, we covert it to the desired
+ * data type directly.
+ */
+Datum
+jsonb_cast_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestSimplify))
+	{
+		SupportRequestSimplify *req = 

Solaris tar issues, or other reason why margay fails 010_pg_basebackup?

2024-04-16 Thread Thomas Munro
Hi,

I noticed that margay (Solaris) has started running more of the tests
lately, but is failing in pg_basebaseup/010_pg_basebackup.  It runs
successfully on wrasse (in older branches, Solaris 11.3 is desupported
in 17/master), and also on pollock (illumos, forked from common
ancestor Solaris 10 while it was open source).

Hmm, wrasse is using "/opt/csw/bin/gtar xf ..." and pollock is using
"/usr/gnu/bin/tar xf ...", while margay is using "/usr/bin/tar xf
...".  The tar command is indicating success (it's run by
system_or_bail and it's not bailing), but the replica doesn't want to
come up:

pg_ctl: directory
"/home/marcel/build-farm-15/buildroot/HEAD/pgsql.build/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_replica_data/pgdata"
is not a database cluster directory"

So one idea would be that our tar format is incompatible with Sun tar
in some way that corrupts the output, or there is some still
difference in the nesting of the directory structure it creates, or
something like that.  I wonder if this is already common knowledge in
the repressed memories of this list, but I couldn't find anything
specific.  I'd be curious to know why exactly, if so (in terms of
POSIX conformance etc, who is doing something wrong).




Re: POC: GROUP BY optimization

2024-04-16 Thread Andrei Lepikhov

On 4/12/24 06:44, Tom Lane wrote:

* I'm pretty unconvinced by group_keys_reorder_by_pathkeys (which
I notice has already had one band-aid added to it since commit).
In particular, it seems to believe that the pathkeys and clauses
lists match one-for-one, but I seriously doubt that that invariant
remains guaranteed after the cleanup steps

 /* append the remaining group pathkeys (will be treated as not sorted) */
 *group_pathkeys = list_concat_unique_ptr(new_group_pathkeys,
  *group_pathkeys);
 *group_clauses = list_concat_unique_ptr(new_group_clauses,
 *group_clauses);

For that to be reliable, the SortGroupClauses added to
new_group_clauses in the main loop have to be exactly those
that are associated with the same pathkeys in the old lists.
I doubt that that's necessarily true in the presence of redundant
grouping clauses.  (Maybe we can't get here with any redundant
grouping clauses, but still, we don't really guarantee uniqueness of
SortGroupClauses, much less that they are never copied which is what
you need if you want to believe that pointer equality is sufficient
for de-duping here.  PathKeys are explicitly made to be safe to compare
pointer-wise, but I know of no such guarantee for SortGroupClauses.)
I spent a lot of time inventing situations with SortGroupClause 
duplicates. Unfortunately, it looks impossible so far. But because we 
really don't guarantee uniqueness, I changed the code to survive in this 
case. Also, I added assertion checking to be sure we don't have logical 
mistakes here - see attachment.
About the band-aid mentioned above - as I see, 4169850 introduces the 
same trick in planner.c. So, it looks like result of design of the 
current code.


--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 5d9597adcd..aa80f6486c 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -380,15 +380,18 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys,
 
 	/*
 	 * We're going to search within just the first num_groupby_pathkeys of
-	 * *group_pathkeys.  The thing is that root->group_pathkeys is passed as
+	 * *group_pathkeys. The thing is that root->group_pathkeys is passed as
 	 * *group_pathkeys containing grouping pathkeys altogether with aggregate
-	 * pathkeys.  If we process aggregate pathkeys we could get an invalid
+	 * pathkeys. If we process aggregate pathkeys we could get an invalid
 	 * result of get_sortgroupref_clause_noerr(), because their
-	 * pathkey->pk_eclass->ec_sortref doesn't referece query targetlist.  So,
+	 * pathkey->pk_eclass->ec_sortref doesn't reference query targetlist. So,
 	 * we allocate a separate list of pathkeys for lookups.
 	 */
 	grouping_pathkeys = list_copy_head(*group_pathkeys, num_groupby_pathkeys);
 
+	/* Make a new copy before reordering clauses */
+	*group_clauses = list_copy(*group_clauses);
+
 	/*
 	 * Walk the pathkeys (determining ordering of the input path) and see if
 	 * there's a matching GROUP BY key. If we find one, we append it to the
@@ -400,8 +403,8 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys,
 	 */
 	foreach(lc, pathkeys)
 	{
-		PathKey*pathkey = (PathKey *) lfirst(lc);
-		SortGroupClause *sgc;
+		PathKey			   *pathkey = (PathKey *) lfirst(lc);
+		SortGroupClause	   *sgc;
 
 		/*
 		 * Pathkeys are built in a way that allows simply comparing pointers.
@@ -431,17 +434,25 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys,
 		Assert(OidIsValid(sgc->sortop));
 
 		new_group_pathkeys = lappend(new_group_pathkeys, pathkey);
+
+		/*
+		 * Keeping in mind that the SortGroupClause list doesn't guarantee
+		 * unique pointers we must explicitly transfer elements one-by-one.
+		 */
 		new_group_clauses = lappend(new_group_clauses, sgc);
+		*group_clauses = list_delete_ptr(*group_clauses, sgc);
 	}
 
 	/* remember the number of pathkeys with a matching GROUP BY key */
 	n = list_length(new_group_pathkeys);
 
-	/* append the remaining group pathkeys (will be treated as not sorted) */
+	/*
+	 * Append the remaining group pathkeys (will be treated as not sorted) and
+	 * grouping clauses.
+	 */
 	*group_pathkeys = list_concat_unique_ptr(new_group_pathkeys,
 			 *group_pathkeys);
-	*group_clauses = list_concat_unique_ptr(new_group_clauses,
-			*group_clauses);
+	*group_clauses = list_concat(new_group_clauses, *group_clauses);
 
 	list_free(grouping_pathkeys);
 	return n;
@@ -484,9 +495,9 @@ pathkeys_are_duplicate(List *infos, List *pathkeys)
 List *
 get_useful_group_keys_orderings(PlannerInfo *root, Path *path)
 {
-	Query	   *parse = root->parse;
-	List	   *infos = NIL;
-	PathKeyInfo *info;
+	Query		   *parse = root->parse;
+	List		   *infos = NIL;
+	PathKeyInfo	   *info;
 
 	List	   *pathkeys = root->group_pathkeys;
 	List	 

Re: An improved README experience for PostgreSQL

2024-04-16 Thread Nathan Bossart
On Wed, Feb 28, 2024 at 02:21:49PM -0600, Nathan Bossart wrote:
> I see many projects have files like SECURITY.md, CODE_OF_CONDUCT.md, and
> CONTRIBUTING.md, and I think it would be relatively easy to add content to
> each of those for PostgreSQL, even if they just link elsewhere.

Here's a first attempt at this.  You can see some of the effects of these
files at [0].  More information about these files is available at [1] [2]
[3].

I figured we'd want to keep these pretty bare-bones to avoid duplicating
information that's already available at postgresql.org, but perhaps it's
worth filling them out a bit more.  Anyway, I just wanted to gauge interest
in this stuff.

[0] https://github.com/nathan-bossart/postgres/tree/special-files
[1] 
https://docs.github.com/en/code-security/getting-started/adding-a-security-policy-to-your-repository
[2] 
https://docs.github.com/en/communities/setting-up-your-project-for-healthy-contributions/adding-a-code-of-conduct-to-your-project
[3] 
https://docs.github.com/en/communities/setting-up-your-project-for-healthy-contributions/setting-guidelines-for-repository-contributors

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d2a7f5ae384dde6305571bd7042e6e03846dd1d5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 16 Apr 2024 21:23:52 -0500
Subject: [PATCH v1 1/1] Add code of conduct, contributing, and security files.

---
 CODE_OF_CONDUCT.md | 2 ++
 CONTRIBUTING.md| 2 ++
 SECURITY.md| 2 ++
 3 files changed, 6 insertions(+)
 create mode 100644 CODE_OF_CONDUCT.md
 create mode 100644 CONTRIBUTING.md
 create mode 100644 SECURITY.md

diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
new file mode 100644
index 00..99bb1905d6
--- /dev/null
+++ b/CODE_OF_CONDUCT.md
@@ -0,0 +1,2 @@
+The PostgreSQL code of conduct can be found at
+.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
new file mode 100644
index 00..ff93024352
--- /dev/null
+++ b/CONTRIBUTING.md
@@ -0,0 +1,2 @@
+For information about contributing to PostgreSQL, see
+.
diff --git a/SECURITY.md b/SECURITY.md
new file mode 100644
index 00..ebdbe609db
--- /dev/null
+++ b/SECURITY.md
@@ -0,0 +1,2 @@
+For information about reporting security issues, see
+.
-- 
2.25.1



Re: Why is parula failing?

2024-04-16 Thread David Rowley
On Tue, 16 Apr 2024 at 18:58, Robins Tharakan  wrote:
> The last 25 consecutive runs have passed [1] after switching
> REL_12_STABLE to -O0 ! So I am wondering whether that confirms that
> the compiler version is to blame, and while we're still here,
> is there anything else I could try?

I don't think it's conclusive proof that it's a compiler bug.  -O0
could equally just have changed the timing sufficiently to not trigger
the issue.

It might be a long shot, but I wonder if it might be worth running a
workload such as:

psql -c "create table a (a int primary key, b text not null, c int not
null); insert into a values(1,'',0);" postgres
echo "update a set b = repeat('a', random(1,10)), c=c+1 where a = 1;"
> bench.sql
pgbench -n -t 12500 -c 8 -j 8 -f bench.sql postgres
psql -c "table a;" postgres

On a build with the original CFLAGS.  I expect the value of "c" to be
100k after running that. If it's not then something bad has happened.

That would exercise the locking code heavily and would show us if any
updates were missed due to locks not being correctly respected or seen
by other backends.

David




Re: cpluspluscheck/headerscheck require build in REL_16_STABLE

2024-04-16 Thread John Naylor
On Mon, Apr 15, 2024 at 9:20 PM Marina Polyakova
 wrote:
>
> On 2024-04-13 08:40, John Naylor wrote:
> > so that they build fine on PG16 as well. The problem is, not all the
> > required headers are generated when invoking `make headerscheck`. The
> > attached patch brings in some Makefile rules from master to make this
> > work. Does this fix it for you?
>
> Everything seems to work with this patch, thank you!

Glad to hear it -- I'll push next week when I get back from vacation,
unless there are objections.




Re: pg_combinebackup does not detect missing files

2024-04-16 Thread David Steele

On 4/16/24 23:50, Robert Haas wrote:

On Wed, Apr 10, 2024 at 9:36 PM David Steele  wrote:

I've been playing around with the incremental backup feature trying to
get a sense of how it can be practically used. One of the first things I
always try is to delete random files and see what happens.

You can delete pretty much anything you want from the most recent
incremental backup (not the manifest) and it will not be detected.


Sure, but you can also delete anything you want from the most recent
non-incremental backup and it will also not be detected. There's no
reason at all to hold incremental backup to a higher standard than we
do in general.


Except that we are running pg_combinebackup on the incremental, which 
the user might reasonably expect to check backup integrity. It actually 
does a bunch of integrity checks -- but not this one.



Maybe the answer here is to update the docs to specify that
pg_verifybackup should be run on all backup directories before
pg_combinebackup is run. Right now that is not at all clear.


I don't want to make those kinds of prescriptive statements. If you
want to verify the backups that you use as input to pg_combinebackup,
you can use pg_verifybackup to do that, but it's not a requirement.
I'm not averse to having some kind of statement in the documentation
along the lines of "Note that pg_combinebackup does not attempt to
verify that the individual backups are intact; for that, use
pg_verifybackup." 


I think we should do this at a minimum.


But I think it should be blindingly obvious to
everyone that you can't go whacking around the inputs to a program and
expect to get perfectly good output. I know it isn't blindingly
obvious to everyone, which is why I'm not averse to adding something
like what I just mentioned, and maybe it wouldn't be a bad idea to
document in a few other places that you shouldn't randomly remove
files from the data directory of your live cluster, either, because
people seem to keep doing it, but really, the expectation that you
can't just blow files away and expect good things to happen afterward
should hardly need to be stated.


And yet, we see it all the time.


I think it's very easy to go overboard with warnings of this type.
Weird stuff comes to me all the time because people call me when the
weird stuff happens, and I'm guessing that your experience is similar.
But my actual personal experience, as opposed to the cases reported to
me by others, practically never features files evaporating into the
ether. 


Same -- if it happens at all it is very rare. Virtually every time I am 
able to track down the cause of missing files it is because the user 
deleted them, usually to "save space" or because they "did not seem 
important".


But given that this occurrence is pretty common in my experience, I 
think it is smart to mitigate against it, rather than just take it on 
faith that the user hasn't done anything destructive.


Especially given how pg_combinebackup works, backups are going to 
undergo a lot of user manipulation (pushing to and pull from storage, 
decompressing, untaring, etc.) and I think that means we should take 
extra care.


Regards,
-David




Re: post-freeze damage control

2024-04-16 Thread Jeff Davis
On Mon, 2024-04-08 at 15:47 -0400, Robert Haas wrote:
> - I couldn't understand why the "Operate
> XLogCtl->log{Write,Flush}Result with atomics" code was correct when I
> read it.

I reviewed ee1cbe806d. It followed a good process of discussion and
review. It was a bit close to the feature freeze for comfort, but did
not feel rushed, to me.

Additional eyes are always welcome. There are also some related commits
in that area like f3ff7bf83b, c9920a9068 and 91f2cae7a4.

> But, a spinlock
> protecting two variables together guarantees more than atomic access
> to each of those variables separately.

We maintain the invariant:

   XLogCtl->logFlushResult <= XLogCtl->logWriteResult

and the non-shared version:

   LogwrtResult.Flush <= LogwrtResult.Write

and that the requests don't fall behind the results:

   XLogCtl->LogwrtRqst.Write >= XLogCtl->logWriteResult &&
   XLogCtl->LogwrtRqst.Flush >= XLogCtl->logFlushResult

Are you concerned that:

  (a) you aren't convinced that we're maintaining the invariants
properly? or
  (b) you aren't convinced that the invariant is strong enough, and
that there are other variables that we aren't considering?

And if you think this is right after all, where could we add some
clarification?

Regards,
Jeff Davis





Re: pg17 issues with not-null contraints

2024-04-16 Thread Justin Pryzby
On Tue, Apr 16, 2024 at 08:11:49PM +0200, Alvaro Herrera wrote:
> On 2024-Apr-15, Alvaro Herrera wrote:
> 
> > - Fourth thought: we do as in the third thought, except we also allow
> > DROP CONSTRAINT a constraint that's marked "local, inherited" to be
> > simply an inherited constraint (remove its "local" marker).
> 
> Here is an initial implementation of what I was thinking.  Can you
> please give it a try and see if it fixes this problem?  At least in my
> run of your original test case, it seems to work as expected.

Yes, this fixes the issue I reported.

BTW, that seems to be the same issue Andrew reported in January.
https://www.postgresql.org/message-id/CAJnzarwkfRu76_yi3dqVF_WL-MpvT54zMwAxFwJceXdHB76bOA%40mail.gmail.com

-- 
Justin




Re: Parallel CREATE INDEX for BRIN indexes

2024-04-16 Thread Tomas Vondra
On 4/15/24 20:35, Tomas Vondra wrote:
> On 4/15/24 10:18, Tomas Vondra wrote:
>> ...
>>
>> I'll try a bit more to make this work without the temp table.
>>
> 
> Considering the earlier discussion in e2933a6e1, I think making the
> table TEMP is the best fix, so I'll do that. Thanks for remembering that
> change, Alexander!
> 

D'oh! I pushed this fix to stabilize the test earlier today, but I just
realized it unfortunately makes the test useless. The idea of the test
was to build BRIN indexes with/without parallelism, and check that the
indexes are exactly the same.

The instability comes from deletes, which I added to get "empty" ranges
in the table, which may not be cleaned up in time for the CREATE INDEX
commands, depending on what else is happening. A TEMPORARY table does
not have this issue (as observed in e2933a6e1), but there's the minor
problem that plan_create_index_workers() does this:

  /*
   * Determine if it's safe to proceed.
   *
   * Currently, parallel workers can't access the leader's temporary
   * tables. Furthermore, any index predicate or index expressions must
   * be parallel safe.
   */
  if (heap->rd_rel->relpersistence == RELPERSISTENCE_TEMP ||
!is_parallel_safe(root, (Node *) RelationGetIndexExpressions(index)) ||
!is_parallel_safe(root, (Node *) RelationGetIndexPredicate(index)))
  {
parallel_workers = 0;
goto done;
  }

That is, no parallel index builds on temporary tables. Which means the
test is not actually testing anything :-( Much more stable, but not very
useful for finding issues.

I think the best way to stabilize the test is to just not delete the
rows. That means we won't have any "empty" ranges (runs of pages without
any tuples).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Differential code coverage between 16 and HEAD

2024-04-16 Thread Jeff Davis
On Tue, 2024-04-16 at 11:58 -0700, Andres Freund wrote:
> 
> Hm, that seems annoying, even for update-unicode :/. But I guess it
> won't be
> very common to have such failures?

Things don't change a lot between Unicode versions (and are subject to
the stability policy), but the tests are exhaustive, so even a single
character's property being changed will cause a failure when compared
against an older version of ICU. The case mapping test succeeds back to
ICU 64 (based on Unicode 12.1), but the category/properties test
succeeds only back to ICU 72 (based on Unicode 15.0).

I agree this is annoying, and I briefly documented it in
src/common/unicode/README. It means whoever updates Unicode for a
Postgres version should probably know how to build ICU from source and
point the Postgres build process at it. Maybe I should add more details
in the README to make that easier for others.

But it's also a really good test. The ICU parsing, interpretation of
data files, and lookup code is entirely independent of ours. Therefore,
if the results agree for all codepoints, we have a high degree of
confidence that the results are correct. That level of confidence seems
worth a bit of annoyance.

This kind of test is possible because the category/property and case
mapping functions accept a single code point, and there are only
0x10 code points.

> > That's not to say that the C code shouldn't be tested, of course.
> > Maybe
> > we can just do some spot checks for the functions that are
> > reachable
> > via SQL and get rid of the functions that aren't yet reachable (and
> > re-
> > add them when they are)?
> 
> Yes, I think that'd be a good start. I don't think we necessarily
> need
> exhaustive coverage, just a bit more coverage than we have.

OK, I'll submit a test module or something.

Regards,
Jeff Davis





Re: documentation structure

2024-04-16 Thread Andres Freund
Hi,

On 2024-04-16 15:05:32 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I think we should work on generating a lot of func.sgml.  Particularly the
> > signature etc should just come from pg_proc.dat, it's pointlessly painful to
> > generate that by hand. And for a lot of the functions we should probably 
> > move
> > the existing func.sgml comments to the description in pg_proc.dat.
>
> Where are you going to get the examples and text descriptions from?

I think there's a few different way to do that. E.g. having long_desc, example
fields in pg_proc.dat. Or having examples and description in a separate file
and "enriching" that with auto-generated function signatures.


> (And no, I don't agree that the pg_description string should match
> what's in the docs.  The description string has to be a short
> one-liner in just about every case.)

Definitely shouldn't be the same in all cases, but I think there's a decent
number of cases where they can be the same. The differences between the two is
often minimal today.

Entirely randomly chosen example:

{ oid => '2825',
  descr => 'slope of the least-squares-fit linear equation determined by the 
(X, Y) pairs',
  proname => 'regr_slope', prokind => 'a', proisstrict => 'f',
  prorettype => 'float8', proargtypes => 'float8 float8',
  prosrc => 'aggregate_dummy' },

and

  
   

 regression slope


 regr_slope

regr_slope ( Y double 
precision, X double precision )
double precision
   
   
Computes the slope of the least-squares-fit linear equation determined
by the (X, Y)
pairs.
   
   Yes
  


The description is quite similar, the pg_proc entry lacks argument names. 


> This sounds to me like it would be a painful exercise with not a
> lot of benefit in the end.

I think the manual work for writing signatures in sgml is not insignificant,
nor is the volume of sgml for them. Manually maintaining the signatures makes
it impractical to significantly improve the presentation - which I don't think
is all that great today.

And the lack of argument names in the pg_proc entries is occasionally fairly
annoying, because a \df+ doesn't provide enough information to use functions.

It'd also be quite useful if clients could render more of the documentation
for functions. People are used to language servers providing full
documentation for functions etc...

Greetings,

Andres Freund




Re: Differential code coverage between 16 and HEAD

2024-04-16 Thread Jeff Davis
On Mon, 2024-04-15 at 21:35 -0400, Tom Lane wrote:
> It's definitely not OK for the standard test suite to include
> internet access.

The update-unicode target is not run as part of the standard test
suite.

>   Seems like we need to separate "download new
> source files" from "generate the derived files".

I'm not sure that's the right dividing line. There are three-ish steps:

1. Download the Unicode files
2. Generate the derived .h files
3. Run tests

If we stop after 1, then do we check in the Unicode files? If so, then
there's inconsistency between the Unicode files and the .h files, which
doesn't seem like a good idea. If we don't check in the files, then
nobody can skip to step 2, so I don't see the point in separating the
steps.

If we separate out step 3 that makes more sense: we check in the result
after step 2, and anyone can run step 3 without downloading anything.
The only problem with that is the tests I added depend on a recent-
enough version of ICU, so I'm not sure how many people will run it,
anyway.

Andres's complaints seem mainly about code coverage in the standard
test suite for the thin layer of C code above the generated arrays. I
agree: code coverage is a good goal by itself, and having a few more
platforms exercising that C code can't hurt. I think we should just
address that concern directly by spot-checking the results for a few
code points rather than trying to make the exhaustive ICU tests run on
more hosts.

Regards,
Jeff Davis





Re: documentation structure

2024-04-16 Thread Bruce Momjian
On Tue, Apr 16, 2024 at 03:05:32PM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I think we should work on generating a lot of func.sgml.  Particularly the
> > signature etc should just come from pg_proc.dat, it's pointlessly painful to
> > generate that by hand. And for a lot of the functions we should probably 
> > move
> > the existing func.sgml comments to the description in pg_proc.dat.
> 
> Where are you going to get the examples and text descriptions from?
> (And no, I don't agree that the pg_description string should match
> what's in the docs.  The description string has to be a short
> one-liner in just about every case.)
> 
> This sounds to me like it would be a painful exercise with not a
> lot of benefit in the end.

Maybe we could _verify_ the contents of func.sgml against pg_proc.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: documentation structure

2024-04-16 Thread Tom Lane
Andres Freund  writes:
> I think we should work on generating a lot of func.sgml.  Particularly the
> signature etc should just come from pg_proc.dat, it's pointlessly painful to
> generate that by hand. And for a lot of the functions we should probably move
> the existing func.sgml comments to the description in pg_proc.dat.

Where are you going to get the examples and text descriptions from?
(And no, I don't agree that the pg_description string should match
what's in the docs.  The description string has to be a short
one-liner in just about every case.)

This sounds to me like it would be a painful exercise with not a
lot of benefit in the end.

I do agree with Andrew that splitting func.sgml into multiple files
would be beneficial.

regards, tom lane




Re: Differential code coverage between 16 and HEAD

2024-04-16 Thread Andres Freund
Hi,

On 2024-04-15 18:23:21 -0700, Jeff Davis wrote:
> On Mon, 2024-04-15 at 17:05 -0700, Andres Freund wrote:
> > Can't we test this as part of the normal testsuite?
> 
> One thing that complicates things a bit is that the test compares the
> results against ICU, so a mismatch in Unicode version between ICU and
> Postgres can cause test failures. The test ignores unassigned code
> points, so normally it just results in less-exhaustive test coverage.
> But sometimes things really do change, and that would cause a failure.

Hm, that seems annoying, even for update-unicode :/. But I guess it won't be
very common to have such failures?


> Stepping back a moment, my top worry is really not to test those C
> functions, but to test the perl code that parses the text files and
> generates those arrays. Imagine a future Unicode version does something
> that the perl scripts didn't anticipate, and they fail to add array
> entries for half the code points, or something like that. By testing
> the arrays generated from freshly-parsed files exhaustively against
> ICU, then we have a good defense against that. That situation really
> only comes up when updating Unicode.

That's a good point.


> That's not to say that the C code shouldn't be tested, of course. Maybe
> we can just do some spot checks for the functions that are reachable
> via SQL and get rid of the functions that aren't yet reachable (and re-
> add them when they are)?

Yes, I think that'd be a good start. I don't think we necessarily need
exhaustive coverage, just a bit more coverage than we have.


> > I don't at all like that the tests depend on downloading new unicode
> > data. What if there was an update but I just want to test the current
> > state?
> 
> I was mostly following the precedent for normalization. Should we
> change that, also?

Yea, I think we should. But I think it's less urgent if we end up testing more
of the code without those test binaries.  I don't immediately know what
dependencies would be best, tbh.

Greetings,

Andres Freund




Re: pg17 issues with not-null contraints

2024-04-16 Thread Alvaro Herrera
On 2024-Apr-15, Justin Pryzby wrote:

> Here's a couple more issues affecting upgrades from v16 to v17.
> 
> postgres=# CREATE TABLE a(i int NOT NULL); CREATE TABLE b(i int PRIMARY KEY) 
> INHERITS (a);
> pg_restore: error: could not execute query: ERROR:  constraint 
> "pgdump_throwaway_notnull_0" of relation "b" does not exist

This one requires a separate pg_dump fix, which should --I hope-- be
pretty simple.

> postgres=# CREATE TABLE a(i int CONSTRAINT a NOT NULL PRIMARY KEY); CREATE 
> TABLE b()INHERITS(a); ALTER TABLE b ADD CONSTRAINT pkb PRIMARY KEY (i);
> pg_restore: error: could not execute query: ERROR:  cannot drop inherited 
> constraint "pgdump_throwaway_notnull_0" of relation "b"

This one seems to be fixed with the patch I just posted.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this."   (Fotis)
  https://postgr.es/m/200606261359.k5qdxe2p004...@auth-smtp.hol.gr




Re: documentation structure

2024-04-16 Thread Andres Freund
Hi,

On 2024-03-19 17:39:39 -0400, Andrew Dunstan wrote:
> My own pet docs peeve is a purely editorial one: func.sgml is a 30k line
> beast, and I think there's a good case for splitting out at least the
> larger chunks of it.

I think we should work on generating a lot of func.sgml.  Particularly the
signature etc should just come from pg_proc.dat, it's pointlessly painful to
generate that by hand. And for a lot of the functions we should probably move
the existing func.sgml comments to the description in pg_proc.dat.

I suspect that we can't just generate all the documentation from pg_proc,
because of xrefs etc.  Although perhaps we could just strip those out for
pg_proc.

We'd need to add some more metadata to pg_proc, for grouping kinds of
functions together. But that seems doable.

Greetings,

Andres Freund




Re: pg17 issues with not-null contraints

2024-04-16 Thread Alvaro Herrera
On 2024-Apr-15, Alvaro Herrera wrote:

> - Fourth thought: we do as in the third thought, except we also allow
> DROP CONSTRAINT a constraint that's marked "local, inherited" to be
> simply an inherited constraint (remove its "local" marker).

Here is an initial implementation of what I was thinking.  Can you
please give it a try and see if it fixes this problem?  At least in my
run of your original test case, it seems to work as expected.

This is still missing some cleanup and additional tests, of course.
Speaking of which, I wonder if I should modify pg16's tests so that they
leave behind tables set up in this way, to immortalize pg_upgrade
testing.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
>From b40a418dfb3d7ce23ffa568c8a8d03640ce28435 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 16 Apr 2024 13:44:34 +0200
Subject: [PATCH] Fix add/drop of not-null constraints

---
 src/backend/catalog/heap.c  | 36 +
 src/backend/catalog/pg_constraint.c | 36 -
 src/backend/commands/tablecmds.c| 26 -
 src/include/catalog/pg_constraint.h |  2 +-
 4 files changed, 83 insertions(+), 17 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index cc31909012..f0278b9c01 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -2539,6 +2539,7 @@ AddRelationNewConstraints(Relation rel,
 			CookedConstraint *nncooked;
 			AttrNumber	colnum;
 			char	   *nnname;
+			int			existing;
 
 			/* Determine which column to modify */
 			colnum = get_attnum(RelationGetRelid(rel), strVal(linitial(cdef->keys)));
@@ -2547,12 +2548,39 @@ AddRelationNewConstraints(Relation rel,
 	 strVal(linitial(cdef->keys)), RelationGetRelid(rel));
 
 			/*
-			 * If the column already has a not-null constraint, we need only
-			 * update its catalog status and we're done.
+			 * If the column already has an inheritable not-null constraint,
+			 * we need only adjust its inheritance status and we're done.  If
+			 * the constraint is there but marked NO INHERIT, then it is
+			 * updated in the same way, but we also recurse to the children
+			 * (if any) to add the constraint there as well.
 			 */
-			if (AdjustNotNullInheritance1(RelationGetRelid(rel), colnum,
-		  cdef->inhcount, cdef->is_no_inherit))
+			existing = AdjustNotNullInheritance1(RelationGetRelid(rel), colnum,
+ cdef->inhcount, cdef->is_no_inherit);
+			if (existing == 1)
+continue;		/* all done! */
+			else if (existing == -1)
+			{
+List	   *children;
+
+children = find_inheritance_children(RelationGetRelid(rel), NoLock);
+foreach_oid(childoid, children)
+{
+	Relation	childrel = table_open(childoid, NoLock);
+
+	AddRelationNewConstraints(childrel,
+			  NIL,
+			  list_make1(copyObject(cdef)),
+			  allow_merge,
+			  is_local,
+			  is_internal,
+			  queryString);
+	/* these constraints are not in the return list -- good? */
+
+	table_close(childrel, NoLock);
+}
+
 continue;
+			}
 
 			/*
 			 * If a constraint name is specified, check that it isn't already
diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index 604280d322..a11483b1b8 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -709,36 +709,50 @@ extractNotNullColumn(HeapTuple constrTup)
  * AdjustNotNullInheritance1
  *		Adjust inheritance count for a single not-null constraint
  *
- * Adjust inheritance count, and possibly islocal status, for the not-null
- * constraint row of the given column, if it exists, and return true.
- * If no not-null constraint is found for the column, return false.
+ * If no not-null constraint is found for the column, return 0
+ * If the constraint does exist and it's inheritable, adjust its
+ * inheritance count (and possibly islocal status) and return 1.
+ * If the constraint exists but is marked NO INHERIT, adjust it as above
+ * and reset connoinherit to false, and return -1.  Caller is
+ * responsible for adding the same constraint to the children, if any.
  */
-bool
+int
 AdjustNotNullInheritance1(Oid relid, AttrNumber attnum, int count,
 		  bool is_no_inherit)
 {
 	HeapTuple	tup;
 
+	Assert(count >= 0);
+
 	tup = findNotNullConstraintAttnum(relid, attnum);
 	if (HeapTupleIsValid(tup))
 	{
 		Relation	pg_constraint;
 		Form_pg_constraint conform;
+		int			retval = 1;
 
 		pg_constraint = table_open(ConstraintRelationId, RowExclusiveLock);
 		conform = (Form_pg_constraint) GETSTRUCT(tup);
 
 		/*
-		 * Don't let the NO INHERIT status change (but don't complain
-		 * unnecessarily.)  In the future it might be useful to let an
-		 * inheritable constraint replace a non-inheritable one, but we'd need
-		 * to recurse to children to get it added there.
+		 * If the constraint already exists in this relation but 

Re: pg_combinebackup fails on file named INCREMENTAL.*

2024-04-16 Thread Robert Haas
On Tue, Apr 16, 2024 at 12:06 PM Stefan Fercot
 wrote:
> Sure, I can see your point here and how people could be tempted to through 
> away that backup_manifest if they don't know how important it is to keep it.
> Probably in this case we'd need the list to be inside the tar, just like 
> backup_label and tablespace_map then.

Yeah, I think anywhere inside the tar is better than anywhere outside
the tar, by a mile. I'm happy to leave the specific question of where
inside the tar as something TBD at time of implementation by fiat of
the person doing the work.

But that said ...

> Do you mean 1 stub-list per pgdata + 1 per tablespaces?
>
> I don't really see how it would be faster to recursively go through each 
> sub-directories of the pgdata and tablespaces to gather all the pieces 
> together compared to reading 1 main file.
> But I guess, choosing one option or the other, we will only find out how well 
> it works once people will use it on the field and possibly give some feedback.

The reason why I was suggesting one stub-list per directory is that we
recurse over the directory tree. We reach each directory in turn,
process it, and then move on to the next one. What I imagine that we
want to do is - first iterate over all of the files actually present
in a directory. Then, iterate over the list of stubs for that
directory and do whatever we would have done if there had been a stub
file present for each of those. So, we don't really want a list of
every stub in the whole backup, or even every stub in the whole
tablespace. What we want is to be able to easily get a list of stubs
for a single directory. Which is very easily done if each directory
contains its own stub-list file.

If we instead have a centralized stub-list for the whole tablespace,
or the whole backup, it's still quite possible to make it work. We
just read that centralized stub list and we build an in-memory data
structure that is indexed by containing directory, like a hash table
where the key is the directory name and the value is a list of
filenames within that directory. But, a slight disadvantage of this
model is that you have to keep that whole data structure in memory for
the whole time you're reconstructing, and you have to pass around a
pointer to it everywhere so that the code that handles individual
directories can access it. I'm sure this isn't the end of the world.
It's probably unlikely that someone has so many stub files that the
memory used for such a data structure is painfully high, and even if
they did, it's unlikely that they are spread out across multiple
databases and/or tablespaces in such a way that only needing the data
for one directory at a time would save you. But, it's not impossible
that such a scenario could exist.

Somebody might say - well, don't go directory by directory. Just
handle all of the stubs at the end. But I don't think that really
fixes anything. I want to be able to verify that none of the stubs
listed in the stub-list are also present in the backup as real files,
for sanity checking purposes. It's quite easy to see how to do that in
the design I proposed above: keep a list of the files for each
directory as you read it, and then when you read the stub-list for
that directory, check those lists against each other for duplicates.
Doing this on the level of a whole tablespace or the whole backup is
clearly also possible, but once again it potentially uses more memory,
and there's no functional gain.

Plus, this kind of approach would also make the reconstruction process
"jump around" more. It might pull a bunch of mostly-unchanged files
from the full backup while handling the non-stub files, and then come
back to that directory a second time, much later, when it's processing
the stub-list. Perhaps that would lead to a less-optimal I/O pattern,
or perhaps it would make it harder for the user to understand how much
progress reconstruction had made. Or perhaps it would make no
difference at all; I don't know. Maybe there's even some advantage in
a two-pass approach like this. I don't see one. But it might prove
otherwise on closer examination.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Stack overflow issue

2024-04-16 Thread Alexander Korotkov
On Tue, Apr 16, 2024 at 6:35 PM Andres Freund  wrote:
> On 2024-04-16 15:45:42 +0300, Alexander Korotkov wrote:
> > On Tue, Apr 16, 2024 at 1:48 AM Andres Freund  wrote:
> > > On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote:
> > > > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > > > I did minor revision of comments and code blocks order to improve the
> > > > readability.
> > >
> > > After sending
> > > https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
> > > I looked some more at important areas where changes didn't have code
> > > coverage. One thing I noticed was that the "non-internal" part of
> > > AbortCurrentTransaction() is uncovered:
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403
> > >
> > > Which made me try to understand fefd9a3fed2.  I'm a bit confused about why
> > > some parts are handled in 
> > > CommitCurrentTransaction()/AbortCurrentTransaction()
> > > and others are in the *Internal functions.
> > >
> > > I understand that fefd9a3fed2 needed to remove the recursion in
> > > CommitTransactionCommand()/AbortCurrentTransaction(). But I don't 
> > > understand
> > > why that means having some code in in the non-internal and some in the
> > > internal functions?  Wouldn't it be easier to just have all the state 
> > > handling
> > > code in the Internal() function and just break after the
> > > CleanupSubTransaction() calls?
> >
> > I'm not sure I correctly get what you mean.  Do you think the attached
> > patch matches the direction you're pointing?  The patch itself is not
> > final, it requires cleanup and comments revision, just to check the
> > direction.
>
> Something like that, yea. The split does seem less confusing that way to me,
> but also not 100% certain.

Thank you for your feedback.  I'm going to go ahead and polish this patch.

--
Regards,
Alexander Korotkov




soliciting patches to review

2024-04-16 Thread Robert Haas
Hi,

At 2024.pgconf.dev, Andres and I will be hosting a patch review
workshop.[1] Part of the workshop will be a presentation, and part of
it will be a practicum. That is, we're going to actually ask attendees
to review some patches during the workshop. We'll also comment on
those reviews, and the patches themselves, with our own thoughts.
While we could just pick some things from the CommitFest, I believe we
both felt a little uncomfortable with the idea of potentially turning
a spotlight on someone's patch where they might not have been
expecting it. So, instead, I'd like to invite you to email me, and/or
Andres, if you have a patch that isn't committed yet and which you
think would be a good candidate for review during this workshop. If
your patch is selected to be reviewed during the workshop, then you
will very likely get some reviews for your patch posted on
pgsql-hackers. But, there are no guarantees about how positive or
negative those reviews will be, so you do need to be prepared to take
the bad with the good.

Note that this is really an exercise in helping more people in the
community to get better at reviewing patches. So, if Andres and I
think that what your patch really needs is an opinion from Tom Lane
specifically, or even an opinion from Andres Freund or Robert Haas
specifically, we probably won't choose to include it in the workshop.
But there are lots of patches that just need attention from someone,
at least for starters, and perhaps this workshop can help some of
those patches to make progress, in addition to (hopefully) being
educational for the attendees.

Key points:

1. If you have a patch you think would be a good candidate for this
event, please email me and/or Andres.

2. Please only volunteer a patch that you wrote, not one that somebody
else wrote.

3. Please don't suggest a patch that's already committed.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

[1] 
https://www.pgevents.ca/events/pgconfdev2024/schedule/session/40-patch-review-workshop-registration-required/




Re: pg_combinebackup fails on file named INCREMENTAL.*

2024-04-16 Thread Stefan Fercot
On Tuesday, April 16th, 2024 at 3:22 PM, Robert Haas wrote:
> What I fear is that this will turn into another situation like we had
> with pg_xlog, where people saw "log" in the name and just blew it
> away. Matter of fact, I recently encountered one of my few recent
> examples of someone doing that thing since the pg_wal renaming
> happened. Some users don't take much convincing to remove anything
> that looks inessential. And what I'm particularly worried about with
> this feature is tar-format backups. If you have a directory format
> backup and you do an "ls", you're going to see a whole bunch of files
> in there of which backup_manifest will be one. How you treat that file
> is just going to depend on what you know about its purpose. But if you
> have a tar-format backup, possibly compressed, the backup_manifest
> file stands out a lot more. You may have something like this:
> 
> backup_manifest root.tar.gz 16384.tar.gz

Sure, I can see your point here and how people could be tempted to through away 
that backup_manifest if they don't know how important it is to keep it.
Probably in this case we'd need the list to be inside the tar, just like 
backup_label and tablespace_map then.

> The kicker for me is that I can't see any reason to do any of this
> stuff. Including the information that we need to elide incremental
> stubs in some other way, say with one stub-list per directory, will be
> easier to implement and probably perform better. Like, I'm not saying
> we can't find a way to jam this into the manifest. But I'm fairly sure
> it's just making life difficult for ourselves.
> 
> I may ultimately lose this argument, as I did the one about whether
> the backup_manifest should be JSON or some bespoke format. And that's
> fine. I respect your opinion, and David's. But I also reserve the
> right to feel differently, and I do.

Do you mean 1 stub-list per pgdata + 1 per tablespaces?

Sure, it is important to respect and value each other feelings, I never said 
otherwise.

I don't really see how it would be faster to recursively go through each 
sub-directories of the pgdata and tablespaces to gather all the pieces together 
compared to reading 1 main file.
But I guess, choosing one option or the other, we will only find out how well 
it works once people will use it on the field and possibly give some feedback.

As you mentioned in [1], we're not going to start rewriting the implementation 
a week after feature freeze nor probably already start building big new things 
now anyway.
So maybe let's start with documenting the possible gotchas/corner cases to make 
our support life easier in the future.

Kind Regards,
--
Stefan FERCOT
Data Egret (https://dataegret.com)

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoaVxr_o3mrDBrBcXm3gowr9Qc4ABW-c73NR_201KkDavw%40mail.gmail.com




Re: Table AM Interface Enhancements

2024-04-16 Thread Andres Freund
Hi,

On 2024-04-16 08:31:24 -0400, Robert Haas wrote:
> On Tue, Apr 16, 2024 at 6:52 AM Alexander Korotkov  
> wrote:
> > Taking a closer look at acquire_sample_rows(), I think it would be
> > good if table AM implementation would care about block-level (or
> > whatever-level) sampling.  So that acquire_sample_rows() just fetches
> > tuples one-by-one from table AM implementation without any care about
> > blocks.  Possible table_beginscan_analyze() could take an argument of
> > target number of tuples, then those tuples are just fetches with
> > table_scan_analyze_next_tuple().  What do you think?
> 
> Andres is the expert here, but FWIW, that plan seems reasonable to me.
> One downside is that every block-based tableam is going to end up with
> a very similar implementation, which is kind of something I don't like
> about the tableam API in general: if you want to make something that
> is basically heap plus a little bit of special sauce, you have to copy
> a mountain of code. Right now we don't really care about that problem,
> because we don't have any other tableams in core, but if we ever do, I
> think we're going to find ourselves very unhappy with that aspect of
> things. But maybe now is not the time to start worrying. That problem
> isn't unique to analyze, and giving out-of-core tableams the
> flexibility to do what they want is better than not.

I think that can partially be addressed by having more "block oriented AM"
helpers in core, like we have for table_block_parallelscan*. Doesn't work for
everything, but should for something like analyze.

Greetings,

Andres Freund




Re: Table AM Interface Enhancements

2024-04-16 Thread Andres Freund
On 2024-04-16 13:33:53 +0300, Alexander Korotkov wrote:
> Reverted.

Thanks!




Re: Issue with the PRNG used by Postgres

2024-04-16 Thread Andres Freund
Hi,

On 2024-04-15 10:54:16 -0400, Robert Haas wrote:
> On Fri, Apr 12, 2024 at 3:33 PM Andres Freund  wrote:
> > Here's a patch implementing this approach. I confirmed that before we 
> > trigger
> > the stuck spinlock logic very quickly and after we don't. However, if most
> > sleeps are interrupted, it can delay the stuck spinlock detection a good
> > bit. But that seems much better than triggering it too quickly.
>
> +1 for doing something about this. I'm not sure if it goes far enough,
> but it definitely seems much better than doing nothing.

One thing I started to be worried about is whether a patch ought to prevent
the timeout used by perform_spin_delay() from increasing when
interrupted. Otherwise a few signals can trigger quite long waits.

But as a I can't quite see a way to make this accurate in the backbranches, I
suspect something like what I posted is still a good first version.


> Given your findings, I'm honestly kind of surprised that I haven't seen
> problems of this type more frequently.

Same. I did a bunch of searches for the error, but found surprisingly
little.

I think in practice most spinlocks just aren't contended enough to reach
perform_spin_delay(). And we have improved some on that over time.

Greetings,

Andres Freund




Re: Stack overflow issue

2024-04-16 Thread Andres Freund
Hi,

On 2024-04-16 15:45:42 +0300, Alexander Korotkov wrote:
> On Tue, Apr 16, 2024 at 1:48 AM Andres Freund  wrote:
> > On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote:
> > > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > > I did minor revision of comments and code blocks order to improve the
> > > readability.
> >
> > After sending
> > https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
> > I looked some more at important areas where changes didn't have code
> > coverage. One thing I noticed was that the "non-internal" part of
> > AbortCurrentTransaction() is uncovered:
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403
> >
> > Which made me try to understand fefd9a3fed2.  I'm a bit confused about why
> > some parts are handled in 
> > CommitCurrentTransaction()/AbortCurrentTransaction()
> > and others are in the *Internal functions.
> >
> > I understand that fefd9a3fed2 needed to remove the recursion in
> > CommitTransactionCommand()/AbortCurrentTransaction(). But I don't understand
> > why that means having some code in in the non-internal and some in the
> > internal functions?  Wouldn't it be easier to just have all the state 
> > handling
> > code in the Internal() function and just break after the
> > CleanupSubTransaction() calls?
> 
> I'm not sure I correctly get what you mean.  Do you think the attached
> patch matches the direction you're pointing?  The patch itself is not
> final, it requires cleanup and comments revision, just to check the
> direction.

Something like that, yea. The split does seem less confusing that way to me,
but also not 100% certain.

Greetings,

Andres Freund




Re: Combine headerscheck and cpluspluscheck scripts

2024-04-16 Thread Anton Voloshin

Hello, hackers,

On 10/03/2024 12:03, Peter Eisentraut wrote:

Committed, thanks.


This commit (7b8e2ae2f) have turned cpluspluscheck script into a 
--cplusplus option for headerscheck.  I propose to update the 
src/tools/pginclude/README correspondingly, please see the attached patch.


--
Anton Voloshin
Postgres Professional, The Russian Postgres Company
https://postgrespro.ruFrom a50e58f117341e8a9df5f97fa05630f7b8f4bd86 Mon Sep 17 00:00:00 2001
From: Anton Voloshin 
Date: Tue, 16 Apr 2024 17:19:28 +0300
Subject: [PATCH] Update src/tools/pginclude/README to match recent changes to
 cpluspluscheck

7b8e2ae2f have turned cpluspluscheck from separate script into
a --cplusplus option for headerscheck. Update README correspondingly.
---
 src/tools/pginclude/README | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/src/tools/pginclude/README b/src/tools/pginclude/README
index 712eca76fb3..65372057dad 100644
--- a/src/tools/pginclude/README
+++ b/src/tools/pginclude/README
@@ -1,10 +1,10 @@
 src/tools/pginclude/README
 
-NOTE: headerscheck and cpluspluscheck are in current use, and any
-problems they find should generally get fixed.  The other scripts
-in this directory have not been used in some time, and have issues.
-pgrminclude in particular has a history of creating more problems
-than it fixes.  Be very wary of applying their results blindly.
+NOTE: headerscheck and headerscheck --cplusplus are in current use, and any
+problems they find should generally get fixed.  The other scripts in this
+directory have not been used in some time, and have issues.  pgrminclude in
+particular has a history of creating more problems than it fixes.  Be very
+wary of applying their results blindly.
 
 
 pginclude
@@ -84,16 +84,16 @@ prerequisite, even if postgres_fe.h or c.h would be more appropriate.
 Also note that the contents of macros are not checked; this is intentional.
 
 
-cpluspluscheck
-==
+headerscheck --cplusplus
+
 
-This script can be run to verify that all Postgres include files meet
-the project convention that they will compile as C++ code.  Although
-the project's coding language is C, some people write extensions in C++,
-so it's helpful for include files to be C++-clean.
+The headerscheck in --cplusplus mode can be run to verify that all Postgres
+include files meet the project convention that they will compile as C++
+code.  Although the project's coding language is C, some people write
+extensions in C++, so it's helpful for include files to be C++-clean.
 
 A small number of header files are exempted from this requirement,
-and are skipped by the cpluspluscheck script.
+and are skipped by the script in the --cplusplus mode.
 
 The easy way to run the script is to say "make -s cpluspluscheck" in
 the top-level build directory after completing a build.  You should
-- 
2.43.2



Re: Use streaming read API in ANALYZE

2024-04-16 Thread Nazir Bilal Yavuz
Hi,

On Wed, 3 Apr 2024 at 22:25, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> Thank you for looking into this!
>
> On Wed, 3 Apr 2024 at 20:17, Heikki Linnakangas  wrote:
> >
> > On 03/04/2024 13:31, Nazir Bilal Yavuz wrote:
> > > Streaming API has been committed but the committed version has a minor
> > > change, the read_stream_begin_relation function takes Relation instead
> > > of BufferManagerRelation now. So, here is a v5 which addresses this
> > > change.
> >
> > I'm getting a repeatable segfault / assertion failure with this:
> >
> > postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10);
> > CREATE TABLE
> > postgres=# insert into tengiga select g, repeat('x', 900) from
> > generate_series(1, 140) g;
> > INSERT 0 140
> > postgres=# set default_statistics_target = 10; ANALYZE tengiga;
> > SET
> > ANALYZE
> > postgres=# set default_statistics_target = 100; ANALYZE tengiga;
> > SET
> > ANALYZE
> > postgres=# set default_statistics_target =1000; ANALYZE tengiga;
> > SET
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
> >
> > TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File:
> > "heapam_handler.c", Line: 1079, PID: 262232
> > postgres: heikki postgres [local]
> > ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8]
> > postgres: heikki postgres [local]
> > ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34]
> > postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34]
> > postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a]
> > postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9]
> > postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0]
> > postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe]
> > postgres: heikki postgres [local]
> > ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9]
> > postgres: heikki postgres [local]
> > ANALYZE(ProcessUtility+0x136)[0x564889f0afb1]
> > postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8]
> > postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b]
> > postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015]
> > postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6]
> > postgres: heikki postgres [local]
> > ANALYZE(PostgresMain+0x80c)[0x564889f06fd7]
> > postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876]
> > postgres: heikki postgres [local]
> > ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3]
> > postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e]
> > postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0]
> > postgres: heikki postgres [local]
> > ANALYZE(PostmasterMain+0x152b)[0x564889e2214d]
> > postgres: heikki postgres [local] ANALYZE(+0xb4)[0x564889cdb4b4]
> > /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a]
> > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305]
> > postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61]
> > 2024-04-03 20:15:49.157 EEST [262101] LOG:  server process (PID 262232)
> > was terminated by signal 6: Aborted
>
> I realized the same error while working on Jakub's benchmarking results.
>
> Cause: I was using the nblocks variable to check how many blocks will
> be returned from the streaming API. But I realized that sometimes the
> number returned from BlockSampler_Init() is not equal to the number of
> blocks that BlockSampler_Next() will return as BlockSampling algorithm
> decides how many blocks to return on the fly by using some random
> seeds.

I wanted to re-check this problem and I realized that I was wrong. I
tried using nblocks again and this time there was no failure. I looked
at block sampling logic and I am pretty sure that BlockSampler_Init()
function correctly returns the number of blocks that
BlockSampler_Next() will return. It seems 158f581923 fixed this issue
as well.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: pg_combinebackup does not detect missing files

2024-04-16 Thread Robert Haas
On Wed, Apr 10, 2024 at 9:36 PM David Steele  wrote:
> I've been playing around with the incremental backup feature trying to
> get a sense of how it can be practically used. One of the first things I
> always try is to delete random files and see what happens.
>
> You can delete pretty much anything you want from the most recent
> incremental backup (not the manifest) and it will not be detected.

Sure, but you can also delete anything you want from the most recent
non-incremental backup and it will also not be detected. There's no
reason at all to hold incremental backup to a higher standard than we
do in general.

> Maybe the answer here is to update the docs to specify that
> pg_verifybackup should be run on all backup directories before
> pg_combinebackup is run. Right now that is not at all clear.

I don't want to make those kinds of prescriptive statements. If you
want to verify the backups that you use as input to pg_combinebackup,
you can use pg_verifybackup to do that, but it's not a requirement.
I'm not averse to having some kind of statement in the documentation
along the lines of "Note that pg_combinebackup does not attempt to
verify that the individual backups are intact; for that, use
pg_verifybackup." But I think it should be blindingly obvious to
everyone that you can't go whacking around the inputs to a program and
expect to get perfectly good output. I know it isn't blindingly
obvious to everyone, which is why I'm not averse to adding something
like what I just mentioned, and maybe it wouldn't be a bad idea to
document in a few other places that you shouldn't randomly remove
files from the data directory of your live cluster, either, because
people seem to keep doing it, but really, the expectation that you
can't just blow files away and expect good things to happen afterward
should hardly need to be stated.

I think it's very easy to go overboard with warnings of this type.
Weird stuff comes to me all the time because people call me when the
weird stuff happens, and I'm guessing that your experience is similar.
But my actual personal experience, as opposed to the cases reported to
me by others, practically never features files evaporating into the
ether. If I read a documentation page for PostgreSQL or any other
piece of software that made it sound like that was a normal
occurrence, I'd question the technical acumen of the authors. And if I
saw such warnings only for one particular feature of a piece of
software and not anywhere else, I'd wonder why the authors of the
software were trying so hard to scare me off the use of that
particular feature. I don't trust at all that incremental backup is
free of bugs -- but I don't trust that all the code anyone else has
written is free of bugs, either.

> Overall I think it is an issue that the combine is being driven from the
> most recent incremental directory rather than from the manifest.

I don't. I considered that design and rejected it for various reasons
that I still believe to be good. Even if I was wrong, we're not going
to start rewriting the implementation a week after feature freeze.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Typos in the code and README

2024-04-16 Thread Nazir Bilal Yavuz
Hi,

Thanks for working on this!

On Mon, 15 Apr 2024 at 15:26, Daniel Gustafsson  wrote:
>
> > On 14 Apr 2024, at 13:19, David Rowley  wrote:
> >
> > On Sat, 13 Apr 2024 at 09:17, Daniel Gustafsson  wrote:
> >>
> >>> On 12 Apr 2024, at 23:15, Heikki Linnakangas  wrote:
> >>> Here's a few more. I've accumulate these over the past couple of months, 
> >>> keeping them stashed in a branch, adding to it whenever I've spotted a 
> >>> minor typo while reading the code.
> >>
> >> Nice, let's lot all these together.
> >
> > Here are a few additional ones to add to that.
>
> Thanks.  Collecting all the ones submitted here, as well as a few submitted
> off-list by Alexander, the patch is now a 3-part patchset of cleanups:
>
> 0001 contains the typos and duplicate words fixups, 0002 fixes a parameter 
> with
> the wrong name in the prototype and 0003 removes a leftover prototype which 
> was
> accidentally left in a refactoring.

I realized two small typos: 'sgmr' -> 'smgr'. You may want to include
them in 0001.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: pg_combinebackup fails on file named INCREMENTAL.*

2024-04-16 Thread Robert Haas
On Mon, Apr 15, 2024 at 10:12 PM David Steele  wrote:
> Anyway, I think it should be fixed or documented as a caveat since it
> causes a hard failure on restore.

Alright, I'll look into this.

> I know Tomas added some optimizations that work best with --no-manifest
> but if we can eventually read compressed tars (which I expect to be the
> general case) then those optimizations are not very useful.

My belief is that those optimizations work fine with or without
manifests; you only start to lose the benefit in cases where you use
different checksum types for different backups that you then try to
combine. Which should hopefully be a rare case.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_combinebackup fails on file named INCREMENTAL.*

2024-04-16 Thread Robert Haas
On Tue, Apr 16, 2024 at 3:10 AM Stefan Fercot
 wrote:
> > > But ... I didn't really end up feeling very comfortable with it. Right
> > > now, the backup manifest is something we only use to verify the
> > > integrity of the backup. If we were to do this, it would become a
> > > critical part of the backup.
>
> Isn't it already the case? I mean, you need the manifest of the previous 
> backup to take an incremental one, right?
> And shouldn't we encourage to verify the backup sets before (at least) trying 
> to combine them?
> It's not because a file was only use for one specific purpose until now that 
> we can't improve it later.
> Splitting the meaningful information across multiple places would be more 
> error-prone (for both devs and users) imo.

Well, right now, if you just take a full backup, and you throw away
the backup manifest because you don't care, you have a working full
backup. Furthermore, if you took any incremental backups based on that
full backup before discarding the manifest, you can still restore
them. Now, it is possible that nobody in the world cares about those
properties other than me; I have been known to (a) care about weird
stuff and (b) be a pedant. However, we've already shipped a couple of
releases where backup manifests were thoroughly non-critical: you
needed them to run pg_verifybackup, and for nothing else. I think it's
quite likely that there are users out there who are used to things
working in that way, and I'm not sure that those users will adjust
their expectations when a new feature comes out. I also feel that if I
were a user, I would think of something called a "manifest" as just a
table of contents for whatever the thing was. I still remember
downloading tar files from the Internet in the 1990s and there'd be a
file in the tarball sometimes called MANIFEST which was, you know, a
list of what was in the tarball. You didn't need that file for
anything functional; it was just so you could check if anything was
missing.

What I fear is that this will turn into another situation like we had
with pg_xlog, where people saw "log" in the name and just blew it
away. Matter of fact, I recently encountered one of my few recent
examples of someone doing that thing since the pg_wal renaming
happened. Some users don't take much convincing to remove anything
that looks inessential. And what I'm particularly worried about with
this feature is tar-format backups. If you have a directory format
backup and you do an "ls", you're going to see a whole bunch of files
in there of which backup_manifest will be one. How you treat that file
is just going to depend on what you know about its purpose. But if you
have a tar-format backup, possibly compressed, the backup_manifest
file stands out a lot more. You may have something like this:

backup_manifest root.tar.gz 16384.tar.gz

Well, at this point, it becomes much more likely that you're going to
think that there are special rules for the backup_manifest file.

The kicker for me is that I can't see any reason to do any of this
stuff. Including the information that we need to elide incremental
stubs in some other way, say with one stub-list per directory, will be
easier to implement and probably perform better. Like, I'm not saying
we can't find a way to jam this into the manifest. But I'm fairly sure
it's just making life difficult for ourselves.

I may ultimately lose this argument, as I did the one about whether
the backup_manifest should be JSON or some bespoke format. And that's
fine. I respect your opinion, and David's. But I also reserve the
right to feel differently, and I do. And I would also just gently
point out that my level of motivation to work on a particular feature
can depend quite a bit on whether I'm going to be forced to implement
it in a way that I disagree with.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] Replace magic constant 3 with NUM_MERGE_MATCH_KINDS

2024-04-16 Thread Dean Rasheed
On Tue, 16 Apr 2024 at 11:35, Richard Guo  wrote:
>
> On Tue, Apr 16, 2024 at 5:48 PM Aleksander Alekseev 
>  wrote:
>>
>> Oversight of 0294df2f1f84 [1].
>>
>> [1]: 
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0294df2f1f84
>
> +1.  I think this change improves the code quality.  I searched for
> other arrays indexed by merge match kind, but found none.  So this patch
> seems thorough.
>

Yes this makes sense, though I note that some other similar code uses
a #define rather than inserting an enum element at the end (e.g.,
NUM_ROWFILTER_PUBACTIONS).

I guess the argument against inserting an enum element at the end is
that a switch statement on the enum value might generate a compiler
warning if it didn't have a default clause.

Looking at how NUM_ROWFILTER_PUBACTIONS is defined as the last element
plus one, it might seem to be barely any better than just defining it
to be 3, since any new enum element would probably be added at the
end, requiring it to be updated in any case. But if the number of
elements were much larger, it would be much more obviously correct,
making it a good general pattern to follow. So in the interests of
code consistency, I think we should do the same here.

Regards,
Dean




Re: Stack overflow issue

2024-04-16 Thread Alexander Korotkov
On Tue, Apr 16, 2024 at 1:48 AM Andres Freund  wrote:
> On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote:
> > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > I did minor revision of comments and code blocks order to improve the
> > readability.
>
> After sending
> https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
> I looked some more at important areas where changes didn't have code
> coverage. One thing I noticed was that the "non-internal" part of
> AbortCurrentTransaction() is uncovered:
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403
>
> Which made me try to understand fefd9a3fed2.  I'm a bit confused about why
> some parts are handled in CommitCurrentTransaction()/AbortCurrentTransaction()
> and others are in the *Internal functions.
>
> I understand that fefd9a3fed2 needed to remove the recursion in
> CommitTransactionCommand()/AbortCurrentTransaction(). But I don't understand
> why that means having some code in in the non-internal and some in the
> internal functions?  Wouldn't it be easier to just have all the state handling
> code in the Internal() function and just break after the
> CleanupSubTransaction() calls?

I'm not sure I correctly get what you mean.  Do you think the attached
patch matches the direction you're pointing?  The patch itself is not
final, it requires cleanup and comments revision, just to check the
direction.

--
Regards,
Alexander Korotkov


abort_commit_transaction.patch
Description: Binary data


Re: Table AM Interface Enhancements

2024-04-16 Thread Robert Haas
On Tue, Apr 16, 2024 at 6:52 AM Alexander Korotkov  wrote:
> Taking a closer look at acquire_sample_rows(), I think it would be
> good if table AM implementation would care about block-level (or
> whatever-level) sampling.  So that acquire_sample_rows() just fetches
> tuples one-by-one from table AM implementation without any care about
> blocks.  Possible table_beginscan_analyze() could take an argument of
> target number of tuples, then those tuples are just fetches with
> table_scan_analyze_next_tuple().  What do you think?

Andres is the expert here, but FWIW, that plan seems reasonable to me.
One downside is that every block-based tableam is going to end up with
a very similar implementation, which is kind of something I don't like
about the tableam API in general: if you want to make something that
is basically heap plus a little bit of special sauce, you have to copy
a mountain of code. Right now we don't really care about that problem,
because we don't have any other tableams in core, but if we ever do, I
think we're going to find ourselves very unhappy with that aspect of
things. But maybe now is not the time to start worrying. That problem
isn't unique to analyze, and giving out-of-core tableams the
flexibility to do what they want is better than not.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Table AM Interface Enhancements

2024-04-16 Thread Pavel Borisov
On Tue, 16 Apr 2024 at 14:52, Alexander Korotkov 
wrote:

> On Mon, Apr 15, 2024 at 11:17 PM Andres Freund  wrote:
> > On 2024-04-15 16:02:00 -0400, Robert Haas wrote:
> > > Do you want that patch applied, not applied, or applied with some set
> of
> > > modifications?
> >
> > I think we should apply Alexander's proposed revert and then separately
> > discuss what we should do about 041b96802ef.
>
> Taking a closer look at acquire_sample_rows(), I think it would be
> good if table AM implementation would care about block-level (or
> whatever-level) sampling.  So that acquire_sample_rows() just fetches
> tuples one-by-one from table AM implementation without any care about
> blocks.  Possible table_beginscan_analyze() could take an argument of
> target number of tuples, then those tuples are just fetches with
> table_scan_analyze_next_tuple().  What do you think?
>
Hi, Alexander!

I like the idea of splitting abstraction levels for:
1. acquirefuncs (FDW or physical table)
2. new specific row fetch functions (alike to existing
_scan_analyze_next_tuple()), that could be AM-specific.

Then scan_analyze_next_block() or another iteration algorithm would be
contained inside table AM implementation of _scan_analyze_next_tuple().

So, init of scan state would be inside table AM implementation of
_beginscan_analyze(). Scan state (like BlockSamplerData or other state that
could be custom for AM) could be transferred from _beginscan_analyze() to
_scan_analyze_next_tuple() by some opaque AM-specific data structure. If so
we'll also may need AM-specific table_endscan_analyze to clean it.

Regards,
Pavel


Re: Disallow changing slot's failover option in transaction block

2024-04-16 Thread shveta malik
On Tue, Apr 16, 2024 at 1:45 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Hou,
>
> > Kuroda-San reported an issue off-list that:
> >
> > If user execute ALTER SUBSCRIPTION SET (failover) command inside a txn block
> > and rollback, only the subscription option change can be rolled back, while 
> > the
> > replication slot's failover change is preserved.
> >
> > This is because ALTER SUBSCRIPTION SET (failover) command internally
> > executes
> > the replication command ALTER_REPLICATION_SLOT to change the replication
> > slot's
> > failover property, but this replication command execution cannot be
> > rollback.
> >
> > To fix it, I think we can prevent user from executing ALTER SUBSCRIPTION set
> > (failover) inside a txn block, which is also consistent to the ALTER
> > SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach a small
> > patch to address this.
>
> Thanks for posting the patch, the fix is same as my expectation.
> Also, should we add the restriction to the doc? I feel [1] can be updated.

+1.

Similar to ALTER SUB, CREATE SUB also needs to be fixed. This is
because we call alter_replication_slot in CREATE SUB as well, for the
case when slot_name is provided and create_slot=false. But the tricky
part is we call alter_replication_slot() when creating subscription
for both failover=true and false. That means if we want to fix it on
the similar line of ALTER SUB, we have to disallow user from executing
the CREATE SUBSCRIPTION (slot_name = xx) in a txn block, which seems
to break some existing use cases. (previously, user can execute such a
command inside a txn block).

So, we need to think if there are better ways to fix it.  After
discussion with Hou-San offlist, here are some ideas:

1. do not alter replication slot's failover option when CREATE
SUBSCRIPTION   WITH failover=false. This means we alter replication
slot only when failover is set to true. And thus we can disallow
CREATE SUB WITH (slot_name =xx, failover=true, create_slot=false)
inside a txn block.

This option allows user to run CREATE-SUB(create_slot=false) with
failover=false in txn block like earlier. But on the downside, it
makes the behavior inconsistent for otherwise simpler option like
failover,  i.e. with failover=true, CREATE SUB is not allowed in txn
block while with failover=false, it is allowed. It makes it slightly
complex to be understood by user.

2. let's not disallow CREATE SUB in txn block as earlier, just don't
perform internal alter-failover during CREATE SUB for existing
slots(create_slot=false, slot_name=xx)  i.e. when create_slot is
false, we will ignore failover parameter of CREATE SUB and it is
user's responsibility to set it appropriately using ALTER SUB cmd. For
create_slot=true, the behaviour of CREATE-SUB remains same as earlier.

This option does not add new restriction for CREATE SUB wrt txn block.
In context of failover with create_slot=false, we already have a
similar restriction (documented one) for ALTER SUB, i.e. with 'ALTER
SUBSCRIPTION SET(slot_name = new)', user needs to alter the new slot's
failover by himself. CREAT SUB can also be documented in similar way.
This seems simpler to be understood considering existing ALTER SUB's
behavior as well. Plus, this will make CREATE-SUB code slightly
simpler and thus easily manageable.

3. add a alter_slot option for CREATE SUBSCRIPTION, we can only alter
the  slot's failover if alter_slot=true. And so we can disallow CREATE
SUB WITH (slot_name =xx, alter_slot=true) inside a txn block.

This seems a clean way, as everything will be as per user's consent
based on alter_slot parameter. But on the downside, this will need
introducing additional parameter and also adding new restriction of
running CREATE-sub in txn  block for a specific case.

4. Don't alter replication in subscription DDLs. Instead, try to alter
replication slot's failover in the apply worker. This means we need to
execute alter_replication_slot each time before starting streaming in
the apply worker.

This does not seem appealing to execute alter_replication_slot
everytime the apply worker starts. But if others think it as a better
option, it can be further analyzed.


Currently, our preference is option 2, as that looks a clean solution
and also aligns with ALTER-SUB behavior which is already documented.
Thoughts?


Note that we could not refer to the design of two_phase here, because
two_phase can be considered as a streaming option, so it's fine to
change the two_phase along with START_REPLICATION command. (the
two_phase is not changed in subscription DDLs, but get changed in
START_REPLICATION command). But the failover is closely related to a
replication slot itself.



Thanks
Shveta




Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-04-16 Thread Nazir Bilal Yavuz
Hi,

I am working on using read streams in the CREATE DATABASE command when the
strategy is wal_log. RelationCopyStorageUsingBuffer() function is used in
this context. This function reads source buffers then copies them to the
destination buffers. I used read streams only when reading source buffers
because the destination buffers are read by 'RBM_ZERO_AND_LOCK' option, so
it is not important.

I created a ~6 GB table [1] and created a new database with the wal_log
strategy using the database that table was created in as a template [2]. My
benchmarking results are:

a. Timings:

patched:
12955.027 ms
12917.475 ms
13177.846 ms
12971.308 ms
13059.985 ms

master:
13156.375 ms
13054.071 ms
13151.607 ms
13152.633 ms
13160.538 ms

There is no difference in timings, the patched version is a tiny bit better
but it is negligible. I actually expected the patched version to be better
because there was no prefetching before, but the read stream API detects
sequential access and disables prefetching.

b. strace:

patched:
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 68.023.749359   2   1285782   pwrite64
 18.541.021734  21 46730   preadv
  9.490.522889 826   633   fdatasync
  2.550.140339  59  2368   pwritev
  1.140.062583 409   153   fsync

master:
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 59.713.831542   2   1288365   pwrite64
 29.841.914540   2747936   pread64
  7.900.506843 837   605   fdatasync
  1.580.101575  54  1856   pwritev
  0.750.048431 400   121   fsync

There are fewer (~1/16) read system calls in the patched version.

c. perf:

patched:

-   97.83% 1.13%  postgres  postgres   [.]
RelationCopyStorageUsingBuffer

   - 97.83% RelationCopyStorageUsingBuffer

  - 44.28% ReadBufferWithoutRelcache

 + 42.20% GetVictimBuffer

   0.81% ZeroBuffer

  + 31.86% log_newpage_buffer

  - 19.51% read_stream_next_buffer

 - 17.92% WaitReadBuffers

+ 17.61% mdreadv

 - 1.47% read_stream_start_pending_read

+ 1.46% StartReadBuffers

master:

-   97.68% 0.57%  postgres  postgres   [.]
RelationCopyStorageUsingBuffer

   - RelationCopyStorageUsingBuffer

  - 65.48% ReadBufferWithoutRelcache

 + 41.16% GetVictimBuffer

 - 20.42% WaitReadBuffers

+ 19.90% mdreadv

 + 1.85% StartReadBuffer

   0.75% ZeroBuffer

  + 30.82% log_newpage_buffer

Patched version spends less CPU time in read calls and more CPU time in
other calls such as write.

There are three patch files attached. First two are optimization and adding
a way to create a read stream object by using SMgrRelation, these are
already proposed in the streaming I/O thread [3]. The third one is the
actual patch file.

Any kind of feedback would be appreciated.

[1] CREATE TABLE t as select repeat('a', 100) || i || repeat('b', 500) as
filler from generate_series(1, 900) as i;
[2] CREATE DATABASE test_1 STRATEGY 'wal_log' TEMPLATE test;
[3]
https://www.postgresql.org/message-id/CAN55FZ1yGvCzCW_aufu83VimdEYHbG_zuOY3J9JL-nBptyJyKA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From edcfacec40a70a8747c5f18777b6c28b0fccba7a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 22:33:36 +0300
Subject: [PATCH v1 1/3] Refactor PinBufferForBlock() to remove if checks about
 persistence

There are if checks in PinBufferForBlock() function to set persistence
of the relation and this function is called for the each block in the
relation. Instead of that, set persistence of the relation before
PinBufferForBlock() function.
---
 src/backend/storage/aio/read_stream.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 31 +++
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..d155dde5ce3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
 	{
 		stream->ios[i].op.rel = rel;
 		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = 0;
+		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 901b7230fb9..17cb7127f99 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1067,24 +1067,10 @@ PinBufferForBlock(Relation rel,
 	BufferDesc 

Re: Add bump memory context type and use it for tuplesorts

2024-04-16 Thread Amul Sul
On Tue, Apr 16, 2024 at 3:44 PM David Rowley  wrote:

> On Tue, 16 Apr 2024 at 17:13, Amul Sul  wrote:
> > Attached is a small patch adding the missing BumpContext description to
> the
> > README.
>
> Thanks for noticing and working on the patch.
>
> There were a few things that were not quite accurate or are misleading:
>
> 1.
>
> > +These three memory contexts aim to free memory back to the operating
> system
>
> That's not true for bump. It's the worst of the 4.  Worse than aset.
> It only returns memory when the context is reset/deleted.
>
> 2.
>
> "These memory contexts were initially developed for ReorderBuffer, but
> may be useful elsewhere as long as the allocation patterns match."
>
> The above isn't true for bump.  It was written for tuplesort.  I think
> we can just remove that part now.  Slab and generation are both old
> enough not to care why they were conceived.
>
> Also since adding bump, I think the choice of which memory context to
> use is about 33% harder than it used to be when there were only 3
> context types.  I think this warrants giving more detail on what these
> 3 special-purpose memory allocators are good for.  I've added more
> details in the attached patch.  This includes more details about
> freeing malloc'd blocks
>
> I've tried to detail out enough of the specialities of the context
> type without going into extensive detail. My hope is that there will
> be enough detail for someone to choose the most suitable looking one
> and head over to the corresponding .c file to find out more.
>
> Is that about the right level of detail?
>

Yes, it looks much better now, thank you.

Regards,
Amul


Re: Add bump memory context type and use it for tuplesorts

2024-04-16 Thread David Rowley
On Mon, 8 Apr 2024 at 00:37, David Rowley  wrote:
> I've now pushed all 3 patches.   Thank you for all the reviews on
> these and for the extra MemoryContextMethodID bit, Matthias.

I realised earlier today when working on [1] that bump makes a pretty
brain-dead move when adding dedicated blocks to the blocks list.  The
problem is that I opted to not have a current block field in
BumpContext and just rely on the head pointer of the blocks list to be
the "current" block.  The head block is the block we look at to see if
we've any space left when new allocations come in.  The problem there
is when adding a dedicated block in BumpAllocLarge(), the code adds
this to the head of the blocks list so that when a new allocation
comes in that's normal-sized, the block at the top of the list is full
and we have to create a new block for the allocation.

The attached fixes this by pushing these large/dedicated blocks to the
*tail* of the blocks list.  This means the partially filled block
remains at the head and is available for any new allocation which will
fit.  This behaviour is evident by the regression test change that I
added earlier today when working on [1].  The 2nd and smaller
allocation in that text goes onto the keeper block rather than a new
block.

I plan to push this tomorrow.

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=bea97cd02ebb347ab469b78673c2b33a72109669
From 39a60420a83e5a059de50869c0981be9732909ab Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Tue, 16 Apr 2024 22:26:00 +1200
Subject: [PATCH v1] Push dedicated BumpBlocks to the tail of the blocks list

BumpContext relies on using the head block from its 'blocks' field to
use as the current block to allocate new chunks to.  When we receive an
allocation request larger than allocChunkLimit, we place these chunks on
a new dedicated block and push this block onto the head of the 'blocks'
list.  This behavior caused the previous bump block to no longer be
available for new normal-sized (non-large) allocations and would result
in wasting space on blocks when a large request arrived before the
current block was full.

Here adjust the code to push these dedicated blocks onto the *tail* of
the blocks list so that the head block remains intact and available to
be used by normal allocation request sizes.

Discussion: 
https://postgr.es/m/caaphdvp9___r-ayjj0nz6gd3mecgwgz0_6zptwpwj+zqhtm...@mail.gmail.com
---
 src/backend/utils/mmgr/bump.c  | 8 ++--
 src/test/regress/expected/sysviews.out | 2 +-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/mmgr/bump.c b/src/backend/utils/mmgr/bump.c
index 449bd29344..83798e2245 100644
--- a/src/backend/utils/mmgr/bump.c
+++ b/src/backend/utils/mmgr/bump.c
@@ -342,8 +342,12 @@ BumpAllocLarge(MemoryContext context, Size size, int flags)
randomize_mem((char *) MemoryChunkGetPointer(chunk), size);
 #endif
 
-   /* add the block to the list of allocated blocks */
-   dlist_push_head(>blocks, >node);
+   /*
+* Add the block to the tail of allocated blocks list.  The current 
block
+* is left at the head of the list as it may still have space for
+* non-large allocations.
+*/
+   dlist_push_tail(>blocks, >node);
 
 #ifdef MEMORY_CONTEXT_CHECKING
/* Ensure any padding bytes are marked NOACCESS. */
diff --git a/src/test/regress/expected/sysviews.out 
b/src/test/regress/expected/sysviews.out
index 634dc8d8b8..2f3eb4e7f1 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -47,7 +47,7 @@ select name, parent, total_bytes > 0, total_nblocks, 
free_bytes > 0, free_chunks
 from pg_backend_memory_contexts where name = 'Caller tuples';
  name  | parent | ?column? | total_nblocks | ?column? | 
free_chunks 
 
---++--+---+--+-
- Caller tuples | TupleSort sort | t| 3 | t|
   0
+ Caller tuples | TupleSort sort | t| 2 | t|
   0
 (1 row)
 
 rollback;
-- 
2.40.1



Re: Table AM Interface Enhancements

2024-04-16 Thread Alexander Korotkov
On Mon, Apr 15, 2024 at 11:17 PM Andres Freund  wrote:
> On 2024-04-15 16:02:00 -0400, Robert Haas wrote:
> > Do you want that patch applied, not applied, or applied with some set of
> > modifications?
>
> I think we should apply Alexander's proposed revert and then separately
> discuss what we should do about 041b96802ef.

Taking a closer look at acquire_sample_rows(), I think it would be
good if table AM implementation would care about block-level (or
whatever-level) sampling.  So that acquire_sample_rows() just fetches
tuples one-by-one from table AM implementation without any care about
blocks.  Possible table_beginscan_analyze() could take an argument of
target number of tuples, then those tuples are just fetches with
table_scan_analyze_next_tuple().  What do you think?

--
Regards,
Alexander Korotkov




Re: [PATCH] Replace magic constant 3 with NUM_MERGE_MATCH_KINDS

2024-04-16 Thread Richard Guo
On Tue, Apr 16, 2024 at 5:48 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Oversight of 0294df2f1f84 [1].
>
> [1]:
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0294df2f1f84


+1.  I think this change improves the code quality.  I searched for
other arrays indexed by merge match kind, but found none.  So this patch
seems thorough.

Thanks
Richard


Re: Table AM Interface Enhancements

2024-04-16 Thread Alexander Korotkov
On Mon, Apr 15, 2024 at 11:11 PM Andres Freund  wrote:
> On 2024-04-12 01:04:03 +0300, Alexander Korotkov wrote:
> > 1) If we just apply my revert patch and leave c6fc50cb4028 and
> > 041b96802ef in the tree, then we get our table AM API narrowed.  As
> > you expressed the current API requires block numbers to be 1:1 with
> > the actual physical on-disk location [2].  Not a secret I think the
> > current API is quite restrictive.  And we're getting the ANALYZE
> > interface narrower than it was since 737a292b5de.  Frankly speaking, I
> > don't think this is acceptable.
>
> As others already pointed out, c6fc50cb4028 was committed quite a while
> ago. I'm fairly unhappy about c6fc50cb4028, fwiw, but didn't realize that
> until it was too late.

+1

> > In token of all of the above, is the in-tree state that bad? (if we
> > abstract the way 27bc1772fc and dd1f6b0c17 were committed).
>
> To me the 27bc1772fc doesn't make much sense on its own. You added calls
> directly to heapam internals to a file in src/backend/commands/, that just
> doesn't make sense.
>
> Leaving that aside, I think the interface isn't good on its own:
> table_relation_analyze() doesn't actually do anything, it just sets callbacks,
> that then later are called from analyze.c, which doesn't at all fit to the
> name of the callback/function.  I realize that this is kinda cribbed from the
> FDW code, but I don't think that is a particularly good excuse.
>
> I don't think dd1f6b0c17 improves the situation, at all. It sets global
> variables to redirect how an individual acquire_sample_rows invocation
> works:
> void
> block_level_table_analyze(Relation relation,
>   AcquireSampleRowsFunc *func,
>   BlockNumber *totalpages,
>   BufferAccessStrategy 
> bstrategy,
>   ScanAnalyzeNextBlockFunc 
> scan_analyze_next_block_cb,
>   ScanAnalyzeNextTupleFunc 
> scan_analyze_next_tuple_cb)
> {
> *func = acquire_sample_rows;
> *totalpages = RelationGetNumberOfBlocks(relation);
> vac_strategy = bstrategy;
> scan_analyze_next_block = scan_analyze_next_block_cb;
> scan_analyze_next_tuple = scan_analyze_next_tuple_cb;
> }
>
> Notably it does so within the ->relation_analyze tableam callback, which does
> *NOT* not actually do anything other than returning a callback.  So if
> ->relation_analyze() for another relation is called, the acquire_sample_rows()
> for the earlier relation will do something different.  Note that this isn't a
> theoretical risk, acquire_inherited_sample_rows() actually collects the
> acquirefunc for all the inherited relations before calling acquirefunc.

You're right.  No sense trying to fix this.  Reverted.

--
Regards,
Alexander Korotkov




Re: Add bump memory context type and use it for tuplesorts

2024-04-16 Thread David Rowley
On Tue, 16 Apr 2024 at 17:13, Amul Sul  wrote:
> Attached is a small patch adding the missing BumpContext description to the
> README.

Thanks for noticing and working on the patch.

There were a few things that were not quite accurate or are misleading:

1.

> +These three memory contexts aim to free memory back to the operating system

That's not true for bump. It's the worst of the 4.  Worse than aset.
It only returns memory when the context is reset/deleted.

2.

"These memory contexts were initially developed for ReorderBuffer, but
may be useful elsewhere as long as the allocation patterns match."

The above isn't true for bump.  It was written for tuplesort.  I think
we can just remove that part now.  Slab and generation are both old
enough not to care why they were conceived.

Also since adding bump, I think the choice of which memory context to
use is about 33% harder than it used to be when there were only 3
context types.  I think this warrants giving more detail on what these
3 special-purpose memory allocators are good for.  I've added more
details in the attached patch.  This includes more details about
freeing malloc'd blocks

I've tried to detail out enough of the specialities of the context
type without going into extensive detail. My hope is that there will
be enough detail for someone to choose the most suitable looking one
and head over to the corresponding .c file to find out more.

Is that about the right level of detail?

David
diff --git a/src/backend/utils/mmgr/README b/src/backend/utils/mmgr/README
index f484f7d6f5..695088bb66 100644
--- a/src/backend/utils/mmgr/README
+++ b/src/backend/utils/mmgr/README
@@ -471,25 +471,32 @@ thrashing.
 Alternative Memory Context Implementations
 --
 
-aset.c is our default general-purpose implementation, working fine
-in most situations. We also have two implementations optimized for
-special use cases, providing either better performance or lower memory
-usage compared to aset.c (or both).
-
-* slab.c (SlabContext) is designed for allocations of fixed-length
-  chunks, and does not allow allocations of chunks with different size.
-
-* generation.c (GenerationContext) is designed for cases when chunks
-  are allocated in groups with similar lifespan (generations), or
-  roughly in FIFO order.
-
-Both memory contexts aim to free memory back to the operating system
-(unlike aset.c, which keeps the freed chunks in a freelist, and only
-returns the memory when reset/deleted).
-
-These memory contexts were initially developed for ReorderBuffer, but
-may be useful elsewhere as long as the allocation patterns match.
-
+aset.c (AllocSetContext) is our default general-purpose allocator.  Three other
+allocator types also exist which are special-purpose:
+
+* slab.c (SlabContext) is designed for allocations of fixed-sized
+  chunks.  The fixed chunk size must be specified when creating the context.
+  New chunks are allocated to the fullest block, keeping used chunks densely
+  packed together to avoid memory fragmentation.  This also increases the
+  chances that pfree'ing a chunk will result in a block becoming empty of all
+  chunks and allow it to be free'd back to the operating system.
+
+* generation.c (GenerationContext) is best suited for cases when chunks are
+  allocated in groups with similar lifespan (generations), or roughly in FIFO
+  order.  No attempt is made to reuse space left by pfree'd chunks.  Blocks
+  are returned to the operating system when all chunks on them have been
+  pfree'd.
+
+* bump.c (BumpContext) is best suited for use cases that require densely
+  allocated chunks of memory that never need to be individually pfree'd or
+  repalloc'd.  These operations are unsupported due to BumpContext chunks
+  having no chunk header.  No chunk header means more densely packed chunks,
+  which is especially useful for workloads that perform lots of small
+  allocations.  Blocks are only free'd back to the operating system when the
+  context is reset or deleted.
+
+For further details, please read the header comment in the corresponding .c
+file.
 
 Memory Accounting
 -


[PATCH] Replace magic constant 3 with NUM_MERGE_MATCH_KINDS

2024-04-16 Thread Aleksander Alekseev
Oversight of 0294df2f1f84 [1].

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0294df2f1f84

-- 
Best regards,
Aleksander Alekseev


v1-0001-Replace-constant-3-with-NUM_MERGE_MATCH_KINDS.patch
Description: Binary data


Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread Bertrand Drouvot
Hi,

On Tue, Apr 16, 2024 at 02:06:45PM +0530, shveta malik wrote:
> On Tue, Apr 16, 2024 at 1:55 PM Zhijie Hou (Fujitsu)
>  wrote:
> > I personally feel adding the additional check for sync_replication_slots may
> > not improve the situation here. Because the GUC sync_replication_slots can
> > change at any point, the GUC could be false when performing this addition 
> > check
> > and is set to true immediately after the check, so It could not simplify 
> > the logic
> > anyway.
> 
> +1.
> I feel doc and "cannot synchronize replication slots concurrently"
> check should suffice.
> 
> In the scenario which Hou-San pointed out,  if after performing the
> GUC check in SQL function, this GUC is enabled immediately and say
> worker is started sooner than the function could get chance to sync,
> in that case as well, SQL function will ultimately get error "cannot
> synchronize replication slots concurrently", even though GUC is
> enabled.  Thus, I feel we should stick with samer error in all
> scenarios.

Okay, fine by me, let's forget about checking sync_replication_slots then.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Add recovery to pg_control and remove backup_label

2024-04-16 Thread Stefan Fercot
Hi,

On Sunday, March 10th, 2024 at 4:47 AM, David Steele wrote:
> I've had a new idea which may revive this patch. The basic idea is to
> keep backup_label but also return a copy of pg_control from
> pg_stop_backup(). This copy of pg_control would be safe from tears and
> have a backupLabelRequired field set (as Andres suggested) so recovery
> cannot proceed without the backup label.
> 
> So, everything will continue to work as it does now. But, backup
> software can be enhanced to write the improved pg_control that is
> guaranteed not to be torn and has protection against a missing backup label.
> 
> Of course, pg_basebackup will write the new backupLabelRequired field
> into pg_control, but this way third party software can also gain
> advantages from the new field.

Bump on this idea.

Given the discussion in [1], even if it obviously makes sense to improve the in 
core backup capabilities, the more we go in that direction, the more we'll rely 
on outside orchestration.
So IMHO it also worth worrying about given more leverage to such orchestration 
tools. In that sense, I really like the idea to extend the backup functions.

More thoughts?

Thanks all,
Kind Regards,
--
Stefan FERCOT
Data Egret (https://dataegret.com)

[1] 
https://www.postgresql.org/message-id/lwXoqQdOT9Nw1tJIx_h7WuqMKrB1YMePQY99RFTZ87H7V52mgUJaSlw2WRbcOgKNUurF1yJqX3nqtZi4hJhtd3e_XlmLsLvnEtGXY-fZPoA%3D%40protonmail.com




Re: post-freeze damage control

2024-04-16 Thread Stefan Fercot
Hi,

On Saturday, April 13th, 2024 at 12:18 PM, Tomas Vondra wrote:
> On 4/13/24 01:03, David Steele wrote:
> > On 4/12/24 22:12, Tomas Vondra wrote:
> > > On 4/11/24 23:48, David Steele wrote:
> > > > On 4/11/24 20:26, Tomas Vondra wrote:
> > > > 
> > > > > FWIW that discussion also mentions stuff that I think the feature
> > > > > should
> > > > > not do. In particular, I don't think the ambition was (or should be) 
> > > > > to
> > > > > make pg_basebackup into a stand-alone tool. I always saw pg_basebackup
> > > > > more as an interface to "backup steps" correctly rather than a 
> > > > > complete
> > > > > backup solution that'd manage backup registry, retention, etc.
> > > > 
> > > > Right -- this is exactly my issue. pg_basebackup was never easy to use
> > > > as a backup solution and this feature makes it significantly more
> > > > complicated. Complicated enough that it would be extremely difficult for
> > > > most users to utilize in a meaningful way.

pg_basebackup has its own use-cases IMHO. It is very handy to simply take a 
copy of the running cluster, thanks to its ability to carry on the needed WAL 
segs without the user even needing to think about 
archive_command/pg_receivewal. And for this kind of tasks, the new incremental 
thing will not really be interesting and won't make things more complicated.

I totally agree that we don't have any complete backup solution in core though. 
And adding more tools to the picture (pg_basebackup, pg_combinebackup, 
pg_receivewal, pg_verifybackup,...) will increase the need of on-top 
orchestration. But that's not new. And for people already having such 
orchestration, having the incremental feature will help.

> > > Perhaps, I agree we could/should try to do better job to do backups, no
> > > argument there. But I still don't quite see why introducing such
> > > infrastructure to "manage backups" should be up to the patch adding
> > > incremental backups. I see it as something to build on top of
> > > pg_basebackup/pg_combinebackup, not into those tools.
> > 
> > I'm not saying that managing backups needs to be part of pg_basebackup,
> > but I am saying without that it is not a complete backup solution.
> > Therefore introducing advanced features that the user then has to figure
> > out how to manage puts a large burden on them. Implementing
> > pg_combinebackup inefficiently out of the gate just makes their life
> > harder.
> 
> I agree with this in general, but I fail to see how it'd be the fault of
> this patch. It merely extends what pg_basebackup did before, so if it's
> not a complete solution now, it wasn't a complete solution before.

+1. We can see it as a step to having a better backup solution in core for the 
future, but we definitely shouldn't rule out the fact that lots of people 
already developed such orchestration (home-made or relying to pgbarman, 
pgbackrest, wal-g,...). IMHO, if we're trying to extend the in core features, 
we should also aim at giving more lights and features for those tools (like 
adding more fields to the backup functions,...).

> > > Sure, I'm not against making it clearer pg_combinebackup is not a
> > > complete backup solution, and documenting the existing restrictions.
> > 
> > Let's do that then. I think it would make sense to add caveats to the
> > pg_combinebackup docs including space requirements, being explicit about
> > the format required (e.g. plain), and also possible mitigation with COW
> > filesystems.
> 
> OK. I'll add this as an open item, for me and Robert.

Thanks for this! It's probably not up to core docs to state all the steps that 
would be needed to develop a complete backup solution but documenting the links 
between the tools and mostly all the caveats (the "don't use INCREMENTAL.* 
filenames",...) will help users not be caught off guard. And as I mentioned in 
[1], IMO the earlier we can catch a potential issue (wrong filename, missing 
file,...), the better for the user.

Thank you all for working on this.
Kind Regards,
--
Stefan FERCOT
Data Egret (https://dataegret.com)

[1] 
https://www.postgresql.org/message-id/vJnnuiaye5rNnCPN8h3xN1Y3lyUDESIgEQnR-Urat9_ld_fozShSJbEk8JDM_3K6BVt5HXT-CatWpSfEZkYVeymlrxKO2_kfKmVZNWyCuJc%3D%40protonmail.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread shveta malik
On Tue, Apr 16, 2024 at 1:55 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Tuesday, April 16, 2024 2:52 PM Bertrand Drouvot 
>  wrote:
>
>
> Hi,
>
> > On Tue, Apr 16, 2024 at 10:00:04AM +0530, shveta malik wrote:
> > > Please find v5 addressing above comments.
> >
> > Thanks!
> >
> > @@ -1634,9 +1677,14 @@ SyncReplicationSlots(WalReceiverConn *wrconn)  {
> > PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback,
> > PointerGetDatum(wrconn));
> > {
> > +   check_flags_and_set_sync_info(InvalidPid);
> > +
> >
> > Given the fact that if the sync worker is running it won't be possible to 
> > trigger a
> > manual sync with pg_sync_replication_slots(), what about also checking the
> > "sync_replication_slots" value at the start of SyncReplicationSlots() and 
> > emmit
> > an error if sync_replication_slots is set to on? (The message could 
> > explicitly
> > states that it's not possible to use the function if sync_replication_slots 
> > is set to
> > on).
>
> I personally feel adding the additional check for sync_replication_slots may
> not improve the situation here. Because the GUC sync_replication_slots can
> change at any point, the GUC could be false when performing this addition 
> check
> and is set to true immediately after the check, so It could not simplify the 
> logic
> anyway.

+1.
I feel doc and "cannot synchronize replication slots concurrently"
check should suffice.

In the scenario which Hou-San pointed out,  if after performing the
GUC check in SQL function, this GUC is enabled immediately and say
worker is started sooner than the function could get chance to sync,
in that case as well, SQL function will ultimately get error "cannot
synchronize replication slots concurrently", even though GUC is
enabled.  Thus, I feel we should stick with samer error in all
scenarios.


thanks
Shveta




Re: Typos in the code and README

2024-04-16 Thread Richard Guo
On Mon, Apr 15, 2024 at 8:26 PM Daniel Gustafsson  wrote:

> Thanks.  Collecting all the ones submitted here, as well as a few submitted
> off-list by Alexander, the patch is now a 3-part patchset of cleanups:
>
> 0001 contains the typos and duplicate words fixups, 0002 fixes a parameter
> with
> the wrong name in the prototype and 0003 removes a leftover prototype
> which was
> accidentally left in a refactoring.


BTW, it seems that 0001 needs a rebase over 9dfcac8e15.

Thanks
Richard


Re: Add bump memory context type and use it for tuplesorts

2024-04-16 Thread Daniel Gustafsson
> On 16 Apr 2024, at 07:12, Amul Sul  wrote:
> 
> Attached is a small patch adding the missing BumpContext description to the
> README.

Nice catch, we should add it to the README.

+  pfree'd or realloc'd.
I think it's best to avoid mixing API:s, "pfree'd or repalloc'd" keeps it to
functions in our API instead.

--
Daniel Gustafsson





RE: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread Zhijie Hou (Fujitsu)
On Tuesday, April 16, 2024 2:52 PM Bertrand Drouvot 
 wrote:


Hi,

> On Tue, Apr 16, 2024 at 10:00:04AM +0530, shveta malik wrote:
> > Please find v5 addressing above comments.
> 
> Thanks!
> 
> @@ -1634,9 +1677,14 @@ SyncReplicationSlots(WalReceiverConn *wrconn)  {
> PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback,
> PointerGetDatum(wrconn));
> {
> +   check_flags_and_set_sync_info(InvalidPid);
> +
> 
> Given the fact that if the sync worker is running it won't be possible to 
> trigger a
> manual sync with pg_sync_replication_slots(), what about also checking the
> "sync_replication_slots" value at the start of SyncReplicationSlots() and 
> emmit
> an error if sync_replication_slots is set to on? (The message could explicitly
> states that it's not possible to use the function if sync_replication_slots 
> is set to
> on).

I personally feel adding the additional check for sync_replication_slots may
not improve the situation here. Because the GUC sync_replication_slots can
change at any point, the GUC could be false when performing this addition check
and is set to true immediately after the check, so It could not simplify the 
logic
anyway.

Best Regards,
Hou zj




Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread shveta malik
On Tue, Apr 16, 2024 at 12:03 PM Bertrand Drouvot
 wrote:
>
> > I think for now let's restrict their usage in parallel and make the
> > promotion behavior consistent both for worker and API.
>
> Okay, let's do it that way. Is it worth to add a few words in the doc related 
> to
> pg_sync_replication_slots() though? (to mention it can not be used if the sync
> slot worker is running).

+1. Please find v6 having the suggested doc changes.


thanks
Shveta


v6-0001-Handle-stopSignaled-during-sync-function-call.patch
Description: Binary data


Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-16 Thread Daniel Gustafsson


> On 16 Apr 2024, at 01:03, Michael Paquier  wrote:
> 
> On Mon, Apr 15, 2024 at 11:07:05AM +0200, Daniel Gustafsson wrote:
>> On 15 Apr 2024, at 07:04, Michael Paquier  wrote:
>>> On Fri, Apr 12, 2024 at 02:42:57PM +0200, Daniel Gustafsson wrote:
 Is the attached split in line with how you were thinking about it?
>>> 
>>> If I may, 0001 looks sensible here.  The bits from 0003 and 0004 could
>>> be applied before 0002, as well.
>> 
>> Agreed, once we are in post-freeze I think those three are mostly
>> ready to go.
> 
> Is there a point to wait for 0001, 0003 and 0004, though, and
> shouldn't these three be backpatched?  0001 is certainly OK as a
> doc-only change to be consistent across the board without waiting 5
> more years.  0003 and 0004 are conditional and depend on if
> SSL_R_VERSION_TOO_LOW and SSL_OP_NO_CLIENT_RENEGOTIATION are defined
> at compile-time.  0003 is much more important, and note that
> 01e6f1a842f4 has been backpatched all the way down.  0004 is nice,
> still not strongly mandatory.

I forgot (and didn't check) that we backpatched 01e6f1a842f4, with that in mind
I agree that we should backpatch 0003 as well to put LibreSSL on par as much as
we can.  0004 is a fix for the LibreSSL support, not adding anything new, so
pushing that to master now makes sense.  Unless objections are raised I'll push
0001, 0003 and 0004 shortly.  0002 and 0005 can hopefully be addressed in the
July commitfest.

>>> Rather than calling always RAND_poll(), this
>>> could use a static flag to call it only once when pg_strong_random is
>>> called for the first time.
>> 
>> Agreed, we can good that. Fixed.
> 
> +#if (OPENSSL_VERSION_NUMBER <= 0x1010L)
> + static bool rand_initialized = false;
> 
> This does not look right.  At the top of upstream's branch
> OpenSSL_1_1_0-stable, OPENSSL_VERSION_NUMBER is 0x101000d0L, so the
> initialization would be missed for any version in the 1.1.0 series
> except the GA one without this code being compiled, no?

Meh, I was clearly not caffeinated as that's a thinko with a typo attached to
it.  The check should be "< 0x10101000" to catch any version prior to 1.1.1.
Fixed.

--
Daniel Gustafsson



v9-0005-Remove-pg_strong_random-initialization.patch
Description: Binary data


v9-0004-Support-SSL_R_VERSION_TOO_LOW-on-LibreSSL.patch
Description: Binary data


v9-0003-Support-disallowing-SSL-renegotiation-in-LibreSSL.patch
Description: Binary data


v9-0002-Remove-support-for-OpenSSL-1.0.2.patch
Description: Binary data


v9-0001-Doc-Use-past-tense-for-things-which-happened-in-t.patch
Description: Binary data


RE: Disallow changing slot's failover option in transaction block

2024-04-16 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

> Kuroda-San reported an issue off-list that:
> 
> If user execute ALTER SUBSCRIPTION SET (failover) command inside a txn block
> and rollback, only the subscription option change can be rolled back, while 
> the
> replication slot's failover change is preserved.
> 
> This is because ALTER SUBSCRIPTION SET (failover) command internally
> executes
> the replication command ALTER_REPLICATION_SLOT to change the replication
> slot's
> failover property, but this replication command execution cannot be
> rollback.
> 
> To fix it, I think we can prevent user from executing ALTER SUBSCRIPTION set
> (failover) inside a txn block, which is also consistent to the ALTER
> SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach a small
> patch to address this.

Thanks for posting the patch, the fix is same as my expectation.
Also, should we add the restriction to the doc? I feel [1] can be updated.


[1]:https://www.postgresql.org/docs/devel/sql-altersubscription.html#SQL-ALTERSUBSCRIPTION-PARAMS-SET

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 





Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-16 Thread Ajin Cherian
On Tue, Apr 16, 2024 at 1:31 AM Давыдов Виталий 
wrote:

> Dear All,
> Just interested, does anyone tried to reproduce the problem with slow
> catchup of twophase transactions (pgbench should be used with big number of
> clients)? I haven't seen any messages from anyone other that me that the
> problem takes place.
>
>
>

 Yes, I was able to reproduce the slow catchup of twophase transactions
with pgbench with 20 clients.

regards,
Ajin Cherian
Fujitsu Australia


Re: Disallow changing slot's failover option in transaction block

2024-04-16 Thread Bertrand Drouvot
Hi,

On Tue, Apr 16, 2024 at 06:32:11AM +, Zhijie Hou (Fujitsu) wrote:
> Hi,
> 
> Kuroda-San reported an issue off-list that:
> 
> If user execute ALTER SUBSCRIPTION SET (failover) command inside a txn block
> and rollback, only the subscription option change can be rolled back, while 
> the
> replication slot's failover change is preserved.

Nice catch, thanks!

> To fix it, I think we can prevent user from executing ALTER SUBSCRIPTION set
> (failover) inside a txn block, which is also consistent to the ALTER
> SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach a small
> patch to address this.

Agree. The patch looks pretty straightforward to me. Worth to add this
case in the doc? (where we already mention that "Commands ALTER SUBSCRIPTION ...
REFRESH PUBLICATION and ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...
with refresh option as true cannot be executed inside a transaction block"

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Fix log_line_prefix to display the transaction id (%x) for statements not in a transaction block

2024-04-16 Thread Quan Zongliang



On 2024/3/11 09:25, Quan Zongliang wrote:



On 2024/3/4 15:48, jian he wrote:


Maybe we can tolerate LOG, first output the query plan then statement.


It is more appropriate to let the extension solve its own problems.
Of course, this change is not easy to implement.
Due to the way XID is assigned, there seems to be no good solution at 
the moment.



According to the discussion with Jian He. Use the guc hook to check if 
the xid needs to be output. If needed, the statement log can be delayed 
to be output.diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 76f48b13d2..217910c9de 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -1068,11 +1068,14 @@ exec_simple_query(const char *query_string)
/* Log immediately if dictated by log_statement */
if (check_log_statement(parsetree_list))
{
-   ereport(LOG,
-   (errmsg("statement: %s", query_string),
-errhidestmt(true),
-errdetail_execute(parsetree_list)));
-   was_logged = true;
+   if (!Log_has_xid || 
TransactionIdIsValid(GetTopTransactionIdIfAny()))
+   {
+   ereport(LOG,
+   (errmsg("statement: %s", query_string),
+   errhidestmt(true),
+   errdetail_execute(parsetree_list)));
+   was_logged = true;
+   }
}
 
/*
@@ -1283,6 +1286,16 @@ exec_simple_query(const char *query_string)
 
PortalDrop(portal, false);
 
+   /* Log if dictated by log_statement and has not been logged. */
+   if (!was_logged && check_log_statement(parsetree_list))
+   {
+   ereport(LOG,
+   (errmsg("statement: %s", query_string),
+   errhidestmt(true),
+   errdetail_execute(parsetree_list)));
+   was_logged = true;
+   }
+
if (lnext(parsetree_list, parsetree_item) == NULL)
{
/*
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 605ff3b045..e7dd9f5027 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -79,6 +79,7 @@
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
+#include "utils/guc.h"
 #include "utils/guc_hooks.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -114,6 +115,9 @@ char   *Log_destination_string = NULL;
 bool   syslog_sequence_numbers = true;
 bool   syslog_split_messages = true;
 
+/* Whether the transaction id will appear in the log messages. */
+bool   Log_has_xid = false;
+
 /* Processed form of backtrace_functions GUC */
 static char *backtrace_function_list;
 
@@ -286,6 +290,50 @@ message_level_is_interesting(int elevel)
 }
 
 
+/*
+ * check_log_line_prefix: GUC check_hook for Log_line_prefix
+ */
+bool check_log_line_prefix(char **newval, void **extra, GucSource source)
+{
+   const char *p;
+
+   if ((*newval) == NULL)
+   {
+   Log_has_xid = false;
+   return true;
+   }
+
+   Log_has_xid = false;
+
+   for (p = (*newval); *p != '\0'; p++)
+   {
+   if (*p != '%')
+   {
+   /* literal char, just skip */
+   continue;
+   }
+
+   /* must be a '%', so skip to the next char */
+   p++;
+   if (*p == '\0')
+   break;  /* format error - 
ignore it */
+   else if (*p == '%')
+   {
+   /* string contains %% */
+   continue;
+   }
+
+   /* process the option */
+   if (*p == 'x')
+   {
+   Log_has_xid = true;
+   break;
+   }
+   }
+
+   return true;
+}
+
 /*
  * in_error_recursion_trouble --- are we at risk of infinite error recursion?
  *
diff --git a/src/backend/utils/misc/guc_tables.c 
b/src/backend/utils/misc/guc_tables.c
index c68fdc008b..96948e5005 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -4103,7 +4103,7 @@ struct config_string ConfigureNamesString[] =
},
_line_prefix,
"%m [%p] ",
-   NULL, NULL, NULL
+   check_log_line_prefix, NULL, NULL
},
 
{
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 054dd2bf62..9536146ce3 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -502,6 +502,8 @@ extern PGDLLIMPORT char 

Re: pg_combinebackup fails on file named INCREMENTAL.*

2024-04-16 Thread Stefan Fercot
Hi,

> I think it would be reasonable to restrict what can be put in base/ and
> global/ but users generally feel free to create whatever they want in
> the root of PGDATA, despite being strongly encouraged not to.
> 
> Anyway, I think it should be fixed or documented as a caveat since it
> causes a hard failure on restore.

+1. IMHO, no matter how you'd further decide to reference the incremental 
stubs, the earlier we can mention the failure to the user, the better.
Tbh, I'm not very confortable seeing pg_combinebackup fail when the user would 
need to restore (whatever the reason). I mean, failing to take the backup is 
less of a problem (compared to failing to restore) because then the user might 
have the time to investigate and fix the issue, not being in the hurry of a 
down production to restore...

> > But ... I didn't really end up feeling very comfortable with it. Right
> > now, the backup manifest is something we only use to verify the
> > integrity of the backup. If we were to do this, it would become a
> > critical part of the backup.

Isn't it already the case? I mean, you need the manifest of the previous backup 
to take an incremental one, right?
And shouldn't we encourage to verify the backup sets before (at least) trying 
to combine them?
It's not because a file was only use for one specific purpose until now that we 
can't improve it later.
Splitting the meaningful information across multiple places would be more 
error-prone (for both devs and users) imo.

> > I don't think I like the idea that
> > removing the backup_manifest should be allowed to, in effect, corrupt
> > the backup. But I think I dislike even more the idea that the data
> > that is used to verify the backup gets mushed together with the backup
> > data itself. Maybe in practice it's fine, but it doesn't seem very
> > conceptually clean.

Removing pretty much any file in the backup set would probably corrupt it. I 
mean, why would people remove the backup manifest when they already can't 
remove backup_label?
A doc stating "dont touch anything" is IMHO easier than "this part is critical, 
not that one".

> I don't think this is a problem. The manifest can do more than store
> verification info, IMO.

+1. Adding more info to the backup manifest can be handy for other use-cases 
too (i.e. like avoiding to store empty files, or storing the checksums state of 
the cluster).

Kind Regards,
--
Stefan FERCOT
Data Egret (https://dataegret.com)




Re: Things I don't like about \du's "Attributes" column

2024-04-16 Thread Pavel Luzanov

Hi,

On 14.04.2024 05:02, Wen Yi wrote:

I think we can change the output like this:

postgres=# \du
 List of roles
  Role name | Login | Attributes  | Password | Valid until | Connection limit
---+---+-+--+-+--
  test  |   | Inherit |  | |
  test2 | Can   | Inherit | Has  | |
  wenyi | Can   | Superuser  +|  | |
|   | Create DB  +|  | |
|   | Create role+|  | |
|   | Inherit+|  | |
|   | Replication+|  | |
|   | Bypass RLS  |  | |
(3 rows)

And I submit my the patch, have a look?


Thanks for the patch.

As I understand, your patch is based on my previous version.
The main thing I'm wondering about my patch is if we need to change pg_roles,
and it looks like we don't. So, in the next version of my patch,
the Password column will no longer be there.

As for the Login column and its values.
I'm not sure about using "Can" instead of "yes" to represent true.
In other psql commands, boolean values are always shown as yes/no.
NULL instead of false might be possible, but I'd rather check if this approach
has been used elsewhere. I prefer consistency everywhere.

--
Pavel Luzanov
Postgres Professional:https://postgrespro.com


Re: Why is parula failing?

2024-04-16 Thread Robins Tharakan
On Mon, 15 Apr 2024 at 16:02, Tom Lane  wrote:
> David Rowley  writes:
> > If GetNowFloat() somehow was returning a negative number then we could
> > end up with a large delay.  But if gettimeofday() was so badly broken
> > then wouldn't there be some evidence of this in the log timestamps on
> > failing runs?
>
> And indeed that too.  I'm finding the "compiler bug" theory
> palatable.  Robins mentioned having built the compiler from
> source, which theoretically should work, but maybe something
> went wrong?  Or it's missing some important bug fix?
>
> It might be interesting to back the animal's CFLAGS down
> to -O0 and see if things get more stable.

The last 25 consecutive runs have passed [1] after switching
REL_12_STABLE to -O0 ! So I am wondering whether that confirms that
the compiler version is to blame, and while we're still here,
is there anything else I could try?

If not, by Sunday, I am considering switching parula to gcc v12 (or even
v14 experimental - given that massasauga [2] has been pretty stable since
its upgrade a few days back).

Reference:
1.
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=parula=REL_12_STABLE
2.
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=massasauga=REL_12_STABLE
-
robins


Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread Bertrand Drouvot
Hi,

On Tue, Apr 16, 2024 at 10:00:04AM +0530, shveta malik wrote:
> Please find v5 addressing above comments.

Thanks!

@@ -1634,9 +1677,14 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 {
PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, 
PointerGetDatum(wrconn));
{
+   check_flags_and_set_sync_info(InvalidPid);
+

Given the fact that if the sync worker is running it won't be possible to 
trigger
a manual sync with pg_sync_replication_slots(), what about also checking the 
"sync_replication_slots" value at the start of SyncReplicationSlots() and
emmit an error if sync_replication_slots is set to on? (The message could 
explicitly
states that it's not possible to use the function if sync_replication_slots is
set to on).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread Amit Kapila
On Tue, Apr 16, 2024 at 12:03 PM Bertrand Drouvot
 wrote:
>
> On Tue, Apr 16, 2024 at 08:21:04AM +0530, Amit Kapila wrote:
>
> > There is no clear use case for allowing them in parallel and I feel it
> > would add more confusion when it can work sometimes but not other
> > times. However, if we receive some report from the field where there
> > is a real demand for such a thing, it should be easy to achieve. For
> > example, I can imagine that we can have sync_state that has values
> > 'started', 'in_progress' , and 'finished'. This should allow us to
> > achieve what the current proposed patch is doing along with allowing
> > the API to work in parallel when the sync_state is not 'in_progress'.
> >
> > I think for now let's restrict their usage in parallel and make the
> > promotion behavior consistent both for worker and API.
>
> Okay, let's do it that way. Is it worth to add a few words in the doc related 
> to
> pg_sync_replication_slots() though? (to mention it can not be used if the sync
> slot worker is running).
>

Yes, this makes sense to me.

-- 
With Regards,
Amit Kapila.




RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-16 Thread Zhijie Hou (Fujitsu)
On Wednesday, March 13, 2024 11:49 AM vignesh C  wrote:
> 
> On Tue, 12 Mar 2024 at 09:34, Ajin Cherian  wrote:
> >
> >
> >
> > On Tue, Mar 12, 2024 at 2:59 PM vignesh C  wrote:
> >>
> >>
> >> Thanks, I have created the following Commitfest entry for this:
> >> https://commitfest.postgresql.org/47/4816/
> >>
> >> Regards,
> >> Vignesh
> >
> >
> > Thanks for the patch, I have verified that the fix works well by following 
> > the
> steps mentioned to reproduce the problem.
> > Reviewing the patch, it seems good and is well documented. Just one minor
> comment I had was probably to change the name of the variable
> table_states_valid to table_states_validity. The current name made sense when
> it was a bool, but now that it is a tri-state enum, it doesn't fit well.
> 
> Thanks for reviewing the patch, the attached v6 patch has the changes for the
> same.

FYI, I noticed that there is one open item on
https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items which is related to
the fix in this thread.

--
Intermittent failures in 040_standby_failover_slots_sync test
Possible solution in this thread: Race condition in FetchTableStates
--

AFAICS, the bug discussed here is not a new issue on
PG17, so I am thinking to move the item to the "Older bugs affecting stable
branches" section if no objections.

Best Regards,
Hou zj



Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread Bertrand Drouvot
Hi,

On Tue, Apr 16, 2024 at 08:21:04AM +0530, Amit Kapila wrote:
> On Mon, Apr 15, 2024 at 7:47 PM Bertrand Drouvot
>  wrote:
> >
> > On Mon, Apr 15, 2024 at 06:29:49PM +0530, Amit Kapila wrote:
> > > On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
> > >  wrote:
> > > >
> > > > On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
> > > > > Now both worker and SQL
> > > > > function set 'syncing' when they start and reset it when they exit.
> > > >
> > > > It means that it's not possible anymore to trigger a manual sync if
> > > > sync_replication_slots is on. Indeed that would trigger:
> > > >
> > > > postgres=# select pg_sync_replication_slots();
> > > > ERROR:  cannot synchronize replication slots concurrently
> > > >
> > > > That looks like an issue to me, thoughts?
> > > >
> > >
> > > This is intentional as of now for the sake of keeping
> > > implementation/code simple. It is not difficult to allow them but I am
> > > not sure whether we want to add another set of conditions allowing
> > > them in parallel.
> >
> > I think that the ability to launch a manual sync before a switchover would 
> > be
> > missed. Except for this case I don't think that's an issue to prevent them 
> > to
> > run in parallel.
> >
> 
> I think if the slotsync worker is available, it can do that as well.

Right, but one has no control as to when the sync is triggered. 

> There is no clear use case for allowing them in parallel and I feel it
> would add more confusion when it can work sometimes but not other
> times. However, if we receive some report from the field where there
> is a real demand for such a thing, it should be easy to achieve. For
> example, I can imagine that we can have sync_state that has values
> 'started', 'in_progress' , and 'finished'. This should allow us to
> achieve what the current proposed patch is doing along with allowing
> the API to work in parallel when the sync_state is not 'in_progress'.
> 
> I think for now let's restrict their usage in parallel and make the
> promotion behavior consistent both for worker and API.

Okay, let's do it that way. Is it worth to add a few words in the doc related to
pg_sync_replication_slots() though? (to mention it can not be used if the sync
slot worker is running).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Disallow changing slot's failover option in transaction block

2024-04-16 Thread Zhijie Hou (Fujitsu)
Hi,

Kuroda-San reported an issue off-list that:

If user execute ALTER SUBSCRIPTION SET (failover) command inside a txn block
and rollback, only the subscription option change can be rolled back, while the
replication slot's failover change is preserved.

This is because ALTER SUBSCRIPTION SET (failover) command internally executes
the replication command ALTER_REPLICATION_SLOT to change the replication slot's
failover property, but this replication command execution cannot be
rollback.

To fix it, I think we can prevent user from executing ALTER SUBSCRIPTION set
(failover) inside a txn block, which is also consistent to the ALTER
SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach a small
patch to address this.

Best Regards,
Hou Zhijie



v1-0001-Disallow-alter-subscription-s-failover-option-ins.patch
Description:  v1-0001-Disallow-alter-subscription-s-failover-option-ins.patch


Re: Typo about the SetDatatabaseHasLoginEventTriggers?

2024-04-16 Thread Michael Paquier
On Tue, Apr 16, 2024 at 02:05:46PM +0800, Japin Li wrote:
> Upon reviewing the login event trigger, I noticed a potential typo about
> the SetDatabaseHasLoginEventTriggers function name.

Indeed, thanks!  Will fix and double-check the surroundings.
--
Michael


signature.asc
Description: PGP signature


Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-16 Thread Amit Kapila
On Tue, Apr 16, 2024 at 7:48 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > > FYI - We also considered the idea which walsender waits until all prepared
> > transactions
> > > are resolved before decoding and sending changes, but it did not work well
> > > - the restarted walsender sent only COMMIT PREPARED record for
> > transactions which
> > > have been prepared before disabling the subscription. This happened 
> > > because
> > > 1) if the two_phase option of slots is false, the confirmed_flush can be 
> > > ahead of
> > >PREPARE record, and
> > > 2) after the altering and restarting, start_decoding_at becomes same as
> > >confirmed_flush and records behind this won't be decoded.
> > >
> >
> > I don't understand the exact problem you are facing. IIUC, if the
> > commit is after start_decoding_at point and prepare was before it, we
> > expect to send the entire transaction followed by a commit record. The
> > restart_lsn should be before the start of such a transaction and we
> > should have recorded the changes in the reorder buffer.
>
> This behavior is right for two_phase = false case. But if the parameter is
> altered between PREPARE and COMMIT PREPARED, there is a possibility that only
> COMMIT PREPARED is sent.
>

Can you please once consider the idea shared by me at [1] (One naive
idea is that on the publisher .) to solve this problem?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1K1fSkeK%3Dkc26G5cq87vQG4%3D1qs_b%2Bno4%2Bep654SeBy1w%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Things I don't like about \du's "Attributes" column

2024-04-16 Thread Pavel Luzanov

On 16.04.2024 01:06, David G. Johnston wrote:


At this point I'm on board with retaining the \dr charter of simply being an 
easy way to access the detail exposed in pg_roles with some display formatting 
but without any attempt to convey how the system uses said information.  
Without changing pg_roles.  Our level of effort here, and degree of dependence 
on superuser, doesn't seem to be bothering people enough to push more radical 
changes here through and we have good improvements that are being held up in 
the hope of possible perfection.


I have similar thoughts. I decided to wait for the end of featurefreeze 
and propose a simpler version of the patch for v18, without changes in 
pg_roles. I hope to send a new version soon. But about \dr. Is it a typo 
and you mean \du & \dg? If we were choosing a name for the command now, 
then \dr would be ideal: \dr - display roles \drg - display role grants 
But the long history of \du & \dg prevents from doing so, and creating a 
third option is too excessive.


--
Pavel Luzanov
Postgres Professional:https://postgrespro.com


Typo about the SetDatatabaseHasLoginEventTriggers?

2024-04-16 Thread Japin Li


Hi,

Upon reviewing the login event trigger, I noticed a potential typo about
the SetDatabaseHasLoginEventTriggers function name.

diff --git a/src/backend/commands/event_trigger.c 
b/src/backend/commands/event_trigger.c
index 0d3214df9c..7a5ed6b985 100644
--- a/src/backend/commands/event_trigger.c
+++ b/src/backend/commands/event_trigger.c
@@ -111,7 +111,7 @@ static void validate_table_rewrite_tags(const char 
*filtervar, List *taglist);
 static void EventTriggerInvoke(List *fn_oid_list, EventTriggerData *trigdata);
 static const char *stringify_grant_objtype(ObjectType objtype);
 static const char *stringify_adefprivs_objtype(ObjectType objtype);
-static void SetDatatabaseHasLoginEventTriggers(void);
+static void SetDatabaseHasLoginEventTriggers(void);

 /*
  * Create an event trigger.
@@ -315,7 +315,7 @@ insert_event_trigger_tuple(const char *trigname, const char 
*eventname, Oid evtO
 * faster lookups in hot codepaths. Set the flag unless already True.
 */
if (strcmp(eventname, "login") == 0)
-   SetDatatabaseHasLoginEventTriggers();
+   SetDatabaseHasLoginEventTriggers();

/* Depend on owner. */
recordDependencyOnOwner(EventTriggerRelationId, trigoid, evtOwner);
@@ -383,7 +383,7 @@ filter_list_to_array(List *filterlist)
  * current database has on login event triggers.
  */
 void
-SetDatatabaseHasLoginEventTriggers(void)
+SetDatabaseHasLoginEventTriggers(void)
 {
/* Set dathasloginevt flag in pg_database */
Form_pg_database db;
@@ -453,7 +453,7 @@ AlterEventTrigger(AlterEventTrigStmt *stmt)
 */
if (namestrcmp(>evtevent, "login") == 0 &&
tgenabled != TRIGGER_DISABLED)
-   SetDatatabaseHasLoginEventTriggers();
+   SetDatabaseHasLoginEventTriggers();

InvokeObjectPostAlterHook(EventTriggerRelationId,
  trigoid, 0);
@@ -925,7 +925,7 @@ EventTriggerOnLogin(void)
/*
 * There is no active login event trigger, but our
 * pg_database.dathasloginevt is set. Try to unset this flag.  We use 
the
-* lock to prevent concurrent SetDatatabaseHasLoginEventTriggers(), but 
we
+* lock to prevent concurrent SetDatabaseHasLoginEventTriggers(), but we
 * don't want to hang the connection waiting on the lock.  Thus, we are
 * just trying to acquire the lock conditionally.
 */
--
Regards,
Japin Li




Re: POC: GROUP BY optimization

2024-04-16 Thread Andrei Lepikhov

On 4/12/24 06:44, Tom Lane wrote:

* The very first hunk causes get_eclass_for_sort_expr to have
side-effects on existing EquivalenceClass data structures.
What is the argument that that's not going to cause problems?
At the very least it introduces asymmetry, in that the first
caller wins the chance to change the EC's ec_sortref and later
callers won't be able to.

Let me resolve issues bit by bit.
Addressing your first note, 'asymmetry,' I discovered this part of the 
code. Before the 8d83a5d, it could cause some artefacts, but since then, 
a kind of asymmetry has been introduced into the GROUP-BY clause list.
As you can see in the attached patch, GROUP-BY reordering works even 
when the asymmetry of the 8d83a5d chooses different keys.


At the same time, I agree that implicitly setting anything in an 
exporting function, which should just look up for EC, is a questionable 
coding style. To avoid possible usage issues (in extensions, for 
example), I propose setting it up afterwards, explicitly forcing this 
action by input parameter - see my attempt in the attachment.


--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index a619ff9177..bc08dfadaf 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -652,18 +652,7 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 
 			if (opcintype == cur_em->em_datatype &&
 equal(expr, cur_em->em_expr))
-			{
-/*
- * Match!
- *
- * Copy the sortref if it wasn't set yet.  That may happen if
- * the ec was constructed from a WHERE clause, i.e. it doesn't
- * have a target reference at all.
- */
-if (cur_ec->ec_sortref == 0 && sortref > 0)
-	cur_ec->ec_sortref = sortref;
-return cur_ec;
-			}
+return cur_ec; /* Match! */
 		}
 	}
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 1d61881a6b..5d9597adcd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -1355,7 +1355,8 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	,
 	tlist,
 	false,
-	);
+	,
+	false);
 	/* It's caller error if not all clauses were sortable */
 	Assert(sortable);
 	return result;
@@ -1379,13 +1380,16 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
  * to remove any clauses that can be proven redundant via the eclass logic.
  * Even though we'll have to hash in that case, we might as well not hash
  * redundant columns.)
+ * *set_ec_sortref is true if caller wants to set up ec_sortref field in
+ * the pathkey Equivalence Class, if it have not initialized beforehand.
  */
 List *
 make_pathkeys_for_sortclauses_extended(PlannerInfo *root,
 	   List **sortclauses,
 	   List *tlist,
 	   bool remove_redundant,
-	   bool *sortable)
+	   bool *sortable,
+	   bool set_ec_sortref)
 {
 	List	   *pathkeys = NIL;
 	ListCell   *l;
@@ -1409,6 +1413,13 @@ make_pathkeys_for_sortclauses_extended(PlannerInfo *root,
 		   sortcl->nulls_first,
 		   sortcl->tleSortGroupRef,
 		   true);
+		if (pathkey->pk_eclass->ec_sortref == 0 && set_ec_sortref)
+			/*
+			 * Copy the sortref if it wasn't set yet.  That may happen if
+			 * the ec was constructed from a WHERE clause, i.e. it doesn't
+			 * have a target reference at all.
+			 */
+			pathkey->pk_eclass->ec_sortref = sortcl->tleSortGroupRef;
 
 		/* Canonical form eliminates redundant ordering keys */
 		if (!pathkey_is_redundant(pathkey, pathkeys))
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5320da51a0..dee98c9a90 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3391,7 +3391,7 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 		/*
 		 * With a plain GROUP BY list, we can remove any grouping items that
 		 * are proven redundant by EquivalenceClass processing.  For example,
-		 * we can remove y given "WHERE x = y GROUP BY x, y".  These aren't
+		 * we can remove y given "WHERE x get_eclass_for_sort_expr= y GROUP BY x, y".  These aren't
 		 * especially common cases, but they're nearly free to detect.  Note
 		 * that we remove redundant items from processed_groupClause but not
 		 * the original parse->groupClause.
@@ -3403,7 +3403,8 @@ standard_qp_callback(PlannerInfo *root, void *extra)
    >processed_groupClause,
    tlist,
    true,
-   );
+   ,
+   true);
 		if (!sortable)
 		{
 			/* Can't sort; no point in considering aggregate ordering either */
@@ -3453,7 +3454,8 @@ standard_qp_callback(PlannerInfo *root, void *extra)
    >processed_distinctClause,
    tlist,
    true,
-   );
+   ,
+