Re: CI and test improvements

2023-07-11 Thread Justin Pryzby
On Tue, Jan 17, 2023 at 11:35:09AM -0600, Justin Pryzby wrote:
> However, this finds two real problems and one false-positive with
> missing regress/isolation tests:
> 
> $ for makefile in `find src contrib -name Makefile`; do for testname in `sed 
> -r '/^(REGRESS|ISOLATION) =/!d; s///; :l; /$/{s///; N; b l}; s/\n//g' 
> "$makefile"`; do meson=${makefile%/Makefile}/meson.build; grep -Fw 
> "$testname" "$meson" >/dev/null || echo "$testname is missing from $meson"; 
> done; done

And, since 681d9e462:

security is missing from contrib/seg/meson.build




Re: 'ERROR: attempted to update invisible tuple' from 'ALTER INDEX ... ATTACH PARTITION' on parent index

2023-07-11 Thread Michael Paquier
On Wed, Jul 12, 2023 at 09:38:41AM +0900, Michael Paquier wrote:
> While working recently on what has led to cfc43ae and fc55c7f, I
> really got the feeling that there could be some command sequences that
> lacked some CCIs (or CommandCounterIncrement calls) to make sure that
> the catalog updates are visible in any follow-up steps in the same
> transaction.

Wait a minute.  The validation of a partitioned index uses a copy of
the pg_index tuple from the relcache, which be out of date:
   newtup = heap_copytuple(partedIdx->rd_indextuple);
   ((Form_pg_index) GETSTRUCT(newtup))->indisvalid = true;

And it seems to me that we should do the catalog update based on a
copy of a tuple coming from the syscache, no?  Attached is a patch
that fixes your issue with more advanced regression tests that use two
levels of partitioning, looping twice through an update of indisvalid
when attaching the leaf index (the test reproduces the problem on
HEAD, as well).
--
Michael
From 685e63dbbb57a3339926a1e8eb84ffdd010c5a51 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 12 Jul 2023 14:36:07 +0900
Subject: [PATCH] Fix validation update of partitioned indexes

---
 src/backend/commands/tablecmds.c   | 14 --
 src/test/regress/expected/indexing.out | 63 ++
 src/test/regress/sql/indexing.sql  | 43 ++
 3 files changed, 116 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8fff036b73..f740306578 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -19175,15 +19175,21 @@ validatePartitionedIndex(Relation partedIdx, Relation partedTbl)
 	if (tuples == RelationGetPartitionDesc(partedTbl, true)->nparts)
 	{
 		Relation	idxRel;
-		HeapTuple	newtup;
+		HeapTuple	indTup;
+		Form_pg_index indexForm;
 
 		idxRel = table_open(IndexRelationId, RowExclusiveLock);
+		indTup = SearchSysCacheCopy1(INDEXRELID,
+	 ObjectIdGetDatum(RelationGetRelid(partedIdx)));
+		if (!HeapTupleIsValid(indTup))
+			elog(ERROR, "cache lookup failed for index %u",
+ RelationGetRelid(partedIdx));
+		indexForm = (Form_pg_index) GETSTRUCT(indTup);
 
-		newtup = heap_copytuple(partedIdx->rd_indextuple);
-		((Form_pg_index) GETSTRUCT(newtup))->indisvalid = true;
+		indexForm->indisvalid = true;
 		updated = true;
 
-		CatalogTupleUpdate(idxRel, >rd_indextuple->t_self, newtup);
+		CatalogTupleUpdate(idxRel, >t_self, indTup);
 
 		table_close(idxRel, RowExclusiveLock);
 	}
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 3e5645c2ab..368e735de2 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1486,3 +1486,66 @@ select indexrelid::regclass, indisvalid,
 (5 rows)
 
 drop table parted_isvalid_tab;
+-- Check state of replica indexes when attaching a partition.
+begin;
+create table parted_replica_tab (id int not null) partition by range (id);
+create table parted_replica_tab_1 partition of parted_replica_tab
+  for values from (1) to (10) partition by range (id);
+create table parted_replica_tab_11 partition of parted_replica_tab_1
+  for values from (1) to (5);
+create unique index parted_replica_idx
+  on only parted_replica_tab using btree (id);
+create unique index parted_replica_idx_1
+  on only parted_replica_tab_1 using btree (id);
+-- This triggers an update of pg_index.indisreplident for parted_replica_idx.
+alter table only parted_replica_tab_1 replica identity
+  using index parted_replica_idx_1;
+create unique index parted_replica_idx_11 on parted_replica_tab_11 USING btree (id);
+select indexrelid::regclass, indisvalid, indisreplident,
+   indrelid::regclass, inhparent::regclass
+  from pg_index idx left join
+   pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+  where indexrelid::regclass::text like 'parted_replica%'
+  order by indexrelid::regclass::text collate "C";
+  indexrelid   | indisvalid | indisreplident |   indrelid| inhparent 
+---+++---+---
+ parted_replica_idx| f  | f  | parted_replica_tab| 
+ parted_replica_idx_1  | f  | t  | parted_replica_tab_1  | 
+ parted_replica_idx_11 | t  | f  | parted_replica_tab_11 | 
+(3 rows)
+
+-- parted_replica_idx is not valid yet here, because parted_replica_idx_1
+-- is not valid.
+alter index parted_replica_idx ATTACH PARTITION parted_replica_idx_1;
+select indexrelid::regclass, indisvalid, indisreplident,
+   indrelid::regclass, inhparent::regclass
+  from pg_index idx left join
+   pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+  where indexrelid::regclass::text like 'parted_replica%'
+  order by indexrelid::regclass::text collate "C";
+  indexrelid   | indisvalid | indisreplident |   indrelid| inhparent  

Re: Cleaning up threading code

2023-07-11 Thread Thomas Munro
On Wed, Jul 12, 2023 at 3:34 PM Thomas Munro  wrote:
> On Wed, Jul 12, 2023 at 3:21 PM Thomas Munro  wrote:
> > Apparently "drongo" didn't like something about this commit, and an
> > ecpg test failed, but I can't immediately see why.  Just "server
> > closed the connection unexpectedly".  drongo is running the new Meson
> > buildfarm variant on Windows+MSVC, so, being still in development, I
> > wonder if some diagnostic clue/log/backtrace etc is not being uploaded
> > yet.  FWIW Meson+Windows+MSVC passes on CI.
>
> Oh, that's probably unrelated to this commit.  It's failed 6 times
> like that in the past ~3 months.

Ah, right, that is a symptom of the old Windows TCP linger vs process
exit thing.  Something on my list to try to investigate again, but not
today.

https://www.postgresql.org/message-id/flat/16678-253e48d34dc0c376%40postgresql.org




Re: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-07-11 Thread Peter Smith
Here are my review comments for the patch v4.

==
General

1.
FYI, this patch also needs some minor PG docs updates [1] because
section "31.1 Publication" currently refers to only btree - e.g.
"Candidate indexes must be btree, non-partial, and have..."

(this may also clash with Sawada-san's other thread as previously mentioned)

==
Commit message

2.
Moreover, BRIN and GIN indexes do not implement "amgettuple".
RelationFindReplTupleByIndex()
assumes that tuples will be fetched one by one via
index_getnext_slot(), but this
currently requires the "amgetuple" function.

~

Typo:
/"amgetuple"/"amgettuple"/

==
src/backend/executor/execReplication.c

3.
+ * 2) Moreover, BRIN and GIN indexes do not implement "amgettuple".
+ * RelationFindReplTupleByIndex() assumes that tuples will be fetched one by
+ * one via index_getnext_slot(), but this currently requires the "amgetuple"
+ * function. To make it use index_getbitmap() must be used, which fetches all
+ * the tuples at once.
+ */
+int

3a.
Typo:
/"amgetuple"/"amgettuple"/

~

3b.
I think my previous comment about this comment was misunderstood. I
was questioning why that last sentence ("To make it...") about
"index_getbitmap()" is even needed in this patch. I thought it may be
overreach to describe details how to make some future enhancement that
might never happen.


==
src/backend/replication/logical/relation.c

4. IsIndexUsableForReplicaIdentityFull

 IsIndexUsableForReplicaIdentityFull(IndexInfo *indexInfo)
 {
- bool is_btree = (indexInfo->ii_Am == BTREE_AM_OID);
- bool is_partial = (indexInfo->ii_Predicate != NIL);
- bool is_only_on_expression = IsIndexOnlyOnExpression(indexInfo);
+ bool is_suitable_type;
+ bool is_partial;
+ bool is_only_on_expression;

- return is_btree && !is_partial && !is_only_on_expression;
+ /* Check whether the index is supported or not */
+ is_suitable_type = ((get_equal_strategy_number_for_am(indexInfo->ii_Am)
+ != InvalidStrategy));
+
+ is_partial = (indexInfo->ii_Predicate != NIL);
+ is_only_on_expression = IsIndexOnlyOnExpression(indexInfo);
+
+ return is_suitable_type && !is_partial && !is_only_on_expression;

4a.
The code can be written more optimally, because if it is deemed
already that 'is_suitable_type' is false, then there is no point to
continue to do the other checks and function calls.

~~~

4b.

+ /* Check whether the index is supported or not */
+ is_suitable_type = ((get_equal_strategy_number_for_am(indexInfo->ii_Am)
+ != InvalidStrategy));

The above statement has an extra set of nested parentheses for some reason.

==
src/backend/utils/cache/lsyscache.c

5.
/*
 * get_opclass_method
 *
 * Returns the OID of the index access method operator class is for.
 */
Oid
get_opclass_method(Oid opclass)

IMO the comment should be worded more like the previous one in this source file.

SUGGESTION
Returns the OID of the index access method the opclass belongs to.

--
[1] https://www.postgresql.org/docs/devel/logical-replication-publication.html

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-07-11 Thread Amit Kapila
On Wed, Jul 12, 2023 at 8:37 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > 10.
> > + /* Check whether the index is supported or not */
> > + is_suitable_type = ((get_equal_strategy_number_for_am(indexInfo->ii_Am)
> > + != InvalidStrategy));
> > +
> > + is_partial = (indexInfo->ii_Predicate != NIL);
> > + is_only_on_expression = IsIndexOnlyOnExpression(indexInfo);
> > +
> > + return is_suitable_type && !is_partial && !is_only_on_expression;
> >
> > I am not sure if the function IsIndexOnlyExpression() even needed
> > anymore. Isn't it sufficient just to check up-front is the leftmost
> > INDEX field is a column and that covers this condition also? Actually,
> > this same question is also open in the Sawada-san thread [1].
> >
> > ==
> > .../subscription/t/032_subscribe_use_index.pl
> >
> > 11.
> > AFAIK there is no test to verify that the leftmost index field must be
> > a column (e.g. cannot be an expression). Yes, there are some tests for
> > *only* expressions like (expr, expr, expr), but IMO there should be
> > another test for an index like (expr, expr, column) which should also
> > fail because the column is not the leftmost field.
>
> I'm not sure they should be fixed in the patch. You have raised these points
> at the Sawada-san's thread[1] and not sure he has done.
> Furthermore, these points are not related with our initial motivation.
> Fork, or at least it should be done by another patch.
>

I think it is reasonable to discuss the existing code-related
improvements in a separate thread rather than trying to change this
patch. I have some other comments about your patch:

1.
+ * 1) Other indexes do not have a fixed set of strategy numbers.
+ * In build_replindex_scan_key(), the operator which corresponds to "equality
+ * comparison" is searched by using the strategy number, but it is difficult
+ * for such indexes. For example, GiST index operator classes for
+ * two-dimensional geometric objects have a comparison "same", but its strategy
+ * number is different from the gist_int4_ops, which is a GiST index operator
+ * class that implements the B-tree equivalent.
+ *
...
+ */
+int
+get_equal_strategy_number_for_am(Oid am)
...

I think this comment is slightly difficult to understand. Can we make
it a bit generic by saying something like: "The index access methods
other than BTREE and HASH don't have a fixed strategy for equality
operation. Note that in other index access methods, the support
routines of each operator class interpret the strategy numbers
according to the operator class's definition. So, we return
InvalidStrategy number for index access methods other than BTREE and
HASH."

2.
+ * 2) Moreover, BRIN and GIN indexes do not implement "amgettuple".
+ * RelationFindReplTupleByIndex() assumes that tuples will be fetched one by
+ * one via index_getnext_slot(), but this currently requires the "amgetuple"
+ * function. To make it use index_getbitmap() must be used, which fetches all
+ * the tuples at once.
+ */
+int
+get_equal_strategy_number_for_am(Oid am)
{
..

I don't think this is a good place for such a comment. We can probably
move this atop IsIndexUsableForReplicaIdentityFull(). I think you need
to mention two reasons in IsIndexUsableForReplicaIdentityFull() why we
support only BTREE and HASH index access methods (a) Refer to comments
atop get_equal_strategy_number_for_am(); (b) mention the reason
related to tuples_equal() as discussed in email [1]. Then you can say
that we also need to ensure that the index access methods that we
support must have an implementation "amgettuple" as later while
searching tuple via RelationFindReplTupleByIndex, we need the same. We
can probably have an assert for this as well.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1Jv8%2B8rax-bejd3Ej%2BT2fG3tuqP8v89byKCBPVQj9C9pg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Cleaning up threading code

2023-07-11 Thread Thomas Munro
On Wed, Jul 12, 2023 at 3:21 PM Thomas Munro  wrote:
> Apparently "drongo" didn't like something about this commit, and an
> ecpg test failed, but I can't immediately see why.  Just "server
> closed the connection unexpectedly".  drongo is running the new Meson
> buildfarm variant on Windows+MSVC, so, being still in development, I
> wonder if some diagnostic clue/log/backtrace etc is not being uploaded
> yet.  FWIW Meson+Windows+MSVC passes on CI.

Oh, that's probably unrelated to this commit.  It's failed 6 times
like that in the past ~3 months.




Re: Cleaning up threading code

2023-07-11 Thread Thomas Munro
Apparently "drongo" didn't like something about this commit, and an
ecpg test failed, but I can't immediately see why.  Just "server
closed the connection unexpectedly".  drongo is running the new Meson
buildfarm variant on Windows+MSVC, so, being still in development, I
wonder if some diagnostic clue/log/backtrace etc is not being uploaded
yet.  FWIW Meson+Windows+MSVC passes on CI.




RE: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-07-11 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thank you for reviewing! PSA new version.

> 1.
> 89e46d allowed using indexes other than PRIMARY KEY or REPLICA IDENTITY
> on the
> subscriber, but only the BTree index could be used. This commit extends the
> limitation, now the Hash index can be also used.
> 
> ~
> 
> Before giving details about the problems of the other index types it
> might be good to summarize why these 2 types are OK.
> 
> SUGGESTION
> These 2 types of indexes are the only ones supported because only these
> - use fix strategy numbers
> - implement the "equality" strategy
> - implement the function amgettuple()

Added.

> 
> 2.
> I'm not sure why the next paragraphs are numbered 1) and 2). Is that
> necessary? It seems maybe a cut/paste hangover from the similar code
> comment.

Yeah, this was just copied from code comments. Numbers were removed.

> 3.
> 1) Other indexes do not have a fixed set of strategy numbers at all. In
> build_replindex_scan_key(), the operator which corresponds to
> "equality comparison"
> is searched by using the strategy number, but it is difficult for such 
> indexes.
> For example, GiST index operator classes for two-dimensional geometric
> objects have
> a comparison "same", but its strategy number is different from the
> gist_int4_ops,
> which is a GiST index operator class that implements the B-tree equivalent.
> 
> ~
> 
> Don't need to say "at all"

Removed.

> 4.
> 2) Some other indexes do not implement "amgettuple".
> RelationFindReplTupleByIndex()
> assumes that tuples could be fetched one by one via
> index_getnext_slot(), but such
> indexes are not supported.
> 
> ~
> 
> 4a.
> "Some other indexes..." --> Maybe give an example here (e.g. XXX, YYY)
> of indexes that do not implement the function.

I clarified like "BRIN and GIN indexes do not implement...", which are the 
built-in
indexes. Actually the bloom index cannot be supported due to the same reason, 
but
I did not mention because it is not in core.

> 4b.
> BEFORE
> RelationFindReplTupleByIndex() assumes that tuples could be fetched
> one by one via index_getnext_slot(), but such indexes are not
> supported.
> 
> AFTER
> RelationFindReplTupleByIndex() assumes that tuples will be fetched one
> by one via index_getnext_slot(), but this currently requires the
> "amgetuple" function.


Changed.

> src/backend/executor/execReplication.c
> 
> 5.
> + * 2) Some other indexes do not implement "amgettuple".
> + * RelationFindReplTupleByIndex() assumes that tuples could be fetched one by
> + * one via index_getnext_slot(), but such indexes are not supported. To make 
> it
> + * use index_getbitmap() must be used, but not done yet because of the above
> + * reason.
> + */
> +int
> +get_equal_strategy_number_for_am(Oid am)
> 
> ~
> 
> Change this text to better match exactly in the commit message (see
> previous review comments above).

Copied from commit message.

> Also I am not sure it is necessary to
> say how it *might* be implemented in the future if somebody wanted to
> enhance it to work also for "amgetbitmap" function. E.g. do we need
> that sentence "To make it..."

Added, how do you think?

> 6. get_equal_strategy_number_for_am
> 
> + case GIST_AM_OID:
> + case SPGIST_AM_OID:
> + case GIN_AM_OID:
> + case BRIN_AM_OID:
> + default:
> 
> I am not sure it is necessary to spell out all these not-supported
> cases in the switch. If seems sufficient just to say 'default:'
> doesn't it?

Yeah, it's sufficient. This is a garbage for previous PoC.

> 7. get_equal_strategy_number
> 
> Two blank lines are following this function

Removed.

> 8. build_replindex_scan_key
> 
> - * This is not generic routine, it expects the idxrel to be a btree,
> non-partial
> - * and have at least one column reference (i.e. cannot consist of only
> - * expressions).
> + * This is not generic routine, it expects the idxrel to be a btree or a 
> hash,
> + * non-partial and have at least one column reference (i.e. cannot consist of
> + * only expressions).
> 
> Take care. AFAIK this change will clash with changes Sawawa-san is
> making to the same code comment in another thread here [1].

Thanks for reminder. I thought that this change seems not needed anymore if the 
patch
is pushed. But anyway I kept it because this may be pushed first.

> src/backend/replication/logical/relation.c
> 
> 9. FindUsableIndexForReplicaIdentityFull
> 
>   * Returns the oid of an index that can be used by the apply worker to scan
> - * the relation. The index must be btree, non-partial, and have at least
> - * one column reference (i.e. cannot consist of only expressions). These
> + * the relation. The index must be btree or hash, non-partial, and have at
> + * least one column reference (i.e. cannot consist of only expressions). 
> These
>   * limitations help to keep the index scan similar to PK/RI index scans.
> 
> ~
> 
> Take care. AFAIK this change will clash with changes Sawawa-san is
> making to the same code comment in another thread here [1].

Thanks for reminder. I 

Re: COPY table FROM STDIN via SPI

2023-07-11 Thread James Sewell
>
> No.  It'd be a wire protocol break, and even if it weren't I would not
> expect many clients to be able to deal with it.  They're in the middle
> of a query cycle (for the SELECT or CALL that got you into SPI), and
> suddenly the backend asks for COPY data?  What are they supposed to
> send, or where are they supposed to put it for the COPY-out case?
> There's just not provision for nesting protocol operations like that.
>

What about running a COPY directly from C - is that possible?


Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-11 Thread David Rowley
On Wed, 12 Jul 2023 at 14:23, Tom Lane  wrote:
> I did think about that, but "shallow copy a Path" seems nontrivial
> because the Path structs are all different sizes.  Maybe it is worth
> building some infrastructure to support that?

It seems a reasonable thing to have to do.  It seems the minimum thing
we could do to ensure each Path is only mentioned in at most 1
RelOptInfo.

I see GetExistingLocalJoinPath() in foreign.c might be related to this
problem, per:

> * If the inner or outer subpath of the chosen path is a ForeignScan, we
> * replace it with its outer subpath.  For this reason, and also because the
> * planner might free the original path later, the path returned by this
> * function is a shallow copy of the original.  There's no need to copy
> * the substructure, so we don't.

so that function could probably disappear if we had this.

David




Re: Forgive trailing semicolons inside of config files

2023-07-11 Thread Tom Lane
Andres Freund  writes:
> Looks like we could make this easier in core postgres by adding one more
> sd_notify() call, with something like
> STATUS=reload failed due to syntax error in file 
> "/srv/dev/pgdev-dev/postgresql.conf" line 821, near end of line

Seems reasonable to investigate.  The systemd camel's nose is already
inside our tent, so we might as well consider more systemd-specific
hacks to improve the user experience.  But it seems like we'd need
some careful thought about just which messages to report.

regards, tom lane




Re: Forgive trailing semicolons inside of config files

2023-07-11 Thread Andres Freund
Hi,

On 2023-07-11 12:21:46 -0400, Greg Sabino Mullane wrote:
> On Tue, Jul 11, 2023 at 11:04 AM Isaac Morland 
> > Or maybe there could be a "check configuration" subcommand which checks
> > the configuration.
> >
>
> There are things inside of Postgres once it has started, but yeah,
> something akin to visudo would be nice for editing config files.

You can also do it kind-of-reasonably with the server binary, like:
PGDATA=/srv/dev/pgdev-dev /path/to/postgres -C server_version; echo $?


> > I'd be more interested in improvements in visibility of errors. For
> > example, maybe if I try to start the server and there is a config file
> > problem, I could somehow get a straightforward error message right in the
> > terminal window complaining about the line of the configuration which is
> > wrong.
> >
>
> That ship has long since sailed. We already send a detailed error message
> with the line number, but in today's world of "service start", "systemctl
> start", and higher level of control such as Patroni and Kubernetes, getting
> things to show in a terminal window isn't happening. We can't work around
> 2>&1.

At least with debian's infrastructure, both systemctl start and reload show
errors reasonably well:

start with broken config:
Jul 11 19:13:40 awork3 systemd[1]: Starting postgresql@15-test.service - 
PostgreSQL Cluster 15-test...
Jul 11 19:13:40 awork3 postgresql@15-test[3217452]: Error: invalid line 3 in 
/var/lib/postgresql/15/test/postgresql.auto.conf: dd
Jul 11 19:13:40 awork3 systemd[1]: postgresql@15-test.service: Can't open PID 
file /run/postgresql/15-test.pid (yet?) after start: No such file or directory


reload with broken config:
Jul 11 19:10:38 awork3 systemd[1]: Reloading postgresql@15-test.service - 
PostgreSQL Cluster 15-test...
Jul 11 19:10:38 awork3 postgresql@15-test[3217175]: Error: invalid line 3 in 
/var/lib/postgresql/15/test/postgresql.auto.conf: dd
Jul 11 19:10:38 awork3 systemd[1]: postgresql@15-test.service: Control process 
exited, code=exited, status=1/FAILURE
Jul 11 19:10:38 awork3 systemd[1]: Reload failed for postgresql@15-test.service 
- PostgreSQL Cluster 15-test.

However: It looks like that's all implemented in debian specific tooling,
rather than PG itself. Oops.


Looks like we could make this easier in core postgres by adding one more
sd_notify() call, with something like
STATUS=reload failed due to syntax error in file 
"/srv/dev/pgdev-dev/postgresql.conf" line 821, near end of line

Greetings,

Andres Freund




Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-11 Thread Tom Lane
David Rowley  writes:
> I've not taken the time to fully understand this, but from reading the
> thread, I'm not immediately understanding why we can't just shallow
> copy the Path from the other RelOptInfo and replace the parent before
> using it in the upper RelOptInfo.  Can you explain?

I did think about that, but "shallow copy a Path" seems nontrivial
because the Path structs are all different sizes.  Maybe it is worth
building some infrastructure to support that?

regards, tom lane




Re: DROP DATABASE is interruptible

2023-07-11 Thread Andres Freund
Hi,

On 2023-07-07 14:09:08 +0200, Daniel Gustafsson wrote:
> > On 25 Jun 2023, at 19:03, Andres Freund  wrote:
> > On 2023-06-21 12:02:04 -0700, Andres Freund wrote:
> >> I'm hacking on this bugfix again,
>
> This patch LGTM from reading through and testing (manually and with your
> supplied tests in the patch), I think this is a sound approach to deal with
> this.

Thanks!


> >> I haven't yet added checks to pg_upgrade, even though that's clearly
> >> needed. I'm waffling a bit between erroring out and just ignoring the
> >> database? pg_upgrade already fails when datallowconn is set "wrongly", see
> >> check_proper_datallowconn().  Any opinions?
> >
> > There don't need to be explict checks, because pg_upgrade will fail, because
> > it connects to every database. Obviously the error could be nicer, but it
> > seems ok for something hopefully very rare. I did add a test ensuring that 
> > the
> > behaviour is caught.
>
> I don't see any pg_upgrade test in the patch?

Oops, I stashed them alongside some unrelated changes... Included this time.



> > It's somewhat odd that pg_upgrade prints errors on stdout...
>
> There are many odd things about pg_upgrade logging, updating it to use the
> common logging framework of other utils would be nice.

Indeed.


> >> I'm not sure what should be done for psql. It's probably not a good idea to
> >> change tab completion, that'd just make it appear the database is gone. 
> >> But \l
> >> could probably show dropped databases more prominently?
> >
> > I have not done that. I wonder if this is something that should be done in 
> > the
> > back branches?
>
> Possibly, I'm not sure where we usually stand on changing the output format of
> \ commands in psql in minor revisions.

I'd normally be quite careful, people do script psql.

While breaking things when encountering an invalid database doesn't actually
sound like a bad thing, I don't think it fits into any of the existing column
output by psql for \l.


> A few small comments on the patch:
>
> + * Max connections allowed (-1=no limit, -2=invalid database). A database
> + * is set to invalid partway through eing dropped.  Using datconnlimit=-2
> + * for this purpose isn't particularly clean, but is backpatchable.
> Typo: s/eing/being/.

Fixed.


> A limit of -1 makes sense, but the meaning of -2 is less intuitive IMO.
> Would it make sense to add a #define with a more descriptive name to save
> folks reading this having to grep around and figure out what -2 means?

I went back and forth about this one. We don't use defines for such things in
all the frontend code today, so the majority of places won't be improved by
adding this. I added them now, which required touching a few otherwise
untouched places, but not too bad.




> + errhint("Use DROP DATABASE to drop invalid databases"));
> Should end with a period as a complete sentence?

I get confused about this every time. It's not helped by this example in
sources.sgml:


Primary:could not create shared memory segment: %m
Detail: Failed syscall was shmget(key=%d, size=%u, 0%o).
Hint:   the addendum


Which notably does not use punctuation for the hint. But indeed, later we say:
   
Detail and hint messages: Use complete sentences, and end each with
a period.  Capitalize the first word of sentences.  Put two spaces after
the period if another sentence follows (for English text; might be
inappropriate in other languages).
   


> + errmsg("cannot alter invalid database \"%s\"", stmt->dbname),
> + errdetail("Use DROP DATABASE to drop invalid databases"));
> Shouldn't this be an errhint() instead? Also ending with a period.

Yep.


> + if (database_is_invalid_form((Form_pg_database) dbform))
> + continue;
> Would it make sense to stick a DEBUG2 log entry in there to signal that such a
> database exist? (The same would apply for the similar hunk in autovacuum.c.)

I don't really have an opinion on it. Added.

elog(DEBUG2,
 "skipping invalid database \"%s\" while 
computing relfrozenxid",
 NameStr(dbform->datname));
and
elog(DEBUG2,
 "autovacuum: skipping invalid database \"%s\"",
 NameStr(pgdatabase->datname));


Updated patches attached.


Not looking forward to fixing all the conflicts.


Does anybody have an opinion about whether we should add a dedicated field to
pg_database for representing invalid databases in HEAD? I'm inclined to think
that it's not really worth the cross-version complexity at this point, and
it's not that bad a misuse to use pg_database.datconnlimit.

Greetings,

Andres Freund
>From 6cdbabf5f243d5227588df5bfd3f83018bcefb9a Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 22 Jun 2023 17:27:54 -0700
Subject: [PATCH v3 1/2] Release lock after encountering bogs row in
 vac_truncate_clog()

When 

Re: MERGE ... RETURNING

2023-07-11 Thread Vik Fearing

On 7/12/23 02:43, Jeff Davis wrote:

On Sun, 2023-01-22 at 19:58 +0100, Alvaro Herrera wrote:


(We do have to keep our fingers
crossed that they will decide to use the same RETURNING syntax as we
do
in this patch, of course.)


Do we have a reason to think that they will accept something similar?


We have reason to think that they won't care at all.

There is no RETURNING clause in Standard SQL, and the way they would do 
this is:


SELECT ...
FROM OLD TABLE (
MERGE ...
) AS m

The rules for that for MERGE are well defined.
--
Vik Fearing





Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-07-11 Thread Thomas Munro
On Wed, Jul 12, 2023 at 5:50 AM Jacob Champion  wrote:
> Oh, whoops, it's just the missed CLOEXEC flag in the final patch. (If
> the write side of the pipe gets copied around, it hangs open and the
> validator never sees the "end" of the token.) I'll switch the logic
> around to set the flag on the write side instead of unsetting it on
> the read side.

Oops, sorry about that.  Glad to hear it's all working!

(FTR my parenthetical note about macOS/XNU sources on Github was a
false alarm: the "apple" account has stopped publishing a redundant
copy of that, but "apple-oss-distributions" is the account I should
have been looking at and it is live.  I guess it migrated at some
point, or something.  Phew.)




Re: MERGE ... RETURNING

2023-07-11 Thread Jeff Davis
On Sun, 2023-01-22 at 19:58 +0100, Alvaro Herrera wrote:
> 
> (We do have to keep our fingers
> crossed that they will decide to use the same RETURNING syntax as we
> do
> in this patch, of course.)

Do we have a reason to think that they will accept something similar?

Regards,
Jeff Davis





Re: 'ERROR: attempted to update invisible tuple' from 'ALTER INDEX ... ATTACH PARTITION' on parent index

2023-07-11 Thread Michael Paquier
On Tue, Jul 11, 2023 at 10:52:16PM +0530, Shruthi Gowda wrote:
> While testing some use cases, I encountered 'ERROR:  attempted to update
> invisible tuple' when a partitioned index is attached to a parent index
> which is also a replica identity index.
> Below is the reproducible test case. The issue is seen only when the
> commands are executed inside a transaction.

Thanks for the report, reproduced here.

> The 'ALTER INDEX pk_foo ATTACH PARTITION foo_2023_id_ts_ix' returns
> "*ERROR:  attempted to update invisible tuple"*

While working recently on what has led to cfc43ae and fc55c7f, I
really got the feeling that there could be some command sequences that
lacked some CCIs (or CommandCounterIncrement calls) to make sure that
the catalog updates are visible in any follow-up steps in the same
transaction.

> The 'indisreplident' is false, the ctid field value is old and it does
> not reflect the ctid changes made by  'ALTER TABLE ONLY foo REPLICA
> IDENTITY USING INDEX pk_foo'.

Your report is telling that we are missing a CCI somewhere in this
sequence.  I would have thought that relation_mark_replica_identity()
is the correct place when the pg_index entry is dirtied, but that does
not seem correct.  Hmm.
--
Michael


signature.asc
Description: PGP signature


Re: Support to define custom wait events for extensions

2023-07-11 Thread Andres Freund
Hi,

On 2023-07-11 16:45:26 +0900, Michael Paquier wrote:
> +/* --
> + * Wait Events - Extension
> + *
> + * Use this category when the server process is waiting for some condition
> + * defined by an extension module.
> + *
> + * Extensions can define custom wait events.  First, call
> + * WaitEventExtensionNewTranche() just once to obtain a new wait event
> + * tranche.  The ID is allocated from a shared counter.  Next, each
> + * individual process using the tranche should call
> + * WaitEventExtensionRegisterTranche() to associate that wait event with
> + * a name.

What does "tranche" mean here? For LWLocks it makes some sense, it's used for
a set of lwlocks, not an individual one. But here that doesn't really seem to
apply?


> + * It may seem strange that each process using the tranche must register it
> + * separately, but dynamic shared memory segments aren't guaranteed to be
> + * mapped at the same address in all coordinating backends, so storing the
> + * registration in the main shared memory segment wouldn't work for that 
> case.
> + */

I don't really see how this applies to wait events? There's no pointers
here...


> +typedef enum
> +{
> + WAIT_EVENT_EXTENSION = PG_WAIT_EXTENSION,
> + WAIT_EVENT_EXTENSION_FIRST_USER_DEFINED
> +} WaitEventExtension;
> +
> +extern void WaitEventExtensionShmemInit(void);
> +extern Size WaitEventExtensionShmemSize(void);
> +
> +extern uint32 WaitEventExtensionNewTranche(void);
> +extern void WaitEventExtensionRegisterTranche(uint32 wait_event_info,

> -slock_t*ShmemLock;   /* spinlock for shared memory 
> and LWLock
> +slock_t*ShmemLock;   /* spinlock for shared memory, 
> LWLock
> +  * allocation, 
> and named extension wait event
>* allocation */

I'm doubtful that it's a good idea to reuse the spinlock further. Given that
the patch adds WaitEventExtensionShmemInit(), why not just have a lock in
there?



> +/*
> + * Allocate a new event ID and return the wait event info.
> + */
> +uint32
> +WaitEventExtensionNewTranche(void)
> +{
> + uint16  eventId;
> +
> + SpinLockAcquire(ShmemLock);
> + eventId = (*WaitEventExtensionCounter)++;
> + SpinLockRelease(ShmemLock);
> +
> + return PG_WAIT_EXTENSION | eventId;
> +}

It seems quite possible to run out space in WaitEventExtensionCounter, so this
should error out in that case.


> +/*
> + * Register a dynamic tranche name in the lookup table of the current 
> process.
> + *
> + * This routine will save a pointer to the wait event tranche name passed
> + * as an argument, so the name should be allocated in a backend-lifetime 
> context
> + * (shared memory, TopMemoryContext, static constant, or similar).
> + *
> + * The "wait_event_name" will be user-visible as a wait event name, so try to
> + * use a name that fits the style for those.
> + */
> +void
> +WaitEventExtensionRegisterTranche(uint32 wait_event_info,
> +   const char 
> *wait_event_name)
> +{
> + uint16  eventId;
> +
> + /* Check wait event class. */
> + Assert((wait_event_info & 0xFF00) == PG_WAIT_EXTENSION);
> +
> + eventId = wait_event_info & 0x;
> +
> + /* This should only be called for user-defined tranches. */
> + if (eventId < NUM_BUILTIN_WAIT_EVENT_EXTENSION)
> + return;

Why not assert out in that case then?


> +/*
> + * Return the name of an Extension wait event ID.
> + */
> +static const char *
> +GetWaitEventExtensionIdentifier(uint16 eventId)
> +{
> + /* Build-in tranche? */

*built

> + if (eventId < NUM_BUILTIN_WAIT_EVENT_EXTENSION)
> + return "Extension";
> +
> + /*
> +  * It's an extension tranche, so look in 
> WaitEventExtensionTrancheNames[].
> +  * However, it's possible that the tranche has never been registered by
> +  * calling WaitEventExtensionRegisterTranche() in the current process, 
> in
> +  * which case give up and return "Extension".
> +  */
> + eventId -= NUM_BUILTIN_WAIT_EVENT_EXTENSION;
> +
> + if (eventId >= WaitEventExtensionTrancheNamesAllocated ||
> + WaitEventExtensionTrancheNames[eventId] == NULL)
> + return "Extension";

I'd return something different here, otherwise something that's effectively a
bug is not distinguishable from the built


> +++ b/src/test/modules/test_custom_wait_events/t/001_basic.pl
> @@ -0,0 +1,34 @@
> +# Copyright (c) 2023, PostgreSQL Global Development Group
> +
> +use strict;
> +use warnings;
> +
> +use PostgreSQL::Test::Cluster;
> +use PostgreSQL::Test::Utils;
> +use Test::More;
> +
> +my $node = PostgreSQL::Test::Cluster->new('main');
> +
> +$node->init;
> +$node->append_conf(
> + 'postgresql.conf',
> + "shared_preload_libraries = 'test_custom_wait_events'"
> +);
> 

Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-11 Thread David Rowley
On Wed, 12 Jul 2023 at 08:46, Tom Lane  wrote:
> A low-cost fix perhaps could be to unlink the lower rel's whole
> path list (set input_rel->pathlist = NIL, also zero the related
> fields such as cheapest_path) once we've finished selecting the
> paths we want for the upper rel.  That's not great for debuggability
> either, but maybe it's the most appropriate compromise.

I've not taken the time to fully understand this, but from reading the
thread, I'm not immediately understanding why we can't just shallow
copy the Path from the other RelOptInfo and replace the parent before
using it in the upper RelOptInfo.  Can you explain?

David




Re: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-07-11 Thread Peter Smith
Hi, here are some review comments for the v3 patch

==
Commit message

1.
89e46d allowed using indexes other than PRIMARY KEY or REPLICA IDENTITY on the
subscriber, but only the BTree index could be used. This commit extends the
limitation, now the Hash index can be also used.

~

Before giving details about the problems of the other index types it
might be good to summarize why these 2 types are OK.

SUGGESTION
These 2 types of indexes are the only ones supported because only these
- use fix strategy numbers
- implement the "equality" strategy
- implement the function amgettuple()

~~~

2.
I'm not sure why the next paragraphs are numbered 1) and 2). Is that
necessary? It seems maybe a cut/paste hangover from the similar code
comment.

~~~

3.
1) Other indexes do not have a fixed set of strategy numbers at all. In
build_replindex_scan_key(), the operator which corresponds to
"equality comparison"
is searched by using the strategy number, but it is difficult for such indexes.
For example, GiST index operator classes for two-dimensional geometric
objects have
a comparison "same", but its strategy number is different from the
gist_int4_ops,
which is a GiST index operator class that implements the B-tree equivalent.

~

Don't need to say "at all"

~~~

4.
2) Some other indexes do not implement "amgettuple".
RelationFindReplTupleByIndex()
assumes that tuples could be fetched one by one via
index_getnext_slot(), but such
indexes are not supported.

~

4a.
"Some other indexes..." --> Maybe give an example here (e.g. XXX, YYY)
of indexes that do not implement the function.

~

4b.
BEFORE
RelationFindReplTupleByIndex() assumes that tuples could be fetched
one by one via index_getnext_slot(), but such indexes are not
supported.

AFTER
RelationFindReplTupleByIndex() assumes that tuples will be fetched one
by one via index_getnext_slot(), but this currently requires the
"amgetuple" function.


==
src/backend/executor/execReplication.c

5.
+ * 2) Some other indexes do not implement "amgettuple".
+ * RelationFindReplTupleByIndex() assumes that tuples could be fetched one by
+ * one via index_getnext_slot(), but such indexes are not supported. To make it
+ * use index_getbitmap() must be used, but not done yet because of the above
+ * reason.
+ */
+int
+get_equal_strategy_number_for_am(Oid am)

~

Change this text to better match exactly in the commit message (see
previous review comments above). Also I am not sure it is necessary to
say how it *might* be implemented in the future if somebody wanted to
enhance it to work also for "amgetbitmap" function. E.g. do we need
that sentence "To make it..."

~~~

6. get_equal_strategy_number_for_am

+ case GIST_AM_OID:
+ case SPGIST_AM_OID:
+ case GIN_AM_OID:
+ case BRIN_AM_OID:
+ default:

I am not sure it is necessary to spell out all these not-supported
cases in the switch. If seems sufficient just to say 'default:'
doesn't it?

~~~

7. get_equal_strategy_number

Two blank lines are following this function

~~~

8. build_replindex_scan_key

- * This is not generic routine, it expects the idxrel to be a btree,
non-partial
- * and have at least one column reference (i.e. cannot consist of only
- * expressions).
+ * This is not generic routine, it expects the idxrel to be a btree or a hash,
+ * non-partial and have at least one column reference (i.e. cannot consist of
+ * only expressions).

Take care. AFAIK this change will clash with changes Sawawa-san is
making to the same code comment in another thread here [1].

==
src/backend/replication/logical/relation.c

9. FindUsableIndexForReplicaIdentityFull

  * Returns the oid of an index that can be used by the apply worker to scan
- * the relation. The index must be btree, non-partial, and have at least
- * one column reference (i.e. cannot consist of only expressions). These
+ * the relation. The index must be btree or hash, non-partial, and have at
+ * least one column reference (i.e. cannot consist of only expressions). These
  * limitations help to keep the index scan similar to PK/RI index scans.

~

Take care. AFAIK this change will clash with changes Sawawa-san is
making to the same code comment in another thread here [1].

~

10.
+ /* Check whether the index is supported or not */
+ is_suitable_type = ((get_equal_strategy_number_for_am(indexInfo->ii_Am)
+ != InvalidStrategy));
+
+ is_partial = (indexInfo->ii_Predicate != NIL);
+ is_only_on_expression = IsIndexOnlyOnExpression(indexInfo);
+
+ return is_suitable_type && !is_partial && !is_only_on_expression;

I am not sure if the function IsIndexOnlyExpression() even needed
anymore. Isn't it sufficient just to check up-front is the leftmost
INDEX field is a column and that covers this condition also? Actually,
this same question is also open in the Sawada-san thread [1].

==
.../subscription/t/032_subscribe_use_index.pl

11.
AFAIK there is no test to verify that the leftmost index field must be
a column (e.g. cannot be an expression). Yes, there are some 

Re: Reducing connection overhead in pg_upgrade compat check phase

2023-07-11 Thread Nathan Bossart
On Wed, Jul 12, 2023 at 12:43:14AM +0200, Daniel Gustafsson wrote:
> I did have coffee before now, but only found time to actually address this now
> so here is a v7 with just that change and a fresh rebase.

Thanks.  I think the patch is in decent shape.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Support to define custom wait events for extensions

2023-07-11 Thread Michael Paquier
On Tue, Jul 11, 2023 at 12:39:52PM -0500, Tristan Partin wrote:
> Given the context of our last conversation, I assume this code was
> copied from somewhere else. Since this is new code, I think it would
> make more sense if newalloc was a uint16 or size_t.

This style comes from LWLockRegisterTranche() in lwlock.c.  Do you
think that it would be more adapted to change that to
pg_nextpower2_size_t() with a Size?  We could do that for the existing
code on HEAD as an improvement.

> From what I understand, Neon differs from upstream in some way related
> to this patch. I am trying to ascertain how that is. I hope to provide
> more feedback when I learn more about it.

Hmm, okay, that would nice to hear about to make sure that the
approach taken on this thread is able to cover what you are looking
for.  So you mean that Neon has been using something similar to
register wait events in the backend?  Last time I looked at the Neon
repo, I did not get the impression that there was a custom patch for
Postgres in this area.  All the in-core code paths using
WAIT_EVENT_EXTENSION would gain from the APIs added here, FWIW.
--
Michael


signature.asc
Description: PGP signature


Re: Cleaning up threading code

2023-07-11 Thread Andres Freund
On 2023-07-12 08:58:29 +1200, Thomas Munro wrote:
> On Mon, Jul 10, 2023 at 10:45 AM Thomas Munro  wrote:
> > * defined ENABLE_THREAD_SAFETY 1 in c.h, for the benefit of extensions
> 
> I may lack imagination but I couldn't think of any use for that
> vestigial macro in backend/extension code, and client code doesn't see
> c.h and might not get the right answer anyway if it's dynamically
> linked which is the usual case.  I took it out for now.  Open to
> discussing further if someone can show what kinds of realistic
> external code would be affected.

WFM.




Re: Avoid unused value (src/fe_utils/print.c)

2023-07-11 Thread Ranier Vilela
Em ter., 11 de jul. de 2023 às 19:34, Marko Tiikkaja 
escreveu:

> On Thu, Jul 6, 2023 at 5:37 PM Karina Litskevich
>  wrote:
> > My point is, technically right now you won't see any difference in output
> > if you remove the line. Because if we get to that line the need_recordsep
> > is already true. However, understanding why it is true is complicated.
> That's
> > why if you remove the line people who read the code will wonder why we
> don't
> > need a separator after "fputs"ing a footer. So keeping that line will
> make
> > the code more readable.
> > Moreover, removing the line will possibly complicate the future
> maintenance.
> > As I wrote in the part you just quoted, if the function changes in the
> way
> > that need_recordsep is not true right before printing footers any more,
> then
> > output will be unexpected.
>
> I agree with Karina here.  Either this patch should keep the
> "need_recordsep = true;" line, thus removing the no-op assignment to
> false and making the code slightly less unreadable; or the entire
> function should be refactored for readability.
>
 As there is consensus to keep the no-op assignment,
I will go ahead and reject the patch.

regards,
Ranier Vilela


Re: Reducing connection overhead in pg_upgrade compat check phase

2023-07-11 Thread Daniel Gustafsson
> On 11 Jul 2023, at 01:26, Daniel Gustafsson  wrote:
>> On 11 Jul 2023, at 01:09, Nathan Bossart  wrote:

>> I think we need to tmp++ somewhere here.
> 
> Yuk, yes, will fix when caffeinated. Thanks.

I did have coffee before now, but only found time to actually address this now
so here is a v7 with just that change and a fresh rebase.

--
Daniel Gustafsson



v7-0001-pg_upgrade-run-all-data-type-checks-per-connectio.patch
Description: Binary data


Re: Avoid unused value (src/fe_utils/print.c)

2023-07-11 Thread Marko Tiikkaja
On Thu, Jul 6, 2023 at 5:37 PM Karina Litskevich
 wrote:
> My point is, technically right now you won't see any difference in output
> if you remove the line. Because if we get to that line the need_recordsep
> is already true. However, understanding why it is true is complicated. That's
> why if you remove the line people who read the code will wonder why we don't
> need a separator after "fputs"ing a footer. So keeping that line will make
> the code more readable.
> Moreover, removing the line will possibly complicate the future maintenance.
> As I wrote in the part you just quoted, if the function changes in the way
> that need_recordsep is not true right before printing footers any more, then
> output will be unexpected.

I agree with Karina here.  Either this patch should keep the
"need_recordsep = true;" line, thus removing the no-op assignment to
false and making the code slightly less unreadable; or the entire
function should be refactored for readability.


.m




Re: Parallel CREATE INDEX for BRIN indexes

2023-07-11 Thread Matthias van de Meent
On Thu, 6 Jul 2023 at 16:13, Tomas Vondra  wrote:
>
> On 7/5/23 16:33, Matthias van de Meent wrote:
> > ...
> >
> >> Maybe. I wasn't that familiar with what parallel tuplesort can and can't
> >> do, and the little I knew I managed to forget since I wrote this patch.
> >> Which similar features do you have in mind?
> >
> > I was referring to the feature that is "emitting a single sorted run
> > of tuples at the leader backend based on data gathered in parallel
> > worker backends". It manages the sort state, on-disk runs etc. so that
> > you don't have to manage that yourself.
> >
> > Adding a new storage format for what is effectively a logical tape
> > (logtape.{c,h}) and manually merging it seems like a lot of changes if
> > that functionality is readily available, standardized and optimized in
> > sortsupport; and adds an additional place to manually go through for
> > disk-related changes like TDE.
> >
>
> Here's a new version of the patch, with three main changes:

Thanks! I've done a review on the patch, and most looks good. Some
places need cleanup and polish, some others more documentations, and
there are some other issues, but so far it's looking OK.

> One thing I was wondering about is whether it might be better to allow
> the workers to process overlapping ranges, and then let the leader to
> merge the summaries. That would mean we might not need the tableam.c
> changes at all, but the leader would need to do more work (although the
> BRIN indexes tend to be fairly small). The main reason that got me
> thinking about this is that we have pretty much no tests for the union
> procedures, because triggering that is really difficult. But for
> parallel index builds that'd be much more common.

Hmm, that's a good point. I don't mind either way, but it would add
overhead in the leader to do all of that merging - especially when you
configure pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE as we'd
need to merge up to parallel_workers tuples. That could be a
significant overhead.

... thinks a bit.

Hmm, but with the current P_S_M_C_S of 8192 blocks that's quite
unlikely to be a serious problem - the per-backend IO saved of such
large ranges on a single backend has presumably much more impact than
the merging of n_parallel_tasks max-sized brin tuples. So, seems fine
with me.

Review follows below.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

---

> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c

> +BrinShared   *brinshared;

Needs some indentation fixes.

> +intbs_reltuples;
> [...]
> +state->bs_reltuples += reltuples;

My IDE warns me that reltuples is a double. Looking deeper into the
value, it contains the number of live tuples in the table, so this
conversion may not result in a meaningful value for tables with >=2^31
live tuples. Tables > 56GB could begin to get affected by this.

> +intbs_worker_id;

This variable seems to be unused.

> +BrinSpool  *bs_spool;
> +BrinSpool  *bs_spool_out;

Are both used? If so, could you add comments why we have two spools
here, instead of only one?

> +/*
> + * A version of the callback, used by parallel index builds. The main 
> difference
> + * is that instead of writing the BRIN tuples into the index, we write them 
> into
> + * a shared tuplestore file, and leave the insertion up to the leader (which 
> may

+ ... shared tuplesort, and ...

> brinbuildCallbackParallel(...)
> +while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 
> 1)

shouldn't this be an 'if' now?

> +while (thisblock > state->bs_currRangeStart + 
> state->bs_pagesPerRange - 1)
> +state->bs_currRangeStart += state->bs_pagesPerRange;

Is there a reason why you went with iterative addition instead of a
single divide-and-multiply like the following?:

+state->bs_currRangeStart += state->bs_pagesPerRange *
((state->bs_currRangeStart - thisblock) / state->bs_pagesPerRange);

> diff --git a/src/backend/access/table/tableam.c 
> b/src/backend/access/table/tableam.c
> [...]
> -table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc 
> pscan)
> +table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc 
> pscan, BlockNumber chunk_factor)
> [...]
> -/* compare phs_syncscan initialization to similar logic in initscan */
> +bpscan->phs_chunk_factor = chunk_factor;
> +/* compare phs_syncscan initialization to similar logic in initscan
> + *
> + * Disable sync scans if the chunk factor is set (valid block number).
> + */

I think this needs some pgindent or other style work, both on comment
style and line lengths

> diff --git a/src/backend/utils/sort/tuplesort.c 
> b/src/backend/utils/sort/tuplesort.c
> [...]
> +Assert(false); (x3)

I think these can be cleaned up, right?

> diff --git a/src/backend/utils/sort/tuplesortvariants.c 
> b/src/backend/utils/sort/tuplesortvariants.c
> 

Re: Cleaning up threading code

2023-07-11 Thread Thomas Munro
On Mon, Jul 10, 2023 at 10:45 AM Thomas Munro  wrote:
> * defined ENABLE_THREAD_SAFETY 1 in c.h, for the benefit of extensions

I may lack imagination but I couldn't think of any use for that
vestigial macro in backend/extension code, and client code doesn't see
c.h and might not get the right answer anyway if it's dynamically
linked which is the usual case.  I took it out for now.  Open to
discussing further if someone can show what kinds of realistic
external code would be affected.

> * defined ENABLE_THREAD_SAFETY 1 ecpg_config.h, for the benefit of ECPG 
> clients

I left this one in.  I'm not sure if it could really be needed.
Perhaps at a stretch, perhaps ECPG code that is statically linked
might test that instead of calling PQisthreadsafe().

Pushed.




Re: MERGE ... RETURNING

2023-07-11 Thread Gurjeet Singh
On Tue, Jul 11, 2023 at 1:43 PM Gurjeet Singh  wrote:
>
> In the above hunk, if there's an exception/ERROR, I believe we should
> PG_RE_THROW(). If there's a reason to continue, we should at least set
> rslot = NULL, otherwise we may be returning an uninitialized value to
> the caller.

Excuse the brain-fart on my part. There's not need to PG_RE_THROW(),
since there's no PG_CATCH(). Re-learning the code's infrastructure
slowly :-)

Best regards,
Gurjeet
http://Gurje.et




Re: Extensible storage manager API - SMGR hook Redux

2023-07-11 Thread Tristan Partin
> Subject: [PATCH v1 1/2] Expose f_smgr to extensions for manual implementation

>From what I can see, all the md* APIs that were exposed in md.h can now
be made static in md.c. The only other references to those APIs were in
smgr.c.

> Subject: [PATCH v1 2/2] Prototype: Allow tablespaces to specify which SMGR
>  they use

> -typedef uint8 SMgrId;
> +/*
> + * volatile ID of the smgr. Across various configurations IDs may vary,
> + * true identity is the name of each smgr.
> + */
> +typedef int SMgrId;
> 
> -#define MaxSMgrId UINT8_MAX
> +#define MaxSMgrId  INT_MAX

In a future revision of this patch, seems worthwhile to just start as
int instead of a uint8 to avoid this song and dance. Maybe int8 instead
of int?

> +static SMgrId recent_smgrid = -1;

You could use InvalidSmgrId here.

> +void smgrvalidatetspopts(const char *smgrname, List *opts)
> +{
> +   SMgrId smgrid = get_smgr_by_name(smgrname, false);
> +
> +   smgrsw[smgrid].smgr_validate_tspopts(opts);
> +}
> +
> +void smgrcreatetsp(const char *smgrname, Oid tsp, List *opts, bool isredo)
> +{
> +   SMgrId smgrid = get_smgr_by_name(smgrname, false);
> +
> +   smgrsw[smgrid].smgr_create_tsp(tsp, opts, isredo);
> +}
> +
> +void smgrdroptsp(const char *smgrname, Oid tsp, bool isredo)
> +{
> +   SMgrId smgrid = get_smgr_by_name(smgrname, false);
> +
> +   smgrsw[smgrid].smgr_drop_tsp(tsp, isredo);
> +}

Do you not need to check if smgrid is the InvalidSmgrId? I didn't see
any other validation anywhere.

> +   char   *smgr;
> +   List   *smgropts; /* list of DefElem nodes */

smgrname would probably work better alongside tablespacename in that
struct.

> @@ -221,7 +229,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
> if (!InRecovery)
> mdclose(reln, forknum);
> 
> -   return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
> +   return (mdopenfork(mdreln, forknum, EXTENSION_RETURN_NULL) != NULL);
>  }

Was this a victim of a bad rebase? Seems like it belongs in the previous
patch.

> +void mddroptsp(Oid tsp, bool isredo)
> +{
> +
> +}

Some functions in this file have the return type on the previous line.

This is a pretty slick patchset. Excited to read more dicussion and how
it evolves.

-- 
Tristan Partin
Neon (https://neon.tech)




Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-11 Thread Tom Lane
So what's going on here is that create_ordered_paths() does this:

foreach(lc, input_rel->pathlist)
{
Path   *input_path = (Path *) lfirst(lc);

if (/* input path is suitably sorted already */)
sorted_path = input_path;
else
/* build a sorted path atop input_path */

/* Add projection step if needed */
if (sorted_path->pathtarget != target)
sorted_path = apply_projection_to_path(root, ordered_rel,
   sorted_path, target);

add_path(ordered_rel, sorted_path);
}

Thus, if the input RelOptInfo has a path that already has the correct
ordering and output target, we'll try to add that path directly to
the output RelOptInfo.  This is cheating in at least two ways:

1. The path's parent link isn't right: it still points to the input rel.

2. As per Jeevan's report, we now potentially have two different links
to the path.  add_path could reject and free the path immediately,
or it could do so later while comparing it to some path offered later
for the output RelOptInfo, and either case leads to a dangling pointer
in the input RelOptInfo's pathlist.

Now, the reason we get away with #2 is that nothing looks at the lower
RelOptInfo's pathlist anymore after create_ordered_paths: we will only
be interested in paths that contribute to a surviving Path in the
output RelOptInfo, and those will be linked directly from the upper
Path.  However, that's clearly kind of fragile, plus it's a bit
surprising that nobody has complained about #1.

We could probably fix this by creating a rule that you *must*
wrap a Path for a lower RelOptInfo into some sort of wrapper
Path before offering it as a candidate for an upper RelOptInfo.
(This could be cross-checked by having add_path Assert that
new_path->parent == parent_rel.  The wrapper could be a do-nothing
ProjectionPath, perhaps.)  But I think there are multiple places
taking similar shortcuts, so I'm a bit worried about how much overhead
we'll add for what seems likely to be only a debugging annoyance.

A low-cost fix perhaps could be to unlink the lower rel's whole
path list (set input_rel->pathlist = NIL, also zero the related
fields such as cheapest_path) once we've finished selecting the
paths we want for the upper rel.  That's not great for debuggability
either, but maybe it's the most appropriate compromise.

I don't recall how clearly I understood this while writing the
upper-planner-pathification patch years ago.  I think I did
realize the code was cheating, but if so I failed to document
it, so far as I can see.

regards, tom lane




Re: MERGE ... RETURNING

2023-07-11 Thread Gurjeet Singh
On Fri, Jul 7, 2023 at 3:48 PM Dean Rasheed  wrote:
>
> On Thu, 6 Jul 2023 at 06:13, Gurjeet Singh  wrote:
> >
> > > One minor annoyance is that, due to the way that pg_merge_action() and
> > > pg_merge_when_clause() require access to the MergeActionState, they
> > > only work if they appear directly in the RETURNING list. They can't,
> > > for example, appear in a subquery in the RETURNING list, and I don't
> > > see an easy way round that limitation.
> >
> > I believe that's a serious limitation, and would be a blocker for the 
> > feature.
>
> Yes, that was bugging me for quite a while.
>
> Attached is a new version that removes that restriction, allowing the
> merge support functions to appear anywhere. This requires a bit of
> planner support, to deal with merge support functions in subqueries,
> in a similar way to how aggregates and GROUPING() expressions are
> handled. But a number of the changes from the previous patch are no
> longer necessary, so overall, this version of the patch is slightly
> smaller.

+1

> > I think the name of function pg_merge_when_clause() can be improved.
> > How about pg_merge_when_clause_ordinal().
> >
>
> That's a bit of a mouthful, but I don't have a better idea right now.
> I've left the names alone for now, in case something better occurs to
> anyone.

+1. How do we make sure we don't forget that it needs to be named
better. Perhaps a TODO item within the patch will help.

> > In the doc page of MERGE, you've moved the 'with_query' from the
> > bottom of Parameters section to the top of that section. Any reason
> > for this? Since the Parameters section is quite large, for a moment I
> > thought the patch was _adding_ the description of 'with_query'.
> >
>
> Ah yes, I was just making the order consistent with the
> INSERT/UPDATE/DELETE pages. That could probably be committed
> separately.

I don't think that's necessary, if it improves consistency with related docs.

> > +/*
> > + * Merge support functions should only be called directly from a MERGE
> > + * command, and need access to the parent ModifyTableState.  The parser
> > + * should have checked that such functions only appear in the RETURNING
> > + * list of a MERGE, so this should never fail.
> > + */
> > +if (IsMergeSupportFunction(funcid))
> > +{
> > +if (!state->parent ||
> > +!IsA(state->parent, ModifyTableState) ||
> > +((ModifyTableState *) state->parent)->operation != CMD_MERGE)
> > +elog(ERROR, "merge support function called in non-merge 
> > context");
> >
> > As the comment says, this is an unexpected condition, and should have
> > been caught and reported by the parser. So I think it'd be better to
> > use an Assert() here. Moreover, there's ERROR being raised by the
> > pg_merge_action() and pg_merge_when_clause() functions, when they
> > detect the function context is not appropriate.
> >
> > I found it very innovative to allow these functions to be called only
> > in a certain part of certain SQL command. I don't think  there's a
> > precedence for this in  Postgres; I'd be glad to learn if there are
> > other functions that Postgres exposes this way.
> >
> > -EXPR_KIND_RETURNING,/* RETURNING */
> > +EXPR_KIND_RETURNING,/* RETURNING in INSERT/UPDATE/DELETE */
> > +EXPR_KIND_MERGE_RETURNING,  /* RETURNING in MERGE */
> >
> > Having to invent a whole new ParseExprKind enum, feels like a bit of
> > an overkill. My only objection is that this is exactly like
> > EXPR_KIND_RETURNING, hence EXPR_KIND_MERGE_RETURNING needs to be dealt
> > with exactly in as many places. But I don't have any alternative
> > proposals.
> >
>
> That's all gone now from the new patch, since there is no longer any
> restriction on where these functions can appear.

I believe this elog can be safely turned into an Assert.

+switch (mergeActionCmdType)
 {
-CmdType commandType = relaction->mas_action->commandType;

+case CMD_INSERT:

+default:
+elog(ERROR, "unrecognized commandType: %d", (int)
mergeActionCmdType);

The same treatment can be applied to the elog(ERROR) in pg_merge_action().

> > +CREATE FUNCTION merge_into_sq_target(sid int, balance int, delta int,
> > + OUT action text, OUT tid int,
> > OUT new_balance int)
> > +LANGUAGE sql AS
> > +$$
> > +MERGE INTO sq_target t
> > +USING (VALUES ($1, $2, $3)) AS v(sid, balance, delta)
> > +ON tid = v.sid
> > +WHEN MATCHED AND tid > 2 THEN
> >
> > Again, for consistency, the comparison operator should be >=. There
> > are a few more occurrences of this comparison in the rest of the file,
> >  that need the same treatment.
> >
>
> I changed the new tests to use ">= 2" (and the COPY test now returns 3
> rows, with an action of each type, which is easier to read), but I
> don't think it's really necessary to change all the existing tests
> from "> 2". There's nothing 

Re: Document that server will start even if it's unable to open some TCP/IP ports

2023-07-11 Thread Gurjeet Singh
On Mon, Jun 19, 2023 at 5:48 PM Bruce Momjian  wrote:
>
> On Tue, Jun 13, 2023 at 11:11:04PM -0400, Tom Lane wrote:
> >
> > There is certainly an argument that such a condition indicates that
> > something's very broken in our configuration and we should complain.
> > But I'm not sure how exciting the case is in practice.  The systemd
> > guys would really like us to be willing to come up before any network
> > interfaces are up, and then auto-listen to those interfaces when they
> > do come up.

That sounds like a reasonable expectation, as the network conditions
can change without any explicit changes made by someone.

> > On the other hand, the situation with Unix sockets is
> > much more static: if you can't make a socket in /tmp or /var/run at
> > the instant of postmaster start, it's unlikely you will be able to do
> > so later.

I think you're describing a setup where Postgres startup is automatic,
as part of server/OS startup. That is the most common case.

In cases where someone is performing a Postgres startup manually, they
are very likely to have the permissions to fix the problem preventing
the startup.

> > Maybe we need different rules for TCP versus Unix-domain sockets?
> > I'm not sure what exactly, but lumping those cases together for
> > a discussion like this feels wrong.

+1.

> If we are going to retry for network configuration changes, it seems we
> would also retry Unix domain sockets for cases like when the permissions
> are wrong, and then fixed.

The network managers (systemd, etc.) are expected to respond to
dynamic conditions, and hence they may perform network config changes
in response to things like network outages, and hardware failures,
time of day, etc.

On the other hand, the permissions required to create files for Unix
domain sockets are only expected to change if someone decides to make
that change. I wouldn't expect these permissions to be changed
dynamically.

On those grounds, keeping the treatment of Unix domain sockets out of
this discussion for this patch seems reasonable.

> However, it seem hard to figure out exactly what _is_ working if we take
> the approach of dynamically retrying listen methods.  Do we report
> anything helpful in the server logs when we start and can't listen on
> anything?

Yes. For every host listed in listen_addresses, if Postgres fails to
open the port on that address, we get a WARNING message in the server
log. After the end of processing of a non-empty listen_addresses, if
there are zero open TCP/IP connections, the server exits (with a FATAL
message, IIRC).

Best regards,
Gurjeet
http://Gurje.et




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-11 Thread Andres Freund
Hi,

On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
> While testing PG16, I observed that in PG16 there is a big performance
> degradation in concurrent COPY into a single relation with 2 - 16
> clients in my environment. I've attached a test script that measures
> the execution time of COPYing 5GB data in total to the single relation
> while changing the number of concurrent insertions, in PG16 and PG15.
> Here are the results on my environment (EC2 instance, RHEL 8.6, 128
> vCPUs, 512GB RAM):
>
> * PG15 (4b15868b69)
> PG15: nclients = 1, execution time = 14.181
>
> * PG16 (c24e9ef330)
> PG16: nclients = 1, execution time = 17.112

> The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
> extend tables more efficiently". With commit 1cbbee0338 (the previous
> commit of 00d1e02be2), I got a better numbers, it didn't have a better
> scalability, though:
>
> PG16: nclients = 1, execution time = 17.444

I think the single client case is indicative of an independent regression, or
rather regressions - it can't have anything to do with the fallocate() issue
and reproduces before that too in your numbers.

1) COPY got slower, due to:
9f8377f7a27 Add a DEFAULT option to COPY  FROM

This added a new palloc()/free() to every call to NextCopyFrom(). It's not at
all clear to me why this needs to happen in NextCopyFrom(), particularly
because it's already stored in CopyFromState?


2) pg_strtoint32_safe() got substantially slower, mainly due
   to
faff8f8e47f Allow underscores in integer and numeric constants.
6fcda9aba83 Non-decimal integer literals

pinned to one cpu, turbo mode disabled, I get the following best-of-three times 
for
  copy test from '/tmp/tmp_4.data'
(too impatient to use the larger file every time)

15:
6281.107 ms

HEAD:
7000.469 ms

backing out 9f8377f7a27:
6433.516 ms

also backing out faff8f8e47f, 6fcda9aba83:
6235.453 ms


I suspect 1) can relatively easily be fixed properly. But 2) seems much
harder. The changes increased the number of branches substantially, that's
gonna cost in something as (previously) tight as pg_strtoint32().



For higher concurrency numbers, I now was able to reproduce the regression, to
a smaller degree. Much smaller after fixing the above. The reason we run into
the issue here is basically that the rows in the test are very narrow and reach

#define MAX_BUFFERED_TUPLES 1000

at a small number of pages, so we go back and forth between extending with
fallocate() and not.

I'm *not* saying that that is the solution, but after changing that to 5000,
the numbers look a lot better (with the other regressions "worked around"):

(this is again with turboboost disabled, to get more reproducible numbers)

clients 1   2   4   8   16  32

15,buffered=100025725   13211   9232563948624700
15,buffered=500026107   14550   8644605049434766
HEAD+fixes,buffered=100025875   14505   8200490035653433
HEAD+fixes,buffered=500025830   12975   6527359427392642

Greetings,

Andres Freund

[1] 
https://postgr.es/m/CAD21AoAEwHTLYhuQ6PaBRPXKWN-CgW9iw%2B4hm%3D2EOFXbJQ3tOg%40mail.gmail.com




Re: tablecmds.c/MergeAttributes() cleanup

2023-07-11 Thread Alvaro Herrera
On 2023-Jun-28, Peter Eisentraut wrote:

> The MergeAttributes() and related code in and around tablecmds.c is huge and
> ancient, with many things bolted on over time, and difficult to deal with.
> I took some time to make careful incremental updates and refactorings to
> make the code easier to follow, more compact, and more modern in appearance.
> I also found several pieces of obsolete code along the way.  This resulted
> in the attached long patch series.  Each patch tries to make a single change
> and can be considered incrementally.  At the end, the code is shorter,
> better factored, and I hope easier to understand.  There shouldn't be any
> change in behavior.

I spent a few minutes doing a test merge of this to my branch with NOT
NULL changes.  Here's a quick review.

> Subject: [PATCH 01/17] Remove obsolete comment about OID support

Obvious, trivial.  +1

> Subject: [PATCH 02/17] Remove ancient special case code for adding oid columns

LGTM; deletes dead code.

> Subject: [PATCH 03/17] Remove ancient special case code for dropping oid
>  columns

Hmm, interesting.  Yay for more dead code removal.  Didn't verify it.

> Subject: [PATCH 04/17] Make more use of makeColumnDef()

Good idea, +1.  Some lines (almost all makeColumnDef callsites) end up
too long.  This is the first patch that actually conflicts with the NOT
NULLs one, and the conflicts are easy to solve, so I won't complain if
you want to get it pushed soon.

> Subject: [PATCH 05/17] Clean up MergeAttributesIntoExisting()

I don't disagree with this in principle, but this one has more
conflicts than the previous ones.


> Subject: [PATCH 06/17] Clean up MergeCheckConstraint()

Looks a reasonable change as far as this patch goes.

However, reading it I noticed that CookedConstraint->inhcount is int
and is tested for wraparound, but pg_constraint.coninhcount is int16.
This is probably bogus already.  ColumnDef->inhcount is also int.  These
should be narrowed to match the catalog definitions.


> Subject: [PATCH 07/17] MergeAttributes() and related variable renaming

I think this makes sense, but there's a bunch of conflicts to NOT NULLs.
I guess we can come back to this one later.

> Subject: [PATCH 08/17] Improve some catalog documentation
> 
> Point out that typstorage and attstorage are never '\0', even for
> fixed-length types.  This is different from attcompression.  For this
> reason, some of the handling of these columns in tablecmds.c etc. is
> different.  (catalogs.sgml already contained this information in an
> indirect way.)

I don't understand why we must point out that they're never '\0'.  I
mean, if we're doing that, why not say that they can never be \xFF?
The valid values are already listed.  The other parts of this patch look
okay.

> Subject: [PATCH 09/17] Remove useless if condition
> 
> This is useless because these fields are not set anywhere before, so
> we can assign them unconditionally.  This also makes this more
> consistent with ATExecAddColumn().

Makes sense.

> Subject: [PATCH 10/17] Remove useless if condition
> 
> We can call GetAttributeCompression() with a NULL argument.  It
> handles that internally already.  This change makes all the callers of
> GetAttributeCompression() uniform.

I agree, +1.

> From 2eda6bc9897d0995a5112e2851c51daf0c35656e Mon Sep 17 00:00:00 2001
> From: Peter Eisentraut 
> Date: Wed, 14 Jun 2023 17:51:31 +0200
> Subject: [PATCH 11/17] Refactor ATExecAddColumn() to use
>  BuildDescForRelation()
> 
> BuildDescForRelation() has all the knowledge for converting a
> ColumnDef into pg_attribute/tuple descriptor.  ATExecAddColumn() can
> make use of that, instead of duplicating all that logic.  We just pass
> a one-element list of ColumnDef and we get back exactly the data
> structure we need.  Note that we don't even need to touch
> BuildDescForRelation() to make this work.

Hmm, crazy.  I'm not sure I like this, because it seems much too clever.
The number of lines that are deleted is alluring, though.

Maybe it'd be better to create a separate routine that takes a single
ColumnDef and returns the Form_pg_attribute element for it; then use
that in both BuildDescForRelation and ATExecAddColumn.

> Subject: [PATCH 12/17] Push attidentity and attgenerated handling into
>  BuildDescForRelation()
> 
> Previously, this was handled by the callers separately, but it can be
> trivially moved into BuildDescForRelation() so that it is handled in a
> central place.

Looks reasonable.



I think the last few patches are the more innovative (interesting,
useful) of the bunch.  Let's get the first few out of the way.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: POC, WIP: OR-clause support for indexes

2023-07-11 Thread Alena Rybakina

Hi!

On 11.07.2023 11:47, Andrey Lepikhov wrote:
This patch looks much better than earlier. But it definitely needs 
some covering with tests. As a first simple approximation, here you 
can see the result of regression tests, where the transformation limit 
is set to 0. See in the attachment some test changes induced by these 
diffs.


Yes, I think so too. I also added some tests. I have attached an 
additional diff-5.diff where you can see the changes.

Also, I see some impact of the transformation to other queries:
create_view.out:
(NOT x > z) > (x <= z)
inherit.out:
(((a)::text = 'ab'::text) OR ((a)::text = ANY ('{NULL,cd}'::text[])))
to -
(((a)::text = ANY ('{NULL,cd}'::text[])) OR ((a)::text = 'ab'::text))

Transformations, mentioned above, are correct, of course. But it can 
be a sign of possible unstable behavior.


I think it can be made more stable if we always add the existing 
transformed expressions first, and then the original ones, or vice versa. T


o do this, we will need two more lists, I think, and then we can combine 
them, where the elements of the second will be written to the end of the 
first.


But I suppose that this may not be the only unstable behavior - I 
suppose we need sorting result elements on the left side, what do you think?


--
Regards,
Alena Rybakina
Postgres Professional
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index cc229d4dcaf..60e053d2179 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1922,81 +1922,6 @@ SELECT count(*) FROM tenk1
 10
 (1 row)
 
-EXPLAIN (COSTS OFF)
-SELECT count(*) FROM tenk1
-  WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3) OR thousand = 41;
-   QUERY PLAN   
-
- Aggregate
-   ->  Bitmap Heap Scan on tenk1
- Recheck Cond: (((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[]))) OR (thousand = 41))
- ->  BitmapOr
-   ->  Bitmap Index Scan on tenk1_thous_tenthous
- Index Cond: ((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[])))
-   ->  Bitmap Index Scan on tenk1_thous_tenthous
- Index Cond: (thousand = 41)
-(8 rows)
-
-SELECT count(*) FROM tenk1
-  WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3) OR thousand = 41;
- count 

-10
-(1 row)
-
-EXPLAIN (COSTS OFF)
-SELECT count(*) FROM tenk1
-  WHERE hundred = 42 AND (thousand = 42 OR thousand = 99 OR tenthous < 2) OR thousand = 41;
- QUERY PLAN  
--
- Aggregate
-   ->  Bitmap Heap Scan on tenk1
- Recheck Cond: (((hundred = 42) AND ((tenthous < 2) OR (thousand = ANY ('{42,99}'::integer[] OR (thousand = 41))
- ->  BitmapOr
-   ->  BitmapAnd
- ->  Bitmap Index Scan on tenk1_hundred
-   Index Cond: (hundred = 42)
- ->  BitmapOr
-   ->  Bitmap Index Scan on tenk1_thous_tenthous
- Index Cond: (tenthous < 2)
-   ->  Bitmap Index Scan on tenk1_thous_tenthous
- Index Cond: (thousand = ANY ('{42,99}'::integer[]))
-   ->  Bitmap Index Scan on tenk1_thous_tenthous
- Index Cond: (thousand = 41)
-(14 rows)
-
-SELECT count(*) FROM tenk1
-  WHERE hundred = 42 AND (thousand = 42 OR thousand = 99 OR tenthous < 2) OR thousand = 41;
- count 

-20
-(1 row)
-
-EXPLAIN (COSTS OFF)
-SELECT count(*) FROM tenk1
-  WHERE hundred = 42 AND (thousand = 42 OR thousand = 41 OR thousand = 99 AND tenthous = 2);
-  QUERY PLAN  
---
- Aggregate
-   ->  Bitmap Heap Scan on tenk1
- Recheck Cond: ((hundred = 42) AND (((thousand = 99) AND (tenthous = 2)) OR (thousand = ANY ('{42,41}'::integer[]
- ->  BitmapAnd
-   ->  Bitmap Index Scan on tenk1_hundred
- Index Cond: (hundred = 42)
-   ->  BitmapOr
- ->  Bitmap Index Scan on tenk1_thous_tenthous
-   Index Cond: ((thousand = 99) AND (tenthous = 2))
- ->  Bitmap Index Scan on tenk1_thous_tenthous
-   Index Cond: (thousand = ANY ('{42,41}'::integer[]))
-(11 rows)
-
-SELECT count(*) FROM tenk1
-  WHERE 

Re: stats test intermittent failure

2023-07-11 Thread Alexander Lakhin

Hi Melanie,

10.07.2023 21:35, Melanie Plageman wrote:

Hi,

Jeff pointed out that one of the pg_stat_io tests has failed a few times
over the past months (here on morepork [1] and more recently here on
francolin [2]).

Failing test diff for those who prefer not to scroll:

+++ 
/home/bf/bf-build/francolin/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out
2023-07-07 18:48:25.976313231 +
@@ -1415,7 +1415,7 @@
 :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
   ?column? | ?column?
  --+--
- t| t
+ t| f

My theory about the test failure is that, when there is enough demand
for shared buffers, the flapping test fails because it expects buffer
access strategy *reuses* and concurrent queries already flushed those
buffers before they could be reused. Attached is a patch which I think
will fix the test while keeping some code coverage. If we count
evictions and reuses together, those should have increased.


I managed to reproduce that failure with the attached patch applied
(on master) and with the following script (that effectively multiplies
probability of the failure by 360):
CPPFLAGS="-O0" ./configure -q --enable-debug --enable-cassert --enable-tap-tests  && make  -s -j`nproc` && make -s check 
-C src/test/recovery

mkdir -p src/test/recovery00/t
cp src/test/recovery/t/027_stream_regress.pl src/test/recovery00/t/
cp src/test/recovery/Makefile src/test/recovery00/
for ((i=1;i<=9;i++)); do cp -r src/test/recovery00/ src/test/recovery$i; done

for ((i=1;i<=10;i++)); do echo "iteration $i"; NO_TEMP_INSTALL=1 parallel --halt now,fail=1 -j9 --linebuffer --tag make 
-s check -C src/test/{} ::: recovery1 recovery2 recovery3 recovery4 recovery5 recovery6 recovery7 recovery8 recovery9 || 
break; done


Without your patch, I get:
iteration 2
...
recovery5   #   Failed test 'regression tests pass'
recovery5   #   at t/027_stream_regress.pl line 92.
recovery5   #  got: '256'
recovery5   # expected: '0'
...
src/test/recovery5/tmp_check/log/regress_log_027_stream_regress contains:
--- .../src/test/regress/expected/stats.out  2023-07-11 20:05:10.536059706 +0300
+++ .../src/test/recovery5/tmp_check/results/stats.out 2023-07-11 
20:30:46.790551305 +0300
@@ -1418,7 +1418,7 @@
    :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
  ?column? | ?column?
 --+--
- t    | t
+ t    | f
 (1 row)

With your patch applied, 10 iterations performed successfully for me.
So it looks like your theory and your fix are correct.

Best regards,
Alexanderdiff --git a/src/test/regress/expected/compression.out b/src/test/regress/expected/compression.out
index 834b7555cb..6a84afd819 100644
--- a/src/test/regress/expected/compression.out
+++ b/src/test/regress/expected/compression.out
@@ -1,3 +1,6 @@
+SET client_min_messages TO 'warning';
+DROP TABLE IF EXISTS cmdata, cmdata1, cmmove1, cmmove3, cmdata2, cmmove2, cmpart, cmpart2 CASCADE;
+RESET client_min_messages;
 \set HIDE_TOAST_COMPRESSION false
 -- ensure we get stable results regardless of installation's default
 SET default_toast_compression = 'pglz';
diff --git a/src/test/regress/expected/compression_1.out b/src/test/regress/expected/compression_1.out
index ddcd137c49..7ff83d2472 100644
--- a/src/test/regress/expected/compression_1.out
+++ b/src/test/regress/expected/compression_1.out
@@ -1,3 +1,6 @@
+SET client_min_messages TO 'warning';
+DROP TABLE IF EXISTS cmdata, cmdata1, cmmove1, cmmove3, cmdata2, cmmove2, cmpart, cmpart2 CASCADE;
+RESET client_min_messages;
 \set HIDE_TOAST_COMPRESSION false
 -- ensure we get stable results regardless of installation's default
 SET default_toast_compression = 'pglz';
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 1aca77491b..d8ad54ec5d 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -1,3 +1,6 @@
+SET client_min_messages TO 'warning';
+DROP FUNCTION IF EXISTS explain_filter, explain_filter_to_json CASCADE;
+RESET client_min_messages;
 --
 -- EXPLAIN
 --
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 3e5645c2ab..cbd5d3affb 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1,3 +1,7 @@
+SET client_min_messages TO 'warning';
+DROP TABLE IF EXISTS idxpart, idxpart_another, covidxpart, covidxpart3, covidxpart4 CASCADE;
+DROP SCHEMA IF EXISTS regress_indexing CASCADE;
+RESET client_min_messages;
 -- Creating an index on a partitioned table makes the partitions
 -- automatically get the index
 create table idxpart (a int, b int, c text) partition by range (a);
diff --git a/src/test/regress/expected/memoize.out b/src/test/regress/expected/memoize.out
index f5202430f8..8577958685 100644
--- a/src/test/regress/expected/memoize.out
+++ b/src/test/regress/expected/memoize.out
@@ -1,3 +1,6 @@

Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-07-11 Thread Jacob Champion
On Mon, Jul 10, 2023 at 4:50 PM Jacob Champion  wrote:
> I don't understand why the
> server-side tests are failing on FreeBSD, but they shouldn't be using
> the libpq code at all, so I think your kqueue implementation is in the
> clear.

Oh, whoops, it's just the missed CLOEXEC flag in the final patch. (If
the write side of the pipe gets copied around, it hangs open and the
validator never sees the "end" of the token.) I'll switch the logic
around to set the flag on the write side instead of unsetting it on
the read side.

I have a WIP patch that passes tests on FreeBSD, which I'll clean up
and post Sometime Soon. macOS builds now but still fails before it
runs the test; looks like it's having trouble finding OpenSSL during
`pip install` of the test modules...

Thanks!
--Jacob




Re: Support to define custom wait events for extensions

2023-07-11 Thread Tristan Partin
> From bf06b8100cb747031959fe81a2d19baabc4838cf Mon Sep 17 00:00:00 2001
> From: Masahiro Ikeda 
> Date: Fri, 16 Jun 2023 11:53:29 +0900
> Subject: [PATCH 1/2] Support custom wait events for extensions.

> + * This is indexed by event ID minus NUM_BUILTIN_WAIT_EVENT_EXTENSION, and
> + * stores the names of all dynamically-created event ID known to the current
> + * process.  Any unused entries in the array will contain NULL.

The second ID should be plural.

> +   /* If necessary, create or enlarge array. */
> +   if (eventId >= ExtensionWaitEventTrancheNamesAllocated)
> +   {
> +   int newalloc;
> +
> +   newalloc = pg_nextpower2_32(Max(8, eventId + 1));

Given the context of our last conversation, I assume this code was
copied from somewhere else. Since this is new code, I think it would
make more sense if newalloc was a uint16 or size_t.

>From what I undersatnd, Neon differs from upstream in some way related
to this patch. I am trying to ascertain how that is. I hope to provide
more feedback when I learn more about it.

-- 
Tristan Partin
Neon (https://neon.tech)




Issue in _bt_getrootheight

2023-07-11 Thread Ahmed Ibrahim
Hi everyone,

We have been working on the pg_adviser
 extension whose goal is to
suggest indexes by creating virtual/hypothetical indexes and see how it
affects the query cost.

The hypothetical index shouldn't take any space on the disk (allocates 0
pages) so we give it the flag *INDEX_CREATE_SKIP_BUILD.*
But the problem comes from here when the function *get_relation_info *is
called in planning stage, it tries to calculate the B-Tree height by
calling function *_bt_getrootheight*, but the B-Tree is not built at all,
and its metadata page (which is block 0 in our case) doesn't exist, so this
returns error that it cannot read the page (since it doesn't exist).

I tried to debug the code and found that this feature was introduced in
version 9.3 under this commit [1]. I think that in the code we need to
check if it's a B-Tree index *AND *the index is built/have some pages, then
we can go and calculate it otherwise just put it to -1

I mean instead of this
if (info->relam == BTREE_AM_OID)
{
/* For btrees, get tree height while we have the index open */
info->tree_height = _bt_getrootheight(indexRelation);
}
else
{
/* For other index types, just set it to "unknown" for now */
info->tree_height = -1;
}

The first line should be
if (info->relam == BTREE_AM_OID && info->pages > 0)
or use the storage manager (smgr) to know if the first block exists.

I would appreciate it if anyone can agree/approve or deny so that I know if
anything I am missing :)

Thanks everyone :)

[1]
https://github.com/postgres/postgres/commit/31f38f28b00cbe2b9267205359e3cf7bafa1cb97


Re: add non-option reordering to in-tree getopt_long

2023-07-11 Thread Nathan Bossart
On Tue, Jul 11, 2023 at 04:16:09PM +0900, Kyotaro Horiguchi wrote:
> I like it. We don't need to overcomplicate things just for the sake of
> speed here. Plus, this version looks the most simple to me. That being
> said, it might be better if the last term is positioned in the second
> place. This is mainly because a lone hyphen is less common than words
> that starts with characters other than a pyphen.

Sure.  І did it this way in v7.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 7c978c11c8012cbfa67646bfc37849cb061f4e07 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 7 Jul 2023 20:00:47 -0700
Subject: [PATCH v7 1/1] Teach in-tree getopt_long() to move non-options to the
 end of argv.

Unlike the other implementations of getopt_long() I could find, the
in-tree implementation does not reorder non-options to the end of
argv.  Instead, it returns -1 as soon as the first non-option is
found, even if there are other options listed afterwards.  By
moving non-options to the end of argv, getopt_long() can parse all
specified options and return -1 when only non-options remain.
This quirk is periodically missed by hackers (e.g., 869aa40a27,
ffd398021c, and d9ddc50baf).

This commit introduces the aforementioned non-option reordering
behavior to the in-tree getopt_long() implementation.  The only
systems I'm aware of that use it are Windows and AIX, both of which
have been tested.

Special thanks to Noah Misch for his help validating behavior on
AIX.

Reviewed-by: Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20230609232257.GA121461%40nathanxps13
---
 src/bin/scripts/t/040_createuser.pl | 10 +++---
 src/port/getopt_long.c  | 54 -
 2 files changed, 43 insertions(+), 21 deletions(-)

diff --git a/src/bin/scripts/t/040_createuser.pl b/src/bin/scripts/t/040_createuser.pl
index 9ca282181d..3290e30c88 100644
--- a/src/bin/scripts/t/040_createuser.pl
+++ b/src/bin/scripts/t/040_createuser.pl
@@ -42,9 +42,8 @@ $node->issues_sql_like(
 	'add a role as a member with admin option of the newly created role');
 $node->issues_sql_like(
 	[
-		'createuser', '-m',
-		'regress_user3', '-m',
-		'regress user #4', 'REGRESS_USER5'
+		'createuser', 'REGRESS_USER5', '-m', 'regress_user3',
+		'-m', 'regress user #4'
 	],
 	qr/statement: CREATE ROLE "REGRESS_USER5" NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN NOREPLICATION NOBYPASSRLS ROLE regress_user3,"regress user #4";/,
 	'add a role as a member of the newly created role');
@@ -73,11 +72,14 @@ $node->issues_sql_like(
 	qr/statement: CREATE ROLE regress_user11 NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN NOREPLICATION NOBYPASSRLS IN ROLE regress_user1;/,
 	'--role');
 $node->issues_sql_like(
-	[ 'createuser', '--member-of', 'regress_user1', 'regress_user12' ],
+	[ 'createuser', 'regress_user12', '--member-of', 'regress_user1' ],
 	qr/statement: CREATE ROLE regress_user12 NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN NOREPLICATION NOBYPASSRLS IN ROLE regress_user1;/,
 	'--member-of');
 
 $node->command_fails([ 'createuser', 'regress_user1' ],
 	'fails if role already exists');
+$node->command_fails(
+	[ 'createuser', 'regress_user1', '-m', 'regress_user2', 'regress_user3' ],
+	'fails for too many non-options');
 
 done_testing();
diff --git a/src/port/getopt_long.c b/src/port/getopt_long.c
index c989276988..b290a2bab9 100644
--- a/src/port/getopt_long.c
+++ b/src/port/getopt_long.c
@@ -50,8 +50,9 @@
  * This implementation does not use optreset.  Instead, we guarantee that
  * it can be restarted on a new argv array after a previous call returned -1,
  * if the caller resets optind to 1 before the first call of the new series.
- * (Internally, this means we must be sure to reset "place" to EMSG before
- * returning -1.)
+ *
+ * Note that this routine reorders the pointers in argv (despite the const
+ * qualifier) so that all non-options will be at the end when -1 is returned.
  */
 int
 getopt_long(int argc, char *const argv[],
@@ -60,40 +61,59 @@ getopt_long(int argc, char *const argv[],
 {
 	static char *place = EMSG;	/* option letter processing */
 	char	   *oli;			/* option letter list index */
+	static int	nonopt_start = -1;
+	static bool force_nonopt = false;
 
 	if (!*place)
 	{			/* update scanning pointer */
-		if (optind >= argc)
+		char	  **args = (char **) argv;
+
+retry:
+
+		/*
+		 * If we are out of arguments or only non-options remain, return -1.
+		 */
+		if (optind >= argc || optind == nonopt_start)
 		{
 			place = EMSG;
+			nonopt_start = -1;
+			force_nonopt = false;
 			return -1;
 		}
 
 		place = argv[optind];
 
-		if (place[0] != '-')
+		/*
+		 * An argument is a non-option if it meets any of the following
+		 * criteria: it follows an argument that is equivalent to the string
+		 * "--", it does not start with '-', or it is equivalent to the string
+		 * "-".  When we encounter a non-option, we move it to the end of argv
+		 * (after shifting all 

Re: Forgive trailing semicolons inside of config files

2023-07-11 Thread Greg Sabino Mullane
On Tue, Jul 11, 2023 at 11:04 AM Isaac Morland 
wrote:

> Please, no!
>
> There is no end to accepting sloppy syntax. What next, allow "SET
> random_page_cost = 2.5;" (with or without semicolon) in config files?
>

Well yes, there is an end. A single, trailing semicolon. Full stop. It's
not a slippery slope in which we end up asking the AI parser to interpret
our haikus to derive the actual value. The postgresql.conf file is not some
finicky YAML/JSON beast - we already support some looseness in quoting or
not quoting values, optional whitespace, etc. Think of the trailing
semicolon as whitespace, if you like. You can see from the patch that this
does not replace EOL/EOF.


> I'd be more interested in improvements in visibility of errors. For
> example, maybe if I try to start the server and there is a config file
> problem, I could somehow get a straightforward error message right in the
> terminal window complaining about the line of the configuration which is
> wrong.
>

That ship has long since sailed. We already send a detailed error message
with the line number, but in today's world of "service start", "systemctl
start", and higher level of control such as Patroni and Kubernetes, getting
things to show in a terminal window isn't happening. We can't work around
2>&1.


> Or maybe there could be a "check configuration" subcommand which checks
> the configuration.
>

There are things inside of Postgres once it has started, but yeah,
something akin to visudo would be nice for editing config files.


> But I think it would be way more useful than a potentially never-ending
> series of patches to liberalize the config parser.
>

It's a single semicolon, not a sign of the parser apocalypse. I've no plans
for future enhancements, but if they do no harm and make Postgres more user
friendly, I will support them.

Cheers,
Greg


Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-11 Thread Andres Freund
Hi,

On 2023-07-11 09:09:43 +0200, Jakub Wartak wrote:
> On Mon, Jul 10, 2023 at 6:24 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
> > > Out of curiosity I've tried and it is reproducible as you have stated : 
> > > XFS
> > > @ 4.18.0-425.10.1.el8_7.x86_64:
> > >...
> > > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
> > > the XFS issues sync writes while ext4 does not, xfs looks like constant
> > > loop of sync writes (D) by kworker/2:1H-kblockd:
> >
> > That clearly won't go well.  It's not reproducible on newer systems,
> > unfortunately :(. Or well, fortunately maybe.
> >
> >
> > I wonder if a trick to avoid this could be to memorialize the fact that we
> > bulk extended before and extend by that much going forward? That'd avoid the
> > swapping back and forth.
>
> I haven't seen this thread [1] "Question on slow fallocate", from XFS
> mailing list being mentioned here (it was started by Masahiko), but I
> do feel it contains very important hints even challenging the whole
> idea of zeroing out files (or posix_fallocate()). Please especially
> see Dave's reply.

I think that's just due to the reproducer being a bit too minimal and the
actual problem being addressed not being explained.


> He also argues that posix_fallocate() != fallocate().  What's interesting is
> that it's by design and newer kernel versions should not prevent such
> behaviour, see my testing result below.

I think the problem there was that I was not targetting a different file
between the different runs, somehow assuming the test program would be taking
care of that.

I don't think the test program actually tests things in a particularly useful
way - it does fallocate()s in 8k chunks - which postgres never does.



> All I can add is that this those kernel versions (4.18.0) seem to very
> popular across customers (RHEL, Rocky) right now and that I've tested
> on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using
> Masahiko test.c and still got 6-7x slower time when using XFS on that
> kernel. After installing kernel-ml (6.4.2) the test.c result seems to
> be the same (note it it occurs only when 1st allocating space, but of
> course it doesnt if the same file is rewritten/"reallocated"):

test.c really doesn't reproduce postgres behaviour in any meaningful way,
using it as a benchmark is a bad idea.

Greetings,

Andres Freund




Re: Forgive trailing semicolons inside of config files

2023-07-11 Thread Tom Lane
Isaac Morland  writes:
> On Tue, 11 Jul 2023 at 10:43, Greg Sabino Mullane 
>> # Add settings for extensions here
>> random_page_cost = 2.5;
>>
>> Boom! Server will not start. Surely, we can be a little more liberal in
>> what we accept? Attached patch allows a single trailing semicolon to be
>> silently discarded.

> Please, no!

I agree.  Allowing this would create huge confusion about whether it's
EOL or semicolon that ends a config file entry.  If you can write a
semicolon, then why not spread an entry across lines, or write
multiple entries on one line?

It seems possible that someday we might want to convert over to
semicolon-is-end-of-entry precisely to allow such cases.  But
I think that if/when we do that, it should be a flag day where you
*must* change to the new syntax.  (We did exactly that in pgbench
scripts some years ago, and people didn't complain too much.)

> Or maybe there could be a "check configuration" subcommand which checks the
> configuration.

We have such a thing, see the pg_file_settings view.

regards, tom lane




Re: Latches vs lwlock contention

2023-07-11 Thread Aleksander Alekseev
Hi,

> Maybe this is all ok, but it would be good to make the assumptions more
> explicit.

Here are my two cents.

```
static void
SetLatchV(Latch **latches, int nlatches)
{
/* Flush any other changes out to main memory just once. */
pg_memory_barrier();

/* Keep only latches that are not already set, and set them. */
for (int i = 0; i < nlatches; ++i)
{
Latch   *latch = latches[i];

if (!latch->is_set)
latch->is_set = true;
else
latches[i] = NULL;
}

pg_memory_barrier();

[...]

void
SetLatches(LatchGroup *group)
{
if (group->size > 0)
{
SetLatchV(group->latches, group->size);

[...]
```

I suspect this API may be error-prone without some additional
comments. The caller (which may be an extension author too) may rely
on the implementation details of SetLatches() / SetLatchV() and use
the returned group->latches[] values e.g. to figure out whether he
attempted to change the state of the given latch. Even worse, one can
mistakenly assume that the result says exactly if the caller was the
one who changed the state of the latch. This being said I see why this
particular implementation was chosen.

I added corresponding comments to SetLatchV() and SetLatches(). Also
the patchset needed a rebase. PFA v4.

It passes `make installcheck-world` on 3 machines of mine: MacOS x64,
Linux x64 and Linux RISC-V.


--
Best regards,
Aleksander Alekseev
From 006499ee28230e8c1b5c8d2f047276c5661f50ad Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 26 Oct 2022 15:51:45 +1300
Subject: [PATCH v4 2/6] Provide SetLatches() for batched deferred latches.

If we have a way to buffer a set of wakeup targets and process them at a
later time, we can:

* move SetLatch() system calls out from under LWLocks, so that locks can
  be released faster; this is especially interesting in cases where the
  target backends will immediately try to acquire the same lock, or
  generally when the lock is heavily contended

* possibly gain some micro-opimization from issuing only two memory
  barriers for the whole batch of latches, not two for each latch to be
  set

* prepare for future IPC mechanisms which might allow us to wake
  multiple backends in a single system call

Individual users of this facility will follow in separate patches.

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/ipc/latch.c  | 221 ---
 src/include/storage/latch.h  |  13 ++
 src/tools/pgindent/typedefs.list |   1 +
 3 files changed, 161 insertions(+), 74 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index db889385b7..5b7113dac8 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -589,6 +589,88 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 	return ret;
 }
 
+/*
+ * Sets multiple 'latches' at the same time.
+ *
+ * Modifies the input array. The nature of this modification is
+ * an implementation detail of SetLatchV(). The caller shouldn't try
+ * interpreting it and/or using the input array after the call.
+ */
+static void
+SetLatchV(Latch **latches, int nlatches)
+{
+	/* Flush any other changes out to main memory just once. */
+	pg_memory_barrier();
+
+	/* Keep only latches that are not already set, and set them. */
+	for (int i = 0; i < nlatches; ++i)
+	{
+		Latch	   *latch = latches[i];
+
+		if (!latch->is_set)
+			latch->is_set = true;
+		else
+			latches[i] = NULL;
+	}
+
+	pg_memory_barrier();
+
+	/* Wake the remaining latches that might be sleeping. */
+	for (int i = 0; i < nlatches; ++i)
+	{
+		Latch	   *latch = latches[i];
+
+		/*
+		 * See if anyone's waiting for the latch. It can be the current
+		 * process if we're in a signal handler. We use the self-pipe or
+		 * SIGURG to ourselves to wake up WaitEventSetWaitBlock() without
+		 * races in that case. If it's another process, send a signal.
+		 *
+		 * Fetch owner_pid only once, in case the latch is concurrently
+		 * getting owned or disowned. XXX: This assumes that pid_t is atomic,
+		 * which isn't guaranteed to be true! In practice, the effective range
+		 * of pid_t fits in a 32 bit integer, and so should be atomic. In the
+		 * worst case, we might end up signaling the wrong process. Even then,
+		 * you're very unlucky if a process with that bogus pid exists and
+		 * belongs to Postgres; and PG database processes should handle excess
+		 * SIGURG interrupts without a problem anyhow.
+		 *
+		 * Another sort of race condition that's possible here is for a new
+		 * process to own the latch immediately after we look, so we don't
+		 * signal it. This is okay so long as all callers of
+		 * ResetLatch/WaitLatch follow the standard coding convention of
+		 * waiting at the bottom of their loops, not the top, so that they'll
+		 * correctly process latch-setting events that happen before they
+		 

Re: Forgive trailing semicolons inside of config files

2023-07-11 Thread Isaac Morland
On Tue, 11 Jul 2023 at 10:43, Greg Sabino Mullane 
wrote:

> This has been a long-standing annoyance of mine. Who hasn't done something
> like this?:
>
> psql> SET random_page_cost = 2.5;
> (do some stuff, realize that rpc was too high)
>
> Let's put that inside of postgresql.conf:
>
>
> #--
> # CUSTOMIZED OPTIONS
>
> #--
>
> # Add settings for extensions here
>
> random_page_cost = 2.5;
>
>
> Boom! Server will not start. Surely, we can be a little more liberal in
> what we accept? Attached patch allows a single trailing semicolon to be
> silently discarded. As this parsing happens before the logging collector
> starts up, the error about the semicolon is often buried somewhere in a
> separate logfile or journald - so let's just allow postgres to start up
> since there is no ambiguity about what random_page_cost (or any other GUC)
> is meant to be set to.
>

Please, no!

There is no end to accepting sloppy syntax. What next, allow "SET
random_page_cost = 2.5;" (with or without semicolon) in config files?

I'd be more interested in improvements in visibility of errors. For
example, maybe if I try to start the server and there is a config file
problem, I could somehow get a straightforward error message right in the
terminal window complaining about the line of the configuration which is
wrong.

Or maybe there could be a "check configuration" subcommand which checks the
configuration. If it's fine, say so and set a flag saying the server is
clear to be started/restarted. If not, give useful error messages and don't
set the flag. Then make the start/restart commands only do their thing if
the "config OK" flag is set. Make sure that editing the configuration
clears the flag (or have 2 copies of the configuration, copied over by the
"check" subcommand: one for editing, one for running with).

This might properly belong outside of Postgres itself, I don't know. But I
think it would be way more useful than a potentially never-ending series of
patches to liberalize the config parser.


Re: pg_decode_message vs skip_empty_xacts and xact_wrote_changes

2023-07-11 Thread Ashutosh Bapat
On Tue, Jul 11, 2023 at 5:59 PM Amit Kapila  wrote:

> >
>
> I have pushed this work. But feel free to propose further
> improvements, if you have any better ideas.
>

Thanks. We have fixed the problem. So things are better than they
were. I have been busy with something else so couldn't reply.


-- 
Best Wishes,
Ashutosh Bapat




Re: Make pgbench exit on SIGINT more reliably

2023-07-11 Thread Tristan Partin
On Mon Jul 10, 2023 at 10:29 PM CDT, Michael Paquier wrote:
> On Tue, Jun 27, 2023 at 09:42:05AM -0500, Tristan Partin wrote:
> > I would say there probably isn't much benefit if you think the polling
> > for CancelRequested is fine. Looking at the other patch, I think it
> > might be simple to add an exit code for SIGINT. But I think it might be
> > best to do it after that patch is merged. I added the other patch to the
> > commitfest for July. Holding off on this one.
>
> Okay, I am going to jump on the patch to switch to COPY, then.

After looking at the other patch some more, I don't think there is a
good way to reliably exit from SIGINT. The only time pgbench ever polls
for CancelRequested is in initPopulateTable(). So if you hit CTRL+C at
any time during the execution of the other initalization steps, you're
out of luck. The default initialization steps are dtgvp, so after g we
have vacuuming and primary keys. From what I can tell pgbench will
continue to run these steps even after it has received a SIGINT.

This behavior seems unintended and odd to me. Though, maybe I am missing
something.

-- 
Tristan Partin
Neon (https://neon.tech)




Re: Use COPY for populating all pgbench tables

2023-07-11 Thread Tristan Partin
On Tue Jul 11, 2023 at 12:03 AM CDT, Michael Paquier wrote:
> On Wed, Jun 14, 2023 at 10:58:06AM -0500, Tristan Partin wrote:
>  static void
> -initGenerateDataClientSide(PGconn *con)
> +initBranch(PGconn *con, PQExpBufferData *sql, int64 curr)
> +{
> +   /* "filler" column defaults to NULL */
> +   printfPQExpBuffer(sql,
> + INT64_FORMAT "\t0\t\n",
> + curr + 1);
> +}
> +
> +static void
> +initTeller(PGconn *con, PQExpBufferData *sql, int64 curr)
> +{
> +   /* "filler" column defaults to NULL */
> +   printfPQExpBuffer(sql,
> + INT64_FORMAT "\t" INT64_FORMAT 
> "\t0\t\n",
> + curr + 1, curr / ntellers + 1);
> +}
> +
> +static void
> +initAccount(PGconn *con, PQExpBufferData *sql, int64 curr)
> +{
> +   /* "filler" column defaults to blank padded empty string */
> +   printfPQExpBuffer(sql,
> + INT64_FORMAT "\t" INT64_FORMAT 
> "\t0\t\n",
> + curr + 1, curr / naccounts + 1);
> +}
>
> These routines don't need a PGconn argument, by the way.

Nice catch!

> /* use COPY with FREEZE on v14 and later without partitioning */
> if (partitions == 0 && PQserverVersion(con) >= 14)
> -   copy_statement = "copy pgbench_accounts from stdin with (freeze on)";
> +   copy_statement_fmt = "copy %s from stdin with (freeze on)";
> else
> -   copy_statement = "copy pgbench_accounts from stdin";
> +   copy_statement_fmt = "copy %s from stdin";
>
> This seems a bit incorrect because partitioning only applies to
> pgbench_accounts, no?  This change means that the teller and branch
> tables would not benefit from FREEZE under a pgbench compiled with
> this patch and a backend version of 14 or newer if --partitions is
> used.  So, perhaps "partitions" should be an argument of
> initPopulateTable() specified for each table? 

Whoops, looks like I didn't read the comment for what the partitions
variable means. It only applies to pgbench_accounts as you said. I don't
think passing it in as an argument would make much sense. Let me know
what you think about the change in this new version, which only hits the
first portion of the `if` statement if the table is pgbench_accounts.

-- 
Tristan Partin
Neon (https://neon.tech)
From af242f39aefd4d9c77271ac897b53aaf17601f85 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 23 May 2023 11:48:16 -0500
Subject: [PATCH v3] Use COPY instead of INSERT for populating all tables

COPY is a better interface for bulk loading and/or high latency
connections. Previously COPY was only used for the pgbench_accounts
table because the amount of data was so much larger than the other
tables. However, as already said there are also gains to be had in the
high latency case, such as loading data across continents.
---
 src/bin/pgbench/pgbench.c | 155 ++
 1 file changed, 90 insertions(+), 65 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 320d348a0f..b77b6c3704 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4869,17 +4869,46 @@ initTruncateTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
-/*
- * Fill the standard tables with some data generated and sent from the client
- */
+typedef void initRow(PQExpBufferData *sql, int64 curr);
+
 static void
-initGenerateDataClientSide(PGconn *con)
+initBranch(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t0\t\n",
+	  curr + 1);
+}
+
+static void
+initTeller(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / ntellers + 1);
+}
+
+static void
+initAccount(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to blank padded empty string */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / naccounts + 1);
+}
+
+static void
+initPopulateTable(PGconn *con, const char *table, int64 base, initRow *init_row)
 {
+	int			n;
+	int			k;
+	int			chars = 0;
+	PGresult	*res;
 	PQExpBufferData sql;
-	PGresult   *res;
-	int			i;
-	int64		k;
-	char	   *copy_statement;
+	char 		copy_statement[256];
+	const char *copy_statement_fmt;
+	const int64 total = base * scale;
 
 	/* used to track elapsed time and estimate of the remaining time */
 	pg_time_usec_t start;
@@ -4888,50 +4917,22 @@ initGenerateDataClientSide(PGconn *con)
 	/* Stay on the same line if reporting to a terminal */
 	char		eol = isatty(fileno(stderr)) ? '\r' : '\n';
 
-	fprintf(stderr, "generating data (client-side)...\n");
-
-	/*
-	 * we do all of this in one transaction to enable the backend's
-	 * data-loading optimizations
-	 */
-	executeStatement(con, 

Forgive trailing semicolons inside of config files

2023-07-11 Thread Greg Sabino Mullane
This has been a long-standing annoyance of mine. Who hasn't done something
like this?:

psql> SET random_page_cost = 2.5;
(do some stuff, realize that rpc was too high)

Let's put that inside of postgresql.conf:

#--
# CUSTOMIZED OPTIONS
#--

# Add settings for extensions here

random_page_cost = 2.5;


Boom! Server will not start. Surely, we can be a little more liberal in
what we accept? Attached patch allows a single trailing semicolon to be
silently discarded. As this parsing happens before the logging collector
starts up, the error about the semicolon is often buried somewhere in a
separate logfile or journald - so let's just allow postgres to start up
since there is no ambiguity about what random_page_cost (or any other GUC)
is meant to be set to.

I also considered doing an additional ereport(LOG) when we find one, but
seemed better on reflection to simply ignore it.

Cheers,
Greg


postgresql.conf_allow_trailing_semicolon.v1.patch
Description: Binary data


Re: POC, WIP: OR-clause support for indexes

2023-07-11 Thread Alena Rybakina

On 11.07.2023 16:29, Ranier Vilela wrote:
Em ter., 11 de jul. de 2023 às 09:29, Alena Rybakina 
 escreveu:


Hi!

On 10.07.2023 15:15, Ranier Vilela wrote:

Em seg., 10 de jul. de 2023 às 09:03, Ranier Vilela
 escreveu:

Hi Alena,

Em seg., 10 de jul. de 2023 às 05:38, Alena Rybakina
 escreveu:

I agreed with the changes. Thank you for your work.

I updated patch and added you to the authors.

I specified Ranier Vilela as a reviewer.

Is a good habit when post a new version of the patch, name it
v1, v2, v3,etc.
Makes it easy to follow development and references on the thread.


Sorry, I fixed it.


Regarding the last patch.
1. I think that variable const_is_left is not necessary.
You can stick with:
+ if (IsA(get_leftop(orqual), Const))
+ nconst_expr =get_rightop(orqual);
+ const_expr = get_leftop(orqual) ;
+ else if (IsA(get_rightop(orqual), Const))
+ nconst_expr =get_leftop(orqual);
+ const_expr = get_rightop(orqual) ;
+ else
+ {
+ or_list = lappend(or_list, orqual);
+ continue;
+ }


Agreed.

You missed in removing the declaration
- bool const_is_left = true;

Yes, thank you. I fixed it.

.



2. Test scalar_type != RECORDOID is more cheaper,
mainly if OidIsValid were a function, we knows that is a macro.
+ if (scalar_type != RECORDOID && OidIsValid(scalar_type))


Is it safe? Maybe we should first make sure that it can be checked
on RECORDOID at all?

Yes it's safe, because && connector.
But you can leave as is in v5.


Added it.

--
Regards,
Alena Rybakina
Postgres Professional
From d4ca399ec5a1866025e6cea5ed5cf0e231af38e3 Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Tue, 11 Jul 2023 17:10:25 +0300
Subject: [PATCH] Replace OR clause to ANY expressions. Replace (X=N1) OR
 (X=N2) ... with X = ANY(N1, N2) on the stage of the optimiser when we are
 still working with a tree expression. Firstly, we do not try to make a
 transformation for "non-or" expressions or inequalities and the creation of a
 relation with "or" expressions occurs according to the same scenario.
 Secondly, we do not make transformations if there are less than 500 or
 expressions. Thirdly, it is worth considering that we consider "or"
 expressions only at the current level.

---
 src/backend/parser/parse_expr.c  | 231 ++-
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 231 insertions(+), 1 deletion(-)

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 346fd272b6d..82d51035684 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -95,6 +95,235 @@ static Expr *make_distinct_op(ParseState *pstate, List *opname,
 static Node *make_nulltest_from_distinct(ParseState *pstate,
 		 A_Expr *distincta, Node *arg);
 
+typedef struct OrClauseGroupEntry
+{
+	Node		   *node;
+	List		   *consts;
+	Oidscalar_type;
+	Oidopno;
+	Expr 		   *expr;
+} OrClauseGroupEntry;
+
+static int const_transform_or_limit = 500;
+
+static Node *
+transformBoolExprOr(ParseState *pstate, BoolExpr *expr_orig)
+{
+	List		   *or_list = NIL;
+	List		   *groups_list = NIL;
+	ListCell	   *lc;
+
+	/* If this is not an 'OR' expression, skip the transformation */
+	if (expr_orig->boolop != OR_EXPR ||
+		list_length(expr_orig->args) < const_transform_or_limit)
+		return transformBoolExpr(pstate, (BoolExpr *) expr_orig);
+
+	foreach(lc, expr_orig->args)
+	{
+		Node			   *arg = lfirst(lc);
+		Node			   *orqual;
+		Node			   *const_expr;
+		Node			   *nconst_expr;
+		ListCell		   *lc_groups;
+		OrClauseGroupEntry *gentry;
+
+		/* At first, transform the arg and evaluate constant expressions. */
+		orqual = transformExprRecurse(pstate, (Node *) arg);
+		orqual = coerce_to_boolean(pstate, orqual, "OR");
+		orqual = eval_const_expressions(NULL, orqual);
+
+		if (!IsA(orqual, OpExpr))
+		{
+			or_list = lappend(or_list, orqual);
+			continue;
+		}
+
+		/*
+		 * Detect the constant side of the clause. Recall non-constant
+		 * expression can be made not only with Vars, but also with Params,
+		 * which is not bonded with any relation. Thus, we detect the const
+		 * side - if another side is constant too, the orqual couldn't be
+		 * an OpExpr.
+		 * Get pointers to constant and expression sides of the qual.
+		 */
+		if (IsA(get_leftop(orqual), Const))
+		{
+			nconst_expr = get_rightop(orqual);
+			const_expr = get_leftop(orqual);
+		}
+		else if (IsA(get_rightop(orqual), Const))
+		{
+			const_expr = get_rightop(orqual);
+			nconst_expr = get_leftop(orqual);
+		}
+		else
+		{
+			or_list = lappend(or_list, orqual);
+			continue;
+		}
+
+		if (!op_mergejoinable(((OpExpr *) orqual)->opno, exprType(nconst_expr)))
+		{
+			or_list = lappend(or_list, orqual);
+			continue;
+		}
+
+		/*
+		* At this point we 

Re: [Question] Can someone provide some links related to the MemoryContext?

2023-07-11 Thread Matthias van de Meent
On Tue, 11 Jul 2023 at 15:11, Wen Yi  wrote:
>
> Hi hackers,
> I am learning the MemoryContext subsystem,  but I really don't know where to 
> find it's document (The PostgreSQL Document just provide some spi function).
> Can someone provide me some solutions?
> Thanks in advance!

You should take a look at the README in the mmgr directory; at
src/backend/utils/mmgr/README. I think that this README provides a lot
of the requested information.

-- 
Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: [PATCH]Add a tip to the check mode

2023-07-11 Thread Matthias van de Meent
On Tue, 11 Jul 2023 at 15:11, Wen Yi  wrote:
>
> Hi community,
> when I learn the source of PostgreSQL, I think it's better to add a tip to 
> the postgres "check mode", this can help the postgres's user when they check 
> the postgres's data directory.
>
> src/backend/bootstrap/bootstrap.c
>
> if (check_only)
> {
> SetProcessingMode(NormalProcessing);
> CheckerModeMain();
> abort();
> }
>
> Instead of
>
> if (check_only)
> {
> SetProcessingMode(NormalProcessing);
> CheckerModeMain();
> printf("PostgreSQL check success, there's no problem\n");
> abort();
> }

I'm afraid I don't understand the point of your suggestion.
CheckerModeMain doesn't return (it unconditionally calls proc_exit(),
which doesn't return) - it shouldn't hit the abort() clause. If it did
hit the abort() clause, that'd probably be a problem on its own,
right?

-- 
Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: POC, WIP: OR-clause support for indexes

2023-07-11 Thread Ranier Vilela
Em ter., 11 de jul. de 2023 às 09:29, Alena Rybakina <
lena.riback...@yandex.ru> escreveu:

> Hi!
> On 10.07.2023 15:15, Ranier Vilela wrote:
>
> Em seg., 10 de jul. de 2023 às 09:03, Ranier Vilela 
> escreveu:
>
>> Hi Alena,
>>
>> Em seg., 10 de jul. de 2023 às 05:38, Alena Rybakina <
>> lena.riback...@yandex.ru> escreveu:
>>
>>> I agreed with the changes. Thank you for your work.
>>>
>>> I updated patch and added you to the authors.
>>>
>>> I specified Ranier Vilela as a reviewer.
>>>
>> Is a good habit when post a new version of the patch, name it v1, v2,
>> v3,etc.
>> Makes it easy to follow development and references on the thread.
>>
>> Sorry, I fixed it.
>
> Regarding the last patch.
>> 1. I think that variable const_is_left is not necessary.
>> You can stick with:
>> + if (IsA(get_leftop(orqual), Const))
>> + nconst_expr =get_rightop(orqual);
>> + const_expr = get_leftop(orqual) ;
>> + else if (IsA(get_rightop(orqual), Const))
>> + nconst_expr =get_leftop(orqual);
>> + const_expr = get_rightop(orqual) ;
>> + else
>> + {
>> + or_list = lappend(or_list, orqual);
>> + continue;
>> + }
>>
> Agreed.
>
You missed in removing the declaration
- bool const_is_left = true;
.

>
>> 2. Test scalar_type != RECORDOID is more cheaper,
>> mainly if OidIsValid were a function, we knows that is a macro.
>> + if (scalar_type != RECORDOID && OidIsValid(scalar_type))
>>
>> Is it safe? Maybe we should first make sure that it can be checked on
> RECORDOID at all?
>
Yes it's safe, because && connector.
But you can leave as is in v5.

best regards,
Ranier Vilela


RE: BUG #18016: REINDEX TABLE failure

2023-07-11 Thread Richard Veselý
Hi Gurjeet,

Thank you for the follow-up. I was worried my message got buried in the middle 
of the thread. I also appreciate your work on the patch to fix/improve the 
REINDEX TABLE behavior even though most people would never encounter it in the 
wild.

As a preface I would first like to say that I can appreciate the emphasis on 
general maintainability of the codebase, trying to avoid having some overly 
clever hacks that might impede understanding, having ideally one way of doing 
things like having a common database page structure, etc. The more one keeps to 
this "happy" path the better the general state of the project end up by keeping 
it accessible to the rest of the community and attracting more contributions in 
turn.

That being said, PostgreSQL can be extremely conservative in scenarios where it 
might not be warranted while giving a limited opportunity to influence said 
behavior. This often leads to a very low hardware resource utilization. You can 
easily find many instances across StackOverflow, dba.stackexchange.com, 
/r/postgres and pgsql-performance where people run into ingress/egress 
bottlenecks even though their hardware can trivially support much larger 
workload.

In my experience, you can be very hard-pressed in many cases to saturate even a 
modest enterprise HDD while observing the official guidelines 
(https://www.postgresql.org/docs/current/populate.html), e.g. minimal WAL and 
host of other configuration optimizations, having no indexes and constraints 
and creating table and filling it with binary COPY within the same transaction. 
And the situation with pg_dump/pg_restore is often much worse.

Is there an interest in improving the current state of affairs? I will be 
rewriting the indexing first to get the whole picture, but I can already tell 
you that there is a -lot- of performance left on the table even considering the 
effort to improve COPY performance in PostgreSQL 16. Given sufficient hardware, 
you should always be heavily IO-bound without exception and saturate any 
reasonable number of NVMe SSDs.

Best regards,
Richard

-Original Message-
From: Gurjeet Singh  
Sent: Monday, July 10, 2023 6:44 PM
To: Richard Veselý ; Postgres Hackers 

Cc: Tom Lane ; Michael Paquier 
Subject: Re: BUG #18016: REINDEX TABLE failure

On Sun, Jul 9, 2023 at 7:21 AM Richard Veselý  wrote:
>
> ... there's no shortage of people that suffer from sluggish 
> pg_dump/pg_restore cycle and I imagine there are any number of people that 
> would be interested in improving bulk ingestion which is often a bottleneck 
> for analytical workloads as you are well aware. What's the best place to 
> discuss this topic further - pgsql-performance or someplace else?

(moved conversation to -hackers, and moved -bugs to BCC)

> I was dissatisfied with storage layer performance, especially during 
> the initial database population, so I rewrote it for my use case. I'm 
> done with the heap, but for the moment I still rely on PostgreSQL to 
> build indexes,

It sounds like you've developed a method to speed up loading of tables, and 
might have ideas/suggestions for speeding up CREATE INDEX/REINDEX. The -hackers 
list feels like a place to discuss such changes.

Best regards,
Gurjeet
http://Gurje.et


[PATCH]Add a tip to the check mode

2023-07-11 Thread Wen Yi
Hi community,
when I learn the source of PostgreSQL, I think it's better to add a tip to the 
postgres "check mode", this can help the postgres's user when they check the 
postgres's data directory.



src/backend/bootstrap/bootstrap.c



if (check_only)
 {
  SetProcessingMode(NormalProcessing);
  CheckerModeMain();
  abort();
 }


Instead of


if (check_only)
 {
  SetProcessingMode(NormalProcessing);
  CheckerModeMain();
 printf("PostgreSQL check success, 
there's no problem\n");

  abort();
 }


Yours,
Wen Yi

Add-a-tip-to-the-check-mode.patch
Description: Binary data


Got FATAL in lock_twophase_recover() during recovery

2023-07-11 Thread suyu.cmj
Hi, all.
I want to report a bug about the recovery of two-phase transaction, in current 
implementation of crash recovery, there are two ways to recover 2pc data:
1、before redo, func restoreTwoPhaseData() will restore 2pc data those xid < 
ShmemVariableCache->nextXid, which is initialized from checkPoint.nextXid;
2、during redo, func xact_redo() will add 2pc from wal;
The following scenario may cause the same 2pc transaction to be added 
repeatedly, I have attached a patch for pg11 that reproduces the error:
1、start creating checkpoint_1, checkpoint_1.redo is set as curInsert;
2、before set checkPoint_1.nextXid, a new 2pc is prepared, suppose the xid of 
this 2pc is 100, and then ShmemVariableCache->nextXid will be advanced as 101;
3、checkPoint_1.nextXid is set as 101;
4、in CheckPointTwoPhase() of checkpoint_1, 2pc_100 won't be copied to disk 
because its prepare_end_lsn > checkpoint_1.redo;
5、checkPoint_1 is finished, after checkpoint_timeout, start creating 
checkpoint_2;
6、during checkpoint_2, data of 2pc_100 will be copied to disk;
7、before UpdateControlFile() of checkpoint_2, crash happened;
8、during crash recovery, redo will start from checkpoint_1, and 2pc_100 will be 
restored first by restoreTwoPhaseData() because xid_100 < checkPoint_1.nextXid, 
which is 101; 
9、because prepare_start_lsn of 2pc_100 > checkpoint_1.redo, 2pc_100 will be 
added again by xact_redo() during wal replay, resulting in the same 2pc data 
being added twice;
10、In RecoverPreparedTransactions() -> lock_twophase_recover(), lock the same 
2pc will cause FATAL.
After running the patch that reproduced the error, you will get the following 
error during crash recovery:
2023-07-10 13:04:30.670 UTC [11169] LOG: recovering prepared transaction 569 
from shared memory
2023-07-10 13:04:30.670 UTC [11169] LOG: recovering prepared transaction 569 
from shared memory
2023-07-10 13:04:30.670 UTC [11169] FATAL: lock ExclusiveLock on object 569/0/0 
is already held
2023-07-10 13:04:30.670 UTC [11168] LOG: startup process (PID 11169) exited 
with exit code 1
2023-07-10 13:04:30.670 UTC [11168] LOG: aborting startup due to startup 
process failure
I also added a patch for pg11 to fix this problem, hope you can check it when 
you have time.
Thanks & Best Regard


0001-Reproduce-the-error-in-lock_twophase_recover_11.patch
Description: Binary data


v1-0001-Fix-a-2PC-transaction-maybe-recovered-twice_11.patch
Description: Binary data


Re: BUG #18016: REINDEX TABLE failure

2023-07-11 Thread Gurjeet Singh
On Sun, Jul 9, 2023 at 7:18 AM Tom Lane  wrote:
>
> Michael Paquier  writes:
> > That should be OK, I assume.  However, if this is improved and
> > something we want to support in the long-run I guess that a TAP test
> > may be appropriate.
>
> I do not see the point of a TAP test.  It's not like the code isn't
> covered perfectly well.

Please find attached the patch that makes REINDEX TABLE perform
reindex on toast table before reindexing the main table's indexes.

The code block movement involved slightly more thought and care than I
had previously imagined. As explained in comments in the patch, the
enumeration and suppression of indexes on the main table must happen
before any CommandCounterIncrement() call, hence the
reindex-the-toast-table-if-any code had to be placed after that
enumeration.

In support of the argument above, the patch does not include any TAP
tests. Reliably reproducing the original error message involves
restarting the database, and since that can't be done via SQL
commands, no sql tests are included, either.

The patch also includes minor wordsmithing, and benign whitespace
changes in neighboring code.

Best regards,
Gurjeet
http://Gurje.et


v1-0001-Reindex-toast-table-s-index-before-main-table-s-i.patch
Description: Binary data


[Question] Can someone provide some links related to the MemoryContext?

2023-07-11 Thread Wen Yi
Hi hackers,
I am learning the MemoryContext subsystem, but I really don't know where 
to find it's document (The PostgreSQL Document just provide some spi function).
Can someone provide me some solutions?
Thanks in advance!


Yours,
Wen Yi

Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2023-07-11 Thread Palak Chaturvedi
Can you please review the new patch of the extension with implemented
force variable.

On Tue, 11 Jul 2023 at 18:08, Palak Chaturvedi
 wrote:
>
> Hey Nitin,
> >Will
> >there be a scenario where the buffer is dirty and its reference count
> >is zero?
> There might be a buffer that has been dirtied but is not pinned or
> being used currently by a process. So checking the refcount and then
> dirty buffers helps.
> >First, The TryInvalidateBuffer() tries to flush the buffer if it is
> dirty and then tries to invalidate it if it meets the requirement.
> Instead of directly doing this can we provide an option to the caller
> to mention whether to invalidate the dirty buffers or not.
> Yes that can be implemented with a default value of force. Will
> implement it in the next patch.
>
> On Wed, 5 Jul 2023 at 17:53, Nitin Jadhav  
> wrote:
> >
> > +1 for the idea. It's going to be more useful to test and understand
> > the buffer management of PostgreSQL and it can be used to explicitly
> > free up the buffers if there are any such requirements.
> >
> > I had a quick look over the patch. Following are the comments.
> >
> > First, The TryInvalidateBuffer() tries to flush the buffer if it is
> > dirty and then tries to invalidate it if it meets the requirement.
> > Instead of directly doing this can we provide an option to the caller
> > to mention whether to invalidate the dirty buffers or not. For
> > example, TryInvalidateBuffer(Buffer bufnum, bool force), if the force
> > is set to FALSE, then ignore invalidating dirty buffers. Otherwise,
> > flush the dirty buffer and try to invalidate.
> >
> > Second, In TryInvalidateBuffer(), it first checks if the reference
> > count is greater than zero and then checks for dirty buffers. Will
> > there be a scenario where the buffer is dirty and its reference count
> > is zero? Can you please provide more information on this or adjust the
> > code accordingly.
> >
> > > +/*
> > > +Try Invalidating a buffer using bufnum.
> > > +If the buffer is invalid, the function returns false.
> > > +The function checks for dirty buffer and flushes the dirty buffer before 
> > > invalidating.
> > > +If the buffer is still dirty it returns false.
> > > +*/
> > > +bool
> >
> > The star(*) and space are missing here. Please refer to the style of
> > function comments and change accordingly.
> >
> > Thanks & Regards,
> > Nitin Jadhav
> >
> > On Fri, Jun 30, 2023 at 4:17 PM Palak Chaturvedi
> >  wrote:
> > >
> > > I hope this email finds you well. I am excited to share that I have
> > > extended the functionality of the `pg_buffercache` extension by
> > > implementing buffer invalidation capability, as requested by some
> > > PostgreSQL contributors for improved testing scenarios.
> > >
> > > This marks my first time submitting a patch to pgsql-hackers, and I am
> > > eager to receive your expert feedback on the changes made. Your
> > > insights are invaluable, and any review or comments you provide will
> > > be greatly appreciated.
> > >
> > > The primary objective of this enhancement is to enable explicit buffer
> > > invalidation within the `pg_buffercache` extension. By doing so, we
> > > can simulate scenarios where buffers are invalidated and observe the
> > > resulting behavior in PostgreSQL.
> > >
> > > As part of this patch, a new function or mechanism has been introduced
> > > to facilitate buffer invalidation. I would like to hear your thoughts
> > > on whether this approach provides a good user interface for this
> > > functionality. Additionally, I seek your evaluation of the buffer
> > > locking protocol employed in the extension to ensure its correctness
> > > and efficiency.
> > >
> > > Please note that I plan to add comprehensive documentation once the
> > > details of this enhancement are agreed upon. This documentation will
> > > serve as a valuable resource for users and contributors alike. I
> > > believe that your expertise will help uncover any potential issues and
> > > opportunities for further improvement.
> > >
> > > I have attached the patch file to this email for your convenience.
> > > Your valuable time and consideration in reviewing this extension are
> > > sincerely appreciated.
> > >
> > > Thank you for your continued support and guidance. I am looking
> > > forward to your feedback and collaboration in enhancing the PostgreSQL
> > > ecosystem.
> > >
> > > The working of the extension:
> > >
> > > 1. Creating the extension pg_buffercache and then call select query on
> > > a table and note the buffer to be cleared.
> > > pgbench=# create extension pg_buffercache;
> > > CREATE EXTENSION
> > > pgbench=# select count(*) from pgbench_accounts;
> > >  count
> > > 
> > >  10
> > > (1 row)
> > >
> > > pgbench=# SELECT *
> > > FROM pg_buffercache
> > > WHERE relfilenode = pg_relation_filenode('pgbench_accounts'::regclass);
> > >  bufferid | relfilenode | reltablespace | reldatabase | relforknumber
> > > | relblocknumber | isdirty | usagecount | 

Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2023-07-11 Thread Palak Chaturvedi
Hey Nitin,
>Will
>there be a scenario where the buffer is dirty and its reference count
>is zero?
There might be a buffer that has been dirtied but is not pinned or
being used currently by a process. So checking the refcount and then
dirty buffers helps.
>First, The TryInvalidateBuffer() tries to flush the buffer if it is
dirty and then tries to invalidate it if it meets the requirement.
Instead of directly doing this can we provide an option to the caller
to mention whether to invalidate the dirty buffers or not.
Yes that can be implemented with a default value of force. Will
implement it in the next patch.

On Wed, 5 Jul 2023 at 17:53, Nitin Jadhav  wrote:
>
> +1 for the idea. It's going to be more useful to test and understand
> the buffer management of PostgreSQL and it can be used to explicitly
> free up the buffers if there are any such requirements.
>
> I had a quick look over the patch. Following are the comments.
>
> First, The TryInvalidateBuffer() tries to flush the buffer if it is
> dirty and then tries to invalidate it if it meets the requirement.
> Instead of directly doing this can we provide an option to the caller
> to mention whether to invalidate the dirty buffers or not. For
> example, TryInvalidateBuffer(Buffer bufnum, bool force), if the force
> is set to FALSE, then ignore invalidating dirty buffers. Otherwise,
> flush the dirty buffer and try to invalidate.
>
> Second, In TryInvalidateBuffer(), it first checks if the reference
> count is greater than zero and then checks for dirty buffers. Will
> there be a scenario where the buffer is dirty and its reference count
> is zero? Can you please provide more information on this or adjust the
> code accordingly.
>
> > +/*
> > +Try Invalidating a buffer using bufnum.
> > +If the buffer is invalid, the function returns false.
> > +The function checks for dirty buffer and flushes the dirty buffer before 
> > invalidating.
> > +If the buffer is still dirty it returns false.
> > +*/
> > +bool
>
> The star(*) and space are missing here. Please refer to the style of
> function comments and change accordingly.
>
> Thanks & Regards,
> Nitin Jadhav
>
> On Fri, Jun 30, 2023 at 4:17 PM Palak Chaturvedi
>  wrote:
> >
> > I hope this email finds you well. I am excited to share that I have
> > extended the functionality of the `pg_buffercache` extension by
> > implementing buffer invalidation capability, as requested by some
> > PostgreSQL contributors for improved testing scenarios.
> >
> > This marks my first time submitting a patch to pgsql-hackers, and I am
> > eager to receive your expert feedback on the changes made. Your
> > insights are invaluable, and any review or comments you provide will
> > be greatly appreciated.
> >
> > The primary objective of this enhancement is to enable explicit buffer
> > invalidation within the `pg_buffercache` extension. By doing so, we
> > can simulate scenarios where buffers are invalidated and observe the
> > resulting behavior in PostgreSQL.
> >
> > As part of this patch, a new function or mechanism has been introduced
> > to facilitate buffer invalidation. I would like to hear your thoughts
> > on whether this approach provides a good user interface for this
> > functionality. Additionally, I seek your evaluation of the buffer
> > locking protocol employed in the extension to ensure its correctness
> > and efficiency.
> >
> > Please note that I plan to add comprehensive documentation once the
> > details of this enhancement are agreed upon. This documentation will
> > serve as a valuable resource for users and contributors alike. I
> > believe that your expertise will help uncover any potential issues and
> > opportunities for further improvement.
> >
> > I have attached the patch file to this email for your convenience.
> > Your valuable time and consideration in reviewing this extension are
> > sincerely appreciated.
> >
> > Thank you for your continued support and guidance. I am looking
> > forward to your feedback and collaboration in enhancing the PostgreSQL
> > ecosystem.
> >
> > The working of the extension:
> >
> > 1. Creating the extension pg_buffercache and then call select query on
> > a table and note the buffer to be cleared.
> > pgbench=# create extension pg_buffercache;
> > CREATE EXTENSION
> > pgbench=# select count(*) from pgbench_accounts;
> >  count
> > 
> >  10
> > (1 row)
> >
> > pgbench=# SELECT *
> > FROM pg_buffercache
> > WHERE relfilenode = pg_relation_filenode('pgbench_accounts'::regclass);
> >  bufferid | relfilenode | reltablespace | reldatabase | relforknumber
> > | relblocknumber | isdirty | usagecount | pinning_backends
> > --+-+---+-+---++-++--
> >   233 |   16397 |  1663 |   16384 | 0
> > |  0 | f   |  1 |0
> >   234 |   16397 |  1663 |   16384 | 0
> > |   

Re: POC, WIP: OR-clause support for indexes

2023-07-11 Thread Alena Rybakina

Hi!

On 10.07.2023 15:15, Ranier Vilela wrote:
Em seg., 10 de jul. de 2023 às 09:03, Ranier Vilela 
 escreveu:


Hi Alena,

Em seg., 10 de jul. de 2023 às 05:38, Alena Rybakina
 escreveu:

I agreed with the changes. Thank you for your work.

I updated patch and added you to the authors.

I specified Ranier Vilela as a reviewer.

Is a good habit when post a new version of the patch, name it v1,
v2, v3,etc.
Makes it easy to follow development and references on the thread.


Sorry, I fixed it.


Regarding the last patch.
1. I think that variable const_is_left is not necessary.
You can stick with:
+ if (IsA(get_leftop(orqual), Const))
+ nconst_expr =get_rightop(orqual);
+ const_expr = get_leftop(orqual) ;
+ else if (IsA(get_rightop(orqual), Const))
+ nconst_expr =get_leftop(orqual);
+ const_expr = get_rightop(orqual) ;
+ else
+ {
+ or_list = lappend(or_list, orqual);
+ continue;
+ }


Agreed.



2. Test scalar_type != RECORDOID is more cheaper,
mainly if OidIsValid were a function, we knows that is a macro.
+ if (scalar_type != RECORDOID && OidIsValid(scalar_type))

Is it safe? Maybe we should first make sure that it can be checked on 
RECORDOID at all?


3. Sorry about wrong tip about array_type, but if really necessary,
better use it.
+ newa->element_typeid = scalar_type;
+ newa->array_typeid = array_type;


Agreed.



4. Is a good habit, call free last, to avoid somebody accidentally
using it.
+ or_list = lappend(or_list, gentry->expr);
+ list_free(gentry->consts);
+ continue;

No, this is not necessary, because we add the original expression in 
these places to the resulting list and later
we will not use the list of constants for this group at all, otherwise 
it would be an error.


--
Regards,
Alena Rybakina
Postgres Professional
From 55e56f1557ca6879eae6ea62845e180ebfd04662 Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Tue, 11 Jul 2023 15:19:22 +0300
Subject: [PATCH] Replace OR clause to ANY expressions. Replace (X=N1) OR
 (X=N2) ... with X = ANY(N1, N2) on the stage of the optimiser when we are
 still working with a tree expression. Firstly, we do not try to make a
 transformation for "non-or" expressions or inequalities and the creation of a
 relation with "or" expressions occurs according to the same scenario.
 Secondly, we do not make transformations if there are less than 500 or
 expressions. Thirdly, it is worth considering that we consider "or"
 expressions only at the current level.

---
 src/backend/parser/parse_expr.c  | 232 ++-
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 232 insertions(+), 1 deletion(-)

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 346fd272b6d..36b50b6441d 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -95,6 +95,236 @@ static Expr *make_distinct_op(ParseState *pstate, List *opname,
 static Node *make_nulltest_from_distinct(ParseState *pstate,
 		 A_Expr *distincta, Node *arg);
 
+typedef struct OrClauseGroupEntry
+{
+	Node		   *node;
+	List		   *consts;
+	Oidscalar_type;
+	Oidopno;
+	Expr 		   *expr;
+} OrClauseGroupEntry;
+
+static int const_transform_or_limit = 500;
+
+static Node *
+transformBoolExprOr(ParseState *pstate, BoolExpr *expr_orig)
+{
+	List		   *or_list = NIL;
+	List		   *groups_list = NIL;
+	ListCell	   *lc;
+
+	/* If this is not an 'OR' expression, skip the transformation */
+	if (expr_orig->boolop != OR_EXPR ||
+		list_length(expr_orig->args) < const_transform_or_limit)
+		return transformBoolExpr(pstate, (BoolExpr *) expr_orig);
+
+	foreach(lc, expr_orig->args)
+	{
+		Node			   *arg = lfirst(lc);
+		Node			   *orqual;
+		Node			   *const_expr;
+		Node			   *nconst_expr;
+		ListCell		   *lc_groups;
+		OrClauseGroupEntry *gentry;
+		boolconst_is_left = true;
+
+		/* At first, transform the arg and evaluate constant expressions. */
+		orqual = transformExprRecurse(pstate, (Node *) arg);
+		orqual = coerce_to_boolean(pstate, orqual, "OR");
+		orqual = eval_const_expressions(NULL, orqual);
+
+		if (!IsA(orqual, OpExpr))
+		{
+			or_list = lappend(or_list, orqual);
+			continue;
+		}
+
+		/*
+		 * Detect the constant side of the clause. Recall non-constant
+		 * expression can be made not only with Vars, but also with Params,
+		 * which is not bonded with any relation. Thus, we detect the const
+		 * side - if another side is constant too, the orqual couldn't be
+		 * an OpExpr.
+		 * Get pointers to constant and expression sides of the qual.
+		 */
+		if (IsA(get_leftop(orqual), Const))
+		{
+			nconst_expr = get_rightop(orqual);
+			const_expr = get_leftop(orqual);
+		}
+		else if (IsA(get_rightop(orqual), Const))
+		{
+			const_expr = get_rightop(orqual);
+			nconst_expr = get_leftop(orqual);
+		}
+		else
+		{
+			or_list = lappend(or_list, orqual);
+			

Re: pg_decode_message vs skip_empty_xacts and xact_wrote_changes

2023-07-11 Thread Amit Kapila
On Thu, Jul 6, 2023 at 2:06 PM Amit Kapila  wrote:
>
> On Wed, Jul 5, 2023 at 7:20 PM Ashutosh Bapat
>  wrote:
> >
> > On Wed, Jul 5, 2023 at 2:29 PM Amit Kapila  wrote:
> > >
> > > On Wed, Jul 5, 2023 at 2:28 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, Jul 3, 2023 at 4:49 PM vignesh C  wrote:
> > > > >
> > > > > +1 for the first version patch, I also felt the first version is
> > > > > easily understandable.
> > > > >
> > > >
> > > > Okay, please find the slightly updated version (changed a comment and
> > > > commit message). Unless there are more comments, I'll push this in a
> > > > day or two.
> > > >
> > >
> > > oops, forgot to attach the patch.
> >
> > I still think that we need to do something so that a new call to
> > pg_output_begin() automatically takes care of the conditions under
> > which it should be called. Otherwise, we will introduce a similar
> > problem in some other place in future.
> >
>
> AFAIU, this problem is because we forget to conditionally call
> pg_output_begin() from pg_decode_message() which can happen with or
> without moving conditions inside pg_output_begin(). Also, it shouldn't
> be done at the expense of complexity. I find the check added by
> Vignesh's v2 patch (+ if (!(last_write ^ data->skip_empty_xacts) ||
> txndata->xact_wrote_changes)) a bit difficult to understand and more
> error-prone. The others like Hou-San also couldn't understand unless
> Vignesh gave an explanation. I also thought of avoiding that check.
> Basically, IIUC, the check is added because the patch also removed
> 'data->skip_empty_xacts' check from
> pg_decode_begin_txn()/pg_decode_begin_prepare_txn(). Now, if retain
> those checks then it is probably okay but again the similar checks are
> still split and that doesn't appear to be better than the v1 or v3
> patch version. I am not against improving code in this area and
> probably we can consider doing it as a separate patch if we have
> better ideas instead of combining it with this patch.
>

I have pushed this work. But feel free to propose further
improvements, if you have any better ideas.

-- 
With Regards,
Amit Kapila.




Re: A minor adjustment to get_cheapest_path_for_pathkeys

2023-07-11 Thread Aleksander Alekseev
Hi,

> The check for parallel_safe should be even cheaper than cost comparison
> so I think it's better to do that first.  The attached patch does this
> and also updates the comment to mention the requirement about being
> parallel-safe.

The patch was marked as "Needs review" so I decided to take a look.

I see the reasoning behind the proposed change, but I'm not convinced
that there will be any measurable performance improvements. Firstly,
compare_path_costs() is rather cheap. Secondly, require_parallel_safe
is `false` in most of the cases. Last but not least, one should prove
that this particular place is a bottleneck under given loads. I doubt
it is. Most of the time it's a network, disk I/O or locks.

So unless the author can provide benchmarks that show measurable
benefits of the change I suggest rejecting it.

-- 
Best regards,
Aleksander Alekseev




Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-11 Thread Jeevan Chalke
Hi Tom,

On Tue, Jul 11, 2023 at 4:30 PM Tom Lane  wrote:

> Jeevan Chalke  writes:
> > Attached patch.
>
> I would be astonished if this fixes anything.  The code still doesn't
> know which paths are referenced by which other ones, and so the place
> where we free a previously-added path can't know what to do.
>
> I've speculated about adding some form of reference counting to paths
> (maybe just a "pin" flag rather than a full refcount) so that we could
> be smarter about this.  The existing kluge for "don't free IndexPaths"
> could be replaced by setting the pin mark on any IndexPath that we
> make a bitmap path from.  Up to now it hasn't seemed necessary to
> generalize that hack, but maybe it's time.  Can you show a concrete
> case where we are freeing a still-referenced path?
>

As mentioned earlier, while debugging some issues, we have put an elog
displaying the foreignrel contents using nodeToString(). Like below:

@@ -1238,6 +1238,8 @@ postgresGetForeignPlan(PlannerInfo *root,
boolhas_limit = false;
ListCell   *lc;

+   elog(INFO, "foreignrel: %s", nodeToString(foreignrel));
+

And ran the postgres_fdw regression and observed many warnings saying "could
not dump unrecognized node type".  Here are the queries retrieved and
adjusted from postgres_fdw.sql

CREATE EXTENSION postgres_fdw;
CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname
'postgres', port '5432');
CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
CREATE TABLE t1 (c1 int NOT NULL, c2 int NOT NULL, CONSTRAINT t1_pkey
PRIMARY KEY (c1));
INSERT INTO t1 SELECT id, id % 10 FROM generate_series(1, 1000) id;
ANALYZE t1;
CREATE FOREIGN TABLE ft2 (c1 int NOT NULL, c2 int NOT NULL) SERVER loopback
OPTIONS (schema_name 'public', table_name 't1');

explain (verbose, costs off)
select c2, sum(c1) from ft2 group by c2 having avg(c1) < 500 and sum(c1) <
49800 order by c2;

With the above elog() in place, we can see the warning. And the pathlist
has a second path as empty ({}). Which got freed but has a reference in
this foreignrel.

Thanks


>
> regards, tom lane
>


-- 
Jeevan Chalke

*Senior Staff SDE, Database Architect, and ManagerProduct Development*



edbpostgres.com


Re: [PATCH] Infinite loop while acquiring new TOAST Oid

2023-07-11 Thread Nikita Malakhov
Hi!

Aleksander, thank you for reminding me of this patch, try to do it in a few
days.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: SLRUs in the main buffer pool, redux

2023-07-11 Thread Aleksander Alekseev
Hi,

> > Here's a rebased set of patches.
> >
> > The second patch is failing the pg_upgrade tests. Before I dig into
> > that, I'd love to get some feedback on this general approach.
>
> Forgot to include the new "slrulist.h" file in the previous patch, fixed
> here.

Unfortunately the patchset rotted quite a bit since February and needs a rebase.

-- 
Best regards,
Aleksander Alekseev




Re: PATCH: Using BRIN indexes for sorted output

2023-07-11 Thread Matthias van de Meent
On Mon, 10 Jul 2023 at 22:04, Tomas Vondra
 wrote:
> On 7/10/23 18:18, Matthias van de Meent wrote:
>> On Mon, 10 Jul 2023 at 17:09, Tomas Vondra
>>  wrote:
>>> On 7/10/23 14:38, Matthias van de Meent wrote:
> I haven't really thought about geometric types, just about minmax and
> minmax-multi. It's not clear to me what the benefit for these types be.
> I mean, we can probably sort points lexicographically, but is anyone
> doing that in queries? It seems useless for order by distance.

 Yes, that's why you would sort them by distance, where the distance is
 generated by the opclass as min/max distance between the summary and
 the distance's origin, and then inserted into the tuplesort.

>>>
>>> OK, so the query says "order by distance from point X" and we calculate
>>> the min/max distance of values in a given page range.
>>
>> Yes, and because it's BRIN that's an approximation, which should
>> generally be fine.
>>
>
> Approximation in what sense? My understanding was we'd get a range of
> distances that we know covers all rows in that range. So the results
> should be accurate, no?

The distance is going to be accurate only to the degree that the
summary can produce accurate distances for the datapoints it
represents. That can be quite imprecise due to the nature of the
contained datapoints: a summary of the points (-1, -1) and (1, 1) will
have a minimum distance of 0 to the origin, where the summary (-1, 0)
and (-1, 0.5) would have a much more accurate distance of 1. The point
I was making is that the summary can only approximate the distance,
and that approximation is fine w.r.t. the BRIN sort algoritm.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)




Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-11 Thread Tom Lane
Jeevan Chalke  writes:
> Attached patch.

I would be astonished if this fixes anything.  The code still doesn't
know which paths are referenced by which other ones, and so the place
where we free a previously-added path can't know what to do.

I've speculated about adding some form of reference counting to paths
(maybe just a "pin" flag rather than a full refcount) so that we could
be smarter about this.  The existing kluge for "don't free IndexPaths"
could be replaced by setting the pin mark on any IndexPath that we
make a bitmap path from.  Up to now it hasn't seemed necessary to
generalize that hack, but maybe it's time.  Can you show a concrete
case where we are freeing a still-referenced path?

regards, tom lane




Re: COPY table FROM STDIN via SPI

2023-07-11 Thread Tom Lane
James Sewell  writes:
> Is $subject possible?

No.  It'd be a wire protocol break, and even if it weren't I would not
expect many clients to be able to deal with it.  They're in the middle
of a query cycle (for the SELECT or CALL that got you into SPI), and
suddenly the backend asks for COPY data?  What are they supposed to
send, or where are they supposed to put it for the COPY-out case?
There's just not provision for nesting protocol operations like that.

regards, tom lane




COPY table FROM STDIN via SPI

2023-07-11 Thread James Sewell
Is $subject possible?

 I feel like maybe the answer is no, but then I can also see some backend
code for similar things in copy.h.

Perhaps it’s possible via a function call not sending the SQL?

- James


Re: logicalrep_message_type throws an error

2023-07-11 Thread Alvaro Herrera
On 2023-Jul-11, Masahiko Sawada wrote:

> Since the numerical value is important only in invalid message type
> cases, how about using a format like "??? (88)" in unknown message
> type cases, in both error and context messages?

+1

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Las navajas y los monos deben estar siempre distantes"   (Germán Poo)




Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-11 Thread Jeevan Chalke
On Tue, Jul 11, 2023 at 2:58 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
> On Tue, Jul 11, 2023 at 1:19 PM Alvaro Herrera 
> wrote:
>
>> On 2023-Jul-11, Jeevan Chalke wrote:
>>
>> > 4. However, 2nd path was already sorted and passed as is to the
>> add_path().
>> > 5. add_path() decides to reject this new path on some metrics. However,
>> in
>> > the end, it pfree() this passed in path. It seems wrong as its
>> references
>> > do present elsewhere. For example, in the first path's parent rels path
>> > list.
>> > 6. So, while displaying the parent's path, we end up with these
>> warnings.
>>
>> In other words, this is use-after-free, with add_path freeing the
>> passed-in Path pointer, but one particular case in which this Path is
>> still used afterwards.
>>
>> > I tried to get a fix for this but no luck so far.
>>
>> I proposed to add an add_path_extended() function that adds 'bool
>> free_input_path' argument, and pass it false in that one place in
>> create_ordered_paths.
>>
>
> Yeah, this can be a way.
>
> However, I am thinking the other way around now. What if we first added
> the unmodified input path as it is to the ordered_rel first?
>
> If we do so, then while adding the next path, add_path() may decide to
> remove the older one as the newer path is the best one. The remove_old
> logic in add_path() will free the path (unsorted one), and we end up with
> the same error.
>
> And if we conditionally remove that path (remove_old logic one), then we
> need to pass false in every add_path() call in create_ordered_paths().
>

Attached patch.


>
> Am I missing something?
>
> Thanks
>
>
>>
>> --
>> Álvaro Herrera   48°01'N 7°57'E  —
>> https://www.EnterpriseDB.com/
>>
>
>
> --
> Jeevan Chalke
>
> *Senior Staff SDE, Database Architect, and ManagerProduct Development*
>
>
>
> edbpostgres.com
>


-- 
Jeevan Chalke

*Senior Staff SDE, Database Architect, and ManagerProduct Development*



edbpostgres.com


fix_add_path.patch
Description: Binary data


Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-11 Thread Jeevan Chalke
On Tue, Jul 11, 2023 at 1:19 PM Alvaro Herrera 
wrote:

> On 2023-Jul-11, Jeevan Chalke wrote:
>
> > 4. However, 2nd path was already sorted and passed as is to the
> add_path().
> > 5. add_path() decides to reject this new path on some metrics. However,
> in
> > the end, it pfree() this passed in path. It seems wrong as its references
> > do present elsewhere. For example, in the first path's parent rels path
> > list.
> > 6. So, while displaying the parent's path, we end up with these warnings.
>
> In other words, this is use-after-free, with add_path freeing the
> passed-in Path pointer, but one particular case in which this Path is
> still used afterwards.
>
> > I tried to get a fix for this but no luck so far.
>
> I proposed to add an add_path_extended() function that adds 'bool
> free_input_path' argument, and pass it false in that one place in
> create_ordered_paths.
>

Yeah, this can be a way.

However, I am thinking the other way around now. What if we first added the
unmodified input path as it is to the ordered_rel first?

If we do so, then while adding the next path, add_path() may decide to
remove the older one as the newer path is the best one. The remove_old
logic in add_path() will free the path (unsorted one), and we end up with
the same error.

And if we conditionally remove that path (remove_old logic one), then we
need to pass false in every add_path() call in create_ordered_paths().

Am I missing something?

Thanks


>
> --
> Álvaro Herrera   48°01'N 7°57'E  —
> https://www.EnterpriseDB.com/
>


-- 
Jeevan Chalke

*Senior Staff SDE, Database Architect, and ManagerProduct Development*



edbpostgres.com


Re: POC, WIP: OR-clause support for indexes

2023-07-11 Thread Andrey Lepikhov

On 10/7/2023 15:38, Alena Rybakina wrote:

I agreed with the changes. Thank you for your work.

I updated patch and added you to the authors.

I specified Ranier Vilela as a reviewer.
This patch looks much better than earlier. But it definitely needs some 
covering with tests. As a first simple approximation, here you can see 
the result of regression tests, where the transformation limit is set to 
0. See in the attachment some test changes induced by these diffs.


Also, I see some impact of the transformation to other queries:
create_view.out:
(NOT x > z) > (x <= z)
inherit.out:
(((a)::text = 'ab'::text) OR ((a)::text = ANY ('{NULL,cd}'::text[])))
to
(((a)::text = ANY ('{NULL,cd}'::text[])) OR ((a)::text = 'ab'::text))

Transformations, mentioned above, are correct, of course. But it can be 
a sign of possible unstable behavior.


--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 0ddcb880ef..3d9b385c1f 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -43,6 +43,7 @@
 
 /* GUC parameters */
 bool   Transform_null_equals = false;
+intor_transform_limit = 500;
 
 
 static Node *transformExprRecurse(ParseState *pstate, Node *expr);
@@ -104,8 +105,6 @@ typedef struct OrClauseGroupEntry
Expr   *expr;
 } OrClauseGroupEntry;
 
-static int const_transform_or_limit = 500;
-
 static Node *
 transformBoolExprOr(ParseState *pstate, BoolExpr *expr_orig)
 {
@@ -115,7 +114,7 @@ transformBoolExprOr(ParseState *pstate, BoolExpr *expr_orig)
 
/* If this is not an 'OR' expression, skip the transformation */
if (expr_orig->boolop != OR_EXPR ||
-   list_length(expr_orig->args) < const_transform_or_limit)
+   list_length(expr_orig->args) < or_transform_limit)
return transformBoolExpr(pstate, (BoolExpr *) expr_orig);
 
foreach(lc, expr_orig->args)
diff --git a/src/backend/utils/misc/guc_tables.c 
b/src/backend/utils/misc/guc_tables.c
index c14456060c..c7ac73ebe0 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2046,6 +2046,16 @@ struct config_int ConfigureNamesInt[] =
100, 1, MAX_STATISTICS_TARGET,
NULL, NULL, NULL
},
+   {
+   {"or_transform_limit", PGC_USERSET, QUERY_TUNING_OTHER,
+   gettext_noop("Transform a sequence of OR clauses to an 
IN expression."),
+   gettext_noop("The planner will replace clauses like 
'x=c1 OR x=c2 .."
+"to the clause 'x IN 
(c1,c2,...)'")
+   },
+   _transform_limit,
+   500, 0, INT_MAX,
+   NULL, NULL, NULL
+   },
{
{"from_collapse_limit", PGC_USERSET, QUERY_TUNING_OTHER,
gettext_noop("Sets the FROM-list size beyond which 
subqueries "
diff --git a/src/include/parser/parse_expr.h b/src/include/parser/parse_expr.h
index 7d38ca75f7..891e6a462b 100644
--- a/src/include/parser/parse_expr.h
+++ b/src/include/parser/parse_expr.h
@@ -17,6 +17,7 @@
 
 /* GUC parameters */
 extern PGDLLIMPORT bool Transform_null_equals;
+extern PGDLLIMPORT int or_transform_limit;
 
 extern Node *transformExpr(ParseState *pstate, Node *expr, ParseExprKind 
exprKind);
 
diff --git a/src/test/regress/expected/create_index.out 
b/src/test/regress/expected/create_index.out
index acfd9d1f4f..60e053d217 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1883,6 +1883,46 @@ SELECT count(*) FROM tenk1
 10
 (1 row)
 
+SET or_transform_limit = 0;
+EXPLAIN (COSTS OFF)
+SELECT * FROM tenk1
+  WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3 OR tenthous = 42);
+  QUERY PLAN  
+--
+ Index Scan using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand = 42) AND (tenthous = ANY ('{1,3,42}'::integer[])))
+(2 rows)
+
+SELECT * FROM tenk1
+  WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3 OR tenthous = 42);
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | 
twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+-+-+-+--+-++-+--+-+---+--+-+--+--+--+-
+  42 |5530 |   0 |2 |   2 |  2 |  42 |   42 |  
42 |42 |   42 |  84 |   85 | QB   | SEIAAA   | xx
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM tenk1
+  WHERE hundred = 42 AND (thousand = 42 OR thousand = 99);
+ QUERY PLAN
 

Re: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL

2023-07-11 Thread Sergei Kornilov
Hello

I think it's appropriate to add on the restrictions page. (But mentioning that 
this restriction is only for subscriber)

If the list were larger, then the restrictions page could be divided into 
publisher and subscriber restrictions. But not for one very specific 
restriction.

regards, Sergei




Re: doc: improve the restriction description of using indexes on REPLICA IDENTITY FULL table.

2023-07-11 Thread Peter Smith
Here are my comments for v4.

==

Docs/Comments:

All the docs and updated comments LTGM, except I felt one sentence
might be written differently to avoid nested parentheses.

BEFORE
...used for REPLICA IDENTITY FULL table (see
FindUsableIndexForReplicaIdentityFull() for details).

AFTER
...used for REPLICA IDENTITY FULL table. See
FindUsableIndexForReplicaIdentityFull() for details.



Logic:

What was the decision about the earlier question [1] of
removing/merging the function IsIndexOnlyOnExpression()?

--
[1] 
https://www.postgresql.org/message-id/CAHut%2BPuGhGHp9Uq8-Wk7uBiirAHF5quDY_1Z6WDoUKRZqkn%2Brg%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: logicalrep_message_type throws an error

2023-07-11 Thread Masahiko Sawada
On Thu, Jul 6, 2023 at 6:28 PM Amit Kapila  wrote:
>
> One point to note is that the user may also get confused if the actual
> ERROR says message type as 'X' and the context says '???'. I feel in
> this case duplicate information is better than different information.

I agree. I think it would be better to show the same string like:

ERROR:  invalid logical replication message type "??? (88)"
CONTEXT:  processing remote data for replication origin "pg_16638"
during message type "??? (88)" in transaction 796, finished at
0/1626698

Since the numerical value is important only in invalid message type
cases, how about using a format like "??? (88)" in unknown message
type cases, in both error and context messages?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-11 Thread Alvaro Herrera
On 2023-Jul-11, Jeevan Chalke wrote:

> 4. However, 2nd path was already sorted and passed as is to the add_path().
> 5. add_path() decides to reject this new path on some metrics. However, in
> the end, it pfree() this passed in path. It seems wrong as its references
> do present elsewhere. For example, in the first path's parent rels path
> list.
> 6. So, while displaying the parent's path, we end up with these warnings.

In other words, this is use-after-free, with add_path freeing the
passed-in Path pointer, but one particular case in which this Path is
still used afterwards.

> I tried to get a fix for this but no luck so far.

I proposed to add an add_path_extended() function that adds 'bool
free_input_path' argument, and pass it false in that one place in
create_ordered_paths.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: Support to define custom wait events for extensions

2023-07-11 Thread Michael Paquier
On Fri, Jun 23, 2023 at 05:56:26PM +0900, Masahiro Ikeda wrote:
> I updated the patches to handle the warning mentioned
> by PostgreSQL Patch Tester, and removed unnecessary spaces.

I have begun hacking on that, and the API layer inspired from the
LWLocks is sound.  I have been playing with it in my own extensions
and it is nice to be able to plug in custom wait events into
pg_stat_activity, particularly for bgworkers.  Finally.

The patch needed a rebase after the recent commit that introduced the
automatic generation of docs and code for wait events.  It requires
two tweaks in generate-wait_event_types.pl, feel free to double-check
them.

Some of the new structures and routine names don't quite reflect the
fact that we have wait events for extensions, so I have taken a stab
at that.

Note that the test module test_custom_wait_events would crash if
attempting to launch a worker when not loaded in
shared_preload_libraries, so we'd better have some protection in
wait_worker_launch() (this function should be renamed as well).

Attached is a rebased patch that I have begun tweaking here and
there.  For now, the patch is moved as waiting on author.  I have
merged the test module with the main patch for the moment, for
simplicity.  A split is straight-forward as the code paths touched are
different.

Another and *very* important thing is that we are going to require
some documentation in xfunc.sgml to explain how to use these routines
and what to expect from them.  Ikeda-san, could you write some?  You
could look at the part about shmem and LWLock to get some
inspiration.
--
Michael
From 630f78822534171bb8839c79d0fdca3bb8268a95 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 11 Jul 2023 16:41:35 +0900
Subject: [PATCH v5] Support custom wait events for extensions.

To support custom wait events, it add 2 APIs to allocate new wait events
dynamically.

The APIs are
* ExtensionWaitEventNewTranche()
* ExtensionWaitEventRegisterTranche()

These are similar to the existing LWLockNewTrancheId() and
LWLockRegisterTranche().

First, extension should call ExtensionWaitEventNewTranche() to get one
or more wait event "tranches", which are ID is allocated from a shared
counter.  Next, each individual process can use the tranche numbers with
ExtensionWaitEventRegisterTranche() to associate that a wait event
string to the associated name.

Note that this includes a test module, which perhaps should be split
later into its own commit.
---
 src/include/utils/wait_event.h|  31 +++
 src/backend/storage/ipc/ipci.c|   3 +
 src/backend/storage/ipc/shmem.c   |   3 +-
 .../activity/generate-wait_event_types.pl |  12 +-
 src/backend/utils/activity/wait_event.c   | 149 +-
 .../utils/activity/wait_event_names.txt   |   2 +-
 src/test/modules/Makefile |   1 +
 src/test/modules/meson.build  |   1 +
 .../test_custom_wait_events/.gitignore|   2 +
 .../modules/test_custom_wait_events/Makefile  |  23 +++
 .../test_custom_wait_events/meson.build   |  33 +++
 .../test_custom_wait_events/t/001_basic.pl|  34 
 .../test_custom_wait_events--1.0.sql  |  14 ++
 .../test_custom_wait_events.c | 188 ++
 .../test_custom_wait_events.control   |   3 +
 15 files changed, 484 insertions(+), 15 deletions(-)
 create mode 100644 src/test/modules/test_custom_wait_events/.gitignore
 create mode 100644 src/test/modules/test_custom_wait_events/Makefile
 create mode 100644 src/test/modules/test_custom_wait_events/meson.build
 create mode 100644 src/test/modules/test_custom_wait_events/t/001_basic.pl
 create mode 100644 src/test/modules/test_custom_wait_events/test_custom_wait_events--1.0.sql
 create mode 100644 src/test/modules/test_custom_wait_events/test_custom_wait_events.c
 create mode 100644 src/test/modules/test_custom_wait_events/test_custom_wait_events.control

diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 4517425f84..dfcaf8c506 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -67,6 +67,37 @@ pgstat_report_wait_start(uint32 wait_event_info)
 	*(volatile uint32 *) my_wait_event_info = wait_event_info;
 }
 
+/* --
+ * Wait Events - Extension
+ *
+ * Use this category when the server process is waiting for some condition
+ * defined by an extension module.
+ *
+ * Extensions can define custom wait events.  First, call
+ * WaitEventExtensionNewTranche() just once to obtain a new wait event
+ * tranche.  The ID is allocated from a shared counter.  Next, each
+ * individual process using the tranche should call
+ * WaitEventExtensionRegisterTranche() to associate that wait event with
+ * a name.
+ *
+ * It may seem strange that each process using the tranche must register it
+ * separately, but dynamic shared memory segments aren't guaranteed to be
+ * mapped at the same address in all coordinating 

Re: logical decoding and replication of sequences, take 2

2023-07-11 Thread Amit Kapila
On Tue, Jun 27, 2023 at 11:30 AM Ashutosh Bapat
 wrote:
>
> I have not looked at the DDL replication patch in detail so I may be
> missing something. IIUC, that patch replicates the DDL statement in
> some form: parse tree or statement. But it doesn't replicate the some
> or all WAL records that the DDL execution generates.
>

Yes, the DDL replication patch uses the parse tree and catalog
information to generate a deparsed form of DDL statement which is WAL
logged and used to replicate DDLs.

-- 
With Regards,
Amit Kapila.




Re: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL

2023-07-11 Thread Amit Kapila
On Tue, Jul 11, 2023 at 12:30 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,
>
> > Isn't the same true for the hash operator class family as well?
>
> True. I didn't write it on purpose because I didn't know the operator which is
> operator class for BTree but not for Hash. But I agreed to clarify it.
>
> > Can we
> > slightly change the line as: "... the table includes an attribute
> > whose datatype doesn't have an equality operator defined for it..".
>
> Hmm, this suggestion is dubious for me. Regarding the point datatype, it has 
> the
> "same as" operator [1]. E.g., following SQL returns true.
>
> ```
> postgres=# select point '(1, 1)' ~= point '(1, 1)';
>  ?column?
> --
>  t
> (1 row)
> ```
>
> The reason why they cannot be supported by tuples_equal() is that 
> lookup_type_cache()
> only checks the operator classes for Btree and Hash. ~= does not defined as 
> the class.
>

Fair enough, but the part of the line:"..  whose datatype is not an
operator class of Btree or Hash." doesn't appear very clear to me.
Because it sounds like we are checking whether datatype has any
operator class for btree or hash access methods but we are actually
checking if there is an equality operator (function) defined in the
default op class for those access methods. Am, I missing something?

-- 
With Regards,
Amit Kapila.




RE: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL

2023-07-11 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> After seeing this, I am thinking about whether we add this restriction
> on the Subscription page [1] or Restrictions page [2] as proposed. Do
> you others have any preference?
> 
> [1] -
> https://www.postgresql.org/docs/devel/logical-replication-subscription.html
> [2] -
> https://www.postgresql.org/docs/devel/logical-replication-restrictions.html

Thanks for giving suggestion. But I still think it should be at "Restrictions" 
page
because all the limitation has been listed that page.
Moreover, the condition of this limitation is not closed to subscriber - the 
setup
on publisher is also related. I think such descriptions it may cause readers
to be confused.


But anyway, I have never been in mind such a point of view.
Maybe I should hear Sergei's opinion. Thought?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: add non-option reordering to in-tree getopt_long

2023-07-11 Thread Kyotaro Horiguchi
At Mon, 10 Jul 2023 13:06:58 -0700, Nathan Bossart  
wrote in 
> Here's a new version of the patch with the latest feedback addressed.

Thanks!

+* An argument is a non-option if it meets any of the following
+* criteria: it follows an argument that is equivalent to the 
string
+* "--", it is equivalent to the string "-", or it does not 
start with
+* '-'.  When we encounter a non-option, we move it to the end 
of argv
+* (after shifting all remaining arguments over to make room), 
and
+* then we try again with the next argument.
+*/
+   if (force_nonopt || strcmp("-", place) == 0 || place[0] != '-')
...
+   else if (strcmp("--", place) == 0)

I like it. We don't need to overcomplicate things just for the sake of
speed here. Plus, this version looks the most simple to me. That being
said, it might be better if the last term is positioned in the second
place. This is mainly because a lone hyphen is less common than words
that starts with characters other than a pyphen.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-11 Thread Jakub Wartak
On Mon, Jul 10, 2023 at 6:24 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
> > Out of curiosity I've tried and it is reproducible as you have stated : XFS
> > @ 4.18.0-425.10.1.el8_7.x86_64:
> >...
> > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
> > the XFS issues sync writes while ext4 does not, xfs looks like constant
> > loop of sync writes (D) by kworker/2:1H-kblockd:
>
> That clearly won't go well.  It's not reproducible on newer systems,
> unfortunately :(. Or well, fortunately maybe.
>
>
> I wonder if a trick to avoid this could be to memorialize the fact that we
> bulk extended before and extend by that much going forward? That'd avoid the
> swapping back and forth.

I haven't seen this thread [1] "Question on slow fallocate", from XFS
mailing list being mentioned here (it was started by Masahiko), but I
do feel it contains very important hints even challenging the whole
idea of zeroing out files (or posix_fallocate()). Please especially
see Dave's reply. He also argues that posix_fallocate() !=
fallocate().  What's interesting is that it's by design and newer
kernel versions should not prevent such behaviour, see my testing
result below.

All I can add is that this those kernel versions (4.18.0) seem to very
popular across customers (RHEL, Rocky) right now and that I've tested
on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using
Masahiko test.c and still got 6-7x slower time when using XFS on that
kernel. After installing kernel-ml (6.4.2) the test.c result seems to
be the same (note it it occurs only when 1st allocating space, but of
course it doesnt if the same file is rewritten/"reallocated"):

[root@rockyora ~]# uname -r
6.4.2-1.el8.elrepo.x86_64
[root@rockyora ~]# time ./test test.0 0
total   20
fallocate   0
filewrite   20

real0m0.405s
user0m0.006s
sys 0m0.391s
[root@rockyora ~]# time ./test test.0 1
total   20
fallocate   20
filewrite   0

real0m0.137s
user0m0.005s
sys 0m0.132s
[root@rockyora ~]# time ./test test.1 1
total   20
fallocate   20
filewrite   0

real0m0.968s
user0m0.020s
sys 0m0.928s
[root@rockyora ~]# time ./test test.2 2
total   20
fallocate   10
filewrite   10

real0m6.059s
user0m0.000s
sys 0m0.788s
[root@rockyora ~]# time ./test test.2 2
total   20
fallocate   10
filewrite   10

real0m0.598s
user0m0.003s
sys 0m0.225s
[root@rockyora ~]#

iostat -x reports during first "time ./test test.2 2" (as you can see
w_awiat is not that high but it accumulates):
Devicer/s w/s rMB/s wMB/s   rrqm/s   wrqm/s
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda  0.00 15394.00  0.00122.02 0.0013.00
0.00   0.080.000.05   0.75 0.00 8.12   0.06 100.00
dm-0 0.00 15407.00  0.00122.02 0.00 0.00
0.00   0.000.000.06   0.98 0.00 8.11   0.06 100.00

So maybe that's just a hint that you should try on slower storage
instead? (I think that on NVMe this issue would be hardly noticeable
due to low IO latency, not like here)

-J.

[1] - https://www.spinics.net/lists/linux-xfs/msg73035.html




RE: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL

2023-07-11 Thread Hayato Kuroda (Fujitsu)
Dear Sergei,

Thank you for giving comment!

The restriction is only for subscriber: the publisher can publish the changes
to downstream under the condition, but the subscriber cannot apply that.

> So, I suggest to mention subscriber explicitly:
> 
> + class of Btree, then UPDATE and
> DELETE
> -  operations cannot be replicated.
> + operations cannot be applied on subscriber.

I accepted the comment. Please see [1].

[1]: 
https://www.postgresql.org/message-id/TYAPR01MB58664DB6ECA653A6922B3FE3F531A%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL

2023-07-11 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> Isn't the same true for the hash operator class family as well?

True. I didn't write it on purpose because I didn't know the operator which is 
operator class for BTree but not for Hash. But I agreed to clarify it.

> Can we
> slightly change the line as: "... the table includes an attribute
> whose datatype doesn't have an equality operator defined for it..".

Hmm, this suggestion is dubious for me. Regarding the point datatype, it has the
"same as" operator [1]. E.g., following SQL returns true.

```
postgres=# select point '(1, 1)' ~= point '(1, 1)';
 ?column? 
--
 t
(1 row)
```

The reason why they cannot be supported by tuples_equal() is that 
lookup_type_cache()
only checks the operator classes for Btree and Hash. ~= does not defined as the 
class.

> Also, I find the proposed wording a bit odd, can we swap the sentence
> to say something like: "The UPDATE and DELETE operations cannot be
> replicated for the published tables that specifies REPLICA IDENTITY
> FULL but the table includes an attribute whose datatype doesn't have
> an equality operator defined for it on the subscriber."?

Swapped. But based on above reply, I did not completely use your suggestion.

[1]: https://www.postgresql.org/docs/devel/functions-geometry.html

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v3_add_description.patch
Description: v3_add_description.patch


Re: Generating code for query jumbling through gen_node_support.pl

2023-07-11 Thread Andrey Lepikhov

On 11/7/2023 12:35, Michael Paquier wrote:

On Tue, Jul 11, 2023 at 12:29:29PM +0700, Andrey Lepikhov wrote:

I vote for only one method based on a query tree structure.


Noted


BTW, did you think about different algorithms of queryId generation?


Not really, except if you are referring to the possibility of being
able to handle differently different portions of the nodes depending
on a context given by the callers willing to do a query jumbling
computation.  (For example, choose to *not* silence the Const nodes,
etc.)

Yes, I have two requests on different queryId algorithms:
1. With suppressed Const nodes.
2. With replacement of Oids with full names - to give a chance to see 
the same queryId at different instances for the same query.


It is quite trivial to implement, but not easy in support.



Auto-generated queryId code can open a way for extensions to have
easy-supporting custom queryIds.


Extensions can control that at some extent, already.
--
Michael


--
regards,
Andrey Lepikhov
Postgres Professional





Re: Initial Schema Sync for Logical Replication

2023-07-11 Thread Masahiko Sawada
On Mon, Jul 10, 2023 at 8:06 PM Kumar, Sachin  wrote:
>
>
>
> > From: Amit Kapila 
> > On Wed, Jul 5, 2023 at 7:45 AM Masahiko Sawada
> >  wrote:
> > >
> > > On Mon, Jun 19, 2023 at 5:29 PM Peter Smith 
> > wrote:
> > > >
> > > > Hi,
> > > >
> > > > Below are my review comments for the PoC patch 0001.
> > > >
> > > > In addition,  the patch needed rebasing, and, after I rebased it
> > > > locally in my private environment there were still test failures:
> > > > a) The 'make check' tests fail but only in a minor way due to
> > > > changes colname
> > > > b) the subscription TAP test did not work at all for me -- many errors.
> > >
> > > Thank you for reviewing the patch.
> > >
> > > While updating the patch, I realized that the current approach won't
> > > work well or at least has the problem with partition tables. If a
> > > publication has a partitioned table with publish_via_root = false, the
> > > subscriber launches tablesync workers for its partitions so that each
> > > tablesync worker copies data of each partition. Similarly, if it has a
> > > partition table with publish_via_root = true, the subscriber launches
> > > a tablesync worker for the parent table. With the current design,
> > > since the tablesync worker is responsible for both schema and data
> > > synchronization for the target table, it won't be possible to
> > > synchronize both the parent table's schema and partitions' schema.
> > >
> >
> > I think one possibility to make this design work is that when 
> > publish_via_root
> > is false, then we assume that subscriber already has parent table and then
> > the individual tablesync workers can sync the schema of partitions and their
> > data.
>
> Since publish_via_partition_root is false by default users have to create 
> parent table by themselves
> which I think is not a good user experience.

I have the same concern. I think that users normally use
publish_via_partiiton_root = false if the partitioned table on the
subscriber consists of the same set of partitions as the publisher's
ones. And such users would expect the both partitioned table and its
partitions to be synchronized.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com