Re: Minimal logical decoding on standbys

2023-03-02 Thread Jeff Davis
On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:
> In this case it looks easier to add the right API than to be sure
> about
> whether it's needed or not.

I attached a sketch of one approach. I'm not very confident that it's
the right API or even that it works as I intended it, but if others
like the approach I can work on it some more.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From ada1c8f373caa971dc0d8ef2144f1e01100d335c Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Wed, 1 Mar 2023 20:02:42 -0800
Subject: [PATCH v1] Introduce ConditionVariableEventSleep().

The new API takes a WaitEventSet which can include socket events. The
WaitEventSet must have been created by
ConditionVariableWaitSetCreate(), another new function, so that it
includes the wait events necessary for a condition variable.
---
 src/backend/storage/lmgr/condition_variable.c | 102 --
 src/backend/storage/lmgr/proc.c   |   6 ++
 src/backend/utils/init/miscinit.c |   1 +
 src/include/storage/condition_variable.h  |  10 ++
 4 files changed, 111 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 7e2bbf46d9..4dc6109786 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -27,9 +27,29 @@
 #include "storage/spin.h"
 #include "utils/memutils.h"
 
+#define ConditionVariableWaitSetLatchPos	0
+
 /* Initially, we are not prepared to sleep on any condition variable. */
 static ConditionVariable *cv_sleep_target = NULL;
 
+/* Used by ConditionVariableSleep() and ConditionVariableTimedSleep(). */
+static WaitEventSet *ConditionVariableWaitSet = NULL;
+
+/*
+ * Initialize the process-local condition variable WaitEventSet.
+ *
+ * This must be called once during startup of any process that can wait on
+ * condition variables, before it issues any ConditionVariableInit() calls.
+ */
+void
+InitializeConditionVariableWaitSet(void)
+{
+	Assert(ConditionVariableWaitSet == NULL);
+
+	ConditionVariableWaitSet = ConditionVariableWaitSetCreate(
+		TopMemoryContext, 0);
+}
+
 /*
  * Initialize a condition variable.
  */
@@ -40,6 +60,51 @@ ConditionVariableInit(ConditionVariable *cv)
 	proclist_init(>wakeup);
 }
 
+/*
+ * Create a WaitEventSet for ConditionVariableEventSleep(). This should be
+ * used when the caller of ConditionVariableEventSleep() would like to wake up
+ * on either the condition variable signal or a socket event. For example:
+ *
+ *   ConditionVariableInit();
+ *   waitset = ConditionVariableWaitSetCreate(mcxt, 1);
+ *   event_pos = AddWaitEventToSet(waitset, 0, sock, NULL, NULL);
+ *   ...
+ *   ConditionVariablePrepareToSleep();
+ *   while (...condition not met...)
+ *   {
+ *   socket_wait_events = ...
+ *   ModifyWaitEvent(waitset, event_pos, socket_wait_events, NULL);
+ *   ConditionVariableEventSleep(, waitset, ...);
+ *   }
+ *   ConditionVariableCancelSleep();
+ *
+ * The waitset is created with the standard events for a condition variable,
+ * and room for adding n_socket_events additional socket events. The
+ * initially-filled event positions should not be modified, but added socket
+ * events can be modified. The same waitset can be used for multiple condition
+ * variables as long as the callers of ConditionVariableEventSleep() are
+ * interested in the same sockets.
+ */
+WaitEventSet *
+ConditionVariableWaitSetCreate(MemoryContext mcxt, int n_socket_events)
+{
+	int latch_pos   PG_USED_FOR_ASSERTS_ONLY;
+	int n_cv_events = IsUnderPostmaster ? 2 : 1;
+	int nevents	 = n_cv_events + n_socket_events;
+	WaitEventSet*waitset	 = CreateWaitEventSet(mcxt, nevents);
+
+	latch_pos = AddWaitEventToSet(waitset, WL_LATCH_SET, PGINVALID_SOCKET,
+  MyLatch, NULL);
+
+	if (IsUnderPostmaster)
+		AddWaitEventToSet(waitset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+		  NULL, NULL);
+
+	Assert(latch_pos == ConditionVariableWaitSetLatchPos);
+
+	return waitset;
+}
+
 /*
  * Prepare to wait on a given condition variable.
  *
@@ -97,7 +162,8 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 void
 ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 {
-	(void) ConditionVariableTimedSleep(cv, -1 /* no timeout */ ,
+	(void) ConditionVariableEventSleep(cv, ConditionVariableWaitSet,
+	   -1 /* no timeout */ ,
 	   wait_event_info);
 }
 
@@ -111,11 +177,27 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 bool
 ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 			uint32 wait_event_info)
+{
+	return ConditionVariableEventSleep(cv, ConditionVariableWaitSet, timeout,
+	   wait_event_info);
+}
+
+/*
+ * Wait for a condition variable to be signaled, a timeout to be reached, or a
+ * socket event in the given waitset. The waitset must have been created by
+ * ConditionVariableWaitSetCreate().
+ *
+ * Returns true when 

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-02 Thread Önder Kalacı
Hi Peter, all

>
> ==
> Commit Message
>
> 1.
> There is no smart mechanism to pick the index. Instead, we choose
> the first index that fulfils the requirements mentioned above.
>
> ~
>
> 1a.
> I think this paragraph should immediately follow the earlier one
> ("With this patch...") which talked about the index requirements.
>
>
makes sense


>
> 1b.
> Slight rewording
>
> SUGGESTION
> If there is more than one index that satisfies these requirements, we
> just pick the first one.
>
>
applied


> ==
> doc/src/sgml/logical-replication.sgml
>
> 2.
> A published table must have a “replica identity” configured in order
> to be able to replicate UPDATE and DELETE operations, so that
> appropriate rows to update or delete can be identified on the
> subscriber side. By default, this is the primary key, if there is one.
> Another unique index (with certain additional requirements) can also
> be set to be the replica identity. When replica identity FULL is
> specified, indexes can be used on the subscriber side for searching
> the rows. These indexes should be btree, non-partial and have at least
> one column reference (e.g., should not consist of only expressions).
> These restrictions on the non-unique index properties are in essence
> the same restrictions that are enforced for primary keys. Internally,
> we follow the same approach for supporting index scans within logical
> replication scope. If there are no such suitable indexes, the search
> on the subscriber s ide can be very inefficient, therefore replica
> identity FULL should only be used as a fallback if no other solution
> is possible. If a replica identity other than “full” is set on the
> publisher side, a replica identity comprising the same or fewer
> columns must also be set on the subscriber side. See REPLICA IDENTITY
> for details on how to set the replica identity. If a table without a
> replica identity is added to a publication that replicates UPDATE or
> DELETE operations then subsequent UPDATE or DELETE operations will
> cause an error on the publisher. INSERT operations can proceed
> regardless of any replica identity.
>
> ~
>
> 2a.
> IMO the replica identity in the first sentence should
> be changed to be replica identity
>



>
> ~
>
> 2b.
> Typo: "subscriber s ide" --> "subscriber side"
>

fixed


> 2c.
> There is still one remaining "full" in this text. I think ought to be
> changed to FULL to match the others.
>
>
changed


> ==
> src/backend/executor/execReplication.c
>
> 3. IdxIsRelationIdentityOrPK
>
> +/*
> + * Given a relation and OID of an index, returns true if the
> + * index is relation's primary key's index or relation's
> + * replica identity index.
> + *
> + * Returns false otherwise.
> + */
> +bool
> +IdxIsRelationIdentityOrPK(Relation rel, Oid idxoid)
> +{
> + Assert(OidIsValid(idxoid));
> +
> + return RelationGetReplicaIndex(rel) == idxoid ||
> + RelationGetPrimaryKeyIndex(rel) == idxoid;
>  }
>
> ~
>
> Since the function name mentions RI (1st) and then PK (2nd), and since
> the implementation also has the same order, I think the function
> comment should use the same consistent order when describing what it
> does.
>

alright, done


>
> ==
> src/backend/replication/logical/relation.c
>
> 4. FindUsableIndexForReplicaIdentityFull
>
> +/*
> + * Returns an index oid if there is an index that can be used
> + * via the apply worker. The index should be btree, non-partial
> + * and have at least one column reference (e.g., should
> + * not consist of only expressions). The limitations arise from
> + * RelationFindReplTupleByIndex(), which is designed to handle
> + * PK/RI and these limitations are inherent to PK/RI.
> + *
> + * There are no fundamental problems for supporting non-btree
> + * and/or partial indexes. We should mostly relax the limitations
> + * in RelationFindReplTupleByIndex().
> + *
> + * If no suitable index is found, returns InvalidOid.
> + *
> + * Note that this is not a generic function, it expects REPLICA
> + * IDENTITY FULL for the remote relation.
> + */
>
> ~
>
> 4a.
> Minor rewording of 1st sentence.
>
> BEFORE
> Returns an index oid if there is an index that can be used via the apply
> worker.
>
> SUGGESTION
> Returns the oid of an index that can be used via the apply worker.
>
>
looks better, applied


>
> 4b.
> + * There are no fundamental problems for supporting non-btree
> + * and/or partial indexes. We should mostly relax the limitations
> + * in RelationFindReplTupleByIndex().
>
> I think this paragraph should come later in the comment (just before
> the Note) and should also have "XXX" prefix to indicate it is some
> implementation note for future versions.
>
>
>
done


>
> 5. GetRelationIdentityOrPK
>
> +/*
> + * Get replica identity index or if it is not defined a primary key.
> + *
> + * If neither is defined, returns InvalidOid
> + */
> +Oid
> +GetRelationIdentityOrPK(Relation rel)
> +{
> + Oid idxoid;
> +
> + idxoid = RelationGetReplicaIndex(rel);
> +
> + if 

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-02 Thread Önder Kalacı
Hi Hou zj, all


>
> 1.
> +   usableIndexContext = AllocSetContextCreate(CurrentMemoryContext,
> +
> "usableIndexContext",
> +
> ALLOCSET_DEFAULT_SIZES);
> +   oldctx = MemoryContextSwitchTo(usableIndexContext);
> +
> +   /* Get index list of the local relation */
> +   indexlist = RelationGetIndexList(localrel);
> +   Assert(indexlist != NIL);
> +
> +   foreach(lc, indexlist)
>
> Is it necessary to create a memory context here ? I thought the memory
> will be
> freed after this apply action soon.
>

Yeah, probably not useful anymore, removed.

In the earlier versions of this patch, this code block was relying on some
planner functions. In that case, it felt safer to use a memory context. Now,
it seems useless.



> 2.
>
> +   /*
> +* Furthermore, because primary key and unique key
> indexes can't
> +* include expressions we also sanity check the
> index is neither
> +* of those kinds.
> +*/
> +   Assert(!IdxIsRelationIdentityOrPK(rel,
> idxrel->rd_id));
>
> It seems you mean "replica identity key" instead of "unique key" in the
> comments.
>

Right, I fixed this comment. Though, are you mentioning multiple comment*s*?
I couldn't
see any other in the patch. Let me know if you see.


>
>
> 3.
> --- a/src/include/replication/logicalrelation.h
> +++ b/src/include/replication/logicalrelation.h
> ...
> +extern bool IsIndexOnlyOnExpression(IndexInfo *indexInfo);
>
> The definition function seems better to be placed in execReplication.c
>

Hmm, why do you think so? IsIndexOnlyOnExpression() is used in
logical/relaton.c, and used for an assertion on execReplication.c

I think it is better suited for relaton.c, but let me know about
your perspective as well.


>
> 4.
>
> +extern Oid GetRelationIdentityOrPK(Relation rel);
>
> The function is only used in relation.c, so we can make it a static
> function.
>
>
In the recent iteration of the patch (I think v27), we also use this
function in check_relation_updatable() in logical/worker.c.

One could argue that we can move the definition back to worker.c,
but it feels better suited for in relation.c, as the perimeter of the
function
is a *Rel*, and the function is looking for a property of a relation.

Let me know if you think otherwise, I don't have strong opinions
on this.


>
> 5.
>
> +   /*
> +* If index scans are disabled, use a sequential scan.
> +*
> +* Note that we do not use index scans below when enable_indexscan
> is
> +* false. Allowing primary key or replica identity even when index
> scan is
> +* disabled is the legacy behaviour. So we hesitate to move the
> below
> +* enable_indexscan check to be done earlier in this function.
> +*/
> +   if (!enable_indexscan)
> +   return InvalidOid;
>
> Since the document of enable_indexscan says "Enables or disables the query
> planner's use of index-scan plan types. The default is on.", and we don't
> use
> planner here, so I am not sure should we allow/disallow index scan in apply
> worker based on this GUC.
>
>
Given Amit's suggestion on [1], I'm planning to drop this check altogether,
and
rely on table storage parameters.

(I'll incorporate these changes with a patch that I'm going to reply
to Peter's e-mail).

Thanks,
Onder

[1]:
https://www.postgresql.org/message-id/CAA4eK1KP-sV4aER51J-2mELjNzq_zVSLf1%2BW90Vu0feo-thVNA%40mail.gmail.com


Re: meson: Optionally disable installation of test modules

2023-03-02 Thread Peter Eisentraut

On 02.03.23 08:09, Nazir Bilal Yavuz wrote:

On Wed, 1 Mar 2023 at 22:21, Peter Eisentraut
 wrote:


Looks good to me.  I did a small pass over it to adjust some namings.
For example, I renamed test_install_files to test_install_data, so it's
consistent with the overall meson naming:

-install_data(
+test_install_data += files(

Let me know if you have any concerns about this version.  Otherwise, I'm
happy to commit it.


That makes sense, thanks!


committed





Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-02 Thread vignesh C
On Thu, 2 Mar 2023 at 20:53, Önder Kalacı  wrote:
>
> Hi,
>
>>
>>  Is it just a matter of testing other
>> index types
>
>
> Yes, there are more to it. build_replindex_scan_key()
> only works for btree indexes, as it does BTEqualStrategyNumber.
>
> I might expect a few more limitations like that. I added comment
> in the code (see FindUsableIndexForReplicaIdentityFull)
>
>> or there is something more to it, if so, we should add
>> comments so that they can be supported in the future if it is feasible
>> to do so.
>
>
> I really don't see any fundamental issues regarding expanding the
> support for more index types, it is just some more coding & testing.
>
> And, I can (and willing to) work on that as a follow-up. I explicitly
> try to keep this patch as small as possible.
>
>>
>> >
>> > Attached are both patches: the main patch, and the patch that
>> > optionally disables the index scans.
>> >
>>
>> Both the patches are numbered 0001. It would be better to number them
>> as 0001 and 0002.
>>
>
> Alright, attached v27_0001_use_index_on_subs_when_pub_rep_ident_full.patch and
> v27_0002_use_index_on_subs_when_pub_rep_ident_full.patch.
>
> I also added one more test which Andres asked me on a private chat
> (Testcase start: SUBSCRIPTION USES INDEX WITH PUB/SUB different data).

Thanks for the patch. Few comments:
1) We are currently calling RelationGetIndexList twice, once in
FindUsableIndexForReplicaIdentityFull function and in the caller too,
we could avoid one of the calls by passing the indexlist to the
function or removing the check here, index list check can be handled
in FindUsableIndexForReplicaIdentityFull.
+   if (remoterel->replident == REPLICA_IDENTITY_FULL &&
+   RelationGetIndexList(localrel) != NIL)
+   {
+   /*
+* If we had a primary key or relation identity with a
unique index,
+* we would have already found and returned that oid.
At this point,
+* the remote relation has replica identity full and
we have at least
+* one local index defined.
+*
+* We are looking for one more opportunity for using
an index. If
+* there are any indexes defined on the local
relation, try to pick
+* a suitable index.
+*
+* The index selection safely assumes that all the
columns are going
+* to be available for the index scan given that
remote relation has
+* replica identity full.
+*/
+   return FindUsableIndexForReplicaIdentityFull(localrel);
+   }
+

2) Copyright year should be mentioned as 2023
diff --git a/src/test/subscription/t/032_subscribe_use_index.pl
b/src/test/subscription/t/032_subscribe_use_index.pl
new file mode 100644
index 00..db0a7ea2a0
--- /dev/null
+++ b/src/test/subscription/t/032_subscribe_use_index.pl
@@ -0,0 +1,861 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test logical replication behavior with subscriber uses available index
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+

3) Many of the tests are using the same tables, we need not
drop/create publication/subscription for each of the team, we could
just drop and create required indexes and verify the update/delete
statements.
+# 
+# Testcase start: SUBSCRIPTION USES INDEX
+#
+# Basic test where the subscriber uses index
+# and only updates 1 row and deletes
+# 1 other row
+#
+
+# create tables pub and sub
+$node_publisher->safe_psql('postgres',
+   "CREATE TABLE test_replica_id_full (x int)");
+$node_publisher->safe_psql('postgres',
+   "ALTER TABLE test_replica_id_full REPLICA IDENTITY FULL;");
+$node_subscriber->safe_psql('postgres',
+   "CREATE TABLE test_replica_id_full (x int)");
+$node_subscriber->safe_psql('postgres',
+   "CREATE INDEX test_replica_id_full_idx ON test_replica_id_full(x)");

+# 
+# Testcase start: SUBSCRIPTION CREATE/DROP INDEX WORKS WITHOUT ISSUES
+#
+# This test ensures that after CREATE INDEX, the subscriber can automatically
+# use one of the indexes (provided that it fulfils the requirements).
+# Similarly, after DROP index, the subscriber can automatically switch to
+# sequential scan
+
+# create tables pub and sub
+$node_publisher->safe_psql('postgres',
+   "CREATE TABLE test_replica_id_full (x int NOT NULL, y int)");
+$node_publisher->safe_psql('postgres',
+   "ALTER TABLE test_replica_id_full REPLICA IDENTITY FULL;");
+$node_subscriber->safe_psql('postgres',
+   "CREATE TABLE test_replica_id_full (x int NOT NULL, y int)");

4) These additional blank lines can be removed to keep it consistent:
4.a)
+# Testcase end: SUBSCRIPTION DOES NOT USE PARTIAL INDEX
+# 

Re: Force testing of query jumbling code in TAP tests

2023-03-02 Thread Michael Paquier
On Tue, Feb 28, 2023 at 02:06:05PM +0900, Michael Paquier wrote:
> Are there any objections to do what's proposed in the patch and
> improve the testing coverage of query jumbling by default?

Well, done this one as of d28a449.  More validation tests could always
be added later if there are better ideas.  Coverage of this code has
gone up to 94.4% at the end:
https://coverage.postgresql.org/src/backend/nodes/queryjumblefuncs.funcs.c.gcov.html
--
Michael


signature.asc
Description: PGP signature


Re: Add SHELL_EXIT_CODE to psql

2023-03-02 Thread Corey Huinker
On Thu, Mar 2, 2023 at 1:35 PM Tom Lane  wrote:

> Corey Huinker  writes:
> > [ v9-0003-Add-psql-variables-SHELL_ERROR-and-SHELL_EXIT_COD.patch ]
>
> I took a brief look through this.  I'm on board with the general
> concept, but I think you've spent too much time on an ultimately
> futile attempt to get a committable test case, and not enough time
>

I didn't want to give the impression that I was avoiding/dismissing the
test case, and at some point I got curious how far we could push pg_regress.


> on making the behavior correct/usable.  In particular, it bothers
> me that there doesn't appear to be any way to distinguish whether
> a command failed on nonzero exit code versus a signal.  Also,
>

That's an intriguing distinction, and one I hadn't considered, mostly
because I assumed that any command with a set of exit codes rich enough to
warrant inspection would have a separate exit code for i-was-terminated.

It would certainly be possible to have two independent booleans,
SHELL_ERROR and SHELL_SIGNAL.


> I see that you're willing to report whatever ferror returns
> (something quite unspecified in relevant standards, beyond
> zero/nonzero) as an equally-non-distinguishable integer.
>

I had imagined users of this feature falling into two camps:

1. Users writing scripts for their specific environment, with the ability
to know what exit codes are possible and different desired courses of
action based on those codes.
2. Users in no specific environment, content with SHELL_ERROR boolean, who
are probably just going to test for failure, and if so either \set a
default or \echo a message and \quit.



>
> I'm tempted to suggest that we ought to be returning a string,
> along the lines of "OK" or "exit code N" or "signal N" or
> "I/O error".  This though brings up the question of exactly
> what you expect scripts to use the variable for.  Maybe such
> a definition would be unhelpful, but if so why?  Maybe we
> should content ourselves with adding the SHELL_ERROR boolean, and
> leave the details to whatever we print in the error message?
>

Having a curated list of responses like "OK" and "exit code N" is also
intriguing, but could be hard to unpack given psql's limited string
manipulation capabilities.

I think the SHELL_ERROR boolean solves most cases, but it was suggested in
https://www.postgresql.org/message-id/20221102115801.gg16...@telsasoft.com
that users might want more detail than that, even if that detail is
unspecified and highly system dependent.


>
> As for the test case, I'm unimpressed with it because it's so
> weak as to be nearly useless; plus it seems like a potential
> security issue (what if "nosuchcommand" exists?).  It's fine
> to have it during development, I guess, but I'm inclined to
> leave it out of the eventual commit.
>
>
I have no objection to leaving it out. I think it proves that writing
os-specific pg_regress tests is hard, and little else.


Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-02 Thread Amit Kapila
On Thu, Mar 2, 2023 at 8:52 PM Önder Kalacı  wrote:
>
> Alright, attached v27_0001_use_index_on_subs_when_pub_rep_ident_full.patch and
> v27_0002_use_index_on_subs_when_pub_rep_ident_full.patch.
>

Few comments on 0001

1.
+   such suitable indexes, the search on the subscriber s ide can be
very inefficient,

unnecessary space in 'side'

2.
-   identity.  If the table does not have any suitable key, then it can be set
-   to replica identity full, which means the entire row becomes
-   the key.  This, however, is very inefficient and should only be used as a
-   fallback if no other solution is possible.  If a replica identity other
+   identity. When replica identity FULL is specified,
+   indexes can be used on the subscriber side for searching the rows.

I think it is better to retain the first sentence (If the table does
not ... entire row becomes the key.) as that says what will be part of
the key.

3.
-   comprising the same or fewer columns must also be set on the subscriber
-   side.  See  for details on
-   how to set the replica identity.  If a table without a replica identity is
-   added to a publication that replicates UPDATE
+   comprising the same or fewer columns must also be set on the
subscriber side.
+   See  for
+   details on how to set the replica identity.  If a table without a replica
+   identity is added to a publication that replicates UPDATE

I don't see any change in this except line length. If so, I don't
think we should change it as part of this patch.

4.
 /*
  * Setup a ScanKey for a search in the relation 'rel' for a tuple 'key' that
  * is setup to match 'rel' (*NOT* idxrel!).
  *
- * Returns whether any column contains NULLs.
+ * Returns how many columns should be used for the index scan.
+ *
+ * This is not generic routine, it expects the idxrel to be
+ * a btree, non-partial and have at least one column
+ * reference (e.g., should not consist of only expressions).
  *
- * This is not generic routine, it expects the idxrel to be replication
- * identity of a rel and meet all limitations associated with that.
+ * By definition, replication identity of a rel meets all
+ * limitations associated with that. Note that any other
+ * index could also meet these limitations.

The comment changes look quite asymmetric to me. Normally, we break
the line if the line length goes beyond 80 cols. Please check and
change other places in the patch if they have a similar symptom.

5.
+ * There are no fundamental problems for supporting non-btree
+ * and/or partial indexes.

Can we mention partial indexes in the above comment? It seems to me
that because the required tuple may not satisfy the expression (in the
case of partial indexes) it may not be easy to support it.

6.
build_replindex_scan_key()
{
...
+ for (index_attoff = 0; index_attoff <
IndexRelationGetNumberOfKeyAttributes(idxrel);
+ index_attoff++)
...
...
+#ifdef USE_ASSERT_CHECKING
+ IndexInfo  *indexInfo = BuildIndexInfo(idxrel);
+
+ Assert(!IsIndexOnlyOnExpression(indexInfo));
+#endif
...
}

We can avoid building index info multiple times. This can be either
checked at the beginning of the function outside attribute offset loop
or we can probably cache it. I understand this is for assert builds
but seems easy to avoid it doing multiple times and it also looks odd
to do it multiple times for the same index.

7.
- /* Build scankey for every attribute in the index. */
- for (attoff = 0; attoff <
IndexRelationGetNumberOfKeyAttributes(idxrel); attoff++)
+ /* Build scankey for every non-expression attribute in the index. */
+ for (index_attoff = 0; index_attoff <
IndexRelationGetNumberOfKeyAttributes(idxrel);
+ index_attoff++)
  {
  Oid operator;
  Oid opfamily;
+ Oid optype = get_opclass_input_type(opclass->values[index_attoff]);
  RegProcedure regop;
- int pkattno = attoff + 1;
- int mainattno = indkey->values[attoff];
- Oid optype = get_opclass_input_type(opclass->values[attoff]);
+ int table_attno = indkey->values[index_attoff];

I don't think here we need to change variable names if we retain
mainattno as it is instead of changing it to table_attno. The current
naming doesn't seem bad for the current usage of the patch.

8.
+ TypeCacheEntry **eq = NULL; /* only used when the index is not RI or PK */

Normally, we don't add such comments as the usage is quite obvious by
looking at the code.


-- 
With Regards,
Amit Kapila.




Re: Real config values for bytes needs quotes?

2023-03-02 Thread Kyotaro Horiguchi
At Thu, 2 Mar 2023 11:54:10 +0100, Peter Eisentraut 
 wrote in 
> On 27.02.23 09:32, Kyotaro Horiguchi wrote:
> > I found it frustrating that the line "shared_buffers = 0.1GB" in
> > postgresql.conf postgresql.conf was causing an error and that the
> > value required (additional) surrounding single quotes.  The attached
> > patch makes the parser accept the use of non-quoted real values
> > followed by a unit for such variables. I'm not sure if that syntax
> > fully covers the input syntax of strtod, but I beieve it is suffucient
> > for most use cases.
> 
> This seems sensible to fix.  If you're not sure about the details,
> write some test cases. :)

Thanks. I initially intended to limit the change for REAL to accept
units following a value. However, actually I also modified it to
accept '1e1'.

man strtod says it accepts the following format.

> The  expected  form  of the (initial portion of the) string is optional
> leading white space as recognized by isspace(3), an optional plus ('+')
> or  minus  sign  ('-')  and then either (i) a decimal number, or (ii) a
> hexadecimal number, or (iii) an infinity, or (iv) a NAN (not-a-number).
>
> A decimal number consists of a nonempty sequence of decimal digits pos‐
> sibly  containing  a  radix character (decimal point, locale-dependent,
> usually '.'), optionally followed by a  decimal  exponent.   A  decimal
> exponent  consists  of  an  'E' or 'e', followed by an optional plus or
> minus sign, followed by a nonempty  sequence  of  decimal  digits,  and
> indicates multiplication by a power of 10.

It is written in regexp as
'\s*[-+]?(\.\d+|\d+(\.\d*)?)([Ee][-+]?\d+)?'.  The leading whitespace
is unnecessary in this specific usage, and we also need to exclude
INTERGER from this set of matches. Therefore, it should be modified to
'[-+]?((\.\d+|\d+\.\d*)([Ee][-+]?\d+)?|\d+([Ee][-+]?\d+))'.

It is translated to the following BNF notation. UNIT_LETTER is also
needed.

{SIGN}?(("."{DIGIT}+|{DIGIT}+"."{DIGIT}*){EXPONENT}?|{DIGIT}+{EXPONENT}){UNIT_LETTER}*

The attached patch applies aforementioned change to guc-file.l and
includes tests for values in certain patters.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 9a413b791364152d44838d329f0efe3cc7db4d22 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 3 Mar 2023 13:39:26 +0900
Subject: [PATCH v1] Widen unquoted value acceptance in configuration file

Numeric configuration parameters can accept real numbers like '10.3',
but the values cannot be followed by a unit if they are not quoted. In
addition, certain formats like '1e4' are not accepted unless they are
not quoted. This commit changes the configuration file parser to
accept all of these values without the need for quotes.
---
 src/backend/utils/misc/guc-file.l |  2 +-
 src/test/modules/test_misc/t/003_check_guc.pl | 41 +++
 2 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/misc/guc-file.l b/src/backend/utils/misc/guc-file.l
index 41d62a9f23..ca4add9710 100644
--- a/src/backend/utils/misc/guc-file.l
+++ b/src/backend/utils/misc/guc-file.l
@@ -76,7 +76,7 @@ UNIT_LETTER		[a-zA-Z]
 INTEGER			{SIGN}?({DIGIT}+|0x{HEXDIGIT}+){UNIT_LETTER}*
 
 EXPONENT		[Ee]{SIGN}?{DIGIT}+
-REAL			{SIGN}?{DIGIT}*"."{DIGIT}*{EXPONENT}?
+REAL			{SIGN}?(("."{DIGIT}+|{DIGIT}+"."{DIGIT}*){EXPONENT}?|{DIGIT}+{EXPONENT}){UNIT_LETTER}*
 
 LETTER			[A-Za-z_\200-\377]
 LETTER_OR_DIGIT [A-Za-z_0-9\200-\377]
diff --git a/src/test/modules/test_misc/t/003_check_guc.pl b/src/test/modules/test_misc/t/003_check_guc.pl
index e9f33f3c77..053fbe336d 100644
--- a/src/test/modules/test_misc/t/003_check_guc.pl
+++ b/src/test/modules/test_misc/t/003_check_guc.pl
@@ -106,4 +106,45 @@ foreach my $param (@sample_intersect)
 	);
 }
 
+# check if GUC parser accepts some unquoted values accepted by strtod()
+my $conffile = $node->data_dir . "/postgresql.conf";
+my $conf = slurp_file($conffile);
+## without units
+my $lines =
+  join("\n", map {"checkpoint_completion_target = $_"}
+	   ('0.1', '1.', '.1', '1e-1', '.1e0', '1.e-01'));
+## with units
+$lines .= "\n" .
+  join("\n", map {"work_mem = ${_}MB"}
+	   ('0.3', '1.', '.3', '+3e-1', '-.3e0', '1.e01'));
+$node->append_conf('postgresql.conf', $lines);
+
+$node->reload();
+## has the last value been read?
+ok ($node->poll_query_until('postgres',
+			"SELECT setting::int = 10240 FROM pg_settings WHERE name = 'work_mem'"),
+	'all variations of real number parameter values passed');
+
+# check if parser correctly rejects some invalid patterns
+my $n = 0;
+foreach my $val ('.', '.e3')
+{
+	$n++ if (test_one_value($node, $val, $conf));
+}
+ok($n == 2, "parser succesfully rejected invalid values");
+
 done_testing();
+
+sub test_one_value
+{
+  my ($node, $value, $base_conf) = @_;
+
+  open(my $fh, '>', $conffile) || die "failed to open $conffile";
+  print $fh $conf . "\nwork_mem=$value\n";
+  close($fh);
+
+  my $logstart = -s $node->logfile;
+  $node->reload();
+  return 

Re: Add pg_walinspect function with block info columns

2023-03-02 Thread Michael Paquier
On Thu, Mar 02, 2023 at 11:17:05AM -0500, Melanie Plageman wrote:
> Thinking about this more, it could make sense to have a function which
> gives you this extended block information and has a parameter like
> with_fpi which would include the information returned by
> pg_get_wal_fpi_info(). It might be nice to have it still include the
> information about the record itself as well.

Hmm.  I am OK if you want to include more information about the
blocks, and it may be nicer to not bloat the interface with more
functions than necessary.

> I don't know if it would be instead of pg_get_wal_fpi_info(), though.
> 
> The way I would use this is when I want to see the record level
> information but with some additional information aggregated across the
> relevant blocks. For example, I could group by the record information
> and relfilenode and using the query in my example above, see all the
> information for the record along with the relname (when possible).

As far as I know, a block reference could have some data or a FPW, so
it is true that pg_get_wal_fpi_info() is not extensive enough if you
want to get more information about the blocks in use for each record,
especially if there is some data, and grouping the information about
whole set of blocks into a single function call can some time.

In order to satisfy your case, why not having one function that does
everything, looping over the blocks of a single record as long as
XLogRecHasBlockRef() is satisfied, returning the FPW if the block
includes an image (or NULL if !XLogRecHasBlockImage()), as well as its
data in bytea if XLogRecGetData() gives something (?).

I am not sure that this should return anything about the record itself
except its ReadRecPtr, though, as ReadRecPtr would be enough to
cross-check with the information provided by GetWALRecordInfo() with a
join.  Hence, I guess that we could update the existing FPI function
with:
- the addition of some of the flags of bimg_info, like the compression
type, if they apply, with a text[].
- the addition of bimg_len, if the block has a FPW, or NULL if none.
- the addition of apply_image, if the block has a FPW, or NULL if
none.
- the addition of the block data, if any, or NULL if there is no
data.
- an update for the FPW handling, where we would return NULL if there
is no FPW references in the block, but still return the full,
decompressed 8kB image if it is there.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-02 Thread Peter Smith
Here are some review comments for v27-0001 (not the tests)

==
Commit Message

1.
There is no smart mechanism to pick the index. Instead, we choose
the first index that fulfils the requirements mentioned above.

~

1a.
I think this paragraph should immediately follow the earlier one
("With this patch...") which talked about the index requirements.

~

1b.
Slight rewording

SUGGESTION
If there is more than one index that satisfies these requirements, we
just pick the first one.

==
doc/src/sgml/logical-replication.sgml

2.
A published table must have a “replica identity” configured in order
to be able to replicate UPDATE and DELETE operations, so that
appropriate rows to update or delete can be identified on the
subscriber side. By default, this is the primary key, if there is one.
Another unique index (with certain additional requirements) can also
be set to be the replica identity. When replica identity FULL is
specified, indexes can be used on the subscriber side for searching
the rows. These indexes should be btree, non-partial and have at least
one column reference (e.g., should not consist of only expressions).
These restrictions on the non-unique index properties are in essence
the same restrictions that are enforced for primary keys. Internally,
we follow the same approach for supporting index scans within logical
replication scope. If there are no such suitable indexes, the search
on the subscriber s ide can be very inefficient, therefore replica
identity FULL should only be used as a fallback if no other solution
is possible. If a replica identity other than “full” is set on the
publisher side, a replica identity comprising the same or fewer
columns must also be set on the subscriber side. See REPLICA IDENTITY
for details on how to set the replica identity. If a table without a
replica identity is added to a publication that replicates UPDATE or
DELETE operations then subsequent UPDATE or DELETE operations will
cause an error on the publisher. INSERT operations can proceed
regardless of any replica identity.

~

2a.
IMO the replica identity in the first sentence should
be changed to be replica identity

~

2b.
Typo: "subscriber s ide" --> "subscriber side"

~

2c.
There is still one remaining "full" in this text. I think ought to be
changed to FULL to match the others.

==
src/backend/executor/execReplication.c

3. IdxIsRelationIdentityOrPK

+/*
+ * Given a relation and OID of an index, returns true if the
+ * index is relation's primary key's index or relation's
+ * replica identity index.
+ *
+ * Returns false otherwise.
+ */
+bool
+IdxIsRelationIdentityOrPK(Relation rel, Oid idxoid)
+{
+ Assert(OidIsValid(idxoid));
+
+ return RelationGetReplicaIndex(rel) == idxoid ||
+ RelationGetPrimaryKeyIndex(rel) == idxoid;
 }

~

Since the function name mentions RI (1st) and then PK (2nd), and since
the implementation also has the same order, I think the function
comment should use the same consistent order when describing what it
does.

==
src/backend/replication/logical/relation.c

4. FindUsableIndexForReplicaIdentityFull

+/*
+ * Returns an index oid if there is an index that can be used
+ * via the apply worker. The index should be btree, non-partial
+ * and have at least one column reference (e.g., should
+ * not consist of only expressions). The limitations arise from
+ * RelationFindReplTupleByIndex(), which is designed to handle
+ * PK/RI and these limitations are inherent to PK/RI.
+ *
+ * There are no fundamental problems for supporting non-btree
+ * and/or partial indexes. We should mostly relax the limitations
+ * in RelationFindReplTupleByIndex().
+ *
+ * If no suitable index is found, returns InvalidOid.
+ *
+ * Note that this is not a generic function, it expects REPLICA
+ * IDENTITY FULL for the remote relation.
+ */

~

4a.
Minor rewording of 1st sentence.

BEFORE
Returns an index oid if there is an index that can be used via the apply worker.

SUGGESTION
Returns the oid of an index that can be used via the apply worker.

~

4b.
+ * There are no fundamental problems for supporting non-btree
+ * and/or partial indexes. We should mostly relax the limitations
+ * in RelationFindReplTupleByIndex().

I think this paragraph should come later in the comment (just before
the Note) and should also have "XXX" prefix to indicate it is some
implementation note for future versions.

~~~

5. GetRelationIdentityOrPK

+/*
+ * Get replica identity index or if it is not defined a primary key.
+ *
+ * If neither is defined, returns InvalidOid
+ */
+Oid
+GetRelationIdentityOrPK(Relation rel)
+{
+ Oid idxoid;
+
+ idxoid = RelationGetReplicaIndex(rel);
+
+ if (!OidIsValid(idxoid))
+ idxoid = RelationGetPrimaryKeyIndex(rel);
+
+ return idxoid;
+}

This is really very similar code to the other new function called
IdxIsRelationIdentityOrPK. I wondered if such similar functions could
be defined together.

~~~

6. FindLogicalRepUsableIndex

+/*
+ * Returns an index oid if we can use an index for 

Re: [EXTERNAL] Re: Support load balancing in libpq

2023-03-02 Thread Andrey Borodin
On Wed, Mar 1, 2023 at 12:03 PM Jelte Fennema  wrote:
>
> done and updated cf entry
>

Hi Jelte!

I've looked into the patch. Although so many improvements can be
suggested, It definitely makes sense as-is too.
These improvements might be, for example, sorting hosts according to
ping latency or something like that. Or, perhaps, some other balancing
policies. Anyway, randomizing is a good start too.

I want to note that the Fisher-Yates algorithm is implemented in a
difficult to understand manner.
+if (j < i) /* avoid fetching undefined data if j=i */
This stuff does not make sense in case of shuffling arrays inplace. It
is important only for making a new copy of an array and only in
languages that cannot access uninitialized values. I'd suggest just
removing this line (in both cases).


Best regards, Andrey Borodin.




Re: Remove source code display from \df+?

2023-03-02 Thread Isaac Morland
On Thu, 2 Mar 2023 at 17:20, Tom Lane  wrote:

> Isaac Morland  writes:
> > [ 0001-Remove-source-code-display-from-df-v6.patch ]
>
> Pushed after some editorialization on the test case.
>

Thanks!

One thing I noticed while testing is that if you apply \df+ to an
> aggregate function, it will show "Internal name" of "aggregate_dummy".
> While that's an accurate description of what's in prosrc, it seems
> not especially useful and perhaps indeed confusing to novices.
> So I thought about suppressing it.  However, that would require
> a server-version-dependent test and I wasn't quite convinced it'd
> be worth the trouble.  Any thoughts on that?
>

I think it’s OK. Right now \df+ claims that the source code for an
aggregate function is “aggregate_dummy”; that’s probably more untrue than
saying that its internal name is “aggregate_dummy”. There are several
features of aggregate functions that are always defined the same way by the
creation process; who’s to say they don’t all have a shared dummy internal
name?


Re: Date-time extraneous fields with reserved keywords

2023-03-02 Thread Keisuke Kuroda
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi Joseph,

Good catch.
Of the reserved words that are special values of type Date/Time, 
'now', 'today', 'tomorrow', 'yesterday', and 'allballs', 
I get an error even before applying the patch.

One thing I noticed is that the following SQL 
returns normal results even after applying the patch.

postgres=# select timestamp 'epoch 01:01:01';
  timestamp
-
 1970-01-01 00:00:00
(1 row)

When 'epoch','infinity','-infinity' and time are specified together, 
the time specified in the SQL is not included in result.
I think it might be better to assume that this pattern is also an error.
What do you think?

As a side note,
reserved words such as 'today', 'tomorrow', and 'yesterday'
can be used to specify a time.

postgres=# select timestamp 'today 01:01:01';
  timestamp
-
 2023-03-03 01:01:01
(1 row)

Best Regards,
Keisuke Kuroda
NTT Comware

The new status of this patch is: Waiting on Author


RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-02 Thread houzj.f...@fujitsu.com
On Thursday, March 2, 2023 11:23 PM Önder Kalacı   wrote:

> Both the patches are numbered 0001. It would be better to number them
> as 0001 and 0002.
> 
> Alright, attached v27_0001_use_index_on_subs_when_pub_rep_ident_full.patch 
> and 
> v27_0002_use_index_on_subs_when_pub_rep_ident_full.patch.
> 
> I also added one more test which Andres asked me on a private chat
> (Testcase start: SUBSCRIPTION USES INDEX WITH PUB/SUB different data).

Thanks for updating the patch. I think this patch can bring noticeable
performance improvements in some use cases.

And here are few comments after reading the patch.

1.
+   usableIndexContext = AllocSetContextCreate(CurrentMemoryContext,
+   
   "usableIndexContext",
+   
   ALLOCSET_DEFAULT_SIZES);
+   oldctx = MemoryContextSwitchTo(usableIndexContext);
+
+   /* Get index list of the local relation */
+   indexlist = RelationGetIndexList(localrel);
+   Assert(indexlist != NIL);
+
+   foreach(lc, indexlist)

Is it necessary to create a memory context here ? I thought the memory will be
freed after this apply action soon.

2.

+   /*
+* Furthermore, because primary key and unique key 
indexes can't
+* include expressions we also sanity check the index 
is neither
+* of those kinds.
+*/
+   Assert(!IdxIsRelationIdentityOrPK(rel, idxrel->rd_id));

It seems you mean "replica identity key" instead of "unique key" in the 
comments.


3.
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
...
+extern bool IsIndexOnlyOnExpression(IndexInfo *indexInfo);

The definition function seems better to be placed in execReplication.c

4.

+extern Oid GetRelationIdentityOrPK(Relation rel);

The function is only used in relation.c, so we can make it a static
function.


5.

+   /*
+* If index scans are disabled, use a sequential scan.
+*
+* Note that we do not use index scans below when enable_indexscan is
+* false. Allowing primary key or replica identity even when index scan 
is
+* disabled is the legacy behaviour. So we hesitate to move the below
+* enable_indexscan check to be done earlier in this function.
+*/
+   if (!enable_indexscan)
+   return InvalidOid;

Since the document of enable_indexscan says "Enables or disables the query
planner's use of index-scan plan types. The default is on.", and we don't use
planner here, so I am not sure should we allow/disallow index scan in apply
worker based on this GUC.

Best Regards,
Hou zj



Re: Making empty Bitmapsets always be NULL

2023-03-02 Thread David Rowley
On Fri, 3 Mar 2023 at 15:17, Tom Lane  wrote:
> (Is it worth carrying both "allocated words" and "nonzero words"
> fields to avoid useless memory-management effort?  Dunno.)

It would have been a more appealing thing to do before Bitmapset
became a node type as we'd have had empty space in the struct to have
another 32-bit word on 64-bit builds.

> Another point here is that I'm pretty sure that just about all
> bitmapsets we deal with are only one or two words, so I'm not
> convinced you're going to get any performance win to justify
> the added management overhead.

It's true that the majority of Bitmapsets are going to be just 1 word,
but it's important to acknowledge that we do suffer in some more
extreme cases when Bitmapsets become large. Partition with large
numbers of partitions is one such case.

create table lp(a int) partition by list(a);
select 'create table lp'||x||' partition of lp for values
in('||x||');' from generate_series(0,)x;
\gexec

# cat bench.sql
select * from lp where a > 1 and a < 3;

$ pgbench -n -T 15 -f bench.sql postgres | grep tps

master:
tps = 28055.619289 (without initial connection time)
tps = 27819.235083 (without initial connection time)
tps = 28486.099808 (without initial connection time)

master + bms_no_trailing_zero_words.patch:
tps = 30840.840266 (without initial connection time)
tps = 29491.519705 (without initial connection time)
tps = 29471.083938 (without initial connection time)

(~6.45% faster)

Of course, it's an extreme case, I'm merely trying to show that
trimming the Bitmapsets down can have an impact in some cases.

David




Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-02 Thread Amit Kapila
On Thu, Mar 2, 2023 at 6:50 PM Önder Kalacı  wrote:
>
>>
>>
>> In the above profile number of calls to index_fetch_heap(),
>> heapam_index_fetch_tuple() explains the reason for the regression you
>> are seeing with the index scan. Because the update will generate dead
>> tuples in the same transaction and those dead tuples won't be removed,
>> we get those from the index and then need to perform
>> index_fetch_heap() to find out whether the tuple is dead or not. Now,
>> for sequence scan also we need to scan those dead tuples but there we
>> don't need to do back-and-forth between index and heap.
>
>
> Thanks for the insights, I think what you describe makes a lot of sense.
>
...
...
>
> I think we figured out the cause of the performance regression. I think it is 
> not small
> enough for some scenarios like the above. But those scenarios seem like 
> synthetic
> test cases, with not much user impacting implications. Still, I think you are 
> better suited
> to comment on this.
>
> If you consider that this is a significant issue,  we could consider the 
> second patch as well
> such that for this unlikely scenario users could disable index scans.
>

I think we can't completely ignore this regression because the key
point of this patch is to pick one of the non-unique indexes to
perform scan and now it will be difficult to predict how many
duplicates (and or dead rows) some index has without more planner
support. Personally, I feel it is better to have a table-level option
for this so that users have some knob to avoid regressions in
particular cases. In general, I agree that it will be a win in more
number of cases than it can regress.

-- 
With Regards,
Amit Kapila.




Re: Rework LogicalOutputPluginWriterUpdateProgress

2023-03-02 Thread Peter Smith
On Fri, Mar 3, 2023 at 1:27 PM houzj.f...@fujitsu.com
 wrote:
>
> On Friday, March 3, 2023 8:18 AM Peter Smith  wrote:
...
> > Anyway, I think this exposes another problem. If you still want the patch 
> > to pass
> > the 'finshed_xact' parameter separately then AFAICT the first parameter 
> > (ctx)
> > now becomes unused/redundant in the WalSndUpdateProgressAndKeepalive
> > function, so it ought to be removed.
> >
>
> I am not sure about this. The first parameter (ctx) has been introduced since
> the Lag tracking feature. I think this is to make it consistent with other
> LogicalOutputPluginWriter callbacks. In addition, this is a public callback
> function and user can implement their own logic in this callbacks based on
> interface, removing this existing parameter doesn't look great to me. Although
> this patch also removes the existing skipped_xact, but it's because we decide
> to use another parameter did_write which can play a similar role.
>

Oh right, that makes sense. Thanks.

Perhaps it just wants some comment to mention that although the
built-in implementation does not use the 'ctx' users might implement
their own logic which does use it.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




RE: Rework LogicalOutputPluginWriterUpdateProgress

2023-03-02 Thread houzj.f...@fujitsu.com
On Friday, March 3, 2023 8:18 AM Peter Smith  wrote:
> On Wed, Mar 1, 2023 at 9:16 PM wangw.f...@fujitsu.com
>  wrote:
> >
> > On Tues, Feb 28, 2023 at 9:12 AM Peter Smith 
> wrote:
> > > Here are some comments for the v2-0001 patch.
> > >
> > > (I haven't looked at the v3 that was posted overnight; maybe some of
> > > my comments have already been addressed.)
> >
> > Thanks for your comments.
> >
> > > ==
> > > General
> > >
> > > 1. (Info from the commit message)
> > > Since we can know whether the change is an end of transaction change
> > > in the common code, we removed the
> LogicalDecodingContext->end_xact
> > > introduced in commit f95d53e.
> > >
> > > ~
> > >
> > > TBH, it was not clear to me that this change was an improvement.
> > > IIUC, it removes the "unnecessary" member, but only does that by
> > > replacing it everywhere with a boolean parameter passed to
> > > update_progress_and_keepalive(). So the end result seems no less
> > > code, but it is less readable code now because you need to know what
> > > the true/false parameter means. I wonder if it would have been
> > > better just to leave this how it was.
> >
> > Since I think we can know the meaning of the input based on the
> > parameter name of the function, I think both approaches are fine. But
> > the approach in the current patch can reduce a member of the
> > structure, so I think this modification looks good to me.
> >
> 
...
> 
> Anyway, I think this exposes another problem. If you still want the patch to 
> pass
> the 'finshed_xact' parameter separately then AFAICT the first parameter (ctx)
> now becomes unused/redundant in the WalSndUpdateProgressAndKeepalive
> function, so it ought to be removed.
> 

I am not sure about this. The first parameter (ctx) has been introduced since
the Lag tracking feature. I think this is to make it consistent with other
LogicalOutputPluginWriter callbacks. In addition, this is a public callback
function and user can implement their own logic in this callbacks based on
interface, removing this existing parameter doesn't look great to me. Although
this patch also removes the existing skipped_xact, but it's because we decide
to use another parameter did_write which can play a similar role.

Best Regards,
Hou zj


Re: Making empty Bitmapsets always be NULL

2023-03-02 Thread Tom Lane
David Rowley  writes:
> I suggest tightening the rule even further so instead of just empty
> sets having to be represented as NULL, the rule should be that sets
> should never contain any trailing zero words, which is effectively a
> superset of the "empty is NULL" rule that you've just changed.

Hmm, I'm not immediately a fan of that, because it'd mean more
interaction with aset.c to change the allocated size of results.
(Is it worth carrying both "allocated words" and "nonzero words"
fields to avoid useless memory-management effort?  Dunno.)

Another point here is that I'm pretty sure that just about all
bitmapsets we deal with are only one or two words, so I'm not
convinced you're going to get any performance win to justify
the added management overhead.

regards, tom lane




Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-02 Thread Peter Smith
FYI,

After applying only the 0001 patch I received a TAP test error.

t/032_subscribe_use_index.pl ... 1/? # Tests were run but no plan
was declared and done_testing() was not seen.
t/032_subscribe_use_index.pl ... Dubious, test returned 29 (wstat
7424, 0x1d00)
All 1 subtests passed
t/100_bugs.pl .. ok


More details:

2023-03-03 12:45:45.382 AEDT [9931] 032_subscribe_use_index.pl LOG:
statement: CREATE INDEX test_replica_id_full_idx ON
test_replica_id_full(x)
2023-03-03 12:45:45.423 AEDT [9937] 032_subscribe_use_index.pl LOG:
statement: CREATE SUBSCRIPTION tap_sub_rep_full CONNECTION 'port=56538
host=/tmp/zWyRQnOa1a dbname=postgres application_name=tap_sub'
PUBLICATION tap_pub_rep_full WITH (enable_index_scan = false)
2023-03-03 12:45:45.423 AEDT [9937] 032_subscribe_use_index.pl ERROR:
unrecognized subscription parameter: "enable_index_scan"
2023-03-03 12:45:45.423 AEDT [9937] 032_subscribe_use_index.pl
STATEMENT:  CREATE SUBSCRIPTION tap_sub_rep_full CONNECTION
'port=56538 host=/tmp/zWyRQnOa1a dbname=postgres
application_name=tap_sub' PUBLICATION tap_pub_rep_full WITH
(enable_index_scan = false)
2023-03-03 12:45:45.532 AEDT [9834] LOG:  received immediate shutdown request
2023-03-03 12:45:45.533 AEDT [9834] LOG:  database system is shut down

~~

The patches 0001 and 0002 seem to have accidentally blended together
because AFAICT the error is because patch 0001 is testing something
that is not available until 0002.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Understanding, testing and improving our Windows filesystem code

2023-03-02 Thread Thomas Munro
On Thu, Jan 5, 2023 at 1:06 AM vignesh C  wrote:
> On Tue, 25 Oct 2022 at 09:42, Thomas Munro  wrote:
> > I pushed the bug fixes from this series, without their accompanying
> > tests.  Here's a rebase of the test suite, with all those tests now
> > squashed into the main test patch, and also the
> > tell-Windows-to-be-more-like-Unix patch.  Registered in the
> > commitfest.
>
> The patch does not apply ...

I think this exercise was (if I say so myself) quite useful, to
understand the Windows file system landscape.  Maybe the things we
figured out by testing are common knowledge to real Windows
programmers, I dunno, but they were certainly all news to me and not
documented anywhere I could find, and the knowledge and tests will
probably help in future battles against Windows.  The most important
things discovered were:

 1.  If you're testing on a Windows VM or laptop running 10 or 11 *you
aren't seeing the same behaviour as Windows Server*.  So the semantics
don't match real production PostgreSQL deployments.

 2.  If we decided to turn on the new POSIX unlink semantics
explicitly as originally proposed by Victor, we'd get the behaviour we
really want on NTFS on all known Windows versions.  But that would
move the traditional behaviour into a blind spot that we have no
testing for: ReFS and SMB.  Our tree would probably gain more stuff
that doesn't work on them, so that would be tantamount to dropping
support.

Therefore, with regret, I'm going to withdraw this for now.  We'd need
to get CI testing for ReFS and/or SMB first, which could be arranged,
but even then, what is the point of POSIX semantics if you don't have
them everywhere?  You can't even remove any code!  Unless we could
reach consensus that "PostgreSQL is not supported on SMB or ReFS until
they gain POSIX semantics" [which may never happen for all we know],
and then commit this patch and forget about non-POSIX unlink semantics
forever.  I don't see us doing that in a hurry.  So there's not much
hope for this idea in this commitfest.

The little C TAP framework could definitely be useful as a starting
point for something else, and the FS semantics test will definitely
come in handy if this topic is reopened by some of those potential
actions or needed to debug existing behaviour, and then I might even
re-propose parts of it, but it's all here in the archives anyway.




Re: Making empty Bitmapsets always be NULL

2023-03-02 Thread David Rowley
On Wed, 1 Mar 2023 at 10:59, Tom Lane  wrote:
> When I designed the Bitmapset module, I set things up so that an empty
> Bitmapset could be represented either by a NULL pointer, or by an
> allocated object all of whose bits are zero.  I've recently come to
> the conclusion that that was a bad idea and we should instead have
> a convention like the longstanding invariant for Lists: an empty
> list is represented by NIL and nothing else.

I know I'm late to the party here, but I just wanted to add that I
agree with this and also that I've never been a fan of having to
decide if it was safe to check for NULL when I needed performance or
if I needed to use bms_is_empty() because the set might have all its
words set to 0.

I suggest tightening the rule even further so instead of just empty
sets having to be represented as NULL, the rule should be that sets
should never contain any trailing zero words, which is effectively a
superset of the "empty is NULL" rule that you've just changed.

If we did this, then various functions can shake loose some crufty
code and various other functions become more optimal due to not having
to loop over trailing zero words needlessly. For example.

* bms_equal() and bms_compare() can precheck nwords match before
troubling themselves with looping over each member, and;
* bms_is_subset() can return false early if 'a' has more words than 'b', and;
* bms_subset_compare() becomes more simple as it does not have to look
for trailing 0 words, and;
* bms_nonempty_difference() can return true early if 'a' has more
words than 'b', plus no need to check for trailing zero words at the
end.

We can also chop the set down to size in; bms_intersect(),
bms_difference(), bms_int_members(), bms_del_members() and
bms_intersect() which saves looping needlessly over empty words when
various other BMS operations are performed later on the set, for
example, bms_next_member(), bms_prev_member, bms_copy(), etc.

The only reason I can think of that this would be a bad idea is that
if we want to add members again then we need to do repalloc().  If
we're only increasing the nwords back to what it had been on some
previous occasion then repalloc() is effectively a no-op, so I doubt
this will really be a net negative.  I think the effort we'll waste by
carrying around needless trailing zero words in most cases is likely
to outweigh the overhead of any no-op repallocs.  Take
bms_int_members(), for example, we'll purposefully 0 out all the
trailing words possibly having to read in new cachelines from RAM to
do so.  It would be better to leave having to read those in again
until we actually need to do something more useful with them, like
adding some new members to the set again. We'll have to dirty those
cachelines then anyway and we may have flushed those cachelines out of
the CPU cache by the time we get around to adding the new members
again.

I've coded this up in the attached and followed the lead in list.c and
added a function named check_bitmapset_invariants() which ensures the
above rule is followed. I think the code as it stands today should
really have something like that anyway.

The patch also optimizes sub-optimal newly added code which calls
bms_is_empty_internal() when we have other more optimal means to
determine if the set is empty or not.

David


bms_no_trailing_zero_words.patch
Description: Binary data


Re: Normalization of utility queries in pg_stat_statements

2023-03-02 Thread Michael Paquier
On Thu, Mar 02, 2023 at 08:12:24AM +0100, Drouvot, Bertrand wrote:
> Applying 0001 produces:
> 
> Applying: Split more regression tests of pg_stat_statements
> .git/rebase-apply/patch:1735: new blank line at EOF.
> +
> .git/rebase-apply/patch:2264: new blank line at EOF.
> +
> warning: 2 lines add whitespace errors.

Indeed, removed.

> What about removing those comments?

Removing these two as well.

> Except from the Nits above, 0001 LGTM.

Thanks for double-checking, applied 0001 to finish this part of the
work.  I am attaching the remaining bits as of the attached, combined
into a single patch.  I am going to look at it again at the beginning
of next week and potentially apply it so as the normalization reflects
to the reports of pg_stat_statements.
--
Michael
From f96f6a65d5318b059527230d074660ffec099129 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 3 Mar 2023 09:28:13 +0900
Subject: [PATCH v6] Remove normalization of A_Const nodes

Doing so leads to weird cases with commands that can define a
transaction isolation (SET TRANSACTION and BEGIN), as the normalization
is not able to copy with the full field, yet.

Applying normalization of Const nodes to DDLs changes the states of the
following commands:
- DECLARE
- EXPLAIN
- CREATE MATERIALIZED VIEW
- CTAS

At the end, this should be merged with the previous patch, but keeping
it separate shows the difference of behavior between the two approaches
in the regression tests of pg_stat_statements.
---
 src/include/nodes/parsenodes.h|  8 +++-
 src/include/nodes/primnodes.h |  9 +++--
 .../pg_stat_statements/expected/cursors.out   | 14 +++
 .../pg_stat_statements/expected/utility.out   | 38 +--
 .../pg_stat_statements/pg_stat_statements.c   |  4 +-
 5 files changed, 41 insertions(+), 32 deletions(-)

diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index f7d7f10f7d..259e814253 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3221,14 +3221,18 @@ typedef struct InlineCodeBlock
  * list contains copies of the expressions for all output arguments, in the
  * order of the procedure's declared arguments.  (outargs is never evaluated,
  * but is useful to the caller as a reference for what to assign to.)
+ * The transformed call state is not relevant in the query jumbling, only the
+ * function call is.
  * --
  */
 typedef struct CallStmt
 {
 	NodeTag		type;
 	FuncCall   *funccall;		/* from the parser */
-	FuncExpr   *funcexpr;		/* transformed call, with only input args */
-	List	   *outargs;		/* transformed output-argument expressions */
+	/* transformed call, with only input args */
+	FuncExpr   *funcexpr pg_node_attr(query_jumble_ignore);
+	/* transformed output-argument expressions */
+	List	   *outargs pg_node_attr(query_jumble_ignore);
 } CallStmt;
 
 typedef struct CallContext
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index b4292253cc..4220c63ab7 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -128,8 +128,10 @@ typedef struct TableFunc
  * CREATE MATERIALIZED VIEW
  *
  * For CREATE MATERIALIZED VIEW, viewQuery is the parsed-but-not-rewritten
- * SELECT Query for the view; otherwise it's NULL.  (Although it's actually
- * Query*, we declare it as Node* to avoid a forward reference.)
+ * SELECT Query for the view; otherwise it's NULL.  This is irrelevant in
+ * the query jumbling as CreateTableAsStmt already includes a reference to
+ * its own Query, so ignore it.  (Although it's actually Query*, we declare
+ * it as Node* to avoid a forward reference.)
  */
 typedef struct IntoClause
 {
@@ -141,7 +143,8 @@ typedef struct IntoClause
 	List	   *options;		/* options from WITH clause */
 	OnCommitAction onCommit;	/* what do we do at COMMIT? */
 	char	   *tableSpaceName; /* table space to use, or NULL */
-	Node	   *viewQuery;		/* materialized view's SELECT query */
+	/* materialized view's SELECT query */
+	Node	   *viewQuery pg_node_attr(query_jumble_ignore);
 	bool		skipData;		/* true for WITH NO DATA */
 } IntoClause;
 
diff --git a/contrib/pg_stat_statements/expected/cursors.out b/contrib/pg_stat_statements/expected/cursors.out
index 5d0dc196f9..46375ea905 100644
--- a/contrib/pg_stat_statements/expected/cursors.out
+++ b/contrib/pg_stat_statements/expected/cursors.out
@@ -16,10 +16,10 @@ CLOSE cursor_stats_1;
 DECLARE cursor_stats_1 CURSOR WITH HOLD FOR SELECT 2;
 CLOSE cursor_stats_1;
 SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
- calls | rows |query 
+--+--
+ calls | rows | query 
+---+--+---
  2 |0 | CLOSE cursor_stats_1
- 2 |0 | DECLARE cursor_stats_1 CURSOR WITH HOLD FOR 

Re: Rework LogicalOutputPluginWriterUpdateProgress

2023-03-02 Thread Andres Freund
Hi,

On 2023-03-03 11:18:04 +1100, Peter Smith wrote:
> - Why is reducing members of LogicalDecodingContext even a goal? I
> thought the LogicalDecodingContext is intended to be the one-stop
> place to hold *all* things related to the "Context" (including that
> member that was deleted).

There's not really a reason to keep it in LogicalDecodingContext after
this change. It was only needed there because of the broken
architectural model of calling UpdateProgress from within output
plugins. Why set a field in each wrapper that we don't need?

> - How is reducing one member better than introducing one new parameter
> in multiple calls?

Reducing the member isn't important, needing to set it before each
callback however makes sense.

Greetings,

Andres Freund




Re: Allow logical replication to copy tables in binary format

2023-03-02 Thread Peter Smith
On Thu, Mar 2, 2023 at 4:00 PM Amit Kapila  wrote:
>
> On Thu, Mar 2, 2023 at 7:27 AM Peter Smith  wrote:
> >
...
> > IIUC most people seem to be coming down in favour of there being a
> > single unified option (the existing 'binary==true/false) which would
> > apply to both the COPY and the data replication parts.
> >
> > I also agree
> > - Yes, it is simpler.
> > - Yes, there are various workarounds in case the COPY part failed
> >
> > But, AFAICT the main question remains unanswered -- Are we happy to
> > break existing applications already using binary=true. E.g. I think
> > there might be cases where applications are working *only* because
> > their binary=true is internally (and probably unbeknownst to the user)
> > reverting to text. So if we unified everything under one 'binary'
> > option then binary=true will force COPY binary so now some previously
> > working applications will get COPY errors requiring workarounds. Is
> > that acceptable?
> >
>
> I think one can look at this from another angle also where users would
> be expecting that when binary = true and copy_data = true, all the
> data transferred between publisher and subscriber should be in binary
> format. Users have a workaround to set binary=true only after the
> initial sync. Also, if at all, the behaviour change would be after
> major version upgrade which shouldn't be a problem.
>
> > TBH I am not sure anymore if the complications justify the patch.
> >
> > It seems we have to choose from 2 bad choices:
> > - separate options = this works but would be more confusing for the user
> > - unified option = this would be simpler and faster, but risks
> > breaking existing applications currently using 'binary=true'
> >
>
> I would prefer a unified option as apart from other things you and
> others mentioned that will be less of a maintenance burden in the
> future.

My concern was mostly just about the potential to break the behaviour
of existing binary=true applications in some edge cases.

If you are happy that doing so shouldn't be a problem, then I am also
+1 to use the unified option.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-03-02 Thread Andres Freund
On 2023-03-02 14:41:26 -0500, reid.thomp...@crunchydata.com wrote:
> Patch has been rebased to master.

Quite a few prior review comments seem to not have been
addressed. There's not much point in posting new versions without that.

I think there's zero chance 0002 can make it into 16. If 0001 is cleaned
up, I can see a path.




Re: Evaluate arguments of correlated SubPlans in the referencing ExprState

2023-03-02 Thread Andres Freund
Hi,

On 2023-03-02 13:00:31 -0800, Andres Freund wrote:
> I'm not opposed to EXPR_PARAM_SET, to be clear. I'll send an updated
> version later. I was just thinking about the correctness in the current
> world.

Attached.

I named the set EEOP_PARAM_SET EEOP_PARAM_EXEC_SET or such, because I
was wondering if there cases it could also be useful in conjunction with
PARAM_EXTERN, and because nothing really depends on the kind of param.

Greetings,

Andres
>From f63915ab00d55b5a67d8b6d05863fe69ea4252b5 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sat, 25 Feb 2023 13:39:19 -0800
Subject: [PATCH v2] WIP: Evaluate arguments of correlated SubPlans in the
 referencing ExprState

---
 src/include/executor/execExpr.h   |  6 +-
 src/include/nodes/execnodes.h |  1 -
 src/backend/executor/execExpr.c   | 93 +--
 src/backend/executor/execExprInterp.c | 22 +++
 src/backend/executor/execProcnode.c   |  5 ++
 src/backend/executor/nodeSubplan.c| 30 -
 src/backend/jit/llvm/llvmjit_expr.c   |  6 ++
 src/backend/jit/llvm/llvmjit_types.c  |  1 +
 8 files changed, 112 insertions(+), 52 deletions(-)

diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 06c3adc0a19..ca2b7306cd0 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -158,6 +158,8 @@ typedef enum ExprEvalOp
 	EEOP_PARAM_EXEC,
 	EEOP_PARAM_EXTERN,
 	EEOP_PARAM_CALLBACK,
+	/* set PARAM_EXEC value */
+	EEOP_PARAM_SET,
 
 	/* return CaseTestExpr value */
 	EEOP_CASE_TESTVAL,
@@ -374,7 +376,7 @@ typedef struct ExprEvalStep
 			ExprEvalRowtypeCache rowcache;
 		}			nulltest_row;
 
-		/* for EEOP_PARAM_EXEC/EXTERN */
+		/* for EEOP_PARAM_EXEC/EXTERN and EEOP_PARAM_SET */
 		struct
 		{
 			int			paramid;	/* numeric ID for parameter */
@@ -738,6 +740,8 @@ extern void ExecEvalParamExec(ExprState *state, ExprEvalStep *op,
 			  ExprContext *econtext);
 extern void ExecEvalParamExtern(ExprState *state, ExprEvalStep *op,
 ExprContext *econtext);
+extern void ExecEvalParamSet(ExprState *state, ExprEvalStep *op,
+			 ExprContext *econtext);
 extern void ExecEvalCurrentOfExpr(ExprState *state, ExprEvalStep *op);
 extern void ExecEvalNextValueExpr(ExprState *state, ExprEvalStep *op);
 extern void ExecEvalRowNull(ExprState *state, ExprEvalStep *op,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e7eb1e666ff..16e95e4cb48 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -947,7 +947,6 @@ typedef struct SubPlanState
 	struct PlanState *planstate;	/* subselect plan's state tree */
 	struct PlanState *parent;	/* parent plan node's state tree */
 	ExprState  *testexpr;		/* state of combining expression */
-	List	   *args;			/* states of argument expression(s) */
 	HeapTuple	curTuple;		/* copy of most recent tuple from subplan */
 	Datum		curArray;		/* most recent array from ARRAY() subplan */
 	/* these are used when hashing the subselect's output: */
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index c61f23c6c18..002f2a0091f 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -87,6 +87,9 @@ static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
   FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
   int transno, int setno, int setoff, bool ishash,
   bool nullcheck);
+static void ExecInitSubPlanExpr(SubPlan *subplan,
+ExprState *state,
+Datum *resv, bool *resnull);
 
 
 /*
@@ -1388,7 +1391,6 @@ ExecInitExprRec(Expr *node, ExprState *state,
 		case T_SubPlan:
 			{
 SubPlan*subplan = (SubPlan *) node;
-SubPlanState *sstate;
 
 /*
  * Real execution of a MULTIEXPR SubPlan has already been
@@ -1405,19 +1407,7 @@ ExecInitExprRec(Expr *node, ExprState *state,
 	break;
 }
 
-if (!state->parent)
-	elog(ERROR, "SubPlan found with no parent plan");
-
-sstate = ExecInitSubPlan(subplan, state->parent);
-
-/* add SubPlanState nodes to state->parent->subPlan */
-state->parent->subPlan = lappend(state->parent->subPlan,
- sstate);
-
-scratch.opcode = EEOP_SUBPLAN;
-scratch.d.subplan.sstate = sstate;
-
-ExprEvalPushStep(state, );
+ExecInitSubPlanExpr(subplan, state, resv, resnull);
 break;
 			}
 
@@ -2618,29 +2608,12 @@ ExecPushExprSetupSteps(ExprState *state, ExprSetupInfo *info)
 	foreach(lc, info->multiexpr_subplans)
 	{
 		SubPlan*subplan = (SubPlan *) lfirst(lc);
-		SubPlanState *sstate;
 
 		Assert(subplan->subLinkType == MULTIEXPR_SUBLINK);
 
-		/* This should match what ExecInitExprRec does for other SubPlans: */
-
-		if (!state->parent)
-			elog(ERROR, "SubPlan found with no parent plan");
-
-		sstate = ExecInitSubPlan(subplan, state->parent);
-
-		/* add SubPlanState nodes to state->parent->subPlan */
-		state->parent->subPlan = lappend(state->parent->subPlan,
-		 sstate);
-
-	

Re: Rework LogicalOutputPluginWriterUpdateProgress

2023-03-02 Thread Peter Smith
On Wed, Mar 1, 2023 at 9:16 PM wangw.f...@fujitsu.com
 wrote:
>
> On Tues, Feb 28, 2023 at 9:12 AM Peter Smith  wrote:
> > Here are some comments for the v2-0001 patch.
> >
> > (I haven't looked at the v3 that was posted overnight; maybe some of
> > my comments have already been addressed.)
>
> Thanks for your comments.
>
> > ==
> > General
> >
> > 1. (Info from the commit message)
> > Since we can know whether the change is an end of transaction change in the
> > common code, we removed the LogicalDecodingContext->end_xact introduced
> > in
> > commit f95d53e.
> >
> > ~
> >
> > TBH, it was not clear to me that this change was an improvement. IIUC,
> > it removes the "unnecessary" member, but only does that by replacing
> > it everywhere with a boolean parameter passed to
> > update_progress_and_keepalive(). So the end result seems no less code,
> > but it is less readable code now because you need to know what the
> > true/false parameter means. I wonder if it would have been better just
> > to leave this how it was.
>
> Since I think we can know the meaning of the input based on the parameter name
> of the function, I think both approaches are fine. But the approach in the
> current patch can reduce a member of the structure, so I think this 
> modification
> looks good to me.
>

Hmm, I am not so sure:

- Why is reducing members of LogicalDecodingContext even a goal? I
thought the LogicalDecodingContext is intended to be the one-stop
place to hold *all* things related to the "Context" (including that
member that was deleted).

- How is reducing one member better than introducing one new parameter
in multiple calls?

Anyway, I think this exposes another problem. If you still want the
patch to pass the 'finshed_xact' parameter separately then AFAICT the
first parameter (ctx) now becomes unused/redundant in the
WalSndUpdateProgressAndKeepalive function, so it ought to be removed.

> > ==
> > src/backend/replication/logical/logical.c
> >
> > 3. General - calls to is_skip_threshold_change()
> >
> > + if (is_skip_threshold_change(ctx))
> > + update_progress_and_keepalive(ctx, false);
> >
> > There are multiple calls like this, which are guarding the
> > update_progress_and_keepalive() with the is_skip_threshold_change()
> > - See truncate_cb_wrapper
> > - See message_cb_wrapper
> > - See stream_change_cb_wrapper
> > - See stream_message_cb_wrapper
> > - See stream_truncate_cb_wrapper
> > - See UpdateDecodingProgressAndKeepalive
> >
> > IIUC, then I was thinking all those conditions maybe can be pushed
> > down *into* the wrapper, thereby making every calling code simpler.
> >
> > e.g. make the wrapper function code look similar to the current
> > UpdateDecodingProgressAndKeepalive:
> >
> > BEFORE (update_progress_and_keepalive)
> > {
> > if (!ctx->update_progress_and_keepalive)
> > return;
> >
> > ctx->update_progress_and_keepalive(ctx, ctx->write_location,
> >ctx->write_xid, ctx->did_write,
> >finished_xact);
> > }
> > AFTER
> > {
> > if (!ctx->update_progress_and_keepalive)
> > return;
> >
> > if (finished_xact || is_skip_threshold_change(ctx))
> > {
> > ctx->update_progress_and_keepalive(ctx, ctx->write_location,
> >ctx->write_xid, ctx->did_write,
> >finished_xact);
> > }
> > }
>
> Since I want to keep the function update_progress_and_keepalive simple, I 
> didn't
> change it.

Hmm, the reason given seems like a false economy to me. You are able
to keep this 1 function simpler only by adding more complexity to the
calls in 6 other places. Let's see if other people have opinions about
this.

~~~

1.
+
+static void UpdateProgressAndKeepalive(LogicalDecodingContext *ctx,
+bool finished_xact);
+
+static bool is_keepalive_threshold_exceeded(LogicalDecodingContext *ctx);

1a.
There is an unnecessary extra blank line above the UpdateProgressAndKeepalive.

~

1b.
I did not recognize a reason for the different naming conventions.
Here are two new functions but one is CamelCase and one is snake_case.
What are the rules to decide the naming?

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Should vacuum process config file reload more often

2023-03-02 Thread Melanie Plageman
On Thu, Mar 2, 2023 at 2:36 AM Masahiko Sawada  wrote:
>
> On Thu, Mar 2, 2023 at 10:41 AM Melanie Plageman
>  wrote:
> > On another topic, I've just realized that when autovacuuming we only
> > update tab->at_vacuum_cost_delay/limit from
> > autovacuum_vacuum_cost_delay/limit for each table (in
> > table_recheck_autovac()) and then use that to update
> > MyWorkerInfo->wi_cost_delay/limit. MyWorkerInfo->wi_cost_delay/limit is
> > what is used to update VacuumCostDelay/Limit in AutoVacuumUpdateDelay().
> > So, even if we reload the config file in vacuum_delay_point(), if we
> > don't use the new value of autovacuum_vacuum_cost_delay/limit it will
> > have no effect for autovacuum.
>
> Right, but IIUC wi_cost_limit (and VacuumCostDelayLimit) might be
> updated. After the autovacuum launcher reloads the config file, it
> calls autovac_balance_cost() that updates that value of active
> workers. I'm not sure why we don't update workers' wi_cost_delay,
> though.

Ah yes, I didn't realize this. Thanks. I went back and did more code
reading/analysis, and I see no reason why we shouldn't update
worker->wi_cost_delay to the new value of autovacuum_vac_cost_delay in
autovac_balance_cost(). Then, as you said, the autovac launcher will
call autovac_balance_cost() when it reloads the configuration file.
Then, the next time the autovac worker calls AutoVacuumUpdateDelay(), it
will update VacuumCostDelay.

> > I started writing a little helper that could be used to update these
> > workerinfo->wi_cost_delay/limit in vacuum_delay_point(),
>
> Since we set vacuum delay parameters for autovacuum workers so that we
> ration out I/O equally, I think we should keep the current mechanism
> that the autovacuum launcher sets workers' delay parameters and they
> update accordingly.

Yes, agreed, it should go in the same place as where we update
wi_cost_limit (autovac_balance_cost()). I think we should potentially
rename autovac_balance_cost() because its name and all its comments
point to its only purpose being to balance the total of the workers
wi_cost_limits to no more than autovacuum_vacuum_cost_limit. And the
autovacuum_vacuum_cost_delay doesn't need to be balanced in this way.

Though, since this change on its own would make autovacuum pick up new
values of autovacuum_vacuum_cost_limit (without having the worker reload
the config file), I wonder if it makes sense to try and have
vacuum_delay_point() only reload the config file if it is an explicit
vacuum or an analyze not being run in an outer transaction (to avoid
overhead of reloading config file)?

The lifecycle of this different vacuum delay-related gucs and how it
differs between autovacuum workers and explicit vacuum is quite tangled
already, though.

> >  but I notice
> > when they are first set, we consider the autovacuum table options. So,
> > I suppose I would need to consider these when updating
> > wi_cost_delay/limit later as well? (during vacuum_delay_point() or
> > in AutoVacuumUpdateDelay())
> >
> > I wasn't quite sure because I found these chained ternaries rather
> > difficult to interpret, but I think table_recheck_autovac() is saying
> > that the autovacuum table options override all other values for
> > vac_cost_delay?
> >
> > vac_cost_delay = (avopts && avopts->vacuum_cost_delay >= 0)
> > ? avopts->vacuum_cost_delay
> > : (autovacuum_vac_cost_delay >= 0)
> > ? autovacuum_vac_cost_delay
> > : VacuumCostDelay;
> >
> > i.e. this?
> >
> >   if (avopts && avopts->vacuum_cost_delay >= 0)
> > vac_cost_delay = avopts->vacuum_cost_delay;
> >   else if (autovacuum_vac_cost_delay >= 0)
> > vac_cost_delay = autovacuum_vacuum_cost_delay;
> >   else
> > vac_cost_delay = VacuumCostDelay
>
> Yes, if the table has autovacuum table options, we use these values
> and the table is excluded from the balancing algorithm I mentioned
> above. See the code from table_recheck_autovac(),
>
>/*
> * If any of the cost delay parameters has been set individually for
> * this table, disable the balancing algorithm.
> */
>tab->at_dobalance =
>!(avopts && (avopts->vacuum_cost_limit > 0 ||
> avopts->vacuum_cost_delay > 0));
>
> So if the table has autovacuum table options, the vacuum delay
> parameters probably should be updated by ALTER TABLE, not by reloading
> the config file.

Yes, if the table has autovacuum table options, I think the user is
out-of-luck until the relation is done being vacuumed because the ALTER
TABLE will need to get a lock.

- Melanie




Re: Memory leak from ExecutorState context?

2023-03-02 Thread Tomas Vondra



On 3/2/23 23:57, Jehan-Guillaume de Rorthais wrote:
> On Thu, 2 Mar 2023 19:53:14 +0100
> Tomas Vondra  wrote:
>> On 3/2/23 19:15, Jehan-Guillaume de Rorthais wrote:
> ...
> 
>>> There was some thoughts about how to make a better usage of the memory. As
>>> memory is exploding way beyond work_mem, at least, avoid to waste it with
>>> too many buffers of BufFile. So you expand either the work_mem or the
>>> number of batch, depending on what move is smarter. TJis is explained and
>>> tested here:
>>>
>>> https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development
>>> https://www.postgresql.org/message-id/20190422030927.3huxq7gghms4kmf4%40development
>>>
>>> And then, another patch to overflow each batch to a dedicated temp file and
>>> stay inside work_mem (v4-per-slice-overflow-file.patch):
>>>
>>> https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
>>>
>>> Then, nothing more on the discussion about this last patch. So I guess it
>>> just went cold.
>>
>> I think a contributing factor was that the OP did not respond for a
>> couple months, so the thread went cold.
>>
>>> For what it worth, these two patches seems really interesting to me. Do you
>>> need any help to revive it?
>>
>> I think another reason why that thread went nowhere were some that we've
>> been exploring a different (and likely better) approach to fix this by
>> falling back to a nested loop for the "problematic" batches.
>>
>> As proposed in this thread:
>>
>>  
>> https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development
> 
> Unless I'm wrong, you are linking to the same «frustrated as heck!» 
> discussion,
> for your patch v2-0001-account-for-size-of-BatchFile-structure-in-hashJo.patch
> (balancing between increasing batches *and* work_mem).
> 
> No sign of turning "problematic" batches to nested loop. Did I miss something?
> 
> Do you have a link close to your hand about such algo/patch test by any 
> chance?
> 

Gah! My apologies, I meant to post a link to this thread:

https://www.postgresql.org/message-id/caakru_b6+jc93wp+pwxqk5kazjc5rmxm8uquktef-kq++1l...@mail.gmail.com

which then points to this BNL patch

https://www.postgresql.org/message-id/CAAKRu_YsWm7gc_b2nBGWFPE6wuhdOLfc1LBZ786DUzaCPUDXCA%40mail.gmail.com

That discussion apparently stalled in August 2020, so maybe that's where
we should pick up and see in what shape that patch is.

regards


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Introduce "log_connection_stages" setting.

2023-03-02 Thread Jacob Champion
On 3/2/23 14:56, Tom Lane wrote:
> Jacob Champion  writes:
>> If I've understood Tom correctly in [1], both of these guc_mallocs
>> should be using a loglevel less than ERROR, to avoid forcing a
>> postmaster exit when out of memory. (I used WARNING in that thread
>> instead, which seemed to be acceptable.)
> 
> Actually, preferred practice is as seen in e.g. check_datestyle:
> 
>   myextra = (int *) guc_malloc(LOG, 2 * sizeof(int));
>   if (!myextra)
>   return false;
>   myextra[0] = newDateStyle;
>   myextra[1] = newDateOrder;
>   *extra = (void *) myextra;
> 
> which gives the guc.c functions an opportunity to manage the
> failure.

Ah, thanks for the correction. (My guc_strdup(WARNING, ...) calls may
need to be cleaned up too, then.)

--Jacob




Re: Memory leak from ExecutorState context?

2023-03-02 Thread Jehan-Guillaume de Rorthais
On Thu, 2 Mar 2023 19:53:14 +0100
Tomas Vondra  wrote:
> On 3/2/23 19:15, Jehan-Guillaume de Rorthais wrote:
...

> > There was some thoughts about how to make a better usage of the memory. As
> > memory is exploding way beyond work_mem, at least, avoid to waste it with
> > too many buffers of BufFile. So you expand either the work_mem or the
> > number of batch, depending on what move is smarter. TJis is explained and
> > tested here:
> > 
> > https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development
> > https://www.postgresql.org/message-id/20190422030927.3huxq7gghms4kmf4%40development
> > 
> > And then, another patch to overflow each batch to a dedicated temp file and
> > stay inside work_mem (v4-per-slice-overflow-file.patch):
> > 
> > https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
> > 
> > Then, nothing more on the discussion about this last patch. So I guess it
> > just went cold.
> 
> I think a contributing factor was that the OP did not respond for a
> couple months, so the thread went cold.
> 
> > For what it worth, these two patches seems really interesting to me. Do you
> > need any help to revive it?
> 
> I think another reason why that thread went nowhere were some that we've
> been exploring a different (and likely better) approach to fix this by
> falling back to a nested loop for the "problematic" batches.
> 
> As proposed in this thread:
> 
>  
> https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development

Unless I'm wrong, you are linking to the same «frustrated as heck!» discussion,
for your patch v2-0001-account-for-size-of-BatchFile-structure-in-hashJo.patch
(balancing between increasing batches *and* work_mem).

No sign of turning "problematic" batches to nested loop. Did I miss something?

Do you have a link close to your hand about such algo/patch test by any chance?

> I was hoping we'd solve this by the BNL, but if we didn't get that in 4
> years, maybe we shouldn't stall and get at least an imperfect stop-gap
> solution ...

I'll keep searching tomorrow about existing BLN discussions (is it block level
nested loops?).

Regards,




Re: Introduce "log_connection_stages" setting.

2023-03-02 Thread Tom Lane
Jacob Champion  writes:
> This is looking very good. One bigger comment:

>> +myextra = (int *) guc_malloc(ERROR, sizeof(int));
>> +*myextra = newlogconnect;

> If I've understood Tom correctly in [1], both of these guc_mallocs
> should be using a loglevel less than ERROR, to avoid forcing a
> postmaster exit when out of memory. (I used WARNING in that thread
> instead, which seemed to be acceptable.)

Actually, preferred practice is as seen in e.g. check_datestyle:

myextra = (int *) guc_malloc(LOG, 2 * sizeof(int));
if (!myextra)
return false;
myextra[0] = newDateStyle;
myextra[1] = newDateOrder;
*extra = (void *) myextra;

which gives the guc.c functions an opportunity to manage the
failure.

A quick grep shows that there are existing check functions that
did not get that memo, e.g. check_recovery_target_lsn.
We ought to clean them up.

This is, of course, not super important unless you're allocating
something quite large; the odds of going OOM in the postmaster
should be pretty small.

regards, tom lane




Re: buildfarm + meson

2023-03-02 Thread Andrew Dunstan


On 2023-03-02 Th 17:06, Andres Freund wrote:

Hi

On 2023-03-02 17:00:47 -0500, Andrew Dunstan wrote:

On 2023-03-01 We 16:32, Andres Freund wrote:

This is now working
on my MSVC test rig (WS2019, VS2019, Strawberry Perl), including TAP tests.
I do get a whole lot of annoying messages like this:

Unknown TAP version. The first line MUST be `TAP version `. Assuming
version 12.

The newest minor version has fixed that, it was a misunderstanding about /
imprecision in the tap 14 specification.


Unfortunately, meson v 1.0.1 appears to be broken on Windows, I had to
downgrade back to 1.0.0.

Is it possible that you're using a PG checkout from a few days ago? A
hack I used was invalidated by 1.0.1, but I fixed that already.

CI is running with 1.0.1:
https://cirrus-ci.com/task/5806561726038016?logs=configure#L8



No, running against PG master tip. I'll get some details - it's not too 
hard to switch back and forth.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Introduce "log_connection_stages" setting.

2023-03-02 Thread Jacob Champion
On 2/1/23 11:59, Sergey Dudoladov wrote:
> Justin, thank you for the fast review. The new version is attached.

This is looking very good. One bigger comment:

> + myextra = (int *) guc_malloc(ERROR, sizeof(int));
> + *myextra = newlogconnect;

If I've understood Tom correctly in [1], both of these guc_mallocs
should be using a loglevel less than ERROR, to avoid forcing a
postmaster exit when out of memory. (I used WARNING in that thread
instead, which seemed to be acceptable.)

And a couple of nitpicks:

> +Causes the specified stages of each connection attempt to the server 
> to be logged. Example: authorized,disconnected.

Long line; should be rewrapped.

> +   else {
>
> +   GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);   
>
> +   GUC_check_errmsg("invalid value '%s'", stage);
>
> +   GUC_check_errdetail("Valid values to use in the list are 'all', 
> 'received', 'authenticate
> +   " If 'all' is present, it must be the only value in the list."); 

I think the errmsg here should reuse the standard message format
invalid value for parameter "%s": "%s"
both for consistency and ease of translation.

Thanks!
--Jacob

[1] https://www.postgresql.org/message-id/2012342.1658356951%40sss.pgh.pa.us




Documentation of psql's \df no longer matches reality

2023-03-02 Thread Tom Lane
While preparing 3dfae91f7 I couldn't help noticing that what
psql-ref.sgml has to say about \df's "function type" column:

... and function types, which are classified as agg
(aggregate), normal, procedure, 
trigger, or window.

no longer corresponds very well to what the code actually does,
when dealing with a v11-or-later server:

  " CASE p.prokind\n"
  "  WHEN 'a' THEN '%s'\n"
  "  WHEN 'w' THEN '%s'\n"
  "  WHEN 'p' THEN '%s'\n"
  "  ELSE '%s'\n"
  " END as \"%s\"",
  ...
/* translator: "agg" is short for "aggregate" */
  gettext_noop("agg"),
  gettext_noop("window"),
  gettext_noop("proc"),
  gettext_noop("func"),
  gettext_noop("Type"));

I was going to just fix the docs to match the code, but removing
"trigger" from the list seems very confusing, because the docs
go on to say

To display only functions
of specific type(s), add the corresponding letters a,
n, p, t, or 
w to the command.

and indeed filtering triggers in or out still seems to work.
Moreover, if you are inspecting a pre-v11 server then you do
still get the old classification, which is bizarrely inconsistent.

It seems like we should either restore "trigger" as its own
type classification, or remove it from the list of properties
you can filter on, or adjust the docs to describe "t" as a
special filter condition.  I'm kind of inclined to the second
option, because treating trigger as a different prokind sure
seems like a wart.  But back in 2009 people thought that was
a good idea; what is our opinion now?

regards, tom lane




Re: Remove source code display from \df+?

2023-03-02 Thread Tom Lane
Isaac Morland  writes:
> [ 0001-Remove-source-code-display-from-df-v6.patch ]

Pushed after some editorialization on the test case.

One thing I noticed while testing is that if you apply \df+ to an
aggregate function, it will show "Internal name" of "aggregate_dummy".
While that's an accurate description of what's in prosrc, it seems
not especially useful and perhaps indeed confusing to novices.
So I thought about suppressing it.  However, that would require
a server-version-dependent test and I wasn't quite convinced it'd
be worth the trouble.  Any thoughts on that?

regards, tom lane




Re: buildfarm + meson

2023-03-02 Thread Andres Freund
Hi

On 2023-03-02 17:00:47 -0500, Andrew Dunstan wrote:
> 
> On 2023-03-01 We 16:32, Andres Freund wrote:
> > > This is now working
> > > on my MSVC test rig (WS2019, VS2019, Strawberry Perl), including TAP 
> > > tests.
> > > I do get a whole lot of annoying messages like this:
> > > 
> > > Unknown TAP version. The first line MUST be `TAP version `. Assuming
> > > version 12.
> > The newest minor version has fixed that, it was a misunderstanding about /
> > imprecision in the tap 14 specification.
> > 
> 
> Unfortunately, meson v 1.0.1 appears to be broken on Windows, I had to
> downgrade back to 1.0.0.

Is it possible that you're using a PG checkout from a few days ago? A
hack I used was invalidated by 1.0.1, but I fixed that already.

CI is running with 1.0.1:
https://cirrus-ci.com/task/5806561726038016?logs=configure#L8

Greetings,

Andres Freund




Re: File descriptors in exec'd subprocesses

2023-03-02 Thread Thomas Munro
On Thu, Mar 2, 2023 at 9:57 AM Thomas Munro  wrote:
> On Thu, Mar 2, 2023 at 9:49 AM Gregory Stark (as CFM)
>  wrote:
> > On Mon, 20 Feb 2023 at 23:04, Thomas Munro  wrote:
> > > Done like that in this version.  This is the version I'm thinking of
> > > committing, unless someone wants to argue for another level.
> >
> > FWIW the cfbot doesn't understand this patch series. I'm not sure why
> > but it's only trying to apply the first (the MacOS one) and it's
> > failing to apply even that.
>
> Ah, it's because I committed one patch in the series.  I'll commit one
> more, and then repost the rest, shortly.

I pushed the main patch, "Don't leak descriptors into subprograms.".
Here's a rebase of the POSIX-next stuff, but I'll sit on these for a
bit longer to see if the build farm agrees with my claim about the
ubiquity of O_CLOEXEC, and if anyone has comments on this stuff.
From 1d03c0b2811b53a20eeff781918ef99d4c9b9df9 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 6 Feb 2023 13:51:19 +1300
Subject: [PATCH v5 1/3] Use accept4() to accept connections, where available.

The postmaster previously accepted connections and then made extra
system calls in child processes to make them non-blocking and (as of a
recent commit) close-on-exit.

On all target Unixes except macOS and AIX, that can be done atomically
with flags to the newer accept4() variant (expected in the next POSIX
revision, but around in the wild for many years now), saving a couple of
system calls.

The exceptions are Windows, where we didn't have to worry about that
problem anyway, EXEC_BACKEND builds on Unix (used by developers only for
slightly-more-like-Windows testing), and the straggler Unixes that
haven't implemented it yet (at the time of writing: macOS and AIX).

Suggested-by: Andres Freund 
Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 configure  | 12 
 configure.ac   |  1 +
 meson.build|  1 +
 src/backend/libpq/pqcomm.c | 34 +++---
 src/include/pg_config.h.in |  4 
 5 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/configure b/configure
index e35769ea73..d06f64cdbb 100755
--- a/configure
+++ b/configure
@@ -16219,6 +16219,18 @@ _ACEOF
 
 # We can't use AC_REPLACE_FUNCS to replace these functions, because it
 # won't handle deployment target restrictions on macOS
+ac_fn_c_check_decl "$LINENO" "accept4" "ac_cv_have_decl_accept4" "#include 
+"
+if test "x$ac_cv_have_decl_accept4" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_ACCEPT4 $ac_have_decl
+_ACEOF
+
 ac_fn_c_check_decl "$LINENO" "preadv" "ac_cv_have_decl_preadv" "#include 
 "
 if test "x$ac_cv_have_decl_preadv" = xyes; then :
diff --git a/configure.ac b/configure.ac
index af23c15cb2..070a0b33db 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1843,6 +1843,7 @@ AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
 
 # We can't use AC_REPLACE_FUNCS to replace these functions, because it
 # won't handle deployment target restrictions on macOS
+AC_CHECK_DECLS([accept4], [], [], [#include ])
 AC_CHECK_DECLS([preadv], [], [AC_LIBOBJ(preadv)], [#include ])
 AC_CHECK_DECLS([pwritev], [], [AC_LIBOBJ(pwritev)], [#include ])
 
diff --git a/meson.build b/meson.build
index 26be83afb6..c91fb05133 100644
--- a/meson.build
+++ b/meson.build
@@ -2097,6 +2097,7 @@ decl_checks = [
 # checking for library symbols wouldn't handle deployment target
 # restrictions on macOS
 decl_checks += [
+  ['accept4', 'sys/socket.h'],
   ['preadv', 'sys/uio.h'],
   ['pwritev', 'sys/uio.h'],
 ]
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index da5bb5fc5d..8b97d1a7e3 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -184,6 +184,14 @@ pq_init(void)
 	/* set up process-exit hook to close the socket */
 	on_proc_exit(socket_close, 0);
 
+#ifndef WIN32
+	/*
+	 * On most systems, the socket has already been made non-blocking and
+	 * close-on-exec by StreamConnection().  On systems that don't have
+	 * accept4() yet, and EXEC_BACKEND builds that need the socket to survive
+	 * exec*(), we do that separately here in the child process.
+	 */
+#if !HAVE_DECL_ACCEPT4 || defined(EXEC_BACKEND)
 	/*
 	 * In backends (as soon as forked) we operate the underlying socket in
 	 * nonblocking mode and use latches to implement blocking semantics if
@@ -194,17 +202,14 @@ pq_init(void)
 	 * the client, which might require changing the mode again, leading to
 	 * infinite recursion.
 	 */
-#ifndef WIN32
 	if (!pg_set_noblock(MyProcPort->sock))
 		ereport(COMMERROR,
 (errmsg("could not set socket to nonblocking mode: %m")));
-#endif
-
-#ifndef WIN32
 
 	/* Don't give the socket to any subprograms we execute. */
 	if (fcntl(MyProcPort->sock, F_SETFD, FD_CLOEXEC) < 0)
 		elog(FATAL, "fcntl(F_SETFD) failed on socket: %m");
+#endif
 #endif
 
 	FeBeWaitSet = 

Re: buildfarm + meson

2023-03-02 Thread Andrew Dunstan


On 2023-03-01 We 16:32, Andres Freund wrote:

This is now working
on my MSVC test rig (WS2019, VS2019, Strawberry Perl), including TAP tests.
I do get a whole lot of annoying messages like this:

Unknown TAP version. The first line MUST be `TAP version `. Assuming
version 12.

The newest minor version has fixed that, it was a misunderstanding about /
imprecision in the tap 14 specification.



Unfortunately, meson v 1.0.1 appears to be broken on Windows, I had to 
downgrade back to 1.0.0.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Operation log for major operations

2023-03-02 Thread Kirk Wolak
On Thu, Mar 2, 2023 at 4:09 PM Dmitry Koval  wrote:

> I'll try to expand my explanation.
> I fully understand and accept the arguments about "limited sense to go
> into the control file" and "about recording *anything* in the control
> file". This is totally correct for vanilla.
> But vendors have forks of PostgreSQL with custom features and extensions.
> Sometimes (especially at the first releases) these custom components
> have bugs which can causes rare problems in data.
> These problems can migrate with using pg_upgrade and "lazy" upgrade of
> pages to higher versions of PostgreSQL fork.
>
> So in error cases "recording crash information" etc. is not the only
> important information.
> Very important is history of this database (pg_upgrades, promotions,
> pg_resets, pg_rewinds etc.).
> Often these "history" allows you to determine from which version of the
> PostgreSQL fork the error came from and what causes of errors we can
> discard immediately.
>
> This "history" is the information that our technical support wants (and
> reason of this patch), but this information is not needed for vanilla...
>
> Another important condition is that the user should not have easy ways
> to delete information about "history" (about reason to use pg_control
> file as "history" storage, but write into it from position 4kB, 8kB,...).
>
> --
> With best regards,
> Dmitry Koval
>
> Postgres Professional: http://postgrespro.com
>
> Dmitry, this is a great explanation.  Thinking outside the box, it feels
like:
We need some kind of semaphore flag that tells us something awkward
happened.
When it happened, and a little bit of extra information.

You also make the point that if such things have happened, it would
probably be a good idea to NOT
allow pg_upgrade to run.  It might even be a reason to constantly bother
someone until the issue is repaired.

To that point, this feels like a "postgresql_panic.log" file (within the
postgresql files?)...  Something that would prevent pg_upgrade,
etc.  That everyone recognizes is serious.  Especially 3rd party vendors.

I see the need for such a thing.  I have to agree with others about
questioning the proper place to write this.

Are there other places that make sense, that you could use, especially if
knowing it exists means there was a serious issue?

Kirk


Re: Operation log for major operations

2023-03-02 Thread Dmitry Koval

I'll try to expand my explanation.
I fully understand and accept the arguments about "limited sense to go 
into the control file" and "about recording *anything* in the control 
file". This is totally correct for vanilla.

But vendors have forks of PostgreSQL with custom features and extensions.
Sometimes (especially at the first releases) these custom components 
have bugs which can causes rare problems in data.
These problems can migrate with using pg_upgrade and "lazy" upgrade of 
pages to higher versions of PostgreSQL fork.


So in error cases "recording crash information" etc. is not the only 
important information.
Very important is history of this database (pg_upgrades, promotions, 
pg_resets, pg_rewinds etc.).
Often these "history" allows you to determine from which version of the 
PostgreSQL fork the error came from and what causes of errors we can 
discard immediately.


This "history" is the information that our technical support wants (and 
reason of this patch), but this information is not needed for vanilla...


Another important condition is that the user should not have easy ways 
to delete information about "history" (about reason to use pg_control 
file as "history" storage, but write into it from position 4kB, 8kB,...).


--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com




Re: Evaluate arguments of correlated SubPlans in the referencing ExprState

2023-03-02 Thread Andres Freund
Hi,

On 2023-03-02 15:10:31 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-03-02 14:33:35 -0500, Tom Lane wrote:
> >> I looked through this, and there is one point that is making me really
> >> uncomfortable.  This bit is assuming that we can bind the address of
> >> the es_param_exec_vals array right into the compiled expression:
> 
> > Yea, I wasn't super comfortable with that either. I concluded it's ok
> > because we already cache pointers to the array inside each ExprContext.
> 
> ExprContext, sure, but compiled expressions?  Considering what it
> costs to JIT those, I think we ought to be trying to make them
> fairly long-lived.

I'm not opposed to EXPR_PARAM_SET, to be clear. I'll send an updated
version later. I was just thinking about the correctness in the current
world.

I think it's not just JIT that could benefit, fwiw. I think making
expressions longer lived could also help plpgsql tremendously, for
example.

Greetings,

Andres Freund




Re: proposal: psql: show current user in prompt

2023-03-02 Thread Kirk Wolak
On Sat, Feb 4, 2023 at 3:33 PM Pavel Stehule 
wrote:

> Hi
>
> pá 3. 2. 2023 v 21:43 odesílatel Pavel Stehule 
> napsal:
>
>>
>>
>> pá 3. 2. 2023 v 21:21 odesílatel Tom Lane  napsal:
>>
>>> Pavel Stehule  writes:
>>> > Both patches are very simple - and they use almost already prepared
>>> > infrastructure.
>>>
>>> It's not simple at all to make the psql feature depend on marking
>>> "role" as GUC_REPORT when it never has been before.  That will
>>> cause the feature to misbehave when using older servers.  I'm
>>> even less impressed by having it fall back on PQuser(), which
>>> would be misleading at exactly the times when it matters.
>>>
>>
>> It is a good note. This can be disabled for older servers, and maybe it
>> can  introduce its own GUC (and again - it can be disallowed for older
>> servers).
>>
>
> Here is another version. For older servers it shows the string ERR0A000.
> That is  ERR code of "feature is not supported"
>
>
>> My goal at this moment is to get some prototype. We can talk if this
>> feature request is valid or not, and we can talk about implementation.
>>
>> There is another possibility to directly execute "select current_user()"
>> instead of looking at status parameters inside prompt processing. It can
>> work too.
>>
>
> I tested using the query SELECT CURRENT_USER, but I don't think it is
> usable now, because it doesn't work in the broken transaction.
>
> Regards
>
> Pavel
>
>
>
>>
>> Regards
>>
>> Pavel
>>
>>
>>
>>
>>
>>> regards, tom lane
>>>
>>
I've tested this w/regards to psql.  Latest commit.
It works as described.  'none' is displayed for the default role. (SET ROLE
DEFAULT), otherwise the specific ROLE is displayed.

I tried this patch on 15.2, but guc_files.c does not exist in that version,
so it did not install.
I also tried applying the %T patch, but since they touch the same file, it
would not install with it, without rebasing, repatching.

The Docs are updated, and it's a relatively contained patch.

Changed status to Ready for Committer. (100% Guessing here...)

Kirk


Re: Add support for unit "B" to pg_size_pretty()

2023-03-02 Thread Dean Rasheed
On Thu, 2 Mar 2023 at 19:58, David Rowley  wrote:
>
> I think I'd prefer to see the size_bytes_unit_alias struct have an
> index into size_pretty_units[] array. i.e:
>
> struct size_bytes_unit_alias
> {
> const char *alias; /* aliased unit name */
> const int unit_index; /* corresponding size_pretty_units element */
> };
>
> then the pg_size_bytes code can be simplified to:
>
> /* If not found, look in the table of aliases */
> if (unit->name == NULL)
> {
> for (const struct size_bytes_unit_alias *a = size_bytes_aliases;
> a->alias != NULL; a++)
> {
> if (pg_strcasecmp(strptr, a->alias) == 0)
> {
> unit = _pretty_units[a->unit_index];
> break;
> }
> }
> }
>
> which saves having to have the additional and slower nested loop code.
>

Hmm, I think it would be easier to just have a separate table for
pg_size_bytes(), rather than reusing pg_size_pretty()'s table. I.e.,
size_bytes_units[], which would only need name and multiplier columns
(not round and limit). Done that way, it would be easier to add other
units later (e.g., non-base-2 units).

Also, it looks to me as though the doc change is for pg_size_pretty()
instead of pg_size_bytes().

Regards,
Dean




Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-03-02 Thread Alexander Korotkov
On Thu, Mar 2, 2023 at 9:17 PM Pavel Borisov  wrote:
> On Thu, 2 Mar 2023 at 18:53, Alexander Korotkov  wrote:
> > On Thu, Mar 2, 2023 at 1:29 PM Pavel Borisov  wrote:
> > > > Let's see the performance results for the patchset.  I'll properly
> > > > revise the comments if results will be good.
> > > >
> > > > Pavel, could you please re-run your tests over revised patchset?
> > >
> > > Since last time I've improved the test to avoid significant series
> > > differences due to AWS storage access variation that is seen in [1].
> > > I.e. each series of tests is run on a tmpfs with newly inited pgbench
> > > tables and vacuum. Also, I've added a test for low-concurrency updates
> > > where the locking optimization isn't expected to improve performance,
> > > just to make sure the patches don't make things worse.
> > >
> > > The tests are as follows:
> > > 1. Heap updates with high tuple concurrency:
> > > Prepare without pkeys (pgbench -d postgres -i -I dtGv -s 10 
> > > --unlogged-tables)
> > > Update tellers 100 rows, 50 conns ( pgbench postgres -f
> > > ./update-only-tellers.sql -s 10 -P10 -M prepared -T 600 -j 5 -c 50 )
> > >
> > > Result: Average of 5 series with patches (0001+0002) is around 5%
> > > faster than both master and patch 0001. Still, there are some
> > > fluctuations between different series of the measurements of the same
> > > patch, but much less than in [1]
> >
> > Thank you for running this that fast!
> >
> > So, it appears that 0001 patch has no effect.  So, we probably should
> > consider to drop 0001 patch and consider just 0002 patch.
> >
> > The attached patch v12 contains v11 0002 patch extracted separately.
> > Please, add it to the performance comparison.  Thanks.
>
> I've done a benchmarking on a full series of four variants: master vs
> v11-0001 vs v11-0001+0002 vs v12 in the same configuration as in the
> previous measurement. The results are as follows:
>
> 1. Heap updates with high tuple concurrency:
> Average of 5 series v11-0001+0002 is around 7% faster than the master.
> I need to note that while v11-0001+0002 shows consistent performance
> improvement over the master, its value can not be determined more
> precisely than a couple of percents even with averaging. So I'd
> suppose we may not conclude from the results if a more subtle
> difference between v11-0001+0002 vs v12 (and master vs v11-0001)
> really exists.
>
> 2. Heap updates with high tuple concurrency:
> All patches and master are still the same within a tolerance of
> less than 0.7%.
>
> Overall patch v11-0001+0002 doesn't show performance degradation so I
> don't see why to apply only patch 0002 skipping 0001.

Thank you, Pavel.  So, it seems that we have substantial benefit only
with two patches.  So, I'll continue working on both of them.

--
Regards,
Alexander Korotkov




Re: Evaluate arguments of correlated SubPlans in the referencing ExprState

2023-03-02 Thread Tom Lane
Andres Freund  writes:
> On 2023-03-02 14:33:35 -0500, Tom Lane wrote:
>> I looked through this, and there is one point that is making me really
>> uncomfortable.  This bit is assuming that we can bind the address of
>> the es_param_exec_vals array right into the compiled expression:

> Yea, I wasn't super comfortable with that either. I concluded it's ok
> because we already cache pointers to the array inside each ExprContext.

ExprContext, sure, but compiled expressions?  Considering what it
costs to JIT those, I think we ought to be trying to make them
fairly long-lived.

regards, tom lane




Re: Evaluate arguments of correlated SubPlans in the referencing ExprState

2023-03-02 Thread Andres Freund
Hi,

On 2023-03-02 14:33:35 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Around
> > https://www.postgresql.org/message-id/20230224015417.75yimxbksejpffh3%40awork3.anarazel.de
> > I suggested that we should evaluate the arguments of correlated SubPlans as
> > part of the expression referencing the subplan.
> 
> > Here's a patch for that.
> 
> I looked through this, and there is one point that is making me really
> uncomfortable.  This bit is assuming that we can bind the address of
> the es_param_exec_vals array right into the compiled expression:
> 
> + ParamExecData *prm = >es_param_exec_vals[paramid];
> +
> + ExecInitExprRec(lfirst(pvar), state, >value, >isnull);
> 
> Even if that works today, it'd kill the ability to use the compiled
> expression across more than one executor instance, which seems like
> a pretty high price.  Also, I think it probably fails already in
> EvalPlanQual contexts, because EvalPlanQualStart allocates a separate
> es_param_exec_vals array for EPQ execution.

Yea, I wasn't super comfortable with that either. I concluded it's ok
because we already cache pointers to the array inside each ExprContext.


> I think we'd be better off inventing an EEOP_SET_PARAM_EXEC step type
> that is essentially the inverse of EEOP_PARAM_EXEC/ExecEvalParamExec,
> and then evaluating each parameter value into the expression's
> scratch Datum/isnull fields and emitting SET_PARAM_EXEC to copy those
> to the correct ParamExecData slot.

Agreed, that'd make sense. If we can build the infrastructure to figure
out what param to use, that'd also provide a nice basis for using params
for CaseTest etc.

Greetings,

Andres Freund




Re: Auth extensions, with an LDAP/SCRAM example [was: Proposal: Support custom authentication methods using hooks]

2023-03-02 Thread Jacob Champion
On Tue, Feb 28, 2023 at 6:53 PM Stephen Frost  wrote:
> * Jacob Champion (jchamp...@timescale.com) wrote:
> > Yes. Interoperable authentication is going to have to solve those
> > sorts of problems somehow, though. And there are a bunch of levers to
> > pull: clients aren't required to run their sent usernames through
> > SASLprep; our existing servers don't mind having it sent, so we could
> > potentially backport a client change; and then after that it's a game
> > of balancing compatibility and compliance. A deeper conversation for
> > sure.
>
> We did solve those problems with SCRAM, by not exactly following the
> unfortunately short-sighted standard.  What we have gets us the benefits
> of SCRAM and generally follows the standard

"Almost standard" in this case is just a different standard, minus
interoperability, plus new edge cases. I understand why it was done,
and I'm trying to provide some evidence in favor of changing it. But
again, that's a separate conversation; I don't have a patchset
proposal ready to go yet, and I'd rather talk about specifics.

> > But not all of the prerequisites you stated are met. I see no evidence
> > that Active Directory supports SCRAM [1], for instance, unless the MS
> > documentation is just omitting it.
>
> That's certainly unfortunate if that's the case.  The question then
> becomes- are a lot of people using SCRAM to connect to OpenLDAP, to the
> extent that it's a sensible thing for us to support and gives us a more
> secure option for 'ldap' auth than what exists today where we're passing
> around cleartext passwords?  That seems like it's at least plausible.

I don't know for sure -- I don't have an LDAP userbase to poll for
that, anymore -- but I somewhat doubt it. (And it's moot unless the
compatibility problem is fixed.)

> Broadly speaking, I feel confident saying that authentication systems,
> rather like encryption libraries, should be standardized, well
> documented, heavily reviewed, and carefully maintained.

I mean, I agree with that statement in isolation; those are all
generally good for user safety, which is the end goal. I disagree with
your premise that those goals are achieved by doing everything
in-house, or that letting other people do it is in opposition to those
goals.

Keep in mind, you just explained why it's good that you chose to
depart from the standard in ways that prevent interoperability with
other (reviewed, standardized, maintained) SCRAM implementations, and
that you're generally opposed to backporting compatibility fixes that
would allow migration to those implementations. That doesn't seem
internally consistent to me.

> If the argument
> is that there are these teams out there who are itching to write amazing
> new authentication methods for PG who are going to spend time
> documenting them, reviewing them, and maintaining them, then we should
> try to get those folks to work with the community to add support for
> these new methods so that everyone can benefit from them-

That's not my argument, and even if it were, I think this is
depressingly Postgres-centric. Why would a team of subject matter
experts want to constantly duplicate their work over every network
protocol, as opposed to banding together in a common place (e.g. IETF)
to design an extensible authentication system (e.g. SASL) and then
contributing their expertise to implementations of that system (e.g.
Cyrus, gsasl)?

If you had made those arguments in favor of your own in-house crypto
over OpenSSL -- "Why don't those cryptographers just come here to work
on PG-RSA, which is better than (and incompatible with) regular RSA
because of the changes we made to address their shortsighted designs?"
-- then wouldn't the response be obvious?

> not
> encouraging them to be independent projects which have to work through
> hooks that limit how those authentication methods are able to be
> integrated.

That's a false choice. I don't think we'd have to encourage them to be
fully independent projects, and I don't really want us to design our
own authentication hooks in the end, as I pointed out in my initial
mail.

> Consider things like pg_stat_ssl and pg_stat_gssapi. What
> about pg_ident maps?  Will each of these end up with their own version
> of that?

I should hope not. For one, I'd imagine a framework which required
that would be vetoed pretty quickly, and I'm not really here to argue
in favor of bad frameworks. If a particular approach causes terrible
maintenance characteristics, we would probably just not take that
approach.

For two, usermaps are an authorization concern that can usually be
separated from the question of "who am I talking to". There's no
reason for, say, a SCRAM implementation to care about them, because
it's not authorizing anything. (And the OAuth work will hopefully be a
decent bellwether of handling authn and authz concerns with the same
primitives.)

> And those are questions beyond just the issue around making
> sure that these external 

Re: Add support for unit "B" to pg_size_pretty()

2023-03-02 Thread David Rowley
On Mon, 27 Feb 2023 at 21:34, Peter Eisentraut
 wrote:
>
> On 22.02.23 03:39, David Rowley wrote:
> > I think you'll need to find another way to make the aliases work.
> > Maybe another array with the name and an int to reference the
> > corresponding index in size_pretty_units.
>
> Ok, here is a new patch with a separate table of aliases.  (Might look
> like overkill, but I think the "PiB" etc. example you had could actually
> be a good use case for this as well.)

I think I'd prefer to see the size_bytes_unit_alias struct have an
index into size_pretty_units[] array. i.e:

struct size_bytes_unit_alias
{
const char *alias; /* aliased unit name */
const int unit_index; /* corresponding size_pretty_units element */
};

then the pg_size_bytes code can be simplified to:

/* If not found, look in the table of aliases */
if (unit->name == NULL)
{
for (const struct size_bytes_unit_alias *a = size_bytes_aliases;
a->alias != NULL; a++)
{
if (pg_strcasecmp(strptr, a->alias) == 0)
{
unit = _pretty_units[a->unit_index];
break;
}
}
}

which saves having to have the additional and slower nested loop code.

Apart from that, the patch looks fine.

David




Re: Minimal logical decoding on standbys

2023-03-02 Thread Jeff Davis
On Thu, 2023-03-02 at 10:20 +0100, Drouvot, Bertrand wrote:
> Right, but in our case, right after the wakeup (the one due to the CV
> broadcast,
> aka the one that will remove it from the wait queue) we'll exit the
> loop due to:
> 
> "
>  /* check whether we're done */
>  if (loc <= RecentFlushPtr)
>  break;
> "
> 
> as the CV broadcast means that a flush/replay occurred.

But does it mean that the flush/replay advanced *enough* to be greater
than or equal to loc?

> - If it is awakened due to the CV broadcast, then we'll right after
> exit the loop (see above)

...

> I think that's not needed as we'd exit the loop right after we are
> awakened by a CV broadcast.

See the comment here:

 * If this process has been taken out of the wait list, then we know  
 * that it has been signaled by ConditionVariableSignal (or   
 * ConditionVariableBroadcast), so we should return to the caller. But
 * that doesn't guarantee that the exit condition is met, only that we
 * ought to check it.

You seem to be arguing that in this case, it doesn't matter; that
walreceiver knows what walsender is waiting for, and will never wake it
up before it's ready. I don't think that's true, and even if it is, it
needs explanation.

> 
> I agree that's a good idea and that it should/would work too. I just
> wanted to highlight that in this particular
> case that might not be necessary to build this new API.

In this case it looks easier to add the right API than to be sure about
whether it's needed or not.

Regards,
Jeff Davis





Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-03-02 Thread reid . thompson
On Mon, 2023-02-13 at 16:26 -0800, Andres Freund wrote:
> Hi,
> 
> The tests recently started to fail:
> 
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3867
> 
> I marked this as waiting on author.
> 
> Greetings,
> 
> Andres Freund

Patch has been rebased to master.

The memory limiting portion (patch 0002-*) has been refactored to utilize a
shared counter for total memory allocation along with backend-local
allowances that are initialized at process startup and refilled from the
central counter upon being used up. Free'd memory is accumulated and
returned to the shared counter upon meeting a threshold and/or upon process
exit. At this point arbitrarily picked 1MB as the initial allowance and
return threshold. 

Thanks,
Reid






From e044bacedab503d1cd732146e1b9947406191bb6 Mon Sep 17 00:00:00 2001
From: Reid Thompson 
Date: Sat, 4 Jun 2022 22:23:59 -0400
Subject: [PATCH 2/2] Add the ability to limit the amount of memory that can be
 allocated to backends.

This builds on the work that adds backend memory allocated to pg_stat_activity.

Add GUC variable max_total_backend_memory.

Specifies a limit to the amount of memory (in MB) that may be allocated to
backends in total (i.e. this is not a per user or per backend limit). If unset,
or set to 0 it is disabled. It is intended as a resource to help avoid the OOM
killer on LINUX and manage resources in general. A backend request that would
push the total over the limit will be denied with an out of memory error causing
that backend's current query/transaction to fail. Due to the dynamic nature of
memory allocations, this limit is not exact. If within 1.5MB of the limit and
two backends request 1MB each at the same time both may be allocated, and exceed
the limit. Further requests will not be allocated until dropping below the
limit. Keep this in mind when setting this value. This limit does not affect
auxiliary backend processes. Backend memory allocations are displayed in the
pg_stat_activity view.
---
 doc/src/sgml/config.sgml  |  26 ++
 src/backend/postmaster/autovacuum.c   |   8 +-
 src/backend/postmaster/postmaster.c   |  17 +-
 src/backend/postmaster/syslogger.c|   4 +-
 src/backend/storage/ipc/dsm_impl.c|  35 ++-
 src/backend/storage/lmgr/proc.c   |   3 +
 src/backend/utils/activity/backend_status.c   | 223 +-
 src/backend/utils/misc/guc_tables.c   |  11 +
 src/backend/utils/misc/postgresql.conf.sample |   3 +
 src/backend/utils/mmgr/aset.c |  43 +++-
 src/backend/utils/mmgr/generation.c   |  21 +-
 src/backend/utils/mmgr/slab.c |  21 +-
 src/include/storage/proc.h|   7 +
 src/include/utils/backend_status.h| 222 +++--
 14 files changed, 560 insertions(+), 84 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e5c41cc6c6..1bff68b1ec 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2113,6 +2113,32 @@ include_dir 'conf.d'
   
  
 
+ 
+  max_total_backend_memory (integer)
+  
+   max_total_backend_memory configuration parameter
+  
+  
+  
+   
+Specifies a limit to the amount of memory (MB) that may be allocated to
+backends in total (i.e. this is not a per user or per backend limit).
+If unset, or set to 0 it is disabled.  A backend request that would
+push the total over the limit will be denied with an out of memory
+error causing that backend's current query/transaction to fail. Due to
+the dynamic nature of memory allocations, this limit is not exact. If
+within 1.5MB of the limit and two backends request 1MB each at the same
+time both may be allocated, and exceed the limit. Further requests will
+not be allocated until dropping below the limit. Keep this in mind when
+setting this value. This limit does not affect auxiliary backend
+processes  . Backend memory
+allocations (allocated_bytes) are displayed in the
+pg_stat_activity
+view.
+   
+  
+ 
+
  
  
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 59c9bcf8c4..ee03d08dd9 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -407,8 +407,8 @@ StartAutoVacLauncher(void)
 
 #ifndef EXEC_BACKEND
 		case 0:
-			/* Zero allocated bytes to avoid double counting parent allocation */
-			pgstat_zero_my_allocated_bytes();
+			/* Init allocated bytes to avoid double counting parent allocation */
+			pgstat_init_allocated_bytes();
 
 			/* in postmaster child ... */
 			InitPostmasterChild();
@@ -1488,8 +1488,8 @@ StartAutoVacWorker(void)
 
 #ifndef EXEC_BACKEND
 		case 0:
-			/* Zero allocated bytes to avoid double counting parent allocation */
-			pgstat_zero_my_allocated_bytes();
+			

Re: libpq: PQgetCopyData() and allocation overhead

2023-03-02 Thread Jeroen Vermeulen
My apologies.  The wiki said to discuss early, even before writing the code
if possible, but I added a link to the PR for those who really wanted to
see the details.

I'm attaching a diff now.  This is not a patch, it's just a discussion
piece.

The problem was that PQgetCopyData loops use a lot of CPU time, and this
alternative reduces that by a lot.


Jeroen

On Thu, 2 Mar 2023 at 13:38, Daniel Gustafsson  wrote:

> > On 1 Mar 2023, at 15:23, Jeroen Vermeulen  wrote:
>
> > PR for easy discussion: https://github.com/jtv/postgres/pull/1
>
> The process for discussing work on pgsql-hackers is to attach the patch to
> the
> email and discuss it inline in the thread.  That way all versions of the
> patch
> as well as the discussion is archived and searchable.
>
> --
> Daniel Gustafsson
>
>
diff --git a/bench.c b/bench.c
new file mode 100644
index 00..c3206d2927
--- /dev/null
+++ b/bench.c
@@ -0,0 +1,134 @@
+/*
+ * Minimal benchmark for PQgetCopyData alternative.
+ *
+ * Define CALL to 0 (to use the classic PQgetCopyData) or 1 (to use the
+ * proposed new function), then run the binary through "time" to get time and
+ * CPU usage stats.
+ *
+ * DO NOT UPSTREAM THIS FILE.  It's just a demonstration for the prototype
+ * patch.
+ */
+#include 
+#include 
+
+#include 
+
+/* Define CALL to...
+ * 0: Use classic PQgetCopyData()
+ * 1: Use experimental PQhandleCopyData()
+ */
+
+/* Benchmark results (best result per category, out of 4 runs):
+ *
+ * PQgetCopyData:
+ * real - 0m32.972s
+ * user - 0m11.364s
+ * sys - 0m1.255s
+ *
+ * PQhandleCopyData:
+ * real - 0m32.839s
+ * user - 0m3.407s
+ * sys - 0m0.872s
+ */
+
+#if CALL == 1
+/*
+ * Print line, add newline.
+ */
+static int
+print_row_and_newline(void *, char *buf, size_t len)
+{
+	/* Zero-terminate the buffer. */
+	buf[len - 1] = '\0';
+	printf("%s\n", buf);
+	return 0;
+}
+#endif
+
+
+int
+main()
+{
+#if !defined(CALL)
+#error "Set CALL: 0 = PQgetCopyDta, 1 = PQhandleCopyData."
+#elif CALL == 0
+	fprintf(stderr, "Testing classic PQgetCopyData().\n");
+#elif CALL == 1
+	fprintf(stderr, "Testing experimental PQhandleCopyData.\n");
+#else
+#error "Unknown CALL value."
+#endif
+
+	PGconn	   *cx = PQconnectdb("");
+
+	if (!cx)
+	{
+		fprintf(stderr, "Could not connect.\n");
+		exit(1);
+	}
+	PGresult   *tx = PQexec(cx, "BEGIN");
+
+	if (!tx)
+	{
+		fprintf(stderr, "No result from BEGIN!\n");
+		exit(1);
+	}
+	int			s = PQresultStatus(tx);
+
+	if (s != PGRES_COMMAND_OK)
+	{
+		fprintf(stderr, "Failed to start transaction: status %d.\n", s);
+		exit(1);
+	}
+
+	PGresult   *r = PQexec(
+		   cx,
+		   "COPY ("
+		   "SELECT generate_series, 'row #' || generate_series "
+		   "FROM generate_series(1, 1)"
+		   ") TO STDOUT"
+	);
+
+	if (!r)
+	{
+		fprintf(stderr, "No result!\n");
+		exit(1);
+	}
+	int			status = PQresultStatus(r);
+
+	if (status != PGRES_COPY_OUT)
+	{
+		fprintf(stderr, "Failed to start COPY: status %d.\n", status);
+		exit(1);
+	}
+
+	int			bytes;
+#if CALL == 0
+	char	   *buffer = NULL;
+
+	for (
+		 bytes = PQgetCopyData(cx, , 0);
+		 bytes > 0;
+		 bytes = PQgetCopyData(cx, , 0)
+		)
+	{
+		if (buffer)
+		{
+			printf("%s", buffer);
+			PQfreemem(buffer);
+		}
+	}
+#elif CALL == 1
+	while ((bytes = PQhandleCopyData(cx, print_row_and_newline, NULL, 0)) > 0);
+#else
+#error "Unknown CALL value."
+#endif
+
+	if (bytes != -1)
+	{
+		fprintf(stderr, "Got unexpected result: %d.\n", bytes);
+		exit(1);
+	}
+
+	/* (Don't bother cleaning up.) */
+}
diff --git a/src/interfaces/ecpg/ecpglib/execute.c b/src/interfaces/ecpg/ecpglib/execute.c
index 641851983d..1e4eba58a2 100644
--- a/src/interfaces/ecpg/ecpglib/execute.c
+++ b/src/interfaces/ecpg/ecpglib/execute.c
@@ -32,6 +32,17 @@
 #include "sqlda-compat.h"
 #include "sqlda-native.h"
 
+/*
+ * Print non-zero-terminated line received from COPY.
+ */
+static int
+print_row(void *, char *buf, size_t len)
+{
+	buf[len - 1] = '\0';
+	printf("%s\n", buf);
+	return 0;
+}
+
 /*
  *	This function returns a newly malloced string that has ' and \
  *	escaped.
@@ -1876,16 +1887,10 @@ ecpg_process_output(struct statement *stmt, bool clear_result)
 			break;
 		case PGRES_COPY_OUT:
 			{
-char	   *buffer;
 int			res;
 
 ecpg_log("ecpg_process_output on line %d: COPY OUT data transfer in progress\n", stmt->lineno);
-while ((res = PQgetCopyData(stmt->connection->connection,
-			, 0)) > 0)
-{
-	printf("%s", buffer);
-	PQfreemem(buffer);
-}
+while ((res = PQhandleCopyData(stmt->connection->connection, print_row, NULL, 0)) > 0);
 if (res == -1)
 {
 	/* COPY done */
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index e8bcc88370..add1ff1591 100644
--- a/src/interfaces/libpq/exports.txt
+++ b/src/interfaces/libpq/exports.txt
@@ -186,3 +186,4 @@ PQpipelineStatus  183
 PQsetTraceFlags   184
 PQmblenBounded185
 PQsendFlushRequest186
+PQhandleCopyData 187
diff --git 

Re: Operation log for major operations

2023-03-02 Thread Tom Lane
Justin Pryzby  writes:
> On Thu, Mar 02, 2023 at 08:57:43PM +0300, Dmitry Koval wrote:
>> These changes did not interest the community. It was expected (topic is very
>> specifiс: vendor's technical support). So no sense to distract developers

> Actually, I think there is interest, but it has to be phrased in a
> limited sense to go into the control file.

> In November, I referenced 2 threads, but I think you misunderstood one
> of them.  If you skim the first couple mails, you'll find a discussion
> about recording crash information in the control file.  

> https://www.postgresql.org/message-id/666c2599a07addea00ae2d0af96192def8441974.camel%40j-davis.com

> It's come up several times now, and there seems to be ample support for
> adding some limited information.

> But a "log" which might exceed a few dozen bytes (now or later), that's
> inconsistent with the pre-existing purpose served by pg_control.

I'm pretty dubious about recording *anything* in the control file.
Every time we write to that, we risk the entire database on completing
the write successfully.  I don't want to do that more often than once
per checkpoint.  If you want to put crash info in some less-critical
place, maybe we could talk.

regards, tom lane




Re: Evaluate arguments of correlated SubPlans in the referencing ExprState

2023-03-02 Thread Tom Lane
Andres Freund  writes:
> Around
> https://www.postgresql.org/message-id/20230224015417.75yimxbksejpffh3%40awork3.anarazel.de
> I suggested that we should evaluate the arguments of correlated SubPlans as
> part of the expression referencing the subplan.

> Here's a patch for that.

I looked through this, and there is one point that is making me really
uncomfortable.  This bit is assuming that we can bind the address of
the es_param_exec_vals array right into the compiled expression:

+   ParamExecData *prm = >es_param_exec_vals[paramid];
+
+   ExecInitExprRec(lfirst(pvar), state, >value, >isnull);

Even if that works today, it'd kill the ability to use the compiled
expression across more than one executor instance, which seems like
a pretty high price.  Also, I think it probably fails already in
EvalPlanQual contexts, because EvalPlanQualStart allocates a separate
es_param_exec_vals array for EPQ execution.

I think we'd be better off inventing an EEOP_SET_PARAM_EXEC step type
that is essentially the inverse of EEOP_PARAM_EXEC/ExecEvalParamExec,
and then evaluating each parameter value into the expression's
scratch Datum/isnull fields and emitting SET_PARAM_EXEC to copy those
to the correct ParamExecData slot.

regards, tom lane




Re: Memory leak from ExecutorState context?

2023-03-02 Thread Tomas Vondra



On 3/2/23 19:15, Jehan-Guillaume de Rorthais wrote:
> Hi!
> 
> On Thu, 2 Mar 2023 13:44:52 +0100
> Tomas Vondra  wrote:
>> Well, yeah and no.
>>
>> In principle we could/should have allocated the BufFiles in a different
>> context (possibly hashCxt). But in practice it probably won't make any
>> difference, because the query will probably run all the hashjoins at the
>> same time. Imagine a sequence of joins - we build all the hashes, and
>> then tuples from the outer side bubble up through the plans. And then
>> you process the last tuple and release all the hashes.
>>
>> This would not fix the issue. It'd be helpful for accounting purposes
>> (we'd know it's the buffiles and perhaps for which hashjoin node). But
>> we'd still have to allocate the memory etc. (so still OOM).
> 
> Well, accounting things in the correct context would already worth a patch I
> suppose. At least, it help to investigate faster. Plus, you already wrote a
> patch about that[1]:
> 
> https://www.postgresql.org/message-id/20190421114618.z3mpgmimc3rmubi4%40development
> 
> Note that I did reference the "Out of Memory errors are frustrating as heck!"
> thread in my first email, pointing on a Tom Lane's email explaining that
> ExecutorState was not supposed to be so large[2].
> 
> [2] 
> https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b
> 

Ah, right, I didn't realize it's the same thread. There are far too many
threads about this sort of things, and I probably submitted half-baked
patches to most of them :-/

>> There's only one thing I think could help - increase the work_mem enough
>> not to trigger the explosive growth in number of batches. Imagine
>> there's one very common value, accounting for ~65MB of tuples. With
>> work_mem=64MB this leads to exactly the explosive growth you're
>> observing here. With 128MB it'd probably run just fine.
>>
>> The problem is we don't know how large the work_mem would need to be :-(
>> So you'll have to try and experiment a bit.
>>
>> I remembered there was a thread [1] about *exactly* this issue in 2019.
>>
>> [1]
>> https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net
>>
>> I even posted a couple patches that try to address this by accounting
>> for the BufFile memory, and increasing work_mem a bit instead of just
>> blindly increasing the number of batches (ignoring the fact that means
>> more memory will be used for the BufFile stuff).
>>
>> I don't recall why it went nowhere, TBH. But I recall there were
>> discussions about maybe doing something like "block nestloop" at the
>> time, or something. Or maybe the thread just went cold.
> 
> So I read the full thread now. I'm still not sure why we try to avoid hash
> collision so hard, and why a similar data subset barely larger than work_mem
> makes the number of batchs explode, but I think I have a better understanding 
> of
> the discussion and the proposed solutions.
> 

I don't think this is about hash collisions (i.e. the same hash value
being computed for different values). You can construct cases like this,
of course, particularly if you only look at a subset of the bits (for 1M
batches we only look at the first 20 bits), but I'd say it's fairly
unlikely to happen unless you do that intentionally.

(I'm assuming regular data types with reasonable hash functions. If the
query joins on custom data types with some silly hash function, it may
be more likely to have conflicts.)

IMHO a much more likely explanation is there actually is a very common
value in the data. For example there might be a value repeated 1M times,
and that'd be enough to break this.

We do build a special "skew" buckets for values from an MCV, but maybe
the stats are not updated yet, or maybe there are too many such values
to fit into MCV?

I now realize there's probably another way to get into this - oversized
rows. Could there be a huge row (e.g. with a large text/bytea value)?
Imagine a row that's 65MB - that'd be game over with work_mem=64MB. Or
there might be smaller rows, but a couple hash collisions would suffice.

> There was some thoughts about how to make a better usage of the memory. As
> memory is exploding way beyond work_mem, at least, avoid to waste it with too
> many buffers of BufFile. So you expand either the work_mem or the number of
> batch, depending on what move is smarter. TJis is explained and tested here:
> 
> https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development
> https://www.postgresql.org/message-id/20190422030927.3huxq7gghms4kmf4%40development
> 
> And then, another patch to overflow each batch to a dedicated temp file and
> stay inside work_mem (v4-per-slice-overflow-file.patch):
> 
> https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
> 
> Then, nothing more on the discussion about this last patch. So I guess it just
> went cold.
> 

I think a contributing factor was that the 

Re: Memory leak from ExecutorState context?

2023-03-02 Thread Jehan-Guillaume de Rorthais
On Thu, 2 Mar 2023 19:15:30 +0100
Jehan-Guillaume de Rorthais  wrote:
[...]
> For what it worth, these two patches seems really interesting to me. Do you
> need any help to revive it?

To avoid confusion, the two patches I meant were:

* 0001-move-BufFile-stuff-into-separate-context.patch   
* v4-per-slice-overflow-file.patch

Regards,




Re: Operation log for major operations

2023-03-02 Thread Justin Pryzby
On Thu, Mar 02, 2023 at 08:57:43PM +0300, Dmitry Koval wrote:
> These changes did not interest the community. It was expected (topic is very
> specifiс: vendor's technical support). So no sense to distract developers

Actually, I think there is interest, but it has to be phrased in a
limited sense to go into the control file.

In November, I referenced 2 threads, but I think you misunderstood one
of them.  If you skim the first couple mails, you'll find a discussion
about recording crash information in the control file.  

https://www.postgresql.org/message-id/666c2599a07addea00ae2d0af96192def8441974.camel%40j-davis.com

It's come up several times now, and there seems to be ample support for
adding some limited information.

But a "log" which might exceed a few dozen bytes (now or later), that's
inconsistent with the pre-existing purpose served by pg_control.

-- 
Justin




Re: Add SHELL_EXIT_CODE to psql

2023-03-02 Thread Tom Lane
Corey Huinker  writes:
> [ v9-0003-Add-psql-variables-SHELL_ERROR-and-SHELL_EXIT_COD.patch ]

I took a brief look through this.  I'm on board with the general
concept, but I think you've spent too much time on an ultimately
futile attempt to get a committable test case, and not enough time
on making the behavior correct/usable.  In particular, it bothers
me that there doesn't appear to be any way to distinguish whether
a command failed on nonzero exit code versus a signal.  Also,
I see that you're willing to report whatever ferror returns
(something quite unspecified in relevant standards, beyond
zero/nonzero) as an equally-non-distinguishable integer.

I'm tempted to suggest that we ought to be returning a string,
along the lines of "OK" or "exit code N" or "signal N" or
"I/O error".  This though brings up the question of exactly
what you expect scripts to use the variable for.  Maybe such
a definition would be unhelpful, but if so why?  Maybe we
should content ourselves with adding the SHELL_ERROR boolean, and
leave the details to whatever we print in the error message?

As for the test case, I'm unimpressed with it because it's so
weak as to be nearly useless; plus it seems like a potential
security issue (what if "nosuchcommand" exists?).  It's fine
to have it during development, I guess, but I'm inclined to
leave it out of the eventual commit.

regards, tom lane




Re: Sort optimizations: Making in-memory sort cache-aware

2023-03-02 Thread Ankit Kumar Pandey

Hi Andres,


I took a stab at naive version of this but been stuck for sometime now.

I have added logic to sort on first column at first pass,

realloc all tuples and do full sort at second pass, but I am not seeing

any benefit (it is actually regressing) at all.

Tried doing above both at bulk and at chunks of data.

> You effectively end up with a bounded number of pre-sorted blocks, so 
the most
> obvious thing to try is to build a heap of those blocks and 
effectively do a

> heapsort between the presorted blocks.

I am not very clear about implementation for this. How can we do 
heapsort between


 the presorted blocks? Do we keep changing state->bound=i, i+n, i+2n 
something like


this and keep calling make_bounded_heap/sort_bounded_heap?

> A related, but separate, improvement is to reduce / remove the memory
> allocation overhead.

This is still pending from my side.


I have attached some benchmarking results with script and POC

patches (which includes GUC switch to enable optimization for ease of 
testing) for the same.


Tested on WORK_MEM=3 GB for 1 and 10 Million rows data.

Please let me know things which I can fix and re-attempt.


Thanks,

Ankit


#!/bin/bash

psql -c "create table if not exists abcd (a int, b int, c int, d int );" 
postgres

for rows in 100 1000
do
for selectrows in "select random()*x,random()*x,random()*x,random()*x 
from generate_Series(1,$rows) x;" "select 1,random()*x,random()*x,random()*x 
from generate_Series(1,$rows) x;" "select 1,1,random()*x,random()*x from 
generate_Series(1,$rows) x;"
do
psql -c "truncate table abcd;" postgres > /dev/null
psql -c "insert into abcd $selectrows" postgres > 
/dev/null
psql -c "vacuum freeze abcd;" postgres > /dev/null
psql -c "select pg_prewarm('abcd');" postgres > 
/dev/null
echo "select a,b,c,d from abcd order by a,b,c,d" > 
bench.sql

psql -c "alter system set enable_sort_optimization = 
off;" postgres > /dev/null
psql -c "select pg_reload_conf();" postgres > /dev/null
echo -n "$rows Master: "
pgbench -n -f bench.sql -T 10 -M prepared postgres | 
grep latency

psql -c "alter system set enable_sort_optimization = 
on;" postgres > /dev/null
psql -c "select pg_reload_conf();" postgres > /dev/null
echo -n "$rows Patched: "
pgbench -n -f bench.sql -T 10 -M prepared postgres | 
grep latency
done
done
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..6585164ca4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -846,6 +846,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_sort_optimization", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the sort optimizations."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		_sort_optimization,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_incremental_sort", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of incremental sort steps."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d06074b86f..1b49f48457 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -389,6 +389,7 @@
 #enable_presorted_aggregate = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_sort_optimization = on
 #enable_tidscan = on
 
 # - Planner Cost Constants -
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 9ca9835aab..cbaf54437c 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -127,6 +127,8 @@
 bool		trace_sort = false;
 #endif
 
+bool		enable_sort_optimization = true;
+
 #ifdef DEBUG_BOUNDED_SORT
 bool		optimize_bounded_sort = true;
 #endif
@@ -404,6 +406,7 @@ struct Sharedsort
 #define SERIAL(state)		((state)->shared == NULL)
 #define WORKER(state)		((state)->shared && (state)->worker != -1)
 #define LEADER(state)		((state)->shared && (state)->worker == -1)
+#define CACHESIZE 2*1024*1024L
 
 /*
  * NOTES about on-tape representation of tuples:
@@ -2725,6 +2728,8 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 		{
 			if (state->base.sortKeys[0].comparator == ssup_datum_unsigned_cmp)
 			{
+SortSupport oldonlyKey = state->base.onlyKey;
+state->base.onlyKey = state->base.sortKeys;
 qsort_tuple_unsigned(state->memtuples,
 	 state->memtupcount,
 	 state);
@@ -2740,11 +2745,63 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 			}
 #endif
 			else if (state->base.sortKeys[0].comparator == ssup_datum_int32_cmp)
-			{
-

Re: Add LZ4 compression in pg_dump

2023-03-02 Thread Tomas Vondra



On 3/2/23 18:18, Justin Pryzby wrote:
> On Wed, Mar 01, 2023 at 05:20:05PM +0100, Tomas Vondra wrote:
>> On 2/25/23 15:05, Justin Pryzby wrote:
>>> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
 I have some fixes (attached) and questions while polishing the patch for
 zstd compression.  The fixes are small and could be integrated with the
 patch for zstd, but could be applied independently.
>>>
>>> One more - WriteDataToArchiveGzip() says:
>>>
>>> +   if (cs->compression_spec.level == 0)
>>> +   pg_fatal("requested to compress the archive yet no level was 
>>> specified");
>>>
>>> That was added at e9960732a.  
>>>
>>> But if you specify gzip:0, the compression level is already enforced by
>>> validate_compress_specification(), before hitting gzip.c:
>>>
>>> | pg_dump: error: invalid compression specification: compression algorithm 
>>> "gzip" expects a compression level between 1 and 9 (default at -1)
>>>
>>> 5e73a6048 intended that to work as before, and you *can* specify -Z0:
>>>
>>> The change is backward-compatible, hence specifying only an integer
>>> leads to no compression for a level of 0 and gzip compression when the
>>> level is greater than 0.
>>>
>>> $ time ./src/bin/pg_dump/pg_dump -h /tmp regression -t int8_tbl -Fp 
>>> --compress 0 |file -
>>> /dev/stdin: ASCII text
>>
>> FWIW I agree we should make this backwards-compatible - accept "0" and
>> treat it as no compression.
>>
>> Georgios, can you prepare a patch doing that?
> 
> I think maybe Tomas misunderstood.  What I was trying to say is that -Z
> 0 *is* accepted to mean no compression.  This part wasn't quoted, but I
> said:
> 

Ah, I see. Well, I also tried but with "-Z gzip:0" (and not -Z 0), and
that does fail:

  error: invalid compression specification: compression algorithm "gzip"
  expects a compression level between 1 and 9 (default at -1)

It's a bit weird these two cases behave differently, when both translate
to the same default compression method (gzip).

>> Right now, I think that pg_fatal in gzip.c is dead code - that was first
>> added in the patch version sent on 21 Dec 2022.
> 
> If you run the diff command that I've been talking about, you'll see
> that InitCompressorZlib was almost unchanged - e9960732 is essentially a
> refactoring.  I don't think it's desirable to add a pg_fatal() in a
> function that's otherwise nearly-unchanged.  The fact that it's
> nearly-unchanged is a good thing: it simplifies reading of what changed.
> If someone wants to add a pg_fatal() in that code path, it'd be better
> done in its own commit, with a separate message explaining the change.
> 
> If you insist on changing anything here, you might add an assertion (as
> you said earlier) along with a comment like
> /* -Z 0 uses the "None" compressor rather than zlib with no compression */
> 

Yeah, a comment would be helpful.

Also, after thinking about it a bit more maybe having the unreachable
pg_fatal() is not a good thing, as it will just confuse people (I'd
certainly assume having such check means there's a way in which case it
might trigger.). Maybe an assert would be better?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Memory leak from ExecutorState context?

2023-03-02 Thread Jehan-Guillaume de Rorthais
Hi!

On Thu, 2 Mar 2023 13:44:52 +0100
Tomas Vondra  wrote:
> Well, yeah and no.
> 
> In principle we could/should have allocated the BufFiles in a different
> context (possibly hashCxt). But in practice it probably won't make any
> difference, because the query will probably run all the hashjoins at the
> same time. Imagine a sequence of joins - we build all the hashes, and
> then tuples from the outer side bubble up through the plans. And then
> you process the last tuple and release all the hashes.
> 
> This would not fix the issue. It'd be helpful for accounting purposes
> (we'd know it's the buffiles and perhaps for which hashjoin node). But
> we'd still have to allocate the memory etc. (so still OOM).

Well, accounting things in the correct context would already worth a patch I
suppose. At least, it help to investigate faster. Plus, you already wrote a
patch about that[1]:

https://www.postgresql.org/message-id/20190421114618.z3mpgmimc3rmubi4%40development

Note that I did reference the "Out of Memory errors are frustrating as heck!"
thread in my first email, pointing on a Tom Lane's email explaining that
ExecutorState was not supposed to be so large[2].

[2] 
https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b

> There's only one thing I think could help - increase the work_mem enough
> not to trigger the explosive growth in number of batches. Imagine
> there's one very common value, accounting for ~65MB of tuples. With
> work_mem=64MB this leads to exactly the explosive growth you're
> observing here. With 128MB it'd probably run just fine.
> 
> The problem is we don't know how large the work_mem would need to be :-(
> So you'll have to try and experiment a bit.
> 
> I remembered there was a thread [1] about *exactly* this issue in 2019.
> 
> [1]
> https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net
>
> I even posted a couple patches that try to address this by accounting
> for the BufFile memory, and increasing work_mem a bit instead of just
> blindly increasing the number of batches (ignoring the fact that means
> more memory will be used for the BufFile stuff).
> 
> I don't recall why it went nowhere, TBH. But I recall there were
> discussions about maybe doing something like "block nestloop" at the
> time, or something. Or maybe the thread just went cold.

So I read the full thread now. I'm still not sure why we try to avoid hash
collision so hard, and why a similar data subset barely larger than work_mem
makes the number of batchs explode, but I think I have a better understanding of
the discussion and the proposed solutions.

There was some thoughts about how to make a better usage of the memory. As
memory is exploding way beyond work_mem, at least, avoid to waste it with too
many buffers of BufFile. So you expand either the work_mem or the number of
batch, depending on what move is smarter. TJis is explained and tested here:

https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development
https://www.postgresql.org/message-id/20190422030927.3huxq7gghms4kmf4%40development

And then, another patch to overflow each batch to a dedicated temp file and
stay inside work_mem (v4-per-slice-overflow-file.patch):

https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development

Then, nothing more on the discussion about this last patch. So I guess it just
went cold.

For what it worth, these two patches seems really interesting to me. Do you need
any help to revive it?

Regards,




Re: Operation log for major operations

2023-03-02 Thread Dmitry Koval

Hi!

These changes did not interest the community. It was expected (topic is 
very specifiс: vendor's technical support). So no sense to distract 
developers ...


I'll move patch to Withdrawn.

--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com




Re: Add LZ4 compression in pg_dump

2023-03-02 Thread Justin Pryzby
On Wed, Mar 01, 2023 at 05:20:05PM +0100, Tomas Vondra wrote:
> On 2/25/23 15:05, Justin Pryzby wrote:
> > On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
> >> I have some fixes (attached) and questions while polishing the patch for
> >> zstd compression.  The fixes are small and could be integrated with the
> >> patch for zstd, but could be applied independently.
> > 
> > One more - WriteDataToArchiveGzip() says:
> > 
> > +   if (cs->compression_spec.level == 0)
> > +   pg_fatal("requested to compress the archive yet no level was 
> > specified");
> > 
> > That was added at e9960732a.  
> > 
> > But if you specify gzip:0, the compression level is already enforced by
> > validate_compress_specification(), before hitting gzip.c:
> > 
> > | pg_dump: error: invalid compression specification: compression algorithm 
> > "gzip" expects a compression level between 1 and 9 (default at -1)
> > 
> > 5e73a6048 intended that to work as before, and you *can* specify -Z0:
> > 
> > The change is backward-compatible, hence specifying only an integer
> > leads to no compression for a level of 0 and gzip compression when the
> > level is greater than 0.
> > 
> > $ time ./src/bin/pg_dump/pg_dump -h /tmp regression -t int8_tbl -Fp 
> > --compress 0 |file -
> > /dev/stdin: ASCII text
> 
> FWIW I agree we should make this backwards-compatible - accept "0" and
> treat it as no compression.
> 
> Georgios, can you prepare a patch doing that?

I think maybe Tomas misunderstood.  What I was trying to say is that -Z
0 *is* accepted to mean no compression.  This part wasn't quoted, but I
said:

> Right now, I think that pg_fatal in gzip.c is dead code - that was first
> added in the patch version sent on 21 Dec 2022.

If you run the diff command that I've been talking about, you'll see
that InitCompressorZlib was almost unchanged - e9960732 is essentially a
refactoring.  I don't think it's desirable to add a pg_fatal() in a
function that's otherwise nearly-unchanged.  The fact that it's
nearly-unchanged is a good thing: it simplifies reading of what changed.
If someone wants to add a pg_fatal() in that code path, it'd be better
done in its own commit, with a separate message explaining the change.

If you insist on changing anything here, you might add an assertion (as
you said earlier) along with a comment like
/* -Z 0 uses the "None" compressor rather than zlib with no compression */

-- 
Justin




Re: Making empty Bitmapsets always be NULL

2023-03-02 Thread Tom Lane
Richard Guo  writes:
> It seems that the Bitmapset checked by bms_is_empty_internal cannot be
> NULL from how it is computed by a function.  So I wonder if we can
> remove the check of 'a' being NULL in that function, or reduce it to an
> Assert.

Yeah, I think just removing it is sufficient.  The subsequent attempts
to dereference the pointer will crash just fine if it's NULL; we don't
need an Assert to help things along.

> It seems that in create_lateral_join_info around line 689, the
> bms_is_empty check of lateral_relids is not necessary, since we've
> checked that lateral_relids cannot be NULL several lines earlier.

Good catch, I missed that one.

Pushed, thanks for reviewing.

regards, tom lane




Re: is_superuser is not documented

2023-03-02 Thread Joseph Koshakow
On Thu, Mar 2, 2023 at 11:53 AM Fujii Masao 
wrote:
>
>On 2022/09/14 14:27, bt22kawamotok wrote:
>> I update patch to reflect master update.
>
>Thanks for updating the patch!
>
>+   
>+Shows whether the current user is a superuser or not.
>+   
>
>How about adding the note about when this parameter can change,
>like we do for in_hot_standby docs?  I applied this change to the
patch.
>Attached is the updated version of the patch.
>

I just came across this thread and noticed that the patch was never
merged. There is some brief docs for is_superuser in the SHOW docs:
https://www.postgresql.org/docs/current/sql-show.html, but the GUC
fields were never updated.

Is there a reason that it never got merged or was it just forgotten
about?

- Joe Koshakow


Re: Add LZ4 compression in pg_dump

2023-03-02 Thread gkokolatos





--- Original Message ---
On Wednesday, March 1st, 2023 at 5:20 PM, Tomas Vondra 
 wrote:


> 
> 
> 
> 
> On 2/25/23 15:05, Justin Pryzby wrote:
> 
> > On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
> > 
> > > I have some fixes (attached) and questions while polishing the patch for
> > > zstd compression. The fixes are small and could be integrated with the
> > > patch for zstd, but could be applied independently.
> > 
> > One more - WriteDataToArchiveGzip() says:
> > 
> > + if (cs->compression_spec.level == 0)
> > + pg_fatal("requested to compress the archive yet no level was specified");
> > 
> > That was added at e9960732a.
> > 
> > But if you specify gzip:0, the compression level is already enforced by
> > validate_compress_specification(), before hitting gzip.c:
> > 
> > | pg_dump: error: invalid compression specification: compression algorithm 
> > "gzip" expects a compression level between 1 and 9 (default at -1)
> > 
> > 5e73a6048 intended that to work as before, and you can specify -Z0:
> > 
> > The change is backward-compatible, hence specifying only an integer
> > leads to no compression for a level of 0 and gzip compression when the
> > level is greater than 0.
> > 
> > $ time ./src/bin/pg_dump/pg_dump -h /tmp regression -t int8_tbl -Fp 
> > --compress 0 |file -
> > /dev/stdin: ASCII text
> 
> 
> FWIW I agree we should make this backwards-compatible - accept "0" and
> treat it as no compression.
> 
> Georgios, can you prepare a patch doing that?

Please find a patch attached. However I am a bit at a loss, the backwards
compatible behaviour has not changed. Passing -Z0/--compress=0 does produce
a non compressed output. So I am not really certain as to what broke and
needs fixing.

What commit 5e73a6048 did fail to do, is test the backwards compatible
behaviour. The attached amends it.

Cheers,
//Georgios

> 
> 
> regards
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL CompanyFrom 99c2da94ecbeacf997270dd26fc5c0a63ffcedd4 Mon Sep 17 00:00:00 2001
From: Georgios Kokolatos 
Date: Thu, 2 Mar 2023 16:10:03 +
Subject: [PATCH vX] Add test for backwards compatible -Z0 option in pg_dump

Commit 5e73a6048 expanded pg_dump with the ability to use compression
specifications. A commonly shared code which lets the user control in an
extended way the method, level, and other details of a desired compression.

Prior to this commit, pg_dump could only accept an integer for the
-Z/--compress option. An integer value of 0 had the special meaning of non
compression. Commit 5e73a6048 respected and maintained this behaviour for
backwards compatibility.

However no tests covered this scenario. The current commit adds coverage for
this case.
---
 src/bin/pg_dump/t/002_pg_dump.pl | 13 +
 1 file changed, 13 insertions(+)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 7b5a6e190c..ec7aaab884 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -76,6 +76,19 @@ my %pgdump_runs = (
 		],
 	},
 
+	# Verify that the backwards compatible option -Z0 produces
+	# non compressed output
+	compression_none_plain => {
+		test_key   => 'compression',
+		# Enforce this test when compile option is available
+		compile_option => 'gzip',
+		dump_cmd   => [
+			'pg_dump',  '--format=plain',
+			'-Z0', "--file=$tempdir/compression_none_plain.sql",
+			'postgres',
+		],
+	},
+
 	# Do not use --no-sync to give test coverage for data sync.
 	compression_gzip_custom => {
 		test_key   => 'compression',
-- 
2.34.1



Re: Add pg_walinspect function with block info columns

2023-03-02 Thread Melanie Plageman
On Wed, Mar 1, 2023 at 12:51 PM Melanie Plageman
 wrote:
> When using pg_walinspect, and calling functions like
> pg_get_wal_records_info(), I often wish that the various information in
> the block_ref column was separated out into columns so that I could
> easily access them and pass them to various other functions to add
> information -- like getting the relname from pg_class like this:
>
> SELECT n.nspname, c.relname, wal_info.*
>   FROM pg_get_wal_records_extended_info(:start_lsn, :end_lsn) wal_info
>   JOIN pg_class c
> ON wal_info.relfilenode = pg_relation_filenode(c.oid) AND
>   wal_info.reldatabase IN (0, (SELECT oid FROM pg_database
> WHERE datname = current_database()))
>   JOIN pg_namespace n ON n.oid = c.relnamespace;
>
>
> This has been mentioned in [1] amongst other places.
>
> So, attached is a patch with pg_get_wal_records_extended_info(). I
> suspect the name is not very good. Also, it is nearly a direct copy of
> pg_get_wal_fpi_infos() except for the helper called to fill in the
> tuplestore, so it might be worth doing something about that.
>
> However, I am mainly looking for feedback about whether or not others
> would find this useful, and, if so, what columns they would like to see
> in the returned tuplestore.
>
> Note that I didn't include the cumulative fpi_len for all the pages
> since pg_get_wal_fpi_info() now exists. I noticed that
> pg_get_wal_fpi_info() doesn't list compression information (which is in
> the block_ref column of pg_get_wal_records_info()). I don't know if this
> is worth including in my proposed function
> pg_get_wal_records_extended_info().

Thinking about this more, it could make sense to have a function which
gives you this extended block information and has a parameter like
with_fpi which would include the information returned by
pg_get_wal_fpi_info(). It might be nice to have it still include the
information about the record itself as well.

I don't know if it would be instead of pg_get_wal_fpi_info(), though.

The way I would use this is when I want to see the record level
information but with some additional information aggregated across the
relevant blocks. For example, I could group by the record information
and relfilenode and using the query in my example above, see all the
information for the record along with the relname (when possible).

- Melanie




Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)

2023-03-02 Thread Kirk Wolak
On Thu, Mar 2, 2023 at 9:56 AM Laurenz Albe 
wrote:

> On Wed, 2023-03-01 at 11:13 -0500, Kirk Wolak wrote:
> > Thanks, corrected, and confirmed Unix line endings.
>
> The patch builds fine and works as intended.
>
> I leave it to the committers to decide whether the patch is worth the
> effort or not, given that you can get a similar effect with %`date`.
> It adds some value by being simpler and uniform across all platforms.
>
> I'll mark the patch as "ready for committer".
>
> Yours,
> Laurenz Albe
>

Thanks Laurenz.

To be clear, I use windows AND linux, and I share my file between them.

in linux:  `date +%H:%M:%S`   is used
in windows: `ECHO %time%`

so, I wrote a ts.cmd and ts.sh  so I could share one prompt:   `ts`
but now every time I connect a new account to this file, I have to go
find/copy my ts file.
Same when I share it with other developers.

This was the pain that started the quest.
Thanks to everyone for their support!


Re: Allow tailoring of ICU locales with custom rules

2023-03-02 Thread Laurenz Albe
On Wed, 2023-02-22 at 18:35 +0100, Peter Eisentraut wrote:
> > - there doesn't seem to be a way to add rules to template1.
> > If someone wants to have icu rules and initial contents to their new
> > databases, I think they need to create a custom template database
> > (from template0) for that purpose, in addition to template1.
> >   From a usability standpoint, this is a bit cumbersome, as it's
> > normally the role of template1.
> > To improve on that, shouldn't initdb be able to create template0 with
> > rules too?
> 
> Right, that would be an initdb option.  Is that too many initdb options 
> then?  It would be easy to add, if we think it's worth it.

An alternative would be to document that you can drop "template1" and
create it again using the ICU collation rules you need.

But I'd prefer an "initdb" option.

Yours,
Laurenz Albe




Re: Allow logical replication to copy tables in binary format

2023-03-02 Thread Kuntal Ghosh
On Thu, Mar 2, 2023 at 10:30 AM Amit Kapila  wrote:
>
> > TBH I am not sure anymore if the complications justify the patch.
> >
> > It seems we have to choose from 2 bad choices:
> > - separate options = this works but would be more confusing for the user
> > - unified option = this would be simpler and faster, but risks
> > breaking existing applications currently using 'binary=true'
> >
>
> I would prefer a unified option as apart from other things you and
> others mentioned that will be less of a maintenance burden in the
> future.
>
+1
When someone sets the binary=true while creating a subscription, the
expectation would be that the data transfer will happen in binary mode
if binary in/out functions are available. As per current
implementation, that's not happening in the table-sync phase. So, it
makes sense to fix that behaviour in a major version release.
For the existing applications that are using (or unknowingly misusing)
the feature, as Amit mentioned, they have a workaround.


-- 
Thanks & Regards,
Kuntal Ghosh




Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)

2023-03-02 Thread Laurenz Albe
On Wed, 2023-03-01 at 11:13 -0500, Kirk Wolak wrote:
> Thanks, corrected, and confirmed Unix line endings.

The patch builds fine and works as intended.

I leave it to the committers to decide whether the patch is worth the
effort or not, given that you can get a similar effect with %`date`.
It adds some value by being simpler and uniform across all platforms.

I'll mark the patch as "ready for committer".

Yours,
Laurenz Albe




Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-03-02 Thread Alexander Korotkov
Hi, Pavel!

On Thu, Mar 2, 2023 at 1:29 PM Pavel Borisov  wrote:
> > Let's see the performance results for the patchset.  I'll properly
> > revise the comments if results will be good.
> >
> > Pavel, could you please re-run your tests over revised patchset?
>
> Since last time I've improved the test to avoid significant series
> differences due to AWS storage access variation that is seen in [1].
> I.e. each series of tests is run on a tmpfs with newly inited pgbench
> tables and vacuum. Also, I've added a test for low-concurrency updates
> where the locking optimization isn't expected to improve performance,
> just to make sure the patches don't make things worse.
>
> The tests are as follows:
> 1. Heap updates with high tuple concurrency:
> Prepare without pkeys (pgbench -d postgres -i -I dtGv -s 10 --unlogged-tables)
> Update tellers 100 rows, 50 conns ( pgbench postgres -f
> ./update-only-tellers.sql -s 10 -P10 -M prepared -T 600 -j 5 -c 50 )
>
> Result: Average of 5 series with patches (0001+0002) is around 5%
> faster than both master and patch 0001. Still, there are some
> fluctuations between different series of the measurements of the same
> patch, but much less than in [1]

Thank you for running this that fast!

So, it appears that 0001 patch has no effect.  So, we probably should
consider to drop 0001 patch and consider just 0002 patch.

The attached patch v12 contains v11 0002 patch extracted separately.
Please, add it to the performance comparison.  Thanks.

--
Regards,
Alexander Korotkov


0001-Allow-locking-updated-tuples-in-tuple_update-and-v12.patch
Description: Binary data


Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

2023-03-02 Thread Tom Lane
Peter Eisentraut  writes:
> I think an error message like
>  "unexpected null value in system cache %d column %d"
> is sufficient.  Since these are "can't happen" errors, we don't need to 
> spend too much extra effort to make it prettier.

I'd at least like to see it give the catalog's OID.  That's easily
convertible to a name, and it doesn't tend to move around across PG
versions, neither of which are true for syscache IDs.

Also, I'm fairly unconvinced that it's a "can't happen" --- this
would be very likely to fire as a result of catalog corruption,
so it would be good if it's at least minimally interpretable
by a non-expert.  Given that we'll now have just one copy of the
code, ISTM there's a good case for doing the small extra work
to report catalog and column by name.

regards, tom lane




Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

2023-03-02 Thread Tom Lane
Daniel Gustafsson  writes:
>> On 1 Mar 2023, at 00:20, Tom Lane  wrote:
>> Also ... at least in assert-enabled builds, maybe we could check that
>> the column being fetched this way is actually marked attnotnull?

> We could, but that would limit the API to attnotnull columns, rather than when
> the caller knows that the attr cannot be NULL either due to attnotnull or due
> to intrinsic knowledge based on what is being extracted.
> An example of the latter is build_function_result_tupdesc_t() which knows that
> proallargtypes cannot be NULL when calling SysCacheGetAttr.

OK, if there are counterexamples then never mind that.  I don't think
we want to discourage call sites from using this function.

regards, tom lane




Request for comment on setting binary format output per session

2023-03-02 Thread Dave Cramer
Greetings,

In [1] I proposed a patch that used a GUC to request a list of OID's to be
returned in binary format.
In [2] Peter Eisentraut proposed a very similar solution to the problem.

In [2] there was some discussion regarding whether this should be set via
GUC or a new protocol message.

I'd like to open up this discussion again so that we can move forward. I
prefer the GUC as it is relatively simple and as Peter mentioned it works,
but I'm not married to the idea.

Regards,
Dave

[1] PostgreSQL: Proposal to provide the facility to set binary format
output for specific OID's per session

[2] PostgreSQL: default result formats setting

Dave Cramer


Re: Improving inferred query column names

2023-03-02 Thread Peter Eisentraut

On 22.02.23 21:38, Andres Freund wrote:

On 2023-02-20 16:08:00 +0100, Peter Eisentraut wrote:

On 11.02.23 20:24, Andres Freund wrote:
I think we should just do it and not care about what breaks.  There has
never been any guarantee about these.

FWIW, "most" other SQL implementations appear to generate column names like

SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: "SUM(reads)", "SUM(writes)"

Hm, personally I don't like leaving in parens in the names, that makes it
unnecessarily hard to reference the columns.  sum_reads imo is more usable
than than "SUM(reads)".


If you want something without special characters, the example you gave 
is manageable, but what are you going to do with


SELECT a, b, a * b, a / b FROM ...

or

SELECT a, b, SUM(a * b) FROM ...

and so on.  What would be the actual rule to produce the output you want?

I think a question here is what "usable" means in this context.

If you want a name that you can refer to (in a client API, for example), 
you should give it a name explicitly.


I think the uses for the automatic names are that they look pretty and 
meaningful in visual output (psql, pgadmin, etc.).  In that context, I 
think it is ok to use special characters without limitation, since you 
are just going to look at the thing, not type it back in.






Re: generic plans and "initial" pruning

2023-03-02 Thread Amit Langote
On Wed, Feb 8, 2023 at 7:31 PM Amit Langote  wrote:
> On Tue, Feb 7, 2023 at 23:38 Andres Freund  wrote:
>> The tests seem to frequently hang on freebsd:
>> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3478
>
> Thanks for the heads up.  I’ve noticed this one too, though couldn’t find the 
> testrun artifacts like I could get for some other failures (on other cirrus 
> machines).  Has anyone else been a similar situation?

I think I have figured out what might be going wrong on that cfbot
animal after building with the same CPPFLAGS as that animal locally.
I had forgotten to update _out/_readRangeTblEntry() to account for the
patch's change that a view's RTE_SUBQUERY now also preserves relkind
in addition to relid and rellockmode for the locking consideration.

Also, I noticed that a multi-query Portal execution with rules was
failing (thanks to a regression test added in a7d71c41db) because of
the snapshot used for the 2nd query onward not being updated for
command ID change under patched model of multi-query Portal execution.
To wit, under the patched model, all queries in the multi-query Portal
case undergo ExecutorStart() before any of it is run with
ExecutorRun().  The patch hadn't changed things however to update the
snapshot's command ID for the 2nd query onwards, which caused the
aforementioned test case to fail.

This new model does however mean that the 2nd query onwards must use
PushCopiedSnapshot() given the current requirement of
UpdateActiveSnapshotCommandId() that the snapshot passed to it must
not be referenced anywhere else.  The new model basically requires
that each query's QueryDesc points to its own copy of the
ActiveSnapshot.  That may not be a thing in favor of the patched model
though.  For now, I haven't been able to come up with a better
alternative.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v34-0001-Move-AcquireExecutorLocks-s-responsibility-into-.patch
Description: Binary data


RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-02 Thread Hayato Kuroda (Fujitsu)
> Yeah, min_send_delay and max_slots_wal_keep_size should be easily tunable
> because
> the appropriate value depends on the enviroment and workload.
> However, pg_replication_slots.pg_replication_slots cannot show the exact amout
> of WALs,
> so it may not suitable for tuning. I think user can compare the value
> pg_replication_slots.restart_lsn (or pg_stat_replication.sent_lsn) and
> pg_current_wal_lsn() to calclate number of WALs to be delayed, like
> 
> ```
> postgres=# select pg_current_wal_lsn() - pg_replication_slots.restart_lsn as
> delayed from pg_replication_slots;
>   delayed
> 
>  1689153760
> (1 row)
> ```
> 
> > I think it would be better to tell about this in the docs along with
> > the 'min_send_delay' description. The key point is whether this would
> > be an acceptable trade-off for users who want to use this feature. I
> > think it can harm only if users use this without understanding the
> > corresponding trade-off. As we kept the default to no delay, it is
> > expected from users using this have an understanding of the trade-off.
> 
> Yes, the trade-off should be emphasized.

Based on the understanding, I added them to the doc in new version patch.
Please see [1].

[1]: 
https://www.postgresql.org/message-id/flat/TYAPR01MB586606CF3B585B6F8BE13A9CF5B29%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-02 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> Fair point but I think the current comment should explain why we are
> doing something different here. How about extending the existing
> comments to something like: "If we've requested to shut down, exit the
> process. This is unlike handling at other places where we allow
> complete WAL to be sent before shutdown because we don't want the
> delayed transactions to be applied downstream. This will allow one to
> use the data from downstream in case of some unwanted operations on
> the current node."

Thank you for suggestion. I think it is better, so changed.
Please see new patch at [1]

[1]: 
https://www.postgresql.org/message-id/flat/TYAPR01MB586606CF3B585B6F8BE13A9CF5B29%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-02 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thank you for reviewing! New version can be available at [1].

> 1) Currently we have added the delay during the decode of commit,
> while decoding the commit walsender process will stop decoding any
> further transaction until delay is completed. There might be a
> possibility that a lot of transactions will happen in parallel and
> there will be a lot of transactions to be decoded after the delay is
> completed.
> Will it be possible to decode the WAL if any WAL is generated instead
> of staying idle in the meantime, I'm not sure if this is feasible just
> throwing my thought to see if it might be possible.
> --- a/src/backend/replication/logical/decode.c
> +++ b/src/backend/replication/logical/decode.c
> @@ -676,6 +676,15 @@ DecodeCommit(LogicalDecodingContext *ctx,
> XLogRecordBuffer *buf,
> 
> buf->origptr, buf->endptr);
> }
> 
> +   /*
> +* Delay sending the changes if required. For streaming transactions,
> +* this means a delay in sending the last stream but that is OK
> +* because on the downstream the changes will be applied only after
> +* receiving the last stream.
> +*/
> +   if (ctx->min_send_delay > 0 && ctx->delay_send)
> +   ctx->delay_send(ctx, xid, commit_time);
> +

I see your point, but I think that extension can be done in future version if 
needed.
This is because we must change some parts and introduce some complexities.

If we have decoded but have not wanted to send changes yet, we must store them 
in
the memory one and skip sending. In order to do that we must add new data 
structure,
and we must add another path in DecodeCommit, DecodePrepare not to send changes
and in WalSndLoop() and other functions to send pending changes. These may not 
be sufficient. 

I'm now thinking aboves are not needed, we can modify later if the overhead of
decoding is quite large and we must do them very efficiently.

> 2) Generally single line comments are not terminated by ".", The
> comment "/* Sleep until appropriate time. */" can be changed
> appropriately:
> +
> +   /* Sleep until appropriate time. */
> +   timeout_sleeptime_ms = WalSndComputeSleeptime(now);
> +
> +   elog(DEBUG2, "time-delayed replication for txid %u,
> delay_time = %d ms, remaining wait time: %ld ms",
> +xid, (int) ctx->min_send_delay,
> remaining_wait_time_ms);
> +
> +   /* Sleep until we get reply from worker or we time out */
> +   WalSndWait(WL_SOCKET_READABLE,

Right, removed.

> 3) In some places we mention as min_send_delay and in some places we
> mention it as time-delayed replication, we can keep the comment
> consistent by using the similar wordings.
> +-- fail - specifying streaming = parallel with time-delayed replication is 
> not
> +-- supported
> +CREATE SUBSCRIPTION regress_testsub CONNECTION
> 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
> false, streaming = parallel, min_send_delay = 123);
> 
> +-- fail - alter subscription with streaming = parallel should fail when
> +-- time-delayed replication is set
> +ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
> 
> +-- fail - alter subscription with min_send_delay should fail when
> +-- streaming = parallel is set

"time-delayed replication" was removed.

> 4) Since the value is stored in ms, we need not add ms again as the
> default value is in ms:
> @@ -4686,6 +4694,9 @@ dumpSubscription(Archive *fout, const
> SubscriptionInfo *subinfo)
> if (strcmp(subinfo->subsynccommit, "off") != 0)
> appendPQExpBuffer(query, ", synchronous_commit = %s",
> fmtId(subinfo->subsynccommit));
> 
> +   if (subinfo->subminsenddelay > 0)
> +   appendPQExpBuffer(query, ", min_send_delay = '%d ms'",
> subinfo->subminsenddelay);

Right, fixed.

> 5) we can use the new error reporting style:
> 5.a) brackets around errcode can be removed
> +   ereport(ERROR,
> +
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("invalid value for parameter
> \"%s\": \"%s\"",
> +   "min_send_delay",
> input_string),
> +hintmsg ? errhint("%s", _(hintmsg)) : 0));
> 
> 5.b) Similarly here too;
> +   if (result < 0 || result > PG_INT32_MAX)
> +   ereport(ERROR,
> +
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("%d ms is outside the valid
> range for parameter \"%s\" (%d .. %d)",
> +   result,
> +   "min_send_delay",
> +   0, PG_INT32_MAX)));
> 
> 5.c) Similarly here too;
> +   delay_val = strtoul(strVal(defel->arg), , 10);
> +   if (errno != 0 || *endptr != '\0')
> +   ereport(ERROR,
> +
> 

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-02 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thank you for reviewing! PSA new version.

> 1.
> Nitpick. The new text is jagged-looking. It should wrap at ~80 chars.

Addressed.

> 
> 2.
> 2. Another reason is for that parallel streaming, the transaction will be 
> opened
> immediately by the parallel apply worker. Therefore, if the walsender
> is delayed in sending the final record of the transaction, the
> parallel apply worker must wait to receive it with an open
> transaction. This would result in the locks acquired during the
> transaction not being released until the min_send_delay has elapsed.
> 
> ~
> 
> The text already said there are "two reasons", and already this is
> numbered as reason 2. So it doesn't need to keep saying "Another
> reason" here.
> 
> "Another reason is for that parallel streaming" --> "For parallel 
> streaming..."

Changed.

> ==
> src/backend/replication/walsender.c
> 
> 3. WalSndDelay
> 
> + /* die if timeout was reached */
> + WalSndCheckTimeOut();
> 
> Other nearby comments start uppercase, so this should too.

I just picked from other part and they have lowercase, but fixed.

> ==
> src/include/replication/walreceiver.h
> 
> 4. WalRcvStreamOptions
> 
> @@ -187,6 +187,7 @@ typedef struct
>   * prepare time */
>   char*origin; /* Only publish data originating from the
>   * specified origin */
> + int32 min_send_delay; /* The minimum send delay */
>   } logical;
>   } proto;
>  } WalRcvStreamOptions;
> 
> ~
> 
> Should that comment mention the units are "(ms)"

Added.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v10-0001-Time-delayed-logical-replication-on-publisher-si.patch
Description:  v10-0001-Time-delayed-logical-replication-on-publisher-si.patch


Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-02 Thread Önder Kalacı
Hi Amit, Shi Yu


> >
> > b. Executed SQL.
> > I executed TRUNCATE and INSERT before each UPDATE. I am not sure if you
> did the
> > same, or just executed 50 consecutive UPDATEs. If the latter one, there
> would be
> > lots of old tuples and this might have a bigger impact on sequential
> scan. I
> > tried this case (which executes 50 consecutive UPDATEs) and also saw
> that the
> > overhead is smaller than before.
>

Alright, I'll do similarly, execute truncate/insert before each update.


> In the above profile number of calls to index_fetch_heap(),
> heapam_index_fetch_tuple() explains the reason for the regression you
> are seeing with the index scan. Because the update will generate dead
> tuples in the same transaction and those dead tuples won't be removed,
> we get those from the index and then need to perform
> index_fetch_heap() to find out whether the tuple is dead or not. Now,
> for sequence scan also we need to scan those dead tuples but there we
> don't need to do back-and-forth between index and heap.


Thanks for the insights, I think what you describe makes a lot of sense.



> I think we can
> once check with more number of tuples (say with 2, 5, etc.)
> for case-1.
>
>
As we'd expect, this test made the performance regression more visible.

I quickly ran case-1 for 50 times with 5 as Shi Yu does, and got
the following results. I'm measuring end-to-end times for running the
whole set of commands:

seq_scan: 00 hr 24 minutes, 42 seconds
index_scan:  01 hr 04 minutes 54 seconds


But, I'm still not sure whether we should focus on this regression too
much. In the end, what we are talking about is a case (e.g., all or many
rows are duplicated) where using an index is not a good idea anyway. So,
I doubt users would have such indexes.


>  The quadratic apply performance
> the sequential scans cause, are a much bigger hazard for users than some
apply
> performance reqression.

Quoting Andres' note, I personally think that the regression for this case
is not a big concern.

> I'd prefer not having an option, because we figure out the cause of the
> performance regression (reducing it to be small enough to not care). After
> that an option defaulting to using indexes. I don't think an option
defaulting
> to false makes sense.

I think we figured out the cause of the performance regression. I think it
is not small
enough for some scenarios like the above. But those scenarios seem like
synthetic
test cases, with not much user impacting implications. Still, I think you
are better suited
to comment on this.

If you consider that this is a significant issue,  we could consider the
second patch as well
such that for this unlikely scenario users could disable index scans.

Thanks,
Onder


Re: Fix comments in gistxlogDelete, xl_heap_freeze_page and xl_btree_delete

2023-03-02 Thread Drouvot, Bertrand

Hi,

On 1/6/23 11:05 AM, Drouvot, Bertrand wrote:

Hi hackers,

Please find attached a patch to $SUBJECT.

The wrong comments have been discovered by Robert in [1].

Submitting this here as a separate thread so it does not get lost in the 
logical decoding
on standby thread.

[1]: 
https://www.postgresql.org/message-id/CA%2BTgmoYTTsxP8y6uknZvCBNCRq%2B1FJ4zGbX8Px1TGW459fGsaQ%40mail.gmail.com

Looking forward to your feedback,

Regards,



It looks like I did not create a CF entry for this one: fixed with [1].

Also, while at it, adding a commit message in V2 attached.

[1]: https://commitfest.postgresql.org/43/4235/

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom 075a8d1c3dbc1ef339c35f719faef758f855614e Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 2 Mar 2023 12:46:01 +
Subject: [PATCH v2] Fix comments in gistxlogDelete, xl_heap_freeze_page and
 xl_btree_delete

gistxlogDelete claims that OffsetNumbers are in payload of blk 0, which is not
as they follow the main payload.

xl_heap_freeze_page claims that freeze plans and offset numbers follow, but
they don't: they're in the data for block 0.

xl_btree_delete claims that the data follows but they are attached to block 0.

This commit fixes the related comments.
---
 src/include/access/gistxlog.h| 4 +---
 src/include/access/heapam_xlog.h | 5 +++--
 src/include/access/nbtxlog.h | 9 ++---
 3 files changed, 10 insertions(+), 8 deletions(-)
 100.0% src/include/access/

diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..2ce9366277 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -52,9 +52,7 @@ typedef struct gistxlogDelete
TransactionId snapshotConflictHorizon;
uint16  ntodelete;  /* number of deleted offsets */
 
-   /*
-* In payload of blk 0 : todelete OffsetNumbers
-*/
+   /* TODELETE OFFSET NUMBER ARRAY FOLLOWS */
 } gistxlogDelete;
 
 #define SizeOfGistxlogDelete   (offsetof(gistxlogDelete, ntodelete) + 
sizeof(uint16))
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..a2c67d1cd3 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -345,8 +345,9 @@ typedef struct xl_heap_freeze_page
TransactionId snapshotConflictHorizon;
uint16  nplans;
 
-   /* FREEZE PLANS FOLLOW */
-   /* OFFSET NUMBER ARRAY FOLLOWS */
+   /*
+* In payload of blk 0 : FREEZE PLANS and OFFSET NUMBER ARRAY
+*/
 } xl_heap_freeze_page;
 
 #define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + 
sizeof(uint16))
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..e7a9711767 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -236,9 +236,12 @@ typedef struct xl_btree_delete
uint16  ndeleted;
uint16  nupdated;
 
-   /* DELETED TARGET OFFSET NUMBERS FOLLOW */
-   /* UPDATED TARGET OFFSET NUMBERS FOLLOW */
-   /* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
+   /*
+* In payload of blk 0 :
+* - DELETED TARGET OFFSET NUMBERS
+* - UPDATED TARGET OFFSET NUMBERS
+* - UPDATED TUPLES METADATA (xl_btree_update) ARRAY
+*/
 } xl_btree_delete;
 
 #define SizeOfBtreeDelete  (offsetof(xl_btree_delete, nupdated) + 
sizeof(uint16))
-- 
2.34.1



Re: Memory leak from ExecutorState context?

2023-03-02 Thread Tomas Vondra


On 3/2/23 13:08, Jehan-Guillaume de Rorthais wrote:
> ...
> [...]
>> But I have another idea - put a breakpoint on makeBufFile() which is the
>> bit that allocates the temp files including the 8kB buffer, and print in
>> what context we allocate that. I have a hunch we may be allocating it in
>> the ExecutorState. That'd explain all the symptoms.
> 
> That what I was wondering as well yesterday night.
> 
> So, on your advice, I set a breakpoint on makeBufFile:
> 
>   (gdb) info br
>   Num Type   Disp Enb AddressWhat
>   1   breakpoint keep y   0x007229df in makeBufFile
>   bt 10
>   p CurrentMemoryContext.name
> 
> 
> Then, I disabled it and ran the query up to this mem usage:
> 
>VIRTRESSHR S  %CPU %MEM
>   20.1g   7.0g  88504 t   0.0 22.5
> 
> Then, I enabled the breakpoint and look at around 600 bt and context name
> before getting bored. They **all** looked like that:
> 
>   Breakpoint 1, BufFileCreateTemp (...)at buffile.c:201
>   201 in buffile.c
>   #0  BufFileCreateTemp (...) buffile.c:201
>   #1  ExecHashJoinSaveTuple (tuple=0x1952c180, ...)   nodeHashjoin.c:1238
>   #2  ExecHashJoinImpl (parallel=false, pstate=0x31a6418) nodeHashjoin.c:398
>   #3  ExecHashJoin (pstate=0x31a6418) nodeHashjoin.c:584
>   #4  ExecProcNodeInstr (node=)execProcnode.c:462
>   #5  ExecProcNode (node=0x31a6418)
>   #6  ExecSort (pstate=0x31a6308)
>   #7  ExecProcNodeInstr (node=)
>   #8  ExecProcNode (node=0x31a6308)
>   #9  fetch_input_tuple (aggstate=aggstate@entry=0x31a5ea0)
>   
>   $421643 = 0x99d7f7 "ExecutorState"
> 
> These 600-ish 8kB buffer were all allocated in "ExecutorState". I could
> probably log much more of them if more checks/stats need to be collected, but
> it really slow down the query a lot, granting it only 1-5% of CPU time instead
> of the usual 100%.
> 

Bingo!

> So It's not exactly a leakage, as memory would be released at the end of the
> query, but I suppose they should be allocated in a shorter living context,
> to avoid this memory bloat, am I right?
> 

Well, yeah and no.

In principle we could/should have allocated the BufFiles in a different
context (possibly hashCxt). But in practice it probably won't make any
difference, because the query will probably run all the hashjoins at the
same time. Imagine a sequence of joins - we build all the hashes, and
then tuples from the outer side bubble up through the plans. And then
you process the last tuple and release all the hashes.

This would not fix the issue. It'd be helpful for accounting purposes
(we'd know it's the buffiles and perhaps for which hashjoin node). But
we'd still have to allocate the memory etc. (so still OOM).

There's only one thing I think could help - increase the work_mem enough
not to trigger the explosive growth in number of batches. Imagine
there's one very common value, accounting for ~65MB of tuples. With
work_mem=64MB this leads to exactly the explosive growth you're
observing here. With 128MB it'd probably run just fine.

The problem is we don't know how large the work_mem would need to be :-(
So you'll have to try and experiment a bit.

I remembered there was a thread [1] about *exactly* this issue in 2019.

[1]
https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net

I even posted a couple patches that try to address this by accounting
for the BufFile memory, and increasing work_mem a bit instead of just
blindly increasing the number of batches (ignoring the fact that means
more memory will be used for the BufFile stuff).

I don't recall why it went nowhere, TBH. But I recall there were
discussions about maybe doing something like "block nestloop" at the
time, or something. Or maybe the thread just went cold.

>> BTW with how many batches does the hash join start?
> 
> * batches went from 32 to 1048576 before being growEnabled=false as suspected
> * original and current nbuckets were set to 1048576 immediately
> * allowed space is set to the work_mem, but current space usage is 1.3GB, as
>   measured previously close before system refuse more memory allocation.
> 

Yeah, I think this is pretty expected. We start with multiple batches,
so we pick optimal buckets for the whole work_mem (so no growth here).

But then batches explode, in the futile hope to keep this in work_mem.
Once that growth gets disabled, we end up with 1.3GB hash table.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: libpq: PQgetCopyData() and allocation overhead

2023-03-02 Thread Daniel Gustafsson
> On 1 Mar 2023, at 15:23, Jeroen Vermeulen  wrote:

> PR for easy discussion: https://github.com/jtv/postgres/pull/1

The process for discussing work on pgsql-hackers is to attach the patch to the
email and discuss it inline in the thread.  That way all versions of the patch
as well as the discussion is archived and searchable. 

--
Daniel Gustafsson





Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-02 Thread Amit Kapila
On Thu, Mar 2, 2023 at 3:00 PM Önder Kalacı  wrote:
>>
>> Few comments:
>> ===
>> 1.
>> +   identity.  When replica identity FULL is specified,
>> +   indexes can be used on the subscriber side for searching the rows. These
>> +   indexes should be btree,
>>
>> Why only btree and not others like a hash index? Also, there should be
>> some comments in FindUsableIndexForReplicaIdentityFull() to explain
>> the choices.
>
>
> I updated the comment(s).
>
> For a more technical reference, we have these restrictions, because we rely on
> RelationFindReplTupleByIndex() which is designed to handle PK/RI. And,
> RelationFindReplTupleByIndex() is written in a way that only expects
> indexes with these limitations.
>
> In order to keep the changes as small as possible, I refrained from relaxing 
> this
> limitation for now. I'm definitely up to working on this for relaxing these
> limitations, and practically allowing more cases for non-unique indexes.
>

See, I think I understand why partial/expression indexes can't be
supported. It seems to me that because the required tuple may not
satisfy the expression and that won't work for our case. But what are
other limitations you see due to which we can't support other index
types for non-unique indexes? Is it just a matter of testing other
index types or there is something more to it, if so, we should add
comments so that they can be supported in the future if it is feasible
to do so.

>
> Attached are both patches: the main patch, and the patch that
> optionally disables the index scans.
>

Both the patches are numbered 0001. It would be better to number them
as 0001 and 0002.



-- 
With Regards,
Amit Kapila.




Re: Memory leak from ExecutorState context?

2023-03-02 Thread Jehan-Guillaume de Rorthais
On Thu, 2 Mar 2023 01:30:27 +0100
Tomas Vondra  wrote:
> On 3/2/23 00:18, Jehan-Guillaume de Rorthais wrote:
> >>> ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands
> >>> of calls, always short-cut'ed to 1048576, I guess because of the
> >>> conditional block «/* safety check to avoid overflow */» appearing early
> >>> in this function.   
> >[...] But what about this other short-cut then?
> > 
> > /* do nothing if we've decided to shut off growth */
> > if (!hashtable->growEnabled)
> > return;
> > 
> > [...]
> > 
> > /*
> >  * If we dumped out either all or none of the tuples in the table,
> >  * disable
> >  * further expansion of nbatch.  This situation implies that we have
> >  * enough tuples of identical hashvalues to overflow spaceAllowed.
> >  * Increasing nbatch will not fix it since there's no way to subdivide
> >  * the
> >  * group any more finely. We have to just gut it out and hope the server
> >  * has enough RAM.
> >  */
> > if (nfreed == 0 || nfreed == ninmemory)
> > {
> > hashtable->growEnabled = false;
> >   #ifdef HJDEBUG
> > printf("Hashjoin %p: disabling further increase of nbatch\n",
> >hashtable);
> >   #endif
> > }
> > 
> > If I guess correctly, the function is not able to split the current batch,
> > so it sits and hopes. This is a much better suspect and I can surely track
> > this from gdb.
> 
> Yes, this would make much more sense - it'd be consistent with the
> hypothesis that this is due to number of batches exploding (it's a
> protection exactly against that).
> 
> You specifically mentioned the other check earlier, but now I realize
> you've been just speculating it might be that.

Yes, sorry about that, I jumped on this speculation without actually digging it
much...

[...]
> But I have another idea - put a breakpoint on makeBufFile() which is the
> bit that allocates the temp files including the 8kB buffer, and print in
> what context we allocate that. I have a hunch we may be allocating it in
> the ExecutorState. That'd explain all the symptoms.

That what I was wondering as well yesterday night.

So, on your advice, I set a breakpoint on makeBufFile:

  (gdb) info br
  Num Type   Disp Enb AddressWhat
  1   breakpoint keep y   0x007229df in makeBufFile
  bt 10
  p CurrentMemoryContext.name


Then, I disabled it and ran the query up to this mem usage:

   VIRTRESSHR S  %CPU %MEM
  20.1g   7.0g  88504 t   0.0 22.5

Then, I enabled the breakpoint and look at around 600 bt and context name
before getting bored. They **all** looked like that:

  Breakpoint 1, BufFileCreateTemp (...)at buffile.c:201
  201 in buffile.c
  #0  BufFileCreateTemp (...) buffile.c:201
  #1  ExecHashJoinSaveTuple (tuple=0x1952c180, ...)   nodeHashjoin.c:1238
  #2  ExecHashJoinImpl (parallel=false, pstate=0x31a6418) nodeHashjoin.c:398
  #3  ExecHashJoin (pstate=0x31a6418) nodeHashjoin.c:584
  #4  ExecProcNodeInstr (node=)execProcnode.c:462
  #5  ExecProcNode (node=0x31a6418)
  #6  ExecSort (pstate=0x31a6308)
  #7  ExecProcNodeInstr (node=)
  #8  ExecProcNode (node=0x31a6308)
  #9  fetch_input_tuple (aggstate=aggstate@entry=0x31a5ea0)
  
  $421643 = 0x99d7f7 "ExecutorState"

These 600-ish 8kB buffer were all allocated in "ExecutorState". I could
probably log much more of them if more checks/stats need to be collected, but
it really slow down the query a lot, granting it only 1-5% of CPU time instead
of the usual 100%.

So It's not exactly a leakage, as memory would be released at the end of the
query, but I suppose they should be allocated in a shorter living context,
to avoid this memory bloat, am I right?

> BTW with how many batches does the hash join start?

* batches went from 32 to 1048576 before being growEnabled=false as suspected
* original and current nbuckets were set to 1048576 immediately
* allowed space is set to the work_mem, but current space usage is 1.3GB, as
  measured previously close before system refuse more memory allocation.

Here are the full details about the hash associated with the previous backtrace:

  (gdb) up
  (gdb) up
  (gdb) p *((HashJoinState*)pstate)->hj_HashTable
  $421652 = {
nbuckets = 1048576,
log2_nbuckets = 20,
nbuckets_original = 1048576,
nbuckets_optimal = 1048576,
log2_nbuckets_optimal = 20,
buckets = {unshared = 0x68f12e8, shared = 0x68f12e8},
keepNulls = true,
skewEnabled = false,
skewBucket = 0x0,
skewBucketLen = 0,
nSkewBuckets = 0,
skewBucketNums = 0x0,
nbatch = 1048576,
curbatch = 0,
nbatch_original = 32,
nbatch_outstart = 1048576,
growEnabled = false,
totalTuples = 19541735,
partialTuples = 19541735,
skewTuples = 0,
innerBatchFile = 0xdfcd168,
outerBatchFile = 0xe7cd1a8,

Re: Make some xlogreader messages more accurate

2023-03-02 Thread Bharath Rupireddy
On Thu, Mar 2, 2023 at 1:51 PM Peter Eisentraut
 wrote:
>
> On 28.02.23 07:15, Bharath Rupireddy wrote:
> >> Going through the remaining report_invalid_record() calls I then
> >> adjusted the use of "invalid" vs. "incorrect" in one case.  The message
> >> "record with invalid length" makes it sound like the length was
> >> something like -5, but really we know what the length should be and what
> >> we got wasn't it, so "incorrect" sounded better and is also used in
> >> other error messages in that file.
> > I have no strong opinion about this change. We seem to be using
> > "invalid length" and "incorrect length" interchangeably [1] without
> > distinguishing between "invalid" if length is < 0 and "incorrect" if
> > length >= 0 and not something we're expecting.
>
> Right, this isn't handled very consistently.  I did a pass across all
> "{invalid|incorrect|wrong} {length|size}" messages and tried to make
> them more precise by adding more detail and using the appropriate word.
> What do you think about the attached patch?

Thanks. IMO, when any of the incorrect/invalid/wrong errors occur, the
wording may not matter much more than the error itself and why it
occurred. While the uniformity of this kind helps, I think it's hard
to enforce the same/similar wording in future. I prefer leaving the
code as-is. Therefore, -1 for these changes.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-02 Thread Bharath Rupireddy
On Wed, Mar 1, 2023 at 2:39 PM Bharath Rupireddy
 wrote:
>
> Please find the attached v5 patch set for further review.

I simplified the code largely by moving the logic of reading the WAL
buffer page from a separate function to the main while loop. This
enabled me to get rid of XLogReadFromBuffersGuts() that v5 and other
previous patches have.

Please find the attached v6 patch set for further review. Meanwhile,
I'll continue to work on the review comment raised upthread -
https://www.postgresql.org/message-id/20230301041523.GA1453450%40nathanxps13.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From afb57cd61955f7fc0dba315f3f7c81604a0b47c9 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 2 Mar 2023 07:54:06 +
Subject: [PATCH v6] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c   | 144 
 src/backend/access/transam/xlogreader.c |  45 +++-
 src/include/access/xlog.h   |   6 +
 3 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..b29dc67c38 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1639,6 +1639,150 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and set the read bytes to 'read_bytes'.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. The caller must be aware of
+ * this and deal with it.
+ */
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+	TimeLineID tli,
+	Size count,
+	char *buf,
+	Size *read_bytes)
+{
+	XLogRecPtr	ptr;
+	char*dst;
+	Sizenbytes;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+	Assert(tli == GetWALInsertionTimeLine());
+
+	ptr = startptr;
+	nbytes = count;
+	dst = buf;
+	*read_bytes = 0;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr origptr;
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int 	idx;
+
+		origptr = ptr;
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+		/*
+		 * Holding WALBufMappingLock ensures inserters don't overwrite this
+		 * value while we are reading it. We try to acquire it in shared mode
+		 * so that the concurrent WAL readers are also allowed. We try to do as
+		 * less work as possible while holding the lock to not impact
+		 * concurrent WAL writers much. We quickly exit to not cause any
+		 * contention, if the lock isn't immediately available.
+		 */
+		if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+			return;
+
+		endptr = XLogCtl->xlblocks[idx];
+
+		if (expectedEndPtr == endptr)
+		{
+			char	*page;
+			char*data;
+			XLogPageHeader	phdr;
+
+			/*
+			 * We found WAL buffer page containing given XLogRecPtr. Get
+			 * starting address of the page and a pointer to the right location
+			 * of given XLogRecPtr in that page.
+			 */
+			page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+			data = page + ptr % XLOG_BLCKSZ;
+
+			/* Read what is wanted, not the whole page. */
+			if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+			{
+/* All the bytes are in one page. */
+memcpy(dst, data, nbytes);
+*read_bytes += nbytes;
+nbytes = 0;
+			}
+			else
+			{
+Size	nread;
+
+/*
+ * All the bytes are not in one page. Read available bytes on
+ * the current page, copy them over to output buffer and
+ * continue to read remaining bytes.
+ */
+nread = XLOG_BLCKSZ - (data - page);
+Assert(nread > 0 && nread <= nbytes);
+memcpy(dst, data, nread);
+ptr += nread;
+nbytes -= nread;
+dst += nread;
+*read_bytes += nread;
+			}
+
+			/*
+			 * Release the lock as early as possible to avoid creating any
+			 * possible contention.
+			 */
+			LWLockRelease(WALBufMappingLock);
+
+			/*
+			 * The fact that we acquire WALBufMappingLock while reading the WAL
+			 * buffer page itself guarantees that no one else initializes it or
+			 * makes it ready for next use in AdvanceXLInsertBuffer().
+			 *
+			 * However, we perform basic page header checks for ensuring that
+			 * we are not reading a page that just got initialized. Callers
+			 * will anyway perform extensive page-level and record-level
+			 * checks.
+			 */
+			phdr = (XLogPageHeader) page;
+
+			if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+  phdr->xlp_pageaddr == (origptr - (origptr % XLOG_BLCKSZ)) &&
+  phdr->xlp_tli == tli))
+			{
+/*
+ * WAL buffer page doesn't look valid, so return with what we
+ * have read so far.
+ 

Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

2023-03-02 Thread Daniel Gustafsson
> On 2 Mar 2023, at 10:59, Peter Eisentraut  
> wrote:
> 
> On 28.02.23 21:14, Daniel Gustafsson wrote:
>> The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around
>> SysCacheGetAttr where a NULL value triggers an elog().  This removes a lot of
>> boilerplate error handling which IMO leads to increased readability as the
>> error handling *in these cases* don't add much (there are other cases where
>> checking isnull does a lot of valuable work of course).  Personally I much
>> prefer the error-out automatically style of APIs like how palloc saves a ton 
>> of
>> checking the returned allocation for null, this aims at providing a similar
>> abstraction.
> 
> I looked through the patch.  

Thanks!

> The changes look ok to me.  In some cases, more line breaks could be removed 
> (that is, the whole call could be put on one line now).

I've tried to find those that would fit on a single line in the attached v2.

>> This will reduce granularity of error messages, and as the patch sits now it
>> does so a lot since the message is left to work on - I wanted to see if this
>> was at all seen as a net positive before spending time on that part.  I chose
>> an elog since I as a user would prefer to hit an elog instead of a silent 
>> keep
>> going with an assert, this is of course debateable.
> 
> I think an error message like
> 
>"unexpected null value in system cache %d column %d"
> 
> is sufficient.  Since these are "can't happen" errors, we don't need to spend 
> too much extra effort to make it prettier.

They really should never happen, but since we have all the information we need
it seems reasonable to ease debugging.  I've made a slightly extended elog in
the attached patch.

Callsites which had a detailed errormessage have been left passing isnull, like
for example statext_expressions_load().

> I don't think the unlikely() is going to buy much.  If you are worried on 
> that level, SysCacheGetAttrNotNull() ought to be made inline. Looking through 
> the sites of the changes, I didn't find any callers where I'd be worried on 
> that level.

Fair enough, removed.

--
Daniel Gustafsson



v2-0001-Add-SysCacheGetAttrNotNull-for-guarnteed-not-null.patch
Description: Binary data


Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

2023-03-02 Thread Daniel Gustafsson
> On 1 Mar 2023, at 00:20, Tom Lane  wrote:

> Also ... at least in assert-enabled builds, maybe we could check that
> the column being fetched this way is actually marked attnotnull?
> That would help to catch misuse.

We could, but that would limit the API to attnotnull columns, rather than when
the caller knows that the attr cannot be NULL either due to attnotnull or due
to intrinsic knowledge based on what is being extracted.

An example of the latter is build_function_result_tupdesc_t() which knows that
proallargtypes cannot be NULL when calling SysCacheGetAttr.

I think I prefer to allow those cases rather than the strict mode where
attnotnull has to be true, do you think it's preferrable to align the API with
attnotnull and keep the current coding for cases like the above?

--
Daniel Gustafsson





Re: Real config values for bytes needs quotes?

2023-03-02 Thread Peter Eisentraut

On 27.02.23 09:32, Kyotaro Horiguchi wrote:

I found it frustrating that the line "shared_buffers = 0.1GB" in
postgresql.conf postgresql.conf was causing an error and that the
value required (additional) surrounding single quotes.  The attached
patch makes the parser accept the use of non-quoted real values
followed by a unit for such variables. I'm not sure if that syntax
fully covers the input syntax of strtod, but I beieve it is suffucient
for most use cases.


This seems sensible to fix.  If you're not sure about the details, write 
some test cases. :)





Re: pg_upgrade and logical replication

2023-03-02 Thread Julien Rouhaud
On Thu, Mar 02, 2023 at 03:47:53PM +0530, Amit Kapila wrote:
> On Wed, Mar 1, 2023 at 12:25 PM Julien Rouhaud  wrote:
> >
> >  1) setup a normal physical replication cluster (pg_basebackup, restoring 
> > PITR,
> > whatever), let's call the primary node "A" and replica node "B"
> >  2) ensure WAL level is "logical" on the primary node A
> >  3) create a logical replication slot on every (connectable) database (or 
> > just
> > the one you're interested in if you don't want to preserve everything) 
> > on A
> >  4) create a FOR ALL TABLE publication (again for every databases or just 
> > the
> > one you're interested in)
> >  5) wait for replication to be reasonably if not entirely up to date
> >  6) promote the standby node B
> >  7) retrieve the promotion LSN (from the .history file,
> > pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()...)
> >  8) call pg_replication_slot_advance() with that LSN for all previously 
> > created
> > logical replication slots on A
> >
>
> How are these slots used? Do subscriptions use these slots?

Yes, as this is the only way to make sure that you replicate everything since
the promotion, and only once.  To be more precise, something like that:

CREATE SUBSCRIPTION db_xxx_subscription
   CONNECTION 'dbname=db_xxx user=...'
   PUBLICATION sub_for_db_xxx
   WITH (create_slot = false,
 slot_name = 'slot_for_db_xxx',
 copy_data = false);

> >  9) create a normal subscription on all wanted databases on the promoted 
> > node
> > 10) wait for it to catchup if needed on B
> > 12) stop the node B
> > 13) run pg_upgrade on B, creating the new node C
> > 14) start C, run the global ANALYZE and any sanity check needed (hopefully 
> > you
> > would have validated that your application is compatible with that new
> > version before this point)
> > 15) re-enable the subscription on C.  This is currently not possible without
> > losing data, the patch fixes that
> > 16) wait for it to catchup if needed
> > 17) create any missing relation and do the ALTER SUBSCRIPTION ... REFRESH if
> > needed
> > 18) trash B
> > 19) create new nodes D, E... as physical replica from C if needed, possibly
> > using cheaper approach like pg_start_backup() / rsync / pg_stop_backup if
> > needed
> > 20) switchover to C and trash A (or convert it to another replica if you 
> > want)
> > 21) trash the publications on C on all databases
> >
> > As noted the step 15 is currently problematic, and is also problematic in 
> > any
> > variation of that scenario that doesn't require you to entirely recreate the
> > node C from scratch using logical replication, which is what I want to 
> > avoid.
> >
> > This isn't terribly complicated but requires to be really careful if you 
> > don't
> > want to end up with an incorrect node C.  This approach is also currently 
> > not
> > entirely ideal, but hopefully logical replication of sequences and DDL will
> > remove the main sources of downtime when upgrading using logical 
> > replication.
> >
>
> I think there are good chances that one can make mistakes following
> all the above steps unless she is an expert.

Assuming we do fix pg_upgrade behavior with subscriptions, there isn't much
room for error compared to other scenario:

- pg_upgrade has been there for ages and contains a lot of sanity checks.
  People already use it and AFAIK it's not a major pain point, apart from the
  cases where it can be slow
- ALTER SUBSCRIPTIOn ... REFRESH will complain if tables are missing locally
- similarly, the logical replica will complain if you're missing some other DDL
  locally
- you only create replica if you had some in the first place, so it's something
  you should already know how to do.  If not, you didn't have any before the
  upgrade and you still won't have after

> > My ultimate goal is to provide some tooling to do that in a much simpler 
> > way.
> > Maybe a new "promote to logical" action that would take care of steps 2 to 
> > 9.
> > Users would therefore only have to do this "promotion to logical", and then 
> > run
> > pg_upgrade and create a new physical replication cluster if they want.
> >
>
> Why don't we try to support the direct upgrade of logical replication
> nodes? Have you tried to analyze what are the obstacles and whether we
> can have solutions for those? For example, one of the challenges is to
> support the upgrade of slots, can we copy (from the old cluster) and
> recreate them in the new cluster by resetting LSNs? We can also reset
> origins during the upgrade of subscribers and recommend to first
> upgrade the subscriber node.

I'm not sure I get your question.  This whole thread is about direct upgrade of
logical replication nodes, at least the subscribers, and what is currently
preventing it.

For the publisher nodes, that may be something nice to support (I'm assuming it
could be useful for more complex replication setups) but I'm not interested in
that at the moment as my goal is 

Re: typedef struct LogicalDecodingContext

2023-03-02 Thread Peter Eisentraut

On 02.03.23 03:46, Tom Lane wrote:

Peter Smith  writes:

Apparently, not all C99 compilers can be assumed to work using the
strict C99 rules.


While googling this issue I came across a statement that clang currently
defaults to C17 rules.  Even relatively old compilers might default to
C11.  But considering how long we held on to C89, I doubt we'll want
to move the project minimum to C11 for some years yet.


We need to wait until we de-support Visual Studio older then 2019. 
(Current minimum is 2015 (changed from 2013 for PG16).)







Re: typedef struct LogicalDecodingContext

2023-03-02 Thread Peter Eisentraut

On 02.03.23 04:00, Tom Lane wrote:

I wrote:

I'm a little inclined to see if I can turn on -std=gnu99 on my
clang-based buildfarm animals.  I use that with gcc for my
normal development activities, but now that I see that clang
catches some things gcc doesn't ...


FTR: done on sifaka and longfin.


mylodon already does something similar.





Re: meson: Non-feature feature options

2023-03-02 Thread Nazir Bilal Yavuz
Hi,

Thanks for the review.

On Wed, 1 Mar 2023 at 18:52, Peter Eisentraut
 wrote:
>
> Maybe we can make some of the logic less nested.  Right now there is
>
>  if sslopt != 'none'
>
>if not ssl.found() and sslopt in ['auto', 'openssl']
>
> I think at that point, ssl.found() is never true, so it can be removed.

I agree, ssl.found() can be removed.

> And the two checks for sslopt are nearly redundant.
>
> At the end of the block, there is
>
># At least one SSL library must be found, otherwise throw an error
>if sslopt == 'auto' and auto_features.enabled()
>  error('SSL Library could not be found')
>endif
>  endif
>
> which also implies sslopt != 'none'.  So I think the whole thing could be
>
>  if sslopt in ['auto', 'openssl']
>
>...
>
>  endif
>
>  if sslopt == 'auto' and auto_features.enabled()
>error('SSL Library could not be found')
>  endif
>
> both at the top level.
>

I am kind of confused. I added these checks for considering other SSL
implementations in the future, for this reason I have two nested if
checks. The top one is for checking if we need to search an SSL
library and the nested one is for checking if we need to search this
specific SSL library. What do you think?

The other thing is(which I forgot before) I need to add "and not
ssl.found()" condition to the "if sslopt == 'auto' and
auto_features.enabled()" check.

> Another issue, I think this is incorrect:
>
> +  openssl_required ? error('openssl function @0@ is
> required'.format(func)) : \
> + message('openssl function @0@ is
> required'.format(func))
>
> We don't want to issue a message like this when a non-required function
> is missing.

I agree, the message part can be removed.

Regards,
Nazir Bilal Yavuz
Microsoft




Re: [HACKERS] make async slave to wait for lsn to be replayed

2023-03-02 Thread Peter Eisentraut

On 28.02.23 11:10, Kartyshov Ivan wrote:

3) Procedure style: Tom Lane and Kyotaro (wait_proc_v1.patch)
https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us
https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com
==
advantages: no new words in grammar,like it made in 
pg_last_wal_replay_lsn, no snapshots need

disadvantages: a little harder to remember names
SELECT pg_waitlsn(‘LSN’, timeout);
SELECT pg_waitlsn_infinite(‘LSN’);
SELECT pg_waitlsn_no_wait(‘LSN’);


Of the presented options, I prefer this one.  (Maybe with a "_" between 
"wait" and "lsn".)


But I wonder how a client is going to get the LSN.  How would all of 
this be used by a client?  I can think of a scenarios where you have an 
application that issues a bunch of SQL commands and you have some kind 
of pooler in the middle that redirects those commands to different 
hosts, and what you really want is to have it transparently behave as if 
it's just a single host.  Do we want to inject a bunch of "SELECT 
pg_get_lsn()", "SELECT pg_wait_lsn()" calls into that?


I'm tempted to think this could be a protocol-layer facility.  Every 
query automatically returns the current LSN, and every query can also 
send along an LSN to wait for, and the client library would just keep 
track of the LSN for (what it thinks of as) the connection.  So you get 
some automatic serialization without having to modify your client code.


That said, exposing this functionality using functions could be a valid 
step in that direction, so that you can at least build out the actual 
internals of the functionality and test it out.





Re: pg_upgrade and logical replication

2023-03-02 Thread Amit Kapila
On Wed, Mar 1, 2023 at 12:25 PM Julien Rouhaud  wrote:
>
> On Wed, Mar 01, 2023 at 11:51:49AM +0530, Amit Kapila wrote:
> > On Tue, Feb 28, 2023 at 10:18 AM Julien Rouhaud  wrote:
> > >
> >
> > Okay, but it would be better if you list out your detailed steps. It
> > would be useful to support the new mechanism in this area if others
> > also find your steps to upgrade useful.
>
> Sure.  Here are the overly detailed steps:
>
>  1) setup a normal physical replication cluster (pg_basebackup, restoring 
> PITR,
> whatever), let's call the primary node "A" and replica node "B"
>  2) ensure WAL level is "logical" on the primary node A
>  3) create a logical replication slot on every (connectable) database (or just
> the one you're interested in if you don't want to preserve everything) on 
> A
>  4) create a FOR ALL TABLE publication (again for every databases or just the
> one you're interested in)
>  5) wait for replication to be reasonably if not entirely up to date
>  6) promote the standby node B
>  7) retrieve the promotion LSN (from the .history file,
> pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()...)
>  8) call pg_replication_slot_advance() with that LSN for all previously 
> created
> logical replication slots on A
>

How are these slots used? Do subscriptions use these slots?

>  9) create a normal subscription on all wanted databases on the promoted node
> 10) wait for it to catchup if needed on B
> 12) stop the node B
> 13) run pg_upgrade on B, creating the new node C
> 14) start C, run the global ANALYZE and any sanity check needed (hopefully you
> would have validated that your application is compatible with that new
> version before this point)
> 15) re-enable the subscription on C.  This is currently not possible without
> losing data, the patch fixes that
> 16) wait for it to catchup if needed
> 17) create any missing relation and do the ALTER SUBSCRIPTION ... REFRESH if
> needed
> 18) trash B
> 19) create new nodes D, E... as physical replica from C if needed, possibly
> using cheaper approach like pg_start_backup() / rsync / pg_stop_backup if
> needed
> 20) switchover to C and trash A (or convert it to another replica if you want)
> 21) trash the publications on C on all databases
>
> As noted the step 15 is currently problematic, and is also problematic in any
> variation of that scenario that doesn't require you to entirely recreate the
> node C from scratch using logical replication, which is what I want to avoid.
>
> This isn't terribly complicated but requires to be really careful if you don't
> want to end up with an incorrect node C.  This approach is also currently not
> entirely ideal, but hopefully logical replication of sequences and DDL will
> remove the main sources of downtime when upgrading using logical replication.
>

I think there are good chances that one can make mistakes following
all the above steps unless she is an expert.

> My ultimate goal is to provide some tooling to do that in a much simpler way.
> Maybe a new "promote to logical" action that would take care of steps 2 to 9.
> Users would therefore only have to do this "promotion to logical", and then 
> run
> pg_upgrade and create a new physical replication cluster if they want.
>

Why don't we try to support the direct upgrade of logical replication
nodes? Have you tried to analyze what are the obstacles and whether we
can have solutions for those? For example, one of the challenges is to
support the upgrade of slots, can we copy (from the old cluster) and
recreate them in the new cluster by resetting LSNs? We can also reset
origins during the upgrade of subscribers and recommend to first
upgrade the subscriber node.

-- 
With Regards,
Amit Kapila.




  1   2   >