Re: [EXTERNAL] Re: [PATCH] Support using "all" for the db user in pg_ident.conf

2023-01-18 Thread Michael Paquier
On Wed, Jan 18, 2023 at 10:35:29AM +0100, Jelte Fennema wrote:
> Anything I can do to help with this? Or will you do that yourself?

So, I have done a second lookup, and tweaked a few things:
- Addition of a macro for pg_strcasecmp(), to match with
token_matches().
- Fixed a bit the documentation.
- Tweaked some comments and descriptions in the tests, I was rather
fine with the role and group names.

Jelte, do you like this version?
--
Michael
From 0e42a5e9c2a532355df49346e47cc5612da1889d Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 19 Jan 2023 16:54:59 +0900
Subject: [PATCH v6] Support same user patterns in pg_ident.conf as in
 pg_hba.conf

While pg_hba.conf has support for non-literal username matches,
pg_ident.conf doesn't have this same functionality. This changes
permission checking in pg_ident.conf to handle all the special cases for
username checking:
1. The "all" keyword
2. Membership checks using the + prefix
3. Using a regex to match against multiple roles

This allows matching certain system users against many different
postgres users with a single line in pg_ident.conf. Without this you
you would need a line for each of the postgres users that a system user
can log in as.

There is some risk of breaking existing pg_ident.conf configs that
contained such special strings and which were treated as literal user
names. But the advantages brought by the features added in this change
far outweigh the risk of breakage. And we only risk breaking when there
exist roles that are called "all", start with a literal + or start with
a literal /. Only "all" seems like a somewhat reasonable role name, but
such a role existing seems unlikely to me given all its special meaning
in pg_hba.conf.

**This compatibility change should be mentioned in the release notes.**
---
 src/backend/libpq/hba.c   |  98 +++-
 src/test/authentication/t/003_peer.pl | 160 --
 doc/src/sgml/client-auth.sgml |  27 -
 3 files changed, 246 insertions(+), 39 deletions(-)

diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index 029b8e4483..69f87667be 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -73,8 +73,10 @@ typedef struct
 } tokenize_error_callback_arg;
 
 #define token_has_regexp(t)	(t->regex != NULL)
+#define token_is_member_check(t)	(!t->quoted && t->string[0] == '+')
 #define token_is_keyword(t, k)	(!t->quoted && strcmp(t->string, k) == 0)
 #define token_matches(t, k)  (strcmp(t->string, k) == 0)
+#define token_matches_insensitive(t,k) (pg_strcasecmp(t->string, k) == 0)
 
 /*
  * Memory context holding the list of TokenizedAuthLines when parsing
@@ -995,10 +997,11 @@ is_member(Oid userid, const char *role)
  *
  * Each AuthToken listed is checked one-by-one.  Keywords are processed
  * first (these cannot have regular expressions), followed by regular
- * expressions (if any) and the exact match.
+ * expressions (if any), the case-insensitive match (if requested) and
+ * the exact match.
  */
 static bool
-check_role(const char *role, Oid roleid, List *tokens)
+check_role(const char *role, Oid roleid, List *tokens, bool case_insensitive)
 {
 	ListCell   *cell;
 	AuthToken  *tok;
@@ -1006,7 +1009,7 @@ check_role(const char *role, Oid roleid, List *tokens)
 	foreach(cell, tokens)
 	{
 		tok = lfirst(cell);
-		if (!tok->quoted && tok->string[0] == '+')
+		if (token_is_member_check(tok))
 		{
 			if (is_member(roleid, tok->string + 1))
 return true;
@@ -1018,6 +1021,11 @@ check_role(const char *role, Oid roleid, List *tokens)
 			if (regexec_auth_token(role, tok, 0, NULL) == REG_OKAY)
 return true;
 		}
+		else if (case_insensitive)
+		{
+			if (token_matches_insensitive(tok, role))
+return true;
+		}
 		else if (token_matches(tok, role))
 			return true;
 	}
@@ -2614,7 +2622,7 @@ check_hba(hbaPort *port)
 	  hba->databases))
 			continue;
 
-		if (!check_role(port->user_name, roleid, hba->roles))
+		if (!check_role(port->user_name, roleid, hba->roles, false))
 			continue;
 
 		/* Found a record that matched! */
@@ -2804,7 +2812,7 @@ parse_ident_line(TokenizedAuthLine *tok_line, int elevel)
 
 	/*
 	 * Now that the field validation is done, compile a regex from the user
-	 * token, if necessary.
+	 * tokens, if necessary.
 	 */
 	if (regcomp_auth_token(parsedline->system_user, file_name, line_num,
 		   err_msg, elevel))
@@ -2813,6 +2821,13 @@ parse_ident_line(TokenizedAuthLine *tok_line, int elevel)
 		return NULL;
 	}
 
+	if (regcomp_auth_token(parsedline->pg_user, file_name, line_num,
+		   err_msg, elevel))
+	{
+		/* err_msg includes the error to report */
+		return NULL;
+	}
+
 	return parsedline;
 }
 
@@ -2827,6 +2842,8 @@ check_ident_usermap(IdentLine *identLine, const char *usermap_name,
 	const char *pg_user, const char *system_user,
 	bool case_insensitive, bool *found_p, bool *error_p)
 {
+	Oid			roleid;
+
 	*found_p = false;
 	*error_p = false;
 
@@ -2834,6 +2851,9 @@ 

Re: Inconsistency in vacuum behavior

2023-01-18 Thread Alexander Pyhalov

Justin Pryzby писал 2023-01-19 04:49:

On Mon, Jan 16, 2023 at 08:12:18PM +0300, Nikita Malakhov wrote:

Hi,

Currently there is no error in this case, so additional thrown error 
would

require a new test.
Besides, throwing an error here does not make sense - it is just a 
check

for a vacuum
permission, I think the right way is to just skip a relation that is 
not

suitable for vacuum.
Any thoughts or objections?


Could you check if this is consistent between the behavior of VACUUM
FULL and CLUSTER ?  See also Nathan's patches.


Hi.

Cluster behaves in a different way - it errors out immediately if 
relation is not owned by user. For partitioned rel it would anyway raise 
error later.
VACUUM and VACUUM FULL behave consistently after applying Nikita's patch 
(for partitioned and regular tables) - issue warning "skipping 
TABLE_NAME --- only table or database owner can vacuum it" and return 
success status.


--
Best regards,
Alexander Pyhalov,
Postgres Professional




Re: Inconsistency in vacuum behavior

2023-01-18 Thread Nikita Malakhov
Hi!

I've found the discussion you'd mentioned before, checking now.

On Thu, Jan 19, 2023 at 4:49 AM Justin Pryzby  wrote:

> On Mon, Jan 16, 2023 at 08:12:18PM +0300, Nikita Malakhov wrote:
> > Hi,
> >
> > Currently there is no error in this case, so additional thrown error
> would
> > require a new test.
> > Besides, throwing an error here does not make sense - it is just a check
> > for a vacuum
> > permission, I think the right way is to just skip a relation that is not
> > suitable for vacuum.
> > Any thoughts or objections?
>
> Could you check if this is consistent between the behavior of VACUUM
> FULL and CLUSTER ?  See also Nathan's patches.
>
> --
> Justin
>


-- 
Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/


RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-18 Thread Takamichi Osumi (Fujitsu)
On Wednesday, January 18, 2023 4:06 PM Peter Smith  
wrote:
>  Here are my review comments for the latest patch v16-0001. (excluding the
> test code)
Hi, thank you for your review !

> ==
> 
> General
> 
> 1.
> 
> Since the value of min_apply_delay cannot be < 0,  I was thinking probably it
> should have been declared everywhere in this patch as a
> uint64 instead of an int64, right?
No, we won't be able to adopt this idea.

It seems that we are not able to use uint for catalog type.
So, can't applying it to the pg_subscription.h definitions
and then similarly Int64GetDatum to store catalog variables
and the argument variable of Int64GetDatum.

Plus, there is a possibility that type Interval becomes negative value,
then we are not able to change the int64 variable to get
the return value of interval2ms().

> ==
> 
> Commit message
> 
> 2.
> 
> If the subscription sets min_apply_delay parameter, the logical replication
> worker will delay the transaction commit for min_apply_delay milliseconds.
> 
> ~
> 
> IMO there should be another sentence before this just to say that a new
> parameter is being added:
> 
> e.g.
> This patch implements a new subscription parameter called
> 'min_apply_delay'.
Added.


> ==
> 
> doc/src/sgml/config.sgml
> 
> 3.
> 
> +  
> +   For time-delayed logical replication, the apply worker sends a Standby
> +   Status Update message to the corresponding publisher per the
> indicated
> +   time of this parameter. Therefore, if this parameter is longer than
> +   wal_sender_timeout on the publisher, then the
> +   walsender doesn't get any update message during the delay and
> repeatedly
> +   terminates due to the timeout errors. Hence, make sure this parameter
> is
> +   shorter than the wal_sender_timeout of the
> publisher.
> +   If this parameter is set to zero with time-delayed replication, the
> +   apply worker doesn't send any feedback messages during the
> +   min_apply_delay.
> +  
> 
> 
> This paragraph seemed confusing. I think it needs to be reworded to change all
> of the "this parameter" references because there are at least 3 different
> parameters mentioned in this paragraph. e.g. maybe just change them to
> explicitly name the parameter you are talking about.
> 
> I also think it needs to mention the ‘min_apply_delay’ subscription parameter
> up-front and then refer to it appropriately.
> 
> The end result might be something like I wrote below (this is just my guess ?
> probably you can word it better).
> 
> SUGGESTION
> For time-delayed logical replication (i.e. when the subscription is created 
> with
> parameter min_apply_delay > 0), the apply worker sends a Standby Status
> Update message to the publisher with a period of wal_receiver_status_interval 
> .
> Make sure to set wal_receiver_status_interval less than the
> wal_sender_timeout on the publisher, otherwise, the walsender will repeatedly
> terminate due to the timeout errors. If wal_receiver_status_interval is set 
> to zero,
> the apply worker doesn't send any feedback messages during the subscriber’s
> min_apply_delay period.
Applied. Also, I added one reference for min_apply_delay parameter
at the end of this description.


> ==
> 
> doc/src/sgml/ref/create_subscription.sgml
> 
> 4.
> 
> + 
> +  By default, the subscriber applies changes as soon as possible. As
> +  with the physical replication feature
> +  (), it can be
> useful to
> +  have a time-delayed logical replica. This parameter lets the user 
> to
> +  delay the application of changes by a specified amount of
> time. If this
> +  value is specified without units, it is taken as milliseconds. The
> +  default is zero(no delay).
> + 
> 
> 4a.
> As with the physical replication feature (recovery_min_apply_delay), it can be
> useful to have a time-delayed logical replica.
> 
> IMO not sure that the above sentence is necessary. It seems only to be saying
> that this parameter can be useful. Why do we need to say that?
Removed the sentence.


> ~
> 
> 4b.
> "This parameter lets the user to delay" -> "This parameter lets the user 
> delay"
> OR
> "This parameter lets the user to delay" -> "This parameter allows the user to
> delay"
Fixed.

 
> ~
> 
> 4c.
> "If this value is specified without units" -> "If the value is specified 
> without
> units"
Fixed.
 
> ~
> 
> 4d.
> "zero(no delay)." -> "zero (no delay)."
Fixed.

> 
> 
> 5.
> 
> + 
> +  The delay occurs only on WAL records for transaction begins and
> after
> +  the initial table synchronization. It is possible that the
> +  replication delay between publisher and subscriber exceeds the
> value
> +  of this parameter, in which case no delay is added. Note that the
> +  delay is calculated between the WAL time stamp as written on
> +  publisher and the current time on the subscriber. Time
> spent in 

Re: Support logical replication of DDLs

2023-01-18 Thread Amit Kapila
On Thu, Jan 19, 2023 at 8:39 AM Zheng Li  wrote:
>
> On Wed, Jan 18, 2023 at 6:27 AM Amit Kapila  wrote:
> >
> > On Sat, Jan 7, 2023 at 8:58 PM Zheng Li  wrote:
> > >
>
> Foreign Tables can also be considered replicated with DDL replication because 
> we
> don't even need to replicate the data as it resides on the external
> server. Users
> need to configure the external server to allow connection from the
> subscriber for
> foreign tables to work on the subscriber.
>

So, this would mean that we expect the subscriber will also have the
same foreign server as the publisher because we will replicate the
entire connection/user information of the foreign server for the
publisher. But what about data inserted by the publisher on the
foreign server?

> > We should also think
> > about initial sync for all those objects as well.
>
> Agree, we're starting an investigation on initial sync. But I think
> initial sync depends on
> DDL replication to work reliably, not the other way around. DDL replication 
> can
> work on its own without the initial sync of schema, users just need to
> setup the initial
> schema just like they would today.
>

The difference is that today users need to take care of all schema
setup on both and follow changes in the same on the publisher. But
with DDL replication, there has to be a point prior to which both the
nodes have the same setup. For that, before setting up DDL
replication, users need to ensure that both nodes have the same
schema, and then during setup, the user doesn't perform any DDL on the
publisher.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] random_normal function

2023-01-18 Thread Andrey Lepikhov

On 1/19/23 11:01, Tom Lane wrote:

Andrey Lepikhov  writes:

On 1/9/23 23:52, Tom Lane wrote:

BTW, if this does bring the probability of failure down to the
one-in-a-billion range, I think we could also nuke the whole
"ignore:" business, simplifying pg_regress and allowing the
random test to be run in parallel with others.



We have used the pg_sleep() function to interrupt a query at certain
execution phase. But on some platforms, especially in containers, the
query can vary execution time in so widely that the pg_sleep() timeout,
required to get rid of dependency on a query execution time, has become
unacceptable. So, the "ignore" option was the best choice.


But does such a test have any actual value?  If your test infrastructure
ignores the result, what makes you think you'd notice if the test did
indeed detect a problem?
Yes, it is good to catch SEGFAULTs and assertions which may be frequent 
because of a logic complexity in the case of timeouts.




I think "ignore:" was a kluge we put in twenty-plus years ago when our
testing standards were a lot lower, and it's way past time we got
rid of it.
Ok, I will try to invent alternative way for deep (and stable) testing 
of timeouts. Thank you for the answer.


--
Regards
Andrey Lepikhov
Postgres Professional





RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-18 Thread Takamichi Osumi (Fujitsu)
On Thursday, January 19, 2023 10:49 AM Peter Smith  
wrote:
> On Wed, Jan 18, 2023 at 6:06 PM Peter Smith 
> wrote:
> >
> >  Here are my review comments for the latest patch v16-0001. (excluding
> > the test code)
> >
> 
> And here are some review comments for the v16-0001 test code.
Hi, thanks for your review !


> ==
> 
> src/test/regress/sql/subscription.sql
> 
> 1. General
> For all comments
> 
> "time delayed replication" -> "time-delayed replication" maybe is better?
Fixed.

> ~~~
> 
> 2.
> -- fail - utilizing streaming = parallel with time delayed replication is not
> supported.
> 
> For readability please put a blank line before this test.
Fixed.

> ~~~
> 
> 3.
> -- success -- value without unit is taken as milliseconds
> 
> "value" -> "min_apply_delay value"
Fixed.


> ~~~
> 
> 4.
> -- success -- interval is converted into ms and stored as integer
> 
> "interval" -> "min_apply_delay interval"
> 
> "integer" -> "an integer"
Both are fixed.


> ~~~
> 
> 5.
> You could also add another test where min_apply_delay is 0
> 
> Then the following combination can be confirmed OK -- success create
> subscription with (streaming=parallel, min_apply_delay=0)
This combination is added with the modification for #6.

> ~~
> 
> 6.
> -- fail - alter subscription with min_apply_delay should fail when streaming =
> parallel is set.
> CREATE SUBSCRIPTION regress_testsub CONNECTION
> 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false,
> streaming = parallel);
> 
> There is another way to do this test without creating a brand-new 
> subscription.
> You could just alter the existing subscription like:
> ALTER ... SET (min_apply_delay = 0)
> then ALTER ... SET (parallel = streaming) then ALTER ... SET (min_apply_delay
> = 123)
Fixed.

> ==
> 
> src/test/subscription/t/032_apply_delay.pl
> 
> 7. sub check_apply_delay_log
> 
> my ($node_subscriber, $message, $expected) = @_;
> 
> Why pass in the message text? I is always the same so can be hardwired in this
> function, right?
Fixed.

> ~~~
> 
> 8.
> # Get the delay time in the server log
> 
> "int the server log" -> "from the server log" (?)
Fixed.

> ~~~
> 
> 9.
> qr/$message: (\d+) ms/
> or die "could not get delayed time";
> my $logged_delay = $1;
> 
> # Is it larger than expected?
> cmp_ok($logged_delay, '>', $expected,
> "The wait time of the apply worker is long enough expectedly"
> );
> 
> 9a.
> "could not get delayed time" -> "could not get the apply worker wait time"
> 
> 9b.
> "The wait time of the apply worker is long enough expectedly" -> "The apply
> worker wait time has expected duration"
Both are fixed.


> ~~~
> 
> 10.
> sub check_apply_delay_time
> 
> 
> Maybe a brief explanatory comment for this function is needed to explain the
> unreplicated column c.
Added.

> ~~~
> 
> 11.
> $node_subscriber->safe_psql('postgres',
> "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
> application_name=$appname' PUBLICATION tap_pub WITH (streaming = on,
> min_apply_delay = '3s')"
> 
> 
> I think there should be a comment here highlighting that you are setting up a
> subscriber time delay of 3 seconds, and then later you can better describe the
> parameters for the checking functions...
Added a comment for CREATE SUBSCRIPTION command.

> e.g. (add this comment)
> # verifies that the subscriber lags the publisher by at least 3 seconds
> check_apply_delay_time($node_publisher, $node_subscriber, '5', '3');
> 
> e.g.
> # verifies that the subscriber lags the publisher by at least 3 seconds
> check_apply_delay_time($node_publisher, $node_subscriber, '8', '3');
Added.


> ~~~
> 
> 12.
> # Test whether ALTER SUBSCRIPTION changes the delayed time of the apply
> worker # (1 day 1 minute).
> $node_subscriber->safe_psql('postgres',
> "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay = 8646)"
> );
> 
> Update the comment with another note.
> # Note - The extra 1 min is to account for any decoding/network overhead.
Okay, added the comment. In general, TAP tests
fail if we wait for more than 3 minutes. Then,
we should think setting the maximum consumed time
more than 3 minutes is safe. For example, if
(which should not happen usually, but)
we consumed more than 1 minutes between this ALTER SUBSCRIPTION SET
and below check_apply_delay_log() then, the test will fail.

So made the extra time bigger.
> ~~~
> 
> 13.
> # Make sure we have long enough min_apply_delay after the ALTER command
> check_apply_delay_log($node_subscriber, "logical replication apply delay",
> "8000");
> 
> IMO the expectation of 1 day (8646 ms) wait time might be a better number
> for your "expected" value.
> 
> So update the comment/call like this:
> 
> # Make sure the apply worker knows to wait for more than 1 day (8640 ms)
> check_apply_delay_log($node_subscriber, "logical replication apply delay",
> "8640");
Updated the comment and the function call.

Kindly have a look at the updated 

Re: Modify the document of Logical Replication configuration settings

2023-01-18 Thread Michael Paquier
On Wed, Jan 18, 2023 at 02:04:16PM +0530, Bharath Rupireddy wrote:
> [1]
> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> index 89d53f2a64..6f9509267c 100644
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -4326,7 +4326,8 @@ restore_command = 'copy
> "C:\\server\\archivedir\\%f" "%p"'  # Windows
> 
>  Terminate replication connections that are inactive for longer
>  than this amount of time. This is useful for
> -the sending server to detect a standby crash or network outage.
> +the sending server to detect a standby crash or logical replication
> +subscriber crash or network outage.
>  If this value is specified without units, it is taken as 
> milliseconds.
>  The default value is 60 seconds.
>  A value of zero disables the timeout mechanism.

Perhaps we could do that, I am not sure whether this brings much in
this section, though.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] random_normal function

2023-01-18 Thread Tom Lane
Andrey Lepikhov  writes:
> On 1/9/23 23:52, Tom Lane wrote:
>> BTW, if this does bring the probability of failure down to the
>> one-in-a-billion range, I think we could also nuke the whole
>> "ignore:" business, simplifying pg_regress and allowing the
>> random test to be run in parallel with others.

> We have used the pg_sleep() function to interrupt a query at certain 
> execution phase. But on some platforms, especially in containers, the 
> query can vary execution time in so widely that the pg_sleep() timeout, 
> required to get rid of dependency on a query execution time, has become 
> unacceptable. So, the "ignore" option was the best choice.

But does such a test have any actual value?  If your test infrastructure
ignores the result, what makes you think you'd notice if the test did
indeed detect a problem?

I think "ignore:" was a kluge we put in twenty-plus years ago when our
testing standards were a lot lower, and it's way past time we got
rid of it.

regards, tom lane




bug: copy progress reporting of backends which run multiple COPYs

2023-01-18 Thread Justin Pryzby
pg_stat_progress_copy was added in v14 (8a4f618e7, 9d2d45700).

But if a command JOINs file_fdw tables, the progress report gets bungled
up.  This will warn/assert during file_fdw tests.

diff --git a/src/backend/utils/activity/backend_progress.c 
b/src/backend/utils/activity/backend_progress.c
index 6743e68cef6..7abcb4f60db 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -10,6 +10,7 @@
  */
 #include "postgres.h"
 
+#include "commands/progress.h"
 #include "port/atomics.h"  /* for memory barriers */
 #include "utils/backend_progress.h"
 #include "utils/backend_status.h"
@@ -105,6 +106,20 @@ pgstat_progress_end_command(void)
if (beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
return;
 
+// This currently fails file_fdw tests, since pgstat_progress evidently fails
+// to support simultaneous copy commands, as happens during JOIN.
+   /* bytes progress is not available in all cases */
+   if (beentry->st_progress_command == PROGRESS_COMMAND_COPY &&
+   beentry->st_progress_param[PROGRESS_COPY_BYTES_TOTAL] > 
0)
+   {
+   volatile int64 *a = beentry->st_progress_param;
+   if (a[PROGRESS_COPY_BYTES_PROCESSED] > 
a[PROGRESS_COPY_BYTES_TOTAL])
+   elog(WARNING, "PROGRESS_COPY_BYTES_PROCESSED %ld %ld",
+   a[PROGRESS_COPY_BYTES_PROCESSED],
+   a[PROGRESS_COPY_BYTES_TOTAL]);
+   // Assert(a[PROGRESS_COPY_BYTES_PROCESSED] <= 
a[PROGRESS_COPY_BYTES_TOTAL]);
+   }
+
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
beentry->st_progress_command_target = InvalidOid;




Re: Experiments with Postgres and SSL

2023-01-18 Thread Andrey Borodin
On Wed, Jan 18, 2023 at 7:16 PM Greg Stark  wrote:
>
> So I took a look into what it would take to do and I think it would
> actually be quite feasible. The first byte of a standard TLS
> connection can't look anything like the first byte of any flavour of
> Postgres startup packet because it would be the high order bits of the
> length so unless we start having multi-megabyte startup packets
>

This is a fascinating idea! I like it a lot.
But..do we have to treat any unknown start sequence of bytes as a TLS
connection? Or is there some definite subset of possible first bytes
that clearly indicates that this is a TLS connection or not?

Best regards, Andrey Borodin.




Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-18 Thread Amit Kapila
On Wed, Jan 18, 2023 at 12:09 PM Amit Kapila  wrote:
>
> On Fri, Jan 13, 2023 at 11:50 AM Peter Smith  wrote:
> >
> > Here are some review comments for patch v79-0002.
> >
>
> So, this is about the latest v84-0001-Stop-extra-worker-if-GUC-was-changed.
>
> >
> > I feel this patch just adds more complexity for almost no gain:
> > - reducing the 'max_apply_workers_per_suibscription' seems not very
> > common in the first place.
> > - even when the GUC is reduced, at that point in time all the workers
> > might be in use so there may be nothing that can be immediately done.
> > - IIUC the excess workers (for a reduced GUC) are going to get freed
> > naturally anyway over time as more transactions are completed so the
> > pool size will reduce accordingly.
> >
>
> I am still not sure if it is worth pursuing this patch because of the
> above reasons. I don't think it would be difficult to add this even at
> a later point in time if we really see a use case for this.
> Sawada-San, IIRC, you raised this point. What do you think?
>
> The other point I am wondering is whether we can have a different way
> to test partial serialization apart from introducing another developer
> GUC (stream_serialize_threshold). One possibility could be that we can
> have a subscription option (parallel_send_timeout or something like
> that) with some default value (current_timeout used in the patch)
> which will be used only when streaming = parallel. Users may want to
> wait for more time before serialization starts depending on the
> workload (say when resource usage is high on a subscriber-side
> machine, or there are concurrent long-running transactions that can
> block parallel apply for a bit longer time). I know with this as well
> it may not be straightforward to test the functionality because we
> can't be sure how many changes would be required for a timeout to
> occur. This is just for brainstorming other options to test the
> partial serialization functionality.
>

Apart from the above, we can also have a subscription option to
specify parallel_shm_queue_size (queue_size used to determine the
queue between the leader and parallel worker) and that can be used for
this purpose. Basically, configuring it to a smaller value can help in
reducing the test time but still, it will not eliminate the need for
dependency on timing we have to wait before switching to partial
serialize mode. I think this can be used in production as well to tune
the performance depending on workload.

Yet another way is to use the existing parameter logical_decode_mode
[1]. If the value of logical_decoding_mode is 'immediate', then we can
immediately switch to partial serialize mode. This will eliminate the
dependency on timing. The one argument against using this is that it
won't be as clear as a separate parameter like
'stream_serialize_threshold' proposed by the patch but OTOH we already
have a few parameters that serve a different purpose when used on the
subscriber. For example, 'max_replication_slots' is used to define the
maximum number of replication slots on the publisher and the maximum
number of origins on subscribers. Similarly,
wal_retrieve_retry_interval' is used for different purposes on
subscriber and standby nodes.

[1] - https://www.postgresql.org/docs/devel/runtime-config-developer.html

-- 
With Regards,
Amit Kapila.




Re: [PATCH] random_normal function

2023-01-18 Thread Andrey Lepikhov

On 1/9/23 23:52, Tom Lane wrote:

BTW, if this does bring the probability of failure down to the
one-in-a-billion range, I think we could also nuke the whole
"ignore:" business, simplifying pg_regress and allowing the
random test to be run in parallel with others.
With 'ignore' option we get used to cover by tests some of the time 
dependent features, such as "statement_timeout", 
"idle_in_transaction_session_timeout", usage of user timeouts in 
extensions and so on.


We have used the pg_sleep() function to interrupt a query at certain 
execution phase. But on some platforms, especially in containers, the 
query can vary execution time in so widely that the pg_sleep() timeout, 
required to get rid of dependency on a query execution time, has become 
unacceptable. So, the "ignore" option was the best choice.


For Now, Do we only have the "isolation tests" option to create stable 
execution time-dependent tests now? Or I'm not aware about some test 
machinery?


--
Regards
Andrey Lepikhov
Postgres Professional





Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-18 Thread Ankit Kumar Pandey




On 19/01/23 08:58, David Rowley wrote:



The problem is that the next item looked at is 1 and the value 2 is skipped.


Okay, I see the issue.


I think you are misinterpreting the results, but the main point
remains - it's slower.  The explain analyze timing shows the time
between outputting the first row and the last row. For sort, there's a
lot of work that needs to be done before you output the first row.


Okay got it.


I looked into this a bit further and using the same table as you, and
the attached set of hacks that adjust the ORDER BY path generation to
split a Sort into a Sort and Incremental Sort when the number of
pathkeys to sort by is > 2.



Okay, so this is issue with strict sort itself.

I will play around with current patch but it should be fine for

current work and would account for issue in strict sort.


  I suspect this might be to do with having to perform more swaps in the 
array elements that we'resorting by with the full sort.  When work_mem is 
large this array no longer fits in CPU cache and I suspect those swaps become 
more costly when there are more cache lines having to be fetched from RAM.


Looks like possible reason.


I think we really should fix tuplesort.c so that it batches sorts into
about L3 CPU cache-sized chunks in RAM rather than trying to sort much
larger arrays.



I assume same thing we are doing for incremental sort and that's why it 
perform


better here?

Or is it due to shorter tuple width and thus being able to fit into 
memory easily?


 


On the
other hand, we already sort WindowClauses with the most strict sort
order first which results in a full Sort and no additional sort for
subsequent WindowClauses that can make use of that sort order.  It
would be bizarre to reverse the final few lines of common_prefix_cmp
so that we sort the least strict order first so we end up injecting
Incremental Sorts into the plan to make it faster!


I would see this as beneficial when tuple size is too huge for strict sort.


Basically

cost(sort(a,b,c,d,e)) > cost(sort(a)+sort(b)+sort(c)+ sort(d,e))

Don't know if cost based decision is realistic or not in current

state of things.


I think more benchmarking is required
so we can figure out if this is a corner case or a common case


I will do few more benchmarks. I assume all scaling issue with strict sort

to remain as it is but no further issue due to this change itself comes up.



I'm just unsure if we should write this off as the expected behaviour



of Sort and continue with the patch or delay the whole thing until we


make some improvements to sort.  


Unless we found more issues, we might be good with the former but

you can make better call on this. As much as I would love my first patch 
to get


merged, I would want it to be right.


Thanks,

Ankit






Re: Make use of assign_checkpoint_completion_target() to calculate CheckPointSegments correctly

2023-01-18 Thread Michael Paquier
On Tue, Jan 17, 2023 at 07:55:53PM +0530, Bharath Rupireddy wrote:
> On Tue, Jan 17, 2023 at 12:31 PM Michael Paquier  wrote:
>> Oops.  It looks like you are right here.  This would impact the
>> calculation of CheckPointSegments on reload when
>> checkpoint_completion_target is updated.  This is wrong since we've
>> switched to max_wal_size as of 88e9823, so this had better be
>> backpatched all the way down.
>>
>> Thoughts?
> 
> +1 to backpatch as setting checkpoint_completion_target will not take
> effect immediately.

Okay, applied down to 11.  I have double-checked the surroundings to
see if there was a similar mistake or something hiding around
CheckpointSegments but noticed nothing (also did some primary/standby
pgbench-ing with periodic reloads of checkpoint_completion_target,
while on it).
--
Michael


signature.asc
Description: PGP signature


Re: doc: add missing "id" attributes to extension packaging page

2023-01-18 Thread Karl O. Pinc
On Tue, 17 Jan 2023 16:43:13 -0600
"Karl O. Pinc"  wrote:

> It might be useful to add --nonet to the xsltproc invocation(s)
> in the Makefile(s).  Just in case; to keep from retrieving
> stylesheets from the net.  (If the option is not already there.
> I didn't look.)
> 
> If this is the first time that PG uses the XSLT import mechanism
> I imagine that "things could go wrong" depending on what sort
> of system is being used to build the docs.  I'm not worried,
> but it is something to note for the committer.

Looks like doc/src/sgml/stylesheet-fo.xsl already uses
xsl:import, although it is unclear to me whether the import
is applied.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein




Re: Remove source code display from \df+?

2023-01-18 Thread Pavel Stehule
st 18. 1. 2023 v 16:27 odesílatel Isaac Morland 
napsal:

> On Wed, 18 Jan 2023 at 00:00, Pavel Stehule 
> wrote:
>
>>
>> út 17. 1. 2023 v 20:29 odesílatel Isaac Morland 
>> napsal:
>>
>>>
>>> I welcome comments and feedback. Now to try to find something manageable
>>> to review.
>>>
>>
>> looks well
>>
>> you miss update psql documentation
>>
>> https://www.postgresql.org/docs/current/app-psql.html
>>
>> If the form \df+ is used, additional information about each function is
>> shown, including volatility, parallel safety, owner, security
>> classification, access privileges, language, source code and description.
>>
>
> Thanks, and sorry about that, it just completely slipped my mind. I’ve
> attached a revised patch with a slight update of the psql documentation.
>
> you should to assign your patch to commitfest app
>>
>> https://commitfest.postgresql.org/
>>
>
> I thought I had: https://commitfest.postgresql.org/42/4133/
>

ok


>
> Did I miss something?
>

it looks well

regards

Pavel


Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-18 Thread David Rowley
On Thu, 19 Jan 2023 at 06:27, Ankit Kumar Pandey  wrote:
> Hmm, not really sure why did I miss that. I tried this again (added
> following in postgres.c above
>
> PortalStart)
>
> List* l = NIL;
> l = lappend(l, 1);
> l = lappend(l, 2);
> l = lappend(l, 3);
> l = lappend(l, 4);
>
> ListCell *lc;
> foreach_reverse(lc, l)
> {
> if (foreach_current_index(lc) == 2) // delete 3
> {
> foreach_delete_current(l, lc);
> }
> }

The problem is that the next item looked at is 1 and the value 2 is skipped.

> I also tried a bit unrealistic case.
>
> create table abcdefgh(a int, b int, c int, d int, e int, f int, g int, h int);
>
> insert into abcdefgh select a,b,c,d,e,f,g,h from
> generate_series(1,7) a,
> generate_series(1,7) b,
> generate_series(1,7) c,
> generate_series(1,7) d,
> generate_series(1,7) e,
> generate_series(1,7) f,
> generate_series(1,7) g,
> generate_series(1,7) h;
>
> explain analyze select count(*) over (order by a),
> row_number() over (partition by a order by b) from abcdefgh order by 
> a,b,c,d,e,f,g,h;
>
> In patch version

>   Execution Time: 82748.789 ms

> In Master

>   Execution Time: 71172.586 ms

> Patch version sort took around 15 sec, whereas master sort took ~2 + 46
> = around 48 sec


> Maybe I am interpreting it wrong but still wanted to bring this to notice.

I think you are misinterpreting the results, but the main point
remains - it's slower.  The explain analyze timing shows the time
between outputting the first row and the last row. For sort, there's a
lot of work that needs to be done before you output the first row.

I looked into this a bit further and using the same table as you, and
the attached set of hacks that adjust the ORDER BY path generation to
split a Sort into a Sort and Incremental Sort when the number of
pathkeys to sort by is > 2.

work_mem = '1GB' in all cases below:

I turned off the timing in EXPLAIN so that wasn't a factor in the
results. Generally having more nodes means more timing requests and
that's got > 0 overhead.

explain (analyze,timing off) select * from abcdefgh order by a,b,c,d,e,f,g,h;

7^8 rows
Master: Execution Time: 7444.479 ms
master + sort_hacks.diff: Execution Time: 5147.937 ms

So I'm also seeing Incremental Sort - > Sort faster than a single Sort
for 7^8 rows.

With 5^8 rows:
master: Execution Time: 299.949 ms
master + sort_hacks: Execution Time: 239.447 ms

and 4^8 rows:
master: Execution Time: 62.596 ms
master + sort_hacks: Execution Time: 67.900 ms

So at 4^8 sort_hacks becomes slower.  I suspect this might be to do
with having to perform more swaps in the array elements that we're
sorting by with the full sort.  When work_mem is large this array no
longer fits in CPU cache and I suspect those swaps become more costly
when there are more cache lines having to be fetched from RAM.

I think we really should fix tuplesort.c so that it batches sorts into
about L3 CPU cache-sized chunks in RAM rather than trying to sort much
larger arrays.

I'm just unsure if we should write this off as the expected behaviour
of Sort and continue with the patch or delay the whole thing until we
make some improvements to sort.  I think more benchmarking is required
so we can figure out if this is a corner case or a common case. On the
other hand, we already sort WindowClauses with the most strict sort
order first which results in a full Sort and no additional sort for
subsequent WindowClauses that can make use of that sort order.  It
would be bizarre to reverse the final few lines of common_prefix_cmp
so that we sort the least strict order first so we end up injecting
Incremental Sorts into the plan to make it faster!

David
diff --git a/src/backend/optimizer/plan/planner.c 
b/src/backend/optimizer/plan/planner.c
index 093ace0864..a714e83a69 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5085,11 +5085,30 @@ create_ordered_paths(PlannerInfo *root,
 * incremental sort when there are presorted keys.
 */
if (presorted_keys == 0 || !enable_incremental_sort)
-   sorted_path = (Path *) create_sort_path(root,
-   
ordered_rel,
-   
input_path,
-   
root->sort_pathkeys,
-   
limit_tuples);
+   {
+   if (list_length(root->sort_pathkeys) > 2)
+   {
+   sorted_path = (Path *) 
create_sort_path(root,
+ 

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan
On Wed, Jan 18, 2023 at 7:04 PM Andres Freund  wrote:
> > You seem to be saying that it's a problem if we don't update reltuples
> > -- an estimate -- when less than 2% of the table is scanned by VACUUM.
> > But why? Why can't we just do nothing sometimes? I mean in general,
> > leaving aside the heuristics I came up with for a moment?
>
> The problem isn't that we might apply the heuristic once, that'd be fine. But
> that there's nothing preventing it from applying until there basically are no
> tuples left, as long as the vacuum is frequent enough.
>
> As a demo: The attached sql script ends up with a table containing 10k rows,
> but relpages being set 1 million.

I saw that too. But then I checked again a few seconds later, and
autoanalyze had run, so reltuples was 10k. Just like it would have if
there was no VACUUM statements in your script.

-- 
Peter Geoghegan




Experiments with Postgres and SSL

2023-01-18 Thread Greg Stark
I had a conversation a while back with Heikki where he expressed that
it was annoying that we negotiate SSL/TLS the way we do since it
introduces an extra round trip. Aside from the performance
optimization I think accepting standard TLS connections would open the
door to a number of other opportunities that would be worth it on
their own.

So I took a look into what it would take to do and I think it would
actually be quite feasible. The first byte of a standard TLS
connection can't look anything like the first byte of any flavour of
Postgres startup packet because it would be the high order bits of the
length so unless we start having multi-megabyte startup packets

So I put together a POC patch and it's working quite well and didn't
require very much kludgery. Well, it required some but it's really not
bad. I do have a bug I'm still trying to work out and the code isn't
quite in committable form but I can send the POC patch.

Other things it would open the door to in order from least
controversial to most

* Hiding Postgres behind a standard SSL proxy terminating SSL without
implementing the Postgres protocol.

* "Service Mesh" type tools that hide multiple services behind a
single host/port ("Service Mesh" is just a new buzzword for "proxy").

* Browser-based protocol implementations using websockets for things
like pgadmin or other tools to connect directly to postgres using
Postgres wire protocol but using native SSL implementations.

* Postgres could even implement an HTTP based version of its protocol
and enable things like queries or browser based tools using straight
up HTTP requests so they don't need to use websockets.

* Postgres could implement other protocols to serve up data like
status queries or monitoring metrics, using HTTP based standard
protocols instead of using our own protocol.

Incidentally I find the logic in ProcessStartupPacket incredibly
confusing. It took me a while before I realized it's using tail
recursion to implement the startup logic. I think it would be way more
straightforward and extensible if it used a much more common iterative
style. I think it would make it possible to keep more state than just
ssl_done and gss_done without changing the function signature every
time for example.

--
greg




Re: Support logical replication of DDLs

2023-01-18 Thread Zheng Li
On Wed, Jan 18, 2023 at 6:27 AM Amit Kapila  wrote:
>
> On Sat, Jan 7, 2023 at 8:58 PM Zheng Li  wrote:
> >
> > I added documentation and changed user interface design in the
> > attached v60 patch set.
> > The patch set addressed comments from Peter in [1].
> >
> > The motivation for the user interface change is that we want to manage
> > DDL replication feature in stages with fine grained replication
> > levels.
> > For example, we can focus on reviewing and testing table commands
> > first, then other commands. It also make sense to introduce different
> > DDL replication levels
> > from the user perspective as pointed out in [1]. We can add more
> > replication levels along the way.
> >
> > In this patch DDL replication is disabled by default and it can be
> > enabled at different levels
> > using the new PUBLICATION option 'ddl'. This option currently has two
> > levels and are
> > only allowed to be set if the PUBLICATION is FOR ALL TABLES.
> >
> >   all: this option enables replication of all supported DDL commands.
> >   table: this option enables replication of Table DDL commands which 
> > include:
> >   -CREATE/ALTER/DROP TABLE
> >   -CREATE TABLE AS
> >
>
> I think this point needs some thought. When you say 'all', how do you
> think it will help to support DDL replication for foreign tables,
> materialized views, views, etc where changes to such relations are
> currently not supported by logical replication?

I think DDL replication naturally provides support for views and
materialized views,
if the publication is FOR ALL TABLES since all the tables in the
view/MV definition
are replicated.

Foreign Tables can also be considered replicated with DDL replication because we
don't even need to replicate the data as it resides on the external
server. Users
need to configure the external server to allow connection from the
subscriber for
foreign tables to work on the subscriber.

> We should also think
> about initial sync for all those objects as well.

Agree, we're starting an investigation on initial sync. But I think
initial sync depends on
DDL replication to work reliably, not the other way around. DDL replication can
work on its own without the initial sync of schema, users just need to
setup the initial
schema just like they would today.

Regards,
Znae




Re: Deduplicate logicalrep_read_tuple()

2023-01-18 Thread Peter Smith
On Wed, Jan 18, 2023 at 6:26 PM Bharath Rupireddy
 wrote:
>
> Hi,
>
> logicalrep_read_tuple() duplicates code for LOGICALREP_COLUMN_TEXT and
> LOGICALREP_COLUMN_BINARY introduced by commit 9de77b5. While it
> doesn't hurt anyone, deduplication makes code a bit leaner by 57 bytes
> [1]. I've attached a patch for $SUBJECT.
>
> Thoughts?
>

The code looks the same but there is a subtle comment difference where
previously only LOGICALREP_COLUMN_BINARY case said:
 /* not strictly necessary but per StringInfo practice */

So if you de-duplicate the code then should that comment be modified to say
/* not strictly necessary for LOGICALREP_COLUMN_BINARY but per
StringInfo practice */

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Andres Freund
Hi,

On 2023-01-18 18:21:33 -0800, Peter Geoghegan wrote:
> On Wed, Jan 18, 2023 at 5:49 PM Andres Freund  wrote:
> > On 2023-01-18 15:28:19 -0800, Peter Geoghegan wrote:
> > > Perhaps we should make vac_estimate_reltuples focus on the pages that
> > > VACUUM newly set all-visible each time (not including all-visible
> > > pages that got scanned despite being all-visible) -- only that subset
> > > of scanned_pages seems to be truly relevant. That way you wouldn't be
> > > able to do stuff like this. We'd have to start explicitly tracking the
> > > number of pages that were newly set in the VM in vacuumlazy.c to be
> > > able to do that, but that seems like a good idea anyway.
> >
> > Can you explain a bit more what you mean with "focus on the pages" means?
>
> We don't say anything about pages we didn't scan right now -- only
> scanned_pages have new information, so we just extrapolate. Why not go
> even further than that, by only saying something about the pages that
> were both scanned and newly set all-visible?
>
> Under this scheme, the pages that were scanned but couldn't be newly
> set all-visible are treated just like the pages that weren't scanned
> at all -- they get a generic estimate from the existing reltuples.

I don't think that'd work well either. If we actually removed a large chunk of
the tuples in the table it should be reflected in reltuples, otherwise costing
and autovac scheduling suffers. And you might not be able to set all those
page to all-visible, because of more recent rows.


> > I don't think it's hard to see this causing problems. Set
> > autovacuum_vacuum_scale_factor to something smaller than 2% or somewhat
> > frequently vacuum manually. Incrementally delete old data.  Unless analyze
> > saves you - which might not be run or might have a different scale factor or
> > not be run manually - reltuples will stay exactly the same, despite data
> > changing substantially.
>
> You seem to be saying that it's a problem if we don't update reltuples
> -- an estimate -- when less than 2% of the table is scanned by VACUUM.
> But why? Why can't we just do nothing sometimes? I mean in general,
> leaving aside the heuristics I came up with for a moment?

The problem isn't that we might apply the heuristic once, that'd be fine. But
that there's nothing preventing it from applying until there basically are no
tuples left, as long as the vacuum is frequent enough.

As a demo: The attached sql script ends up with a table containing 10k rows,
but relpages being set 1 million.

VACUUM foo;
EXPLAIN (ANALYZE) SELECT * FROM foo;
┌───┐
│QUERY PLAN 
│
├───┤
│ Seq Scan on foo  (cost=0.00..14425.00 rows=100 width=4) (actual 
time=3.251..4.693 rows=1 loops=1) │
│ Planning Time: 0.056 ms   
│
│ Execution Time: 5.491 ms  
│
└───┘
(3 rows)


> It will cause problems if we remove the heuristics. Much less
> theoretical problems. What about those?

I don't think what I describe is a theoretical problem.


> How often does VACUUM scan so few pages, anyway? We've been talking
> about how ineffective autovacuum_vacuum_scale_factor is, at great
> length, but now you're saying that it *can* meaningfully trigger not
> just one VACUUM, but many VACUUMs, where no more than 2% of  rel_pages
> are not all-visible (pages, not tuples)? Not just once, mind you, but
> many times?

I've seen autovacuum_vacuum_scale_factor set to 0.01 repeatedly. But that's
not even needed - you just need a longrunning transaction preventing at least
one dead row from getting removed and hit autovacuum_freeze_max_age. There'll
be continuous VACUUMs of the table, all only processing a small fraction of
the table.

And I have many times seen bulk loading / deletion scripts that do VACUUM on a
regular schedule, which also could easily trigger this.


> And in the presence of some kind of highly variable tuple
> size, where it actually could matter to the planner at some point?

I don't see how a variable tuple size needs to be involved? As the EXPLAIN
ANALYZE above shows, we'll end up with wrong row count estimates etc.


> I would be willing to just avoid even these theoretical problems if
> there was some way to do so, that didn't also create new problems. I
> have my doubts that that is possible, within the constraints of
> updating pg_class. Or the present constraints, at least. I am not a
> miracle worker -- I can only do so much with the information that's
> available to 

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan
On Wed, Jan 18, 2023 at 6:10 PM Andres Freund  wrote:
> > This creates an awkward but logical question, though: what if
> > dead_tuples doesn't go down at all? What if VACUUM actually has to
> > increase it, because VACUUM runs so slowly relative to the workload?
>
> Sure, that can happen - but it's not made better by having wrong stats :)

Maybe it's that simple. It's reasonable to wonder how far we want to
go with letting dead tuples grow and grow, even when VACUUM is running
constantly. It's not a question that has an immediate and obvious
answer IMV.

Maybe the real question is: is this an opportunity to signal to the
user (say via a LOG message) that VACUUM cannot keep up? That might be
very useful, in a general sort of way (not just to avoid new
problems).

> We have reasonably sophisticated accounting in pgstats what newly live/dead
> rows a transaction "creates". So an obvious (and wrong) idea is just decrement
> reltuples by the number of tuples removed by autovacuum.

Did you mean dead_tuples?

> But we can't do that,
> because inserted/deleted tuples reported by backends can be removed by
> on-access pruning and vacuumlazy doesn't know about all changes made by its
> call to heap_page_prune().

I'm not following here. Perhaps this is a good sign that I should stop
working for the day.  :-)

> But I think that if we add a
>   pgstat_count_heap_prune(nredirected, ndead, nunused)
> around heap_page_prune() and a
>   pgstat_count_heap_vacuum(nunused)
> in lazy_vacuum_heap_page(), we'd likely end up with a better approximation
> than what vac_estimate_reltuples() does, in the "partially scanned" case.

What does vac_estimate_reltuples() have to do with dead tuples?

--
Peter Geoghegan




Re: Minimal logical decoding on standbys

2023-01-18 Thread Andres Freund
Hi,

On 2023-01-18 11:24:19 +0100, Drouvot, Bertrand wrote:
> On 1/6/23 4:40 AM, Andres Freund wrote:
> > Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
> > for us instead? This will likely also be needed in real applications after 
> > all.
> > 
> 
> Not sure I got it. What the C helper would be supposed to do?

Call LogStandbySnapshot().


> With a reload in place in my testing, now I notice that the catalog_xmin
> is updated on the primary physical slot after logical slots invalidation
> when reloading hot_standby_feedback from "off" to "on".
> 
> This is not the case after a re-start (aka catalog_xmin is NULL).
> 
> I think a re-start and reload should produce identical behavior on
> the primary physical slot. If so, I'm tempted to think that the catalog_xmin
> should be updated in case of a re-start too (even if all the logical slots 
> are invalidated)
> because the slots are not dropped yet. What do you think?

I can't quite follow the steps leading up to the difference. Could you list
them in a bit more detail?



> > Can we do something cheaper than rewriting the entire database? Seems
> > rewriting a single table ought to be sufficient?
> > 
> 
> While implementing the test at the table level I discovered that It looks 
> like there is no guarantee that say a "vacuum full pg_class;" would
> produce a conflict.

I assume that's mostly when there weren't any removal


> Indeed, from what I can see in my testing it could generate a 
> XLOG_HEAP2_PRUNE with snapshotConflictHorizon to 0:
> 
> "rmgr: Heap2   len (rec/tot):107/   107, tx:848, lsn: 
> 0/03B98B30, prev 0/03B98AF0, desc: PRUNE snapshotConflictHorizon 0"
> 
> 
> Having a snapshotConflictHorizon to zero leads to 
> ResolveRecoveryConflictWithSnapshot() simply returning
> without any conflict handling.

That doesn't have to mean anything bad. Some row versions can be removed without
creating a conflict. See HeapTupleHeaderAdvanceConflictHorizon(), specifically

 * Ignore tuples inserted by an aborted transaction or if the tuple was
 * updated/deleted by the inserting transaction.



> It does look like that in the standby decoding case that's not the right 
> behavior (and that the xid that generated the PRUNING should be used instead)
> , what do you think?

That'd not work, because that'll be typically newer than the catalog_xmin. So
we'd start invalidating things left and right, despite not needing to.


Did you see anything else around this making you suspicious?


> > > +##
> > > +# Test standby promotion and logical decoding behavior
> > > +# after the standby gets promoted.
> > > +##
> > > +
> > 
> > I think this also should test the streaming / walsender case.
> > 
> 
> Do you mean cascading standby?

I mean a logical walsender that starts on a standby and continues across
promotion of the standby.

Greetings,

Andres Freund




Re: Parallelize correlated subqueries that execute within each worker

2023-01-18 Thread James Coleman
On Wed, Jan 18, 2023 at 2:09 PM Tomas Vondra
 wrote:
>
> Hi,
>
> This patch hasn't been updated since September, and it got broken by
> 4a29eabd1d91c5484426bc5836e0a7143b064f5a which the incremental sort
> stuff a little bit. But the breakage was rather limited, so I took a
> stab at fixing it - attached is the result, hopefully correct.

Thanks for fixing this up; the changes look correct to me.

> I also added a couple minor comments about stuff I noticed while
> rebasing and skimming the patch, I kept those in separate commits.
> There's also a couple pre-existing TODOs.

I started work on some of these, but wasn't able to finish this
evening, so I don't have an updated series yet.

> James, what's your plan with this patch. Do you intend to work on it for
> PG16, or are there some issues I missed in the thread?

I'd love to see it get into PG16. I don't have any known issues, but
reviewing activity has been light. Originally Robert had had some
concerns about my original approach; I think my updated approach
resolves those issues, but it'd be good to have that sign-off.

Beyond that I'm mostly looking for review and evaluation of the
approach I've taken; of note is my description of that in [1].

> One of the queries in in incremental_sort changed plans a little bit:
>
> explain (costs off) select distinct
>   unique1,
>   (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1)
> from tenk1 t, generate_series(1, 1000);
>
> switched from
>
>  Unique  (cost=18582710.41..18747375.21 rows=1 width=8)
>->  Gather Merge  (cost=18582710.41..18697375.21 rows=1000 ...)
>  Workers Planned: 2
>  ->  Sort  (cost=18582710.39..18593127.06 rows=417 ...)
>Sort Key: t.unique1, ((SubPlan 1))
>  ...
>
> to
>
>  Unique  (cost=18582710.41..18614268.91 rows=1 ...)
>->  Gather Merge  (cost=18582710.41..18614168.91 rows=2 ...)
>  Workers Planned: 2
>  ->  Unique  (cost=18582710.39..18613960.39 rows=1 ...)
>->  Sort  (cost=18582710.39..18593127.06 ...)
>  Sort Key: t.unique1, ((SubPlan 1))
>...
>
> which probably makes sense, as the cost estimate decreases a bit.

Off the cuff that seems fine. I'll read it over again when I send the
updated series.

James Coleman

1: 
https://www.postgresql.org/message-id/CAAaqYe8m0DHUWk7gLKb_C4abTD4nMkU26ErE%3Dahow4zNMZbzPQ%40mail.gmail.com




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan
On Wed, Jan 18, 2023 at 5:49 PM Andres Freund  wrote:
> On 2023-01-18 15:28:19 -0800, Peter Geoghegan wrote:
> > Perhaps we should make vac_estimate_reltuples focus on the pages that
> > VACUUM newly set all-visible each time (not including all-visible
> > pages that got scanned despite being all-visible) -- only that subset
> > of scanned_pages seems to be truly relevant. That way you wouldn't be
> > able to do stuff like this. We'd have to start explicitly tracking the
> > number of pages that were newly set in the VM in vacuumlazy.c to be
> > able to do that, but that seems like a good idea anyway.
>
> Can you explain a bit more what you mean with "focus on the pages" means?

We don't say anything about pages we didn't scan right now -- only
scanned_pages have new information, so we just extrapolate. Why not go
even further than that, by only saying something about the pages that
were both scanned and newly set all-visible?

Under this scheme, the pages that were scanned but couldn't be newly
set all-visible are treated just like the pages that weren't scanned
at all -- they get a generic estimate from the existing reltuples.

> I don't think it's hard to see this causing problems. Set
> autovacuum_vacuum_scale_factor to something smaller than 2% or somewhat
> frequently vacuum manually. Incrementally delete old data.  Unless analyze
> saves you - which might not be run or might have a different scale factor or
> not be run manually - reltuples will stay exactly the same, despite data
> changing substantially.

You seem to be saying that it's a problem if we don't update reltuples
-- an estimate -- when less than 2% of the table is scanned by VACUUM.
But why? Why can't we just do nothing sometimes? I mean in general,
leaving aside the heuristics I came up with for a moment?

It will cause problems if we remove the heuristics. Much less
theoretical problems. What about those?

How often does VACUUM scan so few pages, anyway? We've been talking
about how ineffective autovacuum_vacuum_scale_factor is, at great
length, but now you're saying that it *can* meaningfully trigger not
just one VACUUM, but many VACUUMs, where no more than 2% of  rel_pages
are not all-visible (pages, not tuples)? Not just once, mind you, but
many times? And in the presence of some kind of highly variable tuple
size, where it actually could matter to the planner at some point?

I would be willing to just avoid even these theoretical problems if
there was some way to do so, that didn't also create new problems. I
have my doubts that that is possible, within the constraints of
updating pg_class. Or the present constraints, at least. I am not a
miracle worker -- I can only do so much with the information that's
available to vac_update_relstats (and/or the information that can
easily be made available).

-- 
Peter Geoghegan




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Andres Freund
Hi,

On 2023-01-18 17:00:48 -0800, Peter Geoghegan wrote:
> On Wed, Jan 18, 2023 at 4:37 PM Andres Freund  wrote:
> > I can, it should be just about trivial code-wise. A bit queasy about trying 
> > to
> > forsee the potential consequences.
> 
> That's always going to be true, though.
> 
> > A somewhat related issue is that pgstat_report_vacuum() sets dead_tuples to
> > what VACUUM itself observed, ignoring any concurrently reported dead
> > tuples. As far as I can tell, when vacuum takes a long time, that can lead 
> > to
> > severely under-accounting dead tuples.
> 
> Did I not mention that one? There are so many that it can be hard to
> keep track! That's why I catalog them.

I don't recall you doing, but there's lot of emails and holes in my head.


> This creates an awkward but logical question, though: what if
> dead_tuples doesn't go down at all? What if VACUUM actually has to
> increase it, because VACUUM runs so slowly relative to the workload?

Sure, that can happen - but it's not made better by having wrong stats :)


> > I do think this is an argument for splitting up dead_tuples into separate
> > "components" that we track differently. I.e. tracking the number of dead
> > items, not-yet-removable rows, and the number of dead tuples reported from 
> > DML
> > statements via pgstats.
> 
> Is it? Why?

We have reasonably sophisticated accounting in pgstats what newly live/dead
rows a transaction "creates". So an obvious (and wrong) idea is just decrement
reltuples by the number of tuples removed by autovacuum. But we can't do that,
because inserted/deleted tuples reported by backends can be removed by
on-access pruning and vacuumlazy doesn't know about all changes made by its
call to heap_page_prune().

But I think that if we add a
  pgstat_count_heap_prune(nredirected, ndead, nunused)
around heap_page_prune() and a
  pgstat_count_heap_vacuum(nunused)
in lazy_vacuum_heap_page(), we'd likely end up with a better approximation
than what vac_estimate_reltuples() does, in the "partially scanned" case.



> I'm all in favor of doing that, of course. I just don't particularly
> think that it's related to this other problem. One problem is that we
> count dead tuples incorrectly because we don't account for the fact
> that things change while VACUUM runs. The other problem is that the
> thing that is counted isn't broken down into distinct subcategories of
> things -- things are bunched together that shouldn't be.

If we only adjust the counters incrementally, as we go, we'd not update them
at the end of vacuum. I think it'd be a lot easier to only update the counters
incrementally if we split ->dead_tuples into sub-counters.

So I don't think it's entirely unrelated.

You probably could get close without splitting the counters, by just pushing
down the counting, and only counting redirected and unused during heap
pruning. But I think it's likely to be more accurate with the split counter.



> Oh wait, you were thinking of what I said before -- my "awkward but
> logical question". Is that it?

I'm not quite following? The "awkward but logical" bit is in the email I'm
just replying to, right?

Greetings,

Andres Freund




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan
On Fri, Jan 13, 2023 at 9:55 PM Andres Freund  wrote:
> How about a float autovacuum_no_auto_cancel_age where positive values are
> treated as absolute values, and negative values are a multiple of
> autovacuum_freeze_max_age? And where the "computed" age is capped at
> vacuum_failsafe_age? A "failsafe" autovacuum clearly shouldn't be cancelled.
>
> And maybe a default setting of -1.8 or so?

Attached is a new revision, v5. I'm not happy with this, but thought
it would be useful to show you where I am with it.

It's a bit awkward that we have a GUC (autovacuum_no_auto_cancel_age)
that can sometimes work as a cutoff that works similarly to both
freeze_max_age and multixact_freeze_max_age, but usually works as a
multiplier. It's both an XID age value, an MXID age value, and a
multiplier on XID/MXID age values.

What if it was just a simple multiplier on
freeze_max_age/multixact_freeze_max_age, without changing any other
detail?

-- 
Peter Geoghegan


v5-0002-Add-table-age-trigger-concept-to-autovacuum.patch
Description: Binary data


v5-0001-Add-autovacuum-trigger-instrumentation.patch
Description: Binary data


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-18 Thread Peter Smith
On Wed, Jan 18, 2023 at 6:06 PM Peter Smith  wrote:
>
>  Here are my review comments for the latest patch v16-0001. (excluding
> the test code)
>

And here are some review comments for the v16-0001 test code.

==

src/test/regress/sql/subscription.sql

1. General
For all comments

"time delayed replication" -> "time-delayed replication" maybe is better?

~~~

2.
-- fail - utilizing streaming = parallel with time delayed replication
is not supported.

For readability please put a blank line before this test.

~~~

3.
-- success -- value without unit is taken as milliseconds

"value" -> "min_apply_delay value"

~~~

4.
-- success -- interval is converted into ms and stored as integer

"interval" -> "min_apply_delay interval"

"integer" -> "an integer"

~~~

5.
You could also add another test where min_apply_delay is 0

Then the following combination can be confirmed OK -- success create
subscription with (streaming=parallel, min_apply_delay=0)

~~

6.
-- fail - alter subscription with min_apply_delay should fail when
streaming = parallel is set.
CREATE SUBSCRIPTION regress_testsub CONNECTION
'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
false, streaming = parallel);

There is another way to do this test without creating a brand-new
subscription. You could just alter the existing subscription like:
ALTER ... SET (min_apply_delay = 0)
then ALTER ... SET (parallel = streaming)
then ALTER ... SET (min_apply_delay = 123)

==

src/test/subscription/t/032_apply_delay.pl

7. sub check_apply_delay_log

my ($node_subscriber, $message, $expected) = @_;

Why pass in the message text? I is always the same so can be hardwired
in this function, right?

~~~

8.
# Get the delay time in the server log

"int the server log" -> "from the server log" (?)

~~~

9.
qr/$message: (\d+) ms/
or die "could not get delayed time";
my $logged_delay = $1;

# Is it larger than expected?
cmp_ok($logged_delay, '>', $expected,
"The wait time of the apply worker is long enough expectedly"
);

9a.
"could not get delayed time" -> "could not get the apply worker wait time"

9b.
"The wait time of the apply worker is long enough expectedly" -> "The
apply worker wait time has expected duration"

~~~

10.
sub check_apply_delay_time


Maybe a brief explanatory comment for this function is needed to
explain the unreplicated column c.

~~~

11.
$node_subscriber->safe_psql('postgres',
"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
application_name=$appname' PUBLICATION tap_pub WITH (streaming = on,
min_apply_delay = '3s')"


I think there should be a comment here highlighting that you are
setting up a subscriber time delay of 3 seconds, and then later you
can better describe the parameters for the checking functions...

e.g. (add this comment)
# verifies that the subscriber lags the publisher by at least 3 seconds
check_apply_delay_time($node_publisher, $node_subscriber, '5', '3');

e.g.
# verifies that the subscriber lags the publisher by at least 3 seconds
check_apply_delay_time($node_publisher, $node_subscriber, '8', '3');

~~~

12.
# Test whether ALTER SUBSCRIPTION changes the delayed time of the apply worker
# (1 day 1 minute).
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET (min_apply_delay = 8646)"
);

Update the comment with another note.
# Note - The extra 1 min is to account for any decoding/network overhead.

~~~

13.
# Make sure we have long enough min_apply_delay after the ALTER command
check_apply_delay_log($node_subscriber, "logical replication apply
delay", "8000");

IMO the expectation of 1 day (8646 ms) wait time might be a better
number for your "expected" value.

So update the comment/call like this:

# Make sure the apply worker knows to wait for more than 1 day (8640 ms)
check_apply_delay_log($node_subscriber, "logical replication apply
delay", "8640");

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Inconsistency in vacuum behavior

2023-01-18 Thread Justin Pryzby
On Mon, Jan 16, 2023 at 08:12:18PM +0300, Nikita Malakhov wrote:
> Hi,
> 
> Currently there is no error in this case, so additional thrown error would
> require a new test.
> Besides, throwing an error here does not make sense - it is just a check
> for a vacuum
> permission, I think the right way is to just skip a relation that is not
> suitable for vacuum.
> Any thoughts or objections?

Could you check if this is consistent between the behavior of VACUUM
FULL and CLUSTER ?  See also Nathan's patches.

-- 
Justin




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Andres Freund
Hi,

On 2023-01-18 16:19:02 -0800, Peter Geoghegan wrote:
> On Wed, Jan 18, 2023 at 4:02 PM Andres Freund  wrote:
> > vacuum-no   reltuples/n_live_tupn_dead_tup
> > 1   476 500
> > 2   2500077 500
> > 3   1250184 500
> > 4625266 500
> > 5312821 500
> > 1010165 500
> >
> > Each vacuum halves reltuples.  That's going to screw badly with all kinds of
> > things. Planner costs completely out of whack etc.
>
> I get that that could be a big problem, even relative to the more
> immediate problem of VACUUM just spinning like it does in your test
> case. What do you think we should do about it?

The change made in 7c91a0364fc imo isn't right. We need to fix it. I think
it's correct that now pgstat_report_vacuum() doesn't include recently dead
tuples in livetuples - they're still tracked via deadtuples.  But it's wrong
for vacuum's call to vac_update_relstats() to not include recently dead
tuples, at least when we only scanned part of the relation.

I think the right thing would be to not undo the semantic change of
7c91a0364fc as a whole, but instead take recently-dead tuples into account
only in the "Okay, we've covered the corner cases." part, to avoid the
spiraling seen above.

Not super clean, but it seems somewhat fundamental that we'll re-scan pages
full of recently-dead tuples in the near future. If we, in a way, subtract
the recently dead tuples from reltuples in this cycle, we shouldn't do so
again in the next - but not taking recently dead into account, does so.


It's a bit complicated because of the APIs involved. vac_estimate_reltuples()
computes vacrel->new_live_tuples and contains the logic for how to compute the
new reltuples. But we use the ->new_live_tuples both vac_update_relstats(),
where we, imo, should take recently-dead into account for partial scans and
pgstat_report_vacuum where we shouldn't.  I guess we would need to add an
output paramter both for "reltuples" and "new_live_tuples".







> What do you think about my idea of focussing on the subset of pages newly
> set all-visible in the VM?

I don't understand it yet.

On 2023-01-18 15:28:19 -0800, Peter Geoghegan wrote:
> Perhaps we should make vac_estimate_reltuples focus on the pages that
> VACUUM newly set all-visible each time (not including all-visible
> pages that got scanned despite being all-visible) -- only that subset
> of scanned_pages seems to be truly relevant. That way you wouldn't be
> able to do stuff like this. We'd have to start explicitly tracking the
> number of pages that were newly set in the VM in vacuumlazy.c to be
> able to do that, but that seems like a good idea anyway.

Can you explain a bit more what you mean with "focus on the pages" means?



> > I wonder if this is part of the reason for the distortion you addressed with
> > 74388a1a / 3097bde7dd1d. I am somewhat doubtful they're right as is. For a
> > large relation 2% of blocks is a significant number of rows, and simply 
> > never
> > adjusting reltuples seems quite problematic. At the very least we ought to
> > account for dead tids we removed or such, instead of just freezing 
> > reltuples.
>
> As I mentioned, it only kicks in when relpages is *precisely* the same
> as last time (not one block more or one block less), *and* we only
> scanned less than 2% of rel_pages. It's quite possible that that's
> insufficient, but I can't imagine it causing any new problems.

In OLTP workloads relpages will often not change, even if there's lots of
write activity, because there's plenty free space in the relation, and there's
something not-removable on the last page. relpages also won't change if data
is deleted anywhere but the end.

I don't think it's hard to see this causing problems. Set
autovacuum_vacuum_scale_factor to something smaller than 2% or somewhat
frequently vacuum manually. Incrementally delete old data.  Unless analyze
saves you - which might not be run or might have a different scale factor or
not be run manually - reltuples will stay exactly the same, despite data
changing substantially.


Greetings,

Andres Freund




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-18 Thread Peter Smith
On Wed, Jan 18, 2023 at 6:06 PM Peter Smith  wrote:
>
>  Here are my review comments for the latest patch v16-0001. (excluding
> the test code)
>
...
>
> 8. AlterSubscription (general)
>
> I observed during testing there are 3 different errors….
>
> At subscription CREATE time you can get this error:
> ERROR:  min_apply_delay > 0 and streaming = parallel are mutually
> exclusive options
>
> If you try to ALTER the min_apply_delay when already streaming =
> parallel you can get this error:
> ERROR:  cannot enable min_apply_delay for subscription in streaming =
> parallel mode
>
> If you try to ALTER the streaming to be parallel if there is already a
> min_apply_delay > 0 then you can get this error:
> ERROR:  cannot enable streaming = parallel mode for subscription with
> min_apply_delay
>
> ~
>
> IMO there is no need to have 3 different error message texts.  I think
> all these cases are explained by just the first text (ERROR:
> min_apply_delay > 0 and streaming = parallel are mutually exclusive
> options)
>
>

After checking the regression test output I can see the merit of your
separate error messages like this, even if they are maybe not strictly
necessary. So feel free to ignore my previous review comment.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: [EXTERNAL] Re: [PATCH] Support using "all" for the db user in pg_ident.conf

2023-01-18 Thread Michael Paquier
On Wed, Jan 18, 2023 at 10:35:29AM +0100, Jelte Fennema wrote:
> Anything I can do to help with this? Or will you do that yourself?

Nope.  I just need some time to finish wrapping it, that's all.
--
Michael


signature.asc
Description: PGP signature


Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan
On Wed, Jan 18, 2023 at 4:37 PM Andres Freund  wrote:
> I can, it should be just about trivial code-wise. A bit queasy about trying to
> forsee the potential consequences.

That's always going to be true, though.

> A somewhat related issue is that pgstat_report_vacuum() sets dead_tuples to
> what VACUUM itself observed, ignoring any concurrently reported dead
> tuples. As far as I can tell, when vacuum takes a long time, that can lead to
> severely under-accounting dead tuples.

Did I not mention that one? There are so many that it can be hard to
keep track! That's why I catalog them.

As you point out, it's the dead tuples equivalent of my
ins_since_vacuum complaint. The problem is exactly analogous to my
recent complaint about insert-driven autovacuums.

> I wonder if we ought to remember the dead_tuples value at the start of the
> heap scan and use that to adjust the final dead_tuples value. I'd lean towards
> over-counting rather than under-counting and thus probably would go for
> something like
>
>   tabentry->dead_tuples = livetuples + Min(0, tabentry->dead_tuples - 
> deadtuples_at_start);
>
> i.e. assuming we might have missed all concurrently reported dead tuples.

This is exactly what I was thinking of doing for both issues (the
ins_since_vacuum one and this similar dead tuples one). It's
completely logical.

This creates an awkward but logical question, though: what if
dead_tuples doesn't go down at all? What if VACUUM actually has to
increase it, because VACUUM runs so slowly relative to the workload?
Of course the whole point is to make it more likely that VACUUM will
keep up with the workload. I'm just not quite sure that the
consequences of doing it that way are strictly a good thing. Bearing
in mind that we don't differentiate between recently dead and dead
here.

Fun fact: autovacuum can spin with pgbench because of recently dead
tuples, even absent an old snapshot/long running xact, if you set
things aggressively enough:

https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=w...@mail.gmail.com

I think that we probably need to do something like always make sure
that dead_items goes down by a small amount at the end of each VACUUM,
even when that's a lie. Maybe we also have a log message about
autovacuum not keeping up, so as to not feel too guilty about it. You
know, to give the user a chance to reconfigure autovacuum so that it
stops happening.

> Of course we could instead move to something like ins_since_vacuum and reset
> it at the *start* of the vacuum. But that'd make the error case harder,
> without giving us more accuracy, I think?

It would. It seems illogical to me.

> I do think this is an argument for splitting up dead_tuples into separate
> "components" that we track differently. I.e. tracking the number of dead
> items, not-yet-removable rows, and the number of dead tuples reported from DML
> statements via pgstats.

Is it? Why?

I'm all in favor of doing that, of course. I just don't particularly
think that it's related to this other problem. One problem is that we
count dead tuples incorrectly because we don't account for the fact
that things change while VACUUM runs. The other problem is that the
thing that is counted isn't broken down into distinct subcategories of
things -- things are bunched together that shouldn't be.

Oh wait, you were thinking of what I said before -- my "awkward but
logical question". Is that it?

-- 
Peter Geoghegan




Re: Switching XLog source from archive to streaming when primary available

2023-01-18 Thread Nathan Bossart
On Tue, Jan 17, 2023 at 07:44:52PM +0530, Bharath Rupireddy wrote:
> On Thu, Jan 12, 2023 at 6:21 AM Nathan Bossart  
> wrote:
>> With your patch, we might replay one of these "old" files in pg_wal instead
>> of the complete version of the file from the archives,
> 
> That's true even today, without the patch, no? We're not changing the
> existing behaviour of the state machine. Can you explain how it
> happens with the patch?

My point is that on HEAD, we will always prefer a complete archive file.
With your patch, we might instead choose to replay an old file in pg_wal
because we are artificially advancing the state machine.  IOW even if
there's a complete archive available, we might not use it.  This is a
behavior change, but I think it is okay.

>> Would you mind testing this scenario?
> 
> How about something like below for testing the above scenario? If it
> looks okay, I can add it as a new TAP test file.
> 
> 1. Generate WAL files f1 and f2 and archive them.
> 2. Check the replay lsn and WAL file name on the standby, when it
> replays upto f2, stop the standby.
> 3. Set recovery to fail on the standby, and stop the standby.
> 4. Generate f3, f4 (partially filled) on the primary.
> 5. Manually copy f3, f4 to the standby's pg_wal.
> 6. Start the standby, since recovery is set to fail, and there're new
> WAL files (f3, f4) under its pg_wal, it must replay those WAL files
> (check the replay lsn and WAL file name, it must be f4) before
> switching to streaming.
> 7. Generate f5 on the primary.
> 8. The standby should receive f5 and replay it (check the replay lsn
> and WAL file name, it must be f5).
> 9. Set streaming to fail on the standby and set recovery to succeed.
> 10. Generate f6 on the primary.
> 11. The standby should receive f6 via archive and replay it (check the
> replay lsn and WAL file name, it must be f6).

I meant testing the scenario where there's an old file in pg_wal, a
complete file in the archives, and your new GUC forces replay of the
former.  This might be difficult to do in a TAP test.  Ultimately, I just
want to validate the assumptions discussed above.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Andres Freund
Hi,

On 2023-01-18 13:45:19 -0800, Peter Geoghegan wrote:
> On Wed, Jan 18, 2023 at 1:08 PM Andres Freund  wrote:
> > I suggested nearby to only have ANALYZE dead_tuples it if there's been no
> > [auto]vacuum since the stats entry was created. That allows recovering from
> > stats resets, be it via crashes or explicitly. What do you think?
>
> I like that idea. It's far simpler than the kind of stuff I was
> thinking about, and probably just as effective. Even if it introduces
> some unforeseen problem (which seems unlikely), we can still rely on
> pgstat_report_vacuum() to set things straight before too long.
>
> Are you planning on writing a patch for this? I'd be very interested
> in seeing this through. Could definitely review it.

I can, it should be just about trivial code-wise. A bit queasy about trying to
forsee the potential consequences.


A somewhat related issue is that pgstat_report_vacuum() sets dead_tuples to
what VACUUM itself observed, ignoring any concurrently reported dead
tuples. As far as I can tell, when vacuum takes a long time, that can lead to
severely under-accounting dead tuples.

We probably loose track of a bit more than 50% of the dead tuples reported
since vacuum started. During the heap scan phase we don't notice all the
tuples reported before the current scan point, and then we don't notice them
at all during the index/heap vacuuming.

The only saving grace is that it'll be corrected at the next VACUUM. But the
next vacuum might very well be delayed noticably due to this.


This is similar to the issue around ins_since_vacuum that Peter pointed out.


I wonder if we ought to remember the dead_tuples value at the start of the
heap scan and use that to adjust the final dead_tuples value. I'd lean towards
over-counting rather than under-counting and thus probably would go for
something like

  tabentry->dead_tuples = livetuples + Min(0, tabentry->dead_tuples - 
deadtuples_at_start);

i.e. assuming we might have missed all concurrently reported dead tuples.




Of course we could instead move to something like ins_since_vacuum and reset
it at the *start* of the vacuum. But that'd make the error case harder,
without giving us more accuracy, I think?


I do think this is an argument for splitting up dead_tuples into separate
"components" that we track differently. I.e. tracking the number of dead
items, not-yet-removable rows, and the number of dead tuples reported from DML
statements via pgstats.

Greetings,

Andres Freund




Re: Issue with psql's create_help.pl under perlcritic

2023-01-18 Thread Michael Paquier
On Wed, Jan 18, 2023 at 08:43:01AM -0500, Andrew Dunstan wrote:
> Looks fine.

Thanks for double-checking, will apply shortly.

> Interesting it's not caught by perlcritic on my Fedora 35
> instance, nor my Ubuntu 22.04 instance.

Perhaps that just shows up on the latest version of perlcritic?  I use
1.148 in the box where this issue shows up, for all the stable
branches of Postgres.
--
Michael


signature.asc
Description: PGP signature


Unicode grapheme clusters

2023-01-18 Thread Bruce Momjian
Just my luck, I had to dig into a two-"character" emoji that came to me
as part of a Google Calendar entry --- here it is:

‍⚕️喙

  libc
Unicode UTF8  len
U+1F469  f0 9f 91 a9   2   woman
U+1F3FC  f0 9f 8f bc   2   emoji modifier fitzpatrick type-3 (skin tone)
U+200D   e2 80 8d  0   zero width joiner (ZWJ)
U+2695   e2 9a 95  1   staff with snake
U+FE0F   ef b8 8f  0   variation selector-16 (VS16) (previous 
character as emoji)
U+1FA7A  f0 9f a9 ba   2   stethoscope

Now, in Debian 11 character apps like vi, I see:

  a woman(2) - a black box(2) - a staff with snake(1) - a stethoscope(2)

Display widths are in parentheses.  I also see '<200d>' in blue.

In current Firefox, I see a woman with a stethoscope around her neck,
and then a stethoscope.  Copying the Unicode string above into a browser
URL bar should show you the same thing, thought it might be too small to
see.

For those looking for details on how these should be handled, see this
for an explanation of grapheme clusters that use things like skin tone
modifiers and zero-width joiners:

https://tonsky.me/blog/emoji/

These comments explain the confusion of the term character:


https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme

and I think this comment summarizes it well:

https://github.com/kovidgoyal/kitty/issues/3998#issuecomment-914807237

This is by design. wcwidth() is utterly broken. Any terminal or terminal
application that uses it is also utterly broken. Forget about emoji
wcwidth() doesn't even work with combining characters, zero width
joiners, flags, and a whole bunch of other things.

I decided to see how Postgres, without ICU, handles it:

show lc_ctype;
  lc_ctype
-
 en_US.UTF-8

select octet_length('‍⚕️喙');
 octet_length
--
   21

select character_length('‍⚕️喙');
 character_length
--
6

The octet_length() is verified as correct by counting the UTF8 bytes
above.  I think character_length() is correct if we consider the number
of Unicode characters, display and non-display.

I then started looking at how Postgres computes and uses _display_
width.  The display width, when properly processed like by Firefox, is 4
(two double-wide displayed characters.)  Based on the libc display
lengths above and incorrect displayed character lengths in Debian 11, it
would be 7.

libpq has PQdsplen(), which calls pg_encoding_dsplen(), which then calls
the per-encoding width function stored in pg_wchar_table.dsplen --- for
UTF8, the function is pg_utf_dsplen().

There is no SQL API for display length, but PQdsplen() that can be
called with a string by calling pg_wcswidth() the gdb debugger:

pg_wcswidth(const char *pwcs, size_t len, int encoding)
UTF8 encoding == 6

(gdb) print (int)pg_wcswidth("abcd", 4, 6)
$8 = 4
(gdb) print (int)pg_wcswidth("‍⚕️喙", 21, 6))
$9 = 7

Here is the psql output:

SELECT octet_length('‍⚕️喙'), '‍⚕️喙', character_length('‍⚕️喙');
 octet_length | ?column? | character_length
--+--+--
   21 | ‍⚕️喙  |6

More often called from psql are pg_wcssize() and pg_wcsformat(), which
also calls PQdsplen().

I think the question is whether we want to report a string width that
assumes the display doesn't understand the more complex UTF8
controls/"characters" listed above.
 
tsearch has p_isspecial() calls pg_dsplen() which also uses
pg_wchar_table.dsplen.  p_isspecial() also has a small table of what it
calls "strange_letter",

Here is a report about Unicode variation selector and combining
characters from May, 2022:


https://www.postgresql.org/message-id/flat/013f01d873bb%24ff5f64b0%24fe1e2e10%24%40ndensan.co.jp

Is this something people want improved?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan
On Wed, Jan 18, 2023 at 4:02 PM Andres Freund  wrote:
> vacuum-no   reltuples/n_live_tupn_dead_tup
> 1   476 500
> 2   2500077 500
> 3   1250184 500
> 4625266 500
> 5312821 500
> 1010165 500
>
> Each vacuum halves reltuples.  That's going to screw badly with all kinds of
> things. Planner costs completely out of whack etc.

I get that that could be a big problem, even relative to the more
immediate problem of VACUUM just spinning like it does in your test
case. What do you think we should do about it? What do you think about
my idea of focussing on the subset of pages newly set all-visible in
the VM?

> I wonder if this is part of the reason for the distortion you addressed with
> 74388a1a / 3097bde7dd1d. I am somewhat doubtful they're right as is. For a
> large relation 2% of blocks is a significant number of rows, and simply never
> adjusting reltuples seems quite problematic. At the very least we ought to
> account for dead tids we removed or such, instead of just freezing reltuples.

As I mentioned, it only kicks in when relpages is *precisely* the same
as last time (not one block more or one block less), *and* we only
scanned less than 2% of rel_pages. It's quite possible that that's
insufficient, but I can't imagine it causing any new problems.

I think that we need to be realistic about what's possible while
storing a small, fixed amount of information. There is always going to
be some distortion of this kind. We can do something about the
obviously pathological cases, where errors can grow without bound. But
you have to draw the line somewhere, unless you're willing to replace
the whole approach with something that stores historic metadata.

What kind of tradeoff do you want to make here? I think that you have
to make one.

-- 
Peter Geoghegan




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Andres Freund
Hi,

On 2023-01-18 14:37:20 -0800, Peter Geoghegan wrote:
> On Wed, Jan 18, 2023 at 2:22 PM Andres Freund  wrote:
> > The problem with the change is here:
> >
> > /*
> >  * Okay, we've covered the corner cases.  The normal calculation is 
> > to
> >  * convert the old measurement to a density (tuples per page), then
> >  * estimate the number of tuples in the unscanned pages using that 
> > figure,
> >  * and finally add on the number of tuples in the scanned pages.
> >  */
> > old_density = old_rel_tuples / old_rel_pages;
> > unscanned_pages = (double) total_pages - (double) scanned_pages;
> > total_tuples = old_density * unscanned_pages + scanned_tuples;
> > return floor(total_tuples + 0.5);
> 
> My assumption has always been that vac_estimate_reltuples() is prone
> to issues like this because it just doesn't have access to very much
> information each time it runs. It can only see the delta between what
> VACUUM just saw, and what the last VACUUM (or possibly the last
> ANALYZE) saw according to pg_class. You're always going to find
> weaknesses in such a model if you go looking for them. You're always
> going to find a way to salami slice your way from good information to
> total nonsense, if you pick the right/wrong test case, which runs
> VACUUM in a way that allows whatever bias there may be to accumulate.
> It's sort of like the way floating point values can become very
> inaccurate through a process that allows many small inaccuracies to
> accumulate over time.

Sure. To start with, there's always going to be some inaccuracies when you
assume an even distribution across a table. But I think this goes beyond
that.

This problem occurs with a completely even distribution, exactly the same
inputs to the estimation function every time.  My example under-sold the
severity, because I had only 5% non-deletable tuples. Here's it with 50%
non-removable tuples (I've seen way worse than 50% in many real-world cases),
and a bunch of complexity removed (attched).

vacuum-no   reltuples/n_live_tupn_dead_tup
1   476 500
2   2500077 500
3   1250184 500
4625266 500
5312821 500
1010165 500

Each vacuum halves reltuples.  That's going to screw badly with all kinds of
things. Planner costs completely out of whack etc.



I wonder if this is part of the reason for the distortion you addressed with
74388a1a / 3097bde7dd1d. I am somewhat doubtful they're right as is. For a
large relation 2% of blocks is a significant number of rows, and simply never
adjusting reltuples seems quite problematic. At the very least we ought to
account for dead tids we removed or such, instead of just freezing reltuples.

Greetings,

Andres Freund


vactest_more_extreme.sql
Description: application/sql


Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan
On Wed, Jan 18, 2023 at 3:28 PM Peter Geoghegan  wrote:
> The problems in this area tend to be that vac_estimate_reltuples()
> behaves as if it sees a random sample, when in fact it's far from
> random -- it's the same scanned_pages as last time, and the ten other
> times before that. That's a feature of your test case, and other
> similar vac_estimate_reltuples test cases that I came up with in the
> past. Both scenarios involved skipping using the visibility map in
> multiple successive VACUUM operations.

FWIW, the problem in your test case just goes away if you just change this line:

DELETE FROM foo WHERE i < (1000 * 0.1)

To this:

DELETE FROM foo WHERE i < (1000 * 0.065)

Which is not a huge difference, overall. This effect is a consequence
of the heuristics I added in commit 74388a1a, so it's only present on
Postgres 15+.

Whether or not this is sufficient protection is of course open to
interpretation. One problem with those heuristics (as far as your test
case is concerned) is that they either work, or they don't work. For
example they're conditioned on "old_rel_pages == total_page".

-- 
Peter Geoghegan




Re: [DOCS] Stats views and functions not in order?

2023-01-18 Thread Peter Smith
On Thu, Jan 19, 2023 at 2:55 AM David G. Johnston
 wrote:
>
> On Wed, Jan 18, 2023 at 8:38 AM Tom Lane  wrote:
>>
>> "David G. Johnston"  writes:
>> > ...  I was going for the html effect
>> > of having these views chunked into their own pages, any other changes being
>> > non-detrimental.
>>
>> But is that a result we want?  It will for example break any bookmarks
>> that people might have for these documentation entries.  It will also
>> pretty thoroughly break the cross-version navigation links in this
>> part of the docs.
>>
>>
>> Maybe the benefit is worth those costs, but I'm entirely not convinced
>> of that.  I think we need to tread pretty lightly when rearranging
>> longstanding documentation-layout decisions.
>>
>

David already gave a good summary [1], but since I was the OP here is
the background of v10-0001 from my PoV.

~

The original $SUBJECT requirements evolved to also try to make each
view appear on a separate page after that was suggested by DavidJ [2].
I was unable to achieve per-page views "without radically changing the
document structure." [3], but DavidJ found a way [4] to do it using
refentry. I then wrote the patch v8-0003 using that strategy, which
after more rebasing became the v10-0001 you see today.

I did prefer the view-per-page results (although I also only use HTML
docs). But my worry is that there seem still to be a few unknowns
about how this might affect other (not the HTML) renderings of the
docs. If you think that risk is too great, or if you feel this patch
will cause unwarranted link/bookmark grief, then I am happy to just
drop it.

--
[1] DJ overview -
https://www.postgresql.org/message-id/CAKFQuwaVm%3D6d_sw9Wrp4cdSm5_k%3D8ZVx0--v2v4BH4KnJtqXqg%40mail.gmail.com
[2] DJ suggested view-per-page -
https://www.postgresql.org/message-id/CAKFQuwa9JtoCBVc6CJb7NC5FqMeEAy_A8X4H8t6kVaw7fz9LTw%40mail.gmail.com
[3] PS don't know how to do it -
https://www.postgresql.org/message-id/CAHut%2BPv5Efz1TLWOLSoFvoyC0mq%2Bs92yFSd534ctWSdjEFtKCw%40mail.gmail.com
[4] DJ how to do it using refentry -
https://www.postgresql.org/message-id/CAKFQuwYkM5UZT%2B6tG%2BNgZvDcd5VavS%2BxNHsGsWC8jS-KJsxh7w%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan
On Wed, Jan 18, 2023 at 2:37 PM Peter Geoghegan  wrote:
> Maybe you're right to be concerned to the degree that you're concerned
> -- I'm not sure. I'm just adding what I see as important context.

The problems in this area tend to be that vac_estimate_reltuples()
behaves as if it sees a random sample, when in fact it's far from
random -- it's the same scanned_pages as last time, and the ten other
times before that. That's a feature of your test case, and other
similar vac_estimate_reltuples test cases that I came up with in the
past. Both scenarios involved skipping using the visibility map in
multiple successive VACUUM operations.

Perhaps we should make vac_estimate_reltuples focus on the pages that
VACUUM newly set all-visible each time (not including all-visible
pages that got scanned despite being all-visible) -- only that subset
of scanned_pages seems to be truly relevant. That way you wouldn't be
able to do stuff like this. We'd have to start explicitly tracking the
number of pages that were newly set in the VM in vacuumlazy.c to be
able to do that, but that seems like a good idea anyway.

This probably has consequences elsewhere, but maybe that's okay. We
know when the existing pg_class has no information, since that is
explicitly encoded by a reltuples of -1. Obviously we'd need to be
careful about stuff like that. Overall, the danger from assuming that
"unsettled" pages (pages that couldn't be newly set all-visible by
VACUUM) have a generic tuple density seems less than the danger of
assuming that they're representative. We know that we're bound to scan
these same pages in the next VACUUM anyway, so they'll have another
chance to impact our view of the table's tuple density at that time.

--
Peter Geoghegan




Re: CREATEROLE users vs. role properties

2023-01-18 Thread Nathan Bossart
On Wed, Jan 18, 2023 at 12:15:33PM -0500, Robert Haas wrote:
> On Mon, Jan 16, 2023 at 2:29 PM Robert Haas  wrote:
>> 1. It's still possible for a CREATEROLE user to hand out role
>> attributes that they don't possess. The new prohibitions in
>> cf5eb37c5ee0cc54c80d95c1695d7fca1f7c68cb prevent a CREATEROLE user
>> from handing out membership in a role on which they lack sufficient
>> permissions, but they don't prevent a CREATEROLE user who lacks
>> CREATEDB from creating a new user who does have CREATEDB. I think we
>> should subject the CREATEDB, REPLICATION, and BYPASSRLS attributes to
>> the same rule that we now use for role memberships: you've got to have
>> the property in order to give it to someone else. In the case of
>> CREATEDB, this would tighten the current rules, which allow you to
>> give out CREATEDB without having it. In the case of REPLICATION and
>> BYPASSRLS, this would liberalize the current rules: right now, a
>> CREATEROLE user cannot give REPLICATION or BYPASSRLS to another user
>> even if they possess those attributes.
>>
>> This proposal doesn't address the CREATEROLE or CONNECTION LIMIT
>> properties. It seems possible to me that someone might want to set up
>> a CREATEROLE user who can't make more such users, and this proposal
>> doesn't manufacture any way of doing that. It also doesn't let you
>> constraint the ability of a CREATEROLE user to set a CONNECTION LIMIT
>> for some other user. I think that's OK. It might be nice to have ways
>> of imposing such restrictions at some point in the future, but it is
>> not very obvious what to do about such cases and, importantly, I don't
>> think there's any security impact from failing to address those cases.
>> If a CREATEROLE user without CREATEDB can create a new role that does
>> have CREATEDB, that's a privilege escalation. If they can hand out
>> CREATEROLE, that isn't: they already have it.
> 
> Here is a patch implementing the above proposal. Since this is fairly
> closely related to already-committed work, I would like to get this
> into v16. That way, all the changes to how CREATEROLE works will go
> into a single release, which seems less confusing for users. It is
> also fairly clear to me that this is an improvement over the status
> quo. Sometimes things that seem clear to me turn out to be false, so
> if this change seems like a problem to you, please let me know.

This seems like a clear improvement to me.  However, as the attribute
system becomes more sophisticated, I think we ought to improve the error
messages in user.c.  IMHO messages like "permission denied" could be
greatly improved with some added context.

For example, if I want to change a role's password, I need both CREATEROLE
and ADMIN OPTION on the role, but the error message only mentions
CREATEROLE.

postgres=# create role createrole with createrole;
CREATE ROLE
postgres=# create role otherrole;
CREATE ROLE
postgres=# set role createrole;
SET
postgres=> alter role otherrole password 'test';
ERROR:  must have CREATEROLE privilege to change another user's password

Similarly, if I want to allow a role to grant REPLICATION to another role,
I have to give it CREATEROLE, REPLICATION, and membership with ADMIN
OPTION.  If the role is missing CREATEROLE or membership with ADMIN OPTION,
it'll only ever see a "permission denied" error.

postgres=# create role createrole with createrole;
CREATE ROLE
postgres=# create role otherrole;
CREATE ROLE
postgres=# set role createrole;
SET
postgres=> alter role otherrole with replication;
ERROR:  permission denied
postgres=> reset role;
RESET
postgres=# alter role createrole with replication;
ALTER ROLE
postgres=# set role createrole;
SET
postgres=> alter role otherrole with replication;
ERROR:  permission denied
postgres=> reset role;
RESET
postgres=# grant otherrole to createrole;
GRANT ROLE
postgres=# set role createrole;
SET
postgres=> alter role otherrole with replication;
ERROR:  permission denied
postgres=> reset role;
RESET
postgres=# grant otherrole to createrole with admin option;
GRANT ROLE
postgres=# set role createrole;
SET
postgres=> alter role otherrole with replication;
ALTER ROLE

If it has both CREATEROLE and membership with ADMIN OPTION (but not
REPLICATION), it'll see a more helpful message.

postgres=# create role createrole with createrole;
CREATE ROLE
postgres=# create role otherrole;
CREATE ROLE
postgres=# grant otherrole to createrole with admin option;
GRANT ROLE
postgres=# set role createrole;
SET
postgres=> alter role otherrole with replication;
ERROR:  must have replication privilege to change 

Re: Rethinking the implementation of ts_headline()

2023-01-18 Thread Tom Lane
Alvaro Herrera  writes:
> I tried this other test, based on looking at the new regression tests
> you added,

> SELECT ts_headline('english', '
> Day after day, day after day,
>   We stuck, nor breath nor motion,
> As idle as a painted Ship
>   Upon a painted Ocean.
> Water, water, every where
>   And all the boards did shrink;
> Water, water, every where,
>   Nor any drop to drink.
> S. T. Coleridge (1772-1834)
> ', to_tsquery('english', '(day & drink) | (idle & painted)'), 
> 'MaxFragments=5, MaxWords=9, MinWords=4');
>ts_headline   
> ─
>  motion,↵
>  As idle as a painted Ship↵
>Upon
> (1 fila)

> and was surprised that the match for the 'day & drink' arm of the OR
> disappears from the reported headline.

I'd argue that that's exactly what should happen.  It's supposed to
find as-short-as-possible cover strings that satisfy the query.
In this case, satisfying the 'day & drink' condition would require
nearly the entire input text, whereas 'idle & painted' can be
satisfied in just the third line.  So what you get is the third line,
slightly expanded because of some later rules that like to add
context if the cover is shorter than MaxWords.  I don't find anything
particularly principled about the old behavior:

>  Day after day, day after day,↵
>We stuck ... motion,   ↵
>  As idle as a painted Ship  ↵
>Upon

It's including hits for "day" into the cover despite the lack of any
nearby match to "drink".

> Another thing I think might be a regression is the way fragments are
> selected.  Consider what happens if I change the "idle & painted" in the
> earlier query to "idle <-> painted", and MaxWords is kept low:

Of course, "idle <-> painted" is satisfied nowhere in this text
(the words are there, but not adjacent).  So the cover has to
run from the last 'day' to the 'drink'.  I think v15 is deciding
that it runs from the first 'day' to the 'drink', which while not
exactly wrong is not the shortest cover.  The rest of this is just
minor variations in what mark_hl_fragments() decides to do based
on the precise cover string it's given.  I don't dispute that
mark_hl_fragments() could be improved, but this patch doesn't touch
its algorithm and I think that doing so should be material for a
different patch.  (I have no immediate ideas about what would be
a better algorithm for it, anyway.)

> (Both 15 and master highlight 'painted' in the "Upon a painted Ocean"
> verse, which perhaps they shouldn't do, since it's not preceded by
> 'idle'.)

Yeah, and 'idle' too.  Once it's chosen a string to show, it'll
highlight all the query words within that string, whether they
constitute part of the match or not.  I can see arguments on both
sides of doing it that way; it was probably more sensible before
we had phrase match than it is now.  But again, changing that phase
of the processing is outside the scope of this patch.  I'm just
trying to undo the damage I did to the cover-string-selection phase.

regards, tom lane




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan
On Wed, Jan 18, 2023 at 2:22 PM Andres Freund  wrote:
> The problem with the change is here:
>
> /*
>  * Okay, we've covered the corner cases.  The normal calculation is to
>  * convert the old measurement to a density (tuples per page), then
>  * estimate the number of tuples in the unscanned pages using that 
> figure,
>  * and finally add on the number of tuples in the scanned pages.
>  */
> old_density = old_rel_tuples / old_rel_pages;
> unscanned_pages = (double) total_pages - (double) scanned_pages;
> total_tuples = old_density * unscanned_pages + scanned_tuples;
> return floor(total_tuples + 0.5);

My assumption has always been that vac_estimate_reltuples() is prone
to issues like this because it just doesn't have access to very much
information each time it runs. It can only see the delta between what
VACUUM just saw, and what the last VACUUM (or possibly the last
ANALYZE) saw according to pg_class. You're always going to find
weaknesses in such a model if you go looking for them. You're always
going to find a way to salami slice your way from good information to
total nonsense, if you pick the right/wrong test case, which runs
VACUUM in a way that allows whatever bias there may be to accumulate.
It's sort of like the way floating point values can become very
inaccurate through a process that allows many small inaccuracies to
accumulate over time.

Maybe you're right to be concerned to the degree that you're concerned
-- I'm not sure. I'm just adding what I see as important context.

-- 
Peter Geoghegan




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Andres Freund
Hi,

On 2023-01-18 13:42:40 -0800, Andres Freund wrote:
> The real point of change appears to be 10->11.
>
> There's a relevant looking difference in the vac_estimate_reltuples call:
> 10:
>   /* now we can compute the new value for pg_class.reltuples */
>   vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false,
>   
>  nblocks,
>   
>  vacrelstats->tupcount_pages,
>   
>  num_tuples);
>
> 11:
>   /* now we can compute the new value for pg_class.reltuples */
>   vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel,
>   
>   nblocks,
>   
>   vacrelstats->tupcount_pages,
>   
>   live_tuples);
> which points to:
>
> commit 7c91a0364fcf5d739a09cc87e7adb1d4a33ed112
> Author: Tom Lane 
> Date:   2018-03-22 15:47:29 -0400
>
> Sync up our various ways of estimating pg_class.reltuples.

The problem with the change is here:

/*
 * Okay, we've covered the corner cases.  The normal calculation is to
 * convert the old measurement to a density (tuples per page), then
 * estimate the number of tuples in the unscanned pages using that 
figure,
 * and finally add on the number of tuples in the scanned pages.
 */
old_density = old_rel_tuples / old_rel_pages;
unscanned_pages = (double) total_pages - (double) scanned_pages;
total_tuples = old_density * unscanned_pages + scanned_tuples;
return floor(total_tuples + 0.5);


Because we'll re-scan the pages for not-yet-removable rows in subsequent
vacuums, the next vacuum will process the same pages again. By using
scanned_tuples = live_tuples, we basically remove not-yet-removable tuples
from reltuples, each time.

The commit *did* try to account for that to some degree:

+/* also compute total number of surviving heap entries */
+vacrelstats->new_rel_tuples =
+vacrelstats->new_live_tuples + vacrelstats->new_dead_tuples;


but new_rel_tuples isn't used for pg_class.reltuples or pgstat.


This is pretty nasty. We use reltuples for a lot of things. And while analyze
might fix it sometimes, that won't reliably be the case, particularly when
there are repeated autovacuums due to a longrunning transaction - there's no
cause for auto-analyze to trigger again soon, while autovacuum will go at it
again and again.

Greetings,

Andres Freund




Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-18 Thread Tom Lane
Andrew Dunstan  writes:
> I think we can do what you want but it's a bit harder than what you've
> done. If we're not going to save the current run's product then we need
> to run the upgrade test from a different directory (probably directly in
> "$buildroot/$this_branch/inst"). Otherwise we'll be testing upgrade to
> the saved product of a previous run of this branch.

Hmm, maybe that explains some inconsistent results I remember getting.

> I'll take a stab at it tomorrow if you like.

Please do.

regards, tom lane




Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-18 Thread Andrew Dunstan


On 2023-01-18 We 16:14, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 2023-01-18 We 14:32, Tom Lane wrote:
>>> I suppose that the reason for not running under $from_source is to
>>> avoid corrupting the saved installations with unofficial versions.
>>> However, couldn't we skip the "save" step and still run the upgrade
>>> tests against whatever we have saved?  (Maybe skip the same-version
>>> test, as it's not quite reflecting any real case then.)
>> Something like this should do it:
>> my $source_tree = $from_source || "$self->{buildroot}/$this_branch/pgsql";
> Ah, I didn't understand that $from_source is a path not just a bool.
>
> What do you think about the above questions?  Is this $from_source
> exclusion for the reason I guessed, or some other one?
>
>   

Yes, the reason is that, unlike almost everything else in the buildfarm,
cross version upgrade testing requires saved state (binaries and data
directory), and we don't want from-source builds corrupting that state.

I think we can do what you want but it's a bit harder than what you've
done. If we're not going to save the current run's product then we need
to run the upgrade test from a different directory (probably directly in
"$buildroot/$this_branch/inst"). Otherwise we'll be testing upgrade to
the saved product of a previous run of this branch. I'll take a stab at
it tomorrow if you like.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





RE: Ability to reference other extensions by schema in extension scripts

2023-01-18 Thread Regina Obe
> On Mon, Jan 16, 2023 at 11:57:30PM -0500, Regina Obe wrote:
> 
> > What would be more bullet-proof is having an extra column in
> > pg_extension or adding an extra array element to
> > pg_extension.extcondition[] that allows you to say "Hey, don't allow
> > this to be relocatable cause other extensions depend on it that have
explicitly
> referenced the schema."
> 
> I've given this some more thoughts and I think a good compromise could be
to
> add the safety net in ALTER EXTESION SET SCHEMA so that it does not only
> check "extrelocatable" but also the presence of any extension effectively
> depending on it, in which case the operation could be prevented with a
more
> useful message than "extension does not support SET SCHEMA" (what is
> currently output).
> 
> Example query to determine those cases:
> 
>   SELECT e.extname, array_agg(v.name)
>  FROM pg_extension e, pg_available_extension_versions v
>  WHERE e.extname = ANY( v.requires )
>  AND e.extrelocatable
>  AND v.installed group by e.extname;
> 
>   extname|array_agg
>   ---+--
>fuzzystrmatch | {postgis_tiger_geocoder}
> 
> --strk;

The only problem with the above is then it bars an extension from being
relocated even if no extensions reference their schema.  Note you wouldn't
be able to tell if an extension references a schema without analyzing the
install script.  Which is why I was thinking another property would be
better, cause that could be checked during the install/upgrade of the
dependent extensions.

I personally would be okay with this and is easier to code I think and
doesn't require structural changes, but not sure others would be as it's
taking away permissions they had before when it wasn't necessary to do so.

Thanks,
Regina





Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan
On Wed, Jan 18, 2023 at 1:08 PM Andres Freund  wrote:
> I suggested nearby to only have ANALYZE dead_tuples it if there's been no
> [auto]vacuum since the stats entry was created. That allows recovering from
> stats resets, be it via crashes or explicitly. What do you think?

I like that idea. It's far simpler than the kind of stuff I was
thinking about, and probably just as effective. Even if it introduces
some unforeseen problem (which seems unlikely), we can still rely on
pgstat_report_vacuum() to set things straight before too long.

Are you planning on writing a patch for this? I'd be very interested
in seeing this through. Could definitely review it.

-- 
Peter Geoghegan




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Andres Freund
Hi,

On 2023-01-18 13:08:44 -0800, Andres Freund wrote:
> One complicating factor is that VACUUM sometimes computes an incrementally
> more bogus n_live_tup when it skips pages due to the VM, whereas ANALYZE
> computes something sane. I unintentionally encountered one when I was trying
> something while writing this email, reproducer attached.
> 
> 
> VACUUM (DISABLE_PAGE_SKIPPING) foo;
> SELECT n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relid = 
> 'foo'::regclass;
> ┌┬┐
> │ n_live_tup │ n_dead_tup │
> ├┼┤
> │901 │ 50 │
> └┴┘
> 
> after one VACUUM:
> ┌┬┐
> │ n_live_tup │ n_dead_tup │
> ├┼┤
> │8549905 │ 50 │
> └┴┘
> 
> after 9 more VACUUMs:
> ┌┬┐
> │ n_live_tup │ n_dead_tup │
> ├┼┤
> │5388421 │ 50 │
> └┴┘
> (1 row)
> 
> 
> I briefly tried it out, and it does *not* reproduce in 11, but does in
> 12. Haven't dug into what the cause is, but we probably use the wrong
> denominator somewhere...

Oh, it does actually reproduce in 11 too - my script just didn't see it
because it was "too fast". For some reason < 12 it takes longer for the new
pgstat snapshot to be available. If I add a few sleeps, it shows in 11.

The real point of change appears to be 10->11.

There's a relevant looking difference in the vac_estimate_reltuples call:
10:
/* now we can compute the new value for pg_class.reltuples */
vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false,

 nblocks,

 vacrelstats->tupcount_pages,

 num_tuples);

11:
/* now we can compute the new value for pg_class.reltuples */
vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel,

  nblocks,

  vacrelstats->tupcount_pages,

  live_tuples);
which points to:

commit 7c91a0364fcf5d739a09cc87e7adb1d4a33ed112
Author: Tom Lane 
Date:   2018-03-22 15:47:29 -0400

Sync up our various ways of estimating pg_class.reltuples.

Greetings,

Andres Freund




Re: Ability to reference other extensions by schema in extension scripts

2023-01-18 Thread Sandro Santilli
On Mon, Jan 16, 2023 at 11:57:30PM -0500, Regina Obe wrote:

> What would be more bullet-proof is having an extra column in pg_extension or
> adding an extra array element to pg_extension.extcondition[] that allows you
> to say "Hey, don't allow this to be relocatable cause other extensions
> depend on it that have explicitly referenced the schema."

I've given this some more thoughts and I think a good 
compromise could be to add the safety net in ALTER EXTESION SET SCHEMA
so that it does not only check "extrelocatable" but also the presence
of any extension effectively depending on it, in which case the
operation could be prevented with a more useful message than
"extension does not support SET SCHEMA" (what is currently output).

Example query to determine those cases:

  SELECT e.extname, array_agg(v.name)
 FROM pg_extension e, pg_available_extension_versions v
 WHERE e.extname = ANY( v.requires )
 AND e.extrelocatable
 AND v.installed group by e.extname;

  extname|array_agg
  ---+--
   fuzzystrmatch | {postgis_tiger_geocoder}

--strk;




Re: document the need to analyze partitioned tables

2023-01-18 Thread Bruce Momjian
On Wed, Jan 18, 2023 at 10:15:18AM +0100, Laurenz Albe wrote:
> On Tue, 2023-01-17 at 16:16 -0500, Bruce Momjian wrote:
> > On Tue, Jan 17, 2023 at 03:00:50PM -0600, Justin Pryzby wrote:
> > > Maybe (all?) the clarification the docs need is to say:
> > > "Partitioned tables are not *themselves* processed by autovacuum."
> > 
> > Yes, I think the lack of autovacuum needs to be specifically mentioned
> > since most people assume autovacuum handles _all_ statistics updating.
> > 
> > Can someone summarize how bad it is we have no statistics on partitioned
> > tables?  It sounds bad to me.
> 
> Andrey Lepikhov had an example earlier in this thread[1].  It doesn't take
> an exotic query. 
> 
> Attached is a new version of my patch that tries to improve the wording.

Ah, yes, that is the example I saw but could not re-find.  Here is the
output:

CREATE TABLE test (id integer, val integer) PARTITION BY hash (id);

CREATE TABLE test_0 PARTITION OF test
   FOR VALUES WITH (modulus 2, remainder 0);
CREATE TABLE test_1 PARTITION OF test
   FOR VALUES WITH (modulus 2, remainder 1);

INSERT INTO test (SELECT q, q FROM generate_series(1,10) AS q);

VACUUM ANALYZE test;

INSERT INTO test (SELECT q, q%2 FROM generate_series(11,200) AS q);

VACUUM ANALYZE test_0,test_1;

EXPLAIN (ANALYZE, TIMING OFF, SUMMARY OFF)
SELECT * FROM test t1, test t2 WHERE t1.id = t2.val;
   QUERY PLAN   
 

-
 Hash Join  (cost=7.50..15.25 rows=200 width=16) (actual rows=105 
loops=1)
   Hash Cond: (t1.id = t2.val)
   ->  Append  (cost=0.00..5.00 rows=200 width=8) (actual rows=200 
loops=1)
 ->  Seq Scan on test_0 t1_1  (cost=0.00..2.13 rows=113 
width=8) (actual rows=113 loops=1)
 ->  Seq Scan on test_1 t1_2  (cost=0.00..1.87 rows=87 width=8) 
(actual rows=87 loops=1)
   ->  Hash  (cost=5.00..5.00 rows=200 width=8) (actual rows=200 
loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 16kB
 ->  Append  (cost=0.00..5.00 rows=200 width=8) (actual 
rows=200 loops=1)
   ->  Seq Scan on test_0 t2_1  (cost=0.00..2.13 rows=113 
width=8) (actual rows=113 loops=1)
   ->  Seq Scan on test_1 t2_2  (cost=0.00..1.87 rows=87 
width=8) (actual rows=87 loops=1)

VACUUM ANALYZE test;

EXPLAIN (ANALYZE, TIMING OFF, SUMMARY OFF)
SELECT * FROM test t1, test t2 WHERE t1.id = t2.val;
   QUERY PLAN   
 

-
 Hash Join  (cost=7.50..15.25 rows=200 width=16) (actual rows=105 
loops=1)
   Hash Cond: (t2.val = t1.id)
   ->  Append  (cost=0.00..5.00 rows=200 width=8) (actual rows=200 
loops=1)
 ->  Seq Scan on test_0 t2_1  (cost=0.00..2.13 rows=113 
width=8) (actual rows=113 loops=1)
 ->  Seq Scan on test_1 t2_2  (cost=0.00..1.87 rows=87 width=8) 
(actual rows=87 loops=1)
   ->  Hash  (cost=5.00..5.00 rows=200 width=8) (actual rows=200 
loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 16kB
 ->  Append  (cost=0.00..5.00 rows=200 width=8) (actual 
rows=200 loops=1)
   ->  Seq Scan on test_0 t1_1  (cost=0.00..2.13 rows=113 
width=8) (actual rows=113 loops=1)
   ->  Seq Scan on test_1 t1_2  (cost=0.00..1.87 rows=87 
width=8) (actual rows=87 loops=1)

I see the inner side uses 'val' in the first EXPLAIN and 'id' in the
second, and you are right that 'val' has mostly 0/1.

Is it possible to document when partition table statistics helps?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.




Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-18 Thread Tom Lane
Andrew Dunstan  writes:
> On 2023-01-18 We 14:32, Tom Lane wrote:
>> I suppose that the reason for not running under $from_source is to
>> avoid corrupting the saved installations with unofficial versions.
>> However, couldn't we skip the "save" step and still run the upgrade
>> tests against whatever we have saved?  (Maybe skip the same-version
>> test, as it's not quite reflecting any real case then.)

> Something like this should do it:
> my $source_tree = $from_source || "$self->{buildroot}/$this_branch/pgsql";

Ah, I didn't understand that $from_source is a path not just a bool.

What do you think about the above questions?  Is this $from_source
exclusion for the reason I guessed, or some other one?

regards, tom lane




Re: almost-super-user problems that we haven't fixed yet

2023-01-18 Thread Nathan Bossart
On Wed, Jan 18, 2023 at 02:51:38PM -0500, Robert Haas wrote:
> Should (nfree < SuperuserReservedBackends) be using <=, or am I confused?

I believe < is correct.  At this point, the new backend will have already
claimed a proc struct, so if the number of remaining free slots equals the
number of reserved slots, it is okay.

> What's the deal with removing "and no new replication connections will
> be accepted" from the documentation? Is the existing documentation
> just wrong? If so, should we fix that first? And maybe delete
> "non-replication" from the error message that says "remaining
> connection slots are reserved for non-replication superuser
> connections"? It seems like right now the comments say that
> replication connections are a completely separate pool of connections,
> but the documentation and the error message make it sound otherwise.
> If that's true, then one of them is wrong, and I think it's the
> docs/error message. Or am I just misreading it?

I think you are right.  This seems to have been missed in ea92368.  I moved
this part to a new patch that should probably be back-patched to v12.

On that note, I wonder if it's worth changing the "sorry, too many clients
already" message to make it clear that max_connections has been reached.
IME some users are confused by this error, and I think it would be less
confusing if it pointed to the parameter that governs the number of
connection slots.  I'll create a new thread for this.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 8f0aa2fa54ae01149cffe9a69265f98e76a08a23 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 18 Jan 2023 12:43:41 -0800
Subject: [PATCH v3 1/3] Code review for ea92368.

This commit missed an error message and a line in the docs.

Back-patch to v12.
---
 doc/src/sgml/config.sgml  | 3 +--
 src/backend/utils/init/postinit.c | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 89d53f2a64..e019a1aac9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -725,8 +725,7 @@ include_dir 'conf.d'
 number of active concurrent connections is at least
 max_connections minus
 superuser_reserved_connections, new
-connections will be accepted only for superusers, and no
-new replication connections will be accepted.
+connections will be accepted only for superusers.

 

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ae5a85ed65..9145d96b38 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -931,7 +931,7 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		!HaveNFreeProcs(ReservedBackends))
 		ereport(FATAL,
 (errcode(ERRCODE_TOO_MANY_CONNECTIONS),
- errmsg("remaining connection slots are reserved for non-replication superuser connections")));
+ errmsg("remaining connection slots are reserved for superusers")));
 
 	/* Check replication permissions needed for walsender processes. */
 	if (am_walsender)
-- 
2.25.1

>From bc110461e4f0a73c2b76ba0a6e821349c2cbe3df Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 17 Jan 2023 13:58:56 -0800
Subject: [PATCH v3 2/3] Rename ReservedBackends to SuperuserReservedBackends.

This is in preparation for adding a new reserved_connections GUC
that will use the ReservedBackends variable name.
---
 src/backend/postmaster/postmaster.c | 18 +-
 src/backend/utils/init/postinit.c   |  4 ++--
 src/backend/utils/misc/guc_tables.c |  2 +-
 src/include/postmaster/postmaster.h |  2 +-
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9cedc1b9f0..470704f364 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -204,15 +204,15 @@ char	   *Unix_socket_directories;
 char	   *ListenAddresses;
 
 /*
- * ReservedBackends is the number of backends reserved for superuser use.
- * This number is taken out of the pool size given by MaxConnections so
+ * SuperuserReservedBackends is the number of backends reserved for superuser
+ * use.  This number is taken out of the pool size given by MaxConnections so
  * number of backend slots available to non-superusers is
- * (MaxConnections - ReservedBackends).  Note what this really means is
- * "if there are <= ReservedBackends connections available, only superusers
- * can make new connections" --- pre-existing superuser connections don't
- * count against the limit.
+ * (MaxConnections - SuperuserReservedBackends).  Note what this really means
+ * is "if there are <= SuperuserReservedBackends connections available, only
+ * superusers can make new connections" --- pre-existing superuser connections
+ * don't count against the limit.
  */
-int			ReservedBackends;
+int			SuperuserReservedBackends;
 
 /* The socket(s) we're listening to. */
 

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan
On Wed, Jan 18, 2023 at 1:02 PM Peter Geoghegan  wrote:
> Some of what I'm proposing arguably amounts to deliberately adding a
> bias. But that's not an unreasonable thing in itself. I think of it as
> related to the bias-variance tradeoff, which is a concept that comes
> up a lot in machine learning and statistical inference.

To be clear, I was thinking of unreservedly trusting what
pgstat_report_analyze() says about dead_tuples in the event of its
estimate increasing our dead_tuples estimate, while being skeptical
(to a varying degree) when it's the other way around.

But now I need to go think about what Andres just brought up...

-- 
Peter Geoghegan




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Andres Freund
Hi,

On 2023-01-18 12:15:17 -0800, Peter Geoghegan wrote:
> On Wed, Jan 18, 2023 at 11:02 AM Robert Haas  wrote:
> > On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan  wrote:
> > > pgstat_report_analyze() will totally override the
> > > tabentry->dead_tuples information that drives autovacuum.c, based on
> > > an estimate derived from a random sample -- which seems to me to be an
> > > approach that just doesn't have any sound theoretical basis.
> >
> > Yikes. I think we don't have a choice but to have a method to correct
> > the information somehow, because AFAIK the statistics system is not
> > crash-safe. But that approach does seem to carry significant risk of
> > overwriting correct information with wrong information.

I suggested nearby to only have ANALYZE dead_tuples it if there's been no
[auto]vacuum since the stats entry was created. That allows recovering from
stats resets, be it via crashes or explicitly. What do you think?

To add insult to injury, we overwrite accurate information gathered by VACUUM
with bad information gathered by ANALYZE if you do VACUUM ANALYZE.



One complicating factor is that VACUUM sometimes computes an incrementally
more bogus n_live_tup when it skips pages due to the VM, whereas ANALYZE
computes something sane. I unintentionally encountered one when I was trying
something while writing this email, reproducer attached.


VACUUM (DISABLE_PAGE_SKIPPING) foo;
SELECT n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relid = 
'foo'::regclass;
┌┬┐
│ n_live_tup │ n_dead_tup │
├┼┤
│901 │ 50 │
└┴┘

after one VACUUM:
┌┬┐
│ n_live_tup │ n_dead_tup │
├┼┤
│8549905 │ 50 │
└┴┘

after 9 more VACUUMs:
┌┬┐
│ n_live_tup │ n_dead_tup │
├┼┤
│5388421 │ 50 │
└┴┘
(1 row)


I briefly tried it out, and it does *not* reproduce in 11, but does in
12. Haven't dug into what the cause is, but we probably use the wrong
denominator somewhere...

Greetings,

Andres Freund


vactest.sql
Description: application/sql


Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-18 Thread Andrew Dunstan


On 2023-01-18 We 14:32, Tom Lane wrote:
> One more thing before we move on from this topic.  I'd been testing
> modified versions of the AdjustUpgrade.pm logic by building from a
> --from-source source tree, which seemed way easier than dealing
> with a private git repo.  As it stands, TestUpgradeXversion.pm
> refuses to run under $from_source, but I just diked that check out
> and it seemed to work fine for my purposes.  Now, that's going to be
> a regular need going forward, so I'd like to not need a hacked version
> of the BF client code to do it.
>
> Also, your committed version of TestUpgradeXversion.pm breaks that
> use-case because you did
>
> -   unshift(@INC, "$self->{pgsql}/src/test/perl");
> +   unshift(@INC, "$self->{buildroot}/$this_branch/pgsql/src/test/perl");
>
> which AFAICS is an empty directory in a $from_source run.
>
> I suppose that the reason for not running under $from_source is to
> avoid corrupting the saved installations with unofficial versions.
> However, couldn't we skip the "save" step and still run the upgrade
> tests against whatever we have saved?  (Maybe skip the same-version
> test, as it's not quite reflecting any real case then.)
>
> Here's a quick draft patch showing what I have in mind.  There may
> well be a better way to deal with the wheres-the-source issue than
> what is in hunk 2.  Also, I didn't reindent the unchanged code in
> sub installcheck, and I didn't add anything about skipping
> same-version tests.


No that won't work if we're using vpath builds (which was why I changed
it from what you had). $self->{pgsql} is always the build directory.

Something like this should do it:


my $source_tree = $from_source || "$self->{buildroot}/$this_branch/pgsql";


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan
On Wed, Jan 18, 2023 at 12:44 PM Robert Haas  wrote:
> I don't know enough about the specifics of how this works to have an
> intelligent opinion about how likely these particular ideas are to
> work out. However, I think it's risky to look at estimates and try to
> infer whether they are reliable. It's too easy to be wrong. What we
> really want to do is anchor our estimates to some data source that we
> know we can trust absolutely. If you trust possibly-bad data less, it
> screws up your estimates more slowly, but it still screws them up.

Some of what I'm proposing arguably amounts to deliberately adding a
bias. But that's not an unreasonable thing in itself. I think of it as
related to the bias-variance tradeoff, which is a concept that comes
up a lot in machine learning and statistical inference.

We can afford to be quite imprecise at times, especially if we choose
a bias that we know has much less potential to do us harm -- some
mistakes hurt much more than others. We cannot afford to ever be
dramatically wrong, though -- especially in the direction of vacuuming
less often.

Besides, there is something that we *can* place a relatively high
degree of trust in that will still be in the loop here: VACUUM itself.
If VACUUM runs then it'll call pgstat_report_vacuum(), which will set
the record straight in the event of over estimating dead tuples. To
some degree the problem of over estimating dead tuples is
self-limiting.

> If Andres is correct that what really matter is the number of pages
> we're going to have to dirty, we could abandon counting dead tuples
> altogether and just count not-all-visible pages in the VM map.

That's what matters most from a cost point of view IMV. So it's a big
part of the overall picture, but not everything. It tells us
relatively little about the benefits, except perhaps when most pages
are all-visible.

-- 
Peter Geoghegan




Re: Non-superuser subscription owners

2023-01-18 Thread Mark Dilger



> On Jan 18, 2023, at 12:51 PM, Robert Haas  wrote:
> 
> Unless I'm missing something, it seems like this could be a quite small patch.

I didn't like the idea of the create/alter subscription commands needing to 
parse the connection string and think about what it might do, because at some 
point in the future we might extend what things are allowed in that string, and 
we have to keep everything that contemplates that string in sync.  I may have 
been overly hesitant to tackle that problem.  Or maybe I just ran short of 
round tuits.
 
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Non-superuser subscription owners

2023-01-18 Thread Robert Haas
On Wed, Jan 18, 2023 at 3:26 PM Mark Dilger
 wrote:
> Prior to the patch, if a superuser created a subscription, then later was 
> demoted to non-superuser, the subscription apply workers still applied the 
> changes with superuser force.  So creating a superuser Alice, letting Alice 
> create a subscription, then revoking superuser from Alice didn't accomplish 
> anything interesting.  But after the patch, it does.  The superuser can now 
> create non-superuser subscriptions.  (I'm not sure this ability is well 
> advertised.)  But the problem of non-superuser roles creating non-superuser 
> subscriptions is not solved.

Ah, OK, thanks for the clarification!

> There are different ways of solving (1), and Jeff and I discussed them in Dec 
> 2021.  My recollection was that idea (3) was the cleanest.  Other ideas might 
> be simpler than (3), or they may just appear simpler but in truth turn into a 
> can of worms.  I don't know, since I never went as far as trying to implement 
> either approach.
>
> Idea (2) seems to contemplate non-superuser subscription owners as a 
> theoretical thing, but it's quite real already.  Again, see 
> 027_nosuperuser.pl.

I think the solution to the problem of a connection string trying to
access local files is to just look at the connection string, decide
whether it does that, and if yes, require the owner to have
pg_read_server_files as well as pg_create_subscription. (3) is about
creating some more sophisticated and powerful solution to that
problem, but that seems like a nice-to-have, not something essential,
and a lot more complicated to implement.

I guess what I listed as (2) is not relevant since I didn't understand
correctly what the current state of things is.

Unless I'm missing something, it seems like this could be a quite small patch.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Robert Haas
On Wed, Jan 18, 2023 at 3:15 PM Peter Geoghegan  wrote:
> Suppose that we notice that its new
> estimate for live_tuples approximately matches what the stats
> subsystem already thought about live_tuples, while dead_tuples is far
> far lower. We shouldn't be so credulous as to believe the new
> dead_tuples estimate at that point.
>
> Another angle of attack is the PD_ALL_VISIBLE page-level bit, which
> acquire_sample_rows() could pay attention to -- especially in larger
> tables, where the difference between all pages and just the
> all-visible subset of pages is most likely to matter. The more sampled
> pages that had PD_ALL_VISIBLE set, the less credible the new
> dead_tuples estimate will be (relative to existing information), and
> so pgstat_report_analyze() should prefer the new estimate over the old
> one in proportion to that.

I don't know enough about the specifics of how this works to have an
intelligent opinion about how likely these particular ideas are to
work out. However, I think it's risky to look at estimates and try to
infer whether they are reliable. It's too easy to be wrong. What we
really want to do is anchor our estimates to some data source that we
know we can trust absolutely. If you trust possibly-bad data less, it
screws up your estimates more slowly, but it still screws them up.

If Andres is correct that what really matter is the number of pages
we're going to have to dirty, we could abandon counting dead tuples
altogether and just count not-all-visible pages in the VM map. That
would be cheap enough to recompute periodically. However, it would
also be a big behavior change from the way we do things now, so I'm
not sure it's a good idea. Still, a quantity that we can be certain
we're measuring accurately is better than one we can't measure
accurately even if it's a somewhat worse proxy for the thing we really
care about. There's a ton of value in not being completely and totally
wrong.

> FWIW, part of my mental model with VACUUM is that the rules kind of
> change in the case of a big table. We're far more vulnerable to issues
> such as (say) waiting for cleanup locks because the overall cadence
> used by autovacuum is so infrequently relative to everything else.
> There are more opportunities for things to go wrong, worse
> consequences when they do go wrong, and greater potential for the
> problems to compound.

Yes. A lot of parts of PostgreSQL, including this one, were developed
a long time ago when PostgreSQL databases were a lot smaller than they
often are today.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Non-superuser subscription owners

2023-01-18 Thread Mark Dilger



> On Jan 18, 2023, at 11:38 AM, Robert Haas  wrote:
> 
> I was just noticing that what was committed here didn't actually fix
> the problem implied by the subject line. That is, non-superuser still
> can't own subscriptions.

Not so.  They can.  See src/test/subscription/027_nosuperuser.pl

> To put that another way, there's no way for
> the superuser to delegate the setup and administration of logical
> replication to a non-superuser.

True.

> That's a bummer.

Also true.

> Reading the thread, I'm not quite sure why we seemingly did all the
> preparatory work and then didn't actually fix the problem.

Prior to the patch, if a superuser created a subscription, then later was 
demoted to non-superuser, the subscription apply workers still applied the 
changes with superuser force.  So creating a superuser Alice, letting Alice 
create a subscription, then revoking superuser from Alice didn't accomplish 
anything interesting.  But after the patch, it does.  The superuser can now 
create non-superuser subscriptions.  (I'm not sure this ability is well 
advertised.)  But the problem of non-superuser roles creating non-superuser 
subscriptions is not solved.

From a security perspective, the bit that was solved may be the more important 
part; from a usability perspective, perhaps not.

> It was
> previously proposed that we introduce a new predefined role
> pg_create_subscriptions and allow users who have the privileges of
> that predefined role to create and alter subscriptions. There are a
> few issues with that which, however, seem fairly solvable to me:
> 
> 1. Jeff pointed out that if you supply a connection string that is
> going to try to access local files, you'd better have
> pg_read_server_files, or else we should not let you use that
> connection string. I guess that's mostly a function of which
> parameters you specify, e.g. passfile, sslcert, sslkey, though maybe
> for host it depends on whether the value starts with a slash. We might
> need to think a bit here to make sure we get the rules right but it
> seems like a pretty solvable problem.
> 
> 2. There was also quite a bit of discussion of what to do if a user
> who was previously eligible to own a subscription ceases to be
> eligible, in particular around a superuser who is made into a
> non-superuser, but the same problem would apply if you had
> pg_create_subscriptions or pg_read_server_files and then lost it. My
> vote is to not worry about it too much. Specifically, I think we
> should certainly check whether the user has permission to create a
> subscription before letting them do so, but we could handle the case
> where the user already owns a subscription and tries to modify it by
> either allowing or denying the operation and I think either of those
> would be fine. I even think we could do one of those in some cases and
> the other in other cases and as long as there is some principle to the
> thing, it's fine. I argue that it's not a normal configuration and
> therefore it doesn't have to work in a particularly useful way. It
> shouldn't break the world in some horrendous way, but that's about as
> good as it needs to be. I'd argue for example that DROP SUBSCRIPTION
> could just check whether you own the object, and that ALTER
> SUBSCRIPTION could check whether you own the object and, if you're
> changing the connection string, also whether you would have privileges
> to set that new connection string on a new subscription.
> 
> 3. There was a bit of discussion of maybe wanting to allow users to
> create subscriptions with some connection strings but not others,
> perhaps by having some kind of intermediate object that owns the
> connection string and is owned by a superuser or someone with lots of
> privileges, and then letting a less-privileged user point a
> subscription at that object. I agree that might be useful to somebody,
> but I don't see why it's a hard requirement to get anything at all
> done here. Right now, a subscription contains a connection string
> directly. If in the future someone wants to introduce a CREATE
> REPLICATION DESTINATION command (or whatever) and have a way to point
> a subscription at a replication destination rather than a connection
> string directly, cool. Or if someone wants to wire this into CREATE
> SERVER somehow, also cool. But if you don't care about restricting
> which IPs somebody can try to access by providing a connection string
> of their choice, then you would be happy if we just did something
> simple here and left this problem for another day.
> 
> I am very curious to know (a) why work on this was abandoned (perhaps
> the answer is just lack of round tuits, in which case there is no more
> to be said)

Mostly, it was a lack of round-tuits.  After the patch was committed, I quickly 
switched my focus elsewhere.

> , and (b) what people think of (1)-(3) above

There are different ways of solving (1), and Jeff and I discussed them in Dec 
2021.  My recollection was that idea (3) 

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Andres Freund
Hi,

On 2023-01-17 12:08:01 -0800, Peter Geoghegan wrote:
> > I think that's not the fault of relfrozenxid as a trigger, but that we 
> > simply
> > don't keep enough other stats. We should imo at least keep track of:
> 
> If you assume that there is chronic undercounting of dead tuples
> (which I think is very common), then of course anything that triggers
> vacuuming is going to help with that problem -- it might be totally
> inadequate, but still make the critical difference by not allowing the
> system to become completely destabilized. I absolutely accept that
> users that are relying on that exist, and that those users ought to
> not have things get even worse -- I'm pragmatic. But overall, what we
> should be doing is fixing the real problem, which is that the dead
> tuples accounting is deeply flawed. Actually, it's not just that the
> statistics are flat out wrong; the whole model is flat-out wrong.

I think that depends on what "whole model" encompasses...


> The assumptions that work well for optimizer statistics quite simply
> do not apply here. Random sampling for this is just wrong, because
> we're not dealing with something that follows a distribution that can
> be characterized with a sufficiently large sample. With optimizer
> statistics, the entire contents of the table is itself a sample taken
> from the wider world -- so even very stale statistics can work quite
> well (assuming that the schema is well normalized). Whereas the
> autovacuum dead tuples stuff is characterized by constant change. I
> mean of course it is -- that's the whole point! The central limit
> theorem obviously just doesn't work for something like this  -- we
> cannot generalize from a sample, at all.

If we were to stop dropping stats after crashes, I think we likely could
afford to stop messing with dead tuple stats during analyze. Right now it's
valuable to some degree because it's a way to reaosonably quickly recover from
lost stats.

The main way to collect inserted / dead tuple info for autovacuum's benefit is
via the stats collected when making changes.

We probably ought to simply not update dead tuples after analyze if the stats
entry has information about a prior [auto]vacuum. Or at least split the
fields.


> How many dead heap-only tuples are equivalent to one LP_DEAD item?
> What about page-level concentrations, and the implication for
> line-pointer bloat? I don't have a good answer to any of these
> questions myself. And I have my doubts that there are *any* good
> answers.

Hence my suggestion to track several of these via page level stats. In the big
picture it doesn't really matter that much whether there's 10 or 100 (dead
tuples|items) on a page that needs to be removed. It matters that the page
needs to be processed.


> Even these questions are the wrong questions (they're just less wrong).

I don't agree. Nothing is going to be perfect, but you're not going to be able
to do sensible vacuum scheduling without some stats, and it's fine if those
are an approximation, as long as the approximation makes some sense.


> I'd like to use the visibility map more for stuff here, too. It is
> totally accurate about all-visible/all-frozen pages, so many of my
> complaints about statistics don't really apply. Or need not apply, at
> least. If 95% of a table's pages are all-frozen in the VM, then of
> course it's pretty unlikely to be the right time to VACUUM the table
> if it's to clean up bloat -- this is just about the most reliable
> information we have access to.

I think querying that from stats is too expensive for most things. I suggested
tracking all-frozen in pg_class. Perhaps we should also track when pages are
*removed* from the VM in pgstats, I don't think we do today. That should give
a decent picture?



> > > This sounds like a great argument in favor of suspend-and-resume as a
> > > way of handling autocancellation -- no useful work needs to be thrown
> > > away for AV to yield for a minute or two.
> 
> > Hm, that seems a lot of work. Without having held a lock you don't even know
> > whether your old dead items still apply. Of course it'd improve the 
> > situation
> > substantially, if we could get it.
> 
> I don't think it's all that much work, once the visibility map
> snapshot infrastructure is there.
> 
> Why wouldn't your old dead items still apply?

Well, for one the table could have been rewritten. Of course we can add the
code to deal with that, but it is definitely something to be aware of. There
might also be some oddities around indexes getting added / removed.



> > > Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be
> > > special cases? Maybe they should always be able to auto cancel an
> > > autovacuum?
> >
> > Yea, I think so. It's not obvious how to best pass down that knowledge into
> > ProcSleep(). It'd have to be in the LOCALLOCK, I think. Looks like the best
> > way would be to change LockAcquireExtended() to get a flags argument instead
> > of 

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan
On Wed, Jan 18, 2023 at 11:02 AM Robert Haas  wrote:
> On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan  wrote:
> > pgstat_report_analyze() will totally override the
> > tabentry->dead_tuples information that drives autovacuum.c, based on
> > an estimate derived from a random sample -- which seems to me to be an
> > approach that just doesn't have any sound theoretical basis.
>
> Yikes. I think we don't have a choice but to have a method to correct
> the information somehow, because AFAIK the statistics system is not
> crash-safe. But that approach does seem to carry significant risk of
> overwriting correct information with wrong information.

This situation is really quite awful, so maybe we should do something
about it soon, in the scope of the Postgres 16 work on autovacuum that
is already underway. In fact I think that the problem here is so bad
that even something slightly less naive would be far more effective.

You're right to point out that pgstat_report_analyze needs to update
the stats in case there is a hard crash, of course. But there is
plenty of context with which to make better decisions close at hand.
For example, I bet that pgstat_report_analyze already does a pretty
good job of estimating live_tuples -- my spiel about statistics mostly
doesn't apply to live_tuples. Suppose that we notice that its new
estimate for live_tuples approximately matches what the stats
subsystem already thought about live_tuples, while dead_tuples is far
far lower. We shouldn't be so credulous as to believe the new
dead_tuples estimate at that point.

Perhaps we can change nothing about dead_tuples at all when this
happens. Or perhaps we can set dead_tuples to a value that is scaled
from the old estimate. The new dead_tuples value could be derived by
taking the ratio of the old live_tuples to the old dead_tuples, and
then using that to scale from the new live_tuples. This is just a
first pass, to see what you and others think. Even very simple
heuristics seem like they could make things much better.

Another angle of attack is the PD_ALL_VISIBLE page-level bit, which
acquire_sample_rows() could pay attention to -- especially in larger
tables, where the difference between all pages and just the
all-visible subset of pages is most likely to matter. The more sampled
pages that had PD_ALL_VISIBLE set, the less credible the new
dead_tuples estimate will be (relative to existing information), and
so pgstat_report_analyze() should prefer the new estimate over the old
one in proportion to that.

We probably shouldn't change anything about pgstat_report_vacuum as
part of this effort to make pgstat_report_analyze less terrible in the
near term. It certainly has its problems (what is true for pages that
VACUUM scanned at the end of VACUUM is far from representative for new
pages!), it's probably much less of a contributor to issues like those
that Andres reports seeing.

BTW, one of the nice things about the insert-driven autovacuum stats
is that pgstat_report_analyze doesn't have an opinion about how many
tuples were inserted since the last VACUUM ran. It does have other
problems, but they seem less serious to me.

> Yep. I think what we should try to evaluate is which number is
> furthest from the truth. My guess is that the threshold is so high
> relative to what a reasonable value would be that you can't get any
> benefit out of making the dead tuple count more accurate. Like, if the
> threshold is 100x too high, or something, then who cares how accurate
> the dead tuples number is?

Right. Or if we don't make any reasonable distinction between LP_DEAD
items and dead heap-only tuples, then the total number of both things
together may matter very little. Better to be approximately correct
than exactly wrong. Deliberately introducing a bias to lower the
variance is a perfectly valid approach.

> Maybe that's even true in some scenarios, but I bet that
> it's never the issue when people have really big tables. The fact that
> I'm OK with 10MB of bloat in my 100MB table doesn't mean I'm OK with
> 1TB of bloat in my 10TB table. Among other problems, I can't even
> vacuum away that much bloat in one index pass, because autovacuum
> can't use enough work memory for that. Also, the absolute space
> wastage matters.

I certainly agree with all that.

FWIW, part of my mental model with VACUUM is that the rules kind of
change in the case of a big table. We're far more vulnerable to issues
such as (say) waiting for cleanup locks because the overall cadence
used by autovacuum is so infrequently relative to everything else.
There are more opportunities for things to go wrong, worse
consequences when they do go wrong, and greater potential for the
problems to compound.

-- 
Peter Geoghegan




Re: minor bug

2023-01-18 Thread Tom Lane
Laurenz Albe  writes:
> On Tue, 2023-01-17 at 10:32 -0500, Tom Lane wrote:
>> I seem to recall that the original idea was to report the timestamp
>> of the commit/abort record we are stopping at.  Maybe my memory is
>> faulty, but I think that'd be significantly more useful than the
>> current behavior, *especially* when the replay stopping point is
>> defined by something other than time.
>> (Also, the wording of the log message suggests that that's what
>> the reported time is supposed to be.  I wonder if somebody messed
>> this up somewhere along the way.)

> recoveryStopTime is set to recordXtime (the time of the xlog record)
> a few lines above that patch, so this is useful information if it is
> present.

Ah, but that only happens if recoveryTarget == RECOVERY_TARGET_TIME.
Digging in the git history, I see that this did use to work as
I remember: we always extracted the record time before printing it.
That was accidentally broken in refactoring in c945af80c.  I think
the correct fix is more like the attached.

regards, tom lane

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5e65785306..c14d1f3ef6 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2548,8 +2548,13 @@ recoveryStopsBefore(XLogReaderState *record)
 		stopsHere = (recordXid == recoveryTargetXid);
 	}
 
-	if (recoveryTarget == RECOVERY_TARGET_TIME &&
-		getRecordTimestamp(record, ))
+	/*
+	 * Note: we must fetch recordXtime regardless of recoveryTarget setting.
+	 * We don't expect getRecordTimestamp ever to fail, since we already know
+	 * this is a commit or abort record; but test its result anyway.
+	 */
+	if (getRecordTimestamp(record, ) &&
+		recoveryTarget == RECOVERY_TARGET_TIME)
 	{
 		/*
 		 * There can be many transactions that share the same commit time, so


Re: almost-super-user problems that we haven't fixed yet

2023-01-18 Thread Robert Haas
On Wed, Jan 18, 2023 at 2:00 PM Nathan Bossart  wrote:
> On Wed, Jan 18, 2023 at 11:28:57AM -0500, Robert Haas wrote:
> > In general, looks good. I think this will often call HaveNFreeProcs
> > twice, though, and that would be better to avoid, e.g.
>
> I should have thought of this.  This is fixed in v2.

Should (nfree < SuperuserReservedBackends) be using <=, or am I confused?

> > In the common case where we hit neither limit, this only counts free
> > connection slots once. We could do even better by making
> > HaveNFreeProcs have an out parameter for the number of free procs
> > actually found when it returns false, but that's probably not
> > important.
>
> Actually, I think it might be important.  IIUC the separate calls to
> HaveNFreeProcs might return different values for the same input, which
> could result in incorrect error messages (e.g., you might get the
> reserved_connections message despite setting reserved_connections to 0).
> So, I made this change in v2, too.

I thought of that briefly and it didn't seem that important, but the
way you did it seems fine, so let's go with that.

What's the deal with removing "and no new replication connections will
be accepted" from the documentation? Is the existing documentation
just wrong? If so, should we fix that first? And maybe delete
"non-replication" from the error message that says "remaining
connection slots are reserved for non-replication superuser
connections"? It seems like right now the comments say that
replication connections are a completely separate pool of connections,
but the documentation and the error message make it sound otherwise.
If that's true, then one of them is wrong, and I think it's the
docs/error message. Or am I just misreading it?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Small omission in type_sanity.sql

2023-01-18 Thread Melanie Plageman
Hi,

I was playing around with splitting up the tablespace test in regress so
that I could use the tablespaces it creates in another test and happened
to notice that the pg_class validity checks in type_sanity.sql are
incomplete.

It seems that 8b08f7d4820fd did not update the pg_class tests in
type_sanity to include partitioned indexes and tables.

patch attached.
I only changed these few lines in type_sanity to be more correct; I
didn't change anything else in regress to actually exercise them (e.g.
ensuring a partitioned table is around when running type_sanity). It
might be worth moving type_sanity down in the parallel schedule?

It does seem a bit hard to remember to update these tests in
type_sanity.sql when adding some new value for a pg_class field. I
wonder if there is a better way of testing this.

- Melanie
From 1531cdf257af92d31836e50e1a0b865131f1a018 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 18 Jan 2023 14:26:36 -0500
Subject: [PATCH v1] Update pg_class validity test

Partitioned tables and indexes were not considered in the pg_class
validity checks in type_sanity.sql. This is not currently exercised
because all partitioned tables and indexes have been dropped by the time
type_sanity is run.
---
 src/test/regress/expected/type_sanity.out | 44 ---
 src/test/regress/sql/type_sanity.sql  | 36 ++-
 2 files changed, 42 insertions(+), 38 deletions(-)

diff --git a/src/test/regress/expected/type_sanity.out b/src/test/regress/expected/type_sanity.out
index a640cfc476..95435e3de6 100644
--- a/src/test/regress/expected/type_sanity.out
+++ b/src/test/regress/expected/type_sanity.out
@@ -498,34 +498,36 @@ ORDER BY 1;
 
 --  pg_class 
 -- Look for illegal values in pg_class fields
-SELECT c1.oid, c1.relname
-FROM pg_class as c1
-WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p') OR
+SELECT oid, relname, relkind, relpersistence, relreplident
+FROM pg_class
+WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p', 'I') OR
 relpersistence NOT IN ('p', 'u', 't') OR
 relreplident NOT IN ('d', 'n', 'f', 'i');
- oid | relname 
--+-
+ oid | relname | relkind | relpersistence | relreplident 
+-+-+-++--
 (0 rows)
 
--- All tables and indexes should have an access method.
-SELECT c1.oid, c1.relname
-FROM pg_class as c1
-WHERE c1.relkind NOT IN ('S', 'v', 'f', 'c') and
-c1.relam = 0;
- oid | relname 
--+-
+-- All tables and indexes except partitioned tables should have an access
+-- method.
+SELECT oid, relname, relkind, relam
+FROM pg_class
+WHERE relkind NOT IN ('S', 'v', 'f', 'c', 'p') and
+relam = 0;
+ oid | relname | relkind | relam 
+-+-+-+---
 (0 rows)
 
--- Conversely, sequences, views, types shouldn't have them
-SELECT c1.oid, c1.relname
-FROM pg_class as c1
-WHERE c1.relkind IN ('S', 'v', 'f', 'c') and
-c1.relam != 0;
- oid | relname 
--+-
+-- Conversely, sequences, views, types, and partitioned tables shouldn't have
+-- them
+SELECT oid, relname, relkind, relam
+FROM pg_class
+WHERE relkind IN ('S', 'v', 'f', 'c', 'p') and
+relam != 0;
+ oid | relname | relkind | relam 
+-+-+-+---
 (0 rows)
 
--- Indexes should have AMs of type 'i'
+-- Indexes and partitioned indexes should have AMs of type 'i'
 SELECT pc.oid, pc.relname, pa.amname, pa.amtype
 FROM pg_class as pc JOIN pg_am AS pa ON (pc.relam = pa.oid)
 WHERE pc.relkind IN ('i') and
@@ -534,7 +536,7 @@ WHERE pc.relkind IN ('i') and
 -+-++
 (0 rows)
 
--- Tables, matviews etc should have AMs of type 't'
+-- Tables, matviews etc should have AMs of type 't' (except partitioned tables)
 SELECT pc.oid, pc.relname, pa.amname, pa.amtype
 FROM pg_class as pc JOIN pg_am AS pa ON (pc.relam = pa.oid)
 WHERE pc.relkind IN ('r', 't', 'm') and
diff --git a/src/test/regress/sql/type_sanity.sql b/src/test/regress/sql/type_sanity.sql
index 79ec410a6c..22a2dba94d 100644
--- a/src/test/regress/sql/type_sanity.sql
+++ b/src/test/regress/sql/type_sanity.sql
@@ -358,31 +358,33 @@ ORDER BY 1;
 
 -- Look for illegal values in pg_class fields
 
-SELECT c1.oid, c1.relname
-FROM pg_class as c1
-WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p') OR
+SELECT oid, relname, relkind, relpersistence, relreplident
+FROM pg_class
+WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p', 'I') OR
 relpersistence NOT IN ('p', 'u', 't') OR
 relreplident NOT IN ('d', 'n', 'f', 'i');
 
--- All tables and indexes should have an access method.
-SELECT c1.oid, c1.relname
-FROM pg_class as c1
-WHERE c1.relkind NOT IN ('S', 'v', 'f', 'c') and
-c1.relam = 0;
-
--- Conversely, sequences, views, types shouldn't have them
-SELECT c1.oid, c1.relname
-FROM pg_class as c1
-WHERE c1.relkind IN ('S', 'v', 'f', 'c') and
-c1.relam != 0;
-
--- Indexes should have AMs of type 'i'

Re: Non-superuser subscription owners

2023-01-18 Thread Robert Haas
On Sat, Jan 8, 2022 at 2:38 AM Jeff Davis  wrote:
> Committed.

I was just noticing that what was committed here didn't actually fix
the problem implied by the subject line. That is, non-superuser still
can't own subscriptions. To put that another way, there's no way for
the superuser to delegate the setup and administration of logical
replication to a non-superuser. That's a bummer.

Reading the thread, I'm not quite sure why we seemingly did all the
preparatory work and then didn't actually fix the problem. It was
previously proposed that we introduce a new predefined role
pg_create_subscriptions and allow users who have the privileges of
that predefined role to create and alter subscriptions. There are a
few issues with that which, however, seem fairly solvable to me:

1. Jeff pointed out that if you supply a connection string that is
going to try to access local files, you'd better have
pg_read_server_files, or else we should not let you use that
connection string. I guess that's mostly a function of which
parameters you specify, e.g. passfile, sslcert, sslkey, though maybe
for host it depends on whether the value starts with a slash. We might
need to think a bit here to make sure we get the rules right but it
seems like a pretty solvable problem.

2. There was also quite a bit of discussion of what to do if a user
who was previously eligible to own a subscription ceases to be
eligible, in particular around a superuser who is made into a
non-superuser, but the same problem would apply if you had
pg_create_subscriptions or pg_read_server_files and then lost it. My
vote is to not worry about it too much. Specifically, I think we
should certainly check whether the user has permission to create a
subscription before letting them do so, but we could handle the case
where the user already owns a subscription and tries to modify it by
either allowing or denying the operation and I think either of those
would be fine. I even think we could do one of those in some cases and
the other in other cases and as long as there is some principle to the
thing, it's fine. I argue that it's not a normal configuration and
therefore it doesn't have to work in a particularly useful way. It
shouldn't break the world in some horrendous way, but that's about as
good as it needs to be. I'd argue for example that DROP SUBSCRIPTION
could just check whether you own the object, and that ALTER
SUBSCRIPTION could check whether you own the object and, if you're
changing the connection string, also whether you would have privileges
to set that new connection string on a new subscription.

3. There was a bit of discussion of maybe wanting to allow users to
create subscriptions with some connection strings but not others,
perhaps by having some kind of intermediate object that owns the
connection string and is owned by a superuser or someone with lots of
privileges, and then letting a less-privileged user point a
subscription at that object. I agree that might be useful to somebody,
but I don't see why it's a hard requirement to get anything at all
done here. Right now, a subscription contains a connection string
directly. If in the future someone wants to introduce a CREATE
REPLICATION DESTINATION command (or whatever) and have a way to point
a subscription at a replication destination rather than a connection
string directly, cool. Or if someone wants to wire this into CREATE
SERVER somehow, also cool. But if you don't care about restricting
which IPs somebody can try to access by providing a connection string
of their choice, then you would be happy if we just did something
simple here and left this problem for another day.

I am very curious to know (a) why work on this was abandoned (perhaps
the answer is just lack of round tuits, in which case there is no more
to be said), and (b) what people think of (1)-(3) above, and (c)
whether anyone knows of further problems that need to be considered
here.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-18 Thread Tom Lane
One more thing before we move on from this topic.  I'd been testing
modified versions of the AdjustUpgrade.pm logic by building from a
--from-source source tree, which seemed way easier than dealing
with a private git repo.  As it stands, TestUpgradeXversion.pm
refuses to run under $from_source, but I just diked that check out
and it seemed to work fine for my purposes.  Now, that's going to be
a regular need going forward, so I'd like to not need a hacked version
of the BF client code to do it.

Also, your committed version of TestUpgradeXversion.pm breaks that
use-case because you did

-   unshift(@INC, "$self->{pgsql}/src/test/perl");
+   unshift(@INC, "$self->{buildroot}/$this_branch/pgsql/src/test/perl");

which AFAICS is an empty directory in a $from_source run.

I suppose that the reason for not running under $from_source is to
avoid corrupting the saved installations with unofficial versions.
However, couldn't we skip the "save" step and still run the upgrade
tests against whatever we have saved?  (Maybe skip the same-version
test, as it's not quite reflecting any real case then.)

Here's a quick draft patch showing what I have in mind.  There may
well be a better way to deal with the wheres-the-source issue than
what is in hunk 2.  Also, I didn't reindent the unchanged code in
sub installcheck, and I didn't add anything about skipping
same-version tests.

regards, tom lane

diff --git a/PGBuild/Modules/TestUpgradeXversion.pm b/PGBuild/Modules/TestUpgradeXversion.pm
index a784686..408432d 100644
--- a/PGBuild/Modules/TestUpgradeXversion.pm
+++ b/PGBuild/Modules/TestUpgradeXversion.pm
@@ -51,8 +51,6 @@ sub setup
 	my $conf  = shift;# ref to the whole config object
 	my $pgsql = shift;# postgres build dir
 
-	return if $from_source;
-
 	return if $branch !~ /^(?:HEAD|REL_?\d+(?:_\d+)?_STABLE)$/;
 
 	my $animal = $conf->{animal};
@@ -351,7 +349,16 @@ sub test_upgrade## no critic (Subroutines::ProhibitManyArgs)
 	  if $verbose;
 
 	# load helper module from source tree
-	unshift(@INC, "$self->{buildroot}/$this_branch/pgsql/src/test/perl");
+	my $source_tree;
+	if ($from_source)
+	{
+		$source_tree = $self->{pgsql};
+	}
+	else
+	{
+		$source_tree = "$self->{buildroot}/$this_branch/pgsql";
+	}
+	unshift(@INC, "$source_tree/src/test/perl");
 	require PostgreSQL::Test::AdjustUpgrade;
 	PostgreSQL::Test::AdjustUpgrade->import;
 	shift(@INC);
@@ -694,6 +701,11 @@ sub installcheck
 	my $upgrade_loc  = "$upgrade_install_root/$this_branch";
 	my $installdir   = "$upgrade_loc/inst";
 
+	# Don't save in a $from_source build: we want the saved trees always
+	# to correspond to branch tips of the animal's standard repo.  We can
+	# perform upgrade tests against previously-saved trees, though.
+	if (!$from_source)
+	{
 	# for saving we need an exclusive lock.
 	get_lock($self, $this_branch, 1);
 
@@ -716,6 +728,7 @@ sub installcheck
 	  if ($verbose > 1);
 	send_result('XversionUpgradeSave', $status, \@saveout) if $status;
 	$steps_completed .= " XVersionUpgradeSave";
+	}
 
 	# in saveonly mode our work is done
 	return if $ENV{PG_UPGRADE_SAVE_ONLY};
@@ -744,7 +757,7 @@ sub installcheck
 		# other branch from being removed or changed under us.
 		get_lock($self, $oversion, 0);
 
-		$status =
+		my $status =
 		  test_upgrade($self, $save_env, $this_branch, $upgrade_install_root,
 			$dport, $install_loc, $other_branch) ? 0 : 1;
 


Re: Implement missing join selectivity estimation for range types

2023-01-18 Thread Mahmoud Sakr
Hi Tomas,
Thanks for picking up the patch and for the interesting discussions that
you bring !

> Interesting. Are there any particular differences compared to how we
> estimate for example range clauses on regular columns?
The theory is the same for scalar types. Yet, the statistics that are currently
collected for scalar types include other synopsis than the histogram, such
as MCV, which should be incorporated in the estimation. The theory for using
the additional statistics is ready in the paper, but we didn't yet implement it.
We thought of sharing the ready part, till the time allows us to implement the
rest, or other developers continue it.

> Right. I think 0.5% is roughly expected for the default statistics
> target, which creates 100 histogram bins, each representing ~1% of the
> values. Which on average means ~0.5% error.
Since this patch deals with join selectivity, we are then crossing 100*100 bins.
The ~0.5% error estimation comes from our experiments, rather than a
mathematical analysis.

> independence means we take (P1*P2). But maybe we should be very
> pessimistic and use e.g. Min(P1,P2)? Or maybe something in between?
>
> Another option is to use the length histogram, right? I mean, we know
> what the average length is, and it should be possible to use that to
> calculate how "far" ranges in a histogram can overlap.
The independence assumption exists if we use the lower and upper
histograms. It equally exists if we use the lower and length histograms.
In both cases, the link between the two histograms is lost during their
construction.
You discussion brings an interesting trade-off of optimistic v.s. pessimistic
estimations. A typical way to deal with such a trade-off is to average the
two, for example is model validation in machine learning, Do you think we
should implement something like
average( (P1*P2), Min(P1,P2) )?

> OK. But ideally we'd cross-check elements of the two multiranges, no?

> So if the column(s) contain a couple very common (multi)ranges that make
> it into an MCV, we'll ignore those? That's a bit unfortunate, because
> those MCV elements are potentially the main contributors to selectivity.
Both ideas would require collecting more detailed statistics, for
example similar
to arrays. In this patch, we restricted ourselves to the existing statistics.


> Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
> a proper comment, not just "this is a copy from rangetypes".
Right, the comment should elaborate more that the collected statistics are
currently that same as rangetypes but may potentially deviate.

> However, it seems the two functions are exactly the same. Would the
> functions diverge in the future? If not, maybe there should be just a
> single shared function?
Indeed, it is possible that the two functions will deviate if that statistics
of multirange types will be refined.

--
Best regards
Mahmoud SAKR

On Wed, Jan 18, 2023 at 7:07 PM Tomas Vondra
 wrote:
>
> Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
> a proper comment, not just "this is a copy from rangetypes".
>
> However, it seems the two functions are exactly the same. Would the
> functions diverge in the future? If not, maybe there should be just a
> single shared function?
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company




Re: document the need to analyze partitioned tables

2023-01-18 Thread Laurenz Albe
On Wed, 2023-01-18 at 11:49 -0600, Justin Pryzby wrote:
> I tweaked this a bit to end up with:
> 
> > -    Partitioned tables are not processed by autovacuum.  Statistics
> > -    should be collected by running a manual ANALYZE 
> > when it is
> > +    The leaf partitions of a partitioned table are normal tables and are 
> > processed
> > +    by autovacuum; however, autovacuum does not process the partitioned 
> > table itself.
> > +    This is no problem as far as VACUUM is concerned, 
> > since
> > +    there's no need to vacuum the empty, partitioned table.  But, as 
> > mentioned in
> > +    , it also means that autovacuum 
> > won't
> > +    run ANALYZE on the partitioned table.
> > +    Although statistics are automatically gathered on its leaf partitions, 
> > some queries also need
> > +    statistics on the partitioned table to run optimally.  You should 
> > collect statistics by
> > +    running a manual ANALYZE when the partitioned table 
> > is
> >  first populated, and again whenever the distribution of data in its
> >  partitions changes significantly.
> >     
> 
> "partitions are normal tables" was techically wrong, as partitions can
> also be partitioned.

I am fine with your tweaks.  I think this is good to go.

Yours,
Laurenz Albe




Re: Parallelize correlated subqueries that execute within each worker

2023-01-18 Thread Tomas Vondra
Hi,

This patch hasn't been updated since September, and it got broken by
4a29eabd1d91c5484426bc5836e0a7143b064f5a which the incremental sort
stuff a little bit. But the breakage was rather limited, so I took a
stab at fixing it - attached is the result, hopefully correct.

I also added a couple minor comments about stuff I noticed while
rebasing and skimming the patch, I kept those in separate commits.
There's also a couple pre-existing TODOs.

James, what's your plan with this patch. Do you intend to work on it for
PG16, or are there some issues I missed in the thread?


One of the queries in in incremental_sort changed plans a little bit:

explain (costs off) select distinct
  unique1,
  (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1)
from tenk1 t, generate_series(1, 1000);

switched from

 Unique  (cost=18582710.41..18747375.21 rows=1 width=8)
   ->  Gather Merge  (cost=18582710.41..18697375.21 rows=1000 ...)
 Workers Planned: 2
 ->  Sort  (cost=18582710.39..18593127.06 rows=417 ...)
   Sort Key: t.unique1, ((SubPlan 1))
 ...

to

 Unique  (cost=18582710.41..18614268.91 rows=1 ...)
   ->  Gather Merge  (cost=18582710.41..18614168.91 rows=2 ...)
 Workers Planned: 2
 ->  Unique  (cost=18582710.39..18613960.39 rows=1 ...)
   ->  Sort  (cost=18582710.39..18593127.06 ...)
 Sort Key: t.unique1, ((SubPlan 1))
   ...

which probably makes sense, as the cost estimate decreases a bit.

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 95db15fe16303ed3f4fdea52af3b8d6d05a8d7a6 Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Mon, 26 Sep 2022 20:30:23 -0400
Subject: [PATCH 1/5] Add tests before change

---
 src/test/regress/expected/select_parallel.out | 108 ++
 src/test/regress/sql/select_parallel.sql  |  25 
 2 files changed, 133 insertions(+)

diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 91f74fe47a3..9b4d7dd44a4 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -311,6 +311,114 @@ select count(*) from tenk1 where (two, four) not in
  1
 (1 row)
 
+-- test parallel plans for queries containing correlated subplans
+-- where the subplan only needs params available from the current
+-- worker's scan.
+explain (costs off, verbose) select
+  (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1)
+  from tenk1 t, generate_series(1, 10);
+ QUERY PLAN 
+
+ Gather
+   Output: (SubPlan 1)
+   Workers Planned: 4
+   ->  Nested Loop
+ Output: t.unique1
+ ->  Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t
+   Output: t.unique1
+ ->  Function Scan on pg_catalog.generate_series
+   Output: generate_series.generate_series
+   Function Call: generate_series(1, 10)
+   SubPlan 1
+ ->  Index Only Scan using tenk1_unique1 on public.tenk1
+   Output: t.unique1
+   Index Cond: (tenk1.unique1 = t.unique1)
+(14 rows)
+
+explain (costs off, verbose) select
+  (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1)
+  from tenk1 t;
+  QUERY PLAN  
+--
+ Gather
+   Output: (SubPlan 1)
+   Workers Planned: 4
+   ->  Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t
+ Output: t.unique1
+   SubPlan 1
+ ->  Index Only Scan using tenk1_unique1 on public.tenk1
+   Output: t.unique1
+   Index Cond: (tenk1.unique1 = t.unique1)
+(9 rows)
+
+explain (costs off, verbose) select
+  (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1)
+  from tenk1 t
+  limit 1;
+QUERY PLAN 
+---
+ Limit
+   Output: ((SubPlan 1))
+   ->  Seq Scan on public.tenk1 t
+ Output: (SubPlan 1)
+ SubPlan 1
+   ->  Index Only Scan using tenk1_unique1 on public.tenk1
+ Output: t.unique1
+ Index Cond: (tenk1.unique1 = t.unique1)
+(8 rows)
+
+explain (costs off, verbose) select t.unique1
+  from tenk1 t
+  where t.unique1 = (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1);
+ QUERY PLAN  
+-
+ Seq Scan on public.tenk1 t
+   Output: t.unique1
+   Filter: (t.unique1 = (SubPlan 1))
+   SubPlan 1
+ ->  Index Only Scan using tenk1_unique1 on public.tenk1
+   Output: t.unique1
+   Index Cond: (tenk1.unique1 = t.unique1)

Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences

2023-01-18 Thread Karl O. Pinc
On Wed, 18 Jan 2023 18:34:47 +0100
Alvaro Herrera  wrote:

> On 2023-Jan-18, Karl O. Pinc wrote:
> 
> > On Wed, 18 Jan 2023 13:30:45 +0100
> > Alvaro Herrera  wrote:
> >   
> > > Not related to this patch: it's very annoying that in the PDF
> > > output, each section in the appendix doesn't start on a blank
> > > page -- which means that the doc page for many modules starts in
> > > the middle of a page were the previous one ends.  
> >   
> > > I wonder if we can tweak something in the stylesheet to include a
> > > page break.  
> > 
> > Would this be something to be included in this patch?
> > (If I can figure it out.)  
> 
> No, I think we should do that change separately.  I just didn't think
> a parenthical complain was worth a separate thread for it; but if you
> do create a patch, please do create a new thread (unless the current
> patch in this one is committed already.)

Oops.  Already sent a revised patch that includes starting each
module on a new page, for PDF output.  I'll wait to rip that
out after review and start a new thread if necessary.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein




Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-01-18 Thread Andrei Zubkov
Hi Tomas,

On Wed, 2023-01-18 at 17:29 +0100, Tomas Vondra wrote:
> I took a quick look at this patch, to see if there's something we
> want/can get into v16. The last version was submitted about 9 months
> ago, and it doesn't apply cleanly anymore, but the bitrot is fairly
> minor. Not sure there's still interest, though.

Thank you for your attention to this patch!

I'm still very interest in this patch. And I think I'll try to rebase
this patch during a week or two if it seems possible to get it in 16..
> 
> I'd probably call it "created_at" or something like
> that, but that's a minor detail. Or maybe stats_reset, which is what
> we
> use in pgstat?

Yes there is some naming issue. My thought was the following:
 - "stats_reset" is not quite correct here, because the statement entry
moment if definitely not a reset. The field named just as it means -
this is time of the moment from which statistics is collected for this
particular entry.
 - "created_at" perfectly matches the purpose of the field, but seems
not such self-explaining to me.

> 
> But is the second timestamp for the min/max fields really useful?
> AFAIK
> to perform analysis, people take regular pg_stat_statements
> snapshots,
> which works fine for counters (calculating deltas) but not for gauges
> (which need a reset, to track fresh values). But people analyzing
> this
> are already resetting the whole entry, and so the snapshots already
> are
> tracking deltas.
>
> So I'm not convinced actually need the second timestamp.

The main purpose of the patch is to provide means to collecting
solutions to avoid the reset of pgss at all. Just like it happens for
the pg_stat_ views. The only really need of reset is that we can't be
sure that observing statement was not evicted and come back since last
sample. Right now we only can do a whole reset on each sample and see
how many entries will be in pgss hashtable on the next sample - how
close this value to the max. If there is a plenty space in hashtable we
can hope that there was not evictions since last sample. However there
could be reset performed by someone else and we are know nothing about
this.
Having a timestamp in stats_since field we are sure about how long this
statement statistics is tracked. That said sampling solution can
totally avoid pgss resets. Avoiding such resets means avoiding
interference between monitoring solutions.
But if no more resets is done we can't track min/max values, because
they still needs a reset and we can do nothing with such resets - they
are necessary. However I still want to know when min/max reset was
performed. This will help to detect possible interference on such
resets.
> 
> 
> A couple more comments:
> 
> 1) I'm not sure why the patch is adding tests of permissions on the
> pg_stat_statements_reset function?
> 
> 2) If we want the second timestamp, shouldn't it also cover resets of
> the mean values, not just min/max?

I think that mean values shouldn't be target for a partial reset
because the value for mean values can be easily reconstructed by the
sampling solution without a reset.

> 
> 3) I don't understand why the patch is adding "IS NOT NULL AS t" to
> various places in the regression tests.

The most of tests was copied from the previous version. I'll recheck
them.

> 
> 4) I rather dislike the "minmax" naming, because that's often used in
> other contexts (for BRIN indexes), and as I mentioned maybe it should
> also cover the "mean" fields.

Agreed, but I couldn't make it better. Other versions seemed worse to
me...
> 
> 
Regards, Andrei Zubkov





Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Robert Haas
On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan  wrote:
> pgstat_report_analyze() will totally override the
> tabentry->dead_tuples information that drives autovacuum.c, based on
> an estimate derived from a random sample -- which seems to me to be an
> approach that just doesn't have any sound theoretical basis.

Yikes. I think we don't have a choice but to have a method to correct
the information somehow, because AFAIK the statistics system is not
crash-safe. But that approach does seem to carry significant risk of
overwriting correct information with wrong information.

> On reflection, maybe you're right here. Maybe it's true that the
> bigger problem is just that the implementation is bad, even on its own
> terms -- since it's pretty bad! Hard to say at this point.
>
> Depends on how you define it, too. Statistically sampling is just not
> fit for purpose here. But is that a problem with
> autovacuum_vacuum_scale_factor? I may have said words that could
> reasonably be interpreted that way, but I'm not prepared to blame it
> on the underlying autovacuum_vacuum_scale_factor model now. It's
> fuzzy.

Yep. I think what we should try to evaluate is which number is
furthest from the truth. My guess is that the threshold is so high
relative to what a reasonable value would be that you can't get any
benefit out of making the dead tuple count more accurate. Like, if the
threshold is 100x too high, or something, then who cares how accurate
the dead tuples number is? It's going to be insufficient to trigger
vacuuming whether it's right or wrong. We should try substituting a
less-bogus threshold calculation and see what happens then. An
alternative theory is that the threshold is fine and we're only
failing to reach it because the dead tuple calculation is so
inaccurate. Maybe that's even true in some scenarios, but I bet that
it's never the issue when people have really big tables. The fact that
I'm OK with 10MB of bloat in my 100MB table doesn't mean I'm OK with
1TB of bloat in my 10TB table. Among other problems, I can't even
vacuum away that much bloat in one index pass, because autovacuum
can't use enough work memory for that. Also, the absolute space
wastage matters.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences

2023-01-18 Thread Karl O. Pinc
On Wed, 18 Jan 2023 13:25:57 +0100
Alvaro Herrera  wrote:

> On 2023-Jan-02, Karl O. Pinc wrote:

> > Attached is a patch: contrib_v1.patch
> > 
> > It modifies Appendix F, the contrib directory.
> > 
> > It adds brief text into the titles shown in the
> > table of contents so it's easier to tell what
> > each module does.  It also suffixes [trusted] or [obsolete]
> > on the relevant titles.  

> 
> I'm not 100% sold on having the
> "trusted" or "obsolete" marker on the titles themselves, though.  Not
> sure what alternative do we have, though, other than leave them out
> completely.

The alternative would be to have a separate table with modules
for rows and "trusted" and "obsolete" columns.  It seems like
more of a maintenance hassle than having the markers in the titles.

Let me know if you want a table.  I do like having a place
to look to over all the modules to see what is "trusted" or "obsolete".

I suppose there could just be a table, with module names, descriptions,
and trusted and obsolete flags.  Instead of a table of contents
for the modules the module names in the table could be links.  But
that'd involve suppressing the table of contents showing all the
module names.  And has the problem of possible mis-match between
the modules listed in the table and the modules that exist.

> There's a typo "equalivent" in two places.

Fixed.

> In passwordcheck, I would say just "check for weak passwords" or maybe
> "verify password strength".

I used "verify password strength".

> 
> pg_buffercache is missing.  Maybe "-- inspect state of the Postgres
> buffer cache".

I used "inspect Postgres buffer cache state"

> For pg_stat_statements I suggest "track statistics of planning and
> execution of SQL queries"

I had written "track SQL query planning and execution statistics".
Changed to: "track statistics of SQL planning and execution"

I don't really care.  If you want your version I'll submit another
patch.

> For sepgsql, as I understand it is strictly SELinux based, not just
> "-like".  So this needs rewording: "label-based, SELinux-like,
> mandatory access control".  Maybe "SELinux-based implementation of
> mandatory access control for row-level security".

Changed to: "SELinux-based row-level security mandatory access control"

> xml -- typo "qeurying"

Fixed.

I have also made the patch put each module on a separate
page when producing PDF documents.  This did produce one warning,
which seems unrelated to me.  The pdf seems right. I also tried 
just "make", to be sure I didn't break anything unrelated.  Seemed 
to work.  So..., works for me.

New patch attached: contrib_v2.patch

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/doc/src/sgml/adminpack.sgml b/doc/src/sgml/adminpack.sgml
index 184e96d7a0..04f3b52379 100644
--- a/doc/src/sgml/adminpack.sgml
+++ b/doc/src/sgml/adminpack.sgml
@@ -1,7 +1,7 @@
 
 
 
- adminpack
+ adminpack  pgAdmin support toolpack
 
  
   adminpack
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 923cbde9dd..4006c75cdf 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -1,7 +1,7 @@
 
 
 
- amcheck
+ amcheck  tools to verify index consistency
 
  
   amcheck
diff --git a/doc/src/sgml/auth-delay.sgml b/doc/src/sgml/auth-delay.sgml
index 40629311b1..0571f2a99d 100644
--- a/doc/src/sgml/auth-delay.sgml
+++ b/doc/src/sgml/auth-delay.sgml
@@ -1,7 +1,7 @@
 
 
 
- auth_delay
+ auth_delay  pause on authentication failure
 
  
   auth_delay
diff --git a/doc/src/sgml/auto-explain.sgml b/doc/src/sgml/auto-explain.sgml
index bb7342b120..0c4656ee30 100644
--- a/doc/src/sgml/auto-explain.sgml
+++ b/doc/src/sgml/auto-explain.sgml
@@ -1,7 +1,7 @@
 
 
 
- auto_explain
+ auto_explain  log execution plans of slow queries
 
  
   auto_explain
diff --git a/doc/src/sgml/basebackup-to-shell.sgml b/doc/src/sgml/basebackup-to-shell.sgml
index 491368eb8f..b6a3b39541 100644
--- a/doc/src/sgml/basebackup-to-shell.sgml
+++ b/doc/src/sgml/basebackup-to-shell.sgml
@@ -1,7 +1,7 @@
 
 
 
- basebackup_to_shell
+ basebackup_to_shell  example "shell" pg_basebackup module
 
  
   basebackup_to_shell
diff --git a/doc/src/sgml/basic-archive.sgml b/doc/src/sgml/basic-archive.sgml
index 60f23d2855..b4d43ced20 100644
--- a/doc/src/sgml/basic-archive.sgml
+++ b/doc/src/sgml/basic-archive.sgml
@@ -1,7 +1,7 @@
 
 
 
- basic_archive
+ basic_archive  an example WAL archive module
 
  
   basic_archive
diff --git a/doc/src/sgml/bloom.sgml b/doc/src/sgml/bloom.sgml
index 98d0316175..672ac2ed19 100644
--- a/doc/src/sgml/bloom.sgml
+++ b/doc/src/sgml/bloom.sgml
@@ -1,7 +1,7 @@
 
 
 
- bloom
+ bloom  bloom filter index access
 
  
   bloom
diff --git a/doc/src/sgml/btree-gin.sgml b/doc/src/sgml/btree-gin.sgml
index 870c25559e..0eaea0dbcd 100644
--- a/doc/src/sgml/btree-gin.sgml
+++ b/doc/src/sgml/btree-gin.sgml
@@ -1,7 +1,8 @@
 
 
 
- btree_gin
+ btree_gin 
+   sample GIN B-tree equivalent operator 

Re: almost-super-user problems that we haven't fixed yet

2023-01-18 Thread Nathan Bossart
On Wed, Jan 18, 2023 at 11:28:57AM -0500, Robert Haas wrote:
> In general, looks good. I think this will often call HaveNFreeProcs
> twice, though, and that would be better to avoid, e.g.

I should have thought of this.  This is fixed in v2.

> In the common case where we hit neither limit, this only counts free
> connection slots once. We could do even better by making
> HaveNFreeProcs have an out parameter for the number of free procs
> actually found when it returns false, but that's probably not
> important.

Actually, I think it might be important.  IIUC the separate calls to
HaveNFreeProcs might return different values for the same input, which
could result in incorrect error messages (e.g., you might get the
reserved_connections message despite setting reserved_connections to 0).
So, I made this change in v2, too.

> I don't think that we should default both the existing GUC and the new
> one to 3, because that raises the default limit in the case where the
> new feature is not used from 3 to 6. I think we should default one of
> them to 0 and the other one to 3. Not sure which one should get which
> value.

I chose to set reserved_connections to 0 since it is new and doesn't have a
pre-existing default value.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 94ee89548fd1080447b784993a1418480f407b49 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 17 Jan 2023 13:58:56 -0800
Subject: [PATCH v2 1/2] Rename ReservedBackends to SuperuserReservedBackends.

This is in preparation for adding a new reserved_connections GUC
that will use the ReservedBackends variable name.
---
 src/backend/postmaster/postmaster.c | 18 +-
 src/backend/utils/init/postinit.c   |  4 ++--
 src/backend/utils/misc/guc_tables.c |  2 +-
 src/include/postmaster/postmaster.h |  2 +-
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9cedc1b9f0..470704f364 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -204,15 +204,15 @@ char	   *Unix_socket_directories;
 char	   *ListenAddresses;
 
 /*
- * ReservedBackends is the number of backends reserved for superuser use.
- * This number is taken out of the pool size given by MaxConnections so
+ * SuperuserReservedBackends is the number of backends reserved for superuser
+ * use.  This number is taken out of the pool size given by MaxConnections so
  * number of backend slots available to non-superusers is
- * (MaxConnections - ReservedBackends).  Note what this really means is
- * "if there are <= ReservedBackends connections available, only superusers
- * can make new connections" --- pre-existing superuser connections don't
- * count against the limit.
+ * (MaxConnections - SuperuserReservedBackends).  Note what this really means
+ * is "if there are <= SuperuserReservedBackends connections available, only
+ * superusers can make new connections" --- pre-existing superuser connections
+ * don't count against the limit.
  */
-int			ReservedBackends;
+int			SuperuserReservedBackends;
 
 /* The socket(s) we're listening to. */
 #define MAXLISTEN	64
@@ -908,11 +908,11 @@ PostmasterMain(int argc, char *argv[])
 	/*
 	 * Check for invalid combinations of GUC settings.
 	 */
-	if (ReservedBackends >= MaxConnections)
+	if (SuperuserReservedBackends >= MaxConnections)
 	{
 		write_stderr("%s: superuser_reserved_connections (%d) must be less than max_connections (%d)\n",
 	 progname,
-	 ReservedBackends, MaxConnections);
+	 SuperuserReservedBackends, MaxConnections);
 		ExitPostmaster(1);
 	}
 	if (XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level == WAL_LEVEL_MINIMAL)
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ae5a85ed65..6fa696fe8d 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -927,8 +927,8 @@ InitPostgres(const char *in_dbname, Oid dboid,
 	 * limited by max_connections or superuser_reserved_connections.
 	 */
 	if (!am_superuser && !am_walsender &&
-		ReservedBackends > 0 &&
-		!HaveNFreeProcs(ReservedBackends))
+		SuperuserReservedBackends > 0 &&
+		!HaveNFreeProcs(SuperuserReservedBackends))
 		ereport(FATAL,
 (errcode(ERRCODE_TOO_MANY_CONNECTIONS),
  errmsg("remaining connection slots are reserved for non-replication superuser connections")));
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5025e80f89..5aa2cda8f9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2163,7 +2163,7 @@ struct config_int ConfigureNamesInt[] =
 			gettext_noop("Sets the number of connection slots reserved for superusers."),
 			NULL
 		},
-		,
+		,
 		3, 0, MAX_BACKENDS,
 		NULL, NULL, NULL
 	},
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 203177e1ff..168d85a3d1 100644
--- 

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan
.

On Wed, Jan 18, 2023 at 7:54 AM Robert Haas  wrote:
> > It just fits: the dead tuples approach can sometimes be so
> > completely wrong that even an alternative triggering condition based
> > on something that is virtually unrelated to the thing we actually care
> > about can do much better in practice. Consistently, reliably, for a
> > given table/workload.
>
> Hmm, I don't know. I have no intuition one way or the other for
> whether we're undercounting dead tuples, and I don't understand what
> would cause us to do that. I thought that we tracked that accurately,
> as part of the statistics system, not by sampling
> (pg_stat_all_tables.n_dead_tup).

It's both, kind of.

pgstat_report_analyze() will totally override the
tabentry->dead_tuples information that drives autovacuum.c, based on
an estimate derived from a random sample -- which seems to me to be an
approach that just doesn't have any sound theoretical basis. So while
there is a sense in which we track dead tuples incrementally and
accurately using the statistics system, we occasionally call
pgstat_report_analyze (and pgstat_report_vacuum) like this, so AFAICT
we might as well not even bother tracking things reliably the rest of
the time.

Random sampling works because the things that you don't sample are
very well represented by the things that you do sample. That's why
even very stale optimizer statistics can work quite well (and why the
EAV anti-pattern makes query optimization impossible) -- the
distribution is often fixed, more or less. The statistics generalize
very well because the data meets certain underlying assumptions that
all data stored in a relational database is theoretically supposed to
meet. Whereas with dead tuples, the whole point is to observe and
count dead tuples so that autovacuum can then go remove the dead
tuples -- which then utterly changes the situation! That's a huge
difference.

ISTM that you need a *totally* different approach for something that's
fundamentally dynamic, which is what this really is. Think about how
the random sampling will work in a very large table with concentrated
updates. The modified pages need to outweigh the large majority of
pages in the table that can be skipped by VACUUM anyway.

I wonder how workable it would be to just teach pgstat_report_analyze
and pgstat_report_vacuum to keep out of this, or to not update the
stats unless it's to increase the number of dead_tuples...

> I think we ought to fire autovacuum_vacuum_scale_factor out of an
> airlock.

Couldn't agree more. I think that this and the underlying statistics
are the really big problem as far as under-vacuuming is concerned.

> I think we also ought to invent some sort of better cost limit system
> that doesn't shoot you in the foot automatically as the database
> grows. Nobody actually wants to limit the rate at which the database
> vacuums stuff to a constant. What they really want to do is limit it
> to a rate that is somewhat faster than the minimum rate needed to
> avoid disaster. We should try to develop metrics for whether vacuum is
> keeping up.

Definitely agree that doing some kind of dynamic updating is
promising. What we thought at the start versus what actually happened.
Something cyclic, just like autovacuum itself.

> I don't actually see any reason why dead tuples, even counted in a
> relatively stupid way, isn't fundamentally good enough to get all
> tables vacuumed before we hit the XID age cutoff. It doesn't actually
> do that right now, but I feel like that must be because we're doing
> other stupid things, not because there's anything that terrible about
> the metric as such. Maybe that's wrong, but I find it hard to imagine.

On reflection, maybe you're right here. Maybe it's true that the
bigger problem is just that the implementation is bad, even on its own
terms -- since it's pretty bad! Hard to say at this point.

Depends on how you define it, too. Statistically sampling is just not
fit for purpose here. But is that a problem with
autovacuum_vacuum_scale_factor? I may have said words that could
reasonably be interpreted that way, but I'm not prepared to blame it
on the underlying autovacuum_vacuum_scale_factor model now. It's
fuzzy.

> > We're
> > still subject to the laws of physics. VACUUM would still be something
> > that more or less works at the level of the whole table, or not at
> > all. So being omniscient seems kinda overrated to me. Adding more
> > information does not in general lead to better outcomes.
>
> Yeah, I think that's true. In particular, it's not much use being
> omniscient but stupid. It would be better to have limited information
> and be smart about what you did with it.

I would put it like this: autovacuum shouldn't ever be a sucker. It
should pay attention to disconfirmatory signals. The information that
drives its decision making process should be treated as provisional.

Even if the information was correct at one point, the contents of the
table are constantly changing 

Re: [PATCH] Const'ify the arguments of ilist.c/ilist.h functions

2023-01-18 Thread Andres Freund
Hi,

On 2023-01-18 10:22:14 -0800, Andres Freund wrote:
> On 2023-01-12 08:34:25 +0100, Peter Eisentraut wrote:
> > On 07.01.23 08:21, Peter Eisentraut wrote:
> > > This patch version looks correct to me.  It is almost the same as the
> > > one that Andres had posted in his thread, except that yours also
> > > modifies slist_delete() and dlist_member_check().  Both of these changes
> > > also look correct to me.
> > 
> > committed
> 
> Unfortunately this causes a build failure with ILIST_DEBUG
> enabled. dlist_member_check() uses dlist_foreach(), which isn't set up to work
> with const :(.  I'll push a quick workaround.

Pushed.

Greetings,

Andres Freund




Re: [PATCH] Add <> support to sepgsql_restorecon

2023-01-18 Thread Joe Conway
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

This needs regression test support for the feature and some minimal 
documentation that shows how to make use of it.

The new status of this patch is: Waiting on Author


Re: [PATCH] Const'ify the arguments of ilist.c/ilist.h functions

2023-01-18 Thread Andres Freund
Hi,

On 2023-01-12 08:34:25 +0100, Peter Eisentraut wrote:
> On 07.01.23 08:21, Peter Eisentraut wrote:
> > On 23.11.22 14:57, Aleksander Alekseev wrote:
> > > Hi Andres,
> > > 
> > > Thanks for the review!
> > > 
> > > > I don't think it is correct for any of these to add const. The
> > > > only reason it
> > > > works is because of casting etc.
> > > 
> > > Fair enough. PFA the corrected patch v2.
> > 
> > This patch version looks correct to me.  It is almost the same as the
> > one that Andres had posted in his thread, except that yours also
> > modifies slist_delete() and dlist_member_check().  Both of these changes
> > also look correct to me.
> 
> committed

Unfortunately this causes a build failure with ILIST_DEBUG
enabled. dlist_member_check() uses dlist_foreach(), which isn't set up to work
with const :(.  I'll push a quick workaround.

Greetings,

Andres Freund




Re: document the need to analyze partitioned tables

2023-01-18 Thread Bruce Momjian
On Wed, Jan 18, 2023 at 11:49:19AM -0600, Justin Pryzby wrote:
> On Wed, Jan 18, 2023 at 10:15:18AM +0100, Laurenz Albe wrote:
> > On Tue, 2023-01-17 at 16:16 -0500, Bruce Momjian wrote:
> > > On Tue, Jan 17, 2023 at 03:00:50PM -0600, Justin Pryzby wrote:
> > > > Maybe (all?) the clarification the docs need is to say:
> > > > "Partitioned tables are not *themselves* processed by autovacuum."
> > > 
> > > Yes, I think the lack of autovacuum needs to be specifically mentioned
> > > since most people assume autovacuum handles _all_ statistics updating.
> 
> That's what 61fa6ca79 aimed to do.  Laurenz is suggesting further
> clarification.

Ah, makes sense, thanks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.




Re: Implement missing join selectivity estimation for range types

2023-01-18 Thread Tomas Vondra
Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
a proper comment, not just "this is a copy from rangetypes".

However, it seems the two functions are exactly the same. Would the
functions diverge in the future? If not, maybe there should be just a
single shared function?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: document the need to analyze partitioned tables

2023-01-18 Thread Justin Pryzby
On Wed, Jan 18, 2023 at 10:15:18AM +0100, Laurenz Albe wrote:
> On Tue, 2023-01-17 at 16:16 -0500, Bruce Momjian wrote:
> > On Tue, Jan 17, 2023 at 03:00:50PM -0600, Justin Pryzby wrote:
> > > Maybe (all?) the clarification the docs need is to say:
> > > "Partitioned tables are not *themselves* processed by autovacuum."
> > 
> > Yes, I think the lack of autovacuum needs to be specifically mentioned
> > since most people assume autovacuum handles _all_ statistics updating.

That's what 61fa6ca79 aimed to do.  Laurenz is suggesting further
clarification.

> > Can someone summarize how bad it is we have no statistics on partitioned
> > tables?  It sounds bad to me.
> 
> Andrey Lepikhov had an example earlier in this thread[1].  It doesn't take
> an exotic query. 
> 
> Attached is a new version of my patch that tries to improve the wording.

I tweaked this a bit to end up with:

> -Partitioned tables are not processed by autovacuum.  Statistics
> -should be collected by running a manual ANALYZE when 
> it is
> +The leaf partitions of a partitioned table are normal tables and are 
> processed
> +by autovacuum; however, autovacuum does not process the partitioned 
> table itself.
> +This is no problem as far as VACUUM is concerned, 
> since
> +there's no need to vacuum the empty, partitioned table.  But, as 
> mentioned in
> +, it also means that autovacuum 
> won't
> +run ANALYZE on the partitioned table.
> +Although statistics are automatically gathered on its leaf partitions, 
> some queries also need
> +statistics on the partitioned table to run optimally.  You should 
> collect statistics by
> +running a manual ANALYZE when the partitioned table is
>  first populated, and again whenever the distribution of data in its
>  partitions changes significantly.
> 

"partitions are normal tables" was techically wrong, as partitions can
also be partitioned.

-- 
Justin




Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences

2023-01-18 Thread Alvaro Herrera
On 2023-Jan-18, Karl O. Pinc wrote:

> On Wed, 18 Jan 2023 13:30:45 +0100
> Alvaro Herrera  wrote:
> 
> > Not related to this patch: it's very annoying that in the PDF output,
> > each section in the appendix doesn't start on a blank page -- which
> > means that the doc page for many modules starts in the middle of a
> > page were the previous one ends.
> 
> > I wonder if we can tweak something in the stylesheet to include a page
> > break.
> 
> Would this be something to be included in this patch?
> (If I can figure it out.)

No, I think we should do that change separately.  I just didn't think a
parenthical complain was worth a separate thread for it; but if you do
create a patch, please do create a new thread (unless the current patch
in this one is committed already.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Ninguna manada de bestias tiene una voz tan horrible como la humana" (Orual)




Re: Removing redundant grouping columns

2023-01-18 Thread Tom Lane
David Rowley  writes:
> No objections from me.

Pushed, thanks for looking at it.

regards, tom lane




Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-18 Thread Ankit Kumar Pandey




On 18/01/23 15:12, David Rowley wrote:



I also thought I'd better test that foreach_delete_current() works
with foreach_reverse(). I can confirm that it *does not* work
correctly.  I guess maybe you only tested the fact that it deleted the
current item and not that the subsequent loop correctly went to the
item directly before the deleted item. There's a problem with that. We
skip an item.


Hmm, not really sure why did I miss that. I tried this again (added 
following in postgres.c above


PortalStart)

List* l = NIL;
l = lappend(l, 1);
l = lappend(l, 2);
l = lappend(l, 3);
l = lappend(l, 4);

ListCell *lc;
foreach_reverse(lc, l)
{
if (foreach_current_index(lc) == 2) // delete 3
{
foreach_delete_current(l, lc);
}
}

foreach(lc, l)
{
int i = (int) lfirst(lc);
ereport(LOG,(errmsg("%d", i)));
}

Got result:
2023-01-18 20:23:28.115 IST [51007] LOG:  1
2023-01-18 20:23:28.115 IST [51007] STATEMENT:  select pg_backend_pid();
2023-01-18 20:23:28.115 IST [51007] LOG:  2
2023-01-18 20:23:28.115 IST [51007] STATEMENT:  select pg_backend_pid();
2023-01-18 20:23:28.115 IST [51007] LOG:  4
2023-01-18 20:23:28.115 IST [51007] STATEMENT:  select pg_backend_pid();

I had expected list_delete_cell to take care of rest.


Instead of fixing that, I think it's likely better just to loop
backwards manually with a for() loop, so I've adjusted the patch to
work that way.  It's quite likely that the additional code in
foreach() and what was in foreach_reverse() is slower than looping
manually due to the additional work those macros do to set the cell to
NULL when we run out of cells to loop over.


Okay, current version looks fine as well.


I made another pass over the v7 patch and fixed a bug that was
disabling the optimization when the deepest WindowAgg had a
runCondition.  This should have been using llast_node instead of
linitial_node.  The top-level WindowAgg is the last in the list.



I also made a pass over the tests and added a test case for the
runCondition check to make sure we disable the optimization when the
top-level WindowAgg has one of those.  


I wasn't sure what your test comments case numbers were meant to represent. 
They were not aligned with the code comments that document when the optimisation is

disabled, they started out aligned, but seemed to go off the rails at
#3. I've now made them follow the comments in create_one_window_path()
and made it more clear what we expect the test outcome to be in each
case.


Those were just numbering for exceptional cases, making them in sync 
with comments


wasn't really on my mind, but now they looks better.


I've attached the v9 patch. I feel like this patch is quite
self-contained and I'm quite happy with it now.  If there are no
objections soon, I'm planning on pushing it.


Patch is already rebased with latest master, tests are all green.

Tried some basic profiling and it looked good.


I also tried a bit unrealistic case.

create table abcdefgh(a int, b int, c int, d int, e int, f int, g int, h int);

insert into abcdefgh select a,b,c,d,e,f,g,h from
generate_series(1,7) a,
generate_series(1,7) b,
generate_series(1,7) c,
generate_series(1,7) d,
generate_series(1,7) e,
generate_series(1,7) f,
generate_series(1,7) g,
generate_series(1,7) h;

explain analyze select count(*) over (order by a),
row_number() over (partition by a order by b) from abcdefgh order by 
a,b,c,d,e,f,g,h;

In patch version

QUERY PLAN

--
---
 WindowAgg  (cost=1023241.14..1225007.67 rows=5764758 width=48) (actual 
time=64957.894..81950.352 rows=5764801 loops=1)
   ->  WindowAgg  (cost=1023241.14..1138536.30 rows=5764758 width=40) 
(actual time=37959.055..60391.799 rows=5764801 loops

=1)
 ->  Sort  (cost=1023241.14..1037653.03 rows=5764758 width=32) 
(actual time=37959.045..52968.791 rows=5764801 loop

s=1)
   Sort Key: a, b, c, d, e, f, g, h
   Sort Method: external merge  Disk: 237016kB
   ->  Seq Scan on abcdefgh  (cost=0.00..100036.58 
rows=5764758 width=32) (actual time=0.857..1341.107 rows=57

64801 loops=1)
 Planning Time: 0.168 ms
 Execution Time: 82748.789 ms
(8 rows)

In Master

QUERY PLAN

--
-
 Incremental Sort  (cost=1040889.72..1960081.97 rows=5764758 width=48) 
(actual time=23461.815..69654.700 rows=5764801 loop

s=1)
   Sort Key: a, b, c, d, e, f, g, h
   Presorted Key: a, b
   Full-sort Groups: 49  Sort Method: quicksort  Average Memory: 30kB  
Peak Memory: 30kB
   Pre-sorted Groups: 49  Sort Method: external merge  Average Disk: 
6688kB  Peak Disk: 6688kB
   ->  WindowAgg  (cost=1023241.14..1225007.67 rows=5764758 width=48) 
(actual time=22729.171..40189.407 rows=5764801 loops

=1)
 ->  

Re: Implement missing join selectivity estimation for range types

2023-01-18 Thread Tomas Vondra
Hello Mahmoud,

Thanks for the patch and sorry for not taking a look earlier.

On 6/30/22 16:31, Mahmoud Sakr wrote:
> Hi,
> Given a query:
> SELECT * FROM t1, t2 WHERE t1.r << t2.r
> where t1.r, t2.r are of range type,
> currently PostgreSQL will estimate a constant selectivity for the << 
> predicate,
> which is equal to 0.005, not utilizing the statistics that the optimizer
> collects for range attributes.
> 
> We have worked out a theory for inequality join selectivity estimation
> (http://arxiv.org/abs/2206.07396), and implemented it for range
> types it in this patch.
> 

Interesting. Are there any particular differences compared to how we
estimate for example range clauses on regular columns?

> The algorithm in this patch re-uses the currently collected statistics for
> range types, which is the bounds histogram. It works fairly accurate for the
> operations <<, >>, &&, &<, &>, <=, >= with estimation error of about 0.5%.

Right. I think 0.5% is roughly expected for the default statistics
target, which creates 100 histogram bins, each representing ~1% of the
values. Which on average means ~0.5% error.

> The patch also implements selectivity estimation for the
> operations @>, <@ (contains and is contained in), but their accuracy is not
> stable, since the bounds histograms assume independence between the range
> bounds. A point to discuss is whether or not to keep these last two 
> operations.

That's a good question. I think the independence assumption is rather
foolish in this case, so I wonder if we could "stabilize" this by making
some different - less optimistic - assumption. Essentially, we have an
estimates for lower/upper boundaries:

  P1 = P(lower(var1) <= lower(var2))
  P2 = P(upper(var2) <= upper(var1))

and independence means we take (P1*P2). But maybe we should be very
pessimistic and use e.g. Min(P1,P2)? Or maybe something in between?

Another option is to use the length histogram, right? I mean, we know
what the average length is, and it should be possible to use that to
calculate how "far" ranges in a histogram can overlap.

> The patch also includes the selectivity estimation for multirange types,
> treating a multirange as a single range which is its bounding box.
> 

OK. But ideally we'd cross-check elements of the two multiranges, no?

> The same algorithm in this patch is applicable to inequality joins of scalar
> types. We, however, don't implement it for scalars, since more work is needed
> to make use of the other statistics available for scalars, such as the MCV.
> This is left as a future work.
> 

So if the column(s) contain a couple very common (multi)ranges that make
it into an MCV, we'll ignore those? That's a bit unfortunate, because
those MCV elements are potentially the main contributors to selectivity.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




CREATEROLE users vs. role properties

2023-01-18 Thread Robert Haas
On Mon, Jan 16, 2023 at 2:29 PM Robert Haas  wrote:
> 1. It's still possible for a CREATEROLE user to hand out role
> attributes that they don't possess. The new prohibitions in
> cf5eb37c5ee0cc54c80d95c1695d7fca1f7c68cb prevent a CREATEROLE user
> from handing out membership in a role on which they lack sufficient
> permissions, but they don't prevent a CREATEROLE user who lacks
> CREATEDB from creating a new user who does have CREATEDB. I think we
> should subject the CREATEDB, REPLICATION, and BYPASSRLS attributes to
> the same rule that we now use for role memberships: you've got to have
> the property in order to give it to someone else. In the case of
> CREATEDB, this would tighten the current rules, which allow you to
> give out CREATEDB without having it. In the case of REPLICATION and
> BYPASSRLS, this would liberalize the current rules: right now, a
> CREATEROLE user cannot give REPLICATION or BYPASSRLS to another user
> even if they possess those attributes.
>
> This proposal doesn't address the CREATEROLE or CONNECTION LIMIT
> properties. It seems possible to me that someone might want to set up
> a CREATEROLE user who can't make more such users, and this proposal
> doesn't manufacture any way of doing that. It also doesn't let you
> constraint the ability of a CREATEROLE user to set a CONNECTION LIMIT
> for some other user. I think that's OK. It might be nice to have ways
> of imposing such restrictions at some point in the future, but it is
> not very obvious what to do about such cases and, importantly, I don't
> think there's any security impact from failing to address those cases.
> If a CREATEROLE user without CREATEDB can create a new role that does
> have CREATEDB, that's a privilege escalation. If they can hand out
> CREATEROLE, that isn't: they already have it.

Here is a patch implementing the above proposal. Since this is fairly
closely related to already-committed work, I would like to get this
into v16. That way, all the changes to how CREATEROLE works will go
into a single release, which seems less confusing for users. It is
also fairly clear to me that this is an improvement over the status
quo. Sometimes things that seem clear to me turn out to be false, so
if this change seems like a problem to you, please let me know.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-Adjust-interaction-of-CREATEROLE-with-role-proper.patch
Description: Binary data


Re: ANY_VALUE aggregate

2023-01-18 Thread Vik Fearing

On 1/18/23 16:06, Peter Eisentraut wrote:

On 05.12.22 21:18, Vik Fearing wrote:

On 12/5/22 15:57, Vik Fearing wrote:
The SQL:2023 Standard defines a new aggregate named ANY_VALUE.  It 
returns an implementation-dependent (i.e. non-deterministic) value 
from the rows in its group.


PFA an implementation of this aggregate.


Here is v2 of this patch.  I had forgotten to update sql_features.txt.


In your patch, the documentation says the definition is any_value("any") 
but the catalog definitions are any_value(anyelement).  Please sort that 
out.


Since the transition function is declared strict, null values don't need 
to be checked.


Thank you for the review.  Attached is a new version rebased to d540a02a72.
--
Vik Fearing
From 9cf2c5b56ea38d3080c0cb9f8ef9e6229d8696b4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sat, 9 Apr 2022 00:07:38 +0200
Subject: [PATCH] Implement ANY_VALUE aggregate

SQL:2023 defines an ANY_VALUE aggregate whose purpose is to emit an
implementation-dependent (i.e. non-deterministic) value from the
aggregated rows.
---
 doc/src/sgml/func.sgml   | 14 ++
 src/backend/catalog/sql_features.txt |  1 +
 src/backend/utils/adt/misc.c | 13 +
 src/include/catalog/pg_aggregate.dat |  4 
 src/include/catalog/pg_proc.dat  |  8 
 src/test/regress/expected/aggregates.out | 24 
 src/test/regress/sql/aggregates.sql  |  6 ++
 7 files changed, 70 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b8dac9ef46..8ff9decfec 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19735,6 +19735,20 @@ SELECT NULLIF(value, '(none)') ...
  
 
  
+  
+   
+
+ any_value
+
+any_value ( anyelement )
+same as input type
+   
+   
+Chooses a non-deterministic value from the non-null input values.
+   
+   Yes
+  
+
   

 
diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt
index abad216b7e..dfd3882801 100644
--- a/src/backend/catalog/sql_features.txt
+++ b/src/backend/catalog/sql_features.txt
@@ -520,6 +520,7 @@ T622	Trigonometric functions			YES
 T623	General logarithm functions			YES	
 T624	Common logarithm functions			YES	
 T625	LISTAGG			NO	
+T626	ANY_VALUE			YES	
 T631	IN predicate with one list element			YES	
 T641	Multiple column assignment			NO	only some syntax variants supported
 T651	SQL-schema statements in SQL routines			YES	
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 220ddb8c01..a9251f977e 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -1041,3 +1041,16 @@ pg_get_replica_identity_index(PG_FUNCTION_ARGS)
 	else
 		PG_RETURN_NULL();
 }
+
+/*
+ * Transition function for the ANY_VALUE aggregate
+ *
+ * Currently this just returns the first value, but in the future it might be
+ * able to signal to the aggregate that it does not need to be called anymore.
+ */
+Datum
+any_value_trans(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+}
+
diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat
index 8c957437ea..aac60dee58 100644
--- a/src/include/catalog/pg_aggregate.dat
+++ b/src/include/catalog/pg_aggregate.dat
@@ -625,4 +625,8 @@
   aggfinalfn => 'dense_rank_final', aggfinalextra => 't', aggfinalmodify => 'w',
   aggmfinalmodify => 'w', aggtranstype => 'internal' },
 
+# any_value
+{ aggfnoid => 'any_value(anyelement)', aggtransfn => 'any_value_trans',
+  aggcombinefn => 'any_value_trans', aggtranstype => 'anyelement' },
+
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 86eb8e8c58..95e760440e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11891,4 +11891,12 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '8981', descr => 'arbitrary value from among input values',
+  proname => 'any_value', prokind => 'a', proisstrict => 'f',
+  prorettype => 'anyelement', proargtypes => 'anyelement',
+  prosrc => 'aggregate_dummy' },
+{ oid => '8982', descr => 'any_value transition function',
+  proname => 'any_value_trans', prorettype => 'anyelement', proargtypes => 'anyelement anyelement',
+  prosrc => 'any_value_trans' },
+
 ]
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index ae3b905331..b240ef522b 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -25,6 +25,24 @@ SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;
  32.6667
 (1 row)
 
+SELECT any_value(v) FROM (VALUES (1)) AS v (v);
+ any_value 
+---
+ 1
+(1 row)
+
+SELECT any_value(v) FROM (VALUES (NULL)) AS v (v);
+ any_value 
+---
+ 
+(1 row)
+

Re: Improve GetConfigOptionValues function

2023-01-18 Thread Bharath Rupireddy
On Wed, Jan 18, 2023 at 9:44 PM Tom Lane  wrote:
>
> Possibly a better answer is to refactor into separate functions,
> along the lines of
>
> static bool
> ConfigOptionIsShowable(struct config_generic *conf)
>
> static void
> GetConfigOptionValues(struct config_generic *conf, const char **values)

+1 and ConfigOptionIsShowable() function can replace explicit showable
checks in two other places too.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-01-18 Thread Tomas Vondra
Hi,

I took a quick look at this patch, to see if there's something we
want/can get into v16. The last version was submitted about 9 months
ago, and it doesn't apply cleanly anymore, but the bitrot is fairly
minor. Not sure there's still interest, though.

As for the patch, I wonder if it's unnecessarily complex. It adds *two*
timestamps for each pg_stat_statements entry - one for reset of the
whole entry, one for reset of "min/max" times only.

I can see why the first timestamp (essentially tracking creating of the
entry) is useful. I'd probably call it "created_at" or something like
that, but that's a minor detail. Or maybe stats_reset, which is what we
use in pgstat?

But is the second timestamp for the min/max fields really useful? AFAIK
to perform analysis, people take regular pg_stat_statements snapshots,
which works fine for counters (calculating deltas) but not for gauges
(which need a reset, to track fresh values). But people analyzing this
are already resetting the whole entry, and so the snapshots already are
tracking deltas.

So I'm not convinced actually need the second timestamp.

A couple more comments:

1) I'm not sure why the patch is adding tests of permissions on the
pg_stat_statements_reset function?

2) If we want the second timestamp, shouldn't it also cover resets of
the mean values, not just min/max?

3) I don't understand why the patch is adding "IS NOT NULL AS t" to
various places in the regression tests.

4) I rather dislike the "minmax" naming, because that's often used in
other contexts (for BRIN indexes), and as I mentioned maybe it should
also cover the "mean" fields.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: almost-super-user problems that we haven't fixed yet

2023-01-18 Thread Robert Haas
On Tue, Jan 17, 2023 at 7:15 PM Nathan Bossart  wrote:
> Great.  Here is a first attempt at the patch.

In general, looks good. I think this will often call HaveNFreeProcs
twice, though, and that would be better to avoid, e.g.

if (!am_superuser && !am_walsender && (SuperuserReservedBackends +
ReservedBackends) > 0)
&& !HaveNFreeProcs(SuperuserReservedBackends + ReservedBackends))
{
if (!HaveNFreeProcs(SuperuserReservedBackends))
remaining connection slots are reserved for non-replication
superuser connections;
if (!has_privs_of_role(GetUserId(), ROLE_PG_USE_RESERVED_CONNECTIONS))
   remaining connection slots are reserved for roles with
privileges of pg_use_reserved_backends;
}

In the common case where we hit neither limit, this only counts free
connection slots once. We could do even better by making
HaveNFreeProcs have an out parameter for the number of free procs
actually found when it returns false, but that's probably not
important.

I don't think that we should default both the existing GUC and the new
one to 3, because that raises the default limit in the case where the
new feature is not used from 3 to 6. I think we should default one of
them to 0 and the other one to 3. Not sure which one should get which
value.

> > I think the documentation will need some careful wordsmithing,
> > including adjustments to superuser_reserved_connections. We want to
> > recast superuser_reserved_connections as a final reserve to be touched
> > after even reserved_connections has been exhausted.
>
> I tried to do this, but there is probably still room for improvement,
> especially for the parts that discuss the relationship between
> max_connections, superuser_reserved_connections, and reserved_connections.

I think it's pretty good the way you have it. I agree that there might
be a way to make it even better, but I don't think I know what it is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Improve GetConfigOptionValues function

2023-01-18 Thread Tom Lane
Nitin Jadhav  writes:
> GetConfigOptionValues function extracts the config parameters for the
> given variable irrespective of whether it results in noshow or not.
> But the parent function show_all_settings ignores the values parameter
> if it results in noshow. It's unnecessary to fetch all the values
> during noshow. So a return statement in GetConfigOptionValues() when
> noshow is set to true is needed. Attached the patch for the same.
> Please share your thoughts.

I do not think this is an improvement: it causes GetConfigOptionValues
to be making assumptions about how its results will be used.  If
show_all_settings() were a big performance bottleneck, and there were
a lot of no-show values that we could optimize, then maybe the extra
coupling would be worthwhile.  But I don't believe either of those
things.

Possibly a better answer is to refactor into separate functions,
along the lines of

static bool
ConfigOptionIsShowable(struct config_generic *conf)

static void
GetConfigOptionValues(struct config_generic *conf, const char **values)

regards, tom lane




Re: ANY_VALUE aggregate

2023-01-18 Thread Vik Fearing

On 1/18/23 16:55, Tom Lane wrote:

Peter Eisentraut  writes:

On 05.12.22 21:18, Vik Fearing wrote:

On 12/5/22 15:57, Vik Fearing wrote:

The SQL:2023 Standard defines a new aggregate named ANY_VALUE.  It
returns an implementation-dependent (i.e. non-deterministic) value
from the rows in its group.



Since the transition function is declared strict, null values don't need
to be checked.


Hmm, but should it be strict?  That means that what it's returning
is *not* "any value" but "any non-null value".  What does the draft
spec have to say about that?


It falls into the same category as AVG() etc.  That is, nulls are 
removed before calculation.

--
Vik Fearing





Re: ANY_VALUE aggregate

2023-01-18 Thread Tom Lane
Peter Eisentraut  writes:
> On 05.12.22 21:18, Vik Fearing wrote:
>> On 12/5/22 15:57, Vik Fearing wrote:
>>> The SQL:2023 Standard defines a new aggregate named ANY_VALUE.  It 
>>> returns an implementation-dependent (i.e. non-deterministic) value 
>>> from the rows in its group.

> Since the transition function is declared strict, null values don't need 
> to be checked.

Hmm, but should it be strict?  That means that what it's returning
is *not* "any value" but "any non-null value".  What does the draft
spec have to say about that?

regards, tom lane




Re: [DOCS] Stats views and functions not in order?

2023-01-18 Thread David G. Johnston
On Wed, Jan 18, 2023 at 8:38 AM Tom Lane  wrote:

> "David G. Johnston"  writes:
> > ...  I was going for the html effect
> > of having these views chunked into their own pages, any other changes
> being
> > non-detrimental.
>
> But is that a result we want?  It will for example break any bookmarks
> that people might have for these documentation entries.  It will also
> pretty thoroughly break the cross-version navigation links in this
> part of the docs.


> Maybe the benefit is worth those costs, but I'm entirely not convinced
> of that.  I think we need to tread pretty lightly when rearranging
> longstanding documentation-layout decisions.
>
>
Fair points.

The external linking can be solved with redirect rules, as I believe we've
done before, and fairly recently.  Even if not I think when they see why
the break happened they will be happy for the improved user experience.

I do think it is important enough a change to warrant breaking the
cross-version navigation links.  I can imagine a linking scheme that would
still work but I'm doubtful that this is important enough to expend the
development effort.

David J.


  1   2   >