date:20230117

Re: Helper functions for wait_for_catchup() in Cluster.pm

2023-01-17 Thread Drouvot, Bertrand


Hi,

On 1/17/23 12:23 PM, Alvaro Herrera wrote:

On 2023-Jan-17, Drouvot, Bertrand wrote:


The idea has been raised in [1], where we are adding more calls to
wait_for_catchup() in 'replay' mode.


This seems mostly useless as presented.  Maybe if you're able to reduce
the noise on the second argument it would be worth something -- namely,
if the wrapper function receives a node instead of an LSN: perhaps
wait_for_replay_catchup() would use the flush LSN from the given node,
wait_for_write_catchup() would use the write LSN, and
wait_for_sent_catchup() would use the insert LSN.  (I didn't check in
your patch if there are callsites that do something else).  This would
in several cases let you also remove the line with the assignment of
appropriate LSN to a separate variable.  If you did it that way, maybe
the code would become a tiny bit smaller overall.



Thanks for looking at it!

The current calls are done that way:

wait_for_replay_catchup called:
- 8 times with write LSN as an argument
- 1 time with insert LSN as an argument
- 16 times with flush LSN as an argument

wait_for_write_catchup called:
- 5 times with write LSN as an argument

So it looks like that providing a node as a second argument
would not help for the wait_for_replay_catchup() case.

Worth to use the node as an argument for wait_for_write_catchup()? (though it 
would be
weird to have different types of arguments between wait_for_replay_catchup() 
and wait_for_write_catchup()).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Improve GetConfigOptionValues function

2023-01-17 Thread Nitin Jadhav

Attaching the patch.

On Wed, Jan 18, 2023 at 1:21 PM Nitin Jadhav
 wrote:
>
> Hi,
>
> GetConfigOptionValues function extracts the config parameters for the
> given variable irrespective of whether it results in noshow or not.
> But the parent function show_all_settings ignores the values parameter
> if it results in noshow. It's unnecessary to fetch all the values
> during noshow. So a return statement in GetConfigOptionValues() when
> noshow is set to true is needed. Attached the patch for the same.
> Please share your thoughts.
>
> Thanks & Regards,
> Nitin Jadhav


v1-0001-Improve-GetConfigOptionValues-function.patch
Description: Binary data

Improve GetConfigOptionValues function

2023-01-17 Thread Nitin Jadhav

Hi,

GetConfigOptionValues function extracts the config parameters for the
given variable irrespective of whether it results in noshow or not.
But the parent function show_all_settings ignores the values parameter
if it results in noshow. It's unnecessary to fetch all the values
during noshow. So a return statement in GetConfigOptionValues() when
noshow is set to true is needed. Attached the patch for the same.
Please share your thoughts.

Thanks & Regards,
Nitin Jadhav

Deduplicate logicalrep_read_tuple()

2023-01-17 Thread Bharath Rupireddy

Hi,

logicalrep_read_tuple() duplicates code for LOGICALREP_COLUMN_TEXT and
LOGICALREP_COLUMN_BINARY introduced by commit 9de77b5. While it
doesn't hurt anyone, deduplication makes code a bit leaner by 57 bytes
[1]. I've attached a patch for $SUBJECT.

Thoughts?

[1] size ./src/backend/replication/logical/proto.o
PATCHED:
   textdata bss dec hex filename
  15558   0   0   155583cc6
./src/backend/replication/logical/proto.o

HEAD:
   textdata bss dec hex filename
  15615   0   0   156153cff
./src/backend/replication/logical/proto.o

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 9c645a11a56b3729b4f9184e6bfb7a6fcf834692 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 18 Jan 2023 05:14:47 +
Subject: [PATCH v1] Deduplicate logicalrep_read_tuple()

---
 src/backend/replication/logical/proto.c | 11 ---
 1 file changed, 11 deletions(-)

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3a9d53a61e..86e0750fde 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -895,17 +895,6 @@ logicalrep_read_tuple(StringInfo in, LogicalRepTupleData *tuple)
 /* we don't receive the value of an unchanged column */
 break;
 			case LOGICALREP_COLUMN_TEXT:
-len = pq_getmsgint(in, 4);	/* read length */
-
-/* and data */
-value->data = palloc(len + 1);
-pq_copymsgbytes(in, value->data, len);
-value->data[len] = '\0';
-/* make StringInfo fully valid */
-value->len = len;
-value->cursor = 0;
-value->maxlen = len;
-break;
 			case LOGICALREP_COLUMN_BINARY:
 len = pq_getmsgint(in, 4);	/* read length */
 
-- 
2.34.1

Re: Allow logical replication to copy tables in binary format

2023-01-17 Thread Bharath Rupireddy

On Thu, Jan 12, 2023 at 1:53 PM Melih Mutlu  wrote:
>
> Hi,
>
> Thanks for your reviews.

Thanks. I have some comments:

1. The performance numbers posted upthread [1] look impressive for the
use-case tried, that's a 2.25X improvement or 55.6% reduction in
execution times. However, it'll be great to run a few more varied
tests to confirm the benefit.

2. It'll be great to capture the perf report to see if the time spent
in copy_table() is reduced with the patch.

3. I think blending initial table sync's binary copy option with
data-type level binary send/receive is not good. Moreover, data-type
level binary send/receive has its own restrictions per 9de77b5453.
IMO, it'll be good to have a new option, say copy_data_format synonyms
with COPY command's FORMAT option but with only text (default) and
binary values.

4. Why to add tests to existing 002_types.pl? Can we add a new file
with all the data types covered?

5. It's not clear to me as to why these rows were removed in the patch?
-(1, '{1, 2, 3}'),
-(2, '{2, 3, 1}'),
 (3, '{3, 2, 1}'),
 (4, '{4, 3, 2}'),
 (5, '{5, NULL, 3}');

 -- test_tbl_arrays
 INSERT INTO tst_arrays (a, b, c, d) VALUES
-('{1, 2, 3}', '{"a", "b", "c"}', '{1.1, 2.2, 3.3}', '{"1
day", "2 days", "3 days"}'),
-('{2, 3, 1}', '{"b", "c", "a"}', '{2.2, 3.3, 1.1}', '{"2
minutes", "3 minutes", "1 minute"}'),

6. BTW, the subbinary description is missing in pg_subscription docs
https://www.postgresql.org/docs/devel/catalog-pg-subscription.html?
-  Specifies whether the subscription will request the publisher to
-  send the data in binary format (as opposed to text).
+  Specifies whether the subscription will copy the initial data to
+  synchronize relations in binary format and also request the publisher
+  to send the data in binary format too (as opposed to text).

7. A nitpick - space is needed after >= before 16.
+if (walrcv_server_version(LogRepWorkerWalRcvConn) >=16 &&

8. Note that the COPY binary format isn't portable across platforms
(Windows to Linux for instance) or major versions
https://www.postgresql.org/docs/devel/sql-copy.html, whereas, logical
replication is https://www.postgresql.org/docs/devel/logical-replication.html.
I don't see any handling of such cases in copy_table but only a check
for the publisher version. I think we need to account for all the
cases - allow binary COPY only when publisher and subscriber are of
same versions, architectures, platforms. The trick here is how we
identify if the subscriber is of the same type and taste
(architectures and platforms) as the publisher. Looking for
PG_VERSION_STR of publisher and subscriber might be naive, but I'm not
sure if there's a better way to do it.

Also, the COPY binary format is very data type specific - per the docs
"for example it will not work to output binary data from a smallint
column and read it into an integer column, even though that would work
fine in text format.". I did a small experiment [2], the logical
replication works with compatible data types (int -> smallint, even
int -> text), whereas the COPY binary format doesn't.

I think it'll complicate things a bit to account for the above cases
and allow COPY with binary format for logical replication.

[1] 
https://www.postgresql.org/message-id/CAGPVpCQEKDVKQPf6OFQ-9WiRYB1YRejm--YJTuwgzuvj1LEJ2A%40mail.gmail.com
[2]
DROP TABLE foo;
DROP PUBLICATION mypub;
CREATE TABLE foo(c1 bigint, c2 int, c3 smallint);
INSERT INTO foo SELECT i , i+1, i+2 FROM generate_series(1, 5) i;
CREATE PUBLICATION mypub FOR TABLE foo;
SELECT COUNT(*) FROM foo;
SELECT * FROM foo;

DROP SUBSCRIPTION mysub;
DROP TABLE foo;
CREATE TABLE foo(c1 smallint, c2 smallint, c3 smallint); -- works
without any problem
-- OR
CREATE TABLE foo(c1 smallint, c2 text, c3 smallint); -- works without
any problem
CREATE SUBSCRIPTION mysub CONNECTION 'port=5432 dbname=postgres
user=ubuntu' PUBLICATION mypub;
SELECT COUNT(*) FROM foo;
SELECT * FROM foo;

drop table foo;
create table foo(c1 bigint, c2 int, c3 smallint);
insert into foo select i, i+1, i+2 from generate_series(1, 10) i;
copy foo(c1, c2, c3) to '/home/ubuntu/postgres/inst/bin/data/foo.text'
with (format 'text');
copy foo(c1, c2, c3) to
'/home/ubuntu/postgres/inst/bin/data/foo.binary' with (format
'binary');
drop table bar;
create table bar(c1 smallint, c2 smallint, c3 smallint);
-- or
create table bar(c1 smallint, c2 text, c3 smallint);
copy bar(c1, c2, c3) from
'/home/ubuntu/postgres/inst/bin/data/foo.text' with (format 'text');
copy bar(c1, c2, c3) from
'/home/ubuntu/postgres/inst/bin/data/foo.binary' with (format
'binary'); -- produces "ERROR:  incorrect binary data format"

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Modify the document of Logical Replication configuration settings

2023-01-17 Thread Michael Paquier

On Wed, Jan 18, 2023 at 07:00:43AM +, Takamichi Osumi (Fujitsu) wrote:
> The attached patch includes below two changes for the description of
> Logical Replication "Configuration Settings".
> 
> 1. Add one brief description about wal_sender_timeout.
>I made it similar to one other sentence for subscriber.
> 2. Fix a wrong GUC name "wal_receiver_retry_interval".
>I think this doesn't seem to exist and would mean 
> "wal_retrieve_retry_interval".
> 
> Kindly have a look at it.

Looks right to me, thanks!
--
Michael


signature.asc
Description: PGP signature

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-17 Thread Peter Smith

 Here are my review comments for the latest patch v16-0001. (excluding
the test code)

==

General

1.

Since the value of min_apply_delay cannot be < 0,  I was thinking
probably it should have been declared everywhere in this patch as a
uint64 instead of an int64, right?

==

Commit message

2.

If the subscription sets min_apply_delay parameter, the logical
replication worker will delay the transaction commit for min_apply_delay
milliseconds.

~

IMO there should be another sentence before this just to say that a
new parameter is being added:

e.g.
This patch implements a new subscription parameter called 'min_apply_delay'.

==

doc/src/sgml/config.sgml

3.

+  
+   For time-delayed logical replication, the apply worker sends a Standby
+   Status Update message to the corresponding publisher per the indicated
+   time of this parameter. Therefore, if this parameter is longer than
+   wal_sender_timeout on the publisher, then the
+   walsender doesn't get any update message during the delay and repeatedly
+   terminates due to the timeout errors. Hence, make sure this parameter is
+   shorter than the wal_sender_timeout of the publisher.
+   If this parameter is set to zero with time-delayed replication, the
+   apply worker doesn't send any feedback messages during the
+   min_apply_delay.
+  


This paragraph seemed confusing. I think it needs to be reworded to
change all of the "this parameter" references because there are at
least 3 different parameters mentioned in this paragraph. e.g. maybe
just change them to explicitly name the parameter you are talking
about.

I also think it needs to mention the ‘min_apply_delay’ subscription
parameter up-front and then refer to it appropriately.

The end result might be something like I wrote below (this is just my
guess – probably you can word it better).

SUGGESTION
For time-delayed logical replication (i.e. when the subscription is
created with parameter min_apply_delay > 0), the apply worker sends a
Standby Status Update message to the publisher with a period of
wal_receiver_status_interval . Make sure to set
wal_receiver_status_interval less than the wal_sender_timeout on the
publisher, otherwise, the walsender will repeatedly terminate due to
the timeout errors. If wal_receiver_status_interval is set to zero,
the apply worker doesn't send any feedback messages during the
subscriber’s min_apply_delay period.

==

doc/src/sgml/ref/create_subscription.sgml

4.

+ 
+  By default, the subscriber applies changes as soon as possible. As
+  with the physical replication feature
+  (), it can be useful to
+  have a time-delayed logical replica. This parameter lets the user to
+  delay the application of changes by a specified amount of
time. If this
+  value is specified without units, it is taken as milliseconds. The
+  default is zero(no delay).
+ 

4a.
As with the physical replication feature (recovery_min_apply_delay),
it can be useful to have a time-delayed logical replica.

IMO not sure that the above sentence is necessary. It seems only to be
saying that this parameter can be useful. Why do we need to say that?

~

4b.
"This parameter lets the user to delay" -> "This parameter lets the user delay"
OR
"This parameter lets the user to delay" -> "This parameter allows the
user to delay"

~

4c.
"If this value is specified without units" -> "If the value is
specified without units"

~

4d.
"zero(no delay)." -> "zero (no delay)."



5.

+ 
+  The delay occurs only on WAL records for transaction begins and after
+  the initial table synchronization. It is possible that the
+  replication delay between publisher and subscriber exceeds the value
+  of this parameter, in which case no delay is added. Note that the
+  delay is calculated between the WAL time stamp as written on
+  publisher and the current time on the subscriber. Time
spent in logical
+  decoding and in transferring the transaction may reduce the
actual wait
+  time. If the system clocks on publisher and subscriber are not
+  synchronized, this may lead to apply changes earlier than expected,
+  but this is not a major issue because this parameter is
typically much
+  larger than the time deviations between servers. Note that if this
+  parameter is set to a long delay, the replication will stop if the
+  replication slot falls behind the current LSN by more than
+  max_slot_wal_keep_size.
+ 

I think the first part can be reworded slightly. See what you think
about the suggestion below.

SUGGESTION
Any delay occurs only on WAL records for transaction begins after all
initial table synchronization has finished.  The delay is calculated
between the WAL timestamp as written on the publisher and the current
time on the subscriber. Any

Re: Generating code for query jumbling through gen_node_support.pl

2023-01-17 Thread Michael Paquier

On Tue, Jan 17, 2023 at 04:52:28PM +0900, Michael Paquier wrote:
> On Tue, Jan 17, 2023 at 08:43:44AM +0100, Peter Eisentraut wrote:
>> Ok, I understand now, and I agree with this approach over the opposite. I
>> was confused because the snippet you showed above used "jumble_ignore", but
>> your patch is correct as it uses "jumble_location".
> 
> Okay.  I'll refresh the patch set so as we have only "jumble_ignore",
> then, like v1, with preparatory patches for what you mentioned and
> anything that comes into mind.

This is done as of the patch series v3 attached:
- 0001 reformats all the comments of the nodes.
- 0002 moves the current files for query jumble as of queryjumble.c ->
queryjumblefuncs.c and utils/queryjumble.h -> nodes/queryjumble.h.
- 0003 is the core feature, where I have done a second pass over the
nodes to make sure that things map with HEAD, incorporating the extra
docs coming from v2, adding a bit more.

>> That said, the term "jumble" is really weird, because in the sense that we
>> are using it here it means, approximately, "to mix together", "to unify".
>> So what we are doing with the Const nodes is really to *not* jumble the
>> location, but for all other node types we are jumbling the location.  At
>> least that is my understanding.
> 
> I am quite familiar with this term, FWIW.  That's what we've inherited
> from the days where this has been introduced in pg_stat_statements.

I have renamed the node attributes to query_jumble_ignore and
no_query_jumble at the end, after considering Peter's point that only
"jumble" could be fuzzy here.  The file names are changed in
consequence.

While doing all that, I have done some micro-benchmarking of
JumbleQuery(), making it loop 50M times on my laptop each time a query
ID is computed (hideous hack with a loop in queryjumble.c):
- For non-utility queries, aka queries that go through
JumbleQueryInternal(), I am measuring a repeatable ~10% improvement
with the generated code over HEAD, which is kind of nice.  I have
tested a few DMLs and simple SELECTs, still it looks like a trend.
- For utility queries, the new code is competing against
hash_any_extended() with the query string, which is going to be hard
to beat..  I am measuring what looks like a 5 times slowdown, at
least, and more depending on the depth of the query tree.  That's not
surprising.  On a 50M loop, this comes down to compare a computation
of 100ns to 5ns for a 20-time slowdown for example, still this could
justify the addition of a GUC to control whether utility queries have
their query ID compiled depending on their nodes or their string
hash, as this could become noticeable in OLTP workloads with loads
of short statements and BEGIN/COMMIT queries?

Thoughts or comments?
--
Michael
From 43a5bdffdb9dbe3ecb40e9bf8a5f0e9f12ceff32 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 18 Jan 2023 14:26:12 +0900
Subject: [PATCH v3 1/3] Rework format of comments for nodes

This is similar to 835d476, except that this one is to add node
properties related to query jumbling.
---
 src/include/nodes/parsenodes.h | 261 --
 src/include/nodes/plannodes.h  |   3 +-
 src/include/nodes/primnodes.h  | 469 ++---
 3 files changed, 487 insertions(+), 246 deletions(-)

diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index cfeca96d53..eec51e3ee2 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -123,7 +123,8 @@ typedef struct Query
 
 	CmdType		commandType;	/* select|insert|update|delete|merge|utility */
 
-	QuerySource querySource;	/* where did I come from? */
+	/* where did I come from? */
+	QuerySource querySource;
 
 	/*
 	 * query identifier (can be set by plugins); ignored for equal, as it
@@ -131,39 +132,58 @@ typedef struct Query
 	 */
 	uint64		queryId pg_node_attr(equal_ignore, read_write_ignore, read_as(0));
 
-	bool		canSetTag;		/* do I set the command result tag? */
+	/* do I set the command result tag? */
+	bool		canSetTag;
 
 	Node	   *utilityStmt;	/* non-null if commandType == CMD_UTILITY */
 
-	int			resultRelation; /* rtable index of target relation for
- * INSERT/UPDATE/DELETE/MERGE; 0 for SELECT */
+	/*
+	 * rtable index of target relation for INSERT/UPDATE/DELETE/MERGE; 0 for
+	 * SELECT.
+	 */
+	int			resultRelation;
 
-	bool		hasAggs;		/* has aggregates in tlist or havingQual */
-	bool		hasWindowFuncs; /* has window functions in tlist */
-	bool		hasTargetSRFs;	/* has set-returning functions in tlist */
-	bool		hasSubLinks;	/* has subquery SubLink */
-	bool		hasDistinctOn;	/* distinctClause is from DISTINCT ON */
-	bool		hasRecursive;	/* WITH RECURSIVE was specified */
-	bool		hasModifyingCTE;	/* has INSERT/UPDATE/DELETE in WITH */
-	bool		hasForUpdate;	/* FOR [KEY] UPDATE/SHARE was specified */
-	bool		hasRowSecurity; /* rewriter has applied some RLS policy */
-
-	bool		isReturn;		/* is a RETURN statement */
+	/* has aggregates in tlist or havingQual */
+

Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl

2023-01-17 Thread Nitin Jadhav

> We would miss the names of the parameters that are marked as NO_SHOW,
> missing from pg_settings, making debugging harder.
>
> This would make the test more costly by forcing one SQL for each
> GUC..

I agree.


> We could extend pg_show_all_settings() with a boolean parameter to
> enforce listing all the parameters, even the ones that are marked as
> NOSHOW, but this does not count on GetConfigOptionValues() that could
> force a parameter to become noshow on a superuser-only GUC depending
> on the role that's running the function.  At the end, we'd better rely
> on a separate superuser-only function to do this job, aka option 1.

I did not get it completely. To understand it better, I just gave a
thought of adding a boolean parameter to pg_show_all_settings(). Then
we should modify GetConfigOptionValues() like below [1]. When we call
pg_show_all_settings(false), it behaves like existing behaviour (with
super user and without super user). When we call
pg_show_all_settings(true) with super user privileges, it returns all
parameters including GUC_NO_SHOW_ALL as well as GUC_SUPER_USER_ONLY.
If we call pg_show_all_settings(true) without super user privileges,
then it returns all parameters except GUC_NO_SHOW_ALL and
GUC_SUPER_USER_ONLY. Can't we do it this way? Please share your
thoughts.


> How much do we need to care with the duplication this would involve
> with show_all_settings(), actually?  If you don't use the SRF macros,
> the code would just be a couple of lines with InitMaterializedSRF()
> doing a loop on GetConfigOptionValues().  Even if that means listing
> twice the parameters in pg_proc.dat, the chances of adding new
> parameters in pg_settings is rather low so that would be a one-time
>  change?

How about just fetching the parameter name instead of fetching all its
details. This will meet our objective as well as it controls the code
duplication.

[1]:
static void
GetConfigOptionValues(struct config_generic *conf, const char **values,
  bool *noshow, bool is_show_all)
{
charbuffer[256];

if (noshow)
{
if (((conf->flags & GUC_NO_SHOW_ALL) && !is_show_all) ||
  ((conf->flags & GUC_NO_SHOW_ALL) &&
  !has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_SETTINGS)) ||
((conf->flags & GUC_SUPERUSER_ONLY) &&
 !has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_SETTINGS)))
*noshow = true;
else
*noshow = false;
}
-
-
-
}

On Mon, Jan 16, 2023 at 7:58 AM Michael Paquier  wrote:
>
> On Sat, Jan 14, 2023 at 07:10:55PM +0530, Nitin Jadhav wrote:
> > Option-1 is, expose a function like pg_settings_get_no_show_all()
> > which just returns the parameters which are just listed as
> > GUC_NO_SHOW_ALL (Not in combination with NOT_IN_SAMPLE). We can then
> > use this function in the test file and verify whether there are config
> > entries for these.
> >
> > Option-2 is, if exposing new function and that too to expose
> > parameters which are listed as GUC_NO_SHOW_ALL is not recommended,
> > then how about exposing a function like pg_settings_get_count() which
> > returns the count of all parameters including GUC_NO_SHOW_ALL. We can
> > then use this number to verify whether these many are present in the
> > sample config file. But we cannot show the name of the parameters if
> > it is not matching. We can just display an error saying "Parameter
> > with GUC_NO_SHOW_ALL is missing from postgresql.conf.sample".
>
> We would miss the names of the parameters that are marked as NO_SHOW,
> missing from pg_settings, making debugging harder.
>
> > Option-3 is, if exposing both of the above functions is not
> > recommended, then we can use the existing function
> > pg_settings_get_flags() for each of the parameters while reading the
> > sample config file in 003_check_guc.pl. This validates the
> > GUC_NO_SHOW_ALL parameter if that is present in the sample config
> > file. It does not validate if it is present in guc.c and missing in
> > the sample config file.
>
> This would make the test more costly by forcing one SQL for each
> GUC..
>
> > Option-4 is, how about manually adding the parameter name to
> > 'all_params_array' in 003_check_guc.pl whenever we add such GUCs.
> >
> > I am not able to choose any of the above options as each has some
> > disadvantages but if no other options exist, then I would like to go
> > with option-3 as it validates more than the one currently doing.
> > Please share if any other better ideas.
>
> We could extend pg_show_all_settings() with a boolean parameter to
> enforce listing all the parameters, even the ones that are marked as
> NOSHOW, but this does not count on GetConfigOptionValues() that could
> force a parameter to become noshow on a superuser-only GUC depending
> on the role that's running the function.  At the end, we'd better rely
> on a separate superuser-only function to do this job, aka option 1.
>
> How much do we need to care with

Modify the document of Logical Replication configuration settings

2023-01-17 Thread Takamichi Osumi (Fujitsu)

Hi, hackers


The attached patch includes below two changes for the description of
Logical Replication "Configuration Settings".

1. Add one brief description about wal_sender_timeout.
   I made it similar to one other sentence for subscriber.
2. Fix a wrong GUC name "wal_receiver_retry_interval".
   I think this doesn't seem to exist and would mean 
"wal_retrieve_retry_interval".

Kindly have a look at it.


Best Regards,
Takamichi Osumi



v1-0001-Make-the-description-of-the-LR-configuration-sett.patch
Description:  v1-0001-Make-the-description-of-the-LR-configuration-sett.patch

Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-17 Thread Amit Kapila

On Fri, Jan 13, 2023 at 11:50 AM Peter Smith  wrote:
>
> Here are some review comments for patch v79-0002.
>

So, this is about the latest v84-0001-Stop-extra-worker-if-GUC-was-changed.

> ==
>
> General
>
> 1.
>
> I saw that earlier in this thread Hou-san [1] and Amit [2] also seemed
> to say there is not much point for this patch.
>
> So I wanted to +1 that same opinion.
>
> I feel this patch just adds more complexity for almost no gain:
> - reducing the 'max_apply_workers_per_suibscription' seems not very
> common in the first place.
> - even when the GUC is reduced, at that point in time all the workers
> might be in use so there may be nothing that can be immediately done.
> - IIUC the excess workers (for a reduced GUC) are going to get freed
> naturally anyway over time as more transactions are completed so the
> pool size will reduce accordingly.
>

I am still not sure if it is worth pursuing this patch because of the
above reasons. I don't think it would be difficult to add this even at
a later point in time if we really see a use case for this.
Sawada-San, IIRC, you raised this point. What do you think?

The other point I am wondering is whether we can have a different way
to test partial serialization apart from introducing another developer
GUC (stream_serialize_threshold). One possibility could be that we can
have a subscription option (parallel_send_timeout or something like
that) with some default value (current_timeout used in the patch)
which will be used only when streaming = parallel. Users may want to
wait for more time before serialization starts depending on the
workload (say when resource usage is high on a subscriber-side
machine, or there are concurrent long-running transactions that can
block parallel apply for a bit longer time). I know with this as well
it may not be straightforward to test the functionality because we
can't be sure how many changes would be required for a timeout to
occur. This is just for brainstorming other options to test the
partial serialization functionality.

Thoughts?

-- 
With Regards,
Amit Kapila.

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-01-17 Thread Maciek Sakrejda

On Tue, Jan 17, 2023 at 9:22 AM Melanie Plageman
 wrote:
> On Mon, Jan 16, 2023 at 4:42 PM Maciek Sakrejda  wrote:
> > I missed a couple of versions, but I think the docs are clearer now.
> > I'm torn on losing some of the detail, but overall I do think it's a
> > good trade-off. Moving some details out to after the table does keep
> > the bulk of the view documentation more readable, and the "inform
> > database tuning" part is great. I really like the idea of a separate
> > Interpreting Statistics section, but for now this works.
> >
> > >+  vacuum: I/O operations performed outside of 
> > >shared
> > >+  buffers while vacuuming and analyzing permanent relations.
> >
> > Why only permanent relations? Are temporary relations treated
> > differently? I imagine if someone has a temp-table-heavy workload that
> > requires regularly vacuuming and analyzing those relations, this point
> > may be confusing without some additional explanation.
>
> Ah, yes. This is a bit confusing. We don't use buffer access strategies
> when operating on temp relations, so vacuuming them is counted in IO
> Context normal. I've added this information to the docs but now that
> definition is a bit long. Perhaps it should be a note? That seems like
> it would draw too much attention to this detail, though...

Thanks for clarifying. I think the updated definition still works:
it's still shorter than the `normal` context definition.

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-17 Thread Takamichi Osumi (Fujitsu)

Hi,


On Wednesday, January 18, 2023 2:19 PM Amit Kapila  
wrote:
> On Wed, Jan 18, 2023 at 6:37 AM Takamichi Osumi (Fujitsu)
>  wrote:
> >
> > >
> > > Can you please explain a bit more as asked above to understand the
> > > difference?
> > So, the current difference is that the time-delayed apply worker of
> > logical replication doesn't apply the delayed transaction on the
> > subscriber when the subscription has been disabled during the delay,
> > while (in one example of a promotion) the physical replication does the 
> > apply
> of the delayed transaction.
> >
> 
> I don't see any particular reason here to allow the transaction apply to 
> complete
> if the subscription is disabled. Note, that here we are waiting at the 
> beginning
> of the transaction and for large transactions, it might cause a significant 
> delay if
> we allow applying the xact. OTOH, if someone comes up with a valid use case
> to allow the transaction apply to get completed after the subscription is
> disabled then we can anyway do it later as well.
This makes sense. I agree with you. So, I'll keep the current behavior of
the patch.


Best Regards,
Takamichi Osumi

Re: doc: add missing "id" attributes to extension packaging page

2023-01-17 Thread Brar Piening


On 17.01.2023 at 23:43, Karl O. Pinc wrote:

It's good you asked.  After poking about the XSLT 1.0 spec to refresh
my memory I abandoned the "mode" approach entirely.  The real "right
way" is to use the import mechanism.

I've attached a patch that "wraps" the section.heading template
and adds extra stuff to the HTML output generated by the stock
template.  (example_section_heading_override.patch)


Thanks!

I'll give it a proper look this weekend.

Regards,

Brar

Re: Removing redundant grouping columns

2023-01-17 Thread David Rowley

On Wed, 18 Jan 2023 at 11:51, Tom Lane  wrote:
> If nobody has any comments on this, I'm going to go ahead and push
> it.  The value of the improvement is rapidly paling in comparison
> to the patch's maintenance effort.

No objections from me.

David

Re: Logical replication timeout problem

2023-01-17 Thread Amit Kapila

On Tue, Jan 17, 2023 at 6:41 PM Ashutosh Bapat
 wrote:
>
> On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila  wrote:
>
> > >
> > > I am a bit worried about the indirections that the wrappers and hooks
> > > create. Output plugins call OutputPluginUpdateProgress() in callbacks
> > > but I don't see why  ReorderBufferProcessTXN() needs a callback to
> > > call OutputPluginUpdateProgress.
> > >
> >
> > Yeah, I think we can do it as we are doing the previous approach but
> > we need an additional wrapper (update_progress_cb_wrapper()) as the
> > current patch has so that we can add error context information. This
> > is similar to why we have a wrapper for all other callbacks like
> > change_cb_wrapper.
> >
>
> Ultimately OutputPluginUpdateProgress() will be called - which in turn
> will call ctx->update_progress.
>

No, update_progress_cb_wrapper() should directly call
ctx->update_progress(). The key reason to have a
update_progress_cb_wrapper() is that it allows us to add error context
information (see the usage of output_plugin_error_callback).

-- 
With Regards,
Amit Kapila.

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-17 Thread Amit Kapila

On Wed, Jan 18, 2023 at 6:37 AM Takamichi Osumi (Fujitsu)
 wrote:
>
> >
> > Can you please explain a bit more as asked above to understand the
> > difference?
> So, the current difference is that the time-delayed apply
> worker of logical replication doesn't apply the delayed transaction on the 
> subscriber
> when the subscription has been disabled during the delay, while (in one 
> example
> of a promotion) the physical replication does the apply of the delayed 
> transaction.
>

I don't see any particular reason here to allow the transaction apply
to complete if the subscription is disabled. Note, that here we are
waiting at the beginning of the transaction and for large
transactions, it might cause a significant delay if we allow applying
the xact. OTOH, if someone comes up with a valid use case to allow the
transaction apply to get completed after the subscription is disabled
then we can anyway do it later as well.

-- 
With Regards,
Amit Kapila.

Re: Remove source code display from \df+?

2023-01-17 Thread Pavel Stehule

Hi


út 17. 1. 2023 v 20:29 odesílatel Isaac Morland 
napsal:

> On Thu, 12 Jan 2023 at 12:06, Isaac Morland 
> wrote:
>
> Thanks everybody. So based on the latest discussion I will:
>>
>> 1) rename the column from “Source code” to “Internal name”; and
>> 2) change the contents to NULL except when the language (identified by
>> oid) is INTERNAL or C.
>>
>> Patch forthcoming, I hope.
>>
>
> I've attached a patch for this. It turns out to simplify the existing code
> in one way because the recently added call to pg_get_function_sqlbody() is
> no longer needed since it applies only to SQL functions, which will now
> display as a blank column.
>
> I implemented the change and was surprised to see that no tests failed.
> Turns out that while there are several tests for \df, there were none for
> \df+. I added a couple, just using \df+ on some functions that appear to me
> to be present on plain vanilla Postgres.
>
> I was initially concerned about translation support for the column
> heading, but it turns out that \dT+ already has a column with the exact
> same name so it appears we don’t need any new translations.
>
> I welcome comments and feedback. Now to try to find something manageable
> to review.
>

looks well

you miss update psql documentation

https://www.postgresql.org/docs/current/app-psql.html

If the form \df+ is used, additional information about each function is
shown, including volatility, parallel safety, owner, security
classification, access privileges, language, source code and description.

you should to assign your patch to commitfest app

https://commitfest.postgresql.org/

Regards

Pavel

RE: Perform streaming logical transactions by background workers and parallel apply

2023-01-17 Thread wangw.f...@fujitsu.com

On Wed, Jan 18, 2023 12:36 PM Amit Kapila  wrote:
> On Tue, Jan 17, 2023 at 8:07 PM Masahiko Sawada 
> wrote:
> >
> > On Tue, Jan 17, 2023 at 6:14 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Tuesday, January 17, 2023 2:46 PM Masahiko Sawada
>  wrote:
> > > >
> > > > On Tue, Jan 17, 2023 at 12:37 PM houzj.f...@fujitsu.com
> > > >  wrote:
> > > > I'm slightly concerned that there could be overhead of executing
> > > > GetLeaderApplyWorkerPid () for every backend process except for parallel
> > > > query workers. The number of such backends could be large and
> > > > GetLeaderApplyWorkerPid() acquires the lwlock. For example, does it
> make
> > > > sense to check (st_backendType == B_BG_WORKER) before calling
> > > > GetLeaderApplyWorkerPid()? Or it might not be a problem since it's
> > > > LogicalRepWorkerLock which is not likely to be contended.
> > >
> > > Thanks for the comment and I think your suggestion makes sense.
> > > I have added the check before getting the leader pid. Here is the new
> version patch.
> >
> > Thank you for updating the patch. Looks good to me.
> >
> 
> Pushed.

Rebased and attach remaining patches for reviewing.

Regards,
Wang Wei


v84-0001-Stop-extra-worker-if-GUC-was-changed.patch
Description: v84-0001-Stop-extra-worker-if-GUC-was-changed.patch


v84-0002-Add-GUC-stream_serialize_threshold-and-test-seri.patch
Description:  v84-0002-Add-GUC-stream_serialize_threshold-and-test-seri.patch


v84-0003-Retry-to-apply-streaming-xact-only-in-apply-work.patch
Description:  v84-0003-Retry-to-apply-streaming-xact-only-in-apply-work.patch

Re: recovery modules

2023-01-17 Thread Nathan Bossart

On Wed, Jan 18, 2023 at 11:29:20AM +0900, Michael Paquier wrote:
> And 0002 is applied.

Thanks.  Here's a rebased version of the last patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From c262ed4701fdb417c1af3a6730499cebb9e25030 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 9 Dec 2022 19:40:54 -0800
Subject: [PATCH v9 1/1] Allow recovery via loadable modules.

This adds the restore_library parameter to allow archive recovery
via a loadable module, rather than running shell commands.
---
 contrib/basic_archive/Makefile|   4 +-
 contrib/basic_archive/basic_archive.c |  67 ++-
 contrib/basic_archive/meson.build |   7 +-
 contrib/basic_archive/t/001_restore.pl|  44 +
 doc/src/sgml/archive-modules.sgml | 168 --
 doc/src/sgml/backup.sgml  |  43 -
 doc/src/sgml/basic-archive.sgml   |  33 ++--
 doc/src/sgml/config.sgml  |  54 +-
 doc/src/sgml/high-availability.sgml   |  23 ++-
 src/backend/access/transam/shell_restore.c|  21 ++-
 src/backend/access/transam/xlog.c |  13 +-
 src/backend/access/transam/xlogarchive.c  |  70 +++-
 src/backend/access/transam/xlogrecovery.c |  26 ++-
 src/backend/postmaster/checkpointer.c |  26 +++
 src/backend/postmaster/pgarch.c   |   7 +-
 src/backend/postmaster/startup.c  |  23 ++-
 src/backend/utils/misc/guc.c  |  14 ++
 src/backend/utils/misc/guc_tables.c   |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog_internal.h|   1 +
 src/include/access/xlogarchive.h  |  44 -
 src/include/access/xlogrecovery.h |   1 +
 src/include/utils/guc.h   |   2 +
 23 files changed, 618 insertions(+), 84 deletions(-)
 create mode 100644 contrib/basic_archive/t/001_restore.pl

diff --git a/contrib/basic_archive/Makefile b/contrib/basic_archive/Makefile
index 55d299d650..487dc563f3 100644
--- a/contrib/basic_archive/Makefile
+++ b/contrib/basic_archive/Makefile
@@ -1,7 +1,7 @@
 # contrib/basic_archive/Makefile
 
 MODULES = basic_archive
-PGFILEDESC = "basic_archive - basic archive module"
+PGFILEDESC = "basic_archive - basic archive and recovery module"
 
 REGRESS = basic_archive
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/basic_archive/basic_archive.conf
@@ -9,6 +9,8 @@ REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/basic_archive/basic_archive.c
 # which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
 
+TAP_TESTS = 1
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 3d29711a31..225a9bd3d4 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -17,6 +17,11 @@
  * a file is successfully archived and then the system crashes before
  * a durable record of the success has been made.
  *
+ * This file also demonstrates a basic restore library implementation that
+ * is roughly equivalent to the following shell command:
+ *
+ *		cp /path/to/archivedir/%f %p
+ *
  * Copyright (c) 2022-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
@@ -30,6 +35,7 @@
 #include 
 #include 
 
+#include "access/xlogarchive.h"
 #include "common/int.h"
 #include "miscadmin.h"
 #include "postmaster/pgarch.h"
@@ -48,6 +54,8 @@ static bool basic_archive_file(const char *file, const char *path);
 static void basic_archive_file_internal(const char *file, const char *path);
 static bool check_archive_directory(char **newval, void **extra, GucSource source);
 static bool compare_files(const char *file1, const char *file2);
+static bool basic_restore_file(const char *file, const char *path,
+			   const char *lastRestartPointFileName);
 
 /*
  * _PG_init
@@ -87,6 +95,19 @@ _PG_archive_module_init(ArchiveModuleCallbacks *cb)
 	cb->archive_file_cb = basic_archive_file;
 }
 
+/*
+ * _PG_recovery_module_init
+ *
+ * Returns the module's restore callback.
+ */
+void
+_PG_recovery_module_init(RecoveryModuleCallbacks *cb)
+{
+	AssertVariableIsOfType(&_PG_recovery_module_init, RecoveryModuleInit);
+
+	cb->restore_cb = basic_restore_file;
+}
+
 /*
  * check_archive_directory
  *
@@ -99,8 +120,8 @@ check_archive_directory(char **newval, void **extra, GucSource source)
 
 	/*
 	 * The default value is an empty string, so we have to accept that value.
-	 * Our check_configured callback also checks for this and prevents
-	 * archiving from proceeding if it is still empty.
+	 * Our check_configured and restore callbacks also check for this and
+	 * prevent archiving or recovery from proceeding if it is still empty.
 	 */
 	if (*newval == NULL || *newval[0] == '\0')
 		return true;
@@ -368,3 +389,45 @@ compare_files(const char *file1, const char *file2)

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-01-17 Thread John Naylor

On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada 
wrote:
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
>  wrote:

> > + * Add Tids on a block to TidStore. The caller must ensure the offset
numbers
> > + * in 'offsets' are ordered in ascending order.
> >
> > Must? What happens otherwise?
>
> It ends up missing TIDs by overwriting the same key with different
> values. Is it better to have a bool argument, say need_sort, to sort
> the given array if the caller wants?

Now that I've studied it some more, I see what's happening: We need all
bits set in the "value" before we insert it, since it would be too
expensive to retrieve the current value, add one bit, and put it back.
Also, as a consequence of the encoding, part of the tid is in the key, and
part in the value. It makes more sense now, but it needs more than zero
comments.

As for the order, I don't think it's the responsibility of the caller to
guess if it needs sorting -- if unordered offsets lead to data loss, this
function needs to take care of it.

> > + uint64 last_key = PG_UINT64_MAX;
> >
> > I'm having some difficulty understanding this sentinel and how it's
used.
>
> Will improve the logic.

Part of the problem is the English language: "last" can mean "previous" or
"at the end", so maybe some name changes would help.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-17 Thread Amit Kapila

On Tue, Jan 17, 2023 at 8:07 PM Masahiko Sawada  wrote:
>
> On Tue, Jan 17, 2023 at 6:14 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Tuesday, January 17, 2023 2:46 PM Masahiko Sawada 
> >  wrote:
> > >
> > > On Tue, Jan 17, 2023 at 12:37 PM houzj.f...@fujitsu.com
> > >  wrote:
> > > I'm slightly concerned that there could be overhead of executing
> > > GetLeaderApplyWorkerPid () for every backend process except for parallel
> > > query workers. The number of such backends could be large and
> > > GetLeaderApplyWorkerPid() acquires the lwlock. For example, does it make
> > > sense to check (st_backendType == B_BG_WORKER) before calling
> > > GetLeaderApplyWorkerPid()? Or it might not be a problem since it's
> > > LogicalRepWorkerLock which is not likely to be contended.
> >
> > Thanks for the comment and I think your suggestion makes sense.
> > I have added the check before getting the leader pid. Here is the new 
> > version patch.
>
> Thank you for updating the patch. Looks good to me.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: recovery modules

2023-01-17 Thread Michael Paquier

On Tue, Jan 17, 2023 at 10:23:56AM -0800, Nathan Bossart wrote:
> Yeah, this seems cleaner.  I removed BuildRestoreCommand() in v8.

if (*sp == *lp)
{
-   if (val)
-   {
-   appendStringInfoString(, val);
-   found = true;
-   }
-   /* If val is NULL, we will report an error. */
+   appendStringInfoString(, val);
+   found = true;

In 0002, this code block has been removed as an effect of the removal
of BuildRestoreCommand(), because RestoreArchivedFile() needs to
handle two flags with two values.  The current design has the
advantage to warn extension developers with an unexpected
manipulation, as well, so I have kept the logic in percentrepl.c
as-is.

I was wondering also if ExecuteRecoveryCommand() should use a bits32
for its two boolean flags, but did not bother as it is static in
shell_restore.c so ABI does not matter, even if there are three
callers of it with 75% of the combinations possible (only false/true
is not used).

And 0002 is applied.
--
Michael


signature.asc
Description: PGP signature

Re: Removing redundant grouping columns

2023-01-17 Thread David Rowley

On Wed, 18 Jan 2023 at 14:55, Richard Guo  wrote:
> Yeah, I noticed this too yesterday.  I reviewed through these two
> patches yesterday and I think they are in good shape now.

I'm currently reviewing the two patches.

David

Re: Removing redundant grouping columns

2023-01-17 Thread Richard Guo

On Wed, Jan 18, 2023 at 6:51 AM Tom Lane  wrote:

> I wrote:
> > Yeah, sideswiped by 3c6fc5820 apparently.  No substantive change needed.
>
> And immediately sideswiped by da5800d5f.

Yeah, I noticed this too yesterday.  I reviewed through these two
patches yesterday and I think they are in good shape now.

Thanks
Richard

Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-01-17 Thread Andrey Chudnovsky

> All of this points at a bigger question to the community: if we choose
> not to provide a flow implementation in libpq, is adding OAUTHBEARER
> worth the additional maintenance cost?

> My personal vote would be "no". I think the hook-only approach proposed
> here would ensure that only larger providers would implement it in
> practice

Flow implementations in libpq are definitely a long term plan, and I
agree that it would democratise the adoption.
In the previous posts in this conversation I outlined the ones I think
we should support.

However, I don't see why it's strictly necessary to couple those.
As long as the SASL exchange for OAUTHBEARER mechanism is supported by
the protocol, the Client side can evolve at its own pace.

At the same time, the current implementation allows clients to start
building provider-agnostic OAUTH support. By using iddawc or OAUTH
client implementations in the respective platforms.
So I wouldn't refer to "larger providers", but rather "more motivated
clients" here. Which definitely overlaps, but keeps the system open.

> I'm not understanding the concern in the final point -- providers
> generally require you to opt into device authorization, at least as far
> as I can tell. So if you decide that it's not appropriate for your use
> case... don't enable it. (And I haven't seen any claims that opting into
> device authorization weakens the other flows in any way. So if we're
> going to implement a flow in libpq, I still think device authorization
> is the best choice, since it works on headless machines as well as those
> with browsers.)
I agree with the statement that Device code is the best first choice
if we absolutely have to pick one.
Though I don't think we have to.

While device flow can be used for all kinds of user-facing
applications, it's specifically designed for input-constrained
scenarios. As clearly stated in the Abstract here -
https://www.rfc-editor.org/rfc/rfc8628
The authorization code with pkce flow is recommended by the RFSc and
major providers for cases when it's feasible.
The long term goal is to provide both, though I don't see why the
backbone protocol implementation first wouldn't add value.

Another point is user authentication is one side of the whole story
and the other critical one is system-to-system authentication. Where
we have Client Credentials and Certificates.
With the latter it is much harder to get generically implemented, as
provider-specific tokens need to be signed.

Adding the other reasoning, I think libpq support for specific flows
can get in the further iterations, after the protocol support.

> in that case I'd rather spend cycles on generic SASL.
I see 2 approaches to generic SASL:
(a). Generic SASL is a framework used in the protocol, with the
mechanisms implemented on top and exposed to the DBAs as auth types to
configure in hba.
This is the direction we're going here, which is well aligned with the
existing hba-based auth configuration.
(b). Generic SASL exposed to developers on the server- and client-
side to extend on. It seems to be a much longer shot.
The specific points of large ambiguity are libpq distribution model
(which you pointed to) and potential pluggability of insecure
mechanisms.

I do see (a) as a sweet spot with a lot of value for various
participants with much less ambiguity.

> Additionally, the "issuer" field added here is not part of the RFC. I've
> written my thoughts about unofficial extensions upthread but haven't
> received a response, so I'm going to start being more strident: Please,
> for the sake of reviewers, call out changes you've made to the spec, and
> why they're justified.
Thanks for your feedback on this. We had this discussion as well, and
added that as a convenience for the client to identify the provider.
I don't see a reason why an issuer would be absolutely necessary, so
we will get your point that sticking to RFCs is a safer choice.

> The patches seem to be out of order now (and the documentation in the
> commit messages has been removed).
Feedback taken. Work in progress.

On Tue, Jan 17, 2023 at 2:44 PM Jacob Champion  wrote:
>
> On Sun, Jan 15, 2023 at 12:03 PM Andrey Chudnovsky
>  wrote:
> > 2. Removed Device Code implementation in libpq. Several reasons:
> >- Reduce scope and focus on the protocol first.
> >- Device code implementation uses iddawc dependency. Taking this
> > dependency is a controversial step which requires broader discussion.
> >- Device code implementation without iddaws would significantly
> > increase the scope of the patch, as libpq needs to poll the token
> > endpoint, setup different API calls, e.t.c.
> >- That flow should canonically only be used for clients which can't
> > invoke browsers. If it is the only flow to be implemented, it can be
> > used in the context when it's not expected by the OAUTH protocol.
>
> I'm not understanding the concern in the final point -- providers
> generally require you to opt into device

Re: Backends stalled in 'startup' state

2023-01-17 Thread Ashwin Agrawal

On Tue, Jan 17, 2023 at 4:52 PM Ashwin Agrawal  wrote:

>
> We recently saw many backends (close to max_connection limit) get stalled
> in 'startup' in one of the production environments for Greenplum (fork of
> PostgreSQL). Tracing the reason, it was found all the tuples created by
> bootstrap (xmin=1) in pg_attribute were at super high block numbers (for
> example beyond 30,000). Tracing the reason for the backend startup stall
> exactly matched Tom's reasoning in [1]. Stalls became much longer in
> presence of sub-transaction overflow or presence of long running
> transactions as tuple visibility took longer. The thread ruled out the
> possibility of system catalog rows to be present in higher block numbers
> instead of in front for pg_attribute.
>
> This thread provides simple reproduction on the latest version of
> PostgreSQL and RCA for how bootstrap catalog entries can move to higher
> blocks and as a result cause stalls for backend starts. Simple fix to avoid
> the issue provided at the end.
>
> The cause is syncscan triggering during VACUUM FULL. VACUUM FULL rewrites
> the table by performing the seqscan as well. And
> heapam_relation_copy_for_cluster() conveys feel free to use syncscan. Hence
> logic to not start from block 0 instead some other block already in cache
> is possible and opens the possibility to move the bootstrap tuples to
> anywhere else in the table.
>
> --
> Repro
> --
> -- create database to play
> drop database if exists test;
> create database test;
> \c test
>
> -- function just to create many tables to increase pg_attribute size
> -- (ideally many column table might do the job more easily)
> CREATE OR REPLACE FUNCTION public.f(id integer)
>  RETURNS void
>  LANGUAGE plpgsql
>  STRICT
> AS $function$
> declare
>   sql text;
>   i int;
> begin
>   for i in id..id+ loop
> sql='create table if not exists tbl'||i||' (id int)';
> execute sql;
>   end loop;
> end;
> $function$;
>
> select f(1);
> select f(2);
> select f(3);
> select f(4);
>
> -- validate pg_attribute size is greater than 1/4 of shared_buffers
> -- for syncscan to triggger
> show shared_buffers;
> select pg_size_pretty(pg_relation_size('pg_attribute'));
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
>
> -- perform seq scan of pg_attribute to page past bootstrapped tuples
> copy (select * from pg_attribute limit 2000) to '/tmp/p';
>
> -- this will internally use syncscan starting with block after bootstrap
> tuples
> -- and hence move bootstrap tuples last to higher block numbers
> vacuum full pg_attribute;
>
> --
> Sample run
> --
> show shared_buffers;
>  shared_buffers
> 
>  128MB
> (1 row)
>
> select pg_size_pretty(pg_relation_size('pg_attribute'));
>  pg_size_pretty
> 
>  40 MB
> (1 row)
>
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
>  ctid  | xmin | attrelid |   attname
> ---+--+--+--
>  (0,1) |1 | 1255 | oid
>  (0,2) |1 | 1255 | proname
>  (0,3) |1 | 1255 | pronamespace
>  (0,4) |1 | 1255 | proowner
>  (0,5) |1 | 1255 | prolang
> (5 rows)
>
> copy (select * from pg_attribute limit 2000) to '/tmp/p';
> COPY 2000
> vacuum full pg_attribute;
> VACUUM
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
>ctid| xmin | attrelid |   attname
> ---+--+--+--
>  (5115,14) |1 | 1255 | oid
>  (5115,15) |1 | 1255 | proname
>  (5115,16) |1 | 1255 | pronamespace
>  (5115,17) |1 | 1255 | proowner
>  (5115,18) |1 | 1255 | prolang
> (5 rows)
>
>
> Note:
> -- used logic causing the problem to fix it as well on the system :-)
> -- scan till block where bootstrap tuples are located
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
> -- now due to syncscan triggering it will pick the blocks with bootstrap
> tuples first and help to bring them back to front
> vacuum full pg_attribute;
>
> --
> Patch to avoid the problem:
> --
> diff --git a/src/backend/access/heap/heapam_handler.c
> b/src/backend/access/heap/heapam_handler.c
> index a3414a76e8..4c031914a3 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -757,7 +757,17 @@ heapam_relation_copy_for_cluster(Relation OldHeap,
> Relation NewHeap,
> pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
>
>  PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
>
> -   tableScan =

Re: CI and test improvements

2023-01-17 Thread Justin Pryzby

On Tue, Jan 17, 2023 at 11:56:42AM -0800, Andres Freund wrote:
> > $(CF_PGP_TESTS) is missing from contrib/pgcrypto/meson.build
> 
> Assume that's the false positive?

Yes

> > I also tried but failed to write something to warn if "meson test" was
> > run with a list of tests but without tmp_install.  Help wanted.
> 
> That doesn't even catch the worst case - when there's tmp_install, but it's
> too old.

I don't understand what you mean by "too old" ?

> > I propose to put something like this into "SanityCheck".
> 
> Perhaps we instead could add it as a separate "meson-only" test? Then it'd
> fail on developer's machines, instead of later in CI. We could pass the test
> information from the 'tests' array, or it could look at the metadata in
> meson-info/intro-tests.json

I guess you mean that it should be *able* to fail on developer machines
*in addition* to cirrusci.

But, a meson-only test might not be so helpful, as it assumes that the
developer is using meson, in which case the problem would tend not to
have occured in the first place.

BTW I also noticed that:

meson.build:meson_binpath_r = run_command(python, 'src/tools/find_meson', 
check: true)
meson.build-
meson.build-if meson_binpath_r.returncode() != 0 or meson_binpath_r.stdout() == 
''
meson.build-  error('huh, could not run find_meson.\nerrcode: @0@\nstdout: 
@1@\nstderr: @2@'.format(

The return code will never be nonzero since check==true, right ?

-- 
Justin

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-17 Thread Takamichi Osumi (Fujitsu)

Hi,

On Tuesday, January 17, 2023 9:54 PM Amit Kapila  
wrote:
> On Tue, Jan 17, 2023 at 4:30 PM Takamichi Osumi (Fujitsu)
>  wrote:
> >
> > On Saturday, January 14, 2023 3:27 PM vignesh C 
> wrote:
> >
> > > 2) I'm not sure if this will add any extra coverage as the altering
> > > value of min_apply_delay is already tested in the regression, if so
> > > this test can be
> > > removed:
> > > +# Test ALTER SUBSCRIPTION. Delay 86460 seconds (1 day 1 minute).
> > > +$node_subscriber->safe_psql('postgres',
> > > +   "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay =
> > > 8646)"
> > > +);
> > > +
> > > +# New row to trigger apply delay.
> > > +$node_publisher->safe_psql('postgres',
> > > +   "INSERT INTO test_tab VALUES (0, 'foobar')");
> > > +
> > > +check_apply_delay_log("logical replication apply delay",
> > > +"8000");
> > While addressing this point, I've noticed that there is a behavior
> > difference between physical replication's recovery_min_apply_delay and
> > this feature when stopping the replication during delays.
> >
> > At present, in the latter case,
> > the apply worker exits without applying the suspended transaction
> > after ALTER SUBSCRIPTION DISABLE command for the subscription.
> >
> 
> In the previous paragraph, you said the behavior difference while stopping the
> replication but it is not clear from where this DISABLE command comes in that
> scenario.
Sorry for my unclear description. I mean "stopping the replication" is
to disable the subscription during the "min_apply_delay" wait time on logical
replication setup.

I proposed and mentioned this discussion point to define
how the time-delayed apply worker should behave when there is a transaction
delayed by "min_apply_delay" parameter and additionally the user issues
ALTER SUBSCRIPTION ... DISABLE during the delay. When it comes to physical
replication, it's hard to find a perfect correspondent for LR's ALTER 
SUBSCRIPTION
DISABLE command, but I chose a scenario to promote a secondary during
"recovery_min_apply_delay" for comparison this time. After the promotion of
the secondary in the physical replication, the transaction
committed on the publisher but delayed on the secondary can be seen.
This would be because CheckForStandbyTrigger in recoveryApplyDelay returns true
and we apply the record by breaking the wait.
I checked and got the LOG message "received promote request" in the secondary 
log
when I tested this case.

> > Meanwhile, there is no "disabling" command for physical replication,
> > but I checked the behavior about what happens for promoting a
> > secondary during the delay of recovery_min_apply_delay for physical
> replication as one example.
> > The transaction has become visible even in the promoting in the middle of
> delay.
> >
> 
> What causes such a transaction to be visible after promotion? Ideally, if the
> commit doesn't succeed, the transaction shouldn't be visible.
> Do, we allow the transaction waiting due to delay to get committed on
> promotion?
The commit succeeded on the primary and then I promoted the secondary
during the "recovery_min_apply_delay" wait of the transaction. Then, the result
is the transaction turned out to be available on the promoted secondary.


> > I'm not sure if I should make the time-delayed LR aligned with this 
> > behavior.
> > Does someone has an opinion for this ?
> >
> 
> Can you please explain a bit more as asked above to understand the
> difference?
So, the current difference is that the time-delayed apply
worker of logical replication doesn't apply the delayed transaction on the 
subscriber
when the subscription has been disabled during the delay, while (in one example
of a promotion) the physical replication does the apply of the delayed 
transaction.



Best Regards,
Takamichi Osumi

Re: almost-super-user problems that we haven't fixed yet

2023-01-17 Thread Nathan Bossart

On Tue, Jan 17, 2023 at 02:59:31PM -0500, Robert Haas wrote:
> On Tue, Jan 17, 2023 at 1:42 PM Nathan Bossart  
> wrote:
>> If we create a new batch of reserved connections, only roles with
>> privileges of pg_use_reserved_connections would be able to connect if the
>> number of remaining slots is greater than superuser_reserved_connections
>> but less than or equal to superuser_reserved_connections +
>> reserved_connections.  Only superusers would be able to connect if the
>> number of remaining slots is less than or equal to
>> superuser_reserved_connections.  This helps avoid blocking new superuser
>> connections even if you've reserved some connections for non-superusers.
> 
> This is precisely what I had in mind.

Great.  Here is a first attempt at the patch.

> I think the documentation will need some careful wordsmithing,
> including adjustments to superuser_reserved_connections. We want to
> recast superuser_reserved_connections as a final reserve to be touched
> after even reserved_connections has been exhausted.

I tried to do this, but there is probably still room for improvement,
especially for the parts that discuss the relationship between
max_connections, superuser_reserved_connections, and reserved_connections.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e153d2f22d4303d0bb5e8134391ebf1fa446172c Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 17 Jan 2023 13:58:56 -0800
Subject: [PATCH v1 1/2] Rename ReservedBackends to SuperuserReservedBackends.

This is in preparation for adding a new reserved_connections GUC
that will use the ReservedBackends variable name.
---
 src/backend/postmaster/postmaster.c | 18 +-
 src/backend/utils/init/postinit.c   |  4 ++--
 src/backend/utils/misc/guc_tables.c |  2 +-
 src/include/postmaster/postmaster.h |  2 +-
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9cedc1b9f0..470704f364 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -204,15 +204,15 @@ char	   *Unix_socket_directories;
 char	   *ListenAddresses;
 
 /*
- * ReservedBackends is the number of backends reserved for superuser use.
- * This number is taken out of the pool size given by MaxConnections so
+ * SuperuserReservedBackends is the number of backends reserved for superuser
+ * use.  This number is taken out of the pool size given by MaxConnections so
  * number of backend slots available to non-superusers is
- * (MaxConnections - ReservedBackends).  Note what this really means is
- * "if there are <= ReservedBackends connections available, only superusers
- * can make new connections" --- pre-existing superuser connections don't
- * count against the limit.
+ * (MaxConnections - SuperuserReservedBackends).  Note what this really means
+ * is "if there are <= SuperuserReservedBackends connections available, only
+ * superusers can make new connections" --- pre-existing superuser connections
+ * don't count against the limit.
  */
-int			ReservedBackends;
+int			SuperuserReservedBackends;
 
 /* The socket(s) we're listening to. */
 #define MAXLISTEN	64
@@ -908,11 +908,11 @@ PostmasterMain(int argc, char *argv[])
 	/*
 	 * Check for invalid combinations of GUC settings.
 	 */
-	if (ReservedBackends >= MaxConnections)
+	if (SuperuserReservedBackends >= MaxConnections)
 	{
 		write_stderr("%s: superuser_reserved_connections (%d) must be less than max_connections (%d)\n",
 	 progname,
-	 ReservedBackends, MaxConnections);
+	 SuperuserReservedBackends, MaxConnections);
 		ExitPostmaster(1);
 	}
 	if (XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level == WAL_LEVEL_MINIMAL)
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ae5a85ed65..6fa696fe8d 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -927,8 +927,8 @@ InitPostgres(const char *in_dbname, Oid dboid,
 	 * limited by max_connections or superuser_reserved_connections.
 	 */
 	if (!am_superuser && !am_walsender &&
-		ReservedBackends > 0 &&
-		!HaveNFreeProcs(ReservedBackends))
+		SuperuserReservedBackends > 0 &&
+		!HaveNFreeProcs(SuperuserReservedBackends))
 		ereport(FATAL,
 (errcode(ERRCODE_TOO_MANY_CONNECTIONS),
  errmsg("remaining connection slots are reserved for non-replication superuser connections")));
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5025e80f89..5aa2cda8f9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2163,7 +2163,7 @@ struct config_int ConfigureNamesInt[] =
 			gettext_noop("Sets the number of connection slots reserved for superusers."),
 			NULL
 		},
-		,
+		,
 		3, 0, MAX_BACKENDS,
 		NULL, NULL, NULL
 	},
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 203177e1ff..168d85a3d1 100644
---

Re: constify arguments of copy_file() and copydir()

2023-01-17 Thread Nathan Bossart

On Wed, Jan 18, 2023 at 09:05:52AM +0900, Michael Paquier wrote:
> Thanks, done.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: constify arguments of copy_file() and copydir()

2023-01-17 Thread Michael Paquier

On Tue, Jan 17, 2023 at 08:23:49PM +0100, Peter Eisentraut wrote:
> Looks good to me.

Thanks, done.
--
Michael


signature.asc
Description: PGP signature

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-01-17 Thread Tomas Vondra




On 1/17/23 21:59, Justin Pryzby wrote:
> On Tue, Jan 17, 2023 at 08:44:36PM +0100, Tomas Vondra wrote:
>> On 1/9/23 09:44, Ilya Gladyshev wrote:
>>> On Sun, 2023-01-08 at 10:48 -0600, Justin Pryzby wrote:
 On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote:
> ...
>
> @committers: Is it okay to add nparts_done to IndexStmt ?

 Any hint about this ?

>>
>> AFAIK fields added at the end of a struct is seen as acceptable from the
>> ABI point of view. It's not risk-free, but we did that multiple times
>> when fixing bugs, IIRC.
> 
> My question isn't whether it's okay to add a field at the end of a
> struct in general, but rather whether it's acceptable to add this field
> at the end of this struct, and not because it's unsafe to do in a minor
> release, but whether someone is going to say that it's an abuse of the
> data structure.
> 

Ah, you mean whether it's the right place for the parameter?

I don't think it is, really. IndexStmt is meant to be a description of
the CREATE INDEX statement, not something that includes info about how
it's processed. But it's the only struct we pass to the DefineIndex for
child indexes, so the only alternatives I can think of is a global
variable and the (existing) param array field.

Nevertheless, ABI compatibility is still relevant for backbranches.


>> Do we actually need the new parts_done field? I mean, we already do
>> track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the
>> st_progress_param array. Can't we simply read it from there? Then we
>> would not have ABI issues with the new field added to IndexStmt.
> 
> Good idea to try.
> 

OK

>> As for the earlier discussion about the "correct" behavior for leaf vs.
>> non-leaf partitions and whether to calculate partitions in advance:
>>
>> * I agree it's desirable to count partitions in advance, instead of
>> adding incrementally. The view is meant to provide "overview" of the
>> CREATE INDEX progress, and imagine you get
>>
>>   partitions_total partitions_done
>> 10   9
>>
>> so that you believe you're ~90% done. But then it jumps to the next
>> child and now you get
>>
>>   partitions_total partitions_done
>> 20  10
>>
>> which makes the view a bit useless for it's primary purpose, IMHO.
> 
> To be clear, that's the current, buggy behavior, and this thread is
> about fixing it.  The proposed patches all ought to avoid that.
> 
> But the bug isn't caused by not "calculating partitions in advance".
> Rather, the issue is that currently, the "total" is overwritten while
> recursing.
> 

You're right the issue us about overwriting the total - not sure what I
was thinking about when writing this. I guess I got distracted by the
discussion about "preliminary planning" etc. Sorry for the confusion.

> That's a separate question from whether indirect partitions are counted
> or not.
> 
>> I wonder if we could improve this to track the size of partitions for
>> total/done? That'd make leaf/non-leaf distinction unnecessary, because
>> non-leaf partitions have size 0.
> 
> Maybe, but it's out of scope for this patch.
> 

+1, it was just an idea for future.


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Can we let extensions change their dumped catalog schemas?

2023-01-17 Thread Jacob Champion

On 1/12/23 11:04, Jacob Champion wrote:
> On Wed, Jan 11, 2023 at 1:03 PM Tom Lane  wrote:
>> Jacob Champion  writes:
>>> Right, I think it would have to be opt-in. Say, a new control file
>>> option dump_version or some such.
>>
>> That would require all the installed extensions to cope with this
>> the same way, which does not seem like a great assumption.
> 
> How so? Most installed extensions would not opt into a version dump,
> I'd imagine.

As a concrete example, Timescale's extension control file could look
like this:

default_version = '2.x.y'
module_pathname = '$libdir/timescaledb-2.x.y'
...
dump_version = true

which would then cause pg_dump to issue a VERSION for its CREATE
EXTENSION line. Other extensions would remain with the default
(dump_version = false), so they'd be dumped without an explicit VERSION.

(And in the case of option 3, the name of the control file option
changes -- dump_internals, maybe? -- but it still doesn't affect other
installed extensions.)

--Jacob

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Peter Geoghegan

On Tue, Jan 17, 2023 at 2:11 PM Robert Haas  wrote:
> On Tue, Jan 17, 2023 at 3:08 PM Peter Geoghegan  wrote:
> > If you assume that there is chronic undercounting of dead tuples
> > (which I think is very common), ...
>
> Why do you think that?

For the reasons I gave about statistics, random sampling, the central
limit theorem. All that stuff. This matches the experience of Andres.
And is obviously the only explanation behind the reliance on
antiwraparound autovacuums for cleaning up bloat in larger OLTP
databases. It just fits: the dead tuples approach can sometimes be so
completely wrong that even an alternative triggering condition based
on something that is virtually unrelated to the thing we actually care
about can do much better in practice. Consistently, reliably, for a
given table/workload.

> > How many dead heap-only tuples are equivalent to one LP_DEAD item?
> > What about page-level concentrations, and the implication for
> > line-pointer bloat? I don't have a good answer to any of these
> > questions myself.
>
> Seems a bit pessimistic. If we had unlimited resources and all
> operations were infinitely fast, the optimal strategy would be to
> vacuum after every insert, update, or delete. But in reality, that
> would be prohibitively expensive, so we're making a trade-off.

To a large degree, that's my point. I don't know how to apply this
information, so having detailed information doesn't seem like the main
problem.

> If we had an oracle that could provide us with perfect information,
> we'd ask it, among other things, how much work will be required to
> vacuum right now, and how much benefit would we get out of doing so.

And then what would we do? What about costs?

Even if we were omniscient, we still wouldn't be omnipotent. We're
still subject to the laws of physics. VACUUM would still be something
that more or less works at the level of the whole table, or not at
all. So being omniscient seems kinda overrated to me. Adding more
information does not in general lead to better outcomes.

> The dead tuple count is related to the first question. It's not a
> direct, linear relationship, but it's not completely unrelated,
> either. Maybe we could refine the estimates by gathering more or
> different statistics than we do now, but ultimately it's always going
> to be a trade-off between getting the work done sooner (and thus maybe
> preventing table growth or a wraparound shutdown) and being able to do
> more work at once (and thus being more efficient). The current system
> set of counters predates HOT and the visibility map, so it's not
> surprising if needs updating, but if you're argue that the whole
> concept is just garbage, I think that's an overreaction.

What I'm arguing is that principally relying on any one thing is
garbage. If you have only one thing that creates pressure to VACUUM
then there can be a big impact whenever it turns out to be completely
wrong. Whereas if VACUUM can run because of (say) 3 moderate signals
taken together, then it's much less likely that we'll be completely
wrong. In general my emphasis is on avoiding disaster in all its
forms. Vacuuming somewhat early more often is perhaps suboptimal, but
far from a disaster. It's the kind of thing that we can manage.

By all means, let's make the dead tuples/dead items stuff less naive
(e.g. make it distinguish between LP_DEAD items and dead heap-only
tuples). But even then, we shouldn't continue to completely rely on it
in the way that we do right now. In other words, I'm fine with adding
more information that is more accurate as long as we don't continue to
make the mistake of not treating it kinda suspect, and certainly not
something to completely rely on if at all possible. In particular, we
need to think about both costs and benefits at all times.

-- 
Peter Geoghegan

Re: [EXTERNAL] Re: Support load balancing in libpq

2023-01-17 Thread Jacob Champion

On Fri, Jan 13, 2023 at 10:44 AM Jacob Champion  wrote:
> And my thought was that the one-time
> initialization could be moved to a place that doesn't need to know the
> connection options at all, to make it easier to reason about the
> architecture. Say, next to the WSAStartup machinery.

(And after marinating on this over the weekend, it occurred to me that
keeping the per-connection option while making the PRNG global
introduces an additional hazard, because two concurrent connections
can now fight over the seed value.)

--Jacob

Re: [PATCH] Add native windows on arm64 support

2023-01-17 Thread Andres Freund

Hi,

On 2022-12-16 10:52:23 +, Niyas Sait wrote:
> Subject: [PATCH v7] Enable postgres native build for windows-arm64 platform

>  elif host_cpu == 'arm' or host_cpu == 'aarch64'
>  
> -  prog = '''
> +  if cc.get_id() == 'msvc'
> +cdata.set('USE_ARMV8_CRC32C', false)
> +cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
> +have_optimized_crc = true

I dimly recall that windows might actually require the relevant extension on
arm?


> +  else
> +prog = '''
>  #include 

I'd just make this include #ifdef _MSV_VER (or whatever it is).


>  int main(void)
> @@ -1960,18 +1966,19 @@ int main(void)
>  }
>  '''
>  
> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
> without -march=armv8-a+crc',
> -  args: test_c_args)
> -# Use ARM CRC Extension unconditionally
> -cdata.set('USE_ARMV8_CRC32C', 1)
> -have_optimized_crc = true
> -  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
> with -march=armv8-a+crc',
> -  args: test_c_args + ['-march=armv8-a+crc'])
> -# Use ARM CRC Extension, with runtime check
> -cflags_crc += '-march=armv8-a+crc'
> -cdata.set('USE_ARMV8_CRC32C', false)
> -cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
> -have_optimized_crc = true
> +if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
> without -march=armv8-a+crc',
> +args: test_c_args)

Seems like it'd be easier to read if you don't re-indent this, but just have
the cc.get_id() == 'msvc' part of this if/else-if.


Greetings,

Andres Freund

Re: Removing redundant grouping columns

2023-01-17 Thread Tom Lane

I wrote:
> Yeah, sideswiped by 3c6fc5820 apparently.  No substantive change needed.

And immediately sideswiped by da5800d5f.

If nobody has any comments on this, I'm going to go ahead and push
it.  The value of the improvement is rapidly paling in comparison
to the patch's maintenance effort.

regards, tom lane

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 21237d18ef..473fa45bd4 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -3736,6 +3736,13 @@ appendGroupByClause(List *tlist, deparse_expr_cxt *context)
 	 */
 	Assert(!query->groupingSets);
 
+	/*
+	 * We intentionally print query->groupClause not processed_groupClause,
+	 * leaving it to the remote planner to get rid of any redundant GROUP BY
+	 * items again.  This is necessary in case processed_groupClause reduced
+	 * to empty, and in any case the redundancy situation on the remote might
+	 * be different than what we think here.
+	 */
 	foreach(lc, query->groupClause)
 	{
 		SortGroupClause *grp = (SortGroupClause *) lfirst(lc);
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index c0267a99d2..385b76a8cb 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -2864,16 +2864,13 @@ select c2 * (random() <= 1)::int as c2 from ft2 group by c2 * (random() <= 1)::i
 -- GROUP BY clause in various forms, cardinal, alias and constant expression
 explain (verbose, costs off)
 select count(c2) w, c2 x, 5 y, 7.0 z from ft1 group by 2, y, 9.0::int order by 2;
-  QUERY PLAN   

- Sort
+ QUERY PLAN 
+
+ Foreign Scan
Output: (count(c2)), c2, 5, 7.0, 9
-   Sort Key: ft1.c2
-   ->  Foreign Scan
- Output: (count(c2)), c2, 5, 7.0, 9
- Relations: Aggregate on (public.ft1)
- Remote SQL: SELECT count(c2), c2, 5, 7.0, 9 FROM "S 1"."T 1" GROUP BY 2, 3, 5
-(7 rows)
+   Relations: Aggregate on (public.ft1)
+   Remote SQL: SELECT count(c2), c2, 5, 7.0, 9 FROM "S 1"."T 1" GROUP BY 2, 3, 5 ORDER BY c2 ASC NULLS LAST
+(4 rows)
 
 select count(c2) w, c2 x, 5 y, 7.0 z from ft1 group by 2, y, 9.0::int order by 2;
   w  | x | y |  z  
@@ -2990,14 +2987,13 @@ select exists(select 1 from pg_enum), sum(c1) from ft1 group by 1;
 QUERY PLAN 
 ---
  GroupAggregate
-   Output: ($0), sum(ft1.c1)
-   Group Key: $0
+   Output: $0, sum(ft1.c1)
InitPlan 1 (returns $0)
  ->  Seq Scan on pg_catalog.pg_enum
->  Foreign Scan on public.ft1
- Output: $0, ft1.c1
+ Output: ft1.c1
  Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-(8 rows)
+(7 rows)
 
 select exists(select 1 from pg_enum), sum(c1) from ft1 group by 1;
  exists |  sum   
@@ -3382,14 +3378,13 @@ select array_agg(c1 order by c1 using operator(public.<^)) from ft2 where c2 = 6
 --
  GroupAggregate
Output: array_agg(c1 ORDER BY c1 USING <^ NULLS LAST), c2
-   Group Key: ft2.c2
->  Sort
- Output: c2, c1
+ Output: c1, c2
  Sort Key: ft2.c1 USING <^
  ->  Foreign Scan on public.ft2
-   Output: c2, c1
+   Output: c1, c2
Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE (("C 1" < 100)) AND ((c2 = 6))
-(9 rows)
+(8 rows)
 
 -- This should not be pushed either.
 explain (verbose, costs off)
@@ -3458,14 +3453,13 @@ select array_agg(c1 order by c1 using operator(public.<^)) from ft2 where c2 = 6
 --
  GroupAggregate
Output: array_agg(c1 ORDER BY c1 USING <^ NULLS LAST), c2
-   Group Key: ft2.c2
->  Sort
- Output: c2, c1
+ Output: c1, c2
  Sort Key: ft2.c1 USING <^
  ->  Foreign Scan on public.ft2
-   Output: c2, c1
+   Output: c1, c2
Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE (("C 1" < 100)) AND ((c2 = 6))
-(9 rows)
+(8 rows)
 
 -- Cleanup
 drop operator class my_op_class using btree;
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 53f890bb48..50d23f922c 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -3345,9 +3345,9 @@ estimate_path_cost_size(PlannerInfo *root,
 			}
 
 			/* Get number of grouping columns and possible number of groups */
-			numGroupCols =

Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-01-17 Thread Jacob Champion

On Sun, Jan 15, 2023 at 12:03 PM Andrey Chudnovsky
 wrote:
> 2. Removed Device Code implementation in libpq. Several reasons:
>- Reduce scope and focus on the protocol first.
>- Device code implementation uses iddawc dependency. Taking this
> dependency is a controversial step which requires broader discussion.
>- Device code implementation without iddaws would significantly
> increase the scope of the patch, as libpq needs to poll the token
> endpoint, setup different API calls, e.t.c.
>- That flow should canonically only be used for clients which can't
> invoke browsers. If it is the only flow to be implemented, it can be
> used in the context when it's not expected by the OAUTH protocol.

I'm not understanding the concern in the final point -- providers
generally require you to opt into device authorization, at least as far
as I can tell. So if you decide that it's not appropriate for your use
case... don't enable it. (And I haven't seen any claims that opting into
device authorization weakens the other flows in any way. So if we're
going to implement a flow in libpq, I still think device authorization
is the best choice, since it works on headless machines as well as those
with browsers.)

All of this points at a bigger question to the community: if we choose
not to provide a flow implementation in libpq, is adding OAUTHBEARER
worth the additional maintenance cost?

My personal vote would be "no". I think the hook-only approach proposed
here would ensure that only larger providers would implement it in
practice, and in that case I'd rather spend cycles on generic SASL.

> 3. Temporarily removed test suite. We are actively working on aligning
> the tests with the latest changes. Will add a patch with tests soon.

Okay. Case in point, the following change to the patch appears to be
invalid JSON:

> +   appendStringInfo(,
> +   "{ "
> +   "\"status\": \"invalid_token\", "
> +   "\"openid-configuration\": \"%s\","
> +   "\"scope\": \"%s\" ",
> +   "\"issuer\": \"%s\" ",
> +   "}",

Additionally, the "issuer" field added here is not part of the RFC. I've
written my thoughts about unofficial extensions upthread but haven't
received a response, so I'm going to start being more strident: Please,
for the sake of reviewers, call out changes you've made to the spec, and
why they're justified.

The patches seem to be out of order now (and the documentation in the
commit messages has been removed).

Thanks,
--Jacob

Re: doc: add missing "id" attributes to extension packaging page

2023-01-17 Thread Karl O. Pinc

On Tue, 17 Jan 2023 19:13:38 +0100
Brar Piening  wrote:

> On 17.01.2023 at 14:12, Karl O. Pinc wrote:
> > If you're not willing to try I am willing to see if I can
> > produce an example to work from.  My XSLT is starting to
> > come back.  
> 
> I'm certainly willing to try but I'd appreciate an example in any
> case.

It's good you asked.  After poking about the XSLT 1.0 spec to refresh
my memory I abandoned the "mode" approach entirely.  The real "right
way" is to use the import mechanism.

I've attached a patch that "wraps" the section.heading template
and adds extra stuff to the HTML output generated by the stock
template.  (example_section_heading_override.patch)

There's a bug.  All that goes into the html is a comment, not
a hoverable link.  But the technique is clear.

On my system (Debian 11, bullseye) I found the URI to import
by looking at:
/usr/share/xml/docbook/stylesheet/docbook-xsl/catalog.xml
(Which is probably the right place to look.)
Ultimately, that file is findable via: /etc/xml/catalog
The "best way" on
Debian is: /usr/share/doc/docbook-xsl/README.gz
In other words, the README that comes with the style sheets.

Supposedly, the href=URLs are really URIs and will be good
no matter what/when.  The XSLT processor should know to look
at the system catalogs and read the imported style sheet
from the local file system.

It might be useful to add --nonet to the xsltproc invocation(s)
in the Makefile(s).  Just in case; to keep from retrieving
stylesheets from the net.  (If the option is not already there.
I didn't look.)

If this is the first time that PG uses the XSLT import mechanism
I imagine that "things could go wrong" depending on what sort
of system is being used to build the docs.  I'm not worried,
but it is something to note for the committer.

> My XSLT skills are mostly learning by doing combined with trial and
> error.

I think of XSLT as a functional programming language.  Recursion is
a big deal, and data directed programming can be a powerful technique
because XSLT is so good with data structures.
(https://mitp-content-server.mit.edu/books/content/sectbyfn/books_pres_0/6515/sicp.zip/full-text/book/book-Z-H-17.html#%_sec_2.4.3)

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/doc/src/sgml/stylesheet-html-common.xsl b/doc/src/sgml/stylesheet-html-common.xsl
index 9df2782ce4..88e084e190 100644
--- a/doc/src/sgml/stylesheet-html-common.xsl
+++ b/doc/src/sgml/stylesheet-html-common.xsl
@@ -12,6 +12,10 @@
   all HTML output variants (chunked and single-page).
   -->

+
+http://cdn.docbook.org/release/xsl/current/html/sections.xsl"/>
+

@@ -301,4 +305,53 @@ set   toc,title

+
+
+  
+  
+  
+  
+  
+  
+  
+
+  
+
+
+
+
+  
+  
+
+  
+
+  #
+  
+
+  
+
+
+  id_link
+
+ #
+  
+
+
+  
+  
+
+  
+   element without id. Please add an id to make it usable
+   as stable anchor in the public HTML documentation.
+
+
+  Please add an id to the 
+  
+   element in the sgml source code.
+
+  
+
+  
+
+

diff --git a/doc/src/sgml/stylesheet.css b/doc/src/sgml/stylesheet.css
index 6410a47958..e4174e0613 100644
--- a/doc/src/sgml/stylesheet.css
+++ b/doc/src/sgml/stylesheet.css
@@ -163,3 +163,13 @@ acronym		{ font-style: inherit; }
 width: 75%;
   }
 }
+
+/* Links to ids of headers and definition terms */
+a.id_link {
+color: inherit;
+visibility: hidden;
+}
+
+*:hover > a.id_link {
+visibility: visible;
+}

PGDOCS - sgml linkend using single-quotes

2023-01-17 Thread Peter Smith

Hi,

I happened to notice some examples of SGML linkends that were using
single quotes instead of double quotes.

It didn't seem to be the conventional style because grepping (from
doc/src/sgml folder) showed only a tiny fraction using single quotes.

(single-quotes)
$ grep --include=*.sgml -rn . -e "linkend='" | wc -l
12

(double-quotes)
$ grep --include=*.sgml -rn . -e 'linkend="' | wc -l
5915

~~

PSA patch that makes them all use double quotes.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Replace-linkend-single-quotes-with-double-quotes.patch
Description: Binary data

Re: Rework of collation code, extensibility

2023-01-17 Thread Peter Geoghegan

On Sat, Jan 14, 2023 at 12:03 PM Jeff Davis  wrote:
> The first goal I had was simply that the code was really hard to
> understand and work on, and refactoring was justified to improve the
> situation.
>
> The second goal, which is somewhat dependent on the first goal, is that
> we really need an ability to support multiple ICU libraries, and I
> wanted to do some common groundwork that would be needed for any
> approach we choose there, and provide some hooks to get us there. You
> are right that this goal influenced the first goal.

I don't disagree that it was somewhat independent of the first goal. I
just think that it makes sense to "round up to fully dependent".
Basically it's not independent enough to be worth talking about as an
independent thing, just as a practical matter - it's confusing at the
level of things like the commit message. There is a clear direction
that you're going in here from the start, and your intentions in 0001
do matter to somebody that's just looking at 0001 in isolation. That
is my opinion, at least.

The second goal is a perfectly good enough goal on its own, and one
that I am totally supportive of. Making the code clearer is icing on
the cake.

> ucol_strcollUTF8() accepts -1 to mean "nul-terminated". I did some
> basic testing and it doesn't seem like it's slower than using the
> length. If passing the length is faster for some reason, it would
> complicate the API because we'd need an entry point that's expecting
> nul-termination and lengths, which is awkward (as Peter E. pointed
> out).

That's good. I'm happy to leave it at that. I was only enquiring.

> I felt it was a little clearer amongst the other code, to a casual
> reader, but I suppose it's a style thing. I will change it if you
> insist.

I certainly won't insist.

> I'd have to expose the pg_locale_t struct, which didn't seem desirable
> to me. Do you think it's enough of a performance concern to be worth
> some ugliness there?

I don't know. Quite possibly not. It would be nice to have some data
on that, though.

-- 
Peter Geoghegan

Re: pgsql: Doc: add XML ID attributes to and tags.

2023-01-17 Thread Tom Lane

Peter Eisentraut  writes:
> On 12.01.23 00:05, Tom Lane wrote:
>> That reminds me that I was going to suggest fixing the few existing
>> variances from the "use '-' not '_'" policy:
>> 
>> $ grep 'id="[a-zA-Z0-9-]*_' *sgml ref/*sgml
>> config.sgml: > xreflabel="plan_cache_mode">

> should be fixed

>> libpq.sgml:
>> libpq.sgml:
>> libpq.sgml:
>> libpq.sgml:

> I think we can leave these.  They are internally consistent.

>> pgbuffercache.sgml:  

> should be fixed

>> ref/pg_checksums.sgml: 

> pretty bogus

OK, done like that.

regards, tom lane

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Robert Haas

On Tue, Jan 17, 2023 at 3:08 PM Peter Geoghegan  wrote:
> If you assume that there is chronic undercounting of dead tuples
> (which I think is very common), ...

Why do you think that?

> How many dead heap-only tuples are equivalent to one LP_DEAD item?
> What about page-level concentrations, and the implication for
> line-pointer bloat? I don't have a good answer to any of these
> questions myself.

Seems a bit pessimistic. If we had unlimited resources and all
operations were infinitely fast, the optimal strategy would be to
vacuum after every insert, update, or delete. But in reality, that
would be prohibitively expensive, so we're making a trade-off.
Batching together cleanup for many table modifications reduces the
amortized cost of cleaning up after one such operation very
considerably. That's critical. But if we batch too much together, then
the actual cleanup doesn't happen soon enough to keep us out of
trouble.

If we had an oracle that could provide us with perfect information,
we'd ask it, among other things, how much work will be required to
vacuum right now, and how much benefit would we get out of doing so.
The dead tuple count is related to the first question. It's not a
direct, linear relationship, but it's not completely unrelated,
either. Maybe we could refine the estimates by gathering more or
different statistics than we do now, but ultimately it's always going
to be a trade-off between getting the work done sooner (and thus maybe
preventing table growth or a wraparound shutdown) and being able to do
more work at once (and thus being more efficient). The current system
set of counters predates HOT and the visibility map, so it's not
surprising if needs updating, but if you're argue that the whole
concept is just garbage, I think that's an overreaction.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Update comments in multixact.c

2023-01-17 Thread Peter Geoghegan

On Tue, Jan 17, 2023 at 1:33 AM shiy.f...@fujitsu.com
 wrote:
> I noticed that commit 5212d447fa updated some comments in multixact.c because
> SLRU truncation for multixacts is performed during VACUUM, instead of
> checkpoint. Should the following comments which mentioned checkpointer be
> changed, too?

Yes, I think so.

-- 
Peter Geoghegan

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-01-17 Thread Melanie Plageman

v49 attached

On Tue, Jan 17, 2023 at 2:12 PM Andres Freund  wrote:
> On 2023-01-17 12:22:14 -0500, Melanie Plageman wrote:
>
> > > > +typedef struct PgStat_BackendIO
> > > > +{
> > > > + PgStat_Counter 
> > > > data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
> > > > +} PgStat_BackendIO;
> > >
> > > Would it bother you if we swapped the order of iocontext and iobject here 
> > > and
> > > related places? It makes more sense to me semantically, and should now be
> > > pretty easy, code wise.
> >
> > So, thinking about this I started noticing inconsistencies in other
> > areas around this order:
> > For example: ordering of objects mentioned in commit messages and comments,
> > ordering of parameters (like in pgstat_count_io_op() [currently in
> > reverse order]).
> >
> > I think we should make a final decision about this ordering and then
> > make everywhere consistent (including ordering in the view).
> >
> > Currently the order is:
> > BackendType
> >   IOContext
> > IOObject
> >   IOOp
> >
> > You are suggesting this order:
> > BackendType
> >   IOObject
> > IOContext
> >   IOOp
> >
> > Could you explain what you find more natural about this ordering (as I
> > find the other more natural)?
>
> The object we're performing IO on determines more things than the context. So
> it just seems like the natural hierarchical fit. The context is a sub-category
> of the object. Consider how it'll look like if we also have objects for 'wal',
> 'temp files'. It'll make sense to group by just the object, but it won't make
> sense to group by just the context.
>
> If it were trivial to do I'd use a different IOContext for each IOObject. But
> it'd make it much harder. So there'll just be a bunch of values of IOContext
> that'll only be used for one or a subset of the IOObjects.
>
>
> The reason to put BackendType at the top is pragmatic - one backend is of a
> single type, but can do IO for all kinds of objects/contexts. So any other
> hierarchy would make the locking etc much harder.
>
>
> > This is one possible natural sentence with these objects:
> >
> > During COPY, a client backend may read in data from a permanent
> > relation.
> > This order is:
> > IOContext
> >   BackendType
> > IOOp
> >   IOObject
> >
> > I think English sentences are often structured subject, verb, object --
> > but in our case, we have an extra thing that doesn't fit neatly
> > (IOContext).
>
> "..., to avoid polluting the buffer cache it uses the bulk (read|write)
> strategy".
>
>
> > Also, IOOp in a sentence would be in the middle (as the
> > verb). I made it last because a) it feels like the smallest unit b) it
> > would make the code a lot more annoying if it wasn't last.
>
> Yea, I think pragmatically that is the right choice.

I have changed the order and updated all the places using
PgStat_BktypeIO as well as in all locations in which it should be
ordered for consistency (that I could find in the pass I did) -- e.g.
the view definition, function signatures, comments, commit messages,
etc.

> > > > +-- Change the tablespace so that the table is rewritten directly, then 
> > > > SELECT
> > > > +-- from it to cause it to be read back into shared buffers.
> > > > +SET allow_in_place_tablespaces = true;
> > > > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
> > >
> > > Perhaps worth doing this in tablespace.sql, to avoid the additional
> > > checkpoints done as part of CREATE/DROP TABLESPACE?
> > >
> > > Or, at least combine this with the CHECKPOINTs above?
> >
> > I see a checkpoint is requested when dropping the tablespace if not all
> > the files in it are deleted. It seems like if the DROP TABLE for the
> > permanent table is before the explicit checkpoints in the test, then the
> > DROP TABLESPACE will not cause an additional checkpoint.
>
> Unfortunately, that's not how it works :(. See the comment above mdunlink():
>
> > * For regular relations, we don't unlink the first segment file of the rel,
> > * but just truncate it to zero length, and record a request to unlink it 
> > after
> > * the next checkpoint.  Additional segments can be unlinked immediately,
> > * however.  Leaving the empty file in place prevents that relfilenumber
> > * from being reused.  The scenario this protects us from is:
> > ...
>
>
> > Is this what you are suggesting? Dropping the temporary table should not
> > have an effect on this.
>
> I was wondering about simply moving that portion of the test to
> tablespace.sql, where we already created a tablespace.
>
>
> An alternative would be to propose splitting tablespace.sql into one portion
> running at the start of parallel_schedule, and one at the end. Historically,
> we needed tablespace.sql to be optional due to causing problems when
> replicating to another instance on the same machine, but now we have
> allow_in_place_tablespaces.

It seems like the best way would be to split up the tablespace test file
as you suggested and drop the tablespace at the

Re: document the need to analyze partitioned tables

2023-01-17 Thread Bruce Momjian

On Tue, Jan 17, 2023 at 03:00:50PM -0600, Justin Pryzby wrote:
> On Tue, Jan 17, 2023 at 03:53:24PM -0500, Bruce Momjian wrote:
> > On Thu, Jan 12, 2023 at 03:27:47PM -0800, Nathan Bossart wrote:
> > > Here is my take on the wording:
> > > 
> > >   Since all the data for a partitioned table is stored in its partitions,
> > >   autovacuum does not process partitioned tables.  Instead, autovacuum
> > >   processes the individual partitions that are regular tables.  This
> > >   means that autovacuum only gathers statistics for the regular tables
> > >   that serve as partitions and not for the partitioned tables.  Since
> > >   queries may rely on a partitioned table's statistics, you should
> > >   collect statistics via the ANALYZE command when it is first populated,
> > >   and again whenever the distribution of data in its partitions changes
> > >   significantly.
> > 
> > Uh, what about autovacuum's handling of partitioned tables?  This makes
> > it sound like it ignores them because it talks about manual ANALYZE.
> 
> If we're referring to the *partitioned* table, then it does ignore them.
> See:
> 
> |commit 6f8127b7390119c21479f5ce495b7d2168930e82
> |Author: Alvaro Herrera 
> |Date:   Mon Aug 16 17:27:52 2021 -0400
> |
> |Revert analyze support for partitioned tables

Yes, I see that patch was trying to combine the statistics of individual
partitions into a partitioned table summary.

> Maybe (all?) the clarification the docs need is to say:
> "Partitioned tables are not *themselves* processed by autovacuum."

Yes, I think the lack of autovacuum needs to be specifically mentioned
since most people assume autovacuum handles _all_ statistics updating.

Can someone summarize how bad it is we have no statistics on partitioned
tables?  It sounds bad to me. 

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-17 Thread Tom Lane

Andrew Dunstan  writes:
> On 2023-01-16 Mo 21:58, Tom Lane wrote:
>> I dunno if we want to stretch buildfarm owners' patience with yet
>> another BF client release right now.  On the other hand, I'm antsy
>> to see if we can un-revert 1b4d280ea after doing a little more
>> work in AdjustUpgrade.pm.

> I think the next step is to push the buildfarm client changes, and
> update those three animals to use it, and make sure nothing breaks. I'll
> go and do those things now. Then you should be able to try your unrevert.

It looks like unrevert will require ~130 lines in AdjustUpgrade.pm,
which is not great but not awful either.  I think this is ready to
go once you've vetted your remaining buildfarm animals.

regards, tom lane

diff --git a/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm b/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm
index 7cf4ced392..5bed1d6839 100644
--- a/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm
+++ b/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm
@@ -268,6 +268,12 @@ sub adjust_old_dumpfile
 	# Version comments will certainly not match.
 	$dump =~ s/^-- Dumped from database version.*\n//mg;
 
+	if ($old_version < 16)
+	{
+		# Fix up some view queries that no longer require table-qualification.
+		$dump = _mash_view_qualifiers($dump);
+	}
+
 	if ($old_version >= 14 && $old_version < 16)
 	{
 		# Fix up some privilege-set discrepancies.
@@ -396,6 +402,133 @@ sub adjust_old_dumpfile
 	return $dump;
 }
 
+
+# Data for _mash_view_qualifiers
+my @_unused_view_qualifiers = (
+	# Present at least since 9.2
+	{ obj => 'VIEW public.trigger_test_view',  qual => 'trigger_test' },
+	{ obj => 'VIEW public.domview',qual => 'domtab' },
+	{ obj => 'VIEW public.my_property_normal', qual => 'customer' },
+	{ obj => 'VIEW public.my_property_secure', qual => 'customer' },
+	{ obj => 'VIEW public.pfield_v1',  qual => 'pf' },
+	{ obj => 'VIEW public.rtest_v1',   qual => 'rtest_t1' },
+	{ obj => 'VIEW public.rtest_vview1',   qual => 'x' },
+	{ obj => 'VIEW public.rtest_vview2',   qual => 'rtest_view1' },
+	{ obj => 'VIEW public.rtest_vview3',   qual => 'x' },
+	{ obj => 'VIEW public.rtest_vview5',   qual => 'rtest_view1' },
+	{ obj => 'VIEW public.shoelace_obsolete',  qual => 'shoelace' },
+	{ obj => 'VIEW public.shoelace_candelete', qual => 'shoelace_obsolete' },
+	{ obj => 'VIEW public.toyemp', qual => 'emp' },
+	{ obj => 'VIEW public.xmlview4',   qual => 'emp' },
+	# Since 9.3 (some of these were removed in 9.6)
+	{ obj => 'VIEW public.tv', qual => 't' },
+	{ obj => 'MATERIALIZED VIEW mvschema.tvm', qual => 'tv' },
+	{ obj => 'VIEW public.tvv',qual => 'tv' },
+	{ obj => 'MATERIALIZED VIEW public.tvvm',  qual => 'tvv' },
+	{ obj => 'VIEW public.tvvmv',  qual => 'tvvm' },
+	{ obj => 'MATERIALIZED VIEW public.bb',qual => 'tvvmv' },
+	{ obj => 'VIEW public.nums',   qual => 'nums' },
+	{ obj => 'VIEW public.sums_1_100', qual => 't' },
+	{ obj => 'MATERIALIZED VIEW public.tm',qual => 't' },
+	{ obj => 'MATERIALIZED VIEW public.tmm',   qual => 'tm' },
+	{ obj => 'MATERIALIZED VIEW public.tvmm',  qual => 'tvm' },
+	# Since 9.4
+	{
+		obj  => 'MATERIALIZED VIEW public.citext_matview',
+		qual => 'citext_table'
+	},
+	{
+		obj  => 'OR REPLACE VIEW public.key_dependent_view',
+		qual => 'view_base_table'
+	},
+	{
+		obj  => 'OR REPLACE VIEW public.key_dependent_view_no_cols',
+		qual => 'view_base_table'
+	},
+	# Since 9.5
+	{
+		obj  => 'VIEW public.dummy_seclabel_view1',
+		qual => 'dummy_seclabel_tbl2'
+	},
+	{ obj => 'VIEW public.vv',  qual => 'test_tablesample' },
+	{ obj => 'VIEW public.test_tablesample_v1', qual => 'test_tablesample' },
+	{ obj => 'VIEW public.test_tablesample_v2', qual => 'test_tablesample' },
+	# Since 9.6
+	{
+		obj  => 'MATERIALIZED VIEW public.test_pg_dump_mv1',
+		qual => 'test_pg_dump_t1'
+	},
+	{ obj => 'VIEW public.test_pg_dump_v1', qual => 'test_pg_dump_t1' },
+	{ obj => 'VIEW public.mvtest_tv',   qual => 'mvtest_t' },
+	{
+		obj  => 'MATERIALIZED VIEW mvtest_mvschema.mvtest_tvm',
+		qual => 'mvtest_tv'
+	},
+	{ obj => 'VIEW public.mvtest_tvv',   qual => 'mvtest_tv' },
+	{ obj => 'MATERIALIZED VIEW public.mvtest_tvvm', qual => 'mvtest_tvv' },
+	{ obj => 'VIEW public.mvtest_tvvmv', qual => 'mvtest_tvvm' },
+	{ obj => 'MATERIALIZED VIEW public.mvtest_bb',   qual => 'mvtest_tvvmv' },
+	{ obj => 'MATERIALIZED VIEW public.mvtest_tm',   qual => 'mvtest_t' },
+	{ obj => 'MATERIALIZED VIEW public.mvtest_tmm',  qual => 'mvtest_tm' },
+	{ obj => 'MATERIALIZED VIEW public.mvtest_tvmm', qual => 'mvtest_tvm' },
+	# Since 10 (some removed in 12)
+	{ obj => 'VIEW public.itestv10',  qual => 'itest10' },
+	{ obj => 'VIEW public.itestv11',  qual => 'itest11' },
+	{ obj => 'VIEW public.xmltableview2', qual => '"xmltable"' },
+	# Since 12
+	{
+		obj  => 'MATERIALIZED VIEW

Re: document the need to analyze partitioned tables

2023-01-17 Thread Justin Pryzby

On Tue, Jan 17, 2023 at 03:53:24PM -0500, Bruce Momjian wrote:
> On Thu, Jan 12, 2023 at 03:27:47PM -0800, Nathan Bossart wrote:
> > On Wed, Oct 05, 2022 at 10:37:01AM +0200, Laurenz Albe wrote:
> > > On Mon, 2022-03-28 at 15:05 +0200, Tomas Vondra wrote:
> > >> I've pushed the last version, and backpatched it to 10 (not sure I'd
> > >> call it a bugfix, but I certainly agree with Justin it's worth
> > >> mentioning in the docs, even on older branches).
> > > 
> > > I'd like to suggest an improvement to this.  The current wording could
> > > be read to mean that dead tuples won't get cleaned up in partitioned 
> > > tables.
> > 
> > Well, dead tuples won't get cleaned up in partitioned tables, as
> > partitioned tables do not have storage.  But I see what you mean.  Readers
> > might misinterpret this to mean that autovacuum will not process the
> > partitions.  There's a good definition of what the docs mean by
> > "partitioned table" [0], but FWIW it took me some time before I
> > consistently read "partitioned table" to mean "only the thing with relkind
> > set to 'p'" and not "both the partitioned table and its partitions."  So,
> > while the current wording it technically correct, I think it'd be
> > reasonable to expand it to help avoid confusion.
> > 
> > Here is my take on the wording:
> > 
> > Since all the data for a partitioned table is stored in its partitions,
> > autovacuum does not process partitioned tables.  Instead, autovacuum
> > processes the individual partitions that are regular tables.  This
> > means that autovacuum only gathers statistics for the regular tables
> > that serve as partitions and not for the partitioned tables.  Since
> > queries may rely on a partitioned table's statistics, you should
> > collect statistics via the ANALYZE command when it is first populated,
> > and again whenever the distribution of data in its partitions changes
> > significantly.
> 
> Uh, what about autovacuum's handling of partitioned tables?  This makes
> it sound like it ignores them because it talks about manual ANALYZE.

If we're referring to the *partitioned* table, then it does ignore them.
See:

|commit 6f8127b7390119c21479f5ce495b7d2168930e82
|Author: Alvaro Herrera 
|Date:   Mon Aug 16 17:27:52 2021 -0400
|
|Revert analyze support for partitioned tables

Maybe (all?) the clarification the docs need is to say:
"Partitioned tables are not *themselves* processed by autovacuum."

-- 
Justin

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-01-17 Thread Justin Pryzby

On Tue, Jan 17, 2023 at 08:44:36PM +0100, Tomas Vondra wrote:
> On 1/9/23 09:44, Ilya Gladyshev wrote:
> > On Sun, 2023-01-08 at 10:48 -0600, Justin Pryzby wrote:
> >> On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote:
> >>> ...
> >>>
> >>> @committers: Is it okay to add nparts_done to IndexStmt ?
> >>
> >> Any hint about this ?
> >>
> 
> AFAIK fields added at the end of a struct is seen as acceptable from the
> ABI point of view. It's not risk-free, but we did that multiple times
> when fixing bugs, IIRC.

My question isn't whether it's okay to add a field at the end of a
struct in general, but rather whether it's acceptable to add this field
at the end of this struct, and not because it's unsafe to do in a minor
release, but whether someone is going to say that it's an abuse of the
data structure.

> Do we actually need the new parts_done field? I mean, we already do
> track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the
> st_progress_param array. Can't we simply read it from there? Then we
> would not have ABI issues with the new field added to IndexStmt.

Good idea to try.

> As for the earlier discussion about the "correct" behavior for leaf vs.
> non-leaf partitions and whether to calculate partitions in advance:
> 
> * I agree it's desirable to count partitions in advance, instead of
> adding incrementally. The view is meant to provide "overview" of the
> CREATE INDEX progress, and imagine you get
> 
>   partitions_total partitions_done
> 10   9
> 
> so that you believe you're ~90% done. But then it jumps to the next
> child and now you get
> 
>   partitions_total partitions_done
> 20  10
> 
> which makes the view a bit useless for it's primary purpose, IMHO.

To be clear, that's the current, buggy behavior, and this thread is
about fixing it.  The proposed patches all ought to avoid that.

But the bug isn't caused by not "calculating partitions in advance".
Rather, the issue is that currently, the "total" is overwritten while
recursing.

That's a separate question from whether indirect partitions are counted
or not.

> I wonder if we could improve this to track the size of partitions for
> total/done? That'd make leaf/non-leaf distinction unnecessary, because
> non-leaf partitions have size 0.

Maybe, but it's out of scope for this patch.

Thanks for looking.

-- 
Justin

Re: document the need to analyze partitioned tables

2023-01-17 Thread Bruce Momjian

On Thu, Jan 12, 2023 at 03:27:47PM -0800, Nathan Bossart wrote:
> On Wed, Oct 05, 2022 at 10:37:01AM +0200, Laurenz Albe wrote:
> > On Mon, 2022-03-28 at 15:05 +0200, Tomas Vondra wrote:
> >> I've pushed the last version, and backpatched it to 10 (not sure I'd
> >> call it a bugfix, but I certainly agree with Justin it's worth
> >> mentioning in the docs, even on older branches).
> > 
> > I'd like to suggest an improvement to this.  The current wording could
> > be read to mean that dead tuples won't get cleaned up in partitioned tables.
> 
> Well, dead tuples won't get cleaned up in partitioned tables, as
> partitioned tables do not have storage.  But I see what you mean.  Readers
> might misinterpret this to mean that autovacuum will not process the
> partitions.  There's a good definition of what the docs mean by
> "partitioned table" [0], but FWIW it took me some time before I
> consistently read "partitioned table" to mean "only the thing with relkind
> set to 'p'" and not "both the partitioned table and its partitions."  So,
> while the current wording it technically correct, I think it'd be
> reasonable to expand it to help avoid confusion.
> 
> Here is my take on the wording:
> 
>   Since all the data for a partitioned table is stored in its partitions,
>   autovacuum does not process partitioned tables.  Instead, autovacuum
>   processes the individual partitions that are regular tables.  This
>   means that autovacuum only gathers statistics for the regular tables
>   that serve as partitions and not for the partitioned tables.  Since
>   queries may rely on a partitioned table's statistics, you should
>   collect statistics via the ANALYZE command when it is first populated,
>   and again whenever the distribution of data in its partitions changes
>   significantly.

Uh, what about autovacuum's handling of partitioned tables?  This makes
it sound like it ignores them because it talks about manual ANALYZE.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 14:57:27 -0500, Robert Haas wrote:
> > In pg_class:
> >
> > - The number of all frozen pages, like we do for all-visible
> >
> >   That'd give us a decent upper bound for the amount of work we need to do 
> > to
> >   increase relfrozenxid. It's important that this is crash safe (thus no
> >   pg_stats), and it only needs to be written when we'd likely make other
> >   changes to the pg_class row anyway.
> 
> I'm not sure how useful this is because a lot of the work is from
> scanning the indexes.

We can increase relfrozenxid before the index scans, unless we ran out of dead
tuple space. We already have code for that in failsafe mode, in some way. But
we really also should increase

> > In pgstats:
> >
> > - The number of dead items, incremented both by the heap scan and
> >   opportunistic pruning
> >
> >   This would let us estimate how urgently we need to clean up indexes.
> 
> I don't think this is true because btree indexes are self-cleaning in
> some scenarios and not in others.

I mainly meant it from the angle of whether need to clean up dead items in the
heap to avoid the table from bloating because we stop using those pages -
which requires index scans. But even for the index scan portion, it'd give us
a better bound than we have today.

We probably should track the number of killed tuples in indexes.

> > - The xid/mxid horizons during the last vacuum
> >
> > - The number of pages with tuples that couldn't removed due to the horizon
> >   during the last vacuum
> >
> > - The number of pages with tuples that couldn't be frozen
> 
> Not bad to know, but if the horizon we could use advances by 1, we
> can't tell whether that allows pruning nothing additional or another
> billion tuples.

Sure. But it'd be a lot better than scanning it again and again when nothing
has changed because thing still holds back the horizon. We could improve upon
it later by tracking the average or even bins of ages.

> However, where XID age really falls down as a metric is that it
> doesn't tell you what it's going to take to solve the problem. The
> answer, at ten thousand feet, is always vacuum. But how long will that
> vacuum run? We don't know. Do we need the XID horizon to advance
> first, and if so, how far? We don't know. Do we need auto-cancellation
> to be disabled? We don't know. That's where we get into a lot of
> trouble here.

Agreed. I think the metrics I proposed would help some, by at least providing
sensible upper boundaries (for work) and minimal requirements (horizon during
last vacuum).

Greetings,

Andres Freund

Re: cutting down the TODO list thread

2023-01-17 Thread Bruce Momjian

On Mon, Jan 16, 2023 at 05:17:23PM +0700, John Naylor wrote:
> 
> I wrote:
> 
> > We could also revise the developer FAQ:
> > - remove phrase "Outstanding features are detailed in Todo."
> > - add suggestion to to check the Todo or Not_worth_doing pages to see if the
> desired feature is undesirable or problematic
> > - rephrase "Working in isolation is not advisable because others might be
> working on the same TODO item, or you might have misunderstood the TODO item."
> so it doesn't mention 'TODO' at all.
> 
> There is also
> 
> https://wiki.postgresql.org/wiki/So,_you_want_to_be_a_developer%3F#TODOs
> 
> Changing the description of what links to the todo list will probably do more
> to reduce confusion than language in the todo list itself.

Agreed.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: cutting down the TODO list thread

2023-01-17 Thread Bruce Momjian

On Wed, Jan 11, 2023 at 02:09:56PM +0700, John Naylor wrote:
> I've come up with some revised language, including s/15/16/ and removing the
> category of "[E]" (easier to implement), since it wouldn't be here if it were
> actually easy:

I think it is still possible for a simple item to be identified as
wanted and easy, but not completed and put on the TODO list.

> WARNING for Developers: This list contains some known PostgreSQL bugs, some
> feature requests, and some things we are not even sure we want. This is not
> meant to be a resource for beginning developers to get ideas for things to 
> work
> on. 
> 
> All of these items are hard, and some are perhaps impossible. Some of these
> items might have become unnecessary since they were added. Others might be
> desirable but:
> 
> - a large amount work is required
> - the problems are subtler than they might appear
> - the desirable semantics aren't clear
> - there are tradeoffs that there's not consensus about
> - some combinations of the above
> 
> If you really need a feature that is listed below, it will be worth reading 
> the
> linked email thread if there is one, since it will often show the 
> difficulties,
> or perhaps contain previous failed attempts to get a patch committed. If after
> that you still want to work on it, be prepared to first discuss the value of
> the feature. Do not assume that you can start coding and expect it to be
> committed. Always discuss design on the Hackers list before starting to code.
> 
> Over time, it may become clear that a TODO item has become outdated or
> otherwise determined to be either too controversial or not worth the
> development effort. Such items should be retired to the Not Worth Doing page.
> 
> [D] marks changes that are done, and will appear in the PostgreSQL 16 release.

I think we risk overloading people with too many words above, and they
will not read it fully.  Can it be simplified?  I wonder if some of this
belows in the developer's FAQ and linked to that from the TODO list.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Peter Geoghegan

On Tue, Jan 17, 2023 at 10:02 AM Andres Freund  wrote:
> Both you and Robert said this, and I have seen it be true, but typically not
> for large high-throughput OLTP databases, where I found increasing
> relfrozenxid to be important. Sure, there's probably some up/down through the
> day / week, but it's likely to be pretty predictable.
>
> I think the problem is that an old relfrozenxid doesn't tell you how much
> outstanding work there is. Perhaps that's what both of you meant...

That's what I meant, yes.

> I think that's not the fault of relfrozenxid as a trigger, but that we simply
> don't keep enough other stats. We should imo at least keep track of:

If you assume that there is chronic undercounting of dead tuples
(which I think is very common), then of course anything that triggers
vacuuming is going to help with that problem -- it might be totally
inadequate, but still make the critical difference by not allowing the
system to become completely destabilized. I absolutely accept that
users that are relying on that exist, and that those users ought to
not have things get even worse -- I'm pragmatic. But overall, what we
should be doing is fixing the real problem, which is that the dead
tuples accounting is deeply flawed. Actually, it's not just that the
statistics are flat out wrong; the whole model is flat-out wrong.

The assumptions that work well for optimizer statistics quite simply
do not apply here. Random sampling for this is just wrong, because
we're not dealing with something that follows a distribution that can
be characterized with a sufficiently large sample. With optimizer
statistics, the entire contents of the table is itself a sample taken
from the wider world -- so even very stale statistics can work quite
well (assuming that the schema is well normalized). Whereas the
autovacuum dead tuples stuff is characterized by constant change. I
mean of course it is -- that's the whole point! The central limit
theorem obviously just doesn't work for something like this  -- we
cannot generalize from a sample, at all.

I strongly suspect that it still wouldn't be a good model even if the
information was magically always correct. It might actually be worse
in some ways! Most of my arguments against the model are not arguments
against the accuracy of the statistics as such. They're actually
arguments against the fundamental relevance of the information itself,
to the actual problem at hand. We are not interested in information
for its own sake; we're interested in making better decisions about
autovacuum scheduling. Those may only have a very loose relationship.

How many dead heap-only tuples are equivalent to one LP_DEAD item?
What about page-level concentrations, and the implication for
line-pointer bloat? I don't have a good answer to any of these
questions myself. And I have my doubts that there are *any* good
answers. Even these questions are the wrong questions (they're just
less wrong). Fundamentally, we're deciding when the next autovacuum
should run against each table. Presumably it's going to have to happen
some time, and when it does happen it happens to the table as a whole.
And with a larger table it probably doesn't matter if it happens +/- a
few hours from some theoretical optimal time. Doesn't it boil down to
that?

If we taught the system to do the autovacuum work early because it's a
relatively good time for it from a system level point of view (e.g.
it's going to be less disruptive right now), that would be useful and
easy to justify on its own terms. But it would also tend to make the
system much less vulnerable to undercounting dead tuples, since in
practice there'd be a decent chance of getting to them early enough
that it at least wasn't extremely painful any one time. It's much
easier to understand that the system is quiescent than it is to
understand bloat.

BTW, I think that the insert-driven autovacuum stuff added to 13 has
made the situation with bloat significantly better. Of course it
wasn't really designed to do that at all, but it still has, kind of by
accident, in roughly the same way that antiwraparound autovacuums help
with bloat by accident. So maybe we should embrace "happy accidents"
like that a bit more. It doesn't necessarily matter if we do the right
thing for a reason that turns out to have not been the best reason.
I'm certainly not opposed to it, despite my complaints about relying
on age(relfrozenxid).

> In pgstats:
> (Various stats)

Overall, what I like about your ideas here is the emphasis on bounding
the worst case, and the emphasis on the picture at the page level over
the tuple level.

I'd like to use the visibility map more for stuff here, too. It is
totally accurate about all-visible/all-frozen pages, so many of my
complaints about statistics don't really apply. Or need not apply, at
least. If 95% of a table's pages are all-frozen in the VM, then of
course it's pretty unlikely to be the right time to VACUUM the table
if it's to clean up

Re: almost-super-user problems that we haven't fixed yet

2023-01-17 Thread Robert Haas

On Tue, Jan 17, 2023 at 1:42 PM Nathan Bossart  wrote:
> Alright.  The one design question I have is whether this should be a new
> set of reserved connections or replace superuser_reserved_connections
> entirely.

I think it should definitely be something new, not a replacement.

> If we create a new batch of reserved connections, only roles with
> privileges of pg_use_reserved_connections would be able to connect if the
> number of remaining slots is greater than superuser_reserved_connections
> but less than or equal to superuser_reserved_connections +
> reserved_connections.  Only superusers would be able to connect if the
> number of remaining slots is less than or equal to
> superuser_reserved_connections.  This helps avoid blocking new superuser
> connections even if you've reserved some connections for non-superusers.

This is precisely what I had in mind.

I think the documentation will need some careful wordsmithing,
including adjustments to superuser_reserved_connections. We want to
recast superuser_reserved_connections as a final reserve to be touched
after even reserved_connections has been exhausted.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Robert Haas

On Tue, Jan 17, 2023 at 1:02 PM Andres Freund  wrote:
> On 2023-01-16 13:58:21 -0800, Peter Geoghegan wrote:
> > On Fri, Jan 13, 2023 at 9:55 PM Andres Freund  wrote:
> > When I express skepticism of very high autovacuum_freeze_max_age
> > settings, it's mostly just that I don't think that age(relfrozenxid)
> > is at all informative in the way that something that triggers
> > autovacuum ought to be. Worst of all, the older relfrozenxid gets, the
> > less informative it becomes. I'm sure that it can be safe to use very
> > high autovacuum_freeze_max_age values, but it seems like a case of
> > using a very complicated and unreliable thing to decide when to
> > VACUUM.
>
> Both you and Robert said this, and I have seen it be true, but typically not
> for large high-throughput OLTP databases, where I found increasing
> relfrozenxid to be important. Sure, there's probably some up/down through the
> day / week, but it's likely to be pretty predictable.
>
> I think the problem is that an old relfrozenxid doesn't tell you how much
> outstanding work there is. Perhaps that's what both of you meant...

I think so, at least in my case. The reason why the default
autovacuum_freeze_max_age is 300m is not because going over 300m is a
problem, but because we don't know how long it's going to take to run
all of the vacuums, and we're still going to be consuming XIDs in the
meantime, and we have no idea how fast we're consuming XIDs either.
For all we know it might take days, and we might be burning through
XIDs really quickly, so we might need a ton of headroom.

That's not to say that raising the setting might not be a sensible
thing to do in a context where you know what the workload is. If you
know that the vacuums will completely quickly OR that the XID burn
rate is low, you can raise it drastically and be fine. We just can't
assume that will be true everywhere.

> I think that's not the fault of relfrozenxid as a trigger, but that we simply
> don't keep enough other stats. We should imo at least keep track of:
>
> In pg_class:
>
> - The number of all frozen pages, like we do for all-visible
>
>   That'd give us a decent upper bound for the amount of work we need to do to
>   increase relfrozenxid. It's important that this is crash safe (thus no
>   pg_stats), and it only needs to be written when we'd likely make other
>   changes to the pg_class row anyway.

I'm not sure how useful this is because a lot of the work is from
scanning the indexes.

> In pgstats:
>
> - The number of dead items, incremented both by the heap scan and
>   opportunistic pruning
>
>   This would let us estimate how urgently we need to clean up indexes.

I don't think this is true because btree indexes are self-cleaning in
some scenarios and not in others.

> - The xid/mxid horizons during the last vacuum
>
> - The number of pages with tuples that couldn't removed due to the horizon
>   during the last vacuum
>
> - The number of pages with tuples that couldn't be frozen

Not bad to know, but if the horizon we could use advances by 1, we
can't tell whether that allows pruning nothing additional or another
billion tuples.

I'm not trying to take the position that XID age is a totally useless
metric. I don't think it is. If XID age is high, you know you have a
problem, and the higher it is, the more urgent that problem is.
Furthemore, if XID is low, you know that you don't have that
particular problem. You might have some other one, but that's OK: this
one metric doesn't have to answer every question to be useful.
However, where XID age really falls down as a metric is that it
doesn't tell you what it's going to take to solve the problem. The
answer, at ten thousand feet, is always vacuum. But how long will that
vacuum run? We don't know. Do we need the XID horizon to advance
first, and if so, how far? We don't know. Do we need auto-cancellation
to be disabled? We don't know. That's where we get into a lot of
trouble here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: CI and test improvements

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 11:35:09 -0600, Justin Pryzby wrote:
> The autoconf system runs all tap tests in t/*.pl, but meson requires
> enumerating them in ./meson.build.

Yes. It was a mistake that we ever used t/*.pl for make. For one, it means
that make can't control concurrency meaningfully, due to the varying number of
tests run with one prove instance. It's also the only thing that tied us to
prove, which is one hell of a buggy mess.

> This checks for and finds no missing tests in the current tree:
>
> $ for pl in `find src contrib -path '*/t/*.pl'`; do base=${pl##*/}; 
> dir=${pl%/*}; meson=${dir%/*}/meson.build; grep "$base" "$meson" >/dev/null 
> || echo "$base is missing from $meson"; done

Likely because I do something similar locally.

# prep
m test --list > /tmp/tests.txt

# Check if all tap tests are known to meson
for f in $(git ls-files|grep -E '(t|test)/.*.pl$'|sort);do t=$(echo $f|sed -E 
-e 's/^.*\/([^/]*)\/(t|test)\/(.*)\.pl$/\1\/\3/');grep -q -L $t /tmp/tests.txt 
|\
| echo $f;done

# Check if all regression / isolation tests are known to meson
#
# Expected to find plpgsql due to extra 'src' directory level, src/test/mb
# because it's not run anywhere and sepgsql, because that's not tested yet
for d in $(find ~/src/postgresql -type d \( -name sql -or -name specs \) );do 
t=$(basename $(dirname $d)); grep -q -L $t /tmp/tests.txt || echo $d; done

> However, this finds two real problems and one false-positive with
> missing regress/isolation tests:

Which the above does *not* test for. Good catch.

I'll push the fix for those as soon as tests passed on my personal repo.

> $ for makefile in `find src contrib -name Makefile`; do for testname in `sed 
> -r '/^(REGRESS|ISOLATION) =/!d; s///; :l; /$/{s///; N; b l}; s/\n//g' 
> "$makefile"`; do meson=${makefile%/Makefile}/meson.build; grep -Fw 
> "$testname" "$meson" >/dev/null || echo "$testname is missing from $meson"; 
> done; done
> guc_privs is missing from src/test/modules/unsafe_tests/meson.build

Yep. That got added during the development of the meson port, so it's not too 
surprising.

> oldextversions is missing from contrib/pg_stat_statements/meson.build

This one however, is odd. Not sure how that happened.

> $(CF_PGP_TESTS) is missing from contrib/pgcrypto/meson.build

Assume that's the false positive?

> I also tried but failed to write something to warn if "meson test" was
> run with a list of tests but without tmp_install.  Help wanted.

That doesn't even catch the worst case - when there's tmp_install, but it's
too old.

The proper solution would be to make the creation of tmp_install a dependency
of the relevant tests. Unfortunately meson still auto-propages those to
dependencies of the 'all' target (for historical reasons), and creating the
temp install is too slow on some machines to make that tolerable. I think
there's an open PR to change that. Once that's in a released meson version
that's in somewhat widespread use, we should change that.

The other path forward is to allow running the tests without
tmp_install. There's not that much we'd need to allow running directly from
the source tree - the biggest thing is a way to load extensions from a list of
paths. This option is especially attractive because it'd allow to run
individual tests without a fully built sourcetree. No need to build other
binaries when you just want to test psql, or more extremely, pg_test_timing.

> I propose to put something like this into "SanityCheck".

Perhaps we instead could add it as a separate "meson-only" test? Then it'd
fail on developer's machines, instead of later in CI. We could pass the test
information from the 'tests' array, or it could look at the metadata in
meson-info/intro-tests.json

Greetings,

Andres Freund

Re: Sampling-based timing for EXPLAIN ANALYZE

2023-01-17 Thread Tomas Vondra

On 1/17/23 19:46, Andres Freund wrote:
> Hi,
> 
> On 2023-01-17 19:00:02 +0100, Tomas Vondra wrote:
>> On 1/17/23 18:02, Andres Freund wrote:
>>> On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:
 That also does not have issues with timestamp "rounding" - considering
 e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
 possible the node completes within 1ms, in which case

(now - last_sample)

 ends up being 0 (assuming I correctly understand the code).
>>>
>>> That part should be counting in nanoseconds, I think? Unless I misunderstand
>>> something?
>>>
>>
>> The higher precision does not help, because both values come from the
>> *sampled* timestamp (i.e. the one updated from the signal handler). So
>> if the node happens to execute between two signals, the values are going
>> to be the same, and the difference is 0.
> 
> In that case there simply wasn't any sample for the node, and a non-timestamp
> based sample counter wouldn't do anything different?
> 

Yeah, you're right.

> If you're worried about the case where a timer does fire during execution of
> the node, but exactly once, that should provide a difference between the last
> sampled timestamp and the current time. It'll attribute a bit too much to the
> in-progress nodes, but well, that's sampling for you.
> 
> 
> I think a "hybrid" explain mode might be worth thinking about. Use the
> "current" sampling method for the first execution of a node, and for the first
> few milliseconds of a query (or perhaps the first few timestamp
> acquisitions). That provides an accurate explain analyze for short queries,
> without a significant slowdown. Then switch to sampling, which provides decent
> attribution for a bit longer running queries.
> 

Yeah, this is essentially the sampling I imagined when I first read the
subject of this thread. It samples which node executions to measure (and
then measures those accurately), while these patches sample timestamps.

> 
> 
> I've been thinking that we should consider making more of the 
> instrumentation
> code work like that. The amount of work we're doing in 
> InstrStart/StopNode()
> has steadily risen. When buffer usage and WAL usage are enabled, we're
> executing over 1000 instructions! And each single Instrumentation node is 
> ~450
> bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
> embedded.
>
> If we instead have InstrStartNode() set up a global pointer to the
> Instrumentation node, we can make the instrumentation code modify both the
> "global" counters (pgBufferUsage, pgWalUsage) and, if set,
> current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
> changes - right now nodes "automatically" include the IO/WAL incurred in 
> child
> nodes, but that's just a small bit of additional summin-up to be done 
> during
> EXPLAIN.
>

 That's certainly one way to implement that. I wonder if we could make
 that work without the global pointer, but I can't think of any.
>>>
>>> I don't see a realistic way at least. We could pass down an
>>> "InstrumentationContext" through everything that needs to do IO and WAL. But
>>> that seems infeasible at this point.
> 
>> Why infeasible?
> 
> Primarily the scale of the change. We'd have to pass down the context into all
> table/index AM functions. And into a lot of bufmgr.c, xlog.c functions,
> which'd require their callers to have access to the context.  That's hundreds
> if not thousands places.
> 
> Adding that many function parameters might turn out to be noticable runtime
> wise, due to increased register movement. I think for a number of the places
> where we currently don't, we ought to use by-reference struct for the
> not-always-used parameters, that then also could contain this context.
> 

OK, I haven't realized we'd have to pass it to that many places.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-01-17 Thread Tomas Vondra

On 1/9/23 09:44, Ilya Gladyshev wrote:
> On Sun, 2023-01-08 at 10:48 -0600, Justin Pryzby wrote:
>> On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote:
>>> ...
>>>
>>> @committers: Is it okay to add nparts_done to IndexStmt ?
>>
>> Any hint about this ?
>>

AFAIK fields added at the end of a struct is seen as acceptable from the
ABI point of view. It's not risk-free, but we did that multiple times
when fixing bugs, IIRC.

The primary risk is old extensions (built on older minor version)
running on new server, getting confused by new fields (and implied
shifts in the structs). But fields at the end should be safe - the
extension simply ignores the stuff at the end. The one problem would be
arrays of structs, because even a field at the end changes the array
stride. But I don't think we do that with IndexStmt ...

Of course, if the "old" extension itself allocates the struct and passes
it to core code, that might still be an issue, because it'll allocate a
smaller struct, and core might see bogus data at the end.

On the other hand, new extensions on old server may get confused too,
because it may try setting a field that does not exist.

So ultimately it's about weighing risks vs. benefits - evaluating
whether fixing the issue is actually worth it.

The question is if/how many such extensions messing with IndexStmt in
this way actually exist. That is, allocate IndexStmt (or array of it). I
haven't found any, but maybe some extensions for index or partition
management do it? Not sure.

But ...

Do we actually need the new parts_done field? I mean, we already do
track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the
st_progress_param array. Can't we simply read it from there? Then we
would not have ABI issues with the new field added to IndexStmt.

>> This should be resolved before the "CIC on partitioned tables" patch,
>> which I think is otherwise done.
> 
> I suggest that we move on with the IndexStmt patch and see what the
> committers have to say about it. I have brushed the patch up a bit,
> fixing TODOs and adding docs as per our discussion above.
> 

I did take a look at the patch, so here are my 2c:

1) num_leaf_partitions says it's "excluding foreign tables" but then it
uses RELKIND_HAS_STORAGE() which excludes various others relkinds, e.g.
partitioned tables etc. Minor, but perhaps a bit confusing.

2) I'd probably say count_leaf_partitions() instead.

3) The new part in DefineIndex counting leaf partitions should have a
comment before

if (!OidIsValid(parentIndexId))
{ ... }

4) It's a bit confusing that one of the branches in DefineIndex just
sets stmt->parts_done without calling pgstat_progress_update_param
(while the other one does both). AFAICS the call is not needed because
we already updated it during the recursive DefineIndex call, but maybe
the comment should mention that?

As for the earlier discussion about the "correct" behavior for leaf vs.
non-leaf partitions and whether to calculate partitions in advance:

* I agree it's desirable to count partitions in advance, instead of
adding incrementally. The view is meant to provide "overview" of the
CREATE INDEX progress, and imagine you get

  partitions_total partitions_done
10   9

so that you believe you're ~90% done. But then it jumps to the next
child and now you get

  partitions_total partitions_done
20  10

which makes the view a bit useless for it's primary purpose, IMHO.

* I don't care very much about leaf vs. non-leaf partitions. If we
exclude non-leaf ones, fine with me. But the number of non-leaf ones
should be much smaller than leaf ones, and if the partition already has
a matching index that distorts the tracking too. Furthermore the
partitions may have different size etc. so the progress is only
approximate anyway.

I wonder if we could improve this to track the size of partitions for
total/done? That'd make leaf/non-leaf distinction unnecessary, because
non-leaf partitions have size 0.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Remove source code display from \df+?

2023-01-17 Thread Isaac Morland

On Thu, 12 Jan 2023 at 12:06, Isaac Morland  wrote:

Thanks everybody. So based on the latest discussion I will:
>
> 1) rename the column from “Source code” to “Internal name”; and
> 2) change the contents to NULL except when the language (identified by
> oid) is INTERNAL or C.
>
> Patch forthcoming, I hope.
>

I've attached a patch for this. It turns out to simplify the existing code
in one way because the recently added call to pg_get_function_sqlbody() is
no longer needed since it applies only to SQL functions, which will now
display as a blank column.

I implemented the change and was surprised to see that no tests failed.
Turns out that while there are several tests for \df, there were none for
\df+. I added a couple, just using \df+ on some functions that appear to me
to be present on plain vanilla Postgres.

I was initially concerned about translation support for the column heading,
but it turns out that \dT+ already has a column with the exact same name so
it appears we don’t need any new translations.

I welcome comments and feedback. Now to try to find something manageable to
review.

0001-Remove-source-code-display-from-df.patch
Description: Binary data

[PATCH] Teach planner to further optimize sort in distinct

2023-01-17 Thread Ankit Kumar Pandey

Hi, this is extension of `teach planner to evaluate multiple windows in 
the optimal order` work applied to distinct operation.


Based on discussions before 
(https://www.postgresql.org/message-id/flat/CAApHDvr7rSCVXzGfVa1L9pLpkKj6-s8NynK8o%2B98X9sKjejnQQ%40mail.gmail.com#e01327a3053d9281c40f281ef7105b42) 
,


> All I imagine you need to do for it
> is to invent a function in pathkeys.c which is along the lines of what
> pathkeys_count_contained_in() does, but returns a List of pathkeys
> which are in keys1 but not in keys2 and NIL if keys2 has a pathkey
> that does not exist as a pathkey in keys1. In
> create_final_distinct_paths(), you can then perform an incremental
> sort on any input_path which has a non-empty return list and in
> create_incremental_sort_path(), you'll pass presorted_keys as the
> number of pathkeys in the path, and the required pathkeys the
> input_path->pathkeys + the pathkeys returned from the new function.


There is bit confusion in wording here:

"returns a List of pathkeys
which are in keys1 but not in keys2 and NIL if keys2 has a pathkey
that does not exist as a pathkey in keys1."

You mean extract common keys without ordering right?

Example: keys1 = (a,b,c), keys2 = (b,a)

returns (a,b)

and

keys1 = (a,b,c), keys = (d)

returns = ()

which translates to

needed_pathkeys = (a,b,c) = key2

input_pathkeys = (b,a) key1

returns (b,a) = common_keys

new needed_pathkeys = unique(common_keys + old needed_pathkeys)

=> new needed_pathkeys = (b,a,c)

The new needed_pathkeys matches input_pathkeys.

This is what I implemented in the patch.


The patched version yields the following plans:

set enable_hashagg=0;
set enable_seqscan=0;

explain (costs off) select distinct relname,relkind,count(*) over 
(partition by

relkind) from pg_Class;
   QUERY PLAN
-
 Unique
   ->  Incremental Sort
 Sort Key: relkind, relname, (count(*) OVER (?))
 Presorted Key: relkind
 ->  WindowAgg
   ->  Sort
 Sort Key: relkind
 ->  Seq Scan on pg_class
(8 rows)

explain (costs off) select distinct a, b, count(*) over (partition by b, 
a) from abcd;

   QUERY PLAN

 Unique
   ->  Incremental Sort
 Sort Key: b, a, (count(*) OVER (?))
 Presorted Key: b, a
 ->  WindowAgg
   ->  Incremental Sort
 Sort Key: b, a
 Presorted Key: b
 ->  Index Scan using b_idx on abcd
(9 rows)

explain (costs off) select distinct a, b, count(*) over (partition by c, 
d) from abcd;

   QUERY PLAN

 Unique
   ->  Sort
 Sort Key: a, b, (count(*) OVER (?))
 ->  WindowAgg
   ->  Incremental Sort
 Sort Key: c, d
 Presorted Key: c
 ->  Index Scan using c_idx on abcd
(8 rows)


Issue with index path still remains as pathkeys get purged by 
truncate_useless_pathkeys


and hence are not available in create_final_distinct_paths for the above 
optimizations.



I have attached a patch for the reference.


Thanks,

Ankit

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 609df93dc9..13f6006577 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -1968,3 +1968,32 @@ has_useful_pathkeys(PlannerInfo *root, RelOptInfo *rel)
 		return true;			/* might be able to use them for ordering */
 	return false;/* definitely useless */
 }
+
+/*
+ * extract_common_pathkeys
+ *		returns a List of pathkeys
+ *	which are in keys1 but not in keys2 and NIL if keys2 has a pathkey
+ * that does not exist as a pathkey in keys1 
+ */
+List *
+extract_common_pathkeys(List* keys1, List *keys2)
+{
+	List *new_pk = NIL;
+	ListCell	*l1;
+	ListCell	*l2;
+	foreach(l1, keys1)
+	{
+		PathKey*pathkey1 = (PathKey *) lfirst(l1);
+		foreach(l2, keys2)
+		{
+			PathKey*pathkey2 = (PathKey *) lfirst(l2);
+			if (pathkey1 == pathkey2)
+			{
+new_pk = lappend(new_pk, pathkey1);
+break;
+			}
+		}
+	}
+	return new_pk;
+
+}
\ No newline at end of file
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 044fb24666..1802d28e75 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4844,11 +4844,28 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 			Path	   *sorted_path;
 			bool		is_sorted;
 			int			presorted_keys;
+			List		*common_keys;
 
 			is_sorted = pathkeys_count_contained_in(needed_pathkeys,
 	input_path->pathkeys,
 	_keys);
 
+			/*
+			 * Check if there are common pathkeys (regardless of ordering)
+			 */
+			common_keys =

Re: constify arguments of copy_file() and copydir()

2023-01-17 Thread Peter Eisentraut


On 17.01.23 07:24, Michael Paquier wrote:

On Mon, Jan 16, 2023 at 10:53:40AM +0900, Michael Paquier wrote:

+1.


While I don't forget about this thread..  Any objections if I were to
apply that?


Looks good to me.

Re: Refactor recordExtObjInitPriv()

2023-01-17 Thread Peter Eisentraut


On 16.01.23 23:43, Nathan Bossart wrote:

On Mon, Jan 16, 2023 at 12:01:47PM +0100, Peter Eisentraut wrote:

I have updated the patch as you suggested and split out the aggregate issue
into a separate patch for clarity.


LGTM


committed

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 12:22:14 -0500, Melanie Plageman wrote:
> > > @@ -359,6 +360,15 @@ static const PgStat_KindInfo 
> > > pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
> > >   .snapshot_cb = pgstat_checkpointer_snapshot_cb,
> > >   },
> > >
> > > + [PGSTAT_KIND_IO] = {
> > > + .name = "io_ops",
> >
> > That should be "io" now I think?
> >
> 
> Oh no! I didn't notice this was broken. I've added pg_stat_have_stats()
> to the IO stats tests now.
> 
> It would be nice if pgstat_get_kind_from_str() could be used in
> pg_stat_reset_shared() to avoid having to remember to change both.

It's hard to make that work, because of the historical behaviour of that
function :(


> Also:
> - Since recovery_prefetch doesn't have a statistic kind, it doesn't fit
>   well into this paradigm

I think that needs a rework anyway - it went in at about the same time as the
shared mem stats patch, so it doesn't quite cohere.


> On a separate note, should we be setting have_[io/slru/etc]stats to
> false in the reset all functions?

That'd not work reliably, because other backends won't do the same. I don't
see a benefit in doing it differently in the local connection than the other
connections.


> > > +typedef struct PgStat_BackendIO
> > > +{
> > > + PgStat_Counter 
> > > data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
> > > +} PgStat_BackendIO;
> >
> > Would it bother you if we swapped the order of iocontext and iobject here 
> > and
> > related places? It makes more sense to me semantically, and should now be
> > pretty easy, code wise.
> 
> So, thinking about this I started noticing inconsistencies in other
> areas around this order:
> For example: ordering of objects mentioned in commit messages and comments,
> ordering of parameters (like in pgstat_count_io_op() [currently in
> reverse order]).
> 
> I think we should make a final decision about this ordering and then
> make everywhere consistent (including ordering in the view).
> 
> Currently the order is:
> BackendType
>   IOContext
> IOObject
>   IOOp
> 
> You are suggesting this order:
> BackendType
>   IOObject
> IOContext
>   IOOp
> 
> Could you explain what you find more natural about this ordering (as I
> find the other more natural)?

The object we're performing IO on determines more things than the context. So
it just seems like the natural hierarchical fit. The context is a sub-category
of the object. Consider how it'll look like if we also have objects for 'wal',
'temp files'. It'll make sense to group by just the object, but it won't make
sense to group by just the context.

If it were trivial to do I'd use a different IOContext for each IOObject. But
it'd make it much harder. So there'll just be a bunch of values of IOContext
that'll only be used for one or a subset of the IOObjects.


The reason to put BackendType at the top is pragmatic - one backend is of a
single type, but can do IO for all kinds of objects/contexts. So any other
hierarchy would make the locking etc much harder.


> This is one possible natural sentence with these objects:
> 
> During COPY, a client backend may read in data from a permanent
> relation.
> This order is:
> IOContext
>   BackendType
> IOOp
>   IOObject
> 
> I think English sentences are often structured subject, verb, object --
> but in our case, we have an extra thing that doesn't fit neatly
> (IOContext).

"..., to avoid polluting the buffer cache it uses the bulk (read|write)
strategy".


> Also, IOOp in a sentence would be in the middle (as the
> verb). I made it last because a) it feels like the smallest unit b) it
> would make the code a lot more annoying if it wasn't last.

Yea, I think pragmatically that is the right choice.



> > > Subject: [PATCH v47 3/5] pgstat: Count IO for relations
> >
> > Nearly happy with this now. See one minor nit below.
> >
> > I don't love the counting in register_dirty_segment() and mdsyncfiletag(), 
> > but
> > I don't have a better idea, and it doesn't seem too horrible.
> 
> You don't like it because such things shouldn't be in md.c -- since we
> went to the trouble of having function pointers and making it general?

It's more of a gut feeling than well reasoned ;)



> > > +-- Change the tablespace so that the table is rewritten directly, then 
> > > SELECT
> > > +-- from it to cause it to be read back into shared buffers.
> > > +SET allow_in_place_tablespaces = true;
> > > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
> >
> > Perhaps worth doing this in tablespace.sql, to avoid the additional
> > checkpoints done as part of CREATE/DROP TABLESPACE?
> >
> > Or, at least combine this with the CHECKPOINTs above?
> 
> I see a checkpoint is requested when dropping the tablespace if not all
> the files in it are deleted. It seems like if the DROP TABLE for the
> permanent table is before the explicit checkpoints in the test, then the
> DROP TABLESPACE will not cause an additional checkpoint.

Unfortunately,

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 12:26:57 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Here's an updated version of the move to representing instr_time as
> > nanoseconds. It's now split into a few patches:
> 
> I took a quick look through this.

Thanks!


> > 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
> >   warn
> >   Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
> >   just allow to assign 0.
> 
> I think it's probably wise to keep the macro.  If we ever rethink this
> again, we'll be glad we kept it.  Similarly, IS_ZERO is a good idea
> even if it would work with just compare-to-zero.

Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
could give us the best of both worlds?


> I'm almost tempted to suggest you define instr_time as a struct with a
> uint64 field, just to help keep us honest about that.

I can see that making sense. Unless somebody pipes up with opposition to that
plan soon, I'll see how it goes.


> > 0003) Add INSTR_TIME_SET_SECOND()
> >   This is used in 0004. Just allows setting an instr_time to a time in
> >   seconds, allowing for a cheaper loop exit condition in 0004.
> 
> Code and comments are inconsistent about whether it's SET_SECOND or
> SET_SECONDS.  I think I prefer the latter, but don't care that much.

That's probably because I couldn't decide... So I'll go with your preference.


> > 0004) report nanoseconds in pg_test_timing
> 
> Didn't examine 0004 in any detail, but the others look good to go
> other than these nits.

Thanks for looking!

Greetings,

Andres Freund

Re: Sampling-based timing for EXPLAIN ANALYZE

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 19:00:02 +0100, Tomas Vondra wrote:
> On 1/17/23 18:02, Andres Freund wrote:
> > On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:
> >> That also does not have issues with timestamp "rounding" - considering
> >> e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
> >> possible the node completes within 1ms, in which case
> >>
> >>(now - last_sample)
> >>
> >> ends up being 0 (assuming I correctly understand the code).
> > 
> > That part should be counting in nanoseconds, I think? Unless I misunderstand
> > something?
> > 
> 
> The higher precision does not help, because both values come from the
> *sampled* timestamp (i.e. the one updated from the signal handler). So
> if the node happens to execute between two signals, the values are going
> to be the same, and the difference is 0.

In that case there simply wasn't any sample for the node, and a non-timestamp
based sample counter wouldn't do anything different?

If you're worried about the case where a timer does fire during execution of
the node, but exactly once, that should provide a difference between the last
sampled timestamp and the current time. It'll attribute a bit too much to the
in-progress nodes, but well, that's sampling for you.

I think a "hybrid" explain mode might be worth thinking about. Use the
"current" sampling method for the first execution of a node, and for the first
few milliseconds of a query (or perhaps the first few timestamp
acquisitions). That provides an accurate explain analyze for short queries,
without a significant slowdown. Then switch to sampling, which provides decent
attribution for a bit longer running queries.

> >>> I've been thinking that we should consider making more of the 
> >>> instrumentation
> >>> code work like that. The amount of work we're doing in 
> >>> InstrStart/StopNode()
> >>> has steadily risen. When buffer usage and WAL usage are enabled, we're
> >>> executing over 1000 instructions! And each single Instrumentation node is 
> >>> ~450
> >>> bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
> >>> embedded.
> >>>
> >>> If we instead have InstrStartNode() set up a global pointer to the
> >>> Instrumentation node, we can make the instrumentation code modify both the
> >>> "global" counters (pgBufferUsage, pgWalUsage) and, if set,
> >>> current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
> >>> changes - right now nodes "automatically" include the IO/WAL incurred in 
> >>> child
> >>> nodes, but that's just a small bit of additional summin-up to be done 
> >>> during
> >>> EXPLAIN.
> >>>
> >>
> >> That's certainly one way to implement that. I wonder if we could make
> >> that work without the global pointer, but I can't think of any.
> > 
> > I don't see a realistic way at least. We could pass down an
> > "InstrumentationContext" through everything that needs to do IO and WAL. But
> > that seems infeasible at this point.

> Why infeasible?

Primarily the scale of the change. We'd have to pass down the context into all
table/index AM functions. And into a lot of bufmgr.c, xlog.c functions,
which'd require their callers to have access to the context.  That's hundreds
if not thousands places.

Adding that many function parameters might turn out to be noticable runtime
wise, due to increased register movement. I think for a number of the places
where we currently don't, we ought to use by-reference struct for the
not-always-used parameters, that then also could contain this context.

Greetings,

Andres Freund

Re: almost-super-user problems that we haven't fixed yet

2023-01-17 Thread Nathan Bossart

On Mon, Jan 16, 2023 at 09:06:10PM -0500, Robert Haas wrote:
> On Mon, Jan 16, 2023 at 5:37 PM Nathan Bossart  
> wrote:
>> On Mon, Jan 16, 2023 at 02:29:56PM -0500, Robert Haas wrote:
>> > 4. You can reserve a small number of connections for the superuser
>> > with superuser_reserved_connections, but there's no way to do a
>> > similar thing for any other user. As mentioned above, a CREATEROLE
>> > user could set connection limits for every created role such that the
>> > sum of those limits is less than max_connections by some margin, but
>> > that restricts each of those roles individually, not all of them in
>> > the aggregate. Maybe we could address this by inventing a new GUC
>> > reserved_connections and a predefined role
>> > pg_use_reserved_connections.
>>
>> I've written something like this before, and I'd be happy to put together a
>> patch if there is interest.
> 
> Cool. I had been thinking of coding it up myself, but you doing it works, too.

Alright.  The one design question I have is whether this should be a new
set of reserved connections or replace superuser_reserved_connections
entirely.

If we create a new batch of reserved connections, only roles with
privileges of pg_use_reserved_connections would be able to connect if the
number of remaining slots is greater than superuser_reserved_connections
but less than or equal to superuser_reserved_connections +
reserved_connections.  Only superusers would be able to connect if the
number of remaining slots is less than or equal to
superuser_reserved_connections.  This helps avoid blocking new superuser
connections even if you've reserved some connections for non-superusers.

Іf we replace superuser_reserved_connections, we're basically opening up
the existing functionality to non-superusers, which is simpler and probably
more in the spirit of this thread, but it doesn't provide a way to prevent
blocking new superuser connections.

My preference is the former approach.  This is closest to what I've written
before, and if I read your words carefully, it seems to be what you are
proposing.  WDYT?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 10:26:52 -0500, Robert Haas wrote:
> On Mon, Jan 16, 2023 at 11:11 PM Peter Geoghegan  wrote:
> > On Mon, Jan 16, 2023 at 8:25 AM Robert Haas  wrote:
> > > I really dislike formulas like Min(freeze_max_age * 2, 1 billion).
> > > That looks completely magical from a user perspective. Some users
> > > aren't going to understand autovacuum behavior at all. Some will, and
> > > will be able to compare age(relfrozenxid) against
> > > autovacuum_freeze_max_age. Very few people are going to think to
> > > compare age(relfrozenxid) against some formula based on
> > > autovacuum_freeze_max_age. I guess if we document it, maybe they will.
> >
> > What do you think of Andres' autovacuum_no_auto_cancel_age proposal?
> 
> I like it better than your proposal. I don't think it's a fundamental
> improvement and I would rather see a fundamental improvement, but I
> can see it being better than nothing.

That's similar to my feelings about it.

I do think it'll be operationally nice to have at least some window where an
autovacuum is triggered due to age and where it won't prevent cancels. In many
situations it'll likely suffice if that window is autovacuum_naptime *
xids_per_sec large, but of course that's easily enough exceeded.

> > > I do like the idea of driving the auto-cancel behavior off of the
> > > results of previous attempts to vacuum the table. That could be done
> > > independently of the XID age of the table.
> >
> > Even when the XID age of the table has already significantly surpassed
> > autovacuum_freeze_max_age, say due to autovacuum worker starvation?
> >
> > > If we've failed to vacuum
> > > the table, say, 10 times, because we kept auto-cancelling, it's
> > > probably appropriate to force the issue.
> >
> > I suggested 1000 times upthread. 10 times seems very low, at least if
> > "number of times cancelled" is the sole criterion, without any
> > attention paid to relfrozenxid age or some other tiebreaker.
> 
> Hmm, I think that a threshold of 1000 is far too high to do much good.

Agreed.

> By the time we've tried to vacuum a table 1000 times and failed every
> time, I anticipate that the situation will be pretty dire, regardless
> of why we thought the table needed to be vacuumed in the first place.

Agreed.

> In the best case, with autovacum_naptime=1minute, failing 1000 times
> means that we've delayed vacuuming the table for at least 16 hours.

Perhaps it'd make sense for an auto-cancelled worker to signal the launcher to
do a cycle of vacuuming? Or even to just try to vacuum the table again
immediately? After all, we know that the table is going to be on the schedule
of the next worker immediately. Of course we shouldn't retry indefinitely, but
...

> In fact I think there's a decent argument that a threshold of ten is
> possibly too high here, too. If you wait until the tenth try before
> you try not auto-cancelling, then a table with a workload that makes
> auto-cancelling 100% probable will get vacuumed 10% as often as it
> would otherwise. I think there are cases where that would be OK, but
> probably on the whole it's not going to go very well.

That's already kind of the case - we'll only block auto-cancelling when
exceeding autovacuum_freeze_max_age, all the other autovacuums will be
cancelable.

> The only problem I see with lowering the threshold below ~10 is that the
> signal starts to get weak. If something fails for the same reason ten times
> in a row you can be pretty sure it's a chronic problem. If you made the
> thereshold say three you'd probably start making bad decisions sometimes --
> you'd think that you had a chronic problem when really you just got a bit
> unlucky.

Yea. Schema migrations in prod databases typically have to run in
single-statement or very small transactions, for obvious reasons. Needing to
lock the same table exclusively a few times during a schema migration is
pretty normal, particularly when foreign keys are involved. Getting blocked by
autovacuum in the middle of a schema migration is NASTY.

This is why I'm a bit worried that 10 might be too low... It's not absurd for
a schema migration to create 10 new tables referencing an existing table in
need of vacuuming.

Perhaps we should track when the first failure was, and take that into
account? Clearly having all 10 autovacuums on the same table cancelled is
different when those 10 cancellations happened in the last 10 *
autovacuum_naptime minutes, than when the last successful autovacuum was hours
ago.

> To get back to the earlier question above, I think that if the
> retries-before-not-auto-cancelling threshold were low enough to be
> effective, you wouldn't necessarily need to consider XID age as a
> second reason for not auto-cancelling. You would want to force the
> behavior anyway when you hit emergency mode, because that should force
> all the mitigations we have, but I don't know that you need to do
> anything before that.

Hm, without further restrictions, that has me

Re: recovery modules

2023-01-17 Thread Nathan Bossart

On Tue, Jan 17, 2023 at 02:32:03PM +0900, Michael Paquier wrote:
> Could it be cleaner in the long term to remove entirely
> BuildRestoreCommand() and move the conversion of the xlogpath with
> make_native_path() one level higher in the stack?

Yeah, this seems cleaner.  I removed BuildRestoreCommand() in v8.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From c710f5a9e294b198ce6bb2e8d9404cb26a76b913 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 23 Dec 2022 16:53:38 -0800
Subject: [PATCH v8 1/2] Refactor code for restoring files via shell.

Presently, restore_command uses a different code path than
archive_cleanup_command and recovery_end_command.  These code paths
are similar and can be easily combined.
---
 src/backend/access/transam/shell_restore.c | 99 ++
 src/backend/access/transam/xlogarchive.c   |  1 -
 src/common/Makefile|  1 -
 src/common/archive.c   | 60 -
 src/common/meson.build |  1 -
 src/common/percentrepl.c   | 13 +--
 src/fe_utils/archive.c | 11 ++-
 src/include/common/archive.h   | 21 -
 src/tools/msvc/Mkvcbuild.pm|  2 +-
 9 files changed, 56 insertions(+), 153 deletions(-)
 delete mode 100644 src/common/archive.c
 delete mode 100644 src/include/common/archive.h

diff --git a/src/backend/access/transam/shell_restore.c b/src/backend/access/transam/shell_restore.c
index 7753a7d667..4752be0b9f 100644
--- a/src/backend/access/transam/shell_restore.c
+++ b/src/backend/access/transam/shell_restore.c
@@ -20,16 +20,14 @@
 
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
-#include "common/archive.h"
 #include "common/percentrepl.h"
 #include "storage/ipc.h"
 #include "utils/wait_event.h"
 
-static void ExecuteRecoveryCommand(const char *command,
+static bool ExecuteRecoveryCommand(const char *command,
    const char *commandName,
-   bool failOnSignal,
-   uint32 wait_event_info,
-   const char *lastRestartPointFileName);
+   bool failOnSignal, bool exitOnSigterm,
+   uint32 wait_event_info, int fail_elevel);
 
 /*
  * Attempt to execute a shell-based restore command.
@@ -40,25 +38,16 @@ bool
 shell_restore(const char *file, const char *path,
 			  const char *lastRestartPointFileName)
 {
+	char	   *nativePath = pstrdup(path);
 	char	   *cmd;
-	int			rc;
+	bool		ret;
 
 	/* Build the restore command to execute */
-	cmd = BuildRestoreCommand(recoveryRestoreCommand, path, file,
-			  lastRestartPointFileName);
-
-	ereport(DEBUG3,
-			(errmsg_internal("executing restore command \"%s\"", cmd)));
-
-	/*
-	 * Copy xlog from archival storage to XLOGDIR
-	 */
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
-	rc = system(cmd);
-	pgstat_report_wait_end();
-
-	pfree(cmd);
+	make_native_path(nativePath);
+	cmd = replace_percent_placeholders(recoveryRestoreCommand,
+	   "restore_command", "frp", file,
+	   lastRestartPointFileName, nativePath);
+	pfree(nativePath);
 
 	/*
 	 * Remember, we rollforward UNTIL the restore fails so failure here is
@@ -84,17 +73,11 @@ shell_restore(const char *file, const char *path,
 	 *
 	 * We treat hard shell errors such as "command not found" as fatal, too.
 	 */
-	if (rc != 0)
-	{
-		if (wait_result_is_signal(rc, SIGTERM))
-			proc_exit(1);
-
-		ereport(wait_result_is_any_signal(rc, true) ? FATAL : DEBUG2,
-(errmsg("could not restore file \"%s\" from archive: %s",
-		file, wait_result_to_str(rc;
-	}
+	ret = ExecuteRecoveryCommand(cmd, "restore_command", true, true,
+ WAIT_EVENT_RESTORE_COMMAND, DEBUG2);
+	pfree(cmd);
 
-	return (rc == 0);
+	return ret;
 }
 
 /*
@@ -103,9 +86,14 @@ shell_restore(const char *file, const char *path,
 void
 shell_archive_cleanup(const char *lastRestartPointFileName)
 {
-	ExecuteRecoveryCommand(archiveCleanupCommand, "archive_cleanup_command",
-		   false, WAIT_EVENT_ARCHIVE_CLEANUP_COMMAND,
-		   lastRestartPointFileName);
+	char	   *cmd;
+
+	cmd = replace_percent_placeholders(archiveCleanupCommand,
+	   "archive_cleanup_command",
+	   "r", lastRestartPointFileName);
+	(void) ExecuteRecoveryCommand(cmd, "archive_cleanup_command", false, false,
+  WAIT_EVENT_ARCHIVE_CLEANUP_COMMAND, WARNING);
+	pfree(cmd);
 }
 
 /*
@@ -114,9 +102,14 @@ shell_archive_cleanup(const char *lastRestartPointFileName)
 void
 shell_recovery_end(const char *lastRestartPointFileName)
 {
-	ExecuteRecoveryCommand(recoveryEndCommand, "recovery_end_command", true,
-		   WAIT_EVENT_RECOVERY_END_COMMAND,
-		   lastRestartPointFileName);
+	char	   *cmd;
+
+	cmd = replace_percent_placeholders(recoveryEndCommand,
+	   "recovery_end_command",
+	   "r", lastRestartPointFileName);
+	(void) ExecuteRecoveryCommand(cmd, "recovery_end_command", true, false,
+  WAIT_EVENT_RECOVERY_END_COMMAND, WARNING);
+

Re: doc: add missing "id" attributes to extension packaging page

2023-01-17 Thread Brar Piening


On 17.01.2023 at 14:12, Karl O. Pinc wrote:

If you're not willing to try I am willing to see if I can
produce an example to work from.  My XSLT is starting to
come back.


I'm certainly willing to try but I'd appreciate an example in any case.

My XSLT skills are mostly learning by doing combined with trial and error.

Regards,

Brar

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-16 13:58:21 -0800, Peter Geoghegan wrote:
> On Fri, Jan 13, 2023 at 9:55 PM Andres Freund  wrote:
> When I express skepticism of very high autovacuum_freeze_max_age
> settings, it's mostly just that I don't think that age(relfrozenxid)
> is at all informative in the way that something that triggers
> autovacuum ought to be. Worst of all, the older relfrozenxid gets, the
> less informative it becomes. I'm sure that it can be safe to use very
> high autovacuum_freeze_max_age values, but it seems like a case of
> using a very complicated and unreliable thing to decide when to
> VACUUM.

Both you and Robert said this, and I have seen it be true, but typically not
for large high-throughput OLTP databases, where I found increasing
relfrozenxid to be important. Sure, there's probably some up/down through the
day / week, but it's likely to be pretty predictable.

I think the problem is that an old relfrozenxid doesn't tell you how much
outstanding work there is. Perhaps that's what both of you meant...

I think that's not the fault of relfrozenxid as a trigger, but that we simply
don't keep enough other stats. We should imo at least keep track of:

In pg_class:

- The number of all frozen pages, like we do for all-visible

  That'd give us a decent upper bound for the amount of work we need to do to
  increase relfrozenxid. It's important that this is crash safe (thus no
  pg_stats), and it only needs to be written when we'd likely make other
  changes to the pg_class row anyway.

In pgstats:

- The number of dead items, incremented both by the heap scan and
  opportunistic pruning

  This would let us estimate how urgently we need to clean up indexes.

- The xid/mxid horizons during the last vacuum

- The number of pages with tuples that couldn't removed due to the horizon
  during the last vacuum

  Together with the horizon, this would let us avoid repeated vacuums that
  won't help. Tracking the number of pages instead of tuples allows a lot
  better cost/benefit estimation of another vacuum.

- The number of pages with tuples that couldn't be frozen

  Similar to the dead tuple one, except that it'd help avoid repeated vacuums
  to increase relfrozenxid, when it won't be able to help.

> > I've seen a ~2TB table grow to ~20TB due to dead tuples, at which point the
> > server crash-restarted due to WAL ENOSPC. I think in that case there wasn't
> > even DDL, they just needed to make the table readonly, for a brief moment, a
> > few times a day. The problem started once the indexes grew to be large 
> > enough
> > that the time for (re-)finding dead tuples + and an index scan phase got 
> > large
> > enough that they were unlucky to be killed a few times in a row. After that
> > autovac never got through the index scan phase. Not doing any "new" work, 
> > just
> > collecting the same dead tids over and over, scanning the indexes for those
> > tids, never quite finishing.
>
> That makes sense to me, though I wonder if there would have been
> another kind of outage (not the one you actually saw) had the
> autocancellation behavior somehow been disabled after the first few
> cancellations.

I suspect so, but it's hard to know.

> This sounds like a great argument in favor of suspend-and-resume as a
> way of handling autocancellation -- no useful work needs to be thrown
> away for AV to yield for a minute or two. One ambition of mine for the
> visibility map snapshot infrastructure was to be able support
> suspend-and-resume. It wouldn't be that hard for autovacuum to
> serialize everything, and use that to pick up right where an
> autovacuum worker left of at when it was autocancelled. Same
> OldestXmin starting point as before, same dead_items array, same
> number of scanned_pages (and pages left to scan).

Hm, that seems a lot of work. Without having held a lock you don't even know
whether your old dead items still apply. Of course it'd improve the situation
substantially, if we could get it.

> > If you know postgres internals, it's easy. You know that autovac stops
> > auto-cancelling after freeze_max_age and you know that the lock queue is
> > ordered. So you issue DROP TABLE / TRUNCATE / whatever and immediately
> > afterwards cancel the autovac worker. But if you aren't, it'll take a while 
> > to
> > figure out that the DROP TABLE isn't progressing due to a lock conflict, at
> > which point you'll cancel the statement (possibly having wrecked havoc with
> > all other accesses). Then you figure out that you need to cancel
> > autovacuum. After you try the DROP TABLE again - but the next worker has
> > gotten to work on the table.
>
> Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be
> special cases? Maybe they should always be able to auto cancel an
> autovacuum?

Yea, I think so. It's not obvious how to best pass down that knowledge into
ProcSleep(). It'd have to be in the LOCALLOCK, I think. Looks like the best
way would be to change LockAcquireExtended() to get a

Re: Sampling-based timing for EXPLAIN ANALYZE

2023-01-17 Thread Tomas Vondra

On 1/17/23 18:02, Andres Freund wrote:
> Hi,
> 
> On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:
>> I don't understand why we would even use timestamps, in this case? AFAIK
>> "sampling profilers" simply increment a counter for the executing node,
>> and then approximate the time as proportional to the count.
> 
> The timer interrupt distances aren't all that evenly spaced, particularly
> under load, and are easily distorted by having to wait for IO, an lwlock ...
> 

OK, so the difference is that these events (I/O, lwlocks) may block
signals, and after signals get unblocked we only get a single event for
each signal. Yeah, the timestamp handles that case better.

> 
>> That also does not have issues with timestamp "rounding" - considering
>> e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
>> possible the node completes within 1ms, in which case
>>
>>(now - last_sample)
>>
>> ends up being 0 (assuming I correctly understand the code).
> 
> That part should be counting in nanoseconds, I think? Unless I misunderstand
> something?
> 

The higher precision does not help, because both values come from the
*sampled* timestamp (i.e. the one updated from the signal handler). So
if the node happens to execute between two signals, the values are going
to be the same, and the difference is 0.

Perhaps for many executions it works out, because some executions will
cross the boundary, and the average will converge to the right value.

> We already compute the timestamp inside timeout.c, but don't yet pass that to
> timeout handlers. I think there's others re-computing timestamps.
> 
> 
>> And I don't think there's any particularly good way to correct this.
>>
>> It seems ExplainNode() attempts to do some correction, but I doubt
>> that's very reliable, as these fast nodes will have sampled_total=0, so
>> no matter what you multiply this with, it'll still be 0.
> 
> That's just the scaling to the "actual time" that you're talking about above,
> no?
> 

Maybe, not sure.

> 
>>> I've been thinking that we should consider making more of the 
>>> instrumentation
>>> code work like that. The amount of work we're doing in InstrStart/StopNode()
>>> has steadily risen. When buffer usage and WAL usage are enabled, we're
>>> executing over 1000 instructions! And each single Instrumentation node is 
>>> ~450
>>> bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
>>> embedded.
>>>
>>> If we instead have InstrStartNode() set up a global pointer to the
>>> Instrumentation node, we can make the instrumentation code modify both the
>>> "global" counters (pgBufferUsage, pgWalUsage) and, if set,
>>> current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
>>> changes - right now nodes "automatically" include the IO/WAL incurred in 
>>> child
>>> nodes, but that's just a small bit of additional summin-up to be done during
>>> EXPLAIN.
>>>
>>
>> That's certainly one way to implement that. I wonder if we could make
>> that work without the global pointer, but I can't think of any.
> 
> I don't see a realistic way at least. We could pass down an
> "InstrumentationContext" through everything that needs to do IO and WAL. But
> that seems infeasible at this point.
> 

Why infeasible?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: CI and test improvements

2023-01-17 Thread Justin Pryzby

The autoconf system runs all tap tests in t/*.pl, but meson requires
enumerating them in ./meson.build.

This checks for and finds no missing tests in the current tree:

$ for pl in `find src contrib -path '*/t/*.pl'`; do base=${pl##*/}; 
dir=${pl%/*}; meson=${dir%/*}/meson.build; grep "$base" "$meson" >/dev/null || 
echo "$base is missing from $meson"; done

However, this finds two real problems and one false-positive with
missing regress/isolation tests:

$ for makefile in `find src contrib -name Makefile`; do for testname in `sed -r 
'/^(REGRESS|ISOLATION) =/!d; s///; :l; /$/{s///; N; b l}; s/\n//g' 
"$makefile"`; do meson=${makefile%/Makefile}/meson.build; grep -Fw "$testname" 
"$meson" >/dev/null || echo "$testname is missing from $meson"; done; done
guc_privs is missing from src/test/modules/unsafe_tests/meson.build
oldextversions is missing from contrib/pg_stat_statements/meson.build
$(CF_PGP_TESTS) is missing from contrib/pgcrypto/meson.build

I also tried but failed to write something to warn if "meson test" was
run with a list of tests but without tmp_install.  Help wanted.

I propose to put something like this into "SanityCheck".

-- 
Justin

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-17 Thread Tom Lane

Andres Freund  writes:
> Here's an updated version of the move to representing instr_time as
> nanoseconds. It's now split into a few patches:

I took a quick look through this.

> 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
>   warn
>   Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
>   just allow to assign 0.

I think it's probably wise to keep the macro.  If we ever rethink this
again, we'll be glad we kept it.  Similarly, IS_ZERO is a good idea
even if it would work with just compare-to-zero.  I'm almost tempted
to suggest you define instr_time as a struct with a uint64 field,
just to help keep us honest about that.

> 0003) Add INSTR_TIME_SET_SECOND()
>   This is used in 0004. Just allows setting an instr_time to a time in
>   seconds, allowing for a cheaper loop exit condition in 0004.

Code and comments are inconsistent about whether it's SET_SECOND or
SET_SECONDS.  I think I prefer the latter, but don't care that much.

> 0004) report nanoseconds in pg_test_timing

Didn't examine 0004 in any detail, but the others look good to go
other than these nits.

regards, tom lane

Re: Sampling-based timing for EXPLAIN ANALYZE

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:
> I don't understand why we would even use timestamps, in this case? AFAIK
> "sampling profilers" simply increment a counter for the executing node,
> and then approximate the time as proportional to the count.

The timer interrupt distances aren't all that evenly spaced, particularly
under load, and are easily distorted by having to wait for IO, an lwlock ...


> That also does not have issues with timestamp "rounding" - considering
> e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
> possible the node completes within 1ms, in which case
> 
>(now - last_sample)
> 
> ends up being 0 (assuming I correctly understand the code).

That part should be counting in nanoseconds, I think? Unless I misunderstand
something?

We already compute the timestamp inside timeout.c, but don't yet pass that to
timeout handlers. I think there's others re-computing timestamps.


> And I don't think there's any particularly good way to correct this.
> 
> It seems ExplainNode() attempts to do some correction, but I doubt
> that's very reliable, as these fast nodes will have sampled_total=0, so
> no matter what you multiply this with, it'll still be 0.

That's just the scaling to the "actual time" that you're talking about above,
no?


> > I've been thinking that we should consider making more of the 
> > instrumentation
> > code work like that. The amount of work we're doing in InstrStart/StopNode()
> > has steadily risen. When buffer usage and WAL usage are enabled, we're
> > executing over 1000 instructions! And each single Instrumentation node is 
> > ~450
> > bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
> > embedded.
> > 
> > If we instead have InstrStartNode() set up a global pointer to the
> > Instrumentation node, we can make the instrumentation code modify both the
> > "global" counters (pgBufferUsage, pgWalUsage) and, if set,
> > current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
> > changes - right now nodes "automatically" include the IO/WAL incurred in 
> > child
> > nodes, but that's just a small bit of additional summin-up to be done during
> > EXPLAIN.
> > 
> 
> That's certainly one way to implement that. I wonder if we could make
> that work without the global pointer, but I can't think of any.

I don't see a realistic way at least. We could pass down an
"InstrumentationContext" through everything that needs to do IO and WAL. But
that seems infeasible at this point.


Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 08:46:12 -0500, Robert Haas wrote:
> On Fri, Jan 13, 2023 at 2:56 PM Andres Freund  wrote:
> > Does anybody see a reason to not move forward with this aspect? We do a fair
> > amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> > just using nanoseconds. We'd also save memory in BufferUsage (144-122 
> > bytes),
> > Instrumentation (16 bytes saved in Instrumentation itself, 32 via
> > BufferUsage).

Here's an updated version of the move to representing instr_time as
nanoseconds. It's now split into a few patches:

0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
  warn

  Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
  just allow to assign 0.

0002) Convert instr_time to uint64

  This is the cleaned up version of the prior patch. The main change is
  that it deduplicated a lot of the code between the architectures.

0003) Add INSTR_TIME_SET_SECOND()

  This is used in 0004. Just allows setting an instr_time to a time in
  seconds, allowing for a cheaper loop exit condition in 0004.

0004) report nanoseconds in pg_test_timing


I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out
how to do the cycles->nanosecond conversion with integer shift and multiply in
the common case, which does show a noticable speedup. But that's for another
day.

I fought a bit with myself about whether to send those patches in this thread,
because it'll take over the CF entry. But decided that it's ok, given that
David's patches should be rebased over these anyway?


> I read through 0001 and it seems basically fine to me. Comments:
>
> 1. pg_clock_gettime_ns() doesn't follow pgindent conventions.

Fixed.


> 2. I'm not entirely sure that the new .?S_PER_.?S macros are
> worthwhile but maybe they are, and in any case I don't care very much.

There's now fewer. But those I'd like to keep. I just end up counting digits
manually way too many times.


> 3. I've always found 'struct timespec' to be pretty annoying
> notationally, so I like the fact that this patch would reduce use of
> it.

Same.

Greetings,

Andres Freund
>From c1024a9dfa7f5645200b7fa68e8bce5561c9cee0 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 16 Jan 2023 10:04:42 -0800
Subject: [PATCH v7 1/4] Zero initialize instr_time uses causing compiler
 warnings

These are all not necessary from a correctness POV. However, in a subsequent
patch instr_time will be simplified to an int64, at which point gcc would
otherwise start to warn about the changed places.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20230116023639.rn36vf6ajqmfc...@awork3.anarazel.de
Backpatch:
---
 src/backend/access/transam/xlog.c   | 4 
 src/backend/storage/buffer/bufmgr.c | 4 
 src/backend/storage/file/buffile.c  | 4 
 src/bin/psql/common.c   | 6 ++
 4 files changed, 18 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f47fb75700..7d65b1d4159 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2191,6 +2191,8 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 /* Measure I/O timing to write WAL data */
 if (track_wal_io_timing)
 	INSTR_TIME_SET_CURRENT(start);
+else
+	INSTR_TIME_SET_ZERO(start);
 
 pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 written = pg_pwrite(openLogFile, from, nleft, startoffset);
@@ -8150,6 +8152,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 	/* Measure I/O timing to sync the WAL file */
 	if (track_wal_io_timing)
 		INSTR_TIME_SET_CURRENT(start);
+	else
+		INSTR_TIME_SET_ZERO(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a6..800a4248c95 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1017,6 +1017,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			if (track_io_timing)
 INSTR_TIME_SET_CURRENT(io_start);
+			else
+INSTR_TIME_SET_ZERO(io_start);
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
@@ -2902,6 +2904,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
 
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c5464b6aa62..0a51624df3b 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -446,6 +446,8 @@ BufFileLoadBuffer(BufFile *file)
 
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
 
 	/*
 	 * Read whatever we can get, up to a full bufferload.
@@ -525,6 +527,8 @@ BufFileDumpBuffer(BufFile

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-17 Thread Tom Lane

Andrew Dunstan  writes:
> FYI crake has just passed the test with flying colours.

Cool.  I await the Windows machines' results with interest.

regards, tom lane

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-17 Thread Andrew Dunstan



On 2023-01-17 Tu 10:18, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 2023-01-16 Mo 21:58, Tom Lane wrote:
>>> I dunno if we want to stretch buildfarm owners' patience with yet
>>> another BF client release right now.  On the other hand, I'm antsy
>>> to see if we can un-revert 1b4d280ea after doing a little more
>>> work in AdjustUpgrade.pm.
>> It looks like the only animals doing the cross version tests crake,
>> drongo and fairywren. These are all mine, so I don't think we need to do
>> a new release for this.
> copperhead, kittiwake, snapper, and tadarida were running them
> until fairly recently.
>
>   


Ah, yes, true, I didn't look far enough back.

The new file can be downloaded from

- it's a dropin replacement.

FYI crake has just passed the test with flying colours.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Issue attaching a table to a partitioned table with an auto-referenced foreign key

2023-01-17 Thread Guillaume Lelarge

Quick ping, just to make sure someone can get a look at this issue :)
Thanks.


Le ven. 6 janv. 2023 à 11:07, Guillaume Lelarge  a
écrit :

> Hello,
>
> One of our customers has an issue with partitions and foreign keys. He
> works on a v13, but the issue is also present on v15.
>
> I attach a SQL script showing the issue, and the results on 13.7, 13.9,
> and 15.1. But I'll explain the script here, and its behaviour on 13.9.
>
> There is one partitioned table, two partitions and a foreign key. The
> foreign key references the same table:
>
> create table t1 (
>   c1 bigint not null,
>   c1_old bigint null,
>   c2 bigint not null,
>   c2_old bigint null,
>   primary key (c1, c2)
>   )
>   partition by list (c1);
> create table t1_a   partition of t1 for values in (1);
> create table t1_def partition of t1 default;
> alter table t1 add foreign key (c1_old, c2_old) references t1 (c1, c2) on
> delete restrict on update restrict;
>
> I've a SQL function that shows me some information from pg_constraints
> (code of the function in the SQL script attached). Here is the result of
> this function after creating the table, its partitions, and its foreign key:
>
> select * from show_constraints();
> conname |   t|  tref  |   coparent
> +++---
>  t1_c1_old_c2_old_fkey  | t1 | t1 |
>  t1_c1_old_c2_old_fkey  | t1_a   | t1 | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey  | t1_def | t1 | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey1 | t1 | t1_a   | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey2 | t1 | t1_def | t1_c1_old_c2_old_fkey
> (5 rows)
>
> The constraint works great :
>
> insert into t1 values(1, NULL, 2, NULL);
> insert into t1 values(2, 1,2, 2);
> delete from t1 where c1 = 1;
> psql:ticket15010_v3.sql:34: ERROR:  update or delete on table "t1_a"
> violates foreign key constraint "t1_c1_old_c2_old_fkey1" on table "t1"
> DETAIL:  Key (c1, c2)=(1, 2) is still referenced from table "t1".
>
> This error is normal since the line I want to delete is referenced on the
> other line.
>
> If I try to detach the partition, it also gives me an error.
>
> alter table t1 detach partition t1_a;
> psql:ticket15010_v3.sql:36: ERROR:  removing partition "t1_a" violates
> foreign key constraint "t1_c1_old_c2_old_fkey1"
> DETAIL:  Key (c1_old, c2_old)=(1, 2) is still referenced from table "t1".
>
> Sounds good to me too (well, I'd like it to be smarter and find that the
> constraint is still good after the detach, but I can understand why it
> won't allow it).
>
> The pg_constraint didn't change of course:
>
> select * from show_constraints();
> conname |   t|  tref  |   coparent
> +++---
>  t1_c1_old_c2_old_fkey  | t1 | t1 |
>  t1_c1_old_c2_old_fkey  | t1_a   | t1 | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey  | t1_def | t1 | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey1 | t1 | t1_a   | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey2 | t1 | t1_def | t1_c1_old_c2_old_fkey
> (5 rows)
>
> Now, I'll delete the whole table contents, and I'll detach the partition:
>
> delete from t1;
> alter table t1 detach partition t1_a;
>
> It seems to be working, but the content of pg_constraints is weird:
>
> select * from show_constraints();
> conname |   t|  tref  |   coparent
> +++---
>  t1_c1_old_c2_old_fkey  | t1 | t1 |
>  t1_c1_old_c2_old_fkey  | t1_a   | t1 |
>  t1_c1_old_c2_old_fkey  | t1_def | t1 | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey2 | t1 | t1_def | t1_c1_old_c2_old_fkey
> (4 rows)
>
> I understand why the ('t1_c1_old_c2_old_fkey1', 't1', 't1_a',
> 't1_c1_old_c2_old_fkey') tuple has gone but I don't understand why the
> ('t1_c1_old_c2_old_fkey', 't1_a', 't1', NULL) tuple is still there.
>
> Anyway, I attach the partition:
>
> alter table t1 attach partition t1_a for values in (1);
>
> But pg_constraint has not changed:
>
> select * from show_constraints();
> conname |   t|  tref  |   coparent
> +++---
>  t1_c1_old_c2_old_fkey  | t1 | t1 |
>  t1_c1_old_c2_old_fkey  | t1_a   | t1 | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey  | t1_def | t1 | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey2 | t1 | t1_def | t1_c1_old_c2_old_fkey
> (4 rows)
>
> I was expecting to see the fifth tuple coming back, but alas, no.
>
> And as a result, the foreign key doesn't work anymore:
>
> insert into t1 values(1, NULL, 2, NULL);
> insert into t1 values(2, 1,2, 2);
> delete from t1 where c1 = 1;
>
> Well, let's truncate the partitioned table, and drop the partition:
>
> truncate t1;
> drop table t1_a;
>
> The content of pg_constraint looks good to me:
>
> select * from show_constraints();
> conname |

Re: minor bug

2023-01-17 Thread Tom Lane

Laurenz Albe  writes:
> On Mon, 2023-01-16 at 19:59 +0100, Torsten Förtsch wrote:
>> So, the timestamp displayed in the log message is certainly wrong.

> If recovery stops at a WAL record that has no timestamp, you get this
> bogus recovery stop time.  I think we should show the recovery stop time
> only if time was the target, as in the attached patch.

I don't think that is a tremendously useful definition: the user
already knows what recoveryStopTime is, or can find it out from
their settings easily enough.

I seem to recall that the original idea was to report the timestamp
of the commit/abort record we are stopping at.  Maybe my memory is
faulty, but I think that'd be significantly more useful than the
current behavior, *especially* when the replay stopping point is
defined by something other than time.

(Also, the wording of the log message suggests that that's what
the reported time is supposed to be.  I wonder if somebody messed
this up somewhere along the way.)

regards, tom lane

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Robert Haas

On Mon, Jan 16, 2023 at 11:11 PM Peter Geoghegan  wrote:
> On Mon, Jan 16, 2023 at 8:25 AM Robert Haas  wrote:
> > I really dislike formulas like Min(freeze_max_age * 2, 1 billion).
> > That looks completely magical from a user perspective. Some users
> > aren't going to understand autovacuum behavior at all. Some will, and
> > will be able to compare age(relfrozenxid) against
> > autovacuum_freeze_max_age. Very few people are going to think to
> > compare age(relfrozenxid) against some formula based on
> > autovacuum_freeze_max_age. I guess if we document it, maybe they will.
>
> What do you think of Andres' autovacuum_no_auto_cancel_age proposal?

I like it better than your proposal. I don't think it's a fundamental
improvement and I would rather see a fundamental improvement, but I
can see it being better than nothing.

> > I do like the idea of driving the auto-cancel behavior off of the
> > results of previous attempts to vacuum the table. That could be done
> > independently of the XID age of the table.
>
> Even when the XID age of the table has already significantly surpassed
> autovacuum_freeze_max_age, say due to autovacuum worker starvation?
>
> > If we've failed to vacuum
> > the table, say, 10 times, because we kept auto-cancelling, it's
> > probably appropriate to force the issue.
>
> I suggested 1000 times upthread. 10 times seems very low, at least if
> "number of times cancelled" is the sole criterion, without any
> attention paid to relfrozenxid age or some other tiebreaker.

Hmm, I think that a threshold of 1000 is far too high to do much good.
By the time we've tried to vacuum a table 1000 times and failed every
time, I anticipate that the situation will be pretty dire, regardless
of why we thought the table needed to be vacuumed in the first place.
In the best case, with autovacum_naptime=1minute, failing 1000 times
means that we've delayed vacuuming the table for at least 16 hours.
That's assuming that there's a worker available to retry every minute
and that we fail quickly. If it's a 2 hour vacuum operation and we
typically fail about halfway through, it could take us over a month to
hit 1000 failures. There are many tables out there that get enough
inserts, updates, and deletes that a 16-hour delay will result in
irreversible bloat, never mind a 41-day delay. After even a few days,
wraparound could become critical even if bloat isn't.

I'm not sure why a threshold of 10 would be too low. It seems to me
that if we fail ten times in a row to vacuum a table and fail for the
same reason every time, we're probably going to keep failing for that
reason. If that is true, we will be better off if we force the issue
sooner rather than later. There's no value in letting the table bloat
out the wazoo and the cluster approach a wraparound shutdown before we
insist. Consider a more mundane example. If I try to start my car or
my dishwasher or my computer or my toaster oven ten times and it fails
ten times in a row, and the failure mode appears to be the same each
time, I am not going to sit there and try 990 more times hoping things
get better, because that seems very unlikely to help. Honestly,
depending on the situation, I might not even get to ten times before I
switch to doing some form of troubleshooting and/or calling someone
who could repair the device.

In fact I think there's a decent argument that a threshold of ten is
possibly too high here, too. If you wait until the tenth try before
you try not auto-cancelling, then a table with a workload that makes
auto-cancelling 100% probable will get vacuumed 10% as often as it
would otherwise. I think there are cases where that would be OK, but
probably on the whole it's not going to go very well. The only problem
I see with lowering the threshold below ~10 is that the signal starts
to get weak. If something fails for the same reason ten times in a row
you can be pretty sure it's a chronic problem. If you made the
thereshold say three you'd probably start making bad decisions
sometimes -- you'd think that you had a chronic problem when really
you just got a bit unlucky.

To get back to the earlier question above, I think that if the
retries-before-not-auto-cancelling threshold were low enough to be
effective, you wouldn't necessarily need to consider XID age as a
second reason for not auto-cancelling. You would want to force the
behavior anyway when you hit emergency mode, because that should force
all the mitigations we have, but I don't know that you need to do
anything before that.

> > It doesn't really matter
> > whether the autovacuum triggered because of bloat or because of XID
> > age. Letting either of those things get out of control is bad.
>
> While inventing a new no-auto-cancel behavior that prevents bloat from
> getting completely out of control may well have merit, I don't see why
> it needs to be attached to this other effort.

It doesn't, but I think it would be a lot more beneficial than just
adding a new GUC. A lot of the

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-17 Thread Tom Lane

Andrew Dunstan  writes:
> On 2023-01-16 Mo 21:58, Tom Lane wrote:
>> I dunno if we want to stretch buildfarm owners' patience with yet
>> another BF client release right now.  On the other hand, I'm antsy
>> to see if we can un-revert 1b4d280ea after doing a little more
>> work in AdjustUpgrade.pm.

> It looks like the only animals doing the cross version tests crake,
> drongo and fairywren. These are all mine, so I don't think we need to do
> a new release for this.

copperhead, kittiwake, snapper, and tadarida were running them
until fairly recently.

regards, tom lane

Re: [PATCH] Add <> support to sepgsql_restorecon

2023-01-17 Thread Ted X Toth

The intent of this patch is not to stop all relabeling only to stop 
sepgsql_restorecon from doing a bulk relabel. I believe sepgsql_object_relabel 
is called by the 'SECURITY LABEL'  statement which I'm using to set the label 
of db objects to a specific context which I would not want altered later by a 
restorecon. This is particularly important in a MLS (multi-level security) 
environment where for example if a row were labeled at the 'secret' level I 
would not restorecon to relabel it possibly causing a downgrade.

The new status of this patch is: Ready for Committer

Re: Sampling-based timing for EXPLAIN ANALYZE

2023-01-17 Thread Tomas Vondra





On 1/15/23 21:22, Andres Freund wrote:
> Hi,
> 
> On 2023-01-13 09:11:06 +0100, David Geier wrote:
>> Mostly I'm wondering if the sampling based approach gains us enough to be
>> worth it, once the patch to use RDTSC hopefully landed (see [1]).
> 
> Well, I'm not sure we have a path forward on it. There's portability and
> accuracy concerns. But more importantly:
> 
>> I believe that with the RDTSC patch the overhead of TIMING ON is lower than
>> the overhead of using ANALYZE with TIMING OFF in the first place. Hence, to
>> be really useful, it would be great if we could on top of TIMING SAMPLING
>> also lower the overhead of ANALYZE itself further (e.g. by using a fast path
>> for the default EXPLAIN (ANALYZE, TIMING ON / SAMPLING)). Currently,
>> InstrStartNode() and InstrStopNode() have a ton of branches and without all
>> the typically deactivated code the implementation would be very small and
>> could be placed in an inlinable function.
> 
> Yes, I think SAMPLING could get rid of most of the instrumentation overhead -
> at the cost of a bit less detail in the explain, of course.  Which could make
> it a lot more feasible to enable something like auto_explain.log_timing in
> busy workloads.
> 
> For the sampling mode we don't really need something like
> InstrStart/StopNode. We just need a pointer to node currently executing - not
> free to set, but still a heck of a lot cheaper than InstrStopNode(), even
> without ->need_timer etc. Then the timer just needs to do
> instr->sampled_total += (now - last_sample)
> last_sample = now
> 

I don't understand why we would even use timestamps, in this case? AFAIK
"sampling profilers" simply increment a counter for the executing node,
and then approximate the time as proportional to the count.

That also does not have issues with timestamp "rounding" - considering
e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
possible the node completes within 1ms, in which case

   (now - last_sample)

ends up being 0 (assuming I correctly understand the code).

And I don't think there's any particularly good way to correct this.

It seems ExplainNode() attempts to do some correction, but I doubt
that's very reliable, as these fast nodes will have sampled_total=0, so
no matter what you multiply this with, it'll still be 0.

> 
> I've been thinking that we should consider making more of the instrumentation
> code work like that. The amount of work we're doing in InstrStart/StopNode()
> has steadily risen. When buffer usage and WAL usage are enabled, we're
> executing over 1000 instructions! And each single Instrumentation node is ~450
> bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
> embedded.
> 
> If we instead have InstrStartNode() set up a global pointer to the
> Instrumentation node, we can make the instrumentation code modify both the
> "global" counters (pgBufferUsage, pgWalUsage) and, if set,
> current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
> changes - right now nodes "automatically" include the IO/WAL incurred in child
> nodes, but that's just a small bit of additional summin-up to be done during
> EXPLAIN.
> 

That's certainly one way to implement that. I wonder if we could make
that work without the global pointer, but I can't think of any.

> 
> Separately, I think we should consider re-ordering Instrumentation so that
> bufusage_start, walusage_start are after the much more commonly used
> elements. We're forcing ntuples, nloops, .. onto separate cachelines, even
> though they're accounted for unconditionally.
> 

+1 to that


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-17 Thread Masahiko Sawada

On Tue, Jan 17, 2023 at 6:14 PM houzj.f...@fujitsu.com
 wrote:
>
> On Tuesday, January 17, 2023 2:46 PM Masahiko Sawada  
> wrote:
> >
> > On Tue, Jan 17, 2023 at 12:37 PM houzj.f...@fujitsu.com
> >  wrote:
> > > Attach the new version 0001 patch which addressed all other comments.
> > >
> >
> > Thank you for updating the patch. Here is one comment:
> >
> > @@ -426,14 +427,24 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
> >
> > /*
> >  * Show the leader only for active parallel
> > workers.  This
> > -* leaves the field as NULL for the
> > leader of a parallel
> > -* group.
> > +* leaves the field as NULL for the
> > leader of a parallel group
> > +* or the leader of parallel apply workers.
> >  */
> > if (leader && leader->pid !=
> > beentry->st_procpid)
> > {
> > values[28] =
> > Int32GetDatum(leader->pid);
> > nulls[28] = false;
> > }
> > +   else
> > +   {
> > +   int
> > leader_pid = GetLeaderApplyWorkerPid(beentry->st_procpid);
> > +
> > +   if (leader_pid != InvalidPid)
> > +   {
> > +   values[28] =
> > Int32GetDatum(leader_pid);
> > +   nulls[28] = false;
> > +   }
> > +   }
> > }
> >
> > I'm slightly concerned that there could be overhead of executing
> > GetLeaderApplyWorkerPid () for every backend process except for parallel
> > query workers. The number of such backends could be large and
> > GetLeaderApplyWorkerPid() acquires the lwlock. For example, does it make
> > sense to check (st_backendType == B_BG_WORKER) before calling
> > GetLeaderApplyWorkerPid()? Or it might not be a problem since it's
> > LogicalRepWorkerLock which is not likely to be contended.
>
> Thanks for the comment and I think your suggestion makes sense.
> I have added the check before getting the leader pid. Here is the new version 
> patch.

Thank you for updating the patch. Looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Make use of assign_checkpoint_completion_target() to calculate CheckPointSegments correctly

2023-01-17 Thread Bharath Rupireddy

On Tue, Jan 17, 2023 at 12:31 PM Michael Paquier  wrote:
>
> On Mon, Dec 26, 2022 at 06:12:34PM +0530, Bharath Rupireddy wrote:
> > It looks like assign_checkpoint_completion_target() is defined [1],
> > but never used, because of which CheckPointSegments may miss to
> > account for changed checkpoint_completion_target. I'm attaching a tiny
> > patch to fix this.
> >
> > Thoughts?
>
> Oops.  It looks like you are right here.  This would impact the
> calculation of CheckPointSegments on reload when
> checkpoint_completion_target is updated.  This is wrong since we've
> switched to max_wal_size as of 88e9823, so this had better be
> backpatched all the way down.
>
> Thoughts?

+1 to backpatch as setting checkpoint_completion_target will not take
effect immediately.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: daitch_mokotoff module

2023-01-17 Thread Dag Lem

Paul Ramsey  writes:

>> On Jan 12, 2023, at 7:30 AM, Dag Lem  wrote:
>> 

[...]

>> 
>> Sure, I can do that. You don't think this much example text will be
>> TL;DR?
>
> I can only speak for myself, but examples are the meat of
> documentation learning, so as long as they come with enough
> explanatory context to be legible it's worth having them, IMO.
>

I have updated the documentation, hopefully it is more accessible now.

I also corrected documentation for the other functions in fuzzystrmatch
(function name and argtype in the wrong order).

Crossing fingers that someone will eventually change the status to
"Ready for Committer" :-)

Best regards,

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+	rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..2548903770
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,582 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "catalog/pg_type.h"
+#include "mb/pg_wchar.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from
+ *

Re: Switching XLog source from archive to streaming when primary available

2023-01-17 Thread Bharath Rupireddy

On Thu, Jan 12, 2023 at 6:21 AM Nathan Bossart  wrote:
>
> On Tue, Oct 18, 2022 at 07:31:33AM +0530, Bharath Rupireddy wrote:
> > In summary, the standby state machine in WaitForWALToBecomeAvailable()
> > exhausts all the WAL in pg_wal before switching to streaming after
> > failing to fetch from archive. The v8 patch proposed upthread deviates
> > from this behaviour. Hence, attaching v9 patch that keeps the
> > behaviour as-is, that means, the standby exhausts all the WAL in
> > pg_wal before switching to streaming after fetching WAL from archive
> > for at least streaming_replication_retry_interval milliseconds.
>
> I think this is okay.  The following comment explains why archives are
> preferred over existing files in pg_wal:
>
>  * When doing archive recovery, we always prefer an archived log file 
> even
>  * if a file of the same name exists in XLOGDIR.  The reason is that 
> the
>  * file in XLOGDIR could be an old, un-filled or partly-filled version
>  * that was copied and restored as part of backing up $PGDATA.
>
> With your patch, we might replay one of these "old" files in pg_wal instead
> of the complete version of the file from the archives,

That's true even today, without the patch, no? We're not changing the
existing behaviour of the state machine. Can you explain how it
happens with the patch?

On HEAD, after failing to read from the archive, exhaust all wal from
pg_wal and then switch to streaming mode. With the patch, after
reading from the archive for at least
streaming_replication_retry_interval milliseconds, exhaust all wal
from pg_wal and then switch to streaming mode.

> but I think that is
> still correct.  We'll just replay whatever exists in pg_wal (which may be
> un-filled or partly-filled) before attempting streaming.  If that fails,
> we'll go back to trying the archives again.
>
> Would you mind testing this scenario?

How about something like below for testing the above scenario? If it
looks okay, I can add it as a new TAP test file.

1. Generate WAL files f1 and f2 and archive them.
2. Check the replay lsn and WAL file name on the standby, when it
replays upto f2, stop the standby.
3. Set recovery to fail on the standby, and stop the standby.
4. Generate f3, f4 (partially filled) on the primary.
5. Manually copy f3, f4 to the standby's pg_wal.
6. Start the standby, since recovery is set to fail, and there're new
WAL files (f3, f4) under its pg_wal, it must replay those WAL files
(check the replay lsn and WAL file name, it must be f4) before
switching to streaming.
7. Generate f5 on the primary.
8. The standby should receive f5 and replay it (check the replay lsn
and WAL file name, it must be f5).
9. Set streaming to fail on the standby and set recovery to succeed.
10. Generate f6 on the primary.
11. The standby should receive f6 via archive and replay it (check the
replay lsn and WAL file name, it must be f6).

If needed, we can look out for these messages to confirm it works as expected:
elog(DEBUG2, "switched WAL source from %s to %s after %s",
 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
 lastSourceFailed ? "failure" : "success");
ereport(LOG,
(errmsg("restored log file \"%s\" from archive",
xlogfname)));

Essentially, it covers what the documentation
https://www.postgresql.org/docs/devel/warm-standby.html says:

"In standby mode, the server continuously applies WAL received from
the primary server. The standby server can read WAL from a WAL archive
(see restore_command) or directly from the primary over a TCP
connection (streaming replication). The standby server will also
attempt to restore any WAL found in the standby cluster's pg_wal
directory. That typically happens after a server restart, when the
standby replays again WAL that was streamed from the primary before
the restart, but you can also manually copy files to pg_wal at any
time to have them replayed."

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

GSoC 2023

2023-01-17 Thread Ilaria Battiston


Greetings -hackers,

Our beloved Google Summer of Code is back for 2023, with a format 
similar to 2022: both medium and large sized projects can be proposed, 
with more flexibility on end dates. The program will be open to students 
and open source beginners, as stated in this blog post: 
https://opensource.googleblog.com/2022/11/get-ready-for-google-summer-of-code-2023.html


Now is the time to work on getting together a set of projects we'd like 
to have GSoC students work on over the summer. Similar to last year, we 
need to have a good set of projects for students to choose from in 
advance of the deadline for mentoring organizations.


However, as noted in the blog post above, project length expectations 
may vary. Please decide accordingly based on your requirements and 
availability! Also, there is going to be only one intermediate 
evaluation, similarly to last year.


GSoC timeline: https://developers.google.com/open-source/gsoc/timeline

The deadline for Mentoring organizations to apply is: February 7. The 
list of accepted organization will be published around February 22.


Unsurprisingly, we'll need to have an Ideas page again, so I've gone 
ahead and created one (copying last year's):

https://wiki.postgresql.org/wiki/GSoC_2023

Google discusses what makes a good "Ideas" list here:
https://google.github.io/gsocguides/mentor/defining-a-project-ideas-list.html

All the entries are marked with '2022' to indicate they were pulled from 
last year. If the project from last year is still relevant, please 
update it to be '2023' and make sure to update all of the information 
(in particular, make sure to list yourself as a mentor and remove the 
other mentors, as appropriate). Please also be sure to update the 
project's scope to be appropriate for the new guidelines.


New entries are certainly welcome and encouraged, just be sure to note 
them as '2023' when you add them. Projects from last year which were 
worked on but have significant follow-on work to be completed are 
absolutely welcome as well - simply update the description appropriately 
and mark it as being for '2023'.


When we get closer to actually submitting our application, I'll clean 
out the '2022' entries that didn't get any updates. Also - if there are 
any projects that are no longer appropriate (maybe they were completed, 
for example and no longer need work), please feel free to remove them. 
The page is still work in progress, so it's entirely possible I missed 
some updates where a GSoC project was completed independently of GSoC 
(and if I removed any that shouldn't have been - feel free to add them 
back by copying from the 2022 page).


As a reminder, each idea on the page should be in the format that the 
other entries are in and should include:

- Project title/one-line description
- Brief, 2-5 sentence, description of the project
- Description of programming skills needed and estimation of the 
difficulty level

- Project size
- List of potential mentors
- Expected Outcomes

As with last year, please consider PostgreSQL to be an "Umbrella" 
project and that anything which would be considered "PostgreSQL Family" 
per the News/Announce policy [1] is likely to be acceptable as a 
PostgreSQL GSoC project.


In other words, if you're a contributor or developer on WAL-G, barman, 
pgBackRest, the PostgreSQL website (pgweb), the PgEU/PgUS website code 
(pgeu-system), pgAdmin4, pgbouncer, pldebugger, the PG RPMs (pgrpms), 
the JDBC driver, the ODBC driver, or any of the many other PG Family 
projects, please feel free to add a project for consideration! If we get 
quite a few, we can organize the page further based on which project or 
maybe what skills are needed or similar.


Let's have another great year of GSoC with PostgreSQL!

Thanks!

Ilaria & Stephen

[1]: https://www.postgresql.org/about/policies/news-and-events/

Re: Record queryid when auto_explain.log_verbose is on

2023-01-17 Thread torikoshia


On 2023-01-16 22:07, Julien Rouhaud wrote:

Hi,

On Mon, Jan 16, 2023 at 09:36:59PM +0900, torikoshia wrote:


As far as I read the manual below, auto_explain.log_verbose should 
record

logs equivalent to VERBOSE option of EXPLAIN.


Ah good catch, that's clearly an oversight!


Attached patch makes auto_explain also print query identifiers.

What do you think?


@@ -407,6 +408,9 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
ExplainPrintTriggers(es, queryDesc);
if (es->costs)
ExplainPrintJITSummary(es, queryDesc);
+			if (es->verbose && queryDesc->plannedstmt->queryId != 
UINT64CONST(0))

+   ExplainPropertyInteger("Query Identifier", 
NULL, (int64)
+  
queryDesc->plannedstmt->queryId, es);

For interactive EXPLAIN the query identifier is printed just after the 
plan,
before the triggers and the JIT summary so auto_explain should do the 
same.

Thanks for the comment!
Agreed and updated the patch.


On 2023-01-17 03:53, Justin Pryzby wrote:

On Mon, Jan 16, 2023 at 09:36:59PM +0900, torikoshia wrote:

Attached patch makes auto_explain also print query identifiers.


This was asked during the initial patch; does your patch address the
issues here ?

https://www.postgresql.org/message-id/20200308142644.vlihk7djpwqjkp7w%40nol


Thanks!
I may misunderstand something, but it seems that the issue occurred 
since queryid was calculated in pgss_post_parse_analyze() at that time.


```
--- queryid_exposure-v6.diff, which is the patch just before the 
discussion
@@ -792,16 +801,34 @@ pgss_post_parse_analyze(ParseState *pstate, Query 
*query)

..snip..

if (query->utilityStmt)
{
-   query->queryId = UINT64CONST(0);
+   if (pgss_track_utility && 
PGSS_HANDLED_UTILITY(query->utilityStmt)

+   && pstate->p_sourcetext)
+   {
+   const char *querytext = pstate->p_sourcetext;
+   int query_location = query->stmt_location;
+   int query_len = query->stmt_len;
+
+   /*
+* Confine our attention to the relevant part of the string, 
if the

+* query is a portion of a multi-statement source string.
+*/
+   querytext = pgss_clean_querytext(pstate->p_sourcetext,
+_location,
+_len);
+
+   query->queryId = pgss_compute_utility_queryid(querytext, 
query_len);

```

Currently queryId is not calculated in pgss_post_parse_analyze() and the 
issue does not occur, doesn't it?
I confirmed the attached patch logged queryid for some utility 
statements.


```
LOG:  0: duration: 0.017 ms  plan:
Query Text: prepare p1 as select 1;
Result  (cost=0.00..0.01 rows=1 width=4)
  Output: 1
Query Identifier: -1801652217649936326
```

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONFrom c113e00fc681289e33f72855d86b93b0d44360d2 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Tue, 17 Jan 2023 22:42:19 +0900
Subject: [PATCH v2] Make auto_explain record query identifier

When auto_explain.log_verbose is on, auto_explain should record logs
equivalent to VERBOSE option of EXPLAIN. Howerver, when
compute_query_id is on, query identifiers are only printed in
VERBOSE option of EXPLAIN.
This patch makes auto_explain also print query identifier.
---
 contrib/auto_explain/auto_explain.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index c3ac27ae99..ee6917e605 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -21,6 +21,7 @@
 #include "jit/jit.h"
 #include "nodes/params.h"
 #include "utils/guc.h"
+#include "utils/queryjumble.h"
 
 PG_MODULE_MAGIC;
 
@@ -403,6 +404,9 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			ExplainQueryText(es, queryDesc);
 			ExplainQueryParameters(es, queryDesc->params, auto_explain_log_parameter_max_length);
 			ExplainPrintPlan(es, queryDesc);
+			if (es->verbose && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+ExplainPropertyInteger("Query Identifier", NULL, (int64)
+	   queryDesc->plannedstmt->queryId, es);
 			if (es->analyze && auto_explain_log_triggers)
 ExplainPrintTriggers(es, queryDesc);
 			if (es->costs)

base-commit: 15eb1228800a35042ef318f941173aa8b89002d1
-- 
2.25.1

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-17 Thread Robert Haas

On Fri, Jan 13, 2023 at 2:56 PM Andres Freund  wrote:
> Does anybody see a reason to not move forward with this aspect? We do a fair
> amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
> Instrumentation (16 bytes saved in Instrumentation itself, 32 via
> BufferUsage).

I read through 0001 and it seems basically fine to me. Comments:

1. pg_clock_gettime_ns() doesn't follow pgindent conventions.

2. I'm not entirely sure that the new .?S_PER_.?S macros are
worthwhile but maybe they are, and in any case I don't care very much.

3. I've always found 'struct timespec' to be pretty annoying
notationally, so I like the fact that this patch would reduce use of
it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-17 Thread Andrew Dunstan

On 2023-01-16 Mo 21:58, Tom Lane wrote:
> I've pushed the per-branch AdjustUpgrade.pm files and tested by performing
> a fresh round of buildfarm runs with the patched TestUpgradeXversion.pm
> file.  I think we're in good shape with this project.
>
> I dunno if we want to stretch buildfarm owners' patience with yet
> another BF client release right now.  On the other hand, I'm antsy
> to see if we can un-revert 1b4d280ea after doing a little more
> work in AdjustUpgrade.pm.
>
>   

It looks like the only animals doing the cross version tests crake,
drongo and fairywren. These are all mine, so I don't think we need to do
a new release for this.

I think the next step is to push the buildfarm client changes, and
update those three animals to use it, and make sure nothing breaks. I'll
go and do those things now. Then you should be able to try your unrevert.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-01-17 Thread Aleksander Alekseev

Hi hackers,

> Yeah, pg_upgrade will briefly start and stop the old server to make
> sure all the WAL is replayed, and won't transfer any of the files
> over. AFAIK, major-version WAL changes are fine; it was the previous
> claim that we could do it in a minor version that I was unsure about.

OK, here is the patchset v53 where I mostly modified the commit
messages. It is explicitly said that 0001 modifies the WAL records and
why we decided to do it in this patch. Additionally any mention of
64-bit XIDs is removed since it is not guaranteed that the rest of the
patches are going to be accepted. 64-bit SLRU page numbering is a
valuable change per se.

Changing the status of the CF entry to RfC apparently was a bit
premature. It looks like the patchset can use a few more rounds of
review.

In 0002:

```
-#define TransactionIdToCTsPage(xid) \
-((xid) / (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
+static inline int64
+TransactionIdToCTsPageInternal(TransactionId xid, bool lock)
+{
+FullTransactionIdfxid,
+nextXid;
+uint32epoch;
+
+if (lock)
+LWLockAcquire(XidGenLock, LW_SHARED);
+
+/* make a local copy */
+nextXid = ShmemVariableCache->nextXid;
+
+if (lock)
+LWLockRelease(XidGenLock);
+
+epoch = EpochFromFullTransactionId(nextXid);
+if (xid > XidFromFullTransactionId(nextXid))
+--epoch;
+
+fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+
+return fxid.value / (uint64) COMMIT_TS_XACTS_PER_PAGE;
+}
```

I'm pretty confident that shared memory can't be accessed like this,
without taking a lock. Although it may work on x64 generally we can
get garbage, unless nextXid is accessed atomically and has a
corresponding atomic type. On top of that I'm pretty sure
TransactionIds can't be compared with the regular comparison
operators. All in all, so far I don't understand why this piece of
code should be so complicated.

The same applies to:

```
-#define TransactionIdToPage(xid) ((xid) / (TransactionId)
SUBTRANS_XACTS_PER_PAGE)
+static inline int64
+TransactionIdToPageInternal(TransactionId xid, bool lock)
```

... in subtrans.c

Maxim, perhaps you could share with us what your reasoning was here?

-- 
Best regards,
Aleksander Alekseev


v53-0002-Utilize-64-bit-SLRU-page-numbers-in-SLRU-callers.patch
Description: Binary data


v53-0001-Use-64-bit-numbering-of-SLRU-pages.patch
Description: Binary data


v53-0003-pg_upgrade-from-32-bit-to-64-bit-SLRU-page-numbe.patch
Description: Binary data

Re: doc: add missing "id" attributes to extension packaging page

2023-01-17 Thread Karl O. Pinc

On Tue, 17 Jan 2023 06:57:23 +0100
Brar Piening  wrote:

> On 17.01.2023 at 02:05, Karl O. Pinc wrote:
> > Or maybe the right way is to set a mode at the very top,
> > the first apply-templates call, and not mess with the
> > built-in templates at all.  (You'd write your own
> > "postgres-mode" templates the same way, to "wrap"
> > and call the default templates.)
> >
> > Think of the mode as an implicit argument that's preserved and
> > passed down through each template invocation without having to
> > be explicitly specified by the calling code.  
> 
> I think the document you're missing is [1].
> 
> There are multiple ways to customize DocBook XSL output and it sounds
> like you want me to write a customization layer which I didn't do
> because there is precedent that the typical "way to do it" (TM) in the
> PostgreSQL project is [2].
> 
> Regards,
> 
> Brar
> 
> [1] http://www.sagehill.net/docbookxsl/CustomizingPart.html
> [2] http://www.sagehill.net/docbookxsl/ReplaceTemplate.html
> 

Sagehill is normally vary good.  But in this case [2] does not
apply.  Or rather it applies but it is overkill because you
do not want to replace what a template is producing.  You
want to add to what a template is producing.  So you want to
wrap the template, with your new code adding output before/
after what the original produces.

[1] does not contain this technique.

If you're not willing to try I am willing to see if I can
produce an example to work from.  My XSLT is starting to
come back.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

Re: Logical replication timeout problem

2023-01-17 Thread Ashutosh Bapat

On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila  wrote:

> >
> > I am a bit worried about the indirections that the wrappers and hooks
> > create. Output plugins call OutputPluginUpdateProgress() in callbacks
> > but I don't see why  ReorderBufferProcessTXN() needs a callback to
> > call OutputPluginUpdateProgress.
> >
>
> Yeah, I think we can do it as we are doing the previous approach but
> we need an additional wrapper (update_progress_cb_wrapper()) as the
> current patch has so that we can add error context information. This
> is similar to why we have a wrapper for all other callbacks like
> change_cb_wrapper.
>

Ultimately OutputPluginUpdateProgress() will be called - which in turn
will call ctx->update_progress. I don't see wrappers around
OutputPluginWrite or OutputPluginPrepareWrite. But I see that those
two are called always from output plugin, so indirectly those are
called through a wrapper. I also see that update_progress_cb_wrapper()
is similar, as far as wrapper is concerned, to
ReorderBufferUpdateProgress() in the earlier patch.
ReorderBufferUpdateProgress() looks more readable than the wrapper.

If we want to keep the wrapper at least we should use a different
variable name. update_progress is also there LogicalDecodingContext
and will be indirectly called from ReorderBuffer::update_progress.
Somebody might think that there's some recursion involved there.
That's a mighty confusion.

-- 
Best Wishes,
Ashutosh Bapat

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-17 Thread Amit Kapila

On Tue, Jan 17, 2023 at 4:30 PM Takamichi Osumi (Fujitsu)
 wrote:
>
> On Saturday, January 14, 2023 3:27 PM vignesh C  wrote:
>
> > 2) I'm not sure if this will add any extra coverage as the altering value of
> > min_apply_delay is already tested in the regression, if so this test can be
> > removed:
> > +# Test ALTER SUBSCRIPTION. Delay 86460 seconds (1 day 1 minute).
> > +$node_subscriber->safe_psql('postgres',
> > +   "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay =
> > 8646)"
> > +);
> > +
> > +# New row to trigger apply delay.
> > +$node_publisher->safe_psql('postgres',
> > +   "INSERT INTO test_tab VALUES (0, 'foobar')");
> > +
> > +check_apply_delay_log("logical replication apply delay", "8000");
> While addressing this point, I've noticed that there is a
> behavior difference between physical replication's recovery_min_apply_delay
> and this feature when stopping the replication during delays.
>
> At present, in the latter case,
> the apply worker exits without applying the suspended transaction
> after ALTER SUBSCRIPTION DISABLE command for the subscription.
>

In the previous paragraph, you said the behavior difference while
stopping the replication but it is not clear from where this DISABLE
command comes in that scenario.

> Meanwhile, there is no "disabling" command for physical replication,
> but I checked the behavior about what happens for promoting a secondary
> during the delay of recovery_min_apply_delay for physical replication as one 
> example.
> The transaction has become visible even in the promoting in the middle of 
> delay.
>

What causes such a transaction to be visible after promotion? Ideally,
if the commit doesn't succeed, the transaction shouldn't be visible.
Do, we allow the transaction waiting due to delay to get committed on
promotion?

> I'm not sure if I should make the time-delayed LR aligned with this behavior.
> Does someone has an opinion for this ?
>

Can you please explain a bit more as asked above to understand the difference?

-- 
With Regards,
Amit Kapila.

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-17 Thread Hayato Kuroda (Fujitsu)

Dear hackers,

> At present, in the latter case,
> the apply worker exits without applying the suspended transaction
> after ALTER SUBSCRIPTION DISABLE command for the subscription.
> Meanwhile, there is no "disabling" command for physical replication,
> but I checked the behavior about what happens for promoting a secondary
> during the delay of recovery_min_apply_delay for physical replication as one
> example.
> The transaction has become visible even in the promoting in the middle of 
> delay.
> 
> I'm not sure if I should make the time-delayed LR aligned with this behavior.
> Does someone has an opinion for this ?

I put my opinion here. The current specification is correct; we should not 
follow
a physical replication manner.
One motivation for this feature is to offer opportunities to correct data loss
errors. When accidental delete events occur, DBA can stop propagations on 
subscribers
by disabling the subscription, with the patch at present.
IIUC, when the subscription is disabled before transactions are started,
workers exit and stop applications. This feature delays starting txns, so we
should regard such an alternation as that is executed before the transaction.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

[PATCH] Constify proclist.h

2023-01-17 Thread Aleksander Alekseev

Hi hackers,

This is a follow-up to [1] and c8ad4d81.

> Additionally Bharath pointed out that there are other pieces of code
> that we may want to change in a similar fashion,
> proclist.h/proclist_types.h as one example. I didn't do this yet
> because I would like to know the community opinion first on whether we
> should do this at all.

Since the consensus seems to be to constify everything possible here
is the patch for proclist.h. There is nothing to change in
proclist_types.h.

[1]: 
https://postgr.es/m/CAJ7c6TM2%3D08mNKD9aJg8vEY9hd%2BG4L7%2BNvh30UiNT3kShgRgNg%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev


v1-0001-Constify-proclist.h.patch
Description: Binary data

Re: Fix database creation during installchecks for ICU cluster

2023-01-17 Thread vignesh C

On Tue, 29 Nov 2022 at 20:24, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> Thanks for the patch!
>
>
> On 10/29/22 12:54, Marina Polyakova wrote:
> >
> > 1) The ECPG tests fail because they use the SQL_ASCII encoding [2],
> > the database template0 uses the ICU locale provider and SQL_ASCII is
> > not supported by ICU:
> >
> > $ make -C src/interfaces/ecpg/ installcheck
> > ...
> > == creating database "ecpg1_regression" ==
> > ERROR:  encoding "SQL_ASCII" is not supported with ICU provider
> > ERROR:  database "ecpg1_regression" does not exist
> > command failed: "/home/marina/postgresql/master/my/inst/bin/psql" -X
> > -c "CREATE DATABASE \"ecpg1_regression\" TEMPLATE=template0
> > ENCODING='SQL_ASCII'" -c "ALTER DATABASE \"ecpg1_regression\" SET
> > lc_messages TO 'C';ALTER DATABASE \"ecpg1_regression\" SET lc_monetary
> > TO 'C';ALTER DATABASE \"ecpg1_regression\" SET lc_numeric TO 'C';ALTER
> > DATABASE \"ecpg1_regression\" SET lc_time TO 'C';ALTER DATABASE
> > \"ecpg1_regression\" SET bytea_output TO 'hex';ALTER DATABASE
> > \"ecpg1_regression\" SET timezone_abbreviations TO 'Default';" "postgres"
> >
>
> I can confirm that same error happens on my end and your patch fixes the
> issue. But, do ECPG tests really require SQL_ASCII encoding? I removed
> ECPG tests' encoding line [1], rebuilt it and 'make -C
> src/interfaces/ecpg/ installcheck' passed without applying your patch.
>
>
> >
> > 2) The option --no-locale in pg_regress is described as "use C locale"
> > [3]. But in this case the created databases actually use the ICU
> > locale provider with the ICU cluster locale from template0 (see
> > diff_check_backend_used_provider.txt):
> >
> > $ make NO_LOCALE=1 installcheck
>
>
> This works on my end without applying your patch. Commands I used are:
>
> $ ./configure --with-icu --prefix=$PWD/build_dir
> $ make && make install && export PATH=$PWD/build_dir/bin:$PATH
> $ initdb --locale-provider icu --icu-locale en-US -D data && pg_ctl -D
> data -l logfile start
> $ make NO_LOCALE=1 installcheck

Hi Marina Polyakova,

Since it is working without your patch, Is this patch required for any
other scenarios?

Regards,
Vignesh

1 2 >

1 - 100 of 121 matches

Mail list logo