date:20220401

Thomas Munro  writes:
> cfbot found another source of nondeterminism in the regression tests,
> due to the smaller shared_buffers used in this TAP test:

This failure seems related but not identical:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=myna&dt=2022-04-02%2004%3A00%3A26

portals.out is expecting that the "foo25ns" cursor will read
starting at the beginning of tenk1, but it's starting somewhere
else, which presumably is a syncscan effect.

I think the fundamental instability here is that this TAP test is
setting shared_buffers small enough to allow the syncscan logic
to kick in where it does not in normal testing.  Maybe we should
just disable syncscan in this test script?

regards, tom lane

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-01 Thread Thomas Munro

On Sat, Apr 2, 2022 at 10:03 AM Thomas Munro  wrote:
> Another idea would be to call a new function DropPendingWritebacks(),
> and also tell all the SMgrRelation objects to close all their internal
> state (ie the fds + per-segment objects) but not free the main
> SMgrRelationData object, and for good measure also invalidate the
> small amount of cached state (which hasn't been mentioned in this
> thread, but it kinda bothers me that that state currently survives, so
> it was one unspoken reason I liked the smgrcloseall() idea).  Since
> DropPendingWritebacks() is in a rather different module, perhaps if we
> were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to
> something else, because it's not really a SMGR-only thing anymore.

Here's a sketch of that idea.
From 58262253e167e25d4bbb6777574a98fb12850c29 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 2 Apr 2022 17:05:39 +1300
Subject: [PATCH 1/2] WIP: Rethink PROCSIGNAL_BARRIER_SMGRRELEASE.

With sufficiently bad luck, it was possible for IssuePendingWritebacks()
to reopen relation files just after we'd processed
PROCSIGNAL_BARRIER_SMGRRELEASE but before the file was unlinked by some
other backend.

While commit 4eb21763 initially appeared to have fixed the spate of test
failures we'd seen on Windows, testing with extra instrumentation to
look out for writes to unlinked files revealed the dangerous nature of
these deferred file references.

This second attempt introduces smgrrelease(reln), smgrreleaseall(), and
smgrreleaseallcount().  That last is used by IssuedPendingWritebacks()
to notice when pending writebacks have to be consider potentially
invalid, by watching out for level changes during the time writeback
instructions were accumulated.

XXX Experimental

Reported-by: Andres Freund 
Discussion: https://postgr.es/m/20220209220004.kb3dgtn2x2k2gtdm%40alap3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c | 14 +++
 src/backend/storage/smgr/md.c   |  6 ---
 src/backend/storage/smgr/smgr.c | 58 +
 src/include/storage/buf_internals.h |  3 ++
 src/include/storage/md.h|  1 -
 src/include/storage/smgr.h  |  3 ++
 6 files changed, 70 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d73a40c1bc..43652b757a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4873,6 +4873,7 @@ WritebackContextInit(WritebackContext *context, int *max_pending)
 
 	context->max_pending = max_pending;
 	context->nr_pending = 0;
+	context->smgr_release_count = smgrreleaseallcount();
 }
 
 /*
@@ -4924,6 +4925,18 @@ IssuePendingWritebacks(WritebackContext *context)
 {
 	int			i;
 
+	/*
+	 * Throw away the contents if smgrreleaseall() has been invoked since
+	 * "context" was initialized or cleared.  Since the main loop below doesn't
+	 * check for interrupts or otherwise handle barrier requests, we don't have
+	 * to worry about invalidation if we get past this check.
+	 */
+	if (context->smgr_release_count != smgrreleaseallcount())
+	{
+		context->nr_pending = 0;
+		context->smgr_release_count = smgrreleaseallcount();
+	}
+
 	if (context->nr_pending == 0)
 		return;
 
@@ -4983,6 +4996,7 @@ IssuePendingWritebacks(WritebackContext *context)
 	}
 
 	context->nr_pending = 0;
+	context->smgr_release_count = smgrreleaseallcount();
 }
 
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 879f647dbc..d26c915f90 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -549,12 +549,6 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
-void
-mdrelease(void)
-{
-	closeAllVfds();
-}
-
 /*
  *	mdprefetch() -- Initiate asynchronous read of the specified block of a relation
  */
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index d71a557a35..087c8e76d1 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -41,7 +41,6 @@ typedef struct f_smgr
 {
 	void		(*smgr_init) (void);	/* may be NULL */
 	void		(*smgr_shutdown) (void);	/* may be NULL */
-	void		(*smgr_release) (void); /* may be NULL */
 	void		(*smgr_open) (SMgrRelation reln);
 	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
@@ -70,7 +69,6 @@ static const f_smgr smgrsw[] = {
 	{
 		.smgr_init = mdinit,
 		.smgr_shutdown = NULL,
-		.smgr_release = mdrelease,
 		.smgr_open = mdopen,
 		.smgr_close = mdclose,
 		.smgr_create = mdcreate,
@@ -100,6 +98,8 @@ static dlist_head unowned_relns;
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+static uint64 releaseall_count = 0;
+
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -281,6 +281,44 @@ smgrclose(SMgrRelation reln)
 		*owner = NULL;
 }
 
+/*
+ *	smgrrelease() -- Release all resources used by this object.
+ *
+ *	The object remains v

Re: Skipping logical replication transactions on subscriber side

2022-04-01 Thread Amit Kapila

On Sat, Apr 2, 2022 at 7:29 AM Noah Misch  wrote:
>
> On Sat, Apr 02, 2022 at 06:49:20AM +0530, Amit Kapila wrote:
>
> After applying datum_to_lsn_skiplsn_1.patch, I get another failure.  Logs
> attached.
>

The failure is for the same reason. I noticed that even when skip lsn
value should be 0/0, it is some invalid value, see: "LOG:  not started
skipping changes: my_skiplsn 0/B0706F72 finish_lsn 0/14EB7D8". Here,
my_skiplsn should be 0/0 instead of 0/B0706F72. Now, I am not sure why
the LSN's 4 bytes are correct and the other 4 bytes have some random
value. A similar problem is there when we have set the valid value of
skip lsn, see: "LOG:  not started skipping changes: my_skiplsn
14EB7D8/B0706F72 finish_lsn 0/14EB7D8". Here the value of my_skiplsn
should be 0/14EB7D8 instead of 14EB7D8/B0706F72.

I am sure that if you create a subscription with the below test and
check the skip lsn value, it will be correct, otherwise, you would
have seen failure in subscription.sql as well. If possible, can you
please check the following example to rule out the possibility:

For example,
Publisher:
Create table t1(c1 int);
Create Publication pub1 for table t1;

Subscriber:
Create table t1(c1 int);
Create Subscription sub1 connection 'dbname = postgres' Publication pub1;
Select subname, subskiplsn from pg_subsription; -- subskiplsn should be 0/0

Alter Subscription sub1 SKIP (LSN = '0/14EB7D8');
Select subname, subskiplsn from pg_subsription; -- subskiplsn should
be 0/14EB7D8

Assuming the above is correct and we are still getting the wrong value
in apply worker, the only remaining suspect is the following code in
GetSubscription:
sub->skiplsn = DatumGetLSN(subform->subskiplsn);

I don't know what is wrong with this because subskiplsn is stored as
pg_lsn which is a fixed value and we should be able to access it by
struct. Do you see any problem with this?

-- 
With Regards,
Amit Kapila.

Re: Rewriting the test of pg_upgrade as a TAP test - take three - remastered set

2022-04-01 Thread Michael Paquier

On Fri, Apr 01, 2022 at 08:34:34AM -0500, Justin Pryzby wrote:
> If you do that, should also remove upgradecheck from .cirrus.yaml, which
> currently runs the upgradecheck target.

Indeed.  It makes no sense to keep that.  I have removed this part and
applied the patch, after one extra run through the CI.

> An alternative to your patch to have the buildfarm client avoid calling
> upgradecheck if tap tests are disabled.  Under your patch, upgrade check is a
> NOP, so it should stop calling upgradecheck anyway.  So maybe this is a better
> option ?

Yeah, there is an extra issue with the buildfarm client here.  The
animals that have TAP enabled are now running the tests of pg_upgrade
twice: once per the optional module TestUpgrade and once in
run_bin_tests()@run_build.pl.  This is something that needs to be
changed in the client code itself, and maybe the best fix is to
disable TestUpgrade.pm when running with v15~ or a newer version.  A
fix with this approach would become much easier once REL_15_STABLE is
created, though.  I am pretty sure that it should also be possible to
change the list of optional modules depending on the branch running,
but I have not dug into that..
--
Michael


signature.asc
Description: PGP signature

Re: PostgreSQL shutdown modes

2022-04-01 Thread Michael Paquier

On Fri, Apr 01, 2022 at 01:22:05PM -0400, Robert Haas wrote:
> I attach herewith a modest patch to rename these shutdown modes to
> more accurately correspond to their actual characteristics.
>
> Date: Fri, 1 Apr 2022 12:50:05 -0400

I love the idea.  Just in time, before the feature freeze deadline.
--
Michael


signature.asc
Description: PGP signature

Re: Skipping logical replication transactions on subscriber side

2022-04-01 Thread Noah Misch

On Sat, Apr 02, 2022 at 06:49:20AM +0530, Amit Kapila wrote:
> On Sat, Apr 2, 2022 at 5:41 AM Noah Misch  wrote:
> >
> > On Fri, Apr 01, 2022 at 09:25:52PM +0900, Masahiko Sawada wrote:
> > > > On Fri, Apr 1, 2022 at 4:44 PM Noah Misch  wrote:
> > > > > src/test/subscription/t/029_on_error.pl has been failing reliably on 
> > > > > the five
> > > > > AIX buildfarm members:
> > > > >
> > > > > # poll_query_until timed out executing this query:
> > > > > # SELECT subskiplsn = '0/0' FROM pg_subscription WHERE subname = 'sub'
> > > > > # expecting this output:
> > > > > # t
> > > > > # last actual query output:
> > > > > # f
> > > > > # with stderr:
> > > > > timed out waiting for match: (?^:LOG:  done skipping logical 
> > > > > replication transaction finished at 0/1D30788) at t/029_on_error.pl 
> > > > > line 50.
> > > > >
> > > > > I've posted five sets of logs (2.7 MiB compressed) here:
> > > > > https://drive.google.com/file/d/16NkyNIV07o0o8WM7GwcaAYFQDPTkULkR/view?usp=sharing
> >
> > > Given that "SELECT subskiplsn = '0/0'
> > > FROM pg_subscription WHERE subname = 'sub’” didn't return true, some
> > > value was set to subskiplsn even after the unique key error.
> > >
> > > So I'm guessing that the apply worker could not get the updated value
> > > of the subskiplsn or its MySubscription->skiplsn could not match with
> > > the transaction's finish LSN. Also, given that the test is failing on
> > > all AIX buildfarm members, there might be something specific to AIX.
> > >
> > > Noah, to investigate this issue further, is it possible for you to
> > > apply the attached patch and run the 029_on_error.pl test? The patch
> > > adds some logs to get additional information.
> >
> > Logs attached.
> 
> Thank you.
> 
> By seeing the below Logs:
> 
> 
> 2022-04-01 18:19:34.710 CUT [58327402] LOG:  not started skipping
> changes: my_skiplsn 14EB7D8/B0706F72 finish_lsn 0/14EB7D8
> ...
> 
> 
> It seems that the value of skiplsn read in GetSubscription is wrong
> which makes the apply worker think it doesn't need to skip the
> transaction. Now, in Alter/Create Subscription, we are using
> LSNGetDatum() to store skiplsn value in pg_subscription but while
> reading it in GetSubscription(), we are not converting back the datum
> to LSN by using DatumGetLSN(). Is it possible that on this machine it
> might be leading to not getting the right value for skiplsn? I think
> it is worth trying to see if this fixes the problem.

After applying datum_to_lsn_skiplsn_1.patch, I get another failure.  Logs
attached.


log-subscription-20220401b.tar.xz
Description: Binary data

Re: Skipping logical replication transactions on subscriber side

2022-04-01 Thread Amit Kapila

On Sat, Apr 2, 2022 at 5:41 AM Noah Misch  wrote:
>
> On Fri, Apr 01, 2022 at 09:25:52PM +0900, Masahiko Sawada wrote:
> > > On Fri, Apr 1, 2022 at 4:44 PM Noah Misch  wrote:
> > > > src/test/subscription/t/029_on_error.pl has been failing reliably on 
> > > > the five
> > > > AIX buildfarm members:
> > > >
> > > > # poll_query_until timed out executing this query:
> > > > # SELECT subskiplsn = '0/0' FROM pg_subscription WHERE subname = 'sub'
> > > > # expecting this output:
> > > > # t
> > > > # last actual query output:
> > > > # f
> > > > # with stderr:
> > > > timed out waiting for match: (?^:LOG:  done skipping logical 
> > > > replication transaction finished at 0/1D30788) at t/029_on_error.pl 
> > > > line 50.
> > > >
> > > > I've posted five sets of logs (2.7 MiB compressed) here:
> > > > https://drive.google.com/file/d/16NkyNIV07o0o8WM7GwcaAYFQDPTkULkR/view?usp=sharing
>
> > Given that "SELECT subskiplsn = '0/0'
> > FROM pg_subscription WHERE subname = 'sub’” didn't return true, some
> > value was set to subskiplsn even after the unique key error.
> >
> > So I'm guessing that the apply worker could not get the updated value
> > of the subskiplsn or its MySubscription->skiplsn could not match with
> > the transaction's finish LSN. Also, given that the test is failing on
> > all AIX buildfarm members, there might be something specific to AIX.
> >
> > Noah, to investigate this issue further, is it possible for you to
> > apply the attached patch and run the 029_on_error.pl test? The patch
> > adds some logs to get additional information.
>
> Logs attached.
>

Thank you.

By seeing the below Logs:


2022-04-01 18:19:34.710 CUT [58327402] LOG:  not started skipping
changes: my_skiplsn 14EB7D8/B0706F72 finish_lsn 0/14EB7D8
...


It seems that the value of skiplsn read in GetSubscription is wrong
which makes the apply worker think it doesn't need to skip the
transaction. Now, in Alter/Create Subscription, we are using
LSNGetDatum() to store skiplsn value in pg_subscription but while
reading it in GetSubscription(), we are not converting back the datum
to LSN by using DatumGetLSN(). Is it possible that on this machine it
might be leading to not getting the right value for skiplsn? I think
it is worth trying to see if this fixes the problem.

Any other thoughts?

-- 
With Regards,
Amit Kapila.


datum_to_lsn_skiplsn_1.patch
Description: Binary data

Re: logical decoding and replication of sequences

2022-04-01 Thread Tomas Vondra

On 4/1/22 17:02, Tomas Vondra wrote:
> 
> 
> On 3/28/22 07:29, Amit Kapila wrote:
>> ...
>>
>> While thinking about this, I think I see a problem with the
>> non-transactional handling of sequences. It seems that we will skip
>> sending non-transactional sequence change if it occurs before the
>> decoding has reached a consistent point but the surrounding commit
>> occurs after a consistent point is reached. In such cases, the
>> corresponding DMLs like inserts will be sent but sequence changes
>> won't be sent. For example (this scenario is based on
>> twophase_snapshot.spec),
>>
>> Initial setup:
>> ==
>> Create table t1_seq(c1 int);
>> Create Sequence seq1;
>>
>> Test Execution via multiple sessions (this test allows insert in
>> session-2 to happen before we reach a consistent point and commit
>> happens after a consistent point):
>> ===
>>
>> Session-2:
>> Begin;
>> SELECT pg_current_xact_id();
>>
>> Session-1:
>> SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
>> 'test_decoding', false, true);
>>
>> Session-3:
>> Begin;
>> SELECT pg_current_xact_id();
>>
>> Session-2:
>> Commit;
>> Begin;
>> INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);
>>
>> Session-3:
>> Commit;
>>
>> Session-2:
>> Commit 'foo'
>>
>> Session-1:
>> SELECT data  FROM pg_logical_slot_get_changes('test_slot', NULL, NULL,
>> 'include-xids', 'false', 'skip-empty-xacts', '1');
>>
>>  data
>> --
>>  BEGIN
>>  table public.t1_seq: INSERT: c1[integer]:1
>>  table public.t1_seq: INSERT: c1[integer]:2
>>  table public.t1_seq: INSERT: c1[integer]:3
>>  table public.t1_seq: INSERT: c1[integer]:4
>>  table public.t1_seq: INSERT: c1[integer]:5
>>  table public.t1_seq: INSERT: c1[integer]:6
>>
>>
>> Now, if we normally try to decode such an insert, the result would be
>> something like:
>>  data
>> --
>>  sequence public.seq1: transactional:0 last_value: 33 log_cnt: 0 is_called:1
>>  sequence public.seq1: transactional:0 last_value: 66 log_cnt: 0 is_called:1
>>  sequence public.seq1: transactional:0 last_value: 99 log_cnt: 0 is_called:1
>>  sequence public.seq1: transactional:0 last_value: 132 log_cnt: 0 is_called:1
>>  BEGIN
>>  table public.t1_seq: INSERT: c1[integer]:1
>>  table public.t1_seq: INSERT: c1[integer]:2
>>  table public.t1_seq: INSERT: c1[integer]:3
>>  table public.t1_seq: INSERT: c1[integer]:4
>>  table public.t1_seq: INSERT: c1[integer]:5
>>  table public.t1_seq: INSERT: c1[integer]:6
>>
>> This will create an inconsistent replica as sequence changes won't be
>> replicated.
> 
> Hmm, that's interesting. I wonder if it can actually happen, though.
> Have you been able to reproduce that, somehow?
> 
>> I thought about changing snapshot dealing of
>> non-transactional sequence changes similar to transactional ones but
>> that also won't work because it is only at commit we decide whether we
>> can send the changes.
>>
> I wonder if there's some earlier LSN (similar to the consistent point)
> which might be useful for this.
> 
> Or maybe we should queue even the non-transactional changes, not
> per-transaction but in a global list, and then at each commit either
> discard inspect them (at that point we know the lowest LSN for all
> transactions and the consistent point). Seems complex, though.
> 
>> For the transactional case, as we are considering the create sequence
>> operation as transactional, we would unnecessarily queue them even
>> though that is not required. Basically, they don't need to be
>> considered transactional and we can simply ignore such messages like
>> other DDLs. But for that probably we need to distinguish Alter/Create
>> case which may or may not be straightforward. Now, queuing them is
>> probably harmless unless it causes the transaction to spill/stream.
>>
> 
> I'm not sure I follow. Why would we queue them unnecessarily?
> 
> Also, there's the bug with decoding changes in transactions that create
> the sequence and add it to a publication. I think the agreement was that
> this behavior was incorrect, we should not decode changes until the
> subscription is refreshed. Doesn't that mean can't be any CREATE case,
> just ALTER?
> 

So, I investigated this a bit more, and I wrote a couple test_decoding
isolation tests (patch attached) demonstrating the issue. Actually, I
should say "issues" because it's a bit worse than you described ...

The whole problem is in this chunk of code in sequence_decode():


  /* Skip the change if already processed (per the snapshot). */
  if (transactional &&
  !SnapBuildProcessChange(builder, xid, buf->origptr))
  return;
  else if (!transactional &&
   (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
SnapBuildXactN

Re: Fix overflow in DecodeInterval

Joseph Koshakow  writes:
> * The existing code for rounding had a lot of int to double
> casting and vice versa. I *think* that doubles are able to completely
> represent the range of ints. However doubles are not able to represent
> the full range of int64. After making the change I started noticing
> a lot of lossy behavior. One thought I had was to change the doubles
> to long doubles, but I wasn't able to figure out if long doubles could
> completely represent the range of int64. Especially since their size
> varies depending on the architecture. Does anyone know the answer to
> this?

I agree that relying on long double is not a great plan.  However,
I'm not seeing where there's a problem.  AFAICS the revised code
only uses doubles to represent fractions from the input, ie if you
write "123.456 hours" then the ".456" is carried around for awhile
as a float.  This does not seem likely to pose any real-world
problem; do you have a counterexample?

Anyway, I've spent today reviewing the code and cleaning up things
I didn't like, and attached is a v10.  I almost feel that this is
committable, but there is one thing that is bothering me.  The
part of DecodeInterval that does strange things with signs in the
INTSTYLE_SQL_STANDARD case (starting about line 3400 in datetime.c
before this patch, or line 3600 after) used to separately force the
hour, minute, second, and microsecond fields to negative.
Now it forces the merged tm_usec field to negative.  It seems to
me that this could give a different answer than before, if the
h/m/s/us values had been of different signs before they got merged.
However, I don't think that that situation is possible in SQL-spec-
compliant input, so it may not be a problem.  Again, a counterexample
would be interesting.

regards, tom lane

diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index ba0ec35ac5..dae90e4a9e 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -21,6 +21,7 @@
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "catalog/pg_type.h"
+#include "common/int.h"
 #include "common/string.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -37,17 +38,31 @@ static int	DecodeNumber(int flen, char *field, bool haveTextMonth,
 static int	DecodeNumberField(int len, char *str,
 			  int fmask, int *tmask,
 			  struct pg_tm *tm, fsec_t *fsec, bool *is2digits);
+static int	DecodeTimeCommon(char *str, int fmask, int range,
+			 int *tmask, struct pg_itm *itm);
 static int	DecodeTime(char *str, int fmask, int range,
 	   int *tmask, struct pg_tm *tm, fsec_t *fsec);
+static int	DecodeTimeForInterval(char *str, int fmask, int range,
+  int *tmask, struct pg_itm_in *itm_in);
 static const datetkn *datebsearch(const char *key, const datetkn *base, int nel);
 static int	DecodeDate(char *str, int fmask, int *tmask, bool *is2digits,
 	   struct pg_tm *tm);
 static char *AppendSeconds(char *cp, int sec, fsec_t fsec,
 		   int precision, bool fillzeros);
-static void AdjustFractSeconds(double frac, struct pg_tm *tm, fsec_t *fsec,
-			   int scale);
-static void AdjustFractDays(double frac, struct pg_tm *tm, fsec_t *fsec,
-			int scale);
+static bool int64_multiply_add(int64 val, int64 multiplier, int64 *sum);
+static bool AdjustFractMicroseconds(double frac, int64 scale,
+	struct pg_itm_in *itm_in);
+static bool AdjustFractDays(double frac, int scale,
+			struct pg_itm_in *itm_in);
+static bool AdjustFractYears(double frac, int scale,
+			 struct pg_itm_in *itm_in);
+static bool AdjustMicroseconds(int64 val, double fval, int64 scale,
+			   struct pg_itm_in *itm_in);
+static bool AdjustDays(int64 val, int scale,
+	   struct pg_itm_in *itm_in);
+static bool AdjustMonths(int64 val, struct pg_itm_in *itm_in);
+static bool AdjustYears(int64 val, int scale,
+		struct pg_itm_in *itm_in);
 static int	DetermineTimeZoneOffsetInternal(struct pg_tm *tm, pg_tz *tzp,
 			pg_time_t *tp);
 static bool DetermineTimeZoneAbbrevOffsetInternal(pg_time_t t,
@@ -425,7 +440,7 @@ GetCurrentTimeUsec(struct pg_tm *tm, fsec_t *fsec, int *tzp)
  * Returns a pointer to the new end of string.  No NUL terminator is put
  * there; callers are responsible for NUL terminating str themselves.
  *
- * Note that any sign is stripped from the input seconds values.
+ * Note that any sign is stripped from the input sec and fsec values.
  */
 static char *
 AppendSeconds(char *cp, int sec, fsec_t fsec, int precision, bool fillzeros)
@@ -471,7 +486,7 @@ AppendSeconds(char *cp, int sec, fsec_t fsec, int precision, bool fillzeros)
 
 		/*
 		 * If we still have a non-zero value then precision must have not been
-		 * enough to print the number.  We punt the problem to pg_ltostr(),
+		 * enough to print the number.  We punt the problem to pg_ultostr(),
 		 * which will generate a correct answer in the minimum valid width.
 		 */
 		if (value)
@@ -

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

Hi,

On 2022-04-01 16:35:38 -0700, Jeff Davis wrote:
> * I don't think we need the stats at all. We can run GROUP BY queries
> on the results of pg_get_wal_records_info().

It's well over an order of magnitude slower. And I don't see how that can be
avoided. That makes it practically useless.

See numbers at the bottom of
https://postgr.es/m/CALj2ACUvU2fGLw7keEpxZhGAoMQ9vrCPX-13hexQPoR%2BQRbuOw%40mail.gmail.com

Greetings,

Andres Freund

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-04-01 Thread Jeff Davis

On Sat, 2022-03-26 at 10:31 +0530, Bharath Rupireddy wrote:
> Attaching v16 patch-set, only change in v16-0002-pg_walinspect.patch,
> others remain the same.

I looked more closely at this patch.

* It seems that pg_get_wal_record() is not returning the correct raw
data for the record. I tested with pg_logical_emit_message, and the
message isn't there. pg_get_wal_record_info() uses XLogRecordGetData(),
which seems closer to what I expect.

* I'm a little unclear on the purpose of pg_get_wal_record(). What does
it offer that the other functions don't?

* I don't think we need the stats at all. We can run GROUP BY queries
on the results of pg_get_wal_records_info().

* Include the xlinfo portion of the wal record in addition to the rmgr,
like pg_waldump --stats=record shows. That way we can GROUP BY that as
well.

* I don't think we need the changes to xlogutils.c. You calculate the
end pointer based on the flush pointer, anyway, so we should never need
to wait (and when I take it out, the tests still pass).

I think we can radically simplify it without losing functionality,
unless I'm missing something.

1. Eliminate pg_get_wal_record(),
pg_get_wal_records_info_till_end_of_wal(), pg_get_wal_stats(),
pg_get_wal_stats_till_end_of_wal().

2. Rename pg_get_wal_record_info -> pg_get_wal_record

3. Rename pg_get_wal_records_info -> pg_get_wal_records

4. For pg_get_wal_records, if end_lsn is NULL, read until the end of
WAL.

5. For pg_get_wal_record and pg_get_wal_records, also return the xlinfo
using rm_identify() if available.

6. Remove changes to xlogutils.

7. Remove the refactor to pull the stats out to a separate file,
because stats aren't needed. 

8. With only two functions in the API, it may even make sense to just
make it a part of postgres rather than a separate module.

However, I'm a little behind on this discussion thread, so perhaps I'm
missing some important context. I'll try to catch up soon.

Regards,
Jeff Davis

Re: Higher level questions around shared memory stats

Hi,

On 2022-03-29 12:17:27 -0700, Andres Freund wrote:
> Separate from the minutia in [1] I'd like to discuss a few questions of more
> general interest. I'll post another question or two later.

5) What does track_counts = off mean?

I just was trying to go through the shared memory stats patch to ensure
pgstat_track_counts has the same effect as before.  It's kinda hard to discern
what exactly it supposed to be doing because it's quite inconsistently
applied.

- all "global" stats ignore it (archiver, bgwriter, checkpointer, slru wal)
- subscription, replication slot stats ignore it
- *some* database level stats pay heed
  - pgstat_report_autovac, pgstat_report_connect don't
  - pgstat_report_recovery_conflict, pgstat_report_deadlock,
pgstat_report_checksum_failures_in_db do respect it
- function stats have their own setting

Unless we conclude something else I'll make sure the patch is "bug
compatible".

Greetings,

Andres Freund

Re: [Proposal] vacuumdb --schema only

On Fri, Apr 01, 2022 at 10:01:28AM -0500, Justin Pryzby wrote:
> On Wed, Mar 30, 2022 at 02:22:58PM -0700, Nathan Bossart wrote:
>> I'm personally -1 for the --exclude-schema option.  I don't see any
>> existing "exclude" options in vacuumdb, and the uses for such an option
>> seem rather limited.  If we can point to specific use-cases for this
>> option, I might be willing to change my vote.
> 
> I suggested it because I would consider using it, even though I don't 
> currently
> use the vacuumdb script at all.  I think this would allow partially
> retiring/simplifying our existing vacuum script.
> 
> We 1) put all our partitions in a separate "child" schema (so \d is more
> usable), and also 2) put some short-lived tables into their own schemas.  Some
> of those tables may only exist for ~1 day so I'd perfer to neither vacuum nor
> analyze them (they're only used for SELECT *).  But there can be a lot of 
> them,
> so a nightly job could do something like vacuumdb --schema public or vacuumdb
> --exclude-schema ephemeral.
> 
> Everything would be processed nightly using vacuumdb --min-xid (to keep the
> monitoring system happy).
> 
> The non-partitioned tables could be vacuumed nightly (without min-xid), with
> --exclude ephemeral.
> 
> The partitioned tables could be processed monthly with vacuumdb --analyze.
> 
> I'd also want to be able to run vacuumdb --analyze nightly, but I'd want to
> exclude the schema with short-lived tables.   I'd also need a way to exclude
> our partitioned tables from nightly analyze (they should run monthly only).

Thanks for elaborating.  I retract my -1 vote.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

use has_privs_of_role() for pg_hba.conf

Hi hackers,

6198420 ensured that has_privs_of_role() is used for predefined roles,
which means that the role inheritance hierarchy is checked instead of mere
role membership.  However, inheritance is still not respected for
pg_hba.conf.  Specifically, "samerole", "samegroup", and "+" still use
is_member_of_role_nosuper().

The attached patch introduces has_privs_of_role_nosuper() and uses it for
the aforementioned pg_hba.conf functionality.  I think this is desirable
for consistency.  If a role_a has membership in role_b but none of its
privileges (i.e., NOINHERIT), does it make sense that role_a should match
+role_b in pg_hba.conf?  It is true that role_a could always "SET ROLE
role_b", and with this change, the user won't even have the ability to log
in to run SET ROLE.  But I'm not sure if that's a strong enough argument
for deviating from the standard role privilege checks.

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 5c1a5aaaff949fc25afaa6856ca2a85a54c8efdc Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 1 Apr 2022 11:40:51 -0700
Subject: [PATCH v1 1/1] Use has_privs_of_role for samerole, samegroup, and +
 in pg_hba.conf.

6198420 ensured that has_privs_of_role() is used for predefined
roles, which means that the role privilege inheritance hierarchy is
checked instead of mere role membership.  However, inheritance is
still not respected for pg_hba.conf.  This change alters the
authentication logic to consider role privileges instead of just
role membership.

Do not back-patch.

Author: Nathan Bossart
---
 doc/src/sgml/client-auth.sgml | 18 +-
 src/backend/libpq/hba.c   | 19 ++-
 src/backend/utils/adt/acl.c   | 23 +++
 src/include/utils/acl.h   |  1 +
 4 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index 142b0affcb..5c63842e52 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -221,12 +221,12 @@ hostnogssenc  database  usersameuser specifies that the record
matches if the requested database has the same name as the
requested user.  The value samerole specifies that
-   the requested user must be a member of the role with the same
+   the requested user must have privileges of the role with the same
name as the requested database.  (samegroup is an
obsolete but still accepted spelling of samerole.)
-   Superusers are not considered to be members of a role for the
-   purposes of samerole unless they are explicitly
-   members of the role, directly or indirectly, and not just by
+   Superusers are not considered to have privileges of a role for the
+   purposes of samerole unless they explicitly have
+   privileges of the role, directly or indirectly, and not just by
virtue of being a superuser.
The value replication specifies that the record
matches if a physical replication connection is requested, however, it
@@ -252,10 +252,10 @@ hostnogssenc  database  user+.
(Recall that there is no real distinction between users and groups
in PostgreSQL; a + mark really means
-   match any of the roles that are directly or indirectly members
+   match any of the roles that directly or indirectly have privileges
of this role, while a name without a + mark matches
only that specific role.) For this purpose, a superuser is only
-   considered to be a member of a role if they are explicitly a member
+   considered to have privileges of a role if they explicitly have privileges
of the role, directly or indirectly, and not just by virtue of
being a superuser.
Multiple user names can be supplied by separating them with commas.
@@ -788,9 +788,9 @@ hostall all 192.168.0.0/16  ident map=omicro
 # If these are the only three lines for local connections, they will
 # allow local users to connect only to their own databases (databases
 # with the same name as their database user name) except for administrators
-# and members of role "support", who can connect to all databases.  The file
-# $PGDATA/admins contains a list of names of administrators.  Passwords
-# are required in all cases.
+# and roles with privileges of role "support", who can connect to all
+# databases.  The file $PGDATA/admins contains a list of names of
+# administrators.  Passwords are required in all cases.
 #
 # TYPE  DATABASEUSERADDRESS METHOD
 local   sameuserall md5
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index f8393ca8ed..247118fa30 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -546,13 +546,13 @@ tokenize_auth_file(const char *filename, FILE *file, List **tok_lines,
 
 
 /*
- * Does user belong to role?
+ * Does user ha

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-01 Thread Thomas Munro

On Sat, Apr 2, 2022 at 2:52 AM Robert Haas  wrote:
> On Fri, Apr 1, 2022 at 2:04 AM Thomas Munro  wrote:
> > The v1-0003 patch introduced smgropen_cond() to avoid the problem of
> > IssuePendingWritebacks(), which does desynchronised smgropen() calls
> > and could open files after the barrier but just before they are
> > unlinked.  Makes sense, but...
> >
> > 1.  For that to actually work, we'd better call smgrcloseall() when
> > PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls
> > closeAllVds().  Here's a patch for that.  But...
>
> What if we instead changed our approach to fixing
> IssuePendingWritebacks()?  Design sketch:
>
> 1. We add a new flag, bool smgr_zonk, to SmgrRelationData. Initially it's 
> false.
> 2. Receipt of PROCSIGNAL_BARRIER_SMGRRELEASE sets smgr_zonk to true.
> 3. If you want, you can set it back to false any time you access the
> smgr while holding a relation lock.
> 4. IssuePendingWritebacks() uses smgropen_cond(), as today, but that
> function now refuses either to create a new smgr or to return one
> where smgr_zonk is true.
>
> I think this would be sufficient to guarantee that we never try to
> write back to a relfilenode if we haven't had a lock on it since the
> last PROCSIGNAL_BARRIER_SMGRRELEASE, which I think is the invariant we
> need here?
>
> My thinking here is that it's a lot easier to get away with marking
> the content of a data structure invalid than it is to get away with
> invalidating pointers to that data structure. If we can tinker with
> the design so that the SMgrRelationData doesn't actually go away but
> just gets a little bit more thoroughly invalidated, we could avoid
> having to audit the entire code base for code that keeps smgr handles
> around longer than would be possible with the design you demonstrate
> here.

Another idea would be to call a new function DropPendingWritebacks(),
and also tell all the SMgrRelation objects to close all their internal
state (ie the fds + per-segment objects) but not free the main
SMgrRelationData object, and for good measure also invalidate the
small amount of cached state (which hasn't been mentioned in this
thread, but it kinda bothers me that that state currently survives, so
it was one unspoken reason I liked the smgrcloseall() idea).  Since
DropPendingWritebacks() is in a rather different module, perhaps if we
were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to
something else, because it's not really a SMGR-only thing anymore.

merge documentation fix

2022-04-01 Thread Euler Taveira

Hi,

While inspecting the MERGE documentation, I noticed that there is an extra
semicolon in one of the examples that shouldn't be there.

diff --git a/doc/src/sgml/ref/merge.sgml b/doc/src/sgml/ref/merge.sgml
index c547122c9b..ac1c0a83dd 100644
--- a/doc/src/sgml/ref/merge.sgml
+++ b/doc/src/sgml/ref/merge.sgml
@@ -596,7 +596,7 @@ ON s.winename = w.winename
WHEN NOT MATCHED AND s.stock_delta > 0 THEN
   INSERT VALUES(s.winename, s.stock_delta)
WHEN MATCHED AND w.stock + s.stock_delta > 0 THEN
-  UPDATE SET stock = w.stock + s.stock_delta;
+  UPDATE SET stock = w.stock + s.stock_delta
WHEN MATCHED THEN
   DELETE;



--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: [PATCH] Add extra statistics to explain for Nested Loop

2022-04-01 Thread Ekaterina Sokolova


Hi, hackers. Thank you for your attention to this topic.

Julien Rouhaud wrote:

+static void show_loop_info(Instrumentation *instrument, bool isworker,
+   ExplainState *es);

I think this should be done as a separate refactoring commit.
Sure. I divided the patch. Now Justin's refactor commit is separated. 
Also I actualized it a bit.


Most of the comments I have are easy to fix.  But I think that the real 
problem
is the significant overhead shown by Ekaterina that for now would apply 
even if
you don't consume the new stats, for instance if you have 
pg_stat_statements.

And I'm still not sure of what is the best way to avoid that.

I took your advice about InstrumentOption. Now INSTRUMENT_EXTRA exists.
So currently it's no overheads during basic load. Operations using 
INSTRUMENT_ALL contain overheads (because of INSTRUMENT_EXTRA is a part 
of INSTRUMENT_ALL), but they are much less significant than before. I 
apply new overhead statistics collected by pgbench with auto _explain 
enabled.



Why do you need to initialize min_t and min_tuples but not max_t and
max_tuples while both will initially be 0 and possibly updated 
afterwards?
We need this initialization for min values so comment about it located 
above the block of code with initialization.


I am convinced that the latest changes have affected the patch in a 
positive way. I'll be pleased to hear your thoughts on this.


--
Ekaterina Sokolova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres CompanyFrom: Justin Pryzby 
Subject: [PATCH 2/2] explain.c: refactor ExplainNode()

---
 src/backend/commands/explain.c | 110 +--
 1 file changed, 48 insertions(+), 62 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b7dcffa13f2..61a8a67e2ad 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -118,6 +118,8 @@ static void show_instrumentation_count(const char *qlabel, int which,
 	   PlanState *planstate, ExplainState *es);
 static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
 static void show_eval_params(Bitmapset *bms_params, ExplainState *es);
+static void show_loop_info(Instrumentation *instrument, bool isworker,
+		   ExplainState *es);
 static const char *explain_get_index_name(Oid indexId);
 static void show_buffer_usage(ExplainState *es, const BufferUsage *usage,
 			  bool planning);
@@ -1618,36 +1620,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 	if (es->analyze &&
 		planstate->instrument && planstate->instrument->nloops > 0)
-	{
-		double		nloops = planstate->instrument->nloops;
-		double		startup_ms = 1000.0 * planstate->instrument->startup / nloops;
-		double		total_ms = 1000.0 * planstate->instrument->total / nloops;
-		double		rows = planstate->instrument->ntuples / nloops;
-
-		if (es->format == EXPLAIN_FORMAT_TEXT)
-		{
-			if (es->timing)
-appendStringInfo(es->str,
- " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)",
- startup_ms, total_ms, rows, nloops);
-			else
-appendStringInfo(es->str,
- " (actual rows=%.0f loops=%.0f)",
- rows, nloops);
-		}
-		else
-		{
-			if (es->timing)
-			{
-ExplainPropertyFloat("Actual Startup Time", "ms", startup_ms,
-	 3, es);
-ExplainPropertyFloat("Actual Total Time", "ms", total_ms,
-	 3, es);
-			}
-			ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
-			ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
-		}
-	}
+		show_loop_info(planstate->instrument, false, es);
 	else if (es->analyze)
 	{
 		if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1677,44 +1650,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		for (int n = 0; n < w->num_workers; n++)
 		{
 			Instrumentation *instrument = &w->instrument[n];
-			double		nloops = instrument->nloops;
-			double		startup_ms;
-			double		total_ms;
-			double		rows;
 
-			if (nloops <= 0)
+			if (instrument->nloops <= 0)
 continue;
-			startup_ms = 1000.0 * instrument->startup / nloops;
-			total_ms = 1000.0 * instrument->total / nloops;
-			rows = instrument->ntuples / nloops;
 
 			ExplainOpenWorker(n, es);
 
-			if (es->format == EXPLAIN_FORMAT_TEXT)
-			{
-ExplainIndentText(es);
-if (es->timing)
-	appendStringInfo(es->str,
-	 "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n",
-	 startup_ms, total_ms, rows, nloops);
-else
-	appendStringInfo(es->str,
-	 "actual rows=%.0f loops=%.0f\n",
-	 rows, nloops);
-			}
-			else
-			{
-if (es->timing)
-{
-	ExplainPropertyFloat("Actual Startup Time", "ms",
-		 startup_ms, 3, es);
-	ExplainPropertyFloat("Actual Total Time", "ms",
-		 total_ms, 3, es);
-}
-ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
-ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
-			}
+			show_loop_info(instrument, true, es);
 
+			if

Re: Can we automatically add elapsed times to tap test log?

2022-04-01 Thread Andrew Dunstan


On 4/1/22 15:16, Andrew Dunstan wrote:
> On 4/1/22 13:44, Nathan Bossart wrote:
>> On Fri, Apr 01, 2022 at 10:21:50AM -0700, Andres Freund wrote:
>>> right now I am looking at a test added in the shmstats patch that's slow on
>>> CI, on windows only. Unfortunately the regress_log_* output is useless as-is
>>> to figure out where things hang.
>>>
>>> I've hit this several times before. Of course it's not too hard to hack up
>>> something printing elapsed time. But ISTM that it'd be better if we were 
>>> able
>>> to prefix the logging into regress_log_* with something like
>>> [timestamp + time since start of test]
>>> or
>>> [timestamp + time since start of test + time since last log message]
>>>
>>>
>>> This isn't just useful to figure out what parts of test are slow, but also
>>> helps correlate server logs with the regress_log_* output. Which right now 
>>> is
>>> hard and inaccurate, requiring manually correlating statements between 
>>> server
>>> log and the tap test (often there's no logging for statements in the
>>> regress_log_*).
>> +1
>>
>
> Maybe one way would be to make a change in
> src/test/perl/PostgreSQL/Test/SimpleTee.pm. The simplest thing would
> just be to add a timestamp, the other things would involve a bit more
> bookkeeping. It should also be checked to make sure it doesn't add too
> much overhead, although I would be surprised if it did.
>


Along these lines. Untested, it clearly needs a bit of polish (e.g. a
way to turn it on or off for a filehandle). We could use Time::Hires if
you want higher resolution times.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/test/perl/PostgreSQL/Test/SimpleTee.pm b/src/test/perl/PostgreSQL/Test/SimpleTee.pm
index bb9d79a755..2e037ab89d 100644
--- a/src/test/perl/PostgreSQL/Test/SimpleTee.pm
+++ b/src/test/perl/PostgreSQL/Test/SimpleTee.pm
@@ -14,6 +14,18 @@ package PostgreSQL::Test::SimpleTee;
 use strict;
 use warnings;
 
+my $start_time;
+
+BEGIN { $start_time = time; }
+
+sub _time_str
+{
+my ($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) =
+  localtime(time);
+return sprintf("[%.2d:%.2d:%.2d](%ds) ",
+   hour, $min, $sec,  time-start_time);
+}
+
 sub TIEHANDLE
 {
 	my $self = shift;
@@ -26,7 +38,7 @@ sub PRINT
 	my $ok   = 1;
 	for my $fh (@$self)
 	{
-		print $fh @_ or $ok = 0;
+		print $fh _time_str, @_ or $ok = 0;
 		$fh->flush   or $ok = 0;
 	}
 	return $ok;

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

Hi,

On 2022-04-01 22:47:02 +0300, Andrei Zubkov wrote:
> + entry = (pgssEntry *) hash_search(pgss_hash, &key, HASH_FIND, 
> NULL);
> +
> + if (entry) {
> + /* Found */
> + if (minmax_only) {
> + /* When requested reset only min/max statistics 
> of an entry */
> + entry_counters = &entry->counters;
> + for (int kind = 0; kind < PGSS_NUMKIND; kind++)
> + {
> + entry_counters->max_time[kind] = 0;
> + entry_counters->min_time[kind] = 0;
> + }
> + entry->minmax_stats_since = stats_reset;
> + }
> + else
> + {
> + /* Remove the key otherwise  */
> + hash_search(pgss_hash, &entry->key, 
> HASH_REMOVE, NULL);
> + num_remove++;
> + }
> + }

It seems decidedly not great to have four copies of this code. It was already
not great before, but this patch makes the duplicated section go from four
lines to 20 or so.

Greetings,

Andres Freund

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-04-01 Thread Andrei Zubkov

Hi,

Thank you, Greg

On Fri, 2022-04-01 at 11:38 -0400, Greg Stark wrote:
> [13:19:51.544] pg_stat_statements.c: In function ‘entry_reset’:
> [13:19:51.544] pg_stat_statements.c:2598:32: error:
> ‘minmax_stats_reset’ may be used uninitialized in this function
> [-Werror=maybe-uninitialized]
> [13:19:51.544] 2598 | entry->minmax_stats_since = minmax_stats_reset;
> [13:19:51.544] | ~~^~~~
>

I was afraid of such warning can appear..

On Sat, 2022-04-02 at 00:13 +0800, Julien Rouhaud wrote:
> I guess that pg_stat_statement_reset() is so
> expensive that an extra gettimeofday() wouldn't change much. 

Agreed

> Otherwise
> initializing to NULL should be enough.

Julien, I would prefer an extra GetCurrentTimestamp(). So, I've opted
to use the common unconditional

stats_reset = GetCurrentTimestamp();

for an entire entry_reset function due to the following:

1. It will be uniform for stats_reset and minmax_stats_reset
2. As you mentioned, it wouldn't change a much
3. The most common way to use this function is to reset all statements
meaning that GetCurrentTimestamp() will be called anyway to update the
value of stats_reset field in pg_stat_statements_info view
4. Actually I would like that pg_stat_statements_reset function was
able to return the value of stats_reset as its result. This could give
to the sampling solutions the ability to check if the last reset (of
any type) was performed by this solution or any other reset was
performed by someone else. It seems valuable to me, but it changes the
result type of the pg_stat_statements_reset() function, so I don't know
if we can do that right now.

v8 attached
--
regards, Andrei
From a905dcbd8ca891a1aaf9652324650b87cdcae001 Mon Sep 17 00:00:00 2001
From: Andrei Zubkov 
Date: Fri, 1 Apr 2022 22:09:49 +0300
Subject: [PATCH] pg_stat_statements: Track statement entry timestamp

This patch adds stats_since column to the pg_stat_statements view. This column
is populated with the current timestamp when a new statement is added to the
pg_stat_statements hashtable. It provides clean information about statistics
collection time interval for each statement. Besides it can be used
by sampling solutions to detect situations when a statement was evicted and
returned back between samples.
Such sampling solution could derive any pg_stat_statements statistic value for
an interval between two samples with except of all min/max statistics. To
address this issue this patch adds the ability to reset min/max
statistics independently of statement reset using the new minmax_only parameter
of the pg_stat_statements_reset(userid oid, dbid oid, queryid bigint,
minmax_only boolean) function.
Timestamp of such reset is stored in the minmax_stats_since field for
each statement.

Discussion:
https://www.postgresql.org/message-id/flat/72e80e7b160a6eb189df9ef6f068cce3765d37f8.camel%40moonset.ru
---
 contrib/pg_stat_statements/Makefile   |   3 +-
 .../expected/oldextversions.out   |  61 ++
 .../expected/pg_stat_statements.out   | 140 
 .../pg_stat_statements--1.9--1.10.sql | 108 ++
 .../pg_stat_statements/pg_stat_statements.c   | 199 +++---
 .../pg_stat_statements.control|   2 +-
 .../pg_stat_statements/sql/oldextversions.sql |   8 +
 .../sql/pg_stat_statements.sql|  92 
 doc/src/sgml/pgstatstatements.sgml|  52 -
 9 files changed, 623 insertions(+), 42 deletions(-)
 create mode 100644 contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 7fabd96f38d..edc40c8bbfb 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,8 @@ OBJS = \
 	pg_stat_statements.o

 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.8--1.9.sql \
+DATA = pg_stat_statements--1.4.sql \
+	pg_stat_statements--1.9--1.10.sql pg_stat_statements--1.8--1.9.sql \
 	pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
 	pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
 	pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
diff --git a/contrib/pg_stat_statements/expected/oldextversions.out b/contrib/pg_stat_statements/expected/oldextversions.out
index f18c08838f5..70877948491 100644
--- a/contrib/pg_stat_statements/expected/oldextversions.out
+++ b/contrib/pg_stat_statements/expected/oldextversions.out
@@ -136,4 +136,65 @@ SELECT pg_get_functiondef('pg_stat_statements_reset'::regproc);

 (1 row)

+ALTER EXTENSION pg_stat_statements UPDATE TO '1.9';
+\d pg_stat_statements
+View "public.pg_stat_statements"
+   Column|   Type   | Collation | Nullable | Default 
+-+--+---+--+-
+ userid  | oid  |   |  |

Re: Can we automatically add elapsed times to tap test log?

Hi,

On 2022-04-01 10:21:50 -0700, Andres Freund wrote:
> right now I am looking at a test added in the shmstats patch that's slow on
> CI, on windows only. Unfortunately the regress_log_* output is useless as-is
> to figure out where things hang.

This turns out to be a problem somewhere in the tap testing infrastructure,
rather than the test itself. The slow thing wasn't anything the test did. All
the time is spent in an is().  To verify that, I added a bunch of

ok(1, "this is some long output to test a theory");
print_time();

a few tests before the slow test. And:

ok 7 - this is some long output to test a theory
# test theory 1: 0.000 sec
ok 8 - this is some long output to test a theory
# test theory 2: 0.000 sec
ok 9 - this is some long output to test a theory
# test theory 3: 40.484 sec
ok 10 - this is some long output to test a theory
# test theory 4: 0.001 sec
ok 11 - this is some long output to test a theory
# test theory 5: 0.000 sec

The problem also vanishes when running tests without PROVE_FLAGS=-j$something

What this looks like to me is that when running tests concurrently, the buffer
of the file descriptor used to report tap test output fills up. The blocked
test can only progress once prove gets around to reading from that fd,
presumably when another test finishes.

Gah. I want my time back.

I can't reproduce a similar issue on linux. But of course I'm using a newer
perl, and it's likely a timing dependent issue, so it's not guaranteed to be a
windows problem. But ...

Greetings,

Andres Freund

Re: Can we automatically add elapsed times to tap test log?

2022-04-01 Thread Andrew Dunstan



On 4/1/22 13:44, Nathan Bossart wrote:
> On Fri, Apr 01, 2022 at 10:21:50AM -0700, Andres Freund wrote:
>> right now I am looking at a test added in the shmstats patch that's slow on
>> CI, on windows only. Unfortunately the regress_log_* output is useless as-is
>> to figure out where things hang.
>>
>> I've hit this several times before. Of course it's not too hard to hack up
>> something printing elapsed time. But ISTM that it'd be better if we were able
>> to prefix the logging into regress_log_* with something like
>> [timestamp + time since start of test]
>> or
>> [timestamp + time since start of test + time since last log message]
>>
>>
>> This isn't just useful to figure out what parts of test are slow, but also
>> helps correlate server logs with the regress_log_* output. Which right now is
>> hard and inaccurate, requiring manually correlating statements between server
>> log and the tap test (often there's no logging for statements in the
>> regress_log_*).
> +1
>


Maybe one way would be to make a change in
src/test/perl/PostgreSQL/Test/SimpleTee.pm. The simplest thing would
just be to add a timestamp, the other things would involve a bit more
bookkeeping. It should also be checked to make sure it doesn't add too
much overhead, although I would be surprised if it did.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: should vacuum's first heap pass be read-only?

2022-04-01 Thread Peter Geoghegan

On Fri, Apr 1, 2022 at 11:39 AM Robert Haas  wrote:
> So I'm completely confused here. If we always start a vacuum with
> lazy_scan_heap(), as you said you wanted, then we will not save any
> heap scanning.

The term "start a VACUUM" becomes ambiguous with the conveyor belt.

What I was addressed in a nearby email back in February [1] was the
idea of doing heap vacuuming of the last run (or several runs) of dead
TIDs on top of heap pruning to create the next run/runs of dead TIDs.

> What am I missing?

There is a certain sense in which we are bound to always "start a
vacuum" in lazy_scan_prune(), with any design based on the current
one. How else are we ever going to make a basic initial determination
about which heap LP_DEAD items need their TIDs deleted from indexes,
sooner or later? Obviously that information must always have
originated in lazy_scan_prune (or in lazy_scan_noprune).

With the conveyor belt, and a non-HOT-update heavy workload, we'll
eventually need to exhaustively do index vacuuming of all indexes
(even those that don't need it for their own sake) to make it safe to
remove heap line pointer bloat (to set heap LP_DEAD items to
LP_UNUSED). This will happen least often of all, and is the one
dependency conveyor belt can't help with.

To answer your question: when heap vacuuming does finally happen, we
at least don't need to call lazy_scan_prune for any pages first
(neither the pages we're vacuuming, nor any other heap pages). Plus
the decision to finally clean up line pointer bloat can be made based
on known facts about line pointer bloat, without tying that to other
processing done by lazy_scan_prune() -- so there's greater separation
of concerns.

That having been said...maybe it would make sense to also call
lazy_scan_prune() right after these relatively rare calls to
lazy_vacuum_heap_page(), opportunistically (since we already dirtied
the page once). But that would be an additional optimization, at best; it
wouldn't be the main way that we call lazy_scan_prune().

[1] 
https://www.postgresql.org/message-id/CAH2-WzmG%3D_vYv0p4bhV8L73_u%2BBkd0JMWe2zHH333oEujhig1g%40mail.gmail.com
--
Peter Geoghegan

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Fri, Apr 1, 2022 at 12:22 AM Kyotaro Horiguchi
 wrote:
> By the way, may I ask how do we fix this?  The existing recovery code
> already generates just-to-be-delete files in a real directory in
> pg_tblspc sometimes, and elsewise skip applying WAL records on
> nonexistent heap pages.  It is the "mixed" way.

Can you be more specific about where we have each behavior now?

> 1. stop XLogReadBufferForRedo creating a file in nonexistent
>   directories then remember the failure (I'm not sure how big the
>   impact is.)
>
> 2. unconditionally create all objects required for recovery to proceed..
>   2.1 and igore the failures.
>   2.2 and remember the failures.
>
> 3. Any other?
>
> 2 needs to create a real directory in pg_tblspc. So 1?

I think we could either do 1 or 2. My intuition is that getting 2
working would be less scary and more likely to be something we would
feel comfortable back-patching, but 1 is probably a better design in
the long term. However, I might be wrong -- that's just a guess.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should vacuum's first heap pass be read-only?

On Fri, Apr 1, 2022 at 2:27 PM Peter Geoghegan  wrote:
> I'm not following. It seems like you're saying that the ability to
> vacuum indexes on their own schedule (based on their own needs) is not
> sufficiently compelling. I think it's very compelling, with enough
> indexes (and maybe not very many).
>
> The conveyor belt doesn't just save I/O from repeated scanning of the
> heap. It may also save on repeated pruning (or just dirtying) of the
> same heap pages again and again, for very little benefit.

I'm also not following. In order to get that benefit, we would have to
sometimes decide to not perform lazy_scan_heap() at the startup of a
vacuum. And in this email I asked you whether it was your idea that we
should always start a vacuum operation with lazy_scan_heap(), and you
said "yes":

http://postgr.es/m/ca+tgmoa6kveeurtyeoi3a+ra2xuynwqmj_s-h4kun6-bkmm...@mail.gmail.com

So I'm completely confused here. If we always start a vacuum with
lazy_scan_heap(), as you said you wanted, then we will not save any
heap scanning.

What am I missing?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: PostgreSQL shutdown modes

Isn't this missing support in pg_dumb ?

Re: should vacuum's first heap pass be read-only?

2022-04-01 Thread Peter Geoghegan

On Fri, Apr 1, 2022 at 11:04 AM Robert Haas  wrote:
> I guess you're right, and it's actually a little bit better than that,
> because even if the data does fit into shared memory, we'll have to
> pass fewer TIDs to the worker to be removed from the heap, which might
> save a few CPU cycles. But I think both of those are very small
> benefits.

I'm not following. It seems like you're saying that the ability to
vacuum indexes on their own schedule (based on their own needs) is not
sufficiently compelling. I think it's very compelling, with enough
indexes (and maybe not very many).

The conveyor belt doesn't just save I/O from repeated scanning of the
heap. It may also save on repeated pruning (or just dirtying) of the
same heap pages again and again, for very little benefit.

Imagine an append-only table where 1% of transactions that insert are
aborts. You really want to be able to constantly VACUUM such a table,
so that its pages are proactively frozen and set all-visible in the
visibility map -- it's not that different to a perfectly append-only
table, without any garbage tuples. And so it would be very useful if
we could delay index vacuuming for much longer than the current 2% of
rel_pages heuristics seems to allow.

That heuristic has to conservatively assume that it might be some time
before the next vacuum is launched, and has the opportunity to
reconsider index vacuuming. What if it was a more or less independent
question instead? To put it another way, it would be great if the
scheduling code for autovacuum could make inferences about what
general strategy works best for a given table over time. In order to
be able to do that sensibly, the algorithm needs more context, so that
it can course correct without paying much of a cost for being wrong.

-- 
Peter Geoghegan

Re: should vacuum's first heap pass be read-only?

On Fri, Apr 1, 2022 at 12:08 AM Dilip Kumar  wrote:
> After thinking more about this I see there is some value of
> remembering the dead tids in the conveyor belt.  Basically, the point
> is if there are multiple indexes and we do the index vacuum for some
> of the indexes and skip for others.  And now when we again do the
> complete vacuum cycle that time we will again get all the old dead
> tids + the new dead tids but without conveyor belt we might need to
> perform the multiple cycle of the index vacuum even for the indexes
> for which we had done the vacuum in previous vacuum cycle (if all tids
> are not fitting in maintenance work mem). But with the conveyor belt
> we remember the conveyor belt pageno upto which we have done the index
> vacuum and then we only need to do vacuum for the remaining tids which
> will definitely reduce the index vacuuming passes, right?

I guess you're right, and it's actually a little bit better than that,
because even if the data does fit into shared memory, we'll have to
pass fewer TIDs to the worker to be removed from the heap, which might
save a few CPU cycles. But I think both of those are very small
benefits. If that's all we're going to do with the conveyor belt
infrastructure, I don't think it's worth the effort. I am completely
in agreement with Peter's comments to the effect that the goal here is
to create flexibility, but it feels to me like the particular
development plan we discussed back in late February isn't going to
create enough flexibility to make it worth the effort it takes. It
seems to me we need to find a way to do better than that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-04-01 Thread Peter Geoghegan

On Thu, Mar 31, 2022 at 11:19 AM Peter Geoghegan  wrote:
> The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)".

Attached is v15. I plan to commit the first two patches (the most
substantial two patches by far) in the next couple of days, barring
objections.

v15 removes this "Assert(diff > 0)" assertion from 0001. It's not
adding any value, now that the underlying issue that it accidentally
brought to light is well understood (there are still more robust
assertions to the relfrozenxid/relminmxid invariants). "Assert(diff >
0)" is liable to fail until the underlying bug on HEAD is fixed, which
can be treated as separate work.

I also refined the WARNING patch in v15. It now actually issues
WARNINGs (rather than PANICs, which were just a temporary debugging
measure in v14). Also fixed a compiler warning in this patch, based on
a complaint from CFBot's CompilerWarnings task. I can delay commiting
this WARNING patch until right before feature freeze. Seems best to
give others more opportunity for comments.

-- 
Peter Geoghegan


v15-0003-Have-VACUUM-warn-on-relfrozenxid-from-the-future.patch
Description: Binary data


v15-0004-vacuumlazy.c-Move-resource-allocation-to-heap_va.patch
Description: Binary data


v15-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patch
Description: Binary data


v15-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patch
Description: Binary data

Re: Can we automatically add elapsed times to tap test log?

On Fri, Apr 01, 2022 at 10:21:50AM -0700, Andres Freund wrote:
> right now I am looking at a test added in the shmstats patch that's slow on
> CI, on windows only. Unfortunately the regress_log_* output is useless as-is
> to figure out where things hang.
> 
> I've hit this several times before. Of course it's not too hard to hack up
> something printing elapsed time. But ISTM that it'd be better if we were able
> to prefix the logging into regress_log_* with something like
> [timestamp + time since start of test]
> or
> [timestamp + time since start of test + time since last log message]
> 
> 
> This isn't just useful to figure out what parts of test are slow, but also
> helps correlate server logs with the regress_log_* output. Which right now is
> hard and inaccurate, requiring manually correlating statements between server
> log and the tap test (often there's no logging for statements in the
> regress_log_*).

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: unlogged sequences

On Fri, Apr 1, 2022 at 12:31 PM Peter Eisentraut
 wrote:
> > An "internal" dependency means that the internal object shouldn't really
> > be operated on directly.  (In some cases it's allowed for convenience.)
> > So I think in that case the sequence must follow the table's persistence
> > in all cases.  This is accomplished by setting the initial persistence
> > to the table's, making ALTER TABLE propagate persistence changes, and
> > prohibiting direct ALTER SEQUENCE SET.
>
> But to make pg_upgrade work for identity sequences of unlogged tables,
> we need to allow ALTER SEQUENCE ... SET LOGGED on such sequences.  Which
> I guess is not a real problem in the end.

And I think also SET UNLOGGED, since it would be weird IMHO to make
such a change irreversible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

PostgreSQL shutdown modes

Hi,

I think it's pretty evident that the names we've chosen for the
various PostgreSQL shutdown modes are pretty terrible, and maybe we
should try to do something about that. There is nothing "smart" about
a smart shutdown. The usual result of attempting a smart shutdown is
that the server never shuts down at all, because typically there are
going to be some applications using connections that are kept open
more or less permanently. What ends up happening when you attempt a
"smart" shutdown is that you've basically put the server into a mode
where you're irreversibly committed to accepting no new connections,
but because you have a connection pooler or something that keeps
connections open forever, you never shut down either. It is in effect
a denial-of-service attack on the database you're supposed to be
administering.

Similarly, "fast" shutdowns are not in any way fast. It is pretty
common for a fast shutdown to take many minutes or even tens of
minutes to complete. This doesn't require some kind of extreme
workload to hit; I've run into it during casual benchmarking runs.
It's very easy to have enough dirty data in shared buffers, or enough
dirty in the operating system cache that will have to be fsync'd in
order to complete the shutdown checkpoint, to make things take an
extremely long time. In some ways, this is an even more effective
denial-of-service attack than a smart shutdown. True, the database
will at some point actually finish shutting down, but in the meantime
not only will we not accept new connections but we'll evict all of the
existing ones. Good luck maintaining five nines of availability if
waiting for a clean shutdown to complete is any part of the process.
It might be smarter to initiate a regular (non-shutdown) checkpoint
first, without cutting off connections, and then when that finishes,
proceed as we do now. The second checkpoint will complete a lot
faster, so while the overall operation still won't be fast, at least
we'd be refusing connections for a shorter period of time before the
system is actually shut down and you can do whatever maintenance you
need to do.

"immediate" shutdowns aren't as bad as the other two, but they're
still bad. One of the big problems is that I encounter in this area is
that Oracle uses the name "immediate" shutdown to mean a normal
shutdown with a checkpoint allowing for a clean restart. Users coming
from Oracle are sometimes extremely surprised to discover that an
immediate shutdown is actually a server crash that will require
recovery. Even if you don't come from Oracle, there's really nothing
about the name of this shutdown mode that intrinsically makes you
understand that it's something you should do only as a last resort.
Who doesn't like things that are immediate? The problem with this
theory is that you make the shutdown quicker at the price of startup
becoming much, much slower, because the crash recovery is very likely
going to take a whole lot longer than the shutdown checkpoint would
have done.

I attach herewith a modest patch to rename these shutdown modes to
more accurately correspond to their actual characteristics.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-Give-our-various-shutdown-types-more-appropriate-.patch
Description: Binary data

Can we automatically add elapsed times to tap test log?

Hi,

right now I am looking at a test added in the shmstats patch that's slow on
CI, on windows only. Unfortunately the regress_log_* output is useless as-is
to figure out where things hang.

I've hit this several times before. Of course it's not too hard to hack up
something printing elapsed time. But ISTM that it'd be better if we were able
to prefix the logging into regress_log_* with something like
[timestamp + time since start of test]
or
[timestamp + time since start of test + time since last log message]


This isn't just useful to figure out what parts of test are slow, but also
helps correlate server logs with the regress_log_* output. Which right now is
hard and inaccurate, requiring manually correlating statements between server
log and the tap test (often there's no logging for statements in the
regress_log_*).

Greetings,

Andres Freund

Re: GSoC: pgmoneta: Write-Ahead Log (WAL) infrastructure (2022)

2022-04-01 Thread Jesper Pedersen


Hi Jinlang,

On 4/1/22 12:39, Jinlang Wang wrote:

To whom it may concern:

Here is my proposal 
(https://docs.google.com/document/d/1KKKDU6iP0GOkAMSdGRyxFJRgW964JFVROnpKkbzWyNw/edit?usp=sharing
 
)
 for GSoC. Please check!



Thanks for your proposal to Google Summer of Code 2022 !


We'll follow up off-list to get this finalized.


Best regards,

 Jesper

GSoC: pgmoneta: Write-Ahead Log (WAL) infrastructure (2022)

2022-04-01 Thread Jinlang Wang

To whom it may concern:

Here is my proposal 
(https://docs.google.com/document/d/1KKKDU6iP0GOkAMSdGRyxFJRgW964JFVROnpKkbzWyNw/edit?usp=sharing
 
)
 for GSoC. Please check! 

Thanks,
Jinlang Wang

Re: unlogged sequences

2022-04-01 Thread David G. Johnston

On Fri, Apr 1, 2022 at 9:31 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 01.04.22 18:22, Peter Eisentraut wrote:
> >
> > On 01.04.22 00:43, Tomas Vondra wrote:
> >> Hmm, so what about doing a little bit different thing:
> >>
> >> 1) owned sequences inherit persistence of the table by default
> >>
> >> 2) allow ALTER SEQUENCE to change persistence for all sequences (no
> >> restriction for owned sequences)
> >>
> >> 3) ALTER TABLE ... SET [UN]LOGGED changes persistence for sequences
> >> matching the initial table persistence
> >
> > Consider that an identity sequence creates an "internal" dependency and
> > a serial sequence creates an "auto" dependency.
> >
> > An "internal" dependency means that the internal object shouldn't really
> > be operated on directly.  (In some cases it's allowed for convenience.)
> > So I think in that case the sequence must follow the table's persistence
> > in all cases.  This is accomplished by setting the initial persistence
> > to the table's, making ALTER TABLE propagate persistence changes, and
> > prohibiting direct ALTER SEQUENCE SET.
>
> But to make pg_upgrade work for identity sequences of unlogged tables,
> we need to allow ALTER SEQUENCE ... SET LOGGED on such sequences.  Which
> I guess is not a real problem in the end.
>

Indeed, we need the syntax anyway.  We can constrain it though, and error
when trying to make them different but allow making them the same.  To
change a table's persistence you have to then change the table first -
putting them back into different states - then sync up the sequence again.

David J.

Re: unlogged sequences

2022-04-01 Thread David G. Johnston

On Fri, Apr 1, 2022 at 9:22 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

>
> On 01.04.22 00:43, Tomas Vondra wrote:
> > Hmm, so what about doing a little bit different thing:
> >
> > 1) owned sequences inherit persistence of the table by default
> >
> > 2) allow ALTER SEQUENCE to change persistence for all sequences (no
> > restriction for owned sequences)
> >
> > 3) ALTER TABLE ... SET [UN]LOGGED changes persistence for sequences
> > matching the initial table persistence
>
> Consider that an identity sequence creates an "internal" dependency and
> a serial sequence creates an "auto" dependency.
>
> An "internal" dependency means that the internal object shouldn't really
> be operated on directly.  (In some cases it's allowed for convenience.)
> So I think in that case the sequence must follow the table's persistence
> in all cases.  This is accomplished by setting the initial persistence
> to the table's, making ALTER TABLE propagate persistence changes, and
> prohibiting direct ALTER SEQUENCE SET.
>
> An "auto" dependency is looser, so manipulating both objects
> independently can be allowed.  In that case, I would do (1), (2), and (3).
>
> (I think your (3) is already the behavior in the patch, since there are
> only two persistence levels in play at that point.)
>

I would support having a serial sequence be allowed to be changed
independently while an identity sequence is made to match the table it is
owned by.  Older version restores would produce a logged serial sequence
(since the sequence is independently created and then attached to the
table) on unlogged tables but since identity sequences are only even
implicitly created they would become unlogged as part of the restore.
Though I suspect that pg_upgrade will need to change them explicitly.

I would support all owned sequences as well, but that seems unreachable at
the moment.

David J.

Re: unlogged sequences


On 01.04.22 18:22, Peter Eisentraut wrote:


On 01.04.22 00:43, Tomas Vondra wrote:

Hmm, so what about doing a little bit different thing:

1) owned sequences inherit persistence of the table by default

2) allow ALTER SEQUENCE to change persistence for all sequences (no
restriction for owned sequences)

3) ALTER TABLE ... SET [UN]LOGGED changes persistence for sequences
matching the initial table persistence


Consider that an identity sequence creates an "internal" dependency and 
a serial sequence creates an "auto" dependency.


An "internal" dependency means that the internal object shouldn't really 
be operated on directly.  (In some cases it's allowed for convenience.) 
So I think in that case the sequence must follow the table's persistence 
in all cases.  This is accomplished by setting the initial persistence 
to the table's, making ALTER TABLE propagate persistence changes, and 
prohibiting direct ALTER SEQUENCE SET.


But to make pg_upgrade work for identity sequences of unlogged tables, 
we need to allow ALTER SEQUENCE ... SET LOGGED on such sequences.  Which 
I guess is not a real problem in the end.

Re: unlogged sequences




On 01.04.22 00:43, Tomas Vondra wrote:

Hmm, so what about doing a little bit different thing:

1) owned sequences inherit persistence of the table by default

2) allow ALTER SEQUENCE to change persistence for all sequences (no
restriction for owned sequences)

3) ALTER TABLE ... SET [UN]LOGGED changes persistence for sequences
matching the initial table persistence


Consider that an identity sequence creates an "internal" dependency and 
a serial sequence creates an "auto" dependency.


An "internal" dependency means that the internal object shouldn't really 
be operated on directly.  (In some cases it's allowed for convenience.) 
So I think in that case the sequence must follow the table's persistence 
in all cases.  This is accomplished by setting the initial persistence 
to the table's, making ALTER TABLE propagate persistence changes, and 
prohibiting direct ALTER SEQUENCE SET.


An "auto" dependency is looser, so manipulating both objects 
independently can be allowed.  In that case, I would do (1), (2), and (3).


(I think your (3) is already the behavior in the patch, since there are 
only two persistence levels in play at that point.)


I wanted to check if you can have a persistent sequence owned by a temp 
table, but that is rejected because both sequence and table must be in 
the same schema.  So the sequence owned-by schema does insist on some 
tight coupling.  So for example, once a sequence is owned by a table, 
you can't move it around or change the ownership.

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-04-01 Thread Julien Rouhaud

Hi,

On Fri, Apr 01, 2022 at 11:38:52AM -0400, Greg Stark wrote:
> FYI this has a compiler warning showing up on the cfbot:
>
> [13:19:51.544] pg_stat_statements.c: In function ‘entry_reset’:
> [13:19:51.544] pg_stat_statements.c:2598:32: error:
> ‘minmax_stats_reset’ may be used uninitialized in this function
> [-Werror=maybe-uninitialized]
> [13:19:51.544] 2598 | entry->minmax_stats_since = minmax_stats_reset;
> [13:19:51.544] | ~~^~~~
>
> If the patch is otherwise ready to commit then this is an issue that
> should be fixed before marking it ready to commit.
>
> Given that this is the last week before feature freeze it'll probably
> get moved to a future commitfest unless it's ready to commit.

As I mentioned in my last review I think feature wise the patch is ok, it just
needed a few minor changes.  It's a small patch but can help *a lot* tools on
top of pg_stat_statements and give users a better overview of their workload so
it would be nice to commit it in v15.

I was busy looking at the prefetch patch today (not done yet), but I plan to
review the last version over the weekend.  After a quick look at the patch it
seems like a compiler bug.  I'm not sure which clang version is used, but can't
reproduce it locally using clang 13.  I already saw similar false positive,
when a variable is initialized in a branch (here minmax_only == true), and only
then used in similar branches.  I guess that pg_stat_statement_reset() is so
expensive that an extra gettimeofday() wouldn't change much.  Otherwise
initializing to NULL should be enough.

Re: Add non-blocking version of PQcancel

2022-04-01 Thread Jelte Fennema

Attached is the latest version of this patch, which I think is now in a state
in which it could be merged. The changes are:

1. Don't do host and address discovery for cancel connections. It now
reuses raddr and whichhost from the original connection. This makes
sure the cancel always goes to the right server, even when DNS records
change or another server would be chosen now in case of connnection
strings containing multiple hosts.
2. Fix the windows CI failure. This is done by both using the threadsafe code
in the the dblink cancellation code, and also by not erroring a cancellation
connection on windows in case of any errors. This last one is to work around
the issue described in this thread:

https://www.postgresql.org/message-id/flat/90b34057-4176-7bb0-0dbb-9822a5f6425b%40greiz-reinsdorf.de

I also went over most of the functions that take a PGconn, to see if they needed
extra checks to guard against being executed on cancel. So far all seemed fine,
either they should be okay to execute against a cancellation connection, or
they failed already anyway because a cancellation connection never reaches
the CONNECTION_OK state. So I didn't add any checks specifically for cancel
connections. I'll do this again next week with a fresh head, to see if I
haven't
missed any cases.

I'll try to find some time early next week to implement non-blocking
cancellation
usage in postgres_fdw, i.e. the bonus task I mentioned in my previous email.
But
I don't think it's necessary to have that implemented before merging.

0002-Add-non-blocking-version-of-PQcancel.patch
Description: 0002-Add-non-blocking-version-of-PQcancel.patch

Re: Restructure ALTER TABLE notes to clarify table rewrites and verification scans

On Fri, 1 Apr 2022 at 16:10, James Coleman  wrote:
>
> On Thu, Mar 31, 2022 at 10:58 AM Matthias van de Meent
>  wrote:
> >
> > On Tue, 29 Mar 2022 at 16:20, James Coleman  wrote:
> > >
> > > Over in the "Document atthasmissing default optimization avoids
> > > verification table scan" thread David Johnston (who I've cc'd)
> > > suggested that my goals might be better implemented with a simple
> > > restructuring of the Notes section of the ALTER TABLE docs. I think
> > > this is also along the lines of Tom Lane's suggestion of a "unified
> > > discussion", but I've chosen for now (and simplicity's sake) not to
> > > break this into an entirely new page. If reviewers feel that is
> > > warranted at this stage, I can do that, but it seems to me that for
> > > now this improves the structure and sets us up for such a future page
> > > but falls short of sufficient content to move into its own page.
> > >
> > > One question on the changes: the current docs say "when attaching a
> > > new partition it may be scanned to verify that existing rows meet the
> > > partition constraint". The word "may" there seems to suggest there may
> > > also be occasions where scans are not needed, but no examples of such
> > > cases are present. I'm not immediately aware of such a case. Are these
> > > constraints always validated? If not, in which cases can such a scan
> > > be skipped?
> > >
> > > I've also incorporated the slight correction in "Correct docs re:
> > > rewriting indexes when table rewrite is skipped" [2] here, and will
> > > rebase this patch if that gets committed.
> >
> > See comments in that thread.
>
> Rebased since that thread has now resulted in a committed patch.
>
> > > +Changing the type of an existing column will require the entire 
> > > table and its
> > > +indexes to be rewritten. As an exception, if the 
> > > USING clause
> > > +does not change the column contents and the old type is either 
> > > binary coercible
> > > +to the new type or an unconstrained domain over the new type, a 
> > > table rewrite is
> > > +not needed.
> >
> > This implies "If the old type is [...] an unconstrained domain over
> > the new type, a table rewrite is not needed.", which is the wrong way
> > around.
> >
> > I'd go with something along the lines of:
> >
> > +Changing the type of an existing column will require the entire table 
> > to be
> > +rewritten, unless the USING clause is only a
> > binary coercible
> > +cast, or if the new type is an unconstrained
> > DOMAIN over the
> > +old type.
>
> That language is actually unchanged from the existing docs; is there
> an error in the existing docs you're seeing? I'm actually imagining
> that it can probably got either way -- from unconstrained domain over
> new type to new type or from old type to unconstrained domain over old
> type.

CREATE DOMAIN constrained AS text NOT NULL;
CREATE DOMAIN unconstrained_on_constrained AS constrained;

CREATE TABLE tst (col unconstrained_on_constrained);
ALTER TABLE tst ALTER COLUMN col TYPE constrained; -- table scan

Moving from an unconstrained domain over a constrained domain means
that we still do the table scan. Domain nesting is weird in that way.

> > That would drop the reference to index rebuilding; but that should be
> > covered in other parts of the docs.
>
> Part of the whole point of this restructuring is to make both of those
> clear; I think we should retain the comments about indexes.

OK; I mentioned it because table rewrite also implies index rewrite;
assuming this is correctly referenced in other parts of the docs.

> > > +The following alterations of the table require the entire table, and 
> > > in some
> > > +cases its indexes as well, to be rewritten.
> >
> > It is impossible to rewrite the table without at the same time also
> > rewriting the indexes; as the location of tuples changes and thus
> > previously generated indexes will become invalid. At the same time;
> > changes to columns might not require a table rewrite, while still
> > requiring the indexes to be rewritten. I suggest changing the order of
> > "table" and "index", or dropping the clause.
>
> Ah, that's a good point. I've rewritten that part.
>
> > > +[...] For a large table such a rewrite
> > > +may take a significant amount of time and will temporarily require 
> > > as much as
> > > +double the disk space.
> >
> > I'd replace the will with could. Technically, this "double the disk
> > space" could be even higher than that; due to index rebuilds taking up
> > to 3x normal space (one original index which is only dropped at the
> > end, one sorted tuple store for the rebuild, and one new index).
>
> That's also the existing language, but I agree it seems a bit overly
> precise (and in the process probably incorrect). There's a lot of
> complexity here: depending on the type change (and USING clause!) and
> table width it could be even more than 3x. I've reworded to try to
> capture wh

Re: Temporary tables versus wraparound... again

On Thu, 31 Mar 2022 at 16:05, Greg Stark  wrote:
>
> I haven't wrapped my head around multixacts yet. It's complicated by
> this same codepath being used for truncates of regular tables that
> were created in the same transaction.

So my best idea so far is to actually special-case the temp table case
in this code path. I think that's easy enough since I have the heap
tuple I'm about to replace.

In the temp table case I would just use the value Andres proposes.

In the "truncating table in same transaction it was created" case then
I would go ahead and use the expensive GetOldestMultiXactId() which
should be ok for that case. At least I think the "much higher rate"
comment was motivated by the idea that every transaction commit (when
temp tables are being used) is more frequent than any specific user
ddl.

It's not brilliant since it seems to be embedding knowledge of the
cases where this optimization applies in a lower level function. If we
think of some other case where it could apply it wouldn't be obvious
that it will have a cost to it. But it doesn't seem too terrible to
me.

An alternative would be to simply not adjust relminmxid for non-temp
tables at all. I guess that's not too bad either since these are
non-temp tables that autovacuum will be able to do anti-wraparound
vacuums on. And I get the impression mxids don't wraparound nearly as
often as xids?

-- 
greg

Patches with failing tests in Commitfest

So far I've been moving patches with failing tests to Waiting on
Author but there are a number with "minor" failures or failures which
look unrelated to the patch. There are 20 patches with at least one
failing test in Ready for Comitter (2) and Needs Review (18).

Here's a summary of the reasons for these failures. Mostly they're
bitrot due to other patches but there are a few test failures and one
where Cirrus is just serving up a blank page...


Ready for Committer:

37/3496 Avoid erroring out when unable to remove or parse logical
rewrite files to save ...

[14:12:40.557] t/013_crash_restart.pl (Wstat: 256 Tests: 18 Failed: 1)
[14:12:40.557] Failed test: 5

37/3574 Frontend error logging style

Patch bitrot: plan is to let this sit till the end of the commitfest,
then rebase and push

Needs Review:

37/3374 Allows database-specific role memberships

Patch has bitrotted. It does seem a bit unfair to ask authors to keep
rebasing their patches and then not review them until they bitrot
again  I guess this is a downside of concentrating a lot of
commits in the CF.

37/3506 CREATEROLE and role ownership hierarchies

Thread digression was committed but main patch is still pending. cfbot
is testing the digression.

37/3500 Collecting statistics about contents of JSONB columns

Patch bitrotted, presumably due to the other JSON patchset. I moved it
to Waiting on Author but I think it's mainly in need of design review
so it could still be useful to look at it.

37/2433 Erase the distinctClause if the result is unique by definition

Patch has bitrotted. But there are multiple versions of this patchset
with different approaches and there is a lot of divergence of opinions
on which way to go. I'm inclined to mark this "returned with feedback"
and suggest to start with fresh threads on individual parts of this
patch.

37/2266 Fix up partitionwise join on how equi-join conditions between
the partition keys...

[08:10:46.390] # poll_query_until timed out executing this query:
[08:10:46.390] # SELECT '0/13AC5E80' <= replay_lsn AND state = 'streaming'
[08:10:46.390] # FROM pg_catalog.pg_stat_replication
[08:10:46.390] # WHERE application_name IN ('standby_1', 'walreceiver')

37/2957 Identify missing publications from publisher while
create/alter subscription.

It looks like this was committed, yay! Please remember to update the
CF app when committing patches. It's sometimes hard for me to tell if
it was only part of a patch or a related fix discovered in discussing
a patch that was committed or if the commit resolves the entry.

37/2218 Implement INSERT SET syntax

Tom: "This patch has been basically ignored for a full two years now
(Remarkably, it's still passing in the cfbot.)"

Well it has bitrotted in gram.y now :(

37/2138 Incremental Materialized View Maintenance

bitrotted in trigger.c. This is a huge patchset and probably hard to
keep rebased and I think still looking for more fundamental design
review.

37/3071 Lazy JIT IR code generation to increase JIT speed with partitions

[00:01:20.167] t/013_partition.pl (Wstat: 7424 Tests: 31 Failed: 0)
[00:01:20.167] Non-zero exit status: 29
[00:01:20.167] Parse errors: No plan found in TAP output

37/3181 Map WAL segment files on PMEM as WAL buffers

[19:38:40.880] configure: error: library 'libpmem' (version >= 1.5) is
required for PMEM support

Not sure if this is just the expected output if a platform lacks this
library or if the tests need to be adjusted to add a configure option.

37/3052 Merging statistics from children instead of re-sampling everything

Uh, Cirrus is just giving me a blank page on this one. No idea what's
going on here.

37/3490 Pluggable toaster

An updated patch has been posted since the failure

37/1712 Remove self join on a unique column

Patch bitrotted in a regression test expected output

37/3433 Removing more vacuumlazy.c special cases, relfrozenxid optimizations

[06:50:33.148] vacuum.c: In function ‘vac_update_relstats’:
[06:50:33.148] vacuum.c:1481:6: error: ‘oldrelminmxid’ may be used
uninitialized in this function [-Werror=maybe-uninitialized]
[06:50:33.148] 1481 | errmsg_internal("overwrote invalid
pg_class.relminmxid value %u with new value %u in table \"%s\"",
[06:50:33.148] | ^~~
[06:50:33.148] vacuum.c:1475:6: error: ‘oldrelfrozenxid’ may be used
uninitialized in this function [-Werror=maybe-uninitialized]
[06:50:33.148] 1475 | errmsg_internal("overwrote invalid
pg_class.relfrozenxid value %u with new value %u in table \"%s\"",
[06:50:33.148] | ^~~


37/2901 SQL/JSON: functions

Patchset is mostly applied so the cfbot is seeing patch conflicts.

37/3589 Shared memory based stats collector

cfbot is trying to apply a digression patch rather than the main patch
(which is up to version 67! and has 17 parts)

37/3546 Support custom authentication methods using hooks

Patch bitrot in hba.c. But the discussion is mainly about the design
and whether it's "the right way to go" anyways.

37/3048 pg_stat_state

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

FYI this has a compiler warning showing up on the cfbot:


[13:19:51.544] pg_stat_statements.c: In function ‘entry_reset’:
[13:19:51.544] pg_stat_statements.c:2598:32: error:
‘minmax_stats_reset’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
[13:19:51.544] 2598 | entry->minmax_stats_since = minmax_stats_reset;
[13:19:51.544] | ~~^~~~

If the patch is otherwise ready to commit then this is an issue that
should be fixed before marking it ready to commit.

Given that this is the last week before feature freeze it'll probably
get moved to a future commitfest unless it's ready to commit.

Re: Removing unneeded self joins

Sigh. And now there's a patch conflict in a regression test expected
output: sysviews.out

Please rebase. Incidentally, make sure to check the expected output is
actually correct. It's easy to "fix" an expected output to
accidentally just memorialize an incorrect output.

Btw, it's the last week before feature freeze so time is of the essence.

Re: [PATCH] Accept IP addresses in server certificate SANs

2022-04-01 Thread Jacob Champion

On Fri, 2022-04-01 at 16:07 +0200, Peter Eisentraut wrote:
> I have committed this.
> 
> I have removed the inet header refactoring that you had.  That wasn't
> necessary, since pg_inet_net_ntop() can use the normal AF_INET*
> constants.  The PGSQL_AF_INET* constants are only for the internal
> storage of the inet/cidr types.
> 
> I have added a configure test for inet_pton().  We can check in the
> build farm if it turns out to be necessary.

Thanks!

--Jacob

Re: Per-table storage parameters for TableAM/IndexAM extensions

This patch still has code warnings on the cfbot and I don't think
they're platform-specific:

[00:28:28.236] gram.y: In function ‘base_yyparse’:
[00:28:28.236] gram.y:2851:58: error: passing argument 2 of
‘makeDefElemExtended’ from incompatible pointer type
[-Werror=incompatible-pointer-types]
[00:28:28.236] 2851 | $$ = makeDefElemExtended(NULL, $2, NULL,
DEFELEM_DROP, @2);
[00:28:28.236] | ~^~~
[00:28:28.236] | |
[00:28:28.236] | DefElem *
[00:28:28.236] In file included from gram.y:58:
[00:28:28.236] ../../../src/include/nodes/makefuncs.h:102:60: note:
expected ‘char *’ but argument is of type ‘DefElem *’
[00:28:28.236] 102 | extern DefElem *makeDefElemExtended(char
*nameSpace, char *name, Node *arg,
[00:28:28.236] | ~~^~~~

I gather the patch is still a WIP and ideally we would want to give
feedback on patches in CFs when the author's are looking for it but
this is the last week before feature freeze and the main focus is on
committable patches. I'll move it to next CF.

Re: [PATCH] src/interfaces/libpq/Makefile: fix pkg-config without openssl


On 31.03.22 18:37, Fabrice Fontaine wrote:

Do not add openssl dependencies to libpq pkg-config file if openssl is
disabled to avoid the following build failure with libdbi-drivers raised
since commit beff361bc1edc24ee5f8b2073a1e5e4c92ea66eb:

configure: error: Package requirements (libpq) were not met:

Package 'libssl', required by 'libpq', not found
Package 'libcrypto', required by 'libpq', not found

Fixes:
  
-http://autobuild.buildroot.org/results/415cb61a58b928a42623ed90b0b60c59032f0a4e

Signed-off-by: Fabrice Fontaine


Fixed, thanks.

Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

On Fri, 1 Apr 2022 at 10:50, Matthias van de Meent
 wrote:
>
> On Fri, 1 Apr 2022 at 07:38, Michael Paquier  wrote:
> >
> > On Thu, Mar 31, 2022 at 12:09:35PM +0200, Matthias van de Meent wrote:
> > > PageInit MAXALIGNs the size of the special area that it receives as an
> > > argument; so any changes to the page header that would misalign the
> > > value would be AM-specific; in which case it is quite unlikely that
> > > this is the right accessor for your page's special area.
> >
> > Right.  I'd still be tempted to keep that per-AM rather than making
> > the checks deeper with one extra macro layer in the page header or
> > with a different macro that would depend on the opaque type, though.
> > Like in the attached, for example.

I noticed that you committed the equivalent of 0002 and 0003, thank you.

Here's a new 0001 to keep CFBot happy.

> I see. I still would like it better if the access could use this
> statically determined offset: your opaque-macros.patch doesn't fix the
> out-of-bound read/write scenariofor non-assert builds, nor does it
> remove the unneeded indirection through the page header that I was
> trying to remove.
>
> Even in assert-enabled builds; with my proposed changes the code paths
> for checking the header value and the use of the special area can be
> executed independently, which allows for parallel (pre-)fetching of
> the page header and special area, as opposed to the current sequential
> load order due to the required use of pd_special.

As an extra argument for this; it removes the need for a temporary
variable on the stack; as now all accesses into the special area can
be based on offsets off the base page pointer (which is also used
often). This would allow the compiler to utilize the available
registers more effectively.

-Matthias

PS. I noticed that between PageGetSpecialPointer and
PageGetSpecialOpaque the binary size was bigger by ~1kb for the
*Opaque variant when compiled with `CFLAGS="-o2" ./configure
--enable-debug`; but that varied a lot between builds and likely
related to binary layouts. For similar builds with `--enable-cassert`,
the *Opaque build was slimmer by ~ 50kB, which is a suprisingly large
amount.
From d9e6afdfbc021a699e68529cf37d66b8ff999dbc Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Mon, 28 Mar 2022 14:53:43 +0200
Subject: [PATCH v3] Add known-size pre-aligned special area pointer macro

This removes 1 layer of indirection for special areas of which we know the
type (=size) and location.

Special area access through the page header is an extra cache line that needs
to be fetched. If might only want to look at the special area, it is much
less effort to calculate the offset to the special area directly instead of
first checking the page header - saving one cache line to fetch.

Assertions are added to check that the page has a correctly sized special
area, and that the page is of the expected size. This detects data corruption,
instead of doing random reads/writes into the page or data allocated next to
the page being accessed.

Additionally, updates the GIN, [SP-]GIST, Hash, and BTree
[Index]PageGetOpaque macros with the new pre-aligned special area accessor.
---
 src/include/access/ginblock.h   |  3 ++-
 src/include/access/gist.h   |  3 ++-
 src/include/access/hash.h   |  3 ++-
 src/include/access/nbtree.h |  3 ++-
 src/include/access/spgist_private.h |  3 ++-
 src/include/storage/bufpage.h   | 14 ++
 6 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/src/include/access/ginblock.h b/src/include/access/ginblock.h
index 9347f464f3..3680098c98 100644
--- a/src/include/access/ginblock.h
+++ b/src/include/access/ginblock.h
@@ -108,7 +108,8 @@ typedef struct GinMetaPageData
 /*
  * Macros for accessing a GIN index page's opaque data
  */
-#define GinPageGetOpaque(page) ( (GinPageOpaque) PageGetSpecialPointer(page) )
+#define GinPageGetOpaque(page) \
+	((GinPageOpaque) PageGetSpecialOpaque(page, GinPageOpaqueData))
 
 #define GinPageIsLeaf(page)( (GinPageGetOpaque(page)->flags & GIN_LEAF) != 0 )
 #define GinPageSetLeaf(page)   ( GinPageGetOpaque(page)->flags |= GIN_LEAF )
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index a3337627b8..51223e9051 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -164,7 +164,8 @@ typedef struct GISTENTRY
 	bool		leafkey;
 } GISTENTRY;
 
-#define GistPageGetOpaque(page) ( (GISTPageOpaque) PageGetSpecialPointer(page) )
+#define GistPageGetOpaque(page) \
+	((GISTPageOpaque) PageGetSpecialOpaque(page, GISTPageOpaqueData))
 
 #define GistPageIsLeaf(page)	( GistPageGetOpaque(page)->flags & F_LEAF)
 #define GIST_LEAF(entry) (GistPageIsLeaf((entry)->page))
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index da372841c4..0ed6ccc40b 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -85,7 +85,8 @@ typedef struct HashPageOpaqueData
 
 typedef HashPageOpaqueData *HashPa

Re: Implementing Incremental View Maintenance

This patch has bitrotted due to some other patch affecting trigger.c.

Could you post a rebase?

This is the last week of the CF before feature freeze so time is of the essence.

Re: Add LZ4 compression in pg_dump

2022-04-01 Thread gkokolatos

On Thursday, March 31st, 2022 at 4:34 AM, Michael Paquier  
wrote:
> On Wed, Mar 30, 2022 at 03:32:55PM +, gkokola...@pm.me wrote:
> > On Wednesday, March 30th, 2022 at 7:54 AM, Michael Paquier 
> > mich...@paquier.xyz wrote:
>
> Okay. 0002 looks fine as-is, and I don't mind the extra fatal()
> calls. These could be asserts but that's not a big deal one way or
> the other. And the cleanup is now applied.

Thank you very much.

> > > + my $compress_program = $ENV{GZIP_PROGRAM};
> > > It seems to me that it is enough to rely on {compress_cmd}, hence
> > > there should be no need for $compress_program, no?
> >
> > Maybe not. We don't want to the tests to fail if the utility is not
> > installed. That becomes even more evident as more methods are added.
> > However I realized that the presence of the environmental variable does
> > not guarrantee that the utility is actually installed. In the attached,
> > the existance of the utility is based on the return value of system_log().
>
> Hmm. [.. thinks ..] The thing that's itching me here is that you
> align the concept of compression with gzip, but that's not going to be
> true once more compression options are added to pg_dump, and that
> would make $supports_compression and $compress_program_exists
> incorrect. Perhaps the right answer would be to rename all that with
> a suffix like "_gzip" to make a difference? Or would there be enough
> control with a value of "compression_gzip" instead of "compression" in
> test_key?

I understand the itch. Indeed when LZ4 is added as compression method, this
block changes slightly. I went with the minimum amount changed. Please find
in 0001 of the attached this variable renamed as $gzip_program_exist. I thought
that as prefix it will match better the already used $ENV{GZIP_PROGRAM}.

> +my $compress_program_exists = (system_log("$ENV{GZIP_PROGRAM}", '-h',
> + '>', '/dev/null') == 0);
>
> Do we need this command execution at all? In all the other tests, we
> rely on a simple "if (!defined $gzip || $gzip eq '');", so we could do
> the same here.

You are very correct that we are using the simple version, and that is what
it was included in the previous versions of the current patch. However, I
did notice that the variable is hard-coded in Makefile.global.in and it does
not go through configure. By now, gzip is considered an essential package
in most installations, and this hard-code makes sense. Though I did remove
the utility from my system, (apt remove gzip) and tried the test with the
simple "if (!defined $gzip || $gzip eq '');", which predictably failed. For
this, I went with the system call, it is not too expensive and is rather
reliable.

It is true that the rest of the TAP tests that use this, e.g. in pg_basebackup,
also failed. There is an argument to go simple and I will be happy to revert
to the previous version.

> A last thing is that we should perhaps make a clear difference between
> the check that looks at if the code has been built with zlib and the
> check for the presence of GZIP_PROGRAM, as it can be useful in some
> environments to be able to run pg_dump built with zlib, even if the
> GZIP_PROGRAM command does not exist (I don't think this can be the
> case, but other tests are flexible).

You are very correct. We do that already in the current patch. Note that we skip
the test only when we specifically have to execute a compression command. Not
all compression tests define such command, exactly so that we can test those
cases as well. The point of using an external utility program is in order to
extend the coverage in previously untested yet supported scenarios, e.g. manual
compression of the *.toc files.

Also in the case where it will actually skip the compression command because the
gzip program is not present, it will execute the pg_dump command first.

> As of now, the patch relies on
> pg_dump enforcing uncompression if building under --without-zlib even
> if --compress/-Z is used, but that also means that those compression
> tests overlap with the existing tests in this case. Wouldn't it be
> more consistent to check after $supports_compression when executing
> the dump command for test_key = "compression[_gzip]"? This would mean
> keeping GZIP_PROGRAM as sole check when executing the compression
> command.

I can see the overlap case. Yet, I understand the test_key as serving different
purpose, as it is a key of %tests and %full_runs. I do not expect the database
content of the generated dump to change based on which compression method is 
used.

In the next round, I can see one explitcly requesting --compress=none to 
override
defaults. There is a benefit to group the tests for this scenario under the same
test_key, i.e. compression.

Also there will be cases where if the program exists, yet the codebase is 
compiled
without support for the method. Then compress_cmd or the restore_cmd that 
follows
will fail. For example, in the plain output, if we try to uncompress the 
generated
the test

Re: head fails to build on SLES 12 (wal_compression=zstd)

I wrote:
> That seems like the appropriate answer to me.  I verified that we
> build cleanly and pass check-world against 1.4.0, and later I'm
> going to set up BF member longfin to use that.

Done ...

> In short, I think we should push Justin's version-check patch,
> and also fix the SGML docs to say that we require zstd >= 1.4.0.

... and done.

regards, tom lane

Re: logical decoding and replication of sequences

2022-04-01 Thread Tomas Vondra




On 3/28/22 07:29, Amit Kapila wrote:
> ...
>
> While thinking about this, I think I see a problem with the
> non-transactional handling of sequences. It seems that we will skip
> sending non-transactional sequence change if it occurs before the
> decoding has reached a consistent point but the surrounding commit
> occurs after a consistent point is reached. In such cases, the
> corresponding DMLs like inserts will be sent but sequence changes
> won't be sent. For example (this scenario is based on
> twophase_snapshot.spec),
> 
> Initial setup:
> ==
> Create table t1_seq(c1 int);
> Create Sequence seq1;
> 
> Test Execution via multiple sessions (this test allows insert in
> session-2 to happen before we reach a consistent point and commit
> happens after a consistent point):
> ===
> 
> Session-2:
> Begin;
> SELECT pg_current_xact_id();
> 
> Session-1:
> SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
> 'test_decoding', false, true);
> 
> Session-3:
> Begin;
> SELECT pg_current_xact_id();
> 
> Session-2:
> Commit;
> Begin;
> INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);
> 
> Session-3:
> Commit;
> 
> Session-2:
> Commit 'foo'
> 
> Session-1:
> SELECT data  FROM pg_logical_slot_get_changes('test_slot', NULL, NULL,
> 'include-xids', 'false', 'skip-empty-xacts', '1');
> 
>  data
> --
>  BEGIN
>  table public.t1_seq: INSERT: c1[integer]:1
>  table public.t1_seq: INSERT: c1[integer]:2
>  table public.t1_seq: INSERT: c1[integer]:3
>  table public.t1_seq: INSERT: c1[integer]:4
>  table public.t1_seq: INSERT: c1[integer]:5
>  table public.t1_seq: INSERT: c1[integer]:6
> 
> 
> Now, if we normally try to decode such an insert, the result would be
> something like:
>  data
> --
>  sequence public.seq1: transactional:0 last_value: 33 log_cnt: 0 is_called:1
>  sequence public.seq1: transactional:0 last_value: 66 log_cnt: 0 is_called:1
>  sequence public.seq1: transactional:0 last_value: 99 log_cnt: 0 is_called:1
>  sequence public.seq1: transactional:0 last_value: 132 log_cnt: 0 is_called:1
>  BEGIN
>  table public.t1_seq: INSERT: c1[integer]:1
>  table public.t1_seq: INSERT: c1[integer]:2
>  table public.t1_seq: INSERT: c1[integer]:3
>  table public.t1_seq: INSERT: c1[integer]:4
>  table public.t1_seq: INSERT: c1[integer]:5
>  table public.t1_seq: INSERT: c1[integer]:6
> 
> This will create an inconsistent replica as sequence changes won't be
> replicated.

Hmm, that's interesting. I wonder if it can actually happen, though.
Have you been able to reproduce that, somehow?

> I thought about changing snapshot dealing of
> non-transactional sequence changes similar to transactional ones but
> that also won't work because it is only at commit we decide whether we
> can send the changes.
> 
I wonder if there's some earlier LSN (similar to the consistent point)
which might be useful for this.

Or maybe we should queue even the non-transactional changes, not
per-transaction but in a global list, and then at each commit either
discard inspect them (at that point we know the lowest LSN for all
transactions and the consistent point). Seems complex, though.

> For the transactional case, as we are considering the create sequence
> operation as transactional, we would unnecessarily queue them even
> though that is not required. Basically, they don't need to be
> considered transactional and we can simply ignore such messages like
> other DDLs. But for that probably we need to distinguish Alter/Create
> case which may or may not be straightforward. Now, queuing them is
> probably harmless unless it causes the transaction to spill/stream.
> 

I'm not sure I follow. Why would we queue them unnecessarily?

Also, there's the bug with decoding changes in transactions that create
the sequence and add it to a publication. I think the agreement was that
this behavior was incorrect, we should not decode changes until the
subscription is refreshed. Doesn't that mean can't be any CREATE case,
just ALTER?

> I still couldn't think completely about cases where a mix of
> transactional and non-transactional changes occur in the same
> transaction as I think it somewhat depends on what we want to do about
> the above cases.
> 

Understood. I need to think about this too.

>>> At all places, it is mentioned
>>> as creating a sequence for transactional cases which at the very least
>>> need some tweak.
>>>
>>
>> Which places?
>>
> 
> In comments like:
> a. When decoding sequences, we differentiate between sequences created
> in a (running) transaction and sequences created in other (already
> committed) transactions.
> b. ... But for new sequences, we need to handle them in a transactional way, 
> ..
> c. ... Change need

Re: [Proposal] vacuumdb --schema only

On Wed, Mar 30, 2022 at 02:22:58PM -0700, Nathan Bossart wrote:
> I'm personally -1 for the --exclude-schema option.  I don't see any
> existing "exclude" options in vacuumdb, and the uses for such an option
> seem rather limited.  If we can point to specific use-cases for this
> option, I might be willing to change my vote.

I suggested it because I would consider using it, even though I don't currently
use the vacuumdb script at all.  I think this would allow partially
retiring/simplifying our existing vacuum script.

We 1) put all our partitions in a separate "child" schema (so \d is more
usable), and also 2) put some short-lived tables into their own schemas.  Some
of those tables may only exist for ~1 day so I'd perfer to neither vacuum nor
analyze them (they're only used for SELECT *).  But there can be a lot of them,
so a nightly job could do something like vacuumdb --schema public or vacuumdb
--exclude-schema ephemeral.

Everything would be processed nightly using vacuumdb --min-xid (to keep the
monitoring system happy).

The non-partitioned tables could be vacuumed nightly (without min-xid), with
--exclude ephemeral.

The partitioned tables could be processed monthly with vacuumdb --analyze.

I'd also want to be able to run vacuumdb --analyze nightly, but I'd want to
exclude the schema with short-lived tables.   I'd also need a way to exclude
our partitioned tables from nightly analyze (they should run monthly only).

Maybe this could share something with this patch:
https://commitfest.postgresql.org/37/2573/
pg_dump - read data for some options from external file

The goal of that patch was to put it in a file, which isn't really needed here.
But if there were common infrastructure for matching tables, it could be
shared.  The interesting part for this patch is to avoid adding separate
commandline arguments for --include-table, --exclude-table, --include-schema,
--exclude-schema (and anything else?)

Re: CREATEROLE and role ownership hierarchies

On Fri, Apr 1, 2022 at 10:46 AM Greg Stark  wrote:
> The cfbot is testing the last patch posted to this thread which is the
> remove-self-own patch which was already committed. I gather that
> there's still (at least one) patch under discussion.
>
> Could I suggest reposting the last version of the main patch, perhaps
> rebasing it. That way the cfbot would at least continue to test for
> conflicts.

We should move this patch to the next CF or maybe even mark it
returned with feedback. We're not going to get anything else done here
for v15, and I'm not sure whether what we do beyond that will take
this form or not.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Collecting statistics about contents of JSONB columns

This patch has bitrotted, presumably after the other JSON patchset was
applied. It looks like it's failing in the json header file so it may
be as simple as additional functions added on nearby lines.

Please rebase. Reminder, it's the last week of the commitfest so time
is of the essence

Re: CREATEROLE and role ownership hierarchies

The cfbot is testing the last patch posted to this thread which is the
remove-self-own patch which was already committed. I gather that
there's still (at least one) patch under discussion.

Could I suggest reposting the last version of the main patch, perhaps
rebasing it. That way the cfbot would at least continue to test for
conflicts.

Re: Proposal: allow database-specific role memberships

Patch doesn't apply again...

[image: 1jfj7m.jpg]

Re: Use "WAL segment" instead of "log segment" consistently in user-facing messages

On Fri, Apr 01, 2022 at 10:31:10AM +0900, Kyotaro Horiguchi wrote:
>   you should at least save the contents of the cluster's 
> pg_wal
> - subdirectory, as it might contain logs which
> + subdirectory, as it might contain WAL files which
>   were not archived before the system went down.
> 
> The "logs" means acutally "WAL segment (files)" but the concept of
> "segment" is out of focus in the context.  So just "file" is used
> there.  The same change is applied on dezon of places.

This change seems reasonable to me.

> -   disk-space requirements for the WAL logs are met,
> +   disk-space requirements for the WAL are met,
> 
> This might be better be "WAL files" instead of just "WAL".

+1 for "WAL files"

> -   WAL logs are stored in the directory
> +   WAL is stored in the directory
> pg_wal under the data directory, as a set of
> 
> I'm not sure which is better, use "WAL" as a collective noun, or "WAL
> files" as the cocrete objects.

My vote is for "WAL files" because it was previously "WAL logs."

> -   The aim of WAL is to ensure that the log is
> +   The aim of WAL is to ensure that the WAL record is
> written before database records are altered, but this can be subverted by
> 
> This is not a mechanical change.  But I think this is correct.

IMO the original wording is fine.  I think it is sufficiently clear that
"log" refers to "write-ahead log," and this sentence seems intended to
convey the basic rule of "log before data."  However, the rest of the
sentence is a little weird.  It's basically saying "the aim of the log is
to ensure that the log is written..."  Isn't the aim of the log to record
the database activity?  Perhaps we should rewrite it to something like the
following:

A basic rule of WAL is that the log must be written before the database
files are altered, but this can be...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Restructure ALTER TABLE notes to clarify table rewrites and verification scans

2022-04-01 Thread James Coleman

On Thu, Mar 31, 2022 at 10:58 AM Matthias van de Meent
 wrote:
>
> On Tue, 29 Mar 2022 at 16:20, James Coleman  wrote:
> >
> > Over in the "Document atthasmissing default optimization avoids
> > verification table scan" thread David Johnston (who I've cc'd)
> > suggested that my goals might be better implemented with a simple
> > restructuring of the Notes section of the ALTER TABLE docs. I think
> > this is also along the lines of Tom Lane's suggestion of a "unified
> > discussion", but I've chosen for now (and simplicity's sake) not to
> > break this into an entirely new page. If reviewers feel that is
> > warranted at this stage, I can do that, but it seems to me that for
> > now this improves the structure and sets us up for such a future page
> > but falls short of sufficient content to move into its own page.
> >
> > One question on the changes: the current docs say "when attaching a
> > new partition it may be scanned to verify that existing rows meet the
> > partition constraint". The word "may" there seems to suggest there may
> > also be occasions where scans are not needed, but no examples of such
> > cases are present. I'm not immediately aware of such a case. Are these
> > constraints always validated? If not, in which cases can such a scan
> > be skipped?
> >
> > I've also incorporated the slight correction in "Correct docs re:
> > rewriting indexes when table rewrite is skipped" [2] here, and will
> > rebase this patch if that gets committed.
>
> See comments in that thread.

Rebased since that thread has now resulted in a committed patch.

> > +Changing the type of an existing column will require the entire table 
> > and its
> > +indexes to be rewritten. As an exception, if the 
> > USING clause
> > +does not change the column contents and the old type is either binary 
> > coercible
> > +to the new type or an unconstrained domain over the new type, a table 
> > rewrite is
> > +not needed.
>
> This implies "If the old type is [...] an unconstrained domain over
> the new type, a table rewrite is not needed.", which is the wrong way
> around.
>
> I'd go with something along the lines of:
>
> +Changing the type of an existing column will require the entire table to 
> be
> +rewritten, unless the USING clause is only a
> binary coercible
> +cast, or if the new type is an unconstrained
> DOMAIN over the
> +old type.

That language is actually unchanged from the existing docs; is there
an error in the existing docs you're seeing? I'm actually imagining
that it can probably got either way -- from unconstrained domain over
new type to new type or from old type to unconstrained domain over old
type.

> That would drop the reference to index rebuilding; but that should be
> covered in other parts of the docs.

Part of the whole point of this restructuring is to make both of those
clear; I think we should retain the comments about indexes.

> > +The following alterations of the table require the entire table, and 
> > in some
> > +cases its indexes as well, to be rewritten.
>
> It is impossible to rewrite the table without at the same time also
> rewriting the indexes; as the location of tuples changes and thus
> previously generated indexes will become invalid. At the same time;
> changes to columns might not require a table rewrite, while still
> requiring the indexes to be rewritten. I suggest changing the order of
> "table" and "index", or dropping the clause.

Ah, that's a good point. I've rewritten that part.

> > +[...] For a large table such a rewrite
> > +may take a significant amount of time and will temporarily require as 
> > much as
> > +double the disk space.
>
> I'd replace the will with could. Technically, this "double the disk
> space" could be even higher than that; due to index rebuilds taking up
> to 3x normal space (one original index which is only dropped at the
> end, one sorted tuple store for the rebuild, and one new index).

That's also the existing language, but I agree it seems a bit overly
precise (and in the process probably incorrect). There's a lot of
complexity here: depending on the type change (and USING clause!) and
table width it could be even more than 3x. I've reworded to try to
capture what's really going on here.

Why "could" instead of "will"? All table rewrites will always require
a extra disk space, right?

> > -Similarly, when attaching a new partition it may be scanned to verify 
> > that
> > -existing rows meet the partition constraint.
> > +Attaching a new partition requires scanning the table to verify that 
> > existing
> > +rows meet the partition constraint.
>
> This is also (and better!) documented under section
> sql-altertable-attach-partition: we will skip full table scan if the
> table partition's existing constraints already imply the new partition
> constraints. The previous wording is better in that regard ("may
> need", instead of "requires"), though it could be improve

Re: psql - add SHOW_ALL_RESULTS option


On 01.04.22 07:46, Fabien COELHO wrote:

Attached a rebase.


Again, after the SendQuery refactoring extraction.


I'm doing this locally, so don't feel obliged to send more of these. ;-)

I've started committing this now, in pieces.

Re: [PATCH] Accept IP addresses in server certificate SANs


On 31.03.22 20:15, Jacob Champion wrote:

On Thu, 2022-03-31 at 16:32 +0200, Peter Eisentraut wrote:

Why add a (failry complicated) pg_inet_pton() when a perfectly
reasonable inet_pton() exists?


I think it was mostly just that inet_aton() and pg_inet_net_ntop() both
had ports, and I figured I might as well port the other one since we
already had the implementation. (I don't have a good intuition yet for
the community's preference for port vs dependency.)


I would get rid of all that refactoring and just have your code call
inet_pton()/inet_ntop() directly.

If you're worried about portability, and you don't want to go through
the effort of proving libpgport substitutes, just have your code raise
an error in the "#else" code paths.  We can fill that in later if there
is demand.


Switched to inet_pton() in v12, with no #if/else for now. I think this
should work with Winsock as-is; let's see if the bot agrees...


I have committed this.

I have removed the inet header refactoring that you had.  That wasn't 
necessary, since pg_inet_net_ntop() can use the normal AF_INET* 
constants.  The PGSQL_AF_INET* constants are only for the internal 
storage of the inet/cidr types.


I have added a configure test for inet_pton().  We can check in the 
build farm if it turns out to be necessary.

Re: [PATCH] src/interfaces/libpq/Makefile: fix pkg-config without openssl

Daniel Gustafsson  writes:
>> On 31 Mar 2022, at 18:37, Fabrice Fontaine  
>> wrote:
>> +ifeq ($(with_ssl),openssl)
>> PKG_CONFIG_REQUIRES_PRIVATE = libssl libcrypto
>> +endif

> That seems reasonable, is there any reason why the referenced commit didn't do
> that?

Looks like a clear oversight to me, but maybe Peter will
think differently.

regards, tom lane

Re: Reword docs of feature "Remove temporary files after backend crash"

On Fri, Apr 1, 2022 at 9:42 AM Daniel Gustafsson  wrote:
> I think "useless" is a bit too strong and subjective given that it's 
> describing
> an unknown situation out of the ordinary.  How about "outdated" or "redundant"
> (or something else entirely which is even better)?

It's the existing wording, though, and unrelated to the changes the
patch is trying to make. It also seems accurate to me.

> > I've also made a CF entry - https://commitfest.postgresql.org/35/3356/
>
> This has been sitting the CF for quite some time, time to make a decision on
> it.  I think it makes sense, having detailed docs around debugging is rarely a
> bad thing. Does anyone else have opinions?

I don't like it. It seems to me that it will result in a lot of
duplication in the docs, because every time we talk about something
that happens in connection with a crash, we'll need to talk about this
same list of exceptions. It would be reasonable to document which
conditions trigger a crash-and-restart cycle and which do not in some
centralized place, but not in every place where we mention crashing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

On Fri, Apr 1, 2022 at 2:04 AM Thomas Munro  wrote:
> The v1-0003 patch introduced smgropen_cond() to avoid the problem of
> IssuePendingWritebacks(), which does desynchronised smgropen() calls
> and could open files after the barrier but just before they are
> unlinked.  Makes sense, but...
>
> 1.  For that to actually work, we'd better call smgrcloseall() when
> PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls
> closeAllVds().  Here's a patch for that.  But...

What if we instead changed our approach to fixing
IssuePendingWritebacks()?  Design sketch:

1. We add a new flag, bool smgr_zonk, to SmgrRelationData. Initially it's false.
2. Receipt of PROCSIGNAL_BARRIER_SMGRRELEASE sets smgr_zonk to true.
3. If you want, you can set it back to false any time you access the
smgr while holding a relation lock.
4. IssuePendingWritebacks() uses smgropen_cond(), as today, but that
function now refuses either to create a new smgr or to return one
where smgr_zonk is true.

I think this would be sufficient to guarantee that we never try to
write back to a relfilenode if we haven't had a lock on it since the
last PROCSIGNAL_BARRIER_SMGRRELEASE, which I think is the invariant we
need here?

My thinking here is that it's a lot easier to get away with marking
the content of a data structure invalid than it is to get away with
invalidating pointers to that data structure. If we can tinker with
the design so that the SMgrRelationData doesn't actually go away but
just gets a little bit more thoroughly invalidated, we could avoid
having to audit the entire code base for code that keeps smgr handles
around longer than would be possible with the design you demonstrate
here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Reword docs of feature "Remove temporary files after backend crash"

> On 12 Oct 2021, at 08:40, Bharath Rupireddy 
>  wrote:

> Here's a v2 patch that I could come up with. Please review it further.

+debugging, for example. Repeated crashes may however result in
+accumulation of useless files. This parameter can only be set in the

I think "useless" is a bit too strong and subjective given that it's describing
an unknown situation out of the ordinary.  How about "outdated" or "redundant"
(or something else entirely which is even better)?

> I've also made a CF entry - https://commitfest.postgresql.org/35/3356/

This has been sitting the CF for quite some time, time to make a decision on
it.  I think it makes sense, having detailed docs around debugging is rarely a
bad thing. Does anyone else have opinions?

--
Daniel Gustafsson   https://vmware.com/

Re: [PATCH] src/interfaces/libpq/Makefile: fix pkg-config without openssl

> On 31 Mar 2022, at 18:37, Fabrice Fontaine  wrote:

> +ifeq ($(with_ssl),openssl)
> PKG_CONFIG_REQUIRES_PRIVATE = libssl libcrypto
> +endif

That seems reasonable, is there any reason why the referenced commit didn't do
that?

--
Daniel Gustafsson   https://vmware.com/

Re: Rewriting the test of pg_upgrade as a TAP test - take three - remastered set

On Fri, Apr 01, 2022 at 08:53:10PM +0900, Michael Paquier wrote:
> On Fri, Apr 01, 2022 at 03:01:38PM +0900, Michael Paquier wrote:
> > On Thu, Mar 31, 2022 at 10:51:59PM -0500, Justin Pryzby wrote:
> >> Is diff -q defined somewhere ?  I can't find it in postgres sources nor in
> >> sources for bf client.
> > 
> > 322becb has added such a call, at the end of 002_pg_upgrade.pl.
> > vcregress.pl also has one before this commit.
> 
> The Windows animals seem to be in good shape, except hamerkop that
> dies on "vcregress upgradecheck" when the TAP tests are disabled
> causing the buildfarm client to stop.  My idea to use upgradecheck
> leads to more code than just moving the test to bincheck so let's
> reuse the suggestion from Andres upthread and disable completely
> upgradecheck, keeping the target around only for compatibility.  The
> attached does that, and the test of pg_upgrade would go through
> bincheck instead.

If you do that, should also remove upgradecheck from .cirrus.yaml, which
currently runs the upgradecheck target.

I suspect this'll cause windows CI a bit slower.
https://cirrus-ci.com/task/4703731324289024

An alternative to your patch to have the buildfarm client avoid calling
upgradecheck if tap tests are disabled.  Under your patch, upgrade check is a
NOP, so it should stop calling upgradecheck anyway.  So maybe this is a better
option ?

-- 
Justin

[PATCH] src/interfaces/libpq/Makefile: fix pkg-config without openssl

2022-04-01 Thread Fabrice Fontaine

Do not add openssl dependencies to libpq pkg-config file if openssl is
disabled to avoid the following build failure with libdbi-drivers raised
since commit beff361bc1edc24ee5f8b2073a1e5e4c92ea66eb:

configure: error: Package requirements (libpq) were not met:

Package 'libssl', required by 'libpq', not found
Package 'libcrypto', required by 'libpq', not found

Fixes:
 - 
http://autobuild.buildroot.org/results/415cb61a58b928a42623ed90b0b60c59032f0a4e

Signed-off-by: Fabrice Fontaine 
---
 src/interfaces/libpq/Makefile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/interfaces/libpq/Makefile b/src/interfaces/libpq/Makefile
index 89bf5e0126..b5fd72a4ac 100644
--- a/src/interfaces/libpq/Makefile
+++ b/src/interfaces/libpq/Makefile
@@ -95,7 +95,9 @@ SHLIB_PREREQS = submake-libpgport
 
 SHLIB_EXPORTS = exports.txt
 
+ifeq ($(with_ssl),openssl)
 PKG_CONFIG_REQUIRES_PRIVATE = libssl libcrypto
+endif
 
 all: all-lib libpq-refs-stamp
 
-- 
2.35.1

Re: Use generation context to speed up tuplesorts

On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote:
> 0002:
> This modifies the tuplesort API so that instead of having a
> randomAccess bool flag,

Per cirrus, this missed updating one line for dtrace.

On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote:
> I feel this is a pretty simple patch and if nobody has any objections
> then I plan to push all 3 of them on my New Zealand Monday morning.

I'll mark the CF as such.

-- 
Justin
>From a6271ac626b561cdafe60152eb4fb208e7fe7240 Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Wed, 16 Feb 2022 10:23:32 +1300
Subject: [PATCH 1/4] Improve the generation memory allocator

Here we make a series of improvements to the generation memory
allocator:

1. Allow generation contexts to have a minimum, initial and maximum block
sizes. The standard allocator allows this already but when the generation
context was added, it only allowed fixed-sized blocks.  The problem with
fixed-sized blocks is that it's difficult to choose how large to make the
blocks.  If the chosen size is too small then we'd end up with a large
number of blocks and a large number of malloc calls. If the block size is
made too large, then memory is wasted.

2. Add support for "keeper" blocks.  This is a special block that is
allocated along with the context itself but is never freed.  Instead,
when the last chunk in the keeper block is freed, we simply mark the block
as empty to allow new allocations to make use of it.

3. Add facility to "recycle" newly empty blocks instead of freeing them
and having to later malloc an entire new block again.  We do this by
recording a single GenerationBlock which has become empty of any chunks.
When we run out of space in the current block, we check to see if there is
a "freeblock" and use that if it contains enough space for the allocation.

Author: David Rowley, Tomas Vondra
Discussion: https://postgr.es/m/d987fd54-01f8-0f73-af6c-519f799a0...@enterprisedb.com
---
 src/backend/access/gist/gistvacuum.c  |   6 +
 .../replication/logical/reorderbuffer.c   |   7 +
 src/backend/utils/mmgr/generation.c   | 383 ++
 src/include/utils/memutils.h  |   4 +-
 4 files changed, 322 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index aac4afab8ff..f190decdff2 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -158,9 +158,15 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * pages in page_set_context.  Internally, the integer set will remember
 	 * this context so that the subsequent allocations for these integer sets
 	 * will be done from the same context.
+	 *
+	 * XXX the allocation sizes used below pre-date generation context's block
+	 * growing code.  These values should likely be benchmarked and set to
+	 * more suitable values.
 	 */
 	vstate.page_set_context = GenerationContextCreate(CurrentMemoryContext,
 	  "GiST VACUUM page set context",
+	  16 * 1024,
+	  16 * 1024,
 	  16 * 1024);
 	oldctx = MemoryContextSwitchTo(vstate.page_set_context);
 	vstate.internal_page_set = intset_create();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c2d9be81fae..4702750a2e7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -370,8 +370,15 @@ ReorderBufferAllocate(void)
 			SLAB_DEFAULT_BLOCK_SIZE,
 			sizeof(ReorderBufferTXN));
 
+	/*
+	 * XXX the allocation sizes used below pre-date generation context's block
+	 * growing code.  These values should likely be benchmarked and set to
+	 * more suitable values.
+	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
   "Tuples",
+  SLAB_LARGE_BLOCK_SIZE,
+  SLAB_LARGE_BLOCK_SIZE,
   SLAB_LARGE_BLOCK_SIZE);
 
 	hash_ctl.keysize = sizeof(TransactionId);
diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c
index 95c006f689f..29065201556 100644
--- a/src/backend/utils/mmgr/generation.c
+++ b/src/backend/utils/mmgr/generation.c
@@ -20,18 +20,15 @@
  *
  *	The memory context uses a very simple approach to free space management.
  *	Instead of a complex global freelist, each block tracks a number
- *	of allocated and freed chunks. Freed chunks are not reused, and once all
- *	chunks in a block are freed, the whole block is thrown away. When the
- *	chunks allocated in the same block have similar lifespan, this works
- *	very well and is very cheap.
+ *	of allocated and freed chunks.  The block is classed as empty when the
+ *	number of free chunks is equal to the number of allocated chunks.  When
+ *	this occurs, instead of freeing the block, we try to "recycle" it, i.e.
+ *	reuse it for new allocations.  This is done by setting the block in the

Re: Correct docs re: rewriting indexes when table rewrite is skipped

2022-04-01 Thread James Coleman

On Fri, Apr 1, 2022 at 8:58 AM Robert Haas  wrote:
>
> On Thu, Mar 31, 2022 at 4:19 PM James Coleman  wrote:
> > On Thu, Mar 31, 2022 at 3:25 PM Robert Haas  wrote:
> > > On Thu, Mar 31, 2022 at 10:51 AM James Coleman  wrote:
> > > > Updated.
> > >
> > > This version looks fine to me. If nobody objects I will commit it and
> > > credit myself as a co-author.
> >
> > Sounds great; thanks again for the review.
>
> Done.

Thanks. I marked the CF entry as committed.

James Coleman

Re: Correct docs re: rewriting indexes when table rewrite is skipped

On Thu, Mar 31, 2022 at 4:19 PM James Coleman  wrote:
> On Thu, Mar 31, 2022 at 3:25 PM Robert Haas  wrote:
> > On Thu, Mar 31, 2022 at 10:51 AM James Coleman  wrote:
> > > Updated.
> >
> > This version looks fine to me. If nobody objects I will commit it and
> > credit myself as a co-author.
>
> Sounds great; thanks again for the review.

Done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Is monotonous xid8 is a right way to do?

2022-04-01 Thread Pavel Borisov

>
> A different and more important issue (IMO) is that the xlog record
>
header currently only supports 32-bit xids -- long-running
> transactions can reasonably see a xid4 wraparound in their lifetime.

You're completely right. This is a first of making xid's 64-bit proposed
[1] i.e, making SLRU 64-bit [2]

[1]
https://www.postgresql.org/message-id/flat/CACG%3DezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe%3DpyyjVWA%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/flat/CALT9ZEEf1uywYN%2BVaRuSwNMGE5%3DeFOy7ZTwtP2g%2BW9oJDszqQw%40mail.gmail.com#bd4f64b73cb3b969e119da7e5a7b1f30

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Is monotonous xid8 is a right way to do?

On Fri, 1 Apr 2022 at 14:13, Pavel Borisov  wrote:
>
> Hi hackers!
>
> Now we have two data types xid and xid8. The first one (xid) makes a numeric 
> ring, and xid8 are monotonous.
>
> As per [1] "Unlike xid values, xid8 values increase strictly monotonically 
> and cannot be reused in the lifetime of a database cluster."
>
> As a consequence of [1] xid8 can have min/max functions (committed in [2]), 
> which xid can not have.
>
> When working on 64xid patch [3] we assume that even 64xid's technically can 
> be wraparound-ed, although it's very much unlikely. I wonder what is expected 
> to be with xid8 values at this (unlikely) 64xid wraparound?
>
> What do you think about this? Wouldn't it be better to change xid8 to form a 
> numeric ring like xid? I think it is necessary for any 64-wraparound-enabled 
> implementation of 64xids.
>
> Please feel free to share your thoughts.

Assuming that each Xid is WAL-logged (or at least one in 8) we won't
see xid8 wraparound, as our WAL is byte-addressable with only 64 bits
used as the identifier. As such, we can only fit a maximum of 2^61
xid8s in our WAL; which is less than what would be needed to wrap
around.

Addressed another way: If we'd have a system that consumed one xid
every CPU clock; then the best available x86 hardware right now would
currently consume ~ 5.5B xids every second. This would still leave
around 100 years of this system running non-stop before we'd be
hitting xid8 wraparound (= 2^64 / 5.5e9 (xid8 /sec) / 3600 (min /hour)
/ 730 (hour / month)/ 12 (month /year)).

I don't think we'll have to consider that an issue for now. Maybe,
eventually, if we start doing distributed transactions where
transaction IDs are reasonably consumed at a rate higher than 5B /sec
(and not logged at that rate) we can start considering this to be a
problem.

A different and more important issue (IMO) is that the xlog record
header currently only supports 32-bit xids -- long-running
transactions can reasonably see a xid4 wraparound in their lifetime.

Enjoy,

Matthias van de Meent

Re: Is monotonous xid8 is a right way to do?

2022-04-01 Thread Dagfinn Ilmari Mannsåker

Dagfinn Ilmari Mannsåker  writes:

> Even if a cluster was consuming a million XIDs per second, it would take
> over half a million years to wrap around the 64bit range. Is that really
> something we should worry about?
>
> ilmari@[local]:5432 ~=# select 2::numeric^64/10^9/3600/24/365;

Oops, that should be 10^6, not 10^9. I was dithering over whether to do
it as a million or a billion per second.  For a billion XIDs per second
it would last a mere half millennium.

> ┌──┐
> │ ?column? │
> ├──┤
> │ 584942.417355072 │
> └──┘
>
> - ilmari

Re: Is monotonous xid8 is a right way to do?

2022-04-01 Thread Maxim Orlov

Hi!

In my view, FullTransactionId type was implemented without considering 64
bit wraparound. Which seems to be unlikely to happen. Then on that basis
xid8 type was created. Details of that particular implementation
infiltrated into documentation and became sort of normal. In my opinion,
semantically, both of these types should be treated as similar
types although with different sizes. Thus, again, xid and xid8 types should
be a ring and have no min and max functions. At least, in a sort of
"conventional" way when minimal value is minimal in a mathematical way and
so for maximum.

For example, max may be implemented as max(0, 42, 18446744073709551615) =
42, which is a bit weird.

-- 
Best regards,
Maxim Orlov.

Re: Is monotonous xid8 is a right way to do?

2022-04-01 Thread Dagfinn Ilmari Mannsåker

Pavel Borisov  writes:

> Hi hackers!
>
> Now we have two data types xid and xid8. The first one (xid) makes a
> numeric ring, and xid8 are monotonous.
>
> As per [1] "Unlike xid values, xid8 values increase strictly monotonically
> and cannot be reused in the lifetime of a database cluster."
>
> As a consequence of [1] xid8 can have min/max functions (committed in [2]),
> which xid can not have.
>
> When working on 64xid patch [3] we assume that even 64xid's technically can
> be wraparound-ed, although it's very much unlikely. I wonder what is
> expected to be with xid8 values at this (unlikely) 64xid wraparound?

Even if a cluster was consuming a million XIDs per second, it would take
over half a million years to wrap around the 64bit range. Is that really
something we should worry about?

ilmari@[local]:5432 ~=# select 2::numeric^64/10^9/3600/24/365;
┌──┐
│ ?column? │
├──┤
│ 584942.417355072 │
└──┘

- ilmari

Re: Skipping logical replication transactions on subscriber side

2022-04-01 Thread Masahiko Sawada

On Fri, Apr 1, 2022 at 5:10 PM Masahiko Sawada  wrote:
>
> On Fri, Apr 1, 2022 at 4:44 PM Noah Misch  wrote:
> >
> > On Tue, Mar 29, 2022 at 10:43:00AM +0530, Amit Kapila wrote:
> > > On Mon, Mar 21, 2022 at 5:51 PM Euler Taveira  wrote:
> > > > On Mon, Mar 21, 2022, at 12:25 AM, Amit Kapila wrote:
> > > > I have fixed all the above comments as per your suggestion in the
> > > > attached. Do let me know if something is missed?
> > > >
> > > > Looks good to me.
> > >
> > > This patch is committed
> > > (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=208c5d65bbd60e33e272964578cb74182ac726a8).
> >
> > src/test/subscription/t/029_on_error.pl has been failing reliably on the 
> > five
> > AIX buildfarm members:
> >
> > # poll_query_until timed out executing this query:
> > # SELECT subskiplsn = '0/0' FROM pg_subscription WHERE subname = 'sub'
> > # expecting this output:
> > # t
> > # last actual query output:
> > # f
> > # with stderr:
> > timed out waiting for match: (?^:LOG:  done skipping logical replication 
> > transaction finished at 0/1D30788) at t/029_on_error.pl line 50.
> >
> > I've posted five sets of logs (2.7 MiB compressed) here:
> > https://drive.google.com/file/d/16NkyNIV07o0o8WM7GwcaAYFQDPTkULkR/view?usp=sharing
>
> Thank you for the report. I'm investigating this issue.

Looking at the subscriber logs, it successfully fetched the correct
error-LSN from the server logs and set it to ALTER SUBSCRIPTION …
SKIP:

2022-03-30 09:48:36.617 UTC [17039636:4] CONTEXT:  processing remote
data for replication origin "pg_16391" during "INSERT" for replication
target relation "public.tbl" in transaction 725 finished at 0/1D30788
2022-03-30 09:48:36.617 UTC [17039636:5] LOG:  logical replication
subscription "sub" has been disabled due to an error
:
2022-03-30 09:48:36.670 UTC [17039640:1] [unknown] LOG:  connection
received: host=[local]
2022-03-30 09:48:36.672 UTC [17039640:2] [unknown] LOG:  connection
authorized: user=nm database=postgres application_name=029_on_error.pl
2022-03-30 09:48:36.675 UTC [17039640:3] 029_on_error.pl LOG:
statement: ALTER SUBSCRIPTION sub SKIP (lsn = '0/1D30788')
2022-03-30 09:48:36.676 UTC [17039640:4] 029_on_error.pl LOG:
disconnection: session time: 0:00:00.006 user=nm database=postgres
host=[local]
:
2022-03-30 09:48:36.762 UTC [28246036:2] ERROR:  duplicate key value
violates unique constraint "tbl_pkey"
2022-03-30 09:48:36.762 UTC [28246036:3] DETAIL:  Key (i)=(1) already exists.
2022-03-30 09:48:36.762 UTC [28246036:4] CONTEXT:  processing remote
data for replication origin "pg_16391" during "INSERT" for replication
target relation "public.tbl" in transaction 725 finished at 0/1D30788

However, the worker could not start skipping changes of the error
transaction for some reason. Given that "SELECT subskiplsn = '0/0'
FROM pg_subscription WHERE subname = 'sub’” didn't return true, some
value was set to subskiplsn even after the unique key error.

So I'm guessing that the apply worker could not get the updated value
of the subskiplsn or its MySubscription->skiplsn could not match with
the transaction's finish LSN. Also, given that the test is failing on
all AIX buildfarm members, there might be something specific to AIX.

Noah, to investigate this issue further, is it possible for you to
apply the attached patch and run the 029_on_error.pl test? The patch
adds some logs to get additional information.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

add_logs.patch
Description: Binary data

Is monotonous xid8 is a right way to do?

2022-04-01 Thread Pavel Borisov

Hi hackers!

Now we have two data types xid and xid8. The first one (xid) makes a
numeric ring, and xid8 are monotonous.

As per [1] "Unlike xid values, xid8 values increase strictly monotonically
and cannot be reused in the lifetime of a database cluster."

As a consequence of [1] xid8 can have min/max functions (committed in [2]),
which xid can not have.

When working on 64xid patch [3] we assume that even 64xid's technically can
be wraparound-ed, although it's very much unlikely. I wonder what is
expected to be with xid8 values at this (unlikely) 64xid wraparound?

What do you think about this? Wouldn't it be better to change xid8 to form
a numeric ring like xid? I think it is necessary for any
64-wraparound-enabled implementation of 64xids.

Please feel free to share your thoughts.

[1] https://www.postgresql.org/docs/current/datatype-oid.html
[2]
https://www.postgresql.org/message-id/flat/47d77b18c44f87f8222c4c7a3e2dee6b%40oss.nttdata.com
[3]
https://www.postgresql.org/message-id/flat/CACG%3DezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe%3DpyyjVWA%40mail.gmail.com

Re: Rewriting the test of pg_upgrade as a TAP test - take three - remastered set

2022-04-01 Thread Michael Paquier

On Fri, Apr 01, 2022 at 03:01:38PM +0900, Michael Paquier wrote:
> On Thu, Mar 31, 2022 at 10:51:59PM -0500, Justin Pryzby wrote:
>> Is diff -q defined somewhere ?  I can't find it in postgres sources nor in
>> sources for bf client.
> 
> 322becb has added such a call, at the end of 002_pg_upgrade.pl.
> vcregress.pl also has one before this commit.

The Windows animals seem to be in good shape, except hamerkop that
dies on "vcregress upgradecheck" when the TAP tests are disabled
causing the buildfarm client to stop.  My idea to use upgradecheck
leads to more code than just moving the test to bincheck so let's
reuse the suggestion from Andres upthread and disable completely
upgradecheck, keeping the target around only for compatibility.  The
attached does that, and the test of pg_upgrade would go through
bincheck instead.

It is late here, I'll try to get that patched up tomorrow.
--
Michael
diff --git a/doc/src/sgml/install-windows.sgml b/doc/src/sgml/install-windows.sgml
index 43d05bde4e..18101e7a70 100644
--- a/doc/src/sgml/install-windows.sgml
+++ b/doc/src/sgml/install-windows.sgml
@@ -470,7 +470,6 @@ $ENV{CONFIG}="Debug";
 vcregress isolationcheck
 vcregress bincheck
 vcregress recoverycheck
-vcregress upgradecheck
 
 
To change the schedule used (default is parallel), append it to the
diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl
index e896820ac5..fbfc781508 100644
--- a/src/tools/msvc/vcregress.pl
+++ b/src/tools/msvc/vcregress.pl
@@ -106,7 +106,7 @@ my %command = (
 	ISOLATIONCHECK => \&isolationcheck,
 	BINCHECK   => \&bincheck,
 	RECOVERYCHECK  => \&recoverycheck,
-	UPGRADECHECK   => \&upgradecheck,
+	UPGRADECHECK   => \&upgradecheck, # no-op
 	TAPTEST=> \&taptest,);
 
 my $proc = $command{$what};
@@ -286,9 +286,6 @@ sub bincheck
 	foreach my $dir (@bin_dirs)
 	{
 		next unless -d "$dir/t";
-		# Do not consider pg_upgrade, as it is handled by
-		# upgradecheck.
-		next if ($dir =~ "/pg_upgrade/");
 
 		my $status = tap_check($dir);
 		$mstat ||= $status;
@@ -520,9 +517,9 @@ sub generate_db
 
 sub upgradecheck
 {
-	InstallTemp();
-	my $mstat = tap_check("$topdir/src/bin/pg_upgrade");
-	exit $mstat if $mstat;
+	# pg_upgrade is now handled by bincheck, but keep this
+	# target for backward compatibility.
+	print "upgradecheck disabled\n";
 	return;
 }
 


signature.asc
Description: PGP signature

Re: pg_rewind copies

> On 1 Apr 2022, at 12:46, Heikki Linnakangas  wrote:

>> +if (len >= sizeof(dstpath))
>> +pg_fatal("filepath buffer too small");  /* shouldn't happen */
> 
> Makes sense. I would remove the "shouldn't happen"; it's not very hard to 
> make it happen, you just need a very long target datadir path. And rephrase 
> the error message as "datadir path too long".

Right, good point.

> One typo in the commit message: s/update/updates/.

Will fix.

--
Daniel Gustafsson   https://vmware.com/

Re: pg_rewind copies

2022-04-01 Thread Heikki Linnakangas


On 01/04/2022 12:00, Daniel Gustafsson wrote:

I took another look at this patch, and I think it's ready to go in, it clearly
fixes a bug that isn't too hard to hit in production settings.  To ensure we
don't break this I've added a testcase which pipes the pg_rewind --verbose
output to a file it's asked to copy, which then guarantees that the file is
growing in size during the operation without need for synchronizing two
processes with IPC::Run (it also passes on Windows in the CI setup).

One small comment on the patch:

+   snprintf(srcpath, sizeof(srcpath), "%s/%s", datadir, path);

This should IMO check the returnvalue of snprintf to ensure it wasn't
truncated.  While the risk is exceedingly small, a truncated filename might
match another existing filename and the error not getting caught.  There is
another instance just like this one in open_target_file() to which I think we
should apply the same belts-and-suspenders treatment.  I've fixed this in the
attached version which also have had a pg_indent run on top of a fresh rebase.



+   if (len >= sizeof(dstpath))
+   pg_fatal("filepath buffer too small");/* shouldn't 
happen */


Makes sense. I would remove the "shouldn't happen"; it's not very hard 
to make it happen, you just need a very long target datadir path. And 
rephrase the error message as "datadir path too long".


One typo in the commit message: s/update/updates/.

Thanks!

- Heikki

Re: Use generation context to speed up tuplesorts

2022-04-01 Thread David Rowley

On Wed, 23 Mar 2022 at 04:08, David Rowley  wrote:
> If nobody has any objections to the 0001 patch then I'd like to move
> ahead with it in the next few days.  For the 0002 patch, I'm currently
> feeling like we shouldn't be using the Generation context for bounded
> sorts. The only way I can think to do that is with an API change to
> tuplesort. I feel 0001 is useful even without 0002.

0001:
I've made some small revisions to the 0001 patch so that the keeper
and freeblock are only used again when they're entirely empty.  The
situation I want to avoid here is that when the current block does not
have enough free space to store some new allocation, that we'll then
try the freeblock and then keeper block.  The problem I saw there is
that we may previously have partially filled the keeper or freeblock
block and have been unable to store some medium sized allocation which
caused us to move to a new block.  If we didn't check that the keeper
and freeblock blocks were empty first then we could end up being able
to store some small allocation in there where some previous medium
sized allocation couldn't fit.  While that's not necessarily an actual
problem, what it does mean is that consecutive allocations might not
end up in the same block or the next block.  Over time in a FIFO type
workload it would be possible to get fragmentation, which could result
in being unable to free blocks.  For longer lived contexts I imagine
that could end up fairly bad. The updated patch should avoid that
problem.

0002:
This modifies the tuplesort API so that instead of having a
randomAccess bool flag, this is changed to a bitwise flag that we can
add further options in the future.  It's slightly annoying to break
the API, but it's not exactly going to be hard for people to code
around that.  It might also mean we don't have to break the API in the
future if we're doing some change where we can just add a new bitwise
flag.

0003:
This adds a new flag for TUPLESORT_ALLOWBOUNDED and modifies the
places where we set a sort bound to pass the flag.  The patch only
uses the generation context when the flag is not passed.

I feel this is a pretty simple patch and if nobody has any objections
then I plan to push all 3 of them on my New Zealand Monday morning.

David
From 053066f8276809af1b3f2c26e1e86b8e1db85bac Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Wed, 16 Feb 2022 10:23:32 +1300
Subject: [PATCH v5 1/3] Improve the generation memory allocator

Here we make a series of improvements to the generation memory
allocator:

1. Allow generation contexts to have a minimum, initial and maximum block
sizes. The standard allocator allows this already but when the generation
context was added, it only allowed fixed-sized blocks.  The problem with
fixed-sized blocks is that it's difficult to choose how large to make the
blocks.  If the chosen size is too small then we'd end up with a large
number of blocks and a large number of malloc calls. If the block size is
made too large, then memory is wasted.

2. Add support for "keeper" blocks.  This is a special block that is
allocated along with the context itself but is never freed.  Instead,
when the last chunk in the keeper block is freed, we simply mark the block
as empty to allow new allocations to make use of it.

3. Add facility to "recycle" newly empty blocks instead of freeing them
and having to later malloc an entire new block again.  We do this by
recording a single GenerationBlock which has become empty of any chunks.
When we run out of space in the current block, we check to see if there is
a "freeblock" and use that if it contains enough space for the allocation.

Author: David Rowley, Tomas Vondra
Discussion: 
https://postgr.es/m/d987fd54-01f8-0f73-af6c-519f799a0...@enterprisedb.com
---
 src/backend/access/gist/gistvacuum.c  |   6 +
 .../replication/logical/reorderbuffer.c   |   7 +
 src/backend/utils/mmgr/generation.c   | 383 ++
 src/include/utils/memutils.h  |   4 +-
 4 files changed, 322 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c 
b/src/backend/access/gist/gistvacuum.c
index aac4afab8f..f190decdff 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -158,9 +158,15 @@ gistvacuumscan(IndexVacuumInfo *info, 
IndexBulkDeleteResult *stats,
 * pages in page_set_context.  Internally, the integer set will remember
 * this context so that the subsequent allocations for these integer 
sets
 * will be done from the same context.
+*
+* XXX the allocation sizes used below pre-date generation context's 
block
+* growing code.  These values should likely be benchmarked and set to
+* more suitable values.
 */
vstate.page_set_context = GenerationContextCreate(CurrentMemoryContext,

  "GiST

Re: pg_rewind copies

I took another look at this patch, and I think it's ready to go in, it clearly
fixes a bug that isn't too hard to hit in production settings.  To ensure we
don't break this I've added a testcase which pipes the pg_rewind --verbose
output to a file it's asked to copy, which then guarantees that the file is
growing in size during the operation without need for synchronizing two
processes with IPC::Run (it also passes on Windows in the CI setup).

One small comment on the patch:

+   snprintf(srcpath, sizeof(srcpath), "%s/%s", datadir, path);

This should IMO check the returnvalue of snprintf to ensure it wasn't
truncated.  While the risk is exceedingly small, a truncated filename might
match another existing filename and the error not getting caught.  There is
another instance just like this one in open_target_file() to which I think we
should apply the same belts-and-suspenders treatment.  I've fixed this in the
attached version which also have had a pg_indent run on top of a fresh rebase.

Thoughts on this version?

--
Daniel Gustafsson   https://vmware.com/



v3-0001-pg_rewind-Fetch-small-files-according-to-new-size.patch
Description: Binary data

Re: Unit tests for SLRU

2022-04-01 Thread Pavel Borisov

пт, 1 апр. 2022 г. в 07:47, Noah Misch :

> On Thu, Mar 31, 2022 at 05:30:41PM +0300, Aleksander Alekseev wrote:
> > I used src/test/modules/test_* modules as an example.
>
> > If time permits, please take a quick look at the patch and let me know
> > if I'm moving the right direction. There will be more tests in the
> > final version, but I would appreciate an early feedback.
>
> The default place for this kind of test is regress.c, with plain "make
> check"
> calling the regress.c function.  src/test/modules is for things requiring
> an
> extension module or otherwise unable to run through regress.c.
>
+1 for placement c functions into regress.c if it's possible for the aim of
simplification.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

On Fri, 1 Apr 2022 at 07:38, Michael Paquier  wrote:
>
> On Thu, Mar 31, 2022 at 12:09:35PM +0200, Matthias van de Meent wrote:
> > PageInit MAXALIGNs the size of the special area that it receives as an
> > argument; so any changes to the page header that would misalign the
> > value would be AM-specific; in which case it is quite unlikely that
> > this is the right accessor for your page's special area.
>
> Right.  I'd still be tempted to keep that per-AM rather than making
> the checks deeper with one extra macro layer in the page header or
> with a different macro that would depend on the opaque type, though.
> Like in the attached, for example.

I see. I still would like it better if the access could use this
statically determined offset: your opaque-macros.patch doesn't fix the
out-of-bound read/write scenariofor non-assert builds, nor does it
remove the unneeded indirection through the page header that I was
trying to remove.

Even in assert-enabled builds; with my proposed changes the code paths
for checking the header value and the use of the special area can be
executed independently, which allows for parallel (pre-)fetching of
the page header and special area, as opposed to the current sequential
load order due to the required use of pd_special.

-Matthias

Re: generic plans and "initial" pruning

2022-04-01 Thread Amit Langote

On Fri, Apr 1, 2022 at 5:20 PM David Rowley  wrote:
> On Fri, 1 Apr 2022 at 19:58, Amit Langote  wrote:
> > Yes, the ExecLockRelsInfo node in the current patch, that first gets
> > added to the QueryDesc and subsequently to the EState of the query,
> > serves as that stashing place.  Not sure if you've looked at
> > ExecLockRelInfo in detail in your review of the patch so far, but it
> > carries the initial pruning result in what are called
> > PlanInitPruningOutput nodes, which are stored in a list in
> > ExecLockRelsInfo and their offsets in the list are in turn stored in
> > an adjacent array that contains an element for every plan node in the
> > tree.  If we go with a PlannedStmt.partpruneinfos list, then maybe we
> > don't need to have that array, because the Append/MergeAppend nodes
> > would be carrying those offsets by themselves.
>
> I saw it, just not in great detail. I saw that you had an array that
> was indexed by the plan node's ID.  I thought that wouldn't be so good
> with large complex plans that we often get with partitioning
> workloads.  That's why I mentioned using another index that you store
> in Append/MergeAppend that starts at 0 and increments by 1 for each
> node that has a PartitionPruneInfo made for it during create_plan.
>
> > Maybe a different name for ExecLockRelsInfo would be better?
> >
> > Also, given Tom's apparent dislike for carrying that in PlannedStmt,
> > maybe the way I have it now is fine?
>
> I think if you change how it's indexed and the other stuff then we can
> have another look.  I think the patch will be much easier to review
> once the ParitionPruneInfos are moved into PlannedStmt.

Will do, thanks.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

2022-04-01 Thread Kyotaro Horiguchi

At Wed, 30 Mar 2022 17:58:24 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Wed, 30 Mar 2022 11:46:13 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > But, in the first place the *fix* has been found to be wrong.  I'm
> > going to search for the right fix..
> 
> FETCH uses the snapshot at DECLARE. So anyhow I needed to set the
> queryDesk's snapshot used in PortalRunSelect to the FETCH's portal's
> holdSnapshot.  What I did in this version is:

By the way, this is, given that the added check in init_toast_snapshot
is valid, a long-standing "bug", which at least back to 12.

I'm not sure what to do for this.

1. ignore the check for certain cases?

2. apply any fix only to master and call it a day.  14 and earlier
  doesn't have the assertion check so they don't complain.

3. apply a fix to all affected versions.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Tablesync early exit

2022-04-01 Thread Peter Smith

On Wed, Mar 16, 2022 at 4:07 PM Amit Kapila  wrote:
>
> On Mon, Aug 30, 2021 at 8:50 AM Peter Smith  wrote:
> >
> > Patch v2 is the same; it only needed re-basing to the latest HEAD.
> >
>
> Why do you think it is correct to exit before trying to receive any
> message?

I think the STATE_CATCHUP state guarantees the apply worker must have
received (or tried to receive) a message. See the next answer.

> How will we ensure whether the apply worker has processed any
> message?

All this patch code does is call process_syncing_tables, which
delegates to process_syncing_tables_for_sync (because the call is from
a tablesync worker). This function code can’t do anything unless the
tablesync worker is in STATE_CATCHUP state, and that cannot happen
unless it was explicitly set to that state by the apply worker.

On the other side of the coin, the apply worker can only set that
syncworker->relstate = SUBREL_STATE_CATCHUP from within function
process_syncing_tables_for_apply, and AFAIK that function is only
called when the apply worker has either handled a message, (or the
walrcv_receive in the  LogicalRepApplyLoop received nothing).

So I think the STATE_CATCHUP mechanism itself ensures the apply worker
*must* have already processed a message (or there was no message to
process).

> At the beginning of function LogicalRepApplyLoop(),
> last_received is the LSN where the copy has finished in the case of
> tablesync worker. I think we need to receive the message before trying
> to ensure whether we have synced with the apply worker or not.
>

I think the STATE_CATCHUP guarantees the apply worker must have
received (or tried to receive) a message. See the previous answer.

~~~

AFAIK this patch is OK, but since it is not particularly urgent I've
bumped this to the next CommitFest [1] instead of trying to jam it
into PG15 at the last minute.

BTW - There were some useful logfiles I captured a very long time ago
[2]. They show the behaviour without/with this patch.

--
[1] https://commitfest.postgresql.org/37/3062/
[2] 
https://www.postgresql.org/message-id/cahut+ptjk-qgd3r1a1_tr62cmiswcyphuv0plmva-+2s8r0...@mail.gmail.com

Kind Regards,
Peter Smith
Fujitsu Australia

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-04-01 Thread Dilip Kumar

On Thu, Mar 31, 2022 at 9:52 PM Andres Freund  wrote:

> > + "ALTER DATABASE testdb SET TABLESPACE regress_ts2");
> > +ok($result == 0, 'move database to tablespace 2');
>
> This just tests the command doesn't fail, but not whether it actually did
> something useful. Seem we should at least insert a row or two into the the
> table, and verify they can be accessed?

Now, added some tuples and verified them.


> >  # Check that only READ-only queries can run on standbys
> >  is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
> >   3, 'read-only queries on standby 1');
>
> I'd probably add a function for creating database / table and then testing it,
> with a strategy parameter. That way we can afterwards add more tests verifying
> that everything worked.

I have created a function to create a database and table and verify
the content in it.  Another option is we can just keep the database
and table creation inside the function and the verification part
outside it so that if some future test case wants to create some extra
content and verify it then they can do so.   But with the current
tests in mind the way I got it in the attached patch has less
duplicate code so I preferred it this way.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From a672a27a9e502331a7c4ca2a16f3f660c8ed3fcd Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 1 Apr 2022 13:15:57 +0530
Subject: [PATCH v2] Create database test coverage

Test create database strategy wal replay and alter database
set tablespace.
---
 src/test/modules/test_misc/t/002_tablespace.pl | 14 ++
 src/test/recovery/t/001_stream_rep.pl  | 25 +
 2 files changed, 39 insertions(+)

diff --git a/src/test/modules/test_misc/t/002_tablespace.pl b/src/test/modules/test_misc/t/002_tablespace.pl
index 04e5439..6ac6f79 100644
--- a/src/test/modules/test_misc/t/002_tablespace.pl
+++ b/src/test/modules/test_misc/t/002_tablespace.pl
@@ -83,7 +83,21 @@ $result = $node->psql('postgres',
 	"ALTER TABLE t SET tablespace regress_ts1");
 ok($result == 0, 'move table in-place->abs');
 
+# Test ALTER DATABASE SET TABLESPACE
+$result = $node->psql('postgres',
+	"CREATE DATABASE testdb TABLESPACE regress_ts1");
+ok($result == 0, 'create database in tablespace 1');
+$result = $node->psql('testdb',
+	"CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
+ok($result == 0, 'create table in testdb database');
+$result = $node->psql('postgres',
+	"ALTER DATABASE testdb SET TABLESPACE regress_ts2");
+ok($result == 0, 'move database to tablespace 2');
+$result = $node->safe_psql('testdb', "SELECT count(*) FROM tab_int");
+is($result, qq(10), 'check contents after moving to a different tablespace');
+
 # Drop everything
+$result = $node->psql('postgres', "DROP DATABASE testdb");
 $result = $node->psql('postgres',
 	"DROP TABLE t");
 ok($result == 0, 'create table in tablespace 1');
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 583ee87..cf4041a 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -78,6 +78,31 @@ $result = $node_standby_2->safe_psql('postgres', "SELECT * FROM seq1");
 print "standby 2: $result\n";
 is($result, qq(33|0|t), 'check streamed sequence content on standby 2');
 
+# Create database with some contents for a given strategy and check on standby
+sub test_createdb_strategy_and_check
+{
+	my $node1   = shift;
+	my $node2   = shift;
+	my $dbname  = shift;
+	my $strategy= shift;
+
+	$node1->safe_psql('postgres',
+		"CREATE DATABASE $dbname STRATEGY = $strategy; ");
+	$node1->safe_psql($dbname,
+		"CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
+	my $lsn = $node1->lsn('write');
+	$node1->wait_for_catchup($node2, 'replay', $lsn);
+
+	my $result = $node2->safe_psql($dbname, "SELECT count(*) FROM tab_int");
+	is($result, qq(10), 'check streamed content on standby');
+}
+# Test replication of file_copy strategy
+test_createdb_strategy_and_check($node_primary, $node_standby_1, "testdb1",
+	"file_copy");
+# Test replication of wal_log strategy
+test_createdb_strategy_and_check($node_primary, $node_standby_1, "testdb2",
+	"wal_log");
+
 # Check that only READ-only queries can run on standbys
 is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
 	3, 'read-only queries on standby 1');
-- 
1.8.3.1

Re: generic plans and "initial" pruning

2022-04-01 Thread David Rowley

On Fri, 1 Apr 2022 at 19:58, Amit Langote  wrote:
> Yes, the ExecLockRelsInfo node in the current patch, that first gets
> added to the QueryDesc and subsequently to the EState of the query,
> serves as that stashing place.  Not sure if you've looked at
> ExecLockRelInfo in detail in your review of the patch so far, but it
> carries the initial pruning result in what are called
> PlanInitPruningOutput nodes, which are stored in a list in
> ExecLockRelsInfo and their offsets in the list are in turn stored in
> an adjacent array that contains an element for every plan node in the
> tree.  If we go with a PlannedStmt.partpruneinfos list, then maybe we
> don't need to have that array, because the Append/MergeAppend nodes
> would be carrying those offsets by themselves.

I saw it, just not in great detail. I saw that you had an array that
was indexed by the plan node's ID.  I thought that wouldn't be so good
with large complex plans that we often get with partitioning
workloads.  That's why I mentioned using another index that you store
in Append/MergeAppend that starts at 0 and increments by 1 for each
node that has a PartitionPruneInfo made for it during create_plan.

> Maybe a different name for ExecLockRelsInfo would be better?
>
> Also, given Tom's apparent dislike for carrying that in PlannedStmt,
> maybe the way I have it now is fine?

I think if you change how it's indexed and the other stuff then we can
have another look.  I think the patch will be much easier to review
once the ParitionPruneInfos are moved into PlannedStmt.

David

Re: [WIP] ALTER COLUMN IF EXISTS