Re: Assertion failure with replication origins and PREPARE TRANSACTIOn

2021-12-12 Thread Masahiko Sawada
On Mon, Dec 13, 2021 at 12:44 PM Michael Paquier  wrote:
>
> Hi all,
> (CCing some folks who worked on this area lately)
>
> The following sequence of commands generates an assertion failure, as
> of $subject:
> select pg_replication_origin_create('popo');
> select pg_replication_origin_session_setup('popo');
> begin;
> select txid_current();
> prepare transaction 'popo'; -- assertion fails
>
> The problem originates from 1eb6d65, down to 11, where we finish by
> triggering this assertion before replorigin_session_origin_lsn is not
> valid:
> +   if (replorigin)
> +   {
> +   Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
> +   hdr->origin_lsn = replorigin_session_origin_lsn;
> +   hdr->origin_timestamp = replorigin_session_origin_timestamp;
> +   }
>
> As far as I understand this code and based on the docs,
> pg_replication_origin_xact_setup(), that would set of the session
> origin LSN and timestamp, is an optional choise.
> replorigin_session_advance() would be a no-op for remote_lsn, and
> local_lsn requires an update.  Now please note that I am really fluent
> with this stuff, so feel free to correct me.  The intention of the
> code also seems that XACT_XINFO_HAS_ORIGIN should still be set, but
> with no data.
>
> At the end, it seems to me that the assertion could just be dropped as
> per the attached, as we'd finish by setting up origin_lsn and
> origin_timestamp in the 2PC file header with some invalid data.

Why do we check if replorigin_session_origin_lsn is not invalid data
only when PREPARE TRANSACTION? Looking at commit and rollback code, we
don't have assertions or checks that check
replorigin_session_origin_lsn/timestamp is valid data. So it looks
like we accept also invalid data in those cases since
replorigin_advance doesn’t move LSN backward while applying a commit
or rollback record even if LSN is invalid. The same is true for
PREPARE records, i.g., even if replication origin LSN is invalid, it
doesn't go backward. If replication origin LSN and timestamp in commit
record and rollback record must be valid data too, I think we should
similar checks for commit and rollback code and I think the assertions
will fail in the case I reported before[1].

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoAMuTrXezGqaV1Q5H-Hf%2BzKqGbU8XuCZk9iQMYueF3%2B8w%40mail.gmail.com

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: parallel vacuum comments

2021-12-12 Thread Masahiko Sawada
On Mon, Dec 13, 2021 at 2:09 PM Amit Kapila  wrote:
>
> On Mon, Dec 13, 2021 at 10:33 AM Masahiko Sawada  
> wrote:
> >
> > On Fri, Dec 10, 2021 at 9:08 PM Amit Kapila  wrote:
> > >
> > > On Thu, Dec 9, 2021 at 6:05 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Thu, Dec 9, 2021 at 7:44 PM Amit Kapila  
> > > > wrote:
> > > >
> > > > Agreed with the above two points.
> > > >
> > > > I've attached updated patches that incorporated the above comments
> > > > too. Please review them.
> > > >
> > >
> > > I have made the following minor changes to the 0001 patch: (a) An
> > > assert was removed from dead_items_max_items() which I added back. (b)
> > > Removed an unnecessary semicolon from one of the statements in
> > > compute_parallel_vacuum_workers(). (c) Changed comments at a few
> > > places. (d) moved all parallel_vacuum_* related functions together.
> > > (e) ran pgindent and slightly modify the commit message.
> > >
> > > Let me know what you think of the attached?
> >
> > Thank you for updating the patch!
> >
> > The patch also moves some functions, e.g., update_index_statistics()
> > is moved without code changes. I agree to move functions for
> > consistency but that makes the review hard and the patch complicated.
> > I think it's better to do improving the parallel vacuum code and
> > moving functions in separate patches.
> >
>
> Okay, I thought it might be better to keep all parallel_vacuum_*
> related functions together but we can keep that in a separate patch
> Feel free to submit without those changes.

I've attached the patch. I've just moved some functions back but not
done other changes.

> In fact, if we go for your
> current 0002 then that might not be even required as we move all those
> functions to a new file.

Right. So it seems not necessary.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v9-0001-Improve-parallel-vacuum-implementation.patch
Description: Binary data


Re: Add client connection check during the execution of the query

2021-12-12 Thread Thomas Munro
On Mon, Dec 13, 2021 at 5:51 PM Thomas Munro  wrote:
> [...] Everywhere else calls
> with nevents == 1, so that's hypothetical.

Erm, I forgot about ExecAppendAsyncEventWait(), so I'd have to update
the commit message on that point, but it's hard to worry too much
about that case -- it's creating and destroying a WES every time.  I'd
rather worry about fixing that problem.




Re: parallel vacuum comments

2021-12-12 Thread Amit Kapila
On Mon, Dec 13, 2021 at 10:33 AM Masahiko Sawada  wrote:
>
> On Fri, Dec 10, 2021 at 9:08 PM Amit Kapila  wrote:
> >
> > On Thu, Dec 9, 2021 at 6:05 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Dec 9, 2021 at 7:44 PM Amit Kapila  
> > > wrote:
> > >
> > > Agreed with the above two points.
> > >
> > > I've attached updated patches that incorporated the above comments
> > > too. Please review them.
> > >
> >
> > I have made the following minor changes to the 0001 patch: (a) An
> > assert was removed from dead_items_max_items() which I added back. (b)
> > Removed an unnecessary semicolon from one of the statements in
> > compute_parallel_vacuum_workers(). (c) Changed comments at a few
> > places. (d) moved all parallel_vacuum_* related functions together.
> > (e) ran pgindent and slightly modify the commit message.
> >
> > Let me know what you think of the attached?
>
> Thank you for updating the patch!
>
> The patch also moves some functions, e.g., update_index_statistics()
> is moved without code changes. I agree to move functions for
> consistency but that makes the review hard and the patch complicated.
> I think it's better to do improving the parallel vacuum code and
> moving functions in separate patches.
>

Okay, I thought it might be better to keep all parallel_vacuum_*
related functions together but we can keep that in a separate patch
Feel free to submit without those changes. In fact, if we go for your
current 0002 then that might not be even required as we move all those
functions to a new file.

-- 
With Regards,
Amit Kapila.




Re: parallel vacuum comments

2021-12-12 Thread Masahiko Sawada
On Fri, Dec 10, 2021 at 9:08 PM Amit Kapila  wrote:
>
> On Thu, Dec 9, 2021 at 6:05 PM Masahiko Sawada  wrote:
> >
> > On Thu, Dec 9, 2021 at 7:44 PM Amit Kapila  wrote:
> >
> > Agreed with the above two points.
> >
> > I've attached updated patches that incorporated the above comments
> > too. Please review them.
> >
>
> I have made the following minor changes to the 0001 patch: (a) An
> assert was removed from dead_items_max_items() which I added back. (b)
> Removed an unnecessary semicolon from one of the statements in
> compute_parallel_vacuum_workers(). (c) Changed comments at a few
> places. (d) moved all parallel_vacuum_* related functions together.
> (e) ran pgindent and slightly modify the commit message.
>
> Let me know what you think of the attached?

Thank you for updating the patch!

The patch also moves some functions, e.g., update_index_statistics()
is moved without code changes. I agree to move functions for
consistency but that makes the review hard and the patch complicated.
I think it's better to do improving the parallel vacuum code and
moving functions in separate patches.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: parallel vacuum comments

2021-12-12 Thread Masahiko Sawada
On Mon, Dec 13, 2021 at 12:03 PM Amit Kapila  wrote:
>
> On Sat, Dec 11, 2021 at 8:30 PM Masahiko Sawada  wrote:
> >
> > On Sat, Dec 11, 2021 at 2:32 PM Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > On 2021-10-30 14:21:01 -0700, Andres Freund wrote:
> > > > Due to bug #17245: [1] I spent a considerably amount of time looking at 
> > > > vacuum
> > > > related code. And I found a few things that I think could stand 
> > > > improvement:
> >
> > Thank you for the comments.
> >
> > >
> > > While working on the fix for #17255 (more specifically some cleanup that 
> > > Peter
> > > suggested in the context), I noticed another thing: Initializing 
> > > parallelism
> > > as part of dead_items_alloc() is a bad idea. Even if there are comments 
> > > noting
> > > that oddity.
> > >
> > > I don't really see why we should do it this way? There's no 
> > > "no-parallelism"
> > > path in begin_parallel_vacuum() besides 
> > > compute_parallel_vacuum_workers(). So
> > > it's not like we might just discover the inability to do parallelism 
> > > during
> > > parallel initialization?
> >
> > Right. Also, in parallel vacuum case, it allocates the space not only
> > for dead items but also other data required to do parallelism like
> > shared bulkdeletion results etc. Originally, in PG13,
> > begin_parallel_vacuum() was called by lazy_scan_heap() but in PG14 it
> > became part of dead_items_alloc() (see b4af70cb2). I agree to change
> > this part so that lazy_scan_heap() calls begin_parallel_vacuum()
> > (whatever we rename it). I'll incorporate this change in the
> > refactoring patch barring any objections.
> >
> > >
> > > It's also not particularly helpful to have a begin_parallel_vacuum() that
> > > might not actually begin a parallel vacuum...
> >
> > During the development, I found that we have some begin_* functions
> > that don't start the actual parallel job but prepare state data for
> > starting parallel job and referred to  _bt_begin_parallel() so I named
> > begin_parallel_vacuum(). But I admit that considering what the
> > function actually does, something like
> > create_parallel_vacuum_context() would be clearer.
> >
>
> How about if we name it as parallel_vacuum_init() which will be
> similar InitializeParallelDSM, ExecInitParallelPlan().

parallel_vacuum_init() sounds better.

> Now, I see
> there is some reasoning to keep it in dead_items_alloc as both
> primarily allocate memory for vacuum but maybe we should name the
> function vacuum_space_alloc instead of dead_items_alloc and similarly
> rename dead_items_cleanup to vacuum_space_free. The other idea could
> be to bring begin_parallel_vacuum() back in lazy_scan_heap() but I
> personally prefer the idea to keep it where it is but improve function
> names. Will it be better to do this as a separate patch as 0002
> because this might require some change in the vacuum code path?

Yeah, if we do just renaming functions, I think we can do that in 0001
patch. On the other hand, if we need to change the logic, it's better
to do that in a separate patch.

>
> > >
> > > Minor nit:
> > >
> > > begin_parallel_vacuum()'s comment says:
> > >  * On success (when we can launch one or more workers), will set 
> > > dead_items and
> > >  * lps in vacrel for caller.
> > >
> > > But it actually doesn't know whether we can start workers. It just checks
> > > max_parallel_maintenance_workers, no?
> >
> > Yes, we cannot know whether we can actually start workers when
> > starting parallel index vacuuming. It returns non-NULL if we request
> > one or more workers.
> >
>
> So can we adjust the comments? I think the part of the sentence "when
> we can launch one or more workers" seems to be the cause of confusion,
> can we remove it?

Yes, we can remove it. Or replace "can launch" with "request".

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Add client connection check during the execution of the query

2021-12-12 Thread Thomas Munro
On Sat, Dec 11, 2021 at 7:09 PM Thomas Munro  wrote:
> On Sat, Dec 11, 2021 at 6:11 PM Andres Freund  wrote:
> > Yuck. Is there really no better way to deal with this? What kind of errors 
> > is
> > this trying to handle transparently? Afaics this still changes when we'd
> > e.g. detect postmaster death.
>
> The problem is that WaitEventSetWait() only reports the latch, if it's
> set, so I removed it from the set (setting it to NULL), and then undo
> that afterwards.  Perhaps we could fix that root problem instead.
> That is, we could make it so that latches aren't higher priority in
> that way, ie don't hide other events[1].  Then I wouldn't have to
> modify the WES here, I could just ignore it in the output event list
> (and make sure that's big enough for all possible events, as I had it
> in the last version).  I'll think about that.

I tried that.  It seems OK, and gets rid of the PG_FINALLY(), which is
nice.  Latches still have higher priority, and still have the fast
return if already set and you asked for only one event, but now if you
ask for nevents > 1 we'll make the syscall too so we'll see the
WL_SOCKET_CLOSED.

It's possible that someone might want the old behaviour one day (fast
return even for nevents > 1, hiding other events), and then we'd have
to come up with some way to request that, which is the type of tricky
policy question that had put me off "fixing" this before.  But I guess
we could cross that bridge if we come to it.  Everywhere else calls
with nevents == 1, so that's hypothetical.

Better?
From 6b994807c29157ac7053ec347a51268d60f2b3b4 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 30 Apr 2021 10:38:40 +1200
Subject: [PATCH v4 1/3] Add WL_SOCKET_CLOSED for socket shutdown events.

Provide a way for WaitEventSet to report that the remote peer has shut
down its socket, independently of whether there is any buffered data
remaining to be read.  This works only on systems where the kernel
exposes that information, namely:

* WAIT_USE_POLL builds using POLLRDHUP, if available
* WAIT_USE_EPOLL builds using EPOLLRDHUP
* WAIT_USE_KQUEUE builds using EV_EOF

Reviewed-by: Zhihong Yu 
Reviewed-by: Maksim Milyutin 
Discussion: https://postgr.es/m/77def86b27e41f0efcba411460e929ae%40postgrespro.ru
---
 src/backend/storage/ipc/latch.c | 79 +
 src/include/storage/latch.h |  6 ++-
 2 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 1d893cf863..54e928c564 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -841,6 +841,7 @@ FreeWaitEventSet(WaitEventSet *set)
  * - WL_SOCKET_CONNECTED: Wait for socket connection to be established,
  *	 can be combined with other WL_SOCKET_* events (on non-Windows
  *	 platforms, this is the same as WL_SOCKET_WRITEABLE)
+ * - WL_SOCKET_CLOSED: Wait for socket to be closed by remote peer.
  * - WL_EXIT_ON_PM_DEATH: Exit immediately if the postmaster dies
  *
  * Returns the offset in WaitEventSet->events (starting from 0), which can be
@@ -1043,12 +1044,16 @@ WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action)
 	else
 	{
 		Assert(event->fd != PGINVALID_SOCKET);
-		Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE));
+		Assert(event->events & (WL_SOCKET_READABLE |
+WL_SOCKET_WRITEABLE |
+WL_SOCKET_CLOSED));
 
 		if (event->events & WL_SOCKET_READABLE)
 			epoll_ev.events |= EPOLLIN;
 		if (event->events & WL_SOCKET_WRITEABLE)
 			epoll_ev.events |= EPOLLOUT;
+		if (event->events & WL_SOCKET_CLOSED)
+			epoll_ev.events |= EPOLLRDHUP;
 	}
 
 	/*
@@ -1087,12 +1092,18 @@ WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event)
 	}
 	else
 	{
-		Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE));
+		Assert(event->events & (WL_SOCKET_READABLE |
+WL_SOCKET_WRITEABLE |
+WL_SOCKET_CLOSED));
 		pollfd->events = 0;
 		if (event->events & WL_SOCKET_READABLE)
 			pollfd->events |= POLLIN;
 		if (event->events & WL_SOCKET_WRITEABLE)
 			pollfd->events |= POLLOUT;
+#ifdef POLLRDHUP
+		if (event->events & WL_SOCKET_CLOSED)
+			pollfd->events |= POLLRDHUP;
+#endif
 	}
 
 	Assert(event->fd != PGINVALID_SOCKET);
@@ -1165,7 +1176,9 @@ WaitEventAdjustKqueue(WaitEventSet *set, WaitEvent *event, int old_events)
 	Assert(event->events != WL_LATCH_SET || set->latch != NULL);
 	Assert(event->events == WL_LATCH_SET ||
 		   event->events == WL_POSTMASTER_DEATH ||
-		   (event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)));
+		   (event->events & (WL_SOCKET_READABLE |
+			 WL_SOCKET_WRITEABLE |
+			 WL_SOCKET_CLOSED)));
 
 	if (event->events == WL_POSTMASTER_DEATH)
 	{
@@ -1188,9 +1201,9 @@ WaitEventAdjustKqueue(WaitEventSet *set, WaitEvent *event, int old_events)
 		 * old event mask to the new event mask, since kevent treats readable
 		 * and writable as separate events.
 		 */
-		if (old_events & WL_SOCKET_READABLE)
+		if (old_event

Re: Skipping logical replication transactions on subscriber side

2021-12-12 Thread Amit Kapila
On Mon, Dec 13, 2021 at 8:28 AM Masahiko Sawada  wrote:
>
> On Sat, Dec 11, 2021 at 3:29 PM Amit Kapila  wrote:
> >
> > 3.
> > + * Also, we don't skip receiving the changes in streaming cases,
> > since we decide
> > + * whether or not to skip applying the changes when starting to apply 
> > changes.
> >
> > But why so? Can't we even skip streaming (and writing to file all such
> > messages)? If we can do this then we can avoid even collecting all
> > messages in a file.
>
> IIUC in streaming cases, a transaction can be sent to the subscriber
> while splitting into multiple chunks of changes. In the meanwhile,
> skip_xid can be changed. If the user changed or cleared skip_xid after
> the subscriber skips some streamed changes, the subscriber won't able
> to have complete changes of the transaction.
>

Yeah, I think if we want we can handle this by writing into the stream
xid file whether the changes need to be skipped and then the
consecutive streams can check that in the file or may be in some way
don't allow skip_xid to be changed in worker if it is already skipping
some xact. If we don't want to do anything for this then it is better
to at least reflect this reasoning in the comments.

> >
> > 4.
> > + * Also, one might think that we can skip preparing the skipped 
> > transaction.
> > + * But if we do that, PREPARE WAL record won’t be sent to its physical
> > + * standbys, resulting in that users won’t be able to find the prepared
> > + * transaction entry after a fail-over.
> > + *
> > ..
> > + */
> > + if (skipping_changes)
> > + stop_skipping_changes(false);
> >
> > Why do we need such a Prepare's entry either at current subscriber or
> > on its physical standby? I think it is to allow Commit-prepared. If
> > so, how about if we skip even commit prepared as well? Even on
> > physical standby, we would be having the value of skip_xid which can
> > help us to skip there as well after failover.
>
> It's true that skip_xid would be set also on physical standby. When it
> comes to preparing the skipped transaction on the current subscriber,
> if we want to skip commit-prepared I think we need protocol changes in
> order for subscribers to know prepare_lsn and preppare_timestampso
> that it can lookup the prepared transaction when doing
> commit-prepared. I proposed this idea before. This change would be
> benefical as of now since the publisher sends even empty transactions.
> But considering the proposed patch[1] that makes the puslisher not
> send empty transaction, this protocol change would be an optimization
> only for this feature.
>

I was thinking to compare the xid received as part of the
commit_prepared message with the value of skip_xid to skip the
commit_prepared but I guess the user would change it between prepare
and commit prepare and then we won't be able to detect it, right? I
think we can handle this and the streaming case if we disallow users
to change the value of skip_xid when we are already skipping changes
or don't let the new skip_xid to reflect in the apply worker if we are
already skipping some other transaction. What do you think?

-- 
With Regards,
Amit Kapila.




Assertion failure with replication origins and PREPARE TRANSACTIOn

2021-12-12 Thread Michael Paquier
Hi all,
(CCing some folks who worked on this area lately)

The following sequence of commands generates an assertion failure, as
of $subject:
select pg_replication_origin_create('popo');
select pg_replication_origin_session_setup('popo');
begin;
select txid_current();
prepare transaction 'popo'; -- assertion fails

The problem originates from 1eb6d65, down to 11, where we finish by
triggering this assertion before replorigin_session_origin_lsn is not
valid:
+   if (replorigin)
+   {
+   Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+   hdr->origin_lsn = replorigin_session_origin_lsn;
+   hdr->origin_timestamp = replorigin_session_origin_timestamp;
+   }

As far as I understand this code and based on the docs,
pg_replication_origin_xact_setup(), that would set of the session
origin LSN and timestamp, is an optional choise.
replorigin_session_advance() would be a no-op for remote_lsn, and
local_lsn requires an update.  Now please note that I am really fluent
with this stuff, so feel free to correct me.  The intention of the
code also seems that XACT_XINFO_HAS_ORIGIN should still be set, but
with no data.

At the end, it seems to me that the assertion could just be dropped as
per the attached, as we'd finish by setting up origin_lsn and
origin_timestamp in the 2PC file header with some invalid data.  The
else block could just be removed, but then one would need to guess
from origin.c that both replorigin_session_* may not be set.

I am not completely sure that this is the right move, though, as
pg_replication_origin_status has a *very* limited amount of tests.
Adding a test to replorigin.sql with 2PC transactions able to trigger
the problem is easy once we rely on a different slot to not influence
the existing tests, as the expected contents of
pg_replication_origin_status could be messed up in the follow-up
tests.

Thoughts?
--
Michael
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 28b153abc3..7518f83925 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1127,9 +1127,9 @@ EndPrepare(GlobalTransaction gxact)
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
   replorigin_session_origin != DoNotReplicateId);
 
-	if (replorigin)
+	if (replorigin &&
+		replorigin_session_origin_lsn != InvalidXLogRecPtr)
 	{
-		Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
 		hdr->origin_lsn = replorigin_session_origin_lsn;
 		hdr->origin_timestamp = replorigin_session_origin_timestamp;
 	}
diff --git a/contrib/test_decoding/expected/replorigin.out b/contrib/test_decoding/expected/replorigin.out
index 8077318755..895783d917 100644
--- a/contrib/test_decoding/expected/replorigin.out
+++ b/contrib/test_decoding/expected/replorigin.out
@@ -181,3 +181,65 @@ SELECT pg_replication_origin_drop('regress_test_decoding: regression_slot');
  
 (1 row)
 
+-- Transactions with no origin LSN and commit timestamps set for this
+-- session yet.
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_no_lsn', 'test_decoding');
+ ?column? 
+--
+ init
+(1 row)
+
+SELECT pg_replication_origin_create('regress_test_decoding: regression_slot_no_lsn');
+ pg_replication_origin_create 
+--
+1
+(1 row)
+
+-- mark session as replaying
+SELECT pg_replication_origin_session_setup('regress_test_decoding: regression_slot_no_lsn');
+ pg_replication_origin_session_setup 
+-
+ 
+(1 row)
+
+-- Normal transaction
+BEGIN;
+INSERT INTO target_tbl(data)
+  SELECT data FROM
+pg_logical_slot_get_changes('regression_slot_no_lsn', NULL, NULL, 'include-xids', '0');
+COMMIT;
+-- 2PC transaction
+BEGIN;
+INSERT INTO target_tbl(data)
+  SELECT data FROM
+pg_logical_slot_get_changes('regression_slot_no_lsn', NULL, NULL, 'include-xids', '0');
+PREPARE TRANSACTION 'replorigin_prepared';
+COMMIT PREPARED 'replorigin_prepared';
+SELECT local_id, external_id,
+   remote_lsn <> '0/0' AS valid_remote_lsn,
+   local_lsn <> '0/0' AS valid_local_lsn
+   FROM pg_replication_origin_status;
+ local_id |  external_id  | valid_remote_lsn | valid_local_lsn 
+--+---+--+-
+1 | regress_test_decoding: regression_slot_no_lsn | f| t
+(1 row)
+
+-- This ensures that everything can be dropped.
+SELECT pg_replication_origin_session_reset();
+ pg_replication_origin_session_reset 
+-
+ 
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot_no_lsn');
+ pg_drop_replication_slot 
+--
+ 
+(1 row)
+
+SELECT pg_replication_origin_drop('regress_test_decoding: regression_slot_no_lsn');
+ pg_replication_origin_drop 
+
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/replorigin.sql b/contri

Re: Skipping logical replication transactions on subscriber side

2021-12-12 Thread Greg Nancarrow
On Fri, Dec 10, 2021 at 4:44 PM Masahiko Sawada  wrote:
>
> I've attached an updated patch. The new syntax is like "ALTER
> SUBSCRIPTION testsub SKIP (xid = '123')".
>

I have some review comments:

(1) Patch comment - some suggested wording improvements

BEFORE:
If incoming change violates any constraint, logical replication stops
AFTER:
If an incoming change violates any constraint, logical replication stops

BEFORE:
The user can specify XID by ALTER SUBSCRIPTION ... SKIP (xid = XXX),
updating pg_subscription.subskipxid field, telling the apply worker to
skip the transaction.
AFTER:
The user can specify the XID of the transaction to skip using
ALTER SUBSCRIPTION ... SKIP (xid = XXX), updating the pg_subscription.subskipxid
field, telling the apply worker to skip the transaction.

src/sgml/logical-replication.sgml
(2) Some suggested wording improvements

(i) Missing "the"
BEFORE:
+   the existing data.  When a conflict produce an error, it is shown in
AFTER:
+   the existing data.  When a conflict produce an error, it is shown in the

(ii) Suggest starting a new sentence
BEFORE:
+   and it is also shown in subscriber's server log as follows:
AFTER:
+   The error is also shown in the subscriber's server log as follows:


(iii) Context message should say "at ..." instead of "with commit
timestamp ...", to match the actual output from the current code
BEFORE:
+CONTEXT:  processing remote data during "INSERT" for replication
target relation "public.test" in transaction 716 with commit timestamp
2021-09-29 15:52:45.165754+00
AFTER:
+CONTEXT:  processing remote data during "INSERT" for replication
target relation "public.test" in transaction 716 at 2021-09-29
15:52:45.165754+00


(iv) The following paragraph seems out of place, with the information
presented in the wrong order:

+  
+   In this case, you need to consider changing the data on the
subscriber so that it
+   doesn't conflict with incoming changes, or dropping the
conflicting constraint or
+   unique index, or writing a trigger on the subscriber to suppress or redirect
+   conflicting incoming changes, or as a last resort, by skipping the
whole transaction.
+   They skip the whole transaction, including changes that may not violate any
+   constraint.  They may easily make the subscriber inconsistent, especially if
+   a user specifies the wrong transaction ID or the position of origin.
+  


How about rearranging it as follows:

+  
+   These methods skip the whole transaction, including changes that
may not violate
+   any constraint. They may easily make the subscriber inconsistent,
especially if
+   a user specifies the wrong transaction ID or the position of
origin, and should
+   be used as a last resort.
+   Alternatively, you might consider changing the data on the
subscriber so that it
+   doesn't conflict with incoming changes, or dropping the
conflicting constraint or
+   unique index, or writing a trigger on the subscriber to suppress or redirect
+   conflicting incoming changes.
+  


doc/src/sgml/ref/alter_subscription.sgml
(3)

(i) Doc needs clarification
BEFORE:
+  the whole transaction.  The logical replication worker skips all data
AFTER:
+  the whole transaction.  For the latter case, the logical
replication worker skips all data


(ii) "Setting -1 means to reset the transaction ID"

Shouldn't it be explained what resetting actually does and when it can
be, or is needed to be, done? Isn't it automatically reset?
I notice that negative values (other than -1) seem to be regarded as
valid - is that right?
Also, what happens if this option is set multiple times? Does it just
override and use the latest setting? (other option handling errors out
with errorConflictingDefElem()).
e.g. alter subscription sub skip (xid = 721, xid = 722);


src/backend/replication/logical/worker.c
(4) Shouldn't the "done skipping logical replication transaction"
message also include the skipped XID value at the end?


src/test/subscription/t/027_skip_xact.pl
(5) Some suggested wording improvements

(i)
BEFORE:
+# Test skipping the transaction. This function must be called after the caller
+# inserting data that conflict with the subscriber.  After waiting for the
+# subscription worker stats are updated, we skip the transaction in question
+# by ALTER SUBSCRIPTION ... SKIP. Then, check if logical replication
can continue
+# working by inserting $nonconflict_data on the publisher.
AFTER:
+# Test skipping the transaction. This function must be called after the caller
+# inserts data that conflicts with the subscriber.  After waiting for the
+# subscription worker stats to be updated, we skip the transaction in question
+# by ALTER SUBSCRIPTION ... SKIP. Then, check if logical replication
can continue
+# working by inserting $nonconflict_data on the publisher.

(ii)
BEFORE:
+# will conflict with the data replicated from publisher later.
AFTER:
+# will conflict with the data replicated later from the publisher.


Regards,
Greg Nancarrow
Fujitsu Australi

Re: Replication slot drop message is sent after pgstats shutdown.

2021-12-12 Thread Kyotaro Horiguchi
At Fri, 10 Dec 2021 18:13:31 +0900, Masahiko Sawada  
wrote in 
> I agreed with Andres and Horiguchi-san and attached an updated patch.

Thanks for the new version.

It seems fine, but I have some comments from a cosmetic viewpoint.

+   /*
+* Register callback to make sure cleanup and releasing the replication
+* slot on exit.
+*/
+   ReplicationSlotInitialize();

Actually all the function does is that but it looks slightly
inconsistent with the function name. I think we can call it just
"initialization" here. I think we don't need to change the function
comment the same way but either will do for me.

+ReplicationSlotBeforeShmemExit(int code, Datum arg)

The name looks a bit too verbose. Doesn't just "ReplicationSlotShmemExit" work?

-* so releasing here is fine. There's another cleanup in 
ProcKill()
-* ensuring we'll correctly cleanup on FATAL errors as well.
+* so releasing here is fine. There's another cleanup in
+* ReplicationSlotBeforeShmemExit() callback ensuring we'll 
correctly
+* cleanup on FATAL errors as well.

I'd prefer "during before_shmem_exit()" than "in
ReplicationSlotBeforeShmemExit() callback" here. (But the current
wording is also fine by me.)


The attached detects that bug, but I'm not sure it's worth expending
test time, or this might be in the server test suit.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl 
b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
index 90da1662e3..0fb4e67697 100644
--- a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
+++ b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
@@ -5,7 +5,8 @@ use strict;
 use warnings;
 use PostgreSQL::Test::Utils;
 use PostgreSQL::Test::Cluster;
-use Test::More tests => 20;
+use Test::More tests => 25;
+use IPC::Run qw(pump finish timer);
 
 program_help_ok('pg_recvlogical');
 program_version_ok('pg_recvlogical');
@@ -106,3 +107,44 @@ $node->command_ok(
'--start', '--endpos', "$nextlsn", '--no-loop', '-f', '-'
],
'replayed a two-phase transaction');
+
+## Check for a crash bug caused by replication-slot cleanup after
+## pgstat shutdown.
+#fire up an interactive psql session
+my $in  = '';
+my $out = '';
+my $timer = timer(5);
+my $h = $node->interactive_psql('postgres', \$in, \$out, $timer);
+like($out, qr/psql/, "print startup banner");
+
+# open a transaction
+$out = "";
+$in .= "BEGIN;\nCREATE TABLE a (a int);\n";
+pump $h until ($out =~ /CREATE TABLE/ || timer->is_expired);
+ok(!timer->is_expired, 'background CREATE TABLE passed');
+
+# this recvlogical waits for the transaction ends
+ok(open(my $recvlogical, '-|',
+   'pg_recvlogical', '--create-slot', '-S', 'test2',
+   '-d', $node->connstr('postgres')),
+   'launch background pg_recvlogical');
+
+$node->poll_query_until('postgres',
+   qq{SELECT count(*) > 0 FROM pg_stat_activity 
+   WHERE backend_type='walsender'
+   AND query like 
'CREATE_REPLICATION_SLOT %';});
+# stop server while it hangs. This shouldn't crash server.
+$node->stop;
+ok(open(my $cont, '-|', 'pg_controldata', $node->data_dir),
+   'run pg_controldata');
+my $stop_result = '';
+while (<$cont>)
+{
+   if (/^Database cluster state: *([^ ].*)$/)
+   {
+   $stop_result = $1;
+   last;
+   }
+}
+
+is($stop_result, 'shut down', 'server is properly shut down');


Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-12 Thread Ashutosh Sharma
+   /*
+* If the relation is from the default tablespace then we need to
+* create it in the destinations db's default tablespace.
Otherwise,
+* we need to create in the same tablespace as it is in the source
+* database.
+*/

This comment looks a bit confusing to me especially because when we say
destination db's default tablespace people may think of pg_default
tablespace (at least I think so). Basically what you are trying to say here
- "If the relation exists in the same tablespace as the src database, then
in the destination db also it should be the same or something like that.. "
So, why not put it that way instead of referring to it as the default
tablespace. It's just my view. If you disagree you can ignore it.

--

+   else if (src_dboid == dst_dboid)
+   continue;
+   else
+   dstrnode.spcNode = srcrnode.spcNode;;

There is an extra semicolon here.

--
With Regards,
Ashutosh Sharma.


On Sun, Dec 12, 2021 at 1:39 PM Dilip Kumar  wrote:

> On Fri, Dec 10, 2021 at 7:39 AM Ashutosh Sharma 
> wrote:
> >>
> >> Logfile Snippet:
> >> 2021-12-09 17:49:18.110 +04 [18151] PANIC:  could not fsync file
> "base/116398/116400": No such file or directory
> >> 2021-12-09 17:49:19.105 +04 [18150] LOG:  checkpointer process (PID
> 18151) was terminated by signal 6: Aborted
> >> 2021-12-09 17:49:19.105 +04 [18150] LOG:  terminating any other active
> server processes
> >
> >
> > This is different from the issue you raised earlier. As Dilip said, we
> need to unregister sync requests for files that got successfully copied to
> the target database, but the overall alter database statement failed. We
> are doing this when the database is created successfully, but not when it
> fails.
> > Probably doing the same inside the cleanup function
> movedb_failure_callback() should fix the problem.
>
> Correct, I have done this cleanup, apart from this we have dropped the
> fsyc request in create database failure case as well and also need to
> drop buffer in error case of creatdb as well as movedb.  I have also
> fixed the other issue for which you gave the patch (a bit differently)
> basically, in case of movedb the source and destination dboid are same
> so we don't need an additional parameter and also readjusted the
> conditions to avoid nested if.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>


Re: parallel vacuum comments

2021-12-12 Thread Amit Kapila
On Sat, Dec 11, 2021 at 8:30 PM Masahiko Sawada  wrote:
>
> On Sat, Dec 11, 2021 at 2:32 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2021-10-30 14:21:01 -0700, Andres Freund wrote:
> > > Due to bug #17245: [1] I spent a considerably amount of time looking at 
> > > vacuum
> > > related code. And I found a few things that I think could stand 
> > > improvement:
>
> Thank you for the comments.
>
> >
> > While working on the fix for #17255 (more specifically some cleanup that 
> > Peter
> > suggested in the context), I noticed another thing: Initializing parallelism
> > as part of dead_items_alloc() is a bad idea. Even if there are comments 
> > noting
> > that oddity.
> >
> > I don't really see why we should do it this way? There's no "no-parallelism"
> > path in begin_parallel_vacuum() besides compute_parallel_vacuum_workers(). 
> > So
> > it's not like we might just discover the inability to do parallelism during
> > parallel initialization?
>
> Right. Also, in parallel vacuum case, it allocates the space not only
> for dead items but also other data required to do parallelism like
> shared bulkdeletion results etc. Originally, in PG13,
> begin_parallel_vacuum() was called by lazy_scan_heap() but in PG14 it
> became part of dead_items_alloc() (see b4af70cb2). I agree to change
> this part so that lazy_scan_heap() calls begin_parallel_vacuum()
> (whatever we rename it). I'll incorporate this change in the
> refactoring patch barring any objections.
>
> >
> > It's also not particularly helpful to have a begin_parallel_vacuum() that
> > might not actually begin a parallel vacuum...
>
> During the development, I found that we have some begin_* functions
> that don't start the actual parallel job but prepare state data for
> starting parallel job and referred to  _bt_begin_parallel() so I named
> begin_parallel_vacuum(). But I admit that considering what the
> function actually does, something like
> create_parallel_vacuum_context() would be clearer.
>

How about if we name it as parallel_vacuum_init() which will be
similar InitializeParallelDSM, ExecInitParallelPlan(). Now, I see
there is some reasoning to keep it in dead_items_alloc as both
primarily allocate memory for vacuum but maybe we should name the
function vacuum_space_alloc instead of dead_items_alloc and similarly
rename dead_items_cleanup to vacuum_space_free. The other idea could
be to bring begin_parallel_vacuum() back in lazy_scan_heap() but I
personally prefer the idea to keep it where it is but improve function
names. Will it be better to do this as a separate patch as 0002
because this might require some change in the vacuum code path?

> >
> > Minor nit:
> >
> > begin_parallel_vacuum()'s comment says:
> >  * On success (when we can launch one or more workers), will set dead_items 
> > and
> >  * lps in vacrel for caller.
> >
> > But it actually doesn't know whether we can start workers. It just checks
> > max_parallel_maintenance_workers, no?
>
> Yes, we cannot know whether we can actually start workers when
> starting parallel index vacuuming. It returns non-NULL if we request
> one or more workers.
>

So can we adjust the comments? I think the part of the sentence "when
we can launch one or more workers" seems to be the cause of confusion,
can we remove it?

-- 
With Regards,
Amit Kapila.




Re: Skipping logical replication transactions on subscriber side

2021-12-12 Thread Masahiko Sawada
On Sat, Dec 11, 2021 at 3:29 PM Amit Kapila  wrote:
>
> On Fri, Dec 10, 2021 at 11:14 AM Masahiko Sawada  
> wrote:
> >
> > On Thu, Dec 9, 2021 at 6:16 PM Amit Kapila  wrote:
> > >
> > > On Thu, Dec 9, 2021 at 2:24 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Thu, Dec 9, 2021 at 11:47 AM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > I am thinking that we can start a transaction, update the catalog,
> > > > > commit that transaction. Then start a new one to update
> > > > > origin_lsn/timestamp, finishprepared, and commit it. Now, if it
> > > > > crashes after the first transaction, only commit prepared will be
> > > > > resent again and this time we don't need to update the catalog as that
> > > > > entry would be already cleared.
> > > >
> > > > Sounds good. In the crash case, it should be fine since we will just
> > > > commit an empty transaction. The same is true for the case where
> > > > skip_xid has been changed after skipping and preparing the transaction
> > > > and before handling commit_prepared.
> > > >
> > > > Regarding the case where the user specifies XID of the transaction
> > > > after it is prepared on the subscriber (i.g., the transaction is not
> > > > empty), we won’t skip committing the prepared transaction. But I think
> > > > that we don't need to support skipping already-prepared transaction
> > > > since such transaction doesn't conflict with anything regardless of
> > > > having changed or not.
> > > >
> > >
> > > Yeah, this makes sense to me.
> > >
> >
> > I've attached an updated patch. The new syntax is like "ALTER
> > SUBSCRIPTION testsub SKIP (xid = '123')".
> >
> > I’ve been thinking we can do something safeguard for the case where
> > the user specified the wrong xid. For example, can we somewhat use the
> > stats in pg_stat_subscription_workers? An idea is that logical
> > replication worker fetches the xid from the stats when reading the
> > subscription and skips the transaction if the xid matches to
> > subskipxid. That is, the worker checks the error reported by the
> > worker previously working on the same subscription. The error could
> > not be a conflict error (e.g., connection error etc.) or might have
> > been cleared by the reset function, But given the worker is in an
> > error loop, the worker can eventually get xid in question. We can
> > prevent an unrelated transaction from being skipped unexpectedly. It
> > seems not a stable solution though. Or it might be enough to warn
> > users when they specified an XID that doesn’t match to last_error_xid.
> >
>
> I think the idea is good but because it is not predictable as pointed
> by you so we might want to just issue a LOG/WARNING. If not already
> mentioned, then please do mention in docs the possibility of skipping
> non-errored transactions.
>
> Few comments/questions:
> =
> 1.
> +  Specifies the ID of the transaction whose application is to
> be skipped
> +  by the logical replication worker. Setting -1 means to reset the
> +  transaction ID.
>
> Can we change it to something like: "Specifies the ID of the
> transaction whose changes are to be skipped by the logical replication
> worker. "
>

Agreed.

> 2.
> @@ -104,6 +104,16 @@ GetSubscription(Oid subid, bool missing_ok)
>   Assert(!isnull);
>   sub->publications = textarray_to_stringlist(DatumGetArrayTypeP(datum));
>
> + /* Get skip XID */
> + datum = SysCacheGetAttr(SUBSCRIPTIONOID,
> + tup,
> + Anum_pg_subscription_subskipxid,
> + &isnull);
> + if (!isnull)
> + sub->skipxid = DatumGetTransactionId(datum);
> + else
> + sub->skipxid = InvalidTransactionId;
>
> Can't we assign it as we do for other fixed columns like subdbid,
> subowner, etc.?
>

Yeah, I think we can use InvalidTransactionId as the initial value
instead of setting NULL. Then, we can change this code.

> 3.
> + * Also, we don't skip receiving the changes in streaming cases,
> since we decide
> + * whether or not to skip applying the changes when starting to apply 
> changes.
>
> But why so? Can't we even skip streaming (and writing to file all such
> messages)? If we can do this then we can avoid even collecting all
> messages in a file.

IIUC in streaming cases, a transaction can be sent to the subscriber
while splitting into multiple chunks of changes. In the meanwhile,
skip_xid can be changed. If the user changed or cleared skip_xid after
the subscriber skips some streamed changes, the subscriber won't able
to have complete changes of the transaction.

>
> 4.
> + * Also, one might think that we can skip preparing the skipped transaction.
> + * But if we do that, PREPARE WAL record won’t be sent to its physical
> + * standbys, resulting in that users won’t be able to find the prepared
> + * transaction entry after a fail-over.
> + *
> ..
> + */
> + if (skipping_changes)
> + stop_skipping_changes(false);
>
> Why do we need such a Prepare's entry either at current subscriber or
> on its physical standby? I think it is to allow Commi

Re: Use extended statistics to estimate (Var op Var) clauses

2021-12-12 Thread Zhihong Yu
On Sun, Dec 12, 2021 at 6:04 PM Tomas Vondra 
wrote:

> On 8/31/21 00:14, Zhihong Yu wrote:
> > Hi,
> > For patch 0002,
> >
> > +   s1 = statext_clauselist_selectivity(root, clauses,
> > varRelid,
> > +   jointype,
> > sjinfo, rel,
> > +
> > &estimatedclauses, false);
> > +
> > +   estimated = (bms_num_members(estimatedclauses) == 1);
> >
> > I took a look at clauselist_apply_dependencies() (called by
> > statext_clauselist_selectivity) where estimatedclauses is modified.
> > Since the caller would not use the returned Selectivity if number of
> > elements in estimatedclauses is greater than 1, I wonder
> > if a parameter can be added to clauselist_apply_dependencies() which
> > indicates early return if the second element is added
> to estimatedclauses.
> >
>
> Hmmm, I'm not sure I understand your point. Are you suggesting there's a
> bug in not updating the bitmap, or would this be an optimization? Can
> you give an example of a query affected by this?
>
> Hi,
My previous comment was from 3 months ago - let me see if I can come up
with an example.

Cheers

>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: Use extended statistics to estimate (Var op Var) clauses

2021-12-12 Thread Tomas Vondra

Hi,

I finally got around to this patch again, focusing mostly on the first 
part that simply returns either 1.0 or 0.0 for Var op Var conditions 
(i.e. the part not really using extended statistics).


I have been unhappy about using examine_variable, which does various 
expensive things like searching for statistics (which only got worse 
because now we're also looking for expression stats). But we don't 
really need the stats - we just need to check the Vars match (same 
relation, same attribute). So 0002 fixes this.


Which got me thinking that maybe we don't need to restrict this to Var 
nodes only. We can just as easily compare arbitrary expressions, 
provided it's for the same relation and there are no volatile functions. 
So 0003 does this. Conditions with the same complex expression on each 
side of an operator are probably fairly rare, but it's cheap so why not.


0004 and 0005 parts are unchanged.


The next steps is adding some tests to the first parts, and extending 
the tests in the main patch (to also use more complex expressions, if 
0003 gets included).



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 91371c3e4252580a30530d5b625f72d5525c5c73 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Thu, 26 Aug 2021 23:01:04 +0200
Subject: [PATCH 1/5] Improve estimates for Var op Var with the same Var

When estimating (Var op Var) conditions, we can treat the case with the
same Var on both sides as a special case, and we can provide better
selectivity estimate than for the generic case.

For example for (a = a) we know it's 1.0, because all rows are expected
to match. Similarly for (a != a) , wich has selectivity 0.0. And the
same logic can be applied to inequality comparisons, like (a < a) etc.

In principle, those clauses are a bit strange and queries are unlikely
to use them. But query generators sometimes do silly things, and these
checks are quite cheap so it's likely a win.
---
 src/backend/utils/adt/selfuncs.c | 77 +++-
 1 file changed, 76 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 10895fb287..59038895ec 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -210,7 +210,8 @@ static bool get_actual_variable_endpoint(Relation heapRel,
 		 MemoryContext outercontext,
 		 Datum *endpointDatum);
 static RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
-
+static bool matching_restriction_variables(PlannerInfo *root, List *args,
+		   int varRelid);
 
 /*
  *		eqsel			- Selectivity of "=" for any data types.
@@ -256,6 +257,14 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate)
 		}
 	}
 
+	/*
+	 * It it's (variable = variable) with the same variable on both sides, it's
+	 * a special case and we know it's not expected to filter anything, so we
+	 * estimate the selectivity as 1.0 (or 0.0 if it's negated).
+	 */
+	if (matching_restriction_variables(root, args, varRelid))
+		return (negate) ? 0.0 : 1.0;
+
 	/*
 	 * If expression is not variable = something or something = variable, then
 	 * punt and return a default estimate.
@@ -1408,6 +1417,22 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq)
 	Oid			consttype;
 	double		selec;
 
+	/*
+	 * Handle (variable < variable) and (variable <= variable) with the same
+	 * variable on both sides as a special case. The strict inequality should
+	 * not match any rows, hence selectivity is 0.0. The other case is about
+	 * the same as equality, so selectivity is 1.0.
+	 */
+	if (matching_restriction_variables(root, args, varRelid))
+	{
+		/* The case with equality matches all rows, so estimate it as 1.0. */
+		if (iseq)
+			PG_RETURN_FLOAT8(1.0);
+
+		/* Strict inequality matches nothing, so selectivity is 0.0. */
+		PG_RETURN_FLOAT8(0.0);
+	}
+
 	/*
 	 * If expression is not variable op something or something op variable,
 	 * then punt and return a default estimate.
@@ -4871,6 +4896,56 @@ get_restriction_variable(PlannerInfo *root, List *args, int varRelid,
 	return false;
 }
 
+
+/*
+ * matching_restriction_variable
+ *		Examine the args of a restriction clause to see if it's of the
+ *		form (variable op variable) with the same variable on both sides.
+ *
+ * Inputs:
+ *	root: the planner info
+ *	args: clause argument list
+ *	varRelid: see specs for restriction selectivity functions
+ *
+ * Returns true if the same variable is on both sides, otherwise false.
+ */
+static bool
+matching_restriction_variables(PlannerInfo *root, List *args, int varRelid)
+{
+	Node	   *left,
+			   *right;
+	VariableStatData ldata;
+	VariableStatData rdata;
+	bool		res = false;
+
+	/* Fail if not a binary opclause (probably shouldn't happen) */
+	if (list_length(args) != 2)
+		return false;
+
+	left = (Node *) linitial(args);
+	right = (Node *) lsecond(args);
+
+	/*
+	 * Examine both sides.  Note that when varRelid is nonzero, Va

Re: Make pg_waldump report replication origin ID, LSN, and timestamp.

2021-12-12 Thread Michael Paquier
On Thu, Dec 09, 2021 at 05:42:56PM +0900, Masahiko Sawada wrote:
> Thank you for updating the patch. The patch looks good to me.

Done this way.  Please note that while testing this patch, I have
found a completely different issue.  I'll spawn a thread about that in
a minute..
--
Michael


signature.asc
Description: PGP signature


Re: Use extended statistics to estimate (Var op Var) clauses

2021-12-12 Thread Tomas Vondra

On 8/31/21 00:14, Zhihong Yu wrote:

Hi,
For patch 0002,

+                   s1 = statext_clauselist_selectivity(root, clauses, 
varRelid,
+                                                       jointype, 
sjinfo, rel,
+   
&estimatedclauses, false);

+
+                   estimated = (bms_num_members(estimatedclauses) == 1);

I took a look at clauselist_apply_dependencies() (called by 
statext_clauselist_selectivity) where estimatedclauses is modified.
Since the caller would not use the returned Selectivity if number of 
elements in estimatedclauses is greater than 1, I wonder
if a parameter can be added to clauselist_apply_dependencies() which 
indicates early return if the second element is added to estimatedclauses.




Hmmm, I'm not sure I understand your point. Are you suggesting there's a 
bug in not updating the bitmap, or would this be an optimization? Can 
you give an example of a query affected by this?



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Defining (and possibly skipping) useless VACUUM operations

2021-12-12 Thread Peter Geoghegan
Robert Haas has written on the subject of useless vacuuming, here:

http://rhaas.blogspot.com/2020/02/useless-vacuuming.html

I'm sure at least a few of us have thought about the problem at some
point. I would like to discuss how we can actually avoid useless
vacuuming, and what our goals should be.

I am currently working on decoupling advancing relfrozenxid from tuple
freezing [1]. That is, I'm teaching VACUUM to keep track of
information that it uses to generate an "optimal value" for the
table's final relfrozenxid: the most recent XID value that might still
be in the table. This patch is based on the observation that we don't
actually have to use the FreezeLimit cutoff for our new
pg_class.relfrozenxid. We need only obey the basic relfrozenxid
invariant, which is that the final value must be <= any extant XID in
the table.  Using FreezeLimit is needlessly conservative.

My draft patch to implement the optimization (which builds on the
patches already posted to [1]) will reliably set pg_class.relfrozenxid
to the same VACUUM's precise original OldestXmin once certain
conditions are met -- reasonably common conditions. For example, the
same precise OldestXmin XID is used for relfrozenxid in the event of a
manual VACUUM (without FREEZE) on a table that was just bulk-loaded,
assuming the system is otherwise idle. Setting relfrozenxid to the
precise lowest safe value happens on a best-effort basis, without
needlessly tying that to things like when or how we freeze tuples.

It now occurs to me to push this patch in another direction, on top of
all that: the OldestXmin behavior hints at a precise, robust way of
defining "useless vacuuming". We can condition skipping a VACUUM (i.e.
whether a VACUUM is considered "definitely won't be useful if allowed
to execute") on whether or not our preexisting pg_class.relfrozenxid
precisely equals our newly-acquired OldestXmin for an about-to-begin
VACUUM operation.  (We'd also want to add an "unchangeable
pg_class.relminmxid" test, I think.)

This definition does seem to be close to ideal: We're virtually
assured that there will be no more useful work for us, in a way that
is grounded in theory but still quite practical. But it's not a slam
dunk. A person could still argue that we shouldn't cancel the VACUUM
before it has begun, even when all these conditions have been met.
This would not be a particularly strong argument, mind you, but it's
still worth taking seriously. We need an exact problem statement that
justifies whatever definition of "useless VACUUM" we settle on.

Here are arguments *against* the skipping behavior I sketched out:

* An aborted transaction might need to be cleaned up, which should be
able to go ahead despite the unchanged OldestXmin. (I think that this
is the argument with the most merit, by quite a bit.)

* In general index AMs may want to do deferred cleanup, say to place
previously deleted pages in the FSM. Although in practice the criteria
for recycling safety used by nbtree and GiST will make that
impossible, there is no fundamental reason why they need to work that
way (XIDs are used, but only because they provide a conveniently
available notion of "logical time" that is sufficient to implement
what Lanin & Shasha call "the drain technique"). Plus GIN really could
do real work in amvacuumcleanup, for the pending list. There are bound
to be a handful of marginal things like this.

* Who are we to intervene like this, anyway? (Makes much more sense if
we don't limit ourselves to autovacuum worker operations.)

Offhand, I suspect that we should only consider skipping "useless"
anti-wraparound autovacuums (not other kinds of autovacuums, not
manual VACUUMs). The arguments against skipping are weakest for the
anti-wraparound case. And the arguments in favor are particularly
strong: we should specifically avoid starting a useless (and possibly
time-consuming) anti-wraparound autovacuum, because that could easily
block an actually-useful autovacuum launched some time later. We
should aim to be in a position to launch an anti-wraparound autovacuum
that can actually advance relfrozenxid as soon as that becomes
possible (e.g. when the DBA drops an old replication slot that was
holding back each VACUUM's OldestXmin). And so "skipping" makes us
much more responsive, which seems like it might matter a lot in
practice. It minimizes the risk of wraparound failure.

There is also a strong argument for logging our failure to clean up
anything in any autovacuum -- we don't do nearly enough alerting when
stuff like this happens (possibly because "useless" is such a squishy
concept right now?). Just logging something still requires defining
"useless VACUUM operation" in a way that is both reliable and
proportionate. So just logging something necessitates solving that
hard problem.

[1] https://commitfest.postgresql.org/36/3433/
-- 
Peter Geoghegan




Re: Atomic rename feature for Windows.

2021-12-12 Thread Michael Paquier
On Thu, Dec 09, 2021 at 11:33:17PM -0500, Tom Lane wrote:
> My general approach to platform compatibility is that when we
> break compatibility with old versions of something, we should do so
> because it will bring concrete benefits.  If we can plausibly
> drop support for Windows versions that don't have POSIX rename
> semantics, I'm 100% for that.  I'm not for dropping support for
> some platform just because it's old.

I'd agree with that.  Now, I would also say if we need something that
depends on a newer version of _WIN32_WINNT that proves to be trickier
or even not possible for older versions, there could be an argument
for dropping older versions, even in the back-branches, if the problem
to-be-fixed is bad enough.  In short history, we've never had to go
down to that AFAIK, though.
--
Michael


signature.asc
Description: PGP signature


Re: Logical replication error "no record found" /* shouldn't happen */

2021-12-12 Thread Alvaro Herrera
On 2021-Jul-23, Andrey Borodin wrote:

> Hi!
> 
> From time to time I observe $subj on clusters using logical replication.
> I most of cases there are a lot of other errors. Probably $subj condition 
> should be kind of impossible without other problems.
> I propose to enhance error logging of XLogReadRecord() in ReadPageInternal().

Hmm.

A small problem in this patch is that XLogReaderValidatePageHeader
already sets errormsg_buf; you're overwriting that.  I suggest to leave
that untouched.  There are other two cases where the problem occurs in
page_read() callback; ReadPageInternal explicitly documents that it
doesn't set the error in that case.  We have two options to deal with
that:

1. change all existing callbacks to set the errormsg_buf depending on
what actually fails, and then if they return failure without an error
message, add something like your proposed message.
2. throw error directly in the callback rather than returning.  I don't
think this strategy actually works

I attach a cut-down patch that doesn't deal with the page_read callbacks
issue, just added stub comments in xlog.c where something should be
done.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this."   (Fotis)
   (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d894af310a..83976cb014 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12467,6 +12467,7 @@ retry:
 		 private->replayTLI,
 		 xlogreader->EndRecPtr))
 		{
+			/* XXX should this path set errormsg_buf? */
 			if (readFile >= 0)
 close(readFile);
 			readFile = -1;
@@ -12598,7 +12599,10 @@ next_record_is_invalid:
 	if (StandbyMode)
 		goto retry;
 	else
+	{
+		/* XXX should set errormsg_buf here */
 		return -1;
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3a7de02565..5b61593820 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -650,14 +650,22 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		   state->currRecPtr,
 		   state->readBuf);
 		if (readLen < 0)
+		{
+			report_invalid_record(state,
+  "attempt to read page of next segment failed at %X/%X",
+  LSN_FORMAT_ARGS(targetSegmentPtr));
 			goto err;
+		}
 
 		/* we can be sure to have enough WAL available, we scrolled back */
 		Assert(readLen == XLOG_BLCKSZ);
 
 		if (!XLogReaderValidatePageHeader(state, targetSegmentPtr,
 		  state->readBuf))
+		{
+			/* XLogReaderValidatePageHeader sets errormsg_buf */
 			goto err;
+		}
 	}
 
 	/*
@@ -668,13 +676,18 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	   state->currRecPtr,
 	   state->readBuf);
 	if (readLen < 0)
-		goto err;
+		goto err;	/* XXX errmsg? */
 
 	Assert(readLen <= XLOG_BLCKSZ);
 
 	/* Do we have enough data to check the header length? */
 	if (readLen <= SizeOfXLogShortPHD)
+	{
+		report_invalid_record(state,
+			  "unable to read short header of %d bytes at %X/%X",
+			  readLen, LSN_FORMAT_ARGS(pageptr));
 		goto err;
+	}
 
 	Assert(readLen >= reqLen);
 
@@ -687,14 +700,17 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		   state->currRecPtr,
 		   state->readBuf);
 		if (readLen < 0)
-			goto err;
+			goto err;	/* XXX errmsg */
 	}
 
 	/*
 	 * Now that we know we have the full header, validate it.
 	 */
 	if (!XLogReaderValidatePageHeader(state, pageptr, (char *) hdr))
+	{
+		/* XLogReaderValidatePageHeader sets errormsg_buf */
 		goto err;
+	}
 
 	/* update read state information */
 	state->seg.ws_segno = targetSegNo;


Re: [PATCH] pg_stat_toast

2021-12-12 Thread Andres Freund
Hi,

On 2021-12-13 00:00:23 +0100, Gunnar "Nick" Bluth wrote:
> Regarding stats size; it adds one PgStat_BackendToastEntry
> (PgStat_BackendAttrIdentifier + PgStat_ToastCounts, should be 56-64 bytes or
> something in that ballpark) per TOASTable attribute, I can't see that make
> any system break sweat ;-)

That's actually a lot. The problem is that all the stats data for a database
is loaded into private memory for each connection to that database, and that
the stats collector regularly writes out all the stats data for a database.


> A quick run comparing 1.000.000 INSERTs (2 TOASTable columns each) with and
> without "pgstat_track_toast" resulted in 12792.882 ms vs. 12810.557 ms. So
> at least the call overhead seems to be neglectible.

Yea, you'd probably need a few more tables and a few more connections for it
to have a chance of mattering meaningfully.

Greetings,

Andres Freund




Re: can we add subscription TAP test option "vcregress subscriptioncheck" for MSVC builds?

2021-12-12 Thread Juan José Santamaría Flecha
On Thu, Oct 21, 2021 at 7:21 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

>
> Here's the documentation v1 patch that I've come up with. Please review it.
>
> There's a typo:
+   To run an arbitrary TAP test set, run vcregress
taptest
+   comamnd. For example, use the following command for running subcription
TAP
+   tests:
s/comamnd/command/

But also the wording, I like better what vcregress prints as help, so
something like:
+   You can use vcregress taptest TEST_DIR to run an
+   arbitrary TAP test set, where TEST_DIR is a required argument pointing
to
+   the directory where the tests reside. For example, use the following
+   command for running the subcription TAP tests:

Regards,

Juan José Santamaría Flecha


Re: [PATCH] pg_stat_toast

2021-12-12 Thread Gunnar "Nick" Bluth

Am 12.12.21 um 22:52 schrieb Andres Freund:

Hi,

On 2021-12-12 17:20:58 +0100, Gunnar "Nick" Bluth wrote:

Please have a look at the attached patch, which implements some statistics
for TOAST.

The idea (and patch) have been lurking here for quite a while now, so I
decided to dust it off, rebase it to HEAD and send it out for review today.

A big shoutout to Georgios Kokolatos, who gave me a crash course in PG
hacking, some very useful hints and valueable feedback early this year.

I'd like to get some feedback about the general idea, approach, naming etc.
before refining this further.


I'm worried about the additional overhead this might impose. For some workload
it'll substantially increase the amount of stats traffic. Have you tried to
qualify the overheads? Both in stats size and in stats management overhead?


I'd lie if I claimed so...

Regarding stats size; it adds one PgStat_BackendToastEntry 
(PgStat_BackendAttrIdentifier + PgStat_ToastCounts, should be 56-64 
bytes or something in that ballpark) per TOASTable attribute, I can't 
see that make any system break sweat ;-)


A quick run comparing 1.000.000 INSERTs (2 TOASTable columns each) with 
and without "pgstat_track_toast" resulted in 12792.882 ms vs. 12810.557 
ms. So at least the call overhead seems to be neglectible.


Obviously, this was really a quick run and doesn't reflect real life.
I'll have the machine run some reasonable tests asap, also looking at 
stat size, of course!


Best regards,
--
Gunnar "Nick" Bluth

Eimermacherweg 106
D-48159 Münster

Mobil +49 172 8853339
Email: gunnar.bl...@pro-open.de
__
"Ceterum censeo SystemD esse delendam" - Cato




Re: extended stats on partitioned tables

2021-12-12 Thread Tomas Vondra




On 12/12/21 22:32, Justin Pryzby wrote:

On Sun, Dec 12, 2021 at 05:17:10AM +0100, Tomas Vondra wrote:

The one thing bugging me a bit is that the regression test checks only a
GROUP BY query. It'd be nice to add queries testing MCV/dependencies
too, but that seems tricky because most queries will use per-partitions
stats.


You mean because the quals are pushed down to the scan node.

Does that indicate a deficiency ?

If extended stats are collected for a parent table, selectivity estimates based
from the parent would be better; but instead we use uncorrected column
estimates from the child tables.

 From what I see, we could come up with a way to avoid the pushdown, involving
volatile functions/foreign tables/RLS/window functions/SRF/wholerow vars/etc.
 > But would it be better if extended stats objects on partitioned 

tables were to

collect stats for both parent AND CHILD ?  I'm not sure.  Maybe that's the
wrong solution, but maybe we should still document that extended stats on
(empty) parent tables are often themselves not used/useful for selectivity
estimates, and the user should instead (or in addition) create stats on child
tables.

Or, maybe if there's no extended stats on the child tables, stats on the parent
table should be consulted ?



Maybe, but that seems like a mostly separate improvement. At this point 
I'm interested only in testing the behavior implemented in the current 
patches.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] pg_stat_toast

2021-12-12 Thread Andres Freund
Hi,

On 2021-12-12 17:20:58 +0100, Gunnar "Nick" Bluth wrote:
> Please have a look at the attached patch, which implements some statistics
> for TOAST.
> 
> The idea (and patch) have been lurking here for quite a while now, so I
> decided to dust it off, rebase it to HEAD and send it out for review today.
> 
> A big shoutout to Georgios Kokolatos, who gave me a crash course in PG
> hacking, some very useful hints and valueable feedback early this year.
> 
> I'd like to get some feedback about the general idea, approach, naming etc.
> before refining this further.

I'm worried about the additional overhead this might impose. For some workload
it'll substantially increase the amount of stats traffic. Have you tried to
qualify the overheads? Both in stats size and in stats management overhead?

Greetings,

Andres Freund




Re: extended stats on partitioned tables

2021-12-12 Thread Justin Pryzby
On Sun, Dec 12, 2021 at 10:29:39PM +0100, Tomas Vondra wrote:
> On 12/12/21 18:52, Justin Pryzby wrote:
> That's mostly a conscious choice, so that I don't have to include
> parsetree.h. But maybe that'd be better ...
> 
> > The regression tests changed as a result of not populating stx_data; I think
> > it's may be better to update like this:
> > 
> > SELECT stxname, stxdndistinct, stxddependencies, stxdmcv, stxoid IS NULL
> >   FROM pg_statistic_ext s LEFT JOIN pg_statistic_ext_data d
> >   ON d.stxoid = s.oid
> >   WHERE s.stxname = 'ab1_a_b_stats';
> > 
> 
> Not sure I understand. Why would this be better than inner join?

It shows that there's an entry in pg_statistic_ext and not one in ext_data,
rather than that it's not in at least one of the catalogs.  Which is nice to
show since as you say it's no longer 1:1.

> Thanks, fixed. Can you read through the commit messages and check the
> attribution is correct for all the patches?

Seems fine.

-- 
Justin




Re: extended stats on partitioned tables

2021-12-12 Thread Justin Pryzby
On Sun, Dec 12, 2021 at 05:17:10AM +0100, Tomas Vondra wrote:
> The one thing bugging me a bit is that the regression test checks only a
> GROUP BY query. It'd be nice to add queries testing MCV/dependencies
> too, but that seems tricky because most queries will use per-partitions
> stats.

You mean because the quals are pushed down to the scan node.

Does that indicate a deficiency ?

If extended stats are collected for a parent table, selectivity estimates based
from the parent would be better; but instead we use uncorrected column
estimates from the child tables.

>From what I see, we could come up with a way to avoid the pushdown, involving
volatile functions/foreign tables/RLS/window functions/SRF/wholerow vars/etc.

But would it be better if extended stats objects on partitioned tables were to
collect stats for both parent AND CHILD ?  I'm not sure.  Maybe that's the
wrong solution, but maybe we should still document that extended stats on
(empty) parent tables are often themselves not used/useful for selectivity
estimates, and the user should instead (or in addition) create stats on child
tables.

Or, maybe if there's no extended stats on the child tables, stats on the parent
table should be consulted ?

-- 
Justin




Re: Windows now has fdatasync()

2021-12-12 Thread Thomas Munro
On Sun, Dec 12, 2021 at 3:48 PM Thomas Munro  wrote:
> [...] I
> tried out a quick POC patch and it runs a bit faster than fsync(), as
> expected.  I'm not sure if it's worth bothering with or not given the
> other options, but figured it was worth sharing.

One reason to consider developing this further is the problem
discussed in the aptly named thread "Loaded footgun open_datasync on
Windows"[1] (not the problem that was fixed in pg_test_fsync, but the
problem with cache control, or lack thereof).  I saw 10x more TPS with
open_datasync than with this experimental fdatasync on my little test
VM, which was more than a little fishy, so I turned off the device
write caching in "Device Manager" and got about the same number from
open_datasync and fdatasync.  Clearly you can lose committed
transactions on power loss[2] with the default OS settings and default
PostgreSQL settings, though perhaps only if you're on SATA storage due
to lack of FUA pass-through[3] (?).  I didn't try an NVMe stack.

That suggests that fdatasync would actually be a better default ...
except for the problems already mentioned with versions and not
working on non-NTFS (not sure how it fails on non-NTFS, though, maybe
it does a full flush, [4] doesn't say).

(The patch is a little rough; I couldn't figure out the headers to get
that macro.  Insert my usual disclaimer that I'm not a Windows guy,
this is stuff I'm just figuring out, all clues welcome...)

[1] 
https://www.postgresql.org/message-id/flat/1527846213.2475.31.camel%40cybertec.at
[2] https://github.com/MicrosoftDocs/feedback/issues/2747
[3] https://devblogs.microsoft.com/oldnewthing/20170510-00/?p=95505
[4] 
https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/ntifs/nf-ntifs-ntflushbuffersfileex




Re: extended stats on partitioned tables

2021-12-12 Thread Tomas Vondra
On 12/12/21 18:52, Justin Pryzby wrote:
> +  * XXX Can't we simply look at rte->inh?
> +  */
> + inh = root->append_rel_array == NULL ? false :
> + root->append_rel_array[onerel->relid]->parent_relid != 0;
> 
> I think so.  That's what I came up with while trying to figured this out, and
> it's no great surprise that it needed to be cleaned up - thanks.
> 

OK, fixed.

> In your 0003 patch, the "if inh: break" isn't removed from examine_variable(),
> but the corresponding thing is removed everywhere else.
> 

Ah, you're right. And it wasn't updated in the 0002 patch either - it
should do the relkind check too, to allow partitioned tables. Fixed.

> In 0003, mcv_clauselist_selectivity still uses simple_rte_array rather than
> rt_fetch.
> 

That's mostly a conscious choice, so that I don't have to include
parsetree.h. But maybe that'd be better ...

> The regression tests changed as a result of not populating stx_data; I think
> it's may be better to update like this:
> 
> SELECT stxname, stxdndistinct, stxddependencies, stxdmcv, stxoid IS NULL
>   FROM pg_statistic_ext s LEFT JOIN pg_statistic_ext_data d
>   ON d.stxoid = s.oid
>   WHERE s.stxname = 'ab1_a_b_stats';
> 

Not sure I understand. Why would this be better than inner join?

> There's this part about documentation for the changes in backbranches:
> 
> On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote:
>> Also, I think in backbranches we should document what's being stored in
>> pg_statistic_ext, since it's pretty unintuitive:
>>  - noninherted stats (FROM ONLY) for inheritence parents;
>>  - inherted stats (FROM *) for partitioned tables;
> 
> spellcheck: inheritence should be inheritance.
> 

Thanks, fixed. Can you read through the commit messages and check the
attribution is correct for all the patches?

> All for now.  I'm going to update the regression tests for dependencies and 
> the
> other code paths.
> 

Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 77bd34660d15d6ba03538c6fe45a196a162c196c Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 12 Dec 2021 02:28:41 +0100
Subject: [PATCH 4/4] Refactor parent ACL check

selfuncs.c is 8k lines long, and this makes it 30 LOC shorter.
---
 src/backend/utils/adt/selfuncs.c | 140 ---
 1 file changed, 52 insertions(+), 88 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index bf4525392d..e4be9fc95d 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -187,6 +187,8 @@ static char *convert_string_datum(Datum value, Oid typid, Oid collid,
   bool *failure);
 static double convert_timevalue_to_scalar(Datum value, Oid typid,
 		  bool *failure);
+static void recheck_parent_acl(PlannerInfo *root, VariableStatData *vardata,
+Oid relid);
 static void examine_simple_variable(PlannerInfo *root, Var *var,
 	VariableStatData *vardata);
 static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata,
@@ -5153,51 +5155,7 @@ examine_variable(PlannerInfo *root, Node *node, int varRelid,
 	(pg_class_aclcheck(rte->relid, userid,
 	   ACL_SELECT) == ACLCHECK_OK);
 
-/*
- * If the user doesn't have permissions to
- * access an inheritance child relation, check
- * the permissions of the table actually
- * mentioned in the query, since most likely
- * the user does have that permission.  Note
- * that whole-table select privilege on the
- * parent doesn't quite guarantee that the
- * user could read all columns of the child.
- * But in practice it's unlikely that any
- * interesting security violation could result
- * from allowing access to the expression
- * index's stats, so we allow it anyway.  See
- * similar code in examine_simple_variable()
- * for additional comments.
- */
-if (!vardata->acl_ok &&
-	root->append_rel_array != NULL)
-{
-	AppendRelInfo *appinfo;
-	Index		varno = index->rel->relid;
-
-	appinfo = root->append_rel_array[varno];
-	while (appinfo &&
-		   planner_rt_fetch(appinfo->parent_relid,
-			root)->rtekind == RTE_RELATION)
-	{
-		varno = appinfo->parent_relid;
-		appinfo = root->append_rel_array[varno];
-	}
-	if (varno != index->rel->relid)
-	{
-		/* Repeat access check on this rel */
-		rte = planner_rt_fetch(varno, root);
-		Assert(rte->rtekind == RTE_RELATION);
-
-		userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
-
-		vardata->acl_ok =
-			rte->securityQuals == NIL &&
-			(pg_class_aclcheck(rte->relid,
-			   userid,
-			   ACL_SELECT) == ACLCHECK_OK);
-	}
-			

Re: extended stats on partitioned tables

2021-12-12 Thread Tom Lane
Tomas Vondra  writes:
> On 12/12/21 16:37, Zhihong Yu wrote:
>> Since the rte (RangeTblEntry*) doesn't seem to be used beyond checking 
>> inh, I think it would be better if the above style of checking is used 
>> throughout the patch (without introducing rte variable).

> It's mostly a matter of personal taste, but I always found this style of 
> condition (i.e. dereferencing a pointer returned by a function) much 
> less readable. It's hard to parse what exactly is happening, what struct 
> type are we dealing with, etc. YMMV but the separate variable makes it 
> much clearer for me. And I'd expect the compilers to produce pretty much 
> the same code too for those cases.

FWIW, I agree.  Also, it's possible that future patches would create a
need to touch the RTE again nearby, in which case having the variable
makes it easier to write non-crummy code for that.

regards, tom lane




Re: extended stats on partitioned tables

2021-12-12 Thread Justin Pryzby
+  * XXX Can't we simply look at rte->inh?
+  */
+ inh = root->append_rel_array == NULL ? false :
+ root->append_rel_array[onerel->relid]->parent_relid != 0;

I think so.  That's what I came up with while trying to figured this out, and
it's no great surprise that it needed to be cleaned up - thanks.

In your 0003 patch, the "if inh: break" isn't removed from examine_variable(),
but the corresponding thing is removed everywhere else.

In 0003, mcv_clauselist_selectivity still uses simple_rte_array rather than
rt_fetch.

The regression tests changed as a result of not populating stx_data; I think
it's may be better to update like this:

SELECT stxname, stxdndistinct, stxddependencies, stxdmcv, stxoid IS NULL
  FROM pg_statistic_ext s LEFT JOIN pg_statistic_ext_data d
  ON d.stxoid = s.oid
  WHERE s.stxname = 'ab1_a_b_stats';

There's this part about documentation for the changes in backbranches:

On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote:
> Also, I think in backbranches we should document what's being stored in
> pg_statistic_ext, since it's pretty unintuitive:
>  - noninherted stats (FROM ONLY) for inheritence parents;
>  - inherted stats (FROM *) for partitioned tables;

spellcheck: inheritence should be inheritance.

All for now.  I'm going to update the regression tests for dependencies and the
other code paths.

-- 
Justin




Re: extended stats on partitioned tables

2021-12-12 Thread Tomas Vondra

On 12/12/21 16:37, Zhihong Yu wrote:


Hi,
For patch 1, minor comment:

+           if (planner_rt_fetch(onerel->relid, root)->inh)

Since the rte (RangeTblEntry*) doesn't seem to be used beyond checking 
inh, I think it would be better if the above style of checking is used 
throughout the patch (without introducing rte variable).




It's mostly a matter of personal taste, but I always found this style of 
condition (i.e. dereferencing a pointer returned by a function) much 
less readable. It's hard to parse what exactly is happening, what struct 
type are we dealing with, etc. YMMV but the separate variable makes it 
much clearer for me. And I'd expect the compilers to produce pretty much 
the same code too for those cases.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: extended stats on partitioned tables

2021-12-12 Thread Tomas Vondra

On 12/12/21 14:47, Zhihong Yu wrote:



On Sat, Dec 11, 2021 at 9:14 PM Tomas Vondra 
mailto:tomas.von...@enterprisedb.com>> 
wrote:

>

...

 > +       /* skip statistics with mismatching stxdinherit value */
 > +       if (stat->inherit != rte->inh)
 >
 > Should a log be added for the above case ?
 >

Why should we log this? It's an entirely expected case - there's a
mismatch between inheritance for the relation and statistics, simply
skipping it is the right thing to do.


Hi,
I agree that skipping should be fine (to avoid too much logging).



I'm not sure it's related to the amount of logging, really. It'd be just 
noise without any practical use, even for debugging purposes. If you 
have an inheritance tree, it'll automatically have one set of statistics 
for inh=true and one for inh=false. And this condition will always skip 
one of those, depending on what query is being estimated.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




[PATCH] pg_stat_toast

2021-12-12 Thread Gunnar "Nick" Bluth

Hello -hackers!

Please have a look at the attached patch, which implements some 
statistics for TOAST.


The idea (and patch) have been lurking here for quite a while now, so I 
decided to dust it off, rebase it to HEAD and send it out for review today.


A big shoutout to Georgios Kokolatos, who gave me a crash course in PG 
hacking, some very useful hints and valueable feedback early this year.


I'd like to get some feedback about the general idea, approach, naming 
etc. before refining this further.


I'm not a C person and I s**k at git, so please be kind with me! ;-)
Also, I'm not subscribed here, so a CC would be much appreciated!


Why gather TOAST statistics?

TOAST is transparent and opaque at the same time.
Whilst we know that it's there and we know _that_ it works, we cannot 
generally tell _how well_ it works.


What we can't answer (easily) are questions like e.g.
- how many datums have been externalized?
- how many datums have been compressed?
- how often has a compression failed (resulted in no space saving)?
- how effective is the compression algorithm used on a column?
- how much time did the DB spend compressing/decompressing TOAST values?

The patch adds some functionality that will eventually be able to answer 
these (and probably more) questions.


Currently, #1 - #4 can be answered based on the view contained in 
"pg_stats_toast.sql":


postgres=# CREATE TABLE test (i int, lz4 text COMPRESSION lz4, std text);
postgres=# INSERT INTO test  SELECT 
i,repeat(md5(i::text),100),repeat(md5(i::text),100) FROM 
generate_series(0,10) x(i);

postgres=# SELECT * FROM pg_stat_toast WHERE schemaname = 'public';
-[ RECORD 1 ]+--
schemaname   | public
reloid   | 16829
attnum   | 2
relname  | test
attname  | lz4
externalizations | 0
compressions | 11
compressionsuccesses | 11
compressionsizesum   | 6299710
originalsizesum  | 320403204
-[ RECORD 2 ]+--
schemaname   | public
reloid   | 16829
attnum   | 3
relname  | test
attname  | std
externalizations | 0
compressions | 11
compressionsuccesses | 11
compressionsizesum   | 8198819
originalsizesum  | 320403204


Implementation
==
I added some callbacks in backend/access/table/toast_helper.c to 
"pgstat_report_toast_activity" in backend/postmaster/pgstat.c.


The latter (and the other additions there) are essentially 1:1 copies of 
the function statistics.


Those were the perfect template, as IMHO the TOAST activities (well, 
what we're interested in at least) are very much comparable to function 
calls:
a) It doesn't really matter if the TOASTed data was committed, as "the 
damage is done" (i.e. CPU cycles were used) anyway
b) The information can (thus/best) be stored on DB level, no need to 
touch the relation or attribute statistics


I didn't find anything that could have been used as a hash key, so the
PgStat_StatToastEntry
uses the shiny new
PgStat_BackendAttrIdentifier
(containing relid Oid, attr int).

For persisting in the statsfile, I chose the identifier 'O' (as 'T' was 
taken).



What's working?
===
- Gathering of TOAST externalization and compression events
- collecting the sizes before and after compression
- persisting in statsfile
- not breaking "make check"
- not crashing anything (afaict)

What's missing (yet)?
===
- proper definition of the "pgstat_track_toast" GUC
- Gathering of times (for compression [and decompression?])
- improve "pg_stat_toast" view and include it in the catalog
- documentation (obviously)
- proper naming (of e.g. the hash key type, functions, view columns etc.)
- would it be necessary to implement overflow protection for the size & 
time sums?


Thanks in advance & best regards,
--
Gunnar "Nick" Bluth

Eimermacherweg 106
D-48159 Münster

Mobil +49 172 8853339
Email: gunnar.bl...@pro-open.de
__
"Ceterum censeo SystemD esse delendam" - CatoFrom 05f81229fd2e81b8674649c8fd1e3857e2406fcd Mon Sep 17 00:00:00 2001
From: "Gunnar \"Nick\" Bluth" 
Date: Sun, 12 Dec 2021 15:35:35 +0100
Subject: [PATCH] initial patch of pg_stat_toast for -hackers

---
 pg_stat_toast.sql   |  19 ++
 src/backend/access/table/toast_helper.c |  19 ++
 src/backend/postmaster/pgstat.c | 311 +++-
 src/backend/utils/adt/pgstatfuncs.c |  60 +
 src/include/catalog/pg_proc.dat |  21 ++
 src/include/pgstat.h| 109 +
 6 files changed, 533 insertions(+), 6 deletions(-)
 create mode 100644 pg_stat_toast.sql

diff --git a/pg_stat_toast.sql b/pg_stat_toast.sql
new file mode 100644
index 00..1c653254ab
--- /dev/null
+++ b/pg_stat_toast.sql
@@ -0,0 +1,19 @@
+-- This creates a useable view, but the offset of 1 is annoying.
+-- T

Re: extended stats on partitioned tables

2021-12-12 Thread Zhihong Yu
On Sat, Dec 11, 2021 at 8:17 PM Tomas Vondra 
wrote:

> Hi,
>
> Attached is a rebased and cleaned-up version of these patches, with more
> comments, refactorings etc. Justin and Zhihong, can you take a look?
>
>
> 0001 - Ignore extended statistics for inheritance trees
>
> 0002 - Build inherited extended stats on partitioned tables
>
> Those are mostly just Justin's patches, with more detailed comments and
> updated commit message. I've considered moving the rel->inh check to
> statext_clauselist_selectivity, and then removing the check from
> dependencies and MCV. But I decided no to do that, because someone might
> be calling those functions directly (even if that's very unlikely).
>
> The one thing bugging me a bit is that the regression test checks only a
> GROUP BY query. It'd be nice to add queries testing MCV/dependencies
> too, but that seems tricky because most queries will use per-partitions
> stats.
>
>
> 0003 - Add stxdinherit flag to pg_statistic_ext_data
>
> This is the patch for master, allowing to build stats for both inherits
> flag values. It adds the flag to pg_stats_ext_exprs view to, reworked
> how we deal with iterating both flags etc. I've adopted most of the
> Justin's fixup patches, except that in plancat.c I've refactored how we
> load the stats to process keys/expressions just once.
>
> It has the same issue with regression test using just a GROUP BY query,
> but if we add a test to 0001/0002, that'll fix this too.
>
>
> 0004 - Refactor parent ACL check
>
> Not sure about this - I doubt saving 30 rows in an 8kB file is really
> worth it. Maybe it is, but then maybe we should try cleaning up the
> other ACL checks in this file too? Seems mostly orthogonal to this
> thread, though.
>
> Hi,
For patch 1, minor comment:

+   if (planner_rt_fetch(onerel->relid, root)->inh)

Since the rte (RangeTblEntry*) doesn't seem to be used beyond checking inh,
I think it would be better if the above style of checking is used
throughout the patch (without introducing rte variable).

Cheers

>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


Re: extended stats on partitioned tables

2021-12-12 Thread Zhihong Yu
On Sat, Dec 11, 2021 at 9:14 PM Tomas Vondra 
wrote:

>
>
> On 12/12/21 05:38, Zhihong Yu wrote:
> >
> >
> > On Sat, Dec 11, 2021 at 8:17 PM Tomas Vondra
> > mailto:tomas.von...@enterprisedb.com>>
> > wrote:
> >
> > Hi,
> >
> > Attached is a rebased and cleaned-up version of these patches, with
> more
> > comments, refactorings etc. Justin and Zhihong, can you take a look?
> >
> >
> > 0001 - Ignore extended statistics for inheritance trees
> >
> > 0002 - Build inherited extended stats on partitioned tables
> >
> > Those are mostly just Justin's patches, with more detailed comments
> and
> > updated commit message. I've considered moving the rel->inh check to
> > statext_clauselist_selectivity, and then removing the check from
> > dependencies and MCV. But I decided no to do that, because someone
> might
> > be calling those functions directly (even if that's very unlikely).
> >
> > The one thing bugging me a bit is that the regression test checks
> only a
> > GROUP BY query. It'd be nice to add queries testing MCV/dependencies
> > too, but that seems tricky because most queries will use
> per-partitions
> > stats.
> >
> >
> > 0003 - Add stxdinherit flag to pg_statistic_ext_data
> >
> > This is the patch for master, allowing to build stats for both
> inherits
> > flag values. It adds the flag to pg_stats_ext_exprs view to, reworked
> > how we deal with iterating both flags etc. I've adopted most of the
> > Justin's fixup patches, except that in plancat.c I've refactored how
> we
> > load the stats to process keys/expressions just once.
> >
> > It has the same issue with regression test using just a GROUP BY
> query,
> > but if we add a test to 0001/0002, that'll fix this too.
> >
> >
> > 0004 - Refactor parent ACL check
> >
> > Not sure about this - I doubt saving 30 rows in an 8kB file is really
> > worth it. Maybe it is, but then maybe we should try cleaning up the
> > other ACL checks in this file too? Seems mostly orthogonal to this
> > thread, though.
> >
> >
> > Hi,
> > For patch 3, in commit message:
> >
> > and there no clear winner. -> and there is no clear winner.
> >
> > and it seem wasteful -> and it seems wasteful
> >
> > The there may be -> There may be
> >
>
> Thanks, will fix.
>
> > +   /* skip statistics with mismatching stxdinherit value */
> > +   if (stat->inherit != rte->inh)
> >
> > Should a log be added for the above case ?
> >
>
> Why should we log this? It's an entirely expected case - there's a
> mismatch between inheritance for the relation and statistics, simply
> skipping it is the right thing to do.
>

Hi,
I agree that skipping should be fine (to avoid too much logging).

Thanks


> > +* Determine if we'redealing with inheritance tree.
> >
> > There should be a space between re and dealing.
> >
>
> Thanks, will fix.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


回复:Re: Re: 回复:Re: Is it worth pushing conditions to sublink/subplan?

2021-12-12 Thread 曾文旌(义从)



 --原始邮件 --
发件人:Zhihong Yu 
发送时间:Sun Dec 12 01:13:11 2021
收件人:曾文旌(义从) 
抄送:Tomas Vondra , wjzeng , 
PostgreSQL Hackers , shawn wang 
, ggys...@gmail.com 
主题:Re: Re: 回复:Re: Is it worth pushing conditions to sublink/subplan?



On Sat, Dec 11, 2021 at 7:31 AM 曾文旌(义从)  wrote:




 --原始邮件 --
发件人:Tomas Vondra 
发送时间:Wed Dec 8 11:26:35 2021
收件人:曾文旌(义从) , shawn wang 
, ggys...@gmail.com , PostgreSQL 
Hackers 
抄送:wjzeng 
主题:Re: 回复:Re: Is it worth pushing conditions to sublink/subplan?

Hi,

On 12/7/21 10:44, 曾文旌(义从) wrote:
> Hi Hackers
> 
> For my previous proposal, I developed a prototype and passed
> regression testing. It works similarly to subquery's qual pushdown.
> We know that sublink expands at the beginning of each level of
> query. At this stage, The query's conditions and equivalence classes
> are not processed. But after generate_base_implied_equalities the
> conditions are processed,  which is why qual can push down to 
> subquery but sublink not.
> 
> My POC implementation chose to delay the sublink expansion in the
> SELECT clause (targetList) and where clause. Specifically, it is
> delayed after generate_base_implied_equalities. Thus, the equivalent
> conditions already established in the Up level query can be easily
> obtained in the sublink expansion process (make_subplan).
> 
> For example, if the up level query has a.id = 10 and the sublink
> query has a.id = b.id, then we get b.id = 10 and push it down to the
> sublink quey. If b is a partitioned table and is partitioned by id,
> then a large number of unrelated subpartitions are pruned out, This
> optimizes a significant amount of Planner and SQL execution time, 
> especially if the partitioned table has a large number of
> subpartitions and is what I want.
> 
> Currently, There were two SQL failures in the regression test,
> because the expansion order of sublink was changed, which did not
> affect the execution result of SQL.
> 
> Look forward to your suggestions on this proposal.
> 

I took a quick look, and while I don't see / can't think of any problems
with delaying it until after generating implied equalities, there seems
to be a number of gaps.

Thank you for your attention.

1) Are there any regression tests exercising this modified behavior?
Maybe there are, but if the only changes are due to change in order of
targetlist entries, that doesn't seem like a clear proof.

It'd be good to add a couple tests exercising both the positive and
negative case (i.e. when we can and can't pushdown a qual).

I added several samples to the regress(qual_pushdown_to_sublink.sql). 
and I used the partition table to show the plan status of qual being pushed 
down into sublink.
Hopefully this will help you understand the details of this patch. Later, I 
will add more cases.
2) apparently, contrib/postgres_fdw does crash like this:

  #3  0x0077b412 in adjust_appendrel_attrs_mutator
  (node=0x13f7ea0, context=0x7fffc3351b30) at appendinfo.c:470
  470  Assert(!IsA(node, SubLink));
  (gdb) p node
  $1 = (Node *) 0x13f7ea0
  (gdb) p *node
  $2 = {type = T_SubLink}

  Backtrace attached.

For the patch attached in the last email, I passed all the tests under 
src/test/regress.
As you pointed out, there was a problem with regression under contrib(in 
contrib/postgres_fdw). 
This time I fixed it and the current patch (V2) can pass the check-world.


3) various parts of the patch really need at least some comments, like:

  - try_push_outer_qual_to_sublink_query really needs some docs

  - new stuff at the end of initsplan.c

Ok, I added some comments and will add more. If you have questions about any 
details,
please point them out directly.

4) generate_base_implied_equalities

   shouldn't this

if (ec->ec_processed)
;

   really be?

if (ec->ec_processed)
continue;

You are right. I fixed it.

5) I'm not sure why we need the new ec_processed flag.

I did this to eliminate duplicate equalities from the two 
generate_base_implied_equalities calls
1) I need the base equivalent expression generated after 
generate_base_implied_equalities,
which is used to pushdown qual to sublink(lazy_process_sublinks)
2) The expansion of sublink may result in an equivalent expression with 
parameters, such as a = $1,
which needs to deal with the equivalence classes again.
3) So, I added ec_processed and asked to process it again 
(generate_base_implied_equalities)
after the equivalence class changed (add_eq_member/process_equivalence).

Maybe you have a better suggestion, please let me know.

6) So we now have lazy_process_sublink callback? Does that mean we
expand sublinks in two places - sometimes lazily, sometimes not?

Yes, not all sublink is delayed. Let me explain this:
1) I added a GUC switch enable_lazy_process_sublink. If it is turned off, all 
lazy process sublink will not happen,
qual pushdown to sublink depend on lazy procee sublink, which m

Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display

2021-12-12 Thread Bharath Rupireddy
On Thu, Dec 9, 2021 at 9:28 PM Alvaro Herrera  wrote:
>
> On 2021-Dec-09, Bharath Rupireddy wrote:
>
> > On Thu, Dec 9, 2021 at 6:00 PM Alvaro Herrera  
> > wrote:
> > >
> > > On 2021-Dec-09, Bharath Rupireddy wrote:
> > >
> > > > I just want to call this out: an insertion of 10 million rows in the
> > > > primary generates a bunch of messages in the standby [1] within 40 sec
> > > > (size of the standby server log grows by 5K).
> > >
> > > Hmm, that does sound excessive to me in terms of log bloat increase.
> > > Remember the discussion about turning log_checkpoints on by default?
> >
> > The amount of LOG messages generated when the log_checkpoints GUC is
> > set to on, are quite less, hardly 4 messages per-checkpoint (5min). I
> > think enabling log_checkpoints is still acceptable as most of the
> > hackers agreed in another thread [1] that the advantages with it are
> > more and it doesn't seem to be bloating the server logs (in a day at
> > max 1152 messages).
>
> My point is that it was argued, in that thread, that setting
> log_checkpoints to on by default would cause too much additional log
> volume.  That argument was shot down, but in this thread we're arguing
> that we want five kilobytes of messages in forty seconds.  That sounds
> much less acceptable to me, so making it default behavior or
> unconditional behavior is not going to fly very high.
>
> > I'm not sure if it is worth having a GUC log_recovery if enabled the
> > recovery messages can be emitted at LOG level otherwise DEBUG1 level.
>
> Maybe some tunable like
> log_wal_traffic = { none, medium, high }
> where "none" is current behavior of no noise, "medium" gets (say) once
> every 256 segments (e.g., when going from XFF to (X+1)00), "high" gets
> you one message per segment.

Yeah, that can be an option.

Another idea could be to use the infrastructure laid out by the commit
9ce346e [1]. With ereport_startup_progress, we can emit the LOGs(of
current recovering WAL file) for every log_startup_progress_interval
seconds/milliseconds.

One problem is that ereport_startup_progress doesn't work on
StandbyMode, maybe we can remove this restriction unless we have a
major reason for not allowing it on the standby.
/* Prepare to report progress of the redo phase. */
if (!StandbyMode)
begin_startup_progress_phase();

Thoughts?

[1]
commit 9ce346eabf350a130bba46be3f8c50ba28506969
Author: Robert Haas 
Date:   Mon Oct 25 11:51:57 2021 -0400

Report progress of startup operations that take a long time.

Discussion:
https://postgr.es/m/ca+tgmoahqrgdfobwgy16xcomtxxsrvgfb2jncvb7-ubuee1...@mail.gmail.com
Discussion:
https://postgr.es/m/camm1awahf7ve69572_olq+mgpt5ruiudgf1x5rrtkjbldpr...@mail.gmail.com

Regards,
Bharath Rupireddy.




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-12 Thread Dilip Kumar
On Fri, Dec 10, 2021 at 7:39 AM Ashutosh Sharma  wrote:
>>
>> Logfile Snippet:
>> 2021-12-09 17:49:18.110 +04 [18151] PANIC:  could not fsync file 
>> "base/116398/116400": No such file or directory
>> 2021-12-09 17:49:19.105 +04 [18150] LOG:  checkpointer process (PID 18151) 
>> was terminated by signal 6: Aborted
>> 2021-12-09 17:49:19.105 +04 [18150] LOG:  terminating any other active 
>> server processes
>
>
> This is different from the issue you raised earlier. As Dilip said, we need 
> to unregister sync requests for files that got successfully copied to the 
> target database, but the overall alter database statement failed. We are 
> doing this when the database is created successfully, but not when it fails.
> Probably doing the same inside the cleanup function movedb_failure_callback() 
> should fix the problem.

Correct, I have done this cleanup, apart from this we have dropped the
fsyc request in create database failure case as well and also need to
drop buffer in error case of creatdb as well as movedb.  I have also
fixed the other issue for which you gave the patch (a bit differently)
basically, in case of movedb the source and destination dboid are same
so we don't need an additional parameter and also readjusted the
conditions to avoid nested if.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From eb27d159bc2aa41011b6352a514c7981a19a64dc Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Wed, 1 Sep 2021 14:06:29 +0530
Subject: [PATCH v8 1/7] Refactor relmap load and relmap write functions

Currently, write_relmap_file and load_relmap_file are tightly
coupled with shared_map and local_map.  As part of the higher
level patch set we need remap read/write interfaces that are
not dependent upon shared_map and local_map, and we should be
able to pass map memory as an external parameter instead.
---
 src/backend/utils/cache/relmapper.c | 163 ++--
 1 file changed, 99 insertions(+), 64 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38ad..bb39632 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,6 +136,12 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 			 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 			  bool add_okay);
+static void read_relmap_file(char *mapfilename, RelMapFile *map,
+			 bool lock_held);
+static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+	   bool write_wal, bool send_sinval,
+	   bool preserve_files, Oid dbid, Oid tsid,
+	   const char *dbpath);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 			  bool write_wal, bool send_sinval, bool preserve_files,
@@ -687,36 +693,19 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * read_relmap_file -- read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
- *
- * Note that the local case requires DatabasePath to be set up.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+read_relmap_file(char *mapfilename, RelMapFile *map, bool lock_held)
 {
-	RelMapFile *map;
-	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
- RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
- DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
-
-	/* Read data ... */
+	/* Open the relmap file for reading. */
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 		ereport(FATAL,
@@ -779,62 +768,50 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Write out a new shared or local map file with the given contents.
- *
- * The magic number and CRC are automatically updated in *newmap.  On
- * success, we copy the data to the appropriate permanent static variable.
- *
- * If write_wal is true then an appropriate WAL message is emitted.
- * (It will be false for bootstrap and WAL replay cases.)
- *
- * If send_sinval is true then a SI invalidation message is sent.
- * (This should be true except in bootstrap case.)
- *
- * If preserve_files is true then the storage manager is warned not to
- * delete the files listed in the map.
+ * load_relmap_file -- load data from the shared or local map file
  *
- * Because this may be called during WAL replay when MyDatabaseId,
- * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * Note that the local case requires DatabasePath t