date:20220919

On Mon, Sep 19, 2022 at 4:53 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > On Mon, Sep 19, 2022 at 3:39 PM Tom Lane  wrote:
> >> I also note that our existing code in this area would break pretty
> >> thoroughly on such a machine, so this isn't making it worse.
>
> > Yeah, I don't expect it to be a practical problem on any real system
> > (that is, I don't expect any real calling convention to transfer a
> > struct T * argument in a different place than void *).  I just wanted
> > to mention that it's a new liberty.
>
> No, it's not, because the existing coding here is already assuming that.
> The walker callbacks are generally declared as taking a "struct *"
> second parameter, but expression_tree_walker et al think they are
> passing a "void *" to them.  Even if a platform ABI had some weird
> special rule about how to call functions that you don't know the
> argument list for, it wouldn't fix this because the walkers sure do know
> what their arguments are.  The only reason this code works today is that
> in practice, "void *" *is* ABI-compatible with "struct *".

True.

> I'm not excited about creating a demonstrable opportunity for bugs
> in order to make the code hypothetically more compatible with
> hardware designs that are thirty years obsolete.  (Hypothetical
> in the sense that there's little reason to believe there would
> be no other problems.)

Fair enough.

Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf

2022-09-19 Thread Drouvot, Bertrand


Hi,

On 9/17/22 8:53 AM, Michael Paquier wrote:

On Fri, Sep 16, 2022 at 06:24:07PM +0200, Drouvot, Bertrand wrote:

The CF bot is failing for Windows (all other tests are green) and only for
the new tap test related to the regular expression on the host name (the
ones on database and role are fine).

The issue is not related to the patch. The issue is that the Windows Cirrus
test does not like when a host name is provided for a "host" entry in
pg_hba.conf (while it works fine when a CIDR is provided).

You can see an example in [1] where the only change is to replace the CIDR
by "localhost" in 002_scram.pl. As you can see the Cirrus tests are failing
on Windows only (its log file is here [2]).

I'll look at this "Windows" related issue but would appreciate any
guidance/help if someone has experience in this area on windows.


I recall that being able to do a reverse lookup of a hostname on
Windows for localhost requires a few extra setup steps as that's not
guaranteed to be set in all environments by default, which is why we
go at great length to use 127.0.0.1 in the TAP test setup for example
(see Cluster.pm).  Looking at your patch, the goal is to test the
mapping of regular expression for host names, user names and database
names.  If the first case is not guaranteed, my guess is that it is
fine to skip this portion of the tests on Windows.


Thanks for looking at it!

That sounds reasonable, v3 attached is skipping the regular expression 
tests for the hostname on Windows.





While reading the patch, I am a bit confused about token_regcomp() and
token_regexec().  It would help the review a lot if these were
documented with proper comments, even if these act roughly as wrappers
for pg_regexec() and pg_regcomp().


Fully agree, comments were missing. They've been added in v3 attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comdiff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index c6f1b70fd3..28343445be 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -235,8 +235,9 @@ hostnogssenc  database  
userPostgreSQL database.
-   Multiple database names can be supplied by separating them with
-   commas.  A separate file containing database names can be specified by
+   Multiple database names and/or regular expressions preceded by 
/
+   can be supplied by separating them with commas.
+   A separate file containing database names can be specified by
preceding the file name with @.
   
  
@@ -249,7 +250,8 @@ hostnogssenc  database  
userall specifies that it
matches all users.  Otherwise, this is either the name of a specific
-   database user, or a group name preceded by +.
+   database user, a regular expression preceded by /
+   or a group name preceded by +.
(Recall that there is no real distinction between users and groups
in PostgreSQL; a + mark 
really means
match any of the roles that are directly or indirectly members
@@ -258,7 +260,8 @@ hostnogssenc  database  
user/
+   can be supplied by separating them with commas.
A separate file containing user names can be specified by preceding the
file name with @.
   
@@ -270,8 +273,9 @@ hostnogssenc  database  
user
   
Specifies the client machine address(es) that this record
-   matches.  This field can contain either a host name, an IP
-   address range, or one of the special key words mentioned below.
+   matches.  This field can contain either a host name, a regular 
expression
+   preceded by / representing host names, an IP address 
range,
+   or one of the special key words mentioned below.
   
 
   
@@ -785,16 +789,18 @@ hostall all 192.168.12.10/32  
  gss
 # TYPE  DATABASEUSERADDRESS METHOD
 hostall all 192.168.0.0/16  ident 
map=omicron
 
-# If these are the only three lines for local connections, they will
+# If these are the only four lines for local connections, they will
 # allow local users to connect only to their own databases (databases
-# with the same name as their database user name) except for administrators
-# and members of role "support", who can connect to all databases.  The file
-# $PGDATA/admins contains a list of names of administrators.  Passwords
+# with the same name as their database user name) except for administrators,
+# users finishing with "helpdesk" and members of role "support",
+# who can connect to all databases.
+# The file$PGDATA/admins contains a list of names of administrators.  Passwords
 # are required in all cases.
 #
 # TYPE  DATABASEUSERADDRESS METHOD
 local   sameuserall md5
 local   all @admins

Re: Tree-walker callbacks vs -Wdeprecated-non-prototype

On Mon, Sep 19, 2022 at 10:16 AM Thomas Munro  wrote:
> On Mon, Sep 19, 2022 at 8:57 AM Tom Lane  wrote:
> > BTW, I was distressed to discover that someone decided they could
> > use ExecShutdownNode as a planstate_tree_walker() walker even though
> > its argument list is not even the right length.  I'm a bit flabbergasted
> > that we seem to have gotten away with that so far, because I'd have
> > thought for sure that it'd break some platform's convention for which
> > argument gets passed where.  I think we need to fix that, independently
> > of what we do about the larger scope of these problems.  To avoid an
> > API break, I propose making ExecShutdownNode just be a one-liner that
> > calls an internal ExecShutdownNode_walker() function.  (I've not done
> > it that way in the attached, though.)
>
> Huh... wouldn't systems that pass arguments right-to-left on the stack
> receive NULL for node?  That'd include the SysV i386 convention used
> on Linux, *BSD etc.  But that can't be right or we'd know about it...

I take that back after looking up some long forgotten details; it
happily ignores extra arguments.

Code clean for pre-9.0 binary upgrades in HeapTupleSatisfiesXXX()

2022-09-19 Thread Japin Li



Hi hackers,

When I read the heapam_visibility.c, I found that there is some code in
HeapTupleSatisfiesXXX that has a similar code for pre-9.0 binary upgrades.

HeapTupleSatisfiesSelf, HeapTupleSatisfiesToast and HeapTupleSatisfiesDirty
have the same code for pre-9.0 binary upgrades.  HeapTupleSatisfiesUpdate
has a similar code, except it returns TM_Result other than boolean.

HeapTupleSatisfiesMVCC also has a similar code like HeapTupleSatisfiesSelf,
expect it use XidInMVCCSnapshot other than TransactionIdIsInProgress.

The most different is HeapTupleSatisfiesVacuumHorizon.

Could we encapsulate the code for pre-9.0 binary upgrades?  This idea comes 
from [1].

Any thoughts?


[1] 
https://www.postgresql.org/message-id/8a855f33-2581-66bf-85f7-0b99239edbda%40postgrespro.ru

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: confirmed_flush_lsn shows LSN of the data that has not yet been received by the logical subscriber.

2022-09-19 Thread Ashutosh Sharma

On Fri, Sep 9, 2022 at 5:36 PM Ashutosh Bapat
 wrote:
>
> On Thu, Sep 8, 2022 at 8:32 PM Ashutosh Sharma  wrote:
> >
> > On Thu, Sep 8, 2022 at 6:23 PM Ashutosh Bapat
> >  wrote:
> > >
> > > On Thu, Sep 8, 2022 at 4:14 PM Ashutosh Sharma  
> > > wrote:
> Can you please point to the documentation.
>

AFAIU there is just one documentation. Here is the link for it:

https://www.postgresql.org/docs/current/view-pg-replication-slots.html

> It's true that it needs to be clarified. But what you are saying may
> not be entirely true in case of streamed transaction. In that case we
> might send logically decoded changes of an ongoing transaction as
> well. They may even get applied but not necessarily committed. It's a
> bit complicated. :)
>

This can happen in case of big transactions. That's the reason I
mentioned that the transaction has a small set of data which is not
yet committed but the confirmed_flush_lsn says it has already reached
the logical subscriber.

And.. lastly sorry for the delayed response. I was not well and
couldn't access email for quite some days. The poor dengue had almost
killed me :(

--
With Regards,
Ashutosh Sharma.

Re: HOT chain validation in verify_heapam()

2022-09-19 Thread Himanshu Upadhyaya

On Wed, Sep 7, 2022 at 2:49 AM Robert Haas  wrote:

>
> But here's one random idea: add a successor[] array and an lp_valid[]
> array. In the first loop, set lp_valid[offset] = true if it passes the
> check_lp() checks, and set successor[A] = B if A redirects to B or has
> a CTID link to B, without matching xmin/xmax. Then, in a second loop,
> iterate over the successor[] array. If successor[A] = B && lp_valid[A]
> && lp_valid[B], then check whether A.xmax = B.xmin; if so, then
> complain if predecessor[B] is already set, else set predecessor[B] =
> A. Then, in the third loop, iterate over the predecessor array just as
> you're doing now. Then it's clear that we do the lp_valid checks
> exactly once for every offset that might need them, and in order. And
> it's also clear that the predecessor-based checks can never happen
> unless the lp_valid checks passed for both of the offsets involved.
>
>
ok, I have introduced a new approach to first construct a successor array
and then loop over the successor array to construct a predecessor array.

-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
From 4a32d7cbdec0d090e99a9a58cb4deb5cb4c03aef Mon Sep 17 00:00:00 2001
From: Himanshu Upadhyaya 
Date: Mon, 19 Sep 2022 13:50:47 +0530
Subject: [PATCH v3] HOT chain validation in verify_heapam

---
 contrib/amcheck/verify_heapam.c | 209 
 1 file changed, 209 insertions(+)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index c875f3e5a2..3da4c09665 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -399,6 +399,9 @@ verify_heapam(PG_FUNCTION_ARGS)
 	for (ctx.blkno = first_block; ctx.blkno <= last_block; ctx.blkno++)
 	{
 		OffsetNumber maxoff;
+		OffsetNumber predecessor[MaxOffsetNumber] = {0};
+		OffsetNumber successor[MaxOffsetNumber] = {0};
+		bool		lp_valid[MaxOffsetNumber] = {false};
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -433,6 +436,8 @@ verify_heapam(PG_FUNCTION_ARGS)
 		for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
 			 ctx.offnum = OffsetNumberNext(ctx.offnum))
 		{
+			OffsetNumber nextoffnum;
+
 			ctx.itemid = PageGetItemId(ctx.page, ctx.offnum);
 
 			/* Skip over unused/dead line pointers */
@@ -469,6 +474,12 @@ verify_heapam(PG_FUNCTION_ARGS)
 	report_corruption(&ctx,
 	  psprintf("line pointer redirection to unused item at offset %u",
 			   (unsigned) rdoffnum));
+
+/*
+ * make entry in successor array, redirected tuple will be
+ * validated at the time when we loop over successor array
+ */
+successor[ctx.offnum] = rdoffnum;
 continue;
 			}
 
@@ -504,9 +515,200 @@ verify_heapam(PG_FUNCTION_ARGS)
 			/* It should be safe to examine the tuple's header, at least */
 			ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid);
 			ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr);
+			lp_valid[ctx.offnum] = true;
 
 			/* Ok, ready to check this next tuple */
 			check_tuple(&ctx);
+
+			/*
+			 * Now add data to successor array, it will later be used to
+			 * generate the predecessor array. We need to access the tuple's
+			 * header to populate the predecessor array but tuple is not
+			 * necessarily sanity checked yet so delaying construction of
+			 * predecessor array until all tuples are sanity checked.
+			 */
+			nextoffnum = ItemPointerGetOffsetNumber(&(ctx.tuphdr)->t_ctid);
+			if (ItemPointerGetBlockNumber(&(ctx.tuphdr)->t_ctid) == ctx.blkno &&
+nextoffnum != ctx.offnum)
+			{
+successor[ctx.offnum] = nextoffnum;
+			}
+		}
+
+		/*
+		 * Loop over offset and populate predecessor array from all entries
+		 * that are present in successor array.
+		 */
+		ctx.attnum = -1;
+		for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
+			 ctx.offnum = OffsetNumberNext(ctx.offnum))
+		{
+			ItemId		curr_lp;
+			ItemId		next_lp;
+			HeapTupleHeader curr_htup;
+			HeapTupleHeader next_htup;
+			TransactionId curr_xmax;
+			TransactionId next_xmin;
+			OffsetNumber nextoffnum = successor[ctx.offnum];
+
+			curr_lp = PageGetItemId(ctx.page, ctx.offnum);
+			if (nextoffnum == 0)
+			{
+/*
+ * Either last updated tuple in chain or corruption raised for
+ * this tuple.
+ */
+continue;
+			}
+			if (ItemIdIsRedirected(curr_lp) && lp_valid[nextoffnum])
+			{
+next_lp = PageGetItemId(ctx.page, nextoffnum);
+next_htup = (HeapTupleHeader) PageGetItem(ctx.page, next_lp);;
+if (!HeapTupleHeaderIsHeapOnly(next_htup))
+{
+	report_corruption(&ctx,
+	  psprintf("redirected tuple at line pointer offset %u is not heap only tuple",
+			   (unsigned) nextoffnum));
+}
+if ((next_htup->t_infomask & HEAP_UPDATED) == 0)
+{
+	report_corruption(&ctx,
+	  psprintf("redirected tuple at line pointer offset %u is not heap updated tuple",
+			   (unsigned) nextoffnum));
+}
+continue;
+			}
+
+			/*
+			 * Add line pointer offset to predecessor array. 1

Re: HOT chain validation in verify_heapam()

2022-09-19 Thread Himanshu Upadhyaya

On Wed, Sep 7, 2022 at 12:11 AM Robert Haas  wrote:

>
> I think the check should be written like this:
>
> !TransactionIdEquals(pred_xmin, curr_xmin) &&
> !TransctionIdDidCommit(pred_xmin)
>
> The TransactionIdEquals check should be done first for the reason you
> state: it's cheaper.
>
> I think that we shouldn't be using TransactionIdDidAbort() at all,
> because it can sometimes return false even when the transaction
> actually did abort. See test_lockmode_for_conflict() and
> TransactionIdIsInProgress() for examples of logic that copes with
> this. What's happening here is that TransactionIdDidAbort doesn't ever
> get called if the system crashes while a transaction is running. So we
> can use TransactionIdDidAbort() only as an optimization: if it returns
> true, the transaction is definitely aborted, but if it returns false,
> we have to check whether it's still running. If not, it aborted
> anyway.
>
> TransactionIdDidCommit() does not have the same issue. A transaction
> can abort without updating CLOG if the system crashes, but it can
> never commit without updating CLOG. If the transaction didn't commit,
> then it is either aborted or still in progress (and we don't care
> which, because neither is an error here).
>
> As to whether the existing formulation of the test has an error
> condition, you're generally right that we should test
> TransactionIdIsInProgress() before TransactionIdDidCommit/Abort,
> because we during commit or abort, we first set the status in CLOG -
> which is queried by TransactionIdDidCommit/Abort - and only afterward
> update the procarray - which is queried by TransactionIdIsInProgress.
> So normally TransactionIdIsInProgress should be checked first, and
> TransactionIdDidCommit/Abort should only be checked if it returns
> false, at which point we know that the return values of the latter
> calls can't ever change. Possibly there is an argument for including
> the TransactionIdInProgress check here too:
>
> !TransactionIdEquals(pred_xmin, curr_xmin) &&
> (TransactionIdIsInProgress(pred_xmin) ||
> !TransctionIdDidCommit(pred_xmin))
>
> ...but I don't think it could change the answer. Most places that
> check TransactionIdIsInProgress() first are concerned with MVCC
> semantics, and here we are not. I think the only effects of including
> or excluding the TransactionIdIsInProgress() test are (1) performance,
> in that searching the procarray might avoid expense if it's cheaper
> than searching clog, or add expense if the reverse is true and (2)
> slightly changing the time at which we're first able to detect this
> form of corruption. I am inclined to prefer the simpler form of the
> test without TransactionIdIsInProgress() unless someone can say why we
> shouldn't go that route.
>
> Done, updated in the v3 patch.


-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: RFC: Logging plan of the running query

2022-09-19 Thread Алена Рыбакина

Hi,I'm sorry,if this message is duplicated previous this one, but I'm not sure that the previous message is sent correctly. I sent it from email address a.rybak...@postgrespro.ru and I couldn't send this one email from those address.I like idea to create patch for logging query plan. After reviewing this code and notice some moments and I'd rather ask you some questions.Firstly, I suggest some editing in the comment of commit. I think, it is turned out the more laconic and the same clear. I wrote it below since I can't think of any other way to add it.```Currently, we have to wait for finishing of the query execution to check its plan.This is not so convenient in investigation long-running queries on productionenvironments where we cannot use debuggers.To improve this situation there is proposed the patch containing the pg_log_query_plan()function which request to log plan of the specified backend process.By default, only superusers are allowed to request log of the plan otherwiseallowing any users to issue this request could create cause lots of log messagesand it can lead to denial of service.At the next requesting CHECK_FOR_INTERRUPTS(), the target backend logs its plan atLOG_SERVER_ONLY level and therefore this plan will appear in the server log only,not to be sent to the client.```Secondly, I have question about deleting USE_ASSERT_CHECKING in lock.h?It supposed to have been checked in another placed of the code by matching values. I am worry about skipping errors due to untesting with assert option in the places where it (GetLockMethodLocalHash) participates and we won't able to get core file in segfault cases. I might not understand something, then can you please explain to me?Thirdly, I have incomprehension of the point why save_ActiveQueryDesc is declared in the pquery.h? I am seemed to save_ActiveQueryDesc to be used in an once time in the ExecutorRun function and  its declaration superfluous. I added it in the attached patch.Fourthly, it seems to me there are not enough explanatory comments in the code. I also added them in the attached patch.Lastly, I have incomprehension about handling signals since have been unused it before. Could another signal disabled calling this signal to log query plan? I noticed this signal to be checked the latest in procsignal_sigusr1_handler function.  -- Regards,Alena RybakinaPostgres Professional   19.09.2022, 11:01, "torikoshia" :On 2022-05-16 17:02, torikoshia wrote: On 2022-03-09 19:04, torikoshia wrote: On 2022-02-08 01:13, Fujii Masao wrote: AbortSubTransaction() should reset ActiveQueryDesc to save_ActiveQueryDesc that ExecutorRun() set, instead of NULL? Otherwise ActiveQueryDesc of top-level statement will be unavailable after subtransaction is aborted in the nested statements.  I once agreed above suggestion and made v20 patch making save_ActiveQueryDesc a global variable, but it caused segfault when calling pg_log_query_plan() after FreeQueryDesc().  OTOH, doing some kind of reset of ActiveQueryDesc seems necessary since it also caused segfault when running pg_log_query_plan() during installcheck.  There may be a better way, but resetting ActiveQueryDesc to NULL seems safe and simple. Of course it makes pg_log_query_plan() useless after a subtransaction is aborted. However, if it does not often happen that people want to know the running query's plan whose subtransaction is aborted, resetting ActiveQueryDesc to NULL would be acceptable.  Attached is a patch that sets ActiveQueryDesc to NULL when a subtransaction is aborted.  How do you think?Attached new patch to fix patch apply failures again. --Regards,--Atsushi TorikoshiNTT DATA CORPORATION

Re: Add common function ReplicationOriginName.

2022-09-19 Thread Aleksander Alekseev

Hi Peter,

> PSA a patch to add a common function ReplicationOriginName

The patch looks good to me.

One nitpick I have is that the 2nd argument of snprintf is size_t
while we are passing int's. Your patch is consistent with the current
implementation of ReplicationOriginNameForTablesync() and similar
functions in tablesync.c. However I would like to mention this in case
the committer will be interested in replacing ints with Size's while
on it.


-- 
Best regards,
Aleksander Alekseev

Re: Patch to address creation of PgStat* contexts with null parent context

2022-09-19 Thread Drouvot, Bertrand


Hi,

On 9/17/22 6:10 PM, Andres Freund wrote:

Hi,

On 2022-09-07 11:11:11 +0200, Drouvot, Bertrand wrote:

On 9/6/22 7:53 AM, Kyotaro Horiguchi wrote:

At Mon, 5 Sep 2022 14:46:55 +0200, "Drouvot, Bertrand"  
wrote in

Looks like that both approaches have their pros and cons. I'm tempted
to vote +1 on "just changing" the parent context to TopMemoryContext
and not changing the allocations locations.

Yeah. It is safe more than anything and we don't have a problem there.

So, I'm fine with just replacing the parent context at the three places.


Attached a patch proposal to do so.


Pushed. Thanks for the report and the fix!


Thanks! I just marked the corresponding CF entry as Committed.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)

2022-09-19 Thread Fujii Masao





On 2022/09/18 23:09, Bharath Rupireddy wrote:

On Sun, Sep 18, 2022 at 1:23 PM Bharath Rupireddy
 wrote:

cfbot fails [1] with v6 patch. I made a silly mistake by not checking
the output of "make check-world -j  16" fully, I just saw the end
message "All tests successful." before posting the v6 patch.

The failure is due to perform_base_backup() accessing BackupState's
tablespace_map without a null check, so I fixed it.

Sorry for the noise. Please review the attached v7 patch further.


Thanks for updating the patch!

=# SELECT * FROM pg_backup_start('test', true);
=# SELECT * FROM pg_backup_stop();
LOG:  server process (PID 15651) was terminated by signal 11: Segmentation 
fault: 11
DETAIL:  Failed process was running: SELECT * FROM pg_backup_stop();

With v7 patch, pg_backup_stop() caused the segmentation fault.


=# SELECT * FROM pg_backup_start(repeat('test', 1024));
ERROR:  backup label too long (max 1024 bytes)
STATEMENT:  SELECT * FROM pg_backup_start(repeat('test', 1024));

=# SELECT * FROM pg_backup_start(repeat('testtest', 1024));
LOG:  server process (PID 15844) was terminated by signal 11: Segmentation 
fault: 11
DETAIL:  Failed process was running: SELECT * FROM 
pg_backup_start(repeat('testtest', 1024));

When I specified longer backup label in the second run of pg_backup_start()
after the first run failed, it caused the segmentation fault.


+   state = (BackupState *) palloc0(sizeof(BackupState));
+   state->name = pstrdup(name);

pg_backup_start() calls allocate_backup_state() and allocates the memory for
the input backup label before checking its length in do_pg_backup_start().
This can cause the memory for backup label to be allocated too much
unnecessary. I think that the maximum length of BackupState.name should
be MAXPGPATH (i.e., maximum allowed length for backup label).



The definition of SessionBackupState enum type also should be in xlogbackup.h?


Correct. Basically, all the backup related code from xlog.c,
xlogfuncs.c and elsewhere can go to xlogbackup.c/.h. I will focus on
that refactoring patch once this gets in.


Understood.



Yeah, but they have to be carried from do_pg_backup_stop() to
pg_backup_stop() or callers and also instead of keeping tablespace_map
in BackupState and others elsewhere don't seem to be a good idea to
me. IMO, BackupState is a good place to contain all the information
that's carried across various functions.


In v7 patch, since pg_backup_stop() calls build_backup_content(),
backup_label and history_file seem not to be carried from do_pg_backup_stop()
to pg_backup_stop(). This makes me still think that it's better not to include
them in BackupState...

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [RFC] building postgres with meson - v13

2022-09-19 Thread Peter Eisentraut


On 19.09.22 02:29, Andres Freund wrote:

Hi,

On September 18, 2022 5:24:06 PM PDT, Peter Eisentraut 
 wrote:

On 15.09.22 04:26, Andres Freund wrote:

Attached is v13 of the meson patchset. The biggest changes are:


Did something about warning flags change from the previous patch set?  I see 
it's building with -Wextra now, which combined with -Werror causes the build to 
fail for me.  I have never encountered that with any of the previous patch sets.


In older versions of the patch the default warning level was set to include 
Wextra, and I had added my local flags to suppress uninteresting warnings. 
Comparing the warning flags I reduced the warning level and removed the 
suppressing flags - but changing default options only affects new build trees. 
To change existing ones do meson configure -Dwarning_level=1


Ok that was the reason.  It works now.

IMO, the following commits are ready to be pushed now:

b7d7fe009731 Remove DLLTOOL, DLLWRAP from configure / Makefile.global.in
979f26889544 Don't hardcode tmp_check/ as test directory for tap tests
9fc657fbb7e2 Split TESTDIR into TESTLOGDIR and TESTDATADIR
6de8f1de0ffa meson: prereq: Extend gendef.pl in preparation for meson
7054861f0fef meson: prereq: Add src/tools/gen_export.pl
1aa586f2921c meson: prereq: Refactor PG_TEST_EXTRA logic in autoconf build
5a9731dcc2e6 meson: prereq: port: Include c.h instead of postgres.h in 
*p{read,write}*.c
1939bdcfbfea meson: Add meson based buildsystem

Re: remove more archiving overhead

2022-09-19 Thread Peter Eisentraut


On 18.09.22 09:13, Noah Misch wrote:

This documentation change only covers archive_library.  How are users of
archive_command supposed to handle this?


I believe users of archive_command need to do something similar to what is
described here.  However, it might be more reasonable to expect
archive_command users to simply return false when there is a pre-existing
file, as the deleted text notes.  IIRC that is why I added that sentence
originally.


What makes the answer for archive_command diverge from the answer for
archive_library?


I suspect what we are really trying to say here is

===
Archiving setups (using either archive_command or archive_library) 
should be prepared for the rare case that an identical archive file is 
being archived a second time.  In such a case, they should compare that 
the source and the target file are identical and proceed without error 
if so.


In some cases, it is difficult or impossible to configure 
archive_command or archive_library to do this.  In such cases, the 
archiving command or library should error like in the case for any 
pre-existing target file, and operators need to be prepared to resolve 
such cases manually.

===

Is that correct?

Free list same_input_transnos in preprocess_aggref

2022-09-19 Thread Zhang Mingli

Hi,


In preprocess_aggref(), list same_input_transnos is used to track compatible 
transnos.

Free it if we don’t need it anymore.

```

/*
 * 2. See if this aggregate can share transition state with another
 * aggregate that we've initialized already.
 */
 transno = find_compatible_trans(root, aggref, shareable,
 aggtransfn, aggtranstype,
 transtypeLen, transtypeByVal,
 aggcombinefn,
 aggserialfn, aggdeserialfn,
 initValue, initValueIsNull,
 same_input_transnos);
 list_free(same_input_transnos);

```

Not sure if it worths as it will be freed sooner or later when current context 
ends.

But as in find_compatible_agg(), the list is freed if we found a compatible Agg.

This patch helps a little when there are lots of incompatible aggs because we 
will try to find the compatible transnos again and again.

Each iteration will keep an unused list memory.

Regards,
Zhang Mingli


vn-0001-free-list-same_input_transnos-in-preprocess_aggref.patch
Description: Binary data

Re: RFC: Logging plan of the running query

2022-09-19 Thread a.rybakina


Hi,

I'm sorry,if this message is duplicated previous this one, but the 
previous message is sent incorrectly. I sent it from email address 
lena.riback...@yandex.ru.


I liked this idea and after reviewing code I noticed some moments and 
I'd rather ask you some questions.


Firstly, I suggest some editing in the comment of commit. I think, it is 
turned out the more laconic and the same clear. I wrote it below since I 
can't think of any other way to add it.


```
Currently, we have to wait for finishing of the query execution to check 
its plan.
This is not so convenient in investigation long-running queries on 
production

environments where we cannot use debuggers.

To improve this situation there is proposed the patch containing the 
pg_log_query_plan()

function which request to log plan of the specified backend process.

By default, only superusers are allowed to request log of the plan 
otherwise
allowing any users to issue this request could create cause lots of log 
messages

and it can lead to denial of service.

At the next requesting CHECK_FOR_INTERRUPTS(), the target backend logs 
its plan at
LOG_SERVER_ONLY level and therefore this plan will appear in the server 
log only,

not to be sent to the client.
```

Secondly, I have question about deleting USE_ASSERT_CHECKING in lock.h?
It supposed to have been checked in another placed of the code by 
matching values. I am worry about skipping errors due to untesting with 
assert option in the places where it (GetLockMethodLocalHash) 
participates and we won't able to get core file in segfault cases. I 
might not understand something, then can you please explain to me?


Thirdly, I have incomprehension of the point why save_ActiveQueryDesc is 
declared in the pquery.h? I am seemed to save_ActiveQueryDesc to be used 
in an once time in the ExecutorRun function and  its declaration 
superfluous. I added it in the attached patch.


Fourthly, it seems to me there are not enough explanatory comments in 
the code. I also added them in the attached patch.


Lastly, I have incomprehension about handling signals since have been 
unused it before. Could another signal disabled calling this signal to 
log query plan? I noticed this signal to be checked the latest in 
procsignal_sigusr1_handler function.


Regards,

--
Alena Rybakina
Postgres Professional

19.09.2022, 11:01, "torikoshia" :

On 2022-05-16 17:02, torikoshia wrote:

 On 2022-03-09 19:04, torikoshia wrote:

 On 2022-02-08 01:13, Fujii Masao wrote:

 AbortSubTransaction() should reset ActiveQueryDesc to
 save_ActiveQueryDesc that ExecutorRun() set, instead
of NULL?
 Otherwise ActiveQueryDesc of top-level statement will
be unavailable
 after subtransaction is aborted in the nested statements.


 I once agreed above suggestion and made v20 patch making
 save_ActiveQueryDesc a global variable, but it caused
segfault when
 calling pg_log_query_plan() after FreeQueryDesc().

 OTOH, doing some kind of reset of ActiveQueryDesc seems
necessary
 since it also caused segfault when running
pg_log_query_plan() during
 installcheck.

 There may be a better way, but resetting ActiveQueryDesc
to NULL seems
 safe and simple.
 Of course it makes pg_log_query_plan() useless after a
subtransaction
 is aborted.
 However, if it does not often happen that people want to
know the
 running query's plan whose subtransaction is aborted,
resetting
 ActiveQueryDesc to NULL would be acceptable.

 Attached is a patch that sets ActiveQueryDesc to NULL when a
 subtransaction is aborted.

 How do you think?

Attached new patch to fix patch apply failures again.

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index a82ac87457e..65b692b0ddf 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -306,6 +306,12 @@ ExecutorRun(QueryDesc *queryDesc,
 {
 	QueryDesc *save_ActiveQueryDesc;
 
+	/*
+	 * Save value of ActiveQueryDesc before having called
+	 * ExecutorRun_hook function due to having reset by
+	 * AbortSubTransaction.
+	 */
+
 	save_ActiveQueryDesc = ActiveQueryDesc;
 	ActiveQueryDesc = queryDesc;
 
@@ -314,6 +320,7 @@ ExecutorRun(QueryDesc *queryDesc,
 	else
 		standard_ExecutorRun(queryDesc, direction, count, execute_once);
 
+	/* We set the actual value of ActiveQueryDesc */
 	ActiveQueryDesc = save_ActiveQueryDesc;
 }
 
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index fc9f9f8e3f0..8e7ce3c976f 100644
--- a/src/include/commands/explain.h
+++ b/src/include/command

Re: Fix typos in code comments

2022-09-19 Thread Justin Pryzby

On Mon, Sep 19, 2022 at 02:44:12AM +, houzj.f...@fujitsu.com wrote:
> While working on some other patches, I found serval typos(duplicate words and
> incorrect function name reference) in the code comments. Here is a small patch
> to fix them.

Thanks.

On Mon, Sep 19, 2022 at 11:05:24AM +0800, Zhang Mingli wrote:
> Good catch. There is a similar typo in doc, runtime.sgml.
> ```using TLS protocols enabled by by setting the parameter```

That one should be backpatched to v15.

Find below some others.

diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 901dd435efd..160296e1daf 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -2155,7 +2155,7 @@ InitPartitionPruneContext(PartitionPruneContext *context,
  * Current values of the indexes present in PartitionPruneState count all the
  * subplans that would be present before initial pruning was done.  If initial
  * pruning got rid of some of the subplans, any subsequent pruning passes will
- * will be looking at a different set of target subplans to choose from than
+ * be looking at a different set of target subplans to choose from than
  * those in the pre-initial-pruning set, so the maps in PartitionPruneState
  * containing those indexes must be updated to reflect the new indexes of
  * subplans in the post-initial-pruning set.
diff --git a/src/backend/utils/activity/pgstat.c 
b/src/backend/utils/activity/pgstat.c
index 6224c498c21..5b0f26e3b07 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -556,7 +556,7 @@ pgstat_initialize(void)
  * suggested idle timeout is returned. Currently this is always
  * PGSTAT_IDLE_INTERVAL (1ms). Callers can use the returned time to set up
  * a timeout after which to call pgstat_report_stat(true), but are not
- * required to to do so.
+ * required to do so.
  *
  * Note that this is called only when not within a transaction, so it is fair
  * to use transaction stop time as an approximation of current time.
diff --git a/src/backend/utils/activity/pgstat_replslot.c 
b/src/backend/utils/activity/pgstat_replslot.c
index b77c05ab5fa..9a59012a855 100644
--- a/src/backend/utils/activity/pgstat_replslot.c
+++ b/src/backend/utils/activity/pgstat_replslot.c
@@ -8,7 +8,7 @@
  * storage implementation and the details about individual types of
  * statistics.
  *
- * Replication slot stats work a bit different than other other
+ * Replication slot stats work a bit different than other
  * variable-numbered stats. Slots do not have oids (so they can be created on
  * physical replicas). Use the slot index as object id while running. However,
  * the slot index can change when restarting. That is addressed by using the
diff --git a/src/test/ssl/t/SSL/Server.pm b/src/test/ssl/t/SSL/Server.pm
index 62f54dcbf16..0a9e5da01e4 100644
--- a/src/test/ssl/t/SSL/Server.pm
+++ b/src/test/ssl/t/SSL/Server.pm
@@ -257,7 +257,7 @@ The certificate file to use. Implementation is SSL backend 
specific.
 
 =item keyfile => B
 
-The private key to to use. Implementation is SSL backend specific.
+The private key file to use. Implementation is SSL backend specific.
 
 =item crlfile => B

Re: cataloguing NOT NULL constraints

2022-09-19 Thread Alvaro Herrera

On 2022-Sep-14, Peter Eisentraut wrote:

> Reading through the SQL standard again, I think this patch goes a bit too
> far in folding NOT NULL and CHECK constraints together.  The spec says that
> you need to remember whether a column was defined as NOT NULL, and that the
> commands DROP NOT NULL and SET NOT NULL only affect constraints defined in
> that way.  In this implementation, a constraint defined as NOT NULL is
> converted to a CHECK (x IS NOT NULL) constraint and the original definition
> is forgotten.

Hmm, I don't read it the same way.  Reading SQL:2016, they talk about a
nullability characteristic (/known not nullable/ or /possibly
nullable/):

: 4.13 Columns, fields, and attributes
: [...]
: Every column has a nullability characteristic that indicates whether the
: value from that column can be the null value. A nullability characteristic
: is either known not nullable or possibly nullable.
: Let C be a column of a base table T. C is known not nullable if and only
: if at least one of the following is true:
: — There exists at least one constraint NNC that is enforced and not
: deferrable and that simply contains a  that is a
:  that is a readily-known-not-null condition for C.
: [other possibilities]

then in the same section they explain that this is derived from a
table constraint:

: A column C is described by a column descriptor. A column descriptor
: includes:
: [...]
: — If C is a column of a base table, then an indication of whether it is
: defined as NOT NULL and, if so, the constraint name of the associated
: table constraint definition.

   [aside: note that elsewhere (), they define
   "readily-known-not-null" in Syntax Rule 3), of 6.39 :

   : 3) Let X denote either a column C or the  VALUE. Given a
   :  BVE and X, the notion “BVE is a
   : readily-known-not-null condition for X” is defined as follows.
   : Case:
   :  a) If BVE is a  of the form “RVE IS NOT NULL”, where RVE is a
   :  that is a  that
   : simply contains a , , or
   :  that is a  that
   : references C, then BVE is a readily-known-not-null condition for C.
   :  b) If BVE is the  “VALUE IS NOT NULL”, then BVE is a
   : readily-known-not-null condition for VALUE.
   :  c) Otherwise, BVE is not a readily-known-not-null condition for X.
   edisa]

Later,  says literally that specifying NOT NULL in a
column is equivalent to the CHECK (.. IS NOT NULL) table constraint:

: 11.4 
: 
: Syntax Rules,
: 17) If a  is specified, then let CND be
: the  if one is specified and let CND be the
: zero-length character character string otherwise; let CA be the
:  if specified and let CA be the zero-length
: character string otherwise. The  is
: equivalent to a  as follows.
: 
: Case:
: 
: a) If a  is specified that contains the
:  NOT NULL, then it is equivalent to the following
: :
:CND CHECK ( C IS NOT NULL ) CA

In my reading of it, it doesn't follow that you have to remember whether
the table constraint was created by saying explicitly by doing CHECK (c
IS NOT NULL) or as a plain NOT NULL column constraint.  The idea of
being able to do DROP NOT NULL when only a constraint defined as CHECK
(c IS NOT NULL) exists seems to follow from there; and also that you can
use DROP CONSTRAINT to remove one added via plain NOT NULL; and that
both these operations change the nullability characteristic of the
column.  This is made more explicit by the fact that they do state that
the nullability characteristic can *not* be "destroyed" for other types
of constraints, in 11.26 , Syntax Rule
11)

: 11) Destruction of TC shall not cause the nullability characteristic of
: any of the following columns of T to change from known not nullable to
: possibly nullable:
: 
: a) A column that is a constituent of the primary key of T, if any.
: b) The system-time period start column, if any.
: c) The system-time period end column, if any.
: d) The identity column, if any.

then General Rule 7) explains that this does indeed happen for columns
declared to have some sort of NOT NULL constraint, without saying
exactly how was that constraint defined:

: 7) If TC causes some column COL to be known not nullable and no other
: constraint causes COL to be known not nullable, then the nullability
: characteristic of the column descriptor of COL is changed to possibly
: nullable.

> Besides that, I think that users are not going to like that pg_dump rewrites
> their NOT NULL constraints into CHECK table constraints.

This is a good point, but we could get around it by decreeing that
pg_dump dumps the NOT NULL in the old way if the name is not changed
from whatever would be generated normally.  This would require some
games to remove the CHECK one; and it would also mean that partitions
would not use the same constraint as the parent, but rather it'd have to
generate a new constraint name that uses its own table name, rather than
the parent's.

(This makes me wonder what should happen if you rename a table: should
we go around and rename al

Re: confirmed_flush_lsn shows LSN of the data that has not yet been received by the logical subscriber.

2022-09-19 Thread Ashutosh Bapat

On Mon, Sep 19, 2022 at 1:43 PM Ashutosh Sharma  wrote:
>
> On Fri, Sep 9, 2022 at 5:36 PM Ashutosh Bapat
>  wrote:
> >
> > On Thu, Sep 8, 2022 at 8:32 PM Ashutosh Sharma  
> > wrote:
> > >
> > > On Thu, Sep 8, 2022 at 6:23 PM Ashutosh Bapat
> > >  wrote:
> > > >
> > > > On Thu, Sep 8, 2022 at 4:14 PM Ashutosh Sharma  
> > > > wrote:
> > Can you please point to the documentation.
> >
>
> AFAIU there is just one documentation. Here is the link for it:
>
> https://www.postgresql.org/docs/current/view-pg-replication-slots.html

Thanks. Description of confirmed_flush_lsn is "The address (LSN) up to
which the logical slot's consumer has confirmed receiving data. Data
older than this is not available anymore. NULL for physical slots."
The second sentence is misleading. AFAIU, it really should be "Data
corresponding to the transactions committed before this LSN is not
available anymore". WAL before restart_lsn is likely to be removed but
WAL with LSN higher than restart_lsn is preserved. This correction
makes more sense because of the third sentence.

>
> And.. lastly sorry for the delayed response. I was not well and
> couldn't access email for quite some days. The poor dengue had almost
> killed me :(

Dengue had almost killed me also. Take care.

-- 
Best Wishes,
Ashutosh Bapat

Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)

On Mon, Sep 19, 2022 at 2:38 PM Fujii Masao  wrote:
>
> Thanks for updating the patch!
>
> =# SELECT * FROM pg_backup_start('test', true);
> =# SELECT * FROM pg_backup_stop();
> LOG:  server process (PID 15651) was terminated by signal 11: Segmentation 
> fault: 11
> DETAIL:  Failed process was running: SELECT * FROM pg_backup_stop();
>
> With v7 patch, pg_backup_stop() caused the segmentation fault.

Fixed. I believed that the regression tests cover pg_backup_start()
and pg_backup_stop(), and relied on make check-world, surprisingly
there's no test that covers these functions. Is it a good idea to add
tests for these functions in misc_functions.sql or backup.sql or
somewhere so that they get run as part of make check? Thoughts?

> =# SELECT * FROM pg_backup_start(repeat('test', 1024));
> ERROR:  backup label too long (max 1024 bytes)
> STATEMENT:  SELECT * FROM pg_backup_start(repeat('test', 1024));
>
> =# SELECT * FROM pg_backup_start(repeat('testtest', 1024));
> LOG:  server process (PID 15844) was terminated by signal 11: Segmentation 
> fault: 11
> DETAIL:  Failed process was running: SELECT * FROM 
> pg_backup_start(repeat('testtest', 1024));
>
> When I specified longer backup label in the second run of pg_backup_start()
> after the first run failed, it caused the segmentation fault.
>
>
> +   state = (BackupState *) palloc0(sizeof(BackupState));
> +   state->name = pstrdup(name);
>
> pg_backup_start() calls allocate_backup_state() and allocates the memory for
> the input backup label before checking its length in do_pg_backup_start().
> This can cause the memory for backup label to be allocated too much
> unnecessary. I think that the maximum length of BackupState.name should
> be MAXPGPATH (i.e., maximum allowed length for backup label).

That's a good idea. I'm marking a flag if the label name overflows (>
MAXPGPATH), later using it in do_pg_backup_start() to error out. We
could've thrown error directly, but that changes the behaviour - right
now, first "
wal_level must be set to \"replica\" or \"logical\" at server start."
gets emitted and then label name overflow error - I don't want to do
that.

> > Yeah, but they have to be carried from do_pg_backup_stop() to
> > pg_backup_stop() or callers and also instead of keeping tablespace_map
> > in BackupState and others elsewhere don't seem to be a good idea to
> > me. IMO, BackupState is a good place to contain all the information
> > that's carried across various functions.
>
> In v7 patch, since pg_backup_stop() calls build_backup_content(),
> backup_label and history_file seem not to be carried from do_pg_backup_stop()
> to pg_backup_stop(). This makes me still think that it's better not to include
> them in BackupState...

I'm a bit worried about the backup state being spread across if we
separate out backup_label and history_file from BackupState and keep
tablespace_map contents there. As I said upthread, we are not
allocating memory for them at the beginning, we allocate only when
needed. IMO, this code is readable and more extensible.

I've also taken help of the error callback mechanism to clean up the
allocated memory in case of a failure. For do_pg_abort_backup() cases,
I think we don't need to clean the memory because that gets called on
proc exit (before_shmem_exit()).

Please review the v8 patch further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

v8-0001-Refactor-backup-related-code.patch
Description: Binary data

pgstat: stats added in ReadPageInternal() aren't getting reported via pg_stat_wal

Hi,

I added xlogreader cache stats (hits/misses) to pg_stat_wal in
ReadPageInternal() for some of my work and ran some tests with logical
replication subscribers. I had expected that the new stats generated
by walsenders serving the subscribers would be accumulated and shown
via pg_stat_wal view in another session, but that didn't happen. Upon
looking around, I thought adding
pgstat_flush_wal()/pgstat_report_wal() around proc_exit() in
walsener.c might work, but that didn't help either. Can anyone please
let me know why the stats that I added aren't shown via pg_stat_wal
view despite the walsender doing pgstat_initialize() ->
pgstat_init_wal()? Am I missing something?

I'm attaching here with the patch that I was using.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 6bee81b6284d1619d6828e521d1d210ae5505034 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 19 Sep 2022 13:11:31 +
Subject: [PATCH v1] Add xlogreader cache stats

---
 src/backend/access/transam/xlogreader.c | 9 +
 src/backend/catalog/system_views.sql| 4 +++-
 src/backend/utils/activity/pgstat_wal.c | 2 ++
 src/backend/utils/adt/pgstatfuncs.c | 9 +++--
 src/include/catalog/pg_proc.dat | 6 +++---
 src/include/pgstat.h| 2 ++
 src/test/regress/expected/rules.out | 6 --
 7 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 050d2f424e..9a5b3d7893 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1000,7 +1000,16 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	/* check whether we have all the requested data already */
 	if (targetSegNo == state->seg.ws_segno &&
 		targetPageOff == state->segoff && reqLen <= state->readLen)
+	{
+#ifndef FRONTEND
+		PendingWalStats.xlogreader_cache_hits++;
+#endif
 		return state->readLen;
+	}
+
+#ifndef FRONTEND
+	PendingWalStats.xlogreader_cache_misses++;
+#endif
 
 	/*
 	 * Invalidate contents of internal buffer before read attempt.  Just set
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f7ec79e0..0baab55e6b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1124,7 +1124,9 @@ CREATE VIEW pg_stat_wal AS
 w.wal_sync,
 w.wal_write_time,
 w.wal_sync_time,
-w.stats_reset
+w.stats_reset,
+w.xlogreader_cache_hits,
+w.xlogreader_cache_misses
 FROM pg_stat_get_wal() w;
 
 CREATE VIEW pg_stat_progress_analyze AS
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..ba97d71471 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -105,6 +105,8 @@ pgstat_flush_wal(bool nowait)
 	WALSTAT_ACC(wal_sync);
 	WALSTAT_ACC(wal_write_time);
 	WALSTAT_ACC(wal_sync_time);
+	WALSTAT_ACC(xlogreader_cache_hits);
+	WALSTAT_ACC(xlogreader_cache_misses);
 #undef WALSTAT_ACC
 
 	LWLockRelease(&stats_shmem->lock);
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index be15b4b2e5..c65c7c08a8 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1732,7 +1732,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	9
+#define PG_STAT_GET_WAL_COLS	11
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS] = {0};
 	bool		nulls[PG_STAT_GET_WAL_COLS] = {0};
@@ -1759,7 +1759,10 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	   FLOAT8OID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 	   TIMESTAMPTZOID, -1, 0);
-
+	TupleDescInitEntry(tupdesc, (AttrNumber) 10, "xlogreader_cache_hits",
+	   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 11, "xlogreader_cache_misses",
+	   INT8OID, -1, 0);
 	BlessTupleDesc(tupdesc);
 
 	/* Get statistics about WAL activity */
@@ -1785,6 +1788,8 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
 
 	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[9] = Int64GetDatum(wal_stats->xlogreader_cache_hits);
+	values[10] = Int64GetDatum(wal_stats->xlogreader_cache_misses);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a07e737a33..f728406061 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5649,9 +5649,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => '

Re: cataloguing NOT NULL constraints

On Wed, Aug 17, 2022 at 2:12 PM Alvaro Herrera  wrote:
> If you say CREATE TABLE (a int NOT NULL), you'll get a CHECK constraint
> printed by psql: (this is a bit more noisy that previously and it
> changes a lot of regression tests output).
>
> 55489 16devel 1776237=# create table tab (a int not null);
> CREATE TABLE
> 55489 16devel 1776237=# \d tab
> Tabla «public.tab»
>  Columna │  Tipo   │ Ordenamiento │ Nulable  │ Por omisión
> ─┼─┼──┼──┼─
>  a   │ integer │  │ not null │
> Restricciones CHECK:
> "tab_a_not_null" CHECK (a IS NOT NULL)

In a table with many columns, most of which are NOT NULL, this is
going to produce a ton of clutter. I don't like that.

I'm not sure what a good alternative would be, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: cataloguing NOT NULL constraints

2022-09-19 Thread Isaac Morland

On Mon, 19 Sept 2022 at 09:32, Robert Haas  wrote:

> On Wed, Aug 17, 2022 at 2:12 PM Alvaro Herrera 
> wrote:
> > If you say CREATE TABLE (a int NOT NULL), you'll get a CHECK constraint
> > printed by psql: (this is a bit more noisy that previously and it
> > changes a lot of regression tests output).
> >
> > 55489 16devel 1776237=# create table tab (a int not null);
> > CREATE TABLE
> > 55489 16devel 1776237=# \d tab
> > Tabla «public.tab»
> >  Columna │  Tipo   │ Ordenamiento │ Nulable  │ Por omisión
> > ─┼─┼──┼──┼─
> >  a   │ integer │  │ not null │
> > Restricciones CHECK:
> > "tab_a_not_null" CHECK (a IS NOT NULL)
>
> In a table with many columns, most of which are NOT NULL, this is
> going to produce a ton of clutter. I don't like that.
>
> I'm not sure what a good alternative would be, though.
>

I thought I saw some discussion about the SQL standard saying that there is
a difference between putting NOT NULL in a column definition, and CHECK
(column_name NOT NULL). So if we're going to take this seriously, I think
that means there needs to be a field in pg_constraint which identifies
whether a constraint is a "real" one created explicitly as a constraint, or
if it is just one created because a field is marked NOT NULL.

If this is correct, the answer is easy: don't show constraints that are
there only because of a NOT NULL in the \d or \d+ listings. I certainly
don't want to see that clutter and I'm having trouble seeing why anybody
else would want to see it either; the information is already there in the
"Nullable" column of the field listing.

The error message for a duplicate constraint name when creating a
constraint needs however to be very clear that the conflict is with a NOT
NULL constraint and which one, since I'm proposing leaving those ones off
the visible listing, and it would be very bad for somebody to get
"duplicate name" and then be unable to see the conflicting entry.

Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

Hi,

I was recently asked about converting an offset reported in WAL read
error messages[1] to an LSN with which pg_waldump can be used to
verify the records or WAL file around that LSN (basically one can
filter out the output based on LSN). AFAICS, there's no function that
takes offset as an input and produces an LSN and I ended up figuring
out LSN manually. And, for some of my work too, I was hitting errors
in XLogReaderValidatePageHeader() and adding recptr to those error
messages helped me debug issues faster.

We have a bunch of messages [1] that have an offset, but not LSN in
the error message. Firstly, is there an easiest way to figure out LSN
from offset reported in the error messages? If not, is adding LSN to
these messages along with offset a good idea? Of course, we can't just
convert offset to LSN using XLogSegNoOffsetToRecPtr() and report, but
something meaningful like reporting the LSN of the page that we are
reading-in or writing-out etc.

Thoughts?

[1]
errmsg("could not read from WAL segment %s, offset %u: %m",
errmsg("could not read from WAL segment %s, offset %u: %m",
errmsg("could not write to log file %s "
   "at offset %u, length %zu: %m",
errmsg("unexpected timeline ID %u in WAL segment %s, offset %u",
errmsg("could not read from WAL segment %s, offset %u: read %d of %zu",
pg_log_error("received write-ahead log record for offset %u with no file open",
"invalid magic number %04X in WAL segment %s, offset %u",
"invalid info bits %04X in WAL segment %s, offset %u",
"invalid info bits %04X in WAL segment %s, offset %u",
"unexpected pageaddr %X/%X in WAL segment %s, offset %u",
"out-of-sequence timeline ID %u (after %u) in WAL segment %s, offset %u",

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Tree-walker callbacks vs -Wdeprecated-non-prototype

Thomas Munro  writes:
> On Mon, Sep 19, 2022 at 10:16 AM Thomas Munro  wrote:
>> Huh... wouldn't systems that pass arguments right-to-left on the stack
>> receive NULL for node?  That'd include the SysV i386 convention used
>> on Linux, *BSD etc.  But that can't be right or we'd know about it...

> I take that back after looking up some long forgotten details; it
> happily ignores extra arguments.

Yeah; the fact that no one has complained in several years seems to
indicate that there's not a problem on supported platforms.  Still,
unlike the quibbles over whether char and struct pointers are the
same, it seems clear that this is the sort of inconsistency that
C2x wants to forbid, presumably in the name of making the world
safe for more-efficient function calling code.  So I think we'd
better go fix ExecShutdownNode before somebody breaks it.

Whichever way we jump on the tree-walker API changes, those won't
be back-patchable.  I think the best we can do for the back branches
is add a configure test to use -Wno-deprecated-non-prototype
if available.  But the ExecShutdownNode change could be back-patched,
and I'm leaning to doing so even though that breakage is just
hypothetical today.

regards, tom lane

Re: cataloguing NOT NULL constraints

Isaac Morland  writes:
> I thought I saw some discussion about the SQL standard saying that there is
> a difference between putting NOT NULL in a column definition, and CHECK
> (column_name NOT NULL). So if we're going to take this seriously, I think
> that means there needs to be a field in pg_constraint which identifies
> whether a constraint is a "real" one created explicitly as a constraint, or
> if it is just one created because a field is marked NOT NULL.

If we're going to go that way, I think that we should take the further
step of making not-null constraints be their own contype rather than
an artificially generated CHECK.  The bloat in pg_constraint from CHECK
expressions made this way seems like an additional reason not to like
doing it like that.

regards, tom lane

Re: cataloguing NOT NULL constraints

2022-09-19 Thread Matthias van de Meent

On Mon, 19 Sept 2022 at 15:32, Robert Haas  wrote:
>
> On Wed, Aug 17, 2022 at 2:12 PM Alvaro Herrera  
> wrote:
> > If you say CREATE TABLE (a int NOT NULL), you'll get a CHECK constraint
> > printed by psql: (this is a bit more noisy that previously and it
> > changes a lot of regression tests output).
> >
> > 55489 16devel 1776237=# create table tab (a int not null);
> > CREATE TABLE
> > 55489 16devel 1776237=# \d tab
> > Tabla «public.tab»
> >  Columna │  Tipo   │ Ordenamiento │ Nulable  │ Por omisión
> > ─┼─┼──┼──┼─
> >  a   │ integer │  │ not null │
> > Restricciones CHECK:
> > "tab_a_not_null" CHECK (a IS NOT NULL)
>
> In a table with many columns, most of which are NOT NULL, this is
> going to produce a ton of clutter. I don't like that.
>
> I'm not sure what a good alternative would be, though.

I'm not sure on the 'good' part of this alternative, but we could go
with a single row-based IS NOT NULL to reduce such clutter, utilizing
the `ROW() IS NOT NULL` requirement of a row only matching IS NOT NULL
when all attributes are also IS NOT NULL:

Check constraints:
"tab_notnull_check" CHECK (ROW(a, b, c, d, e) IS NOT NULL)

instead of:

Check constraints:
"tab_a_not_null" CHECK (a IS NOT NULL)
"tab_b_not_null" CHECK (b IS NOT NULL)
"tab_c_not_null" CHECK (c IS NOT NULL)
"tab_d_not_null" CHECK (d IS NOT NULL)
"tab_e_not_null" CHECK (e IS NOT NULL)

But the performance of repeated row-casting would probably not be as
good as our current NULL checks if we'd use the current row
infrastructure, and constraint failure reports wouldn't be as helpful
as the current attribute NOT NULL failures.

Kind regards,

Matthias van de Meent

Re: Switching XLog source from archive to streaming when primary available

On Fri, Sep 16, 2022 at 4:58 PM Bharath Rupireddy
 wrote:
>
> On Fri, Sep 16, 2022 at 12:06 PM Kyotaro Horiguchi
>  wrote:
> >
> > In other words, it seems to me that the macro name doesn't manifest
> > the condition correctly.
> >
> > I don't think we don't particularly want to do that unconditionally.
> > I wanted just to get rid of the macro from the usage site.  Even if
> > the same condition is used elsewhere, I see it better to write out the
> > condition directly there..
>
> I wanted to avoid a bit of duplicate code there. How about naming that
> macro IsXLOGSourceSwitchToStreamEnabled() or
> SwitchFromArchiveToStreamEnabled() or just SwitchFromArchiveToStream()
> or any other better name?

SwitchFromArchiveToStreamEnabled() seemed better at this point. I'm
attaching the v7 patch with that change. Please review it further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From b94327d9af60a44f252251be97ee1efabc964f97 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 19 Sep 2022 14:04:49 +
Subject: [PATCH v7] Allow standby to switch WAL source from archive to
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be efficient and cheaper because network
latencies, disk IO cost might differ on the archive as compared
to primary and often the archive may sit far from standby
impacting recovery performance on the standby. Hence reading WAL
from the primary, by setting this parameter, as opposed to the
archive enables the standby to catch up with the primary sooner
thus reducing replication lag and avoiding WAL files accumulation
on the primary.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication. If the standby fails to switch to stream
mode, it falls back to archive mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  45 +++
 src/backend/access/transam/xlogrecovery.c | 124 +++--
 src/backend/utils/misc/guc_tables.c   |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h |   1 +
 src/test/recovery/t/034_wal_source_switch.pl  | 126 ++
 6 files changed, 301 insertions(+), 11 deletions(-)
 create mode 100644 src/test/recovery/t/034_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 700914684d..892442a053 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4840,6 +4840,51 @@ ANY num_sync ( 
+  streaming_replication_retry_interval (integer)
+  
+   streaming_replication_retry_interval configuration parameter
+  
+  
+  
+   
+Specifies amount of time after which standby attempts to switch WAL
+source from WAL archive to streaming replication (get WAL from
+primary). If the standby fails to switch to stream mode, it falls back
+to archive mode.
+If this value is specified without units, it is taken as milliseconds.
+The default is five minutes (5min).
+With a lower setting of this parameter, the standby makes frequent
+WAL source switch attempts when the primary is lost for quite longer.
+To avoid this, set a reasonable value.
+A setting of 0 disables the feature. When disabled,
+the standby typically switches to stream mode, only when receive from
+WAL archive finishes (no more WAL left there) or fails for any reason.
+This parameter can only be set in
+the postgresql.conf file or on the server
+command line.
+   
+   
+Reading WAL from archive may not always be efficient and cheaper
+because network latencies, disk IO cost might differ on the archive as
+compared to primary and often the archive may sit far from standby
+impacting recovery performance on the standby. Hence reading WAL
+from the primary, by setting this parameter, as opposed to the archive
+enables the standby to catch up with the primary sooner thus reducing
+replication lag and avoiding WAL files accumulation on the primary.
+  
+   
+Note that the standby may not always attempt to switch source from
+WAL archive to streaming replication at exact
+streaming_replication_retry_interval intervals.
+For example, if the parameter is set to 1min and
+fetching from WAL archive takes 5min, then the
+source swi

Re: remove more archiving overhead

2022-09-19 Thread Noah Misch

On Mon, Sep 19, 2022 at 06:08:29AM -0400, Peter Eisentraut wrote:
> On 18.09.22 09:13, Noah Misch wrote:
> >>>This documentation change only covers archive_library.  How are users of
> >>>archive_command supposed to handle this?
> >>
> >>I believe users of archive_command need to do something similar to what is
> >>described here.  However, it might be more reasonable to expect
> >>archive_command users to simply return false when there is a pre-existing
> >>file, as the deleted text notes.  IIRC that is why I added that sentence
> >>originally.
> >
> >What makes the answer for archive_command diverge from the answer for
> >archive_library?
> 
> I suspect what we are really trying to say here is
> 
> ===
> Archiving setups (using either archive_command or archive_library) should be
> prepared for the rare case that an identical archive file is being archived
> a second time.  In such a case, they should compare that the source and the
> target file are identical and proceed without error if so.
> 
> In some cases, it is difficult or impossible to configure archive_command or
> archive_library to do this.  In such cases, the archiving command or library
> should error like in the case for any pre-existing target file, and
> operators need to be prepared to resolve such cases manually.
> ===
> 
> Is that correct?

I wanted it to stop saying anything like the second paragraph, hence commit
d263ced.  Implementing a proper archiving setup is not especially difficult,
and inviting the operator to work around a wrong implementation invites
damaging mistakes under time pressure.

Re: confirmed_flush_lsn shows LSN of the data that has not yet been received by the logical subscriber.

2022-09-19 Thread Ashutosh Sharma

On Mon, Sep 19, 2022 at 5:24 PM Ashutosh Bapat
 wrote:
>
> On Mon, Sep 19, 2022 at 1:43 PM Ashutosh Sharma  wrote:
> >
> > On Fri, Sep 9, 2022 at 5:36 PM Ashutosh Bapat
> >  wrote:
> > >
> > > On Thu, Sep 8, 2022 at 8:32 PM Ashutosh Sharma  
> > > wrote:
> > > >
> > > > On Thu, Sep 8, 2022 at 6:23 PM Ashutosh Bapat
> > > >  wrote:
> > > > >
> > > > > On Thu, Sep 8, 2022 at 4:14 PM Ashutosh Sharma 
> > > > >  wrote:
> > > Can you please point to the documentation.
> > >
> >
> > AFAIU there is just one documentation. Here is the link for it:
> >
> > https://www.postgresql.org/docs/current/view-pg-replication-slots.html
>
> Thanks. Description of confirmed_flush_lsn is "The address (LSN) up to
> which the logical slot's consumer has confirmed receiving data. Data
> older than this is not available anymore. NULL for physical slots."
> The second sentence is misleading. AFAIU, it really should be "Data
> corresponding to the transactions committed before this LSN is not
> available anymore". WAL before restart_lsn is likely to be removed but
> WAL with LSN higher than restart_lsn is preserved. This correction
> makes more sense because of the third sentence.
>

Thanks for the clarification. Attached is the patch with the changes.
Please have a look.

--
With Regards,
Ashutosh Sharma.
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f573a4..0d7285d 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2392,8 +2392,9 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
   
The address (LSN) up to which the logical
-   slot's consumer has confirmed receiving data. Data older than this is
-   not available anymore. NULL for physical slots.
+   slot's consumer has confirmed receiving data. Data corresponding to the
+   transactions committed before this LSN is not
+   available anymore. NULL for physical slots.

Ubuntu 16.04: Xenial: Why was it removed from the apt repo?

2022-09-19 Thread Larry Rosenman




All of a sudden I'm getting repo not found for
Ubuntu 16.04 LTS on the APT repo.  Why?

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106

binary version of pg_current_wal_insert_lsn and pg_walfile_name functions

2022-09-19 Thread Ashutosh Sharma

Hi All,

Currently, we have pg_current_wal_insert_lsn and pg_walfile_name sql
functions which gives us information about the next wal insert
location and the WAL file that the next wal insert location belongs
to. Can we have a binary version of these sql functions? It would be
like any other binaries we have for e.g. pg_waldump to which we can
pass the location of the pg_wal directory. This binary would scan
through the directory to return the next wal insert location and the
wal file the next wal insert pointer belongs to.

The binary version of these sql functions can be used when the server
is offline. This can help us to know the overall WAL data that needs
to be replayed when the server is in recovery. In the control file we
do have the redo pointer. Knowing the end pointer would definitely be
helpful.

If you are ok then I will prepare a patch for it and share it. Please
let me know your thoughts/comments. thank you.!

--
With Regards,
Ashutosh Sharma.

Re: Ubuntu 16.04: Xenial: Why was it removed from the apt repo?

2022-09-19 Thread Magnus Hagander

Hello!

Because it has been removed and moved to the archives, as per the warning
from early July.

See
https://www.postgresql.org/message-id/flat/YsV8fmomNNC%2BGpIR%40msg.credativ.de

//Magnus


On Mon, Sep 19, 2022 at 4:46 PM Larry Rosenman  wrote:

>
> All of a sudden I'm getting repo not found for
> Ubuntu 16.04 LTS on the APT repo.  Why?
>
> --
> Larry Rosenman http://www.lerctr.org/~ler
> Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
> US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106
>
>

Re:Ubuntu 16.04: Xenial: Why was it removed from the apt repo?

2022-09-19 Thread Sergei Kornilov

Hi
Ubuntu 16.04 is EOL from April 2021, over a year ago.

https://wiki.postgresql.org/wiki/Apt/FAQ#Where_are_older_versions_of_the_packages.3F

regards, Sergei

Re: remove more archiving overhead

2022-09-19 Thread David Steele





On 9/19/22 07:39, Noah Misch wrote:

On Mon, Sep 19, 2022 at 06:08:29AM -0400, Peter Eisentraut wrote:

On 18.09.22 09:13, Noah Misch wrote:

This documentation change only covers archive_library.  How are users of
archive_command supposed to handle this?


I believe users of archive_command need to do something similar to what is
described here.  However, it might be more reasonable to expect
archive_command users to simply return false when there is a pre-existing
file, as the deleted text notes.  IIRC that is why I added that sentence
originally.


What makes the answer for archive_command diverge from the answer for
archive_library?


I suspect what we are really trying to say here is

===
Archiving setups (using either archive_command or archive_library) should be
prepared for the rare case that an identical archive file is being archived
a second time.  In such a case, they should compare that the source and the
target file are identical and proceed without error if so.

In some cases, it is difficult or impossible to configure archive_command or
archive_library to do this.  In such cases, the archiving command or library
should error like in the case for any pre-existing target file, and
operators need to be prepared to resolve such cases manually.
===

Is that correct?


I wanted it to stop saying anything like the second paragraph, hence commit
d263ced.  Implementing a proper archiving setup is not especially difficult,
and inviting the operator to work around a wrong implementation invites
damaging mistakes under time pressure.


I would also not want to state that duplicate WAL is rare. In practice 
it is pretty common when things are going wrong. Also, implying it is 
rare might lead a user to decide they don't need to worry about it.


Regards,
-David

Re: HOT chain validation in verify_heapam()

2022-09-19 Thread Aleksander Alekseev

Hi Himanshu,

> Done, updated in the v3 patch.

Thanks for the updated patch.

Here is v4 with fixed compiler warnings and some minor tweaks from me.

I didn't put too much thought into the algorithm but I already see
something strange. At verify_heapam.c:553 you declared curr_xmax and
next_xmin. However the variables are not used/initialized until you
do:

```
if (lp_valid[nextoffnum] && lp_valid[ctx.offnum] &&
TransactionIdIsValid(curr_xmax) &&
TransactionIdEquals(curr_xmax, next_xmin)) {
/* ... */
```

In v4 I elected to initialize both curr_xmax and next_xmin with
InvalidTransactionId for safety and in order to silence the compiler
but still there is no way this condition can succeed.

Please make sure there is no logic missing.

-- 
Best regards,
Aleksander Alekseev


v4-0001-Implement-HOT-chain-validation-in-verify_heapam.patch
Description: Binary data

Re: Free list same_input_transnos in preprocess_aggref

Zhang Mingli  writes:
> In preprocess_aggref(), list same_input_transnos is used to track compatible 
> transnos.
> Free it if we don’t need it anymore.

Very little of the planner bothers with freeing small allocations
like that.  Can you demonstrate a case where this would actually
make a meaningful difference?

regards, tom lane

Re: why can't a table be part of the same publication as its schema

2022-09-19 Thread Alvaro Herrera



> diff --git a/doc/src/sgml/logical-replication.sgml 
> b/doc/src/sgml/logical-replication.sgml
> index 1ae3287..0ab768d 100644
> --- a/doc/src/sgml/logical-replication.sgml
> +++ b/doc/src/sgml/logical-replication.sgml
> @@ -1120,6 +1120,11 @@ test_sub=# SELECT * FROM child ORDER BY a;
>
>  
>
> +   Specifying a column list when the publication also publishes
> +   FOR ALL TABLES IN SCHEMA is not supported.
> +  
> +
> +  
> For partitioned tables, the publication parameter
> publish_via_partition_root determines which column list
> is used. If publish_via_partition_root is
> 
> diff --git a/doc/src/sgml/ref/create_publication.sgml 
> b/doc/src/sgml/ref/create_publication.sgml
> index 0a68c4b..0ced7da 100644
> --- a/doc/src/sgml/ref/create_publication.sgml
> +++ b/doc/src/sgml/ref/create_publication.sgml
> @@ -103,17 +103,17 @@ CREATE PUBLICATION  class="parameter">name
>   
>  
>   
> +  Specifying a column list when the publication also publishes
> +  FOR ALL TABLES IN SCHEMA is not supported.
> + 
> 
> @@ -733,6 +694,24 @@ CheckPubRelationColumnList(List *tables, const char 
> *queryString,
>   continue;
>  
>   /*
> +  * Disallow specifying column list if any schema is in the
> +  * publication.
> +  *
> +  * XXX We could instead just forbid the case when the 
> publication
> +  * tries to publish the table with a column list and a schema 
> for that
> +  * table. However, if we do that then we need a restriction 
> during
> +  * ALTER TABLE ... SET SCHEMA to prevent such a case which 
> doesn't
> +  * seem to be a good idea.
> +  */
> + if (publish_schema)
> + ereport(ERROR,
> + 
> errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("cannot use publication column 
> list for relation \"%s.%s\"",
> +
> get_namespace_name(RelationGetNamespace(pri->relation)),
> +
> RelationGetRelationName(pri->relation)),
> + errdetail("Column list cannot be 
> specified if any schema is part of the publication or specified in the 
> list."));
> +

This seems a pretty arbitrary restriction.  It feels like you're adding
this restriction precisely so that you don't have to write the code to
reject the ALTER .. SET SCHEMA if an incompatible configuration is
detected.  But we already have such checks in several cases, so I don't
see why this one does not seem a good idea.

The whole FOR ALL TABLES IN SCHEMA thing seems pretty weird in several
aspects.  Others have already commented about the syntax, which is
unlike what GRANT uses; I'm also surprised that we've gotten away with
it being superuser-only.  Why are we building more superuser-only
features in this day and age?  I think not even FOR ALL TABLES should
require superuser.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
Thou shalt study thy libraries and strive not to reinvent them without
cause, that thy code may be short and readable and thy days pleasant
and productive. (7th Commandment for C Programmers)

Re: Pruning never visible changes

2022-09-19 Thread Matthias van de Meent

On Mon, 19 Sept 2022 at 01:16, Greg Stark  wrote:
>
> On Fri, 16 Sept 2022 at 10:27, Tom Lane  wrote:
> >
> > Simon Riggs  writes:
> > > A user asked me whether we prune never visible changes, such as
> > > BEGIN;
> > > INSERT...
> > > UPDATE.. (same row)
> > > COMMIT;
> >
> > Didn't we just have this discussion in another thread?
>
> Well.  not "just" :)

This recent thread [0] mentioned the same, and I mentioned it in [1]
too last year.

Kind regards,

Matthias van de Meent

[0] 
https://www.postgresql.org/message-id/flat/2031521.1663076724%40sss.pgh.pa.us#01542683a64a312e5c21541fecd50e63
Subject: Re: Tuples inserted and deleted by the same transaction
Date: 2022-09-13 14:13:44

[1] 
https://www.postgresql.org/message-id/CAEze2Whjnhg96Wt2-DxtBydhmMDmVm2WfWOX3aGcB2C2Hbry0Q%40mail.gmail.com
Subject: Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Date: 2021-06-14 09:53:47
(in a thread about a PS comment)

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-09-19 Thread Önder Kalacı

Hi Hayato Kuroda,

Thanks for the review, please see my reply below:


> ===
> For execRelation.c
>
> 01. RelationFindReplTupleByIndex()
>
> ```
> /* Start an index scan. */
> InitDirtySnapshot(snap);
> -   scan = index_beginscan(rel, idxrel, &snap,
> -
> IndexRelationGetNumberOfKeyAttributes(idxrel),
> -  0);
>
> /* Build scan key. */
> -   build_replindex_scan_key(skey, rel, idxrel, searchslot);
> +   scankey_attoff = build_replindex_scan_key(skey, rel, idxrel,
> searchslot);
>
> +   scan = index_beginscan(rel, idxrel, &snap, scankey_attoff, 0);
> ```
>
> I think "/* Start an index scan. */" should be just above
> index_beginscan().
>

moved there


>
> ===
> For worker.c
>
> 02. sable_indexoid_internal()
>
> ```
> + * Note that if the corresponding relmapentry has InvalidOid
> usableIndexOid,
> + * the function returns InvalidOid.
> + */
> +static Oid
> +usable_indexoid_internal(ApplyExecutionData *edata, ResultRelInfo
> *relinfo)
> ```
>
> "InvalidOid usableIndexOid" should be "invalid usableIndexOid,"
>

makes sense, updated


>
> 03. check_relation_updatable()
>
> ```
>  * We are in error mode so it's fine this is somewhat slow. It's
> better to
>  * give user correct error.
>  */
> -   if (OidIsValid(GetRelationIdentityOrPK(rel->localrel)))
> +   if (OidIsValid(rel->usableIndexOid))
> {
> ```
>
> Shouldn't we change the above comment to? The check is no longer slow.
>

Hmm, I couldn't realize this comment earlier. So you suggest "slow" here
refers to the additional function call "GetRelationIdentityOrPK"? If so,
yes I'll update that.


>
> ===
> For relation.c
>
> 04. GetCheapestReplicaIdentityFullPath()
>
> ```
> +static Path *
> +GetCheapestReplicaIdentityFullPath(Relation localrel)
> +{
> +   PlannerInfo *root;
> +   Query  *query;
> +   PlannerGlobal *glob;
> +   RangeTblEntry *rte;
> +   RelOptInfo *rel;
> +   int attno;
> +   RangeTblRef *rt;
> +   List *joinList;
> +   Path *seqScanPath;
> ```
>
> I think the part that constructs dummy-planner state can be move to
> another function
> because that part is not meaningful for this.
> Especially line 824-846 can.
>
>
Makes sense, simplified the function. Though, it is always hard to pick
good names for these kinds of helper functions. I
picked GenerateDummySelectPlannerInfoForRelation(), does that sound good to
you as well?


>
> ===
> For 032_subscribe_use_index.pl
>
> 05. general
>
> ```
> +# insert some initial data within the range 0-1000
> +$node_publisher->safe_psql('postgres',
> +   "INSERT INTO test_replica_id_full SELECT i%20 FROM
> generate_series(0,1000)i;"
> +);
> ```
>
> It seems that the range of initial data seems [0, 19].
> Same mistake-candidates are found many place.
>

Ah, several copy & paste errors. Fixed (hopefully) all.


>
> 06. general
>
> ```
> +# updates 1000 rows
> +$node_publisher->safe_psql('postgres',
> +   "UPDATE test_replica_id_full SET x = x + 1 WHERE x = 15;");
> ```
>
> Only 50 tuples are modified here.
> Same mistake-candidates are found many place.
>

Alright, yes there were several wrong comments in the tests. I went over
the tests once more to fix those and improve comments.


>
> 07. general
>
> ```
> +# we check if the index is used or not
> +$node_subscriber->poll_query_until(
> +   'postgres', q{select (idx_scan = 200) from pg_stat_all_indexes
> where indexrelname = 'test_replica_id_full_idx';}
> +) or die "Timed out while waiting for check subscriber tap_sub_rep_full_3
> updates 200 rows via index";
> ```
> The query will be executed until the index scan is finished, but it may be
> not commented.
> How about changing it to "we wait until the index used on the
> subscriber-side." or something?
> Same comments are found in many place.
>

Makes sense, updated


>
> 08. test related with ANALYZE
>
> ```
> +# Testcase start: SUBSCRIPTION CAN UPDATE THE INDEX IT USES AFTER ANALYZE
> - PARTITIONED TABLE
> +# 
> ```
>
> "Testcase start:" should be "Testcase end:" here.
>

thanks, fixed


>
> 09. general
>
> In some tests results are confirmed but in other test they are not.
> I think you can make sure results are expected in any case if there are no
> particular reasons.
>
>
Alright, yes I also don't see a reason not to do that. Added to all cases.


I'll attach the patch with the next email as I also want to incorporate the
other comments. Hope this is not going to be confusing.

Thanks,
Onder

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-09-19 Thread Önder Kalacı

Hi Peter,

Thanks again for the review, see my comments below:


>
> ==
>
> 1. Commit message
>
> It is often not feasible to use `REPLICA IDENTITY FULL` on the publication
> because it leads to full table scan per tuple change on the subscription.
> This makes `REPLICA IDENTITY FULL` impracticable -- probably other than
> some small number of use cases.
>
> ~
>
> The "often not feasible" part seems repeated by the "impracticable" part.
>


> SUGGESTION
> Using `REPLICA IDENTITY FULL` on the publication leads to a full table
> scan per tuple change on the subscription. This makes `REPLICA
> IDENTITY FULL` impracticable -- probably other than some small number
> of use cases.
>
> ~~~
>

Sure, this is easier to follow, updated.


>
> 2.
>
> The Majority of the logic on the subscriber side already exists in
> the code.
>
> "Majority" -> "majority"
>
>
fixed


> ~~~
>
> 3.
>
> The ones familiar
> with this part of the code could realize that the sequential scan
> code on the subscriber already implements the `tuples_equal()`
> function.
>
> SUGGESTION
> Anyone familiar with this part of the code might recognize that...
>
> ~~~
>

Yes, this is better, applied


>
> 4.
>
> In short, the changes on the subscriber is mostly
> combining parts of (unique) index scan and sequential scan codes.
>
> "is mostly" -> "are mostly"
>
> ~~~
>
>
applied


> 5.
>
> From the performance point of view, there are few things to note.
>
> "are few" -> "are a few"
>
>
applied


> ==
>
> 6. src/backend/executor/execReplication.c - build_replindex_scan_key
>
> +static int
>  build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
>   TupleTableSlot *searchslot)
>  {
> - int attoff;
> + int index_attoff;
> + int scankey_attoff = 0;
>
> Should it be called 'skey_attoff' for consistency with the param 'skey'?
>
>
That looks better, updated


> ~~~
>
> 7.
>
>   Oid operator;
>   Oid opfamily;
>   RegProcedure regop;
> - int pkattno = attoff + 1;
> - int mainattno = indkey->values[attoff];
> - Oid optype = get_opclass_input_type(opclass->values[attoff]);
> + int table_attno = indkey->values[index_attoff];
> + Oid optype = get_opclass_input_type(opclass->values[index_attoff]);
>
> Maybe the 'optype' should be adjacent to the other Oid opXXX
> declarations just to keep them all together?
>

I do not have any preference on this. Although I do not see such a strong
pattern in the code, I have no objection to doing so changed.

~~~
>
> 8.
>
> + if (!AttributeNumberIsValid(table_attno))
> + {
> + IndexInfo *indexInfo PG_USED_FOR_ASSERTS_ONLY;
> +
> + /*
> + * There are two cases to consider. First, if the index is a primary or
> + * unique key, we cannot have any indexes with expressions. So, at this
> + * point we are sure that the index we are dealing with is not these.
> + */
> + Assert(RelationGetReplicaIndex(rel) != RelationGetRelid(idxrel) &&
> +RelationGetPrimaryKeyIndex(rel) != RelationGetRelid(idxrel));
> +
> + /*
> + * At this point, we are also sure that the index is not consisting
> + * of only expressions.
> + */
> +#ifdef USE_ASSERT_CHECKING
> + indexInfo = BuildIndexInfo(idxrel);
> + Assert(!IndexOnlyOnExpression(indexInfo));
> +#endif
>
> I was a bit confused by the comment. IIUC the code has already called
> the FilterOutNotSuitablePathsForReplIdentFull some point prior so all
> the unwanted indexes are already filtered out. Therefore these
> assertions are just for no reason, other than sanity checking that
> fact, right? If my understand is correct perhaps a simpler single
> comment is possible:
>

Yes, these are for sanity check


>
> SUGGESTION (or something like this)
> This attribute is an expression, however
> FilterOutNotSuitablePathsForReplIdentFull was called earlier during
> [...] and the indexes comprising only expressions have already been
> eliminated. We sanity check this now. Furthermore, because primary key
> and unique key indexes can't include expressions we also sanity check
> the index is neither of those kinds.
>
> ~~~
>

I agree that we can improve comments here. I incorporated your suggestion
as well.


>
> 9.
> - return hasnulls;
> + /* We should always use at least one attribute for the index scan */
> + Assert (scankey_attoff > 0);
>
> SUGGESTION
> There should always be at least one attribute for the index scan.
>

applied


>
> ~~~
>
> 10. src/backend/executor/execReplication.c - RelationFindReplTupleByIndex
>
> ScanKeyData skey[INDEX_MAX_KEYS];
> IndexScanDesc scan;
> SnapshotData snap;
> TransactionId xwait;
> Relation idxrel;
> bool found;
> TypeCacheEntry **eq = NULL; /* only used when the index is not unique */
> bool indisunique;
> int scankey_attoff;
>
> 10a.
> Should 'scankey_attoff' be called 'skey_attoff' for consistency with
> the 'skey' array?
>

Yes, it makes sense as you suggested on build_replindex_scan_key

>
> ~
>
> 10b.
> Also, it might be tidier to declare the 'skey_attoff' adjacent to the
> 'skey'.
>

moved

>
> ==
>
> 11. src/backend/replication/logi

Re: START_REPLICATION SLOT causing a crash in an assert build

2022-09-19 Thread Jaime Casanova

On Fri, Sep 16, 2022 at 02:37:17PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 15 Sep 2022 11:15:12 -0500, Jaime Casanova 
>  wrote in 
> > It fails at ./src/backend/utils/activity/pgstat_shmem.c:530 inside
> 
> Thanks for the info.  I reproduced by make check.. stupid..
> 
> It's the thinko about the base address of reset_off.
> 
> So the attached doesn't crash..
> 

Hi,

Just confirming there have been no crash since this last patch.

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

Re: HOT chain validation in verify_heapam()

2022-09-19 Thread Himanshu Upadhyaya

On Mon, Sep 19, 2022 at 8:27 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi Himanshu,
>
> > Done, updated in the v3 patch.
>
> Thanks for the updated patch.
>
> Here is v4 with fixed compiler warnings and some minor tweaks from me.
>
> I didn't put too much thought into the algorithm but I already see
> something strange. At verify_heapam.c:553 you declared curr_xmax and
> next_xmin. However the variables are not used/initialized until you
> do:
>
> ```
> if (lp_valid[nextoffnum] && lp_valid[ctx.offnum] &&
> TransactionIdIsValid(curr_xmax) &&
> TransactionIdEquals(curr_xmax, next_xmin)) {
> /* ... */
> ```
>
> In v4 I elected to initialize both curr_xmax and next_xmin with
> InvalidTransactionId for safety and in order to silence the compiler
> but still there is no way this condition can succeed.
>
> Please make sure there is no logic missing.
>
>
Hi Aleksander,

Thanks for sharing the feedback,
It's my mistake, sorry about that, I was trying to merge two if conditions
and forgot to move the initialization part for xmin and xmax. Now I think
that it will be good to have nested if, and have an inner if condition to
test xmax and xmin matching. This way we can retrieve and populate xmin and
xmax when it is actually required for the inner if condition.
I have changed this in the attached patch.

-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
From 1a51c544c5c9a14441831f5299b9d6fe275fbf53 Mon Sep 17 00:00:00 2001
From: Himanshu Upadhyaya 
Date: Mon, 19 Sep 2022 21:44:34 +0530
Subject: [PATCH v4] HOT chain validation in verify_heapam

---
 contrib/amcheck/verify_heapam.c | 211 
 1 file changed, 211 insertions(+)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index c875f3e5a2..2b9244ca69 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -399,6 +399,9 @@ verify_heapam(PG_FUNCTION_ARGS)
 	for (ctx.blkno = first_block; ctx.blkno <= last_block; ctx.blkno++)
 	{
 		OffsetNumber maxoff;
+		OffsetNumber predecessor[MaxOffsetNumber] = {0};
+		OffsetNumber successor[MaxOffsetNumber] = {0};
+		bool		lp_valid[MaxOffsetNumber] = {false};
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -433,6 +436,8 @@ verify_heapam(PG_FUNCTION_ARGS)
 		for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
 			 ctx.offnum = OffsetNumberNext(ctx.offnum))
 		{
+			OffsetNumber nextoffnum;
+
 			ctx.itemid = PageGetItemId(ctx.page, ctx.offnum);
 
 			/* Skip over unused/dead line pointers */
@@ -469,6 +474,12 @@ verify_heapam(PG_FUNCTION_ARGS)
 	report_corruption(&ctx,
 	  psprintf("line pointer redirection to unused item at offset %u",
 			   (unsigned) rdoffnum));
+
+/*
+ * make entry in successor array, redirected tuple will be
+ * validated at the time when we loop over successor array
+ */
+successor[ctx.offnum] = rdoffnum;
 continue;
 			}
 
@@ -504,9 +515,202 @@ verify_heapam(PG_FUNCTION_ARGS)
 			/* It should be safe to examine the tuple's header, at least */
 			ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid);
 			ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr);
+			lp_valid[ctx.offnum] = true;
 
 			/* Ok, ready to check this next tuple */
 			check_tuple(&ctx);
+
+			/*
+			 * Now add data to successor array, it will later be used to
+			 * generate the predecessor array. We need to access the tuple's
+			 * header to populate the predecessor array but tuple is not
+			 * necessarily sanity checked yet so delaying construction of
+			 * predecessor array until all tuples are sanity checked.
+			 */
+			nextoffnum = ItemPointerGetOffsetNumber(&(ctx.tuphdr)->t_ctid);
+			if (ItemPointerGetBlockNumber(&(ctx.tuphdr)->t_ctid) == ctx.blkno &&
+nextoffnum != ctx.offnum)
+			{
+successor[ctx.offnum] = nextoffnum;
+			}
+		}
+
+		/*
+		 * Loop over offset and populate predecessor array from all entries
+		 * that are present in successor array.
+		 */
+		ctx.attnum = -1;
+		for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
+			 ctx.offnum = OffsetNumberNext(ctx.offnum))
+		{
+			ItemId		curr_lp;
+			ItemId		next_lp;
+			HeapTupleHeader curr_htup;
+			HeapTupleHeader next_htup;
+			TransactionId curr_xmax;
+			TransactionId next_xmin;
+			OffsetNumber nextoffnum = successor[ctx.offnum];
+
+			curr_lp = PageGetItemId(ctx.page, ctx.offnum);
+			if (nextoffnum == 0)
+			{
+/*
+ * Either last updated tuple in chain or corruption raised for
+ * this tuple.
+ */
+continue;
+			}
+			if (ItemIdIsRedirected(curr_lp) && lp_valid[nextoffnum])
+			{
+next_lp = PageGetItemId(ctx.page, nextoffnum);
+next_htup = (HeapTupleHeader) PageGetItem(ctx.page, next_lp);;
+if (!HeapTupleHeaderIsHeapOnly(next_htup))
+{
+	report_corruption(&ctx,
+	  psprintf("redirected tuple at line pointer offset %u is not heap only tuple",
+			   (unsigned) nextoffnum));
+

Re: Free list same_input_transnos in preprocess_aggref

2022-09-19 Thread Zhang Mingli


Regards,
Zhang Mingli
On Sep 19, 2022, 23:14 +0800, Tom Lane , wrote:
> Very little of the planner bothers with freeing small allocations
> like that.
I think so too, as said, not sure if it worths.
> Can you demonstrate a case where this would actually
> make a meaningful difference?
Offhand, an example may help a little:

create table t1(id int);
explain select max(id), min(id), sum(id), count(id), avg(id) from t1;

 Modify codes to test:

@@ -139,6 +139,7 @@ preprocess_aggref(Aggref *aggref, PlannerInfo *root)
 int16 transtypeLen;
 Oid inputTypes[FUNC_MAX_ARGS];
 int numArguments;
+ static size_t accumulate_list_size = 0;

 Assert(aggref->agglevelsup == 0);

@@ -265,7 +266,7 @@ preprocess_aggref(Aggref *aggref, PlannerInfo *root)
 aggserialfn, aggdeserialfn,
 initValue, initValueIsNull,
 same_input_transnos);
- list_free(same_input_transnos);
+ accumulate_list_size += sizeof(int) * list_length(same_input_transnos);

Gdb and print accumulate_list_size for each iteration:

SaveBytes = Sum results of accumulate_list_size: 32(4+4+8+8), as we have 5 aggs 
in sql.

If there were N sets of that aggs (more columns as id, with above aggs ), the 
bytes will be N*SaveBytes.

Seems we don’t have so many agg functions that could share the same trans 
function, Does it worth?

Tab complete for CREATE SUBSCRIPTION ... CONECTION does not work

2022-09-19 Thread Japin Li



Hi hacker,

As $subject detailed, the tab-complete cannot work such as:

   CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres port=6543' \t

It seems that the get_previous_words() could not parse the single quote.

OTOH, it works for CREATE SUBSCRIPTION sub CONNECTION xx \t, should we fix it?

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: Support tls-exporter as channel binding for TLSv1.3

2022-09-19 Thread Jacob Champion

On Wed, Sep 7, 2022 at 10:03 AM Jacob Champion  wrote:
> Yeah, that should be fine. Requiring newer OpenSSLs for stronger
> crypto will probably be uncontroversial.

While looking into this I noticed that I left the following code in place:

> #ifdef HAVE_BE_TLS_GET_CERTIFICATE_HASH
> if (strcmp(selected_mech, SCRAM_SHA_256_PLUS_NAME) == 0 && 
> port->ssl_in_use)

In other words, we're still deciding whether to advertise -PLUS based
only on whether we support tls-server-end-point. Maybe all the
necessary features landed in OpenSSL in the same version, but I
haven't double-checked that, and in any case I think I need to make
this code more correct in the next version of this patch.

--Jacob

Re: Free list same_input_transnos in preprocess_aggref

2022-09-19 Thread Zhang Mingli


Regards,
Zhang Mingli
On Sep 20, 2022, 00:27 +0800, Zhang Mingli , wrote:
>
> SaveBytes = Sum results of accumulate_list_size: 32(4+4+8+8), as we have 5 
> aggs in sql
Correction: SaveBytes = Sum results of accumulate_list_size: 24(4+4+8+8),

Re: HOT chain validation in verify_heapam()

2022-09-19 Thread Aleksander Alekseev

Hi Himanshu,

> I have changed this in the attached patch.

If it's not too much trouble could you please base your changes on v4
that I submitted? I put some effort into writing a proper commit
message, editing the comments, etc. The easiest way of doing this is
using `git am` and `git format-patch`.

-- 
Best regards,
Aleksander Alekseev

Re: remove more archiving overhead

On Mon, Sep 19, 2022 at 6:08 AM Peter Eisentraut
 wrote:
> I suspect what we are really trying to say here is
>
> ===
> Archiving setups (using either archive_command or archive_library)
> should be prepared for the rare case that an identical archive file is
> being archived a second time.  In such a case, they should compare that
> the source and the target file are identical and proceed without error
> if so.
>
> In some cases, it is difficult or impossible to configure
> archive_command or archive_library to do this.  In such cases, the
> archiving command or library should error like in the case for any
> pre-existing target file, and operators need to be prepared to resolve
> such cases manually.
> ===
>
> Is that correct?

I think it is. I think commit d263ced225bffe2c340175125b0270d1869138fe
made the documentation in this area substantially clearer than what we
had before, and I think that we could adjust that text to apply to
both archive libraries and archive commands relatively easily, so I'm
not really sure that we need to add text like what you propose here.

However, I also don't think it would be particularly bad if we did. An
advantage of that language is that it makes it clear what the
consequences are if you fail to follow the recommendation about
handling identical files. That may be helpful to users in
understanding the behavior of the system, and might even make it more
likely that they will include proper handling of that case in their
archive_command or archive_library.

"impossible" does seem like a bit of a strong word, though. I'd
recommend leaving that out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: remove more archiving overhead

On Mon, Sep 19, 2022 at 10:39 AM Noah Misch  wrote:
> I wanted it to stop saying anything like the second paragraph, hence commit
> d263ced.  Implementing a proper archiving setup is not especially difficult,
> and inviting the operator to work around a wrong implementation invites
> damaging mistakes under time pressure.

I agree wholeheartedly with the second sentence. I do think that we
need to be a little careful not to be too prescriptive. Telling people
what they have to do when they don't really have to do it often leads
to non-compliance. It can be more productive to spell out the precise
consequences of non-compliance, so I think Peter's proposal has some
merit. On the other hand, I don't see any real problem with the
current language, either.

Unfortunately, no matter what we do here, it is not likely that we'll
soon be able to eliminate the phenomenon where people use a buggy
home-grown archive_command, both because we've been encouraging that
in our documentation for a very long time, and also because the people
who are most likely to do something half-baked here are probably the
least likely to actually read the updated documentation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

pg_create_logical_replication_slot argument incongruency

2022-09-19 Thread Florin Irion

Hello,

The function `pg_create_logical_replication_slot()`  is documented to have
a `two_phase` argument(note the underscore), but the function instead
requires `twophase`.

```
\df pg_catalog.pg_create_logical_replication_slot
List of functions
-[ RECORD 1
]---+-

Schema  | pg_catalog
Name| pg_create_logical_replication_slot
Result data type| record
Argument data types | slot_name name, plugin name, temporary boolean
DEFAULT false, twophase boolean DEFAULT false, OUT slot_name name, OUT lsn
pg_lsn
Type| func
```

This was introduced in commit 19890a06.

IMHO we should use the documented argument name `two_phase` and change the
function to accept it.

What do you think?

Please, check the attached patch.


Cheers,
Florin
--
*www.enterprisedb.com *


two_phase_slot_v1.patch
Description: Binary data

Re: Kerberos delegation support in libpq and postgres_fdw

2022-09-19 Thread Stephen Frost

Greetings,

* Jacob Champion (jchamp...@timescale.com) wrote:
> On Thu, Jul 7, 2022 at 4:24 PM Jacob Champion  wrote:
> > So my question is this: does substituting my credentials for the admin's
> > credentials let me weaken or break the transport encryption on the
> > backend connection, and grab the password that I'm not supposed to have
> > access to as a front-end client?
> 
> With some further research: yes, it does.
> 
> If a DBA is using a GSS encrypted tunnel to communicate to a foreign
> server, accepting delegation by default means that clients will be
> able to break that backend encryption at will, because the keys in use
> will be under their control.

This is coming across as if it's a surprise of some kind when it
certainly isn't..  If the delegated credentials are being used to
authenticate and establish the connection from that backend to another
system then, yes, naturally that means that the keys provided are coming
from the client and the client knows them.  The idea of arranging to
have an admin's credentials used to authenticate to another system where
the backend is actually controlled by a non-admin user is, in fact, the
issue in what is being outlined above as that's clearly a situation
where the user's connection is being elevated to an admin level.  That's
also something that we try to avoid having happen because it's not
really a good idea, which is why we require a password today for the
connection to be established (postgres_fdw/connection.c:

Non-superuser cannot connect if the server does not request a password.

).

Consider that, in general, the user could also simply directly connect
to the other system themselves instead of having a PG backend make that
connection for them- the point in doing it from PG would be to avoid
having to pass all the data back through the client's system.

Consider SSH instead of PG.  What you're pointing out, accurately, is
that if an admin were to install their keys into a user's .ssh directory
unencrypted and then the user logged into the system, they'd then be
able to SSH to another system with the admin's credentials and then
they'd need the admin's credentials to decrypt the traffic, but that if,
instead, the user brings their own credentials then they could
potentially decrypt the connection between the systems.  Is that really
the issue here?  Doesn't seem like that's where the concern should be in
this scenario.

> > Maybe there's some ephemeral exchange going on that makes it too hard to
> > attack in practice, or some other mitigations.
> 
> There is no forward secrecy, ephemeral exchange, etc. to mitigate this [1]:
> 
>The Kerberos protocol in its basic form does not provide perfect
>forward secrecy for communications.  If traffic has been recorded by
>an eavesdropper, then messages encrypted using the KRB_PRIV message,
>or messages encrypted using application-specific encryption under
>keys exchanged using Kerberos can be decrypted if the user's,
>application server's, or KDC's key is subsequently discovered.
> 
> So the client can decrypt backend communications that make use of its
> delegated key material. (This also means that gssencmode is a lot
> weaker than I expected.)

The backend wouldn't be able to establish the connection in the first
place without those delegated credentials.

> > I'm trying to avoid writing Wireshark dissector
> > code, but maybe that'd be useful either way...
> 
> I did end up filling out the existing PGSQL dissector so that it could
> decrypt GSSAPI exchanges (with the use of a keytab, that is). If you'd
> like to give it a try, the patch, based on Wireshark 3.7.1, is
> attached. Note the GPLv2 license. It isn't correct code yet, because I
> didn't understand how packet reassembly worked in Wireshark when I
> started writing the code, so really large GSSAPI messages that are
> split across multiple TCP packets will confuse the dissector. But it's
> enough to prove the concept.
> 
> To see this in action, set up an FDW connection that uses gssencmode
> (so the server in the middle will need its own Kerberos credentials).

The server in the middle should *not* be using its own Kerberos
credentials to establish the connection to the other system- that's
elevating the credentials used for that connection and is something that
should be prevented for non-superusers already (see above).  We do allow
that when a superuser is involved because they are considered to
essentially have OS-level privileges and therefore could see those
credentials anyway, but that's not the case for non-superusers.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: walmethods.c/h are doing some strange things

On Thu, Sep 15, 2022 at 9:39 PM Kyotaro Horiguchi
 wrote:
> At Fri, 2 Sep 2022 11:52:38 -0400, Robert Haas  wrote 
> in
> > that type that can ever exist, and the pointer to that object is
> > stored in a global variable managed by walmethods.c. So whereas in
> > other cases we give you the object and then a way to get the
> > corresponding set of callbacks, here we only give you the callbacks,
> > and we therefore have to impose the artificial restriction that there
> > can only ever be one object.
>
> Makes sense to me.

OK, I have committed the patches. Before doing that, I fixed a couple
of bugs in the first one, and did a little bit of rebasing of the
second one.

> I agree to the view. That part seems to be a part of
> open_for_write()'s body functions. But, I'm not sure how we untangle
> them at a glance, too.  In the first place, I'm not sure why we need
> to do that despite the file going to be overwritten from the
> beginning, though..

I suspect that we're going to have to untangle things bit by bit.

One place where I think we might be able to improve things is with the
handling of compression suffixes (e.g. .gz, .lz4) and temp suffixes
(e.g. .partial, .tmp). At present, responsibility for adding these
suffixes to pathnames is spread across receivelog.c and walmethods.c
in a way that, to me, looks pretty random. It's not exactly clear to
me what the best design is here right now, but I think either (A)
receivelog.c should take full responsibility for computing the exact
filename and walmethods.c should just blindly write the data into the
exact filename it's given, or else (B) receivelog.c should take no
responsibility for pathname construction and the fact that there is
pathname munging happening should be hidden inside walmethods.c. Right
now, walmethods.c is doing pathname munging, but the munging is
visible from receivelog.c, so the responsibility is spread across the
two files rather than being the sole responsibility of either.

What's also a little bit aggravating about this is that it doesn't
feel like we're accomplishing all that much code reuse here.
walmethods.c and receivelog.c are shared only between pg_basebackup
and pg_receivewal, but the tar method is used only by pg_basebackup.
The directory mode is shared, but actually the two programs need a
bunch of different things. pg_recievelog needs a bunch of logic to
deal with the possibility that it is receiving data into an existing
directory and that this directory might at any time start to be used
to feed a standby, or used for PITR. That's not an issue for
pg_basebackup: if it fails, the whole directory will be removed, so
there is no need to worry about fsyncs or padding or overwriting
existing files. On the other hand, pg_basebackup needs a bunch of
logic to create a .done file for each WAL segment which is not
required in the case of pg_receivewal. It feels like we've got as much
conditional logic as we do common logic...

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Tree-walker callbacks vs -Wdeprecated-non-prototype

Here's a second-generation patch that fixes the warnings by inserting
casts into a layer of macro wrappers.  I had supposed that this would
cause us to lose all detection of wrongly-chosen walker functions,
so I was very pleased to see this when applying it to yesterday's HEAD:

execProcnode.c:792:2: warning: cast from 'bool (*)(PlanState *)' (aka 'bool 
(*)(struct PlanState *)') to 'planstate_tree_walker_callback' (aka 'bool 
(*)(struct PlanState *, void *)') converts to incompatible function type 
[-Wcast-function-type]
planstate_tree_walker(node, ExecShutdownNode, NULL);
^~~
../../../src/include/nodes/nodeFuncs.h:180:33: note: expanded from macro 
'planstate_tree_walker'
planstate_tree_walker_impl(ps, (planstate_tree_walker_callback) (w), c)
   ^~~~

So we've successfully suppressed the pedantic -Wdeprecated-non-prototype
warnings, and we have activated the actually-useful -Wcast-function-type
warnings, which seem to do exactly what we want in this context:

'-Wcast-function-type'
 Warn when a function pointer is cast to an incompatible function
 pointer.  In a cast involving function types with a variable
 argument list only the types of initial arguments that are provided
 are considered.  Any parameter of pointer-type matches any other
  ^^^
 pointer-type.  Any benign differences in integral types are
 
 ignored, like 'int' vs.  'long' on ILP32 targets.  Likewise type
 qualifiers are ignored.  The function type 'void (*) (void)' is
 special and matches everything, which can be used to suppress this
 warning.  In a cast involving pointer to member types this warning
 warns whenever the type cast is changing the pointer to member
 type.  This warning is enabled by '-Wextra'.

(That verbiage is from the gcc manual; clang seems to act the same
except that -Wcast-function-type is selected by -Wall, or perhaps is
even on by default.)

So I'm pretty pleased with this formulation: no caller changes are
needed, and it does exactly what we want warning-wise.

regards, tom lane

diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index 3bac350bf5..724d076674 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -27,10 +27,12 @@
 static bool expression_returns_set_walker(Node *node, void *context);
 static int	leftmostLoc(int loc1, int loc2);
 static bool fix_opfuncids_walker(Node *node, void *context);
-static bool planstate_walk_subplans(List *plans, bool (*walker) (),
+static bool planstate_walk_subplans(List *plans,
+	planstate_tree_walker_callback walker,
 	void *context);
 static bool planstate_walk_members(PlanState **planstates, int nplans,
-   bool (*walker) (), void *context);
+   planstate_tree_walker_callback walker,
+   void *context);
 
 
 /*
@@ -1909,9 +1911,9 @@ check_functions_in_node(Node *node, check_function_callback checker,
  */
 
 bool
-expression_tree_walker(Node *node,
-	   bool (*walker) (),
-	   void *context)
+expression_tree_walker_impl(Node *node,
+			tree_walker_callback walker,
+			void *context)
 {
 	ListCell   *temp;
 
@@ -1923,6 +1925,10 @@ expression_tree_walker(Node *node,
 	 * when we expect a List we just recurse directly to self without
 	 * bothering to call the walker.
 	 */
+#define WALK(n) walker((Node *) (n), context)
+
+#define LIST_WALK(l) expression_tree_walker_impl((Node *) (l), walker, context)
+
 	if (node == NULL)
 		return false;
 
@@ -1946,25 +1952,21 @@ expression_tree_walker(Node *node,
 			/* primitive node types with no expression subnodes */
 			break;
 		case T_WithCheckOption:
-			return walker(((WithCheckOption *) node)->qual, context);
+			return WALK(((WithCheckOption *) node)->qual);
 		case T_Aggref:
 			{
 Aggref	   *expr = (Aggref *) node;
 
-/* recurse directly on List */
-if (expression_tree_walker((Node *) expr->aggdirectargs,
-		   walker, context))
+/* recurse directly on Lists */
+if (LIST_WALK(expr->aggdirectargs))
 	return true;
-if (expression_tree_walker((Node *) expr->args,
-		   walker, context))
+if (LIST_WALK(expr->args))
 	return true;
-if (expression_tree_walker((Node *) expr->aggorder,
-		   walker, context))
+if (LIST_WALK(expr->aggorder))
 	return true;
-if (expression_tree_walker((Node *) expr->aggdistinct,
-		   walker, context))
+if (LIST_WALK(expr->aggdistinct))
 	return true;
-if (walker((Node *) expr->aggfilter, context))
+if (WALK(expr->aggfilter))
 	return true;
 			}
 			break;
@@ -1972,8 +1974,7 @@ expression_tree_walker(Node *node,
 			{
 GroupingFunc *grouping = (GroupingFunc *) node;
 
-if (expression_tree_walker((No

Re: Tree-walker callbacks vs -Wdeprecated-non-prototype

On Sun, Sep 18, 2022 at 4:58 PM Tom Lane  wrote:
> BTW, I was distressed to discover that someone decided they could
> use ExecShutdownNode as a planstate_tree_walker() walker even though
> its argument list is not even the right length.  I'm a bit flabbergasted
> that we seem to have gotten away with that so far, because I'd have
> thought for sure that it'd break some platform's convention for which
> argument gets passed where.  I think we need to fix that, independently
> of what we do about the larger scope of these problems.  To avoid an
> API break, I propose making ExecShutdownNode just be a one-liner that
> calls an internal ExecShutdownNode_walker() function.  (I've not done
> it that way in the attached, though.)

I think this was brain fade on my part ... or possibly on Amit
Kapila's part, but I believe it was probably me. I agree that it's
impressive that it actually seemed to work that way.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: has_privs_of_role vs. is_member_of_role, redux

On Fri, Aug 26, 2022 at 10:11 AM Robert Haas  wrote:
> Here's a small patch. Despite the small size of the patch, there are a
> couple of debatable points here:

Nobody's commented on this patch specifically, but it seemed like we
had consensus that ALTER DEFAULT PRIVILEGES was doing The Wrong Thing,
so I've pushed the patch I posted previously for that issue.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Silencing the remaining clang 15 warnings

While working on the -Wdeprecated-non-prototype fixups discussed
nearby, I saw that clang 15.0 produces a few other new warnings
(which are also visible in the buildfarm).  Pursuant to our
usual policy that we should suppress warnings on compilers likely
to be used for development, here's a patch to silence them.

There are three groups of these:

* With %pure-parser, Bison makes the "yynerrs" variable local
instead of static, and then if you don't use it clang notices
that it's set but never read.  There doesn't seem to be a way
to persuade Bison not to emit the variable at all, so here I've
just added "(void) yynerrs;" to the topmost production of each
affected grammar.  If anyone has a nicer idea, let's hear it.

* xlog.c's AdvanceXLInsertBuffer has a local variable "npages"
that is only read in the "#ifdef WAL_DEBUG" stanza at the
bottom.  Here I've done the rather ugly and brute-force thing
of wrapping all the variable's references in "#ifdef WAL_DEBUG".
(I tried marking it PG_USED_FOR_ASSERTS_ONLY, but oddly that
did not silence the warning.)  I kind of wonder how useful this
function's WAL_DEBUG output is --- maybe just dropping that
altogether would be better?

* array_typanalyze.c's compute_array_stats counts the number
of null arrays in the column, but then does nothing with the
result.  AFAICS this is redundant with what std_compute_stats
will do, so I just removed the variable.

Any thoughts?

regards, tom lane

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 81d339d57d..2651bcaa76 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1785,7 +1785,9 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
 	XLogRecPtr	NewPageBeginPtr;
 	XLogPageHeader NewPage;
+#ifdef WAL_DEBUG
 	int			npages = 0;
+#endif
 
 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
 
@@ -1928,7 +1930,9 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 
 		XLogCtl->InitializedUpTo = NewPageEndPtr;
 
+#ifdef WAL_DEBUG
 		npages++;
+#endif
 	}
 	LWLockRelease(WALBufMappingLock);
 
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 82f03fc9c9..c6ec694ceb 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -864,6 +864,7 @@ parse_toplevel:
 			stmtmulti
 			{
 pg_yyget_extra(yyscanner)->parsetree = $1;
+(void) yynerrs;		/* suppress compiler warning */
 			}
 			| MODE_TYPE_NAME Typename
 			{
diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 1964cedd93..2360c680ac 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -218,7 +218,6 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
 {
 	ArrayAnalyzeExtraData *extra_data;
 	int			num_mcelem;
-	int			null_cnt = 0;
 	int			null_elem_cnt = 0;
 	int			analyzed_rows = 0;
 
@@ -320,8 +319,7 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
 		value = fetchfunc(stats, array_no, &isnull);
 		if (isnull)
 		{
-			/* array is null, just count that */
-			null_cnt++;
+			/* ignore arrays that are null overall */
 			continue;
 		}
 
diff --git a/src/backend/utils/adt/jsonpath_gram.y b/src/backend/utils/adt/jsonpath_gram.y
index 7e28853a57..2a56629cc3 100644
--- a/src/backend/utils/adt/jsonpath_gram.y
+++ b/src/backend/utils/adt/jsonpath_gram.y
@@ -113,6 +113,7 @@ result:
 		*result = palloc(sizeof(JsonPathParseResult));
 		(*result)->expr = $2;
 		(*result)->lax = $1;
+		(void) yynerrs;
 	}
 	| /* EMPTY */	{ *result = NULL; }
 	;
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index ade2ecdaab..f6387d4a3c 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -80,7 +80,10 @@ static PgBenchExpr *make_case(yyscan_t yyscanner, PgBenchExprList *when_then_lis
 
 %%
 
-result: expr{ expr_parse_result = $1; }
+result: expr{
+expr_parse_result = $1;
+(void) yynerrs; /* suppress compiler warning */
+			}
 
 elist:		{ $$ = NULL; }
 	| expr	{ $$ = make_elist($1, NULL); }

Re: has_privs_of_role vs. is_member_of_role, redux

On Thu, Sep 8, 2022 at 1:06 PM  wrote:
> A different line of thought (compared to the "USAGE" privilege I
> discussed earlier), would be:
> To transfer ownership of an object, you need two sets of privileges:
> - You need to have the privilege to initiate a request to transfer
> ownership.
> - You need to have the privilege to accept a request to transfer ownership.
>
> Let's imagine there'd be such a request created temporarily, then when I
> start the process of changing ownership, I would have to change to the
> other role and then accept that request.
>
> In theory, I could also inherit that privilege, but that's not how the
> system works today. By using is_member_of_role, the decision was already
> made that this should not depend on inheritance. What is left, is the
> ability to do it via SET ROLE only.

I do not accept the argument that we've already made the decision that
this should not depend on inheritance. It's pretty clear that we
haven't thought carefully enough about which checks should depend only
on membership, and which ones should depend on inheritance. The patch
I committed just now to fix ALTER DEFAULT PRIVILEGES is one clear
example of where we've gotten that wrong. We also changed the way
predefined roles worked with inheritance not too long ago, so that
they started using has_privs_of_role() rather than
is_member_of_role(). Our past thinking on this topic has been fuzzy
enough that we can't really conclude that because something uses
is_member_of_role() now that's what it should continue to do in the
future. We are working to get from a messy situation where the rules
aren't consistent or understandable to one where they are, and that
may mean changing some things.

> So it should not be has_privs_of_role() nor has_privs_of_role() ||
> member_can_set_role(), as you suggested above, but rather just
> member_can_set_role() only. Of course, only in the context of the SET
> ROLE patch.

Now, having said that, this choice of behavior might have some
advantages. It would mean that you could GRANT pg_read_all_settings TO
someone WITH INHERIT TRUE, SET FALSE and that user would be able to
read all settings but would not be able to create objects owned by
pg_read_all_settings. It would also be upward-compatible with the
existing behavior, which is nice.

Well, maybe. Suppose that role A has been granted pg_read_all_settings
WITH INHERIT TRUE, SET TRUE and role B has been granted
pg_read_all_settings WITH INHERIT TRUE, SET FALSE. A can create a
table owned by pg_read_all_settings. If A does that, then B can now
create a trigger on that table and usurp the privileges of
pg_read_all_settings, after which B can now create any number of
objects owned by pg_read_all_settings. If A does not do that, though,
I think that with the proposed rule, B would have no way to create
objects owned by A. This is a bit unsatisfying. It seems like B should
either have the right to usurp pg_read_all_settings's privileges or
not, rather than maybe having that right depending on what some other
user chooses to do.

But maybe it's OK. It's hard to come up with perfect solutions here.
One could take the view that the issue here is that
pg_read_all_settings shouldn't have the right to create objects in the
first place, and that this INHERIT vs. SET ROLE distinction is just a
distraction. However, that would require accepting the idea that it's
possible for a role to lack privileges granted to PUBLIC, which also
sounds pretty unsatisfying. On the whole, I'm inclined to think it's
reasonable to suppose that if you want to grant a role to someone
without letting them create objects owned by that role, it should be a
role that doesn't own any existing objects either. Essentially, that's
legislating that predefined roles should be minimally privileged: they
should hold the ability to do whatever it is that they are there to do
(like read all settings) but not have any other privileges (like the
ability to do stuff to objects they own).

But maybe there's a better answer. Ideas/opinions welcome.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

On Tue, May 31, 2022 at 5:33 AM Dmitry Koval  wrote:
> There are not many commands in PostgreSQL for working with partitioned
> tables. This is an obstacle to their widespread use.
> Adding SPLIT PARTITION/MERGE PARTITIONS operations can make easier to
> use partitioned tables in PostgreSQL.
> (This is especially important when migrating projects from ORACLE DBMS.)
>
> SPLIT PARTITION/MERGE PARTITIONS commands are supported for range
> partitioning (BY RANGE) and for list partitioning (BY LIST).
> For hash partitioning (BY HASH) these operations are not supported.

This may be a good idea, but I would like to point out one
disadvantage of this approach.

If you know that a certain partition is not changing, and you would
like to split it, you can create two or more new standalone tables and
populate them from the original partition using INSERT .. SELECT. Then
you can BEGIN a transaction, DETACH the existing partitions, and
ATTACH the replacement ones. By doing this, you take an ACCESS
EXCLUSIVE lock on the partitioned table only for a brief period. The
same kind of idea can be used to merge partitions.

It seems hard to do something comparable with built-in DDL for SPLIT
PARTITION and MERGE PARTITION. You could start by taking e.g. SHARE
lock on the existing partition(s) and then wait until the end to take
ACCESS EXCLUSIVE lock on the partitions, but we typically avoid such
coding patterns, because the lock upgrade might deadlock and then a
lot of work would be wasted. So most likely with the approach you
propose here you will end up acquiring ACCESS EXCLUSIVE lock at the
beginning of the operation and then shuffle a lot of data around while
still holding it, which is pretty painful.

Because of this problem, I find it hard to believe that these commands
would get much use, except perhaps on small tables or in
non-production environments, unless people just didn't know about the
alternatives. That's not to say that something like this has no value.
As a convenience feature, it's fine. It's just hard for me to see it
as any more than that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Consider parallel for lateral subqueries with limit

2022-09-19 Thread James Coleman

On Tue, Mar 1, 2022 at 5:35 PM Tom Lane  wrote:
>
> But more generally, I don't think you've addressed the fundamental
> concern, which is that a query involving Limit is potentially
> nondeterministic (if it lacks a fully-deterministic ORDER BY),
> so that different workers could get different answers from it if
> they're using a plan type that permits that to happen.  (See the
> original discussion that led to 75f9c4ca5, at [1].)  I do not see
> how a lateral dependency removes that hazard.
> ...
> > In this case we are already conceivably getting different
> > results for each execution of the subquery "Limit(Y)" even if we're
> > not running those executions across multiple workers.
>
> That seems to be about the same argument Andres made initially
> in the old thread, but we soon shot that down as not being the
> level of guarantee we want to provide.  There's nothing in the
> SQL standard that says that
> select * from events where account in
> (select account from events
>  where data->>'page' = 'success.html' limit 3);
> (the original problem query) shall execute the sub-query
> only once, but people expect it to act that way.

I understand that particular case being a bit of a gotcha (i.e., what
we would naturally expect exceeds what the spec says).

But in the case where there's correlation via LATERAL we already don't
guarantee unique executions for a given set of params into the lateral
subquery execution, right? For example, suppose we have:

  select *
  from foo
  left join lateral (
select n
from bar
where bar.a = foo.a
limit 1
  ) on true

and suppose that foo.a (in the table scan) returns these values in
order: 1, 2, 1. In that case we'll execute the lateral subquery 3
separate times rather than attempting to order the results of foo.a
such that we can re-execute the subquery only when the param changes
to a new unique value (and we definitely don't cache the results to
guarantee unique subquery executions).

So, assuming that we can guarantee we're talking about the proper
lateral reference (not something unrelated as you pointed out earlier)
then I don't believe we're actually changing even implicit guarantees
about how the query executes. I think that probably true both in
theory (I claim that once someone declares a lateral join they're
definitionally expecting a subquery execution per outer row) and in
practice (e.g., your hypothesis about synchronized seq scans would
apply here also to subsequent executions of the subquery).

Is there still something I'm missing about the concern you have?

> If you want to improve this area, my feeling is that it'd be
> better to look into what was speculated about in the old
> thread: LIMIT doesn't create nondeterminism if the query has
> an ORDER BY that imposes a unique row ordering, ie
> order-by-primary-key.  We didn't have planner infrastructure
> that would allow checking that cheaply in 2018, but maybe
> there is some now?

Assuming what I argued earlier holds true for lateral [at minimum when
correlated] subquery execution, then I believe your suggestion here is
orthogonal and would expand the use cases even more. For example, if
we were able to guarantee a unique result set (including order), then
we could allow parallelizing subqueries even if they're not lateral
and correlated.

James Coleman

Re: Consider parallel for lateral subqueries with limit

On Mon, Sep 19, 2022 at 3:58 PM James Coleman  wrote:
> But in the case where there's correlation via LATERAL we already don't
> guarantee unique executions for a given set of params into the lateral
> subquery execution, right? For example, suppose we have:
>
>   select *
>   from foo
>   left join lateral (
> select n
> from bar
> where bar.a = foo.a
> limit 1
>   ) on true
>
> and suppose that foo.a (in the table scan) returns these values in
> order: 1, 2, 1. In that case we'll execute the lateral subquery 3
> separate times rather than attempting to order the results of foo.a
> such that we can re-execute the subquery only when the param changes
> to a new unique value (and we definitely don't cache the results to
> guarantee unique subquery executions).

I think this is true, but I don't really understand why we should
focus on LATERAL here. What we really need, and I feel like we've
talked about this before, is a way to reason about where parameters
are set and used. Your sample query gets a plan like this:

 Nested Loop Left Join  (cost=0.00..1700245.00 rows=1 width=8)
   ->  Seq Scan on foo  (cost=0.00..145.00 rows=1 width=4)
   ->  Limit  (cost=0.00..170.00 rows=1 width=4)
 ->  Seq Scan on bar  (cost=0.00..170.00 rows=1 width=4)
   Filter: (foo.a = a)

If this were to occur inside a larger plan tree someplace, it would be
OK to insert a Gather node above the Nested Loop node without doing
anything further, because then the parameter that stores foo.a would
be both set and used in the worker. If you wanted to insert a Gather
at any other place in this plan, things get more complicated. But just
because you have LATERAL doesn't mean that you have this problem,
because if you delete the "limit 1" then the subqueries get flattened
together and the parameter disappears, and if you delete the lateral
reference (i.e. WHERE foo.a = bar.a) then there's still a subquery but
it no longer refers to an outer parameter. And on the flip side just
because you don't have LATERAL doesn't mean that you don't have this
problem. e.g. the query could instead be:

select *, (select n from bar where bar.a = foo.a limit 1) from foo;

...which I think is pretty much equivalent to your formulation and has
the same problem as far as parallel query as your formulation but does
not involve the LATERAL keyword.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Fix typos in code comments

2022-09-19 Thread David Rowley

On Mon, 19 Sept 2022 at 23:10, Justin Pryzby  wrote:
> Find below some others.

Thanks. Pushed.

David

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2022-09-19 Thread Dmitry Koval


Thanks for comments and advice!
I thought about this problem and discussed about it with colleagues.
Unfortunately, I don't know of a good general solution.

19.09.2022 22:56, Robert Haas пишет:

If you know that a certain partition is not changing, and you would
like to split it, you can create two or more new standalone tables and
populate them from the original partition using INSERT .. SELECT. Then
you can BEGIN a transaction, DETACH the existing partitions, and
ATTACH the replacement ones. By doing this, you take an ACCESS
EXCLUSIVE lock on the partitioned table only for a brief period. The
same kind of idea can be used to merge partitions.


But for specific situation like this (certain partition is not changing) 
we can add CONCURRENTLY modifier.

Our DDL query can be like

ALTER TABLE...SPLIT PARTITION [CONCURRENTLY];

With CONCURRENTLY modifier we can lock partitioned table in 
ShareUpdateExclusiveLock mode and split partition - in 
AccessExclusiveLock mode. So we don't lock partitioned table in 
AccessExclusiveLock mode and can modify other partitions during SPLIT 
operation (except split partition).
If smb try to modify split partition, he will receive error "relation 
does not exist" at end of operation (because split partition will be drop).



--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Re: why can't a table be part of the same publication as its schema

2022-09-19 Thread Jonathan S. Katz


On 9/19/22 11:16 AM, Alvaro Herrera wrote:



diff --git a/doc/src/sgml/logical-replication.sgml 
b/doc/src/sgml/logical-replication.sgml
index 1ae3287..0ab768d 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -1120,6 +1120,11 @@ test_sub=# SELECT * FROM child ORDER BY a;

  


+   Specifying a column list when the publication also publishes
+   FOR ALL TABLES IN SCHEMA is not supported.
+  
+
+  
 For partitioned tables, the publication parameter
 publish_via_partition_root determines which column list
 is used. If publish_via_partition_root is

diff --git a/doc/src/sgml/ref/create_publication.sgml 
b/doc/src/sgml/ref/create_publication.sgml
index 0a68c4b..0ced7da 100644
--- a/doc/src/sgml/ref/create_publication.sgml
+++ b/doc/src/sgml/ref/create_publication.sgml
@@ -103,17 +103,17 @@ CREATE PUBLICATION name
   
  
   

+  Specifying a column list when the publication also publishes
+  FOR ALL TABLES IN SCHEMA is not supported.
+ 

@@ -733,6 +694,24 @@ CheckPubRelationColumnList(List *tables, const char 
*queryString,
continue;
  
  		/*

+* Disallow specifying column list if any schema is in the
+* publication.
+*
+* XXX We could instead just forbid the case when the 
publication
+* tries to publish the table with a column list and a schema 
for that
+* table. However, if we do that then we need a restriction 
during
+* ALTER TABLE ... SET SCHEMA to prevent such a case which 
doesn't
+* seem to be a good idea.
+*/
+   if (publish_schema)
+   ereport(ERROR,
+   
errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+   errmsg("cannot use publication column list for 
relation \"%s.%s\"",
+  
get_namespace_name(RelationGetNamespace(pri->relation)),
+  
RelationGetRelationName(pri->relation)),
+   errdetail("Column list cannot be specified 
if any schema is part of the publication or specified in the list."));
+


This seems a pretty arbitrary restriction.  It feels like you're adding
this restriction precisely so that you don't have to write the code to
reject the ALTER .. SET SCHEMA if an incompatible configuration is
detected.  But we already have such checks in several cases, so I don't
see why this one does not seem a good idea.

The whole FOR ALL TABLES IN SCHEMA thing seems pretty weird in several
aspects.  Others have already commented about the syntax, which is
unlike what GRANT uses; I'm also surprised that we've gotten away with
it being superuser-only.  Why are we building more superuser-only
features in this day and age?  I think not even FOR ALL TABLES should
require superuser.


FYI, I've added this to the PG15 open items as there are some open 
questions to resolve in this thread.


Jonathan



OpenPGP_signature
Description: OpenPGP digital signature

Re: Kerberos delegation support in libpq and postgres_fdw

2022-09-19 Thread Jacob Champion

On 9/19/22 10:05, Stephen Frost wrote:
> This is coming across as if it's a surprise of some kind when it
> certainly isn't..  If the delegated credentials are being used to
> authenticate and establish the connection from that backend to another
> system then, yes, naturally that means that the keys provided are coming
> from the client and the client knows them.

I think it may be surprising to end users that credential delegation
lets them trivially break transport encryption. Like I said before, it
was a surprise to me, because the cryptosystems I'm familiar with
prevent that.

If it wasn't surprising to you, I could really have used a heads up back
when I asked you about it.

> The idea of arranging to
> have an admin's credentials used to authenticate to another system where
> the backend is actually controlled by a non-admin user is, in fact, the
> issue in what is being outlined above as that's clearly a situation
> where the user's connection is being elevated to an admin level.

Yes, controlled elevation is the goal in the scenario I'm describing.

> That's
> also something that we try to avoid having happen because it's not
> really a good idea, which is why we require a password today for the
> connection to be established (postgres_fdw/connection.c:
> 
> Non-superuser cannot connect if the server does not request a password.

A password is being used in this scenario. The password is the secret
being stolen.

The rest of your email describes a scenario different from what I'm
attacking here. Here's my sample HBA line for the backend again:

hostgssenc all all ... password

I'm using password authentication with a Kerberos-encrypted channel.
It's similar to protecting password authentication with TLS and a client
cert:

hostssl all all ... password clientcert=verify-*

> Consider that, in general, the user could also simply directly connect
> to the other system themselves

No, because they don't have the password. They don't have USAGE on the
foreign table, so they can't see the password in the USER MAPPING.

With the new default introduced in this patch, they can now steal the
password by delegating their credentials and cracking the transport
encryption. This bypasses the protections that are documented for the
pg_user_mappings view.

> Consider SSH instead of PG.  What you're pointing out, accurately, is
> that if an admin were to install their keys into a user's .ssh directory
> unencrypted and then the user logged into the system, they'd then be
> able to SSH to another system with the admin's credentials and then
> they'd need the admin's credentials to decrypt the traffic, but that if,
> instead, the user brings their own credentials then they could
> potentially decrypt the connection between the systems.  Is that really
> the issue here?

No, it's not the issue here. This is more like setting up a restricted
shell that provides limited access to a resource on another machine
(analogous to the foreign table). The user SSHs into this restricted
shell, and then invokes an admin-blessed command whose implementation
uses some credentials (which they cannot read, analogous to the USER
MAPPING) over an encrypted channel to access the backend resource. In
this situation an admin would want to ensure that the encrypted tunnel
couldn't be weakened by the client, so that they can't learn how to
bypass the blessed command and connect to the backend directly.

Unlike SSH, we've never supported credential delegation, and now we're
introducing it. So my claim is, it's possible for someone who was
previously in a secure situation to be broken by the new default.

>> So the client can decrypt backend communications that make use of its
>> delegated key material. (This also means that gssencmode is a lot
>> weaker than I expected.)
> 
> The backend wouldn't be able to establish the connection in the first
> place without those delegated credentials.

That's not true in the situation I'm describing; hopefully my comments
above help clarify.

> The server in the middle should *not* be using its own Kerberos
> credentials to establish the connection to the other system- that's
> elevating the credentials used for that connection and is something that
> should be prevented for non-superusers already (see above).

It's not prevented, because a password is being used. In my tests I'm
connecting as an unprivileged user.

You're claiming that the middlebox shouldn't be doing this. If this new
default behavior were the historical behavior, then I would have agreed.
But the cat's already out of the bag on that, right? It's safe today.
And if it's not safe today for some other reason, please share why, and
maybe I can work on a patch to try to prevent people from doing it.

Thanks,
--Jacob

Re: POC: GROUP BY optimization

2022-09-19 Thread David Rowley

On Wed, 13 Jul 2022 at 15:37, David Rowley  wrote:
> I'm just in this general area of the code again today and wondered
> about the header comment for the preprocess_groupclause() function.
>
> It says:
>
>  * In principle it might be interesting to consider other orderings of the
>  * GROUP BY elements, which could match the sort ordering of other
>  * possible plans (eg an indexscan) and thereby reduce cost.  We don't
>  * bother with that, though.  Hashed grouping will frequently win anyway.
>
> I'd say this commit makes that paragraph mostly obsolete.  It's only
> true now in the sense that we don't try orders that suit some index
> that would provide pre-sorted results for a GroupAggregate path.  The
> comment leads me to believe that we don't do anything at all to find a
> better order, and that's not true now.

I've just pushed a fix for this.

David

pg_upgrade test failure

2022-09-19 Thread Andres Freund

Hi,

After my last rebase of the meson tree I encountered the following test
failure:

https://cirrus-ci.com/task/5532444261613568

[20:23:04.171] - 8< 
-
[20:23:04.171] stderr:
[20:23:04.171] #   Failed test 'pg_upgrade_output.d/ not removed after 
pg_upgrade --check success'
[20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 249.
[20:23:04.171] #   Failed test 'pg_upgrade_output.d/ removed after pg_upgrade 
success'
[20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 261.
[20:23:04.171] # Looks like you failed 2 tests of 13.

regress_log:
https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

The pg_upgrade output contains these potentially relevant warnings:

...
*Clusters are compatible*
pg_upgrade: warning: could not remove file or directory 
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log":
 Directory not empty
pg_upgrade: warning: could not remove file or directory 
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511":
 Directory not empty
...

I don't know if actually related to the commit below, but there've been a
lot of runs of the pg_upgrade tests in the meson branch, and this is the first
failure of this kind. Unfortunately the error seems to be transient -
rerunning the tests succeeded.

On 2022-09-13 01:39:59 +, Michael Paquier wrote:
> Move any remaining files generated by pg_upgrade into an internal subdir
>
> This change concerns a couple of .txt files (for internal state checks)
> that were still written in the path where the binary is executed, and
> not in the subdirectory located in the target cluster.  Like the other
> .txt files doing already so (like loadable_libraries.txt), these are
> saved in the base output directory.  Note that on failure, the logs
> report the full path to the .txt file generated, so these are easy to
> find.
>
> Oversight in 38bfae3.
>
> Author: Daniel Gustafsson
> Reviewed-by: Michael Paquier, Justin Prysby
> Discussion: https://postgr.es/m/181a6da8-3b7f-4b71-82d5-363ff0146...@yesql.se
> Backpatch-through: 15

Greetings,

Andres Freund

Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

2022-09-19 Thread Nathan Bossart

On Mon, Sep 19, 2022 at 07:26:57PM +0530, Bharath Rupireddy wrote:
> We have a bunch of messages [1] that have an offset, but not LSN in
> the error message. Firstly, is there an easiest way to figure out LSN
> from offset reported in the error messages? If not, is adding LSN to
> these messages along with offset a good idea? Of course, we can't just
> convert offset to LSN using XLogSegNoOffsetToRecPtr() and report, but
> something meaningful like reporting the LSN of the page that we are
> reading-in or writing-out etc.

It seems like you want the opposite of pg_walfile_name_offset().  Perhaps
we could add a function like pg_walfile_offset_lsn() that accepts a WAL
file name and byte offset and returns the LSN.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Support pg_attribute_aligned and noreturn in MSVC

2022-09-19 Thread James Coleman

Over in the "Add last commit LSN to pg_last_committed_xact()" [1]
thread this patch had been added as a precursor, but Michael Paquier
suggested it be broken out separately, so I'm doing that here.

It turns out that MSVC supports both noreturn [2] [3] and alignment
[4] [5] attributes, so this patch adds support for those. MSVC also
supports a form of packing, but the implementation as I can tell
requires wrapping the entire struct (with a push/pop declaration set)
[6], which doesn't seem to match the style of macros we're using for
packing in other compilers, so I opted not to implement that
attribute.

James Coleman

1: https://www.postgresql.org/message-id/Yk6UgCGlZKuxRr4n%40paquier.xyz
2: 2008+ 
https://learn.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2008/k6ktzx3s(v=vs.90)
3. 2015+ https://learn.microsoft.com/en-us/cpp/c-language/noreturn?view=msvc-140
4. 2008+ 
https://learn.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2008/dabb5z75(v=vs.90)
5. 2015+ https://learn.microsoft.com/en-us/cpp/cpp/align-cpp?view=msvc-170
6. https://learn.microsoft.com/en-us/cpp/preprocessor/pack?view=msvc-170


v1-0001-Support-pg_attribute_aligned-and-noretur-in-MSVC.patch
Description: Binary data

Re: Silencing the remaining clang 15 warnings

On Tue, Sep 20, 2022 at 7:20 AM Tom Lane  wrote:
> * With %pure-parser, Bison makes the "yynerrs" variable local
> instead of static, and then if you don't use it clang notices
> that it's set but never read.  There doesn't seem to be a way
> to persuade Bison not to emit the variable at all, so here I've
> just added "(void) yynerrs;" to the topmost production of each
> affected grammar.  If anyone has a nicer idea, let's hear it.

+1.  FTR they know about this:

https://www.mail-archive.com/bison-patches@gnu.org/msg07836.html
https://github.com/akimd/bison/commit/a166d5450e3f47587b98f6005f9f5627dbe21a5b

... but that YY_ATTRIBUTE_UNUSED hasn't landed in my systems'
/usr[/local]/share/bison/skeletons/yacc.c yet and it seems harmless
and also legitimate to reference yynerrs from an action.

> * xlog.c's AdvanceXLInsertBuffer has a local variable "npages"
> that is only read in the "#ifdef WAL_DEBUG" stanza at the
> bottom.  Here I've done the rather ugly and brute-force thing
> of wrapping all the variable's references in "#ifdef WAL_DEBUG".
> (I tried marking it PG_USED_FOR_ASSERTS_ONLY, but oddly that
> did not silence the warning.)  I kind of wonder how useful this
> function's WAL_DEBUG output is --- maybe just dropping that
> altogether would be better?

No opinion on the value of the message, but maybe
pg_attribute_unused() would be better?

> * array_typanalyze.c's compute_array_stats counts the number
> of null arrays in the column, but then does nothing with the
> result.  AFAICS this is redundant with what std_compute_stats
> will do, so I just removed the variable.

+1

Re: Silencing the remaining clang 15 warnings

Thomas Munro  writes:
> On Tue, Sep 20, 2022 at 7:20 AM Tom Lane  wrote:
>> * xlog.c's AdvanceXLInsertBuffer has a local variable "npages"
>> that is only read in the "#ifdef WAL_DEBUG" stanza at the
>> bottom.  Here I've done the rather ugly and brute-force thing
>> of wrapping all the variable's references in "#ifdef WAL_DEBUG".
>> (I tried marking it PG_USED_FOR_ASSERTS_ONLY, but oddly that
>> did not silence the warning.)  I kind of wonder how useful this
>> function's WAL_DEBUG output is --- maybe just dropping that
>> altogether would be better?

> No opinion on the value of the message, but maybe
> pg_attribute_unused() would be better?

I realized that the reason PG_USED_FOR_ASSERTS_ONLY didn't help
is that it expands to empty in an assert-enabled build, which is
what I was testing.  So yeah, using pg_attribute_unused() directly
would probably work better.

(Also, I tried an assert-disabled build and found one additional new
warning; same deal where clang doesn't believe "foo++;" is reason to
consider foo to be used.  That one, PG_USED_FOR_ASSERTS_ONLY can fix.)

regards, tom lane

Proposal to use JSON for Postgres Parser format

2022-09-19 Thread Michel Pelletier

Hello hackers,

As noted in the source:

https://github.com/postgres/postgres/blob/master/src/include/nodes/pg_list.h#L6-L11

 * Once upon a time, parts of Postgres were written in Lisp and used real
 * cons-cell lists for major data structures.  When that code was rewritten
 * in C, we initially had a faithful emulation of cons-cell lists, which
 * unsurprisingly was a performance bottleneck.  A couple of major rewrites
 * later, these data structures are actually simple expansible arrays;
 * but the "List" name and a lot of the notation survives.

The Postgres parser format as described in the wiki page:

https://wiki.postgresql.org/wiki/Query_Parsing

looks almost, but not quite, entirely like JSON:

SELECT * FROM foo where bar = 42 ORDER BY id DESC LIMIT 23;
   (
  {SELECT
  :distinctClause <>
  :intoClause <>
  :targetList (
 {RESTARGET
 :name <>
 :indirection <>
 :val
{COLUMNREF
:fields (
   {A_STAR
   }
)
:location 7
}
 :location 7
 }
  )
  :fromClause (
 {RANGEVAR
 :schemaname <>
 :relname foo
 :inhOpt 2
 :relpersistence p
 :alias <>
 :location 14
 }
  )
  ... and so on
   )

This non-standard format is useful for visual inspection and perhaps
simple parsing.  Parsers that do exist for it are generally specific
to some languages.  If there were a standard way to parse queries,
tools like code generators and analysis tools can work with a variety
of libraries that already handle JSON quite well.  Future potential
would include exposing this data to command_ddl_start event triggers.
Providing a JSON Schema would also aid tools that want to validate or
transform the json with rule based systems.

I would like to propose a discussion that in a future major release
Postgres switch
from this custom format to JSON.  The current format is question is
generated from macros and functions found in
`src/backend/nodes/readfuncs.c` and `src/backend/nodes/outfuncs.c` and
converting them to emit valid JSON would be relatively
straightforward.

One downside would be that this would not be a forward compatible
binary change across releases.  Since it is unlikely that very much
code is reliant on this custom format; this would not be a huge problem
for most.

Thoughts?

-Michel

Re: Support pg_attribute_aligned and noreturn in MSVC

On Mon, Sep 19, 2022 at 06:21:58PM -0400, James Coleman wrote:
> It turns out that MSVC supports both noreturn [2] [3] and alignment
> [4] [5] attributes, so this patch adds support for those. MSVC also
> supports a form of packing, but the implementation as I can tell
> requires wrapping the entire struct (with a push/pop declaration set)
> [6], which doesn't seem to match the style of macros we're using for
> packing in other compilers, so I opted not to implement that
> attribute.

Interesting.  Thanks for the investigation.

+/*
+ * MSVC supports aligned and noreturn
+ * Packing is also possible but only by wrapping the entire struct definition
+ * which doesn't fit into our current macro declarations.
+ */
+#elif defined(_MSC_VER)
+#define pg_attribute_aligned(a) __declspec(align(a))
+#define pg_attribute_noreturn() __declspec(noreturn)
 #else
Nit: I think that the comment should be in the elif block for Visual.

pg_attribute_aligned() could be used in generic-msvc.h's
pg_atomic_uint64 as it uses now align.

Shouldn't HAVE_PG_ATTRIBUTE_NORETURN be set for the MSVC case as well?
--
Michael


signature.asc
Description: PGP signature

Re: Support pg_attribute_aligned and noreturn in MSVC

2022-09-19 Thread James Coleman

On Mon, Sep 19, 2022 at 8:21 PM Michael Paquier  wrote:
>
> On Mon, Sep 19, 2022 at 06:21:58PM -0400, James Coleman wrote:
> > It turns out that MSVC supports both noreturn [2] [3] and alignment
> > [4] [5] attributes, so this patch adds support for those. MSVC also
> > supports a form of packing, but the implementation as I can tell
> > requires wrapping the entire struct (with a push/pop declaration set)
> > [6], which doesn't seem to match the style of macros we're using for
> > packing in other compilers, so I opted not to implement that
> > attribute.
>
> Interesting.  Thanks for the investigation.
>
> +/*
> + * MSVC supports aligned and noreturn
> + * Packing is also possible but only by wrapping the entire struct definition
> + * which doesn't fit into our current macro declarations.
> + */
> +#elif defined(_MSC_VER)
> +#define pg_attribute_aligned(a) __declspec(align(a))
> +#define pg_attribute_noreturn() __declspec(noreturn)
>  #else
> Nit: I think that the comment should be in the elif block for Visual.

I was following the style of the comment outside the "if", but I'm not
attached to that style, so changed in this version.

> pg_attribute_aligned() could be used in generic-msvc.h's
> pg_atomic_uint64 as it uses now align.

Added.

> Shouldn't HAVE_PG_ATTRIBUTE_NORETURN be set for the MSVC case as well?

Yes, fixed.

James Coleman


v2-0001-Support-pg_attribute_aligned-and-noreturn-in-MSVC.patch
Description: Binary data

Re: Support tls-exporter as channel binding for TLSv1.3

On Mon, Sep 19, 2022 at 09:27:41AM -0700, Jacob Champion wrote:
> While looking into this I noticed that I left the following code in place:
> 
>> #ifdef HAVE_BE_TLS_GET_CERTIFICATE_HASH
>> if (strcmp(selected_mech, SCRAM_SHA_256_PLUS_NAME) == 0 && 
>> port->ssl_in_use)
> 
> In other words, we're still deciding whether to advertise -PLUS based
> only on whether we support tls-server-end-point.

X509_get_signature_nid() has been introduced in 1.0.2.
SSL_export_keying_material() is older than that, being present since
1.0.1.  Considering the fact that we want to always have
tls-server-end-point as default, it seems to me that we should always
publish SCRAM-SHA-256-PLUS and support channel binding when building
with OpenSSL >= 1.0.2.  The same can be said about the areas where we
have HAVE_BE_TLS_GET_[PEER_]CERTIFICATE_HASH.

There could be a point in supporting tls-exporter as default in 1.0.1,
or just allow it if the caller gives it in the connection string, but
as that's the next version we are going to drop support for (sooner
than later would be better IMO), I don't really want to add more
maintenance burden in this area as 1.0.1 is not that used anyway as
far as I recall.

> Maybe all the necessary features landed in OpenSSL in the same
> version, but I haven't double-checked that, and in any case I think
> I need to make this code more correct in the next version of this
> patch.

I was planning to continue working on this patch based on your latest
review.  Anyway, as that's originally your code, I am fine to let you
take the lead here.  Just let me know which way you prefer, and I'll
stick to it :)
--
Michael

signature.asc
Description: PGP signature

Re: pg_upgrade test failure

On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:
> I don't know if actually related to the commit below, but there've been a
> lot of runs of the pg_upgrade tests in the meson branch, and this is the first
> failure of this kind. Unfortunately the error seems to be transient -
> rerunning the tests succeeded.

This smells to me like a race condition in pg_upgrade (or even pg_ctl
for SERVER_LOG_FILE) where the code still has handles on some of the
files in the log/ subdirectory, causing its removal to not be able to
finish happen.  If this proves to be rather easy to reproduce, giving
a list of the files still present in this path would give a hint easy
to follow.  Does this reproduce with a good frequency?
--
Michael


signature.asc
Description: PGP signature

Re: Reducing the chunk header sizes on all memory context types

2022-09-19 Thread David Rowley

On Tue, 13 Sept 2022 at 20:27, David Rowley  wrote:
> I see that one of the drawbacks from not using MemoryContextContains()
> is that window functions such as lead(), lag(), first_value(),
> last_value() and nth_value() may now do the datumCopy() when it's not
> needed. For example, with a window function call such as
> lead(byref_col ), the expression evaluation code in
> WinGetFuncArgInPartition() will just return the address in the
> tuplestore tuple for "byref_col".  The datumCopy() is needed for that.
> However, if the function call was lead(byref_col || 'something') then
> we'd have ended up with a new allocation in CurrentMemoryContext to
> concatenate the two values.  We'll now do a datumCopy() where we
> previously wouldn't have. I don't really see any way around that
> without doing some highly invasive surgery to the expression
> evaluation code.

It feels like a terrible idea, but  I wondered if we could look at the
WindowFunc->args and make a decision if we should do the datumCopy()
based on the type of the argument. Vars would need to be copied as
they will point into the tuple's memory, but an OpExpr likely would
not need to be copied.

Aside from that, I don't have any ideas on how to get rid of the
possible additional datumCopy() from non-Var arguments to these window
functions.  Should we just suffer it? It's quite likely that most
arguments to these functions are plain Vars anyway.

David

Re: pg_upgrade test failure

2022-09-19 Thread Andres Freund

Hi,

On 2022-09-20 10:08:41 +0900, Michael Paquier wrote:
> On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:
> > I don't know if actually related to the commit below, but there've been a
> > lot of runs of the pg_upgrade tests in the meson branch, and this is the 
> > first
> > failure of this kind. Unfortunately the error seems to be transient -
> > rerunning the tests succeeded.
> 
> This smells to me like a race condition in pg_upgrade (or even pg_ctl
> for SERVER_LOG_FILE) where the code still has handles on some of the
> files in the log/ subdirectory, causing its removal to not be able to
> finish happen.

I don't really see what'd race with what here? pg_upgrade has precise control
over what's happening here, no?


> If this proves to be rather easy to reproduce, giving
> a list of the files still present in this path would give a hint easy
> to follow.  Does this reproduce with a good frequency?

I've only seen it once so far, but there haven't been many CI runs of the
meson branch since rebasing ontop of the last changes to pg_upgrade.

Greetings,

Andres Freund

Re: Reducing the chunk header sizes on all memory context types

David Rowley  writes:
> Aside from that, I don't have any ideas on how to get rid of the
> possible additional datumCopy() from non-Var arguments to these window
> functions.  Should we just suffer it? It's quite likely that most
> arguments to these functions are plain Vars anyway.

No, we shouldn't.  I'm pretty sure that we have various window
functions that are deliberately designed to take advantage of the
no-copy behavior, and that they have taken a significant speed
hit from your having disabled that optimization.  I don't say
that this is enough to justify reverting the chunk header changes
altogether ... but I'm completely not satisfied with the current
situation in HEAD.

regards, tom lane

Re: pg_upgrade test failure

On Mon, Sep 19, 2022 at 06:13:17PM -0700, Andres Freund wrote:
> I don't really see what'd race with what here? pg_upgrade has precise control
> over what's happening here, no?

A code path could have forgotten a fclose() for example, but this code
is rather old and close-proof as far as I know.  Most of the log files
are used with redirections for external calls, though  I don't see
how these could still be hold after pg_upgrade finishes, though :/
Could the use meson somewhat influence when running tests on Windows?

> I've only seen it once so far, but there haven't been many CI runs of the
> meson branch since rebasing ontop of the last changes to pg_upgrade.

Hmm, okay.  Is that a specific branch in one of your public repos?
--
Michael

signature.asc
Description: PGP signature

Re: pg_upgrade test failure

2022-09-19 Thread Justin Pryzby

On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:
> Hi,
> 
> After my last rebase of the meson tree I encountered the following test
> failure:
> 
> https://cirrus-ci.com/task/5532444261613568
> 
> [20:23:04.171] - 8< 
> -
> [20:23:04.171] stderr:
> [20:23:04.171] #   Failed test 'pg_upgrade_output.d/ not removed after 
> pg_upgrade --check success'
> [20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 
> 249.
> [20:23:04.171] #   Failed test 'pg_upgrade_output.d/ removed after pg_upgrade 
> success'
> [20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 
> 261.
> [20:23:04.171] # Looks like you failed 2 tests of 13.
> 
> regress_log:
> https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
> 
> The pg_upgrade output contains these potentially relevant warnings:
> 
> ...
> *Clusters are compatible*
> pg_upgrade: warning: could not remove file or directory 
> "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log":
>  Directory not empty
> pg_upgrade: warning: could not remove file or directory 
> "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511":
>  Directory not empty
> ...

It looks like it failed to remove a *.log file under windows, which
caused rmtree to fail.

src/bin/pg_upgrade/pg_upgrade.h-#define DB_DUMP_LOG_FILE_MASK   
"pg_upgrade_dump_%u.log"
src/bin/pg_upgrade/pg_upgrade.h-#define SERVER_LOG_FILE 
"pg_upgrade_server.log"
src/bin/pg_upgrade/pg_upgrade.h-#define UTILITY_LOG_FILE
"pg_upgrade_utility.log"
src/bin/pg_upgrade/pg_upgrade.h:#define INTERNAL_LOG_FILE   
"pg_upgrade_internal.log"

ee5353abb only changed .txt files used for errors so can't be the cause,
but the original commit 38bfae3 might be related.

I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so
didn't warn about the file itself, but then failed one moment later in
rmdir.

-- 
Justin

Re: pg_create_logical_replication_slot argument incongruency

On Mon, Sep 19, 2022 at 07:02:16PM +0200, Florin Irion wrote:
> This was introduced in commit 19890a06.
> 
> IMHO we should use the documented argument name `two_phase` and change the
> function to accept it.
> 
> What do you think?

19890a0 is included in REL_14_STABLE, and changing an argument name is
not acceptable in a stable branch as it would imply a catversion
bump.  Let's change the docs so as we document the parameter as
"twophase", instead.
--
Michael


signature.asc
Description: PGP signature

Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)

On Mon, Sep 19, 2022 at 06:26:34PM +0530, Bharath Rupireddy wrote:
> On Mon, Sep 19, 2022 at 2:38 PM Fujii Masao  
> wrote:
> Fixed. I believed that the regression tests cover pg_backup_start()
> and pg_backup_stop(), and relied on make check-world, surprisingly
> there's no test that covers these functions. Is it a good idea to add
> tests for these functions in misc_functions.sql or backup.sql or
> somewhere so that they get run as part of make check? Thoughts?

The main regression test suite should not include direct calls to
pg_backup_start() or pg_backup_stop() as these depend on wal_level,
and we spend a certain amount of resources in keeping the tests a
maximum portable across such configurations, wal_level = minimal being
one of them.  One portable example is in 001_stream_rep.pl.

> That's a good idea. I'm marking a flag if the label name overflows (>
> MAXPGPATH), later using it in do_pg_backup_start() to error out. We
> could've thrown error directly, but that changes the behaviour - right
> now, first "
> wal_level must be set to \"replica\" or \"logical\" at server start."
> gets emitted and then label name overflow error - I don't want to do
> that.

-   if (strlen(backupidstr) > MAXPGPATH)
+   if (state->name_overflowed == true)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 errmsg("backup label too long (max %d bytes)",
It does not strike me as a huge issue to force a truncation of such
backup label names.  1024 is large enough for basically all users,
in my opinion.  Or you could just truncate in the internal logic, but
still complain at MAXPGPATH - 1 as the last byte would be for the zero
termination.  In short, there is no need to complicate things with
name_overflowed.

>> In v7 patch, since pg_backup_stop() calls build_backup_content(),
>> backup_label and history_file seem not to be carried from do_pg_backup_stop()
>> to pg_backup_stop(). This makes me still think that it's better not to 
>> include
>> them in BackupState...
> 
> I'm a bit worried about the backup state being spread across if we
> separate out backup_label and history_file from BackupState and keep
> tablespace_map contents there. As I said upthread, we are not
> allocating memory for them at the beginning, we allocate only when
> needed. IMO, this code is readable and more extensible.

+   StringInfo  backup_label;   /* backup_label file contents */
+   StringInfo  tablespace_map; /* tablespace_map file contents */
+   StringInfo  history_file;   /* history file contents */
IMV, repeating a point I already made once upthread, BackupState
should hold none of these.  Let's just generate the contents of these
files in the contexts where they are needed, making BackupState
something to rely on to build them in the code paths where they are
necessary.  This is going to make the reasoning around the memory
contexts where each one of them is stored much easier and reduce the
changes of bugs in the long-term.

> I've also taken help of the error callback mechanism to clean up the
> allocated memory in case of a failure. For do_pg_abort_backup() cases,
> I think we don't need to clean the memory because that gets called on
> proc exit (before_shmem_exit()).

Memory could still bloat while the process running the SQL functions
is running depending on the error code path, anyway.
--
Michael


signature.asc
Description: PGP signature

RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-09-19 Thread kuroda.hay...@fujitsu.com

Dear Önder,

Thanks for updating the patch! I will check it later.
Currently I just reply to your comments.

> Hmm, I couldn't realize this comment earlier. So you suggest "slow" here 
> refers to the additional function call "GetRelationIdentityOrPK"? If so, yes 
> I'll update that.

Yes I meant to say that, because functions will be called like:

GetRelationIdentityOrPK() -> RelationGetPrimaryKeyIndex() -> 
RelationGetIndexList() -> ..

and according to comments last one seems to do the heavy lifting.


> Makes sense, simplified the function. Though, it is always hard to pick good 
> names for these kinds of helper functions. I picked 
> GenerateDummySelectPlannerInfoForRelation(), does that sound good to you as 
> well?

I could not find any better naming than yours. 

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: why can't a table be part of the same publication as its schema

2022-09-19 Thread Amit Kapila

On Mon, Sep 19, 2022 at 8:46 PM Alvaro Herrera  wrote:
>
> > diff --git a/doc/src/sgml/logical-replication.sgml 
> > b/doc/src/sgml/logical-replication.sgml
> > index 1ae3287..0ab768d 100644
> > --- a/doc/src/sgml/logical-replication.sgml
> > +++ b/doc/src/sgml/logical-replication.sgml
> > @@ -1120,6 +1120,11 @@ test_sub=# SELECT * FROM child ORDER BY a;
> >
> >
> >
> > +   Specifying a column list when the publication also publishes
> > +   FOR ALL TABLES IN SCHEMA is not supported.
> > +  
> > +
> > +  
> > For partitioned tables, the publication parameter
> > publish_via_partition_root determines which column 
> > list
> > is used. If publish_via_partition_root is
> >
> > diff --git a/doc/src/sgml/ref/create_publication.sgml 
> > b/doc/src/sgml/ref/create_publication.sgml
> > index 0a68c4b..0ced7da 100644
> > --- a/doc/src/sgml/ref/create_publication.sgml
> > +++ b/doc/src/sgml/ref/create_publication.sgml
> > @@ -103,17 +103,17 @@ CREATE PUBLICATION  > class="parameter">name
> >   
> >
> >   
> > +  Specifying a column list when the publication also publishes
> > +  FOR ALL TABLES IN SCHEMA is not supported.
> > + 
> >
> > @@ -733,6 +694,24 @@ CheckPubRelationColumnList(List *tables, const char 
> > *queryString,
> >   continue;
> >
> >   /*
> > +  * Disallow specifying column list if any schema is in the
> > +  * publication.
> > +  *
> > +  * XXX We could instead just forbid the case when the 
> > publication
> > +  * tries to publish the table with a column list and a schema 
> > for that
> > +  * table. However, if we do that then we need a restriction 
> > during
> > +  * ALTER TABLE ... SET SCHEMA to prevent such a case which 
> > doesn't
> > +  * seem to be a good idea.
> > +  */
> > + if (publish_schema)
> > + ereport(ERROR,
> > + 
> > errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("cannot use publication column 
> > list for relation \"%s.%s\"",
> > +
> > get_namespace_name(RelationGetNamespace(pri->relation)),
> > +
> > RelationGetRelationName(pri->relation)),
> > + errdetail("Column list cannot be 
> > specified if any schema is part of the publication or specified in the 
> > list."));
> > +
>
> This seems a pretty arbitrary restriction.  It feels like you're adding
> this restriction precisely so that you don't have to write the code to
> reject the ALTER .. SET SCHEMA if an incompatible configuration is
> detected.  But we already have such checks in several cases, so I don't
> see why this one does not seem a good idea.
>

I agree that we have such checks at other places as well and one
somewhat similar is in ATPrepChangePersistence().

ATPrepChangePersistence()
{
...
...
/*
 * Check that the table is not part of any publication when changing to
 * UNLOGGED, as UNLOGGED tables can't be published.
 */

However, another angle to look at it is that we try to avoid adding
restrictions in other DDL commands for defined publications. I am not
sure but it appears to me Peter E. is not in favor of restrictions in
other DDLs. I think we don't have a strict rule in this regard, so we
are trying to see what makes the most sense based on feedback and do
it accordingly.

> The whole FOR ALL TABLES IN SCHEMA thing seems pretty weird in several
> aspects.  Others have already commented about the syntax, which is
> unlike what GRANT uses; I'm also surprised that we've gotten away with
> it being superuser-only.  Why are we building more superuser-only
> features in this day and age?  I think not even FOR ALL TABLES should
> require superuser.
>

The intention was to be in sync with FOR ALL TABLES.

-- 
With Regards,
Amit Kapila.

Re: [RFC] building postgres with meson - v13

2022-09-19 Thread Andres Freund

Hi,

On 2022-09-19 05:25:59 -0400, Peter Eisentraut wrote:
> IMO, the following commits are ready to be pushed now:

Slowly working through them.

To have some initial "translation" for other developers I've started a wiki
page with a translation table. Still very WIP:
https://wiki.postgresql.org/wiki/Meson

For now, a bit of polishing aside, I'm just planning to add a minimal
explanation of what's happening, and a reference to this thread.

> b7d7fe009731 Remove DLLTOOL, DLLWRAP from configure / Makefile.global.in
> 979f26889544 Don't hardcode tmp_check/ as test directory for tap tests
> 9fc657fbb7e2 Split TESTDIR into TESTLOGDIR and TESTDATADIR
> 6de8f1de0ffa meson: prereq: Extend gendef.pl in preparation for meson
> 5a9731dcc2e6 meson: prereq: port: Include c.h instead of postgres.h in 
> *p{read,write}*.c

Done

> 7054861f0fef meson: prereq: Add src/tools/gen_export.pl

This one I'm planning to merge with the "main" commit, given there's no other 
user.

Greetings,

Andres Freund

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-09-19 Thread Peter Smith

I had quick look at the latest v11-0001 patch differences from v10.

Here are some initial comments:

==

1. Commit message

It looks like some small mistake happened. You wrote [1] that my
previous review comments about the commit message were fixed, but it
seems the v11 commit message is unchanged since v10.

==

2. src/backend/replication/logical/relation.c -
GenerateDummySelectPlannerInfoForRelation

+/*
+ * This is not a generic function, helper function for
+ * GetCheapestReplicaIdentityFullPath. The function creates
+ * a dummy PlannerInfo for the given relationId as if the
+ * relation is queried with SELECT command.
+ */
+static PlannerInfo *
+GenerateDummySelectPlannerInfoForRelation(Oid relationId)

"generic function, helper function" -> "generic function. It is a
helper function"


--
[1] 
https://www.postgresql.org/message-id/CACawEhXnTcXBOTofptkgSBOyD81Pohd7MSfFaW0SKo-0oKrCJg%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Proposal to use JSON for Postgres Parser format

Michel Pelletier  writes:
> I would like to propose a discussion that in a future major release
> Postgres switch from this custom format to JSON.

There are certainly reasons to think about changing the node tree
storage format; but if we change it, I'd like to see it go to something
more compact not more verbose.  JSON doesn't fit our needs all that
closely, so some things like bitmapsets would become a lot longer;
and even where the semantics are pretty-much-the-same, JSON's
insistence on details like quoting field names will add bytes.
Perhaps making the physical storage be JSONB not JSON would help that
pain point.  It's still far from ideal though.

Maybe a compromise could be found whereby we provide a conversion
function that converts whatever the catalog storage format is to
some JSON equivalent.  That would address the needs of external
code that doesn't want to write a custom parser, while not tying
us directly to JSON.

regards, tom lane

Re: why can't a table be part of the same publication as its schema

2022-09-19 Thread Jonathan S. Katz

On 9/19/22 4:52 PM, Jonathan S. Katz wrote:

On 9/19/22 11:16 AM, Alvaro Herrera wrote:

This seems a pretty arbitrary restriction. It feels like you're adding
this restriction precisely so that you don't have to write the code to
reject the ALTER .. SET SCHEMA if an incompatible configuration is
detected. But we already have such checks in several cases, so I don't
see why this one does not seem a good idea.

The whole FOR ALL TABLES IN SCHEMA thing seems pretty weird in several
aspects. Others have already commented about the syntax, which is
unlike what GRANT uses; I'm also surprised that we've gotten away with
it being superuser-only. Why are we building more superuser-only
features in this day and age? I think not even FOR ALL TABLES should
require superuser.

FYI, I've added this to the PG15 open items as there are some open
questions to resolve in this thread.

(Replying personally, not RMT).

I wanted to enumerate the concerns raised in this thread in the context
of the open item to understand what needs to be addressed, and also give
an opinion. I did read up on the original thread to better understand
context around decisions.

I believe the concerns are these 3 things:

1. Allowing calls that have "ALL TABLES IN SCHEMA" that include calls to
specific tables in schema
2. The syntax of the "ALL TABLES IN SCHEMA" and comparing it to similar
behaviors in PostgreSQL

3. Adding on an additional "superuser-only" feature

For #1 (allowing calls that have schema/table overlap...), there appears
to be both a patch that allows this (reversing[8]), and a suggestion for
dealing with a corner-case that is reasonable, i.e. disallowing adding
schemas to a publication when specifying column-lists. Do we think we
can have consensus on this prior to the RC1 freeze?

For #2 ("ALL TABLES IN SCHEMA" syntax), this was heavily discussed on
the original thread[1][3][4][5][7]. I thought Tom's proposal on the
syntax[3] was reasonable as it "future proofs" for when we allow other
schema-scoped objects to be published and give control over which ones
can be published.

The bigger issue seems to be around the behavior in regards to the
syntax. The current behavior is that when one specifies "ALL TABLES IN
SCHEMA", any future tables created in that schema are added to the
publication. While folks tried to find parallels to GRANT[6], I think
this actually resembles how we handle partitions that are
published[9][10], i.e.:

"When a partitioned table is added to a publication, all of its existing
and future partitions are implicitly considered to be part of the
publication."[10]

Additionally, this is the behavior that is already present in "FOR ALL
TABLES":

"Marks the publication as one that replicates changes for all tables in
the database, including tables created in the future."[10]

I don't think we should change this behavior that's already in logical
replication. While I understand the reasons why "GRANT ... ALL TABLES IN
SCHEMA" has a different behavior (i.e. it's not applied to future
objects) and do not advocate to change it, I have personally been
affected where I thought a permission would be applied to all future
objects, only to discover otherwise. I believe it's more intuitive to
think that "ALL" applies to "everything, always."

For #3 (more superuser-only), in general I do agree that we shouldn't be
adding more of these. However, we have in this release, and not just to
this feature. ALTER SUBSCRIPTION ... SKIP[11] requires superuser. I
think it's easier for us to "relax" privileges (e.g. w/predefined roles)
than to make something "superuser-only" in the future, so I don't see
this being a blocker for v15. The feature will continue to work for
users even if we remove "superuser-only" in the future.

To summarize:

1. I do think we should fix the issue that Peter originally brought up
in this thread before v15. That is an open item.
2. I don't see why we need to change the syntax/behavior, I think that
will make this feature much harder to use.
3. I don't think we need to change the superuser requirement right now,
but we should do that for a future release.

Thanks,

Jonathan

[1]
https://www.postgresql.org/message-id/CAFiTN-u_m0cq7Rm5Bcu9EW4gSHG94WaLuxLfibwE-o7%2BLea2GQ%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/C4D04B90-AC4D-42A7-B93C-4799CE96%40enterprisedb.com

[3] https://www.postgresql.org/message-id/155565.1628954580%40sss.pgh.pa.us
[4]
https://www.postgresql.org/message-id/CAHut%2BPvNwzp-EdtsDNazwrNrV4ziqCovNdLywzOJKSy52LvRjw%40mail.gmail.com
[5]
https://www.postgresql.org/message-id/CAHut%2BPt6Czj0KsE0ip6nMsPf4FatHgNDni-wSu2KkYNYF9mDAw%40mail.gmail.com
[6]
https://www.postgresql.org/message-id/CAA4eK1Lwtea0St1MV5nfSg9FrFeU04YKpHvhQ0i4W-tOBw%3D9Qw%40mail.gmail.com
[7]
https://www.postgresql.org/message-id/202109241325.eag5g6mpvoup@alvherre.pgsql
[8]
https://www.postgresql.org/message-id/CALDaNm1BEXtvg%3Dfq8FzM-FoYvETTEu

Re: Support pg_attribute_aligned and noreturn in MSVC

On Mon, Sep 19, 2022 at 08:51:37PM -0400, James Coleman wrote:
> Yes, fixed.

The CF bot is failing compilation on Windows:
http://commitfest.cputube.org/james-coleman.html
https://api.cirrus-ci.com/v1/task/5376566577332224/logs/build.log

There is something going on with noreturn() after applying it to
elog.h:
01:11:00.468] c:\cirrus\src\include\utils\elog.h(410,45): error C2085:
'ThrowErrorData': not in formal parameter list (compiling source file
src/common/hashfn.c) [c:\cirrus\libpgcommon.vcxproj]
[01:11:00.468] c:\cirrus\src\include\mb\pg_wchar.h(701,80): error
C2085: 'pgwin32_message_to_UTF16': not in formal parameter list
(compiling source file src/common/encnames.c)
[c:\cirrus\libpgcommon.vcxproj] 
[01:11:00.468] c:\cirrus\src\include\utils\elog.h(411,54): error
C2085: 'pg_re_throw': not in formal parameter list (compiling source
file src/common/hashfn.c) [c:\cirrus\libpgcommon.vcxproj] 

align() seems to look fine, at least.  I'd be tempted to apply the
align part first, as that's independently useful for itemptr.h.
--
Michael


signature.asc
Description: PGP signature

Re: Proposal to use JSON for Postgres Parser format

2022-09-19 Thread Peter Geoghegan

On Mon, Sep 19, 2022 at 7:29 PM Tom Lane  wrote:
> Maybe a compromise could be found whereby we provide a conversion
> function that converts whatever the catalog storage format is to
> some JSON equivalent.  That would address the needs of external
> code that doesn't want to write a custom parser, while not tying
> us directly to JSON.

That seems like a perfectly good solution, as long as it can be done
in a way that doesn't leave consumers of the JSON output at any kind
of disadvantage.

I find the current node tree format ludicrously verbose, and generally
hard to work with. But it's not the format itself, really -- that's
not the problem. The underlying data structures are typically very
information dense. So having an output format that's a known quantity
sounds very valuable to me.

Writing declarative @> containment queries against (say) a JSON
variant of node tree format seems like it could be a huge quality of
life improvement.  It will make the output format even more verbose,
but that might not matter in the same way as it does right now.

-- 
Peter Geoghegan

Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

2022-09-19 Thread Nathan Bossart

On Mon, Sep 19, 2022 at 03:16:42PM -0700, Nathan Bossart wrote:
> It seems like you want the opposite of pg_walfile_name_offset().  Perhaps
> we could add a function like pg_walfile_offset_lsn() that accepts a WAL
> file name and byte offset and returns the LSN.

Like so...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From f8ed45d9fa59690d8b3375d658c1baedb8510195 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 19 Sep 2022 20:24:10 -0700
Subject: [PATCH v1 1/1] introduce pg_walfile_offset_lsn()

---
 doc/src/sgml/func.sgml   | 14 +++
 src/backend/access/transam/xlogfuncs.c   | 39 
 src/include/catalog/pg_proc.dat  |  5 +++
 src/test/regress/expected/misc_functions.out | 19 ++
 src/test/regress/sql/misc_functions.sql  |  9 +
 5 files changed, 86 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 67eb380632..318ac22769 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25891,6 +25891,20 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560

   
 
+  
+   
+
+ pg_walfile_offset_lsn
+
+pg_walfile_offset_lsn ( file_name text, file_offset integer )
+pg_lsn
+   
+   
+Converts a WAL file name and byte offset within that file to a
+write-ahead log location.
+   
+  
+
   

 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 27aeb6e281..75f58cb118 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -313,6 +313,45 @@ pg_last_wal_replay_lsn(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(recptr);
 }
 
+/*
+ * Compute an LSN given a WAL file name and decimal byte offset.
+ */
+Datum
+pg_walfile_offset_lsn(PG_FUNCTION_ARGS)
+{
+	char	   *filename = text_to_cstring(PG_GETARG_TEXT_PP(0));
+	int			offset = PG_GETARG_INT32(1);
+	TimeLineID	tli;
+	XLogSegNo	segno;
+	XLogRecPtr	result;
+	uint32		log;
+	uint32		seg;
+
+	if (!IsXLogFileName(filename))
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid WAL file name \"%s\"", filename)));
+
+	sscanf(filename, "%08X%08X%08X", &tli, &log, &seg);
+	if (seg >= XLogSegmentsPerXLogId(wal_segment_size) ||
+		(log == 0 && seg == 0) ||
+		tli == 0)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid WAL file name \"%s\"", filename)));
+
+	if (offset < 0 || offset >= wal_segment_size)
+		ereport(ERROR,
+(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("\"offset\" must not be negative or greater than or "
+		"equal to WAL segment size")));
+
+	XLogFromFileName(filename, &tli, &segno, wal_segment_size);
+	XLogSegNoOffsetToRecPtr(segno, offset, wal_segment_size, result);
+
+	PG_RETURN_LSN(result);
+}
+
 /*
  * Compute an xlog file name and decimal byte offset given a WAL location,
  * such as is returned by pg_backup_stop() or pg_switch_wal().
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a07e737a33..224fe590ac 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6328,6 +6328,11 @@
   proargtypes => 'pg_lsn', proallargtypes => '{pg_lsn,text,int4}',
   proargmodes => '{i,o,o}', proargnames => '{lsn,file_name,file_offset}',
   prosrc => 'pg_walfile_name_offset' },
+{ oid => '8205',
+  descr => 'wal location, given a wal filename and byte offset',
+  proname => 'pg_walfile_offset_lsn', prorettype => 'pg_lsn',
+  proargtypes => 'text int4', proargnames => '{file_name,file_offset}',
+  prosrc => 'pg_walfile_offset_lsn' },
 { oid => '2851', descr => 'wal filename, given a wal location',
   proname => 'pg_walfile_name', prorettype => 'text', proargtypes => 'pg_lsn',
   prosrc => 'pg_walfile_name' },
diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index 9f106c2a10..b2d6f23ac1 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -594,3 +594,22 @@ SELECT * FROM tenk1 a JOIN my_gen_series(1,10) g ON a.unique1 = g;
  Index Cond: (unique1 = g.g)
 (4 rows)
 
+-- pg_walfile_offset_lsn
+SELECT pg_walfile_offset_lsn('invalid', 15);
+ERROR:  invalid WAL file name "invalid"
+SELECT pg_walfile_offset_lsn('0001', 15);
+ERROR:  invalid WAL file name "0001"
+SELECT pg_walfile_offset_lsn('0001', 15);
+ERROR:  invalid WAL file name "0001"
+SELECT pg_walfile_offset_lsn('0001', 15);
+ERROR:  invalid WAL file name "0001"
+SELECT pg_walfile_offset_lsn('00010001', -1);
+ERROR:  "offset" must not be negative or greater than or equal to WAL segment size
+SELECT pg_walfile_offset_lsn('00010001', 20);
+ERROR:  "of

Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 12:16 PM Michel Pelletier
 wrote:
> This non-standard format

FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
default reader/printer behaviour in small ways, including being case
sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
structs for reasons that are lost AFAIK (there's a dark age between
the commit history of the old Berkeley repo and our current repo, and
it looks like plan nodes were still printed as #(NAME ...) at
Berkeley).  At some point it was probably exchanging data between the
Lisp and C parts of POSTGRES, and you could maybe sorta claim it's
based on an ANSI standard (Lisp!), but not with a straight face :-)

Re: Proposal to use JSON for Postgres Parser format

Peter Geoghegan  writes:
> Writing declarative @> containment queries against (say) a JSON
> variant of node tree format seems like it could be a huge quality of
> life improvement.

There are certainly use-cases for something like that, but let's
be clear about it: that's a niche case of interest to developers
and pretty much nobody else.  For ordinary users, what matters about
the node tree storage format is compactness and speed of loading.
Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

regards, tom lane

RE: Perform streaming logical transactions by background workers and parallel apply

2022-09-19 Thread shiy.f...@fujitsu.com

On Mon, Sept 19, 2022 11:26 AM Wang, Wei/王 威  wrote:
> 
> 
> Improved as suggested.
> 

Thanks for updating the patch. Here are some comments on 0001 patch.

1.
+   case TRANS_LEADER_SERIALIZE:
 
-   oldctx = MemoryContextSwitchTo(ApplyContext);
+   /*
+* Notify handle methods we're processing a remote 
in-progress
+* transaction.
+*/
+   in_streamed_transaction = true;
 
-   MyLogicalRepWorker->stream_fileset = palloc(sizeof(FileSet));
-   FileSetInit(MyLogicalRepWorker->stream_fileset);
+   /*
+* Since no parallel apply worker is used for the first 
stream
+* start, serialize all the changes of the transaction.
+*
+* Start a transaction on stream start, this 
transaction will be


It seems that the following comment can be removed after using switch case.
+* Since no parallel apply worker is used for the first 
stream
+* start, serialize all the changes of the transaction.

2.
+   switch (apply_action)
+   {
+   case TRANS_LEADER_SERIALIZE:
+   if (!in_streamed_transaction)
+   ereport(ERROR,
+   
(errcode(ERRCODE_PROTOCOL_VIOLATION),
+errmsg_internal("STREAM STOP 
message without STREAM START")));

In apply_handle_stream_stop(), I think we can move this check to the beginning 
of
this function, to be consistent to other functions.

3. I think the some of the changes in 0005 patch can be merged to 0001 patch,
0005 patch can only contain the changes about new column 'apply_leader_pid'.

4.
+ * ParallelApplyWorkersList. After successfully, launching a new worker it's
+ * information is added to the ParallelApplyWorkersList. Once the worker

Should `it's` be `its` ?

Regards
Shi yu

Re: Proposal to use JSON for Postgres Parser format

2022-09-19 Thread Amit Kapila

On Tue, Sep 20, 2022 at 7:59 AM Tom Lane  wrote:
>
> Michel Pelletier  writes:
> > I would like to propose a discussion that in a future major release
> > Postgres switch from this custom format to JSON.
>
> There are certainly reasons to think about changing the node tree
> storage format; but if we change it, I'd like to see it go to something
> more compact not more verbose.  JSON doesn't fit our needs all that
> closely, so some things like bitmapsets would become a lot longer;
> and even where the semantics are pretty-much-the-same, JSON's
> insistence on details like quoting field names will add bytes.
> Perhaps making the physical storage be JSONB not JSON would help that
> pain point.  It's still far from ideal though.
>
> Maybe a compromise could be found whereby we provide a conversion
> function that converts whatever the catalog storage format is to
> some JSON equivalent.  That would address the needs of external
> code that doesn't want to write a custom parser, while not tying
> us directly to JSON.
>

I think the DDL deparsing stuff that is being discussed as a base for
DDL logical replication provides something like what you are saying
[1][2].

[1] - 
https://www.postgresql.org/message-id/CAFPTHDaqqGxqncAP42Z%3Dw9GVXDR92HN-57O%3D2Zy6tmayV2_eZw%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAAD30U%2BpVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo%3DOjdQGJ9g%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: predefined role(s) for VACUUM and ANALYZE

2022-09-19 Thread Nathan Bossart

On Thu, Sep 08, 2022 at 09:41:20AM -0400, Robert Haas wrote:
> Now on the other hand, I also do think we need more privilege bits.
> You're not alone in making the case that this is a problem which needs
> to be solved, and the set of other people who are also making that
> argument includes me. At the same time, there is certainly a double
> standard here. When Andrew and Tom committed
> d11e84ea466b4e3855d7bd5142fb68f51c273567 and
> a0ffa885e478f5eeacc4e250e35ce25a4740c487 respectively, we used up 2 of
> the remaining 4 bits, bits which other people would have liked to have
> used up years ago and they were told "no you can't." I don't believe I
> would have been willing to commit those patches without doing
> something to solve this problem, because I would have been worried
> about getting yelled at by Tom. But now here we are with only 2 bits
> left instead of 4, and we're telling the next patch author - who is
> not Tom - that he's on the hook to solve the problem.
> 
> Well, we do need to solve the problem. But we're not necessarily being
> fair about how the work involved gets distributed. It's a heck of a
> lot easier for a committer to get something committed to address this
> issue than a non-committer, and it's a heck of a lot easier for a
> committer to ignore the fact that the problem hasn't been solved and
> press ahead anyway, and yet somehow we're trying to dump a problem
> that's a decade in the making on Nathan. I'm not exactly sure what to
> propose as an alternative, but that doesn't seem quite fair.

Are there any concerns with simply expanding AclMode to 64 bits, as done in
v5 [0]?

[0] https://postgr.es/m/20220908055035.GA2100193%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf

On Fri, Sep 09, 2022 at 03:05:18PM -0700, Jacob Champion wrote:
> On Thu, Sep 8, 2022 at 5:46 PM Tom Lane  wrote:
>> Jacob Champion  writes:
>> > I think you're going to have to address backwards compatibility
>> > concerns. Today, I can create a role named "/a", and I can put that
>> > into the HBA without quoting it. I'd be unamused if, after an upgrade,
>> > my rule suddenly matched any role name containing an 'a'.
>>
>> Meh ... that concern seems overblown to me.  I guess it's possible
>> that somebody has an HBA entry that looks like that, but it doesn't
>> seem very plausible.  Note that we made this exact same change in
>> pg_ident.conf years ago, and AFAIR we got zero complaints.
> 
> What percentage of users actually use pg_ident maps? My assumption
> would be that a change to pg_hba would affect many more people, but
> then I don't have any proof that there are users with role names that
> look like that to begin with. I won't pound the table with it.

This concern does not sound overblown to me.  A change in pg_hba.conf
impacts everybody.  I was just looking at this patch, and the logic
with usernames and databases is changed so as we would *always* treat
*any* entries beginning with a slash as a regular expression, skipping
them if they don't match with an error fed to the logs and
pg_hba_file_rules.  This could lead to silent security issues as 
stricter HBA policies need to be located first in pg_hba.conf and
these could suddenly fail to load.  It would be much safer to me if we
had in place some restrictions to avoid such problems to happen,
meaning some extra checks in the DDL code paths for such object names
and perhaps even something in pg_upgrade with a scan at pg_database
and pg_authid.

On the bright side, I have been looking at some of the RFCs covering
the set of characters allowed in DNS names and slashes are not
authorized in hostnames, making this change rather safe AFAIK.
--
Michael

signature.asc
Description: PGP signature

Re: Proposal to use JSON for Postgres Parser format