Re: POC: GROUP BY optimization

2022-07-14 Thread David Rowley
On Thu, 31 Mar 2022 at 12:19, Tomas Vondra
 wrote:
> Pushed, after going through the patch once more, running check-world
> under valgrind, and updating the commit message.

I'm still working in this area and I noticed that db0d67db2 updated
some regression tests in partition_aggregate.out without any care as
to what the test was testing.

The comment above the test reads:

-- Without ORDER BY clause, to test Gather at top-most path

and you've changed the expected plan from being a parallel plan with a
Gather to being a serial plan. So it looks like the test might have
become useless.

I see that the original plan appears to come back with some
adjustments to parallel_setup_cost and parallel_tuple_cost. It seems a
bit strange to me that the changes with this patch would cause a
change of plan for this. There is only 1 GROUP BY column in the query
in question. There's no rearrangement to do with a single column GROUP
BY.

David




RE: Support load balancing in libpq

2022-07-14 Thread kuroda.hay...@fujitsu.com
Dear Jelte,

I like your idea. But do we have to sort randomly even if target_session_attr 
is set to 'primary' or 'read-write'?

I think this parameter can be used when all listed servers have same data,
and we can assume that one of members is a primary and others are secondary.

In this case user maybe add a primary host to top of the list,
so sorting may increase time durations for establishing connection.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: EINTR in ftruncate()

2022-07-14 Thread Thomas Munro
On Fri, Jul 15, 2022 at 9:34 AM Tom Lane  wrote:
> (Someday we oughta go ahead and make our Windows signal API look more
> like POSIX, as I suggested back in 2015.  I'm still not taking
> point on that, though.)

For the sigprocmask() part, here's a patch that passes CI.  Only the
SIG_SETMASK case is actually exercised by our current code, though.

One weird thing about our PG_SETMASK() macro is that you couldn't have
used its return value portably: on Windows we were returning the old
mask (like sigsetmask(), which has no way to report errors), and on
Unix we were returning 0/-1 (from setprocmask(), ie the error we never
checked).
From b21140e9b3f593b46c3bb4782c6bf84ca248dc15 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 15 Jul 2022 14:35:01 +1200
Subject: [PATCH] Emulate sigprocmask(), not sigsetmask(), on Windows.

Since commit a65e0864, we've required Unix systems to have
sigprocmask().  For Windows we still emulated the old deprecated 4.2BSD
function sigsetmask().  Emulate sigprocmask() instead, for consistency.

Discussion: https://postgr.es/m/3153247.1657834482%40sss.pgh.pa.us
---
 src/backend/port/win32/signal.c | 29 +++--
 src/include/libpq/pqsignal.h| 11 +++
 2 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/src/backend/port/win32/signal.c b/src/backend/port/win32/signal.c
index b71164d8db..53b93a50b2 100644
--- a/src/backend/port/win32/signal.c
+++ b/src/backend/port/win32/signal.c
@@ -112,7 +112,7 @@ pgwin32_signal_initialize(void)
 /*
  * Dispatch all signals currently queued and not blocked
  * Blocked signals are ignored, and will be fired at the time of
- * the pqsigsetmask() call.
+ * the pqsigprocmask() call.
  */
 void
 pgwin32_dispatch_queued_signals(void)
@@ -154,12 +154,29 @@ pgwin32_dispatch_queued_signals(void)
 
 /* signal masking. Only called on main thread, no sync required */
 int
-pqsigsetmask(int mask)
+pqsigprocmask(int how, const sigset_t *set, sigset_t *oset)
 {
-	int			prevmask;
+	if (oset)
+		*oset = pg_signal_mask;
 
-	prevmask = pg_signal_mask;
-	pg_signal_mask = mask;
+	if (!set)
+		return 0;
+
+	switch (how)
+	{
+		case SIG_BLOCK:
+			pg_signal_mask |= *set;
+			break;
+		case SIG_UNBLOCK:
+			pg_signal_mask &= ~*set;
+			break;
+		case SIG_SETMASK:
+			pg_signal_mask = *set;
+			break;
+		default:
+			errno = EINVAL;
+			return -1;
+	}
 
 	/*
 	 * Dispatch any signals queued up right away, in case we have unblocked
@@ -167,7 +184,7 @@ pqsigsetmask(int mask)
 	 */
 	pgwin32_dispatch_queued_signals();
 
-	return prevmask;
+	return 0;
 }
 
 
diff --git a/src/include/libpq/pqsignal.h b/src/include/libpq/pqsignal.h
index 41227a30e2..d17ddb787e 100644
--- a/src/include/libpq/pqsignal.h
+++ b/src/include/libpq/pqsignal.h
@@ -15,15 +15,18 @@
 
 #include 
 
-#ifndef WIN32
 #define PG_SETMASK(mask)	sigprocmask(SIG_SETMASK, mask, NULL)
-#else
+
+#ifdef WIN32
 /* Emulate POSIX sigset_t APIs on Windows */
 typedef int sigset_t;
 
-extern int	pqsigsetmask(int mask);
+extern int	pqsigprocmask(int how, const sigset_t *set, sigset_t *oset);
 
-#define PG_SETMASK(mask)		pqsigsetmask(*(mask))
+#define SIG_BLOCK1
+#define SIG_UNBLOCK2
+#define SIG_SETMASK3
+#define sigprocmask(how, set, oset) pqsigprocmask((how), (set), (oset))
 #define sigemptyset(set)		(*(set) = 0)
 #define sigfillset(set)			(*(set) = ~0)
 #define sigaddset(set, signum)	(*(set) |= (sigmask(signum)))
-- 
2.36.1



Re: Allowing REINDEX to have an optional name

2022-07-14 Thread Michael Paquier
On Mon, Jul 04, 2022 at 03:59:55PM +0900, Michael Paquier wrote:
> Please note that patch authors should not switch a patch as RfC by
> themselves.  This is something that a reviewer should do.

This patch has been marked as waiting for a review, however the CF bot
is completely red:
http://commitfest.cputube.org/simon-riggs.html

Could you take care of those issues first?
--
Michael


signature.asc
Description: PGP signature


Re: Collect ObjectAddress for ATTACH DETACH PARTITION to use in event trigger

2022-07-14 Thread Michael Paquier
On Fri, Jul 15, 2022 at 03:21:30AM +, kuroda.hay...@fujitsu.com wrote:
> Sounds good. I grepped ATExecXXX() functions called in ATExecCmd(),
> and I confirmed that all returned values have been collected except them.
> 
> While checking test code test about EVENT TRIGGER,
> I found there were no tests related with partitions in that.
> How about adding them?

Agreed.  It would be good to track down what changes once those
ObjectAddresses are collected.
--
Michael


signature.asc
Description: PGP signature


RE: Collect ObjectAddress for ATTACH DETACH PARTITION to use in event trigger

2022-07-14 Thread kuroda.hay...@fujitsu.com
Hi,

> > I noticed that we didn't collect the ObjectAddress returned by
> > ATExec[Attach|Detach]Partition. I think collecting this information can 
> > make it
> > easier for users to get the partition OID of the attached or detached table 
> > in
> > the event trigger. So how about collecting it like the attached patch ?
> 
> Added to next CF.

Sounds good. I grepped ATExecXXX() functions called in ATExecCmd(),
and I confirmed that all returned values have been collected except them.

While checking test code test about EVENT TRIGGER,
I found there were no tests related with partitions in that.
How about adding them?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





RE: Collect ObjectAddress for ATTACH DETACH PARTITION to use in event trigger

2022-07-14 Thread houzj.f...@fujitsu.com
On Wednesday, July 13, 2022 5:58 PM Hou, Zhijie wrote:
> Hi hackers,
> 
> I noticed that we didn't collect the ObjectAddress returned by
> ATExec[Attach|Detach]Partition. I think collecting this information can make 
> it
> easier for users to get the partition OID of the attached or detached table in
> the event trigger. So how about collecting it like the attached patch ?

Added to next CF.

Best regards,
Hou zhijie





Re: [PATCH v1] remove redundant check of item pointer

2022-07-14 Thread David Rowley
On Fri, 15 Jul 2022 at 10:31, Bruce Momjian  wrote:
> for non-Assert builds, ItemPointerGetOffsetNumberNoCheck() and
> ItemPointerGetOffsetNumber() are the same, so I don't see the point to
> making this change.  Frankly, I don't know why we even have two
> functions for this.  I am guessing ItemPointerGetOffsetNumberNoCheck is
> for cases where you have an Assert build and do not want the check.

We'll want to use ItemPointerGetOffsetNumberNoCheck() where the TID
comes from sources we can't verify. e.g user input... '(2,0)'::tid.
We want to use ItemPointerGetOffsetNumber() for item pointers that
come from locations that we want to ensure are correct.  e.g TIDs
we're storing in an index.

David




Re: doc: Clarify Routines and Extension Membership

2022-07-14 Thread David G. Johnston
On Thu, Jul 14, 2022 at 2:41 PM Bruce Momjian  wrote:

> On Fri, Jul  8, 2022 at 10:55:55PM -0400, Bruce Momjian wrote:
> > > The/that inconsistency ... choose one.  Or actually, the "an ... the"
> > > combination you used elsewhere doesn't grate on the ear either.
> > >
> > > +  For each extension, refuse to drop anything if any objects
> (other than the
> > > +  extensions listed) depend on it.  However, its own member
> objects, and routines
> > > +  that are explicitly dependent on this extension, are skipped.
> > > +  This is the default.
> > >
> > > "skipped" seems like a horrible choice of word; it could easily be
> read as
> > > "they don't get dropped".  I am not convinced that mentioning the
> member
> > > objects here is an improvement either.  In the first sentence you are
> > > treating each extension as a monolithic object; why not in the second?
> >
> > I created a modified patch based on this feedback;  patch attached.  I
> > rewrote the last change.
>
> Patch applied to PG 13 and later, where extension dependency was added.
> Thank you for the patch.
>

Thank you and apologies for being quiet here and on a few of the
other threads. I've been on vacation and flagged as ToDo some of the
non-simple feedback items that have come this way.

The change to restrict and description in drop extension needs to be fixed
up (the other pages look good).

"This option prevents the specified extensions from being dropped if there
exists non-extension-member objects that depends on any the extensions.
This is the default."

At minimum: "...that depend on any of the extensions."

I did just now confirm that if any of the named extensions failed to be
dropped the entire command fails.  There is no partial success mode.

I'd like to avoid non-extension-member, and one of the main points is that
the routine dependency is member-like, not actual membership.  Hence the
separate wording.

I thus propose to replace the drop extension / restrict paragraph and
replace it with the following:

"This option prevents the specified extensions from being dropped if other
objects - besides these extensions, their members, and their explicitly
dependent routines - depend on them.  This is the default."

Also, I'm thinking to change, on the same page (description):

"Dropping an extension causes its component objects,"

to be:

"Dropping an extension causes its member objects,"

I'm not sure why I originally chose component over member...

David J.


Re: Doc about how to set max_wal_senders when setting minimal wal_level

2022-07-14 Thread Bruce Momjian
On Tue, Jul  5, 2022 at 08:02:33PM -0400, Tom Lane wrote:
> "Precondition" is an overly fancy word that makes things less clear
> not more so.  Does it mean that setting wal_level = minimal will fail
> if you don't do these other things, or does it just mean that you
> won't be getting the absolute minimum WAL volume?  If the former,
> I think it'd be better to say something like "To set wal_level to minimal,
> you must also set [these variables], which has the effect of disabling
> both WAL archiving and streaming replication."

I have created the attached patch to try to improve this text.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 37fd80388c..964f9fb379 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2764,9 +2764,10 @@ include_dir 'conf.d'
 levels.  This parameter can only be set at server start.


-In minimal level, no information is logged for
-permanent relations for the remainder of a transaction that creates or
-rewrites them.  This can make operations much faster (see
+The minimal level generates the least WAL
+volume.  It logs no row information for permanent relations
+in transactions that create or
+rewrite them.  This can make operations much faster (see
 ).  Operations that initiate this
 optimization include:
 
@@ -2778,19 +2779,18 @@ include_dir 'conf.d'
  REINDEX
  TRUNCATE
 
-But minimal WAL does not contain enough information to reconstruct the
-data from a base backup and the WAL logs, so replica or
-higher must be used to enable WAL archiving
-() and streaming replication.
+However, minimal WAL does not contain sufficient information for
+point-in-time recovery, so replica or
+higher must be used to enable continuous archiving
+() and streaming binary replication.
 Note that changing wal_level to
-minimal makes any base backups taken before
-unavailable for archive recovery and standby server, which may
-lead to data loss.
+minimal makes any base backups taken before this
+unusable for point-in-time recovery and standby servers.


 In logical level, the same information is logged as
-with replica, plus information needed to allow
-extracting logical change sets from the WAL. Using a level of
+with replica, plus information needed to
+extract logical change sets from the WAL. Using a level of
 logical will increase the WAL volume, particularly if many
 tables are configured for REPLICA IDENTITY FULL and
 many UPDATE and DELETE statements are


Re: Fix typo in progress reporting doc

2022-07-14 Thread Bruce Momjian
On Fri, Mar 11, 2022 at 04:07:32PM +0530, Nitin Jadhav wrote:
> Hi,
> 
> I have observed that the table naming conventions used in
> 'progress-reporting.html' are not consistent across different
> sections. For some cases "Phases" (Table 28.37. CREATE INDEX Phases)
> is used and for some cases "phases" (Table 28.35. ANALYZE phases) is
> used. I have attached a patch to correct this.

Patch applied to PG 13, where it first appeared, and all later releases.
Thanks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: doc: New cumulative stats subsystem obsoletes comment in maintenance.sgml

2022-07-14 Thread Andres Freund
Hi,

I had missed David's original email on this topic...

On 2022-07-14 18:58:09 -0400, Bruce Momjian wrote:
> On Wed, Apr 20, 2022 at 04:40:44PM -0700, David G. Johnston wrote:
> > The new cumulative stats subsystem no longer has a "lost under heavy load"
> > problem so that parenthetical should go (or at least be modified).
> > 
> > These stats can be reset so some discussion about how the system uses them
> > given that possibility seems like it would be good to add here.  I'm not 
> > sure
> > what that should look like though.
> > 
> > diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
> > index 04a04e0e5f..360807c8f9 100644
> > --- a/doc/src/sgml/maintenance.sgml
> > +++ b/doc/src/sgml/maintenance.sgml
> > @@ -652,9 +652,8 @@ vacuum insert threshold = vacuum base insert threshold +
> > vacuum insert scale fac
> >      tuples to be frozen by earlier vacuums.  The number of obsolete tuples 
> > and
> >      the number of inserted tuples are obtained from the cumulative 
> > statistics
> > system;
> >      it is a semi-accurate count updated by each UPDATE,
> > -    DELETE and INSERT operation.  
> > (It is
> > -    only semi-accurate because some information might be lost under heavy
> > -    load.)  If the relfrozenxid value of the 
> > table
> > +    DELETE and INSERT operation.
> > +    If the relfrozenxid value of the table
> >      is more than vacuum_freeze_table_age transactions 
> > old,
> >      an aggressive vacuum is performed to freeze old tuples and advance
> >      relfrozenxid; otherwise, only pages that 
> > have
> > been modified
> 
> Yes, I agree and plan to apply this patch soon.

It might make sense to still say semi-accurate, but adjust the explanation to
say that stats reporting is not instantaneous?

Greetings,

Andres Freund




Re: [PATCH v1] remove redundant check of item pointer

2022-07-14 Thread Peter Geoghegan
On Thu, Jul 14, 2022 at 3:59 PM Tom Lane  wrote:
> Even in an assert-enabled build, wouldn't you expect the compiler to
> optimize away the second assertion as unreachable code?

I think that it probably would, even at -O0 (GCC doesn't really allow
you to opt out of all optimizations). I did think of that myself, but
it seemed rather beside the point.

There have been individual cases where individual assertions were
deemed a bit too heavyweight. But those have been few and far between.
I myself tend to use *lots* of technically-redundant assertions like
this for preconditions and postconditions. At worst they're code
comments that are all but guaranteed to stay current.

-- 
Peter Geoghegan




Re: [PATCH v1] remove redundant check of item pointer

2022-07-14 Thread Tom Lane
Peter Geoghegan  writes:
> The proposal doesn't seem like an improvement. Technically the
> assertion cannot possibly fail here because the earlier assertion
> would always fail instead, so strictly speaking it is redundant -- at
> least right now. That is true. But it seems much more important to be
> consistent about which variant to use. Especially because there is
> obviously no overhead in builds without assertions enabled.

Even in an assert-enabled build, wouldn't you expect the compiler to
optimize away the second assertion as unreachable code?

regards, tom lane




Re: doc: New cumulative stats subsystem obsoletes comment in maintenance.sgml

2022-07-14 Thread Bruce Momjian
On Wed, Apr 20, 2022 at 04:40:44PM -0700, David G. Johnston wrote:
> Hackers,
> 
> The new cumulative stats subsystem no longer has a "lost under heavy load"
> problem so that parenthetical should go (or at least be modified).
> 
> These stats can be reset so some discussion about how the system uses them
> given that possibility seems like it would be good to add here.  I'm not sure
> what that should look like though.
> 
> diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
> index 04a04e0e5f..360807c8f9 100644
> --- a/doc/src/sgml/maintenance.sgml
> +++ b/doc/src/sgml/maintenance.sgml
> @@ -652,9 +652,8 @@ vacuum insert threshold = vacuum base insert threshold +
> vacuum insert scale fac
>      tuples to be frozen by earlier vacuums.  The number of obsolete tuples 
> and
>      the number of inserted tuples are obtained from the cumulative statistics
> system;
>      it is a semi-accurate count updated by each UPDATE,
> -    DELETE and INSERT operation.  (It 
> is
> -    only semi-accurate because some information might be lost under heavy
> -    load.)  If the relfrozenxid value of the table
> +    DELETE and INSERT operation.
> +    If the relfrozenxid value of the table
>      is more than vacuum_freeze_table_age transactions old,
>      an aggressive vacuum is performed to freeze old tuples and advance
>      relfrozenxid; otherwise, only pages that have
> been modified

Yes, I agree and plan to apply this patch soon.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: pg_parameter_aclcheck() and trusted extensions

2022-07-14 Thread Nathan Bossart
On Thu, Jul 14, 2022 at 06:03:45PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> On Thu, Jul 14, 2022 at 04:02:30PM -0400, Tom Lane wrote:
>>> I noted something that ought to be looked at separately:
>>> validate_option_array_item() seems like it needs to be taught about
>>> grantable permissions on GUCs.  I think that right now it may report
>>> permissions failures in some cases where it should succeed.
> 
>> Which cases do you think might be inappropriately reporting permissions
>> failures?  It looked to me like this stuff was mostly used for
>> pg_db_role_setting, which wouldn't be impacted by the current set of
>> grantable GUC permissions.  Is the idea that you should be able to do ALTER
>> ROLE SET for GUCs that you have SET permissions for?
> 
> Well, that's what I'm wondering.  Obviously that wouldn't *alone* be
> enough permissions, but it seems like it could be a component of it.
> Specifically, this bit:
> 
>   /* manual permissions check so we can avoid an error being thrown */
>   if (gconf->context == PGC_USERSET)
>/* ok */ ;
>   else if (gconf->context == PGC_SUSET && superuser())
>/* ok */ ;
>   else if (skipIfNoPermissions)
>   return false;
> 
> seems like it's trying to duplicate what set_config_option would do,
> and it's now missing a component of that.  If it shouldn't check
> per-GUC permissions along with superuser(), we should add a comment
> explaining why not.

I looked into this a bit closer.  I found that having SET permissions on a
GUC seems to allow you to ALTER ROLE SET it to others.

postgres=# CREATE ROLE test CREATEROLE;
CREATE ROLE
postgres=# CREATE ROLE other;
CREATE ROLE
postgres=# GRANT SET ON PARAMETER zero_damaged_pages TO test;
GRANT
postgres=# SET ROLE test;
SET
postgres=> ALTER ROLE other SET zero_damaged_pages = 'on';
ALTER ROLE
postgres=> SELECT * FROM pg_db_role_setting;
 setdatabase | setrole |setconfig
-+-+-
   0 |   16385 | {zero_damaged_pages=on}
(1 row)

However, ALTER ROLE RESET ALL will be blocked, while resetting only the
individual GUC will go through.

postgres=> ALTER ROLE other RESET ALL;
ALTER ROLE
postgres=> SELECT * FROM pg_db_role_setting;
 setdatabase | setrole |setconfig
-+-+-
   0 |   16385 | {zero_damaged_pages=on}
(1 row)

postgres=> ALTER ROLE other RESET zero_damaged_pages;
ALTER ROLE
postgres=> SELECT * FROM pg_db_role_setting;
 setdatabase | setrole | setconfig 
-+-+---
(0 rows)

I think this is because GUCArrayReset() is the only caller of
validate_option_array_item() that sets skipIfNoPermissions to true.  The
others fall through to set_config_option(), which does a
pg_parameter_aclcheck().  So, you are right.

Regarding whether SET privileges should be enough to allow ALTER ROLE SET,
I'm not sure I have an opinion yet.  You would need WITH GRANT OPTION to be
able to grant SET to that role, but that's a bit different than altering
the setting for the role.  You'll already have privileges to alter the role
(e.g., CREATEROLE), so requiring extra permissions to set GUCs on roles
feels like it might be excessive.  But there might be a good argument for
it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Move Section 9.27.7 (Data Object Management Functions) to System Information Chapter

2022-07-14 Thread Tom Lane
Bruce Momjian  writes:
> On Mon, Apr 25, 2022 at 08:33:47AM -0700, David G. Johnston wrote:
>> Both the location and name of the linked to section make no sense to me:
>> https://www.postgresql.org/docs/current/functions-admin.html#
>> FUNCTIONS-ADMIN-DBOBJECT
>> Neither of the tables listed there manage (cause to change) anything.  They 
>> are
>> pure informational functions - size and path of objects respectively.  It
>> belongs in the previous chapter "System Information Functions and Operators"
>> with a different name.

> So, the section title is:
>   9.27.7. Database Object Management Functions
> I think the idea is that they _help_ to manage database objects by
> reporting their size or location.  I do think it is in the right
> chapter, but maybe needs a better title?  I can't think of one.

I'm hesitant to move functions to a different documentation page
without a really solid reason.  Just a couple days ago I fielded a
complaint from somebody who couldn't find string_to_array anymore
because we'd moved it from "array functions" to "string functions".

I'd be the first to say that the division between 9.26 and 9.27 is
pretty arbitrary ... but without a clearer demarcation rule,
moving functions between the two pages seems more likely to
add confusion than subtract it.

regards, tom lane




Re: [PATCH v1] remove redundant check of item pointer

2022-07-14 Thread Peter Geoghegan
On Thu, Jul 14, 2022 at 3:31 PM Bruce Momjian  wrote:
> On Wed, Apr 27, 2022 at 08:04:00PM +0800, Junwang Zhao wrote:
> for non-Assert builds, ItemPointerGetOffsetNumberNoCheck() and
> ItemPointerGetOffsetNumber() are the same, so I don't see the point to
> making this change.  Frankly, I don't know why we even have two
> functions for this.  I am guessing ItemPointerGetOffsetNumberNoCheck is
> for cases where you have an Assert build and do not want the check.

Sometimes we use ItemPointerData for things that aren't actually TIDs.
For example, both GIN and B-Tree type-pun the ItemPointerData field
from the Indextuple struct. Plus we do something like that with
UPDATEs that affect a partitioning key in a partitioned table.

The proposal doesn't seem like an improvement. Technically the
assertion cannot possibly fail here because the earlier assertion
would always fail instead, so strictly speaking it is redundant -- at
least right now. That is true. But it seems much more important to be
consistent about which variant to use. Especially because there is
obviously no overhead in builds without assertions enabled.

-- 
Peter Geoghegan




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-07-14 Thread Melanie Plageman
In addition to adding several new tests, the attached version 26 fixes a
major bug in constructing the view.

The only valid combination of IOPATH/IOOP that is not tested now is
IOPATH_STRATEGY + IOOP_WRITE. In most cases when I ran this in regress,
the checkpointer wrote out the dirty strategy buffer before VACUUM got
around to reusing and writing it out in my tests.

I've also changed the BACKEND_NUM_TYPES definition. Now arrays will have
that dead spot for B_INVALID, but I feel like it is much easier to
understand without trying to skip that spot and use those special helper
functions.

I also started skipping adding rows to the view for WAL_RECEIVER and
WAL_WRITER and for BackendTypes except B_BACKEND and WAL_SENDER for
IOPATH_LOCAL.

On Tue, Jul 12, 2022 at 1:18 PM Andres Freund  wrote:

> On 2022-07-11 22:22:28 -0400, Melanie Plageman wrote:
> > Yes, per an off list suggestion by you, I have changed the tests to use a
> > sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of
> > the missing calls to count IO Operations for IOPATH_LOCAL and
> > IOPATH_STRATEGY.
> >
> > I struggled to come up with a way to test writes for a particular
> > type of backend are counted correctly since a dirty buffer could be
> > written out by another type of backend before the target BackendType has
> > a chance to write it out.
>
> I guess temp file writes would be reliably done by one backend... Don't
> have a
> good idea otherwise.
>
>
This was mainly an issue for IOPATH_STRATEGY writes as I mentioned. I
still have not solved this.


>
> > I'm not sure how to cause a strategy "extend" for testing.
>
> COPY into a table should work. But might be unattractive due to the size
> of of
> the COPY ringbuffer.
>

Did it with a CTAS as Horiguchi-san suggested.


>
> > > Would be nice to have something testing that the ringbuffer stats stuff
> > > does something sensible - that feels not entirely trivial.
> > >
> > >
> > I've added a test to test that reused strategy buffers are counted as
> > allocs. I would like to add a test which checks that if a buffer in the
> > ring is pinned and thus not reused, that it is not counted as a strategy
> > alloc, but I found it challenging without a way to pause vacuuming, pin
> > a buffer, then resume vacuuming.
>
> Yea, that's probably too hard to make reliable to be worth it.
>
>
Yes, I have skipped this.

- Melanie
From f7772e4d19821e0aeb19e906ba6f5e4bb046cfdb Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 29 Jun 2022 18:37:42 -0400
Subject: [PATCH v26 3/4] Track IO operation statistics

Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
location or type of IO done by a backend. For example, the checkpointer
may write a shared buffer out. This would be counted as an IOOp "write"
on an IOPath IOPATH_SHARED by BackendType "checkpointer".

Each IOOp (alloc, extend, fsync, read, write) is counted per IOPath
(local, shared, or strategy) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO done by, for
example, the archiver or syslogger is not counted in these statistics.

IOPATH_LOCAL and IOPATH_SHARED IOPaths concern operations on local
and shared buffers.

The IOPATH_STRATEGY IOPath concerns buffers
alloc'd/extended/fsync'd/read/written as part of a BufferAccessStrategy.

IOOP_ALLOC is counted for IOPATH_SHARED and IOPATH_LOCAL whenever a
buffer is acquired through [Local]BufferAlloc(). IOOP_ALLOC for
IOPATH_STRATEGY is counted whenever a buffer already in the strategy
ring is reused. And IOOP_WRITE for IOPATH_STRATEGY is counted whenever
the reused dirty buffer is written out.

Stats on IOOps for all IOPaths for a backend are initially accumulated
locally.

Later they are flushed to shared memory and accumulated with those from
all other backends, exited and live. The accumulated stats in shared
memory could be extended in the future with per-backend stats -- useful
for per connection IO statistics and monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO Operation statistics
to shared memory in a timely manner.

Author: Melanie Plageman 
Reviewed-by: Justin Pryzby , Kyotaro Horiguchi 
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c |   1 +
 src/backend/storage/buffer/bufmgr.c   |  53 -
 src/backend/storage/buffer/freelist.c |  25 ++-
 src/backend/storage/buffer/localbuf.c |   5 +
 src/backend/storage/sync/sync.c   |   2 +
 src/backend/utils/activity/Makefile   |   1 +
 src/backend/utils/activity/pgstat.c   |  36 
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c  

Re: Move Section 9.27.7 (Data Object Management Functions) to System Information Chapter

2022-07-14 Thread Bruce Momjian
On Mon, Apr 25, 2022 at 08:33:47AM -0700, David G. Johnston wrote:
> Hi,
> 
> Both the location and name of the linked to section make no sense to me:
> 
> https://www.postgresql.org/docs/current/functions-admin.html#
> FUNCTIONS-ADMIN-DBOBJECT
> 
> Neither of the tables listed there manage (cause to change) anything.  They 
> are
> pure informational functions - size and path of objects respectively.  It
> belongs in the previous chapter "System Information Functions and Operators"
> with a different name.

So, the section title is:

9.27.7. Database Object Management Functions

I think the idea is that they _help_ to manage database objects by
reporting their size or location.  I do think it is in the right
chapter, but maybe needs a better title?  I can't think of one.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: [PATCH v1] remove redundant check of item pointer

2022-07-14 Thread Bruce Momjian
On Wed, Apr 27, 2022 at 08:04:00PM +0800, Junwang Zhao wrote:
> In function ItemPointerEquals, the ItemPointerGetBlockNumber
> already checked the ItemPointer if valid, there is no need
> to check it again in ItemPointerGetOffset, so use
> ItemPointerGetOffsetNumberNoCheck instead.
> 
> Signed-off-by: Junwang Zhao 
> ---
>  src/backend/storage/page/itemptr.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/src/backend/storage/page/itemptr.c 
> b/src/backend/storage/page/itemptr.c
> index 9011337aa8..61ad727b1d 100644
> --- a/src/backend/storage/page/itemptr.c
> +++ b/src/backend/storage/page/itemptr.c
> @@ -37,8 +37,8 @@ ItemPointerEquals(ItemPointer pointer1, ItemPointer 
> pointer2)
>  
>   if (ItemPointerGetBlockNumber(pointer1) ==
>   ItemPointerGetBlockNumber(pointer2) &&
> - ItemPointerGetOffsetNumber(pointer1) ==
> - ItemPointerGetOffsetNumber(pointer2))
> + ItemPointerGetOffsetNumberNoCheck(pointer1) ==
> + ItemPointerGetOffsetNumberNoCheck(pointer2))
>   return true;
>   else
>   return false;

Looking at the code:

/*
 * ItemPointerGetOffsetNumberNoCheck
 *  Returns the offset number of a disk item pointer.
 */
static inline OffsetNumber
ItemPointerGetOffsetNumberNoCheck(const ItemPointerData *pointer)
{
return pointer->ip_posid;
}

/*
 * ItemPointerGetOffsetNumber
 *  As above, but verifies that the item pointer looks valid.
 */
static inline OffsetNumber
ItemPointerGetOffsetNumber(const ItemPointerData *pointer)
{
Assert(ItemPointerIsValid(pointer));
return ItemPointerGetOffsetNumberNoCheck(pointer);
}

for non-Assert builds, ItemPointerGetOffsetNumberNoCheck() and
ItemPointerGetOffsetNumber() are the same, so I don't see the point to
making this change.  Frankly, I don't know why we even have two
functions for this.  I am guessing ItemPointerGetOffsetNumberNoCheck is
for cases where you have an Assert build and do not want the check.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: pg_parameter_aclcheck() and trusted extensions

2022-07-14 Thread Tom Lane
Nathan Bossart  writes:
> On Thu, Jul 14, 2022 at 04:02:30PM -0400, Tom Lane wrote:
>> I noted something that ought to be looked at separately:
>> validate_option_array_item() seems like it needs to be taught about
>> grantable permissions on GUCs.  I think that right now it may report
>> permissions failures in some cases where it should succeed.

> Which cases do you think might be inappropriately reporting permissions
> failures?  It looked to me like this stuff was mostly used for
> pg_db_role_setting, which wouldn't be impacted by the current set of
> grantable GUC permissions.  Is the idea that you should be able to do ALTER
> ROLE SET for GUCs that you have SET permissions for?

Well, that's what I'm wondering.  Obviously that wouldn't *alone* be
enough permissions, but it seems like it could be a component of it.
Specifically, this bit:

/* manual permissions check so we can avoid an error being thrown */
if (gconf->context == PGC_USERSET)
 /* ok */ ;
else if (gconf->context == PGC_SUSET && superuser())
 /* ok */ ;
else if (skipIfNoPermissions)
return false;

seems like it's trying to duplicate what set_config_option would do,
and it's now missing a component of that.  If it shouldn't check
per-GUC permissions along with superuser(), we should add a comment
explaining why not.

regards, tom lane




Re: pg_parameter_aclcheck() and trusted extensions

2022-07-14 Thread Nathan Bossart
On Thu, Jul 14, 2022 at 04:02:30PM -0400, Tom Lane wrote:
> Here's a draft patch for that.  I initially ran around and changed all
> the set_config_option callers as I threatened before, but as I did it
> I could not help observing that they were all changing in exactly the
> same way: basically, they were passing GetUserId() if the GucContext
> is PGC_S_SESSION and BOOTSTRAP_SUPERUSERID otherwise.  Not counting
> guc.c internal call sites, there is a grand total of one caller that
> fails to fit the pattern.  So that brought me around to liking the idea
> of keeping set_config_option's API stable by making it a thin wrapper
> around another function with an explicit role argument.  The result,
> attached, poses far less of an API/ABI hazard than I was anticipating.
> If you're not poking into the GUC tables you have little to fear.
> 
> Most of the bulk of this is mechanical changes to pass the source
> role around properly in guc.c's data structures.  That's all basically
> copy-and-paste from the code to track the source context (scontext).

At first glance, this looks pretty reasonable to me.  

> I noted something that ought to be looked at separately:
> validate_option_array_item() seems like it needs to be taught about
> grantable permissions on GUCs.  I think that right now it may report
> permissions failures in some cases where it should succeed.

Which cases do you think might be inappropriately reporting permissions
failures?  It looked to me like this stuff was mostly used for
pg_db_role_setting, which wouldn't be impacted by the current set of
grantable GUC permissions.  Is the idea that you should be able to do ALTER
ROLE SET for GUCs that you have SET permissions for?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2022-07-14 Thread Alvaro Herrera
Here's a couple of fixups.  0001 is the same as before.  In 0002 I think
CheckTablespaceDirectory ends up easier to read if we split out the test
for validity of the link.  Looking at that again, I think we don't need
to piggyback on ignore_invalid_pages, which is already a stretch, so
let's not -- instead we can use allow_in_place_tablespaces if users need
a workaround.  So that's 0003 (this bit needs more than zero docs,
however.)

0004 is straightforward: let's check for bad directories before logging
about consistent state.

After all this, I'm not sure what to think of dbase_redo.  At line 3102,
is the directory supposed to exist or not?  I'm confused as to what is
the expected state at that point.  I rewrote this, but now I think my
rewrite continues to be confusing, so I'll have to think more about it.

Another aspect are the tests.  Robert described a scenario where the
previously committed version of this patch created trouble.  Do we have
a test case to cover that problematic case?  I think we should strive to
cover it, if possible.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)
>From bdb691c2a86301e5369b3ae7f735fa5f7c29304d Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 13 Jul 2022 18:14:18 +0200
Subject: [PATCH v25 1/4] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

CREATE DATABASE
DROP DATABASE
DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch allows missing tablespaces to be created during recovery
before reaching consistency.  The tablespaces are created as real
directories that should not exists but will be removed until reaching
consistency. CheckRecoveryConsistency is responsible to make sure they
have disappeared.

Similar to log_invalid_page mechanism, the GUC ignore_invalid_pages
turns into PANIC errors detected by this patch into WARNING, which
allows continueing recovery.

Diagnosed-by: Paul Guo 
Author: Paul Guo 
Author: Kyotaro Horiguchi 
Author: Asim R Praveen 
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27dn5xwj2p9-roh16e4...@mail.gmail.com
---
 doc/src/sgml/config.sgml|   5 +-
 src/backend/access/transam/xlogrecovery.c   |  59 
 src/backend/commands/dbcommands.c   |  71 +
 src/backend/commands/tablespace.c   |  28 +---
 src/backend/utils/misc/guc.c|   8 +-
 src/include/access/xlogutils.h  |   2 +
 src/test/recovery/t/029_replay_tsp_drops.pl | 155 
 7 files changed, 296 insertions(+), 32 deletions(-)
 create mode 100644 src/test/recovery/t/029_replay_tsp_drops.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 37fd80388c..1e1c8c1cb7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11363,11 +11363,12 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
   

 If set to off (the default), detection of
-WAL records having references to invalid pages during
+WAL records having references to invalid pages or
+WAL records resulting in invalid directory operations during
 recovery causes PostgreSQL to
 raise a PANIC-level error, aborting the recovery. Setting
 ignore_invalid_pages to on
-causes the system to ignore invalid page references in WAL records
+causes the system to ignore invalid actions caused by such WAL records
 (but still report a warning), and continue the recovery.
 This behavior may cause crashes, data loss,
 propagate or hide corruption, or other serious problems.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5d6f1b5e46..ae81244e06 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2008,6 +2008,57 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
 	}
 }
 
+/*
+ * Makes sure that ./pg_tblspc directory doesn't contain a real directory.
+ *
+ * This is intended to be called after reaching consistency.
+ * ignore_invalid_pages=on turns into the PANIC error into WARNING so that
+ * recovery can continue.
+ *
+ * This can't be checked in allow_in_place_tablespaces mode, so skip it in
+ * that case.
+ */
+static void
+CheckTablespaceDirectory(void)
+{
+	char *tblspc_path = "./pg_tblspc";
+	DIR		   *dir;
+	struct dirent *de;
+
+	/* Do not check for this 

Re: doc: Clarify Routines and Extension Membership

2022-07-14 Thread Bruce Momjian
On Fri, Jul  8, 2022 at 10:55:55PM -0400, Bruce Momjian wrote:
> > The/that inconsistency ... choose one.  Or actually, the "an ... the"
> > combination you used elsewhere doesn't grate on the ear either.
> > 
> > +  For each extension, refuse to drop anything if any objects (other 
> > than the
> > +  extensions listed) depend on it.  However, its own member objects, 
> > and routines
> > +  that are explicitly dependent on this extension, are skipped.
> > +  This is the default.
> > 
> > "skipped" seems like a horrible choice of word; it could easily be read as
> > "they don't get dropped".  I am not convinced that mentioning the member
> > objects here is an improvement either.  In the first sentence you are
> > treating each extension as a monolithic object; why not in the second?
> 
> I created a modified patch based on this feedback;  patch attached.  I
> rewrote the last change.

Patch applied to PG 13 and later, where extension dependency was added. 
Thank you for the patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Introduce "log_connection_stages" setting.

2022-07-14 Thread Euler Taveira
On Thu, Jul 14, 2022, at 8:20 AM, Sergey Dudoladov wrote:
> I've taken connection stages and terminology from the existing log messages.
> The reason I have separated "authorized" and "authenticated" are [1]
> and [2] usages of "log_connections";
> "received" is mentioned at [3].
After checking the commit 9afffcb833d, I agree that you need 4 stages:
connected, authorized, authenticated, and disconnected.

> I have thought about enums too, but we need to cover arbitrary
> combinations of message types, for example log only "received" and
> "disconnected".
> Hence the proposed type "string" with individual values within the
> string really drawn from an enum.
Ooops. I said enum but I meant string list.

> Are there any specific deprecation guidelines ? I have not found any
> after a quick search for GUC deprecation in Google and commit history.
> A deprecation scheme could look like that:
> 1. Mention in the docs "log_(dis)connections" are deprecated in favor
> of "log_connection_stages"
> 2. Map "log_(dis)connections" to relevant values of
> "log_connection_stages" in code if the latter is unset.
> 3. Complain in the logs if a conflict arises between the old params
> and "log_connection_stages", with "log_connection_stages"
> taking the precedence.
No. AFAICS in this case, you just remove log_connections and log_disconnections
and create the new one (see for example the commit 88e98230268 that replace
checkpoint_segments with min_wal_size and max_wal_size). We don't generally
keep ConfigureNames* entries for deprecated GUCs. Unless you are renaming a GUC
-- see map_old_guc_names; that's not the case. When we remove a GUC, we are
introducing an incompatibility so the only place it will be mentioned is the
release notes (there is a section called "Migration to Version X" that lists
all incompatibilities). From the developer's point of view, you only need to
mention in the commit message that this commit is introducing an
incompatibility. Hence, when it is time to write the release notes, the
information about the removal and the new replacement will be added.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: EINTR in ftruncate()

2022-07-14 Thread Tom Lane
Thomas Munro  writes:
> So why would I add another wrapper like PG_SETMASK and leave it
> unimplemented for now on Windows, when I could just use sigprocmask()
> directly and leave it unimplemented for now on Windows?

Fair enough, I guess.  No objection to this patch.

(Someday we oughta go ahead and make our Windows signal API look more
like POSIX, as I suggested back in 2015.  I'm still not taking
point on that, though.)

regards, tom lane




Re: EINTR in ftruncate()

2022-07-14 Thread Thomas Munro
On Fri, Jul 15, 2022 at 3:27 AM Tom Lane  wrote:
> Alvaro Herrera  writes:
> > ISTM it would be cleaner to patch PG_SETMASK to have a second argument
> > and to return the original mask if that's not NULL.  This is more
> > invasive, but there aren't that many callsites of that macro.
>
> [ shoulda read your message before replying ]
>
> Given that this needs back-patched, I think changing PG_SETMASK
> is a bad idea: there might be outside callers.  However, we could
> add another macro with the additional argument.  PG_GET_AND_SET_MASK?

It's funny though, the reason we had PG_SETMASK in the first place is
not for Windows.  Ancient commit 47937403676 added that for long gone
pre-POSIX systems like NeXTSTEP which only had single-argument
sigsetmask(), not sigprocmask().  In general on Windows we're
emulating POSIX signal interfaces with normal names like sigemptyset()
etc, it just so happens that we chose to emulate that pre-standard
sigsetmask() interface (as you complained about in the commit message
for a65e0864).

So why would I add another wrapper like PG_SETMASK and leave it
unimplemented for now on Windows, when I could just use sigprocmask()
directly and leave it unimplemented for now on Windows?

The only reason I can think of for a wrapper is to provide a place to
check the return code and ereport (panic?).  That seems to be of
limited value (how can it fail ... bad "how" value, or a sandbox
denying some system calls, ...?).  I did make sure to preserve the
errno though; even though we're assuming these calls can't fail by
long standing tradition, I didn't feel like additionally assuming that
successful calls don't clobber errno.

I guess, coded like this, it should also be safe to do it in the
postmaster, but I think maybe we should leave it conditional, rather
than relying on BlockSig being initialised and sane during early
postmaster initialisation.
From 17eb588af303873dbd76d0f24b536f74747cb3bf Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 15 Jul 2022 09:00:12 +1200
Subject: [PATCH] Make dsm_impl_resize more future-proof.

Commit 4518c798 blocks signals for a short region of code, but it
assumed that whatever calls it had the mask set to UnBlockSig on entry.
That may be true today, but it would be more resilient to save and
restore the caller's signal mask.

Previously we couldn't do this because our macro for abstracting signal
mask changes was based on the old pre-POSIX sigsetmask(), due to 1990s
portability concerns.  Since this is a POSIX-only code path, and since
a65e0864 established that supported Unix systems must have
sigprocmask(), we can just use sigprocmask() directly.  It would also be
implementable for Windows, but that's not needed for this.

Back-patch to all releases, like 4518c798 and 80845b7c.

Reviewed-by: Alvaro Herrera 
Reviewed-by: Tom Lane 
Discussion: https://postgr.es/m/CA%2BhUKGKx6Biq7_UuV0kn9DW%2B8QWcpJC1qwhizdtD9tN-fn0H0g%40mail.gmail.com
---
 src/backend/storage/ipc/dsm_impl.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 0247d13a91..69c6df75b4 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -49,6 +49,7 @@
 #include "postgres.h"
 
 #include 
+#include 
 #include 
 #ifndef WIN32
 #include 
@@ -62,7 +63,7 @@
 #endif
 
 #include "common/file_perm.h"
-#include "libpq/pqsignal.h"		/* for PG_SETMASK macro */
+#include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "portability/mem.h"
@@ -355,6 +356,7 @@ dsm_impl_posix_resize(int fd, off_t size)
 {
 	int			rc;
 	int			save_errno;
+	sigset_t	save_sigmask;
 
 	/*
 	 * Block all blockable signals, except SIGQUIT.  posix_fallocate() can run
@@ -363,7 +365,7 @@ dsm_impl_posix_resize(int fd, off_t size)
 	 * conflicts), the retry loop might never succeed.
 	 */
 	if (IsUnderPostmaster)
-		PG_SETMASK();
+		sigprocmask(SIG_SETMASK, , _sigmask);
 
 	pgstat_report_wait_start(WAIT_EVENT_DSM_ALLOCATE);
 #if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__)
@@ -402,7 +404,7 @@ dsm_impl_posix_resize(int fd, off_t size)
 	if (IsUnderPostmaster)
 	{
 		save_errno = errno;
-		PG_SETMASK();
+		sigprocmask(SIG_SETMASK, _sigmask, NULL);
 		errno = save_errno;
 	}
 
-- 
2.30.2



Re: Issue with recovery_target = 'immediate'

2022-07-14 Thread David Steele

On 7/14/22 04:26, Kyotaro Horiguchi wrote:

At Wed, 13 Jul 2022 14:41:40 -0400, David Steele  wrote in

While it is certainly true that timeline 2 cannot be replayed to from
timeline 1, it should not matter for an immediate recovery that stops
at consistency. No timeline switch will occur until promotion. Of
course the cluster could be shut down before promotion and the target
changed, but in that case couldn't timeline be adjusted at that point?

This works by default for PostgreSQL < 12 because the default timeline
is current. Since PostgreSQL 12 the default has been latest, which
does not work by default.

When a user does a number of recoveries it is pretty common for the
timelines to get in a state that will make most subsequent recoveries
fail. We think it makes sense for recovery_target = 'immediate' to be
a fail safe that always works no matter the state of the latest
timeline.

Our solution has been to force recovery_target_timeline = 'current'
when recovery_target = 'immediate', but it seems like this is
something that should be done in PostgreSQL instead.

Thoughts?


I think it is natural that recovery defaultly targets the most recent
update.  In that sense, at the time the admin decided to recover the
server from the first backup, the second backup is kind of dead, at
least which should be forgotten in the future operation.


Well, I dislike the idea of a dead backup. Certainly no backup can be 
followed along all timelines but it should still be recoverable.



Even if we want "any" backup usable, just re-targeting to the current
timeline after the timeline error looks kind of inconsistent to the
behavior mentioned above. To make "dead" backups behave like the
"live" ones, we would need to check if the backup is in the history of
each "future" timelines, then choose the latest timeline from them.


I think this makes sense for for non-immediate targets. The idea is that 
recovering to the "latest" timeline would actually recover to the latest 
timeline that is valid for the backup. Is that what you had in mind?


However, for immediate targets, only the current timeline makes sense so 
I feel like it would be better to simply force the current timeline.



# Mmm. I remember about a recent patch for pg_rewind to do the same...


Do you have a link for this?

Regards,
-David




Re: Making the subquery alias optional in the FROM clause

2022-07-14 Thread Tom Lane
Dean Rasheed  writes:
> Attached is an update with the following changes:

> * Docs updated as suggested.
> * transformLockingClause() updated to skip subquery and values rtes
> without aliases.
> * eref->aliasname changed to "unnamed_subquery" for subqueries without 
> aliases.

This looks good to me.  Marked RFC.

regards, tom lane




Re: [PATCH] Log details for client certificate failures

2022-07-14 Thread Jacob Champion
On Thu, Jul 14, 2022 at 1:12 PM Peter Eisentraut
 wrote:
> Concretely, I was thinking like the attached top-up patch.
>
> The other way can surely be made to work somehow, but this seems much
> simpler and with fewer questions about the details.

Ah, seeing it side-by-side helps. That's much easier, I agree.

Thanks,
--Jacob




Re: [PATCH] Log details for client certificate failures

2022-07-14 Thread Peter Eisentraut

On 13.07.22 01:06, Jacob Champion wrote:

I had to read up on this "ex_data" API.  Interesting.  But I'm wondering
a bit about how the life cycle of these objects is managed.  What
happens if the allocated error string is deallocated before its
containing object?  Or vice versa?


Yeah, I'm currently leaning heavily on the lack of any memory context
switches here. And I end up leaking out a pointer to the stale stack
of be_tls_open_server(), which is gross -- it works since there are no
other clients, but that could probably come back to bite us.

The ex_data API exposes optional callbacks for new/dup/free (I'm
currently setting those to NULL), so we can run custom code whenever
the SSL* is destroyed. If you'd rather the data have the same lifetime
of the SSL* object, we can switch to malloc/strdup/free (or even
OPENSSL_strdup() in later versions). But since we don't have any use
for the ex_data outside of this function, maybe we should just clear
it before we return, rather than carrying it around.


Concretely, I was thinking like the attached top-up patch.

The other way can surely be made to work somehow, but this seems much 
simpler and with fewer questions about the details.From 662bc1d19101865f91c6ed4a98c0f13f3757e1c7 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 14 Jul 2022 22:08:53 +0200
Subject: [PATCH v5 2/2] squash! Log details for client certificate failures

---
 src/backend/libpq/be-secure-openssl.c | 29 +--
 1 file changed, 5 insertions(+), 24 deletions(-)

diff --git a/src/backend/libpq/be-secure-openssl.c 
b/src/backend/libpq/be-secure-openssl.c
index 8d5eee0d16..2147524ffc 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -81,13 +81,7 @@ static bool ssl_is_server_start;
 static int ssl_protocol_version_to_openssl(int v);
 static const char *ssl_protocol_version_to_string(int v);
 
-/* Helpers for storing application data in the SSL handle */
-struct pg_ex_data
-{
-   /* to bubble errors out of callbacks */
-   char   *errdetail;
-};
-static int ssl_ex_index;
+static const char *cert_errdetail;
 
 /*  */
 /*  Public interface   
*/
@@ -110,7 +104,6 @@ be_tls_init(bool isServerStart)
SSL_library_init();
SSL_load_error_strings();
 #endif
-   ssl_ex_index = SSL_get_ex_new_index(0, NULL, NULL, NULL, NULL);
SSL_initialized = true;
}
 
@@ -422,7 +415,6 @@ be_tls_open_server(Port *port)
int err;
int waitfor;
unsigned long ecode;
-   struct pg_ex_data exdata = {0};
boolgive_proto_hint;
 
Assert(!port->ssl);
@@ -455,14 +447,6 @@ be_tls_open_server(Port *port)

SSLerrmessage(ERR_get_error();
return -1;
}
-   if (!SSL_set_ex_data(port->ssl, ssl_ex_index, ))
-   {
-   ereport(COMMERROR,
-   (errcode(ERRCODE_PROTOCOL_VIOLATION),
-errmsg("could not attach application data to 
SSL handle: %s",
-   
SSLerrmessage(ERR_get_error();
-   return -1;
-   }
 
port->ssl_in_use = true;
 
@@ -561,7 +545,7 @@ be_tls_open_server(Port *port)

(errcode(ERRCODE_PROTOCOL_VIOLATION),
 errmsg("could not accept SSL 
connection: %s",

SSLerrmessage(ecode)),
-exdata.errdetail ? 
errdetail_internal("%s", exdata.errdetail) : 0,
+cert_errdetail ? 
errdetail_internal("%s", cert_errdetail) : 0,
 give_proto_hint ?
 errhint("This may indicate 
that the client does not support any SSL protocol version between %s and %s.",
 
ssl_min_protocol_version ?
@@ -570,6 +554,7 @@ be_tls_open_server(Port *port)
 
ssl_max_protocol_version ?
 
ssl_protocol_version_to_string(ssl_max_protocol_version) :
 
MAX_OPENSSL_TLS_VERSION) : 0));
+   cert_errdetail = NULL;
break;
case SSL_ERROR_ZERO_RETURN:
ereport(COMMERROR,
@@ -1145,12 +1130,10 @@ truncate_cert_name(char *name)
 static int
 verify_cb(int ok, 

Re: System catalog documentation chapter

2022-07-14 Thread Bruce Momjian
On Tue, Jul 12, 2022 at 05:16:41PM -0400, Bruce Momjian wrote:
> On Tue, Jul 12, 2022 at 08:56:01PM +0200, Peter Eisentraut wrote:
> > On 08.07.22 18:07, Bruce Momjian wrote:
> > > so I guess we can backpatch this with no issues.
> > 
> > It inserts a new chapter, which would renumber all other chapters. That's a
> > pretty big change to backpatch.  I'm against that.
> 
> Okay, I can see the renumbering as being confusing so I will do PG 15
> and head only.

Patch applied to PG 15 and master.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: [RFC] building postgres with meson -v9

2022-07-14 Thread Andres Freund
Hi,

On 2022-07-07 12:09:32 +0200, Peter Eisentraut wrote:
> On 06.07.22 15:21, Andres Freund wrote:
> > > Here is my rough assessment of where we are with this patch set:
> > > 
> > > 08b4330ded prereq: deal with \ paths in basebackup_to_shell tests.
> > > 
> > > This still needs clarification, per my previous review.
> > Hm. I thought I had explained that bit, but apparently not. Well, it's 
> > pretty
> > simple - without this, the test fail on windows for me, as soon as one of 
> > the
> > binaries is in a directory with spaces (which is common on windows). 
> > Iimagine
> > what happens with e.g.
> >qq{$gzip --fast > "$escaped_backup_path%f.gz"}
> > if $gzip contains spaces.
> > 
> > 
> > This doesn't happen currently on CI because nothing runs these tests on
> > windows yet.
> 
> Hmm, maybe this patch looked different the last time I saw it.

Don't think it had changed since I wrote it first.


> The quoting of "$gzip" is clearly necessary.
> 
> What about the backslash replacements s{\\}{/}g ?  Is that also required for
> passing the path through the shell?

I don't recall the details, but it's definitely needed for embedding it into
postgresql.conf. We could double the escapes instead, but that doesn't seem an
improvement (and wouldn't work when passing it to the shell anymore).


> If so, the treatment of $tar in that
> way doesn't seem necessary, since that doesn't get called through an
> intermediate shell.  (That would then also explain why $gzip in the
> pg_basebackup tests doesn't require that treatment, which had previously
> confused me.)

Yea, it doesn't immediately look like it's needed, and the test passes without
it. I guess I might just have tried to be complete...

Greetings,

Andres Freund




Re: pg_parameter_aclcheck() and trusted extensions

2022-07-14 Thread Tom Lane
I wrote:
> Michael Paquier  writes:
>> Looks like a bug to me, so I have added an open item assigned to Tom.

> Yeah.  So the fix here seems pretty obvious: rather than applying the
> permissions check using bare GetUserId(), we need to remember the role
> OID that originally applied the setting, and use that.

Here's a draft patch for that.  I initially ran around and changed all
the set_config_option callers as I threatened before, but as I did it
I could not help observing that they were all changing in exactly the
same way: basically, they were passing GetUserId() if the GucContext
is PGC_S_SESSION and BOOTSTRAP_SUPERUSERID otherwise.  Not counting
guc.c internal call sites, there is a grand total of one caller that
fails to fit the pattern.  So that brought me around to liking the idea
of keeping set_config_option's API stable by making it a thin wrapper
around another function with an explicit role argument.  The result,
attached, poses far less of an API/ABI hazard than I was anticipating.
If you're not poking into the GUC tables you have little to fear.

Most of the bulk of this is mechanical changes to pass the source
role around properly in guc.c's data structures.  That's all basically
copy-and-paste from the code to track the source context (scontext).

I noted something that ought to be looked at separately:
validate_option_array_item() seems like it needs to be taught about
grantable permissions on GUCs.  I think that right now it may report
permissions failures in some cases where it should succeed.

regards, tom lane

diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 3db859c3ea..6b6720c690 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -912,6 +912,9 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
 	 * We use the equivalent of a function SET option to allow the setting to
 	 * persist for exactly the duration of the script execution.  guc.c also
 	 * takes care of undoing the setting on error.
+	 *
+	 * log_min_messages can't be set by ordinary users, so for that one we
+	 * pretend to be superuser.
 	 */
 	save_nestlevel = NewGUCNestLevel();
 
@@ -920,9 +923,10 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
  PGC_USERSET, PGC_S_SESSION,
  GUC_ACTION_SAVE, true, 0, false);
 	if (log_min_messages < WARNING)
-		(void) set_config_option("log_min_messages", "warning",
- PGC_SUSET, PGC_S_SESSION,
- GUC_ACTION_SAVE, true, 0, false);
+		(void) set_config_option_ext("log_min_messages", "warning",
+	 PGC_SUSET, PGC_S_SESSION,
+	 BOOTSTRAP_SUPERUSERID,
+	 GUC_ACTION_SAVE, true, 0, false);
 
 	/*
 	 * Similarly disable check_function_bodies, to ensure that SQL functions
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0328029d43..9e6cb590c7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -5172,7 +5172,8 @@ static void reapply_stacked_values(struct config_generic *variable,
    struct config_string *pHolder,
    GucStack *stack,
    const char *curvalue,
-   GucContext curscontext, GucSource cursource);
+   GucContext curscontext, GucSource cursource,
+   Oid cursrole);
 static void ShowGUCConfigOption(const char *name, DestReceiver *dest);
 static void ShowAllGUCConfig(DestReceiver *dest);
 static char *_ShowOption(struct config_generic *record, bool use_units);
@@ -5897,10 +5898,10 @@ InitializeWalConsistencyChecking(void)
 
 		check_wal_consistency_checking_deferred = false;
 
-		set_config_option("wal_consistency_checking",
-		  wal_consistency_checking_string,
-		  PGC_POSTMASTER, guc->source,
-		  GUC_ACTION_SET, true, ERROR, false);
+		set_config_option_ext("wal_consistency_checking",
+			  wal_consistency_checking_string,
+			  guc->scontext, guc->source, guc->srole,
+			  GUC_ACTION_SET, true, ERROR, false);
 
 		/* checking should not be deferred again */
 		Assert(!check_wal_consistency_checking_deferred);
@@ -5979,6 +5980,8 @@ InitializeOneGUCOption(struct config_generic *gconf)
 	gconf->reset_source = PGC_S_DEFAULT;
 	gconf->scontext = PGC_INTERNAL;
 	gconf->reset_scontext = PGC_INTERNAL;
+	gconf->srole = BOOTSTRAP_SUPERUSERID;
+	gconf->reset_srole = BOOTSTRAP_SUPERUSERID;
 	gconf->stack = NULL;
 	gconf->extra = NULL;
 	gconf->last_reported = NULL;
@@ -6356,6 +6359,7 @@ ResetAllOptions(void)
 
 		gconf->source = gconf->reset_source;
 		gconf->scontext = gconf->reset_scontext;
+		gconf->srole = gconf->reset_srole;
 
 		if (gconf->flags & GUC_REPORT)
 		{
@@ -6401,6 +6405,7 @@ push_old_value(struct config_generic *gconf, GucAction action)
 {
 	/* SET followed by SET LOCAL, remember SET's value */
 	stack->masked_scontext = gconf->scontext;
+	stack->masked_srole = gconf->srole;
 	set_stack_value(gconf, >masked);
 	

Re: doc: Clarify Savepoint Behavior

2022-07-14 Thread Bruce Momjian
On Sat, Jul  9, 2022 at 12:59:23PM -0400, Bruce Momjian wrote:
> On Sun, Jun 26, 2022 at 09:14:56AM -0700, David G. Johnston wrote:
> > So leave the "release" behavior implied from the rollback behavior?
> > 
> > On the whole I'm slightly in favor of your proposed wording (mostly due to 
> > the
> > better fitting of the ROLLBACK command, though at the omission of 
> > RELEASE...)
> > but are you seeing anything beyond personal style as to why you feel one is
> > better than the other?  Is there some existing wording in the docs that I
> > should be conforming to here?
> 
> I have developed the attached patch based on the discussion here.  I
> tried to simplify the language and example to clarify the intent.
> 
> I was confused why the first part of the patch added a mention of
> releasing savepoints to the ROLLBACK TO manual page --- I have removed
> that and improved the text in RELEASE SAVEPOINT.

Patch applied to all supported versions.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: doc: Clarify what "excluded" represents for INSERT ON CONFLICT

2022-07-14 Thread Bruce Momjian
On Fri, Jul  8, 2022 at 11:18:43PM -0400, Bruce Momjian wrote:
> On Fri, Jul  1, 2022 at 08:11:36AM -0700, David G. Johnston wrote:
> > That said, I still think that the current wording should be tweak with 
> > respect
> > to row vs. rows (especially if we continue to call it a table):
> > 
> > Current:
> > "The SET and WHERE clauses in ON CONFLICT DO UPDATE have access to the
> > existing row using the table's name (or an alias), and to [rows] proposed
> > for insertion using the special excluded table."
> > 
> > Change [rows] to:
> > 
> > "the row"
> > 
> > 
> > I'm undecided whether "FROM excluded" should be something that works - but I
> > also don't think it would actually be used in any case.
> 
> I found two places where a singular "row" would be better, doc patch
> attached.

Patch applied to all supported versions.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: doc: Bring mention of unique index forced transaction wait behavior outside of the internal section

2022-07-14 Thread Bruce Momjian
On Mon, Jul 11, 2022 at 05:22:41PM +0300, Aleksander Alekseev wrote:
> Hi Bruce,
> 
> > I was not happy with putting this in the Transaction Isolation section.
> > I rewrote it and put it in the INSERT secion, right before ON CONFLICT;
> > patch attached.
> 
> Looks good.

Applied to all supported PG versions.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Pre-allocating WAL files

2022-07-14 Thread Nathan Bossart
On Fri, Apr 08, 2022 at 01:30:03PM -0700, Nathan Bossart wrote:
> On Thu, Mar 17, 2022 at 04:12:12PM -0700, Nathan Bossart wrote:
>> It seems unlikely that this will be committed for v15, so I've adjusted the
>> commitfest entry to v16 and moved it to the next commitfest.
> 
> rebased

It's now been over a year since I first posted a patch in this thread, and
I still sense very little interest for this feature.  I intend to mark it
as Withdrawn at the end of this commitfest.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-07-14 Thread Nathan Bossart
Hi Bharath,

Thanks for taking a look.

On Thu, Jul 14, 2022 at 03:10:56PM +0530, Bharath Rupireddy wrote:
> Aren't these snapshot arrays always sorted? I see the following code:
> 
> /* sort so we can bsearch() */
> qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
> 
> /* sort so we can bsearch() later */
> qsort(snap->subxip, snap->subxcnt, sizeof(TransactionId), xidComparator);

AFAICT these arrays are sorted in limited cases, such as
pg_current_snapshot() and logical replication.  GetSnapshotData() does not
appear to sort them, so I don't think we can always assume they are sorted.
In the previous thread, Tomas analyzed simply sorting the arrays [0] and
found that it provided much less improvement compared to the hash table
approach, so I have not seriously considered it here.

> If the ordering isn't an invariant of these snapshot arrays, can we
> also use the hash table mechanism for all of the snapshot arrays
> infrastructure rather than qsort+bsearch in a few places and hash
> table for others?

Unless there is demonstrable benefit in doing so for the few places that
sort the arrays, I'm ѕkeptical it's worth the complexity.  This patch is
targeted to XidInMVCCSnapshot(), which we can demonstrate has clear impact
on TPS for some workloads.

> + * The current value worked well in testing, but it's still mostly a 
> guessed-at
> + * number that might need updating in the future.
> + */
> +#define XIP_HASH_MIN_ELEMENTS (128)
> +
> 
> Do you see a regression with a hash table for all the cases? Why can't
> we just build a hash table irrespective of these limits and use it for
> all the purposes instead of making it complex with different
> approaches if we don't have measurable differences in the performance
> or throughput?

I performed the same tests as before with a variety of values.  Here are
the results:

 writers  HEAD  1   16  32  64  128

 16   659   698 678 659 665 664
 32   645   661 688 657 649 663
 64   659   656 653 649 663 692
 128  641   636 639 679 643 716
 256  619   641 619 643 653 610
 512  530   609 582 602 605 702
 768  469   610 608 551 571 582
 1000 367   610 538 557 556 577

I was surpised to see that there really wasn't a regression at the low end,
but keep in mind that this is a rather large machine and a specialized
workload for generating snapshots with long [sub]xip arrays.  That being
said, there really wasn't any improvement at the low end, either.  If we
always built a hash table, we'd be introducing more overhead and memory
usage in return for approximately zero benefit.  My intent was to only take
on the overhead in cases where we believe it might have a positive impact,
which is why I picked the somewhat conservative value of 128.  If the
overhead isn't a concern, it might be feasible to always make [sub]xip a
hash table.

> +static inline bool
> +XidInXip(TransactionId xid, TransactionId *xip, uint32 xcnt,
> + xip_hash_hash **xiph)
> 
> + /* Make sure the hash table is built. */
> + if (*xiph == NULL)
> + {
> + *xiph = xip_hash_create(TopTransactionContext, xcnt, NULL);
> +
> + for (int i = 0; i < xcnt; i++)
> 
> Why create a hash table on the first search? Why can't it be built
> while inserting or creating these snapshots? Basically, instead of the
> array, can these snapshot structures be hash tables by themselves? I
> know this requires a good amount of code refactoring, but worth
> considering IMO as it removes bsearch thus might improve the
> performance further.

The idea is to avoid the overhead unless something actually needs to
inspect these arrays.

[0] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e%402ndquadrant.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-14 Thread Masahiko Sawada
On Thu, Jul 14, 2022 at 12:06 PM Masahiko Sawada  wrote:
>
> On Thu, Jul 14, 2022 at 11:16 AM shiy.f...@fujitsu.com
>  wrote:
> >
> > On Tue, Jul 12, 2022 5:23 PM Masahiko Sawada  wrote:
> > >
> > > On Tue, Jul 12, 2022 at 5:58 PM shiy.f...@fujitsu.com
> > >  wrote:
> > > >
> > > > It happened when executing the following code because it tried to free a
> > > NULL
> > > > pointer (catchange_xip).
> > > >
> > > > /* be tidy */
> > > > if (ondisk)
> > > > pfree(ondisk);
> > > > +   if (catchange_xip)
> > > > +   pfree(catchange_xip);
> > > >  }
> > > >
> > > > It seems to be related to configure option. I could reproduce it when 
> > > > using
> > > > `./configure --enable-debug`.
> > > > But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -
> > > ggdb"`.
> > >
> > > Hmm, I could not reproduce this problem even if I use ./configure
> > > --enable-debug. And it's weird that we checked if catchange_xip is not
> > > null but we did pfree for it:
> > >
> > > #1  pfree (pointer=0x0) at mcxt.c:1177
> > > #2  0x0078186b in SnapBuildSerialize (builder=0x1fd5e78,
> > > lsn=25719712) at snapbuild.c:1792
> > >
> > > Is it reproducible in your environment?
> >
> > Thanks for your reply! Yes, it is reproducible. And I also reproduced it on 
> > the
> > v4 patch you posted [1].
>
> Thank you for testing!

I've found out the exact cause of this problem and how to fix it. I'll
submit an updated patch next week with my analysis.

Thank you for testing and providing additional information off-list, Shi yu.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Refactor to make use of a common function for GetSubscriptionRelations and GetSubscriptionNotReadyRelations.

2022-07-14 Thread vignesh C
On Thu, Jul 14, 2022 at 4:34 AM Peter Smith  wrote:
>
> On Wed, Jul 13, 2022 at 7:55 PM vignesh C  wrote:
> >
> > On Wed, Jul 13, 2022 at 1:13 PM Michael Paquier  wrote:
> > >
> > > On Wed, Jul 13, 2022 at 12:22:06PM +0530, vignesh C wrote:
> > > > Most of the code is common between GetSubscriptionRelations and
> > > > GetSubscriptionNotReadyRelations. Added a parameter to
> > > > GetSubscriptionRelations which could provide the same functionality as
> > > > the existing GetSubscriptionRelations and
> > > > GetSubscriptionNotReadyRelations. Attached patch has the changes for
> > > > the same. Thoughts?
> > >
> > > Right.  Using all_rels to mean that we'd filter relations that are not
> > > ready is a bit confusing, though.  Perhaps this could use a bitmask as
> > > argument.
> >
> > The attached v2 patch has the modified version which includes the
> > changes to make the argument as bitmask.
> >
>
> By using a bitmask I think there is an implication that the flags can
> be combined...
>
> Perhaps it is not a problem today, but later you may want more flags. e.g.
> #define SUBSCRIPTION_REL_STATE_READY 0x02 /* READY relations */
>
> then the bitmask idea falls apart because IIUC you have no intentions
> to permit things like:
> (SUBSCRIPTION_REL_STATE_NOT_READY | SUBSCRIPTION_REL_STATE_READY)
>
> IMO using an enum might be a better choice for that parameter.

Changed it to enum so that it can be extended to support other
subscription relations like ready state subscription relations later
easily. The attached v3 patch has the changes for the same.

Regards,
Vignesh
From e0186ae685acb6334b711ea821b287be61d6cbd3 Mon Sep 17 00:00:00 2001
From: Vigneshwaran C 
Date: Wed, 13 Jul 2022 11:54:59 +0530
Subject: [PATCH v3] Refactor to make use of a common function for
 GetSubscriptionRelations and GetSubscriptionNotReadyRelations.

Most of the code is common between GetSubscriptionRelations and
GetSubscriptionNotReadyRelations. Added a parameter to GetSubscriptionRelations
which could provide the same functionality for GetSubscriptionRelations
and GetSubscriptionNotReadyRelations.
---
 src/backend/catalog/pg_subscription.c   | 71 -
 src/backend/commands/subscriptioncmds.c |  5 +-
 src/backend/replication/logical/tablesync.c |  3 +-
 src/include/catalog/pg_subscription_rel.h   | 19 +-
 src/tools/pgindent/typedefs.list|  1 +
 5 files changed, 34 insertions(+), 65 deletions(-)

diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 8856ce3b50..46eedbc062 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -525,65 +525,15 @@ HasSubscriptionRelations(Oid subid)
 }
 
 /*
- * Get all relations for subscription.
+ * Get the relations for subscription.
  *
- * Returned list is palloc'ed in current memory context.
+ * If subrel_options is SUBSCRIPTION_REL_ALL, return all the relations for
+ * subscription. If SUBSCRIPTION_REL_NOT_READY, return all the relations for
+ * subscription that are not in a ready state. Returned list is palloc'ed in
+ * current memory context.
  */
 List *
-GetSubscriptionRelations(Oid subid)
-{
-	List	   *res = NIL;
-	Relation	rel;
-	HeapTuple	tup;
-	ScanKeyData skey[1];
-	SysScanDesc scan;
-
-	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
-
-	ScanKeyInit([0],
-Anum_pg_subscription_rel_srsubid,
-BTEqualStrategyNumber, F_OIDEQ,
-ObjectIdGetDatum(subid));
-
-	scan = systable_beginscan(rel, InvalidOid, false,
-			  NULL, 1, skey);
-
-	while (HeapTupleIsValid(tup = systable_getnext(scan)))
-	{
-		Form_pg_subscription_rel subrel;
-		SubscriptionRelState *relstate;
-		Datum		d;
-		bool		isnull;
-
-		subrel = (Form_pg_subscription_rel) GETSTRUCT(tup);
-
-		relstate = (SubscriptionRelState *) palloc(sizeof(SubscriptionRelState));
-		relstate->relid = subrel->srrelid;
-		relstate->state = subrel->srsubstate;
-		d = SysCacheGetAttr(SUBSCRIPTIONRELMAP, tup,
-			Anum_pg_subscription_rel_srsublsn, );
-		if (isnull)
-			relstate->lsn = InvalidXLogRecPtr;
-		else
-			relstate->lsn = DatumGetLSN(d);
-
-		res = lappend(res, relstate);
-	}
-
-	/* Cleanup */
-	systable_endscan(scan);
-	table_close(rel, AccessShareLock);
-
-	return res;
-}
-
-/*
- * Get all relations for subscription that are not in a ready state.
- *
- * Returned list is palloc'ed in current memory context.
- */
-List *
-GetSubscriptionNotReadyRelations(Oid subid)
+GetSubscriptionRelations(Oid subid, SubscriptionRelationState subrel_state)
 {
 	List	   *res = NIL;
 	Relation	rel;
@@ -599,10 +549,11 @@ GetSubscriptionNotReadyRelations(Oid subid)
 BTEqualStrategyNumber, F_OIDEQ,
 ObjectIdGetDatum(subid));
 
-	ScanKeyInit([nkeys++],
-Anum_pg_subscription_rel_srsubstate,
-BTEqualStrategyNumber, F_CHARNE,
-CharGetDatum(SUBREL_STATE_READY));
+	if (subrel_state == SUBSCRIPTION_REL_NOT_READY)
+		ScanKeyInit([nkeys++],
+	Anum_pg_subscription_rel_srsubstate,
+	

MERGE and parsing with prepared statements

2022-07-14 Thread Justin Pryzby
We've used INSERT ON CONFLICT for a few years (with partitions as the target).
That's also combined with prepared statements, for bulk loading.

I was looking to see if we should use MERGE (probably not, but looking anyway).
And came across this behavior.  I'm not sure if it's any issue.

CREATE TABLE CustomerAccount (CustomerId int, Balance float);

PREPARE p AS
MERGE INTO CustomerAccount CA
USING (SELECT $1 AS CustomerId, $2 AS TransactionValue) AS T
ON CA.CustomerId = T.CustomerId
WHEN NOT MATCHED THEN
  INSERT (CustomerId, Balance)
  VALUES (T.CustomerId, T.TransactionValue)
WHEN MATCHED THEN
  UPDATE SET Balance = Balance + TransactionValue;

ERROR:  operator does not exist: integer = text
LINE 3: ON CA.CustomerId = T.CustomerId

postgres: pryzbyj postgres [local] PREPARE(+0x2337be) [0x56108322e7be]
postgres: pryzbyj postgres [local] PREPARE(oper+0x198) [0x56108322f1fb]
postgres: pryzbyj postgres [local] PREPARE(make_op+0x7e) 
[0x56108322f55a]
postgres: pryzbyj postgres [local] PREPARE(+0x228f2b) [0x561083223f2b]
postgres: pryzbyj postgres [local] PREPARE(+0x227aa9) [0x561083222aa9]
postgres: pryzbyj postgres [local] PREPARE(transformExpr+0x1c) 
[0x5610832227f9]
postgres: pryzbyj postgres [local] PREPARE(transformMergeStmt+0x339) 
[0x56108322d988]
postgres: pryzbyj postgres [local] PREPARE(transformStmt+0x70) 
[0x5610831f4071]
postgres: pryzbyj postgres [local] PREPARE(+0x1fa350) [0x5610831f5350]
postgres: pryzbyj postgres [local] PREPARE(transformTopLevelStmt+0x11) 
[0x5610831f5385]
postgres: pryzbyj postgres [local] 
PREPARE(parse_analyze_varparams+0x5b) [0x5610831f54f4]
postgres: pryzbyj postgres [local] 
PREPARE(pg_analyze_and_rewrite_varparams+0x38) [0x5610834bcdfe]
postgres: pryzbyj postgres [local] PREPARE(PrepareQuery+0xcc) 
[0x561083292155]
postgres: pryzbyj postgres [local] 
PREPARE(standard_ProcessUtility+0x4ea) [0x5610834c31a0]

Why is $1 construed to be of type text ?




Re: SQL/JSON documentation JSON_TABLE

2022-07-14 Thread Andrew Dunstan

On 2022-07-08 Fr 16:20, Andrew Dunstan wrote:
> On 2022-07-08 Fr 16:03, Erik Rijkers wrote:
>> Hi,
>>
>> Attached are a few small changes to the JSON_TABLE section in func.sgml.
>>
>> The first two changes are simple typos.
>>
>> Then there was this line:
>>
>> 
>> context_item, path_expression [ AS json_path_name ] [ PASSING { value
>> AS varname } [, ...]]
>> 
>>
>> those are the parameters to JSON_TABLE() so I changed that line to:
>>
>> 
>> JSON_TABLE(context_item, path_expression [ AS json_path_name ] [
>> PASSING { value AS varname } [, ...]])
>> 
>>
>> Some parts of the JSON_TABLE text strike me as opaque.  For instance,
>> there are paragraphs that more than once use the term:
>>    json_api_common_syntax
>>
>> 'json_api_common_syntax' is not explained.  It turns out it's a relic
>> from Nikita's original docs. I dug up a 2018 patch where the term is
>> used as:
>>
>>  2018:
>> JSON_TABLE (
>>  json_api_common_syntax [ AS path_name ]
>>  COLUMNS ( json_table_column [, ...] )
>>  (etc...)
>> 
>>
>> with explanation:
>>
>>  2018:
>> json_api_common_syntax:
>>    The input data to query, the JSON path expression defining the
>> query, and an optional PASSING clause.
>> 
>>
>> So that made sense then (input+jsonpath+params=api), but it doesn't
>> now fit as such in the current docs.
>>
>> I think it would be best to remove all uses of that compound term, and
>> rewrite the explanations using only the current parameter names
>> (context_item, path_expression, etc).
>>
>> But I wasn't sure and I haven't done any such changes in the attached.
>>
>> Perhaps I'll give it a try during the weekend.
>>
>>
>>
>
> Thanks for this. If you want to follow up that last sentence I will try
> to commit a single fix early next week.
>
>

Here's a patch that deals with most of this. There's one change you
wanted that I don't think is correct, which I omitted.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..478d6eccd8 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18030,9 +18030,9 @@ FROM
 or array, but if it is CONDITIONAL it will not be
 applied to a single array or object. UNCONDITIONAL
 is the default.
-If the result is a a scalar string, by default the value returned will have
-surrounding quotes making it a valid JSON value. However, this behavior
-is reversed if OMIT QUOTES is specified.
+If the result is a scalar string, by default the value returned will
+have surrounding quotes making it a valid JSON value. However, this
+behavior is reversed if OMIT QUOTES is specified.
 The ON ERROR and ON EMPTY
 clauses have similar semantics to those clauses for
 json_value.
@@ -18101,7 +18101,7 @@ FROM
columns. Columns produced by NESTED PATHs at the
same level are considered to be siblings,
while a column produced by a NESTED PATH is
-   considered to be a child of the column produced by and
+   considered to be a child of the column produced by a
NESTED PATH or row expression at a higher level.
Sibling columns are always joined first. Once they are processed,
the resulting rows are joined to the parent row.
@@ -18151,9 +18151,9 @@ FROM
  the specified column.
 
 
- The provided PATH expression parses the
- row pattern defined by json_api_common_syntax
- and fills the column with produced SQL/JSON items, one for each row.
+ The provided PATH expression is evaluated and
+ and the column is filled with the produced SQL/JSON items, one for each
+ row.
  If the PATH expression is omitted,
  JSON_TABLE uses the
  $.name path expression,
@@ -18185,9 +18185,8 @@ FROM
  item into each row of this column.
 
 
- The provided PATH expression parses the
- row pattern defined by json_api_common_syntax
- and fills the column with produced SQL/JSON items, one for each row.
+ The provided PATH expression is evaluated and
+ the column is filled with the produced SQL/JSON items, one for each row.
  If the PATH expression is omitted,
  JSON_TABLE uses the
  $.name path expression,
@@ -18216,11 +18215,10 @@ FROM
  Generates a column and inserts a boolean item into each row of this column.
 
 
- The provided PATH expression parses the
- row pattern defined by json_api_common_syntax,
- checks whether any SQL/JSON items were returned, and fills the column with
- resulting boolean value, one for each row.
- The specified type should have cast from
+ The provided PATH expression is evaluated,
+ a check whether any SQL/JSON items were returned is done, and
+ the column is filled with the resulting boolean value, one for each row.
+ The specified type should have a cast from the
  boolean.
  If the PATH expression is omitted,

Re: EINTR in ftruncate()

2022-07-14 Thread Tom Lane
Thomas Munro  writes:
> ... but now I'm wondering if we should be more defensive and possibly
> even save/restore the mask.

+1, sounds like a more future-proof solution.

> Originally I discounted that because I
> thought I had to go through PG_SETMASK for portability reasons, but on
> closer inspection, I don't see any reason not to use sigprocmask
> directly in Unix-only code.

Seems like the thing to do is to add a suitable operation to the
pqsignal.h API.  We could leave it unimplemented for now on Windows,
and then worry what to do if we ever need it in that context.

regards, tom lane




Re: EINTR in ftruncate()

2022-07-14 Thread Tom Lane
Alvaro Herrera  writes:
> ISTM it would be cleaner to patch PG_SETMASK to have a second argument
> and to return the original mask if that's not NULL.  This is more
> invasive, but there aren't that many callsites of that macro.

[ shoulda read your message before replying ]

Given that this needs back-patched, I think changing PG_SETMASK
is a bad idea: there might be outside callers.  However, we could
add another macro with the additional argument.  PG_GET_AND_SET_MASK?

regards, tom lane




Re: First draft of the PG 15 release notes

2022-07-14 Thread Bruce Momjian
On Thu, Jul 14, 2022 at 01:24:41PM +0700, John Naylor wrote:
> Regarding this item:
> 
> "Allow hash lookup for NOT IN clauses with many constants (David Rowley, James
> Coleman)
> Previously the code always sequentially scanned the list of values."
> 
> The todo list has an entry titled "Planning large IN lists", which links to 
> 
> https://www.postgresql.org/message-id/1178821226.6034.63.camel@goldbach
> 
> Did we already have a hash lookup for IN clauses with constants and the above
> commit adds NOT IN? If so, maybe we have enough to remove this todo item.

Agreed, I have removed it now.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: replacing role-level NOINHERIT with a grant-level option

2022-07-14 Thread tushar

On 7/11/22 11:01 PM, Robert Haas wrote:

Oops. Here is a rebased version of v3 which aims to fix this bug.

I found one issue where pg_upgrade is failing

PG v14.4 , create these below objects

create user u1 with superuser;
create user u3;
create group g2 with user  u1;

now try to perform pg_upgrade from v16(w/patch), it is failing with 
these messages


[edb@centos7tushar bin]$ tail -10 
dc2/pg_upgrade_output.d/20220714T195919.494/log/pg_upgrade_utility.log

--
--
CREATE ROLE "u3";
CREATE ROLE
ALTER ROLE "u3" WITH NOSUPERUSER INHERIT NOCREATEROLE NOCREATEDB LOGIN 
NOREPLICATION NOBYPASSRLS;

ALTER ROLE
GRANT "g2" TO "u1" WITH  GRANTED BY "edb";
psql:dc2/pg_upgrade_output.d/20220714T195919.494/dump/pg_upgrade_dump_globals.sql:47: 
ERROR:  syntax error at or near "BY"

LINE 1: GRANT "g2" TO "u1" WITH  GRANTED BY "edb";
 ^
This issue is not reproducible on PG v16 (without patch).

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company





Re: EINTR in ftruncate()

2022-07-14 Thread Alvaro Herrera
On 2022-Jul-15, Thomas Munro wrote:

> I checked that this throw-away assertion doesn't fail currently:
> 
> if (IsUnderPostmaster)
> +   {
> +   sigset_t old;
> +   sigprocmask(SIG_SETMASK, NULL, );
> +   Assert(memcmp(, , sizeof(UnBlockSig)) == 0);
> PG_SETMASK();
> +   }
> 
> ... but now I'm wondering if we should be more defensive and possibly
> even save/restore the mask.

Yeah, that sounds better to me.

> Originally I discounted that because I thought I had to go through
> PG_SETMASK for portability reasons, but on closer inspection, I don't
> see any reason not to use sigprocmask directly in Unix-only code.

ISTM it would be cleaner to patch PG_SETMASK to have a second argument
and to return the original mask if that's not NULL.  This is more
invasive, but there aren't that many callsites of that macro.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: EINTR in ftruncate()

2022-07-14 Thread Thomas Munro
On Fri, Jul 15, 2022 at 1:02 AM Thomas Munro  wrote:
> On Fri, Jul 15, 2022 at 12:15 AM Thomas Munro  wrote:
> > Yeah.  Done, and pushed.  0002 not back-patched.
>
> Hmm, there were a couple of hard to understand build farm failures.
> My first thought is that the signal mask stuff should only be done if
> IsUnderPostmaster, otherwise it clobbers the postmaster's signal mask
> when reached from dsm_postmaster_startup().  Looking into that.

I pushed that change, and I hope that clears up the problems seen on
eg curculio.  It does raise the more general question of why it's safe
to assume the signal mask is UnBlockSig on entry in regular backends.
I expect it to be in released branches, but things get more
complicated as we use DSM in more ways and it's not ideal to bet on
that staying true.  I checked that this throw-away assertion doesn't
fail currently:

if (IsUnderPostmaster)
+   {
+   sigset_t old;
+   sigprocmask(SIG_SETMASK, NULL, );
+   Assert(memcmp(, , sizeof(UnBlockSig)) == 0);
PG_SETMASK();
+   }

... but now I'm wondering if we should be more defensive and possibly
even save/restore the mask.  Originally I discounted that because I
thought I had to go through PG_SETMASK for portability reasons, but on
closer inspection, I don't see any reason not to use sigprocmask
directly in Unix-only code.




Re: Problem about postponing gathering partial paths for topmost scan/join rel

2022-07-14 Thread Antonin Houska
Richard Guo  wrote:

> On Wed, Jul 28, 2021 at 3:42 PM Richard Guo  wrote:
> 
>  To fix this problem, I'm thinking we can leverage 'root->all_baserels'
>  to tell if we are at the topmost scan/join rel, something like:
> 
>  --- a/src/backend/optimizer/path/allpaths.c
>  +++ b/src/backend/optimizer/path/allpaths.c
>  @@ -3041,7 +3041,7 @@ standard_join_search(PlannerInfo *root, int 
> levels_needed, List *initial_rels)
>   * partial paths.  We'll do the same for the topmost 
> scan/join rel
>   * once we know the final targetlist (see 
> grouping_planner).
>   */
>  -   if (lev < levels_needed)
>  +   if (!bms_equal(rel->relids, root->all_baserels))
>  generate_useful_gather_paths(root, rel, 
> false);
> 
>  Any thoughts?
> 
> Attach a patch to include the fix described upthread. Would appreciate
> any comments on this topic.


I think I understand the idea but I'm not sure about the regression test. I
suspect that in the plan

EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 a JOIN tenk1 b ON a.two =3D b.two
  FULL JOIN tenk1 c ON b.two =3D c.two;
 QUERY PLAN 


 Aggregate
   ->  Hash Full Join
 Hash Cond: (b.two =3D c.two)
 ->  Gather
   Workers Planned: 4
   ->  Parallel Hash Join
 Hash Cond: (a.two =3D b.two)
 ->  Parallel Seq Scan on tenk1 a
 ->  Parallel Hash
   ->  Parallel Seq Scan on tenk1 b
 ->  Hash
   ->  Gather
 Workers Planned: 4
 ->  Parallel Seq Scan on tenk1 c


the Gather node is located below the "Hash Full Join" node only because that
kind of join currently cannot be executed by parallel workers. If the parallel
"Hash Full Join" gets implemented (I've noticed but not checked in detail
[1]), it might break this test.

I'd prefer a test that demonstrates that the Gather node at the top of the
"subproblem plan" is useful purely from the *cost* perspective, rather than
due to executor limitation.


[1] https://commitfest.postgresql.org/38/2903/

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Handle infinite recursion in logical replication setup

2022-07-14 Thread Dilip Kumar
On Thu, 14 Jul 2022 at 6:34 PM, vignesh C  wrote:

> On Thu, Jul 14, 2022 at 11:26 AM Dilip Kumar 
> wrote:
> >
> > On Wed, Jul 13, 2022 at 4:49 PM Dilip Kumar 
> wrote:
> > >
> > > On Tue, Jul 12, 2022 at 2:58 PM vignesh C  wrote:
> > > >
> > > > On Tue, Jul 12, 2022 at 9:51 AM Amit Kapila 
> wrote:
> > >
> > > I find one thing  confusing about this patch.  Basically, this has two
> > > option 'local' and 'any', so I would assume that all the local server
> > > changes should be covered under the 'local' but now if we set some
> > > origin using 'select pg_replication_origin_session_setup('aa');' then
> > > changes from that session will be ignored because it has an origin id.
> > > I think actually the name is creating confusion, because by local it
> > > seems like a change which originated locally and the document is also
> > > specifying the same.
> > >
> > > +   If local, the subscription will request the
> publisher
> > > +   to only send changes that originated locally. If
> any,
> > >
> > > I think if we want to keep the option local then we should look up all
> > > the origin in the replication origin catalog and identify whether it
> > > is a local origin id or remote origin id and based on that filter out
> > > the changes.
> >
> > On the other hand if we are interested in receiving the changes which
> > are generated without any origin then I think we should change 'local'
> > to 'none' and then in future we can provide a new option which can
> > send the changes generated by all the local origin?  I think other
> > than this the patch LGTM.
>
> Thanks for the comment. The attached v33 patch has the changes to
> specify origin as 'none' instead of 'local' which will not publish the
> data having any origin.


I think the ‘none’ might have problem from expand ability pov?  what if in
future we support the actual origin name and than what none mean? no origin
or origin name none?  Should we just give origin name empty name ‘’?  Or is
there some other issue?

>
—
Dilip

> --
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: EINTR in ftruncate()

2022-07-14 Thread Thomas Munro
On Fri, Jul 15, 2022 at 12:15 AM Thomas Munro  wrote:
> Yeah.  Done, and pushed.  0002 not back-patched.

Hmm, there were a couple of hard to understand build farm failures.
My first thought is that the signal mask stuff should only be done if
IsUnderPostmaster, otherwise it clobbers the postmaster's signal mask
when reached from dsm_postmaster_startup().  Looking into that.




Re: Lazy JIT IR code generation to increase JIT speed with partitions

2022-07-14 Thread David Geier
On Mon, Jul 4, 2022 at 10:32 PM Andres Freund  wrote:

> Hi,
>
> On 2022-06-27 16:55:55 +0200, David Geier wrote:
> > Indeed, the total JIT time increases the more modules are used. The
> reason
> > for this to happen is that the inlining pass loads and deserializes all
> to
> > be inlined modules (.bc files) from disk prior to inlining them via
> > llvm::IRMover. There's already a cache for such modules in the code, but
> it
> > is currently unused. This is because llvm::IRMover takes the module to be
> > inlined as std::unique_ptr. The by-value argument requires
> > the source module to be moved, which means it cannot be reused
> afterwards.
> > The code is accounting for that by erasing the module from the cache
> after
> > inlining it, which in turns requires reloading the module next time a
> > reference to it is encountered.
> >
> > Instead of each time loading and deserializing all to be inlined modules
> > from disk, they can reside in the cache and instead be cloned via
> > llvm::CloneModule() before they get inlined. Key to calling
> > llvm::CloneModule() is fully deserializing the module upfront, instead of
> > loading the module lazily. That is why I changed the call from
> > LLVMGetBitcodeModuleInContext2() (which lazily loads the module via
> > llvm::getOwningLazyBitcodeModule()) to LLVMParseBitCodeInContext2()
> (which
> > fully loads the module via llvm::parseBitcodeFile()). Beyond that it
> seems
> > like that prior to LLVM 13, cloning modules could fail with an assertion
> > (not sure though if that would cause problems in a release build without
> > assertions). Andres reported this problem back in the days here [1]. In
> the
> > meanwhile the issue got discussed in [2] and finally fixed for LLVM 13,
> see
> > [3].
>
> Unfortunately that doesn't work right now - that's where I had started. The
> problem is that IRMover renames types. Which, in the case of cloned modules
> unfortunately means that types used cloned modules are also renamed in the
> "origin" module. Which then causes problems down the line, because parts of
> the LLVM code match types by type names.
>
> That can then have the effect of drastically decreasing code generation
> quality over time, because e.g. inlining never manages to find signatures
> compatible.
>
> Looking at the LLVM patches these issues seem to be gone.
The same suggests dumping the bitcode and running all tests with JIT
force-enabled.
See below.

>
> > However, curiously the time spent on optimizing is also reduced (95ms
> > instead of 164ms). Could this be because some of the applied
> optimizations
> > are ending up in the cached module?
>
> I suspect it's more that optimization stops being able to do a lot, due to
> the
> type renamign issue.
>
>
> > @Andres: could you provide me with the queries that caused the assertion
> > failure in LLVM?
>
> I don't think I have the concrete query. What I tend to do is to run the
> whole
> regression tests with forced JITing. I'm fairly certain this triggered the
> bug
> at the time.
>
>
> > Have you ever observed a segfault with a non-assert-enabled build?
>
> I think I observed bogus code generation that then could lead to segfaults
> or
> such.
>
OK, then using a non-assert-enabled LLVM build seems good enough.
Especially, given the fixes they merged in [3].

>
>
> > I just want to make sure this is truly fixed in LLVM 13. Running 'make
> > check-world' all tests passed.
>
> With jit-ing forced for everything?
>
> One more thing to try is to jit-compile twice and ensure the code is the
> same. It certainly wasn't in the past due to the above issue.
>
> Yes, with jitting force-enabled everything passes (I removed all checks
for the various 'jit_above' costs from planner.c).

I also dumped the bitcode via 'SET jit_dump_bitcode = ON' and compared the
bitcode output after inlining and after optimization for two successive
executions of the very same query. The .bc files after inlining only
differ in two bytes. The .bc files after optimization differ in a lot more
bytes. However, the same is true for the master branch. How do you exactly
conclude the identity of compilation output?

--
David Geier
(ServiceNow)


Re: EINTR in ftruncate()

2022-07-14 Thread Thomas Munro
On Tue, Jul 12, 2022 at 5:45 AM Andres Freund  wrote:
> Hm - given that we've observed ftruncate failing with strace, and that
> stracing to find problems isn't insane, shouldn't we retry the ftruncate too?
> It's kind of obsoleted by your next patch, but not really, because it's not
> unconceivable that other OSs behave similarly? And IIUC you're planning to not
> backpatch 0002?

Yeah.  Done, and pushed.  0002 not back-patched.

> > + pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE);
>
> (not new in this patch, just moved around) - FILL_ZERO_WRITE is imo an odd
> description of what's happening... In my understanding this isn't doing any
> writing, just allocating. But whatever...

True.  Fixed in master.




Re: Fix gcc warning in sync.c (usr/src/backend/storage/sync/sync.c)

2022-07-14 Thread Thomas Munro
On Mon, Jul 11, 2022 at 9:45 PM Kyotaro Horiguchi
 wrote:
> At Mon, 11 Jul 2022 01:45:16 -0400, Tom Lane  wrote in
> > Kyotaro Horiguchi  writes:
> > > At Sat, 9 Jul 2022 21:53:31 -0300, Ranier Vilela  
> > > wrote in
> > >> 528 |entry = (PendingUnlinkEntry *) lfirst(cell);
> >
> > > Actually, I already see the following line (maybe) at the place instead.
> > >> PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
> >
> > Yeah, I see no line matching that in HEAD either.

Confusing report :-)

> > However, I do not much like the code at line 528, because its
> > "PendingUnlinkEntry *entry" is masking an outer variable
> > "PendingFsyncEntry *entry" from line 513.  We should rename
> > one or both variables to avoid that masking.

Fair point.

> I thought the same at the moment looking this.  In this case, changing
> entry->syncent, unl(del)lent works. But at the same time I don't think
> that can be strictly applied.

Yeah, let's rename both of them.  Done.

> So, for starters, I compiled the whole tree with -Wshadow=local. and I
> saw many warnings with it.  At a glance all of them are reasonably
> "fixed" but I don't think it is what we want...

Wow, yeah.




Re: enable/disable broken for statement triggers on partitioned tables

2022-07-14 Thread Amit Langote
On Thu, Jul 14, 2022 at 8:20 PM Dmitry Koval  wrote:
> > I agree that the patch shouldn't have changed that behavior, so I've
> > fixed the patch so that EnableDisableTrigger() recurses even if the
> > parent trigger is unchanged.
>
> Thanks, I think this patch is ready for committer.

Great, thanks.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: Introduce "log_connection_stages" setting.

2022-07-14 Thread Sergey Dudoladov
Hello,

Thank you for the constructive feedback.

> Your proposal will add more confusion to the already-confused logging-related 
> GUCs.
> I wouldn't introduce a new GUC that depends on the stage of other GUC as you 
> proposed.

Agreed, coupling a new GUC with "log_connections" is likely to lead to
extra confusion.

> There are 3 stages: connect (received), authorized (authenticated), 
> disconnect.

I've taken connection stages and terminology from the existing log messages.
The reason I have separated "authorized" and "authenticated" are [1]
and [2] usages of "log_connections";
"received" is mentioned at [3].

>> Example: log_connection_stages=’authorized,disconnected’.
> would turn the boolean value into an enum value

I have thought about enums too, but we need to cover arbitrary
combinations of message types, for example log only "received" and
"disconnected".
Hence the proposed type "string" with individual values within the
string really drawn from an enum.

> merge log_connections and log_disconnections into a new GUC (?) and deprecate 
> them.

Are there any specific deprecation guidelines ? I have not found any
after a quick search for GUC deprecation in Google and commit history.
A deprecation scheme could look like that:
1. Mention in the docs "log_(dis)connections" are deprecated in favor
of "log_connection_stages"
2. Map "log_(dis)connections" to relevant values of
"log_connection_stages" in code if the latter is unset.
3. Complain in the logs if a conflict arises between the old params
and "log_connection_stages", with "log_connection_stages"
taking the precedence.

Regards,
Sergey

[1] 
https://github.com/postgres/postgres/blob/3f8148c256e067dc2e8929ed174671ba7dc3339c/src/backend/utils/init/postinit.c#L257-L262
[2] 
https://github.com/postgres/postgres/blob/02c408e21a6e78ff246ea7a1beb4669634fa9c4c/src/backend/libpq/auth.c#L372
[3] 
https://github.com/postgres/postgres/blob/master/src/backend/postmaster/postmaster.c#L4393




Re: enable/disable broken for statement triggers on partitioned tables

2022-07-14 Thread Dmitry Koval

I agree that the patch shouldn't have changed that behavior, so I've
fixed the patch so that EnableDisableTrigger() recurses even if the
parent trigger is unchanged.


Thanks, I think this patch is ready for committer.

--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com




Re: Bug: Reading from single byte character column type may cause out of bounds memory reads.

2022-07-14 Thread Nikolay Shaplov
В письме от среда, 13 июля 2022 г. 16:14:39 MSK пользователь Aleksander 
Alekseev написал:

Hi! Let me join the review process. Postgres data types is field of expertise I 
am interested in.

0. Though it looks like a steady bug, I can't reproduce it. Not using 
valgrind, not using ASan (address sanitizer should catch reading out of bounds 
too). I am running Debian Bullseye, and tried to build both postgresl 14.4 and 
current master.

Never the less I would dig into this issue. And start with the parts that is 
not covered by the patch, but seems to be important for me.

1. typename "char" (with quotes) is very-very-very confusing. it is described 
in documentation, but you need to be postgres expert or careful documentation 
reader, to notice important difference between "char" and char. 
What is the history if "char" type? Is it specified by some standard? May be it 
is good point to create more understandable alias, like byte_char, ascii_char 
or something for usage in practice, and keep "char" for backward compatibility 
only.

2. I would totally agree with  Tom Lane and Isaac Morland, that problem should 
be also fixed  on the side of type conversion.  There is whole big thread about 
it. Guess we should come to some conclusion there

3.Fixing out of bound reading for broken unicode is also important.  Though 
for now I am not quite sure it is possible.


> - p += pg_mblen(p);
> + {
> + int t = pg_mblen(p);
> + p += t;
> + max_copy_bytes -= t;
> + }

Minor issue: Here I would change variable name from "t" to "char_len" or 
something, to make code more easy to understand.

Major issue: is pg_mblen function safe to call with broken encoding at the end 
of buffer? What if last byte of the buffer is 0xF0 and you call pg_mblen for it?


>+  copy_bytes = p - s;
>+  if(copy_bytes > max_copy_bytes)
>+  copy_bytes = max_copy_bytes;

Here I would suggest to add comment about broken utf encoding case. That would 
explain why we might come to situation when we can try to copy more than we 
have.

I would also suggest to issue a warning here. I guess person that uses 
postgres would prefer to know that he managed to stuff into postgres a string 
with broken utf encoding, before it comes to some terrible consequences.  

> Hi Spyridon,
> 
> > The column "single_byte_col" is supposed to store only 1 byte.
> > Nevertheless, the INSERT command implicitly casts the '' text into
> > "char". This means that only the first byte of '' ends up stored in the
> > column. gdb reports that "pg_mblen(p) = 4" (line 1046), which is expected
> > since the pg_mblen('') is indeed 4. Later at line 1050, the memcpy will
> > copy 4 bytes instead of 1, hence an out of bounds memory read happens for
> > pointer 's', which effectively copies random bytes.
> Many thanks for reporting this!
> 
> > - OS: Ubuntu 20.04
> > - PSQL version 14.4
> 
> I can confirm the bug exists in the `master` branch as well and
> doesn't depend on the platform.
> 
> Although the bug is easy to fix for this particular case (see the
> patch) I'm not sure if this solution is general enough. E.g. is there
> something that generally prevents pg_mblen() from doing out of bound
> reading in cases similar to this one? Should we prevent such an INSERT
> from happening instead?


-- 
Nikolay Shaplov aka Nataraj
Fuzzing Engineer at Postgres Professional
Matrix IM: @dhyan:nataraj.su


signature.asc
Description: This is a digitally signed message part.


Re: [PATCH] Implement INSERT SET syntax

2022-07-14 Thread Gareth Palmer
Hello,

Here is a new version of the patch that applies to HEAD.

It also adds some regression tests for overriding {system,user} values
based on Wenjing Zeng's work.

Gareth

On Thu, 14 Jul 2022 at 22:40, Jacob Champion  wrote:
>
> On Thu, Apr 7, 2022 at 11:29 AM Marko Tiikkaja  wrote:
> > I can help with review and/or other work here.  Please give me a
> > couple of weeks.
>
> Hi Marko, did you get a chance to pick up this patchset? If not, no
> worries; I can mark this RwF and we can try again in a future
> commitfest.
>
> Thanks,
> --Jacob
>
>
>
>
From f1ab7edadac846883b9c12f02ecc6c64dd293060 Mon Sep 17 00:00:00 2001
From: Gareth Palmer 
Date: Thu, 14 Jul 2022 01:47:17 +
Subject: [PATCH] Implement INSERT SET syntax

Allow the target column and values of an INSERT statement to be specified
using a SET clause in the same manner as that of an UPDATE statement.

The advantage of using the INSERT SET style is that the columns and values
are kept together, which can make changing or removing a column or value
from a large list easier.

A simple example that uses SET instead of a VALUES() clause:

INSERT INTO t SET c1 = 'foo', c2 = 'bar', c3 = 'baz';

Values can also be sourced from other tables similar to the INSERT INTO
SELECT FROM syntax:

INSERT INTO t SET c1 = x.c1, c2 = x.c2 FROM x WHERE x.c2 > 10 LIMIT 10;

INSERT SET is not part of any SQL standard, however this syntax is also
implemented by MySQL. Their implementation does not support specifying a
FROM clause.
---
 doc/src/sgml/ref/insert.sgml  | 56 +-
 src/backend/parser/gram.y | 77 ++-
 src/test/regress/expected/identity.out| 42 ++
 src/test/regress/expected/insert.out  | 13 +++-
 src/test/regress/expected/insert_conflict.out |  2 +
 src/test/regress/expected/with.out| 20 +
 src/test/regress/sql/identity.sql |  9 +++
 src/test/regress/sql/insert.sql   |  1 +
 src/test/regress/sql/insert_conflict.sql  |  3 +
 src/test/regress/sql/with.sql |  9 +++
 10 files changed, 210 insertions(+), 22 deletions(-)

diff --git a/doc/src/sgml/ref/insert.sgml b/doc/src/sgml/ref/insert.sgml
index a9af9959c0..b774b6ce74 100644
--- a/doc/src/sgml/ref/insert.sgml
+++ b/doc/src/sgml/ref/insert.sgml
@@ -28,6 +28,16 @@ INSERT INTO table_name [ AS conflict_target ] conflict_action ]
 [ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] ]
 
+
+[ WITH [ RECURSIVE ] with_query [, ...] ]
+INSERT INTO table_name [ AS alias ]
+[ OVERRIDING { SYSTEM | USER} VALUE ]
+SET { column_name = { expression | DEFAULT } } [, ...]
+[ FROM from_clause ]
+[ ON CONFLICT [ conflict_target ] conflict_action ]
+[ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] ]
+
+
 where conflict_target can be one of:
 
 ( { index_column_name | ( index_expression ) } [ COLLATE collation ] [ opclass ] [, ...] ) [ WHERE index_predicate ]
@@ -263,6 +273,18 @@ INSERT INTO table_name [ AS 
  
 
+ 
+  from_clause
+  
+   
+A list of table expressions, allowing columns from other tables
+to be used as values in the expression.
+Refer to the  statement for a
+description of the syntax.
+   
+  
+ 
+
  
   DEFAULT
   
@@ -650,6 +672,15 @@ INSERT INTO films (code, title, did, date_prod, kind) VALUES
 
   
 
+  
+Insert a row using SET syntax:
+
+
+INSERT INTO films SET code = 'MH832', title = 'Blade Runner',
+did = 201, date_prod = DEFAULT, kind = 'SciFi';
+
+  
+
   
This example inserts some rows into table
films from a table tmp_films
@@ -696,6 +727,16 @@ WITH upd AS (
 INSERT INTO employees_log SELECT *, current_timestamp FROM upd;
 
   
+  
+   Insert multiple rows into employees_log containing
+the hours worked by each employee from time_sheets.
+
+INSERT INTO employees_log SET id = time_sheets.employee,
+total_hours = sum(time_sheets.hours) FROM time_sheets
+WHERE time_sheets.date  '2019-11-15' GROUP BY time_sheets.employee;
+
+  
+
   
Insert or update new distributors as appropriate.  Assumes a unique
index has been defined that constrains values appearing in the
@@ -752,6 +793,18 @@ INSERT INTO distributors (did, dname) VALUES (9, 'Antwerp Design')
 INSERT INTO distributors (did, dname) VALUES (10, 'Conrad International')
 ON CONFLICT (did) WHERE is_active DO NOTHING;
 
+  
+   Insert a new film into watched_films or increment the
+   number of times seen. Returns the new seen count, example assumes a
+   unique index has been defined that constrains the values appearing in
+   the title and year columns and
+   that seen_count defaults to 1.
+
+INSERT INTO watched_films SET title = 'Akira', year = 1988
+   ON CONFLICT (title, year) DO UPDATE SET seen_count = watched_films.seen_count + 1
+   RETURNING watched_films.seen_count;
+
+  
  
 
  
@@ -762,7 +815,8 @@ INSERT INTO distributors (did, dname) 

Improving scalability of Parallel Bitmap Heap/Index Scan

2022-07-14 Thread David Geier
Hi hackers,

While debugging some slow queries containing Bitmap Heap/Index Scans (in
short BHS / BIS), we observed a few issues regarding scalability:

   1. The BIS always only runs in a single process, also when the parent
   BHS is parallel. The first process arriving in the BHS serves as leader and
   executes the BIS.
   2. As long as execution is "exact" (TIDs are stored instead of page
   bits), the parallel BHS sorts all TIDs to ensure pages are accessed
   sequentially. The sort is also performed just by a single worker. Already
   with a few tens of thousands of pages to scan, the sort time can make up a
   significant portion of the total runtime. Large page counts and the need
   for parallelism are not uncommon for BHS, as one use case is closing the
   gap between index and sequential scans. The BHS costing seems to not
   account for that.
   3. The BHS does not scale well with an increasing number of parallel
   workers, even when accounting for the sequential parts of execution. A perf
   profile shows that the TID list / bitmap iteration code heavily contents on
   a mutex taken for every single TID / page bit (see
   LWLockAcquire(>lock, LW_EXCLUSIVE) in tidbitmap.c:1067).
   4. The EXPLAIN ANALYZE statistics of the parallel BHS do not include the
   statistics of the parallel workers. For example the number of heap pages
   processed is what just the leader did. Similarly to other parallel plan
   nodes we should aggregate statistics across workers.

The EXPLAIN ANALYZE output below shows (1) to (3) happening in action for
different numbers of workers. I had to obfuscate the query slightly. The
difference between the startup time of the BHS and the BIS is the time it
takes to sort the TID list. The self time of the BHS is just the time spent
on processing the shared TID list and processing the pages. That part runs
in parallel but does not scale.

Workers | Total runtime | Startup time BIS | Startup time BHS | Self time
BHS (excl. sorting)
---|--|--
2   | 15322 ms  | 3107 ms  | 5912 ms  | 9269 ms
4   | 13277 ms  | 3094 ms  | 5869 ms  | 7260 ms
8   | 14628 ms  | 3106 ms  | 5882 ms  | 8598 ms

None of this is really new and some of it is even documented. So, what I am
more wondering about is why things are the way they are and how hard it
would be to change them. I am especially curious about:

   - What stops us from extending the BIS to run in parallel? Parallel
   Bitmap Index Scans are also supported.
   - What about reducing the sort time by, e.g.
  - dividing TIDs across workers, ending up with N parallely sorted
  streams,
  - cooperatively sorting the TIDs with multiple workers using barriers
  for synchronization,
  - optimizing the PagetableEntry data structure for size and using a
  faster sorting algorithm like e.g. radix sort
  - a combination of the first three options
   - With separate TID lists per worker process the iteration problem would
   be solved. Otherwise, we could
  - optimize the iteration code and thereby minimize the duration of
  the critical section,
  - have worker processes acquire chunks of TIDs / page bits to reduce
  locking.

Is there interest in patches improving on the above mentioned shortcomings?
If so, which options do you deem best?

--
David Geier
(ServiceNow)



-- 2 workers

 Finalize Aggregate (actual time=15228.937..15321.356 rows=1 loops=1)
   Output: count(*)
   ->  Gather (actual time=15187.942..15321.345 rows=2 loops=1)
 Output: (PARTIAL count(*))
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate (actual time=15181.486..15181.488 rows=1
loops=2)
   Output: PARTIAL count(*)
   Worker 0:  actual time=15181.364..15181.366 rows=1 loops=1
   Worker 1:  actual time=15181.608..15181.610 rows=1 loops=1
   ->  Parallel Bitmap Heap Scan on foo (actual
time=5912.731..15166.992 rows=269713 loops=2)
 Filter: ...
 Rows Removed by Filter: 4020149
 Worker 0:  actual time=5912.498..15166.936 rows=269305
loops=1
 Worker 1:  actual time=5912.963..15167.048 rows=270121
loops=1
 ->  Bitmap Index Scan on foo_idx (actual
time=3107.947..3107.948 rows=8579724 loops=1)
   Index Cond: -
   Worker 1:  actual time=3107.947..3107.948
rows=8579724 loops=1
 Planning Time: 0.167 ms
 Execution Time: 15322.081 ms


-- 4 workers

 Finalize Aggregate (actual time=13175.765..13276.415 rows=1 loops=1)
   Output: count(*)
   ->  Gather (actual time=13137.981..13276.403 rows=4 loops=1)
 Output: (PARTIAL count(*))
 Workers Planned: 4
 Workers Launched: 4
 ->  Partial Aggregate (actual time=13130.344..13130.346 

Re: remove_useless_groupby_columns is too enthusiastic

2022-07-14 Thread Richard Guo
On Wed, Jul 13, 2022 at 1:31 AM Tom Lane  wrote:

> I tried the attached quick-hack patch that just prevents
> remove_useless_groupby_columns from removing anything that
> appears in ORDER BY.  That successfully fixes the complained-of
> case, and it doesn't change any existing regression test results.
> I'm not sure whether there are any cases that it makes significantly
> worse.


If there happens to be an index on t2.othercol, t2.x, the patch would
definitely win since we can perform a GroupAggregate with no explicit
sort. If there is no such index, considering the redundant sorting work
due to the excess columns, do we still win?

I'm testing with the query below:

create table t (a int primary key, b int, c int);
insert into t select i, i%1000, i from generate_series(1,100)i;
analyze t;
create index t_b_a_idx on t (b, a);

select sum(c) from t group by a, b order by b limit 10;

If we have index 't_b_a_idx', there would be big performance
improvement. Without the index, I can see some performance drop (I'm not
using parallel query mechanism).


> (I also kind of wonder if the fundamental problem here is that
> remove_useless_groupby_columns is being done at the wrong time,
> and we ought to do it later when we have more semantic info.


Concur with that.

Thanks
Richard


Re: Patch proposal: New hooks in the connection path

2022-07-14 Thread Bharath Rupireddy
On Fri, Jul 8, 2022 at 5:54 PM Bharath Rupireddy
 wrote:
>
> Looking at v2-0003 patch and emit_log_hook, how about we filter out
> for those connectivity errors either based on error codes and if they
> aren't unique, perhaps passing special flags to ereport API indicating
> that it's a connectivity error and in the emit_log_hook we can look
> for those connectivity error codes or flags to collect the stats about
> the failure connections (with MyProcPort being present in
> emit_log_hook)? This way, we don't need a new hook. Thoughts?

Bertrand and Other Hackers, above comment may have been lost in the
wild - any thoughts on it?

Regards,
Bharath Rupireddy.




Re: optimize lookups in snapshot [sub]xip arrays

2022-07-14 Thread Bharath Rupireddy
On Wed, Jul 13, 2022 at 10:40 PM Nathan Bossart
 wrote:
>
> Hi hackers,
>
> A few years ago, there was a proposal to create hash tables for long
> [sub]xip arrays in snapshots [0], but the thread seems to have fizzled out.
> I was curious whether this idea still showed measurable benefits, so I
> revamped the patch and ran the same test as before [1].  Here are the
> results for 60₋second runs on an r5d.24xlarge with the data directory on
> the local NVMe storage:
>
>  writers  HEAD  patch  diff
> 
>  16   659   664+1%
>  32   645   663+3%
>  64   659   692+5%
>  128  641   716+12%
>  256  619   610-1%
>  512  530   702+32%
>  768  469   582+24%
>  1000 367   577+57%

Impressive.

> As before, the hash table approach seems to provide a decent benefit at
> higher client counts, so I felt it was worth reviving the idea.
>
> The attached patch has some key differences from the previous proposal.
> For example, the new patch uses simplehash instead of open-coding a new
> hash table.  Also, I've bumped up the threshold for creating hash tables to
> 128 based on the results of my testing.  The attached patch waits until a
> lookup of [sub]xip before generating the hash table, so we only need to
> allocate enough space for the current elements in the [sub]xip array, and
> we avoid allocating extra memory for workloads that do not need the hash
> tables.  I'm slightly worried about increasing the number of memory
> allocations in this code path, but the results above seemed encouraging on
> that front.
>
> Thoughts?
>
> [0] https://postgr.es/m/35960b8af917e9268881cd8df3f88320%40postgrespro.ru
> [1] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e%402ndquadrant.com

Aren't these snapshot arrays always sorted? I see the following code:

/* sort so we can bsearch() */
qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);

/* sort so we can bsearch() later */
qsort(snap->subxip, snap->subxcnt, sizeof(TransactionId), xidComparator);

If the ordering isn't an invariant of these snapshot arrays, can we
also use the hash table mechanism for all of the snapshot arrays
infrastructure rather than qsort+bsearch in a few places and hash
table for others?

+ * The current value worked well in testing, but it's still mostly a guessed-at
+ * number that might need updating in the future.
+ */
+#define XIP_HASH_MIN_ELEMENTS (128)
+

Do you see a regression with a hash table for all the cases? Why can't
we just build a hash table irrespective of these limits and use it for
all the purposes instead of making it complex with different
approaches if we don't have measurable differences in the performance
or throughput?

+static inline bool
+XidInXip(TransactionId xid, TransactionId *xip, uint32 xcnt,
+ xip_hash_hash **xiph)

+ /* Make sure the hash table is built. */
+ if (*xiph == NULL)
+ {
+ *xiph = xip_hash_create(TopTransactionContext, xcnt, NULL);
+
+ for (int i = 0; i < xcnt; i++)

Why create a hash table on the first search? Why can't it be built
while inserting or creating these snapshots? Basically, instead of the
array, can these snapshot structures be hash tables by themselves? I
know this requires a good amount of code refactoring, but worth
considering IMO as it removes bsearch thus might improve the
performance further.

Regards,
Bharath Rupireddy.




Re: [PATCH] Add native windows on arm64 support

2022-07-14 Thread Niyas Sait
Hello,

Please find a new patch (no further changes) rebased on top of the master.

On Tue, 10 May 2022 at 02:02, Michael Paquier  wrote:

> On Mon, May 09, 2022 at 12:44:22PM +0100, Niyas Sait wrote:
> > Microsoft updated documentation [1] and clarified that ASLR cannot be
> > disabled for Arm64 targets.
> >
> > Because ASLR can't be disabled on ARM, ARM64, or ARM64EC architectures,
> > > /DYNAMICBASE:NO isn't supported for these targets.
>
> Better than nothing, I guess, though this does not stand as a reason
> explaining why this choice has been made.  And it looks like we will
> never know.  We are going to need more advanced testing to check that
> DYNAMICBASE is stable enough if we begin to enable it.  Perhaps we
> should really look at that for all the build targets we currently
> support rather than just ARM, for the most modern Win32 environments
> if we are going to cut support for most of the oldest releases..
> --
> Michael
>


-- 
Niyas Sait
Linaro Limited


v2-0001-Enable-postgres-native-build-for-windows-arm64-pl.patch
Description: Binary data


Re: Issue with recovery_target = 'immediate'

2022-07-14 Thread Kyotaro Horiguchi
At Wed, 13 Jul 2022 14:41:40 -0400, David Steele  wrote in 
> While it is certainly true that timeline 2 cannot be replayed to from
> timeline 1, it should not matter for an immediate recovery that stops
> at consistency. No timeline switch will occur until promotion. Of
> course the cluster could be shut down before promotion and the target
> changed, but in that case couldn't timeline be adjusted at that point?
> 
> This works by default for PostgreSQL < 12 because the default timeline
> is current. Since PostgreSQL 12 the default has been latest, which
> does not work by default.
> 
> When a user does a number of recoveries it is pretty common for the
> timelines to get in a state that will make most subsequent recoveries
> fail. We think it makes sense for recovery_target = 'immediate' to be
> a fail safe that always works no matter the state of the latest
> timeline.
> 
> Our solution has been to force recovery_target_timeline = 'current'
> when recovery_target = 'immediate', but it seems like this is
> something that should be done in PostgreSQL instead.
> 
> Thoughts?

I think it is natural that recovery defaultly targets the most recent
update.  In that sense, at the time the admin decided to recover the
server from the first backup, the second backup is kind of dead, at
least which should be forgotten in the future operation.

Even if we want "any" backup usable, just re-targeting to the current
timeline after the timeline error looks kind of inconsistent to the
behavior mentioned above. To make "dead" backups behave like the
"live" ones, we would need to check if the backup is in the history of
each "future" timelines, then choose the latest timeline from them.

# Mmm. I remember about a recent patch for pg_rewind to do the same...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2022-07-14 Thread Alvaro Herrera
This is not a review, but I think the isolation tests should be
expanded.  At least, include the case of serializable transactions being
involved.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Pensar que el espectro que vemos es ilusorio no lo despoja de espanto,
sólo le suma el nuevo terror de la locura" (Perelandra, C.S. Lewis)




Re: Backup command and functions can cause assertion failure and segmentation fault

2022-07-14 Thread Michael Paquier
On Fri, Jul 08, 2022 at 08:56:14AM -0400, Robert Haas wrote:
> On Thu, Jul 7, 2022 at 10:58 AM Fujii Masao  
> wrote:
>> But if many think that it's worth adding the test, I will give a
>> try. But even in that case, I think it's better to commit the
>> proposed patch at first to fix the bug, and then to write the patch
>> adding the test.

I have looked at that in details, and it is possible to rely on
pg_stat_activity.wait_event to be BaseBackupThrottle, which would make
sure that the checkpoint triggered at the beginning of the backup
finishes and that we are in the middle of the base backup.  The
command for the test should be a psql command with two -c switches
without ON_ERROR_STOP, so as the second pg_backup_stop() starts after
BASE_BACKUP is cancelled using the same connection, for something like
that:
psql -c "BASE_BACKUP (CHECKPOINT 'fast', MAX_RATE 32);" \
 -c "select pg_backup_stop()" "replication=database"

The last part of the test should do a pump_until() and capture "backup
is not in progress" from the stderr output of the command run.

This is leading me to the attached, that crashes quickly without the
fix and passes with the fix.

> It's true that we don't really have good test coverage of write-ahead
> logging and recovery, but this doesn't seem like the most important
> thing to be testing in that area, either, and developing stable tests
> for stuff like this can be a lot of work.

Well, stability does not seem like a problem to me here.

> I do kind of feel like the patch is fixing two separate bugs. The
> change to SendBaseBackup() is fixing the problem that, because there's
> SQL access on replication connections, we could try to start a backup
> in the middle of another backup by mixing and matching the two
> different methods of doing backups. The change to do_pg_abort_backup()
> is fixing the fact that, after aborting a base backup, we don't reset
> the session state properly so that another backup can be tried
> afterwards.
> 
> I don't know if it's worth committing them separately - they are very
> small fixes. But it would probably at least be good to highlight in
> the commit message that there are two different issues.

Grouping both fixes in the same commit sounds fine by me.  No
objections from here.
--
Michael
From 062bf8a06f68b7d543288260dd0cdf50bb5e9a72 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 14 Jul 2022 16:56:17 +0900
Subject: [PATCH] Add TAP test for BASE_BACKUP cancellation with
 pg_stop_backup()

---
 src/test/recovery/t/033_basebackup_cancel.pl | 60 
 1 file changed, 60 insertions(+)
 create mode 100644 src/test/recovery/t/033_basebackup_cancel.pl

diff --git a/src/test/recovery/t/033_basebackup_cancel.pl b/src/test/recovery/t/033_basebackup_cancel.pl
new file mode 100644
index 00..56cbaf83d0
--- /dev/null
+++ b/src/test/recovery/t/033_basebackup_cancel.pl
@@ -0,0 +1,60 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# BASE_BACKUP cancellation with replication database connection.
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init(allows_streaming => 1);
+$node->append_conf('postgresql.conf', 'log_replication_commands = on');
+$node->start;
+
+note "testing BASE_BACKUP cancellation";
+
+my $sigchld_bb_timeout =
+  IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+# This test requires a replication connection with a database, as it mixes
+# a replication command and a SQL command.  The first BASE_BACKUP is throttled
+# to give enough room for the cancellation running below.  The second command
+# for pg_backup_stop() should fail.
+my $connstr =
+  $node->connstr('postgres') . " replication=database dbname=postgres";
+my ($sigchld_bb_stdin, $sigchld_bb_stdout, $sigchld_bb_stderr) = ('', '', '');
+my $sigchld_bb = IPC::Run::start(
+	[
+		'psql', '-X', '-c', "BASE_BACKUP (CHECKPOINT 'fast', MAX_RATE 32);",
+		'-c',   'SELECT pg_backup_stop()',
+		'-d',   $connstr
+	],
+	'<',
+	\$sigchld_bb_stdin,
+	'>',
+	\$sigchld_bb_stdout,
+	'2>',
+	\$sigchld_bb_stderr,
+	$sigchld_bb_timeout);
+
+# Waiting on the wait event BaseBackupThrottle ensures that the checkpoint
+# issued at backup start completes, making the cancellation happen in the
+# middle of the base backup sent.
+is( $node->poll_query_until(
+		'postgres',
+		"SELECT pg_cancel_backend(pid) FROM pg_stat_activity WHERE "
+		  . "wait_event = 'BaseBackupThrottle' "
+		  . "AND backend_type = 'walsender' AND query ~ 'BASE_BACKUP'"),
+	"1",
+	"WAL sender sending base backup killed");
+
+# The psql command should fail on pg_stop_backup().
+ok( pump_until(
+		$sigchld_bb, $sigchld_bb_timeout,
+		\$sigchld_bb_stderr, qr/backup is not in progress/),
+	'base backup cleanly cancelled');
+$sigchld_bb->finish();
+
+done_testing();
-- 
2.36.1



signature.asc
Description: PGP signature


Re: [PROPOSAL] Detecting plan changes with plan_id in pg_stat_activity

2022-07-14 Thread Julien Rouhaud
Hi,

On Thu, Jul 14, 2022 at 08:51:24AM +0200, Antonin Houska wrote:
> Shouldn't the patch status be set to "Waiting on Author"?
>
> (I was curious if this is a patch that I can review.)

Ah indeed, I just updated the CF entry!




Re: [PROPOSAL] Detecting plan changes with plan_id in pg_stat_activity

2022-07-14 Thread Antonin Houska
Shouldn't the patch status be set to "Waiting on Author"?

(I was curious if this is a patch that I can review.)

Julien Rouhaud  wrote:

> On Wed, Jun 22, 2022 at 11:05:54PM +, Imseih (AWS), Sami wrote:
> > >Can you describe how it's kept in sync, and how it makes sure that the 
> > > property
> > >is maintained over restart / gc?  I don't see any change in the code 
> > > for the
> > >2nd part so I don't see how it could work (and after a quick test it 
> > > indeed
> > >doesn't).
> >
> > There is a routine called qtext_hash_sync which removed all entries from
> > the qtext_hash and reloads it will all the query ids from pgss_hash.
> > [...]
> > All the points when it's called, an exclusive lock is held.this allows or 
> > syncing all
> > The present queryid's in pgss_hash with qtext_hash.
> 
> So your approach is to let the current gc / file loading behavior happen as
> before and construct your mapping hash using the resulting query text / offset
> info.  That can't work.
> 
> > >2nd part so I don't see how it could work (and after a quick test it 
> > > indeed
> > >doesn't).
> >
> > Can you tell me what test you used to determine it is not in sync?
> 
> What test did you use to determine it is in sync?  Have you checked how the 
> gc/
> file loading actually work?
> 
> In my case I just checked the size of the query text file after running some
> script that makes sure that there are the same few queries ran by multiple
> different roles, then:
> 
> Size of $PGDATA/pg_stat_tmp/pgss_query_texts.stat: 559B
> pg_ctl restart
> Size of $PGDATA/pg_stat_tmp/pgss_query_texts.stat: 2383B
> 
> > >Can you share more details on the benchmarks you did?  Did you also run
> > >benchmark on workloads that induce entry eviction, with and without 
> > > need for
> > >gc?  Eviction in pgss is already very expensive, and making things 
> > > worse just
> > >to save a bit of disk space doesn't seem like a good compromise.
> >
> > Sorry this was poorly explained by me. I went back and did some benchmarks. 
> > Attached is
> > The script and results. But here is a summary:
> > On a EC2 r5.2xlarge. The benchmark I performed is:
> > 1. create 10k tables
> > 2. create 5 users
> > 3. run a pgbench script that performs per transaction a select on
> > A randomly chosen table for each of the 5 users.
> > 4. 2 variants of the test executed . 1 variant is with the default 
> > pg_stat_statements.max = 5000
> > and one test with a larger pg_stat_statements.max = 1.
> 
> But you wrote:
> 
> > ##
> > ## pg_stat_statements.max = 15000
> > ##
> 
> So which one is it?
> 
> >
> > So 10-15% is not accurate. I originally tested on a less powered machine. 
> > For this
> > Benchmark I see a 6% increase in TPS (732k vs  683k) when we have a larger 
> > sized
> > pg_stat_statements.max is used and less gc/deallocations.
> > Both tests show a drop in gc/deallocations and a net increase
> > In tps. Less GC makes sense since the external file has less duplicate SQLs.
> 
> On the other hand you're rebuilding the new query_offset hashtable every time
> there's an entry eviction, which seems quite expensive.
> 
> Also, as I mentioned you didn't change any of the heuristic for
> need_gc_qtexts().  So if the query texts are indeed deduplicated, doesn't it
> mean that  gc  will artificially
> be called less often?  The wanted target of "50% bloat" will become "50%
> bloat assuming no deduplication is done" and the average query text file size
> will stay the same whether the query texts are deduplicated or not.
> 
> I'm wondering the improvements you see due to the patch or simply due to
> artificially calling gc less often?  What are the results if instead of using
> vanilla pg_stat_statements you patch it to perform roughly the same number of
> gc as your version does?
> 
> Also your benchmark workload is very friendly with your feature, what are the
> results with other workloads?  Having the results with query texts that aren't
> artificially long would be interesting for instance, after fixing the problems
> mentioned previously.
> 
> Also, you said that if you run that benchmark with a single user you don't see
> any regression.   I don't see how rebuilding an extra hashtable in
> entry_dealloc(), so when holding an exclusive lwlock, while not saving any
> other work elsewhere could be free?
> 
> Looking at the script, you have:
> echo "log_min_messages=debug1" >> $PGDATA/postgresql.conf; \
> 
> Is that really necessary?  Couldn't you upgrade the gc message to a higher
> level for your benchmark need, or expose some new counter in
> pg_stat_statements_info maybe?  Have you done the benchmark using a debug 
> build
> or normal build?
> 
> 

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: i.e. and e.g.

2022-07-14 Thread Kyotaro Horiguchi
At Thu, 14 Jul 2022 09:40:25 +0700, John Naylor  
wrote in 
> On Wed, Jul 13, 2022 at 4:13 PM Kyotaro Horiguchi 
> wrote:
> >
> > At Wed, 13 Jul 2022 18:09:43 +0900 (JST), Kyotaro Horiguchi <
> horikyota@gmail.com> wrote in
> > > So, "e.g." (for example) in the message sounds like "that is", which I
> > > think is "i.e.".  It should be fixed if this is correct.  I'm not sure
> > > whether to keep using Latin-origin acronyms like this, but in the
> > > attached I used "i.e.".
> 
> I did my own quick scan and found one use of i.e. that doesn't really fit,
> in a sentence that has other grammatical issues:
> 
> -Due to the differences how ECPG works compared to Informix's
> ESQL/C (i.e., which steps
> +Due to differences in how ECPG works compared to Informix's ESQL/C
> (namely, which steps
>  are purely grammar transformations and which steps rely on the

Oh!

> I've pushed that in addition to your changes, thanks!

Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: First draft of the PG 15 release notes

2022-07-14 Thread John Naylor
Regarding this item:

"Allow hash lookup for NOT IN clauses with many constants (David Rowley,
James Coleman)
Previously the code always sequentially scanned the list of values."

The todo list has an entry titled "Planning large IN lists", which links to

https://www.postgresql.org/message-id/1178821226.6034.63.camel@goldbach

Did we already have a hash lookup for IN clauses with constants and the
above commit adds NOT IN? If so, maybe we have enough to remove this todo
item.

-- 
John Naylor
EDB: http://www.enterprisedb.com