Re: Synchronizing slots from primary to standby

2023-10-27 Thread Amit Kapila
On Fri, Oct 27, 2023 at 9:00 PM Drouvot, Bertrand
 wrote:
>
> On 10/27/23 10:35 AM, Amit Kapila wrote:
> > On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand
> >
> > I think even if we provide such an API, we need to have logic to get
> > the slots from the primary and create them.
>
> Yeah, my idea was to add an API (in addition to what is already in place).
>
> > Say, even if the user used
> > the APIs, there may still be some new slots that the sync worker needs
> > to create.
>
> Right.
>
> > I think it might be better to provide a view for users to
> > view the current state of sync. For example, in the above case, we can
> > say "waiting for the primary to advance remote LSN" or something like
> > that.
>
> We are already displaying the wait event "ReplSlotsyncPrimaryCatchup" in 
> pg_stat_activity
> so that might already be enough?
>

I am fine if the wait is already displayed in some form.

> My main idea was to be able to manually create/sync logical_slot2 in the test 
> case described in [1]
> without waiting for activity on logical_slot1.
>
> But another (better?) option might be to change our current algorithm during 
> slot creation on the
> standby? (to avoid an "active" slot having to wait on a "inactive" one, like 
> described in [1]).
>

Yeah, I guess it would be better to tweak the algorithm in this case
such that the slots can't be created immediately but can be noted in a
separate list and we can continue with other remaining slots. Once, we
are finished with all the slots, this special list can be traversed
and then we can attempt to create all the remaining slots. OTOH, the
scenario you described doesn't sound to be a frequent case to be
worried for it but if we can deal with it without adding much
complexity then it would be good.

-- 
With Regards,
Amit Kapila.




Re: PATCH: Add REINDEX tag to event triggers

2023-10-27 Thread jian he
On Fri, Oct 27, 2023 at 3:15 PM Michael Paquier  wrote:
>
> On Mon, Sep 04, 2023 at 08:00:52PM +0200, Jim Jones wrote:
> > LGTM. It applies and builds cleanly, all tests pass and documentation also
> > builds ok. The CFbot seems also much happier now :)
>
> +   /*
> +* Open and lock the relation.  ShareLock is sufficient since we only 
> need
> +* to prevent schema and data changes in it.  The lock level used here
> +* should match catalog's reindex_relation().
> +*/
> +   rel = try_table_open(relid, ShareLock);
>
> I was eyeing at 0003, and this strikes me as incorrect.  Sure, this
> matches what reindex_relation() does, but you've missed that
> CONCURRENTLY takes a lighter ShareUpdateExclusiveLock, and ShareLock
> conflicts with it.  See:
> https://www.postgresql.org/docs/devel/explicit-locking.html
>
> So, doesn't this disrupt a concurrent REINDEX?
> --
> Michael

ReindexPartitions called ReindexMultipleInternal
ReindexRelationConcurrently add reindex_event_trigger_collect to cover
it. (line 3869)
ReindexIndex has the function reindex_event_trigger_collect. (line 2853)

reindex_event_trigger_collect_relation called in
ReindexMultipleInternal, ReindexTable (line 2979).
Both are "under concurrent is false" branches.

So it should be fine.




Re: maybe a type_sanity. sql bug

2023-10-27 Thread Tom Lane
Michael Paquier  writes:
> On Fri, Oct 27, 2023 at 11:45:44AM +0800, jian he wrote:
>> The test seems to assume the following sql query should return zero row.
>> but it does not. I don't know much about the "relreplident" column.

> The problem is about relkind, as 'I' refers to a partitioned index.
> That is a legal value in pg_class.relkind, but we forgot to list it in
> this test.

Yeah, in principle this check should allow any permissible relkind
value.  In practice, because it runs so early in the regression tests,
there's not many values present.  I added a quick check and found that
type_sanity only sees these values:
 
 --  pg_class 
 -- Look for illegal values in pg_class fields
+select distinct relkind from pg_class order by 1;
+ relkind 
+-
+ i
+ r
+ t
+ v
+(4 rows)
+
 SELECT c1.oid, c1.relname
 FROM pg_class as c1
 WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p') OR

We've had some prior discussions about moving type_sanity, opr_sanity
etc to run later when there's a wider variety of objects present.
I'm not sure about that being a great idea though --- for example,
there's a test that creates an intentionally incomplete opclass
and even leaves it around for pg_dump stress testing.  That'd
probably annoy opr_sanity if it ran after that one.

The original motivation for type_sanity and friends was mostly
to detect mistakes in the hand-rolled initial catalog contents,
and for that purpose it's fine if they run early.  Some of what
they check is now redundant with genbki.pl I think.

Anyway, we should fix this if only for clarity's sake.
I do not feel a need to back-patch though.

regards, tom lane




Re: doc: New cumulative stats subsystem obsoletes comment in maintenance.sgml

2023-10-27 Thread Bruce Momjian
On Fri, Aug 12, 2022 at 12:58:28PM -0700, David G. Johnston wrote:
> I dislike using the word accurate here now, it will be accurate, but we don't
> promise perfect timeliness.  So it needs to change:
> 
>  diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
> index 04a04e0e5f..360807c8f9 100644
> --- a/doc/src/sgml/maintenance.sgml
> +++ b/doc/src/sgml/maintenance.sgml
> @@ -652,9 +652,8 @@ vacuum insert threshold = vacuum base insert threshold +
> vacuum insert scale fac
>      tuples to be frozen by earlier vacuums.  The number of obsolete tuples 
> and
>      the number of inserted tuples are obtained from the cumulative statistics
> system;
>      it is a semi-accurate count updated by each UPDATE,
> -    DELETE and INSERT operation.  (It 
> is
> -    only semi-accurate because some information might be lost under heavy
> -    load.)  If the relfrozenxid value of the table
> +    DELETE and INSERT operation.
> +    If the relfrozenxid value of the table
>      is more than vacuum_freeze_table_age transactions old,
>      an aggressive vacuum is performed to freeze old tuples and advance
>      relfrozenxid; otherwise, only pages that have
> been modified

Done in master.

> However, also replace the remaining instance of "a semi-accurate count" with
> "an eventually-consistent count".

> ...it is an eventually-consistent count updated by each UPDATE, DELETE, and
> INSERT operation.

Also done.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: race condition in pg_class

2023-10-27 Thread Noah Misch
On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
> >> I'm inclined to propose that heap_inplace_update should check to
> >> make sure that it's operating on the latest version of the tuple
> >> (including, I guess, waiting for an uncommitted update?) and throw
> >> error if not.  I think this is what your B3 option is, but maybe
> >> I misinterpreted.  It might be better to throw error immediately
> >> instead of waiting to see if the other updater commits.
> 
> > That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
> > waiting for GRANT to commit but instead inplace-updating every member of the
> > update chain.  For B2, I was thinking we don't need to error.  There are two
> > problematic orders of events.  The easy one is heap_inplace_update() 
> > mutating
> > a tuple that already has an xmax.  That's the one in the test case upthread,
> > and detecting it is trivial.  The harder one is heap_inplace_update() 
> > mutating
> > a tuple after GRANT fetches the old tuple, before GRANT enters 
> > heap_update().
> 
> Ugh ... you're right, what I was imagining would not catch that last case.
> 
> > I anticipate a new locktag per catalog that can receive inplace updates,
> > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.
> 
> We could perhaps make this work by using the existing tuple-lock
> infrastructure, rather than inventing new locktags (a choice that
> spills to a lot of places including clients that examine pg_locks).

That could be okay.  It would be weird to reuse a short-term lock like that
one as something held till end of transaction.  But the alternative of new
locktags ain't perfect, as you say.

> I would prefer though to find a solution that only depends on making
> heap_inplace_update protect itself, without high-level cooperation
> from the possibly-interfering updater.  This is basically because
> I'm still afraid that we're defining the problem too narrowly.
> For one thing, I have nearly zero confidence that GRANT et al are
> the only problematic source of conflicting transactional updates.

Likewise here, but I have fair confidence that an assertion would flush out
the rest.  heap_inplace_update() would assert that the backend holds one of
the acceptable locks.  It could even be an elog; heap_inplace_update() can
tolerate that cost.

> For another, I'm worried that some extension may be using
> heap_inplace_update against a catalog we're not considering here.

A pgxn search finds "citus" using heap_inplace_update().

> I'd also like to find a solution that fixes the case of a conflicting
> manual UPDATE (although certainly that's a stretch goal we may not be
> able to reach).

It would be nice.

> I wonder if there's a way for heap_inplace_update to mark the tuple
> header as just-updated in a way that regular heap_update could
> recognize.  (For standard catalog updates, we'd then end up erroring
> in simple_heap_update, which I think is fine.)  We can't update xmin,
> because the VACUUM callers don't have an XID; but maybe there's some
> other way?  I'm speculating about putting a funny value into xmax,
> or something like that, and having heap_update check that what it
> sees in xmax matches what was in the tuple the update started with.

Hmmm.  Achieving it without an XID would be the real trick.  (With an XID, we
could use xl_heap_lock like heap_update() does.)  Thinking out loud, what if
heap_inplace_update() sets HEAP_XMAX_INVALID and xmax =
TransactionIdAdvance(xmax)?  Or change t_ctid in a similar way.  Then regular
heap_update() could complain if the field changed vs. last seen value.  This
feels like something to regret later in terms of limiting our ability to
harness those fields for more-valuable ends or compact them away in a future
page format.  I can't pinpoint a specific loss, so the idea might have legs.
Nontransactional data in separate tables or in new metapages smells like the
right long-term state.  A project wanting to reuse the tuple header bits could
introduce such storage to unblock its own bit reuse.

> Or we could try to get rid of in-place updates, but that seems like
> a mighty big lift.  All of the existing callers, except maybe
> the johnny-come-lately dropdb usage, have solid documented reasons
> to do it that way.

Yes, removing that smells problematic.




Re: unnest multirange, returned order

2023-10-27 Thread Jeff Davis
On Fri, 2023-10-27 at 08:48 +0200, Laurenz Albe wrote:
> On Fri, 2023-10-13 at 15:33 -0400, Daniel Fredouille wrote:
> > sorry it took me some time to reply. Yes, the patch is perfect if
> > this is indeed the behavior.
> 
> I'm sending a reply to the hackers list so that I can add the patch
> to the commitfest.
> 
> Tiny as the patch is, I don't want it to fall between the cracks.

Committed with adjusted wording. Thank you!


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: [PATCH v1] [doc] polish the comments of reloptions

2023-10-27 Thread Bruce Momjian


Patch applied.

---

On Tue, Aug 30, 2022 at 11:56:30AM +0800, Junwang Zhao wrote:
> When adding an option, we have 5 choices (bool, integer, real, enum, string),
> so the comments seem stale.
> 
> There are some sentences missing *at ShareUpdateExclusiveLock*, this
> patch adds them to make the sentences complete.
> 
> One thing I'm not sure is should we use *at ShareUpdateExclusiveLock* or
> *with ShareUpdateExclusiveLock*, pls take a look.
> 
>  src/backend/access/common/reloptions.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/src/backend/access/common/reloptions.c
> b/src/backend/access/common/reloptions.c
> index 609329bb21..9e99868faa 100644
> --- a/src/backend/access/common/reloptions.c
> +++ b/src/backend/access/common/reloptions.c
> @@ -42,9 +42,9 @@
>   *
>   * To add an option:
>   *
> - * (i) decide on a type (integer, real, bool, string), name, default value,
> - * upper and lower bounds (if applicable); for strings, consider a validation
> - * routine.
> + * (i) decide on a type (bool, integer, real, enum, string), name, default
> + * value, upper and lower bounds (if applicable); for strings, consider a
> + * validation routine.
>   * (ii) add a record below (or use add__reloption).
>   * (iii) add it to the appropriate options struct (perhaps StdRdOptions)
>   * (iv) add it to the appropriate handling routine (perhaps
> @@ -68,24 +68,24 @@
>   * since they are only used by the AV procs and don't change anything
>   * currently executing.
>   *
> - * Fillfactor can be set because it applies only to subsequent changes made 
> to
> - * data blocks, as documented in hio.c
> + * Fillfactor can be set at ShareUpdateExclusiveLock because it applies only 
> to
> + * subsequent changes made to data blocks, as documented in hio.c
>   *
>   * n_distinct options can be set at ShareUpdateExclusiveLock because they
>   * are only used during ANALYZE, which uses a ShareUpdateExclusiveLock,
>   * so the ANALYZE will not be affected by in-flight changes. Changing those
>   * values has no effect until the next ANALYZE, so no need for stronger lock.
>   *
> - * Planner-related parameters can be set with ShareUpdateExclusiveLock 
> because
> + * Planner-related parameters can be set at ShareUpdateExclusiveLock because
>   * they only affect planning and not the correctness of the execution. Plans
>   * cannot be changed in mid-flight, so changes here could not easily result 
> in
>   * new improved plans in any case. So we allow existing queries to continue
>   * and existing plans to survive, a small price to pay for allowing better
>   * plans to be introduced concurrently without interfering with users.
>   *
> - * Setting parallel_workers is safe, since it acts the same as
> - * max_parallel_workers_per_gather which is a USERSET parameter that doesn't
> - * affect existing plans or queries.
> + * Setting parallel_workers at ShareUpdateExclusiveLock is safe, since it 
> acts
> + * the same as max_parallel_workers_per_gather which is a USERSET parameter
> + * that doesn't affect existing plans or queries.
>   *
>   * vacuum_truncate can be set at ShareUpdateExclusiveLock because it
>   * is only used during VACUUM, which uses a ShareUpdateExclusiveLock,
> -- 
> 2.33.0
> 
> 
> -- 
> Regards
> Junwang Zhao

> From 4cef73a6f7ef4b59a6cc1cd9720b4e545bc36861 Mon Sep 17 00:00:00 2001
> From: Junwang Zhao 
> Date: Tue, 30 Aug 2022 11:33:14 +0800
> Subject: [PATCH v1] [doc] polish the comments of reloptions
> 
> 1. add the missing enum type and change the order to consistent
>with relopt_type
> 2. add some missing ShareUpdateExclusiveLock
> ---
>  src/backend/access/common/reloptions.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/src/backend/access/common/reloptions.c 
> b/src/backend/access/common/reloptions.c
> index 609329bb21..9e99868faa 100644
> --- a/src/backend/access/common/reloptions.c
> +++ b/src/backend/access/common/reloptions.c
> @@ -42,9 +42,9 @@
>   *
>   * To add an option:
>   *
> - * (i) decide on a type (integer, real, bool, string), name, default value,
> - * upper and lower bounds (if applicable); for strings, consider a validation
> - * routine.
> + * (i) decide on a type (bool, integer, real, enum, string), name, default
> + * value, upper and lower bounds (if applicable); for strings, consider a
> + * validation routine.
>   * (ii) add a record below (or use add__reloption).
>   * (iii) add it to the appropriate options struct (perhaps StdRdOptions)
>   * (iv) add it to the appropriate handling routine (perhaps
> @@ -68,24 +68,24 @@
>   * since they are only used by the AV procs and don't change anything
>   * currently executing.
>   *
> - * Fillfactor can be set because it applies only to subsequent changes made 
> to
> - * data blocks, as documented in hio.c
> + * Fillfactor can be set at ShareUpdateExclusiveLock 

Re: Fix search_path for all maintenance commands

2023-10-27 Thread Jeff Davis
On Fri, 2023-07-21 at 15:32 -0700, Jeff Davis wrote:
> Attached is a new version.

Do we still want to do this?

Right now, the MAINTAIN privilege is blocking on some way to prevent
malicious users from abusing the MAINTAIN privilege and search_path to
acquire the table owner's privileges.

The approach of locking down search_path during maintenance commands
would solve the problem, but it means that we are enforcing search_path
in some contexts and not others. That's not great, but it's similar to
what we are doing when we ignore SECURITY INVOKER and run the function
as the table owner during a maintenance command, or (by default) for
subscriptions.

My attempts to more generally try to lock down search_path for
functions attached to tables didn't seem to get much consensus, so if
we do make an exception to lock down search_path for maintenance
commands only, it would stay an exception for the foreseeable future.

Thoughts?

Regards,
Jeff Davis





Re: [PATCH] minor bug fix for pg_dump --clean

2023-10-27 Thread Bruce Momjian


FYI, this was improved in a recent commit:

commit 75af0f401f
Author: Tom Lane 
Date:   Fri Sep 29 13:13:54 2023 -0400

Doc: improve description of dump/restore's --clean and --if-exists.

Try to make these option descriptions a little clearer for novices.
Per gripe from Attila Gulyás.

Discussion: 
https://postgr.es/m/169590536647.3727336.11070254203649648...@wrigleys.postgresql.org


---

On Mon, Oct 24, 2022 at 09:02:46AM +0200, Frédéric Yhuel wrote:
> On 10/24/22 03:01, Tom Lane wrote:
> > =?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Yhuel?=  writes:
> > > When using pg_dump (or pg_restore) with option "--clean", there is some
> > > SQL code to drop every objects at the beginning.
> > 
> > Yup ...
> > 
> > > The DROP statement for a view involving circular dependencies is :
> > > CREATE OR REPLACE VIEW [...]
> > > (see commit message of d8c05aff for a much better explanation)
> > > If the view is not in the "public" schema, and the target database is
> > > empty, this statement fails, because the schema hasn't been created yet.
> > > The attached patches are a TAP test which can be used to reproduce the
> > > bug, and a proposed fix. They apply to the master branch.
> > 
> > I am disinclined to accept this as a valid bug, because there's never
> > been any guarantee that a --clean script would execute error-free in
> > a database that doesn't match what the source database contains.
> > 
> > (The pg_dump documentation used to say that in so many words.
> > I see that whoever added the --if-exists option was under the
> > fond illusion that that fixes all cases, which it surely does not.
> > We need to back off the promises a bit there.)
> > 
> > An example of a case that won't execute error-free is if the view
> > having a circular dependency includes a column of a non-built-in
> > data type.  If you try to run that in an empty database, you'll
> > get an error from the CREATE OR REPLACE VIEW's reference to that
> > data type.  For instance, if I adjust your test case to make
> > the "payload" column be of type hstore, I get something like
> > 
> > psql:dumpresult.sql:22: ERROR:  type "public.hstore" does not exist
> > LINE 4: NULL::public.hstore AS payload;
> >^
> > 
> > The same type of failure occurs for user-defined functions and
> > operators that use a non-built-in type, and I'm sure there are
> > more cases in the same vein.  But it gets *really* messy if
> > the target database isn't completely empty, but contains objects
> > with different properties than the dump script expects; for example,
> > if the view being discussed here exists with a different column set
> > than the script thinks, or if the dependency chains aren't all the
> > same.
> > 
> > If this fix were cleaner I might be inclined to accept it anyway,
> > but it's not very clean at all --- for example, it's far from
> > obvious to me what are the side-effects of changing the filter
> > in RestoreArchive like that.  Nor am I sure that the schema
> > you want to create is guaranteed to get dropped again later in
> > every use-case.
> > 
> 
> Hi Tom, Viktoria,
> 
> Thank you for your review Viktoria!
> 
> Thank you for this detailed explanation, Tom! I didn't have great hope for
> this patch. I thought that the TAP test could be accepted, but now I can see
> that it is clearly useless.
> 
> 
> > So I think mainly what we ought to do here is to adjust the
> > documentation to make it clearer that --clean is not guaranteed
> > to work without errors unless the target database has the same
> > set of objects as the source.  --if-exists can reduce the set
> > of error cases, but not eliminate it.  Possibly we should be
> > more enthusiastic about recommending --create --clean (ie,
> > drop and recreate the whole database) instead.
> > 
> 
> I beleive a documentation patch would be useful, indeed.
> 
> Best regards,
> Frédéric
> 
> 

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: request clarification on pg_restore documentation

2023-10-27 Thread Tom Lane
Bruce Momjian  writes:
> Does anyone have a suggestion on how to handle this issue?

It might be that the later decision to change the representation
of sequence dumps would make it possible to undo 4317e0246c
and go back to having --schema-only/--data-only be true aliases
for --section.  But it'd take some research and probably end up
causing some behavioral changes (eg. trigger handling as you note).

Much the same research would be needed if you just wanted to
document the current state of affairs more clearly.

The real issue here is that --schema-only/--data-only do a few
things that are not within --section's remit, such as trigger
adjustments.  Do we want to cause --section to have those effects
too?  I dunno.  Do we want to give up those extra behaviors?
Almost certainly not.

Either way, I'm not personally planning to put effort into that
anytime soon.

regards, tom lane




Re: race condition in pg_class

2023-10-27 Thread Tom Lane
Noah Misch  writes:
> On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
>> I'm inclined to propose that heap_inplace_update should check to
>> make sure that it's operating on the latest version of the tuple
>> (including, I guess, waiting for an uncommitted update?) and throw
>> error if not.  I think this is what your B3 option is, but maybe
>> I misinterpreted.  It might be better to throw error immediately
>> instead of waiting to see if the other updater commits.

> That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
> waiting for GRANT to commit but instead inplace-updating every member of the
> update chain.  For B2, I was thinking we don't need to error.  There are two
> problematic orders of events.  The easy one is heap_inplace_update() mutating
> a tuple that already has an xmax.  That's the one in the test case upthread,
> and detecting it is trivial.  The harder one is heap_inplace_update() mutating
> a tuple after GRANT fetches the old tuple, before GRANT enters heap_update().

Ugh ... you're right, what I was imagining would not catch that last case.

> I anticipate a new locktag per catalog that can receive inplace updates,
> i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.

We could perhaps make this work by using the existing tuple-lock
infrastructure, rather than inventing new locktags (a choice that
spills to a lot of places including clients that examine pg_locks).

I would prefer though to find a solution that only depends on making
heap_inplace_update protect itself, without high-level cooperation
from the possibly-interfering updater.  This is basically because
I'm still afraid that we're defining the problem too narrowly.
For one thing, I have nearly zero confidence that GRANT et al are
the only problematic source of conflicting transactional updates.
For another, I'm worried that some extension may be using
heap_inplace_update against a catalog we're not considering here.
I'd also like to find a solution that fixes the case of a conflicting
manual UPDATE (although certainly that's a stretch goal we may not be
able to reach).

I wonder if there's a way for heap_inplace_update to mark the tuple
header as just-updated in a way that regular heap_update could
recognize.  (For standard catalog updates, we'd then end up erroring
in simple_heap_update, which I think is fine.)  We can't update xmin,
because the VACUUM callers don't have an XID; but maybe there's some
other way?  I'm speculating about putting a funny value into xmax,
or something like that, and having heap_update check that what it
sees in xmax matches what was in the tuple the update started with.

Or we could try to get rid of in-place updates, but that seems like
a mighty big lift.  All of the existing callers, except maybe
the johnny-come-lately dropdb usage, have solid documented reasons
to do it that way.

regards, tom lane




Re: request clarification on pg_restore documentation

2023-10-27 Thread Bruce Momjian


Does anyone have a suggestion on how to handle this issue?  The report
is a year old.

---

On Thu, Oct  6, 2022 at 10:28:15AM -0400, Bruce Momjian wrote:
> On Thu, Sep 29, 2022 at 02:30:13PM +, PG Doc comments form wrote:
> > The following documentation comment has been logged on the website:
> > 
> > Page: https://www.postgresql.org/docs/14/app-pgrestore.html
> > Description:
> > 
> > pg_restore seems to have two ways to restore data:
> > 
> > --section=data 
> > 
> > or
> > 
> >  --data-only
> > 
> > There is this cryptic warning that --data-only is "similar to but for
> > historical reasons different from" --section=data
> > 
> > But there is no further explanation of what those differences are or what
> > might be missed or different in your restore if you pick one option or the
> > other. Maybe one or the other option is the "preferred current way" and one
> > is the "historical way" or they are aimed at different types of use cases,
> > but that's not clear.
> 
> [Thread moved from docs to hackers because there are behavioral issues.]
> 
> Very good question.  I dug into this and found this commit which says
> --data-only and --section=data were equivalent:
> 
>   commit a4cd6abcc9
>   Author: Andrew Dunstan 
>   Date:   Fri Dec 16 19:09:38 2011 -0500
>   
>   Add --section option to pg_dump and pg_restore.
>   
>   Valid values are --pre-data, data and post-data. The option can be
>   given more than once. --schema-only is equivalent to
> -->   --section=pre-data --section=post-data. --data-only is equivalent
> -->   to --section=data.
>   
> and then this commit which says they are not:
> 
>   commit 4317e0246c
>   Author: Tom Lane 
>   Date:   Tue May 29 23:22:14 2012 -0400
>   
>   Rewrite --section option to decouple it from 
> --schema-only/--data-only.
>   
> -->   The initial implementation of pg_dump's --section option supposed 
> that the
> -->   existing --schema-only and --data-only options could be made 
> equivalent to
> -->   --section settings.  This is wrong, though, due to dubious but long 
> since
>   set-in-stone decisions about where to dump SEQUENCE SET items, as 
> seen in
>   bug report from Martin Pitt.  (And I'm not totally convinced there 
> weren't
>   other bugs, either.)  Undo that coupling and instead drive --section
>   filtering off current-section state tracked as we scan through the 
> TOC
>   list to call _tocEntryRequired().
>   
>   To make sure those decisions don't shift around and hopefully save 
> a few
>   cycles, run _tocEntryRequired() only once per TOC entry and save 
> the result
>   in a new TOC field.  This required minor rejiggering of ACL 
> handling but
>   also allows a far cleaner implementation of 
> inhibit_data_for_failed_table.
>   
>   Also, to ensure that pg_dump and pg_restore have the same behavior 
> with
>   respect to the --section switches, add _tocEntryRequired() 
> filtering to
>   WriteToc() and WriteDataChunks(), rather than trying to implement 
> section
>   filtering in an entirely orthogonal way in dumpDumpableObject().  
> This
>   required adjusting the handling of the special ENCODING and 
> STDSTRINGS
>   items, but they were pretty weird before anyway.
> 
> and this commit which made them closer:
> 
>   commit 5a39114fe7
>   Author: Tom Lane 
>   Date:   Fri Oct 26 12:12:42 2012 -0400
>   
>   In pg_dump, dump SEQUENCE SET items in the data not pre-data 
> section.
>   
>   Represent a sequence's current value as a separate TableDataInfo 
> dumpable
>   object, so that it can be dumped within the data section of the 
> archive
> -->   rather than in pre-data.  This fixes an undesirable inconsistency 
> between
> -->   the meanings of "--data-only" and "--section=data", and also fixes 
> dumping
>   of sequences that are marked as extension configuration tables, as 
> per a
>   report from Marko Kreen back in July.  The main cost is that we do 
> one more
>   SQL query per sequence, but that's probably not very meaningful in 
> most
>   databases.
> 
> Looking at the restore code, I see --data-only disabling triggers, while
> --section=data doesn't.  I also tested --data-only vs. --section=data in
> pg_dump for the regression database and saw the only differences as the
> creation and comments on large objects, e.g.,
> 
>   -- Name: 2121; Type: BLOB; Schema: -; Owner: postgres
>   --
>   
>   SELECT pg_catalog.lo_create('2121');
> 
> 
>   ALTER LARGE OBJECT 2121 OWNER TO postgres;
> 
>   --
>   -- Name: LARGE OBJECT 2121; Type: COMMENT; Schema: -; Owner: postgres
>   --
> 
>   COMMENT ON LARGE OBJECT 2121 IS 

Re: race condition in pg_class

2023-10-27 Thread Noah Misch
On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
> >> We'll likely need to change how we maintain relhasindex or perhaps take a 
> >> lock
> >> in GRANT.
> 
> > The main choice is accepting more DDL blocking vs. accepting inefficient
> > relcache builds.  Options I'm seeing:
> 
> It looks to me like you're only thinking about relhasindex, but it
> seems to me that any call of heap_inplace_update brings some
> risk of this kind.  Excluding the bootstrap-mode-only usage in
> create_toast_table, I see four callers:
> 
> * index_update_stats updating a pg_class tuple's
>   relhasindex, relpages, reltuples, relallvisible
> 
> * vac_update_relstats updating a pg_class tuple's
>   relpages, reltuples, relallvisible, relhasindex, relhasrules,
>   relhastriggers, relfrozenxid, relminmxid
> 
> * vac_update_datfrozenxid updating a pg_database tuple's
>   datfrozenxid, datminmxid
> 
> * dropdb updating a pg_database tuple's datconnlimit
> 
> So we have just as much of a problem with GRANTs on databases
> as GRANTs on relations.  Also, it looks like we can lose
> knowledge of the presence of rules and triggers, which seems
> nearly as bad as forgetting about indexes.  The rest of these
> updates might not be correctness-critical, although I wonder
> how bollixed things could get if we forget an advancement of
> relfrozenxid or datfrozenxid (especially if the calling
> transaction goes on to make other changes that assume that
> the update happened).

Thanks for researching that.  Let's treat frozenxid stuff as critical; I
wouldn't want to advance XID limits based on a datfrozenxid that later gets
rolled back.  I agree relhasrules and relhastriggers are also critical.  The
"inefficient relcache builds" option family can't solve cases like
relfrozenxid and datconnlimit, so that leaves us with the "more DDL blocking"
option family.

> BTW, vac_update_datfrozenxid believes (correctly I think) that
> it cannot use the syscache copy of a tuple as the basis for in-place
> update, because syscache will have detoasted any toastable fields.
> These other callers are ignoring that, which seems like it should
> result in heap_inplace_update failing with "wrong tuple length".
> I wonder how come we're not seeing reports of that from the field.

Good question.  Perhaps we'll need some test cases that exercise each inplace
update against a row having a toast pointer.  It's too easy to go a long time
without encountering those in the field.

> I'm inclined to propose that heap_inplace_update should check to
> make sure that it's operating on the latest version of the tuple
> (including, I guess, waiting for an uncommitted update?) and throw
> error if not.  I think this is what your B3 option is, but maybe
> I misinterpreted.  It might be better to throw error immediately
> instead of waiting to see if the other updater commits.

That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
waiting for GRANT to commit but instead inplace-updating every member of the
update chain.  For B2, I was thinking we don't need to error.  There are two
problematic orders of events.  The easy one is heap_inplace_update() mutating
a tuple that already has an xmax.  That's the one in the test case upthread,
and detecting it is trivial.  The harder one is heap_inplace_update() mutating
a tuple after GRANT fetches the old tuple, before GRANT enters heap_update().
I anticipate a new locktag per catalog that can receive inplace updates,
i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.  Here's a
walk-through for the pg_database case.  GRANT will use the following sequence
of events:

- acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode
- fetch latest pg_database tuple
- heap_update()
- COMMIT, releasing LOCKTAG_DATABASE_DEFINITION

vac_update_datfrozenxid() sequence of events:

- acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode
- (now, all GRANTs on the given database have committed or aborted)
- fetch latest pg_database tuple
- heap_inplace_update()
- release LOCKTAG_DATABASE_DEFINITION, even if xact not ending
- continue with other steps, e.g. vac_truncate_clog()

How does that compare to what you envisioned?  vac_update_datfrozenxid() could
further use xmax as a best-efforts thing to catch conflict with manual UPDATE
statements, but it wouldn't solve the case where the UPDATE had fetched the
tuple but not yet heap_update()'d it.




Re: Fwd: Annoying corruption in PostgreSQL.

2023-10-27 Thread Tomas Vondra
On 10/27/23 23:10, Kirill Reshke wrote:
> 
> Sorry, seems that i replied only to Tomas, so forwarding message.
> -- Forwarded message -
> From: *Kirill Reshke*  >
> Date: Sat, 28 Oct 2023 at 02:06
> Subject: Re: Annoying corruption in PostgreSQL.
> To: Tomas Vondra  >
> 
> 
> Hi Tomas!
> 
> Thanks for the explanation!
> 
> 1) 11 to 15. This week there were 14.9 and 12.16 reproductions. Two
> weeks ago there was 15.4 and 11.21 repro. Unfortunately, there is no
> info about repro which were month old or more, but I found in our work
> chats that there was repro on PostgreSQL 13 in April, a minor version
> unknown. Overall, we observed this issue for over a year on all pgdg
> supported versions.
> 
> 2) Searching out bug tracker i have found:
> 
> 1. missing chunk number 0 for toast value 592966012 in
> pg_toast_563953150 (some user relation)
> |2. missing chunk number 0 for toast value 18019714 in
> pg_toast_17706963| (some user relation)
> 3. missing chunk number 0 for toast value 52677740 in pg_toast_247794
> 
> So, this is not always pg_catalog. There toast tables were toast to some
> user relations.
> 

OK.

> 3) It is always about VACUUM FULL (FREEZE/VERBOSE/ANALYZE) / autovacuum.
> 

Hmm, so it's always one of these VACUUM processes complaining?

> We have physical backups and we can PITR. But restoring a cluster to
> some point in the past is a bit of a different task: we need our
> client's approval for these operations, since we are a Managed DBs Cloud
> Provider. Will try to ask someone.
> 

That's what I'd try, to get some sense of what state the vacuum saw,
what were the transactions modifying the TOAST + parent table doing,
etc, how much stuff the transactions did, if maybe there are some
aborts, that sort of thing. Hard to try reproducing this without any
knowledge of the workload. The WAL might tell us if

How often do you actually see this issue? Once of twice a week?

Are you using some extensions that might interfere with this?

And you mentioned you're running large number of clusters - are those
running similar workloads, or are they unrelated?

Actually, can you elaborate why are you running VACUUM FULL etc? That
generally should not be necessary, so maybe we can learn something about
that about your workload.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Pre-proposal: unicode normalized text

2023-10-27 Thread Jeff Davis
On Mon, 2023-10-16 at 20:32 -0700, Jeff Davis wrote:
> On Wed, 2023-10-11 at 08:56 +0200, Peter Eisentraut wrote:
> > We need to be careful about precise terminology.  "Valid" has a
> > defined 
> > meaning for Unicode.  A byte sequence can be valid or not as UTF-
> > 8. 
> > But 
> > a string containing unassigned code points is not not-"valid" as
> > Unicode.
> 
> New patch attached, function name is "unicode_assigned".

I plan to commit something like v3 early next week unless someone else
has additional comments or I missed a concern.

Regards,
Jeff Davis





Fwd: Annoying corruption in PostgreSQL.

2023-10-27 Thread Kirill Reshke
Sorry, seems that i replied only to Tomas, so forwarding message.
-- Forwarded message -
From: Kirill Reshke 
Date: Sat, 28 Oct 2023 at 02:06
Subject: Re: Annoying corruption in PostgreSQL.
To: Tomas Vondra 


Hi Tomas!

Thanks for the explanation!

1) 11 to 15. This week there were 14.9 and 12.16 reproductions. Two weeks
ago there was 15.4 and 11.21 repro. Unfortunately, there is no info about
repro which were month old or more, but I found in our work chats that
there was repro on PostgreSQL 13 in April, a minor version unknown.
Overall, we observed this issue for over a year on all pgdg supported
versions.

2) Searching out bug tracker i have found:

1. missing chunk number 0 for toast value 592966012 in pg_toast_563953150
(some user relation)
2. missing chunk number 0 for toast value 18019714 in pg_toast_17706963
(some user relation)
3. missing chunk number 0 for toast value 52677740 in pg_toast_247794

So, this is not always pg_catalog. There toast tables were toast to some
user relations.

3) It is always about VACUUM FULL (FREEZE/VERBOSE/ANALYZE) / autovacuum.

We have physical backups and we can PITR. But restoring a cluster to some
point in the past is a bit of a different task: we need our client's
approval for these operations, since we are a Managed DBs Cloud Provider.
Will try to ask someone.

Best regards


On Fri, 27 Oct 2023 at 23:28, Tomas Vondra 
wrote:

>
>
> On 10/27/23 14:19, Kirill Reshke wrote:
> > Hi hackers!
> >
> > We run a large amount of PostgreSQL clusters in our production. They
> > differ by versions (we have 11-16 pg), load, amount of data, schema,
> > etc. From time to time, postgresql corruption happens. It says
> > ERROR,XX001,"missing chunk number 0 for toast value 18767319 in
> > pg_toast_2619",,"vacuum full ;"
> >
> > in logs. the missing chunk number  almost every is equal to zero, while
> > other values vary. There are no known patterns, which triggers this
> > issue. Moreover, if trying to rerun the VACUUM statement against
> > relations from a log message, it succeeds all the time.  So, we just
> > ignore these errors. Maybe it is just some wierd data race?
> >
> > We don't know how to trigger this problem, or why it occurs. I'm not
> > asking you to resolve this issue, but to help with debugging. What can
> > we do to deduct failure reasons? Maybe we can add more logging somewhere
> > (we can deploy a special patched PostgreSQL version everywhere), to have
> > more information about the issue, when it happens next time?
> >
>
> For starters, it'd be good to know something about the environment, and
> stuff that'd tell us if there's some possible pattern:
>
> 1) Which exact PG versions are you observing these errors on?
>
> 2) In the error example you shared it's pg_toast_2619, which is the
> TOAST table for pg_statistic (probably). Is it always this relation? Or
> what relations you noticed this for?
>
> 3) What kind of commands are triggering this? In the example it seems to
> be vacuum full. Did you see it for other commands too? People generally
> don't do VACUUM FULL very often, particularly not in environments with
> concurrent activity.
>
> Considering you don't know what's causing this, or what to look for, I
> think it might be interesting to use pg_waldump, and investigate what
> happened to the page containing the TOAST chunk and to the page
> referencing it. Do you have physical backups and ability to do PITR?
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-10-27 Thread Jeff Davis
On Thu, 2023-09-21 at 14:33 -0700, Jeff Davis wrote:
> I have attached an updated patch. Changes:

Withdrawing this from CF due to lack of consensus.

I'm happy to resume this discussion if someone sees a path forward to
make it easier to secure the search_path; or at least help warn users
when a function without a secured search_path is being used unsafely.

Regards,
Jeff Davis





Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-10-27 Thread Jeff Davis
On Fri, 2023-10-27 at 03:46 +0530, Bharath Rupireddy wrote:
> 
> Almost all the available backend
> page_read callbacks read_local_xlog_page_no_wait,
> read_local_xlog_page, logical_read_xlog_page except XLogPageRead
> (which is used for recovery when WAL buffers aren't used at all) have
> one thing in common, that is, WALRead(). Therefore, it seemed a
> natural choice for me to call XLogReadFromBuffers. In other words,
> I'd
> say it's the responsibility of page_read callback implementers to
> decide if they want to read from WAL buffers or not and hence I don't
> think we need a separate XLogReaderRoutine.

I think I see what you are saying: WALRead() is at a lower level than
the XLogReaderRoutine callbacks, because it's used by the .page_read
callbacks.

That makes sense, but my first interpretation was that WALRead() is
above the XLogReaderRoutine callbacks because it calls .segment_open
and .segment_close. To me that sounds like a layering violation, but it
exists today without your patch.

I suppose the question is: should reading from the WAL buffers an
intentional thing that the caller does explicitly by specific callers?
Or is it an optimization that should be hidden from the caller?

I tend toward the former, at least for now. I suspect that when we do
some more interesting things, like replicating unflushed data, we will
want reading from buffers to be a separate step, not combined with
WALRead(). After things in this area settle a bit then we might want to
refactor and combine them again.

> If someone wants to read unflushed WAL, the typical way to implement
> it is to write a new page_read callback
> read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or
> similar without WALRead() but just the new function
> XLogReadFromBuffers to read from WAL buffers and return.

Then why is it being called from WALRead() at all?

> 
> I prefer to keep the partial hit handling as-is just
> in case:
> 

So a "partial hit" is essentially a narrow race condition where one
page is read from buffers, and it's valid; and by the time it gets to
the next page, it has already been evicted (along with the previously
read page)? In other words, I think you are describing a case where
eviction is happening faster than the memcpy()s in a loop, which is
certainly possible due to scheduling or whatever, but doesn't seem like
the common case.

The case I'd expect for a partial read is when the startptr points to
an evicted page, but some later pages in the requested range are still
present in the buffers.

I'm not really sure whether either of these cases matters, but if we
implement one and not the other, there should be some explanation.

> Yes, I proposed that idea in another thread -
> https://www.postgresql.org/message-id/CALj2ACVFSirOFiABrNVAA6JtPHvA9iu%2Bwp%3DqkM9pdLZ5mwLaFg%40mail.gmail.com
> .
> If that looks okay, I can add it to the next version of this patch
> set.

The code in the email above still shows a call to:

  /*
   * Requested WAL is available in WAL buffers, so recheck the
existence
   * under the WALBufMappingLock and read if the page still exists,
otherwise
   * return.
   */
  LWLockAcquire(WALBufMappingLock, LW_SHARED);

and I don't think that's required. How about something like:

  endptr1 = XLogCtl->xlblocks[idx];
  /* Requested WAL isn't available in WAL buffers. */
  if (expectedEndPtr != endptr1)
  break;

  pg_read_barrier();
  ...
  memcpy(buf, data, bytes_read_this_loop);
  ...
  pg_read_barrier();
  endptr2 = XLogCtl->xlblocks[idx];
  if (expectedEndPtr != endptr2)
  break;

  ntotal += bytes_read_this_loop;
  /* success; move on to next page */

I'm not sure why GetXLogBuffer() doesn't just use pg_atomic_read_u64().
I suppose because xlblocks are not guaranteed to be 64-bit aligned?
Should we just align it to 64 bits so we can use atomics? (I don't
think it matters in this case, but atomics would avoid the need to
think about it.)
> 

> 
> FWIW, I found heapam.c using unlikely() extensively for safety
> checks.

OK, I won't object to the use of unlikely(), though I typically don't
use it without a fairly strong reason to think I should override what
the compiler thinks and/or what branch predictors can handle.

In this case, I think some of those errors are not really necessary
anyway, though:

  * XLogReadFromBuffers shouldn't take a timeline argument just to
demand that it's always equal to the wal insertion timeline.
  * Why check that startptr is earlier than the flush pointer, but not
startptr+count? Also, given that we intend to support reading unflushed
data, it would be good to comment that the function still works past
the flush pointer, and that it will be safe to remove later (right?).
  * An "Assert(!RecoveryInProgress())" would be more appropriate than
an error. Perhaps we will remove even that check in the future to
achieve cascaded replication of unflushed data.

Regards,
Jeff Davis








Re: [17] Special search_path names "!pg_temp" and "!pg_catalog"

2023-10-27 Thread Jeff Davis
On Thu, 2023-10-26 at 16:28 -0500, Nathan Bossart wrote:
> On Fri, Aug 18, 2023 at 02:44:31PM -0700, Jeff Davis wrote:
> > +    SET search_path = admin, "!pg_temp";
> 
> I think it's unfortunate that these new identifiers must be quoted. 
> I
> wonder if we could call these something like "no_pg_temp".  *shrug*

Do you, overall, find this feature useful?

Most functions don't need pg_temp, so it feels cleaner to exclude it.
But pg_temp is ignored for function/op lookup anyway, so functions
won't be exposed to search_path risks related to pg_temp unless they
are accessing tables.

If my proposal for the SEARCH clause got more support, I'd be more
excited about this feature because it could be set implicitly as part
of a safe search_path. Without the SEARCH clause, the only way to set
"!pg_temp" is by typing it out, and I'm not sure a lot of people will
actually do that.

Regards,
Jeff Davis





Re: race condition in pg_class

2023-10-27 Thread Tom Lane
Noah Misch  writes:
> On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
>> We'll likely need to change how we maintain relhasindex or perhaps take a 
>> lock
>> in GRANT.

> The main choice is accepting more DDL blocking vs. accepting inefficient
> relcache builds.  Options I'm seeing:

It looks to me like you're only thinking about relhasindex, but it
seems to me that any call of heap_inplace_update brings some
risk of this kind.  Excluding the bootstrap-mode-only usage in
create_toast_table, I see four callers:

* index_update_stats updating a pg_class tuple's
  relhasindex, relpages, reltuples, relallvisible

* vac_update_relstats updating a pg_class tuple's
  relpages, reltuples, relallvisible, relhasindex, relhasrules,
  relhastriggers, relfrozenxid, relminmxid

* vac_update_datfrozenxid updating a pg_database tuple's
  datfrozenxid, datminmxid

* dropdb updating a pg_database tuple's datconnlimit

So we have just as much of a problem with GRANTs on databases
as GRANTs on relations.  Also, it looks like we can lose
knowledge of the presence of rules and triggers, which seems
nearly as bad as forgetting about indexes.  The rest of these
updates might not be correctness-critical, although I wonder
how bollixed things could get if we forget an advancement of
relfrozenxid or datfrozenxid (especially if the calling
transaction goes on to make other changes that assume that
the update happened).

BTW, vac_update_datfrozenxid believes (correctly I think) that
it cannot use the syscache copy of a tuple as the basis for in-place
update, because syscache will have detoasted any toastable fields.
These other callers are ignoring that, which seems like it should
result in heap_inplace_update failing with "wrong tuple length".
I wonder how come we're not seeing reports of that from the field.

I'm inclined to propose that heap_inplace_update should check to
make sure that it's operating on the latest version of the tuple
(including, I guess, waiting for an uncommitted update?) and throw
error if not.  I think this is what your B3 option is, but maybe
I misinterpreted.  It might be better to throw error immediately
instead of waiting to see if the other updater commits.

regards, tom lane




Buf fix: update-po for PGXS does not work

2023-10-27 Thread Ryo Matsumura (Fujitsu)
Hi hackers,

I found that 'make update-po' for PGXS does not work.
Even if execute 'make update-po', but xx.po.new is not generated.
I don't test and check for meson build system, but I post it tentatively.

I attached patch and test set.
'update-po' tries to find *.po files $top_srcdir, but there is no po-file in 
PGXS system because $top_srcdir is install directory.

Please check it.

Thank you.

Best Regards
Ryo Matsumura


update-po-test.tar.gz
Description: update-po-test.tar.gz


fix-update-pgxs-po.patch
Description: fix-update-pgxs-po.patch


Re: race condition in pg_class

2023-10-27 Thread Noah Misch
On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
> On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote:
> > We are running PG13.10 and recently we have encountered what appears to be
> > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> > some other catalog-writer, possibly ANALYZE.
> > The problem is that after successfully creating index on relation (which
> > previosly didnt have any indexes), its pg_class.relhasindex remains set to
> > "false", which is illegal, I think.

It's damaging.  The table will behave like it has no indexes.  If something
adds an index later, old indexes will reappear, corrupt, having not received
updates during the relhasindex=false era.  ("pg_amcheck --heapallindexed" can
detect this.)

> > Index was built using the following statement:
> > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
> 
> This is going to be a problem with any operation that does a transactional
> pg_class update without taking a lock that conflicts with ShareLock.  GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
> 
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
> 
> == session 2
> alter table t add primary key (c);
> 
> == back in session 1
> commit;
> 
> 
> We'll likely need to change how we maintain relhasindex or perhaps take a lock
> in GRANT.

The main choice is accepting more DDL blocking vs. accepting inefficient
relcache builds.  Options I'm seeing:

=== "more DDL blocking" option family

B1. Take ShareUpdateExclusiveLock in GRANT, REVOKE, and anything that makes
transactional pg_class updates without holding some stronger lock.  New
asserts could catch future commands failing to do this.

B2. Take some shorter-lived lock around pg_class tuple formation, such that
GRANT blocks CREATE INDEX, but two CREATE INDEX don't block each other.
Anything performing a transactional update of a pg_class row would acquire
the lock in exclusive mode before fetching the old tuple and hold it till
end of transaction.  relhasindex=true in-place updates would acquire it
the same way, but they would release it after the inplace update.  I
expect a new heavyweight lock type, e.g. LOCKTAG_RELATION_DEFINITION, with
the same key as LOCKTAG_RELATION.  This has less blocking than the
previous option, but it's more complicated to explain to both users and
developers.

B3. Have CREATE INDEX do an EvalPlanQual()-style thing to update all successor
tuple versions.  Like the previous option, this would require new locking,
but the new lock would not need to persist till end of xact.  It would be
even more complicated to explain to users and developers.  (If this is
promising enough to warrant more detail, let me know.)

B4. Use transactional updates to set relhasindex=true.  Two CREATE INDEX
commands on the same table would block each other.  If we did it the way
most DDL does today, they'd get "tuple concurrently updated" failures
after the blocking ends.

=== "inefficient relcache builds" option family

R1. Ignore relhasindex; possibly remove it in v17.  Relcache builds et
al. will issue more superfluous queries.

R2. As a weird variant of the previous option, keep relhasindex and make all
transactional updates of pg_class set relhasindex=true pessimistically.
(VACUUM will set it back to false.)

=== other

O1. This is another case where the sometimes-discussed "pg_class_nt" for
nontransactional columns would help.  I'm ruling that out as too hard to
back-patch.


Are there other options important to consider?  I currently like (B1) the
most, followed closely by (R1) and (B2).  A key unknown is the prevalence of
index-free tables.  Low prevalence would argue in favor of (R1).  In my
limited experience, they've been rare.  That said, I assume relcache builds
happen a lot more than GRANTs, so it's harder to bound the damage from (R1)
compared to the damage from (B1).  Thoughts on this decision?

Thanks,
nm




Re: Annoying corruption in PostgreSQL.

2023-10-27 Thread Tomas Vondra



On 10/27/23 14:19, Kirill Reshke wrote:
> Hi hackers!
> 
> We run a large amount of PostgreSQL clusters in our production. They
> differ by versions (we have 11-16 pg), load, amount of data, schema,
> etc. From time to time, postgresql corruption happens. It says
> ERROR,XX001,"missing chunk number 0 for toast value 18767319 in
> pg_toast_2619",,"vacuum full ;"
> 
> in logs. the missing chunk number  almost every is equal to zero, while
> other values vary. There are no known patterns, which triggers this
> issue. Moreover, if trying to rerun the VACUUM statement against
> relations from a log message, it succeeds all the time.  So, we just
> ignore these errors. Maybe it is just some wierd data race?
> 
> We don't know how to trigger this problem, or why it occurs. I'm not
> asking you to resolve this issue, but to help with debugging. What can
> we do to deduct failure reasons? Maybe we can add more logging somewhere
> (we can deploy a special patched PostgreSQL version everywhere), to have
> more information about the issue, when it happens next time?
> 

For starters, it'd be good to know something about the environment, and
stuff that'd tell us if there's some possible pattern:

1) Which exact PG versions are you observing these errors on?

2) In the error example you shared it's pg_toast_2619, which is the
TOAST table for pg_statistic (probably). Is it always this relation? Or
what relations you noticed this for?

3) What kind of commands are triggering this? In the example it seems to
be vacuum full. Did you see it for other commands too? People generally
don't do VACUUM FULL very often, particularly not in environments with
concurrent activity.

Considering you don't know what's causing this, or what to look for, I
think it might be interesting to use pg_waldump, and investigate what
happened to the page containing the TOAST chunk and to the page
referencing it. Do you have physical backups and ability to do PITR?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Enderbury Island disappeared from timezone database

2023-10-27 Thread Tom Lane
Victor Wagner  writes:
> Tom Lane  пишет:
>> Fun.  I bet that breaks more than just Pacific/Enderbury.
>> Can you try changing that entry to Pacific/Kanton, and repeat?

> I did. No more problems. 
> I.e. I've invoked
> sed -i 's/Enderburry/Kanton/' $prefix/share/timezonesets/* 
> and rerun tests. No failures.

I was concerned about the non-Default timezonesets too, but
having now spun up a copy of Ubuntu 23.10 I see that those
work fine once Default is fixed.  So indeed this is the only
zone causing us problems.  That's probably because only a
relatively small fraction of the timezonesets entries depend
explicitly on named zones --- most of them are just numeric
UTC offsets.

Anyway, looking into the tzdata NEWS file I found

Release 2021b - 2021-09-24 16:23:00 -0700

Rename Pacific/Enderbury to Pacific/Kanton.  When we added
Enderbury in 1993, we did not know that it is uninhabited and that
Kanton (population two dozen) is the only inhabited location in
that timezone.  The old name is now a backward-compatibility link.

This means that if we substitute Kanton for Enderbury, things
will work fine against tzdata 2021b or later, but will fail in
the reverse way against older tzdata sets.  Do we want to
bet that everybody in the world has up-to-date tzdata installed?
I guess the contract for using --with-system-tzdata is that it's
up to you to maintain that, but still I don't like the odds.

The alternative I'm wondering about is whether to just summarily
remove the PHOT entry from timezonesets/Default.  It's a made-up
zone abbreviation in the first place, and per the above NEWS entry,
there's only a couple dozen people in the world who might even
be candidates to consider using it.  It seems highly likely that
nobody would care if we just dropped it from the Default list.
(We could keep the Pacific.txt entry, although re-pointing it
to Pacific/Kanton seems advisable.)

regards, tom lane




Re: Add recovery to pg_control and remove backup_label

2023-10-27 Thread David G. Johnston
On Fri, Oct 27, 2023 at 7:10 AM David Steele  wrote:

> On 10/26/23 17:27, David G. Johnston wrote:
>
> > Can we not figure out some way to place the relevant files onto the
> > server somewhere so that a simple "cp" command would work?  Have
> > pg_backup_stop return paths instead of contents, those paths being
> > "$TEMP_DIR"//pg_control.conf (and
> > tablespace_map)
>
> Nobody has been able to figure this out, and some of us have been
> thinking about it for years. It just doesn't seem possible to reliably
> tell the difference between a cluster that was copied and one that
> simply crashed.
>
> If cp is really the backup tool being employed, I would recommend using
> pg_basebackup. cp has flaws that could lead to corruption, and of course
> does not at all take into account the archive required to make a backup
> consistent, directories to be excluded, the order of copying pg_control
> on backup from standy, etc., etc.
>
>
Let me modify that to make it a bit more clear, I actually wouldn't care if
pg_backup_end outputs an entire binary pg_control file as part of the SQL
resultset.

My proposal would be to, in addition, place in the temporary directory on
the server, Postgres-written versions of pg_control and tablespace_map as
part of the pg_backup_end processing.  The client software would then have
a choice.  Write the contents of the SQL resultset to newly created binary
mode files in the destination, or, copy the server-written files from the
temporary directory to the destination.

That said, I'm starting to dislike that idea myself.  It only really makes
sense if the files could be placed in the data directory but that isn't
doable given concurrent backups and not wanting to place the source server
into an inconsistent state.

David J.


Re: BRIN minmax multi - incorrect distance for infinite timestamp/date

2023-10-27 Thread Tomas Vondra
FWIW I've cleaned up and pushed all the patches we came up with this
thread. And I've backpatched all of them to 14+.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Enderbury Island disappeared from timezone database

2023-10-27 Thread Victor Wagner
В Fri, 27 Oct 2023 11:17:03 -0400
Tom Lane  пишет:

> Victor Wagner  writes:
> > Tom Lane  пишет:  
> >> Did Ubuntu decide to remove *all* backzone links from their data?
> >> Or just that one?  Either way, I think they're going to get a
> >> tsunami of pushback pretty quickly.  People like their obsolete
> >> zone names.  
> 
> > They split tzdata packages into tzdata and tzdata-legacy (just for
> > those who like obsolete zone names), and into latter one gone 121
> > links, not counting "right" subdirectory.  
> 
> Fun.  I bet that breaks more than just Pacific/Enderbury.
> Can you try changing that entry to Pacific/Kanton, and repeat?

I did. No more problems. 

I.e. I've invoked

sed -i 's/Enderburry/Kanton/' $prefix/share/timezonesets/* 

and rerun tests. No failures.

It seems that Pacific/Enerberry was only one obsolete name which got
its way into abbreviations list.


> And then check the non-Default timezonesets lists too?

Enderbury аppears in two files in the timezonesets - Default
and Pacific.txt.

-- 
   Victor Wagner 




Re: Version 14/15 documentation Section "Alter Default Privileges"

2023-10-27 Thread Laurenz Albe
On Fri, 2023-10-27 at 11:34 +0200, Michael Banck wrote:
> On Fri, Oct 27, 2023 at 09:03:04AM +0200, Laurenz Albe wrote:
> > On Fri, 2022-11-04 at 10:49 +0100, Laurenz Albe wrote:
> > > On Thu, 2022-11-03 at 11:32 +0100, Laurenz Albe wrote:
> > > > On Wed, 2022-11-02 at 19:29 +, David Burns wrote:
> > > > > Some additional clarity in the versions 14/15 documentation
> > > > > would be helpful specifically surrounding the "target_role"
> > > > > clause for the ALTER DEFAULT PRIVILEGES command.  To the
> > > > > uninitiated, the current description seems vague.  Maybe
> > > > > something like the following would help:
> > > 
> > > After some more thinking, I came up with the attached patch.
> > 
> I think something like this is highly useful because I have also seen
> people very confused why default privileges are not applied.
> 
> However, maybe it could be made even clearer if also the main
> description is amended, like
> 
> "You can change default privileges only for objects that will be created
> by yourself or by roles that you are a member of (via target_role)."
> 
> or something.

True.  I have done that in the attached patch.
In this patch, it is mentioned *twice* that ALTER DEFAULT PRIVILEGES
only affects objects created by the current user.  I thought that
would not harm, but if it is too redundant, I can remove the second
mention.

Yours,
Laurenz Albe
From 6c618553cc21639e774f6fd108423134139bfc0a Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Fri, 27 Oct 2023 17:44:19 +0200
Subject: [PATCH] Improve ALTER DEFAULT PRIVILEGES documentation

Clarify that default privileges are only applied to objects
created by the target role.  This has been a frequent source
of misunderstandings.

Per request from David Burns.

Author: Laurenz Albe
Reviewed-by: Michael Banck
Discussion: https://postgr.es/m/LV2PR12MB5725F7C1B8EB2FC38829F276E7399%40LV2PR12MB5725.namprd12.prod.outlook.com
---
 doc/src/sgml/ref/alter_default_privileges.sgml | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/alter_default_privileges.sgml b/doc/src/sgml/ref/alter_default_privileges.sgml
index f1d54f5aa3..cf0ffa9c49 100644
--- a/doc/src/sgml/ref/alter_default_privileges.sgml
+++ b/doc/src/sgml/ref/alter_default_privileges.sgml
@@ -90,7 +90,10 @@ REVOKE [ GRANT OPTION FOR ]
   
ALTER DEFAULT PRIVILEGES allows you to set the privileges
that will be applied to objects created in the future.  (It does not
-   affect privileges assigned to already-existing objects.)  Currently,
+   affect privileges assigned to already-existing objects.)  ALTER
+   DEFAULT PRIVILEGES changes default privileges only for objects
+   that will be created by the user that executed the statement (or by
+   target_role, if specified).  Currently,
only the privileges for schemas, tables (including views and foreign
tables), sequences, functions, and types (including domains) can be
altered.  For this command, functions include aggregates and procedures.
@@ -138,6 +141,11 @@ REVOKE [ GRANT OPTION FOR ]
  
   The name of an existing role of which the current role is a member.
   If FOR ROLE is omitted, the current role is assumed.
+  Default privileges are only changed for new objects created by the
+  target_role.  There is no way to set default
+  privileges for objects created by arbitrary roles; for that, you'd have
+  to run ALTER DEFAULT PRIVILEGES for each role that can
+  create objects.
  
 

-- 
2.41.0



Re: Synchronizing slots from primary to standby

2023-10-27 Thread Drouvot, Bertrand

Hi,

On 10/27/23 10:35 AM, Amit Kapila wrote:

On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand

I think even if we provide such an API, we need to have logic to get
the slots from the primary and create them.


Yeah, my idea was to add an API (in addition to what is already in place).


Say, even if the user used
the APIs, there may still be some new slots that the sync worker needs
to create. 


Right.


I think it might be better to provide a view for users to
view the current state of sync. For example, in the above case, we can
say "waiting for the primary to advance remote LSN" or something like
that.


We are already displaying the wait event "ReplSlotsyncPrimaryCatchup" in 
pg_stat_activity
so that might already be enough?

My main idea was to be able to manually create/sync logical_slot2 in the test 
case described in [1]
without waiting for activity on logical_slot1.

But another (better?) option might be to change our current algorithm during 
slot creation on the
standby? (to avoid an "active" slot having to wait on a "inactive" one, like 
described in [1]).

[1]: 
https://www.postgresql.org/message-id/afe4ab6c-dde3-48ea-acd8-6f6052c7b8fd%40gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2023-10-27 Thread Drouvot, Bertrand

Hi,

On 10/27/23 10:51 AM, shveta malik wrote:

On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand
 wrote:

I discussed this with my colleague Hou-San and we think that one
possibility could be to somehow accelerate the increment of
restart_lsn on primary.  This can be achieved by connecting to the
remote and executing pg_log_standby_snapshot() at reasonable intervals
while waiting on standby during slot creation. This may increase speed
to a reasonable extent w/o having to wait for the user or bgwriter to
do the same for us. The current logical decoding uses a similar
approach to speed up the slot creation.  I refer to usage of
LogStandbySnapshot in SnapBuildWaitSnapshot() and
ReplicationSlotReserveWal()).
Thoughts?



I think that's 2 distinct area.

My concern was more when there is no activity at all on a newly
created slot on the primary. The slot is created on the standby,
but then we loop until there is activity on this slot on the
primary.

That's the test case I described in [1]

[1]: 
https://www.postgresql.org/message-id/afe4ab6c-dde3-48ea-acd8-6f6052c7b8fd%40gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Enderbury Island disappeared from timezone database

2023-10-27 Thread Tom Lane
Victor Wagner  writes:
> Tom Lane  пишет:
>> Did Ubuntu decide to remove *all* backzone links from their data?
>> Or just that one?  Either way, I think they're going to get a tsunami
>> of pushback pretty quickly.  People like their obsolete zone names.

> They split tzdata packages into tzdata and tzdata-legacy (just for
> those who like obsolete zone names), and into latter one gone 121 links,
> not counting "right" subdirectory.

Fun.  I bet that breaks more than just Pacific/Enderbury.
Can you try changing that entry to Pacific/Kanton, and repeat?
And then check the non-Default timezonesets lists too?

regards, tom lane




Re: Synchronizing slots from primary to standby

2023-10-27 Thread Drouvot, Bertrand

Hi,

On 10/27/23 11:56 AM, shveta malik wrote:

On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand
 wrote:


Hi,

On 10/25/23 5:00 AM, shveta malik wrote:

On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand
 wrote:


Hi,

On 10/23/23 2:56 PM, shveta malik wrote:

On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand
 wrote:



We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if 
there
is new synced slot(s) to be created on the standby. Do we want to keep this 
behavior
for V1?



I think for the slotsync workers case, we should reduce the naptime in
the launcher to say 30sec and retain the default one of 3mins for
subscription apply workers. Thoughts?



Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new
API on the standby that would refresh the list of sync slot at wish, thoughts?



Do you mean API to refresh list of DBIDs rather than sync-slots?
As per current design, launcher gets DBID lists for all the failover
slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE.


I mean an API to get a newly created slot on the primary being created/synced on
the standby at wish.

Also let's imagine this scenario:

- create logical_slot1 on the primary (and don't start using it)

Then on the standby we'll get things like:

2023-10-25 08:33:36.897 UTC [740298] LOG:  waiting for remote slot 
"logical_slot1" LSN (0/C00316A0) and catalog xmin (752) to pass local slot LSN 
(0/C0049530) and and catalog xmin (754)

That's expected and due to the fact that ReplicationSlotReserveWal() does set 
the slot
restart_lsn to a value < at the corresponding restart_lsn slot on the primary.

- create logical_slot2 on the primary (and start using it)

Then logical_slot2 won't be created/synced on the standby until there is 
activity on logical_slot1 on the primary
that would produce things like:
2023-10-25 08:41:35.508 UTC [740298] LOG:  wait over for remote slot 
"logical_slot1" as its LSN (0/C005FFD8) and catalog xmin (756) has now passed 
local slot LSN (0/C0049530) and catalog xmin (754)



Slight correction to above. As soon as we start activity on
logical_slot2, it will impact all the slots on primary, as the WALs
are consumed by all the slots. So even if there is activity on
logical_slot2, logical_slot1 creation on standby will be unblocked and
it will then move to logical_slot2 creation. eg:

--on standby:
2023-10-27 15:15:46.069 IST [696884] LOG:  waiting for remote slot
"mysubnew1_1" LSN (0/3C97970) and catalog xmin (756) to pass local
slot LSN (0/3C979A8) and and catalog xmin (756)

on primary:
newdb1=# select now();
now
--
  2023-10-27 15:15:51.504835+05:30
(1 row)

--activity on mysubnew1_3
newdb1=# insert into tab1_3 values(1);
INSERT 0 1
newdb1=# select now();
now
--
  2023-10-27 15:15:54.651406+05:30


--on standby, mysubnew1_1 is unblocked.
2023-10-27 15:15:56.223 IST [696884] LOG:  wait over for remote slot
"mysubnew1_1" as its LSN (0/3C97A18) and catalog xmin (757) has now
passed local slot LSN (0/3C979A8) and catalog xmin (756)

My Setup:
mysubnew1_1 -->mypubnew1_1 -->tab1_1
mysubnew1_3 -->mypubnew1_3-->tab1_3



Agree with your test case, but in my case I was not using pub/sub.

I was not clear, so when I said:


- create logical_slot1 on the primary (and don't start using it)


I meant don't start decoding from it (like using pg_recvlogical() or
pg_logical_slot_get_changes()).

By using pub/sub the "don't start using it" is not satisfied.

My test case is:

"
SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 
'test_decoding', false, true, true);
SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 
'test_decoding', false, true, true);
pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f -
"

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: pg_stat_statements and "IN" conditions

2023-10-27 Thread Dmitry Dolgov
> On Thu, Oct 26, 2023 at 09:08:42AM +0900, Michael Paquier wrote:
>  typedef struct ArrayExpr
>  {
> + pg_node_attr(custom_query_jumble)
> +
>
> Hmm.  I am not sure that this is the best approach
> implementation-wise.  Wouldn't it be better to invent a new
> pg_node_attr (these can include parameters as well!), say
> query_jumble_merge or query_jumble_agg_location that aggregates all
> the parameters of a list to be considered as a single element.  To put
> it short, we could also apply the same property to other parts of a
> parsed tree, and not only an ArrayExpr's list.

Sounds like an interesting idea, something like:

typedef struct ArrayExpr
{
...
List   *elements pg_node_attr(query_jumble_merge);

to replace simple JUMBLE_NODE(elements) with more elaborated logic.

>  /* GUC parameters */
>  extern PGDLLIMPORT int compute_query_id;
> -
> +extern PGDLLIMPORT bool query_id_const_merge;
>
> Not much a fan of this addition as well for an in-core GUC.  I would
> suggest pushing the GUC layer to pg_stat_statements, maintaining the
> computation method to use as a field of JumbleState as I suspect that
> this is something we should not enforce system-wide, but at
> extension-level instead.

I also do not particularly like an extra GUC here, but as far as I can
tell to make it pg_stat_statements GUC only it has to be something
similar to EnableQueryId (e.g. EnableQueryConstMerging), that will be
called from pgss. Does this sound better?

> + /*
> +  * Indicates the constant represents the beginning or the end of a 
> merged
> +  * constants interval.
> +  */
> + boolmerged;
>
> Not sure that this is the best thing to do either.  Instead of this
> extra boolean flag, could it be simpler if we switch LocationLen so as
> we track the start position and the end position of a constant in a
> query string, so as we'd use one LocationLen for a whole set of Const
> nodes in an ArrayExpr?  Perhaps this could just be a refactoring piece
> of its own?

Sounds interesting as well, but it seems to me there is a catch. I'll
try to elaborate, bear with me:

* if the start and the end positions of a constant means the first and the
last character representing it, we need the textual length of the
constant in the query to be able to construct such a LocationLen.  The
lengths are calculated in pg_stat_statements later, not in JumbleQuery,
and it uses parser for that. Doing all of this in JumbleQuery doesn't
sound reasonable to me.

* if instead we talk about the start and the end positions in a
set of constants, that would mean locations of the first and the last
constants in the set, and everything seems fine. But for such
LocationLen to represent a single constant (not a set of constants), it
means that only the start position would be meaningful, the end position
will not be used.

The second approach is somewhat close to be simpler than the merge flag,
but assumes the ugliness for a single constant. What do you think about
this?

> + /*
> +  * If the first expression is a constant, verify if the following 
> elements
> +  * are constants as well. If yes, the list is eligible for merging, and 
> the
> +  * order of magnitude need to be calculated.
> +  */
> + if (IsA(firstExpr, Const))
> + {
> + foreach(temp, elements)
> + if (!IsA(lfirst(temp), Const))
> + return false;
>
> This path should be benchmarked, IMO.

I can do some benchmarking here, but of course it's going to be slower
than the baseline. The main idea behind the patch is to trade this
overhead for the benefits in the future while processing pgss records,
hoping that it's going to be worth it (and in those extreme cases I'm
trying to address it's definitely worth it).




Re: Enderbury Island disappeared from timezone database

2023-10-27 Thread Victor Wagner
В Fri, 27 Oct 2023 10:25:57 -0400
Tom Lane  пишет:

> Victor Wagner  writes:
> > I've encountered following problem compiling PostgreSQL 15.4 with
> > just released Ubuntu 23.10.  
> 
> > I'm compiling postgres with --with-system-tzdata and then regression
> > test sysviews fails with following diff:
> > +ERROR:  time zone "Pacific/Enderbury" not recognized
> > +DETAIL:  This time zone name appears in the configuration file for
> > time zone abbreviation "phot".  
> 
> Hmph.  Pacific/Enderbury is still defined according to tzdata 2023c,
> which is the latest release:
> 
> $ grep Enderbury src/timezone/data/tzdata.zi
> L Pacific/Kanton Pacific/Enderbury
>
> Did Ubuntu decide to remove *all* backzone links from their data?
> Or just that one?  Either way, I think they're going to get a tsunami
> of pushback pretty quickly.  People like their obsolete zone names.

They split tzdata packages into tzdata and tzdata-legacy (just for
those who like obsolete zone names), and into latter one gone 121 links,
not counting "right" subdirectory. It is actually Debian unstable
feature that got impored into ubuntu. But my
test machines with debian testing do not use --with-system-tzdata, so
I've not noticed this earlier.

It has following entry in changelog:

tzdata (2023c-8) unstable; urgency=medium

  * Update Dutch debconf translation.
Thanks to Frans Spiesschaert 
(Closes: #1041278)
  * Ship only timezones in tzdata that follow the current rules of
geographical
region (continent or ocean) and city name. Move all legacy timezone
symlinks
(that are upgraded during package update) to tzdata-legacy. This
includes
dropping the special handling for US/* timezones. (Closes: #1040997)

 -- Benjamin Drung   Mon, 07 Aug 2023 15:02:14 +0200

I.e. they move obsolete timezones into separate package just for people
who like them.

Description of that package ends with:

 This package also contains legacy timezone symlinks that are not
 following
 the current rule of using the geographical region (continent or ocean)
 and
 city name.
 .
 You do not need this package if you are unsure.

Really I think that if at least some distirbutions don't like this
names, it is better to have postgres pass its regression tests without
these names as well as with them.







-- 
   Victor Wagner 




Re: MERGE ... RETURNING

2023-10-27 Thread Dean Rasheed
On Wed, 25 Oct 2023 at 02:07, Merlin Moncure  wrote:
>
> On Tue, Oct 24, 2023 at 2:11 PM Jeff Davis  wrote:
>>
>> Can we revisit the idea of a per-WHEN RETURNING clause? The returning
>> clauses could be treated kind of like a UNION, which makes sense
>> because it really is a union of different results (the returned tuples
>> from an INSERT are different than the returned tuples from a DELETE).
>> You can just add constants to the target lists to distinguish which
>> WHEN clause they came from.
>>
>  Yeah.  Side benefit, the 'action_number' felt really out of place, and that 
> neatly might solve it.  It doesn't match tg_op, for example.  With the 
> current approach, return a text, or an enum? Why doesn't it match concepts 
> that are pretty well established elsewhere?  SQL has a pretty good track 
> record for not inventing weird numbers with no real meaning (sadly, not so 
> much the developers).   Having said that, pg_merge_action() doesn't feel too 
> bad if the syntax issues can be worked out.
>

I've been playing around a little with per-action RETURNING lists, and
attached is a working proof-of-concept (no docs yet).

The implementation is simplified a little by not needing special merge
support functions, but overall this approach introduces a little more
complexity, which is perhaps not surprising.

One fiddly part is resolving the shift/reduce conflicts in the
grammar. Specifically, on seeing "RETURNING expr when ...", there is
ambiguity over whether the "when" is a column alias or the start of
the next merge action. I've resolved that by assigning a slightly
higher precedence to an expression without an alias, so WHEN is
assumed to not be an alias. It seems pretty ugly though (in terms of
having to duplicate so much code), and I'd be interested to know if
there's a neater way to do it.

>From a usability perspective, I'm still somewhat sceptical about this
approach. It's a much more verbose syntax, and it gets quite tedious
having to repeat the RETURNING list for every action, and keep them in
sync. I also note that other database vendors seem to have opted for
the single RETURNING list approach (not that we necessarily need to
copy them).

The patch enforces the rule that if any action has a RETURNING list,
they all must have a RETURNING list. Not doing that leads to the
number of rows returned not matching the command tag, or the number of
rows modified, which I think would just lead to confusion. Also, it
would likely be a source of easy-to-overlook mistakes. One such
mistake would be assuming that a RETURNING list at the very end
applied to all actions, though it would also be easy to accidentally
omit a RETURNING list in the middle of the command.

Having said that, I wonder if it would make sense to also support
having a RETURNING list at the very end, if there are no other
RETURNING lists. If we see that, we could automatically apply it to
all actions, which I think would be much more convenient in situations
where you don't care about the action executed, and just want the
results from the table. That would go a long way towards addressing my
usability concerns.

Regards,
Dean
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
new file mode 100644
index c5d7d78..7977873
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -283,12 +283,6 @@ DoCopy(ParseState *pstate, const CopyStm
 	{
 		Assert(stmt->query);
 
-		/* MERGE is allowed by parser, but unimplemented. Reject for now */
-		if (IsA(stmt->query, MergeStmt))
-			ereport(ERROR,
-	errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	errmsg("MERGE not supported in COPY"));
-
 		query = makeNode(RawStmt);
 		query->stmt = stmt->query;
 		query->stmt_location = stmt_location;
diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
new file mode 100644
index c66a047..1145aaf
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -510,7 +510,8 @@ BeginCopyTo(ParseState *pstate,
 		{
 			Assert(query->commandType == CMD_INSERT ||
    query->commandType == CMD_UPDATE ||
-   query->commandType == CMD_DELETE);
+   query->commandType == CMD_DELETE ||
+   query->commandType == CMD_MERGE);
 
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
new file mode 100644
index f6c3432..f0e75c3
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -613,16 +613,16 @@ ExecInitPartitionInfo(ModifyTableState *
 	 * build the returningList for partitions within the planner, but simple
 	 * translation of varattnos will suffice.  This only occurs for the INSERT
 	 * case or in the case of UPDATE tuple routing where we didn't find a
-	 * result rel to reuse.
+	 * result rel to reuse.  We skip this for MERGE, since it uses per-action
+	 * RETURNING lists, which are handled below.
 	 */
-	if (node && node->returningLists != NIL)
+	if 

Re: btree_gin: Incorrect leftmost interval value

2023-10-27 Thread Tom Lane
Dean Rasheed  writes:
> On Fri, 27 Oct 2023 at 13:56, Ashutosh Bapat
>  wrote:
>> Should we change this to call INTERVAL_NOBEGIN() to be added by
>> infinite interval patches?

> Given that this is a bug that can lead to incorrect query results, I
> plan to back-patch it, and INTERVAL_NOBEGIN() wouldn't make sense in
> back-branches that don't have infinite intervals.

Agreed.  When/if the infinite interval patch lands, it could update
this function to use the macro.

regards, tom lane




Re: Enderbury Island disappeared from timezone database

2023-10-27 Thread Tom Lane
Victor Wagner  writes:
> I've encountered following problem compiling PostgreSQL 15.4 with just
> released Ubuntu 23.10.

> I'm compiling postgres with --with-system-tzdata and then regression
> test sysviews fails with following diff:
> +ERROR:  time zone "Pacific/Enderbury" not recognized
> +DETAIL:  This time zone name appears in the configuration file for time zone 
> abbreviation "phot".

Hmph.  Pacific/Enderbury is still defined according to tzdata 2023c,
which is the latest release:

$ grep Enderbury src/timezone/data/tzdata.zi
L Pacific/Kanton Pacific/Enderbury

Did Ubuntu decide to remove *all* backzone links from their data?
Or just that one?  Either way, I think they're going to get a tsunami
of pushback pretty quickly.  People like their obsolete zone names.

regards, tom lane




Re: Add recovery to pg_control and remove backup_label

2023-10-27 Thread David Steele

On 10/26/23 17:27, David G. Johnston wrote:
On Thu, Oct 26, 2023 at 2:02 PM David Steele > wrote:


Are we planning on dealing with torn writes in the back branches in some 
way or are we just throwing in the towel and saying the old method is 
too error-prone to exist/retain 


We are still planning to address this issue in the back branches.

and therefore the goal of the v17 
changes is to not only provide a better way but also to ensure the old 
way no longer works?  It seems sufficient to change the output signature 
of pg_backup_stop to accomplish that goal though I am pondering whether 
an explicit check and error for seeing the backup_label file would be 
warranted.


Well, if the backup tool is just copying the second column of output to 
the backup_label, then it won't break. Of course in that case, restores 
won't work correctly but you would not get an error. Testing would show 
that it is not working properly and backup tools should certainly be tested.


Even so, I'm OK with an explicit check for backup_label. Let's see what 
others think.


If we are going to solve the torn writes problem completely then while I 
agree the new way is superior, implementing it doesn't have to mean 
existing tools built to produce backup_label and rely upon the 
pg_control in the data directory need to be forcibly broken.


It is a pretty easy update to any backup software that supports 
non-exclusive backup. I was able to make the changes to pgBackRest in 
less than an hour. We've made major changes to backup and restore in 
almost every major version of PostgreSQL for a while: non-exlusive 
backup in 9.6, dir renames in 10, variable WAL size in 11, new recovery 
location in 12, hard recovery target errors in 13, and changes to 
non-exclusive backup and removal of exclusive backup in 15. In 17 we are 
already looking at new page and segment sizes.



I know that outputting pg_control as bytea is going to be a bit
controversial. Software that is using psql get run pg_backup_stop()
could use encode() to get pg_control as text and then decode it later.
Alternately, we could update ReadControlFile() to recognize a
base64-encoded pg_control file. I'm not sure dealing with binary
data is
that much of a problem, though, and if the backup software gets it
wrong
then recovery with fail on an invalid pg_control file.

Can we not figure out some way to place the relevant files onto the 
server somewhere so that a simple "cp" command would work?  Have 
pg_backup_stop return paths instead of contents, those paths being 
"$TEMP_DIR"//pg_control.conf (and 
tablespace_map)


Nobody has been able to figure this out, and some of us have been 
thinking about it for years. It just doesn't seem possible to reliably 
tell the difference between a cluster that was copied and one that 
simply crashed.


If cp is really the backup tool being employed, I would recommend using 
pg_basebackup. cp has flaws that could lead to corruption, and of course 
does not at all take into account the archive required to make a backup 
consistent, directories to be excluded, the order of copying pg_control 
on backup from standy, etc., etc.


Backup/restore is not a simple endeavor and we don't do anyone favors 
pretending that it is.


Regards,
-David




Re: pgBufferUsage.blk_{read|write}_time are zero although there are pgBufferUsage.local_blks_{read|written}

2023-10-27 Thread Nazir Bilal Yavuz
Hi,

On Thu, 19 Oct 2023 at 08:26, Michael Paquier  wrote:
>
> On Wed, Oct 18, 2023 at 02:56:42PM +0900, Michael Paquier wrote:
> > Thanks for the new versions.  I have applied 0001 and backpatched it
> > for now.  0002 and 0003 look in much cleaner shape than previously.
>
> 0002 and 0003 have now been applied.  I have split 0003 into two parts
> at the end, mainly on clarity grounds: one for the counters with
> EXPLAIN and a second for pg_stat_statements.
>
> There were a few things in the patch set.  Per my notes:
> - Some incorrect indentation.
> - The additions of show_buffer_usage() did not handle correctly the
> addition of a comma before/after the local timing block.  The code
> area for has_local_timing needs to check for has_temp_timing, while
> the area of has_shared_timing needs to check for (has_local_timing ||
> has_temp_timing).
> - explain.sgml was missing an update for the information related to
> the read/write timings of the local blocks.

Thanks for the changes, push and feedback!

>
> Remains what we should do about the "shared/local" string in
> show_buffer_usage() for v16 and v15, as "local" is unrelated to that.
> Perhaps we should just switch to "shared" in this case or just remove
> the string entirely?  Still that implies changing the output of
> EXPLAIN on a stable branch in this case, so there could be an argument
> for leaving this stuff alone.

I think switching it to 'shared' makes sense. That shouldn't confuse
existing monitoring queries much as the numbers won't change, right?
Also, if we keep 'shared/local' there could be similar complaints to
this thread in the future; so, at least adding comments can be
helpful.

Regards,
Nazir Bilal Yavuz
Microsoft




Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-10-27 Thread David Steele

On 10/27/23 03:22, Michael Paquier wrote:

On Mon, Oct 16, 2023 at 02:54:35PM +0900, Michael Paquier wrote:

On Sat, Oct 14, 2023 at 03:45:33PM -0400, David Steele wrote:

On 9/28/23 19:59, Michael Paquier wrote:

Another idea I had was to force the creation of recovery.signal by
pg_basebackup even if -R is not used.  All the reports we've seen with
people getting confused came from pg_basebackup that enforces no
configuration.


This change makes it more obvious if configuration is missing (since you'll
get an error), however +1 for adding this to pg_basebackup.


Looking at the streaming APIs of pg_basebackup, it looks like this
would be a matter of using bbstreamer_inject_file() to inject an empty
file into the stream.  Still something seems to be off once
compression methods are involved..  Hmm.  I am not sure.  Well, this
could always be done as a patch independant of this one, under a
separate discussion.  There are extra arguments about whether it would
be a good idea to add a recovery.signal even when taking a backup from
a standby, and do that only in 17~.


Hmm.  On this specific point, it would actually be much simpler to
force recovery.signal to be in the contents streamed to a BASE_BACKUP.


That sounds like the right plan to me. Nice and simple.


This does not step on your proposal at [1], though, because you'd
still require a .signal file for recovery as far as I understand :/

[1]: 
https://www.postgresql.org/message-id/2daf8adc-8db7-4204-a7f2-a7e94e2bf...@pgmasters.net


Yes.


Would folks be OK to move on with the patch of this thread at the end?
I am attempting a last-call kind of thing.


I'm still +1 for the patch as it stands.

Regards,
-David




Re: Reducing connection overhead in pg_upgrade compat check phase

2023-10-27 Thread Daniel Gustafsson
Attached is a v10 rebase of this patch which had undergone significant bitrot
due to recent changes in the pg_upgrade check phase.  This brings in the
changes into the proposed structure without changes to queries, with no
additional changes to the proposed functionality.

Testing with a completely empty v11 cluster fresh from initdb as the old
cluster shows a significant speedup (averaged over multiple runs, adjusted for
outliers):

patched:  53.59ms (52.78ms, 52.49ms, 55.49ms)
master : 125.87ms (125.23 ms, 125.67ms, 126.67ms)

Using a similarly empty cluster from master as the old cluster shows a smaller
speedup, which is expected since many checks only run for older versions:

patched: 33.36ms (32.82ms, 33.78ms, 33.47ms)
master : 44.87ms (44.73ms, 44.90ms 44.99ms)

The latter case is still pretty interesting IMO since it can speed up testing
where every millisecond gained matters.

--
Daniel Gustafsson



v10-0001-pg_upgrade-run-all-data-type-checks-per-connecti.patch
Description: Binary data


Re: btree_gin: Incorrect leftmost interval value

2023-10-27 Thread Dean Rasheed
On Fri, 27 Oct 2023 at 13:56, Ashutosh Bapat
 wrote:
>
> Should we change this to call INTERVAL_NOBEGIN() to be added by
> infinite interval patches?
>

Given that this is a bug that can lead to incorrect query results, I
plan to back-patch it, and INTERVAL_NOBEGIN() wouldn't make sense in
back-branches that don't have infinite intervals.

Regards,
Dean




Re: btree_gin: Incorrect leftmost interval value

2023-10-27 Thread Ashutosh Bapat
On Fri, Oct 27, 2023 at 2:57 PM Dean Rasheed  wrote:
>
> In contrib/btree_gin, leftmostvalue_interval() does this:
>
> leftmostvalue_interval(void)
> {
> Interval   *v = palloc(sizeof(Interval));
>
> v->time = DT_NOBEGIN;
> v->day = 0;
> v->month = 0;
> return IntervalPGetDatum(v);
> }
>
> which is a long way short of the minimum possible interval value.
>
> As a result, a < or <= query using a GIN index on an interval column
> may miss values. For example:
>
> CREATE EXTENSION btree_gin;
> CREATE TABLE foo (a interval);
> INSERT INTO foo VALUES ('-100 years');
> CREATE INDEX foo_idx ON foo USING gin (a);
>
> SET enable_seqscan = off;
> SELECT * FROM foo WHERE a < '1 year';
>  a
> ---
> (0 rows)
>
> Attached is a patch fixing this by setting all the fields to their
> minimum values, which is guaranteed to be less than any other
> interval.

Should we change this to call INTERVAL_NOBEGIN() to be added by
infinite interval patches? It's the same effect but looks similar to
leftmostvalue_timestamp/float8 etc. It will need to wait for the
infinite interval patches to commit but I guess, the wait won't be too
long and the outcome will be better. I can include this change in
those patches.

-- 
Best Wishes,
Ashutosh Bapat




Re: RFC: Logging plan of the running query

2023-10-27 Thread Étienne BERSAC
> Hi,
> 

> If we use client log message, that message is shown on the target 
> process whose pid is specified by the parameter of
> pg_log_query_plan():
> 
>    (pid:1000)=# select pg_sleep(60);
>    (pid:1001)=# select pg_log_query_plan(1000);
>    (pid:1000)=# LOG:  query plan running on backend with PID 1000 is:
>     Query Text: select pg_sleep(1000);
>     Result  (cost=0.00..0.01 rows=1 width=4)
>   Output: pg_sleep('1000'::double precision)
> 
> I think this is not an expected behavior and we set elevel to 
> LOG_SERVER_ONLY.


Makes sens. Thanks.




Re: btree_gin: Incorrect leftmost interval value

2023-10-27 Thread Heikki Linnakangas

On 27/10/2023 12:26, Dean Rasheed wrote:

In contrib/btree_gin, leftmostvalue_interval() does this:

leftmostvalue_interval(void)
{
 Interval   *v = palloc(sizeof(Interval));

 v->time = DT_NOBEGIN;
 v->day = 0;
 v->month = 0;
 return IntervalPGetDatum(v);
}

which is a long way short of the minimum possible interval value.


Good catch!


Attached is a patch fixing this by setting all the fields to their
minimum values, which is guaranteed to be less than any other
interval.


LGTM. I wish extractQuery could return "leftmost" more explicitly, so 
that we didn't need to construct these leftmost values. But I don't 
think that's supported by the current extractQuery interface.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: RFC: Logging plan of the running query

2023-10-27 Thread torikoshia

On 2023-10-27 16:06, Étienne BERSAC wrote:

Hi Torikoshia,


If so, we once tried to implement such function for getting memory
contexts.
However, this attempt didn't succeed because it could lead dead lock
situation[1].


Thanks for the pointer. Why not use client log message to allow client
to get plan without locking backend memory context and without access
to server log ? I missed the rationnal for not sending the plan to
client.


If we use client log message, that message is shown on the target 
process whose pid is specified by the parameter of pg_log_query_plan():


  (pid:1000)=# select pg_sleep(60);
  (pid:1001)=# select pg_log_query_plan(1000);
  (pid:1000)=# LOG:  query plan running on backend with PID 1000 is:
   Query Text: select pg_sleep(1000);
   Result  (cost=0.00..0.01 rows=1 width=4)
 Output: pg_sleep('1000'::double precision)

I think this is not an expected behavior and we set elevel to 
LOG_SERVER_ONLY.


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Enderbury Island disappeared from timezone database

2023-10-27 Thread Victor Wagner
Collegues,

I've encountered following problem compiling PostgreSQL 15.4 with just
released Ubuntu 23.10.

I'm compiling postgres with --with-system-tzdata and then regression
test sysviews fails with following diff:


--- /home/test/pg-tests/postgresql-15.4/src/test/regress/expected/sysviews.out  
2023-10-26 19:06:02.0 +
+++ /home/test/pg-tests/postgresql-15.4/src/test/regress/results/sysviews.out   
2023-10-27 07:10:22.214698986 +
@@ -147,23 +147,14 @@
 (1 row)
 
 select count(distinct utc_offset) >= 24 as ok from pg_timezone_abbrevs;
- ok 
-
- t
-(1 row)
-
+ERROR:  time zone "Pacific/Enderbury" not recognized
+DETAIL:  This time zone name appears in the configuration file for time zone 
abbreviation "phot".


with more such errors follows.

Investigation shows, that this timezone was long ago declared
deprecated, and eventually disappeared from tzdata package in Ubuntu
even as symlink to Pasific/Kanton (which is equivalent).

But this timezone present in src/timezone/tznames/Default, so this
error message is appears any time one access pg_timezone_abbrevs
regardless of Pacific region is included in results or not.

May be, Enderbury should be replaced by Kanton in
src/timezone/tznames/Default and src/timezone/tznames/Pacific.txt?

-- 
   Victor Wagner 




Annoying corruption in PostgreSQL.

2023-10-27 Thread Kirill Reshke
Hi hackers!

We run a large amount of PostgreSQL clusters in our production. They differ
by versions (we have 11-16 pg), load, amount of data, schema, etc. From
time to time, postgresql corruption happens. It says
ERROR,XX001,"missing chunk number 0 for toast value 18767319 in
pg_toast_2619",,"vacuum full ;"

in logs. the missing chunk number  almost every is equal to zero, while
other values vary. There are no known patterns, which triggers this issue.
Moreover, if trying to rerun the VACUUM statement against relations from a
log message, it succeeds all the time.  So, we just ignore these errors.
Maybe it is just some wierd data race?

We don't know how to trigger this problem, or why it occurs. I'm not asking
you to resolve this issue, but to help with debugging. What can we do to
deduct failure reasons? Maybe we can add more logging somewhere (we can
deploy a special patched PostgreSQL version everywhere), to have more
information about the issue, when it happens next time?


Re: run pgindent on a regular basis / scripted manner

2023-10-27 Thread Andrew Dunstan


On 2023-10-27 Fr 03:14, Étienne BERSAC wrote:

Hello,



Yes, there's a lot to look out for, and you're a damn sight better at
it
than I am. But we should try to automate the things that can be
automated, even if that leaves many tasks that can't be. I have three
things in my pre-commit hook: a check for catalog updates, a check
for
new typedefs, and an indent check.

Could you share your configuration ? Could we provide more helper and
integration to help produce consistent code ?



Sure. pre-commit hook file attached. I'm sure this could be improved on.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com
#!/bin/sh

set -u

: ${PGAUTOINDENT:=no}
: ${PGTYPEDEFCHECK:=no}

branch=$(git rev-parse --abbrev-ref HEAD)
files=$(git diff --cached --name-only --diff-filter=ACMR)

check_typedef() {

# only do this on master
test  "$branch" = "master" || return 0

test $PGTYPEDEFCHECK = yes || return 0

tdmod=`git status --porcelain src/tools/pgindent/typedefs.list`

test -z "$tdmod" && return 0

tdfiles=`git diff --cached --name-only -Stypedef $files | grep 
'[.][ch]$'`

test -z "$tdfiles" && return 0

# changes include typedef but list not changed
{
echo 'Commit on master contains a typedef but typedefs.list not 
changed'
echo 'It can be forced with git commit --no-verify'
echo 'or PGTYPEDEFCHECK=no'
} >&2

exit 1

}

check_catalog_version () {

# only do this on master
test  "$branch" = "master" || return 0

case "$files" in
*src/include/catalog/catversion.h*)
return 0;
;;
*src/include/catalog/*)
;;
*)
return 0;
;;
esac

# changes include catalog but not catversion.h, so warn about it
{
echo 'Commit on master alters catalog but catversion not bumped'
echo 'It can be forced with git commit --no-verify'
} >&2

exit 1
}

check_indent () {

# only do this on master
test  "$branch" = "master" || return 0

# no need to filter files - pgindent ignores everything that isn't a
# .c or .h file

# but make sure we have some

perl -e 'foreach (@ARGV) { exit 0 if /\.[ch]$/; }; exit 1;' $files || \
return 0;

echo Indenting $files

src/tools/pgindent/pgindent --silent-diff $files && return 0

exec 2>&1

if [ "$PGAUTOINDENT" = yes ] ; then
echo "Running pgindent on changed files"
src/tools/pgindent/pgindent $files
echo "You need to rerun git commit to pick up pgindent changes"
else
echo 'Need a pgindent run, e.g:'
echo -n 'src/tools/pgindent/pgindent '
echo '`git diff --name-only --diff-filter=ACMR`'
fi

exit 1
}

# nothing to do if there are no files
test -z "$files" && exit 0

check_catalog_version
check_indent
check_typedef


Re: Infinite Interval

2023-10-27 Thread Ashutosh Bapat
On Fri, Oct 27, 2023 at 2:07 PM Dean Rasheed  wrote:
>
> On Wed, 4 Oct 2023 at 14:29, Ashutosh Bapat
>  wrote:
> >
> > I think we should introduce interval_out_of_range_error() function on
> > the lines of float_overflow_error(). Later patches introduce more
> > places where we raise that error. We can introduce the function as
> > part of those patches.
> >
>
> I'm not convinced that it is really worth it. Note also that even with
> this patch, there are still more places that throw "timestamp out of
> range" errors than "interval out of range" errors.

Fine with me.

>
> > > 4. I tested pg_upgrade on a table with an interval with INT_MAX
> > > months, and it was silently converted to infinity. I think that's
> > > probably the best outcome (better than failing). However, this means
> > > that we really should require all 3 fields of an interval to be
> > > INT_MIN/MAX for it to be considered infinite, otherwise it would be
> > > possible to have multiple internal representations of infinity that do
> > > not compare as equal.
> > >
> > My first patch was comparing all the three fields to determine whether
> > a given Interval value represents infinity. [3] changed that to use
> > only the month field. I guess that was based on the discussion at [4].
> > You may want to review that discussion if not already done. I am fine
> > either way. We should be able to change the comparison code later if
> > we see performance getting impacted.
> >
>
> Before looking at the details more closely, I might have agreed with
> that earlier discussion. However, given that things like pg_upgrade
> have the possibility of turning formerly allowed, finite intervals
> into infinity, we really need to ensure that there is only one value
> equal to infinity, otherwise the results are likely to be very
> confusing and counter-intuitive. That means that we have to continue
> to regard intervals like INT32_MAX months + 10 days as finite.
>
> While I haven't done any performance testing, I wouldn't expect this
> to have much impact. In a 64-bit build, this actually generates 2
> comparisons rather than 3 -- one comparing the combined month and day
> fields against a 64-bit value containing 2 copies of INT32_MAX, and
> one testing the time field. In practice, only the first test will be
> executed in the vast majority of cases.
>

Thanks for the analysis.

>
> Something that perhaps does need discussing is the fact that
> '2147483647 months 2147483647 days 9223372036854775807 usecs' is now
> accepted by interval_in() and gives infinity. That's a bit ugly, but I
> think it's defensible as a measure to prevent dump/restore errors from
> older databases, and in any case, such an interval is outside the
> documented range of supported intervals, and is a highly contrived
> example, vanishingly improbable in practice.

Agreed.

>
> Alternatively, we could have interval_in() reject this, which would
> open up the possibility of dump/restore errors. It could be argued
> that that's OK, for similar reasons -- the failing value is highly
> unlikely/contrived, and out of the documented range. I don't like that
> though. I don't think dump/restore should fail under any
> circumstances, however unlikely.

I agree that dump/restore shouldn't fail, especially when restore on
one major version succeeds and fails on another.

>
> Another alternative is to accept this input, but emit a WARNING. I
> don't particularly like that either, since it's forcing a check on
> every input value, just to cater for this one specific highly unlikely
> input. In fact, both these alternative approaches (rejecting the
> value, or emitting a warning), would impose a small performance
> penalty on every interval input, which I don't think is really worth
> it.

Agreed.

>
> So overall, my preference is to just accept it. Anything else is more
> work, for no practical benefit.
>

Ok.

-- 
Best Wishes,
Ashutosh Bapat




Re: race condition in pg_class

2023-10-27 Thread Smolkin Grigory
> This is going to be a problem with any operation that does a transactional
> pg_class update without taking a lock that conflicts with ShareLock.
GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
>
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
>
> == session 2
> alter table t add primary key (c);
>
> == back in session 1
> commit;
>
>
> We'll likely need to change how we maintain relhasindex or perhaps take a
lock
> in GRANT.

Oh, that explains it. Thank you very much.

> Can you explore that as follows?
>
>- PITR to just before the COMMIT record.
>- Save all rows of pg_class.
>- PITR to just after the COMMIT record.
>- Save all rows of pg_class.
>- Diff the two sets of saved rows.

Sure, but it will take some time, its a large db with lots of WAL segments
to apply.

> extensions

  extname   | extversion
+
 plpgsql| 1.0
 pg_stat_statements | 1.8
 pg_buffercache | 1.3
 pgstattuple| 1.5


Re: pg_upgrade and logical replication

2023-10-27 Thread Amit Kapila
On Fri, Oct 27, 2023 at 12:09 PM vignesh C  wrote:
>
> Apart from this I'm still checking that the old cluster's subscription
> relations states are READY state still, but there is a possibility
> that SYNCDONE or FINISHEDCOPY could work, this needs more thought
> before concluding which is the correct state to check. Let' handle
> this in the upcoming version.
>

I was analyzing this part and it seems it could be tricky to upgrade
in FINISHEDCOPY state. Because the system would expect that subscriber
would know the old slotname from oldcluster which it can drop at
SYNCDONE state. Now, as sync_slot_name is generated based on subid,
relid which could be different in the new cluster, the generated
slotname would be different after the upgrade. OTOH, if the relstate
is INIT, then I think the sync could be performed even after the
upgrade.

Shouldn't we at least ensure that replication origins do exist in the
old cluster corresponding to each of the subscriptions? Otherwise,
later the query to get remote_lsn for origin in getSubscriptions()
would fail.

-- 
With Regards,
Amit Kapila.




partitioning and identity column

2023-10-27 Thread Ashutosh Bapat
Hi All,
Reading [1] I have been experimenting with behaviour of identity
columns and serial column in case of partitioned tables. My
observations related to serial column can be found at [2]. This email
is about identity column behaviour with partitioned tables. Serial and
identity columns have sufficiently different behaviour and
implementation to have separate discussions on their behaviour. I
don't want to mix this with [1] since that thread is about replacing
serial with identity. The discussion in this and [2] will be useful to
drive [1] forward.

Behaviour 1
=
If a partitioned table has an identity column, the partitions do not
inherit identity property.
#create table tpart (a int generated always as identity primary key,
src varchar) partition by range(a);
#create table t_p1 partition of tpart for values from (1) to (3);
#\d tpart
 Partitioned table "public.tpart"
 Column |   Type| Collation | Nullable |   Default
+---+---+--+--
 a  | integer   |   | not null | generated always
as identity
 src| character varying |   |  |
Partition key: RANGE (a)
Indexes:
"tpart_pkey" PRIMARY KEY, btree (a)
Number of partitions: 2 (Use \d+ to list them.)

#\d t_p1
 Table "public.t_p1"
 Column |   Type| Collation | Nullable | Default
+---+---+--+-
 a  | integer   |   | not null |
 src| character varying |   |  |
Partition of: tpart FOR VALUES FROM (1) TO (3)
Indexes:
"t_p1_pkey" PRIMARY KEY, btree (a)

Notice that the default column of t_p1. This means that a direct
INSERT into partition will fail if it does not specify value for the
identity column. As a consequence such a value may conflict with an
existing value or a future value of the identity column. In the
example, the identity column is a primary key and also a partition
key, thus the conflict would result in an error. But when it's not a
partition key (and hence a primary key), it will just allow those
conflicting values.

Behaviour 2
=
If a table being attached as a partition to a partitioned table and
both of them have column with same name as identity column, the ATTACH
succeeds and allow both tables to use different sequences.
#create table t_p5 (a int primary key, b int generated always as
identity, src varchar);
#alter table tpart attach partition t_p5 for values from (7) to (9);
#\d t_p5
   Table "public.t_p5"
 Column |   Type| Collation | Nullable |   Default
+---+---+--+--
 a  | integer   |   | not null |
 b  | integer   |   | not null | generated always
as identity
 src| character varying |   |  |
Partition of: tpart FOR VALUES FROM (7) TO (9)
Indexes:
"t_p5_pkey" PRIMARY KEY, btree (a)

As a consequence a direct INSERT into the partition will result in a
value for identity column which conflicts with an existing value or a
future value in the partitioned table. Again, if the identity column
in a primary key (and partition key), the conflicting INSERT will
fail. But otherwise, the conflicting values will go unnoticed.

I consulted Vik Fearing, offline, about SQL standard's take on
identity columns in partitioned table. SQL standard does not specify
partitioned tables as a separate entity. Thus a partitioned table is
at par with a regular table. Hence an identity column in partitioned
table should share the same identity space across all the partitions.

Behaviour 3
=
We allow identity column to be added to a partitioned table which
doesn't have partitions but we disallow adding identity column to a
partitioned table which has partitions.
#create table tpart (a int primary key,
src varchar) partition by range(a);
#create table t_p1 partition of tpart for values from (1) to (3);
#alter table tpart add column b int generated always as identity;
ERROR:  cannot recursively add identity column to table that has child tables

I don't understand why is that prohibited. If we allow partitions to
be added to a partitioned table with identity column, we should allow
an identity column to be added to a partitioned table with partitions.

Behaviour 4
=
Even though we don't allow an identity column to be added to a
partitioned table with partitions, we allow an existing column to be
converted to an identity column in such a table.

#create table tpart (a int primary key,
src varchar) partition by range(a);
#create table t_p1 partition of tpart for values from (1) to (3);
#create table t_p2 partition of tpart for values from (3) to (5);
#alter table tpart alter column a add generated always as identity;

#\d tpart
   

Re: gcc 12.1.0 warning

2023-10-27 Thread Nazir Bilal Yavuz
Hi,

On Fri, 27 Oct 2023 at 12:34, Erik Rijkers  wrote:
>
> Hi,
>
> Not sure if these compiler-mutterings are worth reporting but I guess
> we're trying to get a silent compile.
>
> System: Debian 4.9.303-1 (2022-03-07) x86_64 GNU/Linux
> Compiling with gcc 12.1.0 causes the below 'warning' and 'note'.
> Compiling with --enable-cassert --enable-debug  is silent, no warnings)
>
> In function ‘guc_var_compare’,
>  inlined from ‘bsearch’ at
> /usr/include/x86_64-linux-gnu/bits/stdlib-bsearch.h:33:23,
>  inlined from ‘find_option’ at guc.c:5649:35:
> guc.c:5736:38: warning: array subscript ‘const struct config_generic[0]’
> is partly outside array bounds of ‘const char[8]’ [-Warray-bounds]
>   5736 | return guc_name_compare(confa->name, confb->name);
>| ~^~
>
> guc.c: In function ‘find_option’:
> guc.c:5636:25: note: object ‘name’ of size 8
>   5636 | find_option(const char *name, bool create_placeholders, bool
> skip_errors,
>| ^~~~
>
> (Compiling with gcc 6.3.0 does not complain.)

I was testing 'upgrading CI Debian images to bookworm'. I tested the
bookworm on REL_15, REL_16 and upstream. REL_16 and upstream finished
successfully but the CompilerWarnings task failed on REL_15 with the
same error.

gcc version: 12.2.0

CI Run: https://cirrus-ci.com/task/6151742664998912

Regards,
Nazir Bilal Yavuz
Microsoft




Re: Synchronizing slots from primary to standby

2023-10-27 Thread shveta malik
On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand
 wrote:
>
> Hi,
>
> On 10/25/23 5:00 AM, shveta malik wrote:
> > On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand
> >  wrote:
> >>
> >> Hi,
> >>
> >> On 10/23/23 2:56 PM, shveta malik wrote:
> >>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand
> >>>  wrote:
> >>
>  We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking 
>  if there
>  is new synced slot(s) to be created on the standby. Do we want to keep 
>  this behavior
>  for V1?
> 
> >>>
> >>> I think for the slotsync workers case, we should reduce the naptime in
> >>> the launcher to say 30sec and retain the default one of 3mins for
> >>> subscription apply workers. Thoughts?
> >>>
> >>
> >> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new
> >> API on the standby that would refresh the list of sync slot at wish, 
> >> thoughts?
> >>
> >
> > Do you mean API to refresh list of DBIDs rather than sync-slots?
> > As per current design, launcher gets DBID lists for all the failover
> > slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE.
>
> I mean an API to get a newly created slot on the primary being created/synced 
> on
> the standby at wish.
>
> Also let's imagine this scenario:
>
> - create logical_slot1 on the primary (and don't start using it)
>
> Then on the standby we'll get things like:
>
> 2023-10-25 08:33:36.897 UTC [740298] LOG:  waiting for remote slot 
> "logical_slot1" LSN (0/C00316A0) and catalog xmin (752) to pass local slot 
> LSN (0/C0049530) and and catalog xmin (754)
>
> That's expected and due to the fact that ReplicationSlotReserveWal() does set 
> the slot
> restart_lsn to a value < at the corresponding restart_lsn slot on the primary.
>
> - create logical_slot2 on the primary (and start using it)
>
> Then logical_slot2 won't be created/synced on the standby until there is 
> activity on logical_slot1 on the primary
> that would produce things like:
> 2023-10-25 08:41:35.508 UTC [740298] LOG:  wait over for remote slot 
> "logical_slot1" as its LSN (0/C005FFD8) and catalog xmin (756) has now passed 
> local slot LSN (0/C0049530) and catalog xmin (754)


Slight correction to above. As soon as we start activity on
logical_slot2, it will impact all the slots on primary, as the WALs
are consumed by all the slots. So even if there is activity on
logical_slot2, logical_slot1 creation on standby will be unblocked and
it will then move to logical_slot2 creation. eg:

--on standby:
2023-10-27 15:15:46.069 IST [696884] LOG:  waiting for remote slot
"mysubnew1_1" LSN (0/3C97970) and catalog xmin (756) to pass local
slot LSN (0/3C979A8) and and catalog xmin (756)

on primary:
newdb1=# select now();
   now
--
 2023-10-27 15:15:51.504835+05:30
(1 row)

--activity on mysubnew1_3
newdb1=# insert into tab1_3 values(1);
INSERT 0 1
newdb1=# select now();
   now
--
 2023-10-27 15:15:54.651406+05:30


--on standby, mysubnew1_1 is unblocked.
2023-10-27 15:15:56.223 IST [696884] LOG:  wait over for remote slot
"mysubnew1_1" as its LSN (0/3C97A18) and catalog xmin (757) has now
passed local slot LSN (0/3C979A8) and catalog xmin (756)

My Setup:
mysubnew1_1 -->mypubnew1_1 -->tab1_1
mysubnew1_3 -->mypubnew1_3-->tab1_3

thanks
Shveta




Re: Version 14/15 documentation Section "Alter Default Privileges"

2023-10-27 Thread Michael Banck
Hi,

On Fri, Oct 27, 2023 at 09:03:04AM +0200, Laurenz Albe wrote:
> On Fri, 2022-11-04 at 10:49 +0100, Laurenz Albe wrote:
> > On Thu, 2022-11-03 at 11:32 +0100, Laurenz Albe wrote:
> > > On Wed, 2022-11-02 at 19:29 +, David Burns wrote:
> > > 
> > > > Some additional clarity in the versions 14/15 documentation
> > > > would be helpful specifically surrounding the "target_role"
> > > > clause for the ALTER DEFAULT PRIVILEGES command.  To the
> > > > uninitiated, the current description seems vague.  Maybe
> > > > something like the following would help:
> > 
> > After some more thinking, I came up with the attached patch.
> 
> I'm sending a reply to the hackers list, so that I can add the patch
> to the commitfest.

I think something like this is highly useful because I have also seen
people very confused why default privileges are not applied.

However, maybe it could be made even clearer if also the main
description is amended, like

"You can change default privileges only for objects that will be created
by yourself or by roles that you are a member of (via target_role)."

or something.


Michael




Re: A recent message added to pg_upgade

2023-10-27 Thread Amit Kapila
On Fri, Oct 27, 2023 at 2:02 PM Alvaro Herrera  wrote:
>
> On 2023-Oct-27, Kyotaro Horiguchi wrote:
>
> > @@ -1433,8 +1433,8 @@ 
> > InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
> >   {
> >   ereport(ERROR,
> >   
> > errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > - errmsg("replication slots must not be 
> > invalidated during the upgrade"),
> > - errhint("\"max_slot_wal_keep_size\" 
> > must be set to -1 during the upgrade"));
>
> Hmm, if I read this code right, this error is going to be thrown by the
> checkpointer while finishing a checkpoint.  Fortunately, the checkpoint
> record has already been written, but I don't know what would happen if
> this is thrown while trying to write the shutdown checkpoint.  Probably
> nothing terribly good.
>
> I don't think this is useful.  If the setting is invalid during binary
> upgrade, let's prevent it from being set at all right from the start of
> the upgrade process.

We are already forcing the required setting
"max_slot_wal_keep_size=-1" during the upgrade similar to some of the
other settings like "full_page_writes". However, the user can provide
an option for "max_slot_wal_keep_size" in which case it will be
overridden. Now, I think (a) we can ensure that our setting always
takes precedence in this case. The other idea is (b) to parse the
user-provided options and check if "max_slot_wal_keep_size" has a
value different than expected and raise an error accordingly. Or we
can simply (c) document the usage of max_slot_wal_keep_size in the
upgrade. I am not sure whether it is worth complicating the code for
this as the user shouldn't be using such an option during the upgrade.
So, I think doing (a) and (c) could be simpler.

>
>  In InvalidatePossiblyObsoleteSlot() we could have
> just an Assert() or elog(PANIC).
>

Yeah, we can change to either of those.

-- 
With Regards,
Amit Kapila.




btree_gin: Incorrect leftmost interval value

2023-10-27 Thread Dean Rasheed
In contrib/btree_gin, leftmostvalue_interval() does this:

leftmostvalue_interval(void)
{
Interval   *v = palloc(sizeof(Interval));

v->time = DT_NOBEGIN;
v->day = 0;
v->month = 0;
return IntervalPGetDatum(v);
}

which is a long way short of the minimum possible interval value.

As a result, a < or <= query using a GIN index on an interval column
may miss values. For example:

CREATE EXTENSION btree_gin;
CREATE TABLE foo (a interval);
INSERT INTO foo VALUES ('-100 years');
CREATE INDEX foo_idx ON foo USING gin (a);

SET enable_seqscan = off;
SELECT * FROM foo WHERE a < '1 year';
 a
---
(0 rows)

Attached is a patch fixing this by setting all the fields to their
minimum values, which is guaranteed to be less than any other
interval.

Note that this doesn't affect the contents of the index itself, so
reindexing is not necessary.

Regards,
Dean
diff --git a/contrib/btree_gin/btree_gin.c b/contrib/btree_gin/btree_gin.c
new file mode 100644
index c50d68c..b09bb8d
--- a/contrib/btree_gin/btree_gin.c
+++ b/contrib/btree_gin/btree_gin.c
@@ -306,9 +306,9 @@ leftmostvalue_interval(void)
 {
 	Interval   *v = palloc(sizeof(Interval));
 
-	v->time = DT_NOBEGIN;
-	v->day = 0;
-	v->month = 0;
+	v->time = PG_INT64_MIN;
+	v->day = PG_INT32_MIN;
+	v->month = PG_INT32_MIN;
 	return IntervalPGetDatum(v);
 }
 
diff --git a/contrib/btree_gin/expected/interval.out b/contrib/btree_gin/expected/interval.out
new file mode 100644
index 1f6ef54..8bb9806
--- a/contrib/btree_gin/expected/interval.out
+++ b/contrib/btree_gin/expected/interval.out
@@ -3,30 +3,34 @@ CREATE TABLE test_interval (
 	i interval
 );
 INSERT INTO test_interval VALUES
+	( '-17800 years' ),
 	( '03:55:08' ),
 	( '04:55:08' ),
 	( '05:55:08' ),
 	( '08:55:08' ),
 	( '09:55:08' ),
-	( '10:55:08' )
+	( '10:55:08' ),
+	( '17800 years' )
 ;
 CREATE INDEX idx_interval ON test_interval USING gin (i);
 SELECT * FROM test_interval WHERE i<'08:55:08'::interval ORDER BY i;
 i 
 --
+ @ 17800 years ago
  @ 3 hours 55 mins 8 secs
  @ 4 hours 55 mins 8 secs
  @ 5 hours 55 mins 8 secs
-(3 rows)
+(4 rows)
 
 SELECT * FROM test_interval WHERE i<='08:55:08'::interval ORDER BY i;
 i 
 --
+ @ 17800 years ago
  @ 3 hours 55 mins 8 secs
  @ 4 hours 55 mins 8 secs
  @ 5 hours 55 mins 8 secs
  @ 8 hours 55 mins 8 secs
-(4 rows)
+(5 rows)
 
 SELECT * FROM test_interval WHERE i='08:55:08'::interval ORDER BY i;
 i 
@@ -40,12 +44,14 @@ SELECT * FROM test_interval WHERE i>='08
  @ 8 hours 55 mins 8 secs
  @ 9 hours 55 mins 8 secs
  @ 10 hours 55 mins 8 secs
-(3 rows)
+ @ 17800 years
+(4 rows)
 
 SELECT * FROM test_interval WHERE i>'08:55:08'::interval ORDER BY i;
  i 
 ---
  @ 9 hours 55 mins 8 secs
  @ 10 hours 55 mins 8 secs
-(2 rows)
+ @ 17800 years
+(3 rows)
 
diff --git a/contrib/btree_gin/sql/interval.sql b/contrib/btree_gin/sql/interval.sql
new file mode 100644
index e385158..7a2f3ac
--- a/contrib/btree_gin/sql/interval.sql
+++ b/contrib/btree_gin/sql/interval.sql
@@ -5,12 +5,14 @@ CREATE TABLE test_interval (
 );
 
 INSERT INTO test_interval VALUES
+	( '-17800 years' ),
 	( '03:55:08' ),
 	( '04:55:08' ),
 	( '05:55:08' ),
 	( '08:55:08' ),
 	( '09:55:08' ),
-	( '10:55:08' )
+	( '10:55:08' ),
+	( '17800 years' )
 ;
 
 CREATE INDEX idx_interval ON test_interval USING gin (i);


Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2023-10-27 Thread tender wang
Hi Alvaro,
I re-analyzed this issue, and here is my analysis process.
step 1: CREATE TABLE p ( id bigint PRIMARY KEY )
PARTITION BY list (id);
 step2: CREATE TABLE p_1 PARTITION OF p FOR VALUES IN (1);
 step3: CREATE TABLE r_1 (
id   bigint PRIMARY KEY,
p_id bigint NOT NULL,
FOREIGN KEY (p_id) REFERENCES p (id)
  );
After above step 3 operations, we have below catalog tuples:
postgres=# select oid, relname from pg_class where relname = 'p';
oid | relname
---+-
16384 | p
(1 row)
postgres=# select oid, relname from pg_class where relname = 'p_1';
oid | relname
---+-
16389 | p_1
(1 row)
postgres=# select oid, relname from pg_class where relname = 'r_1';
oid | relname
---+-
16394 | r_1
(1 row)
postgres=# select oid, conname,conrelid,conparentid,confrelid from
pg_constraint where conrelid = 16394;
oid | conname | conrelid | conparentid | confrelid
---+---+--+-+---
16397 | r_1_p_id_not_null | 16394 | 0 | 0
16399 | r_1_pkey | 16394 | 0 | 0
16400 | r_1_p_id_fkey | 16394 | 0 | 16384
16403 | r_1_p_id_fkey1 | 16394 | 16400 | 16389
(4 rows)
postgres=# select oid, tgrelid, tgparentid,
tgconstrrelid,tgconstrindid,tgconstraint from pg_trigger where tgconstraint
= 16403;
oid | tgrelid | tgparentid | tgconstrrelid | tgconstrindid | tgconstraint
---+-++---+---+--
16404 | 16389 | 16401 | 16394 | 16392 | 16403
16405 | 16389 | 16402 | 16394 | 16392 | 16403
(2 rows)
postgres=# select oid, tgrelid, tgparentid,
tgconstrrelid,tgconstrindid,tgconstraint from pg_trigger where tgconstraint
= 16400;
oid | tgrelid | tgparentid | tgconstrrelid | tgconstrindid | tgconstraint
---+-++---+---+--
16401 | 16384 | 0 | 16394 | 16387 | 16400
16402 | 16384 | 0 | 16394 | 16387 | 16400
16406 | 16394 | 0 | 16384 | 16387 | 16400
16407 | 16394 | 0 | 16384 | 16387 | 16400
(4 rows)
Because table p is partitioned table and it has one child table p_1. So
when r_1 add foreign key constraint, according to addFkRecurseReferenced(),
each partition should have one pg_constraint row(e.g. r_1_p_id_fkey1).
After called addFkRecurseReferenced() in ATAddForeignKeyConstraint(),
addFkRecurseReferencing() will be called, in which
it will add INSERT check trigger and UPDATE check trigger for r_1_p_id_fkey
but not for r_1_p_id_fkey1.
So when detach r_1 from r, according to DetachPartitionFinalize(), the
inherited fks should unlink relationship from parent.
The created INSERT and UPDATE check triggers should unlink relationship
link fks. But just like I said above, the r_1_p_id_fkey1
actually doesn't have INSERT check trigger.

I slightly modified the previous patch,but I didn't add test case, because
I found another issue.
After done ALTER TABLE r ATTACH PARTITION r_1 FOR VALUES IN (1);
I run the oidjoins.sql and has warnings as belwo:
psql:/tender/postgres/src/test/regress/sql/oidjoins.sql:49: WARNING:  FK
VIOLATION IN pg_trigger({tgparentid}): ("(0,3)",16401)
psql:/tender/postgres/src/test/regress/sql/oidjoins.sql:49: WARNING:  FK
VIOLATION IN pg_trigger({tgparentid}): ("(0,4)",16402)
postgres=# select oid, tgrelid, tgparentid,
tgconstrrelid,tgconstrindid,tgconstraint from pg_trigger where oid >= 16384;
  oid  | tgrelid | tgparentid | tgconstrrelid | tgconstrindid |
tgconstraint
---+-++---+---+--
 16404 |   16389 |  16401 | 16394 | 16392 |16403
 16405 |   16389 |  16402 | 16394 | 16392 |16403
 16415 |   16384 |  0 | 16408 | 16387 |16414
 16416 |   16384 |  0 | 16408 | 16387 |16414
 16418 |   16389 |  16415 | 16408 | 16392 |16417
 16419 |   16389 |  16416 | 16408 | 16392 |16417
 16420 |   16408 |  0 | 16384 | 16387 |16414
 16421 |   16408 |  0 | 16384 | 16387 |16414
 16406 |   16394 |  16420 | 16384 | 16387 |16400
 16407 |   16394 |  16421 | 16384 | 16387 |16400
(10 rows)
oid = 16401 and oid = 16402 has been deleted.
The two trigger tuples are deleted in tryAttachPartitionForeignKey called
by CloneFkReferencing.
/*
* Looks good!  Attach this constraint.  The action triggers in the new
* partition become redundant -- the parent table already has equivalent
* ones, and those will be able to reach the partition.  Remove the ones
* in the partition.  We identify them because they have our constraint
* OID, as well as being on the referenced rel.
*/
The attached patch can't fix above issue. I'm not sure about the impact of
this issue. Maybe redundant triggers no need removed.

But it surely make oidjoings.sql fail if I add test case into v2 patch, so
I don't add test case in v2 patch.
No test case is not 

Re: Use virtual tuple slot for Unique node

2023-10-27 Thread Ashutosh Bapat
On Fri, Oct 27, 2023 at 8:48 AM David Rowley  wrote:
>
> On Wed, 25 Oct 2023 at 22:48, Ashutosh Bapat
>  wrote:
> > We may save the size of data in VirtualTupleTableSlot, thus avoiding
> > the first loop. I assume that when allocating
> > VirtualTupleTableSlot->data, we always know what size we are
> > allocating so it should be just a matter of saving it in
> > VirtualTupleTableSlot->size. This should avoid the first loop in
> > tts_virtual_materialize() and give some speed up. We will need a loop
> > to repoint non-byval Datums. I imagine that the pointers to non-byval
> > Datums can be computed as dest_slot->tts_values[natts] =
> > dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data).
> > This would work as long as all the non-byval datums in the source slot
> > are all stored flattened in source slot's data. I am assuming that
> > that would be true in a materialized virtual slot. The byval datums
> > are copied as is. I think, this will avoid multiple memcpy calls, one
> > per non-byval attribute and hence some speedup. I may be wrong though.
>
> hmm, do you think it's common enough that we copy an already
> materialised virtual slot?
>
> I tried adding the following code totts_virtual_copyslot and didn't
> see the NOTICE message when running each of your test queries. "make
> check" also worked without anything failing after adjusting
> nodeUnique.c to always use a TTSOpsVirtual slot.
>
> +   if (srcslot->tts_ops ==  && TTS_SHOULDFREE(srcslot))
> +   elog(NOTICE, "We copied a materialized virtual slot!");
>
> I did get a failure in postgres_fdw's tests:
>
>server loopback options (table_name 'tab_batch_sharded_p1_remote');
>  insert into tab_batch_sharded select * from tab_batch_local;
> +NOTICE:  We copied a materialized virtual slot!
> +NOTICE:  We copied a materialized virtual slot!
>
> so I think it's probably not that common that we'd gain from that 
> optimisation.

Thanks for this analysis. If we aren't copying a materialized virtual
slot often, no point in adding that optimization.

>
> What might buy us a bit more would be to get rid of the for loop
> inside tts_virtual_copyslot() and copy the guts of
> tts_virtual_materialize() into tts_virtual_copyslot() and set the
> dstslot tts_isnull and tts_values arrays in the same loop that we use
> to calculate the size.
>
> I tried that in the attached patch and then tested it alongside the
> patch that changes the slot type.
>
> master = 74604a37f
> 1 = [1]
> 2 = optimize_tts_virtual_copyslot.patch
>
> Using the script from [2] and the setup from [3] but reduced to 10k
> tuples instead of 1 million.
>
> Times the average query time in milliseconds for a 60 second pgbench run.
>
> querymaster  master+1master+1+2m/m+1 m/m+1+2
> Q12.616 2.082   1.903   125.65%
> 137.47%
> Q29.47910.59310.361 89.48%
>  91.49%
> Q310.329  10.35710.627 99.73%
> 97.20%
> Q47.272  7.306 6.941  99.53%
>  104.77%
> Q57.597  7.043 6.645107.87%
>  114.33%
> Q6162.177  161.037  162.807100.71% 99.61%
> Q759.288  59.43  61.294  99.76%
>  96.73%
>
>  103.25%105.94%
>
> I only expect Q2 and Q3 to gain from this. Q1 shouldn't improve but
> did, so the results might not be stable enough to warrant making any
> decisions from.

I am actually surprised to see that eliminating loop is showing
improvements. There aren't hundreds of attributes involved in those
queries. So I share your stability concerns. And even with these
patches, Q2 and Q3 are still slower.

Q1 is consistently giving performance > 25% for both of us. But other
queries aren't showing a whole lot improvement. So I am wondering
whether it's worth pursuing this change; similar to the opinion you
expressed a few emails earlier.

>
> I was uncertain if the old behaviour of when srcslot contains fewer
> attributes than dstslot was intended or not.  What happens there is
> that we'd leave the additional old dstslot tts_values in place and
> only overwrite up to srcslot->natts but then we'd go on and
> materialize all the dstslot attributes. I think this might not be
> needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may
> be ok just to materialize the srcslot attributes and ignore the
> previous additional dstslot attributes.  Changing it to that would
> make the attached patch more simple.

We seem to use both tt_nvalid and tts_tupleDescriptor->natts. I forgot
what's the difference. If we do what you say, we might end up trying
to access unmaterialized values beyond tts_nvalid. Better to
investigate whether such a hazard exists.

-- 
Best Wishes,
Ashutosh Bapat




Invalid Path with UpperRel

2023-10-27 Thread Alena Rybakina

Hi, hackers!

I recently encountered strange behavior when, after running the 
create_ms.sql test, I ran the last query from this test. In general, the 
playback looks like this:


\i src/test/regress/sql/create_misc.sql

I added Assert(0) in create_sort_plan() before calling 
create_plan_recurse and restarted postgres. After that I run query:


SELECT relname, reltoastrelid <> 0 AS has_toast_table
   FROM pg_class
   WHERE oid::regclass IN ('a_star', 'c_star')
   ORDER BY 1;

I found Invalid_path in cheapest_startup_path:

 (gdb) p *(IndexPath *)((SortPath 
*)best_path)->subpath->parent->cheapest_startup_path
$12 = {path = {type = T_Invalid, pathtype = T_Invalid, parent = 
0x7f7f7f7f7f7f7f7f, pathtarget = 0x7f7f7f7f7f7f7f7f, param_info = 
0x7f7f7f7f7f7f7f7f,
parallel_aware = 127, parallel_safe = 127, parallel_workers = 
2139062143 , rows = 1.3824172084878715e+306, 
startup_cost = 1.3824172084878715e+306,
total_cost = 1.3824172084878715e+306, pathkeys = 
0x7f7f7f7f7f7f7f7f}, indexinfo = 0x7f7f7f7f7f7f7f7f, indexclauses = 
0x7f7f7f7f7f7f7f7f,
  indexorderbys = 0x7f7f7f7f7f7f7f7f, indexorderbycols = 
0x7f7f7f7f7f7f7f7f, indexscandir = 2139062143 , 
indextotalcost = 1.3824172084878715e+306,

  indexselectivity = 1.3824172084878715e+306}

(gdb) p (IndexPath *)((SortPath 
*)best_path)->subpath->parent->cheapest_startup_path

$11 = (IndexPath *) 0x555febc66160

I found that this beginning since creation upperrel (fetch_upper_rel 
function):


/* primary planning entry point (may recurse for subqueries) */  root = 
subquery_planner(glob, parse, NULL,    false, tuple_fraction);  /* 
Select best Path and turn it into a Plan */ * final_rel = 
fetch_upper_rel(root, UPPERREL_FINAL, NULL);*  best_path = 
get_cheapest_fractional_path(final_rel, tuple_fraction);

Red Heart

(gdb) p *(IndexPath *)((SortPath *)final_rel->cheapest_total_path 
)->subpath->parent->cheapest_startup_path
$15 = {path = {type = T_Invalid, pathtype = T_Invalid, parent = 
0x7f7f7f7f7f7f7f7f, pathtarget = 0x7f7f7f7f7f7f7f7f, param_info = 
0x7f7f7f7f7f7f7f7f,
parallel_aware = 127, parallel_safe = 127, parallel_workers = 
2139062143 , rows = 1.3824172084878715e+306, 
startup_cost = 1.3824172084878715e+306,
total_cost = 1.3824172084878715e+306, pathkeys = 
0x7f7f7f7f7f7f7f7f}, indexinfo = 0x7f7f7f7f7f7f7f7f, indexclauses = 
0x7f7f7f7f7f7f7f7f,
  indexorderbys = 0x7f7f7f7f7f7f7f7f, indexorderbycols = 
0x7f7f7f7f7f7f7f7f, indexscandir = 2139062143 , 
indextotalcost = 1.3824172084878715e+306,

  indexselectivity = 1.3824172084878715e+306}
(gdb) p (IndexPath *)((SortPath *)final_rel->cheapest_total_path 
)->subpath->parent->cheapest_startup_path

$16 = (IndexPath *) 0x555febc66160

I know it doesn't cause a crash anywhere, but can anybody explain me 
what's going on here and why Invalid Path appears?


--
Regards,
Alena Rybakina


Re: Synchronizing slots from primary to standby

2023-10-27 Thread shveta malik
On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand
 wrote:
>
> Hi,
>
> On 10/25/23 5:00 AM, shveta malik wrote:
> > On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand
> >  wrote:
> >>
> >> Hi,
> >>
> >> On 10/23/23 2:56 PM, shveta malik wrote:
> >>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand
> >>>  wrote:
> >>
>  We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking 
>  if there
>  is new synced slot(s) to be created on the standby. Do we want to keep 
>  this behavior
>  for V1?
> 
> >>>
> >>> I think for the slotsync workers case, we should reduce the naptime in
> >>> the launcher to say 30sec and retain the default one of 3mins for
> >>> subscription apply workers. Thoughts?
> >>>
> >>
> >> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new
> >> API on the standby that would refresh the list of sync slot at wish, 
> >> thoughts?
> >>
> >
> > Do you mean API to refresh list of DBIDs rather than sync-slots?
> > As per current design, launcher gets DBID lists for all the failover
> > slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE.
>
> I mean an API to get a newly created slot on the primary being created/synced 
> on
> the standby at wish.
>
> Also let's imagine this scenario:
>
> - create logical_slot1 on the primary (and don't start using it)
>
> Then on the standby we'll get things like:
>
> 2023-10-25 08:33:36.897 UTC [740298] LOG:  waiting for remote slot 
> "logical_slot1" LSN (0/C00316A0) and catalog xmin (752) to pass local slot 
> LSN (0/C0049530) and and catalog xmin (754)
>
> That's expected and due to the fact that ReplicationSlotReserveWal() does set 
> the slot
> restart_lsn to a value < at the corresponding restart_lsn slot on the primary.
>
> - create logical_slot2 on the primary (and start using it)
>
> Then logical_slot2 won't be created/synced on the standby until there is 
> activity on logical_slot1 on the primary
> that would produce things like:
>
> 2023-10-25 08:41:35.508 UTC [740298] LOG:  wait over for remote slot 
> "logical_slot1" as its LSN (0/C005FFD8) and catalog xmin (756) has now passed 
> local slot LSN (0/C0049530) and catalog xmin (754)
>
> With this new dedicated API, it will be:
>
> - clear that the API call is "hanging" until there is some activity on the 
> newly created slot
> (currently there is "waiting for remote slot " message in the logfile as 
> mentioned above but
> I'm not sure that's enough)
>
> - be possible to create/sync logical_slot2 in the example above without 
> waiting for activity
> on logical_slot1.
>
> Maybe we should change our current algorithm during slot creation so that a 
> newly created inactive
> slot on the primary does not block other newly created "active" slots on the 
> primary to be created
> on the standby? Depending on how we implement that, the new API may not be 
> needed at all.
>
> Thoughts?
>

I discussed this with my colleague Hou-San and we think that one
possibility could be to somehow accelerate the increment of
restart_lsn on primary.  This can be achieved by connecting to the
remote and executing pg_log_standby_snapshot() at reasonable intervals
while waiting on standby during slot creation. This may increase speed
to a reasonable extent w/o having to wait for the user or bgwriter to
do the same for us. The current logical decoding uses a similar
approach to speed up the slot creation.  I refer to usage of
LogStandbySnapshot in SnapBuildWaitSnapshot() and
ReplicationSlotReserveWal()).
Thoughts?

thanks
Shveta




Re: pg_upgrade's object listing

2023-10-27 Thread Daniel Gustafsson
> On 27 Oct 2023, at 10:44, Alvaro Herrera  wrote:

> I honestly doubt that this sort of message is in any way useful, other
> than for program debugging.  Maybe listing databases and perhaps slots
> in verbose mode is OK, but tables?  I don't think so.

Outputting this in verbose mode is unlikely to help in regular usage and
instead risk drown out other outputs.  It would be more useful to be able to
specify a logfile for objects and keep verbose output for more informational
and actionable messages.

--
Daniel Gustafsson





Re: pg_upgrade's object listing

2023-10-27 Thread Alvaro Herrera
Hi,

On 2023-Oct-27, Kyotaro Horiguchi wrote:

> I found the following message recently introduced in pg_upgrade:
> 
> > pg_log(PG_VERBOSE, "slot_name: \"%s\", plugin: \"%s\", 
> > two_phase: %s",
> >slot_info->slotname,
> >slot_info->plugin,
> >slot_info->two_phase ? "true" : "false");
> 
> If the labels correspond to the struct member names, the first label
> ought to be "slotname". If not, all labels of this type, including
> those adjucent, should have a more natural spelling.
> 
> What do you think about this?

I think this shouldn't be a translatable message in the first place.

Looking at the wording of other messages in pg_upgrade --verbose,it
doesn't look like any of it is intended for user consumption.  I mean,
look at this monstrosity

pg_log(PG_VERBOSE, "relname: \"%s.%s\", reloid: %u, 
reltblspace: \"%s\"",

Before 249d74394500 it used to be even more hideous.  This message comes
straight from the initial pg_upgrade commit in 2010, c2e9b2f28818, where
it was a debug message.  We seem to have promoted it to a verbose
message (commit 717f6d60859c) for no particular reason and without
careful consideration.

I honestly doubt that this sort of message is in any way useful, other
than for program debugging.  Maybe listing databases and perhaps slots
in verbose mode is OK, but tables?  I don't think so.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I'm impressed how quickly you are fixing this obscure issue. I came from 
MS SQL and it would be hard for me to put into words how much of a better job
you all are doing on [PostgreSQL]."
 Steve Midgley, http://archives.postgresql.org/pgsql-sql/2008-08/msg0.php




Re: Infinite Interval

2023-10-27 Thread Dean Rasheed
On Wed, 4 Oct 2023 at 14:29, Ashutosh Bapat
 wrote:
>
> I think we should introduce interval_out_of_range_error() function on
> the lines of float_overflow_error(). Later patches introduce more
> places where we raise that error. We can introduce the function as
> part of those patches.
>

I'm not convinced that it is really worth it. Note also that even with
this patch, there are still more places that throw "timestamp out of
range" errors than "interval out of range" errors.

> > 4. I tested pg_upgrade on a table with an interval with INT_MAX
> > months, and it was silently converted to infinity. I think that's
> > probably the best outcome (better than failing). However, this means
> > that we really should require all 3 fields of an interval to be
> > INT_MIN/MAX for it to be considered infinite, otherwise it would be
> > possible to have multiple internal representations of infinity that do
> > not compare as equal.
> >
> My first patch was comparing all the three fields to determine whether
> a given Interval value represents infinity. [3] changed that to use
> only the month field. I guess that was based on the discussion at [4].
> You may want to review that discussion if not already done. I am fine
> either way. We should be able to change the comparison code later if
> we see performance getting impacted.
>

Before looking at the details more closely, I might have agreed with
that earlier discussion. However, given that things like pg_upgrade
have the possibility of turning formerly allowed, finite intervals
into infinity, we really need to ensure that there is only one value
equal to infinity, otherwise the results are likely to be very
confusing and counter-intuitive. That means that we have to continue
to regard intervals like INT32_MAX months + 10 days as finite.

While I haven't done any performance testing, I wouldn't expect this
to have much impact. In a 64-bit build, this actually generates 2
comparisons rather than 3 -- one comparing the combined month and day
fields against a 64-bit value containing 2 copies of INT32_MAX, and
one testing the time field. In practice, only the first test will be
executed in the vast majority of cases.


Something that perhaps does need discussing is the fact that
'2147483647 months 2147483647 days 9223372036854775807 usecs' is now
accepted by interval_in() and gives infinity. That's a bit ugly, but I
think it's defensible as a measure to prevent dump/restore errors from
older databases, and in any case, such an interval is outside the
documented range of supported intervals, and is a highly contrived
example, vanishingly improbable in practice.

Alternatively, we could have interval_in() reject this, which would
open up the possibility of dump/restore errors. It could be argued
that that's OK, for similar reasons -- the failing value is highly
unlikely/contrived, and out of the documented range. I don't like that
though. I don't think dump/restore should fail under any
circumstances, however unlikely.

Another alternative is to accept this input, but emit a WARNING. I
don't particularly like that either, since it's forcing a check on
every input value, just to cater for this one specific highly unlikely
input. In fact, both these alternative approaches (rejecting the
value, or emitting a warning), would impose a small performance
penalty on every interval input, which I don't think is really worth
it.

So overall, my preference is to just accept it. Anything else is more
work, for no practical benefit.

Regards,
Dean




Re: Synchronizing slots from primary to standby

2023-10-27 Thread Amit Kapila
On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand
 wrote:
>
> On 10/25/23 5:00 AM, shveta malik wrote:
> > On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand
> >  wrote:
> >>
> >> Hi,
> >>
> >> On 10/23/23 2:56 PM, shveta malik wrote:
> >>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand
> >>>  wrote:
> >>
>  We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking 
>  if there
>  is new synced slot(s) to be created on the standby. Do we want to keep 
>  this behavior
>  for V1?
> 
> >>>
> >>> I think for the slotsync workers case, we should reduce the naptime in
> >>> the launcher to say 30sec and retain the default one of 3mins for
> >>> subscription apply workers. Thoughts?
> >>>
> >>
> >> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new
> >> API on the standby that would refresh the list of sync slot at wish, 
> >> thoughts?
> >>
> >
> > Do you mean API to refresh list of DBIDs rather than sync-slots?
> > As per current design, launcher gets DBID lists for all the failover
> > slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE.
>
> I mean an API to get a newly created slot on the primary being created/synced 
> on
> the standby at wish.
>
> Also let's imagine this scenario:
>
> - create logical_slot1 on the primary (and don't start using it)
>
> Then on the standby we'll get things like:
>
> 2023-10-25 08:33:36.897 UTC [740298] LOG:  waiting for remote slot 
> "logical_slot1" LSN (0/C00316A0) and catalog xmin (752) to pass local slot 
> LSN (0/C0049530) and and catalog xmin (754)
>
> That's expected and due to the fact that ReplicationSlotReserveWal() does set 
> the slot
> restart_lsn to a value < at the corresponding restart_lsn slot on the primary.
>
> - create logical_slot2 on the primary (and start using it)
>
> Then logical_slot2 won't be created/synced on the standby until there is 
> activity on logical_slot1 on the primary
> that would produce things like:
>
> 2023-10-25 08:41:35.508 UTC [740298] LOG:  wait over for remote slot 
> "logical_slot1" as its LSN (0/C005FFD8) and catalog xmin (756) has now passed 
> local slot LSN (0/C0049530) and catalog xmin (754)
>
> With this new dedicated API, it will be:
>
> - clear that the API call is "hanging" until there is some activity on the 
> newly created slot
> (currently there is "waiting for remote slot " message in the logfile as 
> mentioned above but
> I'm not sure that's enough)
>

I think even if we provide such an API, we need to have logic to get
the slots from the primary and create them. Say, even if the user used
the APIs, there may still be some new slots that the sync worker needs
to create. I think it might be better to provide a view for users to
view the current state of sync. For example, in the above case, we can
say "waiting for the primary to advance remote LSN" or something like
that.

-- 
With Regards,
Amit Kapila.




Re: A recent message added to pg_upgade

2023-10-27 Thread Alvaro Herrera
On 2023-Oct-27, Kyotaro Horiguchi wrote:

> @@ -1433,8 +1433,8 @@ 
> InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
>   {
>   ereport(ERROR,
>   
> errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> - errmsg("replication slots must not be 
> invalidated during the upgrade"),
> - errhint("\"max_slot_wal_keep_size\" 
> must be set to -1 during the upgrade"));

Hmm, if I read this code right, this error is going to be thrown by the
checkpointer while finishing a checkpoint.  Fortunately, the checkpoint
record has already been written, but I don't know what would happen if
this is thrown while trying to write the shutdown checkpoint.  Probably
nothing terribly good.

I don't think this is useful.  If the setting is invalid during binary
upgrade, let's prevent it from being set at all right from the start of
the upgrade process.  In InvalidatePossiblyObsoleteSlot() we could have
just an Assert() or elog(PANIC).

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: pg_dump not dumping the run_as_owner setting from version 16?

2023-10-27 Thread Laurenz Albe
On Fri, 2023-10-27 at 18:05 +1100, Philip Warner wrote:
> I as far as I can tell, pg_dump does not dup the ‘run_as_owner` setting for a 
> subscription.
> 
> Should it? Should I submit a patch? It seems pretty trivial to fix if anyone 
> else is working on it.

Yes, it certainly should.  That is an omission in 482675987b.
Go ahead and write a fix!


> Further to this: it seems that `Alter Subscription X Set(Run_As_Owner=True);`
> has no influence on the `subrunasowner` column of pg_subscriptions.

This seems to have been fixed in f062cddafe.

Yours,
Laurenz Albe




Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-10-27 Thread Michael Paquier
On Fri, Oct 27, 2023 at 09:45:52AM +0200, Drouvot, Bertrand wrote:
> Oh I see, yeah I do agree to set tablestatus->trans to NULL to avoid
> any undesired interference with tabentry->trans.
> 
> Done in V8 attached (pgindent has been run on pgstatfuncs.c and
> pgstat_relation.c).

LGTM.
--
Michael


signature.asc
Description: PGP signature


Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-10-27 Thread Drouvot, Bertrand

Hi,

On 10/27/23 8:07 AM, Michael Paquier wrote:


The part that I found disturbing is here:
+   tabentry = (PgStat_TableStatus *) entry_ref->pending;
+   tablestatus = palloc(sizeof(PgStat_TableStatus));
+   *tablestatus = *tabentry;

This causes tablestatus->trans to point to the same location as
tabentry->trans, but wouldn't it be better to set tablestatus->trans
to NULL instead for the copy returned to the caller?


Oh I see, yeah I do agree to set tablestatus->trans to NULL to avoid
any undesired interference with tabentry->trans.

Done in V8 attached (pgindent has been run on pgstatfuncs.c and
pgstat_relation.c).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom 9f69363296058bd473efa50f8d8b995627f0dbdf Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 26 Oct 2023 06:38:08 +
Subject: [PATCH v8] Reconcile stats in find_tabstat_entry()

---
 src/backend/utils/activity/pgstat_relation.c | 38 ++-
 src/backend/utils/adt/pgstatfuncs.c  | 66 ++--
 2 files changed, 41 insertions(+), 63 deletions(-)
  42.3% src/backend/utils/activity/
  57.6% src/backend/utils/adt/

diff --git a/src/backend/utils/activity/pgstat_relation.c 
b/src/backend/utils/activity/pgstat_relation.c
index 9876e0c1e8..4c694d925d 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -478,20 +478,52 @@ pgstat_fetch_stat_tabentry_ext(bool shared, Oid reloid)
  * Find any existing PgStat_TableStatus entry for rel_id in the current
  * database. If not found, try finding from shared tables.
  *
+ * If an entry is found, copy it and increment the copy's counters with their
+ * subtransactions counterparts. Then return the copy. The caller may need to
+ * pfree the copy (in case the MemoryContext is not reset soon after).
+ *
  * If no entry found, return NULL, don't create a new one
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
PgStat_EntryRef *entry_ref;
+   PgStat_TableXactStatus *trans;
+   PgStat_TableStatus *tabentry = NULL;
+   PgStat_TableStatus *tablestatus = NULL;
 
entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, 
MyDatabaseId, rel_id);
if (!entry_ref)
+   {
entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, 
InvalidOid, rel_id);
+   if (!entry_ref)
+   return tablestatus;
+   }
+
+   tabentry = (PgStat_TableStatus *) entry_ref->pending;
+   tablestatus = palloc(sizeof(PgStat_TableStatus));
+   *tablestatus = *tabentry;
+
+   /*
+* We don't want tablestatus->trans to point to the same location as
+* tabentry->trans as that could lead to undesired behavior.
+*/
+   tablestatus->trans = NULL;
+
+   /*
+* Live subtransactions' counts aren't in counts yet. This is not a hot
+* code path so it sounds ok to reconcile for tuples_inserted,
+* tuples_updated and tuples_deleted even if this is not what the caller
+* is interested in.
+*/
+   for (trans = tabentry->trans; trans != NULL; trans = trans->upper)
+   {
+   tablestatus->counts.tuples_inserted += trans->tuples_inserted;
+   tablestatus->counts.tuples_updated += trans->tuples_updated;
+   tablestatus->counts.tuples_deleted += trans->tuples_deleted;
+   }
 
-   if (entry_ref)
-   return entry_ref->pending;
-   return NULL;
+   return tablestatus;
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c 
b/src/backend/utils/adt/pgstatfuncs.c
index 6468b6a805..998c69e241 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1588,68 +1588,14 @@ PG_STAT_GET_XACT_RELENTRY_INT64(blocks_fetched)
 /* pg_stat_get_xact_blocks_hit */
 PG_STAT_GET_XACT_RELENTRY_INT64(blocks_hit)
 
-Datum
-pg_stat_get_xact_tuples_inserted(PG_FUNCTION_ARGS)
-{
-   Oid relid = PG_GETARG_OID(0);
-   int64   result;
-   PgStat_TableStatus *tabentry;
-   PgStat_TableXactStatus *trans;
+/* pg_stat_get_xact_tuples_inserted */
+PG_STAT_GET_XACT_RELENTRY_INT64(tuples_inserted)
 
-   if ((tabentry = find_tabstat_entry(relid)) == NULL)
-   result = 0;
-   else
-   {
-   result = tabentry->counts.tuples_inserted;
-   /* live subtransactions' counts aren't in tuples_inserted yet */
-   for (trans = tabentry->trans; trans != NULL; trans = 
trans->upper)
-   result += trans->tuples_inserted;
-   }
+/* pg_stat_get_xact_tuples_updated */
+PG_STAT_GET_XACT_RELENTRY_INT64(tuples_updated)
 
-   PG_RETURN_INT64(result);
-}
-
-Datum
-pg_stat_get_xact_tuples_updated(PG_FUNCTION_ARGS)
-{
-   Oid relid = PG_GETARG_OID(0);
-   int64   result;
-   

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-27 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

I found several machines on BF got angry (e.g. [1]), because of missing update 
meson.build. Sorry for that.
PSA the patch to fix it.

[1]: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual=2023-10-27%2006%3A08%3A31

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



fix_meson.patch
Description: fix_meson.patch


Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-10-27 Thread Michael Paquier
On Mon, Oct 16, 2023 at 02:54:35PM +0900, Michael Paquier wrote:
> On Sat, Oct 14, 2023 at 03:45:33PM -0400, David Steele wrote:
>> On 9/28/23 19:59, Michael Paquier wrote:
>>> Another idea I had was to force the creation of recovery.signal by
>>> pg_basebackup even if -R is not used.  All the reports we've seen with
>>> people getting confused came from pg_basebackup that enforces no
>>> configuration.
>> 
>> This change makes it more obvious if configuration is missing (since you'll
>> get an error), however +1 for adding this to pg_basebackup.
> 
> Looking at the streaming APIs of pg_basebackup, it looks like this
> would be a matter of using bbstreamer_inject_file() to inject an empty
> file into the stream.  Still something seems to be off once
> compression methods are involved..  Hmm.  I am not sure.  Well, this
> could always be done as a patch independant of this one, under a
> separate discussion.  There are extra arguments about whether it would
> be a good idea to add a recovery.signal even when taking a backup from
> a standby, and do that only in 17~.

Hmm.  On this specific point, it would actually be much simpler to
force recovery.signal to be in the contents streamed to a BASE_BACKUP.
This does not step on your proposal at [1], though, because you'd
still require a .signal file for recovery as far as I understand :/ 

[1]: 
https://www.postgresql.org/message-id/2daf8adc-8db7-4204-a7f2-a7e94e2bf...@pgmasters.net

Would folks be OK to move on with the patch of this thread at the end?
I am attempting a last-call kind of thing.
--
Michael


signature.asc
Description: PGP signature


Re: PATCH: Add REINDEX tag to event triggers

2023-10-27 Thread Michael Paquier
On Mon, Sep 04, 2023 at 08:00:52PM +0200, Jim Jones wrote:
> LGTM. It applies and builds cleanly, all tests pass and documentation also
> builds ok. The CFbot seems also much happier now :)

+   /*
+* Open and lock the relation.  ShareLock is sufficient since we only 
need
+* to prevent schema and data changes in it.  The lock level used here
+* should match catalog's reindex_relation().
+*/
+   rel = try_table_open(relid, ShareLock);

I was eyeing at 0003, and this strikes me as incorrect.  Sure, this
matches what reindex_relation() does, but you've missed that
CONCURRENTLY takes a lighter ShareUpdateExclusiveLock, and ShareLock
conflicts with it.  See:
https://www.postgresql.org/docs/devel/explicit-locking.html

So, doesn't this disrupt a concurrent REINDEX?
--
Michael


signature.asc
Description: PGP signature


Re: run pgindent on a regular basis / scripted manner

2023-10-27 Thread Étienne BERSAC
Hello,


> Yes, there's a lot to look out for, and you're a damn sight better at
> it 
> than I am. But we should try to automate the things that can be 
> automated, even if that leaves many tasks that can't be. I have three
> things in my pre-commit hook: a check for catalog updates, a check
> for 
> new typedefs, and an indent check.

Could you share your configuration ? Could we provide more helper and
integration to help produce consistent code ?

For logfmt extension, I configured clang-formatd so that Emacs format
the buffer on save. Any editor running clangd will use this. This is
ease my mind about formatting. I need to investigate how to use
pgindent instead or at lease ensure clang-format produce same output as
pgindent.

https://gitlab.com/dalibo/logfmt/-/blob/0d808b368e649b23ac06ce2657354b67be398b21/.clang-format

Automate nitpicking in CI is good, but checking locally before sending
the patch will save way more round-trip.

Regards,
Étienne




Re: RFC: Logging plan of the running query

2023-10-27 Thread Étienne BERSAC


Hi Torikoshia,

> If so, we once tried to implement such function for getting memory 
> contexts.
> However, this attempt didn't succeed because it could lead dead lock 
> situation[1].

Thanks for the pointer. Why not use client log message to allow client
to get plan without locking backend memory context and without access
to server log ? I missed the rationnal for not sending the plan to
client.

Regards,
Étienne




RE: pg_dump not dumping the run_as_owner setting from version 16?

2023-10-27 Thread Philip Warner
Further to this: it seems that `Alter Subscription X Set(Run_As_Owner=True);` 
has no influence on the `subrunasowner` column of pg_subscriptions.

Sent from Mail for Windows

From: Philip Warner
Sent: Friday, 27 October 2023 3:26 PM
To: pgsql-hackers@lists.postgresql.org
Subject: pg_dump not dumping the run_as_owner setting from version 16?

Hi,

I as far as I can tell, pg_dump does not dup the ‘run_as_owner` setting for a 
subscription.

Should it? Should I submit a patch? It seems pretty trivial to fix if anyone 
else is working on it.

Sent from Mail for Windows




Re: Version 14/15 documentation Section "Alter Default Privileges"

2023-10-27 Thread Laurenz Albe
On Fri, 2022-11-04 at 10:49 +0100, Laurenz Albe wrote:
> On Thu, 2022-11-03 at 11:32 +0100, Laurenz Albe wrote:
> > On Wed, 2022-11-02 at 19:29 +, David Burns wrote:
> > 
> > > Some additional clarity in the versions 14/15 documentation would be 
> > > helpful specifically
> > > surrounding the "target_role" clause for the ALTER DEFAULT PRIVILEGES 
> > > command.
> > > To the uninitiated, the current description seems vague.  Maybe something 
> > > like the following would help:
> 
> After some more thinking, I came up with the attached patch.

I'm sending a reply to the hackers list, so that I can add the patch to the 
commitfest.

Yours,
Laurenz Albe




RE: CRC32C Parallel Computation Optimization on ARM

2023-10-27 Thread Xiang Gao
On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:
>> We consider that a runtime check needs to be done in any scenario.
>> Here we only confirm that the compilation can be successful.
> >A runtime check will be done when choosing which algorithm.
> >You can think of us as merging USE_ARMV8_VMULL and 
> >USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.

>Oh.  Looking again, I see that we are using a runtime check for ARM in all
>cases with this patch.  If so, maybe we should just remove
>USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
>USE_ARMV8_CRC32C always do the runtime check).  I suspect there are other
>opportunities to simplify things, too.

Yes, I have been removed USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK in this patch.
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


0004-crc32c-parallel-computation-optimization-on-arm.patch
Description:  0004-crc32c-parallel-computation-optimization-on-arm.patch


Re: Postgres Partitions Limitations (5.11.2.3)

2023-10-27 Thread Laurenz Albe
On Mon, 2023-01-09 at 16:40 +0100, Laurenz Albe wrote:
> > "Using ONLY to add or drop a constraint on only the partitioned table is
> > supported as long as there are no partitions. Once partitions exist, using
> > ONLY will result in an error. Instead, constraints on the partitions
> > themselves can be added and (if they are not present in the parent table)
> > dropped." This seems in contradiction to the example involving adding a
> > unique constraint while minimizing locking at the bottom of "5.11.2.2.
> > Partition Maintenance", which seems to run fine on my local Pg instance:
> > 
> > This technique can be used with UNIQUE and PRIMARY KEY constraints too; the
> > indexes are created implicitly when the constraint is created. Example:
> 
> No, that is actually an omission in the documentation.
> 
> The attached patch tries to improve that.

I am sending a reply to the hackers list, so that I can add the patch to the 
commitfest.

Yours,
Laurenz Albe




Re: "38.10.10. Shared Memory and LWLocks" may require a clarification

2023-10-27 Thread Michael Paquier
On Tue, May 23, 2023 at 01:47:52PM +0300, Aleksander Alekseev wrote:
> That's me still talking to myself :)

Let's be two then.

> Evidently this works differently from what I initially thought on
> Windows due to lack of fork() on this system.

This comes down to the fact that processes executed with EXEC_BACKEND,
because Windows does not know how to do a fork(), need to update their
local variables to point to the shared memory structures already
created, so we have to call CreateSharedMemoryAndSemaphores() in this
case.

> PFA the patch v3. Your feedback is most welcomed.

+
+It is convenient to use shmem_startup_hook which allows
+placing all the code responsible for initializing shared memory in one 
place.
+When using shmem_startup_hook the extension still needs
+to acquire AddinShmemInitLock in order to work 
properly
+on all the supported platforms including Windows.

Yeah, AddinShmemInitLock is useful because extensions have no base
point outside that, and they may want to update their local variables.
Still, this is not completely exact because EXEC_BACKEND on
non-Windows platform would still need it, so this should be mentioned.
Another thing is that extensions may do like autoprewarm.c, where
the shmem area is not initialized in the startup shmem hook.  This is
a bit cheating because this is a scenario where shmem_request_hook is
not requested, so shmem needs overallocation, but you'd also need a
LWLock in this case, even for non-WIN32.

+on all the supported platforms including Windows. This is also the reason
+why the return value of GetNamedLWLockTranche is
+conventionally stored in shared memory instead of local process memory.
+

Not sure to follow this sentence, the result of GetNamedLWLockTranche
is the lock, so this sentence does not seem really necessary?

While we're on it, why not improving this part of the documentation more
modern?  We don't mention LWLockNewTrancheId() and
LWLockRegisterTranche() at all, so do you think that it would be worth
adding a sample of code with that, mentioning autoprewarm.c as
example?
--
Michael


signature.asc
Description: PGP signature


Re: unnest multirange, returned order

2023-10-27 Thread Laurenz Albe
On Fri, 2023-10-13 at 15:33 -0400, Daniel Fredouille wrote:
> sorry it took me some time to reply. Yes, the patch is perfect if this is 
> indeed the behavior.

I'm sending a reply to the hackers list so that I can add the patch to the 
commitfest.

Tiny as the patch is, I don't want it to fall between the cracks.

Yours,
Laurenz Albe




Re: pg_upgrade and logical replication

2023-10-27 Thread vignesh C
On Thu, 21 Sept 2023 at 11:27, Michael Paquier  wrote:
>
> On Fri, Sep 15, 2023 at 03:08:21PM +0530, vignesh C wrote:
> > On Tue, 12 Sept 2023 at 14:25, Hayato Kuroda (Fujitsu)
> >  wrote:
> >> Is there a possibility that apply worker on old cluster connects to the
> >> publisher during the upgrade? Regarding the pg_upgrade on publisher, the we
> >> refuse TCP/IP connections from remotes and port number is also changed, so 
> >> we can
> >> assume that subscriber does not connect to. But IIUC such settings may not 
> >> affect
> >> to the connection source, so that the apply worker may try to connect to 
> >> the
> >> publisher. Also, is there any hazards if it happens?
> >
> > Yes, there is a possibility that the apply worker gets started and new
> > transaction data is being synced from the publisher. I have made a fix
> > not to start the launcher process in binary ugprade mode as we don't
> > want the launcher to start apply worker during upgrade.
>
> Hmm.  I was wondering if 0001 is the right way to handle this case,
> but at the end I'm OK to paint one extra isBinaryUpgrade in the code
> path where apply launchers are registered.  I don't think that the
> patch is complete, though.  A comment should be added in pg_upgrade's
> server.c, exactly start_postmaster(), to tell that -b also stops apply
> workers.  I am attaching a version updated as of the attached, that
> I'd be OK to apply.

I have added comments

> I don't really think that we need to worry about a subscriber
> connecting back to a publisher in this case, though?  I mean, each
> postmaster instance started by pg_upgrade restricts the access to the
> instance with unix_socket_directories set to a custom path and
> permissions at 0700, and a subscription's connection string does not
> know the unix path used by pg_upgrade.  I certainly agree that
> stopping these processes could lead to inconsistencies in the data the
> subscribers have been holding though, if we are not careful, so
> preventing them from running is a good practice anyway.

I have made the fix similar to how upgrade publisher has done to keep
it  consistent.

> I have also reviewed 0002.  As a whole, I think that I'm OK with the
> main approach of the patch in pg_dump to use a new type of dumpable
> object for subscription relations that are dumped with their upgrade
> functions after.  This still needs more work, and more documentation.

Added documentation

> Also, perhaps we should really have an option to control if this part
> of the copy happens or not.  With a --no-subscription-relations for
> pg_dump at least?

Currently this is done by default in binary upgrade mode, I will add a
separate patch to skip dump of subscription relations from upgrade and
dump a little later.

>
> +{ oid => '4551', descr => 'add a relation with the specified relation state 
> to pg_subscription_rel table',
>
> During a development cycle, any new function added needs to use an OID
> in range 8000-.  Running unused_oids will suggest new random OIDs.

Modified

> FWIW, I am not convinced that there is a need for two functions to add
> an entry to pg_subscription_rel, with sole difference between both the
> handling of a valid or invalid LSN.  We should have only one function
> that's able to handle NULL for the LSN.  So let's remove rel_state_a
> and rel_state_b, and have a single rel_state().  The description of
> the SQL functions is inconsistent with the other binary upgrade ones,
> I would suggest for the two functions
> "for use by pg_upgrade (relation for pg_subscription_rel)"
> "for use by pg_upgrade (remote_lsn for origin)"

Removed rel_state_a and rel_state_b and updated the description accordingly

> +   i_srsublsn = PQfnumber(res, "srsublsn");
> [...]
> +   subrinfo[cur_rel].srsublsn = pg_strdup(PQgetvalue(res, i, 
> i_srsublsn));
>
> In getSubscriptionTables(), this should check for PQgetisnull()
> because we would have a NULL value for InvalidXLogRecPtr in the
> catalog.  Using a char* for srsublsn is OK, but just assign NULL to
> it, then just pass a hardcoded NULL value to the function as we do in
> other places.  So I don't quite get why this is not the same handling
> as suboriginremotelsn.

Modified

>
> getSubscriptionTables() is entirely skipped if we don't want any
> subscriptions, if we deal with a server of 9.6 or older or if we don't
> do binary upgrades, which is OK.
>
> +/*
> + * getSubscriptionTables
> + *   get information about subscription membership for dumpable tables.
> + */
> This commit is slightly misleading and should mention that this is an
> upgrade-only path?

Modified

>
> The code for dumpSubscriptionTable() is a copy-paste of
> dumpPublicationTable(), but a lot of what you are doing here is
> actually pointless if we are not in binary mode?  Why should this code
> path not taken only under dataOnly?  I mean, this is a code path we
> should never take except if we are in binary mode.  This should have
> at least a cross-check to make 

Re: On login trigger: take three

2023-10-27 Thread Mikhail Gribkov
Hi Alexander,

>> Thank you for catching it.  Please, post this.

Just for a more complete picture of the final state here.
I have posted the described fix (for avoiding race condition in the tests)
separately:
https://commitfest.postgresql.org/45/4616/

--
 best regards,
Mikhail A. Gribkov

e-mail: youzh...@gmail.com



>


Re: pg_upgrade's object listing

2023-10-27 Thread Kyotaro Horiguchi
At Fri, 27 Oct 2023 05:56:31 +, "Zhijie Hou (Fujitsu)" 
 wrote in 
> On Friday, October 27, 2023 1:21 PM Kyotaro Horiguchi 
>  wrote:
> > 
> > Hello.
> > 
> > I found the following message recently introduced in pg_upgrade:
> > 
> > >   pg_log(PG_VERBOSE, "slot_name: \"%s\", plugin: \"%s\",
> > two_phase: %s",
> > >  slot_info->slotname,
> > >  slot_info->plugin,
> > >  slot_info->two_phase ? "true" : "false");
> > 
> > If the labels correspond to the struct member names, the first label ought 
> > to be
> > "slotname". If not, all labels of this type, including those adjucent, 
> > should have a
> > more natural spelling.
> > 
> > What do you think about this?
> 
> Thanks for reporting. But I am not sure if rename to slotname or others will 
> be an
> improvement. I think we don't have a rule to make the output the same as 
> struct
> field. Existing message also don't follow it[1]. So, the current message looks
> OK to me.
> 
> [1]
> pg_log(PG_VERBOSE, "relname: \"%s.%s\", reloid: %u, reltblspace: 
> \"%s\"",
>rel_arr->rels[relnum].nspname,
>rel_arr->rels[relnum].relname,
>rel_arr->rels[relnum].reloid,
>rel_arr->rels[relnum].tablespace);

Thanks for sharing your perspectie. I share similar sentiments. The
initial question arose during the message translation.  For the
subsequent one, I opted not to translate the labels as they looked to
be member names. From this viewpoint, "slot_name" is rather ambiguous.

If there's no interest in modifying it, I will retain the original
labels in translated messages, and that should suffice.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-10-27 Thread Michael Paquier
On Thu, Oct 26, 2023 at 10:04:25AM +0200, Drouvot, Bertrand wrote:
> By "used in an unexpected way in the future", what do you mean exactly? Do 
> you mean
> the caller forgetting it is working on a copy and then could work with
> "stale" counters?

(Be careful about the code indentation.)

The part that I found disturbing is here:
+   tabentry = (PgStat_TableStatus *) entry_ref->pending;
+   tablestatus = palloc(sizeof(PgStat_TableStatus));
+   *tablestatus = *tabentry;

This causes tablestatus->trans to point to the same location as
tabentry->trans, but wouldn't it be better to set tablestatus->trans
to NULL instead for the copy returned to the caller?
--
Michael


signature.asc
Description: PGP signature