Re: plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-15 Thread Pavel Stehule
čt 15. 12. 2022 v 8:53 odesílatel Kyotaro Horiguchi 
napsal:

> At Thu, 15 Dec 2022 08:41:21 +0100, Pavel Stehule 
> wrote in
> > čt 15. 12. 2022 v 8:25 odesílatel Masahiko Sawada  >
> > napsal:
> > > Is this a bug in plpgsql?
> > >
> >
> > I think it is by design.  There is not any callback that is called after
> an
> > exception.
> >
> > It is true, so some callbacks on statement error and function's error can
> > be nice. It can help me to implement profilers, or tracers more simply
> and
> > more robustly.
> >
> > But I am not sure about performance impacts. This is on a critical path.
>
> I didn't searched for, but I guess all of the end-side callback of all
> begin-end type callbacks are not called on exception. Additional
> PG_TRY level wouldn't be acceptable for performance reasons.
>
> What we (pg_hint_plan people) want is any means to know that the
> top-level function is exited, to reset function nest level. It would
> be simpler than calling end callback at every nest level.
>
>
I found some solution based by using fmgr hook

https://github.com/okbob/plpgsql_check/commit/9a17e97354a48913de5219048ee3be6f8460bae9

regards

Pavel


regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>


RE: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-15 Thread Hayato Kuroda (Fujitsu)
Dear Horiguchi-san, Amit,

> > Yes, that would be ideal. But do you know why that is a must?
> 
> I believe a graceful shutdown (fast and smart) of a replication set is 
> expected to
> be in sync.  Of course we can change the policy to allow walsnder to stop 
> before
> confirming all WAL have been applied. However walsender doesn't have an idea
> of  wheter the peer is intentionally delaying or not.

This mechanism was introduced by 985bd7[1], which was needed to support a
"clean" switchover. I think it is needed for physical replication, but it is not
clear for the logical case.

When the postmaster is stopped in fast or smart mode, we expected that all
modifications were received by secondary. This requirement seems to be not 
changed
from the initial commit.

Before 985bd7, the walsender exited just after sending the final WAL, which 
meant
that sometimes the last packet could not reach to secondary. So there was a 
possibility
of failing to reboot the primary as a new secondary because the new primary does
not have the last WAL record. To avoid the above walsender started waiting for
flush before exiting.

But in the case of logical replication, I'm not sure whether this limitation is
really needed or not. I think it may be OK that walsender exits without waiting,
in case of delaying applies. Because we don't have to consider the above issue
for logical replication.

[1]: 
https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459


Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Re: pg_upgrade: Make testing different transfer modes easier

2022-12-15 Thread Daniel Gustafsson
> On 15 Dec 2022, at 01:56, Kyotaro Horiguchi  wrote:
> 
> At Wed, 14 Dec 2022 10:40:45 +0100, Daniel Gustafsson  wrote 
> in 
>>> On 14 Dec 2022, at 08:04, Peter Eisentraut 
>>>  wrote:
>>> 
>>> On 07.12.22 17:33, Peter Eisentraut wrote:
 I think if we want to make this configurable on the fly, and environment 
 variable would be much easier, like
my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
>>> 
>>> Here is an updated patch set that incorporates this idea.
> 
> We have already PG_TEST_EXTRA. Shouldn't we use it here as well?

I think those are two different things.  PG_TEST_EXTRA adds test suites that
aren't run by default, this proposal is to be able to inject options into a
test suite to modify its behavior.

--
Daniel Gustafsson   https://vmware.com/





Re: plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-15 Thread Kyotaro Horiguchi
At Thu, 15 Dec 2022 09:03:12 +0100, Pavel Stehule  
wrote in 
> I found some solution based by using fmgr hook
> 
> https://github.com/okbob/plpgsql_check/commit/9a17e97354a48913de5219048ee3be6f8460bae9

Oh! Thanks for the pointer, will look into that.

By the way, It seems to me that the tool is using
RegisterResourceReleaseCallback to reset the function nest level. But
since there's a case where the mechanism doesn't work, I suspect that
the callback can be missed in some cases of error return, which seems
to be a bug if it is true..

# I haven't confirmed that behavior by myself, though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Ability to reference other extensions by schema in extension scripts

2022-12-15 Thread Sandro Santilli
On Tue, Nov 22, 2022 at 11:24:19PM -0500, Regina Obe wrote:

> Here is first version of my patch using the @extschema:extensionname@ syntax
> you proposed.
> 
> This patch includes:
> 1) Changes to replace references of @extschema:extensionname@ with the
> schema of the required extension
> 2) Documentation for the feature
> 3) Tests for the feature.
> 
> There is one issue I thought about that is not addressed by this.
> 
> If an extension is required by another extension and that required extension
> schema is referenced in the extension scripts using the
> @extschema:extensionname@ syntax, then ideally we should prevent the
> required extension from being relocatable.  This would prevent a user from
> accidentally moving the required extension, thus breaking the dependent
> extensions.
> 
> I didn't add that feature cause I wasn't sure if it was overstepping the
> bounds of what should be done, or if we leave it up to the user to just know
> better.

An alternative would be to forbid using @extschema:extensionname@ to
reference relocatable extensions. DBA can toggle relocatability of an
extension to allow it to be referenced.

--strk;




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-15 Thread Pavel Luzanov

On 15.12.2022 03:18, Jeff Davis wrote:

Right, that's what I had in mind: a user is only granted operations on
the partitioned table, not the partitions.


It's all clear now.


There's definitely a problem with this patch and partitioning, because
REINDEX affects the partitions, CLUSTER is a no-op, and VACUUM/ANALYZE
skip them.


I think the approach that Nathan implemented [1] for TOAST tables
in the latest version can be used for partitioned tables as well.
Skipping the privilege check for partitions while working with
a partitioned table. In that case we would get exactly the same behavior
as for INSERT, SELECT, etc privileges - the MAINTAIN privilege would 
work for

the whole partitioned table, but not for individual partitions.

[1] 
https://www.postgresql.org/message-id/20221215002705.GA889413%40nathanxps13


--
Pavel Luzanov
Postgres Professional: https://postgrespro.com





Re: static assert cleanup

2022-12-15 Thread Peter Eisentraut

On 09.12.22 11:01, John Naylor wrote:


On Fri, Dec 9, 2022 at 2:47 PM Peter Eisentraut 
> wrote:

 >
 > 0003-Move-some-static-assertions-to-better-places.patch
 >
 > This moves some that I thought were suboptimally placed but it could be
 > debated or refined.

+ * We really want ItemPointerData to be exactly 6 bytes.  This is rather a
+ * random place to check, but there is no better place.

Since the assert is no longer in a random function body, it seems we can 
remove the second sentence.


Committed with the discussed adjustments.





Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-15 Thread Pavel Luzanov

On 15.12.2022 03:27, Nathan Bossart wrote:

Another option I'm looking at is skipping the privilege checks when VACUUM
recurses to a TOAST table.  This won't allow you to VACUUM the TOAST table
directly, but it would at least address the originally-reported issue


This approach can be implemented for partitioned tables too. Skipping
the privilege checks when VACUUM/ANALYZE recurses to partitions.


I don't know if this is good enough.


At least it's better than before.


It seems like ideally you should be
able to VACUUM a TOAST table directly if you have MAINTAIN on its main
relation.


I agree, that would be ideally.

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com





Re: Inconsistency in reporting checkpointer stats

2022-12-15 Thread Nitin Jadhav
> Indeed PendingCheckpointerStats.buf_written_checkpoints needs to count
> buffer writes in SlruInternalWritePage(). However, does it need to be
> done immediately there? The stats will not be visible to the users
> until the next pgstat_report_checkpointer(). Incrementing
> buf_written_checkpoints in BufferSync() makes sense as the
> pgstat_report_checkpointer() gets called in there via
> CheckpointWriteDelay() and it becomes visible to the user immediately.
> Have you checked if pgstat_report_checkpointer() gets called while the
> checkpoint calls SlruInternalWritePage()? If not, then you can just
> assign ckpt_bufs_written to buf_written_checkpoints in
> LogCheckpointEnd() like its other friends
> checkpoint_write_time and checkpoint_sync_time.

In case of an immediate checkpoint, the CheckpointWriteDelay() never
gets called until the checkpoint is completed. So no issues in this
case. CheckpointWriteDelay() comes into picture in case of non
immediate checkpoints (i.e. checkpoint timeout is reached or max wal
size is reached). If we remove the increment in BufferSync() and
SlruInternalWritePage() and then just assign ckpt_bufs_written to
buf_written_checkpoints in LogCheckpointEnd() then the update will be
available after the end of each checkpoint which is not better than
the existing behaviour (without patch). If we keep the increment in
BufferSync() then we have to calculate the remaining buffer
incremented in SlruInternalWritePage() and then increment
buf_written_checkpoints with this number in LogCheckpointEnd(). This
just makes it complicated and again the buffer incremented in
SlruInternalWritePage() will get updated at the end of the checkpoint.
In the case of checkpoint_write_time and checkpoint_sync_time, it
makes sense because this information is based on the entire checkpoint
operation and it should be done at the end. So I feel the patch
handles it in a better way even though the
pgstat_report_checkpointer() does not get called immediately but it
will be called during the next increment in BufferSync() which is
before the end of the checkpoint. Please share if you have any other
ideas.

On Wed, Dec 14, 2022 at 4:55 PM Bharath Rupireddy
 wrote:
>
> On Wed, Dec 14, 2022 at 1:02 PM Nitin Jadhav
>  wrote:
> >
> > Hi,
> >
> > While working on checkpoint related stuff, I have encountered that
> > there is some inconsistency while reporting checkpointer stats. When a
> > checkpoint gets completed, a checkpoint complete message gets logged.
> > This message has a lot of information including the buffers written
> > (CheckpointStats.ckpt_bufs_written). This variable gets incremented in
> > 2 cases. First is in BufferSync() and the second is in
> > SlruInternalWritePage(). On the other hand the checkpointer stats
> > exposed using pg_stat_bgwriter contains a lot of information including
> > buffers written (PendingCheckpointerStats.buf_written_checkpoints).
> > This variable gets incremented in only one place and that is in
> > BufferSync(). So there is inconsistent behaviour between these two
> > data. Please refer to the sample output below.
> >
> > In order to fix this, the
> > PendingCheckpointerStats.buf_written_checkpoints should be incremented
> > in SlruInternalWritePage() similar to
> > CheckpointStats.ckpt_bufs_written. I have attached the patch for the
> > same. Please share your thoughts.
>
> Indeed PendingCheckpointerStats.buf_written_checkpoints needs to count
> buffer writes in SlruInternalWritePage(). However, does it need to be
> done immediately there? The stats will not be visible to the users
> until the next pgstat_report_checkpointer(). Incrementing
> buf_written_checkpoints in BufferSync() makes sense as the
> pgstat_report_checkpointer() gets called in there via
> CheckpointWriteDelay() and it becomes visible to the user immediately.
> Have you checked if pgstat_report_checkpointer() gets called while the
> checkpoint calls SlruInternalWritePage()? If not, then you can just
> assign ckpt_bufs_written to buf_written_checkpoints in
> LogCheckpointEnd() like its other friends
> checkpoint_write_time and checkpoint_sync_time.
>
> --
> Bharath Rupireddy
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com




Re: Inconsistency in reporting checkpointer stats

2022-12-15 Thread Nitin Jadhav
>  /* If part of a checkpoint, count this as a buffer written. */
>if (fdata)
>CheckpointStats.ckpt_bufs_written++;
> +   PendingCheckpointerStats.buf_written_checkpoints++;
> Also, the proposed patch would touch PendingCheckpointerStats even
> when there is no fdata, aka outside the context of a checkpoint or
> shutdown sequence.

Sorry. I missed adding braces. Fixed in the v2 patch attached.

Thanks & Regards,
Nitin Jadhav

On Thu, Dec 15, 2022 at 3:16 AM Michael Paquier  wrote:
>
> On Wed, Dec 14, 2022 at 04:54:53PM +0530, Bharath Rupireddy wrote:
> > Indeed PendingCheckpointerStats.buf_written_checkpoints needs to count
> > buffer writes in SlruInternalWritePage(). However, does it need to be
> > done immediately there? The stats will not be visible to the users
> > until the next pgstat_report_checkpointer(). Incrementing
> > buf_written_checkpoints in BufferSync() makes sense as the
> > pgstat_report_checkpointer() gets called in there via
> > CheckpointWriteDelay() and it becomes visible to the user immediately.
> > Have you checked if pgstat_report_checkpointer() gets called while the
> > checkpoint calls SlruInternalWritePage()? If not, then you can just
> > assign ckpt_bufs_written to buf_written_checkpoints in
> > LogCheckpointEnd() like its other friends
> > checkpoint_write_time and checkpoint_sync_time.
>
> /* If part of a checkpoint, count this as a buffer written. */
> if (fdata)
> CheckpointStats.ckpt_bufs_written++;
> +   PendingCheckpointerStats.buf_written_checkpoints++;
> Also, the proposed patch would touch PendingCheckpointerStats even
> when there is no fdata, aka outside the context of a checkpoint or
> shutdown sequence..
> --
> Michael


v2-0001-Fix-inconsistency-in-checkpointer-stats.patch
Description: Binary data


Re: Minimal logical decoding on standbys

2022-12-15 Thread Drouvot, Bertrand

Hi,

On 12/14/22 6:48 PM, Robert Haas wrote:

On Wed, Dec 14, 2022 at 12:35 PM Andres Freund  wrote:

  typedef struct xl_heap_prune

I think this is unsafe on alignment-picky machines. I think it will
cause the offset numbers to be aligned at an odd address.
heap_xlog_prune() doesn't copy the data into aligned memory, so I
think this will result in a misaligned pointer being passed down to
heap_page_prune_execute.


I think the offset numbers are stored separately from the record, even
though it doesn't quite look like that in the above due to the way the
'OFFSET NUMBERS' is embedded in the struct. As they're stored with the
block reference 0, the added boolean shouldn't make a difference
alignment wise?

Or am I misunderstanding your point?


Oh, you're right. So this is another case similar to
xl_btree_reuse_page. In heap_xlog_prune(), we access the offset number
data like this:

 redirected = (OffsetNumber *)
XLogRecGetBlockData(record, 0, &datalen);
 end = (OffsetNumber *) ((char *) redirected + datalen);
 nowdead = redirected + (nredirected * 2);
 nowunused = nowdead + ndead;
 nunused = (end - nowunused);
 heap_page_prune_execute(buffer,

redirected, nredirected,
 nowdead, ndead,

nowunused, nunused);

This is only safe if the return value of XLogRecGetBlockData is
guaranteed to be properly aligned,


Why, could you please elaborate?

It looks to me that here we are "just" accessing the
members of the xl_heap_prune struct to get the numbers.

Then, the actual data will be read later in heap_page_prune_execute() from the 
buffer/page based on the numbers we got from xl_heap_prune.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Minimal logical decoding on standbys

2022-12-15 Thread Drouvot, Bertrand

Hi,

On 12/14/22 4:55 PM, Robert Haas wrote:
  On Wed, Dec 14, 2022 at 8:06 AM Drouvot, Bertrand> 
Other comments:


+if RelationIsAccessibleInLogicalDecoding(rel)
+xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;

This is a few parentheses short of where it should be. Hilariously it
still compiles because there are parentheses in the macro definition.


Oops, thanks will fix.



+xlrec.onCatalogAccessibleInLogicalDecoding =
RelationIsAccessibleInLogicalDecoding(relation);

These lines are quite long. I think we should consider (1) picking a
shorter name for the xlrec field and, if it's such lines are going to
still routinely exceed 80 characters, (2) splitting them into two
lines, with the second one indented to match pgindent's preferences in
such cases, which I think is something like this:

xlrec.onCatalogAccessibleInLogicalDecoding =
 RelationIsAccessibleInLogicalDecoding(relation);

As far as renaming, I think we could at least remove onCatalog part
from the identifier, as that doesn't seem to be adding much. And maybe
we could even think of changing it to something like
logicalDecodingConflict or even decodingConflict, which would shave
off a bunch more characters.



I'm not sure I like the decodingConflict proposal. Indeed, it might be there is 
no conflict (depending of the xids
comparison).

What about "checkForConflict"?



+if (heapRelation->rd_options)
+isusercatalog = ((StdRdOptions *)
(heapRelation)->rd_options)->user_catalog_table;

Couldn't you get rid of the if statement here and also the
initialization at the top of the function and just write isusercatalog
= RelationIsUsedAsCatalogTable(heapRelation)? Or even just get rid of
the variable entirely and pass
RelationIsUsedAsCatalogTable(heapRelation) as the argument to
UpdateIndexRelation directly?



Yeah, that's better, will do, thanks!

While at it, I'm not sure that isusercatalog should be visible in pg_index.
I mean, this information could be retrieved with a join on pg_class (on the 
table the index is linked to), so the weirdness to have it visible.
I did not check how difficult it would be to make it "invisible" though.
What do you think?


I think this could use some test cases demonstrating that
indisusercatalog gets set correctly in all the relevant cases: table
is created with user_catalog_table = true/false, reloption is changed,
reloptions are reset, new index is added later, etc.



v31 already provides a few checks:

- After index creation on relation with user_catalog_table = true
- Propagation is done correctly after a user_catalog_table RESET
- Propagation is done correctly after an ALTER SET user_catalog_table = true
- Propagation is done correctly after an ALTER SET user_catalog_table = false

In v32, I can add a check for index creation after each of the last 3 mentioned 
above and one when a table is created with user_catalog_table = false.

Having said that, we would need a function to retrieve the isusercatalog value 
should we make it invisible.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-15 Thread Pavel Stehule
čt 15. 12. 2022 v 9:34 odesílatel Kyotaro Horiguchi 
napsal:

> At Thu, 15 Dec 2022 09:03:12 +0100, Pavel Stehule 
> wrote in
> > I found some solution based by using fmgr hook
> >
> >
> https://github.com/okbob/plpgsql_check/commit/9a17e97354a48913de5219048ee3be6f8460bae9
>
> Oh! Thanks for the pointer, will look into that.
>
> By the way, It seems to me that the tool is using
> RegisterResourceReleaseCallback to reset the function nest level. But
> since there's a case where the mechanism doesn't work, I suspect that
> the callback can be missed in some cases of error return, which seems
> to be a bug if it is true..
>
> # I haven't confirmed that behavior by myself, though.
>

it should  be executed

/*
 * Register or deregister callback functions for resource cleanup
 *
 * These functions are intended for use by dynamically loaded modules.
 * For built-in modules we generally just hardwire the appropriate calls.
 *
 * Note that the callback occurs post-commit or post-abort, so the callback
 * functions can only do noncritical cleanup.
 */
void
RegisterResourceReleaseCallback(ResourceReleaseCallback callback, void *arg)
{

but it is based on resource owner, so timing can be different than you
expect

Regards

Pavel


>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>


Re: Raising the SCRAM iteration count

2022-12-15 Thread Daniel Gustafsson
> On 14 Dec 2022, at 19:59, Jonathan S. Katz  wrote:
> On 12/14/22 6:25 AM, Daniel Gustafsson wrote:
>>> On 14 Dec 2022, at 02:00, Michael Paquier  wrote:

>>> So, you mean that the GUC should be named like password_iterations,
>>> taking a grammar with a list like 'scram-sha-256=4096,algo2=5000'?
>> I was thinking about it but opted for the simpler approach of a GUC name with
>> the algorithm baked into it: scram_sha256_iterations.  It doesn't seem all 
>> that
>> likely that we'll have more than two versions of SCRAM (sha256/sha512) so
>> the additional complexity doesn't seem worth it.
> 
> I would not rule this out. There is a RFC draft for SCRAM-SHA3-512[1].

Note that this draft is very far from RFC status, it has alredy expired twice
and hasn't been updated for a year.  The SCRAM-SHA-512 draft has an almost
identical history and neither are assigned a work group.  The author is also
drafting scram-bis which is setting up more context around these proposals,
this has yet to expire but is also very early.  The work on SCRAM-2FA seems the
most promising right now.

There might be additional versions of SCRAM published but it's looking pretty
distant now.

> Reviewing patch as is.

Thanks for review! Fixes coming downthread in an updated version.

> ==snip==
> The number of computational iterations to perform when generating
> a SCRAM-SHA-256 secret. The default is 4096. A
> higher number of iterations provides additional protection against
> brute-force attacks on stored passwords, but makes authentication
> slower. Changing the value has no effect on previously created
> SCRAM-SHA-256 secrets as the iteration count at the time of creation
> is fixed. A password must be re-hashed to use an updated iteration
> value.
> ==snip==

I've rewritten to a version of this.  We don't use the terminology "SCRAM
secret" anywhere else so I used password instead.

> /*
> - * Default number of iterations when generating secret.  Should be at least
> - * 4096 per RFC 7677.
> + * Default number of iterations when generating secret.
>  */
> 
> I don't think we should remove the RFC 7677 reference entirely.

Fixed.

> -pg_fe_scram_build_secret(const char *password, const char **errstr)
> +pg_fe_scram_build_secret(const char *password, int iterations, const char 
> **errstr)
> 
> I have mild worry about changing this function definition for downstream 
> usage, esp. for drivers. Perhaps it's not that big of a deal, and perhaps 
> this will end up being needed for the work we've discussed around "\password" 
> but I do want to note that this could be a breaking change.

Not sure driver authors should be relying on this function..  Code scans
doesn't turn up any public consumers of it right now at least.  If we want to
support multiple SCRAM versions we'd still need to change it though as noted
downthread.

> + else if (strcmp(name, "scram_sha256_iterations") == 0)
> + {
> + conn->scram_iterations = atoi(value);
> + }
> 
> Maybe out of scope for this patch based on what else is in the patch, but I 
> was wondering why we don't use a "strncmp" here?

strncmp() would allow scram_sha256_iterations_foo to match, which we don't
want, we want an exact match.

--
Daniel Gustafsson   https://vmware.com/





Re: Raising the SCRAM iteration count

2022-12-15 Thread Daniel Gustafsson
> On 15 Dec 2022, at 00:52, Michael Paquier  wrote:

>conn->in_hot_standby = PG_BOOL_UNKNOWN;
> +   conn->scram_iterations = SCRAM_DEFAULT_ITERATIONS;
> 
> s/SCRAM_DEFAULT_ITERATIONS/SCRAM_SHA_256_DEFAULT_ITERATIONS/ and
> s/scram_iterations/scram_sha_256_interations/ perhaps?  

Distinct members in the conn object is only of interest if there is a way for
the user to select a different password method in \password right?  I can
rename it now but I think doing too much here is premature, awaiting work on
\password (should that materialize) seems reasonable no?

> +#ifndef FRONTEND
> +/*
> + * Number of iterations when generating new secrets.
> + */
> +extern PGDLLIMPORT int scram_sha256_iterations;
> +#endif
> 
> It looks like libpq/scram.h, which is backend-only, would be a better
> location.

Fixed.

> @@ -692,7 +697,7 @@ mock_scram_secret(const char *username, int *iterations, 
> char **salt,
>encoded_salt[encoded_len] = '\0';
> 
>*salt = encoded_salt;
> -   *iterations = SCRAM_DEFAULT_ITERATIONS;
> +   *iterations = scram_sha256_iterations;
> 
> This looks incorrect to me?  The mock authentication is here to
> produce a realistic verifier, still it will fail.  It seems to me that
> we'd better stick to the default in all the cases.

For avoiding revealing anything, I think a case can be argued for both.  I've
reverted back to the default though.

I also renamed the GUC sha_256 to match terminology we use.

--
Daniel Gustafsson   https://vmware.com/



v3-0001-Make-SCRAM-iteration-count-configurable.patch
Description: Binary data


Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-12-15 Thread David Rowley
On Tue, 13 Dec 2022 at 20:53, David Rowley  wrote:
> I think what we need to do is:  Do our best to give incremental sort
> the most realistic costs we can and accept that it might choose a
> worse plan in some cases. Users can turn it off if they really have no
> other means to convince the planner it's wrong.
>
> Additionally, I think what we also need to add a GUC such as
> enable_presorted_aggregate.  People can use that when their Index Scan
> -> Incremental Sort -> Aggregate plan is worse than their previous Seq
> Scan -> Sort -> Aggregate plan that they were getting in < 16.
> Turning off enable_incremental_sort alone won't give them the same
> aggregate plan that they had in pg15 as we always set the
> query_pathkeys to request a sort order that will suit the order by /
> distinct aggregates.
>
> I'll draft up a patch for the enable_presorted_aggregate.

I've attached a patch series for this.

v3-0001 can be ignored here. I've posted about that in [1]. Any
discussion about that patch should take place over there.  The patch
is required to get the 0002 patch to pass isolation check

v3-0002 removes the 1.5 x cost pessimism from incremental sort and
also rewrites how we make incremental sort paths.  I've now gone
through the remaining places where we create an incremental sort path
to give all those the same treatment that I'd added to
add_paths_to_grouping_rel(). There was a 1 or 2 plan changes in the
regression tests.  One was the isolation test change, which I claim to
be a broken test and should be fixed another way.  The other was
performing a Sort on the cheapest input path which had presorted keys.
That plan now uses an Incremental Sort to make use of the presorted
keys. I'm happy to see just how much redundant code this removes.
About 200 lines.

v3-0003 adds the enable_presorted_aggregate GUC.

David

[1] 
https://www.postgresql.org/message-id/CAApHDvrbDhObhLV+=u_k_-t+2av2av1al9d+2j_3ao-xnda...@mail.gmail.com
From ed0e0bd525ea16ea724e9794b5f4feddf29da497 Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Thu, 15 Dec 2022 17:49:40 +1300
Subject: [PATCH v3 1/3] Re-adjust drop-index-concurrently-1 isolation test

It seems that drop-index-concurrently has started to forget what it was
originally meant to be testing.  d2d8a229b, which added incremental sorts
changed the expected plan to be an Index Scan plan instead of a Seq Scan
plan.  This occurred as the primary key index of the table in question
provided presorted input and, because that index happened to be the
cheapest input path due to enable_seqscan being disabled, the incremental
sort changes just added a Sort on top of that.  It seems based on the name
of the PREPAREd statement that the intention here is that query produce a
seqscan plan.

The reason this test has become broken seems to be due to how the test was
originally coded. It tried to force a seqscan plan by performing some
casting to make it so the test_dc couldn't be used to perform the required
filtering.  This likely wasn't the best design.  It seems much better just
to toggle enable_seqscan to allow the planner to produce a more
predictable plan.  Trying to coax the planner into using a plan which has
costed in a disable_cost seems like it's always going to be flakey.  So
let's get rid of that and put the expected plans closer to what they were
originally when the test was added.

Additionally, rename a few things in the test and add some additional
wording to the comments to try and make it more clear in the future what
we expect this test to be doing.
---
 .../expected/drop-index-concurrently-1.out| 31 ++-
 .../expected/drop-index-concurrently-1_2.out  | 31 ++-
 .../specs/drop-index-concurrently-1.spec  | 22 +++--
 3 files changed, 44 insertions(+), 40 deletions(-)

diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out 
b/src/test/isolation/expected/drop-index-concurrently-1.out
index 97e1a6e779..1cb2250891 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -1,17 +1,17 @@
 Parsed test spec with 3 sessions
 
-starting permutation: noseq chkiso prepi preps begin explaini explains select2 
drop insert2 end2 selecti selects end
-step noseq: SET enable_seqscan = false;
+starting permutation: chkiso prepi preps begin disableseq explaini enableseq 
explains select2 drop insert2 end2 selecti selects end
 step chkiso: SELECT (setting in ('read committed','read uncommitted')) AS 
is_read_committed FROM pg_settings WHERE name = 'default_transaction_isolation';
 is_read_committed
 -
 t
 (1 row)
 
-step prepi: PREPARE getrow_idx AS SELECT * FROM test_dc WHERE data=34 ORDER BY 
id,data;
-step preps: PREPARE getrow_seq AS SELECT * FROM test_dc WHERE 
data::text=34::text ORDER BY id,data;
+step prepi: PREPARE getrow_idxscan AS SELECT * FROM test_dc WHERE data = 34 
ORDER BY id,data;
+step preps: PREPARE get

cirrus scripts could use make -k

2022-12-15 Thread Peter Eisentraut
I would find it useful if the cirrus scripts used make -k (and ninja -k) 
to keep building everything in the presence of errors.  For example, I 
once had some issue in code that caused a bunch of compiler warnings on 
Windows.  The cirrus build would only show me the first one, then I had 
to fix, reupload, see the next one, etc.  Are there any drawbacks to 
using these options in this context?





Re: plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-15 Thread Masahiko Sawada
On Thu, Dec 15, 2022 at 4:53 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 15 Dec 2022 08:41:21 +0100, Pavel Stehule  
> wrote in
> > čt 15. 12. 2022 v 8:25 odesílatel Masahiko Sawada 
> > napsal:
> > > Is this a bug in plpgsql?
> > >
> >
> > I think it is by design.  There is not any callback that is called after an
> > exception.
> >
> > It is true, so some callbacks on statement error and function's error can
> > be nice. It can help me to implement profilers, or tracers more simply and
> > more robustly.
> >
> > But I am not sure about performance impacts. This is on a critical path.
>
> I didn't searched for, but I guess all of the end-side callback of all
> begin-end type callbacks are not called on exception. Additional
> PG_TRY level wouldn't be acceptable for performance reasons.

I don't think we need additional PG_TRY() for that since exec_stmts()
is already called in PG_TRY() if there is an exception block. I meant
to call stmt_end() in PG_CATCH() in exec_stmt_block() (i.e. only when
an error is caught by the exception block). Currently, if an error is
caught, we call stmt_begin() and stmt_end() for statements executed
inside the exception block but call only stmt_begin() for the
statement that raised an error.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-12-15 Thread Melih Mutlu
Hi,

Attached new versions of the patch with some changes/fixes.

Here also some numbers to compare the performance of log. rep. with this
patch against the current master branch.

My method of benchmarking is the same with what I did earlier in this
thread. (on a different environment, so not compare the result from this
email with the ones from earlier emails)

> With those changes, I did some benchmarking to see if it improves anything.
> This results compares this patch with the latest version of master branch.
> "max_sync_workers_per_subscription" is set to 2 as default.
> Got some results simply averaging timings from 5 consecutive runs for each
> branch.


Since this patch is expected to improve log. rep. of empty/close-to-empty
tables, started with measuring performance with empty tables.

|  10 tables  |  100 tables|  1000 tables
--
master |  283.430 ms  |  22739.107 ms  |  105226.177 ms
--
 patch  |  189.139 ms  |  1554.802 ms|  23091.434 ms

After the changes discussed here [1], concurrent replication origin drops
> by apply worker and tablesync workers may hold each other on wait due to
> locks taken by replorigin_drop_by_name.
> I see that this harms the performance of logical replication quite a bit
> in terms of speed.
> [1]
> https://www.postgresql.org/message-id/flat/20220714115155.GA5439%40depesz.com


Firstly, as I mentioned, replication origin drops made things worse for the
master branch.
Locks start being a more serious issue when the number of tables increases.
The patch reuses the origin so does not need to drop them in each
iteration. That's why the difference between the master and the patch is
more significant now than it was when I first sent the patch.

To just show that the improvement is not only the result of reuse of
origins, but also reuse of rep. slots and workers, I just reverted those
commits which causes the origin drop issue.

  |  10 tables  |  100 tables|  1000 tables
-
reverted |  270.012 ms  |  2483.907 ms   |  31660.758 ms
-
 patch|  189.139 ms  |  1554.802 ms   |  23091.434 ms

With this patch, logical replication is still faster, even if we wouldn't
have an issue with rep. origin drops.

Also here are some numbers with 10 tables loaded with some data :

 | 10 MB  | 100 MB
--
master  |  2868.524 ms   |  14281.711 ms
--
 patch   |  1750.226 ms   |  14592.800 ms

The difference between the master and the patch is getting close when the
size of tables increase, as expected.


I would appreciate any feedback/thought on the approach/patch/numbers etc.

Thanks,
-- 
Melih Mutlu
Microsoft


v2-0001-Add-replication-protocol-cmd-to-create-a-snapshot.patch
Description: Binary data


v5-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data


Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-15 Thread Amit Kapila
On Thu, Dec 15, 2022 at 8:58 AM houzj.f...@fujitsu.com
 wrote:
>

Few minor comments:
=
1.
+ for (i = list_length(subxactlist) - 1; i >= 0; i--)
+ {
+ TransactionId xid_tmp = lfirst_xid(list_nth_cell(subxactlist, i));
+
+ if (xid_tmp == subxid)
+ {
+ RollbackToSavepoint(spname);
+ CommitTransactionCommand();
+ subxactlist = list_truncate(subxactlist, i + 1);

I find that there is always one element extra in the list after
rollback to savepoint. Don't we need to truncate the list to 'i' as
shown in the diff below?

2.
* Note that If it's an empty sub-transaction then we will not find
* the subxid here.

If in above comment seems to be in wrong case. Anyway, I have slightly
modified it as you can see in the diff below.

$ git diff
diff --git a/src/backend/replication/logical/applyparallelworker.c
b/src/backend/replication/logical/applyparallelworker.c
index 11695c75fa..c809b1fd01 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -1516,8 +1516,8 @@ pa_stream_abort(LogicalRepStreamAbortData *abort_data)
 * Search the subxactlist, determine the offset tracked for the
 * subxact, and truncate the list.
 *
-* Note that If it's an empty sub-transaction then we
will not find
-* the subxid here.
+* Note that for an empty sub-transaction we won't
find the subxid
+* here.
 */
for (i = list_length(subxactlist) - 1; i >= 0; i--)
{
@@ -1527,7 +1527,7 @@ pa_stream_abort(LogicalRepStreamAbortData *abort_data)
{
RollbackToSavepoint(spname);
CommitTransactionCommand();
-   subxactlist = list_truncate(subxactlist, i + 1);
+   subxactlist = list_truncate(subxactlist, i);
break;
}
}


-- 
With Regards,
Amit Kapila.




RE: Ability to reference other extensions by schema in extension scripts

2022-12-15 Thread Regina Obe
> On Tue, Nov 22, 2022 at 11:24:19PM -0500, Regina Obe wrote:
> 
> > Here is first version of my patch using the @extschema:extensionname@
> > syntax you proposed.
> >
> > This patch includes:
> > 1) Changes to replace references of @extschema:extensionname@ with the
> > schema of the required extension
> > 2) Documentation for the feature
> > 3) Tests for the feature.
> >
> > There is one issue I thought about that is not addressed by this.
> >
> > If an extension is required by another extension and that required
> > extension schema is referenced in the extension scripts using the
> > @extschema:extensionname@ syntax, then ideally we should prevent the
> > required extension from being relocatable.  This would prevent a user
> > from accidentally moving the required extension, thus breaking the
> > dependent extensions.
> >
> > I didn't add that feature cause I wasn't sure if it was overstepping
> > the bounds of what should be done, or if we leave it up to the user to
> > just know better.
> 
> An alternative would be to forbid using @extschema:extensionname@ to
> reference relocatable extensions. DBA can toggle relocatability of an
extension
> to allow it to be referenced.
> 
> --strk;
That would be hard to do in a DbaaS setup and not many users know they can
fiddle with extension control files.
Plus those would get overwritten with upgrades.

In my case for example I have postgis_tiger_geocoder that relies on both
postgis and fuzzystrmatch.
I'd rather not have to explain to users how to fiddle with the
fuzzystrmatch.control file to make it not relocatable.

But I don't think anyone would mind if it's forced after install because
it's a rare thing for people to be moving extensions to different schemas
after install.  

Thanks,
Regina





Re: Temporary tables versus wraparound... again

2022-12-15 Thread Robert Haas
On Wed, Dec 14, 2022 at 4:44 PM Greg Stark  wrote:
> > You do have to lock a table in order to update its pg_class row,
> > though, whether the table is temporary or not. Otherwise, another
> > session could drop it while you're doing something with it, after
> > which bad things would happen.
>
> I was responding to this from Andres:
>
> > Is that actually true? Don't we skip some locking operations for temporary
> > tables, which then also means catalog modifications cannot safely be done in
> > other sessions?
>
> I don't actually see this in the code ...

Yes, I think Andres may be wrong in this case.

(Dang, I don't get to say that very often.)

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-15 Thread Pavel Stehule
čt 15. 12. 2022 v 12:51 odesílatel Masahiko Sawada 
napsal:

> On Thu, Dec 15, 2022 at 4:53 PM Kyotaro Horiguchi
>  wrote:
> >
> > At Thu, 15 Dec 2022 08:41:21 +0100, Pavel Stehule <
> pavel.steh...@gmail.com> wrote in
> > > čt 15. 12. 2022 v 8:25 odesílatel Masahiko Sawada <
> sawada.m...@gmail.com>
> > > napsal:
> > > > Is this a bug in plpgsql?
> > > >
> > >
> > > I think it is by design.  There is not any callback that is called
> after an
> > > exception.
> > >
> > > It is true, so some callbacks on statement error and function's error
> can
> > > be nice. It can help me to implement profilers, or tracers more simply
> and
> > > more robustly.
> > >
> > > But I am not sure about performance impacts. This is on a critical
> path.
> >
> > I didn't searched for, but I guess all of the end-side callback of all
> > begin-end type callbacks are not called on exception. Additional
> > PG_TRY level wouldn't be acceptable for performance reasons.
>
> I don't think we need additional PG_TRY() for that since exec_stmts()
> is already called in PG_TRY() if there is an exception block. I meant
> to call stmt_end() in PG_CATCH() in exec_stmt_block() (i.e. only when
> an error is caught by the exception block). Currently, if an error is
> caught, we call stmt_begin() and stmt_end() for statements executed
> inside the exception block but call only stmt_begin() for the
> statement that raised an error.
>

PG_TRY is used only for STMT_BLOCK, other statements don't use PG_TRY.

I have no idea about possible performance impacts, I never tested it.
Personally, I like the possibility of having some error callback function.
Maybe PG_TRY can be used, only when this callback is used. So there will
not be any impact on performance without some extensions that use it.
Unfortunately, there are two functions necessary. Some exceptions can be
raised after the last statement before the function ends. Changing
behaviour of stmt_end or func_end can be problematic, because after an
exception a lot of internal API is not available, and you should know, so
this is that situation. Now anybody knows so at stmt_end function, the code
is not after an exception.

But it can be not too easy, because there can be more chained extensions
that use dbg API - like PL profiler, PL debugger and plpgsql_check - and
maybe others.

Regards

Pavel



>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
>


Re: Error-safe user functions

2022-12-15 Thread Robert Haas
On Wed, Dec 14, 2022 at 6:24 PM Tom Lane  wrote:
> I've gone through these now and revised/pushed most of them.

Tom, I just want to extend huge thanks to you for working on this
infrastructure. jsonpath aside, I think this is going to pay dividends
in many ways for many years to come. It's something that we've needed
for a really long time, and I'm very happy that we're moving forward
with it.

Thanks so much.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: New strategies for freezing, advancing relfrozenxid early

2022-12-15 Thread Matthias van de Meent
On Wed, 14 Dec 2022 at 00:07, Peter Geoghegan  wrote:
>
> On Tue, Dec 13, 2022 at 9:16 AM Peter Geoghegan  wrote:
> > That's not the only thing we care about, though. And to the extent we
> > care about it, we mostly care about the consequences of either
> > freezing or not freezing eagerly. Concentration of unfrozen pages in
> > one particular table is a lot more of a concern than the same number
> > of heap pages being spread out across multiple tables. Those tables
> > can all be independently vacuumed, and come with their own
> > relfrozenxid, that can be advanced independently, and are very likely
> > to be frozen as part of a vacuum that needed to happen anyway.
>
> At the suggestion of Jeff, I wrote a Wiki page that shows motivating
> examples for the patch series:
>
> https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples
>
> These are all cases where VACUUM currently doesn't do the right thing
> around freezing, in a way that is greatly ameliorated by the patch.
> Perhaps this will help other hackers to understand the motivation
> behind some of these mechanisms. There are plenty of details that only
> make sense in the context of a certain kind of table, with certain
> performance characteristics that the design is sensitive to, and seeks
> to take advantage of in one way or another.

In this mentioned wiki page, section "Simple append-only", the
following is written:

> Our "transition from lazy to eager strategies" concludes with an autovacuum 
> that actually advanced relfrozenxid eagerly:
>> automatic vacuum of table "regression.public.pgbench_history": index scans: 0
>> pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
>> [...]
>> frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
>> [...]
>> WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes

I think that this 'transition from lazy to eager' could benefit from a
limit on how many all_visible blocks each autovacuum iteration can
freeze. This first run of (auto)vacuum after the 8GB threshold seems
to appear as a significant IO event (both in WAL and relation
read/write traffic) with 50% of the table updated and WAL-logged. I
think this should be limited to some degree, such as only freeze
all_visible blocks up to 10% of the table's blocks in eager vacuum, so
that the load is spread across a larger time frame and more VACUUM
runs.


Kind regards,

Matthias van de Meent.




Re: plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-15 Thread Tom Lane
Masahiko Sawada  writes:
> I don't think we need additional PG_TRY() for that since exec_stmts()
> is already called in PG_TRY() if there is an exception block. I meant
> to call stmt_end() in PG_CATCH() in exec_stmt_block() (i.e. only when
> an error is caught by the exception block). Currently, if an error is
> caught, we call stmt_begin() and stmt_end() for statements executed
> inside the exception block but call only stmt_begin() for the
> statement that raised an error.

I fail to see anything wrong with that.  We never completed execution
of the statement that raised an error, but calling stmt_end for it
would imply that we did.  I think changing this will break more things
than it fixes, completely independently of whatever cost it would add.

Or in other words: the initial complaint describes a bug in pg_hint_plan,
not one in plpgsql.

regards, tom lane




Re: [PATCH] psql: \dn+ to show size of each schema (and \dA+ for AMs)

2022-12-15 Thread Justin Pryzby
Rebased on c727f511b.

This patch record was "closed for lack of interest", but I think what's
actually needed is committer review of which approach to take.

 - add backend functions but do not modify psql ?
 - add to psql slash-plus commnds ?
 - introduce psql double-plus commands for new options ?
 - change pre-existing psql plus commands to only show size with
   double-plus ?
 - go back to the original, two-line client-side sum() ?

Until then, the patchset is organized with those questions in mind.

-- 
Justin
>From 84316b06d1a6286d30e163c4815c840548d639db Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 13 Jul 2021 21:25:48 -0500
Subject: [PATCH 1/4] Add pg_am_size(), pg_namespace_size() ..

See also: 358a897fa, 528ac10c7
---
 src/backend/utils/adt/dbsize.c  | 131 
 src/include/catalog/pg_proc.dat |  19 +
 2 files changed, 150 insertions(+)

diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index 141db7c9c1c..2e7e98f5676 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -13,19 +13,25 @@
 
 #include 
 
+#include "access/genam.h"
 #include "access/htup_details.h"
 #include "access/relation.h"
+#include "access/table.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_namespace.h"
 #include "catalog/pg_tablespace.h"
 #include "commands/dbcommands.h"
+#include "commands/defrem.h"
 #include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
 #include "utils/numeric.h"
 #include "utils/rel.h"
 #include "utils/relfilenumbermap.h"
@@ -833,6 +839,131 @@ pg_size_bytes(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+/*
+ * Return the sum of size of relations for which the given attribute of
+ * pg_class matches the specified OID value.
+ */
+static int64
+calculate_size_attvalue(int attnum, Oid attval)
+{
+	int64		totalsize = 0;
+	ScanKeyData 	skey;
+	Relation	pg_class;
+	SysScanDesc	scan;
+	HeapTuple	tuple;
+
+	ScanKeyInit(&skey, attnum,
+			BTEqualStrategyNumber, F_OIDEQ, attval);
+
+	pg_class = table_open(RelationRelationId, AccessShareLock);
+	scan = systable_beginscan(pg_class, InvalidOid, false, NULL, 1, &skey);
+	while (HeapTupleIsValid(tuple = systable_getnext(scan)))
+	{
+		Relation rel;
+		Form_pg_class classtuple = (Form_pg_class) GETSTRUCT(tuple);
+
+		rel = try_relation_open(classtuple->oid, AccessShareLock);
+		if (!rel)
+			continue;
+
+		for (ForkNumber forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++)
+			totalsize += calculate_relation_size(&(rel->rd_locator), rel->rd_backend, forkNum);
+
+		relation_close(rel, AccessShareLock);
+	}
+
+	systable_endscan(scan);
+	table_close(pg_class, AccessShareLock);
+	return totalsize;
+}
+
+/* Compute the size of relations in a schema (namespace) */
+static int64
+calculate_namespace_size(Oid nspOid)
+{
+	/*
+	 * User must be a member of pg_read_all_stats or have CREATE privilege for
+	 * target namespace.
+	 */
+	if (!is_member_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS))
+	{
+		AclResult aclresult;
+		aclresult = object_aclcheck(NamespaceRelationId, nspOid, GetUserId(), ACL_CREATE);
+		if (aclresult != ACLCHECK_OK)
+			aclcheck_error(aclresult, OBJECT_SCHEMA,
+		   get_namespace_name(nspOid));
+	}
+
+	return calculate_size_attvalue(Anum_pg_class_relnamespace, nspOid);
+}
+
+Datum
+pg_namespace_size_oid(PG_FUNCTION_ARGS)
+{
+	int64		size;
+	Oid			nspOid = PG_GETARG_OID(0);
+
+	size = calculate_namespace_size(nspOid);
+
+	if (size < 0)
+		PG_RETURN_NULL();
+
+	PG_RETURN_INT64(size);
+}
+
+Datum
+pg_namespace_size_name(PG_FUNCTION_ARGS)
+{
+	int64		size;
+	Name		nspName = PG_GETARG_NAME(0);
+	Oid			nspOid = get_namespace_oid(NameStr(*nspName), false);
+
+	size = calculate_namespace_size(nspOid);
+
+	if (size < 0)
+		PG_RETURN_NULL();
+
+	PG_RETURN_INT64(size);
+}
+
+/* Compute the size of relations using the given access method */
+static int64
+calculate_am_size(Oid amOid)
+{
+	/* XXX acl_check? */
+
+	return calculate_size_attvalue(Anum_pg_class_relam, amOid);
+}
+
+Datum
+pg_am_size_oid(PG_FUNCTION_ARGS)
+{
+	int64		size;
+	Oid			amOid = PG_GETARG_OID(0);
+
+	size = calculate_am_size(amOid);
+
+	if (size < 0)
+		PG_RETURN_NULL();
+
+	PG_RETURN_INT64(size);
+}
+
+Datum
+pg_am_size_name(PG_FUNCTION_ARGS)
+{
+	int64		size;
+	Name		amName = PG_GETARG_NAME(0);
+	Oid			amOid = get_am_oid(NameStr(*amName), false);
+
+	size = calculate_am_size(amOid);
+
+	if (size < 0)
+		PG_RETURN_NULL();
+
+	PG_RETURN_INT64(size);
+}
+
 /*
  * Get the filenode of a relation
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 719599649a9..cd4342c864f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -7331,6 +7331,25 @@
   descr => 'total disk space usage for the spec

Re: plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-15 Thread David G. Johnston
On Thu, Dec 15, 2022 at 8:49 AM Tom Lane  wrote:

> Masahiko Sawada  writes:
> > I don't think we need additional PG_TRY() for that since exec_stmts()
> > is already called in PG_TRY() if there is an exception block. I meant
> > to call stmt_end() in PG_CATCH() in exec_stmt_block() (i.e. only when
> > an error is caught by the exception block). Currently, if an error is
> > caught, we call stmt_begin() and stmt_end() for statements executed
> > inside the exception block but call only stmt_begin() for the
> > statement that raised an error.
>
> I fail to see anything wrong with that.  We never completed execution
> of the statement that raised an error, but calling stmt_end for it
> would imply that we did.  I think changing this will break more things
> than it fixes, completely independently of whatever cost it would add.
>
> Or in other words: the initial complaint describes a bug in pg_hint_plan,
> not one in plpgsql.
>
>
The OP suggests needing something akin to a "finally" callback for
statement.  While a fine feature request for plpgsql its absence doesn't
constitute a bug.

David J.


Re: postgres_fdw: using TABLESAMPLE to collect remote sample

2022-12-15 Thread James Finnerty
This patch looks good to me.  I have two very minor nits: The inflation of the 
sample size by 10% is arbitrary but it doesn't seem unreasonable or concerning. 
 It just makes me curious if there are any known cases that motivated adding 
this logic.  Secondly, if the earliest non-deprecated version of PostgreSQL 
supports sampling, then you could optionally remove the logic that tests for 
that. The affected lines should be unreachable.

Re: postgres_fdw: using TABLESAMPLE to collect remote sample

2022-12-15 Thread Tom Lane
James Finnerty  writes:
> This patch looks good to me.  I have two very minor nits: The inflation
> of the sample size by 10% is arbitrary but it doesn't seem unreasonable
> or concerning.  It just makes me curious if there are any known cases
> that motivated adding this logic.

I wondered why, too.

> Secondly, if the earliest non-deprecated version of PostgreSQL supports
> sampling, then you could optionally remove the logic that tests for
> that. The affected lines should be unreachable.

We've tried to keep postgres_fdw compatible with quite ancient remote
servers (I think the manual claims back to 8.3, though it's unlikely
anyone's tested that far back lately).  This patch should not move those
goalposts, especially if it's easy not to.

regards, tom lane




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-15 Thread Jeff Davis
On Thu, 2022-12-15 at 12:31 +0300, Pavel Luzanov wrote:
> I think the approach that Nathan implemented [1] for TOAST tables
> in the latest version can be used for partitioned tables as well.
> Skipping the privilege check for partitions while working with
> a partitioned table. In that case we would get exactly the same
> behavior
> as for INSERT, SELECT, etc privileges - the MAINTAIN privilege would 
> work for
> the whole partitioned table, but not for individual partitions.

There is some weirdness in 15, too:

   create user foo;
   create user su superuser;
   grant all privileges on schema public to foo;

   \c - foo
   create table p(i int) partition by range (i);
   create index p_idx on p (i);
   create table p0 partition of p for values from (0) to (10);

   \c - su
   create table p1 partition of p for values from (10) to (20);

   \c - foo

   -- possibly weird because the 15 inserts into p1 (owned by su)
   insert into p values (5), (15);

   -- all these are as expected:
   select * from p; -- returns 5 & 15
   insert into p1 values(16); -- permission denied
   select * from p1; -- permission denied
   
   -- the following commands seem inconsistent to me:
   vacuum p; -- skips p1 with warning
   analyze p; -- skips p1 with warning
   cluster p using p_idx; -- silently skips p1
   reindex table p; -- reindexes p0 and p1 (owned by su)

   -- RLS is also bypassed
   \c - su
   grant select, insert on p1 to foo;
   alter table p1 enable row level security;
   create policy odd on p1 using (i % 2 = 1) with check (i % 2 = 1);
   \c - foo
   insert into p1 values (16); -- RLS error
   insert into p values (16); -- succeeds
   select * from p1; -- returns only 15
   select * from p; -- returns 5, 15, 16
   
The proposal to skip privilege checks for partitions would be
consistent with INSERT, SELECT, REINDEX that flow through to the
underlying partitions regardless of permissions/ownership (and even
RLS). It would be very minor behavior change on 15 for this weird case
of superuser-owned partitions, but I doubt anyone would be relying on
that.

+1.

I do have some lingering doubts about whether we should even allow
inconsistent ownership/permissions. But I don't think we need to settle
that question now.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-15 Thread Jeff Davis
On Wed, 2022-12-14 at 16:27 -0800, Nathan Bossart wrote:
> I don't know if this is good enough.  It seems like ideally you
> should be
> able to VACUUM a TOAST table directly if you have MAINTAIN on its
> main
> relation.

Right now, targetting the toast table directly requires the USAGE
privilege on the toast schema, and you have to look up the name first,
right? As it is, that's not a great UI.

How about if we add a VACUUM option like TOAST_ONLY (or combine it with
the PROCESS_TOAST option)? Then, you're always looking at the parent
table first so there's no deadlock, do the permission checks on the
parent, and then expand to the toast table with no check. This can be a
follow-up patch; for now, the idea of skipping the privilege checks
when expanding looks like an improvement.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: New strategies for freezing, advancing relfrozenxid early

2022-12-15 Thread Peter Geoghegan
On Thu, Dec 15, 2022 at 6:50 AM Matthias van de Meent
 wrote:
> This first run of (auto)vacuum after the 8GB threshold seems
> to appear as a significant IO event (both in WAL and relation
> read/write traffic) with 50% of the table updated and WAL-logged. I
> think this should be limited to some degree, such as only freeze
> all_visible blocks up to 10% of the table's blocks in eager vacuum, so
> that the load is spread across a larger time frame and more VACUUM
> runs.

I agree that the burden of catch-up freezing is excessive here (in
fact I already wrote something to that effect on the wiki page). The
likely solution can be simple enough.

In v9 of the patch, we switch over to eager freezing when table size
crosses 4GB (since that is the value of the
vacuum_freeze_strategy_threshold GUC). The catch up freezing that you
draw attention to here occurs when table size exceeds 8GB, which is a
separate physical table size threshold that forces eager relfrozenxid
advancement. The second threshold is hard-coded to 2x the first one.

I think that this issue can be addressed by making the second
threshold 4x or even 8x vacuum_freeze_strategy_threshold, not just 2x.
That would mean that we'd have to freeze just as many pages whenever
we did the catch-up freezing -- so no change in the added *absolute*
cost of freezing. But, the *relative* cost would be much lower, simply
because catch-up freezing would take place when the table was much
larger. So it would be a lot less noticeable.

Note that we might never reach the second table size threshold before
we must advance relfrozenxid, in any case. The catch-up freezing might
actually take place because table age created pressure to advance
relfrozenxid. It's useful to have a purely physical/table-size
threshold like this, especially in bulk loading scenarios. But it's
not like table age doesn't have any influence at all, anymore. The
cost model weighs physical units/costs as well as table age, and in
general the most likely trigger for advancing relfrozenxid is usually
some combination of the two, not any one factor on its own [1].

[1] 
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Opportunistically_advancing_relfrozenxid_with_bursty.2C_real-world_workloads
-- 
Peter Geoghegan




Re: New strategies for freezing, advancing relfrozenxid early

2022-12-15 Thread Justin Pryzby
The patches (003 and 005) are missing a word
should use to decide whether to its eager freezing strategy.

On the wiki, missing a word:
builds on related added




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-15 Thread Nathan Bossart
On Thu, Dec 15, 2022 at 10:42:15AM -0800, Jeff Davis wrote:
> Right now, targetting the toast table directly requires the USAGE
> privilege on the toast schema, and you have to look up the name first,
> right? As it is, that's not a great UI.
> 
> How about if we add a VACUUM option like TOAST_ONLY (or combine it with
> the PROCESS_TOAST option)? Then, you're always looking at the parent
> table first so there's no deadlock, do the permission checks on the
> parent, and then expand to the toast table with no check. This can be a
> follow-up patch; for now, the idea of skipping the privilege checks
> when expanding looks like an improvement.

I originally suggested an option to allow specifying whether to process the
main relation, but we ended up only adding PROCESS_TOAST [0].  FWIW I still
think that such an option would be useful for the reasons you describe.

[0] https://postgr.es/m/BA8951E9-1524-48C5-94AF-73B1F0D7857F%40amazon.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-15 Thread Justin Pryzby
On Thu, Dec 15, 2022 at 10:10:43AM -0800, Jeff Davis wrote:
> On Thu, 2022-12-15 at 12:31 +0300, Pavel Luzanov wrote:
> > I think the approach that Nathan implemented [1] for TOAST tables
> > in the latest version can be used for partitioned tables as well.
> > Skipping the privilege check for partitions while working with
> > a partitioned table. In that case we would get exactly the same
> > behavior
> > as for INSERT, SELECT, etc privileges - the MAINTAIN privilege would 
> > work for
> > the whole partitioned table, but not for individual partitions.
> 
> There is some weirdness in 15, too:

I gather you mean postgresql v15.1 and master ?

>-- the following commands seem inconsistent to me:
>vacuum p; -- skips p1 with warning
>analyze p; -- skips p1 with warning
>cluster p using p_idx; -- silently skips p1
>reindex table p; -- reindexes p0 and p1 (owned by su)

Clustering on a partitioned table is new in v15, and this behavior is
from 3f19e176ae0 and cfdd03f45e6, which added
get_tables_to_cluster_partitioned(), borrowing from expand_vacuum_rel()
and get_tables_to_cluster().

vacuum initially calls vacuum_is_permitted_for_relation() only for the
partitioned table, and *later* locks the partition and then checks its
permissions, which is when the message is output.

Since v15, cluster calls get_tables_to_cluster_partitioned(), which
silently discards partitions failing ACL.

We could change it to emit a message, which would seem to behave like
vacuum, except that the check is happening earlier, and (unlike vacuum)
partitions skipped later during CLUOPT_RECHECK wouldn't have any message
output.

Or we could change cluster_rel() to output a message when skipping.  But
these patches hardly touched that function at all.  I suppose we could
change to emit a message during RECHECK (maybe only in master branch).
If need be, that could be controlled by a new CLUOPT_*.

-- 
Justin




Re: Pluggable toaster

2022-12-15 Thread Nikita Malakhov
Hi hackers!

I want to bump this thread and remind the community about Pluggable TOAST.
Overall Pluggable TOAST was revised to work without altering PG_ATTRIBUTE
table
- no atttoaster column, allow to drop unused Toasters and some other
improvements.
Sorry for the big delay, but it was a big piece of work to do, and the work
is still going on.

Here are the main highlights:

1) No need to modify the PG_ATTRIBUTE table. We've introduced new catalog
table with a set of internal support functions that keeps all table-toaster
relations:
postgres@postgres=# \d+ pg_toastrel;
 Table "pg_catalog.pg_toastrel"
Column|   Type   | Collation | Nullable | Default | Storage |
Toaster | Compression | Stats target | Description
--+--+---+--+-+-+-+-+--+-
 oid  | oid  |   | not null | | plain   |
  | |  |
 toasteroid   | oid  |   | not null | | plain   |
  | |  |
 relid| oid  |   | not null | | plain   |
  | |  |
 toastentid   | oid  |   | not null | | plain   |
  | |  |
 attnum   | smallint |   | not null | | plain   |
  | |  |
 version  | smallint |   | not null | | plain   |
  | |  |
 relname  | name |   | not null | | plain   |
  | |  |
 toastentname | name |   | not null | | plain   |
  | |  |
 flag | "char"   |   | not null | | plain   |
  | |  |
 toastoptions | "char"   |   | not null | | plain   |
  | |  |
Indexes:
"pg_toastrel_oid_index" PRIMARY KEY, btree (oid)
"pg_toastrel_name_index" UNIQUE CONSTRAINT, btree (toasteroid, relid,
version, attnum)
"pg_toastrel_rel_index" btree (relid, attnum)
"pg_toastrel_tsr_index" btree (toasteroid)
Access method: heap
(This is not final definition)

This approach allows us to keep all Toaster assignment history, as well as
correctly storing
Toasters assigned to different columns of the relation, and even have
separate TOAST
storage entities (these not necessary to be regular TOAST tables) for
different columns.

When the table with the TOASTable column is created - a new row is inserted
into
PG_TOASTREL with source table OID, Toaster OID, created TOAST entity OID,
column
(attribute) index. Special field "version" is used to keep history of
Toasters assigned to
the column - it is a counter which increases with each assignment, and the
biggest version
is the current Toaster for the column. All assigned Toasters are
automatically cached,
and when the value is TOASTed - first lookup is done in cache, and if there
is no cached
Toaster it is searched in PG_TOASTREL and inserted in cache.

2) Attribute "reltoastrelid" was replaced with calls of PG_TOASTREL support
functions.
This was done to allow each TOASTed column to be assigned with different
Toaster
and have its individual TOAST table.

3) DROP TABLE command was modified to remove corresponding records from the
PG_TOASTREL - to allow dropping toasters that are out of use.

4) DROP TOASTER command was introduced. This command allows to drop unused
Toasters - the ones that do not have records in PG_TOASTREL. If the Toaster
was
assigned to a column - it could not be dropped, because all data TOASTed
with it will
be lost.

The branch is still in development so I it is too early for patch but
here's link to the repo:
https://github.com/postgrespro/postgres/tree/toastapi_with_ctl

On Mon, Nov 7, 2022 at 1:35 PM Aleksander Alekseev 
wrote:

> Hi Nikita,
>
> > Pluggable TOAST is provided as an interface to allow developers to plug
> > in custom TOAST mechanics. It does not determines would it be universal
> > Toaster or one data type, but syntax for universal Toaster is out of
> scope
> > for this patchset.
>
> If I understand correctly, what is going to happen - the same
> interface TsrRoutine is going to be used for doing two things:
>
> 1) Implementing type-aware TOASTers as a special case for the default
> TOAST algorithm when EXTERNAL storage strategy is used.
> 2) Implementing universal TOASTers from scratch that have nothing to
> do with the default TOAST algorithm.
>
> Assuming this is the case, using the same interface for doing two very
> different things doesn't strike me as a great design decision. While
> working on v24 you may want to rethink this.
>
> Personally I believe that Pluggable TOASTers should support only case
> #2. If there is a need of reusing parts of the default TOASTer,
> corresponding pieces of code should be declared as non-static and
> called from the pluggable TOASTers directly.
>
> Alternative

Re: [PATCH] random_normal function

2022-12-15 Thread Paul Ramsey


> On Dec 14, 2022, at 9:17 PM, Michael Paquier  wrote:
> 
> On Tue, Dec 13, 2022 at 03:51:11PM -0800, Paul Ramsey wrote:
>> Clearing up one CI failure.
> 
> +-- normal values converge on stddev == 2.0
> +SELECT round(stddev(random_normal(2, 2)))
> +  FROM generate_series(1, 1);
> 
> I am not sure that it is a good idea to make a test based on a random
> behavior that should tend to a normalized value.  This is costly in
> cycles, requiring a lot of work just for generate_series().  You could
> do the same kind of thing as random() a few lines above?
> 
> +SELECT bool_and(random_string(16) != random_string(16)) AS same
> +  FROM generate_series(1,8);
> That should be fine in terms of impossible chances :)
> 
> +   ereport(ERROR,
> +   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("return size must be non-negative")))
> This could have a test, same for 0.
> 
> +#ifndef M_PI
> +#define M_PI 3.14159265358979323846
> +#endif
> Postgres' float.h includes one version of that.

Thanks again!

P



random_normal_07a.patch
Description: Binary data


random_normal_07b.patch
Description: Binary data


Re: Speedup generation of command completion tags

2022-12-15 Thread David Rowley
On Mon, 12 Dec 2022 at 14:48, David Rowley  wrote:
>
> On Sun, 11 Dec 2022 at 11:05, Andres Freund  wrote:
> > What about moving make_completion_tag() to cmdtag.c? Then we could just get
> > the entire CommandTagBehaviour struct at once. It's not super pretty to pass
> > QueryCompletion to a routine in cmdtag.c, but it's not awful. And if we deem
> > it problematic, we could just pass qc->commandTag, qc->nprocessed as a
> > separate arguments.
>
> That seems like a good idea. I've renamed and moved the function in
> the attached. I also adjusted how the trailing NUL char is appended to
> avoid having to calculate len + 1 and append the NUL char twice for
> the most commonly taken path.

I've pushed the updated patch. Thanks for having a look.

David




Re: Error-safe user functions

2022-12-15 Thread Tom Lane
Robert Haas  writes:
> Tom, I just want to extend huge thanks to you for working on this
> infrastructure.

Thanks.  I agree it's an important bit of work.

I'm going to step back from this for now and get on with other work,
but before that I thought there was one more input function I should
look at: xml_in, because xml.c is such a hairy can of worms.  It
turns out to be not too bad, given our design principle that only
"bad input" errors should be reported softly.  xml_parse() now has
two different ways of reporting errors depending on whether they're
hard or soft, but it didn't take an undue amount of refactoring to
make that work.

While fixing that, my attention was drawn to wellformed_xml(),
whose error handling is unbelievably horrid: it traps any longjmp
whatsoever (query cancel, for instance) and reports it as ill-formed XML.
0002 attached makes use of this new code to get rid of the need for any
PG_TRY there at all; instead, soft errors result in a "false" return
but hard errors are allowed to propagate.  xml_is_document was much more
careful, but we can change it the same way to save code and cycles.

regards, tom lane

diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 3884babc1b..4b9877ee0b 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -119,9 +119,10 @@ struct PgXmlErrorContext
 
 static xmlParserInputPtr xmlPgEntityLoader(const char *URL, const char *ID,
 		   xmlParserCtxtPtr ctxt);
+static void xml_errsave(Node *escontext, PgXmlErrorContext *errcxt,
+		int sqlcode, const char *msg);
 static void xml_errorHandler(void *data, xmlErrorPtr error);
-static void xml_ereport_by_code(int level, int sqlcode,
-const char *msg, int code);
+static int	errdetail_for_xml_code(int code);
 static void chopStringInfoNewlines(StringInfo str);
 static void appendStringInfoLineSeparator(StringInfo str);
 
@@ -143,7 +144,8 @@ static bool print_xml_decl(StringInfo buf, const xmlChar *version,
 		   pg_enc encoding, int standalone);
 static bool xml_doctype_in_content(const xmlChar *str);
 static xmlDocPtr xml_parse(text *data, XmlOptionType xmloption_arg,
-		   bool preserve_whitespace, int encoding);
+		   bool preserve_whitespace, int encoding,
+		   Node *escontext);
 static text *xml_xmlnodetoxmltype(xmlNodePtr cur, PgXmlErrorContext *xmlerrcxt);
 static int	xml_xpathobjtoxmlarray(xmlXPathObjectPtr xpathobj,
    ArrayBuildState *astate,
@@ -261,14 +263,18 @@ xml_in(PG_FUNCTION_ARGS)
 	xmltype*vardata;
 	xmlDocPtr	doc;
 
+	/* Build the result object. */
 	vardata = (xmltype *) cstring_to_text(s);
 
 	/*
-	 * Parse the data to check if it is well-formed XML data.  Assume that
-	 * ERROR occurred if parsing failed.
+	 * Parse the data to check if it is well-formed XML data.
+	 *
+	 * Note: we don't need to worry about whether a soft error is detected.
 	 */
-	doc = xml_parse(vardata, xmloption, true, GetDatabaseEncoding());
-	xmlFreeDoc(doc);
+	doc = xml_parse(vardata, xmloption, true, GetDatabaseEncoding(),
+	fcinfo->context);
+	if (doc != NULL)
+		xmlFreeDoc(doc);
 
 	PG_RETURN_XML_P(vardata);
 #else
@@ -323,9 +329,10 @@ xml_out_internal(xmltype *x, pg_enc target_encoding)
 		return buf.data;
 	}
 
-	xml_ereport_by_code(WARNING, ERRCODE_INTERNAL_ERROR,
-		"could not parse XML declaration in stored value",
-		res_code);
+	ereport(WARNING,
+			errcode(ERRCODE_INTERNAL_ERROR),
+			errmsg_internal("could not parse XML declaration in stored value"),
+			errdetail_for_xml_code(res_code));
 #endif
 	return str;
 }
@@ -392,7 +399,7 @@ xml_recv(PG_FUNCTION_ARGS)
 	 * Parse the data to check if it is well-formed XML data.  Assume that
 	 * xml_parse will throw ERROR if not.
 	 */
-	doc = xml_parse(result, xmloption, true, encoding);
+	doc = xml_parse(result, xmloption, true, encoding, NULL);
 	xmlFreeDoc(doc);
 
 	/* Now that we know what we're dealing with, convert to server encoding */
@@ -754,7 +761,7 @@ xmlparse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace)
 	xmlDocPtr	doc;
 
 	doc = xml_parse(data, xmloption_arg, preserve_whitespace,
-	GetDatabaseEncoding());
+	GetDatabaseEncoding(), NULL);
 	xmlFreeDoc(doc);
 
 	return (xmltype *) data;
@@ -895,7 +902,7 @@ xml_is_document(xmltype *arg)
 	PG_TRY();
 	{
 		doc = xml_parse((text *) arg, XMLOPTION_DOCUMENT, true,
-		GetDatabaseEncoding());
+		GetDatabaseEncoding(), NULL);
 		result = true;
 	}
 	PG_CATCH();
@@ -1500,17 +1507,26 @@ xml_doctype_in_content(const xmlChar *str)
 
 
 /*
- * Convert a C string to XML internal representation
+ * Convert a text object to XML internal representation
+ *
+ * data is the source data (must not be toasted!), encoding is its encoding,
+ * and xmloption_arg and preserve_whitespace are options for the
+ * transformation.
+ *
+ * Errors normally result in ereport(ERROR), but if escontext is an
+ * ErrorSaveContext, then "safe" errors are reported there inste

RE: Partial aggregates pushdown

2022-12-15 Thread fujii.y...@df.mitsubishielectric.co.jp
Hi Mr.Freund.

> cfbot complains about some compiler warnings when building with clang:
> https://cirrus-ci.com/task/6606268580757504
> 
> deparse.c:3459:22: error: equality comparison with extraneous parentheses
> [-Werror,-Wparentheses-equality]
> if ((node->aggsplit == AGGSPLIT_SIMPLE)) {
>  ~~~^~
> deparse.c:3459:22: note: remove extraneous parentheses around the
> comparison to silence this warning
> if ((node->aggsplit == AGGSPLIT_SIMPLE)) {
> ~   ^ ~
> deparse.c:3459:22: note: use '=' to turn this equality comparison into an
> assignment
> if ((node->aggsplit == AGGSPLIT_SIMPLE)) {
> ^~
> =
I fixed this error.

Sincerely yours,
Yuuki Fujii

--
Yuuki Fujii
Information Technology R&D Center Mitsubishi Electric Corporation


0001-Partial-aggregates-push-down-v17.patch
Description: 0001-Partial-aggregates-push-down-v17.patch


Re: The drop-index-concurrently-1 isolation test no longer tests what it was meant to

2022-12-15 Thread David Rowley
On Thu, 15 Dec 2022 at 18:26, David Rowley  wrote:
> I propose the attached which gets rid of the not-so-great casting
> method that was originally added to this test to try and force the seq
> scan.  It seems a little dangerous to put in hacks like that to force
> a particular plan when the resulting plan ends up penalized with a
> (1.0e10) disable_cost.  The planner is just not going to be stable
> when the plan includes such a large penalty. To force the planner,
> I've added another test step to do set enable_seqscan to true and
> adjusted the permutations to run that just before preparing the seq
> scan query.

Pushed and backpatched to 13, where incremental sorts were added.

David




Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-15 Thread Nathan Bossart
I tried setting wal_retrieve_retry_interval to 1ms for all TAP tests
(similar to what was done in 2710ccd), and I noticed that the recovery
tests consistently took much longer.  Upon further inspection, it looks
like the same (or a very similar) race condition described in e5d494d's
commit message [0].  With some added debug logs, I see that all of the
callers of MaybeStartWalReceiver() complete before SIGCHLD is processed, so
ServerLoop() waits for a minute before starting the WAL receiver.

A simple fix is to have DetermineSleepTime() take the WalReceiverRequested
flag into consideration.  The attached 0002 patch shortens the sleep time
to 100ms if it looks like we are waiting on a SIGCHLD.  I'm not certain
this is the best approach, but it seems to fix the tests.

On my machine, I see the following improvements in the tests (all units in
seconds):
 HEAD  patched (v9)
check-world -j8  165   138
subscription 120   75
recovery 111   108

[0] https://postgr.es/m/21344.1498494720%40sss.pgh.pa.us

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From a80c2b3b2a80d024b72be9fca6b5d4136b8b8272 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 21 Nov 2022 16:01:01 -0800
Subject: [PATCH v9 1/3] wake up logical workers as needed instead of relying
 on periodic wakeups

---
 src/backend/access/transam/xact.c   |  3 ++
 src/backend/commands/alter.c|  7 
 src/backend/commands/subscriptioncmds.c |  4 ++
 src/backend/replication/logical/tablesync.c | 10 +
 src/backend/replication/logical/worker.c| 46 +
 src/include/replication/logicalworker.h |  3 ++
 6 files changed, 73 insertions(+)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b7c7fd9f00..70ad51c591 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -47,6 +47,7 @@
 #include "pgstat.h"
 #include "replication/logical.h"
 #include "replication/logicallauncher.h"
+#include "replication/logicalworker.h"
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
 #include "replication/syncrep.h"
@@ -2360,6 +2361,7 @@ CommitTransaction(void)
 	AtEOXact_PgStat(true, is_parallel_worker);
 	AtEOXact_Snapshot(true, false);
 	AtEOXact_ApplyLauncher(true);
+	AtEOXact_LogicalRepWorkers(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2860,6 +2862,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false, is_parallel_worker);
 		AtEOXact_ApplyLauncher(false);
+		AtEOXact_LogicalRepWorkers(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 10b6fe19a2..d095cd3ced 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -59,6 +59,7 @@
 #include "commands/user.h"
 #include "miscadmin.h"
 #include "parser/parse_func.h"
+#include "replication/logicalworker.h"
 #include "rewrite/rewriteDefine.h"
 #include "tcop/utility.h"
 #include "utils/builtins.h"
@@ -279,6 +280,12 @@ AlterObjectRename_internal(Relation rel, Oid objectId, const char *new_name)
 		if (strncmp(new_name, "regress_", 8) != 0)
 			elog(WARNING, "subscriptions created by regression test cases should have names starting with \"regress_\"");
 #endif
+
+		/*
+		 * Wake up the logical replication workers to handle this change
+		 * quickly.
+		 */
+		LogicalRepWorkersWakeupAtCommit(objectId);
 	}
 	else if (nameCacheId >= 0)
 	{
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index d673557ea4..d6993c26e5 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -34,6 +34,7 @@
 #include "nodes/makefuncs.h"
 #include "pgstat.h"
 #include "replication/logicallauncher.h"
+#include "replication/logicalworker.h"
 #include "replication/origin.h"
 #include "replication/slot.h"
 #include "replication/walreceiver.h"
@@ -1362,6 +1363,9 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 	InvokeObjectPostAlterHook(SubscriptionRelationId, subid, 0);
 
+	/* Wake up the logical replication workers to handle this change quickly. */
+	LogicalRepWorkersWakeupAtCommit(subid);
+
 	return myself;
 }
 
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 94e813ac53..509fe2eb19 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -105,6 +105,7 @@
 #include "pgstat.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalrelation.h"
+#include "replication/logicalworker.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "replication/slot.h"
@@ -619,6 +620,15 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 
 	if (started_tx)
 	{
+		/*
+		 * If we are ready to enable two_phase mode, wake u

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-15 Thread David Christensen
On Thu, Dec 15, 2022 at 12:36 AM Michael Paquier  wrote:
>
> On Wed, Dec 14, 2022 at 04:44:34PM -0600, David Christensen wrote:
> > I can get one sent in tomorrow.

This v10 should incorporate your feedback as well as Bharath's.

> -XLogRecordHasFPW(XLogReaderState *record)
> +XLogRecordHasFPI(XLogReaderState *record)
> This still refers to a FPW, so let's leave that out as well as any
> renamings of this kind..

Reverted those changes.

> +   if (config.save_fpi_path != NULL)
> +   {
> +   /* Create the dir if it doesn't exist */
> +   if (pg_mkdir_p(config.save_fpi_path, pg_dir_create_mode) < 0)
> +   {
> +   pg_log_error("could not create output directory \"%s\": %m",
> +config.save_fpi_path);
> +   goto bad_argument;
> +   }
> +   }
> It seems to me that you could allow things to work even if the
> directory exists and is empty.  See for example
> verify_dir_is_empty_or_create() in pg_basebackup.c.

The `pg_mkdir_p()` supports an existing directory (and I don't think
we want to require it to be empty first), so this only errors when it
can't create a directory for some reason.

> +my $file_re =
> +  
> qr/^([0-9A-F]{8})-([0-9A-F]{8})[.][0-9]+[.][0-9]+[.][0-9]+[.][0-9]+(?:_vm|_init|_fsm|_main)?$/;
> This is artistic to parse for people not used to regexps (I do, a
> little).  Perhaps this could use a small comment with an example of
> name or a reference describing this format?

Added a description of what this is looking for.

> +# Set umask so test directories and files are created with default 
> permissions
> +umask(0077);
> Incorrect copy-paste coming from elsewhere like the TAP tests of
> pg_basebackup with group permissions?  Doesn't
> PostgreSQL::Test::Utils::tempdir give already enough protection in
> terms of umask() and permissions?

I'd expect that's where that came from. Removed.

> +   if (config.save_fpi_path != NULL)
> +   {
> +   /* We fsync our output directory only; since these files are not part
> +* of the production database we do not require the performance hit
> +* that fsyncing every FPI would entail, so are doing this as a
> +* compromise. */
> +
> +   fsync_fname(config.save_fpi_path, true);
> +   }
> Why is it necessary to flush anything at all, then?

I personally don't think it is, but added it per Bharath's request.
Removed in this revision.

> +my $relation = $node->safe_psql('postgres',
> +   q{SELECT format('%s/%s/%s', CASE WHEN reltablespace = 0 THEN
> dattablespace ELSE reltablespace END, pg_database.oid,
> pg_relation_filenode(pg_class.oid)) FROM pg_class, pg_database WHERE
> relname = 'test_table' AND datname = current_database()}
> Could you rewrite that to be on multiple lines?

Sure, reformated.

> +diag "using walfile: $walfile";
> You should avoid the use of "diag", as this would cause extra output
> noise with a simple make check.

Had been using it for debugging and didn't realize it'd cause issues.
Removed both instances.

> +$node->safe_psql('postgres', "SELECT 
> pg_drop_replication_slot('regress_pg_waldump_slot')")
> That's not really necessary, the nodes are wiped out at the end of the
> test.

Removed.

> +$node->safe_psql('postgres', < +SELECT 'init' FROM 
> pg_create_physical_replication_slot('regress_pg_waldump_slot', true, false);
> +CREATE TABLE test_table AS SELECT generate_series(1,100) a;
> +CHECKPOINT; -- required to force FPI for next writes
> +UPDATE test_table SET a = a + 1;
> Using an EOF to execute a multi-line query would be a first.  Couldn't
> you use the same thing as anywhere else?  009_twophase.pl just to
> mention one.  (Mentioned by Bharath upthread, where he asked for an
> extra opinion so here it is.)

Fair enough, while idiomatic perl to me, not a hill to die on;
converted to a standard multiline string.

Best,

David


v10-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-st.patch
Description: Binary data


Exclusion constraints on partitioned tables

2022-12-15 Thread Paul Jungwirth

Hello Hackers,

I'm trying to get things going again on my temporal tables work, and 
here is a small patch to move that forward.


It lets you create exclusion constraints on partitioned tables, similar 
to today's rules for b-tree primary keys & unique constraints:
just as we permit a PK on a partitioned table when the PK's columns are 
a superset of the partition keys, so we could also allow an exclusion 
constraint when its columns are a superset of the partition keys.


This patch also requires the matching constraint columns to use equality 
comparisons (`(foo WITH =)`), so it is really equivalent to the existing 
b-tree rule. Perhaps that is more conservative than necessary, but we 
can't permit an arbitrary operator, since some might require testing 
rows that fall into other partitions. For example `(foo WITH <>)` would 
obviously cause problems.


The exclusion constraint may still include other columns beyond the 
partition keys, and those may use equality operators or something else.


This patch is required to support temporal partitioned tables, because 
temporal tables use exclusion constraints as their primary key.
Essentially they are `(id WITH =, valid_at with &&)`. Since the primary 
key is not a b-tree, partitioning them would be forbidden prior to this 
patch. But now you could partition that table on `id`, and we could 
still correctly validate the temporal PK without requiring rows from 
other partitions.


This patch may be helpful beyond just temporal tables (or for DIY 
temporal tables), so it seems worth submitting it separately.


Yours,

--
Paul  ~{:-)
p...@illuminatedcomputing.comFrom 7daadf9e822509186c9e32794d0e29effdc90edc Mon Sep 17 00:00:00 2001
From: "Paul A. Jungwirth" 
Date: Wed, 23 Nov 2022 14:55:43 -0800
Subject: [PATCH v1] Allow some exclusion constraints on partitions

Previously we only allowed UNIQUE B-tree constraints on partitions
(and only if the constraint included all the partition keys). But we
could allow exclusion constraints with the same restriction. We also
require that those columns be compared for equality, not something like
&&.
---
 doc/src/sgml/ddl.sgml  | 12 ++--
 src/backend/commands/indexcmds.c   | 36 +--
 src/backend/parser/parse_utilcmd.c |  6 --
 src/test/regress/expected/alter_table.out  | 31 +++--
 src/test/regress/expected/create_table.out | 16 +++--
 src/test/regress/expected/indexing.out | 73 ++
 src/test/regress/sql/alter_table.sql   | 29 +++--
 src/test/regress/sql/create_table.sql  | 13 +++-
 src/test/regress/sql/indexing.sql  | 57 +++--
 9 files changed, 221 insertions(+), 52 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 6e92bbddd2..59be911471 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -4206,11 +4206,13 @@ ALTER INDEX measurement_city_id_logdate_key
 
  
   
-   There is no way to create an exclusion constraint spanning the
-   whole partitioned table.  It is only possible to put such a
-   constraint on each leaf partition individually.  Again, this
-   limitation stems from not being able to enforce cross-partition
-   restrictions.
+   Similarly an exclusion constraint must include all the 
+   partition key columns. Furthermore the constraint must compare those
+   columns for equality (not e.g. &&).
+   Again, this limitation stems from not being able to enforce
+   cross-partition restrictions. The constraint may include additional
+   columns that aren't part of the partition key, and it may compare
+   those with any operators you like.
   
  
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7dc1aca8fe..28840544b5 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -709,11 +709,6 @@ DefineIndex(Oid relationId,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 	 errmsg("cannot create index on partitioned table \"%s\" concurrently",
 			RelationGetRelationName(rel;
-		if (stmt->excludeOpNames)
-			ereport(ERROR,
-	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg("cannot create exclusion constraints on partitioned table \"%s\"",
-			RelationGetRelationName(rel;
 	}
 
 	/*
@@ -926,7 +921,7 @@ DefineIndex(Oid relationId,
 	 * We could lift this limitation if we had global indexes, but those have
 	 * their own problems, so this is a useful feature combination.
 	 */
-	if (partitioned && (stmt->unique || stmt->primary))
+	if (partitioned && (stmt->unique || stmt->primary || stmt->excludeOpNames != NIL))
 	{
 		PartitionKey key = RelationGetPartitionKey(rel);
 		const char *constraint_type;
@@ -983,6 +978,8 @@ DefineIndex(Oid relationId,
 			 */
 			if (accessMethodId == BTREE_AM_OID)
 eq_strategy = BTEqualStrategyNumber;
+			else if (accessMethodId == GIST_AM_OID)
+eq_strategy = RTEqualStr

Re: Infinite Interval

2022-12-15 Thread Joseph Koshakow
On Mon, Dec 12, 2022 at 8:05 AM Ashutosh Bapat
 wrote:
>
> Hi Joseph,
> I stumbled upon this requirement a few times. So I started working on
> this support in my spare time as a hobby project to understand
> horology code in PostgreSQL. This was sitting in my repositories for
> more than an year. Now that I have someone else showing an interest,
> it's time for it to face the world. Rebased it, fixed conflicts.
>
> PFA patch implementing infinite interval. It's still WIP, there are
> TODOs in the code and also the commit message lists things that are
> known to be incomplete. You might want to assess expected output
> carefully

That's great! I was also planning to just work on it as a hobby
project, so I'll try and review and add updates as I find free
time as well.

> > The proposed design from the most recent thread was to reserve
> > INT32_MAX months for infinity and INT32_MIN months for negative
> > infinity. As pointed out in the thread, these are currently valid
> > non-infinite intervals, but they are out of the documented range.
>
> The patch uses both months and days together to avoid this problem.

Can you expand on this part? I believe the full range of representable
intervals are considered valid as of v15.

- Joe Koshakow




Add enable_presorted_aggregate GUC

2022-12-15 Thread David Rowley
Over on [1] Pavel reports that PG16's performance is slower than
PG14's for his ORDER BY aggregate query.  This seems to be mainly down
to how the costing works for Incremental Sort. Now, ever since
1349d279, the planner will make a plan that's ordered by the GROUP BY
clause and add on the path key requirements for any ORDER BY
aggregates.  I had in mind that if someone already had tuned their
schema to have a btree index on the GROUP BY clause to allow Group
Aggregate to have presorted rows from the index that the planner would
likely just take that index and add an Incremental Sort on top of that
to obtain the rows in the order required for the aggregate functions.

The main reason for Pavel's reported regression is due to a cost
"pessimism factor" that was added to incremental sorts where we
multiply by 1.5 to reduce the chances of an Incremental Sort plan due
to uncertainties around the number of tuples per presorted group. We
assume each group will have the same number of tuples. If there's some
skew then there will be a larger N factor in the qsort complexity of
O(N * log2(N)).

Over on [1], I'm proposing that we remove that pessimism factor. If we
keep teaching the planner new tricks but cost them pessimistically
then we're not taking full advantage of said new tricks. If you
disagree with that, then best to raise it on [1].

On [1], in addition to removing the * 1.5 factor, I also propose that
we add a new GUC named "enable_presorted_aggregate", which, if turned
off will make the planner not request that the plan is also ordered by
the aggregate function's ORDER BY / DISTINCT clause.  The reason I
think that this is required is that even with removing the pessimism
factor from incremental sort, it's possible the planner will choose to
perform a full sort rather than an incremental sort on top of some
existing index which provides presorted input for only the query's
GROUP BY clause. e.g.

CREATE TABLE ab (a INT, b INT);
CREATE INDEX ON ab(a);

EXPLAIN SELECT a,array_agg(b ORDER BY b) FROM ab GROUP BY a;

Previous to 1349d279, the planner, assuming there's a good amount of
rows in the table, would be very likely to use the index for the GROUP
BY then the executor would be left to do a sort on "b" within
nodeAgg.c.  The equivalent of that post-1349d279 is Index Scan ->
Incremental Sort -> Group Aggregate (with presorted input).  However,
the planner could choose to: Seq Scan -> Sort (a,b) -> Group Aggregate
(with presorted input).  So this leaves an opportunity for the planner
to choose a worse plan.

Normally we add some enable_* GUC to leave an escape hatch when we add
some new feature like this. I likely should have done that when I
added 1349d279, but I didn't and I want to now.

I mainly just shifted this discussion out of [1] as we normally like
to debate GUC names and I feel that the discussion over on the other
thread is buried a little too deep to be visible to most people.

Can anyone think of a better name?  Or does anyone see error with my ambition?

David

[1] 
https://www.postgresql.org/message-id/9f61ddbf-2989-1536-b31e-6459370a6...@postgrespro.ru
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8e4145979d..9eedab652d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5311,6 +5311,29 @@ ANY num_sync ( 
+  enable_presorted_aggregate 
(boolean)
+  
+   enable_presorted_aggregate configuration 
parameter
+  
+  
+  
+   
+Controls if the query planner will produce a plan which will provide
+rows which are presorted in the order required for the query's
+ORDER BY / DISTINCT aggregate
+functions.  When disabled, the query planner will produce a plan which
+will always require the executor to perform a sort before performing
+aggregation of each aggregate function containing an
+ORDER BY or DISTINCT clause.
+When enabled, the planner will try to produce a more efficient plan
+which provides input to the aggregate functions which is presorted in
+the order they require for aggregation.  The default value is
+on.
+   
+  
+ 
+
  
   enable_seqscan (boolean)
   
diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index 0f0bcfb7e5..89d3c4352c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -151,6 +151,7 @@ boolenable_partitionwise_aggregate = false;
 bool   enable_parallel_append = true;
 bool   enable_parallel_hash = true;
 bool   enable_partition_pruning = true;
+bool   enable_presorted_aggregate = true;
 bool   enable_async_append = true;
 
 typedef struct
diff --git a/src/backend/optimizer/plan/planner.c 
b/src/backend/optimizer/plan/planner.c
index dfda251d95..e21e72eb87 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -

Re: Exclusion constraints on partitioned tables

2022-12-15 Thread Tom Lane
Paul Jungwirth  writes:
> It lets you create exclusion constraints on partitioned tables, similar 
> to today's rules for b-tree primary keys & unique constraints:
> just as we permit a PK on a partitioned table when the PK's columns are 
> a superset of the partition keys, so we could also allow an exclusion 
> constraint when its columns are a superset of the partition keys.

OK.  AFAICS that works in principle.

> This patch also requires the matching constraint columns to use equality 
> comparisons (`(foo WITH =)`), so it is really equivalent to the existing 
> b-tree rule.

That's not quite good enough: you'd better enforce that it's the same
equality operator (and same collation, if relevant) as is being used
in the partition key.  Remember that we don't have a requirement that
a datatype have only one equality operator; and these days I think
collation can affect equality, too.

Another problem is that while we can safely assume that we know what
BTEqualStrategyNumber means in btree, we can NOT assume that we know
what gist opclass strategy numbers mean: each opclass is free to
define those as it sees fit.  The part of your patch that is looking
at RTEqualStrategyNumber seems dangerously broken to me.

It might work better to consider the operator itself and ask if
it's equality in the same btree opfamily that's used by the
partition key.  (Hm, do we use btree opfamilies for all types of
partitioning?)

Anyway, I think something can be made of this, but you need to be less
fuzzy about matching the equality semantics of the partition key.

regards, tom lane




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-15 Thread Nathan Bossart
On Thu, Dec 15, 2022 at 01:48:13PM -0600, Justin Pryzby wrote:
> vacuum initially calls vacuum_is_permitted_for_relation() only for the
> partitioned table, and *later* locks the partition and then checks its
> permissions, which is when the message is output.
> 
> Since v15, cluster calls get_tables_to_cluster_partitioned(), which
> silently discards partitions failing ACL.
> 
> We could change it to emit a message, which would seem to behave like
> vacuum, except that the check is happening earlier, and (unlike vacuum)
> partitions skipped later during CLUOPT_RECHECK wouldn't have any message
> output.
> 
> Or we could change cluster_rel() to output a message when skipping.  But
> these patches hardly touched that function at all.  I suppose we could
> change to emit a message during RECHECK (maybe only in master branch).
> If need be, that could be controlled by a new CLUOPT_*.

Yeah, VACUUM/ANALYZE works differently.  For one, you can specify multiple
relations in the command.  Also, VACUUM/ANALYZE only takes an
AccessShareLock when first assessing permissions (which actually skips
partitions).  CLUSTER immediately takes an AccessExclusiveLock, so the
permissions are checked up front.  This is done "to avoid lock-upgrade
hazard in the single-transaction case," which IIUC is what allows CLUSTER
on a single table to run within a transaction block (unlike VACUUM).  I
don't know if running CLUSTER in a transaction block is important.  IMO we
should consider making CLUSTER look a lot more like VACUUM/ANALYZE in this
regard.  The attached patch adds WARNING messages, but we're still far from
consistency with VACUUM/ANALYZE.

Side note: I see that CLUSTER on a partitioned table is disallowed in a
transaction block, which should probably be added to my documentation patch
[0].

[0] https://commitfest.postgresql.org/41/4054/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 8966b75bd1..6d09d1c639 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -80,6 +80,7 @@ static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
 static List *get_tables_to_cluster(MemoryContext cluster_context);
 static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
 			   Oid indexOid);
+static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
 
 
 /*---
@@ -366,8 +367,7 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
 	if (recheck)
 	{
 		/* Check that the user still has privileges for the relation */
-		if (!object_ownercheck(RelationRelationId, tableOid, save_userid) &&
-			pg_class_aclcheck(tableOid, save_userid, ACL_MAINTAIN) != ACLCHECK_OK)
+		if (!cluster_is_permitted_for_relation(tableOid, save_userid))
 		{
 			relation_close(OldHeap, AccessExclusiveLock);
 			goto out;
@@ -1646,8 +1646,7 @@ get_tables_to_cluster(MemoryContext cluster_context)
 
 		index = (Form_pg_index) GETSTRUCT(indexTuple);
 
-		if (!object_ownercheck(RelationRelationId, index->indrelid, GetUserId()) &&
-			pg_class_aclcheck(index->indrelid, GetUserId(), ACL_MAINTAIN) != ACLCHECK_OK)
+		if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId()))
 			continue;
 
 		/* Use a permanent memory context for the result list */
@@ -1695,11 +1694,8 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
 		if (get_rel_relkind(indexrelid) != RELKIND_INDEX)
 			continue;
 
-		/* Silently skip partitions which the user has no access to. */
-		if (!object_ownercheck(RelationRelationId, relid, GetUserId()) &&
-			pg_class_aclcheck(relid, GetUserId(), ACL_MAINTAIN) != ACLCHECK_OK &&
-			(!object_ownercheck(DatabaseRelationId, MyDatabaseId, GetUserId()) ||
-			 IsSharedRelation(relid)))
+		/* skip partitions which the user has no access to */
+		if (!cluster_is_permitted_for_relation(relid, GetUserId()))
 			continue;
 
 		/* Use a permanent memory context for the result list */
@@ -1715,3 +1711,21 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
 
 	return rtcs;
 }
+
+/*
+ * Check if the current user has privileges to cluster the relation.  If not,
+ * issue a WARNING log message and return false to let the caller decide what
+ * to do with this relation.
+ */
+static bool
+cluster_is_permitted_for_relation(Oid relid, Oid userid)
+{
+	if (object_ownercheck(RelationRelationId, relid, userid) ||
+		pg_class_aclcheck(relid, userid, ACL_MAINTAIN) == ACLCHECK_OK)
+		return true;
+
+	ereport(WARNING,
+			(errmsg("permission denied to cluster \"%s\", skipping it",
+	get_rel_name(relid;
+	return false;
+}
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 542c2e098c..830b306907 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -352,7 

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-12-15 Thread David Rowley
On Fri, 16 Dec 2022 at 00:10, David Rowley  wrote:
> v3-0002 removes the 1.5 x cost pessimism from incremental sort and
> also rewrites how we make incremental sort paths.  I've now gone
> through the remaining places where we create an incremental sort path
> to give all those the same treatment that I'd added to
> add_paths_to_grouping_rel(). There was a 1 or 2 plan changes in the
> regression tests.  One was the isolation test change, which I claim to
> be a broken test and should be fixed another way.  The other was
> performing a Sort on the cheapest input path which had presorted keys.
> That plan now uses an Incremental Sort to make use of the presorted
> keys. I'm happy to see just how much redundant code this removes.
> About 200 lines.

I've now pushed this patch. Thanks for the report and everyone for all
the useful discussion. Also Richard for the review.

> v3-0003 adds the enable_presorted_aggregate GUC.

This I've moved off to [1]. We tend to have lengthy discussions about
GUCs, what to name them and if we actually need them. I didn't want to
bury that discussion in this old and already long thread.

David

[1] 
https://postgr.es/m/CAApHDvqzuHerD8zN1Qu=d66e3bp1=9ifn09ziq3zug_phi6...@mail.gmail.com




Re: generic plans and "initial" pruning

2022-12-15 Thread Amit Langote
On Wed, Dec 14, 2022 at 5:35 PM Amit Langote  wrote:
> I have moved the original functionality of GetCachedPlan() to
> GetCachedPlanInternal(), turning the former into a sort of controller
> as described shortly.  The latter's CheckCachedPlan() part now only
> locks the "minimal" set of, non-prunable, relations, making a note of
> whether the plan contains any prunable subnodes and thus prunable
> relations whose locking is deferred to the caller, GetCachedPlan().
> GetCachedPlan(), as a sort of controller as mentioned before, does the
> pruning if needed on the minimally valid plan returned by
> GetCachedPlanInternal(), locks the partitions that survive, and redoes
> the whole thing if the locking of partitions invalidates the plan.

After sleeping on it, I realized this doesn't have to be that
complicated.   Rather than turn GetCachedPlan() into a wrapper for
handling deferred partition locking as outlined above, I could have
changed it more simply as follows to get the same thing done:

if (!customplan)
{
-   if (CheckCachedPlan(plansource))
+   boolhasUnlockedParts = false;
+
+   if (CheckCachedPlan(plansource, &hasUnlockedParts) &&
+   hasUnlockedParts &&
+   CachedPlanLockPartitions(plansource, boundParams, owner, extra))
{
/* We want a generic plan, and we already have a valid one */
plan = plansource->gplan;

Attached updated patch does it like that.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v30-0002-In-GetCachedPlan-only-lock-unpruned-partitions.patch
Description: Binary data


v30-0001-Preparatory-refactoring-before-reworking-CachedP.patch
Description: Binary data


Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-12-15 Thread Amit Kapila
On Thu, Dec 15, 2022 at 5:33 PM Melih Mutlu  wrote:
>
> Also here are some numbers with 10 tables loaded with some data :
>
>  | 10 MB  | 100 MB
> --
> master  |  2868.524 ms   |  14281.711 ms
> --
>  patch   |  1750.226 ms   |  14592.800 ms
>
> The difference between the master and the patch is getting close when the 
> size of tables increase, as expected.
>

Right, but when the size is 100MB, it seems to be taking a bit more
time. Do we want to evaluate with different sizes to see how it looks?
Other than that all the numbers are good.

-- 
With Regards,
Amit Kapila.




Typo macro name on FreeBSD?

2022-12-15 Thread Japin Li


Hi, hackers

Recently, I compile PostgreSQL on FreeBSD, I find commit a2a8acd152 introduecs
__freebsd__ macro, however, I cannot find this macro on FreeBSD 13. There only
has __FreeBSD__ macro. Is this a typo?

root@freebsd:~ # uname -a
FreeBSD freebsd 13.1-RELEASE-p3 FreeBSD 13.1-RELEASE-p3 GENERIC amd64
root@freebsd:~ # echo | gcc10 -dM -E - | grep -i 'freebsd'
#define __FreeBSD__ 13


-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-15 Thread Amit Kapila
On Thu, Dec 15, 2022 at 1:42 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Horiguchi-san, Amit,
>
> > > Yes, that would be ideal. But do you know why that is a must?
> >
> > I believe a graceful shutdown (fast and smart) of a replication set is 
> > expected to
> > be in sync.  Of course we can change the policy to allow walsnder to stop 
> > before
> > confirming all WAL have been applied. However walsender doesn't have an idea
> > of  wheter the peer is intentionally delaying or not.
>
> This mechanism was introduced by 985bd7[1], which was needed to support a
> "clean" switchover. I think it is needed for physical replication, but it is 
> not
> clear for the logical case.
>
> When the postmaster is stopped in fast or smart mode, we expected that all
> modifications were received by secondary. This requirement seems to be not 
> changed
> from the initial commit.
>
> Before 985bd7, the walsender exited just after sending the final WAL, which 
> meant
> that sometimes the last packet could not reach to secondary. So there was a 
> possibility
> of failing to reboot the primary as a new secondary because the new primary 
> does
> not have the last WAL record. To avoid the above walsender started waiting for
> flush before exiting.
>
> But in the case of logical replication, I'm not sure whether this limitation 
> is
> really needed or not. I think it may be OK that walsender exits without 
> waiting,
> in case of delaying applies. Because we don't have to consider the above issue
> for logical replication.
>

I also don't see the need for this mechanism for logical replication,
and in fact, why do we need to even wait for sending the existing WAL?

I think the reason why we don't need to wait for logical replication
is that after the restart, we always start sending WAL from the
location requested by the subscriber, or till the point where the
publisher knows the confirmed flush location of the subscriber.
Consider another case where after restart publisher (node-1) wants to
act as a subscriber for the previous subscriber (node-2). Now, the new
subscriber (node-1) won't have a way to tell the new publisher
(node-2) that starts from the location where the node-1 went down as
WAL locations between publisher and subscriber need not be same.

This brings us to the question of whether users can use logical
replication for the scenario where they want the old master to follow
the new master after the restart which we typically do in physical
replication, if so how?

Another related point to consider is what is the behavior of
synchronous replication when shutdown has been performed both in the
case of physical and logical replication especially when the
time-delayed replication feature is enabled?

-- 
With Regards,
Amit Kapila.




Re: New strategies for freezing, advancing relfrozenxid early

2022-12-15 Thread Peter Geoghegan
On Thu, Dec 15, 2022 at 11:11 AM Justin Pryzby  wrote:
> The patches (003 and 005) are missing a word
> should use to decide whether to its eager freezing strategy.

I mangled this during rebasing for v9, which reordered the commits.
Will be fixed in v10.

> On the wiki, missing a word:
> builds on related added

Fixed.

Thanks
-- 
Peter Geoghegan




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-15 Thread Amit Kapila
On Thu, Dec 15, 2022 at 11:22 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 15 Dec 2022 10:29:17 +0530, Amit Kapila  
> wrote in
> > On Thu, Dec 15, 2022 at 10:11 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Thu, 15 Dec 2022 09:18:55 +0530, Amit Kapila  
> > > wrote in
> > > > On Thu, Dec 15, 2022 at 7:22 AM Kyotaro Horiguchi
> > > >  wrote:
> > > > subscriber was busy enough that it doesn't need to add an additional
> > > > delay before applying a particular transaction(s) but adding a delay
> > > > to such a transaction on the publisher will actually make it take much
> > > > longer to reflect than expected. We probably need to name this
> > >
> > > Isn't the name min_apply_delay implying the same behavior? Even though
> > > the delay time will be a bit prolonged.
> > >
> >
> > Sorry, I don't understand what you intend to say in this point. In
> > above, I mean that the currently proposed patch won't have such a
> > problem but if we apply delay on publisher the problem can happen.
>
> Are you saing about the sender-side delay lets the whole transaction
> (if it have not streamed out) stay on the sender side?
>

It will not stay on the sender side forever but rather will be sent
after the min_apply_delay. The point I wanted to raise is that maybe
the delay won't need to be applied where we will end up delaying it.
Because when we apply the delay on apply side, it will take into
account the other load of apply side. I don't know how much it matters
but it appears logical to add the delay on applying side.

-- 
With Regards,
Amit Kapila.




Re: Typo macro name on FreeBSD?

2022-12-15 Thread Thomas Munro
On Fri, Dec 16, 2022 at 4:44 PM Japin Li  wrote:
> Recently, I compile PostgreSQL on FreeBSD, I find commit a2a8acd152 introduecs
> __freebsd__ macro, however, I cannot find this macro on FreeBSD 13. There only
> has __FreeBSD__ macro. Is this a typo?

Yeah, that seems to be my fault.  Will fix.  Thanks!




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-15 Thread Jeff Davis
On Thu, 2022-12-15 at 17:19 -0800, Nathan Bossart wrote:
> The attached patch adds WARNING messages, but we're still far from
> consistency with VACUUM/ANALYZE.

But why make CLUSTER more like VACUUM? Shouldn't we make
VACUUM/CLUSTER/ANALYZE more like INSERT/SELECT/REINDEX?

As far as I can tell, the only way you can get in this situation in 15
is by having a superuser create partitions in a non-superuser's table.
I don't think that was an intended use case, so it's probably just
historical reasons that VACUUM/CLUSTER/ANALYZE ended up that way, not
precedent we should necessarily follow.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: fix and document CLUSTER privileges

2022-12-15 Thread Nathan Bossart
Here is a new version of the patch.  I've moved the privilege checks to a
new function, and I added a note in the docs about clustering partitioned
tables in a transaction block (it's not allowed).

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 145101e6a5..9ca66cd2ee 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -67,7 +67,8 @@ CLUSTER [VERBOSE]
   
 
   
-   CLUSTER without any parameter reclusters all the
+   CLUSTER without a
+   table_name reclusters all the
previously-clustered tables in the current database that the calling user
owns or has the MAINTAIN privilege for, or all such tables
if called by a superuser or a role with privileges of the
@@ -134,6 +135,16 @@ CLUSTER [VERBOSE]
  
   Notes
 
+   
+To cluster a table, one must have the MAINTAIN privilege
+on the table or be the table's owner, a superuser, or a role with
+privileges of the
+pg_maintain
+role.  Database-wide clusters and clusters on partitioned tables will
+silently skip over any tables that the calling user does not have
+permission to cluster.
+   
+

 In cases where you are accessing single rows randomly
 within a table, the actual order of the data in the
@@ -202,7 +213,8 @@ CLUSTER [VERBOSE]

 Clustering a partitioned table clusters each of its partitions using the
 partition of the specified partitioned index.  When clustering a partitioned
-table, the index may not be omitted.
+table, the index may not be omitted.  CLUSTER on a
+partitioned table cannot be executed inside a transaction block.

 
  
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 8966b75bd1..14328abfa6 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -80,6 +80,7 @@ static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
 static List *get_tables_to_cluster(MemoryContext cluster_context);
 static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
 			   Oid indexOid);
+static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
 
 
 /*---
@@ -366,8 +367,7 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
 	if (recheck)
 	{
 		/* Check that the user still has privileges for the relation */
-		if (!object_ownercheck(RelationRelationId, tableOid, save_userid) &&
-			pg_class_aclcheck(tableOid, save_userid, ACL_MAINTAIN) != ACLCHECK_OK)
+		if (!cluster_is_permitted_for_relation(tableOid, save_userid))
 		{
 			relation_close(OldHeap, AccessExclusiveLock);
 			goto out;
@@ -1646,8 +1646,7 @@ get_tables_to_cluster(MemoryContext cluster_context)
 
 		index = (Form_pg_index) GETSTRUCT(indexTuple);
 
-		if (!object_ownercheck(RelationRelationId, index->indrelid, GetUserId()) &&
-			pg_class_aclcheck(index->indrelid, GetUserId(), ACL_MAINTAIN) != ACLCHECK_OK)
+		if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId()))
 			continue;
 
 		/* Use a permanent memory context for the result list */
@@ -1696,10 +1695,7 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
 			continue;
 
 		/* Silently skip partitions which the user has no access to. */
-		if (!object_ownercheck(RelationRelationId, relid, GetUserId()) &&
-			pg_class_aclcheck(relid, GetUserId(), ACL_MAINTAIN) != ACLCHECK_OK &&
-			(!object_ownercheck(DatabaseRelationId, MyDatabaseId, GetUserId()) ||
-			 IsSharedRelation(relid)))
+		if (!cluster_is_permitted_for_relation(relid, GetUserId()))
 			continue;
 
 		/* Use a permanent memory context for the result list */
@@ -1715,3 +1711,10 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
 
 	return rtcs;
 }
+
+static bool
+cluster_is_permitted_for_relation(Oid relid, Oid userid)
+{
+	return object_ownercheck(RelationRelationId, relid, userid) ||
+		   pg_class_aclcheck(relid, userid, ACL_MAINTAIN) == ACLCHECK_OK;
+}


Re: Exclusion constraints on partitioned tables

2022-12-15 Thread Paul Jungwirth

On 12/15/22 16:12, Tom Lane wrote:

This patch also requires the matching constraint columns to use equality
comparisons (`(foo WITH =)`), so it is really equivalent to the existing
b-tree rule.


That's not quite good enough: you'd better enforce that it's the same
equality operator (and same collation, if relevant) as is being used
in the partition key.
[snip]
It might work better to consider the operator itself and ask if
it's equality in the same btree opfamily that's used by the
partition key.


Thank you for taking a look! Here is a comparison on just the operator 
itself.


I included a collation check too, but I'm not sure it's necessary. 
Exclusion constraints don't have a collation per se; it comes from the 
index, and we choose it just a little above in this function. (I'm not 
even sure how to elicit that new error message in a test case.)


I'm not sure what to do about matching the opfamily. In practice an 
exclusion constraint will typically use gist, but the partition key will 
always use btree/hash. You're saying that the equals operator can be 
inconsistent between those access methods? That is surprising to me, but 
I admit op classes/families are still sinking in. (Even prior to this 
patch, isn't the code for hash-based partitions looking up ptkey_eqop 
via the hash opfamily, and then comparing it to idx_eqop looked up via 
the btree opfamily?)


If partitions can only support btree-based exclusion constraints, you 
still wouldn't be able to partition a temporal table, because those 
constraints would always be gist. So I guess what I really want is to 
support gist index constraints on partitioned tables.


Regards,

--
Paul  ~{:-)
p...@illuminatedcomputing.comFrom df25cb63cacd992fc7453fc432488c6046e59479 Mon Sep 17 00:00:00 2001
From: "Paul A. Jungwirth" 
Date: Wed, 23 Nov 2022 14:55:43 -0800
Subject: [PATCH v2] Allow some exclusion constraints on partitions

Previously we only allowed UNIQUE B-tree constraints on partitions
(and only if the constraint included all the partition keys). But we
could allow exclusion constraints with the same restriction. We also
require that those columns be compared for equality, not something like
&&.
---
 doc/src/sgml/ddl.sgml  | 12 ++--
 src/backend/commands/indexcmds.c   | 42 +++--
 src/backend/parser/parse_utilcmd.c |  6 --
 src/test/regress/expected/alter_table.out  | 31 +++--
 src/test/regress/expected/create_table.out | 16 +++--
 src/test/regress/expected/indexing.out | 73 ++
 src/test/regress/sql/alter_table.sql   | 29 +++--
 src/test/regress/sql/create_table.sql  | 13 +++-
 src/test/regress/sql/indexing.sql  | 57 +++--
 9 files changed, 227 insertions(+), 52 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 6e92bbddd2..59be911471 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -4206,11 +4206,13 @@ ALTER INDEX measurement_city_id_logdate_key
 
  
   
-   There is no way to create an exclusion constraint spanning the
-   whole partitioned table.  It is only possible to put such a
-   constraint on each leaf partition individually.  Again, this
-   limitation stems from not being able to enforce cross-partition
-   restrictions.
+   Similarly an exclusion constraint must include all the 
+   partition key columns. Furthermore the constraint must compare those
+   columns for equality (not e.g. &&).
+   Again, this limitation stems from not being able to enforce
+   cross-partition restrictions. The constraint may include additional
+   columns that aren't part of the partition key, and it may compare
+   those with any operators you like.
   
  
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7dc1aca8fe..a9ecb7df19 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -709,11 +709,6 @@ DefineIndex(Oid relationId,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 	 errmsg("cannot create index on partitioned table \"%s\" concurrently",
 			RelationGetRelationName(rel;
-		if (stmt->excludeOpNames)
-			ereport(ERROR,
-	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg("cannot create exclusion constraints on partitioned table \"%s\"",
-			RelationGetRelationName(rel;
 	}
 
 	/*
@@ -926,7 +921,7 @@ DefineIndex(Oid relationId,
 	 * We could lift this limitation if we had global indexes, but those have
 	 * their own problems, so this is a useful feature combination.
 	 */
-	if (partitioned && (stmt->unique || stmt->primary))
+	if (partitioned && (stmt->unique || stmt->primary || stmt->excludeOpNames != NIL))
 	{
 		PartitionKey key = RelationGetPartitionKey(rel);
 		const char *constraint_type;
@@ -983,6 +978,8 @@ DefineIndex(Oid relationId,
 			 */
 			if (accessMethodId == BTREE_AM_OID)
 eq_strategy = BTEqualStrategyNumbe

Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-15 Thread Nathan Bossart
On Thu, Dec 15, 2022 at 08:35:53PM -0800, Jeff Davis wrote:
> But why make CLUSTER more like VACUUM? Shouldn't we make
> VACUUM/CLUSTER/ANALYZE more like INSERT/SELECT/REINDEX?

Hm.  Since VACUUM may happen across multiple transactions, it is careful to
re-check the privileges for each relation.  For example, vacuum_rel()
contains this comment:

/*
 * Check if relation needs to be skipped based on privileges.  This 
check
 * happens also when building the relation list to vacuum for a manual
 * operation, and needs to be done additionally here as VACUUM could
 * happen across multiple transactions where privileges could have 
changed
 * in-between.  Make sure to only generate logs for VACUUM in this case.
 */

I do wonder whether this is something we really need to be concerned about.
In the worst case, you might be able to VACUUM a table with a concurrent
owner change.

The logic for gathering all relations to process (i.e.,
get_all_vacuum_rels() and get_tables_to_cluster()) would also need to be
adjusted to handle partitioned tables.  Right now, we gather all the
partitions and checks privileges on each.  I think this would be easy to
change.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-15 Thread Nathan Bossart
On Thu, Dec 15, 2022 at 09:13:54PM -0800, Nathan Bossart wrote:
> I do wonder whether this is something we really need to be concerned about.
> In the worst case, you might be able to VACUUM a table with a concurrent
> owner change.

I suppose we might want to avoid running the index functions as the new
owner.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Typo macro name on FreeBSD?

2022-12-15 Thread Japin Li


On Fri, 16 Dec 2022 at 12:25, Thomas Munro  wrote:
> On Fri, Dec 16, 2022 at 4:44 PM Japin Li  wrote:
>> Recently, I compile PostgreSQL on FreeBSD, I find commit a2a8acd152 
>> introduecs
>> __freebsd__ macro, however, I cannot find this macro on FreeBSD 13. There 
>> only
>> has __FreeBSD__ macro. Is this a typo?
>
> Yeah, that seems to be my fault.  Will fix.  Thanks!

Thanks!

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




RE: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-15 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> I also don't see the need for this mechanism for logical replication,
> and in fact, why do we need to even wait for sending the existing WAL?

Is it meant that logicalrep walsenders do not have to track WalSndCaughtUp and
any pending data in the output buffer?

> I think the reason why we don't need to wait for logical replication
> is that after the restart, we always start sending WAL from the
> location requested by the subscriber, or till the point where the
> publisher knows the confirmed flush location of the subscriber.
> Consider another case where after restart publisher (node-1) wants to
> act as a subscriber for the previous subscriber (node-2). Now, the new
> subscriber (node-1) won't have a way to tell the new publisher
> (node-2) that starts from the location where the node-1 went down as
> WAL locations between publisher and subscriber need not be same.

You mean to say that such mechanism was made for supporting switchover, but 
logical
replication cannot do because new subscriber cannot request definitively unknown
changes for it, right? It seems reasonable to me.

> This brings us to the question of whether users can use logical
> replication for the scenario where they want the old master to follow
> the new master after the restart which we typically do in physical
> replication, if so how?

Maybe to support such use-case, 2-way replication is needed
(but this is out-of-scope of this thread).

> Another related point to consider is what is the behavior of
> synchronous replication when shutdown has been performed both in the
> case of physical and logical replication especially when the
> time-delayed replication feature is enabled?

In physical replication without any failures, it seems that users can stop 
primary
server even if the applications are delaying on secondary. This is because sent 
WALs
are immediately flushed on secondary and walreceiver replies its position. The
transaction has been already committed at that time, and the transported changes
will be applied on secondary after spending time.

IIUC we can achieve that when logical walsenders do not consider the remote 
status
while shutting down, but I want to hear another opinion and we must confirm by 
testing...

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Re: Removing unneeded self joins

2022-12-15 Thread Andrey Lepikhov

On 12/6/22 23:46, Andres Freund wrote:

This doesn't pass the main regression tests due to a plan difference.
https://cirrus-ci.com/task/5537518245380096
https://api.cirrus-ci.com/v1/artifact/task/5537518245380096/testrun/build/testrun/regress/regress/regression.diffs

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/join.out 
/tmp/cirrus-ci-build/build/testrun/regress/regress/results/join.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/join.out 2022-12-05 
19:11:52.453920838 +
+++ /tmp/cirrus-ci-build/build/testrun/regress/regress/results/join.out 
2022-12-05 19:15:21.864183651 +
@@ -5806,7 +5806,7 @@
   Nested Loop
 Join Filter: (sj_t3.id = sj_t1.id)
 ->  Nested Loop
- Join Filter: (sj_t3.id = sj_t2.id)
+ Join Filter: (sj_t2.id = sj_t3.id)
   ->  Nested Loop Semi Join
 ->  Nested Loop
   ->  HashAggregate

This change in the test behaviour is induced by the a5fc4641
"Avoid making commutatively-duplicate clauses in EquivalenceClasses."
Nothing special, as I see. Attached patch fixes this.

--
Regards
Andrey Lepikhov
Postgres Professional
From 3e546637561bf4c6d195bc7c95b1e05263e797e2 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Wed, 5 Oct 2022 16:58:34 +0500
Subject: [PATCH] Remove self-joins.

A Self Join Removal (SJR) feature removes inner join of plain table to itself
in a query plan if can be proved that the join can be replaced with a scan.
The proof based on innerrel_is_unique machinery.

We can remove a self-join when for each outer row:
1. At most one inner row matches the join clauses.
2. If the join target list contains any inner vars, an inner row
must be (physically) the same row as the outer one.

In this patch we use Rowley's [1] approach to identify a self-join:
1. Collect all mergejoinable join quals which look like a.x = b.x
2. Check innerrel_is_unique() for the qual list from (1). If it
returns true, then outer row matches only the same row from the inner
relation.
3. If uniqueness of outer relation deduces from baserestrictinfo clause like
'x=const' on unique field(s), check that inner has the same clause with the
same constant(s).
4. Compare RowMarks of inner and outer relations. They must have the same
strength.

Some regression tests changed due to self-join removal logic.

[1] https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |   16 +
 src/backend/optimizer/path/indxpath.c |   38 +
 src/backend/optimizer/plan/analyzejoins.c | 1046 -
 src/backend/optimizer/plan/planmain.c |5 +
 src/backend/utils/misc/guc_tables.c   |   22 +
 src/include/optimizer/paths.h |3 +
 src/include/optimizer/planmain.h  |7 +
 src/test/regress/expected/equivclass.out  |   32 +
 src/test/regress/expected/join.out|  774 +++
 src/test/regress/expected/sysviews.out|3 +-
 src/test/regress/sql/equivclass.sql   |   16 +
 src/test/regress/sql/join.sql |  340 +++
 src/tools/pgindent/typedefs.list  |2 +
 13 files changed, 2278 insertions(+), 26 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8e4145979d..2f9948d5f8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5311,6 +5311,22 @@ ANY num_sync ( 
+  enable_self_join_removal (boolean)
+  
+   enable_self_join_removal configuration parameter
+  
+  
+  
+   
+Enables or disables the query planner's optimization which analyses
+query tree and replaces self joins with semantically equivalent single
+scans. Take into consideration only plain tables.
+The default is on.
+   
+  
+ 
+
  
   enable_seqscan (boolean)
   
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 914bfd90bc..8d57c68b1f 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -3494,6 +3494,21 @@ bool
 relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 			  List *restrictlist,
 			  List *exprlist, List *oprlist)
+{
+	return relation_has_unique_index_ext(root, rel, restrictlist,
+		 exprlist, oprlist, NULL);
+}
+
+/*
+ * relation_has_unique_index_ext
+ * if extra_clauses isn't NULL, return baserestrictinfo clauses which were
+ * used to derive uniqueness.
+ */
+bool
+relation_has_unique_index_ext(PlannerInfo *root, RelOptInfo *rel,
+			  List *restrictlist,
+			  List *exprlist, List *oprlist,
+			  List **extra_clauses)
 {
 	ListCell   *ic;
 
@@ -3549,6 +3564,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic);
 		int			c;
+		List	   *exprs = NIL;
 
 		/*
 		 * If the index is not unique, or not immediately enforced, or if it

Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-15 Thread Masahiko Sawada
On Thu, Dec 15, 2022 at 12:28 PM houzj.f...@fujitsu.com
 wrote:
>
> On Wednesday, December 14, 2022 2:49 PM Amit Kapila  
> wrote:
>
> >
> > On Wed, Dec 14, 2022 at 9:50 AM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Tuesday, December 13, 2022 11:25 PM Masahiko Sawada
> >  wrote:
> > > >
> > > > Here are comments on v59 0001, 0002 patches:
> > >
> > > Thanks for the comments!
> > >
> > > > +void
> > > > +pa_increment_stream_block(ParallelApplyWorkerShared *wshared) {
> > > > +while (1)
> > > > +{
> > > > +SpinLockAcquire(&wshared->mutex);
> > > > +
> > > > +/*
> > > > + * Don't try to increment the count if the parallel
> > > > apply worker is
> > > > + * taking the stream lock. Otherwise, there would
> > > > + be
> > > > a race condition
> > > > + * that the parallel apply worker checks there is
> > > > + no
> > > > pending streaming
> > > > + * block and before it actually starts waiting on a
> > > > lock, the leader
> > > > + * sends another streaming block and take the
> > > > + stream
> > > > lock again. In
> > > > + * this case, the parallel apply worker will start
> > > > waiting for the next
> > > > + * streaming block whereas there is actually a
> > > > pending streaming block
> > > > + * available.
> > > > + */
> > > > +if (!wshared->pa_wait_for_stream)
> > > > +{
> > > > +wshared->pending_stream_count++;
> > > > +SpinLockRelease(&wshared->mutex);
> > > > +break;
> > > > +}
> > > > +
> > > > +SpinLockRelease(&wshared->mutex);
> > > > +}
> > > > +}
> > > >
> > > > I think we should add an assertion to check if we don't hold the stream 
> > > > lock.
> > > >
> > > > I think that waiting for pa_wait_for_stream to be false in a busy
> > > > loop is not a good idea. It's not interruptible and there is not
> > > > guarantee that we can break from this loop in a short time. For
> > > > instance, if PA executes
> > > > pa_decr_and_wait_stream_block() a bit earlier than LA executes
> > > > pa_increment_stream_block(), LA has to wait for PA to acquire and
> > > > release the stream lock in a busy loop. It should not be long in
> > > > normal cases but the duration LA needs to wait for PA depends on PA,
> > > > which could be long. Also what if PA raises an error in
> > > > pa_lock_stream() due to some reasons? I think LA won't be able to
> > > > detect the failure.
> > > >
> > > > I think we should at least make it interruptible and maybe need to
> > > > add some sleep. Or perhaps we can use the condition variable for this 
> > > > case.
> > >
> >
> > Or we can leave this while (true) logic altogether for the first version 
> > and have a
> > comment to explain this race. Anyway, after restarting, it will probably be
> > solved. We can always change this part of the code later if this really 
> > turns out
> > to be problematic.
>
> Agreed, and reverted this part.
>
> >
> > > Thanks for the analysis, I will research this part.
> > >
> > > > ---
> > > > In worker.c, we have the following common pattern:
> > > >
> > > > case TRANS_LEADER_PARTIAL_SERIALIZE:
> > > > write change to the file;
> > > > do some work;
> > > > break;
> > > >
> > > > case TRANS_LEADER_SEND_TO_PARALLEL:
> > > > pa_send_data();
> > > >
> > > > if (winfo->serialize_changes)
> > > > {
> > > > do some worker required after writing changes to the file.
> > > > }
> > > > :
> > > > break;
> > > >
> > > > IIUC there are two different paths for partial serialization: (a)
> > > > where apply_action is TRANS_LEADER_PARTIAL_SERIALIZE, and (b) where
> > > > apply_action is TRANS_LEADER_PARTIAL_SERIALIZE and
> > > > winfo->serialize_changes became true. And we need to match what we
> > > > winfo->do
> > > > in (a) and (b). Rather than having two different paths for the same
> > > > case, how about falling through TRANS_LEADER_PARTIAL_SERIALIZE when
> > > > we could not send the changes? That is, pa_send_data() just returns
> > > > false when the timeout exceeds and we need to switch to serialize
> > > > changes, otherwise returns true. If it returns false, we prepare for
> > > > switching to serialize changes such as initializing fileset, and
> > > > fall through TRANS_LEADER_PARTIAL_SERIALIZE case. The code would be
> > like:
> > > >
> > > > case TRANS_LEADER_SEND_TO_PARALLEL:
> > > > ret = pa_send_data();
> > > >
> > > > if (ret)
> > > > {
> > > > do work for sending changes to PA.
> > > > break;
> > > > }
> > > >
> > > > /* prepare for switching to serialize changes */
> > > > winfo->serialize_changes = true;
> > > > initialize fileset;
> > > > acquire stream lock if necessary;
> > > >
> > > > /* FALLTHROUGH */
> > 

Re: plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-15 Thread Masahiko Sawada
On Fri, Dec 16, 2022 at 12:49 AM Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > I don't think we need additional PG_TRY() for that since exec_stmts()
> > is already called in PG_TRY() if there is an exception block. I meant
> > to call stmt_end() in PG_CATCH() in exec_stmt_block() (i.e. only when
> > an error is caught by the exception block). Currently, if an error is
> > caught, we call stmt_begin() and stmt_end() for statements executed
> > inside the exception block but call only stmt_begin() for the
> > statement that raised an error.
>
> I fail to see anything wrong with that.  We never completed execution
> of the statement that raised an error, but calling stmt_end for it
> would imply that we did.

Thank you for the comment. Agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: New strategies for freezing, advancing relfrozenxid early

2022-12-15 Thread John Naylor
On Wed, Dec 14, 2022 at 6:07 AM Peter Geoghegan  wrote:
>
> At the suggestion of Jeff, I wrote a Wiki page that shows motivating
> examples for the patch series:
>
>
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples
>
> These are all cases where VACUUM currently doesn't do the right thing
> around freezing, in a way that is greatly ameliorated by the patch.
> Perhaps this will help other hackers to understand the motivation
> behind some of these mechanisms. There are plenty of details that only
> make sense in the context of a certain kind of table, with certain
> performance characteristics that the design is sensitive to, and seeks
> to take advantage of in one way or another.

Thanks for this. This is the kind of concrete, data-based evidence that I
find much more convincing, or at least easy to reason about. I'd actually
recommend in the future to open discussion with this kind of analysis --
even before coding, it's possible to indicate what a design is *intended*
to achieve. And reviewers can likewise bring up cases of their own in a
concrete fashion.

On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan  wrote:

> At the very least, a given VACUUM operation has to choose its freezing
> strategy based on how it expects the table will look when it's done
> vacuuming the table, and how that will impact the next VACUUM against
> the same table. Without that, then vacuuming an append-only table will
> fall into a pattern of setting pages all-visible in one vacuum, and
> then freezing those same pages all-frozen in the very next vacuum
> because there are too many. Which makes little sense; we're far better
> off freezing the pages at the earliest opportunity instead.

That makes sense, but I wonder if we can actually be more specific: One
motivating example mentioned is the append-only table. If we detected that
case, which I assume we can because autovacuum_vacuum_insert_* GUCs exist,
we could use that information as one way to drive eager freezing
independently of size. At least in theory -- it's very possible size will
be a necessary part of the decision, but it's less clear that it's as
useful as a user-tunable knob.

If we then ignored the append-only case when evaluating a freezing policy,
maybe other ideas will fall out. I don't have a well-thought out idea about
policy or knobs, but it's worth thinking about.

Aside from that, I've only given the patches a brief reading. Having seen
the VM snapshot in practice (under "Scanned pages, visibility map snapshot"
in the wiki page), it's neat to see fewer pages being scanned. Prefetching
not only seems superior to SKIP_PAGES_THRESHOLD, but anticipates
asynchronous IO. Keeping only one VM snapshot page in memory makes perfect
sense.

I do have a cosmetic, but broad-reaching, nitpick about terms regarding
"skipping strategy". That's phrased as a kind of negative -- what we're
*not* doing. Many times I had to pause and compute in my head what we're
*doing*, i.e. the "scanning strategy". For example, I wonder if the VM
strategies would be easier to read as:

VMSNAP_SKIP_ALL_VISIBLE -> VMSNAP_SCAN_LAZY
VMSNAP_SKIP_ALL_FROZEN -> VMSNAP_SCAN_EAGER
VMSNAP_SKIP_NONE -> VMSNAP_SCAN_ALL

Notice here they're listed in order of increasing eagerness.

--
John Naylor
EDB: http://www.enterprisedb.com