date:20230327

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2023-03-27 Thread Michael Paquier

On Mon, Mar 27, 2023 at 11:34:35PM -0500, Justin Pryzby wrote:
> I realized that one difference with tablespaces is that, as written,
> partitioned tables will *always* have an AM specified,  and partitions
> will never use default_table_access_method.  Is that what's intended ?
>
> Or do we need logic similar tablespaces, such that the relam of a
> partitioned table is set only if it differs from default_table_am ?

Hmm.  This is a good point.  It is true that the patch feels
incomplete on this side.  I don't see why we could not be flexible,
and allow a value of 0 in a partitioned table's relam to mean that we
would pick up the database default in this case when a partition is
is created on it.  This would have the advantage to be consistent with
older versions where we fallback on the default.  We cannot be
completely consistent with the reltablespace of the leaf partitions
unfortunately, as relam should always be set if a relation has
storage.  And allowing a value of 0 means that there are likely other
tricky cases with dumps?

Another thing: would it make sense to allow an empty string in
default_table_access_method so as we'd always fallback to a database
default?
--
Michael

signature.asc
Description: PGP signature

Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-03-27 Thread Drouvot, Bertrand


Hi,

On 3/28/23 7:23 AM, Michael Paquier wrote:

On Tue, Mar 28, 2023 at 12:36:15PM +0900, Kyotaro Horiguchi wrote:

I found that commit ddfc2d9a37 removed the descriptions for
pg_stat_get_blocks_fetched and pg_stat_get_blocks_hit. Right before
that commit, monitoring.sgml had these lines:

- pg_stat_get_blocks_fetched minus
- pg_stat_get_blocks_hit gives the number of kernel
- read() calls issued for the table, index, or
- database; the number of actual physical reads is usually
- lower due to kernel-level buffering.  The *_blks_read
- statistics columns use this subtraction, i.e., fetched minus hit.

The commit then added the following sentence to the description for
pg_statio_all_tables.heap_blks_read.

Later, in 5f2b089387 it twas revised as:
+ Number of disk blocks read in this database


Yeah, maybe adding something like that at the bottom of the table for
stat functions, telling that the difference is the number of read()
calls, may help.  Perhaps also adding a mention that these are used in
none of the existing system views..


The confusion stems from the inconsistency between the views and
underlying functions related to block reads and hits. If we add
descriptions for the two functions, we should also explain their
relationship. 


I agree that adding more explanation would help and avoid confusion.

What about something like?

for pg_stat_get_xact_blocks_fetched(): "block read requests for table or index, 
in the current
transaction. This number minus pg_stat_get_xact_blocks_hit() gives the number 
of kernel
read() calls."

pg_stat_get_xact_blocks_hit(): "block read requests for table or index, in the 
current
transaction, found in cache (not triggering kernel read() calls)".

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Support logical replication of DDLs

2023-03-27 Thread Amit Kapila

On Mon, Mar 27, 2023 at 5:37 PM Amit Kapila  wrote:
>
> On Mon, Mar 27, 2023 at 12:07 PM Amit Kapila  wrote:
> >
> > On Mon, Mar 27, 2023 at 2:52 AM Tom Lane  wrote:
> > >
> >
> > > I suggest taking a couple of steps back from the minutiae of the
> > > patch, and spending some hard effort thinking about how the thing
> > > would be controlled in a useful fashion (that is, a real design for
> > > the filtering that was mentioned at the very outset), and about the
> > > security issues, and about how we could get to a committable patch.
> > >
> >
> > Agreed. I'll try to summarize the discussion we have till now on this
> > and share my thoughts on the same in a separate email.
> >
>
> The idea to control what could be replicated is to introduce a new
> publication option 'ddl' along with current options 'publish' and
> 'publish_via_partition_root'. The values of this new option could be
> 'table', 'function', 'all', etc. Here 'all' enables the replication of
> all supported DDL commands. Example usage for this would be:
> Example:
> Create a new publication with all ddl replication enabled:
>   CREATE PUBLICATION pub1 FOR ALL TABLES with (ddl = 'all');
>
> Enable table ddl replication for an existing Publication:
>   ALTER PUBLICATION pub2 SET (ddl = 'table');
>
> This is what seems to have been discussed but I think we can even
> extend it to support based on operations/commands, say one would like
> to publish only 'create' and 'drop' of tables. Then we can extend the
> existing publish option to have values like 'create', 'alter', and
> 'drop'.
>

The other idea could be to that for the new option ddl, we input
command tags such that the replication will happen for those commands.
For example, ALTER PUBLICATION pub2 SET (ddl = 'Create Table, Alter
Table, ..'); This will obviate the need to have additional values like
'create', 'alter', and 'drop' for publish option.

The other thought related to filtering is that one might want to
filter DDLs and or DMLs performed by specific roles in the future. So,
we then need to introduce another option ddl_role, or something like
that.

Can we think of some other kind of filter for DDL replication?

Thoughts?

-- 
With Regards,
Amit Kapila.

Re: running logical replication as the subscription owner

2023-03-27 Thread Jeff Davis

On Mon, 2023-03-27 at 16:47 -0400, Robert Haas wrote:
> No, not really. It's pretty common for a lot of things to be in the
> public schema, and the public schema is likely to be in the search
> path of every user involved.

Consider this trigger function which uses an unqualified reference to
pfunc(), therefore implicitly depending on the public schema:

  CREATE FUNCTION public.pfunc() RETURNS INT4 LANGUAGE plpgsql AS
$$ BEGIN RETURN 42; END; $$;
  CREATE FUNCTION foo.tfunc() RETURNS TRIGGER LANGUAGE plpgsql AS
$$ BEGIN
   NEW.i = pfunc();
   RETURN NEW;
   END; $$;
  CREATE TABLE foo.a(i INT);
  CREATE TRIGGER a_trigger BEFORE INSERT ON foo.a
FOR EACH ROW EXECUTE PROCEDURE foo.tfunc();
  ALTER TABLE foo.a ENABLE ALWAYS TRIGGER a_trigger;

But that trigger function is broken today for logical replication --
pfunc() isn't found by the subscription process, which has an empty
search_path.

> Let's back up a minute here. Suppose someone were to request a new
> command, ALTER TABLE  DO NOT LET THE SUPERUSER ACCESS THIS. We
> would reject that proposal. The reason we would reject it is because
> it wouldn't actually work as documented. We know that the superuser
> has the power to access that account and reverse that command, either
> by SET ROLE or by changing the account password or by changing
> pg_hba.conf or by shelling out to the filesystem and doing whatever.
> The feature purports to do something that it actually cannot do. No
> one can defend themselves against the superuser. Not even another
> superuser can defend themselves against a superuser. Pretending
> otherwise would be confusing and have no security benefits.

Good framing. With you so far...

> Now let's think about this case. Can a table owner defend themselves
> against a subscription owner who wants to usurp their privileges? If
> they cannot, then I think that what the patch does is correct for the
> same reason that I think we would correctly reject the hypothetical
> command proposed above.

Agreed...

[ Aside: A user foo who accesses a table owned by bar has no means to
defend themselves against SECURITY INVOKER code that usurps their
privileges. Applying your logic above, foo should have to grant SET
ROLE privileges to bar before accessing the table. ]

> If
> the table owner has no executable code at all attached to the table -
> -
> not just triggers, but also expression indexes and default
> expressions
> and so forth -- then they can.

Right...

>  If they do have those things then in
> theory they might be able to protect themselves, but in practice they
> are unlikely to be careful enough. I judge that practically every
> installation where table owners use triggers would be easily
> compromised. Only the most security-conscious of table owners are
> going to get this right.

This is the interesting part.

Commit 11da97024ab set the subscription process's search path as empty.
It seems it was done to protect the subscription owner against users
writing into the public schema. But after your apply-as-table-owner
patch, it seems to also offer some protection for table owners against
a malicious subscription owner, too.

But I must be missing something obvious here -- if the subscription
process has an empty search_path, what can an attacker do to trick it?

Regards,
Jeff Davis

Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-03-27 Thread Michael Paquier

On Tue, Mar 28, 2023 at 12:36:15PM +0900, Kyotaro Horiguchi wrote:
> I found that commit ddfc2d9a37 removed the descriptions for
> pg_stat_get_blocks_fetched and pg_stat_get_blocks_hit. Right before
> that commit, monitoring.sgml had these lines:
> 
> - pg_stat_get_blocks_fetched minus
> - pg_stat_get_blocks_hit gives the number of kernel
> - read() calls issued for the table, index, or
> - database; the number of actual physical reads is usually
> - lower due to kernel-level buffering.  The *_blks_read
> - statistics columns use this subtraction, i.e., fetched minus hit.
> 
> The commit then added the following sentence to the description for
> pg_statio_all_tables.heap_blks_read.
>
> Later, in 5f2b089387 it twas revised as:
> + Number of disk blocks read in this database

Yeah, maybe adding something like that at the bottom of the table for
stat functions, telling that the difference is the number of read()
calls, may help.  Perhaps also adding a mention that these are used in
none of the existing system views..

> The confusion stems from the inconsistency between the views and
> underlying functions related to block reads and hits. If we add
> descriptions for the two functions, we should also explain their
> relationship.  Otherwise, it might be better to add the functions
> pg_stat_*_blocks_read() instead.

I am not sure that we really need to get down to that as this holds
the same meaning as the current system views showing read as the
difference between fetched and hit.
--
Michael


signature.asc
Description: PGP signature

Re: Generate pg_stat_get_xact*() functions with Macros

2023-03-27 Thread Drouvot, Bertrand





On 3/28/23 12:41 AM, Michael Paquier wrote:

On Mon, Mar 27, 2023 at 07:08:51PM +0900, Michael Paquier wrote:

The patch has one mistake: PG_STAT_GET_XACT_FUNCENTRY_FLOAT8_MS does
not need a slash on its last line or it would include the next, empty
line.  This could lead to mistakes (no need to send a new patch just
for that).


Adjusted that, and the rest was fine after a second look, so applied.
It looks like we are done for now with this thread, so I have marked
it as committed in the CF app.


Thanks for having corrected the mistake and applied the patch!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: PGdoc: add missing ID attribute to create_subscription.sgml

2023-03-27 Thread Peter Smith

Thanks for this patch.

v5-0001 looks good to me.

v5-0002 looks good to me.

I've marked the CF entry [1] as "ready for committer".

--
[1] https://commitfest.postgresql.org/43/4256/

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Why mark empty pages all visible?

2023-03-27 Thread Peter Geoghegan

On Mon, Mar 27, 2023 at 9:32 PM Andres Freund  wrote:
> On 2023-03-27 20:12:11 -0700, Peter Geoghegan wrote:
> > On Mon, Mar 27, 2023 at 6:48 PM Andres Freund  wrote:
> > > It seems odd that we enter the page into the VM at this point. That means 
> > > that
> > > use of that page will now require a bit more work (including
> > > RelationGetBufferForTuple() pinning it).
> >
> > I think that it's fairly obvious that it's *not* odd at all. If it
> > didn't do this then the pages would have to be scanned by VACUUM.
>
> Yes - just like in the case of new pages.

I'm not saying that the status quo is free of contradictions. Only
that there seem to be contradictions in what you're saying now.

> > You haven't said anything about the leading cause of marking empty
> > pages all-frozen, by far: lazy_vacuum_heap_page(). Should it also stop
> > marking empty pages all-frozen?
>
> It's not obvious that it should - but it's not as clear a case as the
> ConditionalLockBufferForCleanup() -> lazy_scan_new_or_empty() one. In the
> latter, we know
> a) that we don't have to do any work to be able to advance the horizon
> b) we know that somebody else has the page pinned
>
> What's the point in marking it all-visible at that point? In quite likely be
> from RelationGetBufferForTuple() having extended the relation and then briefly
> needed to release the lock (to acquire the lock on otherBuffer or in
> GetVisibilityMapPins()).

I think that there is significant value in avoiding special cases, on
general principle. If we stopped doing this in
lazy_scan_new_or_empty() we'd be inventing a new thing that cleanup
locks are supposed to protect against. Maybe something like that would
make sense, but if so then make that argument, and make it explicitly
represented in the code.

> I don't follow. In the ConditionalLockBufferForCleanup() ->
> lazy_scan_new_or_empty() case we are dealing with an new or empty
> page. Whereas lazy_vacuum_heap_page() deals with a page that definitely has
> dead tuples on it.  How are those two cases comparable?

It doesn't have dead tuples anymore, though.

ISTM that there is an issue here with the definition of an empty page.
You're concerned about PageIsEmpty() pages. Which actually aren't
quite the same thing as an empty page left behind by
lazy_vacuum_heap_page(). It's just that this distinction isn't quite
acknowledged anywhere, and probably didn't exist at all at some point.
Maybe that should be addressed.

> > > It seems pretty clear that we shouldn't enter a currently-in-use page 
> > > into the
> > > VM or freespacemap. All that's going to do is to "disturb" the backend 
> > > trying
> > > to use that page (by directing other backends to it) and to make its job 
> > > more
> > > expensive.
> >
> > I don't think that it's clear. What about the case where there is only
> > one tuple, on a page that we cannot cleanup lock? Where do you draw
> > the line?
>
> I don't see how that's comparable? For one, we might need to clean up that
> tuple for vacuum to be able to advance the horizon. And as far as the
> non-cleanup lock path goes, it actually can perform work there. And it doesn't
> even need to acquire an exclusive lock.

So we should put space in the FSM if it has one tuple, but not if it
has zero tuples? Though not if it has zero tuples following processing
by lazy_vacuum_heap_page()?

-- 
Peter Geoghegan

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2023-03-27 Thread Justin Pryzby

On Tue, Mar 28, 2023 at 09:13:10AM +0900, Michael Paquier wrote:
> On Mon, Mar 20, 2023 at 09:30:50AM +0900, Michael Paquier wrote:
> > Did you check dump and restore flows with partition
> > trees and --no-table-access-method?  Perhaps there should be
> > some regression tests with partitioned tables?
> 
> I was looking at the patch, and as I suspected the dumps generated
> are forgetting to apply the AM to the partitioned tables.

The patch said:

+   else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == 
RELKIND_PARTITIONED_TABLE)

pg_dump was missing a similar change that's conditional on
RELKIND_HAS_TABLE_AM().  It looks like those are the only two places
that need be be specially handled.

I dug up my latest patch from 2021 and incorporated the changes from the
0001 patch here, and added a test case.

I realized that one difference with tablespaces is that, as written,
partitioned tables will *always* have an AM specified,  and partitions
will never use default_table_access_method.  Is that what's intended ?

Or do we need logic similar tablespaces, such that the relam of a
partitioned table is set only if it differs from default_table_am ?

-- 
Justin
>From 2c0c5068cf849ad91de8f5276a430b022d1243b0 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 7 Mar 2021 00:11:38 -0600
Subject: [PATCH] Allow specifying access method of partitioned tables..

..to be inherited by partitions

See also:
ca4103025dfe26eaaf6a500dec9170fbb176eebc
8586bf7ed8889f39a59dd99b292014b73be85342

ebfe2dbd6b624e2a428e14b7ee9322cc096f63f7 - prevent DROP AM

Authors: Justin Pryzby, Soumyadeep Chakraborty
---
 doc/src/sgml/ref/create_table.sgml  |  9 ++-
 src/backend/catalog/heap.c  |  3 +-
 src/backend/commands/tablecmds.c| 87 +++--
 src/backend/utils/cache/lsyscache.c | 22 +++
 src/backend/utils/cache/relcache.c  |  4 ++
 src/bin/pg_dump/pg_dump.c   |  3 +-
 src/bin/pg_dump/t/002_pg_dump.pl| 16 -
 src/include/utils/lsyscache.h   |  1 +
 src/test/regress/expected/create_am.out | 39 +++
 src/test/regress/sql/create_am.sql  | 16 +++--
 10 files changed, 158 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 10ef699fab9..b45632619e5 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1295,9 +1295,12 @@ WITH ( MODULUS numeric_literal, REM
   This optional clause specifies the table access method to use to store
   the contents for the new table; the method needs be an access method of
   type TABLE. See  for more
-  information.  If this option is not specified, the default table access
-  method is chosen for the new table. See  for more information.
+  information.  If this option is not specified, a default table access
+  method is chosen for the new table.
+  When creating a partitioned table, the default table access method
+  is the access method of the parent.
+  For other tables, the default is determined by
+  .
  
 

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 4f006820b85..146d8b36b09 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1452,7 +1452,8 @@ heap_create_with_catalog(const char *relname,
 		 * No need to add an explicit dependency for the toast table, as the
 		 * main table depends on it.
 		 */
-		if (RELKIND_HAS_TABLE_AM(relkind) && relkind != RELKIND_TOASTVALUE)
+		if ((RELKIND_HAS_TABLE_AM(relkind) && relkind != RELKIND_TOASTVALUE) ||
+relkind == RELKIND_PARTITIONED_TABLE)
 		{
 			ObjectAddressSet(referenced, AccessMethodRelationId, accessmtd);
 			add_exact_object_address(, addrs);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c510a01fd8d..de1fe15206a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -571,6 +571,7 @@ static ObjectAddress ATExecClusterOn(Relation rel, const char *indexName,
 	 LOCKMODE lockmode);
 static void ATExecDropCluster(Relation rel, LOCKMODE lockmode);
 static void ATPrepSetAccessMethod(AlteredTableInfo *tab, Relation rel, const char *amname);
+static void ATExecSetAccessMethodNoStorage(Relation rel, Oid newAccessMethod);
 static bool ATPrepChangePersistence(Relation rel, bool toLogged);
 static void ATPrepSetTableSpace(AlteredTableInfo *tab, Relation rel,
 const char *tablespacename, LOCKMODE lockmode);
@@ -680,7 +681,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	Oid			ofTypeId;
 	ObjectAddress address;
 	LOCKMODE	parentLockmode;
-	const char *accessMethod = NULL;
 	Oid			accessMethodId = InvalidOid;
 
 	/*
@@ -945,20 +945,20 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	 * a type of relation that needs one, use the default.
 	 */
 	if (stmt->accessMethod != NULL)
+		accessMethodId =

Re: Why mark empty pages all visible?

2023-03-27 Thread Andres Freund

Hi,

On 2023-03-27 20:12:11 -0700, Peter Geoghegan wrote:
> On Mon, Mar 27, 2023 at 6:48 PM Andres Freund  wrote:
> > It seems odd that we enter the page into the VM at this point. That means 
> > that
> > use of that page will now require a bit more work (including
> > RelationGetBufferForTuple() pinning it).
> 
> I think that it's fairly obvious that it's *not* odd at all. If it
> didn't do this then the pages would have to be scanned by VACUUM.

Yes - just like in the case of new pages.

> You haven't said anything about the leading cause of marking empty
> pages all-frozen, by far: lazy_vacuum_heap_page(). Should it also stop
> marking empty pages all-frozen?

It's not obvious that it should - but it's not as clear a case as the
ConditionalLockBufferForCleanup() -> lazy_scan_new_or_empty() one. In the
latter, we know
a) that we don't have to do any work to be able to advance the horizon
b) we know that somebody else has the page pinned

What's the point in marking it all-visible at that point? In quite likely be
from RelationGetBufferForTuple() having extended the relation and then briefly
needed to release the lock (to acquire the lock on otherBuffer or in
GetVisibilityMapPins()).

> > This got a bit stranger with 44fa84881fff, because now we add the page into
> > the VM even if it currently is pinned:
> 
> > It seems quite odd to set a page to all visible that we could not currently
> > get a cleanup lock on - obviously evidence of another backend trying to to 
> > do
> > something with the page.
> 
> You can say the same thing about lazy_vacuum_heap_page(), too.
> Including the part about cleanup locking.

I don't follow. In the ConditionalLockBufferForCleanup() ->
lazy_scan_new_or_empty() case we are dealing with an new or empty
page. Whereas lazy_vacuum_heap_page() deals with a page that definitely has
dead tuples on it.  How are those two cases comparable?

> > It seems pretty clear that we shouldn't enter a currently-in-use page into 
> > the
> > VM or freespacemap. All that's going to do is to "disturb" the backend 
> > trying
> > to use that page (by directing other backends to it) and to make its job 
> > more
> > expensive.
> 
> I don't think that it's clear. What about the case where there is only
> one tuple, on a page that we cannot cleanup lock? Where do you draw
> the line?

I don't see how that's comparable? For one, we might need to clean up that
tuple for vacuum to be able to advance the horizon. And as far as the
non-cleanup lock path goes, it actually can perform work there. And it doesn't
even need to acquire an exclusive lock.

Greetings,

Andres Freund

RE: PGdoc: add missing ID attribute to create_subscription.sgml

2023-03-27 Thread Hayato Kuroda (Fujitsu)

Dear Peter,

Thank you for prompt reply!

> Hmm, my above-suggested wording was “publish_via_partition_root
> parameter “ but it seems you (accidentally?) omitted the word
> “parameter”.

It is my carelessness, sorry for inconvenience. PSA new ones.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v5-0001-Add-XML-ID-attributes-to-create_subscription.sgml.patch
Description:  v5-0001-Add-XML-ID-attributes-to-create_subscription.sgml.patch


v5-0002-Add-XML-ID-attributes-to-create_publication.sgml.patch
Description:  v5-0002-Add-XML-ID-attributes-to-create_publication.sgml.patch

Re: allow_in_place_tablespaces vs. pg_basebackup

2023-03-27 Thread Thomas Munro

The commit message explains prettty well, and it seems to work in
simple testing, and yeah commit c6f2f016 was not a work of art.
pg_basebackeup --format=plain "worked", but your way is better.  I
guess we should add a test of -Fp too, to keep it working?  Here's one
of those.

I know it's not your patch's fault, but I wonder if this might break
something.  We have this strange beast ti->rpath (commit b168c5ef in
2014):

+   /*
+* Relpath holds the relative path of
the tablespace directory
+* when it's located within PGDATA, or
NULL if it's located
+* elsewhere.
+*/

That's pretty confusing, because relative paths have been banned since
the birth of tablespaces (commit 2467394ee15, 2004):

+   /*
+* Allowing relative paths seems risky
+*
+* this also helps us ensure that location is not empty or whitespace
+*/
+   if (!is_absolute_path(location))
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+errmsg("tablespace location must be
an absolute path")));

The discussion that produced the contradiction:
https://www.postgresql.org/message-id/flat/m2ob3vl3et.fsf%402ndQuadrant.fr

I guess what I'm wondering here is if there is a hazard where we
confuse these outlawed tablespaces that supposedly roam the plains
somewhere in your code that is assuming that "relative implies
in-place".  Or if not now, maybe in future changes.  I wonder if these
"semi-supported-but-don't-tell-anyone" relative symlinks are worthy of
a defensive test (or is it in there somewhere already?).

Originally when trying to implement allow_in_place_tablespaces, I
instead tried allowing limited relative tablespaces, so you could use
LOCATION 'pg_tblspc/my_directory', and then it would still create a
symlink 1234 -> my_directory, which probably would have All Just
Worked™ given the existing ti->rpath stuff, and possibly made the
users that thread was about happy, but I had to give that up because
it didn't work on Windows (no relative symlinks).  Oh well.
From 4d2a8fac9c5ecbbbee5c55c5da997a7e6cb03952 Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Thu, 9 Mar 2023 16:00:02 -0500
Subject: [PATCH 1/2] Fix pg_basebackup with in-place tablespaces some more.

Commit c6f2f01611d4f2c412e92eb7893f76fa590818e8 purported to make
this work, but problems remained. In a plain-format backup, the
files from an in-place tablespace got included in the tar file for
the main tablespace, which is wrong but it's not clear that it
has any user-visible consequences. In a tar-format backup, the
TABLESPACE_MAP option is used, and so we never iterated over
pg_tblspc and thus never backed up the in-place tablespaces
anywhere at all.

To fix this, reverse the changes in that commit, so that when we scan
pg_tblspc during a backup, we create tablespaceinfo objects even for
in-place tablespaces. We set the field that would normally contain the
absolute pathname to the relative path pg_tblspc/${TSOID}, and that's
good enough to make basebackup.c happy without any further changes.

However, pg_basebackup needs a couple of adjustments to make it work.
First, it needs to understand that a relative path for a tablespace
means it's an in-place tablespace.  Second, it needs to tolerate the
situation where restoring the main tablespace tries to create
pg_tblspc or a subdirectory and finds that it already exists, because
we restore user-defined tablespaces before the main tablespace.
---
 src/backend/access/transam/xlog.c | 110 ++
 src/bin/pg_basebackup/bbstreamer_file.c   |  53 ++---
 src/bin/pg_basebackup/pg_basebackup.c |  22 +++-
 .../t/011_in_place_tablespace.pl  |  40 +++
 4 files changed, 159 insertions(+), 66 deletions(-)
 create mode 100644 src/bin/pg_basebackup/t/011_in_place_tablespace.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 543d4d897a..89529ac158 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8431,9 +8431,8 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
 			char		fullpath[MAXPGPATH + 10];
 			char		linkpath[MAXPGPATH];
 			char	   *relpath = NULL;
-			int			rllen;
-			StringInfoData escapedpath;
 			char	   *s;
+			PGFileType	de_type;
 
 			/* Skip anything that doesn't look like a tablespace */
 			if (strspn(de->d_name, "0123456789") != strlen(de->d_name))
@@ -8441,66 +8440,83 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
 
 			snprintf(fullpath, sizeof(fullpath), "pg_tblspc/%s", de->d_name);
 
-			/*
-			 * Skip anything that isn't a symlink/junction.  For testing only,
-			 * we sometimes use allow_in_place_tablespaces to create
-			 * directories directly under

RE: Support logical replication of DDLs

2023-03-27 Thread houzj.f...@fujitsu.com

On Monday, March 27, 2023 8:08 PM Amit Kapila  wrote:
> On Mon, Mar 27, 2023 at 12:07 PM Amit Kapila 
> wrote:
> >
> > On Mon, Mar 27, 2023 at 2:52 AM Tom Lane  wrote:
> > >
> >
> > > I suggest taking a couple of steps back from the minutiae of the
> > > patch, and spending some hard effort thinking about how the thing
> > > would be controlled in a useful fashion (that is, a real design for
> > > the filtering that was mentioned at the very outset), and about the
> > > security issues, and about how we could get to a committable patch.
> > >
> >
> > Agreed. I'll try to summarize the discussion we have till now on this
> > and share my thoughts on the same in a separate email.
> >
> 
> The idea to control what could be replicated is to introduce a new publication
> option 'ddl' along with current options 'publish' and
> 'publish_via_partition_root'. The values of this new option could be 'table',
> 'function', 'all', etc. Here 'all' enables the replication of all supported 
> DDL
> commands. Example usage for this would be:
> Example:
> Create a new publication with all ddl replication enabled:
>   CREATE PUBLICATION pub1 FOR ALL TABLES with (ddl = 'all');
> 
> Enable table ddl replication for an existing Publication:
>   ALTER PUBLICATION pub2 SET (ddl = 'table');
> 
> This is what seems to have been discussed but I think we can even extend it to
> support based on operations/commands, say one would like to publish only
> 'create' and 'drop' of tables. Then we can extend the existing publish option 
> to
> have values like 'create', 'alter', and 'drop'.
> 
> Another thing we are considering related to this is at what level these
> additional options should be specified. We have three variants FOR TABLE, FOR
> ALL TABLES, and FOR TABLES IN SCHEMA that enables replication. Now, for the
> sake of simplicity, this new option is discussed to be provided only with FOR
> ALL TABLES variant but I think we can provide it with other variants with some
> additional restrictions like with FOR TABLE, we can only specify 'alter' and
> 'drop' for publish option. Now, though possible, it brings additional
> complexity to support it with variants other than FOR ALL TABLES because then
> we need to ensure additional filtering and possible modification of the 
> content
> we have to send to downstream. So, we can even decide to first support it only
> FOR ALL TABLES variant.
> 
> The other point to consider for publish option 'ddl = table' is whether we 
> need
> to allow replicating dependent objects like say some user-defined type is used
> in the table. I guess the difficulty here would be to identify which 
> dependents
> we want to allow.
> 
> I think in the first version we should allow to replicate only some of the 
> objects
> instead of everything. For example, can we consider only allowing tables and
> indexes in the first version? Then extend it in a phased manner?

I think supporting table related stuff in the first version makes sense and the
patch size could be reduced to a suitable size. I also checked other DBs design
for reference, the IBM DB2's DDL replication functionality[1] is similar to what
is proposed here(e.g. only replicate table related DDL: TABLE/INDEX/KEY ..). We
can extend it to support other non-table objects in the following patch set.

[1] 
https://www.ibm.com/docs/en/idr/11.4.0?topic=dr-how-q-capture-handles-ddl-operations-source-database

Best Regards,
Hou zj

Re: PGdoc: add missing ID attribute to create_subscription.sgml

2023-03-27 Thread Peter Smith

On Tue, Mar 28, 2023 at 2:04 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Peter,
>
> Thank you for reviewing. PSA new version.
>

v4-0001 LGTM

>
> > 
> > v3-0002
> > 
> >
>
> > 2.2 doc/src/sgml/logical-replication.sgml (31.6. Restrictions)
> >
> > Publications can also specify that changes are to be replicated using
> > the identity and schema of the partitioned root table instead of that
> > of the individual leaf partitions in which the changes actually
> > originate (see CREATE PUBLICATION).
> >
> > ~
> >
> > Maybe that text can be changed now to say something like "(see
> > publish_via_partition_root parameter of CREATE PUBLICATION)” -- so
> > only the parameter part has the link, not the CREATE PUBLICATION part.
>
> Seems better, added.
>

- originate (see CREATE
PUBLICATION).
+ originate (see publish_via_partition_root
+ of CREATE PUBLICATION).

Hmm, my above-suggested wording was “publish_via_partition_root
parameter “ but it seems you (accidentally?) omitted the word
“parameter”.

Otherwise, the patch v4-0002 also LGTM

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Add pg_walinspect function with block info columns

2023-03-27 Thread Michael Paquier

On Mon, Mar 27, 2023 at 06:07:09PM -0700, Peter Geoghegan wrote:
> I see a bug on HEAD, following yesterday's commit 0276ae42dd.
> 
> GetWALRecordInfo() will now output the value of the fpi_len variable
> before it has actually been set by our call to . So it'll always
> be 0.

Indeed, good catch.  It looks like I was not careful enough with the
block controlled by XLogRecHasAnyBlockRefs().
--
Michael


signature.asc
Description: PGP signature

Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-03-27 Thread Kyotaro Horiguchi

At Tue, 28 Mar 2023 07:38:25 +0900, Michael Paquier  wrote 
in 
> On Mon, Mar 27, 2023 at 10:24:46AM -0400, Melanie Plageman wrote:
> > I do like/prefer "block read requests" and
> > "blocks requested found in cache"
> > Though, now I fear my initial complaint may have been a bit pedantic.
> 
> That's fine.  Let's ask for extra opinions, then.
> 
> So, have others an opinion to share here?

I do not have a strong preference for the wording, but consistency is
important.  IMHO simply swapping out a few words won't really improve
things.

I found that commit ddfc2d9a37 removed the descriptions for
pg_stat_get_blocks_fetched and pg_stat_get_blocks_hit. Right before
that commit, monitoring.sgml had these lines:

- pg_stat_get_blocks_fetched minus
- pg_stat_get_blocks_hit gives the number of kernel
- read() calls issued for the table, index, or
- database; the number of actual physical reads is usually
- lower due to kernel-level buffering.  The *_blks_read
- statistics columns use this subtraction, i.e., fetched minus hit.

The commit then added the following sentence to the description for
pg_statio_all_tables.heap_blks_read.

> Number of disk blocks read from this table.
> This value can also be returned by directly calling
> the pg_stat_get_blocks_fetched and
> pg_stat_get_blocks_hit functions and
> subtracting the results.

Later, in 5f2b089387 it twas revised as:
+ Number of disk blocks read in this database

This revision lost the explantion regarding the relationship among
fetch, hit and read, as it became hidden in the views' definitions.

As the result, in the current state, it doesn't make sense to just add
a description for pg_stat_get_xact_blocks_fetched().

The confusion stems from the inconsistency between the views and
underlying functions related to block reads and hits. If we add
descriptions for the two functions, we should also explain their
relationship.  Otherwise, it might be better to add the functions
pg_stat_*_blocks_read() instead.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: SQL/JSON revisited

2023-03-27 Thread Amit Langote

On Tue, Mar 28, 2023 at 6:18 AM Justin Pryzby  wrote:
> I ran sqlsmith on this patch for a short while, and reduced one of its
> appalling queries to this:
>
> postgres=# SELECT jsonb_object_agg_unique_strict('', null::xid8);
> ERROR:  unexpected jsonb type as object key

I think this may have to do with the following changes to
uniqueifyJsonbObject() that the patch makes:

@@ -1936,7 +1942,7 @@ lengthCompareJsonbPair(const void *a, const void
*b, void *binequal)
  * Sort and unique-ify pairs in JsonbValue object
  */
 static void
-uniqueifyJsonbObject(JsonbValue *object)
+uniqueifyJsonbObject(JsonbValue *object, bool unique_keys, bool skip_nulls)
 {
boolhasNonUniq = false;

@@ -1946,15 +1952,32 @@ uniqueifyJsonbObject(JsonbValue *object)
qsort_arg(object->val.object.pairs, object->val.object.nPairs,
sizeof(JsonbPair),
  lengthCompareJsonbPair, );

-   if (hasNonUniq)
+   if (hasNonUniq && unique_keys)
+   ereport(ERROR,
+   errcode(ERRCODE_DUPLICATE_JSON_OBJECT_KEY_VALUE),
+   errmsg("duplicate JSON object key value"));
+
+   if (hasNonUniq || skip_nulls)
{
-   JsonbPair  *ptr = object->val.object.pairs + 1,
-  *res = object->val.object.pairs;
+   JsonbPair  *ptr,
+  *res;
+
+   while (skip_nulls && object->val.object.nPairs > 0 &&
+  object->val.object.pairs->value.type == jbvNull)
+   {
+   /* If skip_nulls is true, remove leading items with null */
+   object->val.object.pairs++;
+   object->val.object.nPairs--;
+   }
+
+   ptr = object->val.object.pairs + 1;
+   res = object->val.object.pairs;

The code below the while loop does not take into account the
possibility that object->val.object.pairs would be pointing to garbage
when object->val.object.nPairs is 0.

Attached delta patch that applies on top of Alvaro's v12-0001 fixes
the case for me:

postgres=# SELECT jsonb_object_agg_unique_strict('', null::xid8);
 jsonb_object_agg_unique_strict

 {}
(1 row)

postgres=# SELECT jsonb_object_agg_unique_strict('1', null::xid8);
 jsonb_object_agg_unique_strict

 {}
(1 row)

SELECT jsonb_object_agg_unique_strict('1', '1'::xid8);
 jsonb_object_agg_unique_strict

 {"1": "1"}
(1 row)

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v12-0001-delta-uniqueifyJsonbObject-bugfix.patch
Description: Binary data

Re: Why mark empty pages all visible?

2023-03-27 Thread Peter Geoghegan

On Mon, Mar 27, 2023 at 6:48 PM Andres Freund  wrote:
> It seems odd that we enter the page into the VM at this point. That means that
> use of that page will now require a bit more work (including
> RelationGetBufferForTuple() pinning it).

I think that it's fairly obvious that it's *not* odd at all. If it
didn't do this then the pages would have to be scanned by VACUUM.

You haven't said anything about the leading cause of marking empty
pages all-frozen, by far: lazy_vacuum_heap_page(). Should it also stop
marking empty pages all-frozen?

Actually it isn't technically an empty page according to
PageIsEmpty(), since I wrote PageTruncateLinePointerArray() in a way
that made it leave a heap page with at least a single LP_UNUSED item.
But it'll essentially leave behind an empty page in many cases. The
regression tests mark pages all-frozen in this path quite a bit more
often than any other path according to gcov.

> This got a bit stranger with 44fa84881fff, because now we add the page into
> the VM even if it currently is pinned:

> It seems quite odd to set a page to all visible that we could not currently
> get a cleanup lock on - obviously evidence of another backend trying to to do
> something with the page.

You can say the same thing about lazy_vacuum_heap_page(), too.
Including the part about cleanup locking.

> It seems pretty clear that we shouldn't enter a currently-in-use page into the
> VM or freespacemap. All that's going to do is to "disturb" the backend trying
> to use that page (by directing other backends to it) and to make its job more
> expensive.

I don't think that it's clear. What about the case where there is only
one tuple, on a page that we cannot cleanup lock? Where do you draw
the line?

-- 
Peter Geoghegan

RE: PGdoc: add missing ID attribute to create_subscription.sgml

2023-03-27 Thread Hayato Kuroda (Fujitsu)

Dear Peter,

Thank you for reviewing. PSA new version.

> 
> v3-0001
> 
> 
> This patch looks good, but I think there are a couple of other places
> where you could add links:
> 
> ~~~
> 
> 1.1 doc/src/sgml/logical-replication.sgml (31.5 Conflicts)
> 
> "When the streaming mode is parallel, the finish LSN ..."
> 
> Maybe you can add a "streaming" link there.

Added. It could not be detected because this is not tagged as .

> 1.2. doc/src/sgml/logical-replication.sgml (31.5 31.8. Monitoring)
> 
> "Moreover, if the streaming transaction is applied in parallel, there
> may be additional parallel apply workers."
> 
> Maybe you can add a "streaming" link there.

Added.

> 
> v3-0002
> 
> 
> There is one bug, and I think there are a couple of other places where
> you could add links:
> 
> ~~~
> 
> 2.1 doc/src/sgml/logical-replication.sgml (31.4. Column Lists blurb)
> 
> For partitioned tables, the publication parameter
> publish_via_partition_root determines which column list is used.
> 
> ~
> 
> Maybe you can add a "publish_via_partition_root" link there.

Added. I'm not sure why I missed it...

> 2.2 doc/src/sgml/logical-replication.sgml (31.6. Restrictions)
> 
> Publications can also specify that changes are to be replicated using
> the identity and schema of the partitioned root table instead of that
> of the individual leaf partitions in which the changes actually
> originate (see CREATE PUBLICATION).
> 
> ~
> 
> Maybe that text can be changed now to say something like "(see
> publish_via_partition_root parameter of CREATE PUBLICATION)” -- so
> only the parameter part has the link, not the CREATE PUBLICATION part.

Seems better, added.

> 2.3 doc/src/sgml/logical-replication.sgml (31.9. Security)
> 
> +   subscription  linkend="sql-createpublication-for-all-tables">FOR ALL
> TABLES
> +   or  linkend="sql-createpublication-for-tables-in-schema">FOR
> TABLES IN SCHEMAFOR TABLES IN
> SCHEMA
> +   only when superusers trust every user permitted to create a non-temp table
> +   on the publisher or the subscriber.
> 
> There is a cut/paste typo here -- it renders badly with "FOR TABLES IN
> SCHEMA" appearing 2x.
>

That's my fault, fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v4-0001-Add-XML-ID-attributes-to-create_subscription.sgml.patch
Description:  v4-0001-Add-XML-ID-attributes-to-create_subscription.sgml.patch


v4-0002-Add-XML-ID-attributes-to-create_publication.sgml.patch
Description:  v4-0002-Add-XML-ID-attributes-to-create_publication.sgml.patch

Re: Moving forward with TDE

2023-03-27 Thread Stephen Frost

Greetings,

On Mon, Mar 27, 2023 at 21:35 Bruce Momjian  wrote:

> On Tue, Mar 28, 2023 at 02:03:50AM +0200, Stephen Frost wrote:
> > The remote storage is certainly an independent system. Multi-mount LUNs
> are
> > entirely possible in a SAN (and absolutely with NFS, or just the NFS
> server
> > itself is compromised..), so while the attacker may not have any access
> to the
> > database server itself, they may have access to these other systems, and
> that’s
> > not even considering in-transit attacks which are also absolutely
> possible,
> > especially with iSCSI or NFS.
> >
> > I don’t understand what is being claimed that the remote storage is “not
> an
> > independent system” based on my understanding of, eg, NFS. With NFS, a
> > directory on the NFS server is exported and the client mounts that
> directory as
> > NFS locally, all over a network which may or may not be secured against
> > manipulation.  A user on the NFS server with root access is absolutely
> able to
> > access and modify files on the NFS server trivially, even if they have no
> > access to the PG server.  Would you explain what you mean?
>
> The point is that someone could change values in the storage, pg_xact,
> encryption settings, binaries, that would allow the attacker to learn
> the encryption key.  This is not possible for two secure endpoints and
> someone changing data in transit.  Yeah, it took me a while to
> understand these boundaries too.

This depends on the specific configuration of the systems, clearly. Being
able to change values in other parts of the system isn’t great and we
should work to improve on that, but clearly that isn’t so much of an issue
that people aren’t willing to accept a partial solution or existing
commercial solutions wouldn’t be accepted or considered viable. Indeed,
using GCM is objectively an improvement over what’s being offered commonly
today.

I also generally object to the idea that being able to manipulate the
PGDATA directory necessarily means being able to gain access to the KEK. In
trivial solutions, sure, it’s possible, but the NFS server should never be
asking some external KMS for the key to a given DB server and a reasonable
implementation won’t allow this, and instead would flag and log such an
attempt for someone to review, leading to a much faster realization of a
compromised system.

Certainly it’s much simpler to reason about an attacker with no knowledge
of either system and only network access to see if they can penetrate the
communications between the two end-points, but that is not the only case
where authenticated encryption is useful.

> So the idea is that the backup user can be compromised without the
> data
> > being vulnerable --- makes sense, though that use-case seems narrow.
> >
> > That’s perhaps a fair consideration- but it’s clearly of enough value
> that many
> > of our users are asking for it and not using PG because we don’t have it
> today.
> > Ultimately though, this clearly makes it more than a “checkbox” feature.
> I hope
> > we are able to agree on that now.
>
> It is more than a check box feature, yes, but I am guessing few people
> are wanting the this for the actual features beyond check box.

As I explained previously, perhaps the people asking are doing so for only
the “checkbox”, but that doesn’t mean it isn’t a useful feature or that it
isn’t valuable in its own right. Those checklists were compiled and
enforced for a reason, which the end users might not understand but is
still absolutely valuable.  Sad to say, but frankly this is becoming more
and more common but we shouldn’t be faulting the users asking for it- if it
were truly useless then eventually it would be removed from the standard,
but it hasn’t and it won’t be because, while not every end user has a depth
of understanding to explain it, it is actually a useful and important
capability to have and one that is important to implement.

> Yes, there is value beyond the check-box, but in most cases those
> > values are limited considering the complexity of the features, and
> the
> > check-box is what most people are asking for, I think.
> >
> > For the users who ask on the lists for this feature, regularly, how many
> don’t
> > ask because they google or find prior responses on the list to the
> question of
> > if we have this capability?  How do we know that their cases are
> “checkbox”?
>
> Because I have rarely heard people articulate the value beyond check
> box.

Have I done so sufficiently then that we can agree that calling it
“checkbox” is inappropriate and detrimental to our user base?

> Consider that there are standards groups which explicitly consider these
> attack
> > vectors and consider them important enough to require mitigations to
> address
> > those vectors. Do the end users of PG understand the attack vectors or
> why they
> > matter?  Perhaps not, but just because they can’t articulate the
> reasoning does
> > NOT mean that the attack vector

Re: Add pg_walinspect function with block info columns

2023-03-27 Thread Bharath Rupireddy

On Tue, Mar 28, 2023 at 5:29 AM Peter Geoghegan  wrote:
>
> On Mon, Mar 27, 2023 at 12:42 AM Bharath Rupireddy
>  wrote:
> > Thanks. Here's the v6 patch (last patch that I have with me for
> > pg_walinspect) for adding per-record info to pg_get_wal_block_info.
> > Note that I addressed all review comments received so far. Any
> > thoughts?
>
> Looking at this now, with the intention of committing it for 16.
>
> In addition to what I said a little while ago about the forknum
> parameter and parameter ordering, I have a concern about the data
> type: perhaps the forknum paramater should be declared as
> "relforknumber smallint", instead of using text? That would match the
> approach taken by pg_buffercache, and would be more efficient.
>
> I don't think that using a text column with the fork name adds too
> much, since this is after all supposed to be a tool used by experts.
> Plus it's usually pretty clear what it is from context. Not that many
> WAL records touch the visibility map, and those that do make it
> relatively obvious which block is from the VM based on other details.
> Details such as blockid and relblocknumber (the VM is approximately
> 32k times smaller than the heap). Once I see that the record is (say)
> a VISIBLE record, I'm already looking at the order of each block
> reference, and maybe at relblocknumber -- I'm not likely to visually
> scan the forknum column at all.

Hm, agreed. Changed in the attached v7-0002 patch. We can as well
write a case statement in the create function SQL to output forkname
instead forknumber, but I'd stop doing that to keep in sync with
pg_buffercache.

On Tue, Mar 28, 2023 at 6:37 AM Peter Geoghegan  wrote:
>
> On Mon, Mar 27, 2023 at 4:59 PM Peter Geoghegan  wrote:
> > Looking at this now, with the intention of committing it for 16.
>
> I see a bug on HEAD, following yesterday's commit 0276ae42dd.
>
> GetWALRecordInfo() will now output the value of the fpi_len variable
> before it has actually been set by our call to . So it'll always
> be 0.
>
> Can you post a bugfix patch for this, Bharath?

Oh, thanks for finding it out. Fixed in the attached v7-0001 patch. I
also removed the "invalid fork number" error as users can figure that
out if at all the fork number is wrong.

On the ordering of the columns, I kept start_lsn, end_lsn and prev_lsn
first and then the rel** columns (this rel** columns order follows
pg_buffercache) and then block data related columns. Michael and
Kyotaro are of the opinion that it's better to keep LSNs first to be
consistent and also given that this function is WAL related, it makes
sense to have LSNs first.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From f2fba90775ef2fee6ea6f04dff3a08c2ed7724f1 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 28 Mar 2023 01:46:08 +
Subject: [PATCH v7] Fix fpi_len issue introduced by 0276ae42dd in
 pg_walinspect

0276ae42dd changed the code to emit block references only if
exists. That change moved the block references calucation code
around which left fpi_len to be emitted as always 0. This commit
fixes that issue.
---
 contrib/pg_walinspect/pg_walinspect.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index 2933734122..062e90dbce 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -186,6 +186,7 @@ GetWALRecordInfo(XLogReaderState *record, Datum *values,
 	RmgrData	desc;
 	uint32		fpi_len = 0;
 	StringInfoData rec_desc;
+	StringInfoData rec_blk_ref;
 	int			i = 0;
 
 	desc = GetRmgr(XLogRecGetRmid(record));
@@ -197,6 +198,12 @@ GetWALRecordInfo(XLogReaderState *record, Datum *values,
 	initStringInfo(_desc);
 	desc.rm_desc(_desc, record);
 
+	if (XLogRecHasAnyBlockRefs(record))
+	{
+		initStringInfo(_blk_ref);
+		XLogRecGetBlockRefInfo(record, false, true, _blk_ref, _len);
+	}
+
 	values[i++] = LSNGetDatum(record->ReadRecPtr);
 	values[i++] = LSNGetDatum(record->EndRecPtr);
 	values[i++] = LSNGetDatum(XLogRecGetPrev(record));
@@ -205,7 +212,6 @@ GetWALRecordInfo(XLogReaderState *record, Datum *values,
 	values[i++] = CStringGetTextDatum(id);
 	values[i++] = UInt32GetDatum(XLogRecGetTotalLen(record));
 	values[i++] = UInt32GetDatum(XLogRecGetDataLen(record));
-
 	values[i++] = UInt32GetDatum(fpi_len);
 
 	if (rec_desc.len > 0)
@@ -213,15 +219,8 @@ GetWALRecordInfo(XLogReaderState *record, Datum *values,
 	else
 		nulls[i++] = true;
 
-	/* Block references. */
 	if (XLogRecHasAnyBlockRefs(record))
-	{
-		StringInfoData rec_blk_ref;
-
-		initStringInfo(_blk_ref);
-		XLogRecGetBlockRefInfo(record, false, true, _blk_ref, _len);
 		values[i++] = CStringGetTextDatum(rec_blk_ref.data);
-	}
 	else
 		nulls[i++] = true;
 
-- 
2.34.1

From 55126ffd91bae629d1116949467a53ba8ba58f8b Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 28 Mar 2023

Re: Add pg_walinspect function with block info columns

2023-03-27 Thread Peter Geoghegan

On Mon, Mar 27, 2023 at 4:59 PM Peter Geoghegan  wrote:
> Looking at this now, with the intention of committing it for 16.

Attached revision v7 adjusts the column order. This is still WIP, but
gives a good idea of the direction I'm going in.

v7 makes the column output look like this:

pg@regression:5432 [1209761]=# select * from
pg_get_wal_block_info('0/10D8280', '0/10D82B8');
┌─[ RECORD 1 ]─┬──┐
│ start_lsn│ 0/10D8280│
│ end_lsn  │ 0/10D82B8│
│ prev_lsn │ 0/10D8208│
│ blockid  │ 0│
│ reltablespace│ 1,663│
│ reldatabase  │ 1│
│ relfilenode  │ 2,610│
│ relforknumber│ 0│
│ relblocknumber   │ 0│
│ xid  │ 0│
│ resource_manager │ Heap2│
│ record_type  │ PRUNE│
│ record_length│ 56   │
│ main_data_length │ 8│
│ block_fpi_length │ 0│
│ block_fpi_info   │ ∅│
│ description  │ snapshotConflictHorizon 10 nredirected 0 ndead 1 │
│ block_data   │ \x2b00   │
│ fpi_data │ ∅│
└──┴──┘

Few things to note here:

* blockid is now given more prominence, in that it appears just after
the LSN related output params.

The idea is that the blockid is conceptually closer to the LSN stuff.

* There is now a smallint relblocknumber, for consistency with
pg_buffercache. This replaces the previous text column.

As I mentioned earlier on, I don't think that a text based output
param adds much.

* The integer fields record_length, main_data_length, block_fpi_length
all now all appear together. This is for consistency with the similar
output params from the other function.

v7 allows block_fpi_length to be 0 instead of NULL, for consistency
with the fpi_length param from the other function. The intention is to
make it relatively obvious which information "comes from the record"
and which information "comes from the block reference".

* The block_fpi_info output param appears right after block_fpi_info.

This is not very verbose, and I find that it is hard to find by
scrolling horizontally in pspg if it gets placed after either
block_data or fpi_data, which tend to have at least some huge/wide
outputs. It seemed sensible to place block_fpi_info next to the param
I'm now calling block_fpi_length, after it was moved next to the other
"length" fields.

How do people feel about this approach? I'll need to write
documentation to help the user to understand what's really going on
here.

-- 
Peter Geoghegan


v7-0001-Emit-WAL-record-info-via-pg_get_wal_block_info.patch
Description: Binary data

Why mark empty pages all visible?

2023-03-27 Thread Andres Freund

Hi,

visibilitymap.c currently marks empty pages as all visible, including WAL
logging them:

if (PageIsEmpty(page))
...
/*
 * Empty pages are always all-visible and all-frozen (note that
 * the same is currently not true for new pages, see above).
 */
if (!PageIsAllVisible(page))
...
visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
  vmbuffer, InvalidTransactionId,
  VISIBILITYMAP_ALL_VISIBLE | 
VISIBILITYMAP_ALL_FROZEN);


It seems odd that we enter the page into the VM at this point. That means that
use of that page will now require a bit more work (including
RelationGetBufferForTuple() pinning it).

Note that we do *not* do so for new pages:

if (PageIsNew(page))
{
/*
 * All-zeroes pages can be left over if either a backend 
extends the
 * relation by a single page, but crashes before the newly 
initialized
 * page has been written out, or when bulk-extending the 
relation
 * (which creates a number of empty pages at the tail end of the
 * relation), and then enters them into the FSM.
 *
 * Note we do not enter the page into the visibilitymap. That 
has the
 * downside that we repeatedly visit this page in subsequent 
vacuums,
 * but otherwise we'll never discover the space on a promoted 
standby.
 * The harm of repeated checking ought to normally not be too 
bad. The
 * space usually should be used at some point, otherwise there
 * wouldn't be any regular vacuums.
...
return true;
}


The standby.c reasoning seems to hold just as well for empty pages? In fact,
there might very well be many more empty pages than new pages.

Which of course also is also the only argument for putting empty pages into
the VM - there could be many of them, so we might not want to rescan them on a
regular basis. But there's actually also no real bound on the number of new
pages, so I'm not sure that argument goes all that far?

The standby argument also doesn't just seem to apply to the standby, but also
to crashes on the primary, as the FSM is not crashsafe...


I traced that through the versions - that behaviour originates in the original
commit adding the visibilitymap (608195a3a365). There's no comments back then
explaining why this behaviour was chosen.


This got a bit stranger with 44fa84881fff, because now we add the page into
the VM even if it currently is pinned:

if (!ConditionalLockBufferForCleanup(buf))
...
/* Check for new or empty pages before 
lazy_scan_noprune call */
if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, 
true,
   
vmbuffer))
...


It seems quite odd to set a page to all visible that we could not currently
get a cleanup lock on - obviously evidence of another backend trying to to do
something with the page.

The main way to encounter this situation, afaict, is when
RelationGetTargetBufferForTuple() briefly releases the lock on a newly
extended page, to acquire the lock on the source page. The buffer is pinned,
but not locked in that situation.


I started to look into this in the context of
https://postgr.es/m/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de and
https://postgr.es/m/20221029025420.eplyow6k7tgu6he3%40awork3.anarazel.de

which both would ever so slightly extend the window in which we don't hold a
lock on the page (to do a visibilitymap_pin() and RecordPageWithFreeSpace()
respectively).


It seems pretty clear that we shouldn't enter a currently-in-use page into the
VM or freespacemap. All that's going to do is to "disturb" the backend trying
to use that page (by directing other backends to it) and to make its job more
expensive.


It's less clear, but imo worth discussing, whether we should continue to set
empty pages to all-visible.


Greetings,

Andres Freund

Re: Assertion in pgstat_assoc_relation() fails intermittently

2023-03-27 Thread Kyotaro Horiguchi

At Mon, 27 Mar 2023 11:46:08 +0530, Bharath Rupireddy 
 wrote in 
> I recently observed an assertion failure [1] a few times on my dev
> setup during initdb. The code was built with --enable-debug
> --enable-cassert CFLAGS="-ggdb3 -O0". The assertion was gone after I
> did make distclean and built the source code again. It looks like the
> same relation (pg_type [2]) is linked to multiple relcache entries.
> I'm not sure if anyone else has seen this, but thought of reporting it
> here. Note that I'm not seeing this issue any more.

This seems like the same issue with [a] and it was fixed by cb2e7ddfe5
on Dec 2, 2022.

[a] 
https://www.postgresql.org/message-id/CALDaNm2yXz%2BzOtv7y5zBd5WKT8O0Ld3YxikuU3dcyCvxF7gypA%40mail.gmail.com

a> #5  0x5590bf283139 in ExceptionalCondition
a> (conditionName=0x5590bf468170 "rel->pgstat_info->relation == NULL",
a> fileName=0x5590bf46812b "pgstat_relation.c", lineNumber=143) at
a> assert.c:66
a> #6  0x5590bf0ce5f8 in pgstat_assoc_relation (rel=0x7efcce996a48)
a> at pgstat_relation.c:143
a> #7  0x5590beb83046 in initscan (scan=0x5590bfbf4af8, key=0x0,
a> keep_startblock=false) at heapam.c:343
a> #8  0x5590beb8466f in heap_beginscan (relation=0x7efcce996a48,
snapshot=0x5590bfb5a520, nkeys=0, key=0x0, parallel_scan=0x0,
flags=449) at heapam.c:1223


> [1]
> running bootstrap script ... TRAP: failed
> Assert("rel->pgstat_info->relation == NULL"), File:
> "pgstat_relation.c", Line: 143, PID: 837245
> /home/ubuntu/postgres/inst/bin/postgres(ExceptionalCondition+0xbb)[0x55d98ff6abc4]
> /home/ubuntu/postgres/inst/bin/postgres(pgstat_assoc_relation+0xcd)[0x55d98fdb3db7]
> /home/ubuntu/postgres/inst/bin/postgres(+0x1326f5)[0x55d98f8576f5]
> /home/ubuntu/postgres/inst/bin/postgres(heap_beginscan+0x17a)[0x55d98f8586b5]
> /home/ubuntu/postgres/inst/bin/postgres(table_beginscan_catalog+0x6e)[0x55d98f8c4cf3]

regareds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Moving forward with TDE

2023-03-27 Thread Bruce Momjian

On Tue, Mar 28, 2023 at 02:03:50AM +0200, Stephen Frost wrote:
> The remote storage is certainly an independent system. Multi-mount LUNs are
> entirely possible in a SAN (and absolutely with NFS, or just the NFS server
> itself is compromised..), so while the attacker may not have any access to the
> database server itself, they may have access to these other systems, and 
> that’s
> not even considering in-transit attacks which are also absolutely possible,
> especially with iSCSI or NFS. 
> 
> I don’t understand what is being claimed that the remote storage is “not an
> independent system” based on my understanding of, eg, NFS. With NFS, a
> directory on the NFS server is exported and the client mounts that directory 
> as
> NFS locally, all over a network which may or may not be secured against
> manipulation.  A user on the NFS server with root access is absolutely able to
> access and modify files on the NFS server trivially, even if they have no
> access to the PG server.  Would you explain what you mean?

The point is that someone could change values in the storage, pg_xact,
encryption settings, binaries, that would allow the attacker to learn
the encryption key.  This is not possible for two secure endpoints and
someone changing data in transit.  Yeah, it took me a while to
understand these boundaries too.

> So the idea is that the backup user can be compromised without the data
> being vulnerable --- makes sense, though that use-case seems narrow.
> 
> That’s perhaps a fair consideration- but it’s clearly of enough value that 
> many
> of our users are asking for it and not using PG because we don’t have it 
> today.
> Ultimately though, this clearly makes it more than a “checkbox” feature. I 
> hope
> we are able to agree on that now.

It is more than a check box feature, yes, but I am guessing few people
are wanting the this for the actual features beyond check box.

> Yes, there is value beyond the check-box, but in most cases those
> values are limited considering the complexity of the features, and the
> check-box is what most people are asking for, I think.
> 
> For the users who ask on the lists for this feature, regularly, how many don’t
> ask because they google or find prior responses on the list to the question of
> if we have this capability?  How do we know that their cases are “checkbox”? 

Because I have rarely heard people articulate the value beyond check
box.

> Consider that there are standards groups which explicitly consider these 
> attack
> vectors and consider them important enough to require mitigations to address
> those vectors. Do the end users of PG understand the attack vectors or why 
> they
> matter?  Perhaps not, but just because they can’t articulate the reasoning 
> does
> NOT mean that the attack vector doesn’t exist or that their environment is
> somehow immune to it- indeed, as the standards bodies surely know, the 
> opposite
> is true- they’re almost certainly at risk of those attack vectors and 
> therefore
> the standards bodies are absolutely justified in requiring them to provide a
> solution. Treating these users as unimportant because they don’t have the 
> depth
> of understanding that we do or that the standards body does is not helping
> them- it’s actively driving them away from PG. 

Well, then who is going to explain them here, because I have not heard
them yet.

> The RLS arguments were that queries could expoose some of the underlying
> data, but in summary, that was considered acceptable.
> 
> This is an excellent point- and dovetails very nicely into my argument that
> protecting primary data (what is provided by users and ends up in indexes and
> heaps) is valuable even if we don’t (yet..) have protection for other parts of
> the system. Reducing the size of the attack vector is absolutely useful,
> especially when it’s such a large amount of the data in the system. Yes, we
> should, and will, continue to improve- as we do with many features, but we
> don’t need to wait for perfection to include this feature, just as with RLS 
> and
> numerous other features we have. 

The issue is that you needed a certain type of user with a certain type
of access to break RLS, while for this, writing to PGDATA is the simple
case for all the breakage, and the thing we are protecting with
authentication.

> >     > We, as a community, are clearly losing value by lack of this
> capability,
> >     if by
> >     > no other measure than simply the numerous users of the commercial
> >     > implementations feeling that they simply can’t use PG without this
> >     feature, for
> >     > whatever their reasoning.
> >
> >     That is true, but I go back to my concern over useful feature vs.
> check
> >     box.
> >
> > While it’s easy to label something as checkbox, I don’t feel we have 
> been
> fair
> 
> No, actually, it isn't.  I am not sure why you are saying that.
> 
>

Re: PGdoc: add missing ID attribute to create_subscription.sgml

2023-03-27 Thread Peter Smith

Hi Kuroda-san. Here are my review comments for both patches v3-0001 and v3-0002.



v3-0001


This patch looks good, but I think there are a couple of other places
where you could add links:

~~~

1.1 doc/src/sgml/logical-replication.sgml (31.5 Conflicts)

"When the streaming mode is parallel, the finish LSN ..."

Maybe you can add a "streaming" link there.

~~~

1.2. doc/src/sgml/logical-replication.sgml (31.5 31.8. Monitoring)

"Moreover, if the streaming transaction is applied in parallel, there
may be additional parallel apply workers."

Maybe you can add a "streaming" link there.



v3-0002


There is one bug, and I think there are a couple of other places where
you could add links:

~~~

2.1 doc/src/sgml/logical-replication.sgml (31.4. Column Lists blurb)

For partitioned tables, the publication parameter
publish_via_partition_root determines which column list is used.

~

Maybe you can add a "publish_via_partition_root" link there.

~~~

2.2 doc/src/sgml/logical-replication.sgml (31.6. Restrictions)

Publications can also specify that changes are to be replicated using
the identity and schema of the partitioned root table instead of that
of the individual leaf partitions in which the changes actually
originate (see CREATE PUBLICATION).

~

Maybe that text can be changed now to say something like "(see
publish_via_partition_root parameter of CREATE PUBLICATION)” -- so
only the parameter part has the link, not the CREATE PUBLICATION part.

~~~

2.3 doc/src/sgml/logical-replication.sgml (31.9. Security)

+   subscription FOR ALL
TABLES
+   or FOR
TABLES IN SCHEMAFOR TABLES IN
SCHEMA
+   only when superusers trust every user permitted to create a non-temp table
+   on the publisher or the subscriber.

There is a cut/paste typo here -- it renders badly with "FOR TABLES IN
SCHEMA" appearing 2x.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Add pg_walinspect function with block info columns

2023-03-27 Thread Peter Geoghegan

On Mon, Mar 27, 2023 at 4:59 PM Peter Geoghegan  wrote:
> Looking at this now, with the intention of committing it for 16.

I see a bug on HEAD, following yesterday's commit 0276ae42dd.

GetWALRecordInfo() will now output the value of the fpi_len variable
before it has actually been set by our call to . So it'll always
be 0.

Can you post a bugfix patch for this, Bharath?

-- 
Peter Geoghegan

Re: Improve logging when using Huge Pages

2023-03-27 Thread Michael Paquier

On Thu, Mar 23, 2023 at 08:50:50PM -0500, Justin Pryzby wrote:
> I'm sure it's possible, but it's also not worth writing a special
> implementation just to handle huge_pages_active, which is better written
> in 30 lines than in 300 lines.
> 
> If we needed to avoid using a GUC, maybe it'd work to add
> huge_pages_active to PGShmemHeader.  But "avoid using gucs at all costs"
> isn't the goal, and using a GUC parallels all the similar, and related
> things without needing to allocate extra bits of shared_memory.

Yeah, I kind of agree that the complications are not appealing
compared to the use case.  FWIW, I am fine with just using "unknown"
even under -C because we don't know the status without the mmpa()
call.  And we don't want to assign what would be potentially a bunch
of memory when running that.

> This goes back to using a GUC, and:
> 
>  - removes GUC_RUNTIME_COMPUTED, since that causes a useless error when
>using -C if the server is running, rather than successfully printing
>"unknown".
>  - I also renamed it from huge_pages_active = {true,false,unknown} to
>huge_pages_STATUS = {on,off,unknown}.  This parallels huge_pages,
>which is documented to accept values on/off/try.  Also, true/false
>isn't how bools are displayed.
>  - I realized that the rename suggested implementing it as an enum GUC,
>and re-using the existing HUGE_PAGES_{ON,OFF} values (and adding an
>"UNKNOWN").  Maybe this also avoids Stephen's earlier objection to
>using a string ?

huge_pages_status is OK here, as is reusing the enum struct to avoid
an unnecessary duplication.

> I mis-used cirrusci to check that the GUC does work correctly under
> windows.

You mean that you abused of it in some custom branch running the CI on
github?  If I may ask, what did you do to make sure that huge pages
are set when re-attaching a backend to a shmem area?

> If there's continuing aversions to using a GUC, please say so, and try
> to suggest an alternate implementation you think would be justified.
> It'd need to work under windows with EXEC_BACKEND.

Looking at the patch, it seems like that this should work even for
EXEC_BACKEND on *nix when a backend reattaches..  It may be better to
be sure of that, as well, if it has not been tested.

+++ b/src/backend/port/win32_shmem.c
@@ -327,6 +327,10 @@ retry:
Sleep(1000);
continue;
}
+
+   SetConfigOption("huge_pages_status", (flProtect & SEC_LARGE_PAGES) ?
+   "on" : "off", PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT)

Perhaps better to just move that at the end of PGSharedMemoryCreate()
for WIN32?

+ 
+  huge_pages_status (enum)
+  
+   huge_pages_status configuration 
parameter
+  
+  
+  
+   
+Reports the state of huge pages in the current instance.
+See  for more information.
+   
+  
+ 

The patch needs to provide more details about the unknown state, IMO,
to make it crystal-clear to the users what are the limitations of this
GUC, especially regarding the fact that this is useful when "try"-ing
to allocate huge pages, and that the value can only be consulted after
the server has been started.
--
Michael


signature.asc
Description: PGP signature

Re: Moving forward with TDE

2023-03-27 Thread Stephen Frost

Greetings,

On Mon, Mar 27, 2023 at 19:19 Bruce Momjian  wrote:

> On Tue, Mar 28, 2023 at 12:57:42AM +0200, Stephen Frost wrote:
> > I consider the operating system and its processes as much more of a
> > single entity than TLS over a network.
> >
> > This may be the case sometimes but there’s absolutely no shortage of
> other
> > cases and it’s almost more the rule these days, that there is some kind
> of
> > network between the OS processes and the storage- a SAN, an iSCSI
> network, NFS,
> > are all quite common.
>
> Yes, but consider that the database cluster is having to get its data
> from that remote storage --- the remote storage is not an independent
> entity that can be corrupted without the databaes server being
> compromised. If everything in PGDATA was GCM-verified, it would be
> secure, but because some parts are not, I don't think it would be.

The remote storage is certainly an independent system. Multi-mount LUNs are
entirely possible in a SAN (and absolutely with NFS, or just the NFS server
itself is compromised..), so while the attacker may not have any access to
the database server itself, they may have access to these other systems,
and that’s not even considering in-transit attacks which are also
absolutely possible, especially with iSCSI or NFS.

I don’t understand what is being claimed that the remote storage is “not an
independent system” based on my understanding of, eg, NFS. With NFS, a
directory on the NFS server is exported and the client mounts that
directory as NFS locally, all over a network which may or may not be
secured against manipulation.  A user on the NFS server with root access is
absolutely able to access and modify files on the NFS server trivially,
even if they have no access to the PG server.  Would you explain what you
mean?

I do agree that the ideal case would be that we encrypt everything we can
(not everything can be for various reasons, but we don’t actually need to
either) in the PGDATA directory is encrypted and authenticated, just like
it would be ideal if everything was checksum’d and isn’t today. We are
progressing in that direction thanks to efforts such as reworking the other
subsystems to used shared buffers and a consistent page format, but just
like with checksums we do not need to have the perfect solution for us to
provide a lot of value here- and our users know that as the same is true of
the unauthenticated encryption approaches being offered by the commercial
solutions.

> > As specific examples, consider:
> > >
> > > An attack against the database system where the database server is
> shut
> > down,
> > > or a backup, and  the encryption key isn’t available on the system.
> > >
> > > The backup system itself, not running as the PG user (an option
> supported
> > by PG
> > > and at least pgbackrest) being compromised, thus allowing for
> injection
> > of
> > > changes into a backup or into a restore.
> >
> > I then question why we are not adding encryption to pg_basebackup or
> > pgbackrest rather than the database system.
> >
> > Pgbackrest has encryption and authentication of it … but that doesn’t
> actually
> > address the attack vector that I outlined. If the backup user is
> compromised
> > then they can change the data before it gets to the storage.  If the
> backup
> > user is compromised then they have access to whatever key is used to
> encrypt
> > and authenticate the backup and therefore can trivially manipulate the
> data.
>
> So the idea is that the backup user can be compromised without the data
> being vulnerable --- makes sense, though that use-case seems narrow.

That’s perhaps a fair consideration- but it’s clearly of enough value that
many of our users are asking for it and not using PG because we don’t have
it today. Ultimately though, this clearly makes it more than a “checkbox”
feature. I hope we are able to agree on that now.

> What were the _technical_ reasons for those objections?
> >
> > I believe largely the ones I’m bringing up here and which I outline
> above… I
> > don’t mean to pretend that any of this is of my own independent
> construction. I
> > don’t believe it is and my apologies if it came across that way.
>
> Yes, there is value beyond the check-box, but in most cases those
> values are limited considering the complexity of the features, and the
> check-box is what most people are asking for, I think.

For the users who ask on the lists for this feature, regularly, how many
don’t ask because they google or find prior responses on the list to the
question of if we have this capability?  How do we know that their cases
are “checkbox”?  Consider that there are standards groups which explicitly
consider these attack vectors and consider them important enough to require
mitigations to address those vectors. Do the end users of PG understand the
attack vectors or why they matter?  Perhaps not, but just because they
can’t articulate the

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2023-03-27 Thread Michael Paquier

On Mon, Mar 20, 2023 at 09:30:50AM +0900, Michael Paquier wrote:
> Did you check dump and restore flows with partition
> trees and --no-table-access-method?  Perhaps there should be
> some regression tests with partitioned tables?

I was looking at the patch, and as I suspected the dumps generated
are forgetting to apply the AM to the partitioned tables.  For
example, assuming that the default AM is heap, the patch creates
child_0_10 with heap2 as AM, which is what we want:
CREATE ACCESS METHOD heap2 TYPE TABLE HANDLER heap_tableam_handler;
CREATE TABLE parent_tab (id int) PARTITION BY RANGE (id) USING heap2;
CREATE TABLE child_0_10 PARTITION OF parent_tab
  FOR VALUES FROM (0) TO (10);

However a dump produces that (content cut except for its most relevant
parts):
CREATE ACCESS METHOD heap2 TYPE TABLE HANDLER heap_tableam_handler;
SET default_tablespace = '';
CREATE TABLE public.parent_tab (
id integer
)
PARTITION BY RANGE (id);
SET default_table_access_method = heap2;
CREATE TABLE public.child_0_10 (
id integer
);

This would restore the previous contents incorrectly, where parent_tab
would use heap and child_0_10 would use heap2, causing any partitions
created after the restore to use silently heap.  This is going to
require a logic similar to tablespaces, where generate SET commands
on default_table_access_method so as --no-table-access-method in
pg_dump and pg_restore are able to work correctly.  Having tests to
cover all that is a must-have.
--
Michael

signature.asc
Description: PGP signature

Re: Add pg_walinspect function with block info columns

2023-03-27 Thread Peter Geoghegan

On Mon, Mar 27, 2023 at 12:42 AM Bharath Rupireddy
 wrote:
> Thanks. Here's the v6 patch (last patch that I have with me for
> pg_walinspect) for adding per-record info to pg_get_wal_block_info.
> Note that I addressed all review comments received so far. Any
> thoughts?

Looking at this now, with the intention of committing it for 16.

In addition to what I said a little while ago about the forknum
parameter and parameter ordering, I have a concern about the data
type: perhaps the forknum paramater should be declared as
"relforknumber smallint", instead of using text? That would match the
approach taken by pg_buffercache, and would be more efficient.

I don't think that using a text column with the fork name adds too
much, since this is after all supposed to be a tool used by experts.
Plus it's usually pretty clear what it is from context. Not that many
WAL records touch the visibility map, and those that do make it
relatively obvious which block is from the VM based on other details.
Details such as blockid and relblocknumber (the VM is approximately
32k times smaller than the heap). Once I see that the record is (say)
a VISIBLE record, I'm already looking at the order of each block
reference, and maybe at relblocknumber -- I'm not likely to visually
scan the forknum column at all.

-- 
Peter Geoghegan

Re: Moving forward with TDE

2023-03-27 Thread Bruce Momjian

On Tue, Mar 28, 2023 at 12:57:42AM +0200, Stephen Frost wrote:
> I consider the operating system and its processes as much more of a
> single entity than TLS over a network.
> 
> This may be the case sometimes but there’s absolutely no shortage of other
> cases and it’s almost more the rule these days, that there is some kind of
> network between the OS processes and the storage- a SAN, an iSCSI network, 
> NFS,
> are all quite common.

Yes, but consider that the database cluster is having to get its data
from that remote storage --- the remote storage is not an independent
entity that can be corrupted without the databaes server being
compromised. If everything in PGDATA was GCM-verified, it would be
secure, but because some parts are not, I don't think it would be.

> > As specific examples, consider:
> >
> > An attack against the database system where the database server is shut
> down,
> > or a backup, and  the encryption key isn’t available on the system.
> >
> > The backup system itself, not running as the PG user (an option 
> supported
> by PG
> > and at least pgbackrest) being compromised, thus allowing for injection
> of
> > changes into a backup or into a restore.
> 
> I then question why we are not adding encryption to pg_basebackup or
> pgbackrest rather than the database system.
> 
> Pgbackrest has encryption and authentication of it … but that doesn’t actually
> address the attack vector that I outlined. If the backup user is compromised
> then they can change the data before it gets to the storage.  If the backup
> user is compromised then they have access to whatever key is used to encrypt
> and authenticate the backup and therefore can trivially manipulate the data.

So the idea is that the backup user can be compromised without the data
being vulnerable --- makes sense, though that use-case seems narrow.

> What were the _technical_ reasons for those objections?
> 
> I believe largely the ones I’m bringing up here and which I outline above… I
> don’t mean to pretend that any of this is of my own independent construction. 
> I
> don’t believe it is and my apologies if it came across that way.

Yes, there is value beyond the check-box, but in most cases those
values are limited considering the complexity of the features, and the
check-box is what most people are asking for, I think.

> > I’ve grown weary of this argument as the other major piece of work it 
> was
> > routinely applied to was RLS and yet that has certainly been seen 
> broadly
> as a
> > beneficial feature with users clearly leveraging it and in more than 
> some
> > “checkbox” way.
> 
> RLS has to overcome that objection, and I think it did, as was better
> for doing that.
> 
> Beyond it being called a checkbox - what were the arguments against it?  I

The RLS arguments were that queries could expoose some of the underlying
data, but in summary, that was considered acceptable.

> > We, as a community, are clearly losing value by lack of this capability,
> if by
> > no other measure than simply the numerous users of the commercial
> > implementations feeling that they simply can’t use PG without this
> feature, for
> > whatever their reasoning.
> 
> That is true, but I go back to my concern over useful feature vs. check
> box.
> 
> While it’s easy to label something as checkbox, I don’t feel we have been fair

No, actually, it isn't.  I am not sure why you are saying that.

> to our users in doing so as it has historically prevented features which our
> users are demanding and end up getting from commercial providers until we
> implement them ultimately anyway.  This particular argument simply doesn’t 
> seem
> to actually hold the value that proponents of it claim, for us at least, and 
> we
> have clear counter-examples which we can point to and I hope we learn from
> those.

I don't think you are addressing actual issues above.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Embrace your flaws.  They make you human, rather than perfect,
  which you will never be.

Re: Parallel Full Hash Join

2023-03-27 Thread Thomas Munro

On Sun, Mar 26, 2023 at 9:52 AM Melanie Plageman
 wrote:
> I have some very minor pieces of feedback, mainly about extraneous
> commas that made me uncomfortable ;)

Offensive punctuation removed.

> > discussion).  Therefore FULL JOIN inhibited page-based parallelism,
> > as the other join strategies can't do it either.
>
> I actually don't quite understand what this means? It's been awhile for
> me, so perhaps I'm being dense, but what is page-based parallelism?

Reworded.  I just meant our usual kind of "partial path" parallelism
(the kind when you don't know anything at all about the values of the
tuples that each process sees, and typically it's chopped up by
storage pages at the scan level).

> > That unfairness is considered acceptable for now, because it's better
> > than no parallelism at all.  The build and probe phases are run in
> > parallel, and the new scan-for-unmatched phase, while serial, is usually
> > applied to the smaller of the two relations and is either limited by
> > some multiple of work_mem, or it's too big and is partitioned into
> > batches and then the situation is improved by batch-level parallelism.
> > In future work on deadlock avoidance strategies, we may find a way to
> > parallelize the new phase safely.
>
> Is it worth mentioning something about parallel-oblivious parallel hash
> join not being able to do this still? Or is that obvious?

That's kind of what I meant above.

> > @@ -3116,18 +3256,31 @@ ExecHashTableDetachBatch(HashJoinTable hashtable)

> full/right joins should never fall into this code path, right?

Yeah, this is the normal way we detach from a batch.  This is reached
when shutting down the executor early, or when moving to the next
batch, etc.

***

I found another problem.  I realised that ... FULL JOIN ... LIMIT n
might be able to give wrong answers with unlucky scheduling.
Unfortunately I have been unable to reproduce the phenomenon I am
imagining yet but I can't think of any mechanism that prevents the
following sequence of events:

P0 probes, pulls n tuples from the outer relation and then the
executor starts to shut down (see commit 3452dc52 which pushed down
LIMIT), but just then P1 attaches, right before P0 does.  P1
continues, and finds < n outer tuples while probing and then runs out
so it enters the unmatched scan phase, and starts emitting bogusly
unmatched tuples.  Some outer tuples we needed to get the complete set
of match bits and thus the right answer were buffered inside P0's
subplan and abandoned.

I've attached a simple fixup for this problem.  Short version: if
you're abandoning your PHJ_BATCH_PROBE phase without reaching the end,
you must be shutting down, so the executor must think it's OK to
abandon tuples this process has buffered, so it must also be OK to
throw all unmatched tuples out the window too, as if this process was
about to emit them.  Right?

***

With all the long and abstract discussion of hard to explain problems
in this thread and related threads, I thought I should take a step
back and figure out a way to demonstrate what this thing really does
visually.  I wanted to show that this is a very useful feature that
unlocks previously unobtainable parallelism, and to show the
compromise we've had to make so far in an intuitive way.  With some
extra instrumentation hacked up locally, I produced the attached
"progress" graphs for a very simple query: SELECT COUNT(*) FROM r FULL
JOIN s USING (i).  Imagine a time axis along the bottom, but I didn't
bother to add numbers because I'm just trying to convey the 'shape' of
execution with relative times and synchronisation points.

Figures 1-3 show that phases 'h' (hash) and 'p' (probe) are
parallelised and finish sooner as we add more processes to help out,
but 's' (= the unmatched inner tuple scan) is not.  Note that if all
inner tuples are matched, 's' becomes extremely small and the
parallelism is almost as fair as a plain old inner join, but here I've
maximised it: all inner tuples were unmatched, because the two
relations have no matches at all.  Even if we achieve perfect linear
scalability for the other phases, the speedup will be governed by
https://en.wikipedia.org/wiki/Amdahl%27s_law and the only thing that
can mitigate that is if there is more useful work those early-quitting
processes could do somewhere else in your query plan.

Figure 4 shows that it gets a lot fairer in a multi-batch join,
because there is usually useful work to do on other batches of the
same join.  Notice how processes initially work on loading, probing
and scanning different batches to reduce contention, but they are
capable of ganging up to load and/or probe the same batch if there is
nothing else left to do (for example P2 and P3 both work on p5 near
the end).  For now, they can't do that for the s phases.  (BTW, the
little gaps before loading is the allocation phase that I didn't
bother to plot because they can't fit a label on them; this
visualisation technique is a WIP.)

With the

Re: Data is copied twice when specifying both child and parent table in publication

2023-03-27 Thread Jacob Champion

On Mon, Mar 20, 2023 at 11:22 PM Amit Kapila 
wrote:
> If the tests you have in mind are only related to this patch set then
> feel free to propose them here if you feel the current ones are not
> sufficient.

I think the new tests added by Wang cover my concerns (thanks!). I share
Peter's comment that we don't seem to have a regression test covering
only the bug description itself -- just ones that combine that case with
row and column restrictions -- but if you're all happy with the existing
approach then I have nothing much to add there.

I was staring at this subquery in fetch_table_list():

> +"  ( SELECT array_agg(a.attname ORDER BY 
> a.attnum)\n"
> +"FROM pg_attribute a\n"
> +"WHERE a.attrelid = gpt.relid AND\n"
> +"  a.attnum = ANY(gpt.attrs)\n"
> +"  ) AS attnames\n"

On my machine this takes up roughly 90% of the runtime of the query,
which makes for a noticeable delay with a bigger test case (a couple of
FOR ALL TABLES subscriptions on the regression database). And it seems
like we immediately throw all that work away: if I understand correctly,
we only use the third column for its interaction with DISTINCT. Would it
be enough to just replace that whole thing with gpt.attrs?

Thanks,
--Jacob

Re: Moving forward with TDE

2023-03-27 Thread Stephen Frost

Greetings,

On Mon, Mar 27, 2023 at 18:17 Bruce Momjian  wrote:

> On Tue, Mar 28, 2023 at 12:01:56AM +0200, Stephen Frost wrote:
> > Greetings,
> >
> > On Mon, Mar 27, 2023 at 12:38 Bruce Momjian  wrote:
> >
> > On Wed, Mar  8, 2023 at 04:25:04PM -0500, Stephen Frost wrote:
> > > Agreed, though the latest efforts include an option for
> *authenticated*
> > > encryption as well as unauthenticated.  That makes it much more
> > > difficult to make undetected changes to the data that's protected
> by
> > > the authenticated encryption being used.
> >
> > I thought some more about this.  GCM-style authentication of
> encrypted
> > data has value because it assumes the two end points are secure but
> that
> > a malicious actor could modify data during transfer.  In the Postgres
> > case, it seems the two end points and the transfer are all in the
> same
> > place.  Therefore, it is unclear to me the value of using GCM-style
> > authentication because if the GCM-level can be modified, so can the
> end
> > points, and the encryption key exposed.
> >
> >
> > What are the two end points you are referring to and why don’t you feel
> there
> > is an opportunity between them for a malicious actor to attack the
> system?
>
> Uh, TLS can use GCM and in this case you assume the sender and receiver
> are secure, no?


TLS does use GCM.. pretty much exclusively as far as I can recall. So do a
lot of other things though..

> There are simpler cases to consider than an online attack on a single
> > independent system where an attacker having access to modify the data in
> > transit between PG and the storage would imply the attacker also having
> access
> > to read keys out of PG’s memory.
>
> I consider the operating system and its processes as much more of a
> single entity than TLS over a network.


This may be the case sometimes but there’s absolutely no shortage of other
cases and it’s almost more the rule these days, that there is some kind of
network between the OS processes and the storage- a SAN, an iSCSI network,
NFS, are all quite common.

> As specific examples, consider:
> >
> > An attack against the database system where the database server is shut
> down,
> > or a backup, and  the encryption key isn’t available on the system.
> >
> > The backup system itself, not running as the PG user (an option
> supported by PG
> > and at least pgbackrest) being compromised, thus allowing for injection
> of
> > changes into a backup or into a restore.
>
> I then question why we are not adding encryption to pg_basebackup or
> pgbackrest rather than the database system.


Pgbackrest has encryption and authentication of it … but that doesn’t
actually address the attack vector that I outlined. If the backup user is
compromised then they can change the data before it gets to the storage.
If the backup user is compromised then they have access to whatever key is
used to encrypt and authenticate the backup and therefore can trivially
manipulate the data.

Encryption of backups by the backup tool serves to protect the data after
it leaves the backup system and is stored in cloud storage or in whatever
format the repository takes.  This is beneficial, particularly when the
data itself offers no protection, but simply not the same.

> The beginning of this discussion also very clearly had individuals voicing
> > strong opinions that unauthenticated encryption methods were not
> acceptable as
> > an end-state for PG due to the clear issue of there then being no
> protection
> > against modification of data.  The approach we are working towards
> provides
>
> What were the _technical_ reasons for those objections?


I believe largely the ones I’m bringing up here and which I outline above…
I don’t mean to pretend that any of this is of my own independent
construction. I don’t believe it is and my apologies if it came across that
way.

> both the unauthenticated option, which clearly has value to a large
> number of
> > our collective user base considering the number of commercial
> implementations
> > which have now arisen, and the authenticated solution which goes further
> and
> > provides the level clearly expected of the PG community. This gets us a
> win-win
> > situation.
> >
> > > There's clearly user demand for it as there's a number of
> organizations
> > > who have forks which are providing it in one shape or another.
> This
> > > kind of splintering of the community is actually an actively bad
> thing
> > > for the project and is part of what killed Unix, by at least some
> pretty
> > > reputable accounts, in my view.
> >
> > Yes, the number of commercial implementations of this is a concern.
> Of
> > course, it is also possible that those commercial implementations are
> > meeting checkbox requirements rather than technical ones, and the
> > community has been hostile to check box-only features.
> >
> >
> > I’ve grown weary of this argument

Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-03-27 Thread Sandro Santilli

On Mon, Mar 13, 2023 at 02:48:56PM -0400, Regina Obe wrote:
> 
> I still see the main use-case as for those that micro version and for this
> use case, they would need a way, not necessarily to have a single upgrade
> script, but a script for each minor.
> 
> So something like 
> 
> 3.2.%--3.4.0 = 3.2--3.4.0

I could implement this too if there's an agreement about it.
For now I'm attaching an updated patch with conflicts resolved.

--strk;
>From a10a1a7200f76bbc6e2def8d4684b647a99316cd Mon Sep 17 00:00:00 2001
From: Sandro Santilli 
Date: Wed, 14 Sep 2022 11:10:10 +0200
Subject: [PATCH v2] Allow wildcard (%) in extension upgrade paths

A wildcard character "%" will be accepted in the
"source" side of the upgrade script and be considered
usable to upgrade any version to the "target" side.

Includes regression test and documentation.
---
 doc/src/sgml/extend.sgml  |  8 
 src/backend/commands/extension.c  | 42 ---
 src/test/modules/test_extensions/Makefile |  7 ++--
 .../expected/test_extensions.out  | 16 +++
 src/test/modules/test_extensions/meson.build  |  3 ++
 .../test_extensions/sql/test_extensions.sql   |  9 
 .../test_ext_wildcard1--%--2.0.sql|  6 +++
 .../test_ext_wildcard1--1.0.sql   |  6 +++
 .../test_ext_wildcard1.control|  3 ++
 9 files changed, 91 insertions(+), 9 deletions(-)
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1--%--2.0.sql
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1--1.0.sql
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1.control

diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml
index 218940ee5c..bdd463b81f 100644
--- a/doc/src/sgml/extend.sgml
+++ b/doc/src/sgml/extend.sgml
@@ -1120,6 +1120,14 @@ SELECT pg_catalog.pg_extension_config_dump('my_config', 'WHERE NOT standard_entr
  1.1).
 
 
+
+ The literal value % can be used as the
+ old_version component in an extension
+ update script for it to match any version. Such wildcard update
+ scripts will only be used when no explicit path is found from
+ old to target version.
+
+
 
  Given that a suitable update script is available, the command
  ALTER EXTENSION UPDATE will update an installed extension
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 0eabe18335..36b6d7e01a 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -132,6 +132,7 @@ static void ApplyExtensionUpdates(Oid extensionOid,
   bool cascade,
   bool is_create);
 static char *read_whole_file(const char *filename, int *length);
+static bool file_exists(const char *name);
 
 
 /*
@@ -913,7 +914,14 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
 	if (from_version == NULL)
 		elog(DEBUG1, "executing extension script for \"%s\" version '%s'", control->name, version);
 	else
+	{
+		if ( ! file_exists(filename) )
+		{
+			/* if filename does not exist, try wildcard */
+			filename = get_extension_script_filename(control, "%", version);
+		}
 		elog(DEBUG1, "executing extension script for \"%s\" update from version '%s' to '%s'", control->name, from_version, version);
+	}
 
 	/*
 	 * If installing a trusted extension on behalf of a non-superuser, become
@@ -1258,14 +1266,19 @@ identify_update_path(ExtensionControlFile *control,
 
 	/* Find shortest path */
 	result = find_update_path(evi_list, evi_start, evi_target, false, false);
+	if (result != NIL)
+		return result;
 
-	if (result == NIL)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("extension \"%s\" has no update path from version \"%s\" to version \"%s\"",
-		control->name, oldVersion, newVersion)));
+	/* Find wildcard path, if no explicit path was found */
+	evi_start = get_ext_ver_info("%", _list);
+	result = find_update_path(evi_list, evi_start, evi_target, false, false);
+	if (result != NIL)
+		return result;
 
-	return result;
+	ereport(ERROR,
+			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			 errmsg("extension \"%s\" has no update path from version \"%s\" to version \"%s\"",
+	control->name, oldVersion, newVersion)));
 }
 
 /*
@@ -3470,3 +3483,20 @@ read_whole_file(const char *filename, int *length)
 	buf[*length] = '\0';
 	return buf;
 }
+
+static bool
+file_exists(const char *name)
+{
+	struct stat st;
+
+	Assert(name != NULL);
+
+	if (stat(name, ) == 0)
+		return !S_ISDIR(st.st_mode);
+	else if (!(errno == ENOENT || errno == ENOTDIR || errno == EACCES))
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not access file \"%s\": %m", name)));
+
+	return false;
+}
diff --git a/src/test/modules/test_extensions/Makefile b/src/test/modules/test_extensions/Makefile
index 70fc0c8e66..beec04eea3 100644
--- a/src/test/modules/test_extensions/Makefile
+++

Re: Memory leak from ExecutorState context?

2023-03-27 Thread Tomas Vondra

On 3/27/23 23:13, Jehan-Guillaume de Rorthais wrote:
> Hi,
> 
> On Mon, 20 Mar 2023 15:12:34 +0100
> Jehan-Guillaume de Rorthais  wrote:
> 
>> On Mon, 20 Mar 2023 09:32:17 +0100
>> Tomas Vondra  wrote:
>>
> * Patch 1 could be rebased/applied/backpatched

 Would it help if I rebase Patch 1 ("move BufFile stuff into separate
 context")?   
>>>
>>> Yeah, I think this is something we'd want to do. It doesn't change the
>>> behavior, but it makes it easier to track the memory consumption etc.  
>>
>> Will do this week.
> 
> Please, find in attachment a patch to allocate bufFiles in a dedicated 
> context.
> I picked up your patch, backpatch'd it, went through it and did some minor
> changes to it. I have some comment/questions thought.
> 
>   1. I'm not sure why we must allocate the "HashBatchFiles" new context
>   under ExecutorState and not under hashtable->hashCxt?
> 
>   The only references I could find was in hashjoin.h:30:
> 
>/* [...]
> * [...] (Exception: data associated with the temp files lives in the
> * per-query context too, since we always call buffile.c in that context.)
> 
>   And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this
>   original comment in the patch):
> 
>/* [...]
> * Note: it is important always to call this in the regular executor
> * context, not in a shorter-lived context; else the temp file buffers
> * will get messed up.
> 
> 
>   But these are not explanation of why BufFile related allocations must be 
> under
>   a per-query context. 
> 

Doesn't that simply describe the current (unpatched) behavior where
BufFile is allocated in the per-query context? I mean, the current code
calls BufFileCreateTemp() without switching the context, so it's in the
ExecutorState. But with the patch it very clearly is not.

And I'm pretty sure the patch should do

hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt,
 "HashBatchFiles",
 ALLOCSET_DEFAULT_SIZES);

and it'd still work. Or why do you think we *must* allocate it under
ExecutorState?

FWIW The comment in hashjoin.h needs updating to reflect the change.

> 
>   2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context 
> switch
>   seems fragile as it could be forgotten in futur code path/changes. So I
>   added an Assert() in the function to make sure the current memory context is
>   "HashBatchFiles" as expected.
>   Another way to tie this up might be to pass the memory context as argument 
> to
>   the function.
>   ... Or maybe I'm over precautionary.
> 

I'm not sure I'd call that fragile, we have plenty other code that
expects the memory context to be set correctly. Not sure about the
assert, but we don't have similar asserts anywhere else.

But I think it's just ugly and overly verbose - it'd be much nicer to
e.g. pass the memory context as a parameter, and do the switch inside.

> 
>   3. You wrote:
> 
>>> A separate BufFile memory context helps, although people won't see it
>>> unless they attach a debugger, I think. Better than nothing, but I was
>>> wondering if we could maybe print some warnings when the number of batch
>>> files gets too high ...
> 
>   So I added a WARNING when batches memory are exhausting the memory size
>   allowed.
> 
>+   if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed)
>+   elog(WARNING, "Growing number of hash batch is exhausting memory");
> 
>   This is repeated on each call of ExecHashIncreaseNumBatches when BufFile
>   overflows the memory budget. I realize now I should probably add the memory
>   limit, the number of current batch and their memory consumption.
>   The message is probably too cryptic for a user. It could probably be
>   reworded, but some doc or additionnal hint around this message might help.
> 

Hmmm, not sure is WARNING is a good approach, but I don't have a better
idea at the moment.

>   4. I left the debug messages for some more review rounds
> 

OK


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Generate pg_stat_get_xact*() functions with Macros

2023-03-27 Thread Michael Paquier

On Mon, Mar 27, 2023 at 07:08:51PM +0900, Michael Paquier wrote:
> The patch has one mistake: PG_STAT_GET_XACT_FUNCENTRY_FLOAT8_MS does
> not need a slash on its last line or it would include the next, empty
> line.  This could lead to mistakes (no need to send a new patch just
> for that).

Adjusted that, and the rest was fine after a second look, so applied.
It looks like we are done for now with this thread, so I have marked
it as committed in the CF app.
--
Michael


signature.asc
Description: PGP signature

Re: Add pg_walinspect function with block info columns

2023-03-27 Thread Peter Geoghegan

On Sun, Mar 26, 2023 at 8:41 PM Kyotaro Horiguchi
 wrote:
> > I'd still put the LSN data before the three OIDs for consistency with
> > the structures, though my opinion does not seem to count much..
>
> I agree with Michael on this point. Also, although it may not be
> significant for SQL, the rows are output in lsn order from the
> function.

I guess that it makes sense to have the LSN data first, but to have
the other columns after that. I certainly don't dislike that approach.

I just noticed is that "forkname" appears towards the end among
declared output parameters, even in Bharath's v6. I think that it
should be after "relblocknumber" instead, because it is conceptually
part of the "composite primary key", and so belongs right next to
"relblocknumber". I failed to mention this detail upthread, I think.

-- 
Peter Geoghegan

Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-03-27 Thread Michael Paquier

On Mon, Mar 27, 2023 at 10:24:46AM -0400, Melanie Plageman wrote:
> I do like/prefer "block read requests" and
> "blocks requested found in cache"
> Though, now I fear my initial complaint may have been a bit pedantic.

That's fine.  Let's ask for extra opinions, then.

So, have others an opinion to share here?
--
Michael


signature.asc
Description: PGP signature

Re: Add LZ4 compression in pg_dump

2023-03-27 Thread gkokolatos







--- Original Message ---
On Thursday, March 16th, 2023 at 11:30 PM, Tomas Vondra 
 wrote:


> 
> 
> 
> 
> On 3/16/23 01:20, Justin Pryzby wrote:
> 
> > On Mon, Mar 13, 2023 at 10:47:12PM +0100, Tomas Vondra wrote:
> > 
> > > 
> > > Thanks. I don't want to annoy you too much, but could you split the
> > > patch into the "empty-data" fix and all the other changes (rearranging
> > > functions etc.)? I'd rather not mix those in the same commit.
> > 
> > I don't know if that makes sense? The "empty-data" fix creates a new
> > function called DeflateCompressorInit(). My proposal was to add the new
> > function in the same place in the file as it used to be.
> 
> 
> Got it. In that case I agree it's fine to do that in a single commit.

For what is worth, I think that this patch should get a +1 and get in. It
solves the empty writes problem and includes a test to a previous untested
case.

Cheers,
//Georgios

> 
> > The patch also moves the pg_fatal() that's being removed. I don't think
> > it's going to look any cleaner to read a history involving the
> > pg_fatal() first being added, then moved, then removed. Anyway, I'll
> > wait while the community continues discussion about the pg_fatal().
> 
> 
> I think the agreement was to replace the pg_fatal with and assert, and I
> see your patch already does that.
> 
> 
> regards
> 
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Re: Show various offset arrays for heap WAL records

2023-03-27 Thread Peter Geoghegan

On Mon, Mar 27, 2023 at 2:29 PM Melanie Plageman
 wrote:
> I went to add dedup records and noticed that since the actual
> BTDedupInterval struct is what is put in the xlog, I would need access
> to that type from nbtdesc.c, however, including nbtree.h doesn't seem to
> work because it includes files that cannot be included in frontend code.

I suppose that the BTDedupInterval struct could have just as easily
gone in nbtxlog.h, next to xl_btree_dedup. There might have been a
moment where I thought about doing it that way, but I guess I found it
slightly preferable to use that symbol name (BTDedupInterval) rather
than (say) xl_btree_dedup_interval in places like the nearby
BTDedupStateData struct.

Actually, I suppose that it's hard to make that alternative work, at
least without
including nbtxlog.h in nbtree.h. Which sounds wrong.

> I, of course, could make some local struct in nbtdesc.c which has an
> OffsetNumber and a uint16, since the BTDedupInterval is pretty
> straightforward, but that seems a bit annoying.
> I'm probably missing something obvious, but is there a better way to do
> this?

It was probably just one of those cases where I settled on the
arrangement that looked least odd overall. Not a particularly
principled approach. But the approach that I'm going to take once more
here.  ;-)

All of the available alternatives are annoying in roughly the same
way, though perhaps to varying degrees. All except one: I'm okay with
just not adding coverage for deduplication records, for the time being
-- just seeing the number of intervals alone is relatively informative
with deduplication records, unlike (say) nbtree delete records. I'm
also okay with having coverage for dedup records if you feel it's
worth having. Your call.

If we're going to have coverage for deduplication records then it
seems to me that we have to have a struct in nbtxlog.h for your code
to work off of. It also seems likely that we'll want to use that same
struct within nbtxlog.c. What's less clear is what that means for the
BTDedupInterval struct. I don't think that we should include nbtxlog.h
in nbtree.h, nor should we do the converse.

I guess maybe two identical structs would be okay. BTDedupInterval,
and xl_btree_dedup_interval, with the former still used in nbtdedup.c,
and the latter used through a pointer at the point that nbtxlog.c
reads a dedup record. Then maybe at a sizeof() static assert beside
the existing btree_xlog_dedup() assertions that check that the dedup
state interval array matches the array taken from the WAL record.
That's still a bit weird, but I find it preferable to any alternative
that I can think of.

> On another note, I've thought about how to include some example output
> in docs, and, for example we could modify the example output in the
> pgwalinspect docs which includes a PRUNE record already for
> pg_get_wal_record_info() docs. We'd probably just want to keep it short.

Yeah. Perhaps a PRUNE record for one of the system catalogs whose
relfilenode is relatively recognizable. Say pg_class. It probably
doesn't matter that much, but there is perhaps some small value in
picking an example that is relatively easy to recreate later on (or to
approximately recreate). I'm certainly not insisting on that, though.

--
Peter Geoghegan

Re: Non-superuser subscription owners

2023-03-27 Thread Jeff Davis

On Mon, 2023-03-27 at 14:06 -0400, Robert Haas wrote:
> I thought you were asking for those changes to be made before this
> patch got committed, so that's what I was responding to. If you're
> asking for it not to be committed at all, that's a different
> discussion.

I separately had a complaint (in a separate subthread) about the scope
of the predefined role you are introducing, which I think encompasses
two concepts that should be treated differently and I think that may
need to be revisited later. If you ignore this complaint it wouldn't be
the end of the world.

This subthread is about the order in which the patches get committed
(which is a topic you brought up), not whether they are ever to be
committed.

> 
> I kind of agree with you about the feature itself. Even though the
> basic feature works quite well and does something people really want,
> there are a lot of loose ends to sort out, and not just about
> security. But I also want to make some progress. If there are
> problems
> with what I'm proposing that will make us regret committing things
> right before feature freeze, then we shouldn't. But waiting a whole
> additional year to see any kind of improvement is not free; these
> issues are serious.

The non-superuser-subscription-owner patch without the apply-as-table-
owner patch feels like a facade to me, at least right now. Perhaps I
can be convinced otherwise, but that's what it looks like to me.

> 
> I think this patch is a lot better-baked and less speculative than
> that one. I think that patch is more important, so if they were
> equally mature, I'd favor getting that one committed first. But
> that's
> not the case.

You explicitly asked about the order of the patches, which made me
think it was more of an option?

If the apply-as-table-owner patch gets held up for whatever reason, we
might have to make a difficult decision. I'd prefer focus on the apply-
as-table-owner patch briefly, and now that it's getting some review
attention, we might find out how ready it is quite soon.

Regards,
Jeff Davis

Re: Moving forward with TDE

2023-03-27 Thread Bruce Momjian

On Tue, Mar 28, 2023 at 12:01:56AM +0200, Stephen Frost wrote:
> Greetings,
> 
> On Mon, Mar 27, 2023 at 12:38 Bruce Momjian  wrote:
> 
> On Wed, Mar  8, 2023 at 04:25:04PM -0500, Stephen Frost wrote:
> > Agreed, though the latest efforts include an option for *authenticated*
> > encryption as well as unauthenticated.  That makes it much more
> > difficult to make undetected changes to the data that's protected by
> > the authenticated encryption being used.
> 
> I thought some more about this.  GCM-style authentication of encrypted
> data has value because it assumes the two end points are secure but that
> a malicious actor could modify data during transfer.  In the Postgres
> case, it seems the two end points and the transfer are all in the same
> place.  Therefore, it is unclear to me the value of using GCM-style
> authentication because if the GCM-level can be modified, so can the end
> points, and the encryption key exposed.
> 
> 
> What are the two end points you are referring to and why don’t you feel there
> is an opportunity between them for a malicious actor to attack the system?

Uh, TLS can use GCM and in this case you assume the sender and receiver
are secure, no?

> There are simpler cases to consider than an online attack on a single
> independent system where an attacker having access to modify the data in
> transit between PG and the storage would imply the attacker also having access
> to read keys out of PG’s memory. 

I consider the operating system and its processes as much more of a
single entity than TLS over a network.

> As specific examples, consider:
> 
> An attack against the database system where the database server is shut down,
> or a backup, and  the encryption key isn’t available on the system.
> 
> The backup system itself, not running as the PG user (an option supported by 
> PG
> and at least pgbackrest) being compromised, thus allowing for injection of
> changes into a backup or into a restore.

I then question why we are not adding encryption to pg_basebackup or
pgbackrest rather than the database system.

> The beginning of this discussion also very clearly had individuals voicing
> strong opinions that unauthenticated encryption methods were not acceptable as
> an end-state for PG due to the clear issue of there then being no protection
> against modification of data.  The approach we are working towards provides

What were the _technical_ reasons for those objections?

> both the unauthenticated option, which clearly has value to a large number of
> our collective user base considering the number of commercial implementations
> which have now arisen, and the authenticated solution which goes further and
> provides the level clearly expected of the PG community. This gets us a 
> win-win
> situation.
> 
> > There's clearly user demand for it as there's a number of organizations
> > who have forks which are providing it in one shape or another.  This
> > kind of splintering of the community is actually an actively bad thing
> > for the project and is part of what killed Unix, by at least some pretty
> > reputable accounts, in my view.
> 
> Yes, the number of commercial implementations of this is a concern.  Of
> course, it is also possible that those commercial implementations are
> meeting checkbox requirements rather than technical ones, and the
> community has been hostile to check box-only features.
> 
> 
> I’ve grown weary of this argument as the other major piece of work it was
> routinely applied to was RLS and yet that has certainly been seen broadly as a
> beneficial feature with users clearly leveraging it and in more than some
> “checkbox” way.

RLS has to overcome that objection, and I think it did, as was better
for doing that.

> We, as a community, are clearly losing value by lack of this capability, if by
> no other measure than simply the numerous users of the commercial
> implementations feeling that they simply can’t use PG without this feature, 
> for
> whatever their reasoning.

That is true, but I go back to my concern over useful feature vs. check
box.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Embrace your flaws.  They make you human, rather than perfect,
  which you will never be.

Re: doc: add missing "id" attributes to extension packaging page

2023-03-27 Thread Peter Smith

On Tue, Mar 28, 2023 at 4:06 AM Brar Piening  wrote:
>
> On 23.03.2023 at 20:08, Brar Piening wrote:
> > Since the need for ids is starting to grow again (ecb696527c added an
> > id to a varlistentry in doc/src/sgml/ref/create_subscription.sgml)
> > I've also amended the add-needed-ids patch once again so that the
> > build does not fail after applying the make_html_ids_discoverable patch.
>
> New add-needed-ids patch since it was outdated again.
>

FYI, there is a lot of overlap between this last attachment and the
patches of Kuroda-san's current thread here [1] which is also adding
ids to create_subscription.sgml.

(Anyway, I guess you might already be aware of that other thread
because your new ids look like they have the same names as those
chosen by Kuroda-san)

--
[1] 
https://www.postgresql.org/message-id/flat/CAHut%2BPvzo6%3DKKLqMR6-mAQdM%2Bj_dse0eUreGmrFouL7gbLbv2w%40mail.gmail.com#7da8d0e3b73096375847c16c856b4aed

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: running logical replication as the subscription owner

2023-03-27 Thread Jelte Fennema

> I don't get it. If we just return, that would result in skipping
> changes rather than erroring out on changes, but it wouldn't preserve
> the current behavior, because we'd still care about the table owner's
> permissions rather than, as now, the subscription owner's permissions.

Attached is an updated version of your patch with what I had in mind
(admittedly it needed one more line than "just" the return to make it
work). But as you can see all previous tests for a lowly privileged
subscription owner that **cannot** SET ROLE to the table owner
continue to work as they did before. While still downgrading to the
table owners role when the subscription owner **can** SET ROLE to the
table owner.

Obviously this needs some comments explaining what's going on and
probably some code refactoring and/or variable renaming, but I hope
it's clear what I meant now: For high privileged subscription owners,
we downgrade to the permissions of the table owner, but for low
privileged ones we care about permissions of the subscription owner
itself.


v2-0001-Perform-logical-replication-actions-as-the-table-.patch
Description: Binary data

Re: Moving forward with TDE

2023-03-27 Thread Stephen Frost

Greetings,

On Mon, Mar 27, 2023 at 12:38 Bruce Momjian  wrote:

> On Wed, Mar  8, 2023 at 04:25:04PM -0500, Stephen Frost wrote:
> > Agreed, though the latest efforts include an option for *authenticated*
> > encryption as well as unauthenticated.  That makes it much more
> > difficult to make undetected changes to the data that's protected by
> > the authenticated encryption being used.
>
> I thought some more about this.  GCM-style authentication of encrypted
> data has value because it assumes the two end points are secure but that
> a malicious actor could modify data during transfer.  In the Postgres
> case, it seems the two end points and the transfer are all in the same
> place.  Therefore, it is unclear to me the value of using GCM-style
> authentication because if the GCM-level can be modified, so can the end
> points, and the encryption key exposed.

What are the two end points you are referring to and why don’t you feel
there is an opportunity between them for a malicious actor to attack the
system?

There are simpler cases to consider than an online attack on a single
independent system where an attacker having access to modify the data in
transit between PG and the storage would imply the attacker also having
access to read keys out of PG’s memory.

As specific examples, consider:

An attack against the database system where the database server is shut
down, or a backup, and  the encryption key isn’t available on the system.

The backup system itself, not running as the PG user (an option supported
by PG and at least pgbackrest) being compromised, thus allowing for
injection of changes into a backup or into a restore.

The beginning of this discussion also very clearly had individuals voicing
strong opinions that unauthenticated encryption methods were not acceptable
as an end-state for PG due to the clear issue of there then being no
protection against modification of data.  The approach we are working
towards provides both the unauthenticated option, which clearly has value
to a large number of our collective user base considering the number of
commercial implementations which have now arisen, and the authenticated
solution which goes further and provides the level clearly expected of the
PG community. This gets us a win-win situation.

> There's clearly user demand for it as there's a number of organizations
> > who have forks which are providing it in one shape or another.  This
> > kind of splintering of the community is actually an actively bad thing
> > for the project and is part of what killed Unix, by at least some pretty
> > reputable accounts, in my view.
>
> Yes, the number of commercial implementations of this is a concern.  Of
> course, it is also possible that those commercial implementations are
> meeting checkbox requirements rather than technical ones, and the
> community has been hostile to check box-only features.

I’ve grown weary of this argument as the other major piece of work it was
routinely applied to was RLS and yet that has certainly been seen broadly
as a beneficial feature with users clearly leveraging it and in more than
some “checkbox” way.

Indeed, it’s similar also in that commercial implementations were done of
RLS while there were arguments made about it being a checkbox feature which
were used to discourage it from being implemented in core.  Were it truly
checkbox, I don’t feel we would have the regular and ongoing discussion
about it on the lists that we do, nor see other tools built on top of PG
which specifically leverage it. Perhaps there are truly checkbox features
out there which we will never implement, but I’m (perhaps due to what my
dad would call selective listening on my part, perhaps not) having trouble
coming up with any presently. Features that exist in other systems that we
don’t want?  Certainly. We don’t characterize those as simply “checkbox”
though. Perhaps that’s in part because we provide alternatives- but that’s
not the case here. We have no comparable way to have this capability as
part of the core system.

We, as a community, are clearly losing value by lack of this capability, if
by no other measure than simply the numerous users of the commercial
implementations feeling that they simply can’t use PG without this feature,
for whatever their reasoning.

Thanks,

Stephen

Re: Show various offset arrays for heap WAL records

2023-03-27 Thread Melanie Plageman

On Mon, Mar 13, 2023 at 9:41 PM Peter Geoghegan  wrote:
> On Mon, Mar 13, 2023 at 4:01 PM Melanie Plageman  
> wrote:
>
> > I have added detail to xl_btree_delete and xl_btree_vacuum. I have added
> > the updated/deleted target offset numbers and the updated tuples
> > metadata.
> >
> > I wondered if there was any reason to do xl_btree_dedup deduplication
> > intervals.
>
> No reason. It wouldn't be hard to cover xl_btree_dedup deduplication
> intervals -- each element is a page offset number, and a corresponding
> count of index tuples to merge together in the REDO routine. That's
> slightly different to anything else, but not in a way that seems like
> it requires very much additional effort.

I went to add dedup records and noticed that since the actual
BTDedupInterval struct is what is put in the xlog, I would need access
to that type from nbtdesc.c, however, including nbtree.h doesn't seem to
work because it includes files that cannot be included in frontend code.

I, of course, could make some local struct in nbtdesc.c which has an
OffsetNumber and a uint16, since the BTDedupInterval is pretty
straightforward, but that seems a bit annoying.
I'm probably missing something obvious, but is there a better way to do
this?

On another note, I've thought about how to include some example output
in docs, and, for example we could modify the example output in the
pgwalinspect docs which includes a PRUNE record already for
pg_get_wal_record_info() docs. We'd probably just want to keep it short.

- Melanie

Re: SQL/JSON revisited

2023-03-27 Thread Justin Pryzby

I ran sqlsmith on this patch for a short while, and reduced one of its
appalling queries to this:

postgres=# SELECT jsonb_object_agg_unique_strict('', null::xid8);
ERROR:  unexpected jsonb type as object key

postgres=# \errverbose 
ERROR:  XX000: unexpected jsonb type as object key
UBICACIÓN:  JsonbIteratorNext, jsonb_util.c:958

As you know, it's considered bad if elog()s are reachable, user-facing errors.

2023-03-27 15:46:47.351 CDT client backend[13361] psql ERROR:  unexpected jsonb 
type as object key
2023-03-27 15:46:47.351 CDT client backend[13361] psql BACKTRACE:  
postgres: pryzbyj postgres [local] SELECT(JsonbIteratorNext+0x1e5) 
[0x5638fa11ba82]
postgres: pryzbyj postgres [local] SELECT(+0x4ff951) [0x5638fa114951]
postgres: pryzbyj postgres [local] SELECT(JsonbToCString+0x12) 
[0x5638fa116584]
postgres: pryzbyj postgres [local] SELECT(jsonb_out+0x24) 
[0x5638fa1165ad]
postgres: pryzbyj postgres [local] SELECT(FunctionCall1Coll+0x51) 
[0x5638fa1ef585]
postgres: pryzbyj postgres [local] SELECT(OutputFunctionCall+0x15) 
[0x5638fa1f067d]
postgres: pryzbyj postgres [local] SELECT(+0xe7ef7) [0x5638f9cfcef7]
postgres: pryzbyj postgres [local] SELECT(+0x2b4271) [0x5638f9ec9271]
postgres: pryzbyj postgres [local] SELECT(standard_ExecutorRun+0x146) 
[0x5638f9ec9402]

What might indicate a worse problem is that with debug_discard_caches=1, it
does something different:

postgres=# \errverbose 
ERROR:  XX000: invalid jsonb scalar type
UBICACIÓN:  convertJsonbScalar, jsonb_util.c:1865

2023-03-27 15:51:21.788 CDT client backend[15939] psql ERROR:  invalid jsonb 
scalar type
2023-03-27 15:51:21.788 CDT client backend[15939] psql CONTEXT:  parallel worker
2023-03-27 15:51:21.788 CDT client backend[15939] psql BACKTRACE:  
postgres: pryzbyj postgres [local] SELECT(ThrowErrorData+0x2a6) 
[0x5638fa1ec8f3]
postgres: pryzbyj postgres [local] SELECT(+0x194820) [0x5638f9da9820]
postgres: pryzbyj postgres [local] SELECT(HandleParallelMessages+0x15d) 
[0x5638f9daac95]
postgres: pryzbyj postgres [local] SELECT(ProcessInterrupts+0x906) 
[0x5638fa094873]
postgres: pryzbyj postgres [local] SELECT(+0x2d202b) [0x5638f9ee702b]
postgres: pryzbyj postgres [local] SELECT(+0x2d2206) [0x5638f9ee7206]
postgres: pryzbyj postgres [local] SELECT(+0x2d245a) [0x5638f9ee745a]
postgres: pryzbyj postgres [local] SELECT(+0x2bbcec) [0x5638f9ed0cec]
postgres: pryzbyj postgres [local] SELECT(+0x2b4240) [0x5638f9ec9240]
postgres: pryzbyj postgres [local] SELECT(standard_ExecutorRun+0x146) 
[0x5638f9ec9402]

+valgrind indicates this:

==14095== Use of uninitialised value of size 8
==14095==at 0x60D1C9: convertJsonbScalar (jsonb_util.c:1822)
==14095==by 0x60D44F: convertJsonbObject (jsonb_util.c:1741)
==14095==by 0x60D630: convertJsonbValue (jsonb_util.c:1611)
==14095==by 0x60D903: convertToJsonb (jsonb_util.c:1565)
==14095==by 0x60F272: JsonbValueToJsonb (jsonb_util.c:117)
==14095==by 0x60A504: jsonb_object_agg_finalfn (jsonb.c:2057)
==14095==by 0x3D0806: finalize_aggregate (nodeAgg.c:1119)
==14095==by 0x3D2210: finalize_aggregates (nodeAgg.c:1353)
==14095==by 0x3D2E7F: agg_retrieve_direct (nodeAgg.c:2512)
==14095==by 0x3D32DC: ExecAgg (nodeAgg.c:2172)
==14095==by 0x3C3CEB: ExecProcNodeFirst (execProcnode.c:464)
==14095==by 0x3BC23F: ExecProcNode (executor.h:272)
==14095==by 0x3BC23F: ExecutePlan (execMain.c:1633)

And then it shows a different error:
2023-03-27 16:00:10.072 CDT standalone backend[14095] ERROR:  unknown type of 
jsonb container to convert

In the docs:

+The key can not be null. If the
+value is null then the entry is skipped,

s/can not/cannot/
The "," is dangling.

-- 
Justin

Re: Memory leak from ExecutorState context?

2023-03-27 Thread Jehan-Guillaume de Rorthais

Hi,

On Mon, 20 Mar 2023 15:12:34 +0100
Jehan-Guillaume de Rorthais  wrote:

> On Mon, 20 Mar 2023 09:32:17 +0100
> Tomas Vondra  wrote:
> 
> > >> * Patch 1 could be rebased/applied/backpatched
> > > 
> > > Would it help if I rebase Patch 1 ("move BufFile stuff into separate
> > > context")?   
> > 
> > Yeah, I think this is something we'd want to do. It doesn't change the
> > behavior, but it makes it easier to track the memory consumption etc.  
> 
> Will do this week.

Please, find in attachment a patch to allocate bufFiles in a dedicated context.
I picked up your patch, backpatch'd it, went through it and did some minor
changes to it. I have some comment/questions thought.

  1. I'm not sure why we must allocate the "HashBatchFiles" new context
  under ExecutorState and not under hashtable->hashCxt?

  The only references I could find was in hashjoin.h:30:

   /* [...]
* [...] (Exception: data associated with the temp files lives in the
* per-query context too, since we always call buffile.c in that context.)

  And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this
  original comment in the patch):

   /* [...]
* Note: it is important always to call this in the regular executor
* context, not in a shorter-lived context; else the temp file buffers
* will get messed up.

  But these are not explanation of why BufFile related allocations must be under
  a per-query context. 

  2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context switch
  seems fragile as it could be forgotten in futur code path/changes. So I
  added an Assert() in the function to make sure the current memory context is
  "HashBatchFiles" as expected.
  Another way to tie this up might be to pass the memory context as argument to
  the function.
  ... Or maybe I'm over precautionary.

  3. You wrote:

>> A separate BufFile memory context helps, although people won't see it
>> unless they attach a debugger, I think. Better than nothing, but I was
>> wondering if we could maybe print some warnings when the number of batch
>> files gets too high ...

  So I added a WARNING when batches memory are exhausting the memory size
  allowed.

   +   if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed)
   +   elog(WARNING, "Growing number of hash batch is exhausting memory");

  This is repeated on each call of ExecHashIncreaseNumBatches when BufFile
  overflows the memory budget. I realize now I should probably add the memory
  limit, the number of current batch and their memory consumption.
  The message is probably too cryptic for a user. It could probably be
  reworded, but some doc or additionnal hint around this message might help.

  4. I left the debug messages for some more review rounds

Regards,
>From a0dd541ad4e47735fc388d0c553e935f0956f254 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Mon, 27 Mar 2023 15:54:39 +0200
Subject: [PATCH] Allocate hash batches related BufFile in a dedicated context

---
 src/backend/executor/nodeHash.c | 39 +++--
 src/backend/executor/nodeHashjoin.c | 14 ---
 src/include/executor/hashjoin.h |  1 +
 3 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 748c9b0024..3f1ae9755e 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -484,7 +484,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	 *
 	 * The hashtable control block is just palloc'd from the executor's
 	 * per-query memory context.  Everything else should be kept inside the
-	 * subsidiary hashCxt or batchCxt.
+	 * subsidiary hashCxt, batchCxt or fileCxt.
 	 */
 	hashtable = palloc_object(HashJoinTableData);
 	hashtable->nbuckets = nbuckets;
@@ -538,6 +538,10 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 "HashBatchContext",
 ALLOCSET_DEFAULT_SIZES);

+	hashtable->fileCxt = AllocSetContextCreate(CurrentMemoryContext,
+ "HashBatchFiles",
+ ALLOCSET_DEFAULT_SIZES);
+
 	/* Allocate data that will live for the life of the hashjoin */

 	oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
@@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,

 	if (nbatch > 1 && hashtable->parallel_state == NULL)
 	{
+		MemoryContext oldctx;
+
 		/*
 		 * allocate and initialize the file arrays in hashCxt (not needed for
 		 * parallel case which uses shared tuplestores instead of raw files)
 		 */
+		oldctx = MemoryContextSwitchTo(hashtable->fileCxt);
+
 		hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch);
 		hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
 		/* The files will not be opened until needed... */
 		/* ... but make sure we have temp tablespaces established for them */
 		PrepareTempTablespaces();
+
+

Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.

2023-03-27 Thread Peter Geoghegan

On Mon, Mar 27, 2023 at 1:17 PM Robert Haas  wrote:
> Patch attached.

This is fine, as far as it goes. Obviously it fixes the immediate problem.

> > I don't think that it's particularly likely that having refined
> > aborted speculative insertion amcheck coverage will make a critical
> > difference to any user, at any time. But "amcheck as documentation of
> > the on-disk format" is reason enough to have it.
>
> Sure, if someone feels like writing the code. I have to admit that I
> have mixed feelings about this whole direction. In concept, I agree
> with you entirely: a fringe benefit of having checks that tell us
> whether or not a page is valid is that it helps to make clear what
> page states we think are valid.

I don't think that it's a fringe benefit; it's just not necessarily of
direct benefit to amcheck users.

Before the HOT chain validation patch went in, it was unclear whether
certain conceivable on-disk states should constitute corruption. In
particular, it wasn't clear to anybody whether or not it was okay for
an LP_REDIRECT to point to an LP_DEAD until recently (and probably
other things besides that). I don't think that we should assume that
the easy part is abstractly defining corruption, while the hard part
is writing the tool to check for the corruption. Sometimes it is, but
I think that it's often the other way around.

> In practice, however, the point you
> raise in your first sentence weighs awfully heavily with me. Spending
> a lot of energy on checks that are unlikely to catch practical
> problems feels like it may not be the best use of time.

That definitely could be true, but I don't think that it's terribly
much extra effort in most cases.

> I'm not sure
> exactly where to draw the line, but it seems highly likely to be that
> there are things we could deduce about the page that wouldn't be worth
> the effort. For example, would we bother checking that a tuple with an
> in-progress xmin does not have a smaller natts value than a tuple with
> a committed xmin? Or that natts values are non-decreasing across a HOT
> chain? I suspect there are even more obscure examples of things that
> should be true but might not really be worth worrying about in the
> code.

A related way of looking at it (that I also find appealing) is that
it's often easier (far easier) to just have the check, and be done
with it. Of course there is bound to be uncertainty about how useful
any given check might be; we're looking for something that is
theoretically never supposed to happen. Why not just assume that it
might matter if it's not costing very much to check for it?

This is quite a different mentality than the one we bring to core
heapam code, where it's quite natural to just avoid strange corner
cases in the on-disk format like the plague. The risk profile is
totally different for amcheck code. Within amcheck, I'd rather go too
far than not go far enough.

-- 
Peter Geoghegan

Re: running logical replication as the subscription owner

2023-03-27 Thread Robert Haas

On Mon, Mar 27, 2023 at 3:04 PM Jeff Davis  wrote:
> On Mon, 2023-03-27 at 12:53 -0400, Robert Haas wrote:
> > We do know that an empty search_path is secure,
> > but it's probably also going to cause any code we run to fail, unless
> > that code schema-qualifies all references outside of pg_catalog, or
> > unless it sets search_path itself.
>
> I am confused.
>
> If the trigger function requires schema "xyz" to be in the search_path,
> and the function itself doesn't set it, how will it ever get set in the
> subscription process? Won't such a function be simply broken for all
> logical subscriptions (in current code or with any of the proposals
> active right now)?
>
> And if the trigger function requires object "abc" (regardless of
> schema) to be somehow accessible without qualification, and if it
> doesn't set the search path itself, and it's not in pg_catalog; then
> again I think that function would be broken today.

No, not really. It's pretty common for a lot of things to be in the
public schema, and the public schema is likely to be in the search
path of every user involved.

> It feels like we are reaching to say that a trigger function might be
> broken based on some contrived cases, but we can already contrive cases
> that are broken for logical replication today. It might not be exactly
> the same set, but unless I'm missing something it would be a very
> similar set.

No, it's not a contrived case, and it's not the same set of cases, not
even close. Running functions with a different search path than the
author intended is a really common cause of all kinds of things
breaking. See for example commit
582edc369cdbd348d68441fc50fa26a84afd0c1a, which certainly did break
things for some users.

> Are you suggesting that we require the subscription owner to have SET
> ROLE on everybody so that everyone is equally insecure and there's no
> way to ever fix it even if you don't have any triggers on your tables
> at all?

I certainly am not. I don't even know how to respond to this. I want
to make it secure, not insecure. But we don't gain any security by
pretending that certain exploits don't exist or aren't problems when
they do and are. Quite the contrary.

> And all of this is to protect a user that writes a trigger function
> that executes arbitrary SQL based on the input data?

Or is insecure in any other way, and there are quite a few ways. If
you don't think that this is a problem in reality, I really don't know
how to carry this conversation forward. The idea that the average
trigger function is safe if it can be unexpectedly called by someone
other than the table owner with arguments and GUC settings chosen by
the caller doesn't seem remotely correct to me. It matches no part of
my experience with user-defined functions, either written by me or by
EDB customers. Every database I've ever seen that used triggers at all
would be vulnerable to the subscription owner compromising the table
owner's account.

> > Well, I continue to feel that if you can compromise the subscription
> > owner's account, we haven't really fixed anything yet.
>
> I'm trying to reconcile the following two points:
>
>   A. That you are proposing a patch to allow non-superuser
> subscriptions; and
>   B. That you are arguing that the subscription owner basically needs
> to be superuser and we should just accept that.
>
> Granted, there's some nuance here and I won't say that those two are
> mutually exclusive. But I think it would be helpful to explain how
> those two ideas fit together.

I do think that what this patch does is tantamount to B, because you
can have SET ROLE to some users without having SET ROLE to all users.
That's a big part of the point of the CREATEROLE and
createrole_self_grant work.

> But I think you're saying something slightly different about the
> subscription process compromising the table owner, and assuming that
> problem exists, your patch doesn't do anything to stop it. In fact,
> your patch requires the subscription owner can SET ROLE to each of the
> table owners, guaranteeing that the table owners can never do anything
> at all to protect themselves from the subscription owner.

Yeah, that's true, and like I said, if there's a way to avoid that,
great. But wishing it were so is not that way.

Let's back up a minute here. Suppose someone were to request a new
command, ALTER TABLE  DO NOT LET THE SUPERUSER ACCESS THIS. We
would reject that proposal. The reason we would reject it is because
it wouldn't actually work as documented. We know that the superuser
has the power to access that account and reverse that command, either
by SET ROLE or by changing the account password or by changing
pg_hba.conf or by shelling out to the filesystem and doing whatever.
The feature purports to do something that it actually cannot do. No
one can defend themselves against the superuser. Not even another
superuser can defend themselves against a superuser. Pretending
otherwise would be confusing and have no

Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.

2023-03-27 Thread Robert Haas

On Mon, Mar 27, 2023 at 2:34 PM Peter Geoghegan  wrote:
> > Since this was back-patched, I think it's probably better to just
> > remove the error. We can introduce new validation if we want, but that
> > should probably be master-only.
>
> That makes sense.

Patch attached.

> I don't think that it's particularly likely that having refined
> aborted speculative insertion amcheck coverage will make a critical
> difference to any user, at any time. But "amcheck as documentation of
> the on-disk format" is reason enough to have it.

Sure, if someone feels like writing the code. I have to admit that I
have mixed feelings about this whole direction. In concept, I agree
with you entirely: a fringe benefit of having checks that tell us
whether or not a page is valid is that it helps to make clear what
page states we think are valid. In practice, however, the point you
raise in your first sentence weighs awfully heavily with me. Spending
a lot of energy on checks that are unlikely to catch practical
problems feels like it may not be the best use of time. I'm not sure
exactly where to draw the line, but it seems highly likely to be that
there are things we could deduce about the page that wouldn't be worth
the effort. For example, would we bother checking that a tuple with an
in-progress xmin does not have a smaller natts value than a tuple with
a committed xmin? Or that natts values are non-decreasing across a HOT
chain? I suspect there are even more obscure examples of things that
should be true but might not really be worth worrying about in the
code.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

0001-amcheck-In-verify_heapam-allows-tuples-with-xmin-0.patch
Description: Binary data

Re: Non-superuser subscription owners

2023-03-27 Thread Jeff Davis

On Mon, 2023-03-27 at 10:46 -0700, Andres Freund wrote:
> > There are some big issues, like the security model for replaying
> > changes.
> 
> That seems largely unrelated.

They are self-evidently related in a fundamental way. The behavior of
the non-superuser-subscription patch depends on the presence of the
apply-as-table-owner patch.

I think I'd like to understand the apply-as-table-owner patch better to
understand the interaction.

Regards,
Jeff Davis

Re: running logical replication as the subscription owner

2023-03-27 Thread Jeff Davis

On Mon, 2023-03-27 at 12:53 -0400, Robert Haas wrote:
> We do know that an empty search_path is secure,
> but it's probably also going to cause any code we run to fail, unless
> that code schema-qualifies all references outside of pg_catalog, or
> unless it sets search_path itself.

I am confused.

If the trigger function requires schema "xyz" to be in the search_path,
and the function itself doesn't set it, how will it ever get set in the
subscription process? Won't such a function be simply broken for all
logical subscriptions (in current code or with any of the proposals
active right now)?

And if the trigger function requires object "abc" (regardless of
schema) to be somehow accessible without qualification, and if it
doesn't set the search path itself, and it's not in pg_catalog; then
again I think that function would be broken today.

It feels like we are reaching to say that a trigger function might be
broken based on some contrived cases, but we can already contrive cases
that are broken for logical replication today. It might not be exactly
the same set, but unless I'm missing something it would be a very
similar set.

>  search_path also isn't necessarily
> the only problem. As a hypothetical example, suppose I create a table
> with one text column, revoke all public access to that table, and
> then
> create an on-insert trigger that executes as an SQL command any text
> value inserted into the table. This is perfectly secure as long as
> I'm
> the only one who can access the table, but if someone else gets
> access
> to insert things into that table using my credentials then they can
> easily break into my account. Real examples aren't necessarily that
> dramatically bad, but that doesn't mean they don't exist.

Are you suggesting that we require the subscription owner to have SET
ROLE on everybody so that everyone is equally insecure and there's no
way to ever fix it even if you don't have any triggers on your tables
at all?

And all of this is to protect a user that writes a trigger function
that executes arbitrary SQL based on the input data?

> 
> Well, I continue to feel that if you can compromise the subscription
> owner's account, we haven't really fixed anything yet.

I'm trying to reconcile the following two points:

  A. That you are proposing a patch to allow non-superuser
subscriptions; and
  B. That you are arguing that the subscription owner basically needs
to be superuser and we should just accept that.

Granted, there's some nuance here and I won't say that those two are
mutually exclusive. But I think it would be helpful to explain how
those two ideas fit together.

> We just ... fixed it so that no
> compromise was possible. And I think that's also the right approach
> here.

If you are saying that we fix the subscription process so that it can't
easily be compromised by the table owner, then that appears to be the
entire purpose of this patch and I agree.

But I think you're saying something slightly different about the
subscription process compromising the table owner, and assuming that
problem exists, your patch doesn't do anything to stop it. In fact,
your patch requires the subscription owner can SET ROLE to each of the
table owners, guaranteeing that the table owners can never do anything
at all to protect themselves from the subscription owner.

> I do agree with you that allowing SET ROLE on tons of accounts is not
> great, though.

Regardless of security concerns, the UI is bad.

Regards,
Jeff Davis

Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.

2023-03-27 Thread Peter Geoghegan

On Mon, Mar 27, 2023 at 10:17 AM Robert Haas  wrote:
> > What about aborted speculative insertions? See
> > heap_abort_speculative(), which directly sets the speculatively
> > inserted heap tuple's xmin to InvalidTransactionId/zero.
>
> Oh, dear. I didn't know about that case.

A big benefit of having extensive amcheck coverage is that it
effectively centralizes information about the on-disk format, in an
easy to understand way, and (over time) puts things on a more rigorous
footing. Now it'll be a lot harder for somebody else to overlook that
case in the future, which is good. Things are trending in the right
direction.

> > It probably does make sense to keep something close to this check --
> > it just needs to account for speculative insertions to avoid false
> > positive reports of corruption. We could perform cross-checks against
> > a tuple whose xmin is InvalidTransactionId/zero to verify that it
> > really is from an aborted speculative insertion, to the extent that
> > that's possible. For example, such a tuple can't be a heap-only tuple,
> > and it can't have any xmax value other than InvalidTransactionId/zero.
>
> Since this was back-patched, I think it's probably better to just
> remove the error. We can introduce new validation if we want, but that
> should probably be master-only.

That makes sense.

I don't think that it's particularly likely that having refined
aborted speculative insertion amcheck coverage will make a critical
difference to any user, at any time. But "amcheck as documentation of
the on-disk format" is reason enough to have it.

-- 
Peter Geoghegan

Re: Should vacuum process config file reload more often

2023-03-27 Thread Melanie Plageman

On Sat, Mar 25, 2023 at 3:03 PM Melanie Plageman
 wrote:
>
> On Thu, Mar 23, 2023 at 8:27 PM Melanie Plageman
>  wrote:
> > On Thu, Mar 23, 2023 at 2:09 AM Masahiko Sawada  
> > wrote:
> > > > And, I was wondering if it was worth trying to split up the part that
> > > > reloads the config file and all of the autovacuum stuff. The reloading
> > > > of the config file by itself won't actually result in autovacuum workers
> > > > having updated cost delays because of them overwriting it with
> > > > wi_cost_delay, but it will allow VACUUM to have those updated values.
> > >
> > > It makes sense to me to have changes for overhauling the rebalance
> > > mechanism in a separate patch.
> > >
> > > Looking back at the original concern you mentioned[1]:
> > >
> > > speed up long-running vacuum of a large table by
> > > decreasing autovacuum_vacuum_cost_delay/vacuum_cost_delay, however the
> > > config file is only reloaded between tables (for autovacuum) or after
> > > the statement (for explicit vacuum).
> > >
> > > does it make sense to have autovac_balance_cost() update workers'
> > > wi_cost_delay too? Autovacuum launcher already reloads the config file
> > > and does the rebalance. So I thought autovac_balance_cost() can update
> > > the cost_delay as well, and this might be a minimal change to deal
> > > with your concern. This doesn't have the effect for manual VACUUM but
> > > since vacuum delay is disabled by default it won't be a big problem.
> > > As for manual VACUUMs, we would need to reload the config file in
> > > vacuum_delay_point() as the part of your patch does. Overhauling the
> > > rebalance mechanism would be another patch to improve it further.
> >
> > So, we can't do this without acquiring an access shared lock on every
> > call to vacuum_delay_point() because cost delay is a double.
> >
> > I will work on a patchset with separate commits for reloading the config
> > file, though (with autovac not benefitting in the first commit).
>
> So, I realized we could actually do as you say and have autovac workers
> update their wi_cost_delay and keep the balance changes in a separate
> commit. I've done this in attached v8.
>
> Workers take the exclusive lock to update their wi_cost_delay and
> wi_cost_limit only when there is a config reload. So, there is one
> commit that implements this behavior and a separate commit to revise the
> worker rebalancing.

So, I've attached an alternate version of the patchset which takes the
approach of having one commit which only enables cost-based delay GUC
refresh for VACUUM and another commit which enables it for autovacuum
and makes the changes to balancing variables.

I still think the commit which has workers updating their own
wi_cost_delay in vacuum_delay_point() is a bit weird. It relies on no one
else emulating our bad behavior and reading from wi_cost_delay without a
lock and on no one else deciding to ever write to wi_cost_delay (even
though it is in shared memory [this is the same as master]). It is only
safe because our process is the only one (right now) writing to
wi_cost_delay, so when we read from it without a lock, we know it isn't
being written to. And everyone else takes a lock when reading from
wi_cost_delay right now. So, it seems...not great.

This approach also introduces a function that is only around for
one commit until the next commit obsoletes it, which seems a bit silly.

Basically, I think it is probably better to just have one commit
enabling guc refresh for VACUUM and then another which correctly
implements what is needed for autovacuum to do the same.
Attached v9 does this.

I've provided both complete versions of both approaches (v9 and v8).

- Melanie
From 94c08c1b764619ad6cc3a0c75295f416e1863b26 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sat, 25 Mar 2023 11:59:33 -0400
Subject: [PATCH v9 1/4] Zero out VacuumCostBalance

Though it is unlikely to matter, we should zero out VacuumCostBalance
whenever we may be transitioning the state of VacuumCostActive to false.
---
 src/backend/commands/vacuum.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c54360a6a0..7d01df7e48 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -550,6 +550,7 @@ vacuum(List *relations, VacuumParams *params,
 	{
 		in_vacuum = false;
 		VacuumCostActive = false;
+		VacuumCostBalance = 0;
 	}
 	PG_END_TRY();
 
-- 
2.37.2

From 3bdf9414c5840bcc94a9c0292f25ec41d2e95b69 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sat, 25 Mar 2023 14:14:55 -0400
Subject: [PATCH v9 4/4] Autovacuum refreshes cost-based delay params more
 often

The previous commit allowed VACUUM to reload the config file more often
so that cost-based delay parameters could take effect while VACUUMing a
relation. Autovacuum, however did not benefit from this change.

In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without

Re: Non-superuser subscription owners

2023-03-27 Thread Robert Haas

On Sat, Mar 25, 2023 at 3:16 PM Jeff Davis  wrote:
> On Fri, 2023-03-24 at 09:24 -0400, Robert Haas wrote:
> > I certainly agree that the security model isn't in a reasonable place
> > right now. However, I feel that:
> >
> > (1) adding an extra predefined role
>
> > (2) even adding the connection string security stuff
>
> I don't see how these points are related to the question of whether you
> should commit your non-superuser-subscription-owners patch or logical-
> repl-as-table-owner patch first.

I thought you were asking for those changes to be made before this
patch got committed, so that's what I was responding to. If you're
asking for it not to be committed at all, that's a different
discussion.

> My perspective is that logical replication is an unfinished feature
> with an incomplete design. As I said earlier, that's why I backed away
> from trying to do non-superuser subscriptions as a documented feature:
> it feels like we need to settle some of the underlying pieces first.

I kind of agree with you about the feature itself. Even though the
basic feature works quite well and does something people really want,
there are a lot of loose ends to sort out, and not just about
security. But I also want to make some progress. If there are problems
with what I'm proposing that will make us regret committing things
right before feature freeze, then we shouldn't. But waiting a whole
additional year to see any kind of improvement is not free; these
issues are serious.

> I don't mean to say all of the above issues are blockers or that they
> should all be resolved in my favor. But there are enough issues and
> some of those issues are serious enough that I feel like it's premature
> to just go ahead with the non-superuser subscriptions and the
> predefined role.
>
> There are already users, which complicates things. And you make a good
> point that some important users may be already working around the
> flaws. But there's already a patch and discussion going on for some
> security model improvements (thanks to you), so let's try to get that
> one in first. If we can't, it's probably because we learned something
> important.

I think this patch is a lot better-baked and less speculative than
that one. I think that patch is more important, so if they were
equally mature, I'd favor getting that one committed first. But that's
not the case.

Also, I don't really understand how we could end up not wanting this
patch. I mean there's a lot of things I don't understand that are
still true anyway, so the mere fact that I don't understand how we
could not end up wanting this patch doesn't mean that it couldn't
happen. But like, the current state of play is that subscription
owners are always going to be superusers at the time the subscription
is created, and literally nobody thinks that's a good idea. Some
people (like me) think that we ought to assume that subscription
owners will be and need to be high-privilege users like superusers,
but to my knowledge every such person thinks that it's OK for the
subscription owner to be a non-superuser if they have adequate
privileges. I just think that's a high amount of privileges, not that
it has to be all the privileges i.e. superuser. Other people (like
you, AIUI) think that we ought to try to set things up so that
subscription owners can be low-privilege users, in which case we, once
again, don't want the user who owns the subscription to start out a
superuser. I actually can't imagine anyone defending the idea of
having the subscription owner always be a superuser at the time they
first own the subscription. That's a weird rule that can only serve to
reduce security. Nor can I imagine anyone saying that forcing
subscriptions to be created only by superusers improves security. I
don't think anyone thinks that.

If we're going to delay this patch, probably for a full year, because
of other ongoing discussions, it should be because there is some
outcome of those discussions that would involve deciding that this
patch isn't needed or should be significantly redesigned. If this
patch is going to end up being desirable no matter how those
discussions turn out, and if it's not going to change significantly no
matter how those discussions turn out, then those discussions aren't a
reason not to get it into this release.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Non-superuser subscription owners

2023-03-27 Thread Andres Freund

Hi,

On 2023-03-25 12:16:35 -0700, Jeff Davis wrote:
> On Fri, 2023-03-24 at 09:24 -0400, Robert Haas wrote:
> > I certainly agree that the security model isn't in a reasonable place
> > right now. However, I feel that:
> > 
> > (1) adding an extra predefined role
> 
> > (2) even adding the connection string security stuff
> 
> I don't see how these points are related to the question of whether you
> should commit your non-superuser-subscription-owners patch or logical-
> repl-as-table-owner patch first.
> 
> 
> My perspective is that logical replication is an unfinished feature
> with an incomplete design.

I agree with that much.


>  As I said earlier, that's why I backed away from trying to do non-superuser
> subscriptions as a documented feature: it feels like we need to settle some
> of the underlying pieces first.

I don't agree. The patch allows to use logical rep in a far less dangerous
fashion than now. The alternative is to release 16 without a real way to use
logical rep less insanely. Which I think is work.


> There are some big issues, like the security model for replaying
> changes.

That seems largely unrelated.


> And some smaller issues like feature gaps (RLS doesn't work,
> if I remember correctly, and maybe something with partitioning).

Entirely unrelated?

Greetings,

Andres Freund

Re: HOT chain validation in verify_heapam()

2023-03-27 Thread Robert Haas

On Thu, Mar 23, 2023 at 3:06 PM Robert Haas  wrote:
> On Thu, Mar 23, 2023 at 1:34 PM Robert Haas  wrote:
> > OK, let me spend some more time on this and I'll post a patch (or
> > patches) in a bit.
>
> All right, here are some more fixups.

It looks like e88754a1965c0f40a723e6e46d670cacda9e19bd make skink
happy (although Peter Geoghegan has spotted a problem with it, see the
thread that begins with the commit email) so I went ahead and
committed these fixups. Hopefully that won't again make the buildfarm
unhappy, but I guess we'll see.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: zstd compression for pg_dump

2023-03-27 Thread Justin Pryzby

On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> On 3/16/23 05:50, Justin Pryzby wrote:
> > On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> >> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion  
> >> wrote:
> >>> I did some smoke testing against zstd's GitHub release on Windows. To
> >>> build against it, I had to construct an import library, and put that
> >>> and the DLL into the `lib` folder expected by the MSVC scripts...
> >>> which makes me wonder if I've chosen a harder way than necessary?
> >>
> >> It looks like pg_dump's meson.build is missing dependencies on zstd
> >> (meson couldn't find the headers in the subproject without them).
> > 
> > I saw that this was added for LZ4, but I hadn't added it for zstd since
> > I didn't run into an issue without it.  Could you check that what I've
> > added works for your case ?
> > 
> >>> Parallel zstd dumps seem to work as expected, in that the resulting
> >>> pg_restore output is identical to uncompressed dumps and nothing
> >>> explodes. I haven't inspected the threading implementation for safety
> >>> yet, as you mentioned.
> >>
> >> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> >> handling safety for this, by isolating each thread's state. I don't feel
> >> comfortable pronouncing this new addition safe or not, because I'm not
> >> sure I understand what the comments in the format-specific _Clone()
> >> callbacks are saying yet.
> > 
> > My line of reasoning for unix is that pg_dump forks before any calls to
> > zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> > doesn't apply to pg_dump under windows.  This is an opened question.  If
> > there's no solid answer, I could disable/ignore the option (maybe only
> > under windows).
> 
> I may be missing something, but why would the patch affect this? Why
> would it even affect safety of the parallel dump? And I don't see any
> changes to the clone stuff ...

zstd supports using threads during compression, with -Z zstd:workers=N.
When unix forks, the child processes can't do anything to mess up the
state of the parent processes.  

But windows pg_dump uses threads instead of forking, so it seems
possible that the pg_dump -j threads that then spawn zstd threads could
"leak threads" and break the main thread.  I suspect there's no issue,
but we still ought to verify that before declaring it safe.

> > The function is first checking if it was passed a filename which already
> > has a suffix.  And if not, it searches through a list of suffixes,
> > testing for an existing file with each suffix.  The search with stat()
> > doesn't happen if it has a suffix.  I'm having trouble seeing how the
> > hasSuffix() branch isn't dead code.  Another opened question.
> 
> AFAICS it's done this way because of this comment in pg_backup_directory
> 
>  * ...
>  * ".gz" suffix is added to the filenames. The TOC files are never
>  * compressed by pg_dump, however they are accepted with the .gz suffix
>  * too, in case the user has manually compressed them with 'gzip'.
> 
> I haven't tried, but I believe that if you manually compress the
> directory, it may hit this branch.

That would make sense, but when I tried, it didn't work like that.
The filenames were all uncompressed names.  Maybe it worked differently
in an older release.  Or maybe it changed during development of the
parallel-directory-dump patch and it's actually dead code.

This is rebased over the updated compression API.

It seems like I misunderstood something you said before, so now I put
back "supports_compression()".

-- 
Justin
>From b4cf9df2f8672b012301d58ff83f2b0344de9e96 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 7 Jan 2023 15:45:06 -0600
Subject: [PATCH 1/3] pg_dump: zstd compression

Previously proposed at: 20201221194924.gi30...@telsasoft.com
---
 doc/src/sgml/ref/pg_dump.sgml |  13 +-
 src/bin/pg_dump/Makefile  |   2 +
 src/bin/pg_dump/compress_io.c |  62 ++-
 src/bin/pg_dump/compress_zstd.c   | 535 ++
 src/bin/pg_dump/compress_zstd.h   |  25 ++
 src/bin/pg_dump/meson.build   |   4 +-
 src/bin/pg_dump/pg_backup_archiver.c  |   9 +-
 src/bin/pg_dump/pg_backup_directory.c |   2 +
 src/bin/pg_dump/pg_dump.c |  16 +-
 src/bin/pg_dump/t/002_pg_dump.pl  |  79 +++-
 src/tools/pginclude/cpluspluscheck|   1 +
 11 files changed, 694 insertions(+), 54 deletions(-)
 create mode 100644 src/bin/pg_dump/compress_zstd.c
 create mode 100644 src/bin/pg_dump/compress_zstd.h

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index d6b1faa8042..62b3ed2dad6 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -330,8 +330,9 @@ PostgreSQL documentation
machine-readable format that pg_restore
can read. A directory format archive can be manipulated with
standard Unix tools; for example, files in an uncompressed

Re: meson/msys2 fails with plperl/Strawberry

2023-03-27 Thread Andres Freund

Hi,

On 2023-03-26 21:13:41 -0400, Andrew Dunstan wrote:
> > On Mar 26, 2023, at 5:28 PM, Andres Freund  wrote:
> >> On 2023-03-26 12:39:08 -0700, Andres Freund wrote:
> >> First: I am *not* arguing we shouldn't repair building against strawberry 
> >> perl
> >> with mingw.
> > 
> > Hm - can you describe the failure more - I just tried, and it worked to 
> > build
> > against strawberry perl on mingw, without any issues. All I did was set
> > -DPERL="c:/strawberrly/perl/bin/perl.exe".

> That might be the secret sauce I’m missing. I will be offline for a day or 
> three, will test when I’m back.

It should suffice to put strawberry perl first in PATH. All that the -DPERL
does is to use that, instead of 'perl' from PATH. If putting strawberry perl
ahead in PATH failed, something else must have been going on...

Greetings,

Andres Freund

Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.

2023-03-27 Thread Robert Haas

On Sat, Mar 25, 2023 at 6:25 PM Peter Geoghegan  wrote:
> On Fri, Mar 24, 2023 at 8:13 AM Robert Haas  wrote:
> > If we're checking xmin and find that it is invalid (i.e. 0) just
> > report that as corruption, similar to what's already done in the
> > three cases that seem correct. If we're checking xmax and find
> > that's invalid, that's fine: it just means that the tuple hasn't
> > been updated or deleted.
>
> What about aborted speculative insertions? See
> heap_abort_speculative(), which directly sets the speculatively
> inserted heap tuple's xmin to InvalidTransactionId/zero.

Oh, dear. I didn't know about that case.

> It probably does make sense to keep something close to this check --
> it just needs to account for speculative insertions to avoid false
> positive reports of corruption. We could perform cross-checks against
> a tuple whose xmin is InvalidTransactionId/zero to verify that it
> really is from an aborted speculative insertion, to the extent that
> that's possible. For example, such a tuple can't be a heap-only tuple,
> and it can't have any xmax value other than InvalidTransactionId/zero.

Since this was back-patched, I think it's probably better to just
remove the error. We can introduce new validation if we want, but that
should probably be master-only.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: running logical replication as the subscription owner

2023-03-27 Thread Robert Haas

On Sat, Mar 25, 2023 at 5:24 AM Jelte Fennema  wrote:
> For my purposes I always trust the publisher, what I don't trust is
> the table owners. But I can indeed imagine scenarios where that's the
> other way around, and indeed you can protect against that currently,
> but not with your new patch. That seems fairly easily solvable though.
>
> +   if (!member_can_set_role(context->save_userid, userid))
> +   ereport(ERROR,
> +   (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> +errmsg("role \"%s\" cannot SET ROLE to \"%s\"",
> +   GetUserNameFromId(context->save_userid, false),
> +   GetUserNameFromId(userid, false;
>
> If we don't throw an error here, but instead simply return, then the
> current behaviour is preserved and people can manually configure
> permissions to protect against an untrusted publisher. This would
> still mean that the table owners can escalate privileges to the
> subscription owner, but if that subscription owner actually has fewer
> privileges than the table owner then you don't have that issue.

I don't get it. If we just return, that would result in skipping
changes rather than erroring out on changes, but it wouldn't preserve
the current behavior, because we'd still care about the table owner's
permissions rather than, as now, the subscription owner's permissions.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: doc: add missing "id" attributes to extension packaging page

2023-03-27 Thread Brar Piening


On 23.03.2023 at 20:08, Brar Piening wrote:

Since the need for ids is starting to grow again (ecb696527c added an
id to a varlistentry in doc/src/sgml/ref/create_subscription.sgml)
I've also amended the add-needed-ids patch once again so that the
build does not fail after applying the make_html_ids_discoverable patch.


New add-needed-ids patch since it was outdated again.

Regards,

Brar


diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 71730cc52f..b0546f47b4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1132,7 +1132,7 @@ include_dir 'conf.d'
   
  
 
- 
+ 
   scram_iterations (integer)
   
scram_iterations configuration 
parameter
diff --git a/doc/src/sgml/logical-replication.sgml 
b/doc/src/sgml/logical-replication.sgml
index 3836d13ad3..1432c67c2a 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -252,7 +252,7 @@
additional columns not provided by the published table.  Any such columns
will be filled with the default value as specified in the definition of the
target table. However, logical replication in binary format is more
-   restrictive. See the binary
+   restrictive. See the binary
option of CREATE SUBSCRIPTION for details.
   
 
diff --git a/doc/src/sgml/pgwalinspect.sgml b/doc/src/sgml/pgwalinspect.sgml
index 9a0241a8d6..62d9f9eb22 100644
--- a/doc/src/sgml/pgwalinspect.sgml
+++ b/doc/src/sgml/pgwalinspect.sgml
@@ -157,7 +157,7 @@ combined_size_percentage | 2.8634072910530795
 

 
-   
+   
 
  pg_get_wal_block_info(start_lsn pg_lsn, end_lsn pg_lsn) returns 
setof record
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml 
b/doc/src/sgml/ref/alter_subscription.sgml
index e92346edef..f0ab6a918e 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -178,7 +178,7 @@ ALTER SUBSCRIPTION name RENAME TO <
   origin parameter.
  
  
-  See the binary
+  See the binary
   option of CREATE SUBSCRIPTION for details
   about copying pre-existing data in binary format.
  
diff --git a/doc/src/sgml/ref/create_subscription.sgml 
b/doc/src/sgml/ref/create_subscription.sgml
index 9d4b9d4e33..605b11bc67 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -61,7 +61,7 @@ CREATE SUBSCRIPTION subscription_nameParameters
 
   
-   
+   
 subscription_name
 
  
@@ -70,7 +70,7 @@ CREATE SUBSCRIPTION subscription_name

 
-   
+   
 CONNECTION 'conninfo'
 
  
@@ -81,7 +81,7 @@ CREATE SUBSCRIPTION subscription_name

 
-   
+   
 PUBLICATION publication_name [, ...]
 
  
@@ -90,7 +90,7 @@ CREATE SUBSCRIPTION subscription_name

 
-   
+   
 WITH ( subscription_parameter [= value] [, ... ] )
 
  
@@ -102,7 +102,7 @@ CREATE SUBSCRIPTION subscription_name
 
-   
+   
 connect (boolean)
 
  
@@ -129,7 +129,7 @@ CREATE SUBSCRIPTION subscription_name

 
-   
+   
 create_slot (boolean)
 
  
@@ -145,7 +145,7 @@ CREATE SUBSCRIPTION subscription_name

 
-   
+   
 enabled (boolean)
 
  
@@ -156,7 +156,7 @@ CREATE SUBSCRIPTION subscription_name

 
-   
+   
 slot_name (string)
 
  
@@ -185,7 +185,7 @@ CREATE SUBSCRIPTION subscription_name
 
-   
+   
 binary (boolean)
 
  
@@ -222,7 +222,7 @@ CREATE SUBSCRIPTION subscription_name

 
-   
+   
 copy_data (boolean)
 
  
@@ -243,7 +243,7 @@ CREATE SUBSCRIPTION subscription_name

 
-   
+   
 streaming (enum)
 
  
@@ -271,7 +271,7 @@ CREATE SUBSCRIPTION subscription_name

 
-   
+   
 synchronous_commit (enum)
 
  
@@ -303,7 +303,7 @@ CREATE SUBSCRIPTION subscription_name

 
-   
+   
 two_phase (boolean)
 
  
@@ -334,7 +334,7 @@ CREATE SUBSCRIPTION subscription_name

 
-   
+   
 disable_on_error (boolean)
 
  
@@ -346,7 +346,7 @@ CREATE SUBSCRIPTION subscription_name

 
-   
+   
 origin (string)

Re: running logical replication as the subscription owner

2023-03-27 Thread Robert Haas

On Sat, Mar 25, 2023 at 12:01 PM Noah Misch  wrote:
> (One might allow temp
> tables by introducing NewTempSchemaNestLevel(), called whenever we call
> NewGUCNestLevel().  The transaction would then proceed as though it has no
> temp schema, allocating an additional schema if creating a temp object.)

Neat idea. That would require some adjustments, I believe, because of
the way that temp schemas are named. But it sounds doable.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: running logical replication as the subscription owner

2023-03-27 Thread Robert Haas

On Fri, Mar 24, 2023 at 8:02 PM Jeff Davis  wrote:
> Without a reasonable example, we should probably be on some kind of
> path to disallowing crazy stuff in triggers that poses only risks and
> no benefits. Not the job of this patch, but perhaps it can be seen as a
> step in that direction?

Possibly, but it's a little harder to say what's crazy in a trigger
than in some other contexts. I feel like it would be fine to say that
your index expression should probably not be doing "ALTER USER
somebody SUPERUSER" really ever, but it's not quite so clear in case
of a trigger. And the stuff that's prevented by
SECURITY_RESTRICTED_OPERATION isn't that sort of thing, but has rather
to do with stuff that messes with the session state, maybe leaving
booby-traps behind for the next user. For instance, if user A runs
some code as user B and user B closes a cursor opened by A and opens a
new one with the same name, that has a rather good chance of making A
do something they didn't intend to do. SECURITY_RESTRICTED_OPERATION
is aimed at preventing that sort of attack.

> > I am not thrilled with this either, but if you can arrange to run
> > code
> > as a certain user without the ability to SET ROLE to that user then
> > there is, by definition, a potential privilege escalation.
>
> I don't think that's "by definition".
>
> The code is being executed with the privileges of the person who wrote
> it. I don't see anything inherently insecure about that. There could be
> incidental or practical risks, but it's a pretty reasonable thing to do
> at a high level.

Not really. My home directory on my laptop is full of Perl and sh
scripts that I wouldn't want someone else to execute as me. They don't
have any defenses against malicious use because I don't expect anyone
else has access to run them, especially as me. If someone got access
to run them as me, they'd compromise my laptop account in no time at
all. I don't see any reason to believe the situation is any different
inside of a database. People have no reason to harden code unless
someone else is going to have access to run it.

> >  I don't
> > think we can just dismiss that as a non-issue. If the ability of one
> > user to potentially become some other user weren't a problem, we
> > wouldn't need any patch at all. Imagine for example that the table
> > owner has a trigger which doesn't sanitize search_path. The
> > subscription owner can potentially leverage that to get the table
> > owner's privileges.
>
> Can you explain? Couldn't we control the subscription process's search
> path?

There's no place in the code right now where when we switch user
identities we also change the search_path. There is nothing to prevent
us from writing such code, but we have no place from which to obtain a
search_path that will cause the called code to behave properly. We
don't have access to the search_path that would prevail at the time
the target user logged in, and even if we did, we don't know that that
search_path is secure. We do know that an empty search_path is secure,
but it's probably also going to cause any code we run to fail, unless
that code schema-qualifies all references outside of pg_catalog, or
unless it sets search_path itself. search_path also isn't necessarily
the only problem. As a hypothetical example, suppose I create a table
with one text column, revoke all public access to that table, and then
create an on-insert trigger that executes as an SQL command any text
value inserted into the table. This is perfectly secure as long as I'm
the only one who can access the table, but if someone else gets access
to insert things into that table using my credentials then they can
easily break into my account. Real examples aren't necessarily that
dramatically bad, but that doesn't mean they don't exist.

> The benefit of delegating to a non-superuser is to contain the risk if
> that account is compromised. Allowing SET ROLE on tons of accounts
> diminishes that benefit, a lot.

Well, I continue to feel that if you can compromise the subscription
owner's account, we haven't really fixed anything yet. I mean, it used
to be that autovacuum could compromise the superuser's account, and we
fixed that. If we find more ways for that same thing to happen, we
would presumably fix those too. We would never accept a situation
where autovacuum can compromise the superuser's account. And we
shouldn't accept a situation where either the table owner can
compromise the subscription owner's account, either. And similarly
nobody ever proposed that that issue should be fixed by running the
autovacuum worker process as some kind of low-privileged user that we
created specially for that purpose. We just ... fixed it so that no
compromise was possible. And I think that's also the right approach
here.

I do agree with you that allowing SET ROLE on tons of accounts is not
great, though. I don't really think it matters very much today,
because basically all subscriptions today are owned by

Re: Moving forward with TDE

2023-03-27 Thread Bruce Momjian

On Wed, Mar  8, 2023 at 04:25:04PM -0500, Stephen Frost wrote:
> Agreed, though the latest efforts include an option for *authenticated*
> encryption as well as unauthenticated.  That makes it much more
> difficult to make undetected changes to the data that's protected by
> the authenticated encryption being used.

I thought some more about this.  GCM-style authentication of encrypted
data has value because it assumes the two end points are secure but that
a malicious actor could modify data during transfer.  In the Postgres
case, it seems the two end points and the transfer are all in the same
place.  Therefore, it is unclear to me the value of using GCM-style
authentication because if the GCM-level can be modified, so can the end
points, and the encryption key exposed.

> There's clearly user demand for it as there's a number of organizations
> who have forks which are providing it in one shape or another.  This
> kind of splintering of the community is actually an actively bad thing
> for the project and is part of what killed Unix, by at least some pretty
> reputable accounts, in my view.

Yes, the number of commercial implementations of this is a concern.  Of
course, it is also possible that those commercial implementations are
meeting checkbox requirements rather than technical ones, and the
community has been hostile to check box-only features.

> Certainly agree with you there though there's an overall trajectory of
> patches involved in all of this that's a bit deep.  The plan is to
> discuss that at PGCon (On the Road to TDE) and at the PGCon
> Unconference after.  I certainly hope those interested will be there.
> I'm also happy to have a call with anyone interested in this effort
> independent of that, of course.

I will not be attending Ottawa.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Embrace your flaws.  They make you human, rather than perfect,
  which you will never be.

Re: zstd compression for pg_dump

2023-03-27 Thread Tomas Vondra

Hi,

On 3/17/23 03:43, Tomas Vondra wrote:
> 
> ...
>
>>> I'm a little suspicious of the replacement of supports_compression()
>>> with parse_compress_specification(). For example:
>>>
 -   errmsg = supports_compression(AH->compression_spec);
 -   if (errmsg)
 +   parse_compress_specification(AH->compression_spec.algorithm,
 +   NULL, _spec);
 +   if (compress_spec.parse_error != NULL)
 {
 pg_log_warning("archive is compressed, but this installation does 
 not support compression (%s
 -   errmsg);
 -   pg_free(errmsg);
 +   compress_spec.parse_error);
 +   pg_free(compress_spec.parse_error);
 }
>>>
>>> The top-level error here is "does not support compression", but wouldn't
>>> a bad specification option with a supported compression method trip this
>>> path too?
>>
>> No - since the 2nd argument is passed as NULL, it just checks whether
>> the compression is supported.  Maybe there ought to be a more
>> direct/clean way to do it.  But up to now evidently nobody needed to do
>> that.
>>
> 
> I don't think the patch can use parse_compress_specification() instead
> of replace supports_compression(). The parsing simply determines if the
> build has the library, it doesn't say if a particular tool was modified
> to support the algorithm. I might build --with-zstd and yet pg_dump does
> not support that algorithm yet.
> 
> Even after we add zstd to pg_dump, it's quite likely other compression
> algorithms may not be supported by pg_dump from day 1.
> 
> 
> I haven't looked at / tested the patch yet, but I wonder if you have any
> thoughts regarding the size_t / int tweaks. I don't know what types zstd
> library uses, how it reports errors etc.
> 

Any thoughts regarding my comments on removing supports_compression()?

Also, this patch needs a rebase to adopt it to the API changes from last
week. The sooner the better, considering we're getting fairly close to
the end of the CF and code freeze.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [EXTERNAL] Support load balancing in libpq

2023-03-27 Thread Aleksander Alekseev

Hi,

> I guess you didn't set up the hostnames in /etc/hosts as described in
> 004_load_balance_dns.pl. Then it's expected that the loop body isn't
> covered. As discussed upthread, running this test manually is much
> more cumbersome than is desirable, but it's still better than not
> having the test at all, because it is run in CI.

Got it, thanks.

I guess I'm completely out of nitpicks then!

-- 
Best regards,
Aleksander Alekseev

Re: running logical replication as the subscription owner

2023-03-27 Thread Robert Haas

On Fri, Mar 24, 2023 at 4:11 PM Mark Dilger
 wrote:
> > > Imagine for example that the table
> > > owner has a trigger which doesn't sanitize search_path. The
> > > subscription owner can potentially leverage that to get the table
> > > owner's privileges.
>
> I don't find that terribly convincing.  First, there's no reason a 
> subscription owner should be an ordinary user able to volitionally do 
> anything.  The subscription owner should just be a role that the subscription 
> runs under, as a means of superuser dropping privileges before applying 
> changes.  So the only real problem would be that the changes coming from the 
> publisher might, upon application, hack the table owner.  But if that's the 
> case, the table owner's vulnerability on the subscription-database side is 
> equal to their vulnerability on the publication-database side (assuming equal 
> schemas on both).  Flagging this vulnerability as being logical replication 
> related seems a category error.  Instead, it's a schema vulnerability.

I think there are actually a number of reasons why the subscription
owner should be a regular user account rather than a special
low-privilege account. First, it's only barely possible to do anything
else. As of today, you can't create a subscription owned by a
non-superuser except by making the subscription first and then
removing superuser from the account. You can't even do this:

rhaas=# alter subscription s1 owner to alice;
ERROR:  permission denied to change owner of subscription "s1"
HINT:  The owner of a subscription must be a superuser.

That hint is actually a lie because of the loophole mentioned above,
but even if making the subscription owner a low-privilege account were
the right model (which I don't believe) we've got error messages in
the current source code saying that you're not allowed to even do it.

Second, even if this kind of setup were fully supported and all the
stuff worked as you expect, it's not very convenient. It requires you
to create this extra dummy account that doesn't otherwise need to
exist. I don't see a good reason to impose that requirement on
everybody. Given that subscriptions initially could only be owned by
superusers, and that's still mostly true, it seems to me that the
feature was intended to be used with the superuser as the subscription
owner, and I think that's what most people must be doing now and will
probably want to continue to do, and we should try to make it safe
instead of back-pedaling and saying, hey, do it this totally other way
instead.

More discussion of problems below.

> > So I think that if we allow user A to replicate into user B's
> > table with fewer privileges than A-can-set-role-to-B, we're building a
> > privilege-escalation attack into the system. But if we do require
> > A-can-set-role-to-B, then things change as described above.
>
> I don't understand the direction this patch is going.  I'm emphatically not 
> objecting to it, merely expressing my confusion about it.
>
> I had imagined the solution to the replication security problem was to stop 
> running the replication as superuser, and instead as a trivial user.  Imagine 
> that superuser creates roles "deadhead_bob" and "deadhead_alice" which cannot 
> log in, are not members of any groups nor have any other roles as members of 
> themselves, and have no privileges beyond begin able to replicate into bob's 
> and alice's tables, respectively.  The superuser sets up the subscriptions 
> disabled, transfers ownership to deadhead_bob and deadhead_alice, and only 
> then enables the subscriptions.
>
> Since deadhead_bob and deadhead_alice cannot log in, and nobody can set role 
> to them, I don't see what the vulnerability is.  Sure, maybe alice can attack 
> deadhead_alice, or bob can attack deadhead_bob, but that's more of a 
> privilege deescalation than a privilege escalation, so where's the risk?  
> That's not a rhetorical question.  Is there a risk here?  Or are we just 
> concerned that most users will set up replication with superuser or some 
> other high-privilege user, and we're trying to protect them from the 
> consequences of that choice?

Having a separate subscription for each different table owner is
extremely undesirable from a performance perspective, because it means
that the WAL on the origin server has to be decoded once per table
owner. That's not very much fun even if the number of table owners is
only two, as in your example, and there could be many more than two
users who own tables. In addition, it breaks the transactional
consistency of replication. If there's any single transaction that
modifies both a table owned by alice and a table owned by bob, we lose
transactional consistency on the subscriber side unless both of those
transactions are replicated in a single transaction, which means they
also need to be part of a single subscription.

I admit that hacking into deadhead_alice or deadhead_bob is probably
only minimally interesting. There

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-03-27 Thread Daniel Gustafsson




> On 27 Mar 2023, at 16:33, Tom Lane  wrote:
> 
> Julien Rouhaud  writes:
>> On Mon, Mar 27, 2023 at 02:06:34PM +0200, Daniel Gustafsson wrote:
>> On 27 Mar 2023, at 14:04, Dagfinn Ilmari Mannsåker  wrote:
 Doesn't this apply to Apple Silicon generally, not just M1? M2 already
 exists, and M3 etc. will presumably also appear at some point. The
 linked Homebrew issue refers to Apple Silicon, not any specific models.
> 
>>> Thats a good point, it should say Apple Silicon and not M1 specifically.
>>> Thanks, I'll go fix.
> 
>> Ah indeed that's a good point.  Thanks for pushing and fixing!
> 
> Also, this needs to be back-patched, as this same text appears in
> the back branches.

Yeah, it’s on my TODO for tonight when I get back. Since I botched the first 
commit before I had prepped the backbranches I figured I’d give the second some 
time in case that needed an update as well.

./daniel

Re: [EXTERNAL] Support load balancing in libpq

2023-03-27 Thread Jelte Fennema

> > ```
> > if (conn->addr == NULL && conn->naddr != 0)
> > ```

Afaict this is not necessary, since getaddrinfo already returns an
error if the host could not be resolved to any addresses. A quick test
gives me this error:
error: could not translate host name "doesnotexist" to address: Name
or service not known

>
> ```
> +}
> +else
> +conn->load_balance_type = LOAD_BALANCE_DISABLE;
> ```
>
> The else branch is never executed.

I don't think that line is coverable then. There's definitely places
in the test suite where load_balance_hosts is not explicitly set. But
even in those cases I guess the argument parsing logic will use
DefaultLoadBalanceHosts instead of NULL as a value for
conn->load_balance_type.

> Strangely enough the body of the for loop is never executed either.
> Apparently only one address is used and there is nothing to shuffle?
>
> Here is the exact command I used to build the code coverage report:

I guess you didn't set up the hostnames in /etc/hosts as described in
004_load_balance_dns.pl. Then it's expected that the loop body isn't
covered. As discussed upthread, running this test manually is much
more cumbersome than is desirable, but it's still better than not
having the test at all, because it is run in CI.

Re: logical decoding and replication of sequences, take 2

2023-03-27 Thread Tomas Vondra

On 3/27/23 03:32, Masahiko Sawada wrote:
> Hi,
> 
> On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra
>  wrote:
>>
>> I merged the earlier "fixup" patches into the relevant parts, and left
>> two patches with new tweaks (deducing the corrent "WAL" state from the
>> current state read by copy_sequence), and the interlock discussed here.
>>
> 
> Apart from that, how does the publication having sequences work with
> subscribers who are not able to handle sequence changes, e.g. in a
> case where PostgreSQL version of publication is newer than the
> subscriber? As far as I tested the latest patches, the subscriber
> (v15)  errors out with the error 'invalid logical replication message
> type "Q"' when receiving a sequence change. I'm not sure it's sensible
> behavior. I think we should instead either (1) deny starting the
> replication if the subscriber isn't able to handle sequence changes
> and the publication includes that, or (2) not send sequence changes to
> such subscribers.
> 

I agree the "invalid message" error is not great, but it's not clear to
me how to do either (1). The trouble is we don't really know if the
publication contains (or will contain) sequences. I mean, what would
happen if the replication starts and then someone adds a sequence?

For (2), I think that's not something we should do - silently discarding
some messages seems error-prone. If the publication includes sequences,
presumably the user wanted to replicate those. If they want to replicate
to an older subscriber, create a publication without sequences.

Perhaps the right solution would be to check if the subscriber supports
replication of sequences in the output plugin, while attempting to write
the "Q" message. And error-out if the subscriber does not support it.

What do you think?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-03-27 Thread Tom Lane

Julien Rouhaud  writes:
> On Mon, Mar 27, 2023 at 02:06:34PM +0200, Daniel Gustafsson wrote:
> On 27 Mar 2023, at 14:04, Dagfinn Ilmari Mannsåker  wrote:
>>> Doesn't this apply to Apple Silicon generally, not just M1? M2 already
>>> exists, and M3 etc. will presumably also appear at some point. The
>>> linked Homebrew issue refers to Apple Silicon, not any specific models.

>> Thats a good point, it should say Apple Silicon and not M1 specifically.
>> Thanks, I'll go fix.

> Ah indeed that's a good point.  Thanks for pushing and fixing!

Also, this needs to be back-patched, as this same text appears in
the back branches.

regards, tom lane

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-03-27 Thread Damir Belyalov

Hi!

I made the specified changes and my patch turned out the same as yours. The
performance measurements were the same too.

The only thing left to do is how not to add IGNORE_DATATYPE_ERRORS as a
keyword. See how this is done for parameters such as FORCE_NOT_NULL,
FORCE_NULL, FORCE_QUOTE. They are not in kwlist.h and are not as keywords
in gram.y.

Regards,
Damir Belyalov
Postgres Professional

Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-03-27 Thread Melanie Plageman

On Wed, Mar 22, 2023 at 6:42 PM Michael Paquier  wrote:
>
> On Wed, Mar 22, 2023 at 02:21:12PM -0400, Melanie Plageman wrote:
> > Apologies as I know this docs update has already been committed, but
> > buffers fetched and blocks fetched both feel weird to me. If you have a
> > cache hit, you don't end up really "fetching" anything at all (since
> > pgstat_count_buffer_read() is called before ReadBuffer_common() and we
> > don't know if it is a hit or miss yet). And, I would normally associate
> > fetching with fetching a block into a buffer. It seems like this counter
> > is really reflecting the number of buffers acquired or used.
>
> Well, it is the number of times we've requested a block read, though
> it may not actually be a read if the block was in the cache already.
>
> > This isn't really the fault of this patch since that member was already
> > called blocks_fetched.
>
> The original documentation of these functions added by 46aa77c refers
> to "block fetch requests" and "block requests found in cache", so that
> would not be right either based on your opinion here.  If you find
> "fetch" to be incorrect in this context, here is another idea:
> - "block read requests" for blocks_fetched().
> - "block read requested but actually found in cache" for blocks_hit().

I do like/prefer "block read requests" and
"blocks requested found in cache"
Though, now I fear my initial complaint may have been a bit pedantic.

- Melanie

Re: facing issues in downloading of packages in pgadmin4

2023-03-27 Thread parth ratra

Thank you, I will do that.

On Mon, Mar 27, 2023 at 7:34 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Mon, Mar 27, 2023 at 7:13 PM parth ratra 
> wrote:
> >
> > Hey Everyone,
> >
> > I am Parth Ratra from India majoring in CSE. I have hands-on experience
> with HTML/Vanilla CSS/ JS, Reactjs and CI/CD pipelines through various
> projects. I am proficient in C/C++, Python, and Django as well.
> >
> > I am super excited to work with PostgreSQL and would like to contribute
> to these projects: “GUI representation of monitoring System Activity with
> the system stats Extension in pgAdmin 4” as they match my skills and
> interests. While Setting up I ran into some issues configuring the venv for
> faulty packages.
>
> Hi Parth, we appreciate your interest. There's a separate mailing list
> for pgAdmin Support , you may want to
> reach out to them for quicker help. FYI - you can find all postgres
> mailing lists here https://www.postgresql.org/list/.
>
> --
> Bharath Rupireddy
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>

Re: facing issues in downloading of packages in pgadmin4

2023-03-27 Thread Bharath Rupireddy

On Mon, Mar 27, 2023 at 7:13 PM parth ratra  wrote:
>
> Hey Everyone,
>
> I am Parth Ratra from India majoring in CSE. I have hands-on experience with 
> HTML/Vanilla CSS/ JS, Reactjs and CI/CD pipelines through various projects. I 
> am proficient in C/C++, Python, and Django as well.
>
> I am super excited to work with PostgreSQL and would like to contribute to 
> these projects: “GUI representation of monitoring System Activity with the 
> system stats Extension in pgAdmin 4” as they match my skills and interests. 
> While Setting up I ran into some issues configuring the venv for faulty 
> packages.

Hi Parth, we appreciate your interest. There's a separate mailing list
for pgAdmin Support , you may want to
reach out to them for quicker help. FYI - you can find all postgres
mailing lists here https://www.postgresql.org/list/.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Memory leak from ExecutorState context?

2023-03-27 Thread Melanie Plageman

On Thu, Mar 23, 2023 at 2:49 PM Tomas Vondra
 wrote:
> > On Mon, Mar 20, 2023 at 10:12 AM Jehan-Guillaume de Rorthais
> >  wrote:
> >> BNJL and/or other considerations are for 17 or even after. In the meantime,
> >> Melanie, who authored BNLJ, +1 the balancing patch as it can coexists with 
> >> other
> >> discussed solutions. No one down vote since then. Melanie, what is your
> >> opinion today on this patch? Did you change your mind as you worked for 
> >> many
> >> months on BNLJ since then?
> >
> > So, in order to avoid deadlock, my design of adaptive hash join/block
> > nested loop hash join required a new parallelism concept not yet in
> > Postgres at the time -- the idea of a lone worker remaining around to do
> > work when others have left.
> >
> > See: BarrierArriveAndDetachExceptLast()
> > introduced in 7888b09994
> >
> > Thomas Munro had suggested we needed to battle test this concept in a
> > more straightforward feature first, so I implemented parallel full outer
> > hash join and parallel right outer hash join with it.
> >
> > https://commitfest.postgresql.org/42/2903/
> >
> > This has been stalled ready-for-committer for two years. It happened to
> > change timing such that it made an existing rarely hit parallel hash
> > join bug more likely to be hit. Thomas recently committed our fix for
> > this in 8d578b9b2e37a4d (last week). It is my great hope that parallel
> > full outer hash join goes in before the 16 feature freeze.
> >
>
> Good to hear this is moving forward.
>
> > If it does, I think it could make sense to try and find committable
> > smaller pieces of the adaptive hash join work. As it is today, parallel
> > hash join does not respect work_mem, and, in some sense, is a bit broken.
> >
> > I would be happy to work on this feature again, or, if you were
> > interested in picking it up, to provide review and any help I can if for
> > you to work on it.
> >
>
> I'm no expert in parallel hashjoin expert, but I'm willing to take a
> look a rebased patch. I'd however recommend breaking the patch into
> smaller pieces - the last version I see in the thread is ~250kB, which
> is rather daunting, and difficult to review without interruption. (I
> don't remember the details of the patch, so maybe that's not possible
> for some reason.)

Great! I will rebase and take a stab at splitting up the patch into
smaller commits, with a focus on finding pieces that may have standalone
benefits, in the 17 dev cycle.

- Melanie

Re: cataloguing NOT NULL constraints

2023-03-27 Thread Peter Eisentraut


On 15.03.23 23:44, Alvaro Herrera wrote:

Here's v5.  I removed the business of renaming constraints in child
relations: recursing now just relies on matching column names.  Each
column has only one NOT NULL constraint; if you try to add another,
nothing happens.  All in all, this code is pretty similar to how we
handle inheritance of columns, which I think is good.


This patch looks pretty okay to me now.  It matches all the functional 
expectations.


I suggest going through the tests carefully again and make sure all the 
changes are sensible and all the comments are correct.  There are a few 
places where the behavior of tests has changed (intentionally) but the 
surrounding comments don't match anymore, or objects that previously 
weren't created now succeed but then affect following tests.  Also, it 
seems some tests are left over from the first variant of this patch 
(where not-null constraints were converted to check constraints), and 
test names or comments should be updated to the current behavior.


I suppose we don't need any changes in pg_dump, since ruleutils.c 
handles that?


The information schema should be updated.  I think the following views:

- CHECK_CONSTRAINTS
- CONSTRAINT_COLUMN_USAGE
- DOMAIN_CONSTRAINTS
- TABLE_CONSTRAINTS

It looks like these have no test coverage; maybe that could be addressed 
at the same time.

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-03-27 Thread torikoshia

On 2023-03-23 02:50, Andres Freund wrote:
Thanks again for your review.
Attached v5 patch.

Have you measured whether this has negative performance effects when 
*NOT*

using the new option?

I loaded 1000 rows of pgbench_accounts on my laptop and compared the 
elapsed time.
GUCs changed from the default are logging_collector = on, 
log_error_verbosity = verbose.

Three tests were run under each condition and the middle of them is 
listed below:

- patch NOT applied(36f40ce2dc66): 35299ms
- patch applied, without IGNORE_DATATYPE_ERRORS: 34409ms
- patch applied, with IGNORE_DATATYPE_ERRORS: 35510ms

It seems there are no significant degradation.

Also tested the elapsed time when loading data which has some datatype 
error with IGNORE_DATATYPE_ERRORS:

- data has 100 rows of error: 35269ms
- data has 1000 rows of error: 34577ms
- data has 500 rows of error: 48925ms

500 rows of error consumes much time, but it seems to be influenced 
by logging time.
Here are test results under log_min_messages and client_min_messages are 
'error':

- data has 500 data type error: 23972ms
- data has 0 data type error: 34320ms

Now conversely, when there were many data type errors, it consumes less 
time.
This seems like a reasonable result since the amount of skipped data is 
increasing.

As-is this does not work with FORMAT BINARY - and converting the binary 
input
functions to support soft errors won't happen for 16. So I think you 
need to

raise an error if BINARY and IGNORE_DATATYPE_ERRORS are specified.

Added the option check.

On 2023-03-22 22:34:20 +0900, torikoshia wrote:
> @@ -985,9 +986,28 @@ CopyFrom(CopyFromState cstate)
>
>   ExecClearTuple(myslot);
>
> + if (cstate->opts.ignore_datatype_errors)
> + {
> + escontext.details_wanted = true;
> + cstate->escontext = escontext;
> + }

I think it might be worth pulling this out of the loop. That does mean 
you'd
have to reset escontext.error_occurred after an error, but that doesn't 
seem

too bad, you need to do other cleanup anyway.

Pull this out of the loop and added process for resetting escontext.

> @@ -956,10 +957,20 @@ NextCopyFrom(CopyFromState cstate, ExprContext 
*econtext,
>   values[m] = ExecEvalExpr(defexprs[m], econtext, 
[m]);
>   }
>   else
> - values[m] = InputFunctionCall(_functions[m],
> - 
  string,
> - 
  typioparams[m],
> -
   att->atttypmod);
> + /* If IGNORE_DATATYPE_ERRORS is enabled skip 
rows with datatype errors */
> + if (!InputFunctionCallSafe(_functions[m],
> + 
   string,
> + 
   typioparams[m],
> +
att->atttypmod,
> +
(Node *) >escontext,
> +
[m]))
> + {
> + cstate->ignored_errors++;
> +
> + ereport(WARNING,
> + errmsg("%s", 
cstate->escontext.error_data->message));

That isn't right - you loose all the details of the message. As is 
you'd also

leak the error context.

I think the best bet for now is to do something like
/* adjust elevel so we don't jump out */
cstate->escontext.error_data->elevel = WARNING;
/* despite the name, this won't raise an error if elevel < ERROR */
ThrowErrorData(cstate->escontext.error_data);

As I mentioned in one previous email, added above codes for now.

I wonder if we ought to provide a wrapper for this? It could e.g. know 
to

mention the original elevel and such?

I don't think NextCopyFrom() is the right place to emit this warning - 
it
e.g. is also called from file_fdw.c, which might want to do something 
else

with the error. From a layering POV it seems cleaner to do this in
CopyFrom(). You already have a check for escontext.error_occurred there
anyway.

Agreed.

> @@ -3378,6 +3378,10 @@ copy_opt_item:
>   {
>   $$ = makeDefElem("freeze", (Node *) 
makeBoolean(true), @1);
>   }
> + | IGNORE_DATATYPE_ERRORS
> + {
> + $$ =

facing issues in downloading of packages in pgadmin4

2023-03-27 Thread parth ratra

Hey Everyone,

I am Parth Ratra from India majoring in CSE. I have *hands-on experience
with HTML/Vanilla CSS/ JS, Reactjs and CI/CD pipelines through various
projects. I am proficient in C/C++, Python, and Django* as well.


*I am super excited to work with PostgreSQL and would like to contribute to
these projects: **“**GUI representation of monitoring System Activity with
the system stats Extension in pgAdmin 4”** as they match my skills and
interests. While Setting up I ran into some issues configuring the venv for
faulty packages.*


Regards,


Parth Ratra
[image: image.png]

Re: Infinite Interval

2023-03-27 Thread Ashutosh Bapat

On Sat, Mar 25, 2023 at 9:13 PM Joseph Koshakow  wrote:
>
> On Fri, Mar 24, 2023 at 9:43 AM Ashutosh Bapat  
> wrote:
> >
> >You don't need to do this, but looks like we can add DAYS_PER_WEEK macro 
> > and
> >use it here.
>
> I've attached a patch with this new macro. There's probably tons of
> places it can be used instead of hardcoding the number 7, but I'll save
> that for a future patch.

Thanks. Yes, changing other existing usages is out of scope for this patch.

Looks good to me.

-- 
Best Wishes,
Ashutosh Bapat

Re: [EXTERNAL] Support load balancing in libpq

2023-03-27 Thread Aleksander Alekseev

Hi,

> ```
> +ret = store_conn_addrinfo(conn, addrlist);
> +pg_freeaddrinfo_all(hint.ai_family, addrlist);
> +if (ret)
> +goto error_return;/* message already logged */
> ```
> The goto path is not test-covered.

D'oh, this one is fine since store_conn_addrinfo() is going to fail
only when we are out of memory.

-- 
Best regards,
Aleksander Alekseev

Re: [EXTERNAL] Support load balancing in libpq

2023-03-27 Thread Aleksander Alekseev

Hi,

> So I think it should be:
>
> ```
> if (conn->addr == NULL && conn->naddr != 0)
> ```
>
> [...]
>
> I will take a look at v16 now.

The code coverage could be slightly better.

In v16-0001:

```
+ret = store_conn_addrinfo(conn, addrlist);
+pg_freeaddrinfo_all(hint.ai_family, addrlist);
+if (ret)
+goto error_return;/* message already logged */
```

The goto path is not test-covered.

In v16-0002:

```
+}
+else
+conn->load_balance_type = LOAD_BALANCE_DISABLE;
```

The else branch is never executed.

```
 if (ret)
 goto error_return;/* message already logged */

+/*
+ * If random load balancing is enabled we shuffle the addresses.
+ */
+if (conn->load_balance_type == LOAD_BALANCE_RANDOM)
+{
+/*
+ * This is the "inside-out" variant of the Fisher-Yates shuffle
[...]
+ */
+for (int i = 1; i < conn->naddr; i++)
+{
+intj =
pg_prng_uint64_range(>prng_state, 0, i);
+AddrInfotemp = conn->addr[j];
+
+conn->addr[j] = conn->addr[i];
+conn->addr[i] = temp;
+}
+}
```

Strangely enough the body of the for loop is never executed either.
Apparently only one address is used and there is nothing to shuffle?

Here is the exact command I used to build the code coverage report:

```
git clean -dfx && meson setup --buildtype debug -Db_coverage=true
-Dcassert=true -DPG_TEST_EXTRA="kerberos ldap ssl load_balance"
-Dldap=disabled -Dssl=openssl -Dtap_tests=enabled
-Dprefix=/home/eax/projects/pginstall build && ninja -C build &&
PG_TEST_EXTRA=1 meson test -C build && ninja -C build coverage-html
```

I'm sharing this for the sake of completeness. I don't have a strong
opinion on whether we should bother with covering every new line of
code with tests.

Except for the named nitpicks v16 looks good to me.

-- 
Best regards,
Aleksander Alekseev

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-03-27 Thread Julien Rouhaud

On Mon, Mar 27, 2023 at 02:06:34PM +0200, Daniel Gustafsson wrote:
> > On 27 Mar 2023, at 14:04, Dagfinn Ilmari Mannsåker  
> > wrote:
> > Daniel Gustafsson  writes:
>
> >> Applied with a tiny but of changes to make it look like the rest of the
> >> paragraph more. Thanks!
> >
> > Doesn't this apply to Apple Silicon generally, not just M1? M2 already
> > exists, and M3 etc. will presumably also appear at some point. The
> > linked Homebrew issue refers to Apple Silicon, not any specific models.
>
> Thats a good point, it should say Apple Silicon and not M1 specifically.
> Thanks, I'll go fix.

Ah indeed that's a good point.  Thanks for pushing and fixing!

Re: Support logical replication of DDLs

2023-03-27 Thread Amit Kapila

On Mon, Mar 27, 2023 at 12:07 PM Amit Kapila  wrote:
>
> On Mon, Mar 27, 2023 at 2:52 AM Tom Lane  wrote:
> >
>
> > I suggest taking a couple of steps back from the minutiae of the
> > patch, and spending some hard effort thinking about how the thing
> > would be controlled in a useful fashion (that is, a real design for
> > the filtering that was mentioned at the very outset), and about the
> > security issues, and about how we could get to a committable patch.
> >
>
> Agreed. I'll try to summarize the discussion we have till now on this
> and share my thoughts on the same in a separate email.
>

The idea to control what could be replicated is to introduce a new
publication option 'ddl' along with current options 'publish' and
'publish_via_partition_root'. The values of this new option could be
'table', 'function', 'all', etc. Here 'all' enables the replication of
all supported DDL commands. Example usage for this would be:
Example:
Create a new publication with all ddl replication enabled:
  CREATE PUBLICATION pub1 FOR ALL TABLES with (ddl = 'all');

Enable table ddl replication for an existing Publication:
  ALTER PUBLICATION pub2 SET (ddl = 'table');

This is what seems to have been discussed but I think we can even
extend it to support based on operations/commands, say one would like
to publish only 'create' and 'drop' of tables. Then we can extend the
existing publish option to have values like 'create', 'alter', and
'drop'.

Another thing we are considering related to this is at what level
these additional options should be specified. We have three variants
FOR TABLE, FOR ALL TABLES, and FOR TABLES IN SCHEMA that enables
replication. Now, for the sake of simplicity, this new option is
discussed to be provided only with FOR ALL TABLES variant but I think
we can provide it with other variants with some additional
restrictions like with FOR TABLE, we can only specify 'alter' and
'drop' for publish option. Now, though possible, it brings additional
complexity to support it with variants other than FOR ALL TABLES
because then we need to ensure additional filtering and possible
modification of the content we have to send to downstream. So, we can
even decide to first support it only FOR ALL TABLES variant.

The other point to consider for publish option 'ddl = table' is
whether we need to allow replicating dependent objects like say some
user-defined type is used in the table. I guess the difficulty here
would be to identify which dependents we want to allow.

I think in the first version we should allow to replicate only some of
the objects instead of everything. For example, can we consider only
allowing tables and indexes in the first version? Then extend it in a
phased manner?

AFAICR, we have discussed two things related to security. (a)
ownership of objects created via DDL replication. We have discussed
providing an option at subscription level to allow objects to have the
same ownership (as it has on the publisher) after apply to the
subscriber. If that option is not enabled the objects will be owned by
the subscription owner. (b) Allow use of functions replicated to be
used even if they don't use schema qualify objects. Currently, we
override the search_path in apply worker to an empty string to ensure
that apply worker doesn't execute arbitrary expressions as it works
with the privileges of the subscription owner which would be a
superuser. We have discussed providing a search_path as an option at
the subscription level or a GUC to allow apply workers to use a
specified search_path.

Do you have anything else in mind?

-- 
With Regards,
Amit Kapila.

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-03-27 Thread Daniel Gustafsson

> On 27 Mar 2023, at 14:04, Dagfinn Ilmari Mannsåker  wrote:
> Daniel Gustafsson  writes:

>> Applied with a tiny but of changes to make it look like the rest of the
>> paragraph more. Thanks!
> 
> Doesn't this apply to Apple Silicon generally, not just M1? M2 already
> exists, and M3 etc. will presumably also appear at some point. The
> linked Homebrew issue refers to Apple Silicon, not any specific models.

Thats a good point, it should say Apple Silicon and not M1 specifically.
Thanks, I'll go fix.

--
Daniel Gustafsson

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-03-27 Thread Dagfinn Ilmari Mannsåker

Daniel Gustafsson  writes:

>> On 27 Mar 2023, at 10:41, Julien Rouhaud  wrote:
>> On Mon, Mar 27, 2023 at 10:32:52AM +0200, Daniel Gustafsson wrote:
 On 27 Mar 2023, at 10:24, Julien Rouhaud  wrote:
>
 FTR the documented XML_CATALOG_FILES environment variable is only valid for
 Intel based machines, as homebrew installs everything in a different 
 location
 for M1...
 
 I'm attaching a patch to make that distinction, hoping that no one else 
 will
 have to waste time trying to figure out how to get it working on such 
 hardware.
>>> 
>>> LGTM apart from the double // in the export which is easy enough to fix 
>>> before
>>> pushing.
>>> 
>>> +export XML_CATALOG_FILES=/opt/homebrew//etc/xml/catalog
>> 
>> Oh, I didn't notice it.  Apparently apple's find isn't smart enough to trim 
>> a /
>> when fed with a directory with a trailing /
>
> Applied with a tiny but of changes to make it look like the rest of the
> paragraph more. Thanks!

Doesn't this apply to Apple Silicon generally, not just M1? M2 already
exists, and M3 etc. will presumably also appear at some point. The
linked Homebrew issue refers to Apple Silicon, not any specific models.

- ilmari

Re: [EXTERNAL] Support load balancing in libpq

2023-03-27 Thread Aleksander Alekseev

Hi,

> > ▶ 6/6 - received at least one connection on node1   OK
> > ▶ 6/6 - received at least one connection on node2   OK
> > ▶ 6/6 - received at least one connection on node3   OK
> > ▶ 6/6 - received 50 connections across all nodesOK
>
> Good point.
>
> > Finally, I changed a few small typos in your updated commit message
> > (some of which originated from my earlier commit messages)
>
> +1

Hi,

> I would like to see this wrapped up in the current CF, what do you think about
> the attached?

In v15-0001:

```
+conn->addr = calloc(conn->naddr, sizeof(AddrInfo));
+if (conn->addr == NULL)
+{
+libpq_append_conn_error(conn, "out of memory");
+return 1;
+}
```

According to the man pages, in a corner case when naddr is 0 calloc
can return NULL which will not indicate an error.

So I think it should be:

```
if (conn->addr == NULL && conn->naddr != 0)
```

Other than that v15 looked very good. It was checked on Linux and
MacOS including running it under sanitizer.

I will take a look at v16 now.

-- 
Best regards,
Aleksander Alekseev

Re: [EXTERNAL] Support load balancing in libpq

2023-03-27 Thread Daniel Gustafsson

> On 27 Mar 2023, at 13:50, Jelte Fennema  wrote:
> 
> Looks good overall. I attached a new version with a few small changes:
> 
>>  * Changed store_conn_addrinfo to return int like how all the functions
>>dealing with addrinfo does.  Also moved the error reporting to inside 
>> there
>>where the error happened.
> 
> I don't feel strong about the int vs bool return type. The existing
> static libpq functions are a bit of a mixed bag around this, so either
> way seems fine to me. And moving the log inside the function seems
> fine too. But it seems you accidentally removed the "goto
> error_return" part as well, so now we're completely ignoring the
> allocation failure. The attached patch fixes that.

Ugh, thanks.  I had a conflict here when rebasing with the load balancing
commit in place and clearly fat-fingered that one.

>> +ok($node1_occurences > 1, "expected at least one execution on node1, found 
>> none");
>> +ok($node2_occurences > 1, "expected at least one execution on node2, found 
>> none");
>> +ok($node3_occurences > 1, "expected at least one execution on node3, found 
>> none");
> 
> I changed the message to be a description of the expected case,
> instead of the failure case. This is in line with the way these
> messages are used in other tests, and indeed seems like the correct
> way because you get output from "meson test -v postgresql:libpq /
> libpq/003_load_balance_host_list" like this:
> ▶ 6/6 - received at least one connection on node1   OK
> ▶ 6/6 - received at least one connection on node2   OK
> ▶ 6/6 - received at least one connection on node3   OK
> ▶ 6/6 - received 50 connections across all nodesOK

Good point.

> Finally, I changed a few small typos in your updated commit message
> (some of which originated from my earlier commit messages)

+1

--
Daniel Gustafsson

Re: [EXTERNAL] Support load balancing in libpq

2023-03-27 Thread Jelte Fennema

Looks good overall. I attached a new version with a few small changes:

>   * Changed store_conn_addrinfo to return int like how all the functions
> dealing with addrinfo does.  Also moved the error reporting to inside 
> there
> where the error happened.

I don't feel strong about the int vs bool return type. The existing
static libpq functions are a bit of a mixed bag around this, so either
way seems fine to me. And moving the log inside the function seems
fine too. But it seems you accidentally removed the "goto
error_return" part as well, so now we're completely ignoring the
allocation failure. The attached patch fixes that.

>+ok($node1_occurences > 1, "expected at least one execution on node1, found 
>none");
>+ok($node2_occurences > 1, "expected at least one execution on node2, found 
>none");
>+ok($node3_occurences > 1, "expected at least one execution on node3, found 
>none");

I changed the message to be a description of the expected case,
instead of the failure case. This is in line with the way these
messages are used in other tests, and indeed seems like the correct
way because you get output from "meson test -v postgresql:libpq /
libpq/003_load_balance_host_list" like this:
▶ 6/6 - received at least one connection on node1   OK
▶ 6/6 - received at least one connection on node2   OK
▶ 6/6 - received at least one connection on node3   OK
▶ 6/6 - received 50 connections across all nodesOK

Finally, I changed a few small typos in your updated commit message
(some of which originated from my earlier commit messages)


v16-0001-Copy-and-store-addrinfo-in-libpq-owned-private-m.patch
Description: Binary data


v16-0002-Support-connection-load-balancing-in-libpq.patch
Description: Binary data

Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-03-27 Thread Alexander Korotkov

Hi!

On Fri, Mar 24, 2023 at 3:39 AM Andres Freund  wrote:
> On 2023-03-23 23:24:19 +0300, Alexander Korotkov wrote:
> > On Thu, Mar 23, 2023 at 8:06 PM Andres Freund  wrote:
> > > I seriously doubt that solving this at the tuple locking level is the 
> > > right
> > > thing. If we want to avoid refetching tuples, why don't we add a 
> > > parameter to
> > > delete/update to generally put the old tuple version into a slot, not 
> > > just as
> > > an optimization for a subsequent lock_tuple()? Then we could remove all
> > > refetching tuples for triggers. It'd also provide the basis for adding 
> > > support
> > > for referencing the OLD version in RETURNING, which'd be quite powerful.

After some thoughts, I think I like idea of fetching old tuple version
in update/delete.  Everything that evades extra tuple fetching and do
more of related work in a single table AM call, makes table AM API
more flexible.

I'm working on patch implementing this.  I'm going to post it later today.

--
Regards,
Alexander Korotkov

Re: CREATE INDEX CONCURRENTLY on partitioned index

2023-03-27 Thread Alexander Pyhalov


Justin Pryzby писал 2023-03-26 17:51:

On Sun, Dec 04, 2022 at 01:09:35PM -0600, Justin Pryzby wrote:

This currently handles partitions with a loop around the whole CIC
implementation, which means that things like WaitForLockers() happen
once for each index, the same as REINDEX CONCURRENTLY on a partitioned
table.  Contrast that with ReindexRelationConcurrently(), which 
handles
all the indexes on a table in one pass by looping around indexes 
within

each phase.


Rebased over the progress reporting fix (27f5c712b).

I added a list of (intermediate) partitioned tables, rather than 
looping

over the list of inheritors again, to save calling rel_get_relkind().

I think this patch is done.


Hi.

Overall looks good to me. However, I think that using 'partitioned' as 
list of partitioned index oids in DefineIndex() is a bit misleading - 
we've just used it as boolean, specifying if we are dealing with a 
partitioned relation.


--
Best regards,
Alexander Pyhalov,
Postgres Professional

Re: Generate pg_stat_get_xact*() functions with Macros

2023-03-27 Thread Michael Paquier

On Mon, Mar 27, 2023 at 10:11:21AM +0200, Drouvot, Bertrand wrote:
> Please find attached V2 adding pg_stat_get_xact_function_total_time()
> and pg_stat_get_xact_function_self_time() to the party.

The patch has one mistake: PG_STAT_GET_XACT_FUNCENTRY_FLOAT8_MS does
not need a slash on its last line or it would include the next, empty
line.  This could lead to mistakes (no need to send a new patch just
for that).
--
Michael


signature.asc
Description: PGP signature

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-03-27 Thread Daniel Gustafsson

> On 27 Mar 2023, at 10:41, Julien Rouhaud  wrote:
> On Mon, Mar 27, 2023 at 10:32:52AM +0200, Daniel Gustafsson wrote:
>>> On 27 Mar 2023, at 10:24, Julien Rouhaud  wrote:

>>> FTR the documented XML_CATALOG_FILES environment variable is only valid for
>>> Intel based machines, as homebrew installs everything in a different 
>>> location
>>> for M1...
>>> 
>>> I'm attaching a patch to make that distinction, hoping that no one else will
>>> have to waste time trying to figure out how to get it working on such 
>>> hardware.
>> 
>> LGTM apart from the double // in the export which is easy enough to fix 
>> before
>> pushing.
>> 
>> +export XML_CATALOG_FILES=/opt/homebrew//etc/xml/catalog
> 
> Oh, I didn't notice it.  Apparently apple's find isn't smart enough to trim a 
> /
> when fed with a directory with a trailing /

Applied with a tiny but of changes to make it look like the rest of the
paragraph more. Thanks!

--
Daniel Gustafsson

JsonPath version bits

2023-03-27 Thread Nikita Malakhov

Hi hackers!

I've got a question on the JsonPath header - currently the header size
is 4 bytes, where there are version and mode bits. Is there somewhere
a defined size of the version part? There are some extensions working
with JsonPath, and we have some too, thus it is important how many
bits is it possible to use to store a version value?

Thanks!

--
Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/

Re: [EXTERNAL] Support load balancing in libpq

2023-03-27 Thread Daniel Gustafsson

> On 17 Mar 2023, at 09:50, Jelte Fennema  wrote:
> 
>> The documentation lists the modes disabled and random, but I wonder if it's
>> worth expanding the docs to mention that "disabled" is pretty much a round
>> robin load balancing scheme?  It reads a bit odd to present load balancing
>> without a mention of round robin balancing given how common it is.
> 
> I think you misunderstood what I meant in that section, so I rewrote
> it to hopefully be clearer. Because disabled really isn't the same as
> round-robin.

Thinking more about it I removed that section since it adds more confusion than
it resolves I think.  It would be interesting to make it a true round-robin
with some form of locally stored pointer to last connection but thats for
future hacking.

>> -#ifndef WIN32
>> +/* MinGW has sys/time.h, but MSVC doesn't */
>> +#ifndef _MSC_VER
>> #include 
>> This seems unrelated to the patch in question, and should be a separate 
>> commit IMO.
> 
> It's not really unrelated. This only started to be needed because
> libpq_prng_init calls gettimeofday . That did not work on MinGW
> systems. Before this patch libpq was never calling gettimeofday. So I
> think it makes sense to leave it in the commit.

Gotcha.

>> A test
>> which require root permission level manual system changes stand a very low
>> chance of ever being executed, and as such will equate to dead code that may
>> easily be broken or subtly broken.
> 
> While I definitely agree that it makes it hard to execute, I don't
> think that means it will be executed nearly as few times as you
> suggest. Maybe you missed it, but I modified the .cirrus.yml file to
> configure the hosts file for both Linux and Windows runs. So, while I
> agree it is unlikely to be executed manually by many people, it would
> still be run on every commit fest entry (which should capture most
> issues that I can imagine could occur).

I did see it was used in the CI since the jobs there are containerized, what
I'm less happy about is that we wont be able to test this in the BF.  That
being said, not having the test at all would mean even less testing so in the
end I agree that including it is the least bad option.  Longer term I would
like to rework into something less do-this-manually test, but I have no good
ideas right now.

I've played around some more with this and came up with the attached v15 which
I think is close to the final state.  The changes I've made are:

  * Added the DNS test back into the main commit
  * A few incorrect (referred to how the test worked previously) comments in
the tests fixed.
  * The check against PG_TEST_EXTRA performed before any processing done
  * Reworked the check for hosts content attempting to make it a bit more
robust
  * Changed store_conn_addrinfo to return int like how all the functions
dealing with addrinfo does.  Also moved the error reporting to inside there
where the error happened.
  * Made the prng init function void as it always returned true anyways.
  * Minor comment and docs tweaking.
  * I removed the change to geqo, while I don't think it's incorrect it also
hardly seems worth the churn.
  * Commit messages are reworded.

I would like to see this wrapped up in the current CF, what do you think about
the attached?

--
Daniel Gustafsson

v15-0002-Support-connection-load-balancing-in-libpq.patch
Description: Binary data

v15-0001-Copy-and-store-addrinfo-in-libpq-owned-private-m.patch
Description: Binary data

Re: awkward cancellation of parallel queries on standby.

2023-03-27 Thread Kyotaro Horiguchi

At Sun, 26 Mar 2023 11:12:48 -0400, Jeff Janes  wrote in 
> I don't know if this is actually a problem.  It isn't for me as I don't do
> this kind of thing outside of testing, but it seems untidy and I can see it
> being frustrating from a catch-and-retry perspective and from a log-spam
> perspective.
> 
> It looks like the backend gets signalled by the startup process, and then
> it signals the postmaster to signal the parallel workers, and then they
> ignore it for a quite long time (tens to hundreds of ms).  By the time they
> get around responding, someone has decided to escalate things.  Which
> doesn't seem to be useful, because no one can do anything until the workers
> respond anyway.

I believe you are seeing autovacuum_naptime as the latency since the
killed backend is running a busy query.  It seems to me that the
signals are get processed pretty much instantly in most cases. There's
a situation where detection takes longer if a session is sitting idle
in a transaction, but that's just how we deal with that
situation. There could be a delay when the system load is pretty high,
but it's not really our concern unless some messages start going
missing irregularly.

> This behavior seems to go back a long way, but the propensity for both
> messages to show up at the same time vs. in different round-trips changes
> from version to version.
> 
> Is this something we should do something about?

I can't say for certain about the version dependency, but the latency
you mentioned doesn't really seem to be an issue, so we don't need to
worry about it. Regarding session cancellation, taking action might be
an option. However, even if we detect transaction status in
PostgresMain, there's still a possibility of the cancellation if a
conflicting process tries to read a command right before ending the
ongoing transaction. Although we might prevent cancellations in those
final moments, it seems like things could get complicated.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

1 2 >

1 - 100 of 120 matches

Mail list logo