date:20220311

Re: Logical replication - schema change not invalidating the relation cache

2022-03-11 Thread vignesh C

On Fri, Dec 3, 2021 at 3:21 PM vignesh C  wrote:
>
> On Fri, Dec 3, 2021 at 1:13 PM Michael Paquier  wrote:
> >
> > On Thu, Aug 26, 2021 at 09:00:39PM +0530, vignesh C wrote:
> > > The previous patch was failing because of the recent test changes made
> > > by commit 201a76183e2 which unified new and get_new_node, attached
> > > patch has the changes to handle the changes accordingly.
> >
> > Please note that the CF app is complaining about this patch, so a
> > rebase is required.  I have moved it to next CF, waiting on author,
> > for now.
>
> Thanks for letting me know, I have rebased it on top of HEAD, the
> attached v2 version has the rebased changes.

The patch was not applying on top of the HEAD, attached v3 version
which has the rebased changes.

Regards,
Vignesh
From ad79b95e9362239e32cc597aeac1d6e22aaa7504 Mon Sep 17 00:00:00 2001
From: Vigneshwaran C 
Date: Sat, 12 Mar 2022 13:04:55 +0530
Subject: [PATCH v3] Fix for invalidating logical replication relations when
 there is a change in schema.

When the schema gets changed, the rel sync cache invalidation was not
happening, fixed it by adding a callback for schema change.
---
 src/backend/replication/pgoutput/pgoutput.c | 44 
 src/test/subscription/t/001_rep_changes.pl  | 74 +
 2 files changed, 118 insertions(+)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ea57a0477f..2ce634a90b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -176,6 +176,8 @@ static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data,
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 		  uint32 hashvalue);
+static void rel_sync_cache_namespace_cb(Datum arg, int cacheid,
+		uint32 hashvalue);
 static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
 			TransactionId xid);
 static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
@@ -1658,6 +1660,9 @@ init_rel_sync_cache(MemoryContext cachectx)
 	CacheRegisterSyscacheCallback(PUBLICATIONNAMESPACEMAP,
   rel_sync_cache_publication_cb,
   (Datum) 0);
+	CacheRegisterSyscacheCallback(NAMESPACEOID,
+  rel_sync_cache_namespace_cb,
+  (Datum) 0);
 }
 
 /*
@@ -1989,6 +1994,45 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	}
 }
 
+/*
+ * Namespace syscache invalidation callback
+ */
+static void
+rel_sync_cache_namespace_cb(Datum arg, int cacheid, uint32 hashvalue)
+{
+	HASH_SEQ_STATUS status;
+	RelationSyncEntry *entry;
+
+	/*
+	 * We can get here if the plugin was used in SQL interface as the
+	 * RelSchemaSyncCache is destroyed when the decoding finishes, but there
+	 * is no way to unregister the relcache invalidation callback.
+	 */
+	if (RelationSyncCache == NULL)
+		return;
+
+	hash_seq_init(&status, RelationSyncCache);
+	while ((entry = (RelationSyncEntry *) hash_seq_search(&status)) != NULL)
+	{
+		/*
+		 * Reset schema sent status as the relation definition may have changed.
+		 * Also free any objects that depended on the earlier definition.
+		 */
+		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NIL;
+
+		if (entry->attrmap)
+			free_attrmap(entry->attrmap);
+		entry->attrmap = NULL;
+
+		if (hash_search(RelationSyncCache,
+(void *) &entry->relid,
+HASH_REMOVE, NULL) == NULL)
+			elog(ERROR, "hash table corrupted");
+	}
+}
+
 /*
  * Publication relation/schema map syscache invalidation callback
  *
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index af0cff6a30..4b7b9e54ff 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -520,6 +520,80 @@ is($result, qq(0), 'check replication origin was dropped on subscriber');
 $node_subscriber->stop('fast');
 $node_publisher->stop('fast');
 
+# Test schema invalidation by renaming the schema
+# Create tables on publisher
+# Initialize publisher node
+my $node_publisher1 = PostgreSQL::Test::Cluster->new('publisher1');
+$node_publisher1->init(allows_streaming => 'logical');
+$node_publisher1->start;
+
+# Create subscriber node
+my $node_subscriber1 = PostgreSQL::Test::Cluster->new('subscriber1');
+$node_subscriber1->init(allows_streaming => 'logical');
+$node_subscriber1->start;
+
+my $publisher1_connstr = $node_publisher1->connstr . ' dbname=postgres';
+
+$node_publisher1->safe_psql('postgres', "CREATE SCHEMA sch1");
+$node_publisher1->safe_psql('postgres', "CREATE TABLE sch1.t1 (c1 int)");
+
+# Create tables on subscriber
+$node_subscriber1->safe_psql('postgres', "CREATE SCHEMA sch1");
+$node_subscriber1->safe_psql('postgres', "CREATE TABLE sch1.t1 (c1 int)");
+$node_subscriber1->safe_psql('postgres', "CREATE SCHEMA sch2");
+$node_subscriber1->safe_psql('postgres', "CREATE TABLE sch2.t1 (c1 int)");
+
+#

Re: Printing backtrace of postgres processes

2022-03-11 Thread vignesh C

On Wed, Mar 9, 2022 at 9:26 PM Justin Pryzby  wrote:
>
> rebased to appease cfbot.
>
> + couple of little fixes as 0002.

Thanks for rebasing and fixing a few issues. I have taken all your
changes except for mcxtfuncs changes as those changes were not done as
part of this patch. Attached v19 patch which has the changes for the
same.

Regards,
Vignesh
From baff4bd9331eff3d3d021724847c2fb89d4bcdcf Mon Sep 17 00:00:00 2001
From: Vigneshwaran C 
Date: Tue, 25 Jan 2022 08:21:22 +0530
Subject: [PATCH v19] Add function to log the backtrace of the specified
 postgres process.

This commit adds pg_log_backtrace() function that requests to log
the backtrace of the specified backend or auxiliary process except
logger and statistic collector.

Only superusers are allowed to request to log the backtrace
because allowing any users to issue this request at an unbounded rate
would cause lots of log messages and which can lead to denial of service.

On receipt of the request, at the next CHECK_FOR_INTERRUPTS(),
the target backend logs its backtrace at LOG_SERVER_ONLY level,
so that the backtrace will appear in the server log but not
be sent to the client.

Bump catalog version.

Authors: Vignesh C, Bharath Rupireddy, Justin Pryzby
Reviewers: Bharath Rupireddy, Justin Pryzby, Fujii Masao, Atsushi Torikoshi, Dilip Kumar, Robert Haas, Andres Freund, Tom lane, Craig Ringer
Discussion: https://www.postgresql.org/message-id/CALDaNm3ZzmFS-=r7oduzj7y7bgqv+n06kqyft6c3xzdoknk...@mail.gmail.com
---
 doc/src/sgml/func.sgml| 62 +++
 src/backend/catalog/system_functions.sql  |  2 +
 src/backend/postmaster/autovacuum.c   |  4 ++
 src/backend/postmaster/checkpointer.c |  4 ++
 src/backend/postmaster/interrupt.c|  4 ++
 src/backend/postmaster/pgarch.c   | 10 +++-
 src/backend/postmaster/startup.c  |  4 ++
 src/backend/postmaster/walwriter.c|  4 ++
 src/backend/storage/ipc/procarray.c   | 56 +
 src/backend/storage/ipc/procsignal.c  | 42 +
 src/backend/storage/ipc/signalfuncs.c | 73 ---
 src/backend/tcop/postgres.c   |  4 ++
 src/backend/utils/adt/mcxtfuncs.c | 40 +++--
 src/backend/utils/error/elog.c| 19 --
 src/backend/utils/init/globals.c  |  1 +
 src/include/catalog/pg_proc.dat   |  5 ++
 src/include/miscadmin.h   |  1 +
 src/include/storage/procarray.h   |  2 +
 src/include/storage/procsignal.h  |  3 +-
 src/include/utils/elog.h  |  2 +
 src/test/regress/expected/backtrace.out   | 49 +++
 src/test/regress/expected/backtrace_1.out | 55 +
 src/test/regress/parallel_schedule|  2 +-
 src/test/regress/sql/backtrace.sql| 33 ++
 24 files changed, 416 insertions(+), 65 deletions(-)
 create mode 100644 src/test/regress/expected/backtrace.out
 create mode 100644 src/test/regress/expected/backtrace_1.out
 create mode 100644 src/test/regress/sql/backtrace.sql

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 8a802fb225..4171208719 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25355,6 +25355,30 @@ SELECT collation for ('foo' COLLATE "de_DE");

   
 
+  
+   
+
+ pg_log_backtrace
+
+pg_log_backtrace ( pid integer )
+boolean
+   
+   
+Requests to log the backtrace of the backend with the
+specified process ID.  This function can send the request to
+backends and auxiliary processes except the logger and statistics
+collector.  The backtraces will be logged at LOG
+message level. They will appear in the server log based on the log
+configuration set (See  for
+more information), but will not be sent to the client regardless of
+. A backtrace identifies
+which function a process is currently executing and it may be useful
+for developers to diagnose stuck processes and other problems. This
+function is supported only if PostgreSQL was built with the ability to
+capture backtraces, otherwise it will emit a warning.
+   
+  
+
   

 
@@ -25574,6 +25598,44 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
 because it may generate a large number of log messages.

 
+   
+pg_log_backtrace can be used to log the backtrace of
+a backend process. For example:
+
+postgres=# select pg_log_backtrace(pg_backend_pid());
+ pg_log_backtrace
+--
+ t
+(1 row)
+
+The backtrace will be logged as specified by the logging configuration.
+For example:
+
+2021-01-27 11:33:50.247 IST [111735] LOG:  current backtrace:
+postgres: postgresdba postgres [local] SELECT(set_backtrace+0x38) [0xae06c5]
+postgres: postgresdba postgres [local] SELECT(ProcessInterrupts+0x788) [0x950c34]
+

Re: Handle infinite recursion in logical replication setup

2022-03-11 Thread vignesh C

On Fri, Mar 11, 2022 at 4:28 PM kuroda.hay...@fujitsu.com
 wrote:
>
> Hi Vegnesh,
>
> While considering about second problem, I was very confusing about it.
> I'm happy if you answer my question.
>
> > To handle this if user has specified only_local option, we could throw
> > a warning or error out while creating subscription in this case, we
> > could have a column srreplicateddata in pg_subscription_rel which
> > could indicate if the table has any replicated data or not:
> > postgres=# select * from pg_subscription_rel;
> >  srsubid | srrelid | srsubstate | srsublsn  | srreplicateddata
> > -+-++---+--
> >16389 |   16384 | r  | 0/14A4640 |t
> >16389 |   16385 | r  | 0/14A4690 |f
> > (1 row)
> > In the above example, srreplicateddata with true indicates, tabel t1
> > whose relid is 16384 has replicated data and the other row having
> > srreplicateddata  as false indicates table t2 whose relid is 16385
> > does not have replicated data.
> > When creating a new subscription, the subscriber will connect to the
> > publisher and check if the relation has replicated data by checking
> > srreplicateddata in pg_subscription_rel table.
> > If the table has any replicated data, log a warning or error for this.
>
> IIUC srreplicateddata represents whether the subscribed data is not
> generated from the publisher, but another node.
> My first impression was that the name 'srreplicateddata' is not friendly
> because all subscribed data is replicated from publisher.
> Also I was not sure how value of the column was set.
> IIUC a filtering by replication origins is done in publisher node
> and subscriber node cannot know
> whether some data are really filtered or not.
> If we distinguish by subscriber option publish_local_only,
> it cannot reproduce your example because same subscriber have different 
> 'srreplicateddata'.

Let's consider an existing Multi master logical replication setup
between Node1 and Node2 that is created using the following steps:
a) Node1 - Publication publishing employee table - pub1
b) Node2 - Subscription subscribing from publication pub1 with
publish_local_only - sub1_pub1_node1
c) Node2 - Publication publishing employee table - pub2
d) Node1 - Subscription subscribing from publication pub2 with
publish_local_only - sub2_pub2_node2

To create a subscription in node3, we will be using the following steps:
a) Node2 - Publication publishing employee table. - pub3
b) Node3 - Subscription subscribing from publication in Node2 with
publish_local_only - sub3_pub3_node2

When we create a subscription in Node3, Node3 will connect to
Node2(this will not be done in Node3) and check if the employee table
is present in pg_subscription_rel, in our case Node2 will have
employee table present in pg_subscription_rel (sub1_pub1_node1
subscribing to employee table from pub1 in Node1). As employee table
is being subscribed in node2 from node1, we will throw an error like
below:
postgres=# create subscription sub2 CONNECTION 'dbname =postgres port
= ' publication pub2 with (publish_local_only=on);
ERROR:  CREATE/ALTER SUBSCRIPTION with publish_local_only and
copy_data as true is not allowed when the publisher might have
replicated data, table:public.t1 might have replicated data in the
publisher
HINT:  Use CREATE/ALTER SUBSCRIPTION with copy_data = off or force

I was initially planning to add srreplicateddata field but I have
changed it slightly to keep the design simple. Now we just check if
the relation is present in pg_subscription_rel and throw an error if
copy_data and publish_local_only option is specified. The changes for
the same are available at [1].

[1] - 
https://www.postgresql.org/message-id/CALDaNm0V%2B%3Db%3DCeZJNAAUO2PmSXH5QzNX3jADXb-0hGO_jVj0vA%40mail.gmail.com
Thoughts?

Regards,
Vignesh

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-11 Thread Dilip Kumar

On Sat, Mar 12, 2022 at 1:55 AM Robert Haas  wrote:
>
Responding to this specific issue..

> The changes to createdb_failure_params make me a little nervous. I
> think we'd be in real trouble if we failed before completing both
> DropDatabaseBuffers() and ForgetDatabaseSyncRequests(). However, it
> looks to me like those are both intended to be no-fail operations, so
> I don't see an actual hazard here.

I might be missing something but even without that I do not see a real
problem here.  The reason we are dropping the database buffers and
pending sync request is because right after this we are removing the
underlying files and if we just remove the files without dropping the
buffer from the buffer cache then the checkpointer will fail while
trying to flush the buffer.

But, hmm, what about on the
> recovery side? Suppose that we start copying the database block by
> block and then either (a) the standby is promoted before the copy is
> finished or (b) the copy fails.

I think the above logic will be valid in case of standby as well
because we are not really deleting the underlying files.

Now the standby has data in
> shared_buffers for a database that does not exist. If that's not bad,
> then why does createdb_failure_params need to DropDatabaseBuffers()?
> But I bet it is bad. I wonder if we should be using
> RelationCopyStorage() rather than this new function
> RelationCopyStorageUsingBuffer().

I am not sure how RelationCopyStorage() will help in the standby side,
because then also we will log the same WAL (XLOG_FPI) for each page
and standby side we will use buffer to apply this FPI so if you think
that there is a problem then it will be same with
RelationCopyStorage() at least on the standby side.

In fact while we are rewriting the relation during vacuum full that
time also we are calling log_newpage() under RelationCopyStorage() and
during standby if it gets promoted we will be having some buffers in
the buffer pool with the new relfilenode.  So I think our case is also
the same.

So here my stand is that we need to drop database buffers and remove
pending sync requests because we are deleting underlying files and if
we do not do that in some extreme cases then there is no need to drop
the buffers or remove the pending sync request and the worst
consequences would be waste of disk space.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Handle infinite recursion in logical replication setup

2022-03-11 Thread vignesh C

On Wed, Mar 2, 2022 at 3:59 PM Amit Kapila  wrote:
>
> On Wed, Feb 23, 2022 at 11:59 AM vignesh C  wrote:
> >
> ...
> ...
> > I have attached a basic patch for this, if the idea is accepted, I
> > will work further to test more scenarios, add documentation, and test
> > and post an updated patch.
> > For the second problem, Table synchronization of table including local
> > data and replicated data using copy command.
> >
> > Let us consider the following scenario:
> > a) node1 publishing to node2 b) node2 publishing to node1. Here in
> > this case node1 will have replicated data from node2 and vice versa.
> >
> > In the above if user wants to include node3 to subscribe data from
> > node2. Users will have to create a subscription in node3 to get the
> > data from node2. During table synchronization we send the complete
> > table data from node2 to node3. Node2 will have local data from node2
> > and also replicated data from node1. Currently we don't have an option
> > to differentiate between the locally generated data and replicated
> > data in the heap which will cause infinite recursion as described
> > above.
> >
> > To handle this if user has specified only_local option, we could throw
> > a warning or error out while creating subscription in this case, we
> > could have a column srreplicateddata in pg_subscription_rel which
> > could indicate if the table has any replicated data or not:
> > postgres=# select * from pg_subscription_rel;
> >  srsubid | srrelid | srsubstate | srsublsn  | srreplicateddata
> > -+-++---+--
> >16389 |   16384 | r  | 0/14A4640 |t
> >16389 |   16385 | r  | 0/14A4690 |f
> > (1 row)
> >
> > In the above example, srreplicateddata with true indicates, tabel t1
> > whose relid is 16384 has replicated data and the other row having
> > srreplicateddata  as false indicates table t2 whose relid is 16385
> > does not have replicated data.
> > When creating a new subscription, the subscriber will connect to the
> > publisher and check if the relation has replicated data by checking
> > srreplicateddata in pg_subscription_rel table.
> > If the table has any replicated data, log a warning or error for this.
> >
>
> If you want to give the error in this case, then I think we need to
> provide an option to the user to allow copy. One possibility could be
> to extend existing copy_data option as 'false', 'true', 'force'. For
> 'false', there shouldn't be any change, for 'true', if 'only_local'
> option is also set and the new column indicates replicated data then
> give an error, for 'force', we won't give an error even if the
> conditions as mentioned for 'true' case are met, rather we will allow
> copy in this case.

When a subscription is created with publish_local_only and copy_data,
it will connect to the publisher and check if the published tables
have also been subscribed from other nodes by checking if the entry is
present in pg_subscription_rel and throw an error if present. The
attached v4-0002-Support-force-option-for-copy_data-check-and-thro.patch
has the implementation for the same.
Thoughts?

Regards,
Vignesh
From ef978c53b5b195b861210a2ef7f2ff876f004e94 Mon Sep 17 00:00:00 2001
From: Vigneshwaran C 
Date: Mon, 7 Mar 2022 11:19:10 +0530
Subject: [PATCH v4 1/2] Skip replication of non local data.

Add an option publish_local_only which will subscribe only to the locally
generated data in the publisher node. If subscriber is created with this
option, publisher will skip publishing the data that was subscribed
from other nodes. It can be created using following syntax:
ex: CREATE SUBSCRIPTION sub1 CONNECTION 'dbname =postgres port=' PUBLICATION pub1 with (publish_local_only = on);
---
 contrib/test_decoding/test_decoding.c |  20 +++
 doc/src/sgml/ref/alter_subscription.sgml  |   3 +-
 doc/src/sgml/ref/create_subscription.sgml |  12 ++
 src/backend/catalog/pg_subscription.c |   1 +
 src/backend/catalog/system_views.sql  |   5 +-
 src/backend/commands/subscriptioncmds.c   |  26 +++-
 .../libpqwalreceiver/libpqwalreceiver.c   |   4 +
 src/backend/replication/logical/decode.c  |  15 ++-
 src/backend/replication/logical/logical.c |  33 +
 src/backend/replication/logical/worker.c  |   2 +
 src/backend/replication/pgoutput/pgoutput.c   |  45 +++
 src/bin/pg_dump/pg_dump.c |  16 ++-
 src/bin/pg_dump/pg_dump.h |   1 +
 src/bin/psql/describe.c   |   8 +-
 src/bin/psql/tab-complete.c   |   4 +-
 src/include/catalog/pg_subscription.h |   3 +
 src/include/replication/logical.h |   4 +
 src/include/replication/output_plugin.h   |   7 ++
 src/include/replication/pgoutput.h|   1 +
 src/include/replication/walreceiver.h |   1 +
 src/test/regress/expected/subscription.out| 115 ++
 src/test/regress/sql/subs

Re: Column Filtering in Logical Replication

2022-03-11 Thread Amit Kapila

On Fri, Mar 11, 2022 at 6:20 PM Tomas Vondra
 wrote:
>
> On 3/11/22 10:52, Amit Kapila wrote:
> > On Fri, Mar 11, 2022 at 7:26 AM Tomas Vondra
> >  wrote:
> >>
> >> On 3/10/22 20:10, Tomas Vondra wrote:
> >>>
> >>>
> >>> FWIW I think the reason is pretty simple - pgoutput_row_filter_init is
> >>> broken. It assumes you can just do this
> >>>
> >>> rftuple = SearchSysCache2(PUBLICATIONRELMAP,
> >>>   ObjectIdGetDatum(entry->publish_as_relid),
> >>>   ObjectIdGetDatum(pub->oid));
> >>>
> >>> if (HeapTupleIsValid(rftuple))
> >>> {
> >>> /* Null indicates no filter. */
> >>> rfdatum = SysCacheGetAttr(PUBLICATIONRELMAP, rftuple,
> >>>   Anum_pg_publication_rel_prqual,
> >>>   &pub_no_filter);
> >>> }
> >>> else
> >>> {
> >>> pub_no_filter = true;
> >>> }
> >>>
> >>>
> >>> and pub_no_filter=true means there's no filter at all. Which is
> >>> nonsense, because we're using publish_as_relid here - the publication
> >>> may not include this particular ancestor, in which case we need to just
> >>> ignore this publication.
> >>>
> >>> So yeah, this needs to be reworked.
> >>>
> >>
> >> I spent a bit of time looking at this, and I think a minor change in
> >> get_rel_sync_entry() fixes this - it's enough to ensure rel_publications
> >> only includes publications that actually include publish_as_relid.
> >>
> >
> > Thanks for looking into this. I think in the first patch before
> > calling get_partition_ancestors() we need to ensure it is a partition
> > (the call expects that) and pubviaroot is true.
>
> Does the call really require that?
>

There may not be any harm but I have mentioned it because (a) the
comments atop get_partition_ancestors(...it should only be called when
it is known that the relation is a partition.) indicates the same; (b)
all existing callers seems to use it only for partitions.

> Also, I'm not sure why we'd need to
> look at pubviaroot - that's already considered earlier when calculating
> publish_as_relid, here we just need to know the relationship of the two
> OIDs (if one is ancestor/child of the other).
>

I thought of avoiding calling get_partition_ancestors when pubviaroot
is not set. It will unnecessary check the whole hierarchy for
partitions even when it is not required. I agree that this is not a
common code path but still felt why do it needlessly?

> > I think it would be
> > good if we can avoid an additional call to get_partition_ancestors()
> > as it could be costly.
>
> Maybe. OTOH we only should do this only very rarely anyway.
>
> > I wonder why it is not sufficient to ensure
> > that publish_as_relid exists after ancestor in ancestors list before
> > assigning the ancestor to publish_as_relid? This only needs to be done
> > in case of (if (!publish)). I have not tried this so I could be wrong.
> >
>
> I'm not sure what exactly are you proposing. Maybe try coding it? That's
> probably faster than trying to describe what the code might do ...
>

Okay, please find attached. I have done basic testing of this, if we
agree with this approach then this will require some more testing.

-- 
With Regards,
Amit Kapila.


v1-0001-fixup-publish_as_relid.patch
Description: Binary data


v1-0002-fixup-row-filter-publications.patch
Description: Binary data

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-11 Thread Jeff Davis

On Thu, 2022-03-10 at 15:54 -0500, Stephen Frost wrote:
> The standard is basically that all of the functions it brings are
> written to enforce the PG privilege system and you aren't able to use
> the extension to bypass those privileges.  In some cases that means
> that

Every extension should follow that standard, right? If it doesn't (e.g.
creating dangerous functions and granting them to public), then even
superuser should not install it.

> the C-language functions installed have if(!superuser) ereport()
> calls

I'm curious why not rely on the grant system where possible? I thought
we were trying to get away from explicit superuser checks.

> I've not looked back on this thread, but I'd expect pg_walinspect to
> need those superuser checks and with those it *could* be marked as
> trusted, but that again brings into question how useful it is to mark
> it
> thusly.

As long as any functions are safely accessible to public or a
predefined role, there is some utility for the 'trusted' marker.

As this patch is currently written, pg_monitor has access these
functions, though I don't think that's the right privilege level at
least for pg_get_raw_wal_record().

> I certainly don't think we should allow either database owners or
> regular users on a system the ability to access the WAL traffic of
> the
> entire system.

Agreed. That was not what I intended by asking if it should be marked
'trusted'. The marker only allows the non-superuser to run the CREATE
EXTENSION command; it's up to the extension script to decide whether
any non-superusers can do anything at all with the extension.

>   More forcefully- we should *not* be throwing more access
> rights towards $owners in general and should be thinking about how we
> can allow admins, providers, whomever, the ability to control what
> rights users are given.  If they're all lumped under 'owner' then
> there's no way for people to provide granular access to just those
> things they wish and intend to.

Not sure I understand, but that sounds like a larger discussion.

Regards,
Jeff Davis

Re: range_agg with multirange inputs

2022-03-11 Thread Paul Jungwirth


On 3/10/22 14:07, Chapman Flack wrote:

When I apply this patch, I get a func.sgml with two entries for
range_intersect_agg(anymultirange).


Arg, fixed.


In range_agg_transfn, you've changed the message in the "must be called
with a range or multirange"; that seems like another good candidate to
be an elog.


Agreed. Updated here.


I think your query finds aggregate declarations that share the same SQL
function declaration as their finalizer functions. That seems to be more
common.

The query I used looks for cases where different SQL-declared functions
appear as finalizers of aggregates, but the different SQL declared functions
share the same internal C implementation.


Okay, I see. I believe that is quite common for ordinary SQL functions. 
Sharing a prosrc seems even less remarkable than sharing an aggfinalfn. 
You're right there are no cases for other finalfns yet, but I don't 
think there is anything special about finalfns that would make this a 
weirder thing to do there than with ordinary functions. Still, noting it 
with a comment does seem helpful. I've updated the remark to match what 
you suggested.


Thank you again for the review, and sorry for so many iterations! :-)

Yours,

--
Paul  ~{:-)
p...@illuminatedcomputing.comFrom b409d7333132e34a928ec8cc0ecca0a3421b7268 Mon Sep 17 00:00:00 2001
From: "Paul A. Jungwirth" 
Date: Fri, 10 Dec 2021 16:04:57 -0800
Subject: [PATCH v4] Add range_agg with multirange inputs

---
 doc/src/sgml/func.sgml|  30 ++
 src/backend/utils/adt/multirangetypes.c   |  69 ++--
 src/include/catalog/pg_aggregate.dat  |   3 +
 src/include/catalog/pg_proc.dat   |  11 ++
 src/test/regress/expected/multirangetypes.out | 100 ++
 src/test/regress/expected/opr_sanity.out  |   1 +
 src/test/regress/sql/multirangetypes.sql  |  21 
 src/test/regress/sql/opr_sanity.sql   |   1 +
 8 files changed, 230 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index df3cd5987b..ab6c4f0093 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20012,8 +20012,23 @@ SELECT NULLIF(value, '(none)') ...

No
   
 
+  
+   
+
+ range_agg
+
+range_agg ( value
+ anymultirange )
+anymultirange
+   
+   
+Computes the union of the non-null input values.
+   
+   No
+  
+
   

 
  range_intersect_agg
@@ -20027,8 +20042,23 @@ SELECT NULLIF(value, '(none)') ...

No
   
 
+  
+   
+
+ range_intersect_agg
+
+range_intersect_agg ( value
+ anymultirange )
+anymultirange
+   
+   
+Computes the intersection of the non-null input values.
+   
+   No
+  
+
   

 
  string_agg
diff --git a/src/backend/utils/adt/multirangetypes.c b/src/backend/utils/adt/multirangetypes.c
index c474b24431..a6d376b083 100644
--- a/src/backend/utils/adt/multirangetypes.c
+++ b/src/backend/utils/adt/multirangetypes.c
@@ -1344,11 +1344,9 @@ range_agg_transfn(PG_FUNCTION_ARGS)
 		elog(ERROR, "range_agg_transfn called in non-aggregate context");
 
 	rngtypoid = get_fn_expr_argtype(fcinfo->flinfo, 1);
 	if (!type_is_range(rngtypoid))
-		ereport(ERROR,
-(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("range_agg must be called with a range")));
+		elog(ERROR, "range_agg must be called with a range or multirange");
 
 	if (PG_ARGISNULL(0))
 		state = initArrayResult(rngtypoid, aggContext, false);
 	else
@@ -1362,8 +1360,11 @@ range_agg_transfn(PG_FUNCTION_ARGS)
 }
 
 /*
  * range_agg_finalfn: use our internal array to merge touching ranges.
+ *
+ * Shared by range_agg_finalfn(anyrange) and
+ * multirange_agg_finalfn(anymultirange).
  */
 Datum
 range_agg_finalfn(PG_FUNCTION_ARGS)
 {
@@ -1397,8 +1398,66 @@ range_agg_finalfn(PG_FUNCTION_ARGS)
 
 	PG_RETURN_MULTIRANGE_P(make_multirange(mltrngtypoid, typcache->rngtype, range_count, ranges));
 }
 
+/*
+ * multirange_agg_transfn: combine adjacent/overlapping multiranges.
+ *
+ * All we do here is gather the input multiranges' ranges into an array
+ * so that the finalfn can sort and combine them.
+ */
+Datum
+multirange_agg_transfn(PG_FUNCTION_ARGS)
+{
+	MemoryContext aggContext;
+	Oid			mltrngtypoid;
+	TypeCacheEntry *typcache;
+	TypeCacheEntry *rngtypcache;
+	ArrayBuildState *state;
+	MultirangeType *current;
+	int32		range_count;
+	RangeType **ranges;
+	int32		i;
+
+	if (!AggCheckCallContext(fcinfo, &aggContext))
+		elog(ERROR, "multirange_agg_transfn called in non-aggregate context");
+
+	mltrngtypoid = get_fn_expr_argtype(fcinfo->flinfo, 1);
+	if (!type_is_multirange(mltrngtypoid))
+		elog(ERROR, "range_agg must be called with a range or multirange");
+
+	typcache = multirange_get_typcache(fcinfo, mltrngtypoid);
+	rngtypcache = typcache->rngtype;
+
+	if (PG_ARGISNULL

Re: Issue with pg_stat_subscription_stats

2022-03-11 Thread Amit Kapila

On Sat, Mar 12, 2022 at 2:14 AM Melanie Plageman
 wrote:
>
> So, I noticed that pg_stat_reset_subscription_stats() wasn't working
> properly, and, upon further investigation, I'm not sure the view
> pg_stat_subscription_stats is being properly populated.
>

I have tried the below scenario based on this:
Step:1 Create some data that generates conflicts and lead to apply
failures and then check in the view:

postgres=# select * from pg_stat_subscription_stats;
 subid | subname | apply_error_count | sync_error_count | stats_reset
---+-+---+--+-
 16389 | sub1| 4 |0 |
(1 row)

Step-2: Reset the view
postgres=# select * from pg_stat_reset_subscription_stats(16389);
 pg_stat_reset_subscription_stats
--

(1 row)

Step-3: Again, check the view:
postgres=# select * from pg_stat_subscription_stats;
 subid | subname | apply_error_count | sync_error_count |   stats_reset
---+-+---+--+--
 16389 | sub1| 0 |0 | 2022-03-12
08:21:39.156971+05:30
(1 row)

The stats_reset time seems to be populated. Similarly, I have tried by
passing NULL to pg_stat_reset_subscription_stats and it works. I think
I am missing something here, can you please explain the exact
scenario/steps where you observed that this API is not working.


-- 
With Regards,
Amit Kapila.

Re: On login trigger: take three

2022-03-11 Thread Andres Freund

Hi,

On 2022-02-15 21:07:15 +1100, Greg Nancarrow wrote:
> Subject: [PATCH v25] Add a new "login" event and login event trigger support.
> 
> The login event occurs when a user logs in to the system.

I think this needs a HUGE warning in the docs about most event triggers
needing to check pg_is_in_recovery() or breaking hot standby. And certainly
the example given needs to include an pg_is_in_recovery() check.


> +   
> + The login event occurs when a user logs in to the
> + system.
> + Any bugs in a trigger procedure for this event may prevent successful
> + login to the system. Such bugs may be fixed after first restarting the
> + system in single-user mode (as event triggers are disabled in this 
> mode).
> + See the  reference page for details about
> + using single-user mode.
> +   

I'm strongly against adding any new dependencies on single user mode.

A saner approach might be a superuser-only GUC that can be set as part of the
connection data (e.g. PGOPTIONS='-c ignore_login_event_trigger=true').


> @@ -293,6 +297,27 @@ insert_event_trigger_tuple(const char *trigname, const 
> char *eventname, Oid evtO
>   CatalogTupleInsert(tgrel, tuple);
>   heap_freetuple(tuple);
>  
> + if (strcmp(eventname, "login") == 0)
> + {
> + Form_pg_database db;
> + Relationpg_db = table_open(DatabaseRelationId, 
> RowExclusiveLock);
> +
> + /* Set dathasloginevt flag in pg_database */
> + tuple = SearchSysCacheCopy1(DATABASEOID, 
> ObjectIdGetDatum(MyDatabaseId));
> + if (!HeapTupleIsValid(tuple))
> + elog(ERROR, "cache lookup failed for database %u", 
> MyDatabaseId);
> + db = (Form_pg_database) GETSTRUCT(tuple);
> + if (!db->dathasloginevt)
> + {
> + db->dathasloginevt = true;
> + CatalogTupleUpdate(pg_db, &tuple->t_self, tuple);
> + }
> + else
> + CacheInvalidateRelcacheByTuple(tuple);
> + table_close(pg_db, RowExclusiveLock);
> + heap_freetuple(tuple);
> + }
> +
>   /* Depend on owner. */
>   recordDependencyOnOwner(EventTriggerRelationId, trigoid, evtOwner);

Maybe I am confused, but isn't that CacheInvalidateRelcacheByTuple call
*entirely* bogus? CacheInvalidateRelcacheByTuple() expects a pg_class tuple,
but you're passing in a pg_database tuple?  And what is relcache invalidation
even supposed to achieve here?


I think this should mention that ->dathasloginevt is unset on login when
appropriate.



> +/*
> + * Fire login event triggers.
> + */
> +void
> +EventTriggerOnLogin(void)
> +{
> + List   *runlist;
> + EventTriggerData trigdata;
> +
> + /*
> +  * See EventTriggerDDLCommandStart for a discussion about why event
> +  * triggers are disabled in single user mode.
> +  */
> + if (!IsUnderPostmaster || !OidIsValid(MyDatabaseId))
> + return;
> +
> + StartTransactionCommand();
> +
> + if (DatabaseHasLoginEventTriggers())
> + {
> + runlist = EventTriggerCommonSetup(NULL,
> + 
>   EVT_Login, "login",
> + 
>   &trigdata);
> +
> + if (runlist != NIL)
> + {
> + /*
> +  * Make sure anything the main command did will be 
> visible to the
> +  * event triggers.
> +  */
> + CommandCounterIncrement();

"Main command"?

It's not clear to my why a CommandCounterIncrement() could be needed here -
which previous writes do you need to make visible?



> + /*
> +  * Runlist is empty: clear dathasloginevt flag
> +  */
> + Relationpg_db = table_open(DatabaseRelationId, 
> RowExclusiveLock);
> + HeapTuple   tuple = 
> SearchSysCacheCopy1(DATABASEOID, ObjectIdGetDatum(MyDatabaseId));
> + Form_pg_database db;
> +
> + if (!HeapTupleIsValid(tuple))
> + elog(ERROR, "cache lookup failed for database 
> %u", MyDatabaseId);
> +
> + db = (Form_pg_database) GETSTRUCT(tuple);
> + if (db->dathasloginevt)
> + {
> + /*
> +  * There can be a race condition: a login event 
> trigger may
> +  * have been added after the pg_event_trigger 
> table was
> +  * scanned, and we don't want to erroneously 
> clear the
> +  * dathasloginevt flag in this case. To be sure 
> that this
> +  * hasn't happened, repeat the scan under the 
>

Re: role self-revocation

2022-03-11 Thread Mark Dilger

> On Mar 11, 2022, at 4:56 PM, Stephen Frost  wrote:
> 
> First … I outlined a fair bit of further description in the message you just 
> responded to but neglected to include in your response, which strikes me as 
> odd that you’re now asking for further explanation.

>  When it comes to completing the idea of role ownership- I didn’t come up 
> with that idea nor champion it

Sorry, not "completing", but "competing".  It seems we're discussing different 
ways to fix how roles and CREATEROLE work, and we have several ideas competing 
against each other.  (This differs from *people* competing against each other, 
as I don't necessarily like the patch I wrote better than I like your idea.)

> and therefore I’m not really sure how many of the steps are required to fully 
> support that concept..?

There are problems that the ownership concepts solve.  I strongly suspect that 
your proposal could also solve those same problems, and just trying to identify 
the specific portions of your proposal necessary to do so.

>  For my part, I would think that those steps necessary to satisfy the spec 
> would get us pretty darn close to what true folks advocating for role 
> ownership are asking for

I have little idea what "true folks" means in this context.  As for "advocating 
for role ownership", I'm not in that group.  Whether role ownership or 
something else, I just want some solution to a set of problems, mostly to do 
with needing superuser to do role management tasks.

> , but that doesn’t include the superuser-only alter role attributes piece.  
> Is that included in role ownership?  I wouldn’t think so, but some might 
> argue otherwise, and I don’t know that it is actually useful to divert into a 
> discussion about what is or isn’t in that.

Introducing the idea of role ownership doesn't fix that.  But a patch which 
introduces role ownership is useless unless CREATEROLE is also fixed.  There 
isn't any point having non-superusers create and own roles if, to do so, they 
need a privilege which can break into superuser.  But that argument is no 
different with a patch along the lines of what you are proposing.  CREATEROLE 
needs fixing either way.

> If we agree that the role attribute bits are independent

Yes.

> then I think I agree that we need 1 and 2 to get the capabilities that the 
> folks asking for role ownership want

Yes.

> as 2 is where we make sure that one admin of a role can’t revoke another 
> admin’s rights over that role.

Exactly, so #2 is part of the competing proposal.  (I get the sense you might 
not see these as competing proposals, but I find that framing useful for 
deciding which approach to pursue.)

>  Perhaps 2 isn’t strictly necessary in a carefully managed environment where 
> no one else is given admin rights over the mini-superuser roles, but I’d 
> rather not have folks depending on that.

I think it is necessary, and for the reason you say.

>  I’d still push back though and ask those advocating for this if it meets 
> what they’re asking for.  I got the impression that it did but maybe I 
> misunderstood.
> 
> In terms of exactly how things would work with these changes… I thought I 
> explained it pretty clearly, so it’s kind of hard to answer that further 
> without a specific question to answer.  Did you have something specific in 
> mind?  Perhaps I could answer a specific question and provide more clarity 
> that way.

Your emails contained a lot of "we could do this or that depending on what 
people want, and maybe this other thing, but that isn't really necessary, and 
" which left me unclear on the proposal.  I don't mean to disparage your 
communication style; it's just that when trying to distill technical details, 
high level conversation can be hard to grok.

I have the sense that you aren't going to submit a patch, so I wanted this 
thread to contain enough detail for somebody else to do so.  Thanks.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: refactoring basebackup.c

2022-03-11 Thread Andres Freund

Hi,

On 2022-03-11 10:19:29 -0500, Robert Haas wrote:
> Thanks for the report. The problem here is that, when the output is
> standard output (-D -), pg_basebackup can only produce a single output
> file, so the manifest gets injected into the tar file on the client
> side rather than being written separately as we do in normal cases.
> However, that only works if we're receiving a tar file that we can
> parse from the server, and here the server is sending a compressed
> tarfile. The current code mistakely attempts to parse the compressed
> tarfile as if it were an uncompressed tarfile, which causes the error
> messages that you are seeing (and which I can also reproduce here). We
> actually have enough infrastructure available in pg_basebackup now
> that we could do the "right thing" in this case: decompress the data
> received from the server, parse the resulting tar file, inject the
> backup manifest, construct a new tar file, and recompress. However, I
> think that's probably not a good idea, because it's unlikely that the
> user will understand that the data is being compressed on the server,
> then decompressed, and then recompressed again, and the performance of
> the resulting pipeline will probably not be very good. So I think we
> should just refuse this command. Patch for that attached.

You could also just append a manifest as a compresed tar to the compressed tar
stream. Unfortunately GNU tar requires -i to read concated compressed
archives, so perhaps that's not quite an alternative.

Greetings,

Andres Freund

Re: role self-revocation

2022-03-11 Thread Stephen Frost

Greetings,

On Fri, Mar 11, 2022 at 19:03 Mark Dilger 
wrote:

> > On Mar 11, 2022, at 2:46 PM, Stephen Frost  wrote:
> >
> > I do think that’s reasonable … and believe I suggested it about 3
> messages ago in this thread. ;)  (step #3 I think it was?  Or maybe 4).
>
> Yes, and you mentioned it to me off-list.

Indeed.

I'm soliciting a more concrete specification for what you are proposing.
> To me, that means understanding how the SQL spec behavior that you champion
> translates into specific changes.  You specified some of this in steps #1
> through #5, but I'd like a clearer indication of how many of those (#1
> alone, both #1 and #2, or what?) constitute a competing idea to the idea of
> role ownership, and greater detail about how each of those steps translate
> into specific behavior changes in postgres.  Your initial five-step email
> seems to be claiming that #1 by itself is competitive, but to me it seems
> at least #1 and #2 would be required.

First … I outlined a fair bit of further description in the message you
just responded to but neglected to include in your response, which strikes
me as odd that you’re now asking for further explanation.  When it comes to
completing the idea of role ownership- I didn’t come up with that idea nor
champion it and therefore I’m not really sure how many of the steps are
required to fully support that concept..?  For my part, I would think that
those steps necessary to satisfy the spec would get us pretty darn close to
what true folks advocating for role ownership are asking for, but that
doesn’t include the superuser-only alter role attributes piece.  Is that
included in role ownership?  I wouldn’t think so, but some might argue
otherwise, and I don’t know that it is actually useful to divert into a
discussion about what is or isn’t in that.

If we agree that the role attribute bits are independent then I think I
agree that we need 1 and 2 to get the capabilities that the folks asking
for role ownership want, as 2 is where we make sure that one admin of a
role can’t revoke another admin’s rights over that role.  Perhaps 2 isn’t
strictly necessary in a carefully managed environment where no one else is
given admin rights over the mini-superuser roles, but I’d rather not have
folks depending on that.  I’d still push back though and ask those
advocating for this if it meets what they’re asking for.  I got the
impression that it did but maybe I misunderstood.

In terms of exactly how things would work with these changes… I thought I
explained it pretty clearly, so it’s kind of hard to answer that further
without a specific question to answer.  Did you have something specific in
mind?  Perhaps I could answer a specific question and provide more clarity
that way.

Thanks,

Stephen

>

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-03-11 Thread Andres Freund

Hi,

On 2022-03-11 22:42:42 +0200, Heikki Linnakangas wrote:
> Have you been able to create a test case for that? The largest record I can
> think of is a commit record with a huge number of subtransactions, dropped
> relations, and shared inval messages. I'm not sure if you can overflow a
> uint32 with that, but exceeding MaxAllocSize seems possible.

MaxAllocSize is pretty easy:
SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 
1024), 1024*1023) as l(long);

on a standby:

2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG:  record length 
2145386550 at 0/360 too long

> I wonder if these checks hurt performance. These are very cheap, but then
> again, this codepath is very hot. It's probably fine, but it still worries
> me a little. Maybe some of these could be Asserts.

I wouldn't expect the added branch itself to hurt much in XLogRegisterData() -
it should be statically predicted to be not taken with the unlikely. I don't
think it's quite inner-loop enough for the instructions or the number of
"concurrently out of order branches" to be a problem.

FWIW, often the added elog()s are worse, because they require a decent amount
of code and restrict the optimizer somewhat (e.g. no sibling calls, more local
variables etc). They can't even be deduplicated because of the line-numbers
embedded.

So maybe just collapse the new elog() with the previous elog, with a common
unlikely()?

> > @@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
> > if (needs_data)
> > {
> > +   /* protect against overflow */
> > +   if (unlikely(regbuf->rdata_len > UINT16_MAX))
> > +   elog(ERROR, "too much WAL data for registered 
> > buffer");
> > +
> > /*
> >  * Link the caller-supplied rdata chain for this buffer 
> > to the
> >  * overall list.

FWIW, this branch I'm a tad more concerned about - it's in a loop body where
plausibly a lot of branches could be outstanding at the same time.

ISTM that this could just be an assert?

Greetings,

Andres Freund

Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit

2022-03-11 Thread David Zhang

Applied patches v6-0002 and v6-0003 to master branch, and the `make 
check` test is ok.


Here is my test result in 10 times average on 3 virtual machines:
before the patches:

   abort.1 = 2.5473 ms

   abort.2 = 4.1572 ms

after the patches with OPTIONS (ADD parallel_abort 'true'):

   abort.1 = 1.7136 ms

   abort.2 = 2.5833 ms

Overall, it has about 32 ~ 37 % improvement in my testing environment.

On 2022-03-05 2:32 a.m., Etsuro Fujita wrote:

On Mon, Feb 28, 2022 at 6:53 PM Etsuro Fujita  wrote:

Here is an updated version.  I added to the 0003 patch a macro for
defining the milliseconds to wait, as proposed by David upthread.

I modified the 0003 patch further: 1) I added to
pgfdw_cancel_query_end/pgfdw_exec_cleanup_query_end the PQconsumeInput
optimization that we have in do_sql_command_end, and 2) I
added/tweaked comments a bit further.  Attached is an updated version.

Like [1], I ran a simple performance test using the following transaction:

BEGIN;
SAVEPOINT s;
INSERT INTO ft1 VALUES (10, 10);
INSERT INTO ft2 VALUES (20, 20);
ROLLBACK TO SAVEPOINT s;
RELEASE SAVEPOINT s;
INSERT INTO ft1 VALUES (10, 10);
INSERT INTO ft2 VALUES (20, 20);
ABORT;

where ft1 is a foreign table created on a foreign server hosted on the
same machine as the local server, and ft2 is a foreign table created
on a foreign server hosted on a different machine.  (In this test I
used two machines, while in [1] I used three machines: one for the
local server and the others for ft1 and ft2.)  The average latencies
for the ROLLBACK TO SAVEPOINT and ABORT commands over ten runs of the
above transaction with the parallel_abort option disabled/enabled are:

* ROLLBACK TO SAVEPOINT
   parallel_abort=0: 0.3217 ms
   parallel_abort=1: 0.2396 ms

* ABORT
   parallel_abort=0: 0.4749 ms
   parallel_abort=1: 0.3733 ms

This option reduces the latency for ROLLBACK TO SAVEPOINT by 25.5
percent, and the latency for ABORT by 21.4 percent.  From the results,
I think the patch is useful.

Best regards,
Etsuro Fujita

[1]https://www.postgresql.org/message-id/CAPmGK17dAZCXvwnfpr1eTfknTGdt%3DhYTV9405Gt5SqPOX8K84w%40mail.gmail.com

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

Re: Kerberos delegation support in libpq and postgres_fdw

2022-03-11 Thread Stephen Frost

Greetings,

On Fri, Mar 11, 2022 at 18:55 Jacob Champion  wrote:

> On Mon, 2022-02-28 at 20:28 -0500, Stephen Frost wrote:
> > Will add to the CF for consideration.
>
> GSSAPI newbie here, so caveat lector.


No worries, thanks for your interest!

> diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
> > index efc53f3135..6f820a34f1 100644
> > --- a/src/backend/libpq/auth.c
> > +++ b/src/backend/libpq/auth.c
> > @@ -920,6 +920,7 @@ pg_GSS_recvauth(Port *port)
> >   int mtype;
> >   StringInfoData buf;
> >   gss_buffer_desc gbuf;
> > + gss_cred_id_t proxy;
> >
> >   /*
> >* Use the configured keytab, if there is one.  Unfortunately,
> Heimdal
> > @@ -949,6 +950,9 @@ pg_GSS_recvauth(Port *port)
> >*/
> >   port->gss->ctx = GSS_C_NO_CONTEXT;
> >
> > + proxy = NULL;
> > + port->gss->proxy_creds = false;
> > +
> >   /*
> >* Loop through GSSAPI message exchange. This exchange can consist
> of
> >* multiple messages sent in both directions. First message is
> always from
> > @@ -999,7 +1003,7 @@ pg_GSS_recvauth(Port *port)
> >
>&port->gss->outbuf,
> >
>&gflags,
> >
>NULL,
> > -
>NULL);
> > +
>&proxy);
> >
> >   /* gbuf no longer used */
> >   pfree(buf.data);
> > @@ -1011,6 +1015,12 @@ pg_GSS_recvauth(Port *port)
> >
> >   CHECK_FOR_INTERRUPTS();
> >
> > + if (proxy != NULL)
> > + {
> > + pg_store_proxy_credential(proxy);
> > + port->gss->proxy_creds = true;
> > + }
> > +
>
> Some implementation docs [1] imply that a delegated_cred_handle is only
> valid if the ret_flags include GSS_C_DELEG_FLAG. The C-binding RFC [2],
> though, says that we can rely on it being set to GSS_C_NO_CREDENTIAL if
> no handle was sent...
>
> I don't know if there are any implementation differences here, but in
> any case I think it'd be more clear to use the GSS_C_NO_CREDENTIAL
> spelling (instead of NULL) here, if we do decide not to check
> ret_flags.


Hmmm, yeah, that seems like it might be better and is something I’ll take a
look at.

[5] says we have to free the proxy credential with GSS_Release_cred();
> I don't see that happening anywhere, but I may have missed it.


I’m not sure that it’s really necessary or worthwhile to do that at process
end since … the process is about to end. I suppose we could provide a
function that a user could call to ask for it to be released sooner if we
really wanted..?

>   maj_stat = gss_init_sec_context(&min_stat,
> > -
>  GSS_C_NO_CREDENTIAL,
> > +
>  proxy,
> >
>  &conn->gctx,
> >
>  conn->gtarg_nam,
> >
>  GSS_C_NO_OID,
> > -
>  GSS_C_MUTUAL_FLAG,
> > +
>  GSS_C_MUTUAL_FLAG | GSS_C_DELEG_FLAG,
> >   0,
> >
>  GSS_C_NO_CHANNEL_BINDINGS,
> >
>  (ginbuf.value == NULL) ? GSS_C_NO_BUFFER : &ginbuf,
> > diff --git a/src/interfaces/libpq/fe-secure-gssapi.c
> b/src/interfaces/libpq/fe-secure-gssapi.c
> > index 6ea52ed866..566c89f52f 100644
> > --- a/src/interfaces/libpq/fe-secure-gssapi.c
> > +++ b/src/interfaces/libpq/fe-secure-gssapi.c
> > @@ -631,7 +631,7 @@ pqsecure_open_gss(PGconn *conn)
> >*/
> >   major = gss_init_sec_context(&minor, conn->gcred, &conn->gctx,
> >
> conn->gtarg_nam, GSS_C_NO_OID,
> > -
> GSS_REQUIRED_FLAGS, 0, 0, &input, NULL,
> > +
> GSS_REQUIRED_FLAGS | GSS_C_DELEG_FLAG, 0, 0, &input, NULL,
>
> It seems like there should be significant security implications to
> allowing delegation across the board. Especially since one postgres_fdw
> might delegate to another server, and then another... Should this be
> opt-in, maybe via a connection parameter?


This is already opt-in- at kinit time a user can decide if they’d like a
proxy-able ticket or not.  I don’t know that we really need to have our own
option for it … tho I’m not really against adding such an option either.

(It also looks like there are some mechanisms for further constraining
> delegation scope, either by administrator policy or otherwise [3, 4].
> Might be a good thing for a v2 of this feature to have.)


Yes, constrained delegation is a pretty neat extension to Kerberos and one
I’d like to look at later as a future enhancement but I don’t think it
needs to be in the initial version.

Similarly, it feels a little strange that the server would allow the
> client to unilaterally force the use of a delegated credential. I think
> that should be opt-in on the server side too, unless there's some
> context I'm missing around why that's safe.


Perhaps you could explain what isn’t safe about accepting a delegated
credential from a client..?  I am not away of a risk to accepting such a
delegated credential.  Even so, I’m not against adding an option… but
exactly how would that option be configured?  Server level?  On the HBA
line?  role level..?

> + /* Make the proxy credential only av

Re: wal_compression=zstd

2022-03-11 Thread Michael Paquier

On Fri, Mar 11, 2022 at 03:49:00PM -0600, Justin Pryzby wrote:
> While rebasing, I realized this should have bumped XLOG_PAGE_MAGIC.
> 
> Also, there's a dangling "and".

Right.  I'll address that a bit later today.  Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: role self-revocation

2022-03-11 Thread Mark Dilger

> On Mar 11, 2022, at 2:46 PM, Stephen Frost  wrote:
> 
> I do think that’s reasonable … and believe I suggested it about 3 messages 
> ago in this thread. ;)  (step #3 I think it was?  Or maybe 4).

Yes, and you mentioned it to me off-list.

I'm soliciting a more concrete specification for what you are proposing.  To 
me, that means understanding how the SQL spec behavior that you champion 
translates into specific changes.  You specified some of this in steps #1 
through #5, but I'd like a clearer indication of how many of those (#1 alone, 
both #1 and #2, or what?) constitute a competing idea to the idea of role 
ownership, and greater detail about how each of those steps translate into 
specific behavior changes in postgres.  Your initial five-step email seems to 
be claiming that #1 by itself is competitive, but to me it seems at least #1 
and #2 would be required.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Kerberos delegation support in libpq and postgres_fdw

2022-03-11 Thread Jacob Champion

On Mon, 2022-02-28 at 20:28 -0500, Stephen Frost wrote:
> Will add to the CF for consideration.

GSSAPI newbie here, so caveat lector.

> diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
> index efc53f3135..6f820a34f1 100644
> --- a/src/backend/libpq/auth.c
> +++ b/src/backend/libpq/auth.c
> @@ -920,6 +920,7 @@ pg_GSS_recvauth(Port *port)
>   int mtype;
>   StringInfoData buf;
>   gss_buffer_desc gbuf;
> + gss_cred_id_t proxy;
>  
>   /*
>* Use the configured keytab, if there is one.  Unfortunately, Heimdal
> @@ -949,6 +950,9 @@ pg_GSS_recvauth(Port *port)
>*/
>   port->gss->ctx = GSS_C_NO_CONTEXT;
>  
> + proxy = NULL;
> + port->gss->proxy_creds = false;
> +
>   /*
>* Loop through GSSAPI message exchange. This exchange can consist of
>* multiple messages sent in both directions. First message is always 
> from
> @@ -999,7 +1003,7 @@ pg_GSS_recvauth(Port *port)
>   
>   &port->gss->outbuf,
>   
>   &gflags,
>   
>   NULL,
> - 
>   NULL);
> + 
>   &proxy);
>  
>   /* gbuf no longer used */
>   pfree(buf.data);
> @@ -1011,6 +1015,12 @@ pg_GSS_recvauth(Port *port)
>  
>   CHECK_FOR_INTERRUPTS();
>  
> + if (proxy != NULL)
> + {
> + pg_store_proxy_credential(proxy);
> + port->gss->proxy_creds = true;
> + }
> +

Some implementation docs [1] imply that a delegated_cred_handle is only
valid if the ret_flags include GSS_C_DELEG_FLAG. The C-binding RFC [2],
though, says that we can rely on it being set to GSS_C_NO_CREDENTIAL if
no handle was sent...

I don't know if there are any implementation differences here, but in
any case I think it'd be more clear to use the GSS_C_NO_CREDENTIAL
spelling (instead of NULL) here, if we do decide not to check
ret_flags.

[5] says we have to free the proxy credential with GSS_Release_cred();
I don't see that happening anywhere, but I may have missed it.

>   maj_stat = gss_init_sec_context(&min_stat,
> - 
> GSS_C_NO_CREDENTIAL,
> + proxy,
>   
> &conn->gctx,
>   
> conn->gtarg_nam,
>   
> GSS_C_NO_OID,
> - 
> GSS_C_MUTUAL_FLAG,
> + 
> GSS_C_MUTUAL_FLAG | GSS_C_DELEG_FLAG,
>   0,
>   
> GSS_C_NO_CHANNEL_BINDINGS,
>   
> (ginbuf.value == NULL) ? GSS_C_NO_BUFFER : &ginbuf,
> diff --git a/src/interfaces/libpq/fe-secure-gssapi.c 
> b/src/interfaces/libpq/fe-secure-gssapi.c
> index 6ea52ed866..566c89f52f 100644
> --- a/src/interfaces/libpq/fe-secure-gssapi.c
> +++ b/src/interfaces/libpq/fe-secure-gssapi.c
> @@ -631,7 +631,7 @@ pqsecure_open_gss(PGconn *conn)
>*/
>   major = gss_init_sec_context(&minor, conn->gcred, &conn->gctx,
>
> conn->gtarg_nam, GSS_C_NO_OID,
> -  
> GSS_REQUIRED_FLAGS, 0, 0, &input, NULL,
> +  
> GSS_REQUIRED_FLAGS | GSS_C_DELEG_FLAG, 0, 0, &input, NULL,

It seems like there should be significant security implications to
allowing delegation across the board. Especially since one postgres_fdw
might delegate to another server, and then another... Should this be
opt-in, maybe via a connection parameter?

(It also looks like there are some mechanisms for further constraining
delegation scope, either by administrator policy or otherwise [3, 4].
Might be a good thing for a v2 of this feature to have.)

Similarly, it feels a little strange that the server would allow the
client to unilaterally force the use of a delegated credential. I think
that should be opt-in on the server side too, unless there's some
context I'm missing around why that's safe.

> + /* Make the proxy credential only available to current process */
> + major = gss_store_cred_into(&minor,
> + cred,
> + GSS_C_INITIATE, /* cr

Re: role self-revocation

2022-03-11 Thread Stephen Frost

Greetings,

On Fri, Mar 11, 2022 at 12:32 Mark Dilger 
wrote:

>
>
> > On Mar 11, 2022, at 8:48 AM, Stephen Frost  wrote:
> >
> > I agree that it would have an impact on backwards compatibility to
> > change how WITH ADMIN works- but it would also get us more in line with
> > what the SQL standard says for how WITH ADMIN is supposed to work and
> > that seems worth the change to me.
>
> I'm fine with giving up some backwards compatibility to get some SQL
> standard compatibility, as long as we're clear that is what we're doing.
> What you say about the SQL spec isn't great, though, because too much power
> is vested in "ADMIN".  I see "ADMIN" as at least three separate privileges
> together.  Maybe it would be spec compliant to implement "ADMIN" as a
> synonym for a set of separate privileges?

I do think that’s reasonable … and believe I suggested it about 3 messages
ago in this thread. ;)  (step #3 I think it was?  Or maybe 4).

> On Mar 11, 2022, at 8:41 AM, Stephen Frost  wrote:
> >
> > That we aren't discussing the issues with the current GRANT ... WITH
> > ADMIN OPTION and how we deviate from what the spec calls for when it
> > comes to DROP ROLE, which seems to be the largest thing that's
> > 'solved' with this ownership concept, is concerning to me.
>
> Sure, let's discuss that a bit more.  Here is my best interpretation of
> your post about the spec, when applied to postgres with an eye towards not
> doing any more damage than necessary:
>
> > On Mar 10, 2022, at 11:58 AM, Stephen Frost  wrote:
> >
> > let's look at what the spec says:
> >
> > CREATE ROLE
> >  - Who is allowed to run CREATE ROLE is implementation-defined
>
> This should be anyone with membership in pg_create_role.

That could be the case if we wished to go that route. I’d think in such
case we’d then also remove CREATEROLE as otherwise the documentation feels
like it’d be quite confusing.

>  - After creation, this is effictively run:
> >GRANT new_role TO creator_role WITH ADMIN, GRANTOR "_SYSTEM"
>
> This should internally be implemented as three separate privileges, one
> which means you can grant the role, another which means you can drop the
> role, and a third that means you're a member of the role.  That way, they
> can be independently granted and revoked.  We could make "WITH ADMIN" a
> short-hand for "WITH G, D, M" where G, D, and M are whatever we name the
> independent privileges Grant, Drop, and Member-of.

I mean, sure, we can get there, and possibly add more like if you’re
allowed to change or reset that role’s password and other things, but I
don’t see that this piece is required as part of the very first change in
this area.  Further, WITH ADMIN already gives grant and member today, so
you’re saying the only thing this change does that makes “WITH ADMIN” too
powerful is adding DROP to it, yet that’s explicitly what the spec calls
for.  In short, I disagree that moving the DROP ROLE right from CREATEROLE
roles having that across the entire system (excluding superusers) to WITH
ADMIN where the role who has that right can:

A) already become that role and drop all their objects
B) already GRANT that role to some other role

is a big issue.

Splitting G and D helps with backwards compatibility, because it gives
> people who want the traditional postgres "admin" a way to get there, by
> granting "G+M".  Splitting M from G and D makes it simpler to implement the
> "bot" idea, since the bot shouldn't have M.  But it does raise a question
> about always granting G+D+M to the creator, since the bot is the creator
> and we don't want the bot to have M.  This isn't a problem I've invented
> from thin air, mind you, as G+D+M is just the definition of ADMIN per the
> SQL spec, if I've understood you correctly.  So we need to think a bit more
> about the pg_create_role built-in role and whether that needs to be further
> refined to distinguish those who can get membership in roles they create
> vs. those who cannot.  This line of reasoning takes me in the direction of
> what I think you were calling #5 upthread, but you'd have to elaborate on
> that, and how it interacts with the spec, for us to have a useful
> conversation about it.

All that said, as I said before, I’m in favor of splitting things up and so
if you want to do that as part of this initial work, sure. Idk that it’s
absolutely required as part of this but I’m not going to complain if it’s
included either.  I agree that would allow folks to get something similar
to what they could get today if they want.

I agree that the split up helps with the “bot” idea, as we could at least
then create a security definer function that the bot runs and which creates
roles that the bot then has G for but not M or D.  Even better would be to
also provide a way for the “bot” to be able to create roles without the
need for a security definer function where it doesn’t automatically get all
three, and that was indeed what I was thinking about with the template
idea. The genera

Re: generic plans and "initial" pruning

2022-03-11 Thread Zhihong Yu

Hi,
w.r.t. v5-0003-Teach-AcquireExecutorLocks-to-skip-locking-pruned.patch :

(pruning steps containing expressions that can be computed before
before the executor proper has started)

the word 'before' was repeated.

For ExecInitParallelPlan():

+   char   *execlockrelsinfo_data;
+   char   *execlockrelsinfo_space;

the content of execlockrelsinfo_data is copied into execlockrelsinfo_space.
I wonder if having one of execlockrelsinfo_data and
execlockrelsinfo_space suffices.

Cheers

Re: wal_compression=zstd

2022-03-11 Thread Justin Pryzby

While rebasing, I realized this should have bumped XLOG_PAGE_MAGIC.

Also, there's a dangling "and".

+The supported methods are pglz,
+lz4 (if PostgreSQL
+was compiled with --with-lz4) and
+zstd (if PostgreSQL
+was compiled with --with-zstd) and
+The default value is off.
+Only superusers can change this setting.

-- 
Justin

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-03-11 Thread Robert Haas

On Fri, Mar 11, 2022 at 3:42 PM Heikki Linnakangas  wrote:
> Have you been able to create a test case for that? The largest record I
> can think of is a commit record with a huge number of subtransactions,
> dropped relations, and shared inval messages. I'm not sure if you can
> overflow a uint32 with that, but exceeding MaxAllocSize seems possible.

I believe that wal_level=logical can generate very large update and
delete records, especially with REPLICA IDENTITY FULL.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: On login trigger: take three

2022-03-11 Thread Robert Haas

On Tue, Feb 15, 2022 at 5:07 AM Greg Nancarrow  wrote:
> I've attached a re-based version (no functional changes from the
> previous) to fix cfbot failures.

I tried this:

rhaas=# create function on_login_proc() returns event_trigger as
$$begin perform pg_sleep(1000); end$$ language plpgsql;
CREATE FUNCTION
rhaas=# create event trigger on_login_trigger on login execute
procedure on_login_proc();

When I then attempt to connect via psql, it hangs, as expected. When I
press ^C, psql exits, but the backend process is not cancelled and
just keeps chugging along in the background. The good news is that if
I connect to another database, I can cancel all of the hung sessions
using pg_cancel_backend(), and all of those processes then promptly
exit, and presumably I could accomplish the same thing by sending them
SIGINT directly. But it's still not great behavior. It would be easy
to use up a pile of connection slots inadvertently and have to go to
some trouble to get access to the server again. Since this is a psql
behavior and not a server behavior, one could argue that it's
unrelated to this patch, but in practice this patch seems to increase
the chances of people running into problems quite a lot.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Issue with pg_stat_subscription_stats

2022-03-11 Thread Melanie Plageman

So, I noticed that pg_stat_reset_subscription_stats() wasn't working
properly, and, upon further investigation, I'm not sure the view
pg_stat_subscription_stats is being properly populated.

I don't think subscriptionStatHash will be created properly and that the
reset timestamp won't be initialized without the following code:

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 53ddd930e6..0b8c5436e9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3092,7 +3092,7 @@ pgstat_fetch_stat_subscription(Oid subid)
/* Load the stats file if needed */
backend_read_statsfile();

-   return pgstat_get_subscription_entry(subid, false);
+   return pgstat_get_subscription_entry(subid, true);
 }

 /*
@@ -6252,7 +6252,7 @@ pgstat_get_subscription_entry(Oid subid, bool create)

/* If not found, initialize the new one */
if (!found)
-   pgstat_reset_subscription(subentry, 0);
+   pgstat_reset_subscription(subentry, GetCurrentTimestamp());

return subentry;
 }

- melanie

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-03-11 Thread Heikki Linnakangas


On 11/03/2022 17:42, Matthias van de Meent wrote:

Hi,

Xlogreader limits the size of what it considers valid xlog records to
MaxAllocSize; but this is not currently enforced in the
XLogRecAssemble API. This means it is possible to assemble a record
that postgresql cannot replay.


Oops, that would be nasty.


Similarly; it is possible to repeatedly call XlogRegisterData() so as
to overflow rec->xl_tot_len; resulting in out-of-bounds reads and
writes while processing record data;


And that too.

Have you been able to create a test case for that? The largest record I 
can think of is a commit record with a huge number of subtransactions, 
dropped relations, and shared inval messages. I'm not sure if you can 
overflow a uint32 with that, but exceeding MaxAllocSize seems possible.



PFA a patch that attempts to fix both of these issues in the insertion
API; by checking against overflows and other incorrectly large values
in the relevant functions in xloginsert.c. In this patch, I've also
added a comment to the XLogRecord spec to document that xl_tot_len
should not be larger than 1GB - 1B; and why that limit exists.
diff --git a/src/backend/access/transam/xloginsert.c 
b/src/backend/access/transam/xloginsert.c
index c260310c4c..ae654177de 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -342,6 +342,11 @@ XLogRegisterData(char *data, int len)
 
 	if (num_rdatas >= max_rdatas)

elog(ERROR, "too much WAL data");
+
+   /* protect against overflow */
+   if (unlikely((uint64) mainrdata_len + (uint64) len > UINT32_MAX))
+   elog(ERROR, "too much WAL data");
+
rdata = &rdatas[num_rdatas++];
 
 	rdata->data = data;


Could check for just AllocSizeValid(mainrdata_len), if we're only 
worried about the total size of the data to exceed the limit, and assume 
that each individual piece of data is smaller.


We also don't check for negative 'len'. I think that's fine, the caller 
bears some responsibility for passing valid arguments too. But maybe 
uint32 or size_t would be more appropriate here.


I wonder if these checks hurt performance. These are very cheap, but 
then again, this codepath is very hot. It's probably fine, but it still 
worries me a little. Maybe some of these could be Asserts.



@@ -387,6 +392,11 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 
 	if (num_rdatas >= max_rdatas)

elog(ERROR, "too much WAL data");
+
+   /* protect against overflow */
+   if (unlikely((uint64) regbuf->rdata_len + (uint64) len > UINT32_MAX))
+   elog(ERROR, "too much WAL data");
+
rdata = &rdatas[num_rdatas++];
 
 	rdata->data = data;


Could check "len > UINT16_MAX". As you noted in XLogRecordAssemble, 
that's the real limit. And if you check for that here, you don't need to 
check it in XLogRecordAssemble.



@@ -505,7 +515,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
   XLogRecPtr *fpw_lsn, int *num_fpi, bool 
*topxid_included)
 {
XLogRecData *rdt;
-   uint32  total_len = 0;
+   uint64  total_len = 0;
int block_id;
pg_crc32c   rdata_crc;
registered_buffer *prev_regbuf = NULL;


I don't think the change to uint64 is necessary. If all the data blocks 
are limited to 64 kB, and the number of blocks is limited, and the 
number of blocks is limited too.



@@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 		if (needs_data)

{
+   /* protect against overflow */
+   if (unlikely(regbuf->rdata_len > UINT16_MAX))
+   elog(ERROR, "too much WAL data for registered 
buffer");
+
/*
 * Link the caller-supplied rdata chain for this buffer 
to the
 * overall list.
@@ -836,6 +850,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
 
+	/*

+* Ensure that xlogreader.c can read the record; and check that we don't
+* accidentally overflow the size of the record.
+* */
+   if (unlikely(!AllocSizeIsValid(total_len) || total_len > UINT32_MAX))
+   elog(ERROR, "too much registered data for WAL record");
+
/*
 * Fill in the fields in the record header. Prev-link is filled in 
later,
 * once we know where in the WAL the record will be inserted. The CRC 
does


It's enough to check AllocSizeIsValid(total_len), no need to also check 
against UINT32_MAX.


- Heikki

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-11 Thread Robert Haas

On Thu, Mar 10, 2022 at 9:38 PM Kyotaro Horiguchi
 wrote:
> It seems to me too rigorous that pg_get_wal_records_info/stats()
> reject future LSNs as end-LSN and I think WARNING or INFO and stop at
> the real end-of-WAL is more kind to users.  I think the same with the
> restriction that start and end LSN are required to be different.

In his review just yesterday, Jeff suggested this: "Don't issue
WARNINGs or other messages for ordinary situations, like when
pg_get_wal_records_info() hits the end of WAL." I think he's entirely
right, and I don't think any patch that does otherwise should get
committed. It is worth remembering that the results of queries are
often examined by something other than a human being sitting at a psql
terminal. Any tool that uses this is going to want to understand what
happened from the result set, not by parsing strings that may show up
inside warning messages.

I think that the right answer here is to have a function that returns
one row per record parsed, and each row should also include the start
and end LSN of the record. If for some reason the WAL records return
start after the specified start LSN (e.g. because we skip over a page
header) or end before the specified end LSN (e.g. because we reach
end-of-WAL) the user can figure it out from looking at the LSNs in the
output rows and comparing them to the LSNs provided as input.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-11 Thread Robert Haas

On Fri, Mar 11, 2022 at 5:21 AM Dilip Kumar  wrote:
> Changes, 1) it take Robert's patch as first refactoring patch 2)
> Rebase other new relmapper apis on top of that in 0002 3) Some code
> refactoring in main patch 0005 and also one problem fix, earlier in
> wal log method I have removed ForceSyncCommit(), but IMHO that is
> equally valid whether we use file_copy or wal_log because in both
> cases we are creating the disk files.  4) Support strategy in createdb
> tool and add test case as part of 0006.

I think there's something wrong with what this patch is doing with the
XLOG records. It adds XLOG_DBASE_CREATEDIR, but the only new
XLogInsert() calls in the patch are passing XLOG_DBASE_CREATE, and no
existing references are adjusted. Similarly with xl_dbase_create_rec
and xl_dbase_createdir_rec. Why would we introduce a new record type
and not use it?

Let's not call the functions for the different strategies
CopyDatabase() and CopyDatabaseWithWal() but rather something that
matches up to the strategy names e.g. create_database_using_wal_log()
and create_database_using_file_copy(). There's something a little
funny about the names wal_log and file_copy ... they're not quite
parallel gramatically. But it's probably OK.

The changes to createdb_failure_params make me a little nervous. I
think we'd be in real trouble if we failed before completing both
DropDatabaseBuffers() and ForgetDatabaseSyncRequests(). However, it
looks to me like those are both intended to be no-fail operations, so
I don't see an actual hazard here. But, hmm, what about on the
recovery side? Suppose that we start copying the database block by
block and then either (a) the standby is promoted before the copy is
finished or (b) the copy fails. Now the standby has data in
shared_buffers for a database that does not exist. If that's not bad,
then why does createdb_failure_params need to DropDatabaseBuffers()?
But I bet it is bad. I wonder if we should be using
RelationCopyStorage() rather than this new function
RelationCopyStorageUsingBuffer(). That would avoid having the buffers
in shared_buffers, dodging the problem. But then we have a problem
with checkpoint interlocking: we could begin replay from a checkpoint
and find that the pages that were supposed to get copied prior to the
checkpoint were actually not copied, because the checkpoint record
could be written after we've logged a page being copied and before we
actually write the page. Or, we could crash after writing a whole lot
of pages and a checkpoint record, but before RelationCopyStorage()
fsyncs the destination fork. It doesn't seem advisable to hold off
checkpoints for the time it takes to copy an entire relation fork, so
the solution is apparently to keep the data in shared buffers after
all. But that brings us right back to square one. Have you thought
through this whole problem carefully? It seems like a total mess to me
at the moment, but maybe I'm missing something.

There seems to be no reason to specify specific values for the members
of enum CreateDBStrategy.

I think the naming of some of the new functions might need work, in
particular GetRelInfoFromTuple, GetRelListFromPage, and
GetDatabaseRelationList. The names seem kind of generic for what
they're doing. Maybe ScanSourceDatabasePgClass,
ScanSourceDatabasePgClassPage, ScanSourceDatabasePgClassTuple?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Proposal for internal Numeric to Uint64 conversion function.

2022-03-11 Thread Tom Lane

Amul Sul  writes:
> On Wed, Feb 16, 2022 at 4:50 PM Peter Eisentraut
>  wrote:
>> On 16.02.22 06:00, Amul Sul wrote:
>>> Currently, numeric_pg_lsn is the only one that accepts the Numeric
>>> value and converts to uint64 and that is the reason all the type
>>> conversion code is embedded into it.

>> There are other functions such as numeric_int8() that work similarly.
>> If you are going to refactor, then they should all be treated similarly.
>> I'm not sure if it's going to end up being beneficial.

> Yeah, that's true, I am too not sure if we really need to refactor
> all those; If we want, I can give it a try.

There are several places that call numeric_int8, and right now they
have to go through DirectFunctionCall1, which is ugly and inefficient.
I think a refactoring that exposes some more-convenient API for that
could be useful.  The patch as-presented isn't very compelling for
lack of callers of the new function; but if it were handling the
int64 and uint64 cases alike, and maybe the float8 case too, that
would seem more convincing.  We already expose APIs like int64_to_numeric,
so the lack of similar APIs for the other direction seems odd.

It also feels to me that numeric_pg_lsn(), per se, doesn't belong in
numeric.c.  A pretty direct comparison is numeric_cash(), which is
not in numeric.c but cash.c.

regards, tom lane

Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-11 Thread Nathan Bossart

+CheckpointStats.repl_map_cutoff_lsn = cutoff;

Could we set repl_map_cutoff_lsn closer to where it is calculated?  Right
now, it's set at the bottom of the function just before the directory is
freed.  Is there a strong reason to do it there?

+"logical rewrite mapping file(s) removed/synced=" 
UINT64_FORMAT ", size=%zu bytes, time=%ld.%03d s, cutoff LSN=%X/%X",

Can we split out the "removed" versus "synced" files?  Otherwise, I think
this is basically just telling us how many files are in the mappings
directory.

+CheckpointStats.repl_snap_cutoff_lsn = cutoff;

I have the same note for this one as I have for repl_map_cutoff_lsn.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-03-11 Thread Nathan Bossart

On Fri, Mar 11, 2022 at 02:00:37PM -0500, Chapman Flack wrote:
> v7 looks good to me. I have repeated the installcheck-world and given
> the changed documentation sections one last read-through, and will
> change the status to RfC.

Thanks for reviewing!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-03-11 Thread Chapman Flack

On 03/10/22 19:38, Nathan Bossart wrote:
> On Thu, Mar 10, 2022 at 07:13:14PM -0500, Chapman Flack wrote:
>> +postgres=# SELECT * FROM pg_walfile_name_offset((pg_backup_stop()).lsn);
> 
> Ah, good catch.  I made this change in v7.  I considered doing something

v7 looks good to me. I have repeated the installcheck-world and given
the changed documentation sections one last read-through, and will
change the status to RfC.

Regards,
-Chap

Re: refactoring basebackup.c

2022-03-11 Thread Robert Haas

On Tue, Feb 15, 2022 at 11:26 AM Robert Haas  wrote:
> On Wed, Feb 9, 2022 at 8:41 AM Abhijit Menon-Sen  wrote:
> > It took me a while to assimilate these patches, including the backup
> > targets one, which I hadn't looked at before. Now that I've wrapped my
> > head around how to put the pieces together, I really like the idea. As
> > you say, writing non-trivial integrations in C will take some effort,
> > but it seems worthwhile. It's also nice that one can continue to use
> > pg_basebackup to trigger the backups and see progress information.
>
> Cool. Thanks for having a look.
>
> > Yes, it looks simple to follow the example set by basebackup_to_shell to
> > write a custom target. The complexity will be in whatever we need to do
> > to store/forward the backup data, rather than in obtaining the data in
> > the first place, which is exactly as it should be.
>
> Yeah, that's what made me really happy with how this came out.
>
> Here's v2, rebased and with documentation added.

I don't hear many comments on this, but I'm pretty sure that it's a
good idea, and there haven't been many objections to this patch series
as a whole, so I'd like to proceed with it. If nobody objects
vigorously, I'll commit this next week.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-11 Thread Robert Haas

On Fri, Mar 11, 2022 at 1:10 PM Robert Haas  wrote:
> I don't think you've adequately considered temporary relations here.
> It seems to be that ReadBufferWithoutRelcache() could not be safe on a
> temprel, because we'd need a BackendId to access the underlying
> storage. So I think that ReadBufferWithoutRelcache can only accept
> unlogged or permanent, and maybe the argument ought to be a Boolean
> instead of a relpersistence value. I thought that this problem might
> be only cosmetic, but I checked the code that actually does the copy,
> and there's no filter there on relpersistence either. And I think
> there should be.

I hit "send" too quickly there:

rhaas=# create database fudge;
CREATE DATABASE
rhaas=# \c fudge
You are now connected to database "fudge" as user "rhaas".
fudge=# create temp table q ();
CREATE TABLE
fudge=# ^Z
[2]+  Stopped psql
[rhaas Downloads]$ pg_ctl stop -mi
waiting for server to shut down done
server stopped
[rhaas Downloads]$ %%
psql
\c
You are now connected to database "fudge" as user "rhaas".
fudge=# select * from pg_class where relpersistence='t';
  oid  | relname | relnamespace | reltype | reloftype | relowner |
relam | relfilenode | reltablespace | relpages | reltuples |
relallvisible | reltoastrelid | relhasindex | relisshared |
relpersistence | relkind | relnatts | relchecks | relhasrules |
relhastriggers | relhassubclass | relrowsecurity | relforcerowsecurity
| relispopulated | relreplident | relispartition | relrewrite |
relfrozenxid | relminmxid | relacl | reloptions | relpartbound
---+-+--+-+---+--+---+-+---+--+---+---+---+-+-++-+--+---+-++++-++--+++--++++--
 16388 | q   |16386 |   16390 | 0 |   10 |
2 |   16388 | 0 |0 |-1 | 0
| 0 | f   | f   | t  | r
|0 | 0 | f   | f  | f
| f  | f   | t  | d
| f  |  0 |  721 |  1 ||
 |
(1 row)

fudge=# \c rhaas
You are now connected to database "rhaas" as user "rhaas".
rhaas=# alter database fudge is_template true;
ALTER DATABASE
rhaas=# create database cookies template fudge;
CREATE DATABASE
rhaas=# \c cookies
You are now connected to database "cookies" as user "rhaas".
cookies=# select count(*) from pg_class where relpersistence='t';
 count
---
 1
(1 row)

You have to be quick, because autovacuum will drop the orphaned temp
table when it notices it, but it is possible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-11 Thread Robert Haas

On Fri, Mar 11, 2022 at 5:21 AM Dilip Kumar  wrote:
> Changes, 1) it take Robert's patch as first refactoring patch 2)
> Rebase other new relmapper apis on top of that in 0002 3) Some code
> refactoring in main patch 0005 and also one problem fix, earlier in
> wal log method I have removed ForceSyncCommit(), but IMHO that is
> equally valid whether we use file_copy or wal_log because in both
> cases we are creating the disk files.  4) Support strategy in createdb
> tool and add test case as part of 0006.

I don't think you've adequately considered temporary relations here.
It seems to be that ReadBufferWithoutRelcache() could not be safe on a
temprel, because we'd need a BackendId to access the underlying
storage. So I think that ReadBufferWithoutRelcache can only accept
unlogged or permanent, and maybe the argument ought to be a Boolean
instead of a relpersistence value. I thought that this problem might
be only cosmetic, but I checked the code that actually does the copy,
and there's no filter there on relpersistence either. And I think
there should be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: refactoring basebackup.c

2022-03-11 Thread Robert Haas

On Fri, Mar 11, 2022 at 11:29 AM Justin Pryzby  wrote:
> Sounds right.

OK, committed.

> Also, I think the magic 8 for .gz should actually be a 7.
>
> I'm not sure why it tests for ".gz" but not ".tar.gz", which would help to 
> make
> them all less magic.
>
> commit 1fb1e21ba7a500bb2b85ec3e65f59130fcdb4a7e
> Author: Justin Pryzby 
> Date:   Thu Mar 10 21:22:16 2022 -0600

Yeah, your patch looks right. Committed that, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: role self-revocation

2022-03-11 Thread Mark Dilger

> On Mar 11, 2022, at 8:48 AM, Stephen Frost  wrote:
> 
> I agree that it would have an impact on backwards compatibility to
> change how WITH ADMIN works- but it would also get us more in line with
> what the SQL standard says for how WITH ADMIN is supposed to work and
> that seems worth the change to me.

I'm fine with giving up some backwards compatibility to get some SQL standard 
compatibility, as long as we're clear that is what we're doing.  What you say 
about the SQL spec isn't great, though, because too much power is vested in 
"ADMIN".  I see "ADMIN" as at least three separate privileges together.  Maybe 
it would be spec compliant to implement "ADMIN" as a synonym for a set of 
separate privileges? 

> On Mar 11, 2022, at 8:41 AM, Stephen Frost  wrote:
> 
> That we aren't discussing the issues with the current GRANT ... WITH
> ADMIN OPTION and how we deviate from what the spec calls for when it
> comes to DROP ROLE, which seems to be the largest thing that's
> 'solved' with this ownership concept, is concerning to me.

Sure, let's discuss that a bit more.  Here is my best interpretation of your 
post about the spec, when applied to postgres with an eye towards not doing any 
more damage than necessary:

> On Mar 10, 2022, at 11:58 AM, Stephen Frost  wrote:
> 
> let's look at what the spec says:
> 
> CREATE ROLE
>  - Who is allowed to run CREATE ROLE is implementation-defined

This should be anyone with membership in pg_create_role.  

>  - After creation, this is effictively run:
>GRANT new_role TO creator_role WITH ADMIN, GRANTOR "_SYSTEM"

This should internally be implemented as three separate privileges, one which 
means you can grant the role, another which means you can drop the role, and a 
third that means you're a member of the role.  That way, they can be 
independently granted and revoked.  We could make "WITH ADMIN" a short-hand for 
"WITH G, D, M" where G, D, and M are whatever we name the independent 
privileges Grant, Drop, and Member-of.

Splitting G and D helps with backwards compatibility, because it gives people 
who want the traditional postgres "admin" a way to get there, by granting 
"G+M".  Splitting M from G and D makes it simpler to implement the "bot" idea, 
since the bot shouldn't have M.  But it does raise a question about always 
granting G+D+M to the creator, since the bot is the creator and we don't want 
the bot to have M.  This isn't a problem I've invented from thin air, mind you, 
as G+D+M is just the definition of ADMIN per the SQL spec, if I've understood 
you correctly.  So we need to think a bit more about the pg_create_role 
built-in role and whether that needs to be further refined to distinguish those 
who can get membership in roles they create vs. those who cannot.  This line of 
reasoning takes me in the direction of what I think you were calling #5 
upthread, but you'd have to elaborate on that, and how it interacts with the 
spec, for us to have a useful conversation about it.

> DROP ROLE
>  - Any user who has been GRANT'd a role with ADMIN option is able to
>DROP that role.

Change this to "Any role who has D on the role".  That's spec compliant, 
because anyone granted ADMIN necessarily has D.

> GRANT ROLE
>  - No cycles allowed
>  - A role must have ADMIN rights on the role to be able to GRANT it to
>another role.

Change this to "Any role who has G on the role".  That's spec compliant, 
because anyone grant ADMIN necessarily has G.

We should also fix the CREATE ROLE command to require the grantor have G on a 
role in order to give it to the new role as part of the command.  Changing the 
CREATEROLE, CREATEDB, REPLICATION, and BYPASSRLS attributes into 
pg_create_role, pg_create_db, pg_replication, and pg_bypassrls, the creator 
could only give them to the created role if the creator has G on the roles.  If 
we do this, we could keep the historical privilege bits and their syntax 
support for backward compatibility, or we could rip them out, but the decision 
between those two options is independent of the rest of the design.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Collecting statistics about contents of JSONB columns

2022-03-11 Thread Mahendra Singh Thalor

On Fri, 4 Feb 2022 at 08:30, Tomas Vondra 
wrote:
>
>
>
> On 2/4/22 03:47, Tomas Vondra wrote:
> > ./json-generate.py 3 2 8 1000 6 1000
>
> Sorry, this should be (different order of parameters):
>
> ./json-generate.py 3 2 1000 8 6 1000
>

Thanks, Tomas for this test case.

Hi Hackers,

For the last few days, I was doing testing on the top of these JSON
optimizers patches and was taking help fro Tomas Vondra to understand
patches and testing results.
Thanks, Tomas for your feedback and suggestions.

Below is the summary:
*Point 1)* analyse is taking very much time for large documents:
For large JSON documents, analyze took very large time as compared to the
current head. For reference, I am attaching test file (./json-generate.py
3 2 1000 8 6 1000)

Head: analyze test ; Time: 120.864 ms
With patch: analyze test ; Time: more than 2 hours

analyze is taking a very large time because with these patches, firstly we
iterate over all sample rows (in above case 3), and we store all the
paths (here around 850k paths).
In another pass, we took 1 path at a time and collects stats for the
particular path by analyzing all the sample rows and we continue this
process for all 850k paths or we can say that we do 850k loops, and in each
loop we extract values for a single path.

*Point 2)* memory consummation increases rapidly for large documents:
In the above test case, there are total 851k paths and to keep stats for
one path, we allocate 1120 bytes.

Total paths : 852689 ~ 852k

Memory for 1 path to keep stats: 1120 ~ 1 KB

(sizeof(JsonValueStats) = 1120 from “Analyze Column”)

Total memory for all paths: 852689 * 1120 = 955011680 ~ 955 MB

Extra memory for each path will be more. I mean, while analyzing each path,
we allocate some more memory based on frequency and others

To keep all entries(851k paths) in the hash, we use around 1GB memory for
hash so this is also very large.

*Point 3*) Review comment noticed by Tomas Vondra:

+   oldcxt = MemoryContextSwitchTo(ctx->stats->anl_context);
+   pstats->stats = jsonAnalyzeBuildPathStats(pstats);
+   MemoryContextSwitchTo(oldcxt);

Above should be:
+   oldcxt = MemoryContextSwitchTo(ctx->mcxt);
+   pstats->stats = jsonAnalyzeBuildPathStats(pstats);
+   MemoryContextSwitchTo(oldcxt);

*Response from Tomas Vondra:*
The problem is "anl_context" is actually "Analyze", i.e. the context for
the whole ANALYZE command, for all the columns. But we only want to keep
those path stats while processing a particular column. At the end, after
processing all paths from a column, we need to "build" the final stats in
the column, and this result needs to go into "Analyze" context. But all the
partial results need to go into "Analyze Column" context.

*Point 4)*

+/*

+ * jsonAnalyzeCollectPath

+ * Extract a single path from JSON documents and collect its
values.

+ */

+static void

+jsonAnalyzeCollectPath(JsonAnalyzeContext *ctx, Jsonb *jb, void *param)

+{

+   JsonPathAnlStats *pstats = (JsonPathAnlStats *) param;

+   JsonbValue  jbvtmp;

+   JsonbValue *jbv = JsonValueInitBinary(&jbvtmp, jb);

+   JsonPathEntry *path;

+   JsonPathEntry **entries;

+   int i;

+

+   entries = palloc(sizeof(*entries) * pstats->depth);

+

+   /* Build entry array in direct order */

+   for (path = &pstats->path, i = pstats->depth - 1;

+path->parent && i >= 0;

+path = path->parent, i--)

+   entries[i] = path;

+

+   jsonAnalyzeCollectSubpath(ctx, pstats, jbv, entries, 0);

+

+   pfree(entries);

many times, we are trying to palloc with zero size and entries is
pointing to invalid memory (because pstats->depth=0) so I think, we should
not try to palloc with 0??

*Fix:*

+   If (pstats->depth)

  +entries = palloc(sizeof(*entries) * pstats->depth);

>From these points, we can say that we should rethink our design to collect
stats for all paths.

We can set limits(like MCV) for paths or we can give an explicit path to
collect stats for a particular path only or we can pass a subset of the
JSON values.

In the above case, there are total 851k paths, but we can collect stats for
only 1000 paths that are most common so this way we can minimize time and
memory also and we might even keep at
least frequencies for the non-analyzed paths.

Next, I will take the latest patches from Nikita's last email and I will do
more tests.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: role self-revocation

2022-03-11 Thread Robert Haas

On Fri, Mar 11, 2022 at 11:34 AM Tom Lane  wrote:
> Note that either case would also involve making entries in pg_shdepend;
> although for the case of roles owned by/granted to the bootstrap
> superuser, we could omit those on the usual grounds that we don't need
> to record dependencies on pinned objects.

That makes sense to me, but it still doesn't solve the problem of
agreeing on role ownership vs. WITH ADMIN OPTION vs. something else.

I find it ironic (and frustrating) that Mark implemented what I think
is basically what you're arguing for, it got stuck because Stephen
didn't like it, we then said OK so let's try to find out what Stephen
would like, only to have you show up and say that it's right the way
he already had it. I'm not saying that you're wrong, or for that
matter that he's wrong. I'm just saying that if both of you are
absolutely bent on having it the way you want it, either one of you is
going to be sad, or we're not going to make any progress.

Never mind the fact that neither of you seem interested in even giving
a hearing to my preferred way of doing it. :-(

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: role self-revocation

2022-03-11 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Mar 11, 2022 at 11:12 AM Mark Dilger
>  wrote:
> > This issue of how much backwards compatibility breakage we're willing to 
> > tolerate is just as important as questions about how we would want roles to 
> > work in a green-field development project.  The sense I got a year ago, on 
> > this list, was that changing CREATEROLE was acceptable, but changing other 
> > parts of the system, such as how ADMIN OPTION works, would go too far.

That we deviate as far as we do when it comes to the SQL spec is
something that I don't feel like I had a good handle on when discussing
this previously (that the spec doesn't talk about 'admin option' really
but rather 'grantable authorization identifiers' or whatever it is
doesn't help... but still, that's on me, sorry about that).

> > Role ownership did not yet exist, and that was a big motivation in 
> > introducing that concept, because you couldn't credibly say it broke other 
> > existing features.  It introduces the new notion that when a superuser 
> > creates a role, the superuser owns it, which is identical to how things 
> > implicitly work today; and when a CREATEROLE non-superuser creates a role, 
> > that role owns the new role, which is different from how it works today, 
> > arguably breaking CREATEROLE's prior behavior.  *But it doesn't break 
> > anything else*.
> >
> > If we're going to change how ADMIN OPTION works, or how role membership 
> > works, or how inherit/noinherit works, let's first be clear that we are 
> > willing to accept whatever backwards incompatibility that entails.  This is 
> > not a green-field development project.  The constant spinning around with 
> > regard to how much compatibility we need to preserve is giving me vertigo.

I agree that it would have an impact on backwards compatibility to
change how WITH ADMIN works- but it would also get us more in line with
what the SQL standard says for how WITH ADMIN is supposed to work and
that seems worth the change to me.

> On the other hand, changing ADMIN OPTION does have compatibility and
> spec-compliance ramifications. I think Stephen is arguing that we can
> solve this problem while coming closer to the spec, and I think we
> usually consider getting closer to the spec to be a sufficient reason
> for breaking backward compatibility (cf. standard_conforming_strings).

Indeed.

> But I don't know whether he is correct when he argues that the spec
> makes admin option on a role sufficient to drop the role. I've never
> had any luck understanding what the SQL specification is saying about
> any topic.

I'm happy to point you to what the spec says and to discuss it further
if that would be helpful, or to get other folks to comment on it.  I
agree that it's definitely hard to grok at times.  In this particular
case what I'm looking at is, under DROP ROLE / Access Rules, there's
only one sentence:

There shall exist at least one grantable role authorization descriptor
whose role name is R and whose grantee is an enabled authorization
identifier.

A bit of decoding: 'grantable role authorization descriptor' is a GRANT
of a role WITH ADMIN OPTION.  The role name 'R' is the role specified.
The 'grantee' is who that role R was GRANT'd to, and 'enabled
authorization identifier' is basically "has_privs_of_role()" (note that
you can in the spec hvae roles that you're a member of but which are
*not* currently enabled).

Hopefully that helps.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: role self-revocation

2022-03-11 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Robert Haas  writes:
> > If we implement the link between the creating role and the created
> > role as role ownership, then we are surely just going to add a
> > rolowner column to pg_authid, and when the role is owned by nobody, I
> > think we should always just store a valid OID in it, rather than
> > sometimes storing 0. It just seems simpler. Any time we would store 0,
> > store the bootstrap superuser's pg_authid.oid value instead. That way
> > the OID is always valid, which probably lets us get by with fewer
> > special cases in the code.

We haven't got any particularly special cases in the code today for what
happens if we run up the role hierarchy to a point that it ends and so
I'm not sure why adding in a whole new concept around role ownership,
which doesn't exist in the spec, would somehow leave us with fewer such
special cases.

> +1.
> 
> Note that either case would also involve making entries in pg_shdepend;
> although for the case of roles owned by/granted to the bootstrap
> superuser, we could omit those on the usual grounds that we don't need
> to record dependencies on pinned objects.

That we aren't discussing the issues with the current GRANT ... WITH
ADMIN OPTION and how we deviate from what the spec calls for when it
comes to DROP ROLE, which seems to be the largest thing that's
'solved' with this ownership concept, is concerning to me.

If we go down the route of adding role ownership, are we going to
document that we explicitly go against the SQL standard when it comes to
how DROP ROLE works?  Or are we going to fix DROP ROLE?  I'd much prefer
the latter, but doing so then largely negates the point of this role
ownership concept.  I don't see how it makes sense to do both.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: role self-revocation

2022-03-11 Thread Robert Haas

On Fri, Mar 11, 2022 at 11:12 AM Mark Dilger
 wrote:
> This issue of how much backwards compatibility breakage we're willing to 
> tolerate is just as important as questions about how we would want roles to 
> work in a green-field development project.  The sense I got a year ago, on 
> this list, was that changing CREATEROLE was acceptable, but changing other 
> parts of the system, such as how ADMIN OPTION works, would go too far.
>
> Role ownership did not yet exist, and that was a big motivation in 
> introducing that concept, because you couldn't credibly say it broke other 
> existing features.  It introduces the new notion that when a superuser 
> creates a role, the superuser owns it, which is identical to how things 
> implicitly work today; and when a CREATEROLE non-superuser creates a role, 
> that role owns the new role, which is different from how it works today, 
> arguably breaking CREATEROLE's prior behavior.  *But it doesn't break 
> anything else*.
>
> If we're going to change how ADMIN OPTION works, or how role membership 
> works, or how inherit/noinherit works, let's first be clear that we are 
> willing to accept whatever backwards incompatibility that entails.  This is 
> not a green-field development project.  The constant spinning around with 
> regard to how much compatibility we need to preserve is giving me vertigo.

I mean, I agree that the backward compatibility ramifications of every
idea need to be considered, but I agree even more that the amount of
spinning around here is pretty insane. My feeling is that neither role
owners nor tenants introduce any real concerns about
backward-compatibility or, for that matter, SQL standards compliance,
nonwithstanding Stephen's argument to the contrary. Every vendor
extends the standard with their own stuff, and we've done that as
well, as we can do it in more places.

On the other hand, changing ADMIN OPTION does have compatibility and
spec-compliance ramifications. I think Stephen is arguing that we can
solve this problem while coming closer to the spec, and I think we
usually consider getting closer to the spec to be a sufficient reason
for breaking backward compatibility (cf. standard_conforming_strings).
But I don't know whether he is correct when he argues that the spec
makes admin option on a role sufficient to drop the role. I've never
had any luck understanding what the SQL specification is saying about
any topic.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: role self-revocation

2022-03-11 Thread Tom Lane

Robert Haas  writes:
> If we implement the link between the creating role and the created
> role as role ownership, then we are surely just going to add a
> rolowner column to pg_authid, and when the role is owned by nobody, I
> think we should always just store a valid OID in it, rather than
> sometimes storing 0. It just seems simpler. Any time we would store 0,
> store the bootstrap superuser's pg_authid.oid value instead. That way
> the OID is always valid, which probably lets us get by with fewer
> special cases in the code.

+1.

Note that either case would also involve making entries in pg_shdepend;
although for the case of roles owned by/granted to the bootstrap
superuser, we could omit those on the usual grounds that we don't need
to record dependencies on pinned objects.

regards, tom lane

Re: refactoring basebackup.c

2022-03-11 Thread Justin Pryzby

On Fri, Mar 11, 2022 at 10:19:29AM -0500, Robert Haas wrote:
> So I think we should just refuse this command. Patch for that attached.

Sounds right.

Also, I think the magic 8 for .gz should actually be a 7.

I'm not sure why it tests for ".gz" but not ".tar.gz", which would help to make
them all less magic.

commit 1fb1e21ba7a500bb2b85ec3e65f59130fcdb4a7e
Author: Justin Pryzby 
Date:   Thu Mar 10 21:22:16 2022 -0600

pg_basebackup: make magic numbers less magic

The magic 8 for .gz should actually be a 7.

.tar.gz
1234567

.tar.lz4
.tar.zst
12345678

See d45099425, 751b8d23b, 7cf085f07.

diff --git a/src/bin/pg_basebackup/pg_basebackup.c 
b/src/bin/pg_basebackup/pg_basebackup.c
index 9f3ecc60fbe..8dd9721323d 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -1223,17 +1223,17 @@ CreateBackupStreamer(char *archive_name, char 
*spclocation,
is_tar = (archive_name_len > 4 &&
  strcmp(archive_name + archive_name_len - 4, ".tar") 
== 0);
 
-   /* Is this a gzip archive? */
-   is_tar_gz = (archive_name_len > 8 &&
-strcmp(archive_name + archive_name_len - 3, 
".gz") == 0);
+   /* Is this a .tar.gz archive? */
+   is_tar_gz = (archive_name_len > 7 &&
+strcmp(archive_name + archive_name_len - 7, 
".tar.gz") == 0);
 
-   /* Is this a LZ4 archive? */
+   /* Is this a .tar.lz4 archive? */
is_tar_lz4 = (archive_name_len > 8 &&
- strcmp(archive_name + archive_name_len - 4, 
".lz4") == 0);
+ strcmp(archive_name + archive_name_len - 8, 
".tar.lz4") == 0);
 
-   /* Is this a ZSTD archive? */
+   /* Is this a .tar.zst archive? */
is_tar_zstd = (archive_name_len > 8 &&
-  strcmp(archive_name + archive_name_len - 4, 
".zst") == 0);
+  strcmp(archive_name + archive_name_len - 8, 
".tar.zst") == 0);
 
/*
 * We have to parse the archive if (1) we're suppose to extract it, or 
if

Re: role self-revocation

2022-03-11 Thread Robert Haas

On Fri, Mar 11, 2022 at 10:46 AM Tom Lane  wrote:
> The bootstrap superuser clearly must be a special case in some way.
> I'm not convinced that that means there should be other special
> cases.  Maybe there is a use-case for other "unowned" roles, but in
> exactly what way would that be different from deeming such roles
> to be owned by the bootstrap superuser?

I think that just boils down to how many useless catalog entries you
want to make.

If we implement the link between the creating role and the created
role as role ownership, then we are surely just going to add a
rolowner column to pg_authid, and when the role is owned by nobody, I
think we should always just store a valid OID in it, rather than
sometimes storing 0. It just seems simpler. Any time we would store 0,
store the bootstrap superuser's pg_authid.oid value instead. That way
the OID is always valid, which probably lets us get by with fewer
special cases in the code.

If we implement the link between the creating role and the created
role as an automatically-granted WITH ADMIN OPTION, then we could
choose to put (CREATOR_OID, CREATED_OID, whatever, TRUE) into
pg_auth_members for the creating superuser or, indeed, every superuser
in the system. Or we can leave it out. The result will be exactly the
same. Here, I would favor leaving it out, because extra catalog
entries that don't do anything are usually a thing that we do not
want. See a49d081235997c67e8aab7a523b17e8d1cb93184, for example.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: role self-revocation

2022-03-11 Thread Mark Dilger

> On Mar 11, 2022, at 7:58 AM, Robert Haas  wrote:
> 
> This kind of thing is always a judgement call. If we were talking
> about breaking 'SELECT * from table', I'm sure it would be hard to
> convince anybody to agree to do that at all, let alone with no
> deprecation period. Fortunately, CREATEROLE is less used, so breaking
> it will inconvenience fewer people.

This issue of how much backwards compatibility breakage we're willing to 
tolerate is just as important as questions about how we would want roles to 
work in a green-field development project.  The sense I got a year ago, on this 
list, was that changing CREATEROLE was acceptable, but changing other parts of 
the system, such as how ADMIN OPTION works, would go too far.

Role ownership did not yet exist, and that was a big motivation in introducing 
that concept, because you couldn't credibly say it broke other existing 
features.  It introduces the new notion that when a superuser creates a role, 
the superuser owns it, which is identical to how things implicitly work today; 
and when a CREATEROLE non-superuser creates a role, that role owns the new 
role, which is different from how it works today, arguably breaking 
CREATEROLE's prior behavior.  *But it doesn't break anything else*.

If we're going to change how ADMIN OPTION works, or how role membership works, 
or how inherit/noinherit works, let's first be clear that we are willing to 
accept whatever backwards incompatibility that entails.  This is not a 
green-field development project.  The constant spinning around with regard to 
how much compatibility we need to preserve is giving me vertigo.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: role self-revocation

2022-03-11 Thread Robert Haas

On Fri, Mar 11, 2022 at 10:37 AM David G. Johnston
 wrote:
> I largely agree and am perfectly fine with going with the majority on this 
> point.  My vote would just fall on the conservative side.  But as so far no 
> one else seems to be overly concerned, nerfing CREATEROLE seems to be the 
> path forward.

This kind of thing is always a judgement call. If we were talking
about breaking 'SELECT * from table', I'm sure it would be hard to
convince anybody to agree to do that at all, let alone with no
deprecation period. Fortunately, CREATEROLE is less used, so breaking
it will inconvenience fewer people. Moreover, unlike 'SELECT * FROM
table', CREATEROLE is kinda broken, and it's less scary to make
changes to behavior that sucks in the first place than it is to make
changes to the behavior of things that are working well. For all of
that, there's no hard-and-fast rule that we couldn't keep the existing
behavior around, introduce a substitute, and eventually drop the old
thing. I'm just not clear that it's really worth it in this case. It'd
certainly be interesting to hear from anyone who is finding some
utility in the current system. It looks pretty crap to me, but it's
easy to bring too much of one's own bias to such judgements.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: role self-revocation

2022-03-11 Thread Tom Lane

Robert Haas  writes:
> On Fri, Mar 11, 2022 at 10:27 AM Stephen Frost  wrote:
>> I agree that there would be a recorded relationship (that is, one that
>> we write into the catalog and keep around until and unless it's removed
>> by an admin) between creating and created roles and that's probably the
>> default when CREATE ROLE is run but, unlike tables or such objects in
>> the system, I don't agree that we should require this to exist at
>> absolutely all times for every role (what would it be for the bootstrap
>> superuser..?).  At least today, that's distinct from how ownership in
>> the system works.  I also don't believe that this is necessarily an
>> issue for Robert's use-case, as long as there are appropriate
>> restrictions around who is allowed to remove or modify these
>> relationships.

> I agree.

The bootstrap superuser clearly must be a special case in some way.
I'm not convinced that that means there should be other special
cases.  Maybe there is a use-case for other "unowned" roles, but in
exactly what way would that be different from deeming such roles
to be owned by the bootstrap superuser?

regards, tom lane

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-11 Thread Bharath Rupireddy

On Fri, Mar 11, 2022 at 8:22 AM Kyotaro Horiguchi
 wrote:
> >
> - The difference between pg_get_wal_record_info and _records_ other than
> - the number of argument is the former accepts incorrect LSNs.
>
> The discussion is somewhat confused after some twists and turns..  It
> should be something like the following.
>
> pg_get_wal_record_info and pg_get_wal_records_info are almost same
> since the latter can show a single record.  However it is a bit
> annoying to do that. Since, other than it doens't accept same LSNs for
> start and end, it doesn't show a record when there' no record in the
> specfied LSN range.  But I don't think there's no usefulness of the
> behavior.

I would like to reassert the usability of pg_get_wal_record_info and
pg_get_wal_records_info:

pg_get_wal_record_info(lsn):
if lsn is invalid i.e. '0/0' - throws an error
if lsn is future lsn - throws an error
if lsn looks okay, it figures out the next available valid WAL record
and returns info about that

pg_get_wal_records_info(start_lsn, end_lsn default null) -> if start
and end lsns are provided  no end_lsn would give the WAL records info
till the end of WAL,
if start_lsn is invalid i.e. '0/0' - throws an error
if start_lsn is future lsn - throws an error
if end_lsn isn't provided by the user - calculates the end_lsn as
server's current flush lsn
if end_lsn is provided by the user - throws an error if it's future LSN
if start_lsn and end_lsn look okay, it returns info about all WAL
records from the next available valid WAL record of start_lsn until
end_lsn

So, both pg_get_wal_record_info and pg_get_wal_records_info are necessary IMHO.

Coming to the behaviour when input lsn is '0/100', it's an issue
with XLogSegmentOffset(lsn, wal_segment_size) != 0 check, which I will
fix in the next version.

if (*first_record != lsn && XLogSegmentOffset(lsn, wal_segment_size) != 0)
ereport(WARNING,
(errmsg_plural("first record is after %X/%X, at %X/%X,
skipping over %u byte",
   "first record is after %X/%X, at %X/%X,
skipping over %u bytes",
   (*first_record - lsn),
   LSN_FORMAT_ARGS(lsn),
   LSN_FORMAT_ARGS(*first_record),
   (uint32) (*first_record - lsn;

Regards,
Bharath Rupireddy.

Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-03-11 Thread Matthias van de Meent

Hi,

Xlogreader limits the size of what it considers valid xlog records to
MaxAllocSize; but this is not currently enforced in the
XLogRecAssemble API. This means it is possible to assemble a record
that postgresql cannot replay.
Similarly; it is possible to repeatedly call XlogRegisterData() so as
to overflow rec->xl_tot_len; resulting in out-of-bounds reads and
writes while processing record data;

PFA a patch that attempts to fix both of these issues in the insertion
API; by checking against overflows and other incorrectly large values
in the relevant functions in xloginsert.c. In this patch, I've also
added a comment to the XLogRecord spec to document that xl_tot_len
should not be larger than 1GB - 1B; and why that limit exists.

Kind regards,

Matthias van de Meent
From a72121b8fccbbf8c289e71a2c76801a1004f5353 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Fri, 11 Mar 2022 16:16:22 +0100
Subject: [PATCH v1] Add protections in xlog record APIs against large numbers
 and overflows.

Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops.
---
 src/backend/access/transam/xloginsert.c | 23 ++-
 src/include/access/xlogrecord.h |  4 
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c260310c4c..ae654177de 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -342,6 +342,11 @@ XLogRegisterData(char *data, int len)
 
 	if (num_rdatas >= max_rdatas)
 		elog(ERROR, "too much WAL data");
+
+	/* protect against overflow */
+	if (unlikely((uint64) mainrdata_len + (uint64) len > UINT32_MAX))
+		elog(ERROR, "too much WAL data");
+
 	rdata = &rdatas[num_rdatas++];
 
 	rdata->data = data;
@@ -387,6 +392,11 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 
 	if (num_rdatas >= max_rdatas)
 		elog(ERROR, "too much WAL data");
+
+	/* protect against overflow */
+	if (unlikely((uint64) regbuf->rdata_len + (uint64) len > UINT32_MAX))
+		elog(ERROR, "too much WAL data");
+
 	rdata = &rdatas[num_rdatas++];
 
 	rdata->data = data;
@@ -505,7 +515,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
    XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
 {
 	XLogRecData *rdt;
-	uint32		total_len = 0;
+	uint64		total_len = 0;
 	int			block_id;
 	pg_crc32c	rdata_crc;
 	registered_buffer *prev_regbuf = NULL;
@@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 		if (needs_data)
 		{
+			/* protect against overflow */
+			if (unlikely(regbuf->rdata_len > UINT16_MAX))
+elog(ERROR, "too much WAL data for registered buffer");
+
 			/*
 			 * Link the caller-supplied rdata chain for this buffer to the
 			 * overall list.
@@ -836,6 +850,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
 		COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
 
+	/*
+	 * Ensure that xlogreader.c can read the record; and check that we don't
+	 * accidentally overflow the size of the record.
+	 * */
+	if (unlikely(!AllocSizeIsValid(total_len) || total_len > UINT32_MAX))
+		elog(ERROR, "too much registered data for WAL record");
+
 	/*
 	 * Fill in the fields in the record header. Prev-link is filled in later,
 	 * once we know where in the WAL the record will be inserted. The CRC does
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index c1b1137aa7..950e1f22b0 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -37,6 +37,10 @@
  * The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
  * XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
  * used to distinguish between block references, and the main data structs.
+ *
+ * Note that xlogreader.c limits the total size of the xl_tot_len to
+ * MaxAllocSize (1GB - 1). This means that this is also the maximum size
+ * of a single WAL record - we would be unable to replay any larger record.
  */
 typedef struct XLogRecord
 {
-- 
2.30.2

Re: role self-revocation

2022-03-11 Thread Robert Haas

On Fri, Mar 11, 2022 at 10:27 AM Stephen Frost  wrote:
> I agree that there would be a recorded relationship (that is, one that
> we write into the catalog and keep around until and unless it's removed
> by an admin) between creating and created roles and that's probably the
> default when CREATE ROLE is run but, unlike tables or such objects in
> the system, I don't agree that we should require this to exist at
> absolutely all times for every role (what would it be for the bootstrap
> superuser..?).  At least today, that's distinct from how ownership in
> the system works.  I also don't believe that this is necessarily an
> issue for Robert's use-case, as long as there are appropriate
> restrictions around who is allowed to remove or modify these
> relationships.

I agree.

> > I agree. [ but we need to get consensus ]
>
> Well ... [ how about we do it my way? ]

Repeating the same argument over again isn't necessarily going to help
anything here. I read your argument and I can believe there could be a
solution along those lines, although you haven't addressed my concern
about NOINHERIT. Tom is apparently less convinced, and you know, I
think that's OK. Not everybody has to agree with the way you want to
do it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: role self-revocation

2022-03-11 Thread David G. Johnston

On Fri, Mar 11, 2022 at 8:32 AM Stephen Frost  wrote:

>
> Such scripts as will break will still
> break in a pretty clear way with a clear answer as to how to fix them
> and I don't think there's some kind of data corruption or something that
> would happen.
>
>
I largely agree and am perfectly fine with going with the majority on this
point.  My vote would just fall on the conservative side.  But as so far no
one else seems to be overly concerned, nerfing CREATEROLE seems to be the
path forward.

David J.

Re: role self-revocation

2022-03-11 Thread Stephen Frost

Greetings,

* David G. Johnston (david.g.johns...@gmail.com) wrote:
> On Thu, Mar 10, 2022 at 3:01 PM Robert Haas  wrote:
> 
> > On Thu, Mar 10, 2022 at 4:00 PM David G. Johnston
> >  wrote:
> > > I dislike changing the documented behavior of CREATEROLE to the degree
> > suggested here.  However, there are three choices here, only one of which
> > can be chosen:
> > >
> > > 1. Leave CREATEROLE alone entirely
> > > 2. Make it so CREATEROLE cannot assign membership to the predefined
> > roles or superuser (inheritance included), but leave the rest alone.  This
> > would be the hard-coded version, not the role attribute one.
> > > 3. Make it so CREATEROLE can only assign membership to roles for which
> > it has been made an admin; as well as the other things mentioned
> > >
> > > Moving forward I'd prefer options 1 or 2, leaving the ability to
> > create/alter/drop a role to be vested via predefined roles.
> >
> > It sounds like you prefer a behavior where CREATEROLE gives power over
> > all non-superusers, but that seems pretty limiting to me.
> 
> Doh!  I edited out the part where I made clear I considered options 1 and 2
> as basically being done for a limited period of time while deprecating the
> CREATEROLE attribute altogether in favor of the fine-grained and predefined
> role based permission granting.  I don't want to nerf CREATEROLE as part of
> adding this new feature, instead leave it as close to status quo as
> reasonable so as not to mess up existing setups that make use of it.  We
> can note in the release notes and documentation that we consider CREATEROLE
> to be deprecated and that the new predefined role should be used to give a
> user the ability to create/alter/drop roles, etc...  DBAs should consider
> revoking CREATEROLE from their users and granting them proper memberships
> in the predefined roles and the groups those roles should be administering.

I disagree entirely with the idea that we should push the out however
many years it'd take to get through some deprecation period.  We are
absolutely terrible when it comes to that and what we're talking about
here, at this point anyway, is making changes that get us closer to what
the spec says.  I agree that we can't back-patch these changes, but I
don't think we need a deprecation period.  If we were just getting rid
of CREATEROLE, I don't think we'd have a deprecation period.  If we need
to get rid of CREATEROLE and introduce something new that more-or-less
means the same thing, to make it so that people's scripts break in a
more obvious way, maybe we can consider that, but I don't really think
that's actually the case here.  Such scripts as will break will still
break in a pretty clear way with a clear answer as to how to fix them
and I don't think there's some kind of data corruption or something that
would happen.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: role self-revocation

2022-03-11 Thread David G. Johnston

On Fri, Mar 11, 2022 at 6:55 AM Robert Haas  wrote:

> On Thu, Mar 10, 2022 at 5:14 PM Tom Lane  wrote:
> > This seems reasonable in isolation, but
> >
> > (1) it implies a persistent relationship between creating and created
> > roles.  Whether you want to call that ownership or not, it sure walks
> > and quacks like ownership.
>

> I like my TENANT idea best, but I'm perfectly willing to call
> it ownership as you seem to prefer or WITH ADMIN OPTION as Stephen
> seems to prefer if one of those ideas gains consensus.

If WITH ADMIN OPTION is sufficient to meet our immediate goals I do not see
the benefit of adding an ownership concept where there is not one today.
If added, I'd much rather have it be ownership as to fit in with the rest
of the existing system rather than introduce an entirely new term.

> If Alice creates non-superusers Bob and Charlie, and Charlie creates
> Doug, we need the persistent relationship to know that Charlie is
> allowed to drop Doug and Bob is not
>

The interesting question seems to be whether Alice can drop Doug, not
whether Bob can.

> It's more important
> at this point to get agreement on the principles.
>

What are the principles you want to get agreement on and how do they differ
from what we have in place today?  What are the proposed changes you would
make to enforce the new principles.  Which principles are now obsolete and
what do you want to do about the features that were built to enforce them
(including backward compatibility concerns)?

David J.

Re: role self-revocation

2022-03-11 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Mar 10, 2022 at 5:14 PM Tom Lane  wrote:
> > This seems reasonable in isolation, but
> >
> > (1) it implies a persistent relationship between creating and created
> > roles.  Whether you want to call that ownership or not, it sure walks
> > and quacks like ownership.

I agree that there would be a recorded relationship (that is, one that
we write into the catalog and keep around until and unless it's removed
by an admin) between creating and created roles and that's probably the
default when CREATE ROLE is run but, unlike tables or such objects in
the system, I don't agree that we should require this to exist at
absolutely all times for every role (what would it be for the bootstrap
superuser..?).  At least today, that's distinct from how ownership in
the system works.  I also don't believe that this is necessarily an
issue for Robert's use-case, as long as there are appropriate
restrictions around who is allowed to remove or modify these
relationships.

> I agree. It's been obvious to me from the beginning that we needed
> such a persistent relationship, and also that it needed to be a
> relationship from which the created role couldn't simply walk away.
> Yet, more than six months after the first discussions of this topic,
> we still don't have any kind of agreement on what that thing should be
> called. I like my TENANT idea best, but I'm perfectly willing to call
> it ownership as you seem to prefer or WITH ADMIN OPTION as Stephen
> seems to prefer if one of those ideas gains consensus. But we've
> managed to waste all hope of making any significant progress here for
> an entire release cycle for lack of ability to agree on spelling. I
> think that's unfair to Mark, who put a lot of work into this area and
> got nothing out of it, and I think it sucks for users of PostgreSQL,
> too.

Well ... one of those actually already exists and also happens to be in
the SQL spec.  I don't necessarily agree that we should absolutely
require that the system always enforce that this relationship exist (I'd
like a superuser to be able to get rid of it and to be able to change it
too if they want) and that seems a bit saner than having the bootstrap
superuser be special in some way here as would seem to otherwise be
required.  I also feel that it would be generally useful to have more
than one of these relationships, if the user wishes, and that's
something that ownership doesn't (directly) support today.  Further,
that's supported and expected by the SQL spec too.  Even if we invented
some concept of ownership of roles, it seems like we should make most of
the other changes discussed here to bring us closer to what the spec
says about CREATE ROLE, DROP ROLE, GRANT, REVOKE, etc.  At that point
though, what's the point of having ownership?

> > (2) it seems exactly contradictory to your later point that
> >
> > > Agree. I also think that it would be a good idea to attribute grants
> > > performed by any superuser to the bootstrap superuser, or leave them
> > > unattributed somehow. Because otherwise dropping superusers becomes a
> > > pain in the tail for no good reason.
> >
> > Either there's a persistent relationship or there's not.  I don't
> > think it's sensible to treat superusers differently here.
> >
> > I think that this argument about the difficulty of dropping superusers
> > may in fact be the motivation for the existing behavior that object-
> > permissions GRANTs done by superusers are attributed to the object
> > owner; something you were unhappy about upthread.
> >
> > In the end these requirements seem mutually contradictory.  Either
> > we can have a persistent ownership relationship or not, but I don't
> > think we can have it apply in some cases and not others without
> > creating worse problems than we solve.  I'm inclined to toss overboard
> > the requirement that superusers need to be an easy thing to drop.
> > Why is that important, anyway?
> 
> Well, I think you're looking at it the wrong way. Compared to getting
> useful functionality, the relative ease of dropping users is
> completely unimportant. I'm happy to surrender it in exchange for
> something else. I just don't see why we should give it up for nothing.
> If Alice creates non-superusers Bob and Charlie, and Charlie creates
> Doug, we need the persistent relationship to know that Charlie is
> allowed to drop Doug and Bob is not. But if Charlie is a superuser
> anyway, then the persistent relationship is of no use. I don't see the
> point of cluttering up the system with such dependencies. Will I do it
> that way, if that's what it takes to get the patch accepted? Sure. But
> I can't imagine any end-user actually liking it.

We need to know that Charlie is allowed to drop Doug and Bob isn't but
that doesn't make it absolutely required that this be tracked
permanently or that Alice can't decide later to make it such that Doug
can't be dropped by Charlie for whatever reason

Re: refactoring basebackup.c

2022-03-11 Thread Robert Haas

On Thu, Mar 10, 2022 at 8:02 PM Justin Pryzby  wrote:
> I'm getting errors from pg_basebackup when using both -D- and 
> --compress=server-*
> The issue seems to go away if I use --no-manifest.
>
> $ ./src/bin/pg_basebackup/pg_basebackup -h /tmp -Ft -D- --wal-method none 
> --compress=server-gzip >/dev/null ; echo $?
> pg_basebackup: error: tar member has empty name
> 1
>
> $ ./src/bin/pg_basebackup/pg_basebackup -h /tmp -Ft -D- --wal-method none 
> --compress=server-gzip >/dev/null ; echo $?
> NOTICE:  WAL archiving is not enabled; you must ensure that all required WAL 
> segments are copied through other means to complete the backup
> pg_basebackup: error: COPY stream ended before last file was finished
> 1

Thanks for the report. The problem here is that, when the output is
standard output (-D -), pg_basebackup can only produce a single output
file, so the manifest gets injected into the tar file on the client
side rather than being written separately as we do in normal cases.
However, that only works if we're receiving a tar file that we can
parse from the server, and here the server is sending a compressed
tarfile. The current code mistakely attempts to parse the compressed
tarfile as if it were an uncompressed tarfile, which causes the error
messages that you are seeing (and which I can also reproduce here). We
actually have enough infrastructure available in pg_basebackup now
that we could do the "right thing" in this case: decompress the data
received from the server, parse the resulting tar file, inject the
backup manifest, construct a new tar file, and recompress. However, I
think that's probably not a good idea, because it's unlikely that the
user will understand that the data is being compressed on the server,
then decompressed, and then recompressed again, and the performance of
the resulting pipeline will probably not be very good. So I think we
should just refuse this command. Patch for that attached.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

reject-compressed-inject.patch
Description: Binary data

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-11 Thread Ashutosh Sharma

On Fri, Mar 11, 2022 at 8:21 PM Robert Haas  wrote:
>
> On Fri, Mar 11, 2022 at 12:15 AM Ashutosh Sharma  
> wrote:
> > Looks better, but why do you want to pass elevel to the
> > load_relmap_file()? Are we calling this function from somewhere other
> > than read_relmap_file()? If not, do we have any plans to call this
> > function  directly bypassing read_relmap_file for any upcoming patch?
>
> If it fails during CREATE DATABASE, it should be ERROR, not FATAL. In
> that case, we only need to stop trying to create a database; we don't
> need to terminate the session. On the other hand if we can't read our
> own database's relmap files, that's an unrecoverable error, because we
> will not be able to run any queries at all, so FATAL is appropriate.
>

OK. I can see it being used in the v13 patch. In the previous patches
it was hard-coded with FATAL. Also, we simply error out when doing
file copy as I can see in the copy_file function. So yes FATAL is not
the right option to use when creating a database. Thanks.

--
With Regards,
Ashutosh Sharma.

Re: generic plans and "initial" pruning

2022-03-11 Thread Amit Langote

On Fri, Mar 11, 2022 at 11:35 PM Amit Langote  wrote:
> Attached is v5, now broken into 3 patches:
>
> 0001: Some refactoring of runtime pruning code
> 0002: Add a plan_tree_walker
> 0003: Teach AcquireExecutorLocks to skip locking pruned relations

Repeated the performance tests described in the 1st email of this thread:

HEAD: (copied from the 1st email)

32  tps = 20561.776403 (without initial connection time)
64  tps = 12553.131423 (without initial connection time)
128 tps = 13330.365696 (without initial connection time)
256 tps = 8605.723120 (without initial connection time)
512 tps = 4435.951139 (without initial connection time)
1024tps = 2346.902973 (without initial connection time)
2048tps = 1334.680971 (without initial connection time)

Patched v1: (copied from the 1st email)

32  tps = 27554.156077 (without initial connection time)
64  tps = 27531.161310 (without initial connection time)
128 tps = 27138.305677 (without initial connection time)
256 tps = 25825.467724 (without initial connection time)
512 tps = 19864.386305 (without initial connection time)
1024tps = 18742.668944 (without initial connection time)
2048tps = 16312.412704 (without initial connection time)

Patched v5:

32  tps = 28204.197738 (without initial connection time)
64  tps = 26795.385318 (without initial connection time)
128 tps = 26387.920550 (without initial connection time)
256 tps = 25601.141556 (without initial connection time)
512 tps = 19911.947502 (without initial connection time)
1024tps = 20158.692952 (without initial connection time)
2048tps = 16180.195463 (without initial connection time)

Good to see that these rewrites haven't really hurt the numbers much,
which makes sense because the rewrites have really been about putting
the code in the right place.

BTW, these are the numbers for the same benchmark repeated with
plan_cache_mode = auto, which causes a custom plan to be chosen for
every execution and so unaffected by this patch.

32  tps = 13359.225082 (without initial connection time)
64  tps = 15760.533280 (without initial connection time)
128 tps = 15825.734482 (without initial connection time)
256 tps = 15017.693905 (without initial connection time)
512 tps = 13479.973395 (without initial connection time)
1024tps = 13200.444397 (without initial connection time)
2048tps = 12884.645475 (without initial connection time)

Comparing them to numbers when using force_generic_plan shows that
making the generic plans faster is indeed worthwhile.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-11 Thread Robert Haas

On Fri, Mar 11, 2022 at 12:15 AM Ashutosh Sharma  wrote:
> Looks better, but why do you want to pass elevel to the
> load_relmap_file()? Are we calling this function from somewhere other
> than read_relmap_file()? If not, do we have any plans to call this
> function  directly bypassing read_relmap_file for any upcoming patch?

If it fails during CREATE DATABASE, it should be ERROR, not FATAL. In
that case, we only need to stop trying to create a database; we don't
need to terminate the session. On the other hand if we can't read our
own database's relmap files, that's an unrecoverable error, because we
will not be able to run any queries at all, so FATAL is appropriate.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-11 Thread Ashutosh Sharma

Some comments on pg_walinspect-docc.patch this time:

+   
+
+ pg_get_wal_record_info(in_lsn pg_lsn, lsn OUT pg_lsn,
prev_lsn OUT pg_lsn, xid OUT xid, resource_manager OUT text, length
OUT int4, total_length OUT int4, description OUT text, block_ref OUT
text, data OUT bytea, data_len OUT int4)
+

You may shorten this by mentioning just the function input parameters
and specify "returns record" like shown below. So no need to specify
all the OUT params.

pg_get_wal_record_info(in_lsn pg_lsn) returns record.

Please check the documentation for other functions for reference.

==

+
+ pg_get_wal_records_info(start_lsn pg_lsn, end_lsn
pg_lsn DEFAULT NULL, lsn OUT pg_lsn, prev_lsn OUT pg_lsn, xid OUT xid,
resource_manager OUT text, length OUT int4, total_length OUT int4,
description OUT text, block_ref OUT text, data OUT bytea, data_len OUT
int4)
+

Same comment applies here as well. In the return type you can just
mention - "returns setof record" like shown below:

pg_get_wal_records_info(start_lsn pg_lsn, end_lsn pg_lsn) returns setof records.

You may also check for such optimizations at other places. I might
have missed some.

==

+
+postgres=# select prev_lsn, xid, resource_manager, length,
total_length, block_ref from pg_get_wal_records_info('0/158A7F0',
'0/1591400');
+ prev_lsn  | xid | resource_manager | length | total_length |
   block_ref
+---+-+--++--+--
+ 0/158A7B8 | 735 | Heap | 54 | 7838 | blkref
#0: rel 1663/5/2619 blk 18 (FPW); hole: offset: 88, length: 408
+ 0/158A7F0 | 735 | Btree| 53 | 8133 | blkref
#0: rel 1663/5/2696 blk 1 (FPW); hole: offset: 1632, length: 112
+ 0/158C6A8 | 735 | Heap | 53 |  873 | blkref
#0: rel 1663/5/1259 blk 0 (FPW); hole: offset: 212, length: 7372

Instead of specifying column names in the targetlist I think it's
better to use "*" so that it will display all the output columns. Also
you may shorten the gap between start and end lsn to reduce the output
size.

==

Any reason for not specifying author name in the .sgml file. Do you
want me to add my name to the author? :)

  
   Ashutosh Sharma ashu.coe...@gmail.com
  
 

--
With Regards,
Ashutosh Sharma.

On Thu, Mar 10, 2022 at 10:15 PM Bharath Rupireddy
 wrote:
>
> On Thu, Mar 10, 2022 at 1:52 PM Jeff Davis  wrote:
> >
> > On Wed, 2022-03-02 at 22:37 +0530, Bharath Rupireddy wrote:
> > >
> > > Attaching v6 patch set with above review comments addressed. Please
> > > review it further.
>
> Thanks Jeff for reviewing it. I've posted the latest v7 patch-set
> upthread [1] which is having more simple-yet-useful-and-effective
> functions.
>
> > * Don't issue WARNINGs or other messages for ordinary situations, like
> > when pg_get_wal_records_info() hits the end of WAL.
>
> v7 patch-set [1] has no warnings, but the functions will error out if
> future LSN is specified.
>
> > * It feels like the APIs that allow waiting for the end of WAL are
> > slightly off. Can't you just do pg_get_wal_records_info(start_lsn,
> > least(pg_current_wal_flush_lsn(), end_lsn)) if you want the non-waiting
> > behavior? Try to make the API more orthogonal, where a few basic
> > functions can be combined to give you everything you need, rather than
> > specifying extra parameters and issuing WARNINGs. I
>
> v7 patch-set [1] onwards waiting mode has been removed for all of the
> functions, again to keep things simple-yet-useful-and-effective.
> However, we can always add new pg_walinspect functions that wait for
> future WAL in the next versions once basic stuff gets committed and if
> many users ask for it.
>
> > * In the docs, include some example output. I don't see any output in
> > the tests, which makes sense because it's mostly non-deterministic, but
> > it would be helpful to see sample output of at least
> > pg_get_wal_records_info().
>
> +1. Added for pg_get_wal_records_info and pg_get_wal_stats.
>
> > * Is pg_get_wal_stats() even necessary, or can you get the same
> > information with a query over pg_get_wal_records_info()? For instance,
> > if you want to group by transaction ID rather than rmgr, then
> > pg_get_wal_stats() is useless.
>
> Yes, you are right pg_get_wal_stats provides WAL stats per resource
> manager which is similar to pg_waldump with --start, --end and --stats
> option. It provides more information than pg_get_wal_records_info and
> is a good way of getting stats than adding more columns to
> pg_get_wal_records_info, calculating percentage in sql and having
> group by clause. IMO, pg_get_wal_stats is more readable and useful.
>
> > * Would be nice to have a pg_wal_file_is_valid() or similar, which
> > would test that it exists, and the header matches the filename (e.g. if
> > it was recycled but not used, that would count as invalid). I think
> > pg_get_first_valid_wal_record_lsn() would

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-11 Thread Ashutosh Sharma

You may also need to add documentation to app-createdb.sgml. Currently
you have just added to create_database.sgml. Also, I had a quick look
at the new changes done in v13-0005-WAL-logged-CREATE-DATABASE.patch
and they seemed fine to me although I haven't put much emphasis on the
comments and other cosmetic stuff.

--
With Regards,
Ashutosh Sharma.

On Fri, Mar 11, 2022 at 3:51 PM Dilip Kumar  wrote:
>
> On Fri, Mar 11, 2022 at 11:52 AM Dilip Kumar  wrote:
> >
> > On Thu, Mar 10, 2022 at 10:18 PM Robert Haas  wrote:
> > >
> > > On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar  wrote:
> > > > I have completely changed the logic for this refactoring.  Basically,
> > > > write_relmap_file(), is already having parameters to control whether
> > > > to write wal, send inval and we are already passing the dbpath.
> > > > Instead of making a new function I just pass one additional parameter
> > > > to this function itself about whether we are creating a new map or not
> > > > and I think with that changes are very less and this looks cleaner to
> > > > me.  Similarly for load_relmap_file() also I just had to pass the
> > > > dbpath and memory for destination map.  Please have a look and let me
> > > > know your thoughts.
> > >
> > > It's not terrible, but how about something like the attached instead?
> > > I think this has the effect of reducing the number of cases that the
> > > low-level code needs to know about from 2 to 1, instead of making it
> > > go up from 2 to 3.
> >
> > Yeah this looks cleaner, I will rebase the remaining patch.
>
> Here is the updated version of the patch set.
>
> Changes, 1) it take Robert's patch as first refactoring patch 2)
> Rebase other new relmapper apis on top of that in 0002 3) Some code
> refactoring in main patch 0005 and also one problem fix, earlier in
> wal log method I have removed ForceSyncCommit(), but IMHO that is
> equally valid whether we use file_copy or wal_log because in both
> cases we are creating the disk files.  4) Support strategy in createdb
> tool and add test case as part of 0006.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com

Re: role self-revocation

2022-03-11 Thread Robert Haas

On Thu, Mar 10, 2022 at 5:14 PM Tom Lane  wrote:
> This seems reasonable in isolation, but
>
> (1) it implies a persistent relationship between creating and created
> roles.  Whether you want to call that ownership or not, it sure walks
> and quacks like ownership.

I agree. It's been obvious to me from the beginning that we needed
such a persistent relationship, and also that it needed to be a
relationship from which the created role couldn't simply walk away.
Yet, more than six months after the first discussions of this topic,
we still don't have any kind of agreement on what that thing should be
called. I like my TENANT idea best, but I'm perfectly willing to call
it ownership as you seem to prefer or WITH ADMIN OPTION as Stephen
seems to prefer if one of those ideas gains consensus. But we've
managed to waste all hope of making any significant progress here for
an entire release cycle for lack of ability to agree on spelling. I
think that's unfair to Mark, who put a lot of work into this area and
got nothing out of it, and I think it sucks for users of PostgreSQL,
too.

> (2) it seems exactly contradictory to your later point that
>
> > Agree. I also think that it would be a good idea to attribute grants
> > performed by any superuser to the bootstrap superuser, or leave them
> > unattributed somehow. Because otherwise dropping superusers becomes a
> > pain in the tail for no good reason.
>
> Either there's a persistent relationship or there's not.  I don't
> think it's sensible to treat superusers differently here.
>
> I think that this argument about the difficulty of dropping superusers
> may in fact be the motivation for the existing behavior that object-
> permissions GRANTs done by superusers are attributed to the object
> owner; something you were unhappy about upthread.
>
> In the end these requirements seem mutually contradictory.  Either
> we can have a persistent ownership relationship or not, but I don't
> think we can have it apply in some cases and not others without
> creating worse problems than we solve.  I'm inclined to toss overboard
> the requirement that superusers need to be an easy thing to drop.
> Why is that important, anyway?

Well, I think you're looking at it the wrong way. Compared to getting
useful functionality, the relative ease of dropping users is
completely unimportant. I'm happy to surrender it in exchange for
something else. I just don't see why we should give it up for nothing.
If Alice creates non-superusers Bob and Charlie, and Charlie creates
Doug, we need the persistent relationship to know that Charlie is
allowed to drop Doug and Bob is not. But if Charlie is a superuser
anyway, then the persistent relationship is of no use. I don't see the
point of cluttering up the system with such dependencies. Will I do it
that way, if that's what it takes to get the patch accepted? Sure. But
I can't imagine any end-user actually liking it.

> I'm a bit disturbed that parts of this discussion seem to be getting
> conducted with little understanding of the system's existing behaviors.
> We should not be reinventing things we already have perfectly good
> solutions for in the object-privileges domain.

I did wonder whether that might be the existing behavior, but stopping
to check right at that moment didn't seem that important to me. Maybe
I should have taken the time, but it's not like we're writing the
final patch for commit next Tuesday at this point. It's more important
at this point to get agreement on the principles. That said, I do
agree that there have been times when we haven't thought hard enough
about the existing behavior in proposing new behavior. On the third
hand, though, part of the problem here is that neither Stephen nor I
are entirely happy with the existing behavior, if for somewhat
different reasons. It really isn't "perfectly good." On the one hand,
from a purely technical standpoint, a lot of the behavior around roles
in particular seems well below the standard that anyone would consider
committable today. On the other hand, even the parts of the code that
are in reasonable shape from a code quality point of view don't
actually do the things that we think users want done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences

2022-03-11 Thread Tomas Vondra




On 3/11/22 12:34, Amit Kapila wrote:
> On Tue, Mar 8, 2022 at 11:59 PM Tomas Vondra
>  wrote:
>>
>> On 3/7/22 22:25, Tomas Vondra wrote:

 Interesting. I can think of one reason that might cause this - we log
 the first sequence increment after a checkpoint. So if a checkpoint
 happens in an unfortunate place, there'll be an extra WAL record. On
 slow / busy machines that's quite possible, I guess.

>>>
>>> I've tweaked the checkpoint_interval to make checkpoints more aggressive
>>> (set it to 1s), and it seems my hunch was correct - it produces failures
>>> exactly like this one. The best fix probably is to just disable decoding
>>> of sequences in those tests that are not aimed at testing sequence decoding.
>>>
>>
>> I've pushed a fix for this, adding "include-sequences=0" to a couple
>> test_decoding tests, which were failing with concurrent checkpoints.
>>
>> Unfortunately, I realized we have a similar issue in the "sequences"
>> tests too :-( Imagine you do a series of sequence increments, e.g.
>>
>>   SELECT nextval('s') FROM generate_sequences(1,100);
>>
>> If there's a concurrent checkpoint, this may add an extra WAL record,
>> affecting the decoded output (and also the data stored in the sequence
>> relation itself). Not sure what to do about this ...
>>
> 
> I am also not sure what to do for it but maybe if in some way we can
> increase checkpoint timeout or other parameters for these tests then
> it would reduce the chances of such failures. The other idea could be
> to perform checkpoint before the start of tests to reduce the
> possibility of another checkpoint.
> 

Yeah, I had the same ideas, but I'm not sure I like any of them. I doubt
we want to make checkpoints extremely rare, and even if we do that it'll
still fail on slow machines (e.g. with valgrind, clobber cache etc.).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Column Filtering in Logical Replication

2022-03-11 Thread Tomas Vondra




On 3/11/22 10:52, Amit Kapila wrote:
> On Fri, Mar 11, 2022 at 7:26 AM Tomas Vondra
>  wrote:
>>
>> On 3/10/22 20:10, Tomas Vondra wrote:
>>>
>>>
>>> FWIW I think the reason is pretty simple - pgoutput_row_filter_init is
>>> broken. It assumes you can just do this
>>>
>>> rftuple = SearchSysCache2(PUBLICATIONRELMAP,
>>>   ObjectIdGetDatum(entry->publish_as_relid),
>>>   ObjectIdGetDatum(pub->oid));
>>>
>>> if (HeapTupleIsValid(rftuple))
>>> {
>>> /* Null indicates no filter. */
>>> rfdatum = SysCacheGetAttr(PUBLICATIONRELMAP, rftuple,
>>>   Anum_pg_publication_rel_prqual,
>>>   &pub_no_filter);
>>> }
>>> else
>>> {
>>> pub_no_filter = true;
>>> }
>>>
>>>
>>> and pub_no_filter=true means there's no filter at all. Which is
>>> nonsense, because we're using publish_as_relid here - the publication
>>> may not include this particular ancestor, in which case we need to just
>>> ignore this publication.
>>>
>>> So yeah, this needs to be reworked.
>>>
>>
>> I spent a bit of time looking at this, and I think a minor change in
>> get_rel_sync_entry() fixes this - it's enough to ensure rel_publications
>> only includes publications that actually include publish_as_relid.
>>
> 
> Thanks for looking into this. I think in the first patch before
> calling get_partition_ancestors() we need to ensure it is a partition
> (the call expects that) and pubviaroot is true.

Does the call really require that? Also, I'm not sure why we'd need to
look at pubviaroot - that's already considered earlier when calculating
publish_as_relid, here we just need to know the relationship of the two
OIDs (if one is ancestor/child of the other).

> I think it would be
> good if we can avoid an additional call to get_partition_ancestors()
> as it could be costly.

Maybe. OTOH we only should do this only very rarely anyway.

> I wonder why it is not sufficient to ensure
> that publish_as_relid exists after ancestor in ancestors list before
> assigning the ancestor to publish_as_relid? This only needs to be done
> in case of (if (!publish)). I have not tried this so I could be wrong.
> 

I'm not sure what exactly are you proposing. Maybe try coding it? That's
probably faster than trying to describe what the code might do ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Column Filtering in Logical Replication

2022-03-11 Thread Tomas Vondra




On 3/11/22 08:05, wangw.f...@fujitsu.com wrote:
> On Fri, Mar 11, 2022 at 9:57 AM Tomas Vondra  
> wrote:
>>
> Hi Tomas,
> Thanks for your patches.
> 
> On Mon, Mar 9, 2022 at 9:53 PM Tomas Vondra  
> wrote:
>> On Wed, Mar 9, 2022 at 6:04 PM Amit Kapila  wrote:
>>> On Mon, Mar 7, 2022 at 11:18 PM Tomas Vondra 
>>>  wrote:
>>>> On Fri, Mar 4, 2022 at 6:43 PM Amit Kapila  wrote:
>>>>> Fetching column filter info in tablesync.c is quite expensive. It
>>>>> seems to be using four round-trips to get the complete info whereas
>>>>> for row-filter we use just one round trip. I think we should try to
>>>>> get both row filter and column filter info in just one round trip.
>>>>>
>>>>
>>>> Maybe, but I really don't think this is an issue.
>>>>
>>>
>>> I am not sure but it might matter for small tables. Leaving aside the
>>> performance issue, I think the current way will get the wrong column
>>> list in many cases: (a) The ALL TABLES IN SCHEMA case handling won't
>>> work for partitioned tables when the partitioned table is part of one
>>> schema and partition table is part of another schema. (b) The handling
>>> of partition tables in other cases will fetch incorrect lists as it
>>> tries to fetch the column list of all the partitions in the hierarchy.
>>>
>>> One of my colleagues has even tested these cases both for column
>>> filters and row filters and we find the behavior of row filter is okay
>>> whereas for column filter it uses the wrong column list. We will share
>>> the tests and results with you in a later email. We are trying to
>>> unify the column filter queries with row filter to make their behavior
>>> the same and will share the findings once it is done. I hope if we are
>>> able to achieve this that we will reduce the chances of bugs in this
>>> area.
>>>
>>
>> OK, I'll take a look at that email.
> I tried to get both the column filters and the row filters with one SQL, but
> it failed because I think the result is not easy to parse.
> 
> I noted that we use two SQLs to get column filters in the latest
> patches(20220311). I think maybe we could use one SQL to get column filters to
> reduce network cost. Like the SQL in the attachment.
> 

I'll take a look. But as I said before - I very much prefer SQL that is
easy to understand, and I don't think the one extra round trip is an
issue during tablesync (which is a very rare action).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-11 Thread Bharath Rupireddy

On Fri, Mar 11, 2022 at 8:08 AM Kyotaro Horiguchi
 wrote:
>
> > Attaching the v8 patch-set resolving above comments and some tests for
> > checking function permissions. Please review it further.
>
> I played with this a bit, and would like to share some thoughts on it.

Thanks a lot Kyotaro-san for reviewing.

> It seems to me too rigorous that pg_get_wal_records_info/stats()
> reject future LSNs as end-LSN and I think WARNING or INFO and stop at
> the real end-of-WAL is more kind to users.  I think the same with the
> restriction that start and end LSN are required to be different.

Throwing error on future LSNs is the same behaviour for all of the
pg_walinspect function input LSNs. IMO it is a cleaner thing to do
rather than confuse the users with different behaviours for each
function. The principle is this - pg_walinspect functions can't show
future WAL info. Having said that, I agree to make it a WARNING
instead of ERROR, for the simple reason that ERROR aborts the txn and
the applications can retry without aborting the txn. For instance,
pg_terminate_backend emits a WARNING if the PID isn't a postgres
process id.

PS: WARNING may not be a better idea than ERROR if we turn
pg_get_wal_stats a SQL function, see my response below.

> The definition of end-lsn is fuzzy here.  If I fed a future LSN to the
> functions, they tell me the beginning of the current insertion point
> in error message.  On the other hand they don't accept the same
> value as end-LSN.  I think it is right that they tell the current
> insertion point and they should take the end-LSN as the LSN to stop
> reading.

The future LSN is determined by this:

if (!RecoveryInProgress())
available_lsn = GetFlushRecPtr(NULL);
else
available_lsn = GetXLogReplayRecPtr(NULL);

GetFlushRecPtr returns last byte + 1 flushed meaning this is the end
LSN currently known in the server, but it is not the start LSN of the
last WAL record in the server. Same goes with GetXLogReplayRecPtr
which gives lastReplayedEndRecPtr end+1 position. I picked
GetFlushRecPtr and GetXLogReplayRecPtr to determine the future WAL LSN
because this is how read_local_xlog_page determines to read WAL upto
and goes for wait mode, but I wanted to avoid the wait mode completely
for all the pg_walinspect functions (to keep things simple for now),
hence doing the similar checks within the input validation code and
emitting warning.

And you are right when we emit something like below, users tend to use
0/15B6D68 (from the DETAIL message) as the end LSN. I don't want to
ignore this DETAIL message altogether as it gives an idea where the
server is. How about rephrasing the DETAIL message a bit, something
like "Database system flushed the WAL up to WAL LSN %X/% X.'' or some
other better phrasing?

WARNING:  WAL start LSN cannot be a future WAL LSN
DETAIL:  Last known WAL LSN on the database system is 0/15B6D68.

If the users aren't sure about what's the end record LSN, they can
just use pg_get_wal_records_info and pg_get_wal_stats without end LSN:
select * from pg_get_wal_records_info('0/15B6D68');
select * from pg_get_wal_stats('0/15B6D68');

> I think pg_get_wal_stats() is worth to have but I think it should be
> implemented in SQL.  Currently pg_get_wal_records_info() doesn't tell
> about FPI since pg_waldump doesn't but it is internally collected (of
> course!) and easily revealed. If we do that, the
> pg_get_wal_records_stats() would be reduced to the following SQL
> statement
>
> SELECT resource_manager resmgr,
>count(*) AS N,
>(count(*) * 100 / sum(count(*)) OVER tot)::numeric(5,2) AS "%N",
>sum(total_length) AS "combined size",
>(sum(total_length) * 100 / sum(sum(total_length)) OVER 
> tot)::numeric(5,2) AS "%combined size",
>sum(fpi_len) AS fpilen,
>(sum(fpi_len) * 100 / sum(sum(fpi_len)) OVER tot)::numeric(5,2) AS 
> "%fpilen"
>FROM pg_get_wal_records_info('0/100', '0/175DD7f')
>GROUP by resource_manager
>WINDOW tot AS ()
>ORDER BY "combined size" desc;
>
> The only difference with pg_waldump is the statement above doesn't
> show lines for the resource managers that don't contained in the
> result of pg_get_wal_records_info(). But I don't think that matters.

Yeah, this is better. One problem with the above is when
pg_get_wal_records_info emits a warning for future LSN. But this
shouldn't stop us doing it via SQL. Instead I would let all the
pg_walinspect functions emit errors as opposed to WARNING. Thoughts?

postgres=# SELECT resource_manager, count(*) AS count,
(count(*) * 100 / sum(count(*)) OVER tot)::numeric(5,2) AS count_percentage,
sum(total_length) AS combined_size,
(sum(total_length) * 100 / sum(sum(total_length)) OVER
tot)::numeric(5,2) AS combined_size_percentage
FROM pg_get_wal_records_info('0/10A3E50', '0/25B6F00')
GROUP BY resource_manager
WINDOW tot AS ()
ORDER BY combined_size desc;
WARNING:  WAL end LSN cannot be a future WAL LSN
DETAIL:  L

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-03-11 Thread Julien Rouhaud

On Fri, Mar 11, 2022 at 04:59:11PM +0530, Nitin Jadhav wrote:
> > That "throttled" flag should be the same as having or not a "force" in the
> > flags.  We should be consistent and report information the same way, so 
> > either
> > a lot of flags (is_throttled, is_force...) or as now a single field 
> > containing
> > the set flags, so the current approach seems better.  Also, it wouldn't be 
> > much
> > better to show the checkpoint as not having the force flags and still not 
> > being
> > throttled.
> 
> I think your understanding is wrong here. The flag which affects
> throttling behaviour is CHECKPOINT_IMMEDIATE.

Yes sorry, that's what I meant and later used in the flags.

> I am not suggesting
> removing the existing 'flags' field of pg_stat_progress_checkpoint
> view and adding a new field 'throttled'. The content of the 'flags'
> field remains the same. I was suggesting replacing the 'next_flags'
> field with 'throttled' field since the new request with
> CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.

Are you saying that this new throttled flag will only be set by the overloaded
flags in ckpt_flags?  So you can have a checkpoint with a CHECKPOINT_IMMEDIATE
flags that's throttled, and a checkpoint without the CHECKPOINT_IMMEDIATE flag
that's not throttled?

> > CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be 
> > used
> > to detect that someone wants a new checkpoint afterwards, whatever it's and
> > whether or not the current checkpoint to be finished quickly.  For this 
> > flag I
> > think it's better to not report it in the view flags but with a new field, 
> > as
> > discussed before, as it's really what it means.
> 
> I understand your suggestion of adding a new field to indicate whether
> any of the new requests have been made or not. You just want this
> field to represent only a new request or does it also represent the
> current checkpoint to finish quickly.

Only represent what it means: a new checkpoint is requested.  An additional
CHECKPOINT_IMMEDIATE flag is orthogonal to this flag and this information.

> > CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
> > progress checkpoint, so it can be simply added to the view flags.
> 
> As discussed upthread this is not advisable to do so. The content of
> 'flags' remains the same through the checkpoint. We cannot add a new
> checkpoint's flag (CHECKPOINT_IMMEDIATE ) to the current one even
> though it affects current checkpoint behaviour. Only thing we can do
> is to add a new field to show that the current checkpoint is affected
> with new requests.

I don't get it.  The checkpoint flags and the view flags (set by
pgstat_progrss_update*) are different, so why can't we add this flag to the
view flags?  The fact that checkpointer.c doesn't update the passed flag and
instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
an implementation detail, and the view shouldn't focus on which flags were
initially passed to the checkpointer but instead which flags the checkpointer
is actually enforcing, as that's what the user should be interested in.  If you
want to store it in another field internally but display it in the view with
the rest of the flags, I'm fine with it.

> > Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED |
> > CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the
> > view?
> 
> Where do you want to add this in the path?

Same as in your current patch I guess.

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-11 Thread Ashutosh Sharma

On Fri, Mar 11, 2022 at 8:22 AM Kyotaro Horiguchi
 wrote:
>
> - The difference between pg_get_wal_record_info and _records_ other than
> - the number of argument is the former accepts incorrect LSNs.
>
> The discussion is somewhat confused after some twists and turns..  It
> should be something like the following.
>
> pg_get_wal_record_info and pg_get_wal_records_info are almost same
> since the latter can show a single record.  However it is a bit
> annoying to do that. Since, other than it doens't accept same LSNs for
> start and end, it doesn't show a record when there' no record in the
> specfied LSN range.  But I don't think there's no usefulness of the
> behavior.
>

So, do you want the pg_get_wal_record_info function to be removed as
we can use pg_get_wal_records_info() to achieve what it does?

--
With Regards,
Ashutosh Sharma.

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-11 Thread Ashutosh Sharma

On Fri, Mar 11, 2022 at 8:22 AM Kyotaro Horiguchi
 wrote:
>
> Sorry, some minor non-syntactical corrections.
>
> At Fri, 11 Mar 2022 11:38:22 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > I played with this a bit, and would like to share some thoughts on it.
> >
> > It seems to me too rigorous that pg_get_wal_records_info/stats()
> > reject future LSNs as end-LSN and I think WARNING or INFO and stop at
> > the real end-of-WAL is more kind to users.  I think the same with the
> > restriction that start and end LSN are required to be different.
> >
> > The definition of end-lsn is fuzzy here.  If I fed a future LSN to the
> > functions, they tell me the beginning of the current insertion point
> > in error message.  On the other hand they don't accept the same
> > value as end-LSN.  I think it is right that they tell the current
> > insertion point and they should take the end-LSN as the LSN to stop
> > reading.
> >
> > I think pg_get_wal_stats() is worth to have but I think it should be
> > implemented in SQL.  Currently pg_get_wal_records_info() doesn't tell
> > about FPI since pg_waldump doesn't but it is internally collected (of
> > course!) and easily revealed. If we do that, the
> > pg_get_wal_records_stats() would be reduced to the following SQL
> > statement
> >
> > SELECT resource_manager resmgr,
> >count(*) AS N,
> >  (count(*) * 100 / sum(count(*)) OVER tot)::numeric(5,2) AS "%N",
> >  sum(total_length) AS "combined size",
> >  (sum(total_length) * 100 / sum(sum(total_length)) OVER 
> > tot)::numeric(5,2) AS "%combined size",
> >  sum(fpi_len) AS fpilen,
> >  (sum(fpi_len) * 100 / sum(sum(fpi_len)) OVER tot)::numeric(5,2) AS 
> > "%fpilen"
> >  FROM pg_get_wal_records_info('0/100', '0/175DD7f')
> >  GROUP by resource_manager
> >  WINDOW tot AS ()
> >  ORDER BY "combined size" desc;
> >
> > The only difference with pg_waldump is the statement above doesn't
> > show lines for the resource managers that don't contained in the
> > result of pg_get_wal_records_info(). But I don't think that matters.
> >
> >
> > Sometimes the field description has very long (28kb long) content. It
> > makes the result output almost unreadable and I had a bit hard time
> > struggling with the output full of '-'s.  I would like have a default
> > limit on the length of such fields that can be long but I'm not sure
> > we want that.
> >
> >
> - The difference between pg_get_wal_record_info and _records_ other than
> - the number of argument is the former accepts incorrect LSNs.
>
> The discussion is somewhat confused after some twists and turns..  It
> should be something like the following.
>
> pg_get_wal_record_info and pg_get_wal_records_info are almost same
> since the latter can show a single record.  However it is a bit
> annoying to do that. Since, other than it doens't accept same LSNs for
> start and end, it doesn't show a record when there' no record in the
> specfied LSN range.  But I don't think there's no usefulness of the
> behavior.
>
> The following works,
>  pg_get_wal_record_info('0/100');

This does work but it doesn't show any WARNING message for the start
pointer adjustment. I think it should.

>  pg_get_wal_records_info('0/100');
>

I think this is fine. It should be working because the user hasn't
specified the end pointer so we assume the default end pointer is
end-of-WAL.

> but this doesn't
>  pg_get_wal_records_info('0/100', '0/100');
> > ERROR:  WAL start LSN must be less than end LSN
>

I think this behaviour is fine. We cannot have the same start and end
lsn pointers.

> And the following shows no records.
>  pg_get_wal_records_info('0/100', '0/101');
>  pg_get_wal_records_info('0/100', '0/128');
>

I think we should be erroring out here saying - couldn't find any
valid WAL record between given start and end lsn because there exists
no valid wal records between the specified start and end lsn pointers.

--
With Regards,
Ashutosh Sharma.

Re: logical decoding and replication of sequences

2022-03-11 Thread Amit Kapila

On Fri, Mar 11, 2022 at 5:04 PM Amit Kapila  wrote:
>
> On Tue, Mar 8, 2022 at 11:59 PM Tomas Vondra
>  wrote:
> >
> > On 3/7/22 22:25, Tomas Vondra wrote:
> > >>
> > >> Interesting. I can think of one reason that might cause this - we log
> > >> the first sequence increment after a checkpoint. So if a checkpoint
> > >> happens in an unfortunate place, there'll be an extra WAL record. On
> > >> slow / busy machines that's quite possible, I guess.
> > >>
> > >
> > > I've tweaked the checkpoint_interval to make checkpoints more aggressive
> > > (set it to 1s), and it seems my hunch was correct - it produces failures
> > > exactly like this one. The best fix probably is to just disable decoding
> > > of sequences in those tests that are not aimed at testing sequence 
> > > decoding.
> > >
> >
> > I've pushed a fix for this, adding "include-sequences=0" to a couple
> > test_decoding tests, which were failing with concurrent checkpoints.
> >
> > Unfortunately, I realized we have a similar issue in the "sequences"
> > tests too :-( Imagine you do a series of sequence increments, e.g.
> >
> >   SELECT nextval('s') FROM generate_sequences(1,100);
> >
> > If there's a concurrent checkpoint, this may add an extra WAL record,
> > affecting the decoded output (and also the data stored in the sequence
> > relation itself). Not sure what to do about this ...
> >
>
> I am also not sure what to do for it but maybe if in some way we can
> increase checkpoint timeout or other parameters for these tests then
> it would reduce the chances of such failures. The other idea could be
> to perform checkpoint before the start of tests to reduce the
> possibility of another checkpoint.
>

One more thing, I notice while checking the commit for this feature is
that the below include seems to be out of order:
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -42,6 +42,7 @@
 #include "replication/reorderbuffer.h"
 #include "replication/snapbuild.h"
 #include "storage/standby.h"
+#include "commands/sequence.h"

-- 
With Regards,
Amit Kapila.

RE: Skipping logical replication transactions on subscriber side

2022-03-11 Thread osumi.takami...@fujitsu.com

On Friday, March 11, 2022 5:20 PM Masahiko Sawada  wrote:
> I've attached an updated version patch. This patch can be applied on top of 
> the
> latest disable_on_error patch[1].
Hi, thank you for the patch. I'll share my review comments on v13.

(a) src/backend/commands/subscriptioncmds.c

@@ -84,6 +86,8 @@ typedef struct SubOpts
boolstreaming;
booltwophase;
booldisableonerr;
+   XLogRecPtr  lsn;/* InvalidXLogRecPtr for 
resetting purpose,
+* otherwise a 
valid LSN */

I think this explanation is slightly odd and can be improved.
Strictly speaking, I feel a *valid* LSN is for retting transaction purpose
from the functional perspective. Also, the wording "resetting purpose"
is unclear by itself. I'll suggest below change.

From:
InvalidXLogRecPtr for resetting purpose, otherwise a valid LSN
To:
A valid LSN when we skip transaction, otherwise InvalidXLogRecPtr

(b) The code position of additional append in describeSubscriptions

+
+   /* Skip LSN is only supported in v15 and higher */
+   if (pset.sversion >= 15)
+   appendPQExpBuffer(&buf,
+ ", subskiplsn AS 
\"%s\"\n",
+ gettext_noop("Skip 
LSN"));

I suggest to combine this code after subdisableonerr.

(c) parse_subscription_options

+   /* Parse the argument as LSN */
+   lsn = 
DatumGetTransactionId(DirectFunctionCall1(pg_lsn_in,

Here, shouldn't we call DatumGetLSN, instead of DatumGetTransactionId ?

(d) parse_subscription_options

+   if (strcmp(lsn_str, "none") == 0)
+   {
+   /* Setting lsn = NONE is treated as resetting 
LSN */
+   lsn = InvalidXLogRecPtr;
+   }
+

We should remove this pair of curly brackets that is for one sentence.

(e) src/backend/replication/logical/worker.c

+ * to skip applying the changes when starting to apply changes.  The 
subskiplsn is
+ * cleared after successfully skipping the transaction or applying non-empty
+ * transaction, where the later avoids the mistakenly specified subskiplsn from
+ * being left.

typo "the later" -> "the latter"

At the same time, I feel the last part of this sentence can be an independent 
sentence.
From:
, where the later avoids the mistakenly specified subskiplsn from being left
To:
. The latter prevents the mistakenly specified subskiplsn from being left

* Note that my comments below are applied if we choose we don't merge 
disable_on_error test with skip lsn tests.

(f) src/test/subscription/t/030_skip_xact.pl

+use Test::More tests => 4;

It's better to utilize the new style for the TAP test.
Then, probably we should introduce done_testing()
at the end of the test.

(g) src/test/subscription/t/030_skip_xact.pl

I think there's no need to create two types of subscriptions.
Just one subscription with two_phase = on and streaming = on
would be sufficient for the tests(normal commit, commit prepared,
stream commit cases). I think this point of view will reduce
the number of the table and the publication, which will
make the whole test simpler.

Best Regards,
Takamichi Osumi

Re: logical decoding and replication of sequences

2022-03-11 Thread Amit Kapila

On Tue, Mar 8, 2022 at 11:59 PM Tomas Vondra
 wrote:
>
> On 3/7/22 22:25, Tomas Vondra wrote:
> >>
> >> Interesting. I can think of one reason that might cause this - we log
> >> the first sequence increment after a checkpoint. So if a checkpoint
> >> happens in an unfortunate place, there'll be an extra WAL record. On
> >> slow / busy machines that's quite possible, I guess.
> >>
> >
> > I've tweaked the checkpoint_interval to make checkpoints more aggressive
> > (set it to 1s), and it seems my hunch was correct - it produces failures
> > exactly like this one. The best fix probably is to just disable decoding
> > of sequences in those tests that are not aimed at testing sequence decoding.
> >
>
> I've pushed a fix for this, adding "include-sequences=0" to a couple
> test_decoding tests, which were failing with concurrent checkpoints.
>
> Unfortunately, I realized we have a similar issue in the "sequences"
> tests too :-( Imagine you do a series of sequence increments, e.g.
>
>   SELECT nextval('s') FROM generate_sequences(1,100);
>
> If there's a concurrent checkpoint, this may add an extra WAL record,
> affecting the decoded output (and also the data stored in the sequence
> relation itself). Not sure what to do about this ...
>

I am also not sure what to do for it but maybe if in some way we can
increase checkpoint timeout or other parameters for these tests then
it would reduce the chances of such failures. The other idea could be
to perform checkpoint before the start of tests to reduce the
possibility of another checkpoint.

-- 
With Regards,
Amit Kapila.

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-03-11 Thread Nitin Jadhav

> >
> > Ok. I agree that it is difficult to interpret it correctly. So even if
> > say that a new checkpoint has been explicitly requested, the user may
> > not understand that it affects current checkpoint behaviour unless the
> > user knows the internals of the checkpoint. How about naming the field
> > to 'throttled' (Yes/ No) since our objective is to show that the
> > current checkpoint is throttled or not.
>
> -1
>
> That "throttled" flag should be the same as having or not a "force" in the
> flags.  We should be consistent and report information the same way, so either
> a lot of flags (is_throttled, is_force...) or as now a single field containing
> the set flags, so the current approach seems better.  Also, it wouldn't be 
> much
> better to show the checkpoint as not having the force flags and still not 
> being
> throttled.

I think your understanding is wrong here. The flag which affects
throttling behaviour is CHECKPOINT_IMMEDIATE. I am not suggesting
removing the existing 'flags' field of pg_stat_progress_checkpoint
view and adding a new field 'throttled'. The content of the 'flags'
field remains the same. I was suggesting replacing the 'next_flags'
field with 'throttled' field since the new request with
CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.

> CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be 
> used
> to detect that someone wants a new checkpoint afterwards, whatever it's and
> whether or not the current checkpoint to be finished quickly.  For this flag I
> think it's better to not report it in the view flags but with a new field, as
> discussed before, as it's really what it means.

I understand your suggestion of adding a new field to indicate whether
any of the new requests have been made or not. You just want this
field to represent only a new request or does it also represent the
current checkpoint to finish quickly.

> CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
> progress checkpoint, so it can be simply added to the view flags.

As discussed upthread this is not advisable to do so. The content of
'flags' remains the same through the checkpoint. We cannot add a new
checkpoint's flag (CHECKPOINT_IMMEDIATE ) to the current one even
though it affects current checkpoint behaviour. Only thing we can do
is to add a new field to show that the current checkpoint is affected
with new requests.

> Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED |
> CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the
> view?

Where do you want to add this in the path?

I feel the new field name is confusing here.
'next_flags' - It shows all the flag values of the next checkpoint.
Based on this user can get to know that the new request has been made
and also if CHECKPOINT_IMMEDIATE is enabled here, then it indicates
that the current checkpoint also gets affected. You are not ok to use
this name as it confuses the user.
'throttled' - The value will be set to Yes/No based on the
CHECKPOINT_IMMEDIATE bit set in the new checkpoint request's flag.
This says that the current checkpoint is affected and also I thought
this is an indication that new requests have been made. But there is a
confusion here too. If the current checkpoint starts with
CHECKPOINT_IMMEDIATE which is described by the 'flags' field and there
is no new request, then the value of this field is 'Yes' (Not
throttling) which again confuses the user.
'new request' - The value will be set to Yes/No if any new checkpoint
requests. This just indicates whether new requests have been made or
not. It can not be used to infer other information.

Thought?

Thanks & Regards,
Nitin Jadhav

On Fri, Mar 11, 2022 at 3:34 PM Julien Rouhaud  wrote:
>
> On Fri, Mar 11, 2022 at 02:41:23PM +0530, Nitin Jadhav wrote:
> >
> > Ok. I agree that it is difficult to interpret it correctly. So even if
> > say that a new checkpoint has been explicitly requested, the user may
> > not understand that it affects current checkpoint behaviour unless the
> > user knows the internals of the checkpoint. How about naming the field
> > to 'throttled' (Yes/ No) since our objective is to show that the
> > current checkpoint is throttled or not.
>
> -1
>
> That "throttled" flag should be the same as having or not a "force" in the
> flags.  We should be consistent and report information the same way, so either
> a lot of flags (is_throttled, is_force...) or as now a single field containing
> the set flags, so the current approach seems better.  Also, it wouldn't be 
> much
> better to show the checkpoint as not having the force flags and still not 
> being
> throttled.
>
> Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED |
> CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the
> view?
>
> CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be 
> used
> to detect that someone wants a new checkpoint afterwards, whatever it

Re: On login trigger: take three

2022-03-11 Thread Daniel Gustafsson

> On 15 Feb 2022, at 11:07, Greg Nancarrow  wrote:

> I've attached a re-based version (no functional changes from the
> previous) to fix cfbot failures.

Thanks for adopting this patch, I took another look at it today and have some
comments.

+This flag is used internally by PostgreSQL 
and should not be manually changed by DBA or application.
I think we should reword this to something like "This flag is used internally
by PostgreSQL and should not be altered or relied
upon for monitoring".  We really don't want anyone touching or interpreting
this value since there is no guarantee that it will be accurate.


+   new_record[Anum_pg_database_dathasloginevt - 1] = 
BoolGetDatum(datform->dathasloginevt);
Since the corresponding new_record_repl valus is false, this won't do anything
and can be removed.  We will use the old value anyways.


+   if (strcmp(eventname, "login") == 0)
I think this block warrants a comment on why it only applies to login triggers.


+   db = (Form_pg_database) GETSTRUCT(tuple);
+   if (!db->dathasloginevt)
+   {
+   db->dathasloginevt = true;
+   CatalogTupleUpdate(pg_db, &tuple->t_self, tuple);
+   }
+   else
+   CacheInvalidateRelcacheByTuple(tuple);
This needs a CommandCounterIncrement inside the if () { .. } block right?


+   /* Get the command tag. */
+   tag = (event == EVT_Login) ? CMDTAG_LOGIN : CreateCommandTag(parsetree);
To match project style I think this should be an if-then-else block.  Also,
while it's tempting to move this before the assertion block in order to reuse
it there, it does mean that we are calling this in a hot codepath before we
know if it's required.  I think we should restore the previous programming with
a separate CreateCommandTag call inside the assertion block and move this one
back underneath the fast-path exit.


+   CacheInvalidateRelcacheByTuple(tuple);
+   }
+   table_close(pg_db, RowExclusiveLock);
+   heap_freetuple(tuple);
+   }
+   }
+   CommitTransactionCommand();
Since we are commiting the transaction just after closing the table, is the
relcache invalidation going to achieve much?  I guess registering the event
doesn't hurt much?


+   /*
+* There can be a race condition: a login event trigger may
+* have been added after the pg_event_trigger table was
+* scanned, and we don't want to erroneously clear the
+* dathasloginevt flag in this case. To be sure that this
+* hasn't happened, repeat the scan under the pg_database
+* table lock.
+*/
+   AcceptInvalidationMessages();
+   runlist = EventTriggerCommonSetup(NULL,
+ EVT_Login, "login",
+ &trigdata);
It seems conceptually wrong to allocate this in the TopTransactionContext when
we only generate the runlist to throw it away.  At the very least I think we
should free the returned list.


+   if (runlist == NULL)/* list is still empty, so clear the
This should check for NIL and not NULL.


Have you done any benchmarking on this patch?  With this version I see a small
slowdown on connection time of ~1.5% using pgbench for the case where there are
no login event triggers defined.  With a login event trigger calling an empty
plpgsql function there is a ~30% slowdown (corresponding to ~1ms on my system).
When there is a login event trigger defined all bets are off however, since the
function called can be arbitrarily complex and it's up to the user to measure
and decide - but the bare overhead we impose is still of interest of course.

--
Daniel Gustafsson   https://vmware.com/

RE: Handle infinite recursion in logical replication setup

2022-03-11 Thread kuroda.hay...@fujitsu.com

Hi Vegnesh,

While considering about second problem, I was very confusing about it.
I'm happy if you answer my question.

> To handle this if user has specified only_local option, we could throw
> a warning or error out while creating subscription in this case, we
> could have a column srreplicateddata in pg_subscription_rel which
> could indicate if the table has any replicated data or not:
> postgres=# select * from pg_subscription_rel;
>  srsubid | srrelid | srsubstate | srsublsn  | srreplicateddata
> -+-++---+--
>16389 |   16384 | r  | 0/14A4640 |t
>16389 |   16385 | r  | 0/14A4690 |f
> (1 row)
> In the above example, srreplicateddata with true indicates, tabel t1
> whose relid is 16384 has replicated data and the other row having
> srreplicateddata  as false indicates table t2 whose relid is 16385
> does not have replicated data.
> When creating a new subscription, the subscriber will connect to the
> publisher and check if the relation has replicated data by checking
> srreplicateddata in pg_subscription_rel table.
> If the table has any replicated data, log a warning or error for this.

IIUC srreplicateddata represents whether the subscribed data is not
generated from the publisher, but another node.
My first impression was that the name 'srreplicateddata' is not friendly
because all subscribed data is replicated from publisher.
Also I was not sure how value of the column was set.
IIUC a filtering by replication origins is done in publisher node
and subscriber node cannot know
whether some data are really filtered or not.
If we distinguish by subscriber option publish_local_only,
it cannot reproduce your example because same subscriber have different 
'srreplicateddata'.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Fix typo in progress reporting doc

2022-03-11 Thread Nitin Jadhav

Hi,

I have observed that the table naming conventions used in
'progress-reporting.html' are not consistent across different
sections. For some cases "Phases" (Table 28.37. CREATE INDEX Phases)
is used and for some cases "phases" (Table 28.35. ANALYZE phases) is
used. I have attached a patch to correct this.

Thanks and Regards,
Nitin Jahav


0001-fix-typo-in-progress-reporting-doc.patch
Description: Binary data

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-11 Thread Dilip Kumar

On Fri, Mar 11, 2022 at 11:52 AM Dilip Kumar  wrote:
>
> On Thu, Mar 10, 2022 at 10:18 PM Robert Haas  wrote:
> >
> > On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar  wrote:
> > > I have completely changed the logic for this refactoring.  Basically,
> > > write_relmap_file(), is already having parameters to control whether
> > > to write wal, send inval and we are already passing the dbpath.
> > > Instead of making a new function I just pass one additional parameter
> > > to this function itself about whether we are creating a new map or not
> > > and I think with that changes are very less and this looks cleaner to
> > > me.  Similarly for load_relmap_file() also I just had to pass the
> > > dbpath and memory for destination map.  Please have a look and let me
> > > know your thoughts.
> >
> > It's not terrible, but how about something like the attached instead?
> > I think this has the effect of reducing the number of cases that the
> > low-level code needs to know about from 2 to 1, instead of making it
> > go up from 2 to 3.
>
> Yeah this looks cleaner, I will rebase the remaining patch.

Here is the updated version of the patch set.

Changes, 1) it take Robert's patch as first refactoring patch 2)
Rebase other new relmapper apis on top of that in 0002 3) Some code
refactoring in main patch 0005 and also one problem fix, earlier in
wal log method I have removed ForceSyncCommit(), but IMHO that is
equally valid whether we use file_copy or wal_log because in both
cases we are creating the disk files.  4) Support strategy in createdb
tool and add test case as part of 0006.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 7bcc39c740a2a737f1ebd6c7b0441c9df4fab6d3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 11 Mar 2022 09:03:09 +0530
Subject: [PATCH v13 1/6] Refactor relmap load and relmap write functions

Currently, relmap reading and writing interfaces are tightly
coupled with shared_map and local_map of the database
it is connected to.  But as higher level patch set we need
interfaces where we can read relmap into any input memory
and while writing also we should be able to pass the map.

So as part of this patch, we are doing refactoring of the
existing code such that we can expose the read and write
interfaces that are independent of the shared_map and the
local_map, without changing any logic.

Author: Robert Haas
---
 src/backend/utils/cache/relmapper.c | 147 ++--
 1 file changed, 74 insertions(+), 73 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4f6811f..f172f61 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,9 +136,11 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 			 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 			  bool add_okay);
-static void load_relmap_file(bool shared, bool lock_held);
-static void write_relmap_file(bool shared, RelMapFile *newmap,
-			  bool write_wal, bool send_sinval, bool preserve_files,
+static void read_relmap_file(bool shared, bool lock_held);
+static void load_relmap_file(RelMapFile *map, char *dbpath, bool lock_held,
+			 int elevel);
+static void write_relmap_file(RelMapFile *newmap, bool write_wal,
+			  bool send_sinval, bool preserve_files,
 			  Oid dbid, Oid tsid, const char *dbpath);
 static void perform_relmap_update(bool shared, const RelMapFile *updates);
 
@@ -405,12 +407,12 @@ RelationMapInvalidate(bool shared)
 	if (shared)
 	{
 		if (shared_map.magic == RELMAPPER_FILEMAGIC)
-			load_relmap_file(true, false);
+			read_relmap_file(true, false);
 	}
 	else
 	{
 		if (local_map.magic == RELMAPPER_FILEMAGIC)
-			load_relmap_file(false, false);
+			read_relmap_file(false, false);
 	}
 }
 
@@ -425,9 +427,9 @@ void
 RelationMapInvalidateAll(void)
 {
 	if (shared_map.magic == RELMAPPER_FILEMAGIC)
-		load_relmap_file(true, false);
+		read_relmap_file(true, false);
 	if (local_map.magic == RELMAPPER_FILEMAGIC)
-		load_relmap_file(false, false);
+		read_relmap_file(false, false);
 }
 
 /*
@@ -568,9 +570,9 @@ RelationMapFinishBootstrap(void)
 	Assert(pending_local_updates.num_mappings == 0);
 
 	/* Write the files; no WAL or sinval needed */
-	write_relmap_file(true, &shared_map, false, false, false,
-	  InvalidOid, GLOBALTABLESPACE_OID, NULL);
-	write_relmap_file(false, &local_map, false, false, false,
+	write_relmap_file(&shared_map, false, false, false,
+	  InvalidOid, GLOBALTABLESPACE_OID, "global");
+	write_relmap_file(&local_map, false, false, false,
 	  MyDatabaseId, MyDatabaseTableSpace, DatabasePath);
 }
 
@@ -612,7 +614,7 @@ RelationMapInitializePhase2(void)
 	/*
 	 * Load the shared map file, die on error.
 	 */
-	load_relmap_file(true, false);
+	read_relmap_file(true, false);
 }
 
 /*
@@ -633,7 +635,7 @@ RelationMapInitializePhase3(void)
 	/*
 	 * Load the local map fil

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-03-11 Thread Julien Rouhaud

On Fri, Mar 11, 2022 at 02:41:23PM +0530, Nitin Jadhav wrote:
>
> Ok. I agree that it is difficult to interpret it correctly. So even if
> say that a new checkpoint has been explicitly requested, the user may
> not understand that it affects current checkpoint behaviour unless the
> user knows the internals of the checkpoint. How about naming the field
> to 'throttled' (Yes/ No) since our objective is to show that the
> current checkpoint is throttled or not.

-1

That "throttled" flag should be the same as having or not a "force" in the
flags.  We should be consistent and report information the same way, so either
a lot of flags (is_throttled, is_force...) or as now a single field containing
the set flags, so the current approach seems better.  Also, it wouldn't be much
better to show the checkpoint as not having the force flags and still not being
throttled.

Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED |
CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the
view?

CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be used
to detect that someone wants a new checkpoint afterwards, whatever it's and
whether or not the current checkpoint to be finished quickly.  For this flag I
think it's better to not report it in the view flags but with a new field, as
discussed before, as it's really what it means.

CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
progress checkpoint, so it can be simply added to the view flags.

Re: Column Filtering in Logical Replication

2022-03-11 Thread Amit Kapila

On Fri, Mar 11, 2022 at 7:26 AM Tomas Vondra
 wrote:
>
> On 3/10/22 20:10, Tomas Vondra wrote:
> >
> >
> > FWIW I think the reason is pretty simple - pgoutput_row_filter_init is
> > broken. It assumes you can just do this
> >
> > rftuple = SearchSysCache2(PUBLICATIONRELMAP,
> >   ObjectIdGetDatum(entry->publish_as_relid),
> >   ObjectIdGetDatum(pub->oid));
> >
> > if (HeapTupleIsValid(rftuple))
> > {
> > /* Null indicates no filter. */
> > rfdatum = SysCacheGetAttr(PUBLICATIONRELMAP, rftuple,
> >   Anum_pg_publication_rel_prqual,
> >   &pub_no_filter);
> > }
> > else
> > {
> > pub_no_filter = true;
> > }
> >
> >
> > and pub_no_filter=true means there's no filter at all. Which is
> > nonsense, because we're using publish_as_relid here - the publication
> > may not include this particular ancestor, in which case we need to just
> > ignore this publication.
> >
> > So yeah, this needs to be reworked.
> >
>
> I spent a bit of time looking at this, and I think a minor change in
> get_rel_sync_entry() fixes this - it's enough to ensure rel_publications
> only includes publications that actually include publish_as_relid.
>

Thanks for looking into this. I think in the first patch before
calling get_partition_ancestors() we need to ensure it is a partition
(the call expects that) and pubviaroot is true. I think it would be
good if we can avoid an additional call to get_partition_ancestors()
as it could be costly. I wonder why it is not sufficient to ensure
that publish_as_relid exists after ancestor in ancestors list before
assigning the ancestor to publish_as_relid? This only needs to be done
in case of (if (!publish)). I have not tried this so I could be wrong.

-- 
With Regards,
Amit Kapila.

Re: BufferAlloc: don't take two simultaneous locks

2022-03-11 Thread Yura Sokolov

В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:
> At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > Thanks!  I looked into dynahash part.
> > 
> >  struct HASHHDR
> >  {
> > - /*
> > -  * The freelist can become a point of contention in high-concurrency 
> > hash
> > 
> > Why did you move around the freeList?
> > 
> > 
> > - longnentries;   /* number of entries in 
> > associated buckets */
> > + longnfree;  /* number of free entries in 
> > the list */
> > + longnalloced;   /* number of entries 
> > initially allocated for
> > 
> > Why do we need nfree?  HASH_ASSING should do the same thing with
> > HASH_REMOVE.  Maybe the reason is the code tries to put the detached
> > bucket to different free list, but we can just remember the
> > freelist_idx for the detached bucket as we do for hashp.  I think that
> > should largely reduce the footprint of this patch.
> > 
> > -static void hdefault(HTAB *hashp);
> > +static void hdefault(HTAB *hashp, bool partitioned);
> > 
> > That optimization may work even a bit, but it is not irrelevant to
> > this patch?

(forgot to answer in previous letter).
Yes, third commit is very optional. But adding `nalloced` to
`FreeListData` increases allocation a lot even for usual
non-shared non-partitioned dynahashes. And this allocation is
quite huge right now for no meaningful reason.

> > 
> > + case HASH_REUSE:
> > + if (currBucket != NULL)
> > + {
> > + /* check there is no unfinished 
> > HASH_REUSE+HASH_ASSIGN pair */
> > + Assert(DynaHashReuse.hashp == NULL);
> > + Assert(DynaHashReuse.element == NULL);
> > 
> > I think all cases in the switch(action) other than HASH_ASSIGN needs
> > this assertion and no need for checking both, maybe only for element
> > would be enough.
> 
> While I looked buf_table part, I came up with additional comments.
> 
> BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
> {
> hash_search_with_hash_value(SharedBufHash,
> 
> HASH_ASSIGN,
> ...
> BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
> 
> BufTableDelete considers both reuse and !reuse cases but
> BufTableInsert doesn't and always does HASH_ASSIGN.  That looks
> odd. We should use HASH_ENTER here.  Thus I think it is more
> reasonable that HASH_ENTRY uses the stashed entry if exists and
> needed, or returns it to freelist if exists but not needed.
> 
> What do you think about this?

Well... I don't like it but I don't mind either.

Code in HASH_ENTER and HASH_ASSIGN cases differs much.
On the other hand, probably it is possible to merge it carefuly.
I'll try.

-

regards

Yura Sokolov

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-03-11 Thread Nitin Jadhav

> > I just wanted to avoid extra calculations just to show the progress in
> > the view. Since it's a good metric, I have added an additional field
> > named 'next_flags' to the view which holds all possible flag values of
> > the next checkpoint.
>
> I still don't think that's ok.  IIUC the only way to know if the current
> checkpoint is throttled or not is to be aware that the "next_flags" can apply
> to the current checkpoint too, look for it and see if that changes the
> semantics of what the view say the current checkpoint is.  Most users will get
> it wrong.
>
> Again I would just display a bool flag saying whether a new checkpoint has 
> been
> explicitly requested or not, it seems enough.

Ok. I agree that it is difficult to interpret it correctly. So even if
say that a new checkpoint has been explicitly requested, the user may
not understand that it affects current checkpoint behaviour unless the
user knows the internals of the checkpoint. How about naming the field
to 'throttled' (Yes/ No) since our objective is to show that the
current checkpoint is throttled or not.

Thanks & Regards,
Nitin Jadhav

On Wed, Mar 9, 2022 at 7:48 PM Julien Rouhaud  wrote:
>
> On Tue, Mar 08, 2022 at 08:57:23PM +0530, Nitin Jadhav wrote:
> >
> > I just wanted to avoid extra calculations just to show the progress in
> > the view. Since it's a good metric, I have added an additional field
> > named 'next_flags' to the view which holds all possible flag values of
> > the next checkpoint.
>
> I still don't think that's ok.  IIUC the only way to know if the current
> checkpoint is throttled or not is to be aware that the "next_flags" can apply
> to the current checkpoint too, look for it and see if that changes the
> semantics of what the view say the current checkpoint is.  Most users will get
> it wrong.
>
> > This gives more information than just saying
> > whether the new checkpoint is requested or not with the same memory.
>
> So that next_flags will be empty most of the time?  It seems confusing.
>
> Again I would just display a bool flag saying whether a new checkpoint has 
> been
> explicitly requested or not, it seems enough.
>
> If you're interested in that next checkpoint, you probably want a quick
> completion of the current checkpoint first (and thus need to know if it's
> throttled or not).  And then you will have to keep monitoring that view for 
> the
> next checkpoint anyway, and at that point the view will show the relevant
> information.

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-03-11 Thread Nitin Jadhav

> > The current format matches with the server log message for the
> > checkpoint start in LogCheckpointStart(). Just to be consistent, I
> > have not changed the code.
> >
>
> See below, how flags are shown in other sql functions like:
>
> ashu@postgres=# select * from heap_tuple_infomask_flags(2304, 1);
>raw_flags| combined_flags
> -+
> {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
> (1 row)
>
> This looks more readable and it's easy to understand for the
> end-users.. Further comparing the way log messages are displayed with
> the way sql functions display its output doesn't look like a right
> comparison to me. Obviously both should show matching data but the way
> it is shown doesn't need to be the same. In fact it is not in most of
> the cases.

ok. I will take care in the next patch. I would like to handle this at
the SQL level in system_views.sql. The following can be used to
display in the format described above.

  ( '{' ||
  CASE WHEN (S.param2 & 4) > 0 THEN 'immediate' ELSE '' END ||
  CASE WHEN (S.param2 & 4) > 0 AND (S.param2 & -8) > 0 THEN ',
' ELSE '' END ||
  CASE WHEN (S.param2 & 8) > 0 THEN 'force' ELSE '' END ||
  CASE WHEN (S.param2 & 8) > 0 AND (S.param2 & -16) > 0 THEN
', ' ELSE '' END ||
  CASE WHEN (S.param2 & 16) > 0 THEN 'flush-all' ELSE '' END ||
  CASE WHEN (S.param2 & 16) > 0 AND (S.param2 & -32) > 0 THEN
', ' ELSE '' END ||
  CASE WHEN (S.param2 & 32) > 0 THEN 'wait' ELSE '' END ||
  CASE WHEN (S.param2 & 32) > 0 AND (S.param2 & -128) > 0 THEN
', ' ELSE '' END ||
  CASE WHEN (S.param2 & 128) > 0 THEN 'wal' ELSE '' END ||
  CASE WHEN (S.param2 & 128) > 0 AND (S.param2 & -256) > 0
THEN ', ' ELSE '' END ||
  CASE WHEN (S.param2 & 256) > 0 THEN 'time' ELSE '' END
  || '}'

Basically, a separate CASE statement is used to decide whether a comma
has to be printed or not, which is done by checking whether the
previous flag bit is enabled (so that the appropriate flag has to be
displayed) and if any next bits are enabled (So there are some more
flags to be displayed). Kindly let me know if you know any other
better approach.

Thanks & Regards,
Nitin Jadhav

On Wed, Mar 9, 2022 at 7:07 PM Ashutosh Sharma  wrote:
>
> On Tue, Mar 8, 2022 at 8:31 PM Nitin Jadhav
>  wrote:
> >
> > > > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > > > -[ RECORD 1 ]-+-
> > > > pid   | 22043
> > > > type  | checkpoint
> > > > kind  | immediate force wait requested time
> > > >
> > > > I think the output in the kind column can be displayed as {immediate,
> > > > force, wait, requested, time}. By the way these are all checkpoint
> > > > flags so it is better to display it as checkpoint flags instead of
> > > > checkpoint kind as mentioned in one of my previous comments.
> > >
> > > I will update in the next patch.
> >
> > The current format matches with the server log message for the
> > checkpoint start in LogCheckpointStart(). Just to be consistent, I
> > have not changed the code.
> >
>
> See below, how flags are shown in other sql functions like:
>
> ashu@postgres=# select * from heap_tuple_infomask_flags(2304, 1);
> raw_flags| combined_flags
> -+
>  {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
> (1 row)
>
> This looks more readable and it's easy to understand for the
> end-users.. Further comparing the way log messages are displayed with
> the way sql functions display its output doesn't look like a right
> comparison to me. Obviously both should show matching data but the way
> it is shown doesn't need to be the same. In fact it is not in most of
> the cases.
>
> > I have taken care of the rest of the comments in v5 patch for which
> > there was clarity.
> >
>
> Thank you very much. Will take a look at it later.
>
> --
> With Regards,
> Ashutosh Sharma.

Re: BufferAlloc: don't take two simultaneous locks

2022-03-11 Thread Yura Sokolov

В Пт, 11/03/2022 в 15:30 +0900, Kyotaro Horiguchi пишет:
> At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov  
> wrote in 
> > В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет:
> > > Ok, here is v4.
> > 
> > And here is v5.
> > 
> > First, there was compilation error in Assert in dynahash.c .
> > Excuse me for not checking before sending previous version.
> > 
> > Second, I add third commit that reduces HASHHDR allocation
> > size for non-partitioned dynahash:
> > - moved freeList to last position
> > - alloc and memset offset(HASHHDR, freeList[1]) for
> >   non-partitioned hash tables.
> > I didn't benchmarked it, but I will be surprised if it
> > matters much in performance sence.
> > 
> > Third, I put all three commits into single file to not
> > confuse commitfest application.
> 
> Thanks!  I looked into dynahash part.
> 
>  struct HASHHDR
>  {
> -   /*
> -* The freelist can become a point of contention in high-concurrency 
> hash
> 
> Why did you move around the freeList?
> 
> 
> -   longnentries;   /* number of entries in 
> associated buckets */
> +   longnfree;  /* number of free entries in 
> the list */
> +   longnalloced;   /* number of entries 
> initially allocated for
> 
> Why do we need nfree?  HASH_ASSING should do the same thing with
> HASH_REMOVE.  Maybe the reason is the code tries to put the detached
> bucket to different free list, but we can just remember the
> freelist_idx for the detached bucket as we do for hashp.  I think that
> should largely reduce the footprint of this patch.

If we keep nentries, then we need to fix nentries in both old
"freeList" partition and new one. It is two freeList[partition]->mutex
lock+unlock pairs.

But count of free elements doesn't change, so if we change nentries
to nfree, then no need to fix freeList[partition]->nfree counters,
no need to lock+unlock. 

> 
> -static void hdefault(HTAB *hashp);
> +static void hdefault(HTAB *hashp, bool partitioned);
> 
> That optimization may work even a bit, but it is not irrelevant to
> this patch?
> 
> +   case HASH_REUSE:
> +   if (currBucket != NULL)
> +   {
> +   /* check there is no unfinished 
> HASH_REUSE+HASH_ASSIGN pair */
> +   Assert(DynaHashReuse.hashp == NULL);
> +   Assert(DynaHashReuse.element == NULL);
> 
> I think all cases in the switch(action) other than HASH_ASSIGN needs
> this assertion and no need for checking both, maybe only for element
> would be enough.

Agree.

Re: WIP: WAL prefetch (another approach)

2022-03-11 Thread Julien Rouhaud

On Fri, Mar 11, 2022 at 06:31:13PM +1300, Thomas Munro wrote:
> On Wed, Mar 9, 2022 at 7:47 PM Julien Rouhaud  wrote:
> >
> > This could use XLogRecGetBlock?  Note that this macro is for now never used.
> > xlogreader.c also has some similar forgotten code that could use
> > XLogRecMaxBlockId.
>
> That is true, but I was thinking of it like this: most of the existing
> code that interacts with xlogreader.c is working with the old model,
> where the XLogReader object holds only one "current" record.  For that
> reason the XLogRecXXX() macros continue to work as before, implicitly
> referring to the record that XLogReadRecord() most recently returned.
> For xlogreader.c code, I prefer not to use the XLogRecXXX() macros,
> even when referring to the "current" record, since xlogreader.c has
> switched to a new multi-record model.  In other words, they're sort of
> 'old API' accessors provided for continuity.  Does this make sense?

Ah I see, it does make sense.  I'm wondering if there should be some comment
somewhere on the top of the file to mention it, as otherwise someone may be
tempted to change it to avoid some record->record->xxx usage.

> > +DecodedXLogRecord *
> > +XLogNextRecord(XLogReaderState *state, char **errormsg)
> > +{
> > [...]
> > +   /*
> > +* state->EndRecPtr is expected to have been set by the last call to
> > +* XLogBeginRead() or XLogNextRecord(), and is the location of the
> > +* error.
> > +*/
> > +
> > +   return NULL;
> >
> > The comment should refer to XLogFindNextRecord, not XLogNextRecord?
> 
> No, it does mean to refer to the XLogNextRecord() (ie the last time
> you called XLogNextRecord and successfully dequeued a record, we put
> its end LSN there, so if there is a deferred error, that's the
> corresponding LSN).  Make sense?

It does, thanks!

> 
> > Also, is it worth an assert (likely at the top of the function) for that?
> 
> How could I assert that EndRecPtr has the right value?

Sorry, I meant to assert that some value was assigned (!XLogRecPtrIsInvalid).
It can only make sure that the first call is done after XLogBeginRead /
XLogFindNextRecord, but that's better than nothing and consistent with the top
comment.

> > +   if (unlikely(state->decode_buffer == NULL))
> > +   {
> > +   if (state->decode_buffer_size == 0)
> > +   state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
> > +   state->decode_buffer = palloc(state->decode_buffer_size);
> > +   state->decode_buffer_head = state->decode_buffer;
> > +   state->decode_buffer_tail = state->decode_buffer;
> > +   state->free_decode_buffer = true;
> > +   }
> >
> > Maybe change XLogReaderSetDecodeBuffer to also handle allocation and use it
> > here too?  Otherwise XLogReaderSetDecodeBuffer should probably go in 0002 as
> > the only caller is the recovery prefetching.
> 
> I don't think it matters much?

The thing is that for now the only caller to XLogReaderSetDecodeBuffer (in
0002) only uses it to set the length, so a buffer is actually never passed to
that function.  Since frontend code can rely on a palloc emulation, is there
really a use case to use e.g. some stack buffer there, or something in a
specific memory context?  It seems to be the only use cases for having
XLogReaderSetDecodeBuffer() rather than simply a
XLogReaderSetDecodeBufferSize().  But overall I agree it doesn't matter much,
so no objection to keep it as-is.

> > It's also not particulary obvious why XLogFindNextRecord() doesn't check for
> > this value.  AFAICS callers don't (and should never) call it with a
> > nonblocking == true state, maybe add an assert for that?
> 
> Fair point.  I have now explicitly cleared that flag.  (I don't much
> like state->nonblocking, which might be better as an argument to
> page_read(), but in fact I don't like the fact that page_read
> callbacks are blocking in the first place, which is why I liked
> Horiguchi-san's patch to get rid of that... but that can be a subject
> for later work.)

Agreed.

> >  static void
> >  ResetDecoder(XLogReaderState *state)
> >  {
> > [...]
> > +   /* Reset the decoded record queue, freeing any oversized records. */
> > +   while ((r = state->decode_queue_tail))
> >
> > nit: I think it's better to explicitly check for the assignment being != 
> > NULL,
> > and existing code is more frequently written this way AFAICS.
> 
> I think it's perfectly normal idiomatic C, but if you think it's
> clearer that way, OK, done like that.

The thing I don't like about this form is that you can never be sure that an
assignment was really meant unless you read the rest of the nearby code.  Other
than that agreed, if perfectly normal idiomatic C.

> I realised that this version has broken -DWAL_DEBUG.  I'll fix that
> shortly, but I wanted to post this update ASAP, so here's a new
> version.

+ * Returns XLREAD_WOULDBLOCK if he requested data can't be read without
+ * waiting.  This can be returned only if the installed page_read cal

Re: BufferAlloc: don't take two simultaneous locks

2022-03-11 Thread Kyotaro Horiguchi

At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > Thanks!  I looked into dynahash part.

Then I looked into bufmgr part.  It looks fine to me but I have some
comments on code comments.

>* To change the association of a valid buffer, we'll need to 
> have
>* exclusive lock on both the old and new mapping partitions.
>   if (oldFlags & BM_TAG_VALID)

We don't take lock on the new mapping partition here.


+* Clear out the buffer's tag and flags.  We must do this to ensure that
+* linear scans of the buffer array don't think the buffer is valid. We
+* also reset the usage_count since any recency of use of the old 
content
+* is no longer relevant.
+*
+* We are single pinner, we hold buffer header lock and exclusive
+* partition lock (if tag is valid). Given these statements it is safe 
to
+* clear tag since no other process can inspect it to the moment.

This comment is a merger of the comments from InvalidateBuffer and
BufferAlloc.  But I think what we need to explain here is why we
invalidate the buffer here despite of we are going to reuse it soon.
And I think we need to state that the old buffer is now safe to use
for the new tag here.  I'm not sure the statement is really correct
but clearing-out actually looks like safer.

> Now it is safe to use victim buffer for new tag.  Invalidate the
> buffer before releasing header lock to ensure that linear scans of
> the buffer array don't think the buffer is valid.  It is safe
> because it is guaranteed that we're the single pinner of the buffer.
> That pin also prevents the buffer from being stolen by others until
> we reuse it or return it to freelist.

So I want to revise the following comment.

-* Now it is safe to use victim buffer for new tag.
+* Now reuse victim buffer for new tag.
>* Make sure BM_PERMANENT is set for buffers that must be written at 
> every
>* checkpoint.  Unlogged buffers only need to be written at shutdown
>* checkpoints, except for their "init" forks, which need to be treated
>* just like permanent relations.
>*
>* The usage_count starts out at 1 so that the buffer can survive one
>* clock-sweep pass.

But if you think the current commet is fine, I don't insist on the
comment chagnes.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Skipping logical replication transactions on subscriber side

2022-03-11 Thread Masahiko Sawada

On Thu, Mar 10, 2022 at 9:02 PM Amit Kapila  wrote:
>
> On Tue, Mar 1, 2022 at 8:31 PM Masahiko Sawada  wrote:
> >
> > I've attached an updated patch along with two patches for cfbot tests
> > since the main patch (0003) depends on the other two patches. Both
> > 0001 and 0002 patches are the same ones I attached on another
> > thread[2].
> >
>
> Few comments on 0003:
> =
> 1.
> + 
> +  
> +   subskiplsn pg_lsn
> +  
> +  
> +   Commit LSN of the transaction whose changes are to be skipped,
> if a valid
> +   LSN; otherwise 0/0.
> +  
> + 
>
> Can't this be prepared LSN or rollback prepared LSN? Can we say
> Finish/End LSN and then add some details which all LSNs can be there?

Right, changed to finish LSN.

>
> 2. The conflict resolution explanation needs an update after the
> latest commits and we should probably change the commit LSN
> terminology as mentioned in the previous point.

Updated.

>
> 3. The text in alter_subscription.sgml looks a bit repetitive to me
> (similar to what we have in logical-replication.sgml related to
> conflicts). Here also we refer to only commit LSN which needs to be
> changed as mentioned in the previous two points.

Updated.

>
> 4.
> if (strcmp(lsn_str, "none") == 0)
> + {
> + /* Setting lsn = NONE is treated as resetting LSN */
> + lsn = InvalidXLogRecPtr;
> + }
> + else
> + {
> + /* Parse the argument as LSN */
> + lsn = DatumGetTransactionId(DirectFunctionCall1(pg_lsn_in,
> + CStringGetDatum(lsn_str)));
> +
> + if (XLogRecPtrIsInvalid(lsn))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("invalid WAL location (LSN): %s", lsn_str)));
>
> Is there a reason that we don't want to allow setting 0
> (InvalidXLogRecPtr) for skip LSN?

0 is obviously an invalid value for skip LSN, which should not be
allowed similar to other options (like setting '' to slot_name). Also,
we use 0 (InvalidXLogRecPtr) internally to reset the subskipxid when
NONE is specified.

>
> 5.
> +# The subscriber will enter an infinite error loop, so we don't want
> +# to overflow the server log with error messages.
> +$node_subscriber->append_conf(
> + 'postgresql.conf',
> + qq[
> +wal_retrieve_retry_interval = 2s
> +]);
>
> Can we change this test to use disable_on_error feature? I am thinking
> if the disable_on_error feature got committed first, maybe we can have
> one test file for this and disable_on_error feature (something like
> conflicts.pl).

Good idea. Updated.

I've attached an updated version patch. This patch can be applied on
top of the latest disable_on_error patch[1].

Regards,

[1] 
https://www.postgresql.org/message-id/CAA4eK1Kes9TsMpGL6m%2BAJNHYCGRvx6piYQt5v6TEbH_t9jh8nA%40mail.gmail.com

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v13-0001-Add-ALTER-SUBSCRIPTION-.-SKIP-to-skip-the-transa.patch
Description: Binary data

86 matches

Mail list logo