Re: Logical replication - schema change not invalidating the relation cache
On Fri, Dec 3, 2021 at 3:21 PM vignesh C wrote: > > On Fri, Dec 3, 2021 at 1:13 PM Michael Paquier wrote: > > > > On Thu, Aug 26, 2021 at 09:00:39PM +0530, vignesh C wrote: > > > The previous patch was failing because of the recent test changes made > > > by commit 201a76183e2 which unified new and get_new_node, attached > > > patch has the changes to handle the changes accordingly. > > > > Please note that the CF app is complaining about this patch, so a > > rebase is required. I have moved it to next CF, waiting on author, > > for now. > > Thanks for letting me know, I have rebased it on top of HEAD, the > attached v2 version has the rebased changes. The patch was not applying on top of the HEAD, attached v3 version which has the rebased changes. Regards, Vignesh From ad79b95e9362239e32cc597aeac1d6e22aaa7504 Mon Sep 17 00:00:00 2001 From: Vigneshwaran C Date: Sat, 12 Mar 2022 13:04:55 +0530 Subject: [PATCH v3] Fix for invalidating logical replication relations when there is a change in schema. When the schema gets changed, the rel sync cache invalidation was not happening, fixed it by adding a callback for schema change. --- src/backend/replication/pgoutput/pgoutput.c | 44 src/test/subscription/t/001_rep_changes.pl | 74 + 2 files changed, 118 insertions(+) diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c index ea57a0477f..2ce634a90b 100644 --- a/src/backend/replication/pgoutput/pgoutput.c +++ b/src/backend/replication/pgoutput/pgoutput.c @@ -176,6 +176,8 @@ static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, static void rel_sync_cache_relation_cb(Datum arg, Oid relid); static void rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue); +static void rel_sync_cache_namespace_cb(Datum arg, int cacheid, + uint32 hashvalue); static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid); static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, @@ -1658,6 +1660,9 @@ init_rel_sync_cache(MemoryContext cachectx) CacheRegisterSyscacheCallback(PUBLICATIONNAMESPACEMAP, rel_sync_cache_publication_cb, (Datum) 0); + CacheRegisterSyscacheCallback(NAMESPACEOID, + rel_sync_cache_namespace_cb, + (Datum) 0); } /* @@ -1989,6 +1994,45 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid) } } +/* + * Namespace syscache invalidation callback + */ +static void +rel_sync_cache_namespace_cb(Datum arg, int cacheid, uint32 hashvalue) +{ + HASH_SEQ_STATUS status; + RelationSyncEntry *entry; + + /* + * We can get here if the plugin was used in SQL interface as the + * RelSchemaSyncCache is destroyed when the decoding finishes, but there + * is no way to unregister the relcache invalidation callback. + */ + if (RelationSyncCache == NULL) + return; + + hash_seq_init(&status, RelationSyncCache); + while ((entry = (RelationSyncEntry *) hash_seq_search(&status)) != NULL) + { + /* + * Reset schema sent status as the relation definition may have changed. + * Also free any objects that depended on the earlier definition. + */ + entry->schema_sent = false; + list_free(entry->streamed_txns); + entry->streamed_txns = NIL; + + if (entry->attrmap) + free_attrmap(entry->attrmap); + entry->attrmap = NULL; + + if (hash_search(RelationSyncCache, +(void *) &entry->relid, +HASH_REMOVE, NULL) == NULL) + elog(ERROR, "hash table corrupted"); + } +} + /* * Publication relation/schema map syscache invalidation callback * diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl index af0cff6a30..4b7b9e54ff 100644 --- a/src/test/subscription/t/001_rep_changes.pl +++ b/src/test/subscription/t/001_rep_changes.pl @@ -520,6 +520,80 @@ is($result, qq(0), 'check replication origin was dropped on subscriber'); $node_subscriber->stop('fast'); $node_publisher->stop('fast'); +# Test schema invalidation by renaming the schema +# Create tables on publisher +# Initialize publisher node +my $node_publisher1 = PostgreSQL::Test::Cluster->new('publisher1'); +$node_publisher1->init(allows_streaming => 'logical'); +$node_publisher1->start; + +# Create subscriber node +my $node_subscriber1 = PostgreSQL::Test::Cluster->new('subscriber1'); +$node_subscriber1->init(allows_streaming => 'logical'); +$node_subscriber1->start; + +my $publisher1_connstr = $node_publisher1->connstr . ' dbname=postgres'; + +$node_publisher1->safe_psql('postgres', "CREATE SCHEMA sch1"); +$node_publisher1->safe_psql('postgres', "CREATE TABLE sch1.t1 (c1 int)"); + +# Create tables on subscriber +$node_subscriber1->safe_psql('postgres', "CREATE SCHEMA sch1"); +$node_subscriber1->safe_psql('postgres', "CREATE TABLE sch1.t1 (c1 int)"); +$node_subscriber1->safe_psql('postgres', "CREATE SCHEMA sch2"); +$node_subscriber1->safe_psql('postgres', "CREATE TABLE sch2.t1 (c1 int)"); + +#
Re: Printing backtrace of postgres processes
On Wed, Mar 9, 2022 at 9:26 PM Justin Pryzby wrote: > > rebased to appease cfbot. > > + couple of little fixes as 0002. Thanks for rebasing and fixing a few issues. I have taken all your changes except for mcxtfuncs changes as those changes were not done as part of this patch. Attached v19 patch which has the changes for the same. Regards, Vignesh From baff4bd9331eff3d3d021724847c2fb89d4bcdcf Mon Sep 17 00:00:00 2001 From: Vigneshwaran C Date: Tue, 25 Jan 2022 08:21:22 +0530 Subject: [PATCH v19] Add function to log the backtrace of the specified postgres process. This commit adds pg_log_backtrace() function that requests to log the backtrace of the specified backend or auxiliary process except logger and statistic collector. Only superusers are allowed to request to log the backtrace because allowing any users to issue this request at an unbounded rate would cause lots of log messages and which can lead to denial of service. On receipt of the request, at the next CHECK_FOR_INTERRUPTS(), the target backend logs its backtrace at LOG_SERVER_ONLY level, so that the backtrace will appear in the server log but not be sent to the client. Bump catalog version. Authors: Vignesh C, Bharath Rupireddy, Justin Pryzby Reviewers: Bharath Rupireddy, Justin Pryzby, Fujii Masao, Atsushi Torikoshi, Dilip Kumar, Robert Haas, Andres Freund, Tom lane, Craig Ringer Discussion: https://www.postgresql.org/message-id/CALDaNm3ZzmFS-=r7oduzj7y7bgqv+n06kqyft6c3xzdoknk...@mail.gmail.com --- doc/src/sgml/func.sgml| 62 +++ src/backend/catalog/system_functions.sql | 2 + src/backend/postmaster/autovacuum.c | 4 ++ src/backend/postmaster/checkpointer.c | 4 ++ src/backend/postmaster/interrupt.c| 4 ++ src/backend/postmaster/pgarch.c | 10 +++- src/backend/postmaster/startup.c | 4 ++ src/backend/postmaster/walwriter.c| 4 ++ src/backend/storage/ipc/procarray.c | 56 + src/backend/storage/ipc/procsignal.c | 42 + src/backend/storage/ipc/signalfuncs.c | 73 --- src/backend/tcop/postgres.c | 4 ++ src/backend/utils/adt/mcxtfuncs.c | 40 +++-- src/backend/utils/error/elog.c| 19 -- src/backend/utils/init/globals.c | 1 + src/include/catalog/pg_proc.dat | 5 ++ src/include/miscadmin.h | 1 + src/include/storage/procarray.h | 2 + src/include/storage/procsignal.h | 3 +- src/include/utils/elog.h | 2 + src/test/regress/expected/backtrace.out | 49 +++ src/test/regress/expected/backtrace_1.out | 55 + src/test/regress/parallel_schedule| 2 +- src/test/regress/sql/backtrace.sql| 33 ++ 24 files changed, 416 insertions(+), 65 deletions(-) create mode 100644 src/test/regress/expected/backtrace.out create mode 100644 src/test/regress/expected/backtrace_1.out create mode 100644 src/test/regress/sql/backtrace.sql diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 8a802fb225..4171208719 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -25355,6 +25355,30 @@ SELECT collation for ('foo' COLLATE "de_DE"); + + + + pg_log_backtrace + +pg_log_backtrace ( pid integer ) +boolean + + +Requests to log the backtrace of the backend with the +specified process ID. This function can send the request to +backends and auxiliary processes except the logger and statistics +collector. The backtraces will be logged at LOG +message level. They will appear in the server log based on the log +configuration set (See for +more information), but will not be sent to the client regardless of +. A backtrace identifies +which function a process is currently executing and it may be useful +for developers to diagnose stuck processes and other problems. This +function is supported only if PostgreSQL was built with the ability to +capture backtraces, otherwise it will emit a warning. + + + @@ -25574,6 +25598,44 @@ LOG: Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560 because it may generate a large number of log messages. + +pg_log_backtrace can be used to log the backtrace of +a backend process. For example: + +postgres=# select pg_log_backtrace(pg_backend_pid()); + pg_log_backtrace +-- + t +(1 row) + +The backtrace will be logged as specified by the logging configuration. +For example: + +2021-01-27 11:33:50.247 IST [111735] LOG: current backtrace: +postgres: postgresdba postgres [local] SELECT(set_backtrace+0x38) [0xae06c5] +postgres: postgresdba postgres [local] SELECT(ProcessInterrupts+0x788) [0x950c34] +
Re: Handle infinite recursion in logical replication setup
On Fri, Mar 11, 2022 at 4:28 PM kuroda.hay...@fujitsu.com wrote: > > Hi Vegnesh, > > While considering about second problem, I was very confusing about it. > I'm happy if you answer my question. > > > To handle this if user has specified only_local option, we could throw > > a warning or error out while creating subscription in this case, we > > could have a column srreplicateddata in pg_subscription_rel which > > could indicate if the table has any replicated data or not: > > postgres=# select * from pg_subscription_rel; > > srsubid | srrelid | srsubstate | srsublsn | srreplicateddata > > -+-++---+-- > >16389 | 16384 | r | 0/14A4640 |t > >16389 | 16385 | r | 0/14A4690 |f > > (1 row) > > In the above example, srreplicateddata with true indicates, tabel t1 > > whose relid is 16384 has replicated data and the other row having > > srreplicateddata as false indicates table t2 whose relid is 16385 > > does not have replicated data. > > When creating a new subscription, the subscriber will connect to the > > publisher and check if the relation has replicated data by checking > > srreplicateddata in pg_subscription_rel table. > > If the table has any replicated data, log a warning or error for this. > > IIUC srreplicateddata represents whether the subscribed data is not > generated from the publisher, but another node. > My first impression was that the name 'srreplicateddata' is not friendly > because all subscribed data is replicated from publisher. > Also I was not sure how value of the column was set. > IIUC a filtering by replication origins is done in publisher node > and subscriber node cannot know > whether some data are really filtered or not. > If we distinguish by subscriber option publish_local_only, > it cannot reproduce your example because same subscriber have different > 'srreplicateddata'. Let's consider an existing Multi master logical replication setup between Node1 and Node2 that is created using the following steps: a) Node1 - Publication publishing employee table - pub1 b) Node2 - Subscription subscribing from publication pub1 with publish_local_only - sub1_pub1_node1 c) Node2 - Publication publishing employee table - pub2 d) Node1 - Subscription subscribing from publication pub2 with publish_local_only - sub2_pub2_node2 To create a subscription in node3, we will be using the following steps: a) Node2 - Publication publishing employee table. - pub3 b) Node3 - Subscription subscribing from publication in Node2 with publish_local_only - sub3_pub3_node2 When we create a subscription in Node3, Node3 will connect to Node2(this will not be done in Node3) and check if the employee table is present in pg_subscription_rel, in our case Node2 will have employee table present in pg_subscription_rel (sub1_pub1_node1 subscribing to employee table from pub1 in Node1). As employee table is being subscribed in node2 from node1, we will throw an error like below: postgres=# create subscription sub2 CONNECTION 'dbname =postgres port = ' publication pub2 with (publish_local_only=on); ERROR: CREATE/ALTER SUBSCRIPTION with publish_local_only and copy_data as true is not allowed when the publisher might have replicated data, table:public.t1 might have replicated data in the publisher HINT: Use CREATE/ALTER SUBSCRIPTION with copy_data = off or force I was initially planning to add srreplicateddata field but I have changed it slightly to keep the design simple. Now we just check if the relation is present in pg_subscription_rel and throw an error if copy_data and publish_local_only option is specified. The changes for the same are available at [1]. [1] - https://www.postgresql.org/message-id/CALDaNm0V%2B%3Db%3DCeZJNAAUO2PmSXH5QzNX3jADXb-0hGO_jVj0vA%40mail.gmail.com Thoughts? Regards, Vignesh
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Sat, Mar 12, 2022 at 1:55 AM Robert Haas wrote: > Responding to this specific issue.. > The changes to createdb_failure_params make me a little nervous. I > think we'd be in real trouble if we failed before completing both > DropDatabaseBuffers() and ForgetDatabaseSyncRequests(). However, it > looks to me like those are both intended to be no-fail operations, so > I don't see an actual hazard here. I might be missing something but even without that I do not see a real problem here. The reason we are dropping the database buffers and pending sync request is because right after this we are removing the underlying files and if we just remove the files without dropping the buffer from the buffer cache then the checkpointer will fail while trying to flush the buffer. But, hmm, what about on the > recovery side? Suppose that we start copying the database block by > block and then either (a) the standby is promoted before the copy is > finished or (b) the copy fails. I think the above logic will be valid in case of standby as well because we are not really deleting the underlying files. Now the standby has data in > shared_buffers for a database that does not exist. If that's not bad, > then why does createdb_failure_params need to DropDatabaseBuffers()? > But I bet it is bad. I wonder if we should be using > RelationCopyStorage() rather than this new function > RelationCopyStorageUsingBuffer(). I am not sure how RelationCopyStorage() will help in the standby side, because then also we will log the same WAL (XLOG_FPI) for each page and standby side we will use buffer to apply this FPI so if you think that there is a problem then it will be same with RelationCopyStorage() at least on the standby side. In fact while we are rewriting the relation during vacuum full that time also we are calling log_newpage() under RelationCopyStorage() and during standby if it gets promoted we will be having some buffers in the buffer pool with the new relfilenode. So I think our case is also the same. So here my stand is that we need to drop database buffers and remove pending sync requests because we are deleting underlying files and if we do not do that in some extreme cases then there is no need to drop the buffers or remove the pending sync request and the worst consequences would be waste of disk space. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Handle infinite recursion in logical replication setup
On Wed, Mar 2, 2022 at 3:59 PM Amit Kapila wrote: > > On Wed, Feb 23, 2022 at 11:59 AM vignesh C wrote: > > > ... > ... > > I have attached a basic patch for this, if the idea is accepted, I > > will work further to test more scenarios, add documentation, and test > > and post an updated patch. > > For the second problem, Table synchronization of table including local > > data and replicated data using copy command. > > > > Let us consider the following scenario: > > a) node1 publishing to node2 b) node2 publishing to node1. Here in > > this case node1 will have replicated data from node2 and vice versa. > > > > In the above if user wants to include node3 to subscribe data from > > node2. Users will have to create a subscription in node3 to get the > > data from node2. During table synchronization we send the complete > > table data from node2 to node3. Node2 will have local data from node2 > > and also replicated data from node1. Currently we don't have an option > > to differentiate between the locally generated data and replicated > > data in the heap which will cause infinite recursion as described > > above. > > > > To handle this if user has specified only_local option, we could throw > > a warning or error out while creating subscription in this case, we > > could have a column srreplicateddata in pg_subscription_rel which > > could indicate if the table has any replicated data or not: > > postgres=# select * from pg_subscription_rel; > > srsubid | srrelid | srsubstate | srsublsn | srreplicateddata > > -+-++---+-- > >16389 | 16384 | r | 0/14A4640 |t > >16389 | 16385 | r | 0/14A4690 |f > > (1 row) > > > > In the above example, srreplicateddata with true indicates, tabel t1 > > whose relid is 16384 has replicated data and the other row having > > srreplicateddata as false indicates table t2 whose relid is 16385 > > does not have replicated data. > > When creating a new subscription, the subscriber will connect to the > > publisher and check if the relation has replicated data by checking > > srreplicateddata in pg_subscription_rel table. > > If the table has any replicated data, log a warning or error for this. > > > > If you want to give the error in this case, then I think we need to > provide an option to the user to allow copy. One possibility could be > to extend existing copy_data option as 'false', 'true', 'force'. For > 'false', there shouldn't be any change, for 'true', if 'only_local' > option is also set and the new column indicates replicated data then > give an error, for 'force', we won't give an error even if the > conditions as mentioned for 'true' case are met, rather we will allow > copy in this case. When a subscription is created with publish_local_only and copy_data, it will connect to the publisher and check if the published tables have also been subscribed from other nodes by checking if the entry is present in pg_subscription_rel and throw an error if present. The attached v4-0002-Support-force-option-for-copy_data-check-and-thro.patch has the implementation for the same. Thoughts? Regards, Vignesh From ef978c53b5b195b861210a2ef7f2ff876f004e94 Mon Sep 17 00:00:00 2001 From: Vigneshwaran C Date: Mon, 7 Mar 2022 11:19:10 +0530 Subject: [PATCH v4 1/2] Skip replication of non local data. Add an option publish_local_only which will subscribe only to the locally generated data in the publisher node. If subscriber is created with this option, publisher will skip publishing the data that was subscribed from other nodes. It can be created using following syntax: ex: CREATE SUBSCRIPTION sub1 CONNECTION 'dbname =postgres port=' PUBLICATION pub1 with (publish_local_only = on); --- contrib/test_decoding/test_decoding.c | 20 +++ doc/src/sgml/ref/alter_subscription.sgml | 3 +- doc/src/sgml/ref/create_subscription.sgml | 12 ++ src/backend/catalog/pg_subscription.c | 1 + src/backend/catalog/system_views.sql | 5 +- src/backend/commands/subscriptioncmds.c | 26 +++- .../libpqwalreceiver/libpqwalreceiver.c | 4 + src/backend/replication/logical/decode.c | 15 ++- src/backend/replication/logical/logical.c | 33 + src/backend/replication/logical/worker.c | 2 + src/backend/replication/pgoutput/pgoutput.c | 45 +++ src/bin/pg_dump/pg_dump.c | 16 ++- src/bin/pg_dump/pg_dump.h | 1 + src/bin/psql/describe.c | 8 +- src/bin/psql/tab-complete.c | 4 +- src/include/catalog/pg_subscription.h | 3 + src/include/replication/logical.h | 4 + src/include/replication/output_plugin.h | 7 ++ src/include/replication/pgoutput.h| 1 + src/include/replication/walreceiver.h | 1 + src/test/regress/expected/subscription.out| 115 ++ src/test/regress/sql/subs
Re: Column Filtering in Logical Replication
On Fri, Mar 11, 2022 at 6:20 PM Tomas Vondra wrote: > > On 3/11/22 10:52, Amit Kapila wrote: > > On Fri, Mar 11, 2022 at 7:26 AM Tomas Vondra > > wrote: > >> > >> On 3/10/22 20:10, Tomas Vondra wrote: > >>> > >>> > >>> FWIW I think the reason is pretty simple - pgoutput_row_filter_init is > >>> broken. It assumes you can just do this > >>> > >>> rftuple = SearchSysCache2(PUBLICATIONRELMAP, > >>> ObjectIdGetDatum(entry->publish_as_relid), > >>> ObjectIdGetDatum(pub->oid)); > >>> > >>> if (HeapTupleIsValid(rftuple)) > >>> { > >>> /* Null indicates no filter. */ > >>> rfdatum = SysCacheGetAttr(PUBLICATIONRELMAP, rftuple, > >>> Anum_pg_publication_rel_prqual, > >>> &pub_no_filter); > >>> } > >>> else > >>> { > >>> pub_no_filter = true; > >>> } > >>> > >>> > >>> and pub_no_filter=true means there's no filter at all. Which is > >>> nonsense, because we're using publish_as_relid here - the publication > >>> may not include this particular ancestor, in which case we need to just > >>> ignore this publication. > >>> > >>> So yeah, this needs to be reworked. > >>> > >> > >> I spent a bit of time looking at this, and I think a minor change in > >> get_rel_sync_entry() fixes this - it's enough to ensure rel_publications > >> only includes publications that actually include publish_as_relid. > >> > > > > Thanks for looking into this. I think in the first patch before > > calling get_partition_ancestors() we need to ensure it is a partition > > (the call expects that) and pubviaroot is true. > > Does the call really require that? > There may not be any harm but I have mentioned it because (a) the comments atop get_partition_ancestors(...it should only be called when it is known that the relation is a partition.) indicates the same; (b) all existing callers seems to use it only for partitions. > Also, I'm not sure why we'd need to > look at pubviaroot - that's already considered earlier when calculating > publish_as_relid, here we just need to know the relationship of the two > OIDs (if one is ancestor/child of the other). > I thought of avoiding calling get_partition_ancestors when pubviaroot is not set. It will unnecessary check the whole hierarchy for partitions even when it is not required. I agree that this is not a common code path but still felt why do it needlessly? > > I think it would be > > good if we can avoid an additional call to get_partition_ancestors() > > as it could be costly. > > Maybe. OTOH we only should do this only very rarely anyway. > > > I wonder why it is not sufficient to ensure > > that publish_as_relid exists after ancestor in ancestors list before > > assigning the ancestor to publish_as_relid? This only needs to be done > > in case of (if (!publish)). I have not tried this so I could be wrong. > > > > I'm not sure what exactly are you proposing. Maybe try coding it? That's > probably faster than trying to describe what the code might do ... > Okay, please find attached. I have done basic testing of this, if we agree with this approach then this will require some more testing. -- With Regards, Amit Kapila. v1-0001-fixup-publish_as_relid.patch Description: Binary data v1-0002-fixup-row-filter-publications.patch Description: Binary data
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
On Thu, 2022-03-10 at 15:54 -0500, Stephen Frost wrote: > The standard is basically that all of the functions it brings are > written to enforce the PG privilege system and you aren't able to use > the extension to bypass those privileges. In some cases that means > that Every extension should follow that standard, right? If it doesn't (e.g. creating dangerous functions and granting them to public), then even superuser should not install it. > the C-language functions installed have if(!superuser) ereport() > calls I'm curious why not rely on the grant system where possible? I thought we were trying to get away from explicit superuser checks. > I've not looked back on this thread, but I'd expect pg_walinspect to > need those superuser checks and with those it *could* be marked as > trusted, but that again brings into question how useful it is to mark > it > thusly. As long as any functions are safely accessible to public or a predefined role, there is some utility for the 'trusted' marker. As this patch is currently written, pg_monitor has access these functions, though I don't think that's the right privilege level at least for pg_get_raw_wal_record(). > I certainly don't think we should allow either database owners or > regular users on a system the ability to access the WAL traffic of > the > entire system. Agreed. That was not what I intended by asking if it should be marked 'trusted'. The marker only allows the non-superuser to run the CREATE EXTENSION command; it's up to the extension script to decide whether any non-superusers can do anything at all with the extension. > More forcefully- we should *not* be throwing more access > rights towards $owners in general and should be thinking about how we > can allow admins, providers, whomever, the ability to control what > rights users are given. If they're all lumped under 'owner' then > there's no way for people to provide granular access to just those > things they wish and intend to. Not sure I understand, but that sounds like a larger discussion. Regards, Jeff Davis
Re: range_agg with multirange inputs
On 3/10/22 14:07, Chapman Flack wrote: When I apply this patch, I get a func.sgml with two entries for range_intersect_agg(anymultirange). Arg, fixed. In range_agg_transfn, you've changed the message in the "must be called with a range or multirange"; that seems like another good candidate to be an elog. Agreed. Updated here. I think your query finds aggregate declarations that share the same SQL function declaration as their finalizer functions. That seems to be more common. The query I used looks for cases where different SQL-declared functions appear as finalizers of aggregates, but the different SQL declared functions share the same internal C implementation. Okay, I see. I believe that is quite common for ordinary SQL functions. Sharing a prosrc seems even less remarkable than sharing an aggfinalfn. You're right there are no cases for other finalfns yet, but I don't think there is anything special about finalfns that would make this a weirder thing to do there than with ordinary functions. Still, noting it with a comment does seem helpful. I've updated the remark to match what you suggested. Thank you again for the review, and sorry for so many iterations! :-) Yours, -- Paul ~{:-) p...@illuminatedcomputing.comFrom b409d7333132e34a928ec8cc0ecca0a3421b7268 Mon Sep 17 00:00:00 2001 From: "Paul A. Jungwirth" Date: Fri, 10 Dec 2021 16:04:57 -0800 Subject: [PATCH v4] Add range_agg with multirange inputs --- doc/src/sgml/func.sgml| 30 ++ src/backend/utils/adt/multirangetypes.c | 69 ++-- src/include/catalog/pg_aggregate.dat | 3 + src/include/catalog/pg_proc.dat | 11 ++ src/test/regress/expected/multirangetypes.out | 100 ++ src/test/regress/expected/opr_sanity.out | 1 + src/test/regress/sql/multirangetypes.sql | 21 src/test/regress/sql/opr_sanity.sql | 1 + 8 files changed, 230 insertions(+), 6 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index df3cd5987b..ab6c4f0093 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -20012,8 +20012,23 @@ SELECT NULLIF(value, '(none)') ... No + + + + range_agg + +range_agg ( value + anymultirange ) +anymultirange + + +Computes the union of the non-null input values. + + No + + range_intersect_agg @@ -20027,8 +20042,23 @@ SELECT NULLIF(value, '(none)') ... No + + + + range_intersect_agg + +range_intersect_agg ( value + anymultirange ) +anymultirange + + +Computes the intersection of the non-null input values. + + No + + string_agg diff --git a/src/backend/utils/adt/multirangetypes.c b/src/backend/utils/adt/multirangetypes.c index c474b24431..a6d376b083 100644 --- a/src/backend/utils/adt/multirangetypes.c +++ b/src/backend/utils/adt/multirangetypes.c @@ -1344,11 +1344,9 @@ range_agg_transfn(PG_FUNCTION_ARGS) elog(ERROR, "range_agg_transfn called in non-aggregate context"); rngtypoid = get_fn_expr_argtype(fcinfo->flinfo, 1); if (!type_is_range(rngtypoid)) - ereport(ERROR, -(errcode(ERRCODE_DATATYPE_MISMATCH), - errmsg("range_agg must be called with a range"))); + elog(ERROR, "range_agg must be called with a range or multirange"); if (PG_ARGISNULL(0)) state = initArrayResult(rngtypoid, aggContext, false); else @@ -1362,8 +1360,11 @@ range_agg_transfn(PG_FUNCTION_ARGS) } /* * range_agg_finalfn: use our internal array to merge touching ranges. + * + * Shared by range_agg_finalfn(anyrange) and + * multirange_agg_finalfn(anymultirange). */ Datum range_agg_finalfn(PG_FUNCTION_ARGS) { @@ -1397,8 +1398,66 @@ range_agg_finalfn(PG_FUNCTION_ARGS) PG_RETURN_MULTIRANGE_P(make_multirange(mltrngtypoid, typcache->rngtype, range_count, ranges)); } +/* + * multirange_agg_transfn: combine adjacent/overlapping multiranges. + * + * All we do here is gather the input multiranges' ranges into an array + * so that the finalfn can sort and combine them. + */ +Datum +multirange_agg_transfn(PG_FUNCTION_ARGS) +{ + MemoryContext aggContext; + Oid mltrngtypoid; + TypeCacheEntry *typcache; + TypeCacheEntry *rngtypcache; + ArrayBuildState *state; + MultirangeType *current; + int32 range_count; + RangeType **ranges; + int32 i; + + if (!AggCheckCallContext(fcinfo, &aggContext)) + elog(ERROR, "multirange_agg_transfn called in non-aggregate context"); + + mltrngtypoid = get_fn_expr_argtype(fcinfo->flinfo, 1); + if (!type_is_multirange(mltrngtypoid)) + elog(ERROR, "range_agg must be called with a range or multirange"); + + typcache = multirange_get_typcache(fcinfo, mltrngtypoid); + rngtypcache = typcache->rngtype; + + if (PG_ARGISNULL
Re: Issue with pg_stat_subscription_stats
On Sat, Mar 12, 2022 at 2:14 AM Melanie Plageman wrote: > > So, I noticed that pg_stat_reset_subscription_stats() wasn't working > properly, and, upon further investigation, I'm not sure the view > pg_stat_subscription_stats is being properly populated. > I have tried the below scenario based on this: Step:1 Create some data that generates conflicts and lead to apply failures and then check in the view: postgres=# select * from pg_stat_subscription_stats; subid | subname | apply_error_count | sync_error_count | stats_reset ---+-+---+--+- 16389 | sub1| 4 |0 | (1 row) Step-2: Reset the view postgres=# select * from pg_stat_reset_subscription_stats(16389); pg_stat_reset_subscription_stats -- (1 row) Step-3: Again, check the view: postgres=# select * from pg_stat_subscription_stats; subid | subname | apply_error_count | sync_error_count | stats_reset ---+-+---+--+-- 16389 | sub1| 0 |0 | 2022-03-12 08:21:39.156971+05:30 (1 row) The stats_reset time seems to be populated. Similarly, I have tried by passing NULL to pg_stat_reset_subscription_stats and it works. I think I am missing something here, can you please explain the exact scenario/steps where you observed that this API is not working. -- With Regards, Amit Kapila.
Re: On login trigger: take three
Hi, On 2022-02-15 21:07:15 +1100, Greg Nancarrow wrote: > Subject: [PATCH v25] Add a new "login" event and login event trigger support. > > The login event occurs when a user logs in to the system. I think this needs a HUGE warning in the docs about most event triggers needing to check pg_is_in_recovery() or breaking hot standby. And certainly the example given needs to include an pg_is_in_recovery() check. > + > + The login event occurs when a user logs in to the > + system. > + Any bugs in a trigger procedure for this event may prevent successful > + login to the system. Such bugs may be fixed after first restarting the > + system in single-user mode (as event triggers are disabled in this > mode). > + See the reference page for details about > + using single-user mode. > + I'm strongly against adding any new dependencies on single user mode. A saner approach might be a superuser-only GUC that can be set as part of the connection data (e.g. PGOPTIONS='-c ignore_login_event_trigger=true'). > @@ -293,6 +297,27 @@ insert_event_trigger_tuple(const char *trigname, const > char *eventname, Oid evtO > CatalogTupleInsert(tgrel, tuple); > heap_freetuple(tuple); > > + if (strcmp(eventname, "login") == 0) > + { > + Form_pg_database db; > + Relationpg_db = table_open(DatabaseRelationId, > RowExclusiveLock); > + > + /* Set dathasloginevt flag in pg_database */ > + tuple = SearchSysCacheCopy1(DATABASEOID, > ObjectIdGetDatum(MyDatabaseId)); > + if (!HeapTupleIsValid(tuple)) > + elog(ERROR, "cache lookup failed for database %u", > MyDatabaseId); > + db = (Form_pg_database) GETSTRUCT(tuple); > + if (!db->dathasloginevt) > + { > + db->dathasloginevt = true; > + CatalogTupleUpdate(pg_db, &tuple->t_self, tuple); > + } > + else > + CacheInvalidateRelcacheByTuple(tuple); > + table_close(pg_db, RowExclusiveLock); > + heap_freetuple(tuple); > + } > + > /* Depend on owner. */ > recordDependencyOnOwner(EventTriggerRelationId, trigoid, evtOwner); Maybe I am confused, but isn't that CacheInvalidateRelcacheByTuple call *entirely* bogus? CacheInvalidateRelcacheByTuple() expects a pg_class tuple, but you're passing in a pg_database tuple? And what is relcache invalidation even supposed to achieve here? I think this should mention that ->dathasloginevt is unset on login when appropriate. > +/* > + * Fire login event triggers. > + */ > +void > +EventTriggerOnLogin(void) > +{ > + List *runlist; > + EventTriggerData trigdata; > + > + /* > + * See EventTriggerDDLCommandStart for a discussion about why event > + * triggers are disabled in single user mode. > + */ > + if (!IsUnderPostmaster || !OidIsValid(MyDatabaseId)) > + return; > + > + StartTransactionCommand(); > + > + if (DatabaseHasLoginEventTriggers()) > + { > + runlist = EventTriggerCommonSetup(NULL, > + > EVT_Login, "login", > + > &trigdata); > + > + if (runlist != NIL) > + { > + /* > + * Make sure anything the main command did will be > visible to the > + * event triggers. > + */ > + CommandCounterIncrement(); "Main command"? It's not clear to my why a CommandCounterIncrement() could be needed here - which previous writes do you need to make visible? > + /* > + * Runlist is empty: clear dathasloginevt flag > + */ > + Relationpg_db = table_open(DatabaseRelationId, > RowExclusiveLock); > + HeapTuple tuple = > SearchSysCacheCopy1(DATABASEOID, ObjectIdGetDatum(MyDatabaseId)); > + Form_pg_database db; > + > + if (!HeapTupleIsValid(tuple)) > + elog(ERROR, "cache lookup failed for database > %u", MyDatabaseId); > + > + db = (Form_pg_database) GETSTRUCT(tuple); > + if (db->dathasloginevt) > + { > + /* > + * There can be a race condition: a login event > trigger may > + * have been added after the pg_event_trigger > table was > + * scanned, and we don't want to erroneously > clear the > + * dathasloginevt flag in this case. To be sure > that this > + * hasn't happened, repeat the scan under the >
Re: role self-revocation
> On Mar 11, 2022, at 4:56 PM, Stephen Frost wrote: > > First … I outlined a fair bit of further description in the message you just > responded to but neglected to include in your response, which strikes me as > odd that you’re now asking for further explanation. > When it comes to completing the idea of role ownership- I didn’t come up > with that idea nor champion it Sorry, not "completing", but "competing". It seems we're discussing different ways to fix how roles and CREATEROLE work, and we have several ideas competing against each other. (This differs from *people* competing against each other, as I don't necessarily like the patch I wrote better than I like your idea.) > and therefore I’m not really sure how many of the steps are required to fully > support that concept..? There are problems that the ownership concepts solve. I strongly suspect that your proposal could also solve those same problems, and just trying to identify the specific portions of your proposal necessary to do so. > For my part, I would think that those steps necessary to satisfy the spec > would get us pretty darn close to what true folks advocating for role > ownership are asking for I have little idea what "true folks" means in this context. As for "advocating for role ownership", I'm not in that group. Whether role ownership or something else, I just want some solution to a set of problems, mostly to do with needing superuser to do role management tasks. > , but that doesn’t include the superuser-only alter role attributes piece. > Is that included in role ownership? I wouldn’t think so, but some might > argue otherwise, and I don’t know that it is actually useful to divert into a > discussion about what is or isn’t in that. Introducing the idea of role ownership doesn't fix that. But a patch which introduces role ownership is useless unless CREATEROLE is also fixed. There isn't any point having non-superusers create and own roles if, to do so, they need a privilege which can break into superuser. But that argument is no different with a patch along the lines of what you are proposing. CREATEROLE needs fixing either way. > If we agree that the role attribute bits are independent Yes. > then I think I agree that we need 1 and 2 to get the capabilities that the > folks asking for role ownership want Yes. > as 2 is where we make sure that one admin of a role can’t revoke another > admin’s rights over that role. Exactly, so #2 is part of the competing proposal. (I get the sense you might not see these as competing proposals, but I find that framing useful for deciding which approach to pursue.) > Perhaps 2 isn’t strictly necessary in a carefully managed environment where > no one else is given admin rights over the mini-superuser roles, but I’d > rather not have folks depending on that. I think it is necessary, and for the reason you say. > I’d still push back though and ask those advocating for this if it meets > what they’re asking for. I got the impression that it did but maybe I > misunderstood. > > In terms of exactly how things would work with these changes… I thought I > explained it pretty clearly, so it’s kind of hard to answer that further > without a specific question to answer. Did you have something specific in > mind? Perhaps I could answer a specific question and provide more clarity > that way. Your emails contained a lot of "we could do this or that depending on what people want, and maybe this other thing, but that isn't really necessary, and " which left me unclear on the proposal. I don't mean to disparage your communication style; it's just that when trying to distill technical details, high level conversation can be hard to grok. I have the sense that you aren't going to submit a patch, so I wanted this thread to contain enough detail for somebody else to do so. Thanks. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: refactoring basebackup.c
Hi, On 2022-03-11 10:19:29 -0500, Robert Haas wrote: > Thanks for the report. The problem here is that, when the output is > standard output (-D -), pg_basebackup can only produce a single output > file, so the manifest gets injected into the tar file on the client > side rather than being written separately as we do in normal cases. > However, that only works if we're receiving a tar file that we can > parse from the server, and here the server is sending a compressed > tarfile. The current code mistakely attempts to parse the compressed > tarfile as if it were an uncompressed tarfile, which causes the error > messages that you are seeing (and which I can also reproduce here). We > actually have enough infrastructure available in pg_basebackup now > that we could do the "right thing" in this case: decompress the data > received from the server, parse the resulting tar file, inject the > backup manifest, construct a new tar file, and recompress. However, I > think that's probably not a good idea, because it's unlikely that the > user will understand that the data is being compressed on the server, > then decompressed, and then recompressed again, and the performance of > the resulting pipeline will probably not be very good. So I think we > should just refuse this command. Patch for that attached. You could also just append a manifest as a compresed tar to the compressed tar stream. Unfortunately GNU tar requires -i to read concated compressed archives, so perhaps that's not quite an alternative. Greetings, Andres Freund
Re: role self-revocation
Greetings, On Fri, Mar 11, 2022 at 19:03 Mark Dilger wrote: > > On Mar 11, 2022, at 2:46 PM, Stephen Frost wrote: > > > > I do think that’s reasonable … and believe I suggested it about 3 > messages ago in this thread. ;) (step #3 I think it was? Or maybe 4). > > Yes, and you mentioned it to me off-list. Indeed. I'm soliciting a more concrete specification for what you are proposing. > To me, that means understanding how the SQL spec behavior that you champion > translates into specific changes. You specified some of this in steps #1 > through #5, but I'd like a clearer indication of how many of those (#1 > alone, both #1 and #2, or what?) constitute a competing idea to the idea of > role ownership, and greater detail about how each of those steps translate > into specific behavior changes in postgres. Your initial five-step email > seems to be claiming that #1 by itself is competitive, but to me it seems > at least #1 and #2 would be required. First … I outlined a fair bit of further description in the message you just responded to but neglected to include in your response, which strikes me as odd that you’re now asking for further explanation. When it comes to completing the idea of role ownership- I didn’t come up with that idea nor champion it and therefore I’m not really sure how many of the steps are required to fully support that concept..? For my part, I would think that those steps necessary to satisfy the spec would get us pretty darn close to what true folks advocating for role ownership are asking for, but that doesn’t include the superuser-only alter role attributes piece. Is that included in role ownership? I wouldn’t think so, but some might argue otherwise, and I don’t know that it is actually useful to divert into a discussion about what is or isn’t in that. If we agree that the role attribute bits are independent then I think I agree that we need 1 and 2 to get the capabilities that the folks asking for role ownership want, as 2 is where we make sure that one admin of a role can’t revoke another admin’s rights over that role. Perhaps 2 isn’t strictly necessary in a carefully managed environment where no one else is given admin rights over the mini-superuser roles, but I’d rather not have folks depending on that. I’d still push back though and ask those advocating for this if it meets what they’re asking for. I got the impression that it did but maybe I misunderstood. In terms of exactly how things would work with these changes… I thought I explained it pretty clearly, so it’s kind of hard to answer that further without a specific question to answer. Did you have something specific in mind? Perhaps I could answer a specific question and provide more clarity that way. Thanks, Stephen >
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths
Hi, On 2022-03-11 22:42:42 +0200, Heikki Linnakangas wrote: > Have you been able to create a test case for that? The largest record I can > think of is a commit record with a huge number of subtransactions, dropped > relations, and shared inval messages. I'm not sure if you can overflow a > uint32 with that, but exceeding MaxAllocSize seems possible. MaxAllocSize is pretty easy: SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 1024), 1024*1023) as l(long); on a standby: 2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG: record length 2145386550 at 0/360 too long > I wonder if these checks hurt performance. These are very cheap, but then > again, this codepath is very hot. It's probably fine, but it still worries > me a little. Maybe some of these could be Asserts. I wouldn't expect the added branch itself to hurt much in XLogRegisterData() - it should be statically predicted to be not taken with the unlikely. I don't think it's quite inner-loop enough for the instructions or the number of "concurrently out of order branches" to be a problem. FWIW, often the added elog()s are worse, because they require a decent amount of code and restrict the optimizer somewhat (e.g. no sibling calls, more local variables etc). They can't even be deduplicated because of the line-numbers embedded. So maybe just collapse the new elog() with the previous elog, with a common unlikely()? > > @@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, > > if (needs_data) > > { > > + /* protect against overflow */ > > + if (unlikely(regbuf->rdata_len > UINT16_MAX)) > > + elog(ERROR, "too much WAL data for registered > > buffer"); > > + > > /* > > * Link the caller-supplied rdata chain for this buffer > > to the > > * overall list. FWIW, this branch I'm a tad more concerned about - it's in a loop body where plausibly a lot of branches could be outstanding at the same time. ISTM that this could just be an assert? Greetings, Andres Freund
Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit
Applied patches v6-0002 and v6-0003 to master branch, and the `make check` test is ok. Here is my test result in 10 times average on 3 virtual machines: before the patches: abort.1 = 2.5473 ms abort.2 = 4.1572 ms after the patches with OPTIONS (ADD parallel_abort 'true'): abort.1 = 1.7136 ms abort.2 = 2.5833 ms Overall, it has about 32 ~ 37 % improvement in my testing environment. On 2022-03-05 2:32 a.m., Etsuro Fujita wrote: On Mon, Feb 28, 2022 at 6:53 PM Etsuro Fujita wrote: Here is an updated version. I added to the 0003 patch a macro for defining the milliseconds to wait, as proposed by David upthread. I modified the 0003 patch further: 1) I added to pgfdw_cancel_query_end/pgfdw_exec_cleanup_query_end the PQconsumeInput optimization that we have in do_sql_command_end, and 2) I added/tweaked comments a bit further. Attached is an updated version. Like [1], I ran a simple performance test using the following transaction: BEGIN; SAVEPOINT s; INSERT INTO ft1 VALUES (10, 10); INSERT INTO ft2 VALUES (20, 20); ROLLBACK TO SAVEPOINT s; RELEASE SAVEPOINT s; INSERT INTO ft1 VALUES (10, 10); INSERT INTO ft2 VALUES (20, 20); ABORT; where ft1 is a foreign table created on a foreign server hosted on the same machine as the local server, and ft2 is a foreign table created on a foreign server hosted on a different machine. (In this test I used two machines, while in [1] I used three machines: one for the local server and the others for ft1 and ft2.) The average latencies for the ROLLBACK TO SAVEPOINT and ABORT commands over ten runs of the above transaction with the parallel_abort option disabled/enabled are: * ROLLBACK TO SAVEPOINT parallel_abort=0: 0.3217 ms parallel_abort=1: 0.2396 ms * ABORT parallel_abort=0: 0.4749 ms parallel_abort=1: 0.3733 ms This option reduces the latency for ROLLBACK TO SAVEPOINT by 25.5 percent, and the latency for ABORT by 21.4 percent. From the results, I think the patch is useful. Best regards, Etsuro Fujita [1]https://www.postgresql.org/message-id/CAPmGK17dAZCXvwnfpr1eTfknTGdt%3DhYTV9405Gt5SqPOX8K84w%40mail.gmail.com -- David Software Engineer Highgo Software Inc. (Canada) www.highgo.ca
Re: Kerberos delegation support in libpq and postgres_fdw
Greetings, On Fri, Mar 11, 2022 at 18:55 Jacob Champion wrote: > On Mon, 2022-02-28 at 20:28 -0500, Stephen Frost wrote: > > Will add to the CF for consideration. > > GSSAPI newbie here, so caveat lector. No worries, thanks for your interest! > diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c > > index efc53f3135..6f820a34f1 100644 > > --- a/src/backend/libpq/auth.c > > +++ b/src/backend/libpq/auth.c > > @@ -920,6 +920,7 @@ pg_GSS_recvauth(Port *port) > > int mtype; > > StringInfoData buf; > > gss_buffer_desc gbuf; > > + gss_cred_id_t proxy; > > > > /* > >* Use the configured keytab, if there is one. Unfortunately, > Heimdal > > @@ -949,6 +950,9 @@ pg_GSS_recvauth(Port *port) > >*/ > > port->gss->ctx = GSS_C_NO_CONTEXT; > > > > + proxy = NULL; > > + port->gss->proxy_creds = false; > > + > > /* > >* Loop through GSSAPI message exchange. This exchange can consist > of > >* multiple messages sent in both directions. First message is > always from > > @@ -999,7 +1003,7 @@ pg_GSS_recvauth(Port *port) > > >&port->gss->outbuf, > > >&gflags, > > >NULL, > > - >NULL); > > + >&proxy); > > > > /* gbuf no longer used */ > > pfree(buf.data); > > @@ -1011,6 +1015,12 @@ pg_GSS_recvauth(Port *port) > > > > CHECK_FOR_INTERRUPTS(); > > > > + if (proxy != NULL) > > + { > > + pg_store_proxy_credential(proxy); > > + port->gss->proxy_creds = true; > > + } > > + > > Some implementation docs [1] imply that a delegated_cred_handle is only > valid if the ret_flags include GSS_C_DELEG_FLAG. The C-binding RFC [2], > though, says that we can rely on it being set to GSS_C_NO_CREDENTIAL if > no handle was sent... > > I don't know if there are any implementation differences here, but in > any case I think it'd be more clear to use the GSS_C_NO_CREDENTIAL > spelling (instead of NULL) here, if we do decide not to check > ret_flags. Hmmm, yeah, that seems like it might be better and is something I’ll take a look at. [5] says we have to free the proxy credential with GSS_Release_cred(); > I don't see that happening anywhere, but I may have missed it. I’m not sure that it’s really necessary or worthwhile to do that at process end since … the process is about to end. I suppose we could provide a function that a user could call to ask for it to be released sooner if we really wanted..? > maj_stat = gss_init_sec_context(&min_stat, > > - > GSS_C_NO_CREDENTIAL, > > + > proxy, > > > &conn->gctx, > > > conn->gtarg_nam, > > > GSS_C_NO_OID, > > - > GSS_C_MUTUAL_FLAG, > > + > GSS_C_MUTUAL_FLAG | GSS_C_DELEG_FLAG, > > 0, > > > GSS_C_NO_CHANNEL_BINDINGS, > > > (ginbuf.value == NULL) ? GSS_C_NO_BUFFER : &ginbuf, > > diff --git a/src/interfaces/libpq/fe-secure-gssapi.c > b/src/interfaces/libpq/fe-secure-gssapi.c > > index 6ea52ed866..566c89f52f 100644 > > --- a/src/interfaces/libpq/fe-secure-gssapi.c > > +++ b/src/interfaces/libpq/fe-secure-gssapi.c > > @@ -631,7 +631,7 @@ pqsecure_open_gss(PGconn *conn) > >*/ > > major = gss_init_sec_context(&minor, conn->gcred, &conn->gctx, > > > conn->gtarg_nam, GSS_C_NO_OID, > > - > GSS_REQUIRED_FLAGS, 0, 0, &input, NULL, > > + > GSS_REQUIRED_FLAGS | GSS_C_DELEG_FLAG, 0, 0, &input, NULL, > > It seems like there should be significant security implications to > allowing delegation across the board. Especially since one postgres_fdw > might delegate to another server, and then another... Should this be > opt-in, maybe via a connection parameter? This is already opt-in- at kinit time a user can decide if they’d like a proxy-able ticket or not. I don’t know that we really need to have our own option for it … tho I’m not really against adding such an option either. (It also looks like there are some mechanisms for further constraining > delegation scope, either by administrator policy or otherwise [3, 4]. > Might be a good thing for a v2 of this feature to have.) Yes, constrained delegation is a pretty neat extension to Kerberos and one I’d like to look at later as a future enhancement but I don’t think it needs to be in the initial version. Similarly, it feels a little strange that the server would allow the > client to unilaterally force the use of a delegated credential. I think > that should be opt-in on the server side too, unless there's some > context I'm missing around why that's safe. Perhaps you could explain what isn’t safe about accepting a delegated credential from a client..? I am not away of a risk to accepting such a delegated credential. Even so, I’m not against adding an option… but exactly how would that option be configured? Server level? On the HBA line? role level..? > + /* Make the proxy credential only av
Re: wal_compression=zstd
On Fri, Mar 11, 2022 at 03:49:00PM -0600, Justin Pryzby wrote: > While rebasing, I realized this should have bumped XLOG_PAGE_MAGIC. > > Also, there's a dangling "and". Right. I'll address that a bit later today. Thanks! -- Michael signature.asc Description: PGP signature
Re: role self-revocation
> On Mar 11, 2022, at 2:46 PM, Stephen Frost wrote: > > I do think that’s reasonable … and believe I suggested it about 3 messages > ago in this thread. ;) (step #3 I think it was? Or maybe 4). Yes, and you mentioned it to me off-list. I'm soliciting a more concrete specification for what you are proposing. To me, that means understanding how the SQL spec behavior that you champion translates into specific changes. You specified some of this in steps #1 through #5, but I'd like a clearer indication of how many of those (#1 alone, both #1 and #2, or what?) constitute a competing idea to the idea of role ownership, and greater detail about how each of those steps translate into specific behavior changes in postgres. Your initial five-step email seems to be claiming that #1 by itself is competitive, but to me it seems at least #1 and #2 would be required. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Kerberos delegation support in libpq and postgres_fdw
On Mon, 2022-02-28 at 20:28 -0500, Stephen Frost wrote: > Will add to the CF for consideration. GSSAPI newbie here, so caveat lector. > diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c > index efc53f3135..6f820a34f1 100644 > --- a/src/backend/libpq/auth.c > +++ b/src/backend/libpq/auth.c > @@ -920,6 +920,7 @@ pg_GSS_recvauth(Port *port) > int mtype; > StringInfoData buf; > gss_buffer_desc gbuf; > + gss_cred_id_t proxy; > > /* >* Use the configured keytab, if there is one. Unfortunately, Heimdal > @@ -949,6 +950,9 @@ pg_GSS_recvauth(Port *port) >*/ > port->gss->ctx = GSS_C_NO_CONTEXT; > > + proxy = NULL; > + port->gss->proxy_creds = false; > + > /* >* Loop through GSSAPI message exchange. This exchange can consist of >* multiple messages sent in both directions. First message is always > from > @@ -999,7 +1003,7 @@ pg_GSS_recvauth(Port *port) > > &port->gss->outbuf, > > &gflags, > > NULL, > - > NULL); > + > &proxy); > > /* gbuf no longer used */ > pfree(buf.data); > @@ -1011,6 +1015,12 @@ pg_GSS_recvauth(Port *port) > > CHECK_FOR_INTERRUPTS(); > > + if (proxy != NULL) > + { > + pg_store_proxy_credential(proxy); > + port->gss->proxy_creds = true; > + } > + Some implementation docs [1] imply that a delegated_cred_handle is only valid if the ret_flags include GSS_C_DELEG_FLAG. The C-binding RFC [2], though, says that we can rely on it being set to GSS_C_NO_CREDENTIAL if no handle was sent... I don't know if there are any implementation differences here, but in any case I think it'd be more clear to use the GSS_C_NO_CREDENTIAL spelling (instead of NULL) here, if we do decide not to check ret_flags. [5] says we have to free the proxy credential with GSS_Release_cred(); I don't see that happening anywhere, but I may have missed it. > maj_stat = gss_init_sec_context(&min_stat, > - > GSS_C_NO_CREDENTIAL, > + proxy, > > &conn->gctx, > > conn->gtarg_nam, > > GSS_C_NO_OID, > - > GSS_C_MUTUAL_FLAG, > + > GSS_C_MUTUAL_FLAG | GSS_C_DELEG_FLAG, > 0, > > GSS_C_NO_CHANNEL_BINDINGS, > > (ginbuf.value == NULL) ? GSS_C_NO_BUFFER : &ginbuf, > diff --git a/src/interfaces/libpq/fe-secure-gssapi.c > b/src/interfaces/libpq/fe-secure-gssapi.c > index 6ea52ed866..566c89f52f 100644 > --- a/src/interfaces/libpq/fe-secure-gssapi.c > +++ b/src/interfaces/libpq/fe-secure-gssapi.c > @@ -631,7 +631,7 @@ pqsecure_open_gss(PGconn *conn) >*/ > major = gss_init_sec_context(&minor, conn->gcred, &conn->gctx, > > conn->gtarg_nam, GSS_C_NO_OID, > - > GSS_REQUIRED_FLAGS, 0, 0, &input, NULL, > + > GSS_REQUIRED_FLAGS | GSS_C_DELEG_FLAG, 0, 0, &input, NULL, It seems like there should be significant security implications to allowing delegation across the board. Especially since one postgres_fdw might delegate to another server, and then another... Should this be opt-in, maybe via a connection parameter? (It also looks like there are some mechanisms for further constraining delegation scope, either by administrator policy or otherwise [3, 4]. Might be a good thing for a v2 of this feature to have.) Similarly, it feels a little strange that the server would allow the client to unilaterally force the use of a delegated credential. I think that should be opt-in on the server side too, unless there's some context I'm missing around why that's safe. > + /* Make the proxy credential only available to current process */ > + major = gss_store_cred_into(&minor, > + cred, > + GSS_C_INITIATE, /* cr
Re: role self-revocation
Greetings, On Fri, Mar 11, 2022 at 12:32 Mark Dilger wrote: > > > > On Mar 11, 2022, at 8:48 AM, Stephen Frost wrote: > > > > I agree that it would have an impact on backwards compatibility to > > change how WITH ADMIN works- but it would also get us more in line with > > what the SQL standard says for how WITH ADMIN is supposed to work and > > that seems worth the change to me. > > I'm fine with giving up some backwards compatibility to get some SQL > standard compatibility, as long as we're clear that is what we're doing. > What you say about the SQL spec isn't great, though, because too much power > is vested in "ADMIN". I see "ADMIN" as at least three separate privileges > together. Maybe it would be spec compliant to implement "ADMIN" as a > synonym for a set of separate privileges? I do think that’s reasonable … and believe I suggested it about 3 messages ago in this thread. ;) (step #3 I think it was? Or maybe 4). > On Mar 11, 2022, at 8:41 AM, Stephen Frost wrote: > > > > That we aren't discussing the issues with the current GRANT ... WITH > > ADMIN OPTION and how we deviate from what the spec calls for when it > > comes to DROP ROLE, which seems to be the largest thing that's > > 'solved' with this ownership concept, is concerning to me. > > Sure, let's discuss that a bit more. Here is my best interpretation of > your post about the spec, when applied to postgres with an eye towards not > doing any more damage than necessary: > > > On Mar 10, 2022, at 11:58 AM, Stephen Frost wrote: > > > > let's look at what the spec says: > > > > CREATE ROLE > > - Who is allowed to run CREATE ROLE is implementation-defined > > This should be anyone with membership in pg_create_role. That could be the case if we wished to go that route. I’d think in such case we’d then also remove CREATEROLE as otherwise the documentation feels like it’d be quite confusing. > - After creation, this is effictively run: > >GRANT new_role TO creator_role WITH ADMIN, GRANTOR "_SYSTEM" > > This should internally be implemented as three separate privileges, one > which means you can grant the role, another which means you can drop the > role, and a third that means you're a member of the role. That way, they > can be independently granted and revoked. We could make "WITH ADMIN" a > short-hand for "WITH G, D, M" where G, D, and M are whatever we name the > independent privileges Grant, Drop, and Member-of. I mean, sure, we can get there, and possibly add more like if you’re allowed to change or reset that role’s password and other things, but I don’t see that this piece is required as part of the very first change in this area. Further, WITH ADMIN already gives grant and member today, so you’re saying the only thing this change does that makes “WITH ADMIN” too powerful is adding DROP to it, yet that’s explicitly what the spec calls for. In short, I disagree that moving the DROP ROLE right from CREATEROLE roles having that across the entire system (excluding superusers) to WITH ADMIN where the role who has that right can: A) already become that role and drop all their objects B) already GRANT that role to some other role is a big issue. Splitting G and D helps with backwards compatibility, because it gives > people who want the traditional postgres "admin" a way to get there, by > granting "G+M". Splitting M from G and D makes it simpler to implement the > "bot" idea, since the bot shouldn't have M. But it does raise a question > about always granting G+D+M to the creator, since the bot is the creator > and we don't want the bot to have M. This isn't a problem I've invented > from thin air, mind you, as G+D+M is just the definition of ADMIN per the > SQL spec, if I've understood you correctly. So we need to think a bit more > about the pg_create_role built-in role and whether that needs to be further > refined to distinguish those who can get membership in roles they create > vs. those who cannot. This line of reasoning takes me in the direction of > what I think you were calling #5 upthread, but you'd have to elaborate on > that, and how it interacts with the spec, for us to have a useful > conversation about it. All that said, as I said before, I’m in favor of splitting things up and so if you want to do that as part of this initial work, sure. Idk that it’s absolutely required as part of this but I’m not going to complain if it’s included either. I agree that would allow folks to get something similar to what they could get today if they want. I agree that the split up helps with the “bot” idea, as we could at least then create a security definer function that the bot runs and which creates roles that the bot then has G for but not M or D. Even better would be to also provide a way for the “bot” to be able to create roles without the need for a security definer function where it doesn’t automatically get all three, and that was indeed what I was thinking about with the template idea. The genera
Re: generic plans and "initial" pruning
Hi, w.r.t. v5-0003-Teach-AcquireExecutorLocks-to-skip-locking-pruned.patch : (pruning steps containing expressions that can be computed before before the executor proper has started) the word 'before' was repeated. For ExecInitParallelPlan(): + char *execlockrelsinfo_data; + char *execlockrelsinfo_space; the content of execlockrelsinfo_data is copied into execlockrelsinfo_space. I wonder if having one of execlockrelsinfo_data and execlockrelsinfo_space suffices. Cheers
Re: wal_compression=zstd
While rebasing, I realized this should have bumped XLOG_PAGE_MAGIC. Also, there's a dangling "and". +The supported methods are pglz, +lz4 (if PostgreSQL +was compiled with --with-lz4) and +zstd (if PostgreSQL +was compiled with --with-zstd) and +The default value is off. +Only superusers can change this setting. -- Justin
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths
On Fri, Mar 11, 2022 at 3:42 PM Heikki Linnakangas wrote: > Have you been able to create a test case for that? The largest record I > can think of is a commit record with a huge number of subtransactions, > dropped relations, and shared inval messages. I'm not sure if you can > overflow a uint32 with that, but exceeding MaxAllocSize seems possible. I believe that wal_level=logical can generate very large update and delete records, especially with REPLICA IDENTITY FULL. -- Robert Haas EDB: http://www.enterprisedb.com
Re: On login trigger: take three
On Tue, Feb 15, 2022 at 5:07 AM Greg Nancarrow wrote: > I've attached a re-based version (no functional changes from the > previous) to fix cfbot failures. I tried this: rhaas=# create function on_login_proc() returns event_trigger as $$begin perform pg_sleep(1000); end$$ language plpgsql; CREATE FUNCTION rhaas=# create event trigger on_login_trigger on login execute procedure on_login_proc(); When I then attempt to connect via psql, it hangs, as expected. When I press ^C, psql exits, but the backend process is not cancelled and just keeps chugging along in the background. The good news is that if I connect to another database, I can cancel all of the hung sessions using pg_cancel_backend(), and all of those processes then promptly exit, and presumably I could accomplish the same thing by sending them SIGINT directly. But it's still not great behavior. It would be easy to use up a pile of connection slots inadvertently and have to go to some trouble to get access to the server again. Since this is a psql behavior and not a server behavior, one could argue that it's unrelated to this patch, but in practice this patch seems to increase the chances of people running into problems quite a lot. -- Robert Haas EDB: http://www.enterprisedb.com
Issue with pg_stat_subscription_stats
So, I noticed that pg_stat_reset_subscription_stats() wasn't working properly, and, upon further investigation, I'm not sure the view pg_stat_subscription_stats is being properly populated. I don't think subscriptionStatHash will be created properly and that the reset timestamp won't be initialized without the following code: diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 53ddd930e6..0b8c5436e9 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3092,7 +3092,7 @@ pgstat_fetch_stat_subscription(Oid subid) /* Load the stats file if needed */ backend_read_statsfile(); - return pgstat_get_subscription_entry(subid, false); + return pgstat_get_subscription_entry(subid, true); } /* @@ -6252,7 +6252,7 @@ pgstat_get_subscription_entry(Oid subid, bool create) /* If not found, initialize the new one */ if (!found) - pgstat_reset_subscription(subentry, 0); + pgstat_reset_subscription(subentry, GetCurrentTimestamp()); return subentry; } - melanie
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths
On 11/03/2022 17:42, Matthias van de Meent wrote: Hi, Xlogreader limits the size of what it considers valid xlog records to MaxAllocSize; but this is not currently enforced in the XLogRecAssemble API. This means it is possible to assemble a record that postgresql cannot replay. Oops, that would be nasty. Similarly; it is possible to repeatedly call XlogRegisterData() so as to overflow rec->xl_tot_len; resulting in out-of-bounds reads and writes while processing record data; And that too. Have you been able to create a test case for that? The largest record I can think of is a commit record with a huge number of subtransactions, dropped relations, and shared inval messages. I'm not sure if you can overflow a uint32 with that, but exceeding MaxAllocSize seems possible. PFA a patch that attempts to fix both of these issues in the insertion API; by checking against overflows and other incorrectly large values in the relevant functions in xloginsert.c. In this patch, I've also added a comment to the XLogRecord spec to document that xl_tot_len should not be larger than 1GB - 1B; and why that limit exists. diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index c260310c4c..ae654177de 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -342,6 +342,11 @@ XLogRegisterData(char *data, int len) if (num_rdatas >= max_rdatas) elog(ERROR, "too much WAL data"); + + /* protect against overflow */ + if (unlikely((uint64) mainrdata_len + (uint64) len > UINT32_MAX)) + elog(ERROR, "too much WAL data"); + rdata = &rdatas[num_rdatas++]; rdata->data = data; Could check for just AllocSizeValid(mainrdata_len), if we're only worried about the total size of the data to exceed the limit, and assume that each individual piece of data is smaller. We also don't check for negative 'len'. I think that's fine, the caller bears some responsibility for passing valid arguments too. But maybe uint32 or size_t would be more appropriate here. I wonder if these checks hurt performance. These are very cheap, but then again, this codepath is very hot. It's probably fine, but it still worries me a little. Maybe some of these could be Asserts. @@ -387,6 +392,11 @@ XLogRegisterBufData(uint8 block_id, char *data, int len) if (num_rdatas >= max_rdatas) elog(ERROR, "too much WAL data"); + + /* protect against overflow */ + if (unlikely((uint64) regbuf->rdata_len + (uint64) len > UINT32_MAX)) + elog(ERROR, "too much WAL data"); + rdata = &rdatas[num_rdatas++]; rdata->data = data; Could check "len > UINT16_MAX". As you noted in XLogRecordAssemble, that's the real limit. And if you check for that here, you don't need to check it in XLogRecordAssemble. @@ -505,7 +515,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included) { XLogRecData *rdt; - uint32 total_len = 0; + uint64 total_len = 0; int block_id; pg_crc32c rdata_crc; registered_buffer *prev_regbuf = NULL; I don't think the change to uint64 is necessary. If all the data blocks are limited to 64 kB, and the number of blocks is limited, and the number of blocks is limited too. @@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, if (needs_data) { + /* protect against overflow */ + if (unlikely(regbuf->rdata_len > UINT16_MAX)) + elog(ERROR, "too much WAL data for registered buffer"); + /* * Link the caller-supplied rdata chain for this buffer to the * overall list. @@ -836,6 +850,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next) COMP_CRC32C(rdata_crc, rdt->data, rdt->len); + /* +* Ensure that xlogreader.c can read the record; and check that we don't +* accidentally overflow the size of the record. +* */ + if (unlikely(!AllocSizeIsValid(total_len) || total_len > UINT32_MAX)) + elog(ERROR, "too much registered data for WAL record"); + /* * Fill in the fields in the record header. Prev-link is filled in later, * once we know where in the WAL the record will be inserted. The CRC does It's enough to check AllocSizeIsValid(total_len), no need to also check against UINT32_MAX. - Heikki
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
On Thu, Mar 10, 2022 at 9:38 PM Kyotaro Horiguchi wrote: > It seems to me too rigorous that pg_get_wal_records_info/stats() > reject future LSNs as end-LSN and I think WARNING or INFO and stop at > the real end-of-WAL is more kind to users. I think the same with the > restriction that start and end LSN are required to be different. In his review just yesterday, Jeff suggested this: "Don't issue WARNINGs or other messages for ordinary situations, like when pg_get_wal_records_info() hits the end of WAL." I think he's entirely right, and I don't think any patch that does otherwise should get committed. It is worth remembering that the results of queries are often examined by something other than a human being sitting at a psql terminal. Any tool that uses this is going to want to understand what happened from the result set, not by parsing strings that may show up inside warning messages. I think that the right answer here is to have a function that returns one row per record parsed, and each row should also include the start and end LSN of the record. If for some reason the WAL records return start after the specified start LSN (e.g. because we skip over a page header) or end before the specified end LSN (e.g. because we reach end-of-WAL) the user can figure it out from looking at the LSNs in the output rows and comparing them to the LSNs provided as input. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Fri, Mar 11, 2022 at 5:21 AM Dilip Kumar wrote: > Changes, 1) it take Robert's patch as first refactoring patch 2) > Rebase other new relmapper apis on top of that in 0002 3) Some code > refactoring in main patch 0005 and also one problem fix, earlier in > wal log method I have removed ForceSyncCommit(), but IMHO that is > equally valid whether we use file_copy or wal_log because in both > cases we are creating the disk files. 4) Support strategy in createdb > tool and add test case as part of 0006. I think there's something wrong with what this patch is doing with the XLOG records. It adds XLOG_DBASE_CREATEDIR, but the only new XLogInsert() calls in the patch are passing XLOG_DBASE_CREATE, and no existing references are adjusted. Similarly with xl_dbase_create_rec and xl_dbase_createdir_rec. Why would we introduce a new record type and not use it? Let's not call the functions for the different strategies CopyDatabase() and CopyDatabaseWithWal() but rather something that matches up to the strategy names e.g. create_database_using_wal_log() and create_database_using_file_copy(). There's something a little funny about the names wal_log and file_copy ... they're not quite parallel gramatically. But it's probably OK. The changes to createdb_failure_params make me a little nervous. I think we'd be in real trouble if we failed before completing both DropDatabaseBuffers() and ForgetDatabaseSyncRequests(). However, it looks to me like those are both intended to be no-fail operations, so I don't see an actual hazard here. But, hmm, what about on the recovery side? Suppose that we start copying the database block by block and then either (a) the standby is promoted before the copy is finished or (b) the copy fails. Now the standby has data in shared_buffers for a database that does not exist. If that's not bad, then why does createdb_failure_params need to DropDatabaseBuffers()? But I bet it is bad. I wonder if we should be using RelationCopyStorage() rather than this new function RelationCopyStorageUsingBuffer(). That would avoid having the buffers in shared_buffers, dodging the problem. But then we have a problem with checkpoint interlocking: we could begin replay from a checkpoint and find that the pages that were supposed to get copied prior to the checkpoint were actually not copied, because the checkpoint record could be written after we've logged a page being copied and before we actually write the page. Or, we could crash after writing a whole lot of pages and a checkpoint record, but before RelationCopyStorage() fsyncs the destination fork. It doesn't seem advisable to hold off checkpoints for the time it takes to copy an entire relation fork, so the solution is apparently to keep the data in shared buffers after all. But that brings us right back to square one. Have you thought through this whole problem carefully? It seems like a total mess to me at the moment, but maybe I'm missing something. There seems to be no reason to specify specific values for the members of enum CreateDBStrategy. I think the naming of some of the new functions might need work, in particular GetRelInfoFromTuple, GetRelListFromPage, and GetDatabaseRelationList. The names seem kind of generic for what they're doing. Maybe ScanSourceDatabasePgClass, ScanSourceDatabasePgClassPage, ScanSourceDatabasePgClassTuple? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Proposal for internal Numeric to Uint64 conversion function.
Amul Sul writes: > On Wed, Feb 16, 2022 at 4:50 PM Peter Eisentraut > wrote: >> On 16.02.22 06:00, Amul Sul wrote: >>> Currently, numeric_pg_lsn is the only one that accepts the Numeric >>> value and converts to uint64 and that is the reason all the type >>> conversion code is embedded into it. >> There are other functions such as numeric_int8() that work similarly. >> If you are going to refactor, then they should all be treated similarly. >> I'm not sure if it's going to end up being beneficial. > Yeah, that's true, I am too not sure if we really need to refactor > all those; If we want, I can give it a try. There are several places that call numeric_int8, and right now they have to go through DirectFunctionCall1, which is ugly and inefficient. I think a refactoring that exposes some more-convenient API for that could be useful. The patch as-presented isn't very compelling for lack of callers of the new function; but if it were handling the int64 and uint64 cases alike, and maybe the float8 case too, that would seem more convincing. We already expose APIs like int64_to_numeric, so the lack of similar APIs for the other direction seems odd. It also feels to me that numeric_pg_lsn(), per se, doesn't belong in numeric.c. A pretty direct comparison is numeric_cash(), which is not in numeric.c but cash.c. regards, tom lane
Re: add checkpoint stats of snapshot and mapping files of pg_logical dir
+CheckpointStats.repl_map_cutoff_lsn = cutoff; Could we set repl_map_cutoff_lsn closer to where it is calculated? Right now, it's set at the bottom of the function just before the directory is freed. Is there a strong reason to do it there? +"logical rewrite mapping file(s) removed/synced=" UINT64_FORMAT ", size=%zu bytes, time=%ld.%03d s, cutoff LSN=%X/%X", Can we split out the "removed" versus "synced" files? Otherwise, I think this is basically just telling us how many files are in the mappings directory. +CheckpointStats.repl_snap_cutoff_lsn = cutoff; I have the same note for this one as I have for repl_map_cutoff_lsn. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On Fri, Mar 11, 2022 at 02:00:37PM -0500, Chapman Flack wrote: > v7 looks good to me. I have repeated the installcheck-world and given > the changed documentation sections one last read-through, and will > change the status to RfC. Thanks for reviewing! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On 03/10/22 19:38, Nathan Bossart wrote: > On Thu, Mar 10, 2022 at 07:13:14PM -0500, Chapman Flack wrote: >> +postgres=# SELECT * FROM pg_walfile_name_offset((pg_backup_stop()).lsn); > > Ah, good catch. I made this change in v7. I considered doing something v7 looks good to me. I have repeated the installcheck-world and given the changed documentation sections one last read-through, and will change the status to RfC. Regards, -Chap
Re: refactoring basebackup.c
On Tue, Feb 15, 2022 at 11:26 AM Robert Haas wrote: > On Wed, Feb 9, 2022 at 8:41 AM Abhijit Menon-Sen wrote: > > It took me a while to assimilate these patches, including the backup > > targets one, which I hadn't looked at before. Now that I've wrapped my > > head around how to put the pieces together, I really like the idea. As > > you say, writing non-trivial integrations in C will take some effort, > > but it seems worthwhile. It's also nice that one can continue to use > > pg_basebackup to trigger the backups and see progress information. > > Cool. Thanks for having a look. > > > Yes, it looks simple to follow the example set by basebackup_to_shell to > > write a custom target. The complexity will be in whatever we need to do > > to store/forward the backup data, rather than in obtaining the data in > > the first place, which is exactly as it should be. > > Yeah, that's what made me really happy with how this came out. > > Here's v2, rebased and with documentation added. I don't hear many comments on this, but I'm pretty sure that it's a good idea, and there haven't been many objections to this patch series as a whole, so I'd like to proceed with it. If nobody objects vigorously, I'll commit this next week. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Fri, Mar 11, 2022 at 1:10 PM Robert Haas wrote: > I don't think you've adequately considered temporary relations here. > It seems to be that ReadBufferWithoutRelcache() could not be safe on a > temprel, because we'd need a BackendId to access the underlying > storage. So I think that ReadBufferWithoutRelcache can only accept > unlogged or permanent, and maybe the argument ought to be a Boolean > instead of a relpersistence value. I thought that this problem might > be only cosmetic, but I checked the code that actually does the copy, > and there's no filter there on relpersistence either. And I think > there should be. I hit "send" too quickly there: rhaas=# create database fudge; CREATE DATABASE rhaas=# \c fudge You are now connected to database "fudge" as user "rhaas". fudge=# create temp table q (); CREATE TABLE fudge=# ^Z [2]+ Stopped psql [rhaas Downloads]$ pg_ctl stop -mi waiting for server to shut down done server stopped [rhaas Downloads]$ %% psql \c You are now connected to database "fudge" as user "rhaas". fudge=# select * from pg_class where relpersistence='t'; oid | relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | reltoastrelid | relhasindex | relisshared | relpersistence | relkind | relnatts | relchecks | relhasrules | relhastriggers | relhassubclass | relrowsecurity | relforcerowsecurity | relispopulated | relreplident | relispartition | relrewrite | relfrozenxid | relminmxid | relacl | reloptions | relpartbound ---+-+--+-+---+--+---+-+---+--+---+---+---+-+-++-+--+---+-++++-++--+++--++++-- 16388 | q |16386 | 16390 | 0 | 10 | 2 | 16388 | 0 |0 |-1 | 0 | 0 | f | f | t | r |0 | 0 | f | f | f | f | f | t | d | f | 0 | 721 | 1 || | (1 row) fudge=# \c rhaas You are now connected to database "rhaas" as user "rhaas". rhaas=# alter database fudge is_template true; ALTER DATABASE rhaas=# create database cookies template fudge; CREATE DATABASE rhaas=# \c cookies You are now connected to database "cookies" as user "rhaas". cookies=# select count(*) from pg_class where relpersistence='t'; count --- 1 (1 row) You have to be quick, because autovacuum will drop the orphaned temp table when it notices it, but it is possible. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Fri, Mar 11, 2022 at 5:21 AM Dilip Kumar wrote: > Changes, 1) it take Robert's patch as first refactoring patch 2) > Rebase other new relmapper apis on top of that in 0002 3) Some code > refactoring in main patch 0005 and also one problem fix, earlier in > wal log method I have removed ForceSyncCommit(), but IMHO that is > equally valid whether we use file_copy or wal_log because in both > cases we are creating the disk files. 4) Support strategy in createdb > tool and add test case as part of 0006. I don't think you've adequately considered temporary relations here. It seems to be that ReadBufferWithoutRelcache() could not be safe on a temprel, because we'd need a BackendId to access the underlying storage. So I think that ReadBufferWithoutRelcache can only accept unlogged or permanent, and maybe the argument ought to be a Boolean instead of a relpersistence value. I thought that this problem might be only cosmetic, but I checked the code that actually does the copy, and there's no filter there on relpersistence either. And I think there should be. -- Robert Haas EDB: http://www.enterprisedb.com
Re: refactoring basebackup.c
On Fri, Mar 11, 2022 at 11:29 AM Justin Pryzby wrote: > Sounds right. OK, committed. > Also, I think the magic 8 for .gz should actually be a 7. > > I'm not sure why it tests for ".gz" but not ".tar.gz", which would help to > make > them all less magic. > > commit 1fb1e21ba7a500bb2b85ec3e65f59130fcdb4a7e > Author: Justin Pryzby > Date: Thu Mar 10 21:22:16 2022 -0600 Yeah, your patch looks right. Committed that, too. -- Robert Haas EDB: http://www.enterprisedb.com
Re: role self-revocation
> On Mar 11, 2022, at 8:48 AM, Stephen Frost wrote: > > I agree that it would have an impact on backwards compatibility to > change how WITH ADMIN works- but it would also get us more in line with > what the SQL standard says for how WITH ADMIN is supposed to work and > that seems worth the change to me. I'm fine with giving up some backwards compatibility to get some SQL standard compatibility, as long as we're clear that is what we're doing. What you say about the SQL spec isn't great, though, because too much power is vested in "ADMIN". I see "ADMIN" as at least three separate privileges together. Maybe it would be spec compliant to implement "ADMIN" as a synonym for a set of separate privileges? > On Mar 11, 2022, at 8:41 AM, Stephen Frost wrote: > > That we aren't discussing the issues with the current GRANT ... WITH > ADMIN OPTION and how we deviate from what the spec calls for when it > comes to DROP ROLE, which seems to be the largest thing that's > 'solved' with this ownership concept, is concerning to me. Sure, let's discuss that a bit more. Here is my best interpretation of your post about the spec, when applied to postgres with an eye towards not doing any more damage than necessary: > On Mar 10, 2022, at 11:58 AM, Stephen Frost wrote: > > let's look at what the spec says: > > CREATE ROLE > - Who is allowed to run CREATE ROLE is implementation-defined This should be anyone with membership in pg_create_role. > - After creation, this is effictively run: >GRANT new_role TO creator_role WITH ADMIN, GRANTOR "_SYSTEM" This should internally be implemented as three separate privileges, one which means you can grant the role, another which means you can drop the role, and a third that means you're a member of the role. That way, they can be independently granted and revoked. We could make "WITH ADMIN" a short-hand for "WITH G, D, M" where G, D, and M are whatever we name the independent privileges Grant, Drop, and Member-of. Splitting G and D helps with backwards compatibility, because it gives people who want the traditional postgres "admin" a way to get there, by granting "G+M". Splitting M from G and D makes it simpler to implement the "bot" idea, since the bot shouldn't have M. But it does raise a question about always granting G+D+M to the creator, since the bot is the creator and we don't want the bot to have M. This isn't a problem I've invented from thin air, mind you, as G+D+M is just the definition of ADMIN per the SQL spec, if I've understood you correctly. So we need to think a bit more about the pg_create_role built-in role and whether that needs to be further refined to distinguish those who can get membership in roles they create vs. those who cannot. This line of reasoning takes me in the direction of what I think you were calling #5 upthread, but you'd have to elaborate on that, and how it interacts with the spec, for us to have a useful conversation about it. > DROP ROLE > - Any user who has been GRANT'd a role with ADMIN option is able to >DROP that role. Change this to "Any role who has D on the role". That's spec compliant, because anyone granted ADMIN necessarily has D. > GRANT ROLE > - No cycles allowed > - A role must have ADMIN rights on the role to be able to GRANT it to >another role. Change this to "Any role who has G on the role". That's spec compliant, because anyone grant ADMIN necessarily has G. We should also fix the CREATE ROLE command to require the grantor have G on a role in order to give it to the new role as part of the command. Changing the CREATEROLE, CREATEDB, REPLICATION, and BYPASSRLS attributes into pg_create_role, pg_create_db, pg_replication, and pg_bypassrls, the creator could only give them to the created role if the creator has G on the roles. If we do this, we could keep the historical privilege bits and their syntax support for backward compatibility, or we could rip them out, but the decision between those two options is independent of the rest of the design. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Collecting statistics about contents of JSONB columns
On Fri, 4 Feb 2022 at 08:30, Tomas Vondra wrote: > > > > On 2/4/22 03:47, Tomas Vondra wrote: > > ./json-generate.py 3 2 8 1000 6 1000 > > Sorry, this should be (different order of parameters): > > ./json-generate.py 3 2 1000 8 6 1000 > Thanks, Tomas for this test case. Hi Hackers, For the last few days, I was doing testing on the top of these JSON optimizers patches and was taking help fro Tomas Vondra to understand patches and testing results. Thanks, Tomas for your feedback and suggestions. Below is the summary: *Point 1)* analyse is taking very much time for large documents: For large JSON documents, analyze took very large time as compared to the current head. For reference, I am attaching test file (./json-generate.py 3 2 1000 8 6 1000) Head: analyze test ; Time: 120.864 ms With patch: analyze test ; Time: more than 2 hours analyze is taking a very large time because with these patches, firstly we iterate over all sample rows (in above case 3), and we store all the paths (here around 850k paths). In another pass, we took 1 path at a time and collects stats for the particular path by analyzing all the sample rows and we continue this process for all 850k paths or we can say that we do 850k loops, and in each loop we extract values for a single path. *Point 2)* memory consummation increases rapidly for large documents: In the above test case, there are total 851k paths and to keep stats for one path, we allocate 1120 bytes. Total paths : 852689 ~ 852k Memory for 1 path to keep stats: 1120 ~ 1 KB (sizeof(JsonValueStats) = 1120 from “Analyze Column”) Total memory for all paths: 852689 * 1120 = 955011680 ~ 955 MB Extra memory for each path will be more. I mean, while analyzing each path, we allocate some more memory based on frequency and others To keep all entries(851k paths) in the hash, we use around 1GB memory for hash so this is also very large. *Point 3*) Review comment noticed by Tomas Vondra: + oldcxt = MemoryContextSwitchTo(ctx->stats->anl_context); + pstats->stats = jsonAnalyzeBuildPathStats(pstats); + MemoryContextSwitchTo(oldcxt); Above should be: + oldcxt = MemoryContextSwitchTo(ctx->mcxt); + pstats->stats = jsonAnalyzeBuildPathStats(pstats); + MemoryContextSwitchTo(oldcxt); *Response from Tomas Vondra:* The problem is "anl_context" is actually "Analyze", i.e. the context for the whole ANALYZE command, for all the columns. But we only want to keep those path stats while processing a particular column. At the end, after processing all paths from a column, we need to "build" the final stats in the column, and this result needs to go into "Analyze" context. But all the partial results need to go into "Analyze Column" context. *Point 4)* +/* + * jsonAnalyzeCollectPath + * Extract a single path from JSON documents and collect its values. + */ +static void +jsonAnalyzeCollectPath(JsonAnalyzeContext *ctx, Jsonb *jb, void *param) +{ + JsonPathAnlStats *pstats = (JsonPathAnlStats *) param; + JsonbValue jbvtmp; + JsonbValue *jbv = JsonValueInitBinary(&jbvtmp, jb); + JsonPathEntry *path; + JsonPathEntry **entries; + int i; + + entries = palloc(sizeof(*entries) * pstats->depth); + + /* Build entry array in direct order */ + for (path = &pstats->path, i = pstats->depth - 1; +path->parent && i >= 0; +path = path->parent, i--) + entries[i] = path; + + jsonAnalyzeCollectSubpath(ctx, pstats, jbv, entries, 0); + + pfree(entries); many times, we are trying to palloc with zero size and entries is pointing to invalid memory (because pstats->depth=0) so I think, we should not try to palloc with 0?? *Fix:* + If (pstats->depth) +entries = palloc(sizeof(*entries) * pstats->depth); >From these points, we can say that we should rethink our design to collect stats for all paths. We can set limits(like MCV) for paths or we can give an explicit path to collect stats for a particular path only or we can pass a subset of the JSON values. In the above case, there are total 851k paths, but we can collect stats for only 1000 paths that are most common so this way we can minimize time and memory also and we might even keep at least frequencies for the non-analyzed paths. Next, I will take the latest patches from Nikita's last email and I will do more tests. Thanks and Regards Mahendra Singh Thalor EnterpriseDB: http://www.enterprisedb.com
Re: role self-revocation
On Fri, Mar 11, 2022 at 11:34 AM Tom Lane wrote: > Note that either case would also involve making entries in pg_shdepend; > although for the case of roles owned by/granted to the bootstrap > superuser, we could omit those on the usual grounds that we don't need > to record dependencies on pinned objects. That makes sense to me, but it still doesn't solve the problem of agreeing on role ownership vs. WITH ADMIN OPTION vs. something else. I find it ironic (and frustrating) that Mark implemented what I think is basically what you're arguing for, it got stuck because Stephen didn't like it, we then said OK so let's try to find out what Stephen would like, only to have you show up and say that it's right the way he already had it. I'm not saying that you're wrong, or for that matter that he's wrong. I'm just saying that if both of you are absolutely bent on having it the way you want it, either one of you is going to be sad, or we're not going to make any progress. Never mind the fact that neither of you seem interested in even giving a hearing to my preferred way of doing it. :-( -- Robert Haas EDB: http://www.enterprisedb.com
Re: role self-revocation
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Fri, Mar 11, 2022 at 11:12 AM Mark Dilger > wrote: > > This issue of how much backwards compatibility breakage we're willing to > > tolerate is just as important as questions about how we would want roles to > > work in a green-field development project. The sense I got a year ago, on > > this list, was that changing CREATEROLE was acceptable, but changing other > > parts of the system, such as how ADMIN OPTION works, would go too far. That we deviate as far as we do when it comes to the SQL spec is something that I don't feel like I had a good handle on when discussing this previously (that the spec doesn't talk about 'admin option' really but rather 'grantable authorization identifiers' or whatever it is doesn't help... but still, that's on me, sorry about that). > > Role ownership did not yet exist, and that was a big motivation in > > introducing that concept, because you couldn't credibly say it broke other > > existing features. It introduces the new notion that when a superuser > > creates a role, the superuser owns it, which is identical to how things > > implicitly work today; and when a CREATEROLE non-superuser creates a role, > > that role owns the new role, which is different from how it works today, > > arguably breaking CREATEROLE's prior behavior. *But it doesn't break > > anything else*. > > > > If we're going to change how ADMIN OPTION works, or how role membership > > works, or how inherit/noinherit works, let's first be clear that we are > > willing to accept whatever backwards incompatibility that entails. This is > > not a green-field development project. The constant spinning around with > > regard to how much compatibility we need to preserve is giving me vertigo. I agree that it would have an impact on backwards compatibility to change how WITH ADMIN works- but it would also get us more in line with what the SQL standard says for how WITH ADMIN is supposed to work and that seems worth the change to me. > On the other hand, changing ADMIN OPTION does have compatibility and > spec-compliance ramifications. I think Stephen is arguing that we can > solve this problem while coming closer to the spec, and I think we > usually consider getting closer to the spec to be a sufficient reason > for breaking backward compatibility (cf. standard_conforming_strings). Indeed. > But I don't know whether he is correct when he argues that the spec > makes admin option on a role sufficient to drop the role. I've never > had any luck understanding what the SQL specification is saying about > any topic. I'm happy to point you to what the spec says and to discuss it further if that would be helpful, or to get other folks to comment on it. I agree that it's definitely hard to grok at times. In this particular case what I'm looking at is, under DROP ROLE / Access Rules, there's only one sentence: There shall exist at least one grantable role authorization descriptor whose role name is R and whose grantee is an enabled authorization identifier. A bit of decoding: 'grantable role authorization descriptor' is a GRANT of a role WITH ADMIN OPTION. The role name 'R' is the role specified. The 'grantee' is who that role R was GRANT'd to, and 'enabled authorization identifier' is basically "has_privs_of_role()" (note that you can in the spec hvae roles that you're a member of but which are *not* currently enabled). Hopefully that helps. Thanks, Stephen signature.asc Description: PGP signature
Re: role self-revocation
Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: > Robert Haas writes: > > If we implement the link between the creating role and the created > > role as role ownership, then we are surely just going to add a > > rolowner column to pg_authid, and when the role is owned by nobody, I > > think we should always just store a valid OID in it, rather than > > sometimes storing 0. It just seems simpler. Any time we would store 0, > > store the bootstrap superuser's pg_authid.oid value instead. That way > > the OID is always valid, which probably lets us get by with fewer > > special cases in the code. We haven't got any particularly special cases in the code today for what happens if we run up the role hierarchy to a point that it ends and so I'm not sure why adding in a whole new concept around role ownership, which doesn't exist in the spec, would somehow leave us with fewer such special cases. > +1. > > Note that either case would also involve making entries in pg_shdepend; > although for the case of roles owned by/granted to the bootstrap > superuser, we could omit those on the usual grounds that we don't need > to record dependencies on pinned objects. That we aren't discussing the issues with the current GRANT ... WITH ADMIN OPTION and how we deviate from what the spec calls for when it comes to DROP ROLE, which seems to be the largest thing that's 'solved' with this ownership concept, is concerning to me. If we go down the route of adding role ownership, are we going to document that we explicitly go against the SQL standard when it comes to how DROP ROLE works? Or are we going to fix DROP ROLE? I'd much prefer the latter, but doing so then largely negates the point of this role ownership concept. I don't see how it makes sense to do both. Thanks, Stephen signature.asc Description: PGP signature
Re: role self-revocation
On Fri, Mar 11, 2022 at 11:12 AM Mark Dilger wrote: > This issue of how much backwards compatibility breakage we're willing to > tolerate is just as important as questions about how we would want roles to > work in a green-field development project. The sense I got a year ago, on > this list, was that changing CREATEROLE was acceptable, but changing other > parts of the system, such as how ADMIN OPTION works, would go too far. > > Role ownership did not yet exist, and that was a big motivation in > introducing that concept, because you couldn't credibly say it broke other > existing features. It introduces the new notion that when a superuser > creates a role, the superuser owns it, which is identical to how things > implicitly work today; and when a CREATEROLE non-superuser creates a role, > that role owns the new role, which is different from how it works today, > arguably breaking CREATEROLE's prior behavior. *But it doesn't break > anything else*. > > If we're going to change how ADMIN OPTION works, or how role membership > works, or how inherit/noinherit works, let's first be clear that we are > willing to accept whatever backwards incompatibility that entails. This is > not a green-field development project. The constant spinning around with > regard to how much compatibility we need to preserve is giving me vertigo. I mean, I agree that the backward compatibility ramifications of every idea need to be considered, but I agree even more that the amount of spinning around here is pretty insane. My feeling is that neither role owners nor tenants introduce any real concerns about backward-compatibility or, for that matter, SQL standards compliance, nonwithstanding Stephen's argument to the contrary. Every vendor extends the standard with their own stuff, and we've done that as well, as we can do it in more places. On the other hand, changing ADMIN OPTION does have compatibility and spec-compliance ramifications. I think Stephen is arguing that we can solve this problem while coming closer to the spec, and I think we usually consider getting closer to the spec to be a sufficient reason for breaking backward compatibility (cf. standard_conforming_strings). But I don't know whether he is correct when he argues that the spec makes admin option on a role sufficient to drop the role. I've never had any luck understanding what the SQL specification is saying about any topic. -- Robert Haas EDB: http://www.enterprisedb.com
Re: role self-revocation
Robert Haas writes: > If we implement the link between the creating role and the created > role as role ownership, then we are surely just going to add a > rolowner column to pg_authid, and when the role is owned by nobody, I > think we should always just store a valid OID in it, rather than > sometimes storing 0. It just seems simpler. Any time we would store 0, > store the bootstrap superuser's pg_authid.oid value instead. That way > the OID is always valid, which probably lets us get by with fewer > special cases in the code. +1. Note that either case would also involve making entries in pg_shdepend; although for the case of roles owned by/granted to the bootstrap superuser, we could omit those on the usual grounds that we don't need to record dependencies on pinned objects. regards, tom lane
Re: refactoring basebackup.c
On Fri, Mar 11, 2022 at 10:19:29AM -0500, Robert Haas wrote: > So I think we should just refuse this command. Patch for that attached. Sounds right. Also, I think the magic 8 for .gz should actually be a 7. I'm not sure why it tests for ".gz" but not ".tar.gz", which would help to make them all less magic. commit 1fb1e21ba7a500bb2b85ec3e65f59130fcdb4a7e Author: Justin Pryzby Date: Thu Mar 10 21:22:16 2022 -0600 pg_basebackup: make magic numbers less magic The magic 8 for .gz should actually be a 7. .tar.gz 1234567 .tar.lz4 .tar.zst 12345678 See d45099425, 751b8d23b, 7cf085f07. diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index 9f3ecc60fbe..8dd9721323d 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -1223,17 +1223,17 @@ CreateBackupStreamer(char *archive_name, char *spclocation, is_tar = (archive_name_len > 4 && strcmp(archive_name + archive_name_len - 4, ".tar") == 0); - /* Is this a gzip archive? */ - is_tar_gz = (archive_name_len > 8 && -strcmp(archive_name + archive_name_len - 3, ".gz") == 0); + /* Is this a .tar.gz archive? */ + is_tar_gz = (archive_name_len > 7 && +strcmp(archive_name + archive_name_len - 7, ".tar.gz") == 0); - /* Is this a LZ4 archive? */ + /* Is this a .tar.lz4 archive? */ is_tar_lz4 = (archive_name_len > 8 && - strcmp(archive_name + archive_name_len - 4, ".lz4") == 0); + strcmp(archive_name + archive_name_len - 8, ".tar.lz4") == 0); - /* Is this a ZSTD archive? */ + /* Is this a .tar.zst archive? */ is_tar_zstd = (archive_name_len > 8 && - strcmp(archive_name + archive_name_len - 4, ".zst") == 0); + strcmp(archive_name + archive_name_len - 8, ".tar.zst") == 0); /* * We have to parse the archive if (1) we're suppose to extract it, or if
Re: role self-revocation
On Fri, Mar 11, 2022 at 10:46 AM Tom Lane wrote: > The bootstrap superuser clearly must be a special case in some way. > I'm not convinced that that means there should be other special > cases. Maybe there is a use-case for other "unowned" roles, but in > exactly what way would that be different from deeming such roles > to be owned by the bootstrap superuser? I think that just boils down to how many useless catalog entries you want to make. If we implement the link between the creating role and the created role as role ownership, then we are surely just going to add a rolowner column to pg_authid, and when the role is owned by nobody, I think we should always just store a valid OID in it, rather than sometimes storing 0. It just seems simpler. Any time we would store 0, store the bootstrap superuser's pg_authid.oid value instead. That way the OID is always valid, which probably lets us get by with fewer special cases in the code. If we implement the link between the creating role and the created role as an automatically-granted WITH ADMIN OPTION, then we could choose to put (CREATOR_OID, CREATED_OID, whatever, TRUE) into pg_auth_members for the creating superuser or, indeed, every superuser in the system. Or we can leave it out. The result will be exactly the same. Here, I would favor leaving it out, because extra catalog entries that don't do anything are usually a thing that we do not want. See a49d081235997c67e8aab7a523b17e8d1cb93184, for example. -- Robert Haas EDB: http://www.enterprisedb.com
Re: role self-revocation
> On Mar 11, 2022, at 7:58 AM, Robert Haas wrote: > > This kind of thing is always a judgement call. If we were talking > about breaking 'SELECT * from table', I'm sure it would be hard to > convince anybody to agree to do that at all, let alone with no > deprecation period. Fortunately, CREATEROLE is less used, so breaking > it will inconvenience fewer people. This issue of how much backwards compatibility breakage we're willing to tolerate is just as important as questions about how we would want roles to work in a green-field development project. The sense I got a year ago, on this list, was that changing CREATEROLE was acceptable, but changing other parts of the system, such as how ADMIN OPTION works, would go too far. Role ownership did not yet exist, and that was a big motivation in introducing that concept, because you couldn't credibly say it broke other existing features. It introduces the new notion that when a superuser creates a role, the superuser owns it, which is identical to how things implicitly work today; and when a CREATEROLE non-superuser creates a role, that role owns the new role, which is different from how it works today, arguably breaking CREATEROLE's prior behavior. *But it doesn't break anything else*. If we're going to change how ADMIN OPTION works, or how role membership works, or how inherit/noinherit works, let's first be clear that we are willing to accept whatever backwards incompatibility that entails. This is not a green-field development project. The constant spinning around with regard to how much compatibility we need to preserve is giving me vertigo. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: role self-revocation
On Fri, Mar 11, 2022 at 10:37 AM David G. Johnston wrote: > I largely agree and am perfectly fine with going with the majority on this > point. My vote would just fall on the conservative side. But as so far no > one else seems to be overly concerned, nerfing CREATEROLE seems to be the > path forward. This kind of thing is always a judgement call. If we were talking about breaking 'SELECT * from table', I'm sure it would be hard to convince anybody to agree to do that at all, let alone with no deprecation period. Fortunately, CREATEROLE is less used, so breaking it will inconvenience fewer people. Moreover, unlike 'SELECT * FROM table', CREATEROLE is kinda broken, and it's less scary to make changes to behavior that sucks in the first place than it is to make changes to the behavior of things that are working well. For all of that, there's no hard-and-fast rule that we couldn't keep the existing behavior around, introduce a substitute, and eventually drop the old thing. I'm just not clear that it's really worth it in this case. It'd certainly be interesting to hear from anyone who is finding some utility in the current system. It looks pretty crap to me, but it's easy to bring too much of one's own bias to such judgements. -- Robert Haas EDB: http://www.enterprisedb.com
Re: role self-revocation
Robert Haas writes: > On Fri, Mar 11, 2022 at 10:27 AM Stephen Frost wrote: >> I agree that there would be a recorded relationship (that is, one that >> we write into the catalog and keep around until and unless it's removed >> by an admin) between creating and created roles and that's probably the >> default when CREATE ROLE is run but, unlike tables or such objects in >> the system, I don't agree that we should require this to exist at >> absolutely all times for every role (what would it be for the bootstrap >> superuser..?). At least today, that's distinct from how ownership in >> the system works. I also don't believe that this is necessarily an >> issue for Robert's use-case, as long as there are appropriate >> restrictions around who is allowed to remove or modify these >> relationships. > I agree. The bootstrap superuser clearly must be a special case in some way. I'm not convinced that that means there should be other special cases. Maybe there is a use-case for other "unowned" roles, but in exactly what way would that be different from deeming such roles to be owned by the bootstrap superuser? regards, tom lane
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
On Fri, Mar 11, 2022 at 8:22 AM Kyotaro Horiguchi wrote: > > > - The difference between pg_get_wal_record_info and _records_ other than > - the number of argument is the former accepts incorrect LSNs. > > The discussion is somewhat confused after some twists and turns.. It > should be something like the following. > > pg_get_wal_record_info and pg_get_wal_records_info are almost same > since the latter can show a single record. However it is a bit > annoying to do that. Since, other than it doens't accept same LSNs for > start and end, it doesn't show a record when there' no record in the > specfied LSN range. But I don't think there's no usefulness of the > behavior. I would like to reassert the usability of pg_get_wal_record_info and pg_get_wal_records_info: pg_get_wal_record_info(lsn): if lsn is invalid i.e. '0/0' - throws an error if lsn is future lsn - throws an error if lsn looks okay, it figures out the next available valid WAL record and returns info about that pg_get_wal_records_info(start_lsn, end_lsn default null) -> if start and end lsns are provided no end_lsn would give the WAL records info till the end of WAL, if start_lsn is invalid i.e. '0/0' - throws an error if start_lsn is future lsn - throws an error if end_lsn isn't provided by the user - calculates the end_lsn as server's current flush lsn if end_lsn is provided by the user - throws an error if it's future LSN if start_lsn and end_lsn look okay, it returns info about all WAL records from the next available valid WAL record of start_lsn until end_lsn So, both pg_get_wal_record_info and pg_get_wal_records_info are necessary IMHO. Coming to the behaviour when input lsn is '0/100', it's an issue with XLogSegmentOffset(lsn, wal_segment_size) != 0 check, which I will fix in the next version. if (*first_record != lsn && XLogSegmentOffset(lsn, wal_segment_size) != 0) ereport(WARNING, (errmsg_plural("first record is after %X/%X, at %X/%X, skipping over %u byte", "first record is after %X/%X, at %X/%X, skipping over %u bytes", (*first_record - lsn), LSN_FORMAT_ARGS(lsn), LSN_FORMAT_ARGS(*first_record), (uint32) (*first_record - lsn; Regards, Bharath Rupireddy.
Non-replayable WAL records through overflows and >MaxAllocSize lengths
Hi, Xlogreader limits the size of what it considers valid xlog records to MaxAllocSize; but this is not currently enforced in the XLogRecAssemble API. This means it is possible to assemble a record that postgresql cannot replay. Similarly; it is possible to repeatedly call XlogRegisterData() so as to overflow rec->xl_tot_len; resulting in out-of-bounds reads and writes while processing record data; PFA a patch that attempts to fix both of these issues in the insertion API; by checking against overflows and other incorrectly large values in the relevant functions in xloginsert.c. In this patch, I've also added a comment to the XLogRecord spec to document that xl_tot_len should not be larger than 1GB - 1B; and why that limit exists. Kind regards, Matthias van de Meent From a72121b8fccbbf8c289e71a2c76801a1004f5353 Mon Sep 17 00:00:00 2001 From: Matthias van de Meent Date: Fri, 11 Mar 2022 16:16:22 +0100 Subject: [PATCH v1] Add protections in xlog record APIs against large numbers and overflows. Before this; it was possible for an extension to create malicious WAL records that were too large to replay; or that would overflow the xl_tot_len field, causing potential corruption in WAL record IO ops. --- src/backend/access/transam/xloginsert.c | 23 ++- src/include/access/xlogrecord.h | 4 2 files changed, 26 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index c260310c4c..ae654177de 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -342,6 +342,11 @@ XLogRegisterData(char *data, int len) if (num_rdatas >= max_rdatas) elog(ERROR, "too much WAL data"); + + /* protect against overflow */ + if (unlikely((uint64) mainrdata_len + (uint64) len > UINT32_MAX)) + elog(ERROR, "too much WAL data"); + rdata = &rdatas[num_rdatas++]; rdata->data = data; @@ -387,6 +392,11 @@ XLogRegisterBufData(uint8 block_id, char *data, int len) if (num_rdatas >= max_rdatas) elog(ERROR, "too much WAL data"); + + /* protect against overflow */ + if (unlikely((uint64) regbuf->rdata_len + (uint64) len > UINT32_MAX)) + elog(ERROR, "too much WAL data"); + rdata = &rdatas[num_rdatas++]; rdata->data = data; @@ -505,7 +515,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included) { XLogRecData *rdt; - uint32 total_len = 0; + uint64 total_len = 0; int block_id; pg_crc32c rdata_crc; registered_buffer *prev_regbuf = NULL; @@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, if (needs_data) { + /* protect against overflow */ + if (unlikely(regbuf->rdata_len > UINT16_MAX)) +elog(ERROR, "too much WAL data for registered buffer"); + /* * Link the caller-supplied rdata chain for this buffer to the * overall list. @@ -836,6 +850,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next) COMP_CRC32C(rdata_crc, rdt->data, rdt->len); + /* + * Ensure that xlogreader.c can read the record; and check that we don't + * accidentally overflow the size of the record. + * */ + if (unlikely(!AllocSizeIsValid(total_len) || total_len > UINT32_MAX)) + elog(ERROR, "too much registered data for WAL record"); + /* * Fill in the fields in the record header. Prev-link is filled in later, * once we know where in the WAL the record will be inserted. The CRC does diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h index c1b1137aa7..950e1f22b0 100644 --- a/src/include/access/xlogrecord.h +++ b/src/include/access/xlogrecord.h @@ -37,6 +37,10 @@ * The XLogRecordBlockHeader, XLogRecordDataHeaderShort and * XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's * used to distinguish between block references, and the main data structs. + * + * Note that xlogreader.c limits the total size of the xl_tot_len to + * MaxAllocSize (1GB - 1). This means that this is also the maximum size + * of a single WAL record - we would be unable to replay any larger record. */ typedef struct XLogRecord { -- 2.30.2
Re: role self-revocation
On Fri, Mar 11, 2022 at 10:27 AM Stephen Frost wrote: > I agree that there would be a recorded relationship (that is, one that > we write into the catalog and keep around until and unless it's removed > by an admin) between creating and created roles and that's probably the > default when CREATE ROLE is run but, unlike tables or such objects in > the system, I don't agree that we should require this to exist at > absolutely all times for every role (what would it be for the bootstrap > superuser..?). At least today, that's distinct from how ownership in > the system works. I also don't believe that this is necessarily an > issue for Robert's use-case, as long as there are appropriate > restrictions around who is allowed to remove or modify these > relationships. I agree. > > I agree. [ but we need to get consensus ] > > Well ... [ how about we do it my way? ] Repeating the same argument over again isn't necessarily going to help anything here. I read your argument and I can believe there could be a solution along those lines, although you haven't addressed my concern about NOINHERIT. Tom is apparently less convinced, and you know, I think that's OK. Not everybody has to agree with the way you want to do it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: role self-revocation
On Fri, Mar 11, 2022 at 8:32 AM Stephen Frost wrote: > > Such scripts as will break will still > break in a pretty clear way with a clear answer as to how to fix them > and I don't think there's some kind of data corruption or something that > would happen. > > I largely agree and am perfectly fine with going with the majority on this point. My vote would just fall on the conservative side. But as so far no one else seems to be overly concerned, nerfing CREATEROLE seems to be the path forward. David J.
Re: role self-revocation
Greetings, * David G. Johnston (david.g.johns...@gmail.com) wrote: > On Thu, Mar 10, 2022 at 3:01 PM Robert Haas wrote: > > > On Thu, Mar 10, 2022 at 4:00 PM David G. Johnston > > wrote: > > > I dislike changing the documented behavior of CREATEROLE to the degree > > suggested here. However, there are three choices here, only one of which > > can be chosen: > > > > > > 1. Leave CREATEROLE alone entirely > > > 2. Make it so CREATEROLE cannot assign membership to the predefined > > roles or superuser (inheritance included), but leave the rest alone. This > > would be the hard-coded version, not the role attribute one. > > > 3. Make it so CREATEROLE can only assign membership to roles for which > > it has been made an admin; as well as the other things mentioned > > > > > > Moving forward I'd prefer options 1 or 2, leaving the ability to > > create/alter/drop a role to be vested via predefined roles. > > > > It sounds like you prefer a behavior where CREATEROLE gives power over > > all non-superusers, but that seems pretty limiting to me. > > Doh! I edited out the part where I made clear I considered options 1 and 2 > as basically being done for a limited period of time while deprecating the > CREATEROLE attribute altogether in favor of the fine-grained and predefined > role based permission granting. I don't want to nerf CREATEROLE as part of > adding this new feature, instead leave it as close to status quo as > reasonable so as not to mess up existing setups that make use of it. We > can note in the release notes and documentation that we consider CREATEROLE > to be deprecated and that the new predefined role should be used to give a > user the ability to create/alter/drop roles, etc... DBAs should consider > revoking CREATEROLE from their users and granting them proper memberships > in the predefined roles and the groups those roles should be administering. I disagree entirely with the idea that we should push the out however many years it'd take to get through some deprecation period. We are absolutely terrible when it comes to that and what we're talking about here, at this point anyway, is making changes that get us closer to what the spec says. I agree that we can't back-patch these changes, but I don't think we need a deprecation period. If we were just getting rid of CREATEROLE, I don't think we'd have a deprecation period. If we need to get rid of CREATEROLE and introduce something new that more-or-less means the same thing, to make it so that people's scripts break in a more obvious way, maybe we can consider that, but I don't really think that's actually the case here. Such scripts as will break will still break in a pretty clear way with a clear answer as to how to fix them and I don't think there's some kind of data corruption or something that would happen. Thanks, Stephen signature.asc Description: PGP signature
Re: role self-revocation
On Fri, Mar 11, 2022 at 6:55 AM Robert Haas wrote: > On Thu, Mar 10, 2022 at 5:14 PM Tom Lane wrote: > > This seems reasonable in isolation, but > > > > (1) it implies a persistent relationship between creating and created > > roles. Whether you want to call that ownership or not, it sure walks > > and quacks like ownership. > > I like my TENANT idea best, but I'm perfectly willing to call > it ownership as you seem to prefer or WITH ADMIN OPTION as Stephen > seems to prefer if one of those ideas gains consensus. If WITH ADMIN OPTION is sufficient to meet our immediate goals I do not see the benefit of adding an ownership concept where there is not one today. If added, I'd much rather have it be ownership as to fit in with the rest of the existing system rather than introduce an entirely new term. > If Alice creates non-superusers Bob and Charlie, and Charlie creates > Doug, we need the persistent relationship to know that Charlie is > allowed to drop Doug and Bob is not > The interesting question seems to be whether Alice can drop Doug, not whether Bob can. > It's more important > at this point to get agreement on the principles. > What are the principles you want to get agreement on and how do they differ from what we have in place today? What are the proposed changes you would make to enforce the new principles. Which principles are now obsolete and what do you want to do about the features that were built to enforce them (including backward compatibility concerns)? David J.
Re: role self-revocation
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Thu, Mar 10, 2022 at 5:14 PM Tom Lane wrote: > > This seems reasonable in isolation, but > > > > (1) it implies a persistent relationship between creating and created > > roles. Whether you want to call that ownership or not, it sure walks > > and quacks like ownership. I agree that there would be a recorded relationship (that is, one that we write into the catalog and keep around until and unless it's removed by an admin) between creating and created roles and that's probably the default when CREATE ROLE is run but, unlike tables or such objects in the system, I don't agree that we should require this to exist at absolutely all times for every role (what would it be for the bootstrap superuser..?). At least today, that's distinct from how ownership in the system works. I also don't believe that this is necessarily an issue for Robert's use-case, as long as there are appropriate restrictions around who is allowed to remove or modify these relationships. > I agree. It's been obvious to me from the beginning that we needed > such a persistent relationship, and also that it needed to be a > relationship from which the created role couldn't simply walk away. > Yet, more than six months after the first discussions of this topic, > we still don't have any kind of agreement on what that thing should be > called. I like my TENANT idea best, but I'm perfectly willing to call > it ownership as you seem to prefer or WITH ADMIN OPTION as Stephen > seems to prefer if one of those ideas gains consensus. But we've > managed to waste all hope of making any significant progress here for > an entire release cycle for lack of ability to agree on spelling. I > think that's unfair to Mark, who put a lot of work into this area and > got nothing out of it, and I think it sucks for users of PostgreSQL, > too. Well ... one of those actually already exists and also happens to be in the SQL spec. I don't necessarily agree that we should absolutely require that the system always enforce that this relationship exist (I'd like a superuser to be able to get rid of it and to be able to change it too if they want) and that seems a bit saner than having the bootstrap superuser be special in some way here as would seem to otherwise be required. I also feel that it would be generally useful to have more than one of these relationships, if the user wishes, and that's something that ownership doesn't (directly) support today. Further, that's supported and expected by the SQL spec too. Even if we invented some concept of ownership of roles, it seems like we should make most of the other changes discussed here to bring us closer to what the spec says about CREATE ROLE, DROP ROLE, GRANT, REVOKE, etc. At that point though, what's the point of having ownership? > > (2) it seems exactly contradictory to your later point that > > > > > Agree. I also think that it would be a good idea to attribute grants > > > performed by any superuser to the bootstrap superuser, or leave them > > > unattributed somehow. Because otherwise dropping superusers becomes a > > > pain in the tail for no good reason. > > > > Either there's a persistent relationship or there's not. I don't > > think it's sensible to treat superusers differently here. > > > > I think that this argument about the difficulty of dropping superusers > > may in fact be the motivation for the existing behavior that object- > > permissions GRANTs done by superusers are attributed to the object > > owner; something you were unhappy about upthread. > > > > In the end these requirements seem mutually contradictory. Either > > we can have a persistent ownership relationship or not, but I don't > > think we can have it apply in some cases and not others without > > creating worse problems than we solve. I'm inclined to toss overboard > > the requirement that superusers need to be an easy thing to drop. > > Why is that important, anyway? > > Well, I think you're looking at it the wrong way. Compared to getting > useful functionality, the relative ease of dropping users is > completely unimportant. I'm happy to surrender it in exchange for > something else. I just don't see why we should give it up for nothing. > If Alice creates non-superusers Bob and Charlie, and Charlie creates > Doug, we need the persistent relationship to know that Charlie is > allowed to drop Doug and Bob is not. But if Charlie is a superuser > anyway, then the persistent relationship is of no use. I don't see the > point of cluttering up the system with such dependencies. Will I do it > that way, if that's what it takes to get the patch accepted? Sure. But > I can't imagine any end-user actually liking it. We need to know that Charlie is allowed to drop Doug and Bob isn't but that doesn't make it absolutely required that this be tracked permanently or that Alice can't decide later to make it such that Doug can't be dropped by Charlie for whatever reason
Re: refactoring basebackup.c
On Thu, Mar 10, 2022 at 8:02 PM Justin Pryzby wrote: > I'm getting errors from pg_basebackup when using both -D- and > --compress=server-* > The issue seems to go away if I use --no-manifest. > > $ ./src/bin/pg_basebackup/pg_basebackup -h /tmp -Ft -D- --wal-method none > --compress=server-gzip >/dev/null ; echo $? > pg_basebackup: error: tar member has empty name > 1 > > $ ./src/bin/pg_basebackup/pg_basebackup -h /tmp -Ft -D- --wal-method none > --compress=server-gzip >/dev/null ; echo $? > NOTICE: WAL archiving is not enabled; you must ensure that all required WAL > segments are copied through other means to complete the backup > pg_basebackup: error: COPY stream ended before last file was finished > 1 Thanks for the report. The problem here is that, when the output is standard output (-D -), pg_basebackup can only produce a single output file, so the manifest gets injected into the tar file on the client side rather than being written separately as we do in normal cases. However, that only works if we're receiving a tar file that we can parse from the server, and here the server is sending a compressed tarfile. The current code mistakely attempts to parse the compressed tarfile as if it were an uncompressed tarfile, which causes the error messages that you are seeing (and which I can also reproduce here). We actually have enough infrastructure available in pg_basebackup now that we could do the "right thing" in this case: decompress the data received from the server, parse the resulting tar file, inject the backup manifest, construct a new tar file, and recompress. However, I think that's probably not a good idea, because it's unlikely that the user will understand that the data is being compressed on the server, then decompressed, and then recompressed again, and the performance of the resulting pipeline will probably not be very good. So I think we should just refuse this command. Patch for that attached. -- Robert Haas EDB: http://www.enterprisedb.com reject-compressed-inject.patch Description: Binary data
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Fri, Mar 11, 2022 at 8:21 PM Robert Haas wrote: > > On Fri, Mar 11, 2022 at 12:15 AM Ashutosh Sharma > wrote: > > Looks better, but why do you want to pass elevel to the > > load_relmap_file()? Are we calling this function from somewhere other > > than read_relmap_file()? If not, do we have any plans to call this > > function directly bypassing read_relmap_file for any upcoming patch? > > If it fails during CREATE DATABASE, it should be ERROR, not FATAL. In > that case, we only need to stop trying to create a database; we don't > need to terminate the session. On the other hand if we can't read our > own database's relmap files, that's an unrecoverable error, because we > will not be able to run any queries at all, so FATAL is appropriate. > OK. I can see it being used in the v13 patch. In the previous patches it was hard-coded with FATAL. Also, we simply error out when doing file copy as I can see in the copy_file function. So yes FATAL is not the right option to use when creating a database. Thanks. -- With Regards, Ashutosh Sharma.
Re: generic plans and "initial" pruning
On Fri, Mar 11, 2022 at 11:35 PM Amit Langote wrote: > Attached is v5, now broken into 3 patches: > > 0001: Some refactoring of runtime pruning code > 0002: Add a plan_tree_walker > 0003: Teach AcquireExecutorLocks to skip locking pruned relations Repeated the performance tests described in the 1st email of this thread: HEAD: (copied from the 1st email) 32 tps = 20561.776403 (without initial connection time) 64 tps = 12553.131423 (without initial connection time) 128 tps = 13330.365696 (without initial connection time) 256 tps = 8605.723120 (without initial connection time) 512 tps = 4435.951139 (without initial connection time) 1024tps = 2346.902973 (without initial connection time) 2048tps = 1334.680971 (without initial connection time) Patched v1: (copied from the 1st email) 32 tps = 27554.156077 (without initial connection time) 64 tps = 27531.161310 (without initial connection time) 128 tps = 27138.305677 (without initial connection time) 256 tps = 25825.467724 (without initial connection time) 512 tps = 19864.386305 (without initial connection time) 1024tps = 18742.668944 (without initial connection time) 2048tps = 16312.412704 (without initial connection time) Patched v5: 32 tps = 28204.197738 (without initial connection time) 64 tps = 26795.385318 (without initial connection time) 128 tps = 26387.920550 (without initial connection time) 256 tps = 25601.141556 (without initial connection time) 512 tps = 19911.947502 (without initial connection time) 1024tps = 20158.692952 (without initial connection time) 2048tps = 16180.195463 (without initial connection time) Good to see that these rewrites haven't really hurt the numbers much, which makes sense because the rewrites have really been about putting the code in the right place. BTW, these are the numbers for the same benchmark repeated with plan_cache_mode = auto, which causes a custom plan to be chosen for every execution and so unaffected by this patch. 32 tps = 13359.225082 (without initial connection time) 64 tps = 15760.533280 (without initial connection time) 128 tps = 15825.734482 (without initial connection time) 256 tps = 15017.693905 (without initial connection time) 512 tps = 13479.973395 (without initial connection time) 1024tps = 13200.444397 (without initial connection time) 2048tps = 12884.645475 (without initial connection time) Comparing them to numbers when using force_generic_plan shows that making the generic plans faster is indeed worthwhile. -- Amit Langote EDB: http://www.enterprisedb.com
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Fri, Mar 11, 2022 at 12:15 AM Ashutosh Sharma wrote: > Looks better, but why do you want to pass elevel to the > load_relmap_file()? Are we calling this function from somewhere other > than read_relmap_file()? If not, do we have any plans to call this > function directly bypassing read_relmap_file for any upcoming patch? If it fails during CREATE DATABASE, it should be ERROR, not FATAL. In that case, we only need to stop trying to create a database; we don't need to terminate the session. On the other hand if we can't read our own database's relmap files, that's an unrecoverable error, because we will not be able to run any queries at all, so FATAL is appropriate. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
Some comments on pg_walinspect-docc.patch this time: + + + pg_get_wal_record_info(in_lsn pg_lsn, lsn OUT pg_lsn, prev_lsn OUT pg_lsn, xid OUT xid, resource_manager OUT text, length OUT int4, total_length OUT int4, description OUT text, block_ref OUT text, data OUT bytea, data_len OUT int4) + You may shorten this by mentioning just the function input parameters and specify "returns record" like shown below. So no need to specify all the OUT params. pg_get_wal_record_info(in_lsn pg_lsn) returns record. Please check the documentation for other functions for reference. == + + pg_get_wal_records_info(start_lsn pg_lsn, end_lsn pg_lsn DEFAULT NULL, lsn OUT pg_lsn, prev_lsn OUT pg_lsn, xid OUT xid, resource_manager OUT text, length OUT int4, total_length OUT int4, description OUT text, block_ref OUT text, data OUT bytea, data_len OUT int4) + Same comment applies here as well. In the return type you can just mention - "returns setof record" like shown below: pg_get_wal_records_info(start_lsn pg_lsn, end_lsn pg_lsn) returns setof records. You may also check for such optimizations at other places. I might have missed some. == + +postgres=# select prev_lsn, xid, resource_manager, length, total_length, block_ref from pg_get_wal_records_info('0/158A7F0', '0/1591400'); + prev_lsn | xid | resource_manager | length | total_length | block_ref +---+-+--++--+-- + 0/158A7B8 | 735 | Heap | 54 | 7838 | blkref #0: rel 1663/5/2619 blk 18 (FPW); hole: offset: 88, length: 408 + 0/158A7F0 | 735 | Btree| 53 | 8133 | blkref #0: rel 1663/5/2696 blk 1 (FPW); hole: offset: 1632, length: 112 + 0/158C6A8 | 735 | Heap | 53 | 873 | blkref #0: rel 1663/5/1259 blk 0 (FPW); hole: offset: 212, length: 7372 Instead of specifying column names in the targetlist I think it's better to use "*" so that it will display all the output columns. Also you may shorten the gap between start and end lsn to reduce the output size. == Any reason for not specifying author name in the .sgml file. Do you want me to add my name to the author? :) Ashutosh Sharma ashu.coe...@gmail.com -- With Regards, Ashutosh Sharma. On Thu, Mar 10, 2022 at 10:15 PM Bharath Rupireddy wrote: > > On Thu, Mar 10, 2022 at 1:52 PM Jeff Davis wrote: > > > > On Wed, 2022-03-02 at 22:37 +0530, Bharath Rupireddy wrote: > > > > > > Attaching v6 patch set with above review comments addressed. Please > > > review it further. > > Thanks Jeff for reviewing it. I've posted the latest v7 patch-set > upthread [1] which is having more simple-yet-useful-and-effective > functions. > > > * Don't issue WARNINGs or other messages for ordinary situations, like > > when pg_get_wal_records_info() hits the end of WAL. > > v7 patch-set [1] has no warnings, but the functions will error out if > future LSN is specified. > > > * It feels like the APIs that allow waiting for the end of WAL are > > slightly off. Can't you just do pg_get_wal_records_info(start_lsn, > > least(pg_current_wal_flush_lsn(), end_lsn)) if you want the non-waiting > > behavior? Try to make the API more orthogonal, where a few basic > > functions can be combined to give you everything you need, rather than > > specifying extra parameters and issuing WARNINGs. I > > v7 patch-set [1] onwards waiting mode has been removed for all of the > functions, again to keep things simple-yet-useful-and-effective. > However, we can always add new pg_walinspect functions that wait for > future WAL in the next versions once basic stuff gets committed and if > many users ask for it. > > > * In the docs, include some example output. I don't see any output in > > the tests, which makes sense because it's mostly non-deterministic, but > > it would be helpful to see sample output of at least > > pg_get_wal_records_info(). > > +1. Added for pg_get_wal_records_info and pg_get_wal_stats. > > > * Is pg_get_wal_stats() even necessary, or can you get the same > > information with a query over pg_get_wal_records_info()? For instance, > > if you want to group by transaction ID rather than rmgr, then > > pg_get_wal_stats() is useless. > > Yes, you are right pg_get_wal_stats provides WAL stats per resource > manager which is similar to pg_waldump with --start, --end and --stats > option. It provides more information than pg_get_wal_records_info and > is a good way of getting stats than adding more columns to > pg_get_wal_records_info, calculating percentage in sql and having > group by clause. IMO, pg_get_wal_stats is more readable and useful. > > > * Would be nice to have a pg_wal_file_is_valid() or similar, which > > would test that it exists, and the header matches the filename (e.g. if > > it was recycled but not used, that would count as invalid). I think > > pg_get_first_valid_wal_record_lsn() would
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
You may also need to add documentation to app-createdb.sgml. Currently you have just added to create_database.sgml. Also, I had a quick look at the new changes done in v13-0005-WAL-logged-CREATE-DATABASE.patch and they seemed fine to me although I haven't put much emphasis on the comments and other cosmetic stuff. -- With Regards, Ashutosh Sharma. On Fri, Mar 11, 2022 at 3:51 PM Dilip Kumar wrote: > > On Fri, Mar 11, 2022 at 11:52 AM Dilip Kumar wrote: > > > > On Thu, Mar 10, 2022 at 10:18 PM Robert Haas wrote: > > > > > > On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar wrote: > > > > I have completely changed the logic for this refactoring. Basically, > > > > write_relmap_file(), is already having parameters to control whether > > > > to write wal, send inval and we are already passing the dbpath. > > > > Instead of making a new function I just pass one additional parameter > > > > to this function itself about whether we are creating a new map or not > > > > and I think with that changes are very less and this looks cleaner to > > > > me. Similarly for load_relmap_file() also I just had to pass the > > > > dbpath and memory for destination map. Please have a look and let me > > > > know your thoughts. > > > > > > It's not terrible, but how about something like the attached instead? > > > I think this has the effect of reducing the number of cases that the > > > low-level code needs to know about from 2 to 1, instead of making it > > > go up from 2 to 3. > > > > Yeah this looks cleaner, I will rebase the remaining patch. > > Here is the updated version of the patch set. > > Changes, 1) it take Robert's patch as first refactoring patch 2) > Rebase other new relmapper apis on top of that in 0002 3) Some code > refactoring in main patch 0005 and also one problem fix, earlier in > wal log method I have removed ForceSyncCommit(), but IMHO that is > equally valid whether we use file_copy or wal_log because in both > cases we are creating the disk files. 4) Support strategy in createdb > tool and add test case as part of 0006. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
Re: role self-revocation
On Thu, Mar 10, 2022 at 5:14 PM Tom Lane wrote: > This seems reasonable in isolation, but > > (1) it implies a persistent relationship between creating and created > roles. Whether you want to call that ownership or not, it sure walks > and quacks like ownership. I agree. It's been obvious to me from the beginning that we needed such a persistent relationship, and also that it needed to be a relationship from which the created role couldn't simply walk away. Yet, more than six months after the first discussions of this topic, we still don't have any kind of agreement on what that thing should be called. I like my TENANT idea best, but I'm perfectly willing to call it ownership as you seem to prefer or WITH ADMIN OPTION as Stephen seems to prefer if one of those ideas gains consensus. But we've managed to waste all hope of making any significant progress here for an entire release cycle for lack of ability to agree on spelling. I think that's unfair to Mark, who put a lot of work into this area and got nothing out of it, and I think it sucks for users of PostgreSQL, too. > (2) it seems exactly contradictory to your later point that > > > Agree. I also think that it would be a good idea to attribute grants > > performed by any superuser to the bootstrap superuser, or leave them > > unattributed somehow. Because otherwise dropping superusers becomes a > > pain in the tail for no good reason. > > Either there's a persistent relationship or there's not. I don't > think it's sensible to treat superusers differently here. > > I think that this argument about the difficulty of dropping superusers > may in fact be the motivation for the existing behavior that object- > permissions GRANTs done by superusers are attributed to the object > owner; something you were unhappy about upthread. > > In the end these requirements seem mutually contradictory. Either > we can have a persistent ownership relationship or not, but I don't > think we can have it apply in some cases and not others without > creating worse problems than we solve. I'm inclined to toss overboard > the requirement that superusers need to be an easy thing to drop. > Why is that important, anyway? Well, I think you're looking at it the wrong way. Compared to getting useful functionality, the relative ease of dropping users is completely unimportant. I'm happy to surrender it in exchange for something else. I just don't see why we should give it up for nothing. If Alice creates non-superusers Bob and Charlie, and Charlie creates Doug, we need the persistent relationship to know that Charlie is allowed to drop Doug and Bob is not. But if Charlie is a superuser anyway, then the persistent relationship is of no use. I don't see the point of cluttering up the system with such dependencies. Will I do it that way, if that's what it takes to get the patch accepted? Sure. But I can't imagine any end-user actually liking it. > I'm a bit disturbed that parts of this discussion seem to be getting > conducted with little understanding of the system's existing behaviors. > We should not be reinventing things we already have perfectly good > solutions for in the object-privileges domain. I did wonder whether that might be the existing behavior, but stopping to check right at that moment didn't seem that important to me. Maybe I should have taken the time, but it's not like we're writing the final patch for commit next Tuesday at this point. It's more important at this point to get agreement on the principles. That said, I do agree that there have been times when we haven't thought hard enough about the existing behavior in proposing new behavior. On the third hand, though, part of the problem here is that neither Stephen nor I are entirely happy with the existing behavior, if for somewhat different reasons. It really isn't "perfectly good." On the one hand, from a purely technical standpoint, a lot of the behavior around roles in particular seems well below the standard that anyone would consider committable today. On the other hand, even the parts of the code that are in reasonable shape from a code quality point of view don't actually do the things that we think users want done. -- Robert Haas EDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences
On 3/11/22 12:34, Amit Kapila wrote: > On Tue, Mar 8, 2022 at 11:59 PM Tomas Vondra > wrote: >> >> On 3/7/22 22:25, Tomas Vondra wrote: Interesting. I can think of one reason that might cause this - we log the first sequence increment after a checkpoint. So if a checkpoint happens in an unfortunate place, there'll be an extra WAL record. On slow / busy machines that's quite possible, I guess. >>> >>> I've tweaked the checkpoint_interval to make checkpoints more aggressive >>> (set it to 1s), and it seems my hunch was correct - it produces failures >>> exactly like this one. The best fix probably is to just disable decoding >>> of sequences in those tests that are not aimed at testing sequence decoding. >>> >> >> I've pushed a fix for this, adding "include-sequences=0" to a couple >> test_decoding tests, which were failing with concurrent checkpoints. >> >> Unfortunately, I realized we have a similar issue in the "sequences" >> tests too :-( Imagine you do a series of sequence increments, e.g. >> >> SELECT nextval('s') FROM generate_sequences(1,100); >> >> If there's a concurrent checkpoint, this may add an extra WAL record, >> affecting the decoded output (and also the data stored in the sequence >> relation itself). Not sure what to do about this ... >> > > I am also not sure what to do for it but maybe if in some way we can > increase checkpoint timeout or other parameters for these tests then > it would reduce the chances of such failures. The other idea could be > to perform checkpoint before the start of tests to reduce the > possibility of another checkpoint. > Yeah, I had the same ideas, but I'm not sure I like any of them. I doubt we want to make checkpoints extremely rare, and even if we do that it'll still fail on slow machines (e.g. with valgrind, clobber cache etc.). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Column Filtering in Logical Replication
On 3/11/22 10:52, Amit Kapila wrote: > On Fri, Mar 11, 2022 at 7:26 AM Tomas Vondra > wrote: >> >> On 3/10/22 20:10, Tomas Vondra wrote: >>> >>> >>> FWIW I think the reason is pretty simple - pgoutput_row_filter_init is >>> broken. It assumes you can just do this >>> >>> rftuple = SearchSysCache2(PUBLICATIONRELMAP, >>> ObjectIdGetDatum(entry->publish_as_relid), >>> ObjectIdGetDatum(pub->oid)); >>> >>> if (HeapTupleIsValid(rftuple)) >>> { >>> /* Null indicates no filter. */ >>> rfdatum = SysCacheGetAttr(PUBLICATIONRELMAP, rftuple, >>> Anum_pg_publication_rel_prqual, >>> &pub_no_filter); >>> } >>> else >>> { >>> pub_no_filter = true; >>> } >>> >>> >>> and pub_no_filter=true means there's no filter at all. Which is >>> nonsense, because we're using publish_as_relid here - the publication >>> may not include this particular ancestor, in which case we need to just >>> ignore this publication. >>> >>> So yeah, this needs to be reworked. >>> >> >> I spent a bit of time looking at this, and I think a minor change in >> get_rel_sync_entry() fixes this - it's enough to ensure rel_publications >> only includes publications that actually include publish_as_relid. >> > > Thanks for looking into this. I think in the first patch before > calling get_partition_ancestors() we need to ensure it is a partition > (the call expects that) and pubviaroot is true. Does the call really require that? Also, I'm not sure why we'd need to look at pubviaroot - that's already considered earlier when calculating publish_as_relid, here we just need to know the relationship of the two OIDs (if one is ancestor/child of the other). > I think it would be > good if we can avoid an additional call to get_partition_ancestors() > as it could be costly. Maybe. OTOH we only should do this only very rarely anyway. > I wonder why it is not sufficient to ensure > that publish_as_relid exists after ancestor in ancestors list before > assigning the ancestor to publish_as_relid? This only needs to be done > in case of (if (!publish)). I have not tried this so I could be wrong. > I'm not sure what exactly are you proposing. Maybe try coding it? That's probably faster than trying to describe what the code might do ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Column Filtering in Logical Replication
On 3/11/22 08:05, wangw.f...@fujitsu.com wrote: > On Fri, Mar 11, 2022 at 9:57 AM Tomas Vondra > wrote: >> > Hi Tomas, > Thanks for your patches. > > On Mon, Mar 9, 2022 at 9:53 PM Tomas Vondra > wrote: >> On Wed, Mar 9, 2022 at 6:04 PM Amit Kapila wrote: >>> On Mon, Mar 7, 2022 at 11:18 PM Tomas Vondra >>> wrote: >>>> On Fri, Mar 4, 2022 at 6:43 PM Amit Kapila wrote: >>>>> Fetching column filter info in tablesync.c is quite expensive. It >>>>> seems to be using four round-trips to get the complete info whereas >>>>> for row-filter we use just one round trip. I think we should try to >>>>> get both row filter and column filter info in just one round trip. >>>>> >>>> >>>> Maybe, but I really don't think this is an issue. >>>> >>> >>> I am not sure but it might matter for small tables. Leaving aside the >>> performance issue, I think the current way will get the wrong column >>> list in many cases: (a) The ALL TABLES IN SCHEMA case handling won't >>> work for partitioned tables when the partitioned table is part of one >>> schema and partition table is part of another schema. (b) The handling >>> of partition tables in other cases will fetch incorrect lists as it >>> tries to fetch the column list of all the partitions in the hierarchy. >>> >>> One of my colleagues has even tested these cases both for column >>> filters and row filters and we find the behavior of row filter is okay >>> whereas for column filter it uses the wrong column list. We will share >>> the tests and results with you in a later email. We are trying to >>> unify the column filter queries with row filter to make their behavior >>> the same and will share the findings once it is done. I hope if we are >>> able to achieve this that we will reduce the chances of bugs in this >>> area. >>> >> >> OK, I'll take a look at that email. > I tried to get both the column filters and the row filters with one SQL, but > it failed because I think the result is not easy to parse. > > I noted that we use two SQLs to get column filters in the latest > patches(20220311). I think maybe we could use one SQL to get column filters to > reduce network cost. Like the SQL in the attachment. > I'll take a look. But as I said before - I very much prefer SQL that is easy to understand, and I don't think the one extra round trip is an issue during tablesync (which is a very rare action). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
On Fri, Mar 11, 2022 at 8:08 AM Kyotaro Horiguchi wrote: > > > Attaching the v8 patch-set resolving above comments and some tests for > > checking function permissions. Please review it further. > > I played with this a bit, and would like to share some thoughts on it. Thanks a lot Kyotaro-san for reviewing. > It seems to me too rigorous that pg_get_wal_records_info/stats() > reject future LSNs as end-LSN and I think WARNING or INFO and stop at > the real end-of-WAL is more kind to users. I think the same with the > restriction that start and end LSN are required to be different. Throwing error on future LSNs is the same behaviour for all of the pg_walinspect function input LSNs. IMO it is a cleaner thing to do rather than confuse the users with different behaviours for each function. The principle is this - pg_walinspect functions can't show future WAL info. Having said that, I agree to make it a WARNING instead of ERROR, for the simple reason that ERROR aborts the txn and the applications can retry without aborting the txn. For instance, pg_terminate_backend emits a WARNING if the PID isn't a postgres process id. PS: WARNING may not be a better idea than ERROR if we turn pg_get_wal_stats a SQL function, see my response below. > The definition of end-lsn is fuzzy here. If I fed a future LSN to the > functions, they tell me the beginning of the current insertion point > in error message. On the other hand they don't accept the same > value as end-LSN. I think it is right that they tell the current > insertion point and they should take the end-LSN as the LSN to stop > reading. The future LSN is determined by this: if (!RecoveryInProgress()) available_lsn = GetFlushRecPtr(NULL); else available_lsn = GetXLogReplayRecPtr(NULL); GetFlushRecPtr returns last byte + 1 flushed meaning this is the end LSN currently known in the server, but it is not the start LSN of the last WAL record in the server. Same goes with GetXLogReplayRecPtr which gives lastReplayedEndRecPtr end+1 position. I picked GetFlushRecPtr and GetXLogReplayRecPtr to determine the future WAL LSN because this is how read_local_xlog_page determines to read WAL upto and goes for wait mode, but I wanted to avoid the wait mode completely for all the pg_walinspect functions (to keep things simple for now), hence doing the similar checks within the input validation code and emitting warning. And you are right when we emit something like below, users tend to use 0/15B6D68 (from the DETAIL message) as the end LSN. I don't want to ignore this DETAIL message altogether as it gives an idea where the server is. How about rephrasing the DETAIL message a bit, something like "Database system flushed the WAL up to WAL LSN %X/% X.'' or some other better phrasing? WARNING: WAL start LSN cannot be a future WAL LSN DETAIL: Last known WAL LSN on the database system is 0/15B6D68. If the users aren't sure about what's the end record LSN, they can just use pg_get_wal_records_info and pg_get_wal_stats without end LSN: select * from pg_get_wal_records_info('0/15B6D68'); select * from pg_get_wal_stats('0/15B6D68'); > I think pg_get_wal_stats() is worth to have but I think it should be > implemented in SQL. Currently pg_get_wal_records_info() doesn't tell > about FPI since pg_waldump doesn't but it is internally collected (of > course!) and easily revealed. If we do that, the > pg_get_wal_records_stats() would be reduced to the following SQL > statement > > SELECT resource_manager resmgr, >count(*) AS N, >(count(*) * 100 / sum(count(*)) OVER tot)::numeric(5,2) AS "%N", >sum(total_length) AS "combined size", >(sum(total_length) * 100 / sum(sum(total_length)) OVER > tot)::numeric(5,2) AS "%combined size", >sum(fpi_len) AS fpilen, >(sum(fpi_len) * 100 / sum(sum(fpi_len)) OVER tot)::numeric(5,2) AS > "%fpilen" >FROM pg_get_wal_records_info('0/100', '0/175DD7f') >GROUP by resource_manager >WINDOW tot AS () >ORDER BY "combined size" desc; > > The only difference with pg_waldump is the statement above doesn't > show lines for the resource managers that don't contained in the > result of pg_get_wal_records_info(). But I don't think that matters. Yeah, this is better. One problem with the above is when pg_get_wal_records_info emits a warning for future LSN. But this shouldn't stop us doing it via SQL. Instead I would let all the pg_walinspect functions emit errors as opposed to WARNING. Thoughts? postgres=# SELECT resource_manager, count(*) AS count, (count(*) * 100 / sum(count(*)) OVER tot)::numeric(5,2) AS count_percentage, sum(total_length) AS combined_size, (sum(total_length) * 100 / sum(sum(total_length)) OVER tot)::numeric(5,2) AS combined_size_percentage FROM pg_get_wal_records_info('0/10A3E50', '0/25B6F00') GROUP BY resource_manager WINDOW tot AS () ORDER BY combined_size desc; WARNING: WAL end LSN cannot be a future WAL LSN DETAIL: L
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
On Fri, Mar 11, 2022 at 04:59:11PM +0530, Nitin Jadhav wrote: > > That "throttled" flag should be the same as having or not a "force" in the > > flags. We should be consistent and report information the same way, so > > either > > a lot of flags (is_throttled, is_force...) or as now a single field > > containing > > the set flags, so the current approach seems better. Also, it wouldn't be > > much > > better to show the checkpoint as not having the force flags and still not > > being > > throttled. > > I think your understanding is wrong here. The flag which affects > throttling behaviour is CHECKPOINT_IMMEDIATE. Yes sorry, that's what I meant and later used in the flags. > I am not suggesting > removing the existing 'flags' field of pg_stat_progress_checkpoint > view and adding a new field 'throttled'. The content of the 'flags' > field remains the same. I was suggesting replacing the 'next_flags' > field with 'throttled' field since the new request with > CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint. Are you saying that this new throttled flag will only be set by the overloaded flags in ckpt_flags? So you can have a checkpoint with a CHECKPOINT_IMMEDIATE flags that's throttled, and a checkpoint without the CHECKPOINT_IMMEDIATE flag that's not throttled? > > CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be > > used > > to detect that someone wants a new checkpoint afterwards, whatever it's and > > whether or not the current checkpoint to be finished quickly. For this > > flag I > > think it's better to not report it in the view flags but with a new field, > > as > > discussed before, as it's really what it means. > > I understand your suggestion of adding a new field to indicate whether > any of the new requests have been made or not. You just want this > field to represent only a new request or does it also represent the > current checkpoint to finish quickly. Only represent what it means: a new checkpoint is requested. An additional CHECKPOINT_IMMEDIATE flag is orthogonal to this flag and this information. > > CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in > > progress checkpoint, so it can be simply added to the view flags. > > As discussed upthread this is not advisable to do so. The content of > 'flags' remains the same through the checkpoint. We cannot add a new > checkpoint's flag (CHECKPOINT_IMMEDIATE ) to the current one even > though it affects current checkpoint behaviour. Only thing we can do > is to add a new field to show that the current checkpoint is affected > with new requests. I don't get it. The checkpoint flags and the view flags (set by pgstat_progrss_update*) are different, so why can't we add this flag to the view flags? The fact that checkpointer.c doesn't update the passed flag and instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is an implementation detail, and the view shouldn't focus on which flags were initially passed to the checkpointer but instead which flags the checkpointer is actually enforcing, as that's what the user should be interested in. If you want to store it in another field internally but display it in the view with the rest of the flags, I'm fine with it. > > Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED | > > CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the > > view? > > Where do you want to add this in the path? Same as in your current patch I guess.
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
On Fri, Mar 11, 2022 at 8:22 AM Kyotaro Horiguchi wrote: > > - The difference between pg_get_wal_record_info and _records_ other than > - the number of argument is the former accepts incorrect LSNs. > > The discussion is somewhat confused after some twists and turns.. It > should be something like the following. > > pg_get_wal_record_info and pg_get_wal_records_info are almost same > since the latter can show a single record. However it is a bit > annoying to do that. Since, other than it doens't accept same LSNs for > start and end, it doesn't show a record when there' no record in the > specfied LSN range. But I don't think there's no usefulness of the > behavior. > So, do you want the pg_get_wal_record_info function to be removed as we can use pg_get_wal_records_info() to achieve what it does? -- With Regards, Ashutosh Sharma.
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
On Fri, Mar 11, 2022 at 8:22 AM Kyotaro Horiguchi wrote: > > Sorry, some minor non-syntactical corrections. > > At Fri, 11 Mar 2022 11:38:22 +0900 (JST), Kyotaro Horiguchi > wrote in > > I played with this a bit, and would like to share some thoughts on it. > > > > It seems to me too rigorous that pg_get_wal_records_info/stats() > > reject future LSNs as end-LSN and I think WARNING or INFO and stop at > > the real end-of-WAL is more kind to users. I think the same with the > > restriction that start and end LSN are required to be different. > > > > The definition of end-lsn is fuzzy here. If I fed a future LSN to the > > functions, they tell me the beginning of the current insertion point > > in error message. On the other hand they don't accept the same > > value as end-LSN. I think it is right that they tell the current > > insertion point and they should take the end-LSN as the LSN to stop > > reading. > > > > I think pg_get_wal_stats() is worth to have but I think it should be > > implemented in SQL. Currently pg_get_wal_records_info() doesn't tell > > about FPI since pg_waldump doesn't but it is internally collected (of > > course!) and easily revealed. If we do that, the > > pg_get_wal_records_stats() would be reduced to the following SQL > > statement > > > > SELECT resource_manager resmgr, > >count(*) AS N, > > (count(*) * 100 / sum(count(*)) OVER tot)::numeric(5,2) AS "%N", > > sum(total_length) AS "combined size", > > (sum(total_length) * 100 / sum(sum(total_length)) OVER > > tot)::numeric(5,2) AS "%combined size", > > sum(fpi_len) AS fpilen, > > (sum(fpi_len) * 100 / sum(sum(fpi_len)) OVER tot)::numeric(5,2) AS > > "%fpilen" > > FROM pg_get_wal_records_info('0/100', '0/175DD7f') > > GROUP by resource_manager > > WINDOW tot AS () > > ORDER BY "combined size" desc; > > > > The only difference with pg_waldump is the statement above doesn't > > show lines for the resource managers that don't contained in the > > result of pg_get_wal_records_info(). But I don't think that matters. > > > > > > Sometimes the field description has very long (28kb long) content. It > > makes the result output almost unreadable and I had a bit hard time > > struggling with the output full of '-'s. I would like have a default > > limit on the length of such fields that can be long but I'm not sure > > we want that. > > > > > - The difference between pg_get_wal_record_info and _records_ other than > - the number of argument is the former accepts incorrect LSNs. > > The discussion is somewhat confused after some twists and turns.. It > should be something like the following. > > pg_get_wal_record_info and pg_get_wal_records_info are almost same > since the latter can show a single record. However it is a bit > annoying to do that. Since, other than it doens't accept same LSNs for > start and end, it doesn't show a record when there' no record in the > specfied LSN range. But I don't think there's no usefulness of the > behavior. > > The following works, > pg_get_wal_record_info('0/100'); This does work but it doesn't show any WARNING message for the start pointer adjustment. I think it should. > pg_get_wal_records_info('0/100'); > I think this is fine. It should be working because the user hasn't specified the end pointer so we assume the default end pointer is end-of-WAL. > but this doesn't > pg_get_wal_records_info('0/100', '0/100'); > > ERROR: WAL start LSN must be less than end LSN > I think this behaviour is fine. We cannot have the same start and end lsn pointers. > And the following shows no records. > pg_get_wal_records_info('0/100', '0/101'); > pg_get_wal_records_info('0/100', '0/128'); > I think we should be erroring out here saying - couldn't find any valid WAL record between given start and end lsn because there exists no valid wal records between the specified start and end lsn pointers. -- With Regards, Ashutosh Sharma.
Re: logical decoding and replication of sequences
On Fri, Mar 11, 2022 at 5:04 PM Amit Kapila wrote: > > On Tue, Mar 8, 2022 at 11:59 PM Tomas Vondra > wrote: > > > > On 3/7/22 22:25, Tomas Vondra wrote: > > >> > > >> Interesting. I can think of one reason that might cause this - we log > > >> the first sequence increment after a checkpoint. So if a checkpoint > > >> happens in an unfortunate place, there'll be an extra WAL record. On > > >> slow / busy machines that's quite possible, I guess. > > >> > > > > > > I've tweaked the checkpoint_interval to make checkpoints more aggressive > > > (set it to 1s), and it seems my hunch was correct - it produces failures > > > exactly like this one. The best fix probably is to just disable decoding > > > of sequences in those tests that are not aimed at testing sequence > > > decoding. > > > > > > > I've pushed a fix for this, adding "include-sequences=0" to a couple > > test_decoding tests, which were failing with concurrent checkpoints. > > > > Unfortunately, I realized we have a similar issue in the "sequences" > > tests too :-( Imagine you do a series of sequence increments, e.g. > > > > SELECT nextval('s') FROM generate_sequences(1,100); > > > > If there's a concurrent checkpoint, this may add an extra WAL record, > > affecting the decoded output (and also the data stored in the sequence > > relation itself). Not sure what to do about this ... > > > > I am also not sure what to do for it but maybe if in some way we can > increase checkpoint timeout or other parameters for these tests then > it would reduce the chances of such failures. The other idea could be > to perform checkpoint before the start of tests to reduce the > possibility of another checkpoint. > One more thing, I notice while checking the commit for this feature is that the below include seems to be out of order: --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -42,6 +42,7 @@ #include "replication/reorderbuffer.h" #include "replication/snapbuild.h" #include "storage/standby.h" +#include "commands/sequence.h" -- With Regards, Amit Kapila.
RE: Skipping logical replication transactions on subscriber side
On Friday, March 11, 2022 5:20 PM Masahiko Sawada wrote: > I've attached an updated version patch. This patch can be applied on top of > the > latest disable_on_error patch[1]. Hi, thank you for the patch. I'll share my review comments on v13. (a) src/backend/commands/subscriptioncmds.c @@ -84,6 +86,8 @@ typedef struct SubOpts boolstreaming; booltwophase; booldisableonerr; + XLogRecPtr lsn;/* InvalidXLogRecPtr for resetting purpose, +* otherwise a valid LSN */ I think this explanation is slightly odd and can be improved. Strictly speaking, I feel a *valid* LSN is for retting transaction purpose from the functional perspective. Also, the wording "resetting purpose" is unclear by itself. I'll suggest below change. From: InvalidXLogRecPtr for resetting purpose, otherwise a valid LSN To: A valid LSN when we skip transaction, otherwise InvalidXLogRecPtr (b) The code position of additional append in describeSubscriptions + + /* Skip LSN is only supported in v15 and higher */ + if (pset.sversion >= 15) + appendPQExpBuffer(&buf, + ", subskiplsn AS \"%s\"\n", + gettext_noop("Skip LSN")); I suggest to combine this code after subdisableonerr. (c) parse_subscription_options + /* Parse the argument as LSN */ + lsn = DatumGetTransactionId(DirectFunctionCall1(pg_lsn_in, Here, shouldn't we call DatumGetLSN, instead of DatumGetTransactionId ? (d) parse_subscription_options + if (strcmp(lsn_str, "none") == 0) + { + /* Setting lsn = NONE is treated as resetting LSN */ + lsn = InvalidXLogRecPtr; + } + We should remove this pair of curly brackets that is for one sentence. (e) src/backend/replication/logical/worker.c + * to skip applying the changes when starting to apply changes. The subskiplsn is + * cleared after successfully skipping the transaction or applying non-empty + * transaction, where the later avoids the mistakenly specified subskiplsn from + * being left. typo "the later" -> "the latter" At the same time, I feel the last part of this sentence can be an independent sentence. From: , where the later avoids the mistakenly specified subskiplsn from being left To: . The latter prevents the mistakenly specified subskiplsn from being left * Note that my comments below are applied if we choose we don't merge disable_on_error test with skip lsn tests. (f) src/test/subscription/t/030_skip_xact.pl +use Test::More tests => 4; It's better to utilize the new style for the TAP test. Then, probably we should introduce done_testing() at the end of the test. (g) src/test/subscription/t/030_skip_xact.pl I think there's no need to create two types of subscriptions. Just one subscription with two_phase = on and streaming = on would be sufficient for the tests(normal commit, commit prepared, stream commit cases). I think this point of view will reduce the number of the table and the publication, which will make the whole test simpler. Best Regards, Takamichi Osumi
Re: logical decoding and replication of sequences
On Tue, Mar 8, 2022 at 11:59 PM Tomas Vondra wrote: > > On 3/7/22 22:25, Tomas Vondra wrote: > >> > >> Interesting. I can think of one reason that might cause this - we log > >> the first sequence increment after a checkpoint. So if a checkpoint > >> happens in an unfortunate place, there'll be an extra WAL record. On > >> slow / busy machines that's quite possible, I guess. > >> > > > > I've tweaked the checkpoint_interval to make checkpoints more aggressive > > (set it to 1s), and it seems my hunch was correct - it produces failures > > exactly like this one. The best fix probably is to just disable decoding > > of sequences in those tests that are not aimed at testing sequence decoding. > > > > I've pushed a fix for this, adding "include-sequences=0" to a couple > test_decoding tests, which were failing with concurrent checkpoints. > > Unfortunately, I realized we have a similar issue in the "sequences" > tests too :-( Imagine you do a series of sequence increments, e.g. > > SELECT nextval('s') FROM generate_sequences(1,100); > > If there's a concurrent checkpoint, this may add an extra WAL record, > affecting the decoded output (and also the data stored in the sequence > relation itself). Not sure what to do about this ... > I am also not sure what to do for it but maybe if in some way we can increase checkpoint timeout or other parameters for these tests then it would reduce the chances of such failures. The other idea could be to perform checkpoint before the start of tests to reduce the possibility of another checkpoint. -- With Regards, Amit Kapila.
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
> > > > Ok. I agree that it is difficult to interpret it correctly. So even if > > say that a new checkpoint has been explicitly requested, the user may > > not understand that it affects current checkpoint behaviour unless the > > user knows the internals of the checkpoint. How about naming the field > > to 'throttled' (Yes/ No) since our objective is to show that the > > current checkpoint is throttled or not. > > -1 > > That "throttled" flag should be the same as having or not a "force" in the > flags. We should be consistent and report information the same way, so either > a lot of flags (is_throttled, is_force...) or as now a single field containing > the set flags, so the current approach seems better. Also, it wouldn't be > much > better to show the checkpoint as not having the force flags and still not > being > throttled. I think your understanding is wrong here. The flag which affects throttling behaviour is CHECKPOINT_IMMEDIATE. I am not suggesting removing the existing 'flags' field of pg_stat_progress_checkpoint view and adding a new field 'throttled'. The content of the 'flags' field remains the same. I was suggesting replacing the 'next_flags' field with 'throttled' field since the new request with CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint. > CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be > used > to detect that someone wants a new checkpoint afterwards, whatever it's and > whether or not the current checkpoint to be finished quickly. For this flag I > think it's better to not report it in the view flags but with a new field, as > discussed before, as it's really what it means. I understand your suggestion of adding a new field to indicate whether any of the new requests have been made or not. You just want this field to represent only a new request or does it also represent the current checkpoint to finish quickly. > CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in > progress checkpoint, so it can be simply added to the view flags. As discussed upthread this is not advisable to do so. The content of 'flags' remains the same through the checkpoint. We cannot add a new checkpoint's flag (CHECKPOINT_IMMEDIATE ) to the current one even though it affects current checkpoint behaviour. Only thing we can do is to add a new field to show that the current checkpoint is affected with new requests. > Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED | > CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the > view? Where do you want to add this in the path? I feel the new field name is confusing here. 'next_flags' - It shows all the flag values of the next checkpoint. Based on this user can get to know that the new request has been made and also if CHECKPOINT_IMMEDIATE is enabled here, then it indicates that the current checkpoint also gets affected. You are not ok to use this name as it confuses the user. 'throttled' - The value will be set to Yes/No based on the CHECKPOINT_IMMEDIATE bit set in the new checkpoint request's flag. This says that the current checkpoint is affected and also I thought this is an indication that new requests have been made. But there is a confusion here too. If the current checkpoint starts with CHECKPOINT_IMMEDIATE which is described by the 'flags' field and there is no new request, then the value of this field is 'Yes' (Not throttling) which again confuses the user. 'new request' - The value will be set to Yes/No if any new checkpoint requests. This just indicates whether new requests have been made or not. It can not be used to infer other information. Thought? Thanks & Regards, Nitin Jadhav On Fri, Mar 11, 2022 at 3:34 PM Julien Rouhaud wrote: > > On Fri, Mar 11, 2022 at 02:41:23PM +0530, Nitin Jadhav wrote: > > > > Ok. I agree that it is difficult to interpret it correctly. So even if > > say that a new checkpoint has been explicitly requested, the user may > > not understand that it affects current checkpoint behaviour unless the > > user knows the internals of the checkpoint. How about naming the field > > to 'throttled' (Yes/ No) since our objective is to show that the > > current checkpoint is throttled or not. > > -1 > > That "throttled" flag should be the same as having or not a "force" in the > flags. We should be consistent and report information the same way, so either > a lot of flags (is_throttled, is_force...) or as now a single field containing > the set flags, so the current approach seems better. Also, it wouldn't be > much > better to show the checkpoint as not having the force flags and still not > being > throttled. > > Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED | > CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the > view? > > CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be > used > to detect that someone wants a new checkpoint afterwards, whatever it
Re: On login trigger: take three
> On 15 Feb 2022, at 11:07, Greg Nancarrow wrote: > I've attached a re-based version (no functional changes from the > previous) to fix cfbot failures. Thanks for adopting this patch, I took another look at it today and have some comments. +This flag is used internally by PostgreSQL and should not be manually changed by DBA or application. I think we should reword this to something like "This flag is used internally by PostgreSQL and should not be altered or relied upon for monitoring". We really don't want anyone touching or interpreting this value since there is no guarantee that it will be accurate. + new_record[Anum_pg_database_dathasloginevt - 1] = BoolGetDatum(datform->dathasloginevt); Since the corresponding new_record_repl valus is false, this won't do anything and can be removed. We will use the old value anyways. + if (strcmp(eventname, "login") == 0) I think this block warrants a comment on why it only applies to login triggers. + db = (Form_pg_database) GETSTRUCT(tuple); + if (!db->dathasloginevt) + { + db->dathasloginevt = true; + CatalogTupleUpdate(pg_db, &tuple->t_self, tuple); + } + else + CacheInvalidateRelcacheByTuple(tuple); This needs a CommandCounterIncrement inside the if () { .. } block right? + /* Get the command tag. */ + tag = (event == EVT_Login) ? CMDTAG_LOGIN : CreateCommandTag(parsetree); To match project style I think this should be an if-then-else block. Also, while it's tempting to move this before the assertion block in order to reuse it there, it does mean that we are calling this in a hot codepath before we know if it's required. I think we should restore the previous programming with a separate CreateCommandTag call inside the assertion block and move this one back underneath the fast-path exit. + CacheInvalidateRelcacheByTuple(tuple); + } + table_close(pg_db, RowExclusiveLock); + heap_freetuple(tuple); + } + } + CommitTransactionCommand(); Since we are commiting the transaction just after closing the table, is the relcache invalidation going to achieve much? I guess registering the event doesn't hurt much? + /* +* There can be a race condition: a login event trigger may +* have been added after the pg_event_trigger table was +* scanned, and we don't want to erroneously clear the +* dathasloginevt flag in this case. To be sure that this +* hasn't happened, repeat the scan under the pg_database +* table lock. +*/ + AcceptInvalidationMessages(); + runlist = EventTriggerCommonSetup(NULL, + EVT_Login, "login", + &trigdata); It seems conceptually wrong to allocate this in the TopTransactionContext when we only generate the runlist to throw it away. At the very least I think we should free the returned list. + if (runlist == NULL)/* list is still empty, so clear the This should check for NIL and not NULL. Have you done any benchmarking on this patch? With this version I see a small slowdown on connection time of ~1.5% using pgbench for the case where there are no login event triggers defined. With a login event trigger calling an empty plpgsql function there is a ~30% slowdown (corresponding to ~1ms on my system). When there is a login event trigger defined all bets are off however, since the function called can be arbitrarily complex and it's up to the user to measure and decide - but the bare overhead we impose is still of interest of course. -- Daniel Gustafsson https://vmware.com/
RE: Handle infinite recursion in logical replication setup
Hi Vegnesh, While considering about second problem, I was very confusing about it. I'm happy if you answer my question. > To handle this if user has specified only_local option, we could throw > a warning or error out while creating subscription in this case, we > could have a column srreplicateddata in pg_subscription_rel which > could indicate if the table has any replicated data or not: > postgres=# select * from pg_subscription_rel; > srsubid | srrelid | srsubstate | srsublsn | srreplicateddata > -+-++---+-- >16389 | 16384 | r | 0/14A4640 |t >16389 | 16385 | r | 0/14A4690 |f > (1 row) > In the above example, srreplicateddata with true indicates, tabel t1 > whose relid is 16384 has replicated data and the other row having > srreplicateddata as false indicates table t2 whose relid is 16385 > does not have replicated data. > When creating a new subscription, the subscriber will connect to the > publisher and check if the relation has replicated data by checking > srreplicateddata in pg_subscription_rel table. > If the table has any replicated data, log a warning or error for this. IIUC srreplicateddata represents whether the subscribed data is not generated from the publisher, but another node. My first impression was that the name 'srreplicateddata' is not friendly because all subscribed data is replicated from publisher. Also I was not sure how value of the column was set. IIUC a filtering by replication origins is done in publisher node and subscriber node cannot know whether some data are really filtered or not. If we distinguish by subscriber option publish_local_only, it cannot reproduce your example because same subscriber have different 'srreplicateddata'. Best Regards, Hayato Kuroda FUJITSU LIMITED
Fix typo in progress reporting doc
Hi, I have observed that the table naming conventions used in 'progress-reporting.html' are not consistent across different sections. For some cases "Phases" (Table 28.37. CREATE INDEX Phases) is used and for some cases "phases" (Table 28.35. ANALYZE phases) is used. I have attached a patch to correct this. Thanks and Regards, Nitin Jahav 0001-fix-typo-in-progress-reporting-doc.patch Description: Binary data
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Fri, Mar 11, 2022 at 11:52 AM Dilip Kumar wrote: > > On Thu, Mar 10, 2022 at 10:18 PM Robert Haas wrote: > > > > On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar wrote: > > > I have completely changed the logic for this refactoring. Basically, > > > write_relmap_file(), is already having parameters to control whether > > > to write wal, send inval and we are already passing the dbpath. > > > Instead of making a new function I just pass one additional parameter > > > to this function itself about whether we are creating a new map or not > > > and I think with that changes are very less and this looks cleaner to > > > me. Similarly for load_relmap_file() also I just had to pass the > > > dbpath and memory for destination map. Please have a look and let me > > > know your thoughts. > > > > It's not terrible, but how about something like the attached instead? > > I think this has the effect of reducing the number of cases that the > > low-level code needs to know about from 2 to 1, instead of making it > > go up from 2 to 3. > > Yeah this looks cleaner, I will rebase the remaining patch. Here is the updated version of the patch set. Changes, 1) it take Robert's patch as first refactoring patch 2) Rebase other new relmapper apis on top of that in 0002 3) Some code refactoring in main patch 0005 and also one problem fix, earlier in wal log method I have removed ForceSyncCommit(), but IMHO that is equally valid whether we use file_copy or wal_log because in both cases we are creating the disk files. 4) Support strategy in createdb tool and add test case as part of 0006. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com From 7bcc39c740a2a737f1ebd6c7b0441c9df4fab6d3 Mon Sep 17 00:00:00 2001 From: Dilip Kumar Date: Fri, 11 Mar 2022 09:03:09 +0530 Subject: [PATCH v13 1/6] Refactor relmap load and relmap write functions Currently, relmap reading and writing interfaces are tightly coupled with shared_map and local_map of the database it is connected to. But as higher level patch set we need interfaces where we can read relmap into any input memory and while writing also we should be able to pass the map. So as part of this patch, we are doing refactoring of the existing code such that we can expose the read and write interfaces that are independent of the shared_map and the local_map, without changing any logic. Author: Robert Haas --- src/backend/utils/cache/relmapper.c | 147 ++-- 1 file changed, 74 insertions(+), 73 deletions(-) diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c index 4f6811f..f172f61 100644 --- a/src/backend/utils/cache/relmapper.c +++ b/src/backend/utils/cache/relmapper.c @@ -136,9 +136,11 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode, bool add_okay); static void merge_map_updates(RelMapFile *map, const RelMapFile *updates, bool add_okay); -static void load_relmap_file(bool shared, bool lock_held); -static void write_relmap_file(bool shared, RelMapFile *newmap, - bool write_wal, bool send_sinval, bool preserve_files, +static void read_relmap_file(bool shared, bool lock_held); +static void load_relmap_file(RelMapFile *map, char *dbpath, bool lock_held, + int elevel); +static void write_relmap_file(RelMapFile *newmap, bool write_wal, + bool send_sinval, bool preserve_files, Oid dbid, Oid tsid, const char *dbpath); static void perform_relmap_update(bool shared, const RelMapFile *updates); @@ -405,12 +407,12 @@ RelationMapInvalidate(bool shared) if (shared) { if (shared_map.magic == RELMAPPER_FILEMAGIC) - load_relmap_file(true, false); + read_relmap_file(true, false); } else { if (local_map.magic == RELMAPPER_FILEMAGIC) - load_relmap_file(false, false); + read_relmap_file(false, false); } } @@ -425,9 +427,9 @@ void RelationMapInvalidateAll(void) { if (shared_map.magic == RELMAPPER_FILEMAGIC) - load_relmap_file(true, false); + read_relmap_file(true, false); if (local_map.magic == RELMAPPER_FILEMAGIC) - load_relmap_file(false, false); + read_relmap_file(false, false); } /* @@ -568,9 +570,9 @@ RelationMapFinishBootstrap(void) Assert(pending_local_updates.num_mappings == 0); /* Write the files; no WAL or sinval needed */ - write_relmap_file(true, &shared_map, false, false, false, - InvalidOid, GLOBALTABLESPACE_OID, NULL); - write_relmap_file(false, &local_map, false, false, false, + write_relmap_file(&shared_map, false, false, false, + InvalidOid, GLOBALTABLESPACE_OID, "global"); + write_relmap_file(&local_map, false, false, false, MyDatabaseId, MyDatabaseTableSpace, DatabasePath); } @@ -612,7 +614,7 @@ RelationMapInitializePhase2(void) /* * Load the shared map file, die on error. */ - load_relmap_file(true, false); + read_relmap_file(true, false); } /* @@ -633,7 +635,7 @@ RelationMapInitializePhase3(void) /* * Load the local map fil
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
On Fri, Mar 11, 2022 at 02:41:23PM +0530, Nitin Jadhav wrote: > > Ok. I agree that it is difficult to interpret it correctly. So even if > say that a new checkpoint has been explicitly requested, the user may > not understand that it affects current checkpoint behaviour unless the > user knows the internals of the checkpoint. How about naming the field > to 'throttled' (Yes/ No) since our objective is to show that the > current checkpoint is throttled or not. -1 That "throttled" flag should be the same as having or not a "force" in the flags. We should be consistent and report information the same way, so either a lot of flags (is_throttled, is_force...) or as now a single field containing the set flags, so the current approach seems better. Also, it wouldn't be much better to show the checkpoint as not having the force flags and still not being throttled. Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED | CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the view? CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be used to detect that someone wants a new checkpoint afterwards, whatever it's and whether or not the current checkpoint to be finished quickly. For this flag I think it's better to not report it in the view flags but with a new field, as discussed before, as it's really what it means. CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in progress checkpoint, so it can be simply added to the view flags.
Re: Column Filtering in Logical Replication
On Fri, Mar 11, 2022 at 7:26 AM Tomas Vondra wrote: > > On 3/10/22 20:10, Tomas Vondra wrote: > > > > > > FWIW I think the reason is pretty simple - pgoutput_row_filter_init is > > broken. It assumes you can just do this > > > > rftuple = SearchSysCache2(PUBLICATIONRELMAP, > > ObjectIdGetDatum(entry->publish_as_relid), > > ObjectIdGetDatum(pub->oid)); > > > > if (HeapTupleIsValid(rftuple)) > > { > > /* Null indicates no filter. */ > > rfdatum = SysCacheGetAttr(PUBLICATIONRELMAP, rftuple, > > Anum_pg_publication_rel_prqual, > > &pub_no_filter); > > } > > else > > { > > pub_no_filter = true; > > } > > > > > > and pub_no_filter=true means there's no filter at all. Which is > > nonsense, because we're using publish_as_relid here - the publication > > may not include this particular ancestor, in which case we need to just > > ignore this publication. > > > > So yeah, this needs to be reworked. > > > > I spent a bit of time looking at this, and I think a minor change in > get_rel_sync_entry() fixes this - it's enough to ensure rel_publications > only includes publications that actually include publish_as_relid. > Thanks for looking into this. I think in the first patch before calling get_partition_ancestors() we need to ensure it is a partition (the call expects that) and pubviaroot is true. I think it would be good if we can avoid an additional call to get_partition_ancestors() as it could be costly. I wonder why it is not sufficient to ensure that publish_as_relid exists after ancestor in ancestors list before assigning the ancestor to publish_as_relid? This only needs to be done in case of (if (!publish)). I have not tried this so I could be wrong. -- With Regards, Amit Kapila.
Re: BufferAlloc: don't take two simultaneous locks
В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет: > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi > wrote in > > Thanks! I looked into dynahash part. > > > > struct HASHHDR > > { > > - /* > > - * The freelist can become a point of contention in high-concurrency > > hash > > > > Why did you move around the freeList? > > > > > > - longnentries; /* number of entries in > > associated buckets */ > > + longnfree; /* number of free entries in > > the list */ > > + longnalloced; /* number of entries > > initially allocated for > > > > Why do we need nfree? HASH_ASSING should do the same thing with > > HASH_REMOVE. Maybe the reason is the code tries to put the detached > > bucket to different free list, but we can just remember the > > freelist_idx for the detached bucket as we do for hashp. I think that > > should largely reduce the footprint of this patch. > > > > -static void hdefault(HTAB *hashp); > > +static void hdefault(HTAB *hashp, bool partitioned); > > > > That optimization may work even a bit, but it is not irrelevant to > > this patch? (forgot to answer in previous letter). Yes, third commit is very optional. But adding `nalloced` to `FreeListData` increases allocation a lot even for usual non-shared non-partitioned dynahashes. And this allocation is quite huge right now for no meaningful reason. > > > > + case HASH_REUSE: > > + if (currBucket != NULL) > > + { > > + /* check there is no unfinished > > HASH_REUSE+HASH_ASSIGN pair */ > > + Assert(DynaHashReuse.hashp == NULL); > > + Assert(DynaHashReuse.element == NULL); > > > > I think all cases in the switch(action) other than HASH_ASSIGN needs > > this assertion and no need for checking both, maybe only for element > > would be enough. > > While I looked buf_table part, I came up with additional comments. > > BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id) > { > hash_search_with_hash_value(SharedBufHash, > > HASH_ASSIGN, > ... > BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse) > > BufTableDelete considers both reuse and !reuse cases but > BufTableInsert doesn't and always does HASH_ASSIGN. That looks > odd. We should use HASH_ENTER here. Thus I think it is more > reasonable that HASH_ENTRY uses the stashed entry if exists and > needed, or returns it to freelist if exists but not needed. > > What do you think about this? Well... I don't like it but I don't mind either. Code in HASH_ENTER and HASH_ASSIGN cases differs much. On the other hand, probably it is possible to merge it carefuly. I'll try. - regards Yura Sokolov
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
> > I just wanted to avoid extra calculations just to show the progress in > > the view. Since it's a good metric, I have added an additional field > > named 'next_flags' to the view which holds all possible flag values of > > the next checkpoint. > > I still don't think that's ok. IIUC the only way to know if the current > checkpoint is throttled or not is to be aware that the "next_flags" can apply > to the current checkpoint too, look for it and see if that changes the > semantics of what the view say the current checkpoint is. Most users will get > it wrong. > > Again I would just display a bool flag saying whether a new checkpoint has > been > explicitly requested or not, it seems enough. Ok. I agree that it is difficult to interpret it correctly. So even if say that a new checkpoint has been explicitly requested, the user may not understand that it affects current checkpoint behaviour unless the user knows the internals of the checkpoint. How about naming the field to 'throttled' (Yes/ No) since our objective is to show that the current checkpoint is throttled or not. Thanks & Regards, Nitin Jadhav On Wed, Mar 9, 2022 at 7:48 PM Julien Rouhaud wrote: > > On Tue, Mar 08, 2022 at 08:57:23PM +0530, Nitin Jadhav wrote: > > > > I just wanted to avoid extra calculations just to show the progress in > > the view. Since it's a good metric, I have added an additional field > > named 'next_flags' to the view which holds all possible flag values of > > the next checkpoint. > > I still don't think that's ok. IIUC the only way to know if the current > checkpoint is throttled or not is to be aware that the "next_flags" can apply > to the current checkpoint too, look for it and see if that changes the > semantics of what the view say the current checkpoint is. Most users will get > it wrong. > > > This gives more information than just saying > > whether the new checkpoint is requested or not with the same memory. > > So that next_flags will be empty most of the time? It seems confusing. > > Again I would just display a bool flag saying whether a new checkpoint has > been > explicitly requested or not, it seems enough. > > If you're interested in that next checkpoint, you probably want a quick > completion of the current checkpoint first (and thus need to know if it's > throttled or not). And then you will have to keep monitoring that view for > the > next checkpoint anyway, and at that point the view will show the relevant > information.
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
> > The current format matches with the server log message for the > > checkpoint start in LogCheckpointStart(). Just to be consistent, I > > have not changed the code. > > > > See below, how flags are shown in other sql functions like: > > ashu@postgres=# select * from heap_tuple_infomask_flags(2304, 1); >raw_flags| combined_flags > -+ > {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {} > (1 row) > > This looks more readable and it's easy to understand for the > end-users.. Further comparing the way log messages are displayed with > the way sql functions display its output doesn't look like a right > comparison to me. Obviously both should show matching data but the way > it is shown doesn't need to be the same. In fact it is not in most of > the cases. ok. I will take care in the next patch. I would like to handle this at the SQL level in system_views.sql. The following can be used to display in the format described above. ( '{' || CASE WHEN (S.param2 & 4) > 0 THEN 'immediate' ELSE '' END || CASE WHEN (S.param2 & 4) > 0 AND (S.param2 & -8) > 0 THEN ', ' ELSE '' END || CASE WHEN (S.param2 & 8) > 0 THEN 'force' ELSE '' END || CASE WHEN (S.param2 & 8) > 0 AND (S.param2 & -16) > 0 THEN ', ' ELSE '' END || CASE WHEN (S.param2 & 16) > 0 THEN 'flush-all' ELSE '' END || CASE WHEN (S.param2 & 16) > 0 AND (S.param2 & -32) > 0 THEN ', ' ELSE '' END || CASE WHEN (S.param2 & 32) > 0 THEN 'wait' ELSE '' END || CASE WHEN (S.param2 & 32) > 0 AND (S.param2 & -128) > 0 THEN ', ' ELSE '' END || CASE WHEN (S.param2 & 128) > 0 THEN 'wal' ELSE '' END || CASE WHEN (S.param2 & 128) > 0 AND (S.param2 & -256) > 0 THEN ', ' ELSE '' END || CASE WHEN (S.param2 & 256) > 0 THEN 'time' ELSE '' END || '}' Basically, a separate CASE statement is used to decide whether a comma has to be printed or not, which is done by checking whether the previous flag bit is enabled (so that the appropriate flag has to be displayed) and if any next bits are enabled (So there are some more flags to be displayed). Kindly let me know if you know any other better approach. Thanks & Regards, Nitin Jadhav On Wed, Mar 9, 2022 at 7:07 PM Ashutosh Sharma wrote: > > On Tue, Mar 8, 2022 at 8:31 PM Nitin Jadhav > wrote: > > > > > > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint; > > > > -[ RECORD 1 ]-+- > > > > pid | 22043 > > > > type | checkpoint > > > > kind | immediate force wait requested time > > > > > > > > I think the output in the kind column can be displayed as {immediate, > > > > force, wait, requested, time}. By the way these are all checkpoint > > > > flags so it is better to display it as checkpoint flags instead of > > > > checkpoint kind as mentioned in one of my previous comments. > > > > > > I will update in the next patch. > > > > The current format matches with the server log message for the > > checkpoint start in LogCheckpointStart(). Just to be consistent, I > > have not changed the code. > > > > See below, how flags are shown in other sql functions like: > > ashu@postgres=# select * from heap_tuple_infomask_flags(2304, 1); > raw_flags| combined_flags > -+ > {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {} > (1 row) > > This looks more readable and it's easy to understand for the > end-users.. Further comparing the way log messages are displayed with > the way sql functions display its output doesn't look like a right > comparison to me. Obviously both should show matching data but the way > it is shown doesn't need to be the same. In fact it is not in most of > the cases. > > > I have taken care of the rest of the comments in v5 patch for which > > there was clarity. > > > > Thank you very much. Will take a look at it later. > > -- > With Regards, > Ashutosh Sharma.
Re: BufferAlloc: don't take two simultaneous locks
В Пт, 11/03/2022 в 15:30 +0900, Kyotaro Horiguchi пишет: > At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov > wrote in > > В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет: > > > Ok, here is v4. > > > > And here is v5. > > > > First, there was compilation error in Assert in dynahash.c . > > Excuse me for not checking before sending previous version. > > > > Second, I add third commit that reduces HASHHDR allocation > > size for non-partitioned dynahash: > > - moved freeList to last position > > - alloc and memset offset(HASHHDR, freeList[1]) for > > non-partitioned hash tables. > > I didn't benchmarked it, but I will be surprised if it > > matters much in performance sence. > > > > Third, I put all three commits into single file to not > > confuse commitfest application. > > Thanks! I looked into dynahash part. > > struct HASHHDR > { > - /* > -* The freelist can become a point of contention in high-concurrency > hash > > Why did you move around the freeList? > > > - longnentries; /* number of entries in > associated buckets */ > + longnfree; /* number of free entries in > the list */ > + longnalloced; /* number of entries > initially allocated for > > Why do we need nfree? HASH_ASSING should do the same thing with > HASH_REMOVE. Maybe the reason is the code tries to put the detached > bucket to different free list, but we can just remember the > freelist_idx for the detached bucket as we do for hashp. I think that > should largely reduce the footprint of this patch. If we keep nentries, then we need to fix nentries in both old "freeList" partition and new one. It is two freeList[partition]->mutex lock+unlock pairs. But count of free elements doesn't change, so if we change nentries to nfree, then no need to fix freeList[partition]->nfree counters, no need to lock+unlock. > > -static void hdefault(HTAB *hashp); > +static void hdefault(HTAB *hashp, bool partitioned); > > That optimization may work even a bit, but it is not irrelevant to > this patch? > > + case HASH_REUSE: > + if (currBucket != NULL) > + { > + /* check there is no unfinished > HASH_REUSE+HASH_ASSIGN pair */ > + Assert(DynaHashReuse.hashp == NULL); > + Assert(DynaHashReuse.element == NULL); > > I think all cases in the switch(action) other than HASH_ASSIGN needs > this assertion and no need for checking both, maybe only for element > would be enough. Agree.
Re: WIP: WAL prefetch (another approach)
On Fri, Mar 11, 2022 at 06:31:13PM +1300, Thomas Munro wrote: > On Wed, Mar 9, 2022 at 7:47 PM Julien Rouhaud wrote: > > > > This could use XLogRecGetBlock? Note that this macro is for now never used. > > xlogreader.c also has some similar forgotten code that could use > > XLogRecMaxBlockId. > > That is true, but I was thinking of it like this: most of the existing > code that interacts with xlogreader.c is working with the old model, > where the XLogReader object holds only one "current" record. For that > reason the XLogRecXXX() macros continue to work as before, implicitly > referring to the record that XLogReadRecord() most recently returned. > For xlogreader.c code, I prefer not to use the XLogRecXXX() macros, > even when referring to the "current" record, since xlogreader.c has > switched to a new multi-record model. In other words, they're sort of > 'old API' accessors provided for continuity. Does this make sense? Ah I see, it does make sense. I'm wondering if there should be some comment somewhere on the top of the file to mention it, as otherwise someone may be tempted to change it to avoid some record->record->xxx usage. > > +DecodedXLogRecord * > > +XLogNextRecord(XLogReaderState *state, char **errormsg) > > +{ > > [...] > > + /* > > +* state->EndRecPtr is expected to have been set by the last call to > > +* XLogBeginRead() or XLogNextRecord(), and is the location of the > > +* error. > > +*/ > > + > > + return NULL; > > > > The comment should refer to XLogFindNextRecord, not XLogNextRecord? > > No, it does mean to refer to the XLogNextRecord() (ie the last time > you called XLogNextRecord and successfully dequeued a record, we put > its end LSN there, so if there is a deferred error, that's the > corresponding LSN). Make sense? It does, thanks! > > > Also, is it worth an assert (likely at the top of the function) for that? > > How could I assert that EndRecPtr has the right value? Sorry, I meant to assert that some value was assigned (!XLogRecPtrIsInvalid). It can only make sure that the first call is done after XLogBeginRead / XLogFindNextRecord, but that's better than nothing and consistent with the top comment. > > + if (unlikely(state->decode_buffer == NULL)) > > + { > > + if (state->decode_buffer_size == 0) > > + state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE; > > + state->decode_buffer = palloc(state->decode_buffer_size); > > + state->decode_buffer_head = state->decode_buffer; > > + state->decode_buffer_tail = state->decode_buffer; > > + state->free_decode_buffer = true; > > + } > > > > Maybe change XLogReaderSetDecodeBuffer to also handle allocation and use it > > here too? Otherwise XLogReaderSetDecodeBuffer should probably go in 0002 as > > the only caller is the recovery prefetching. > > I don't think it matters much? The thing is that for now the only caller to XLogReaderSetDecodeBuffer (in 0002) only uses it to set the length, so a buffer is actually never passed to that function. Since frontend code can rely on a palloc emulation, is there really a use case to use e.g. some stack buffer there, or something in a specific memory context? It seems to be the only use cases for having XLogReaderSetDecodeBuffer() rather than simply a XLogReaderSetDecodeBufferSize(). But overall I agree it doesn't matter much, so no objection to keep it as-is. > > It's also not particulary obvious why XLogFindNextRecord() doesn't check for > > this value. AFAICS callers don't (and should never) call it with a > > nonblocking == true state, maybe add an assert for that? > > Fair point. I have now explicitly cleared that flag. (I don't much > like state->nonblocking, which might be better as an argument to > page_read(), but in fact I don't like the fact that page_read > callbacks are blocking in the first place, which is why I liked > Horiguchi-san's patch to get rid of that... but that can be a subject > for later work.) Agreed. > > static void > > ResetDecoder(XLogReaderState *state) > > { > > [...] > > + /* Reset the decoded record queue, freeing any oversized records. */ > > + while ((r = state->decode_queue_tail)) > > > > nit: I think it's better to explicitly check for the assignment being != > > NULL, > > and existing code is more frequently written this way AFAICS. > > I think it's perfectly normal idiomatic C, but if you think it's > clearer that way, OK, done like that. The thing I don't like about this form is that you can never be sure that an assignment was really meant unless you read the rest of the nearby code. Other than that agreed, if perfectly normal idiomatic C. > I realised that this version has broken -DWAL_DEBUG. I'll fix that > shortly, but I wanted to post this update ASAP, so here's a new > version. + * Returns XLREAD_WOULDBLOCK if he requested data can't be read without + * waiting. This can be returned only if the installed page_read cal
Re: BufferAlloc: don't take two simultaneous locks
At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi wrote in > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi > wrote in > > Thanks! I looked into dynahash part. Then I looked into bufmgr part. It looks fine to me but I have some comments on code comments. >* To change the association of a valid buffer, we'll need to > have >* exclusive lock on both the old and new mapping partitions. > if (oldFlags & BM_TAG_VALID) We don't take lock on the new mapping partition here. +* Clear out the buffer's tag and flags. We must do this to ensure that +* linear scans of the buffer array don't think the buffer is valid. We +* also reset the usage_count since any recency of use of the old content +* is no longer relevant. +* +* We are single pinner, we hold buffer header lock and exclusive +* partition lock (if tag is valid). Given these statements it is safe to +* clear tag since no other process can inspect it to the moment. This comment is a merger of the comments from InvalidateBuffer and BufferAlloc. But I think what we need to explain here is why we invalidate the buffer here despite of we are going to reuse it soon. And I think we need to state that the old buffer is now safe to use for the new tag here. I'm not sure the statement is really correct but clearing-out actually looks like safer. > Now it is safe to use victim buffer for new tag. Invalidate the > buffer before releasing header lock to ensure that linear scans of > the buffer array don't think the buffer is valid. It is safe > because it is guaranteed that we're the single pinner of the buffer. > That pin also prevents the buffer from being stolen by others until > we reuse it or return it to freelist. So I want to revise the following comment. -* Now it is safe to use victim buffer for new tag. +* Now reuse victim buffer for new tag. >* Make sure BM_PERMANENT is set for buffers that must be written at > every >* checkpoint. Unlogged buffers only need to be written at shutdown >* checkpoints, except for their "init" forks, which need to be treated >* just like permanent relations. >* >* The usage_count starts out at 1 so that the buffer can survive one >* clock-sweep pass. But if you think the current commet is fine, I don't insist on the comment chagnes. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Skipping logical replication transactions on subscriber side
On Thu, Mar 10, 2022 at 9:02 PM Amit Kapila wrote: > > On Tue, Mar 1, 2022 at 8:31 PM Masahiko Sawada wrote: > > > > I've attached an updated patch along with two patches for cfbot tests > > since the main patch (0003) depends on the other two patches. Both > > 0001 and 0002 patches are the same ones I attached on another > > thread[2]. > > > > Few comments on 0003: > = > 1. > + > + > + subskiplsn pg_lsn > + > + > + Commit LSN of the transaction whose changes are to be skipped, > if a valid > + LSN; otherwise 0/0. > + > + > > Can't this be prepared LSN or rollback prepared LSN? Can we say > Finish/End LSN and then add some details which all LSNs can be there? Right, changed to finish LSN. > > 2. The conflict resolution explanation needs an update after the > latest commits and we should probably change the commit LSN > terminology as mentioned in the previous point. Updated. > > 3. The text in alter_subscription.sgml looks a bit repetitive to me > (similar to what we have in logical-replication.sgml related to > conflicts). Here also we refer to only commit LSN which needs to be > changed as mentioned in the previous two points. Updated. > > 4. > if (strcmp(lsn_str, "none") == 0) > + { > + /* Setting lsn = NONE is treated as resetting LSN */ > + lsn = InvalidXLogRecPtr; > + } > + else > + { > + /* Parse the argument as LSN */ > + lsn = DatumGetTransactionId(DirectFunctionCall1(pg_lsn_in, > + CStringGetDatum(lsn_str))); > + > + if (XLogRecPtrIsInvalid(lsn)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("invalid WAL location (LSN): %s", lsn_str))); > > Is there a reason that we don't want to allow setting 0 > (InvalidXLogRecPtr) for skip LSN? 0 is obviously an invalid value for skip LSN, which should not be allowed similar to other options (like setting '' to slot_name). Also, we use 0 (InvalidXLogRecPtr) internally to reset the subskipxid when NONE is specified. > > 5. > +# The subscriber will enter an infinite error loop, so we don't want > +# to overflow the server log with error messages. > +$node_subscriber->append_conf( > + 'postgresql.conf', > + qq[ > +wal_retrieve_retry_interval = 2s > +]); > > Can we change this test to use disable_on_error feature? I am thinking > if the disable_on_error feature got committed first, maybe we can have > one test file for this and disable_on_error feature (something like > conflicts.pl). Good idea. Updated. I've attached an updated version patch. This patch can be applied on top of the latest disable_on_error patch[1]. Regards, [1] https://www.postgresql.org/message-id/CAA4eK1Kes9TsMpGL6m%2BAJNHYCGRvx6piYQt5v6TEbH_t9jh8nA%40mail.gmail.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/ v13-0001-Add-ALTER-SUBSCRIPTION-.-SKIP-to-skip-the-transa.patch Description: Binary data