Re: [EXTERNAL] Re: [PATCH] Support using "all" for the db user in pg_ident.conf
On Wed, Jan 18, 2023 at 10:35:29AM +0100, Jelte Fennema wrote: > Anything I can do to help with this? Or will you do that yourself? So, I have done a second lookup, and tweaked a few things: - Addition of a macro for pg_strcasecmp(), to match with token_matches(). - Fixed a bit the documentation. - Tweaked some comments and descriptions in the tests, I was rather fine with the role and group names. Jelte, do you like this version? -- Michael From 0e42a5e9c2a532355df49346e47cc5612da1889d Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Thu, 19 Jan 2023 16:54:59 +0900 Subject: [PATCH v6] Support same user patterns in pg_ident.conf as in pg_hba.conf While pg_hba.conf has support for non-literal username matches, pg_ident.conf doesn't have this same functionality. This changes permission checking in pg_ident.conf to handle all the special cases for username checking: 1. The "all" keyword 2. Membership checks using the + prefix 3. Using a regex to match against multiple roles This allows matching certain system users against many different postgres users with a single line in pg_ident.conf. Without this you you would need a line for each of the postgres users that a system user can log in as. There is some risk of breaking existing pg_ident.conf configs that contained such special strings and which were treated as literal user names. But the advantages brought by the features added in this change far outweigh the risk of breakage. And we only risk breaking when there exist roles that are called "all", start with a literal + or start with a literal /. Only "all" seems like a somewhat reasonable role name, but such a role existing seems unlikely to me given all its special meaning in pg_hba.conf. **This compatibility change should be mentioned in the release notes.** --- src/backend/libpq/hba.c | 98 +++- src/test/authentication/t/003_peer.pl | 160 -- doc/src/sgml/client-auth.sgml | 27 - 3 files changed, 246 insertions(+), 39 deletions(-) diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c index 029b8e4483..69f87667be 100644 --- a/src/backend/libpq/hba.c +++ b/src/backend/libpq/hba.c @@ -73,8 +73,10 @@ typedef struct } tokenize_error_callback_arg; #define token_has_regexp(t) (t->regex != NULL) +#define token_is_member_check(t) (!t->quoted && t->string[0] == '+') #define token_is_keyword(t, k) (!t->quoted && strcmp(t->string, k) == 0) #define token_matches(t, k) (strcmp(t->string, k) == 0) +#define token_matches_insensitive(t,k) (pg_strcasecmp(t->string, k) == 0) /* * Memory context holding the list of TokenizedAuthLines when parsing @@ -995,10 +997,11 @@ is_member(Oid userid, const char *role) * * Each AuthToken listed is checked one-by-one. Keywords are processed * first (these cannot have regular expressions), followed by regular - * expressions (if any) and the exact match. + * expressions (if any), the case-insensitive match (if requested) and + * the exact match. */ static bool -check_role(const char *role, Oid roleid, List *tokens) +check_role(const char *role, Oid roleid, List *tokens, bool case_insensitive) { ListCell *cell; AuthToken *tok; @@ -1006,7 +1009,7 @@ check_role(const char *role, Oid roleid, List *tokens) foreach(cell, tokens) { tok = lfirst(cell); - if (!tok->quoted && tok->string[0] == '+') + if (token_is_member_check(tok)) { if (is_member(roleid, tok->string + 1)) return true; @@ -1018,6 +1021,11 @@ check_role(const char *role, Oid roleid, List *tokens) if (regexec_auth_token(role, tok, 0, NULL) == REG_OKAY) return true; } + else if (case_insensitive) + { + if (token_matches_insensitive(tok, role)) +return true; + } else if (token_matches(tok, role)) return true; } @@ -2614,7 +2622,7 @@ check_hba(hbaPort *port) hba->databases)) continue; - if (!check_role(port->user_name, roleid, hba->roles)) + if (!check_role(port->user_name, roleid, hba->roles, false)) continue; /* Found a record that matched! */ @@ -2804,7 +2812,7 @@ parse_ident_line(TokenizedAuthLine *tok_line, int elevel) /* * Now that the field validation is done, compile a regex from the user - * token, if necessary. + * tokens, if necessary. */ if (regcomp_auth_token(parsedline->system_user, file_name, line_num, err_msg, elevel)) @@ -2813,6 +2821,13 @@ parse_ident_line(TokenizedAuthLine *tok_line, int elevel) return NULL; } + if (regcomp_auth_token(parsedline->pg_user, file_name, line_num, + err_msg, elevel)) + { + /* err_msg includes the error to report */ + return NULL; + } + return parsedline; } @@ -2827,6 +2842,8 @@ check_ident_usermap(IdentLine *identLine, const char *usermap_name, const char *pg_user, const char *system_user, bool case_insensitive, bool *found_p, bool *error_p) { + Oid roleid; + *found_p = false; *error_p = false; @@ -2834,6 +2851,9 @@
Re: Inconsistency in vacuum behavior
Justin Pryzby писал 2023-01-19 04:49: On Mon, Jan 16, 2023 at 08:12:18PM +0300, Nikita Malakhov wrote: Hi, Currently there is no error in this case, so additional thrown error would require a new test. Besides, throwing an error here does not make sense - it is just a check for a vacuum permission, I think the right way is to just skip a relation that is not suitable for vacuum. Any thoughts or objections? Could you check if this is consistent between the behavior of VACUUM FULL and CLUSTER ? See also Nathan's patches. Hi. Cluster behaves in a different way - it errors out immediately if relation is not owned by user. For partitioned rel it would anyway raise error later. VACUUM and VACUUM FULL behave consistently after applying Nikita's patch (for partitioned and regular tables) - issue warning "skipping TABLE_NAME --- only table or database owner can vacuum it" and return success status. -- Best regards, Alexander Pyhalov, Postgres Professional
Re: Inconsistency in vacuum behavior
Hi! I've found the discussion you'd mentioned before, checking now. On Thu, Jan 19, 2023 at 4:49 AM Justin Pryzby wrote: > On Mon, Jan 16, 2023 at 08:12:18PM +0300, Nikita Malakhov wrote: > > Hi, > > > > Currently there is no error in this case, so additional thrown error > would > > require a new test. > > Besides, throwing an error here does not make sense - it is just a check > > for a vacuum > > permission, I think the right way is to just skip a relation that is not > > suitable for vacuum. > > Any thoughts or objections? > > Could you check if this is consistent between the behavior of VACUUM > FULL and CLUSTER ? See also Nathan's patches. > > -- > Justin > -- Regards, Nikita Malakhov Postgres Professional https://postgrespro.ru/
RE: Time delayed LR (WAS Re: logical replication restrictions)
On Wednesday, January 18, 2023 4:06 PM Peter Smith wrote: > Here are my review comments for the latest patch v16-0001. (excluding the > test code) Hi, thank you for your review ! > == > > General > > 1. > > Since the value of min_apply_delay cannot be < 0, I was thinking probably it > should have been declared everywhere in this patch as a > uint64 instead of an int64, right? No, we won't be able to adopt this idea. It seems that we are not able to use uint for catalog type. So, can't applying it to the pg_subscription.h definitions and then similarly Int64GetDatum to store catalog variables and the argument variable of Int64GetDatum. Plus, there is a possibility that type Interval becomes negative value, then we are not able to change the int64 variable to get the return value of interval2ms(). > == > > Commit message > > 2. > > If the subscription sets min_apply_delay parameter, the logical replication > worker will delay the transaction commit for min_apply_delay milliseconds. > > ~ > > IMO there should be another sentence before this just to say that a new > parameter is being added: > > e.g. > This patch implements a new subscription parameter called > 'min_apply_delay'. Added. > == > > doc/src/sgml/config.sgml > > 3. > > + > + For time-delayed logical replication, the apply worker sends a Standby > + Status Update message to the corresponding publisher per the > indicated > + time of this parameter. Therefore, if this parameter is longer than > + wal_sender_timeout on the publisher, then the > + walsender doesn't get any update message during the delay and > repeatedly > + terminates due to the timeout errors. Hence, make sure this parameter > is > + shorter than the wal_sender_timeout of the > publisher. > + If this parameter is set to zero with time-delayed replication, the > + apply worker doesn't send any feedback messages during the > + min_apply_delay. > + > > > This paragraph seemed confusing. I think it needs to be reworded to change all > of the "this parameter" references because there are at least 3 different > parameters mentioned in this paragraph. e.g. maybe just change them to > explicitly name the parameter you are talking about. > > I also think it needs to mention the ‘min_apply_delay’ subscription parameter > up-front and then refer to it appropriately. > > The end result might be something like I wrote below (this is just my guess ? > probably you can word it better). > > SUGGESTION > For time-delayed logical replication (i.e. when the subscription is created > with > parameter min_apply_delay > 0), the apply worker sends a Standby Status > Update message to the publisher with a period of wal_receiver_status_interval > . > Make sure to set wal_receiver_status_interval less than the > wal_sender_timeout on the publisher, otherwise, the walsender will repeatedly > terminate due to the timeout errors. If wal_receiver_status_interval is set > to zero, > the apply worker doesn't send any feedback messages during the subscriber’s > min_apply_delay period. Applied. Also, I added one reference for min_apply_delay parameter at the end of this description. > == > > doc/src/sgml/ref/create_subscription.sgml > > 4. > > + > + By default, the subscriber applies changes as soon as possible. As > + with the physical replication feature > + (), it can be > useful to > + have a time-delayed logical replica. This parameter lets the user > to > + delay the application of changes by a specified amount of > time. If this > + value is specified without units, it is taken as milliseconds. The > + default is zero(no delay). > + > > 4a. > As with the physical replication feature (recovery_min_apply_delay), it can be > useful to have a time-delayed logical replica. > > IMO not sure that the above sentence is necessary. It seems only to be saying > that this parameter can be useful. Why do we need to say that? Removed the sentence. > ~ > > 4b. > "This parameter lets the user to delay" -> "This parameter lets the user > delay" > OR > "This parameter lets the user to delay" -> "This parameter allows the user to > delay" Fixed. > ~ > > 4c. > "If this value is specified without units" -> "If the value is specified > without > units" Fixed. > ~ > > 4d. > "zero(no delay)." -> "zero (no delay)." Fixed. > > > 5. > > + > + The delay occurs only on WAL records for transaction begins and > after > + the initial table synchronization. It is possible that the > + replication delay between publisher and subscriber exceeds the > value > + of this parameter, in which case no delay is added. Note that the > + delay is calculated between the WAL time stamp as written on > + publisher and the current time on the subscriber. Time > spent in
Re: Support logical replication of DDLs
On Thu, Jan 19, 2023 at 8:39 AM Zheng Li wrote: > > On Wed, Jan 18, 2023 at 6:27 AM Amit Kapila wrote: > > > > On Sat, Jan 7, 2023 at 8:58 PM Zheng Li wrote: > > > > > Foreign Tables can also be considered replicated with DDL replication because > we > don't even need to replicate the data as it resides on the external > server. Users > need to configure the external server to allow connection from the > subscriber for > foreign tables to work on the subscriber. > So, this would mean that we expect the subscriber will also have the same foreign server as the publisher because we will replicate the entire connection/user information of the foreign server for the publisher. But what about data inserted by the publisher on the foreign server? > > We should also think > > about initial sync for all those objects as well. > > Agree, we're starting an investigation on initial sync. But I think > initial sync depends on > DDL replication to work reliably, not the other way around. DDL replication > can > work on its own without the initial sync of schema, users just need to > setup the initial > schema just like they would today. > The difference is that today users need to take care of all schema setup on both and follow changes in the same on the publisher. But with DDL replication, there has to be a point prior to which both the nodes have the same setup. For that, before setting up DDL replication, users need to ensure that both nodes have the same schema, and then during setup, the user doesn't perform any DDL on the publisher. -- With Regards, Amit Kapila.
Re: [PATCH] random_normal function
On 1/19/23 11:01, Tom Lane wrote: Andrey Lepikhov writes: On 1/9/23 23:52, Tom Lane wrote: BTW, if this does bring the probability of failure down to the one-in-a-billion range, I think we could also nuke the whole "ignore:" business, simplifying pg_regress and allowing the random test to be run in parallel with others. We have used the pg_sleep() function to interrupt a query at certain execution phase. But on some platforms, especially in containers, the query can vary execution time in so widely that the pg_sleep() timeout, required to get rid of dependency on a query execution time, has become unacceptable. So, the "ignore" option was the best choice. But does such a test have any actual value? If your test infrastructure ignores the result, what makes you think you'd notice if the test did indeed detect a problem? Yes, it is good to catch SEGFAULTs and assertions which may be frequent because of a logic complexity in the case of timeouts. I think "ignore:" was a kluge we put in twenty-plus years ago when our testing standards were a lot lower, and it's way past time we got rid of it. Ok, I will try to invent alternative way for deep (and stable) testing of timeouts. Thank you for the answer. -- Regards Andrey Lepikhov Postgres Professional
RE: Time delayed LR (WAS Re: logical replication restrictions)
On Thursday, January 19, 2023 10:49 AM Peter Smith wrote: > On Wed, Jan 18, 2023 at 6:06 PM Peter Smith > wrote: > > > > Here are my review comments for the latest patch v16-0001. (excluding > > the test code) > > > > And here are some review comments for the v16-0001 test code. Hi, thanks for your review ! > == > > src/test/regress/sql/subscription.sql > > 1. General > For all comments > > "time delayed replication" -> "time-delayed replication" maybe is better? Fixed. > ~~~ > > 2. > -- fail - utilizing streaming = parallel with time delayed replication is not > supported. > > For readability please put a blank line before this test. Fixed. > ~~~ > > 3. > -- success -- value without unit is taken as milliseconds > > "value" -> "min_apply_delay value" Fixed. > ~~~ > > 4. > -- success -- interval is converted into ms and stored as integer > > "interval" -> "min_apply_delay interval" > > "integer" -> "an integer" Both are fixed. > ~~~ > > 5. > You could also add another test where min_apply_delay is 0 > > Then the following combination can be confirmed OK -- success create > subscription with (streaming=parallel, min_apply_delay=0) This combination is added with the modification for #6. > ~~ > > 6. > -- fail - alter subscription with min_apply_delay should fail when streaming = > parallel is set. > CREATE SUBSCRIPTION regress_testsub CONNECTION > 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, > streaming = parallel); > > There is another way to do this test without creating a brand-new > subscription. > You could just alter the existing subscription like: > ALTER ... SET (min_apply_delay = 0) > then ALTER ... SET (parallel = streaming) then ALTER ... SET (min_apply_delay > = 123) Fixed. > == > > src/test/subscription/t/032_apply_delay.pl > > 7. sub check_apply_delay_log > > my ($node_subscriber, $message, $expected) = @_; > > Why pass in the message text? I is always the same so can be hardwired in this > function, right? Fixed. > ~~~ > > 8. > # Get the delay time in the server log > > "int the server log" -> "from the server log" (?) Fixed. > ~~~ > > 9. > qr/$message: (\d+) ms/ > or die "could not get delayed time"; > my $logged_delay = $1; > > # Is it larger than expected? > cmp_ok($logged_delay, '>', $expected, > "The wait time of the apply worker is long enough expectedly" > ); > > 9a. > "could not get delayed time" -> "could not get the apply worker wait time" > > 9b. > "The wait time of the apply worker is long enough expectedly" -> "The apply > worker wait time has expected duration" Both are fixed. > ~~~ > > 10. > sub check_apply_delay_time > > > Maybe a brief explanatory comment for this function is needed to explain the > unreplicated column c. Added. > ~~~ > > 11. > $node_subscriber->safe_psql('postgres', > "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr > application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, > min_apply_delay = '3s')" > > > I think there should be a comment here highlighting that you are setting up a > subscriber time delay of 3 seconds, and then later you can better describe the > parameters for the checking functions... Added a comment for CREATE SUBSCRIPTION command. > e.g. (add this comment) > # verifies that the subscriber lags the publisher by at least 3 seconds > check_apply_delay_time($node_publisher, $node_subscriber, '5', '3'); > > e.g. > # verifies that the subscriber lags the publisher by at least 3 seconds > check_apply_delay_time($node_publisher, $node_subscriber, '8', '3'); Added. > ~~~ > > 12. > # Test whether ALTER SUBSCRIPTION changes the delayed time of the apply > worker # (1 day 1 minute). > $node_subscriber->safe_psql('postgres', > "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay = 8646)" > ); > > Update the comment with another note. > # Note - The extra 1 min is to account for any decoding/network overhead. Okay, added the comment. In general, TAP tests fail if we wait for more than 3 minutes. Then, we should think setting the maximum consumed time more than 3 minutes is safe. For example, if (which should not happen usually, but) we consumed more than 1 minutes between this ALTER SUBSCRIPTION SET and below check_apply_delay_log() then, the test will fail. So made the extra time bigger. > ~~~ > > 13. > # Make sure we have long enough min_apply_delay after the ALTER command > check_apply_delay_log($node_subscriber, "logical replication apply delay", > "8000"); > > IMO the expectation of 1 day (8646 ms) wait time might be a better number > for your "expected" value. > > So update the comment/call like this: > > # Make sure the apply worker knows to wait for more than 1 day (8640 ms) > check_apply_delay_log($node_subscriber, "logical replication apply delay", > "8640"); Updated the comment and the function call. Kindly have a look at the updated
Re: Modify the document of Logical Replication configuration settings
On Wed, Jan 18, 2023 at 02:04:16PM +0530, Bharath Rupireddy wrote: > [1] > diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml > index 89d53f2a64..6f9509267c 100644 > --- a/doc/src/sgml/config.sgml > +++ b/doc/src/sgml/config.sgml > @@ -4326,7 +4326,8 @@ restore_command = 'copy > "C:\\server\\archivedir\\%f" "%p"' # Windows > > Terminate replication connections that are inactive for longer > than this amount of time. This is useful for > -the sending server to detect a standby crash or network outage. > +the sending server to detect a standby crash or logical replication > +subscriber crash or network outage. > If this value is specified without units, it is taken as > milliseconds. > The default value is 60 seconds. > A value of zero disables the timeout mechanism. Perhaps we could do that, I am not sure whether this brings much in this section, though. -- Michael signature.asc Description: PGP signature
Re: [PATCH] random_normal function
Andrey Lepikhov writes: > On 1/9/23 23:52, Tom Lane wrote: >> BTW, if this does bring the probability of failure down to the >> one-in-a-billion range, I think we could also nuke the whole >> "ignore:" business, simplifying pg_regress and allowing the >> random test to be run in parallel with others. > We have used the pg_sleep() function to interrupt a query at certain > execution phase. But on some platforms, especially in containers, the > query can vary execution time in so widely that the pg_sleep() timeout, > required to get rid of dependency on a query execution time, has become > unacceptable. So, the "ignore" option was the best choice. But does such a test have any actual value? If your test infrastructure ignores the result, what makes you think you'd notice if the test did indeed detect a problem? I think "ignore:" was a kluge we put in twenty-plus years ago when our testing standards were a lot lower, and it's way past time we got rid of it. regards, tom lane
bug: copy progress reporting of backends which run multiple COPYs
pg_stat_progress_copy was added in v14 (8a4f618e7, 9d2d45700). But if a command JOINs file_fdw tables, the progress report gets bungled up. This will warn/assert during file_fdw tests. diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c index 6743e68cef6..7abcb4f60db 100644 --- a/src/backend/utils/activity/backend_progress.c +++ b/src/backend/utils/activity/backend_progress.c @@ -10,6 +10,7 @@ */ #include "postgres.h" +#include "commands/progress.h" #include "port/atomics.h" /* for memory barriers */ #include "utils/backend_progress.h" #include "utils/backend_status.h" @@ -105,6 +106,20 @@ pgstat_progress_end_command(void) if (beentry->st_progress_command == PROGRESS_COMMAND_INVALID) return; +// This currently fails file_fdw tests, since pgstat_progress evidently fails +// to support simultaneous copy commands, as happens during JOIN. + /* bytes progress is not available in all cases */ + if (beentry->st_progress_command == PROGRESS_COMMAND_COPY && + beentry->st_progress_param[PROGRESS_COPY_BYTES_TOTAL] > 0) + { + volatile int64 *a = beentry->st_progress_param; + if (a[PROGRESS_COPY_BYTES_PROCESSED] > a[PROGRESS_COPY_BYTES_TOTAL]) + elog(WARNING, "PROGRESS_COPY_BYTES_PROCESSED %ld %ld", + a[PROGRESS_COPY_BYTES_PROCESSED], + a[PROGRESS_COPY_BYTES_TOTAL]); + // Assert(a[PROGRESS_COPY_BYTES_PROCESSED] <= a[PROGRESS_COPY_BYTES_TOTAL]); + } + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); beentry->st_progress_command = PROGRESS_COMMAND_INVALID; beentry->st_progress_command_target = InvalidOid;
Re: Experiments with Postgres and SSL
On Wed, Jan 18, 2023 at 7:16 PM Greg Stark wrote: > > So I took a look into what it would take to do and I think it would > actually be quite feasible. The first byte of a standard TLS > connection can't look anything like the first byte of any flavour of > Postgres startup packet because it would be the high order bits of the > length so unless we start having multi-megabyte startup packets > This is a fascinating idea! I like it a lot. But..do we have to treat any unknown start sequence of bytes as a TLS connection? Or is there some definite subset of possible first bytes that clearly indicates that this is a TLS connection or not? Best regards, Andrey Borodin.
Re: Perform streaming logical transactions by background workers and parallel apply
On Wed, Jan 18, 2023 at 12:09 PM Amit Kapila wrote: > > On Fri, Jan 13, 2023 at 11:50 AM Peter Smith wrote: > > > > Here are some review comments for patch v79-0002. > > > > So, this is about the latest v84-0001-Stop-extra-worker-if-GUC-was-changed. > > > > > I feel this patch just adds more complexity for almost no gain: > > - reducing the 'max_apply_workers_per_suibscription' seems not very > > common in the first place. > > - even when the GUC is reduced, at that point in time all the workers > > might be in use so there may be nothing that can be immediately done. > > - IIUC the excess workers (for a reduced GUC) are going to get freed > > naturally anyway over time as more transactions are completed so the > > pool size will reduce accordingly. > > > > I am still not sure if it is worth pursuing this patch because of the > above reasons. I don't think it would be difficult to add this even at > a later point in time if we really see a use case for this. > Sawada-San, IIRC, you raised this point. What do you think? > > The other point I am wondering is whether we can have a different way > to test partial serialization apart from introducing another developer > GUC (stream_serialize_threshold). One possibility could be that we can > have a subscription option (parallel_send_timeout or something like > that) with some default value (current_timeout used in the patch) > which will be used only when streaming = parallel. Users may want to > wait for more time before serialization starts depending on the > workload (say when resource usage is high on a subscriber-side > machine, or there are concurrent long-running transactions that can > block parallel apply for a bit longer time). I know with this as well > it may not be straightforward to test the functionality because we > can't be sure how many changes would be required for a timeout to > occur. This is just for brainstorming other options to test the > partial serialization functionality. > Apart from the above, we can also have a subscription option to specify parallel_shm_queue_size (queue_size used to determine the queue between the leader and parallel worker) and that can be used for this purpose. Basically, configuring it to a smaller value can help in reducing the test time but still, it will not eliminate the need for dependency on timing we have to wait before switching to partial serialize mode. I think this can be used in production as well to tune the performance depending on workload. Yet another way is to use the existing parameter logical_decode_mode [1]. If the value of logical_decoding_mode is 'immediate', then we can immediately switch to partial serialize mode. This will eliminate the dependency on timing. The one argument against using this is that it won't be as clear as a separate parameter like 'stream_serialize_threshold' proposed by the patch but OTOH we already have a few parameters that serve a different purpose when used on the subscriber. For example, 'max_replication_slots' is used to define the maximum number of replication slots on the publisher and the maximum number of origins on subscribers. Similarly, wal_retrieve_retry_interval' is used for different purposes on subscriber and standby nodes. [1] - https://www.postgresql.org/docs/devel/runtime-config-developer.html -- With Regards, Amit Kapila.
Re: [PATCH] random_normal function
On 1/9/23 23:52, Tom Lane wrote: BTW, if this does bring the probability of failure down to the one-in-a-billion range, I think we could also nuke the whole "ignore:" business, simplifying pg_regress and allowing the random test to be run in parallel with others. With 'ignore' option we get used to cover by tests some of the time dependent features, such as "statement_timeout", "idle_in_transaction_session_timeout", usage of user timeouts in extensions and so on. We have used the pg_sleep() function to interrupt a query at certain execution phase. But on some platforms, especially in containers, the query can vary execution time in so widely that the pg_sleep() timeout, required to get rid of dependency on a query execution time, has become unacceptable. So, the "ignore" option was the best choice. For Now, Do we only have the "isolation tests" option to create stable execution time-dependent tests now? Or I'm not aware about some test machinery? -- Regards Andrey Lepikhov Postgres Professional
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On 19/01/23 08:58, David Rowley wrote: The problem is that the next item looked at is 1 and the value 2 is skipped. Okay, I see the issue. I think you are misinterpreting the results, but the main point remains - it's slower. The explain analyze timing shows the time between outputting the first row and the last row. For sort, there's a lot of work that needs to be done before you output the first row. Okay got it. I looked into this a bit further and using the same table as you, and the attached set of hacks that adjust the ORDER BY path generation to split a Sort into a Sort and Incremental Sort when the number of pathkeys to sort by is > 2. Okay, so this is issue with strict sort itself. I will play around with current patch but it should be fine for current work and would account for issue in strict sort. I suspect this might be to do with having to perform more swaps in the array elements that we'resorting by with the full sort. When work_mem is large this array no longer fits in CPU cache and I suspect those swaps become more costly when there are more cache lines having to be fetched from RAM. Looks like possible reason. I think we really should fix tuplesort.c so that it batches sorts into about L3 CPU cache-sized chunks in RAM rather than trying to sort much larger arrays. I assume same thing we are doing for incremental sort and that's why it perform better here? Or is it due to shorter tuple width and thus being able to fit into memory easily? On the other hand, we already sort WindowClauses with the most strict sort order first which results in a full Sort and no additional sort for subsequent WindowClauses that can make use of that sort order. It would be bizarre to reverse the final few lines of common_prefix_cmp so that we sort the least strict order first so we end up injecting Incremental Sorts into the plan to make it faster! I would see this as beneficial when tuple size is too huge for strict sort. Basically cost(sort(a,b,c,d,e)) > cost(sort(a)+sort(b)+sort(c)+ sort(d,e)) Don't know if cost based decision is realistic or not in current state of things. I think more benchmarking is required so we can figure out if this is a corner case or a common case I will do few more benchmarks. I assume all scaling issue with strict sort to remain as it is but no further issue due to this change itself comes up. I'm just unsure if we should write this off as the expected behaviour of Sort and continue with the patch or delay the whole thing until we make some improvements to sort. Unless we found more issues, we might be good with the former but you can make better call on this. As much as I would love my first patch to get merged, I would want it to be right. Thanks, Ankit
Re: Make use of assign_checkpoint_completion_target() to calculate CheckPointSegments correctly
On Tue, Jan 17, 2023 at 07:55:53PM +0530, Bharath Rupireddy wrote: > On Tue, Jan 17, 2023 at 12:31 PM Michael Paquier wrote: >> Oops. It looks like you are right here. This would impact the >> calculation of CheckPointSegments on reload when >> checkpoint_completion_target is updated. This is wrong since we've >> switched to max_wal_size as of 88e9823, so this had better be >> backpatched all the way down. >> >> Thoughts? > > +1 to backpatch as setting checkpoint_completion_target will not take > effect immediately. Okay, applied down to 11. I have double-checked the surroundings to see if there was a similar mistake or something hiding around CheckpointSegments but noticed nothing (also did some primary/standby pgbench-ing with periodic reloads of checkpoint_completion_target, while on it). -- Michael signature.asc Description: PGP signature
Re: doc: add missing "id" attributes to extension packaging page
On Tue, 17 Jan 2023 16:43:13 -0600 "Karl O. Pinc" wrote: > It might be useful to add --nonet to the xsltproc invocation(s) > in the Makefile(s). Just in case; to keep from retrieving > stylesheets from the net. (If the option is not already there. > I didn't look.) > > If this is the first time that PG uses the XSLT import mechanism > I imagine that "things could go wrong" depending on what sort > of system is being used to build the docs. I'm not worried, > but it is something to note for the committer. Looks like doc/src/sgml/stylesheet-fo.xsl already uses xsl:import, although it is unclear to me whether the import is applied. Regards, Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
Re: Remove source code display from \df+?
st 18. 1. 2023 v 16:27 odesílatel Isaac Morland napsal: > On Wed, 18 Jan 2023 at 00:00, Pavel Stehule > wrote: > >> >> út 17. 1. 2023 v 20:29 odesílatel Isaac Morland >> napsal: >> >>> >>> I welcome comments and feedback. Now to try to find something manageable >>> to review. >>> >> >> looks well >> >> you miss update psql documentation >> >> https://www.postgresql.org/docs/current/app-psql.html >> >> If the form \df+ is used, additional information about each function is >> shown, including volatility, parallel safety, owner, security >> classification, access privileges, language, source code and description. >> > > Thanks, and sorry about that, it just completely slipped my mind. I’ve > attached a revised patch with a slight update of the psql documentation. > > you should to assign your patch to commitfest app >> >> https://commitfest.postgresql.org/ >> > > I thought I had: https://commitfest.postgresql.org/42/4133/ > ok > > Did I miss something? > it looks well regards Pavel
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On Thu, 19 Jan 2023 at 06:27, Ankit Kumar Pandey wrote: > Hmm, not really sure why did I miss that. I tried this again (added > following in postgres.c above > > PortalStart) > > List* l = NIL; > l = lappend(l, 1); > l = lappend(l, 2); > l = lappend(l, 3); > l = lappend(l, 4); > > ListCell *lc; > foreach_reverse(lc, l) > { > if (foreach_current_index(lc) == 2) // delete 3 > { > foreach_delete_current(l, lc); > } > } The problem is that the next item looked at is 1 and the value 2 is skipped. > I also tried a bit unrealistic case. > > create table abcdefgh(a int, b int, c int, d int, e int, f int, g int, h int); > > insert into abcdefgh select a,b,c,d,e,f,g,h from > generate_series(1,7) a, > generate_series(1,7) b, > generate_series(1,7) c, > generate_series(1,7) d, > generate_series(1,7) e, > generate_series(1,7) f, > generate_series(1,7) g, > generate_series(1,7) h; > > explain analyze select count(*) over (order by a), > row_number() over (partition by a order by b) from abcdefgh order by > a,b,c,d,e,f,g,h; > > In patch version > Execution Time: 82748.789 ms > In Master > Execution Time: 71172.586 ms > Patch version sort took around 15 sec, whereas master sort took ~2 + 46 > = around 48 sec > Maybe I am interpreting it wrong but still wanted to bring this to notice. I think you are misinterpreting the results, but the main point remains - it's slower. The explain analyze timing shows the time between outputting the first row and the last row. For sort, there's a lot of work that needs to be done before you output the first row. I looked into this a bit further and using the same table as you, and the attached set of hacks that adjust the ORDER BY path generation to split a Sort into a Sort and Incremental Sort when the number of pathkeys to sort by is > 2. work_mem = '1GB' in all cases below: I turned off the timing in EXPLAIN so that wasn't a factor in the results. Generally having more nodes means more timing requests and that's got > 0 overhead. explain (analyze,timing off) select * from abcdefgh order by a,b,c,d,e,f,g,h; 7^8 rows Master: Execution Time: 7444.479 ms master + sort_hacks.diff: Execution Time: 5147.937 ms So I'm also seeing Incremental Sort - > Sort faster than a single Sort for 7^8 rows. With 5^8 rows: master: Execution Time: 299.949 ms master + sort_hacks: Execution Time: 239.447 ms and 4^8 rows: master: Execution Time: 62.596 ms master + sort_hacks: Execution Time: 67.900 ms So at 4^8 sort_hacks becomes slower. I suspect this might be to do with having to perform more swaps in the array elements that we're sorting by with the full sort. When work_mem is large this array no longer fits in CPU cache and I suspect those swaps become more costly when there are more cache lines having to be fetched from RAM. I think we really should fix tuplesort.c so that it batches sorts into about L3 CPU cache-sized chunks in RAM rather than trying to sort much larger arrays. I'm just unsure if we should write this off as the expected behaviour of Sort and continue with the patch or delay the whole thing until we make some improvements to sort. I think more benchmarking is required so we can figure out if this is a corner case or a common case. On the other hand, we already sort WindowClauses with the most strict sort order first which results in a full Sort and no additional sort for subsequent WindowClauses that can make use of that sort order. It would be bizarre to reverse the final few lines of common_prefix_cmp so that we sort the least strict order first so we end up injecting Incremental Sorts into the plan to make it faster! David diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 093ace0864..a714e83a69 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -5085,11 +5085,30 @@ create_ordered_paths(PlannerInfo *root, * incremental sort when there are presorted keys. */ if (presorted_keys == 0 || !enable_incremental_sort) - sorted_path = (Path *) create_sort_path(root, - ordered_rel, - input_path, - root->sort_pathkeys, - limit_tuples); + { + if (list_length(root->sort_pathkeys) > 2) + { + sorted_path = (Path *) create_sort_path(root, +
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 7:04 PM Andres Freund wrote: > > You seem to be saying that it's a problem if we don't update reltuples > > -- an estimate -- when less than 2% of the table is scanned by VACUUM. > > But why? Why can't we just do nothing sometimes? I mean in general, > > leaving aside the heuristics I came up with for a moment? > > The problem isn't that we might apply the heuristic once, that'd be fine. But > that there's nothing preventing it from applying until there basically are no > tuples left, as long as the vacuum is frequent enough. > > As a demo: The attached sql script ends up with a table containing 10k rows, > but relpages being set 1 million. I saw that too. But then I checked again a few seconds later, and autoanalyze had run, so reltuples was 10k. Just like it would have if there was no VACUUM statements in your script. -- Peter Geoghegan
Experiments with Postgres and SSL
I had a conversation a while back with Heikki where he expressed that it was annoying that we negotiate SSL/TLS the way we do since it introduces an extra round trip. Aside from the performance optimization I think accepting standard TLS connections would open the door to a number of other opportunities that would be worth it on their own. So I took a look into what it would take to do and I think it would actually be quite feasible. The first byte of a standard TLS connection can't look anything like the first byte of any flavour of Postgres startup packet because it would be the high order bits of the length so unless we start having multi-megabyte startup packets So I put together a POC patch and it's working quite well and didn't require very much kludgery. Well, it required some but it's really not bad. I do have a bug I'm still trying to work out and the code isn't quite in committable form but I can send the POC patch. Other things it would open the door to in order from least controversial to most * Hiding Postgres behind a standard SSL proxy terminating SSL without implementing the Postgres protocol. * "Service Mesh" type tools that hide multiple services behind a single host/port ("Service Mesh" is just a new buzzword for "proxy"). * Browser-based protocol implementations using websockets for things like pgadmin or other tools to connect directly to postgres using Postgres wire protocol but using native SSL implementations. * Postgres could even implement an HTTP based version of its protocol and enable things like queries or browser based tools using straight up HTTP requests so they don't need to use websockets. * Postgres could implement other protocols to serve up data like status queries or monitoring metrics, using HTTP based standard protocols instead of using our own protocol. Incidentally I find the logic in ProcessStartupPacket incredibly confusing. It took me a while before I realized it's using tail recursion to implement the startup logic. I think it would be way more straightforward and extensible if it used a much more common iterative style. I think it would make it possible to keep more state than just ssl_done and gss_done without changing the function signature every time for example. -- greg
Re: Support logical replication of DDLs
On Wed, Jan 18, 2023 at 6:27 AM Amit Kapila wrote: > > On Sat, Jan 7, 2023 at 8:58 PM Zheng Li wrote: > > > > I added documentation and changed user interface design in the > > attached v60 patch set. > > The patch set addressed comments from Peter in [1]. > > > > The motivation for the user interface change is that we want to manage > > DDL replication feature in stages with fine grained replication > > levels. > > For example, we can focus on reviewing and testing table commands > > first, then other commands. It also make sense to introduce different > > DDL replication levels > > from the user perspective as pointed out in [1]. We can add more > > replication levels along the way. > > > > In this patch DDL replication is disabled by default and it can be > > enabled at different levels > > using the new PUBLICATION option 'ddl'. This option currently has two > > levels and are > > only allowed to be set if the PUBLICATION is FOR ALL TABLES. > > > > all: this option enables replication of all supported DDL commands. > > table: this option enables replication of Table DDL commands which > > include: > > -CREATE/ALTER/DROP TABLE > > -CREATE TABLE AS > > > > I think this point needs some thought. When you say 'all', how do you > think it will help to support DDL replication for foreign tables, > materialized views, views, etc where changes to such relations are > currently not supported by logical replication? I think DDL replication naturally provides support for views and materialized views, if the publication is FOR ALL TABLES since all the tables in the view/MV definition are replicated. Foreign Tables can also be considered replicated with DDL replication because we don't even need to replicate the data as it resides on the external server. Users need to configure the external server to allow connection from the subscriber for foreign tables to work on the subscriber. > We should also think > about initial sync for all those objects as well. Agree, we're starting an investigation on initial sync. But I think initial sync depends on DDL replication to work reliably, not the other way around. DDL replication can work on its own without the initial sync of schema, users just need to setup the initial schema just like they would today. Regards, Znae
Re: Deduplicate logicalrep_read_tuple()
On Wed, Jan 18, 2023 at 6:26 PM Bharath Rupireddy wrote: > > Hi, > > logicalrep_read_tuple() duplicates code for LOGICALREP_COLUMN_TEXT and > LOGICALREP_COLUMN_BINARY introduced by commit 9de77b5. While it > doesn't hurt anyone, deduplication makes code a bit leaner by 57 bytes > [1]. I've attached a patch for $SUBJECT. > > Thoughts? > The code looks the same but there is a subtle comment difference where previously only LOGICALREP_COLUMN_BINARY case said: /* not strictly necessary but per StringInfo practice */ So if you de-duplicate the code then should that comment be modified to say /* not strictly necessary for LOGICALREP_COLUMN_BINARY but per StringInfo practice */ -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Hi, On 2023-01-18 18:21:33 -0800, Peter Geoghegan wrote: > On Wed, Jan 18, 2023 at 5:49 PM Andres Freund wrote: > > On 2023-01-18 15:28:19 -0800, Peter Geoghegan wrote: > > > Perhaps we should make vac_estimate_reltuples focus on the pages that > > > VACUUM newly set all-visible each time (not including all-visible > > > pages that got scanned despite being all-visible) -- only that subset > > > of scanned_pages seems to be truly relevant. That way you wouldn't be > > > able to do stuff like this. We'd have to start explicitly tracking the > > > number of pages that were newly set in the VM in vacuumlazy.c to be > > > able to do that, but that seems like a good idea anyway. > > > > Can you explain a bit more what you mean with "focus on the pages" means? > > We don't say anything about pages we didn't scan right now -- only > scanned_pages have new information, so we just extrapolate. Why not go > even further than that, by only saying something about the pages that > were both scanned and newly set all-visible? > > Under this scheme, the pages that were scanned but couldn't be newly > set all-visible are treated just like the pages that weren't scanned > at all -- they get a generic estimate from the existing reltuples. I don't think that'd work well either. If we actually removed a large chunk of the tuples in the table it should be reflected in reltuples, otherwise costing and autovac scheduling suffers. And you might not be able to set all those page to all-visible, because of more recent rows. > > I don't think it's hard to see this causing problems. Set > > autovacuum_vacuum_scale_factor to something smaller than 2% or somewhat > > frequently vacuum manually. Incrementally delete old data. Unless analyze > > saves you - which might not be run or might have a different scale factor or > > not be run manually - reltuples will stay exactly the same, despite data > > changing substantially. > > You seem to be saying that it's a problem if we don't update reltuples > -- an estimate -- when less than 2% of the table is scanned by VACUUM. > But why? Why can't we just do nothing sometimes? I mean in general, > leaving aside the heuristics I came up with for a moment? The problem isn't that we might apply the heuristic once, that'd be fine. But that there's nothing preventing it from applying until there basically are no tuples left, as long as the vacuum is frequent enough. As a demo: The attached sql script ends up with a table containing 10k rows, but relpages being set 1 million. VACUUM foo; EXPLAIN (ANALYZE) SELECT * FROM foo; ┌───┐ │QUERY PLAN │ ├───┤ │ Seq Scan on foo (cost=0.00..14425.00 rows=100 width=4) (actual time=3.251..4.693 rows=1 loops=1) │ │ Planning Time: 0.056 ms │ │ Execution Time: 5.491 ms │ └───┘ (3 rows) > It will cause problems if we remove the heuristics. Much less > theoretical problems. What about those? I don't think what I describe is a theoretical problem. > How often does VACUUM scan so few pages, anyway? We've been talking > about how ineffective autovacuum_vacuum_scale_factor is, at great > length, but now you're saying that it *can* meaningfully trigger not > just one VACUUM, but many VACUUMs, where no more than 2% of rel_pages > are not all-visible (pages, not tuples)? Not just once, mind you, but > many times? I've seen autovacuum_vacuum_scale_factor set to 0.01 repeatedly. But that's not even needed - you just need a longrunning transaction preventing at least one dead row from getting removed and hit autovacuum_freeze_max_age. There'll be continuous VACUUMs of the table, all only processing a small fraction of the table. And I have many times seen bulk loading / deletion scripts that do VACUUM on a regular schedule, which also could easily trigger this. > And in the presence of some kind of highly variable tuple > size, where it actually could matter to the planner at some point? I don't see how a variable tuple size needs to be involved? As the EXPLAIN ANALYZE above shows, we'll end up with wrong row count estimates etc. > I would be willing to just avoid even these theoretical problems if > there was some way to do so, that didn't also create new problems. I > have my doubts that that is possible, within the constraints of > updating pg_class. Or the present constraints, at least. I am not a > miracle worker -- I can only do so much with the information that's > available to
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 6:10 PM Andres Freund wrote: > > This creates an awkward but logical question, though: what if > > dead_tuples doesn't go down at all? What if VACUUM actually has to > > increase it, because VACUUM runs so slowly relative to the workload? > > Sure, that can happen - but it's not made better by having wrong stats :) Maybe it's that simple. It's reasonable to wonder how far we want to go with letting dead tuples grow and grow, even when VACUUM is running constantly. It's not a question that has an immediate and obvious answer IMV. Maybe the real question is: is this an opportunity to signal to the user (say via a LOG message) that VACUUM cannot keep up? That might be very useful, in a general sort of way (not just to avoid new problems). > We have reasonably sophisticated accounting in pgstats what newly live/dead > rows a transaction "creates". So an obvious (and wrong) idea is just decrement > reltuples by the number of tuples removed by autovacuum. Did you mean dead_tuples? > But we can't do that, > because inserted/deleted tuples reported by backends can be removed by > on-access pruning and vacuumlazy doesn't know about all changes made by its > call to heap_page_prune(). I'm not following here. Perhaps this is a good sign that I should stop working for the day. :-) > But I think that if we add a > pgstat_count_heap_prune(nredirected, ndead, nunused) > around heap_page_prune() and a > pgstat_count_heap_vacuum(nunused) > in lazy_vacuum_heap_page(), we'd likely end up with a better approximation > than what vac_estimate_reltuples() does, in the "partially scanned" case. What does vac_estimate_reltuples() have to do with dead tuples? -- Peter Geoghegan
Re: Minimal logical decoding on standbys
Hi, On 2023-01-18 11:24:19 +0100, Drouvot, Bertrand wrote: > On 1/6/23 4:40 AM, Andres Freund wrote: > > Hm, that's quite expensive. Perhaps worth adding a C helper that can do that > > for us instead? This will likely also be needed in real applications after > > all. > > > > Not sure I got it. What the C helper would be supposed to do? Call LogStandbySnapshot(). > With a reload in place in my testing, now I notice that the catalog_xmin > is updated on the primary physical slot after logical slots invalidation > when reloading hot_standby_feedback from "off" to "on". > > This is not the case after a re-start (aka catalog_xmin is NULL). > > I think a re-start and reload should produce identical behavior on > the primary physical slot. If so, I'm tempted to think that the catalog_xmin > should be updated in case of a re-start too (even if all the logical slots > are invalidated) > because the slots are not dropped yet. What do you think? I can't quite follow the steps leading up to the difference. Could you list them in a bit more detail? > > Can we do something cheaper than rewriting the entire database? Seems > > rewriting a single table ought to be sufficient? > > > > While implementing the test at the table level I discovered that It looks > like there is no guarantee that say a "vacuum full pg_class;" would > produce a conflict. I assume that's mostly when there weren't any removal > Indeed, from what I can see in my testing it could generate a > XLOG_HEAP2_PRUNE with snapshotConflictHorizon to 0: > > "rmgr: Heap2 len (rec/tot):107/ 107, tx:848, lsn: > 0/03B98B30, prev 0/03B98AF0, desc: PRUNE snapshotConflictHorizon 0" > > > Having a snapshotConflictHorizon to zero leads to > ResolveRecoveryConflictWithSnapshot() simply returning > without any conflict handling. That doesn't have to mean anything bad. Some row versions can be removed without creating a conflict. See HeapTupleHeaderAdvanceConflictHorizon(), specifically * Ignore tuples inserted by an aborted transaction or if the tuple was * updated/deleted by the inserting transaction. > It does look like that in the standby decoding case that's not the right > behavior (and that the xid that generated the PRUNING should be used instead) > , what do you think? That'd not work, because that'll be typically newer than the catalog_xmin. So we'd start invalidating things left and right, despite not needing to. Did you see anything else around this making you suspicious? > > > +## > > > +# Test standby promotion and logical decoding behavior > > > +# after the standby gets promoted. > > > +## > > > + > > > > I think this also should test the streaming / walsender case. > > > > Do you mean cascading standby? I mean a logical walsender that starts on a standby and continues across promotion of the standby. Greetings, Andres Freund
Re: Parallelize correlated subqueries that execute within each worker
On Wed, Jan 18, 2023 at 2:09 PM Tomas Vondra wrote: > > Hi, > > This patch hasn't been updated since September, and it got broken by > 4a29eabd1d91c5484426bc5836e0a7143b064f5a which the incremental sort > stuff a little bit. But the breakage was rather limited, so I took a > stab at fixing it - attached is the result, hopefully correct. Thanks for fixing this up; the changes look correct to me. > I also added a couple minor comments about stuff I noticed while > rebasing and skimming the patch, I kept those in separate commits. > There's also a couple pre-existing TODOs. I started work on some of these, but wasn't able to finish this evening, so I don't have an updated series yet. > James, what's your plan with this patch. Do you intend to work on it for > PG16, or are there some issues I missed in the thread? I'd love to see it get into PG16. I don't have any known issues, but reviewing activity has been light. Originally Robert had had some concerns about my original approach; I think my updated approach resolves those issues, but it'd be good to have that sign-off. Beyond that I'm mostly looking for review and evaluation of the approach I've taken; of note is my description of that in [1]. > One of the queries in in incremental_sort changed plans a little bit: > > explain (costs off) select distinct > unique1, > (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1) > from tenk1 t, generate_series(1, 1000); > > switched from > > Unique (cost=18582710.41..18747375.21 rows=1 width=8) >-> Gather Merge (cost=18582710.41..18697375.21 rows=1000 ...) > Workers Planned: 2 > -> Sort (cost=18582710.39..18593127.06 rows=417 ...) >Sort Key: t.unique1, ((SubPlan 1)) > ... > > to > > Unique (cost=18582710.41..18614268.91 rows=1 ...) >-> Gather Merge (cost=18582710.41..18614168.91 rows=2 ...) > Workers Planned: 2 > -> Unique (cost=18582710.39..18613960.39 rows=1 ...) >-> Sort (cost=18582710.39..18593127.06 ...) > Sort Key: t.unique1, ((SubPlan 1)) >... > > which probably makes sense, as the cost estimate decreases a bit. Off the cuff that seems fine. I'll read it over again when I send the updated series. James Coleman 1: https://www.postgresql.org/message-id/CAAaqYe8m0DHUWk7gLKb_C4abTD4nMkU26ErE%3Dahow4zNMZbzPQ%40mail.gmail.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 5:49 PM Andres Freund wrote: > On 2023-01-18 15:28:19 -0800, Peter Geoghegan wrote: > > Perhaps we should make vac_estimate_reltuples focus on the pages that > > VACUUM newly set all-visible each time (not including all-visible > > pages that got scanned despite being all-visible) -- only that subset > > of scanned_pages seems to be truly relevant. That way you wouldn't be > > able to do stuff like this. We'd have to start explicitly tracking the > > number of pages that were newly set in the VM in vacuumlazy.c to be > > able to do that, but that seems like a good idea anyway. > > Can you explain a bit more what you mean with "focus on the pages" means? We don't say anything about pages we didn't scan right now -- only scanned_pages have new information, so we just extrapolate. Why not go even further than that, by only saying something about the pages that were both scanned and newly set all-visible? Under this scheme, the pages that were scanned but couldn't be newly set all-visible are treated just like the pages that weren't scanned at all -- they get a generic estimate from the existing reltuples. > I don't think it's hard to see this causing problems. Set > autovacuum_vacuum_scale_factor to something smaller than 2% or somewhat > frequently vacuum manually. Incrementally delete old data. Unless analyze > saves you - which might not be run or might have a different scale factor or > not be run manually - reltuples will stay exactly the same, despite data > changing substantially. You seem to be saying that it's a problem if we don't update reltuples -- an estimate -- when less than 2% of the table is scanned by VACUUM. But why? Why can't we just do nothing sometimes? I mean in general, leaving aside the heuristics I came up with for a moment? It will cause problems if we remove the heuristics. Much less theoretical problems. What about those? How often does VACUUM scan so few pages, anyway? We've been talking about how ineffective autovacuum_vacuum_scale_factor is, at great length, but now you're saying that it *can* meaningfully trigger not just one VACUUM, but many VACUUMs, where no more than 2% of rel_pages are not all-visible (pages, not tuples)? Not just once, mind you, but many times? And in the presence of some kind of highly variable tuple size, where it actually could matter to the planner at some point? I would be willing to just avoid even these theoretical problems if there was some way to do so, that didn't also create new problems. I have my doubts that that is possible, within the constraints of updating pg_class. Or the present constraints, at least. I am not a miracle worker -- I can only do so much with the information that's available to vac_update_relstats (and/or the information that can easily be made available). -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Hi, On 2023-01-18 17:00:48 -0800, Peter Geoghegan wrote: > On Wed, Jan 18, 2023 at 4:37 PM Andres Freund wrote: > > I can, it should be just about trivial code-wise. A bit queasy about trying > > to > > forsee the potential consequences. > > That's always going to be true, though. > > > A somewhat related issue is that pgstat_report_vacuum() sets dead_tuples to > > what VACUUM itself observed, ignoring any concurrently reported dead > > tuples. As far as I can tell, when vacuum takes a long time, that can lead > > to > > severely under-accounting dead tuples. > > Did I not mention that one? There are so many that it can be hard to > keep track! That's why I catalog them. I don't recall you doing, but there's lot of emails and holes in my head. > This creates an awkward but logical question, though: what if > dead_tuples doesn't go down at all? What if VACUUM actually has to > increase it, because VACUUM runs so slowly relative to the workload? Sure, that can happen - but it's not made better by having wrong stats :) > > I do think this is an argument for splitting up dead_tuples into separate > > "components" that we track differently. I.e. tracking the number of dead > > items, not-yet-removable rows, and the number of dead tuples reported from > > DML > > statements via pgstats. > > Is it? Why? We have reasonably sophisticated accounting in pgstats what newly live/dead rows a transaction "creates". So an obvious (and wrong) idea is just decrement reltuples by the number of tuples removed by autovacuum. But we can't do that, because inserted/deleted tuples reported by backends can be removed by on-access pruning and vacuumlazy doesn't know about all changes made by its call to heap_page_prune(). But I think that if we add a pgstat_count_heap_prune(nredirected, ndead, nunused) around heap_page_prune() and a pgstat_count_heap_vacuum(nunused) in lazy_vacuum_heap_page(), we'd likely end up with a better approximation than what vac_estimate_reltuples() does, in the "partially scanned" case. > I'm all in favor of doing that, of course. I just don't particularly > think that it's related to this other problem. One problem is that we > count dead tuples incorrectly because we don't account for the fact > that things change while VACUUM runs. The other problem is that the > thing that is counted isn't broken down into distinct subcategories of > things -- things are bunched together that shouldn't be. If we only adjust the counters incrementally, as we go, we'd not update them at the end of vacuum. I think it'd be a lot easier to only update the counters incrementally if we split ->dead_tuples into sub-counters. So I don't think it's entirely unrelated. You probably could get close without splitting the counters, by just pushing down the counting, and only counting redirected and unused during heap pruning. But I think it's likely to be more accurate with the split counter. > Oh wait, you were thinking of what I said before -- my "awkward but > logical question". Is that it? I'm not quite following? The "awkward but logical" bit is in the email I'm just replying to, right? Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Fri, Jan 13, 2023 at 9:55 PM Andres Freund wrote: > How about a float autovacuum_no_auto_cancel_age where positive values are > treated as absolute values, and negative values are a multiple of > autovacuum_freeze_max_age? And where the "computed" age is capped at > vacuum_failsafe_age? A "failsafe" autovacuum clearly shouldn't be cancelled. > > And maybe a default setting of -1.8 or so? Attached is a new revision, v5. I'm not happy with this, but thought it would be useful to show you where I am with it. It's a bit awkward that we have a GUC (autovacuum_no_auto_cancel_age) that can sometimes work as a cutoff that works similarly to both freeze_max_age and multixact_freeze_max_age, but usually works as a multiplier. It's both an XID age value, an MXID age value, and a multiplier on XID/MXID age values. What if it was just a simple multiplier on freeze_max_age/multixact_freeze_max_age, without changing any other detail? -- Peter Geoghegan v5-0002-Add-table-age-trigger-concept-to-autovacuum.patch Description: Binary data v5-0001-Add-autovacuum-trigger-instrumentation.patch Description: Binary data
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Wed, Jan 18, 2023 at 6:06 PM Peter Smith wrote: > > Here are my review comments for the latest patch v16-0001. (excluding > the test code) > And here are some review comments for the v16-0001 test code. == src/test/regress/sql/subscription.sql 1. General For all comments "time delayed replication" -> "time-delayed replication" maybe is better? ~~~ 2. -- fail - utilizing streaming = parallel with time delayed replication is not supported. For readability please put a blank line before this test. ~~~ 3. -- success -- value without unit is taken as milliseconds "value" -> "min_apply_delay value" ~~~ 4. -- success -- interval is converted into ms and stored as integer "interval" -> "min_apply_delay interval" "integer" -> "an integer" ~~~ 5. You could also add another test where min_apply_delay is 0 Then the following combination can be confirmed OK -- success create subscription with (streaming=parallel, min_apply_delay=0) ~~ 6. -- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set. CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel); There is another way to do this test without creating a brand-new subscription. You could just alter the existing subscription like: ALTER ... SET (min_apply_delay = 0) then ALTER ... SET (parallel = streaming) then ALTER ... SET (min_apply_delay = 123) == src/test/subscription/t/032_apply_delay.pl 7. sub check_apply_delay_log my ($node_subscriber, $message, $expected) = @_; Why pass in the message text? I is always the same so can be hardwired in this function, right? ~~~ 8. # Get the delay time in the server log "int the server log" -> "from the server log" (?) ~~~ 9. qr/$message: (\d+) ms/ or die "could not get delayed time"; my $logged_delay = $1; # Is it larger than expected? cmp_ok($logged_delay, '>', $expected, "The wait time of the apply worker is long enough expectedly" ); 9a. "could not get delayed time" -> "could not get the apply worker wait time" 9b. "The wait time of the apply worker is long enough expectedly" -> "The apply worker wait time has expected duration" ~~~ 10. sub check_apply_delay_time Maybe a brief explanatory comment for this function is needed to explain the unreplicated column c. ~~~ 11. $node_subscriber->safe_psql('postgres', "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, min_apply_delay = '3s')" I think there should be a comment here highlighting that you are setting up a subscriber time delay of 3 seconds, and then later you can better describe the parameters for the checking functions... e.g. (add this comment) # verifies that the subscriber lags the publisher by at least 3 seconds check_apply_delay_time($node_publisher, $node_subscriber, '5', '3'); e.g. # verifies that the subscriber lags the publisher by at least 3 seconds check_apply_delay_time($node_publisher, $node_subscriber, '8', '3'); ~~~ 12. # Test whether ALTER SUBSCRIPTION changes the delayed time of the apply worker # (1 day 1 minute). $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay = 8646)" ); Update the comment with another note. # Note - The extra 1 min is to account for any decoding/network overhead. ~~~ 13. # Make sure we have long enough min_apply_delay after the ALTER command check_apply_delay_log($node_subscriber, "logical replication apply delay", "8000"); IMO the expectation of 1 day (8646 ms) wait time might be a better number for your "expected" value. So update the comment/call like this: # Make sure the apply worker knows to wait for more than 1 day (8640 ms) check_apply_delay_log($node_subscriber, "logical replication apply delay", "8640"); -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Inconsistency in vacuum behavior
On Mon, Jan 16, 2023 at 08:12:18PM +0300, Nikita Malakhov wrote: > Hi, > > Currently there is no error in this case, so additional thrown error would > require a new test. > Besides, throwing an error here does not make sense - it is just a check > for a vacuum > permission, I think the right way is to just skip a relation that is not > suitable for vacuum. > Any thoughts or objections? Could you check if this is consistent between the behavior of VACUUM FULL and CLUSTER ? See also Nathan's patches. -- Justin
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Hi, On 2023-01-18 16:19:02 -0800, Peter Geoghegan wrote: > On Wed, Jan 18, 2023 at 4:02 PM Andres Freund wrote: > > vacuum-no reltuples/n_live_tupn_dead_tup > > 1 476 500 > > 2 2500077 500 > > 3 1250184 500 > > 4625266 500 > > 5312821 500 > > 1010165 500 > > > > Each vacuum halves reltuples. That's going to screw badly with all kinds of > > things. Planner costs completely out of whack etc. > > I get that that could be a big problem, even relative to the more > immediate problem of VACUUM just spinning like it does in your test > case. What do you think we should do about it? The change made in 7c91a0364fc imo isn't right. We need to fix it. I think it's correct that now pgstat_report_vacuum() doesn't include recently dead tuples in livetuples - they're still tracked via deadtuples. But it's wrong for vacuum's call to vac_update_relstats() to not include recently dead tuples, at least when we only scanned part of the relation. I think the right thing would be to not undo the semantic change of 7c91a0364fc as a whole, but instead take recently-dead tuples into account only in the "Okay, we've covered the corner cases." part, to avoid the spiraling seen above. Not super clean, but it seems somewhat fundamental that we'll re-scan pages full of recently-dead tuples in the near future. If we, in a way, subtract the recently dead tuples from reltuples in this cycle, we shouldn't do so again in the next - but not taking recently dead into account, does so. It's a bit complicated because of the APIs involved. vac_estimate_reltuples() computes vacrel->new_live_tuples and contains the logic for how to compute the new reltuples. But we use the ->new_live_tuples both vac_update_relstats(), where we, imo, should take recently-dead into account for partial scans and pgstat_report_vacuum where we shouldn't. I guess we would need to add an output paramter both for "reltuples" and "new_live_tuples". > What do you think about my idea of focussing on the subset of pages newly > set all-visible in the VM? I don't understand it yet. On 2023-01-18 15:28:19 -0800, Peter Geoghegan wrote: > Perhaps we should make vac_estimate_reltuples focus on the pages that > VACUUM newly set all-visible each time (not including all-visible > pages that got scanned despite being all-visible) -- only that subset > of scanned_pages seems to be truly relevant. That way you wouldn't be > able to do stuff like this. We'd have to start explicitly tracking the > number of pages that were newly set in the VM in vacuumlazy.c to be > able to do that, but that seems like a good idea anyway. Can you explain a bit more what you mean with "focus on the pages" means? > > I wonder if this is part of the reason for the distortion you addressed with > > 74388a1a / 3097bde7dd1d. I am somewhat doubtful they're right as is. For a > > large relation 2% of blocks is a significant number of rows, and simply > > never > > adjusting reltuples seems quite problematic. At the very least we ought to > > account for dead tids we removed or such, instead of just freezing > > reltuples. > > As I mentioned, it only kicks in when relpages is *precisely* the same > as last time (not one block more or one block less), *and* we only > scanned less than 2% of rel_pages. It's quite possible that that's > insufficient, but I can't imagine it causing any new problems. In OLTP workloads relpages will often not change, even if there's lots of write activity, because there's plenty free space in the relation, and there's something not-removable on the last page. relpages also won't change if data is deleted anywhere but the end. I don't think it's hard to see this causing problems. Set autovacuum_vacuum_scale_factor to something smaller than 2% or somewhat frequently vacuum manually. Incrementally delete old data. Unless analyze saves you - which might not be run or might have a different scale factor or not be run manually - reltuples will stay exactly the same, despite data changing substantially. Greetings, Andres Freund
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Wed, Jan 18, 2023 at 6:06 PM Peter Smith wrote: > > Here are my review comments for the latest patch v16-0001. (excluding > the test code) > ... > > 8. AlterSubscription (general) > > I observed during testing there are 3 different errors…. > > At subscription CREATE time you can get this error: > ERROR: min_apply_delay > 0 and streaming = parallel are mutually > exclusive options > > If you try to ALTER the min_apply_delay when already streaming = > parallel you can get this error: > ERROR: cannot enable min_apply_delay for subscription in streaming = > parallel mode > > If you try to ALTER the streaming to be parallel if there is already a > min_apply_delay > 0 then you can get this error: > ERROR: cannot enable streaming = parallel mode for subscription with > min_apply_delay > > ~ > > IMO there is no need to have 3 different error message texts. I think > all these cases are explained by just the first text (ERROR: > min_apply_delay > 0 and streaming = parallel are mutually exclusive > options) > > After checking the regression test output I can see the merit of your separate error messages like this, even if they are maybe not strictly necessary. So feel free to ignore my previous review comment. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: [EXTERNAL] Re: [PATCH] Support using "all" for the db user in pg_ident.conf
On Wed, Jan 18, 2023 at 10:35:29AM +0100, Jelte Fennema wrote: > Anything I can do to help with this? Or will you do that yourself? Nope. I just need some time to finish wrapping it, that's all. -- Michael signature.asc Description: PGP signature
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 4:37 PM Andres Freund wrote: > I can, it should be just about trivial code-wise. A bit queasy about trying to > forsee the potential consequences. That's always going to be true, though. > A somewhat related issue is that pgstat_report_vacuum() sets dead_tuples to > what VACUUM itself observed, ignoring any concurrently reported dead > tuples. As far as I can tell, when vacuum takes a long time, that can lead to > severely under-accounting dead tuples. Did I not mention that one? There are so many that it can be hard to keep track! That's why I catalog them. As you point out, it's the dead tuples equivalent of my ins_since_vacuum complaint. The problem is exactly analogous to my recent complaint about insert-driven autovacuums. > I wonder if we ought to remember the dead_tuples value at the start of the > heap scan and use that to adjust the final dead_tuples value. I'd lean towards > over-counting rather than under-counting and thus probably would go for > something like > > tabentry->dead_tuples = livetuples + Min(0, tabentry->dead_tuples - > deadtuples_at_start); > > i.e. assuming we might have missed all concurrently reported dead tuples. This is exactly what I was thinking of doing for both issues (the ins_since_vacuum one and this similar dead tuples one). It's completely logical. This creates an awkward but logical question, though: what if dead_tuples doesn't go down at all? What if VACUUM actually has to increase it, because VACUUM runs so slowly relative to the workload? Of course the whole point is to make it more likely that VACUUM will keep up with the workload. I'm just not quite sure that the consequences of doing it that way are strictly a good thing. Bearing in mind that we don't differentiate between recently dead and dead here. Fun fact: autovacuum can spin with pgbench because of recently dead tuples, even absent an old snapshot/long running xact, if you set things aggressively enough: https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=w...@mail.gmail.com I think that we probably need to do something like always make sure that dead_items goes down by a small amount at the end of each VACUUM, even when that's a lie. Maybe we also have a log message about autovacuum not keeping up, so as to not feel too guilty about it. You know, to give the user a chance to reconfigure autovacuum so that it stops happening. > Of course we could instead move to something like ins_since_vacuum and reset > it at the *start* of the vacuum. But that'd make the error case harder, > without giving us more accuracy, I think? It would. It seems illogical to me. > I do think this is an argument for splitting up dead_tuples into separate > "components" that we track differently. I.e. tracking the number of dead > items, not-yet-removable rows, and the number of dead tuples reported from DML > statements via pgstats. Is it? Why? I'm all in favor of doing that, of course. I just don't particularly think that it's related to this other problem. One problem is that we count dead tuples incorrectly because we don't account for the fact that things change while VACUUM runs. The other problem is that the thing that is counted isn't broken down into distinct subcategories of things -- things are bunched together that shouldn't be. Oh wait, you were thinking of what I said before -- my "awkward but logical question". Is that it? -- Peter Geoghegan
Re: Switching XLog source from archive to streaming when primary available
On Tue, Jan 17, 2023 at 07:44:52PM +0530, Bharath Rupireddy wrote: > On Thu, Jan 12, 2023 at 6:21 AM Nathan Bossart > wrote: >> With your patch, we might replay one of these "old" files in pg_wal instead >> of the complete version of the file from the archives, > > That's true even today, without the patch, no? We're not changing the > existing behaviour of the state machine. Can you explain how it > happens with the patch? My point is that on HEAD, we will always prefer a complete archive file. With your patch, we might instead choose to replay an old file in pg_wal because we are artificially advancing the state machine. IOW even if there's a complete archive available, we might not use it. This is a behavior change, but I think it is okay. >> Would you mind testing this scenario? > > How about something like below for testing the above scenario? If it > looks okay, I can add it as a new TAP test file. > > 1. Generate WAL files f1 and f2 and archive them. > 2. Check the replay lsn and WAL file name on the standby, when it > replays upto f2, stop the standby. > 3. Set recovery to fail on the standby, and stop the standby. > 4. Generate f3, f4 (partially filled) on the primary. > 5. Manually copy f3, f4 to the standby's pg_wal. > 6. Start the standby, since recovery is set to fail, and there're new > WAL files (f3, f4) under its pg_wal, it must replay those WAL files > (check the replay lsn and WAL file name, it must be f4) before > switching to streaming. > 7. Generate f5 on the primary. > 8. The standby should receive f5 and replay it (check the replay lsn > and WAL file name, it must be f5). > 9. Set streaming to fail on the standby and set recovery to succeed. > 10. Generate f6 on the primary. > 11. The standby should receive f6 via archive and replay it (check the > replay lsn and WAL file name, it must be f6). I meant testing the scenario where there's an old file in pg_wal, a complete file in the archives, and your new GUC forces replay of the former. This might be difficult to do in a TAP test. Ultimately, I just want to validate the assumptions discussed above. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Hi, On 2023-01-18 13:45:19 -0800, Peter Geoghegan wrote: > On Wed, Jan 18, 2023 at 1:08 PM Andres Freund wrote: > > I suggested nearby to only have ANALYZE dead_tuples it if there's been no > > [auto]vacuum since the stats entry was created. That allows recovering from > > stats resets, be it via crashes or explicitly. What do you think? > > I like that idea. It's far simpler than the kind of stuff I was > thinking about, and probably just as effective. Even if it introduces > some unforeseen problem (which seems unlikely), we can still rely on > pgstat_report_vacuum() to set things straight before too long. > > Are you planning on writing a patch for this? I'd be very interested > in seeing this through. Could definitely review it. I can, it should be just about trivial code-wise. A bit queasy about trying to forsee the potential consequences. A somewhat related issue is that pgstat_report_vacuum() sets dead_tuples to what VACUUM itself observed, ignoring any concurrently reported dead tuples. As far as I can tell, when vacuum takes a long time, that can lead to severely under-accounting dead tuples. We probably loose track of a bit more than 50% of the dead tuples reported since vacuum started. During the heap scan phase we don't notice all the tuples reported before the current scan point, and then we don't notice them at all during the index/heap vacuuming. The only saving grace is that it'll be corrected at the next VACUUM. But the next vacuum might very well be delayed noticably due to this. This is similar to the issue around ins_since_vacuum that Peter pointed out. I wonder if we ought to remember the dead_tuples value at the start of the heap scan and use that to adjust the final dead_tuples value. I'd lean towards over-counting rather than under-counting and thus probably would go for something like tabentry->dead_tuples = livetuples + Min(0, tabentry->dead_tuples - deadtuples_at_start); i.e. assuming we might have missed all concurrently reported dead tuples. Of course we could instead move to something like ins_since_vacuum and reset it at the *start* of the vacuum. But that'd make the error case harder, without giving us more accuracy, I think? I do think this is an argument for splitting up dead_tuples into separate "components" that we track differently. I.e. tracking the number of dead items, not-yet-removable rows, and the number of dead tuples reported from DML statements via pgstats. Greetings, Andres Freund
Re: Issue with psql's create_help.pl under perlcritic
On Wed, Jan 18, 2023 at 08:43:01AM -0500, Andrew Dunstan wrote: > Looks fine. Thanks for double-checking, will apply shortly. > Interesting it's not caught by perlcritic on my Fedora 35 > instance, nor my Ubuntu 22.04 instance. Perhaps that just shows up on the latest version of perlcritic? I use 1.148 in the box where this issue shows up, for all the stable branches of Postgres. -- Michael signature.asc Description: PGP signature
Unicode grapheme clusters
Just my luck, I had to dig into a two-"character" emoji that came to me as part of a Google Calendar entry --- here it is: ⚕️喙 libc Unicode UTF8 len U+1F469 f0 9f 91 a9 2 woman U+1F3FC f0 9f 8f bc 2 emoji modifier fitzpatrick type-3 (skin tone) U+200D e2 80 8d 0 zero width joiner (ZWJ) U+2695 e2 9a 95 1 staff with snake U+FE0F ef b8 8f 0 variation selector-16 (VS16) (previous character as emoji) U+1FA7A f0 9f a9 ba 2 stethoscope Now, in Debian 11 character apps like vi, I see: a woman(2) - a black box(2) - a staff with snake(1) - a stethoscope(2) Display widths are in parentheses. I also see '<200d>' in blue. In current Firefox, I see a woman with a stethoscope around her neck, and then a stethoscope. Copying the Unicode string above into a browser URL bar should show you the same thing, thought it might be too small to see. For those looking for details on how these should be handled, see this for an explanation of grapheme clusters that use things like skin tone modifiers and zero-width joiners: https://tonsky.me/blog/emoji/ These comments explain the confusion of the term character: https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme and I think this comment summarizes it well: https://github.com/kovidgoyal/kitty/issues/3998#issuecomment-914807237 This is by design. wcwidth() is utterly broken. Any terminal or terminal application that uses it is also utterly broken. Forget about emoji wcwidth() doesn't even work with combining characters, zero width joiners, flags, and a whole bunch of other things. I decided to see how Postgres, without ICU, handles it: show lc_ctype; lc_ctype - en_US.UTF-8 select octet_length('⚕️喙'); octet_length -- 21 select character_length('⚕️喙'); character_length -- 6 The octet_length() is verified as correct by counting the UTF8 bytes above. I think character_length() is correct if we consider the number of Unicode characters, display and non-display. I then started looking at how Postgres computes and uses _display_ width. The display width, when properly processed like by Firefox, is 4 (two double-wide displayed characters.) Based on the libc display lengths above and incorrect displayed character lengths in Debian 11, it would be 7. libpq has PQdsplen(), which calls pg_encoding_dsplen(), which then calls the per-encoding width function stored in pg_wchar_table.dsplen --- for UTF8, the function is pg_utf_dsplen(). There is no SQL API for display length, but PQdsplen() that can be called with a string by calling pg_wcswidth() the gdb debugger: pg_wcswidth(const char *pwcs, size_t len, int encoding) UTF8 encoding == 6 (gdb) print (int)pg_wcswidth("abcd", 4, 6) $8 = 4 (gdb) print (int)pg_wcswidth("⚕️喙", 21, 6)) $9 = 7 Here is the psql output: SELECT octet_length('⚕️喙'), '⚕️喙', character_length('⚕️喙'); octet_length | ?column? | character_length --+--+-- 21 | ⚕️喙 |6 More often called from psql are pg_wcssize() and pg_wcsformat(), which also calls PQdsplen(). I think the question is whether we want to report a string width that assumes the display doesn't understand the more complex UTF8 controls/"characters" listed above. tsearch has p_isspecial() calls pg_dsplen() which also uses pg_wchar_table.dsplen. p_isspecial() also has a small table of what it calls "strange_letter", Here is a report about Unicode variation selector and combining characters from May, 2022: https://www.postgresql.org/message-id/flat/013f01d873bb%24ff5f64b0%24fe1e2e10%24%40ndensan.co.jp Is this something people want improved? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 4:02 PM Andres Freund wrote: > vacuum-no reltuples/n_live_tupn_dead_tup > 1 476 500 > 2 2500077 500 > 3 1250184 500 > 4625266 500 > 5312821 500 > 1010165 500 > > Each vacuum halves reltuples. That's going to screw badly with all kinds of > things. Planner costs completely out of whack etc. I get that that could be a big problem, even relative to the more immediate problem of VACUUM just spinning like it does in your test case. What do you think we should do about it? What do you think about my idea of focussing on the subset of pages newly set all-visible in the VM? > I wonder if this is part of the reason for the distortion you addressed with > 74388a1a / 3097bde7dd1d. I am somewhat doubtful they're right as is. For a > large relation 2% of blocks is a significant number of rows, and simply never > adjusting reltuples seems quite problematic. At the very least we ought to > account for dead tids we removed or such, instead of just freezing reltuples. As I mentioned, it only kicks in when relpages is *precisely* the same as last time (not one block more or one block less), *and* we only scanned less than 2% of rel_pages. It's quite possible that that's insufficient, but I can't imagine it causing any new problems. I think that we need to be realistic about what's possible while storing a small, fixed amount of information. There is always going to be some distortion of this kind. We can do something about the obviously pathological cases, where errors can grow without bound. But you have to draw the line somewhere, unless you're willing to replace the whole approach with something that stores historic metadata. What kind of tradeoff do you want to make here? I think that you have to make one. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Hi, On 2023-01-18 14:37:20 -0800, Peter Geoghegan wrote: > On Wed, Jan 18, 2023 at 2:22 PM Andres Freund wrote: > > The problem with the change is here: > > > > /* > > * Okay, we've covered the corner cases. The normal calculation is > > to > > * convert the old measurement to a density (tuples per page), then > > * estimate the number of tuples in the unscanned pages using that > > figure, > > * and finally add on the number of tuples in the scanned pages. > > */ > > old_density = old_rel_tuples / old_rel_pages; > > unscanned_pages = (double) total_pages - (double) scanned_pages; > > total_tuples = old_density * unscanned_pages + scanned_tuples; > > return floor(total_tuples + 0.5); > > My assumption has always been that vac_estimate_reltuples() is prone > to issues like this because it just doesn't have access to very much > information each time it runs. It can only see the delta between what > VACUUM just saw, and what the last VACUUM (or possibly the last > ANALYZE) saw according to pg_class. You're always going to find > weaknesses in such a model if you go looking for them. You're always > going to find a way to salami slice your way from good information to > total nonsense, if you pick the right/wrong test case, which runs > VACUUM in a way that allows whatever bias there may be to accumulate. > It's sort of like the way floating point values can become very > inaccurate through a process that allows many small inaccuracies to > accumulate over time. Sure. To start with, there's always going to be some inaccuracies when you assume an even distribution across a table. But I think this goes beyond that. This problem occurs with a completely even distribution, exactly the same inputs to the estimation function every time. My example under-sold the severity, because I had only 5% non-deletable tuples. Here's it with 50% non-removable tuples (I've seen way worse than 50% in many real-world cases), and a bunch of complexity removed (attched). vacuum-no reltuples/n_live_tupn_dead_tup 1 476 500 2 2500077 500 3 1250184 500 4625266 500 5312821 500 1010165 500 Each vacuum halves reltuples. That's going to screw badly with all kinds of things. Planner costs completely out of whack etc. I wonder if this is part of the reason for the distortion you addressed with 74388a1a / 3097bde7dd1d. I am somewhat doubtful they're right as is. For a large relation 2% of blocks is a significant number of rows, and simply never adjusting reltuples seems quite problematic. At the very least we ought to account for dead tids we removed or such, instead of just freezing reltuples. Greetings, Andres Freund vactest_more_extreme.sql Description: application/sql
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 3:28 PM Peter Geoghegan wrote: > The problems in this area tend to be that vac_estimate_reltuples() > behaves as if it sees a random sample, when in fact it's far from > random -- it's the same scanned_pages as last time, and the ten other > times before that. That's a feature of your test case, and other > similar vac_estimate_reltuples test cases that I came up with in the > past. Both scenarios involved skipping using the visibility map in > multiple successive VACUUM operations. FWIW, the problem in your test case just goes away if you just change this line: DELETE FROM foo WHERE i < (1000 * 0.1) To this: DELETE FROM foo WHERE i < (1000 * 0.065) Which is not a huge difference, overall. This effect is a consequence of the heuristics I added in commit 74388a1a, so it's only present on Postgres 15+. Whether or not this is sufficient protection is of course open to interpretation. One problem with those heuristics (as far as your test case is concerned) is that they either work, or they don't work. For example they're conditioned on "old_rel_pages == total_page". -- Peter Geoghegan
Re: [DOCS] Stats views and functions not in order?
On Thu, Jan 19, 2023 at 2:55 AM David G. Johnston wrote: > > On Wed, Jan 18, 2023 at 8:38 AM Tom Lane wrote: >> >> "David G. Johnston" writes: >> > ... I was going for the html effect >> > of having these views chunked into their own pages, any other changes being >> > non-detrimental. >> >> But is that a result we want? It will for example break any bookmarks >> that people might have for these documentation entries. It will also >> pretty thoroughly break the cross-version navigation links in this >> part of the docs. >> >> >> Maybe the benefit is worth those costs, but I'm entirely not convinced >> of that. I think we need to tread pretty lightly when rearranging >> longstanding documentation-layout decisions. >> > David already gave a good summary [1], but since I was the OP here is the background of v10-0001 from my PoV. ~ The original $SUBJECT requirements evolved to also try to make each view appear on a separate page after that was suggested by DavidJ [2]. I was unable to achieve per-page views "without radically changing the document structure." [3], but DavidJ found a way [4] to do it using refentry. I then wrote the patch v8-0003 using that strategy, which after more rebasing became the v10-0001 you see today. I did prefer the view-per-page results (although I also only use HTML docs). But my worry is that there seem still to be a few unknowns about how this might affect other (not the HTML) renderings of the docs. If you think that risk is too great, or if you feel this patch will cause unwarranted link/bookmark grief, then I am happy to just drop it. -- [1] DJ overview - https://www.postgresql.org/message-id/CAKFQuwaVm%3D6d_sw9Wrp4cdSm5_k%3D8ZVx0--v2v4BH4KnJtqXqg%40mail.gmail.com [2] DJ suggested view-per-page - https://www.postgresql.org/message-id/CAKFQuwa9JtoCBVc6CJb7NC5FqMeEAy_A8X4H8t6kVaw7fz9LTw%40mail.gmail.com [3] PS don't know how to do it - https://www.postgresql.org/message-id/CAHut%2BPv5Efz1TLWOLSoFvoyC0mq%2Bs92yFSd534ctWSdjEFtKCw%40mail.gmail.com [4] DJ how to do it using refentry - https://www.postgresql.org/message-id/CAKFQuwYkM5UZT%2B6tG%2BNgZvDcd5VavS%2BxNHsGsWC8jS-KJsxh7w%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 2:37 PM Peter Geoghegan wrote: > Maybe you're right to be concerned to the degree that you're concerned > -- I'm not sure. I'm just adding what I see as important context. The problems in this area tend to be that vac_estimate_reltuples() behaves as if it sees a random sample, when in fact it's far from random -- it's the same scanned_pages as last time, and the ten other times before that. That's a feature of your test case, and other similar vac_estimate_reltuples test cases that I came up with in the past. Both scenarios involved skipping using the visibility map in multiple successive VACUUM operations. Perhaps we should make vac_estimate_reltuples focus on the pages that VACUUM newly set all-visible each time (not including all-visible pages that got scanned despite being all-visible) -- only that subset of scanned_pages seems to be truly relevant. That way you wouldn't be able to do stuff like this. We'd have to start explicitly tracking the number of pages that were newly set in the VM in vacuumlazy.c to be able to do that, but that seems like a good idea anyway. This probably has consequences elsewhere, but maybe that's okay. We know when the existing pg_class has no information, since that is explicitly encoded by a reltuples of -1. Obviously we'd need to be careful about stuff like that. Overall, the danger from assuming that "unsettled" pages (pages that couldn't be newly set all-visible by VACUUM) have a generic tuple density seems less than the danger of assuming that they're representative. We know that we're bound to scan these same pages in the next VACUUM anyway, so they'll have another chance to impact our view of the table's tuple density at that time. -- Peter Geoghegan
Re: CREATEROLE users vs. role properties
On Wed, Jan 18, 2023 at 12:15:33PM -0500, Robert Haas wrote: > On Mon, Jan 16, 2023 at 2:29 PM Robert Haas wrote: >> 1. It's still possible for a CREATEROLE user to hand out role >> attributes that they don't possess. The new prohibitions in >> cf5eb37c5ee0cc54c80d95c1695d7fca1f7c68cb prevent a CREATEROLE user >> from handing out membership in a role on which they lack sufficient >> permissions, but they don't prevent a CREATEROLE user who lacks >> CREATEDB from creating a new user who does have CREATEDB. I think we >> should subject the CREATEDB, REPLICATION, and BYPASSRLS attributes to >> the same rule that we now use for role memberships: you've got to have >> the property in order to give it to someone else. In the case of >> CREATEDB, this would tighten the current rules, which allow you to >> give out CREATEDB without having it. In the case of REPLICATION and >> BYPASSRLS, this would liberalize the current rules: right now, a >> CREATEROLE user cannot give REPLICATION or BYPASSRLS to another user >> even if they possess those attributes. >> >> This proposal doesn't address the CREATEROLE or CONNECTION LIMIT >> properties. It seems possible to me that someone might want to set up >> a CREATEROLE user who can't make more such users, and this proposal >> doesn't manufacture any way of doing that. It also doesn't let you >> constraint the ability of a CREATEROLE user to set a CONNECTION LIMIT >> for some other user. I think that's OK. It might be nice to have ways >> of imposing such restrictions at some point in the future, but it is >> not very obvious what to do about such cases and, importantly, I don't >> think there's any security impact from failing to address those cases. >> If a CREATEROLE user without CREATEDB can create a new role that does >> have CREATEDB, that's a privilege escalation. If they can hand out >> CREATEROLE, that isn't: they already have it. > > Here is a patch implementing the above proposal. Since this is fairly > closely related to already-committed work, I would like to get this > into v16. That way, all the changes to how CREATEROLE works will go > into a single release, which seems less confusing for users. It is > also fairly clear to me that this is an improvement over the status > quo. Sometimes things that seem clear to me turn out to be false, so > if this change seems like a problem to you, please let me know. This seems like a clear improvement to me. However, as the attribute system becomes more sophisticated, I think we ought to improve the error messages in user.c. IMHO messages like "permission denied" could be greatly improved with some added context. For example, if I want to change a role's password, I need both CREATEROLE and ADMIN OPTION on the role, but the error message only mentions CREATEROLE. postgres=# create role createrole with createrole; CREATE ROLE postgres=# create role otherrole; CREATE ROLE postgres=# set role createrole; SET postgres=> alter role otherrole password 'test'; ERROR: must have CREATEROLE privilege to change another user's password Similarly, if I want to allow a role to grant REPLICATION to another role, I have to give it CREATEROLE, REPLICATION, and membership with ADMIN OPTION. If the role is missing CREATEROLE or membership with ADMIN OPTION, it'll only ever see a "permission denied" error. postgres=# create role createrole with createrole; CREATE ROLE postgres=# create role otherrole; CREATE ROLE postgres=# set role createrole; SET postgres=> alter role otherrole with replication; ERROR: permission denied postgres=> reset role; RESET postgres=# alter role createrole with replication; ALTER ROLE postgres=# set role createrole; SET postgres=> alter role otherrole with replication; ERROR: permission denied postgres=> reset role; RESET postgres=# grant otherrole to createrole; GRANT ROLE postgres=# set role createrole; SET postgres=> alter role otherrole with replication; ERROR: permission denied postgres=> reset role; RESET postgres=# grant otherrole to createrole with admin option; GRANT ROLE postgres=# set role createrole; SET postgres=> alter role otherrole with replication; ALTER ROLE If it has both CREATEROLE and membership with ADMIN OPTION (but not REPLICATION), it'll see a more helpful message. postgres=# create role createrole with createrole; CREATE ROLE postgres=# create role otherrole; CREATE ROLE postgres=# grant otherrole to createrole with admin option; GRANT ROLE postgres=# set role createrole; SET postgres=> alter role otherrole with replication; ERROR: must have replication privilege to change
Re: Rethinking the implementation of ts_headline()
Alvaro Herrera writes: > I tried this other test, based on looking at the new regression tests > you added, > SELECT ts_headline('english', ' > Day after day, day after day, > We stuck, nor breath nor motion, > As idle as a painted Ship > Upon a painted Ocean. > Water, water, every where > And all the boards did shrink; > Water, water, every where, > Nor any drop to drink. > S. T. Coleridge (1772-1834) > ', to_tsquery('english', '(day & drink) | (idle & painted)'), > 'MaxFragments=5, MaxWords=9, MinWords=4'); >ts_headline > ─ > motion,↵ > As idle as a painted Ship↵ >Upon > (1 fila) > and was surprised that the match for the 'day & drink' arm of the OR > disappears from the reported headline. I'd argue that that's exactly what should happen. It's supposed to find as-short-as-possible cover strings that satisfy the query. In this case, satisfying the 'day & drink' condition would require nearly the entire input text, whereas 'idle & painted' can be satisfied in just the third line. So what you get is the third line, slightly expanded because of some later rules that like to add context if the cover is shorter than MaxWords. I don't find anything particularly principled about the old behavior: > Day after day, day after day,↵ >We stuck ... motion, ↵ > As idle as a painted Ship ↵ >Upon It's including hits for "day" into the cover despite the lack of any nearby match to "drink". > Another thing I think might be a regression is the way fragments are > selected. Consider what happens if I change the "idle & painted" in the > earlier query to "idle <-> painted", and MaxWords is kept low: Of course, "idle <-> painted" is satisfied nowhere in this text (the words are there, but not adjacent). So the cover has to run from the last 'day' to the 'drink'. I think v15 is deciding that it runs from the first 'day' to the 'drink', which while not exactly wrong is not the shortest cover. The rest of this is just minor variations in what mark_hl_fragments() decides to do based on the precise cover string it's given. I don't dispute that mark_hl_fragments() could be improved, but this patch doesn't touch its algorithm and I think that doing so should be material for a different patch. (I have no immediate ideas about what would be a better algorithm for it, anyway.) > (Both 15 and master highlight 'painted' in the "Upon a painted Ocean" > verse, which perhaps they shouldn't do, since it's not preceded by > 'idle'.) Yeah, and 'idle' too. Once it's chosen a string to show, it'll highlight all the query words within that string, whether they constitute part of the match or not. I can see arguments on both sides of doing it that way; it was probably more sensible before we had phrase match than it is now. But again, changing that phase of the processing is outside the scope of this patch. I'm just trying to undo the damage I did to the cover-string-selection phase. regards, tom lane
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 2:22 PM Andres Freund wrote: > The problem with the change is here: > > /* > * Okay, we've covered the corner cases. The normal calculation is to > * convert the old measurement to a density (tuples per page), then > * estimate the number of tuples in the unscanned pages using that > figure, > * and finally add on the number of tuples in the scanned pages. > */ > old_density = old_rel_tuples / old_rel_pages; > unscanned_pages = (double) total_pages - (double) scanned_pages; > total_tuples = old_density * unscanned_pages + scanned_tuples; > return floor(total_tuples + 0.5); My assumption has always been that vac_estimate_reltuples() is prone to issues like this because it just doesn't have access to very much information each time it runs. It can only see the delta between what VACUUM just saw, and what the last VACUUM (or possibly the last ANALYZE) saw according to pg_class. You're always going to find weaknesses in such a model if you go looking for them. You're always going to find a way to salami slice your way from good information to total nonsense, if you pick the right/wrong test case, which runs VACUUM in a way that allows whatever bias there may be to accumulate. It's sort of like the way floating point values can become very inaccurate through a process that allows many small inaccuracies to accumulate over time. Maybe you're right to be concerned to the degree that you're concerned -- I'm not sure. I'm just adding what I see as important context. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Hi, On 2023-01-18 13:42:40 -0800, Andres Freund wrote: > The real point of change appears to be 10->11. > > There's a relevant looking difference in the vac_estimate_reltuples call: > 10: > /* now we can compute the new value for pg_class.reltuples */ > vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false, > > nblocks, > > vacrelstats->tupcount_pages, > > num_tuples); > > 11: > /* now we can compute the new value for pg_class.reltuples */ > vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel, > > nblocks, > > vacrelstats->tupcount_pages, > > live_tuples); > which points to: > > commit 7c91a0364fcf5d739a09cc87e7adb1d4a33ed112 > Author: Tom Lane > Date: 2018-03-22 15:47:29 -0400 > > Sync up our various ways of estimating pg_class.reltuples. The problem with the change is here: /* * Okay, we've covered the corner cases. The normal calculation is to * convert the old measurement to a density (tuples per page), then * estimate the number of tuples in the unscanned pages using that figure, * and finally add on the number of tuples in the scanned pages. */ old_density = old_rel_tuples / old_rel_pages; unscanned_pages = (double) total_pages - (double) scanned_pages; total_tuples = old_density * unscanned_pages + scanned_tuples; return floor(total_tuples + 0.5); Because we'll re-scan the pages for not-yet-removable rows in subsequent vacuums, the next vacuum will process the same pages again. By using scanned_tuples = live_tuples, we basically remove not-yet-removable tuples from reltuples, each time. The commit *did* try to account for that to some degree: +/* also compute total number of surviving heap entries */ +vacrelstats->new_rel_tuples = +vacrelstats->new_live_tuples + vacrelstats->new_dead_tuples; but new_rel_tuples isn't used for pg_class.reltuples or pgstat. This is pretty nasty. We use reltuples for a lot of things. And while analyze might fix it sometimes, that won't reliably be the case, particularly when there are repeated autovacuums due to a longrunning transaction - there's no cause for auto-analyze to trigger again soon, while autovacuum will go at it again and again. Greetings, Andres Freund
Re: Extracting cross-version-upgrade knowledge from buildfarm client
Andrew Dunstan writes: > I think we can do what you want but it's a bit harder than what you've > done. If we're not going to save the current run's product then we need > to run the upgrade test from a different directory (probably directly in > "$buildroot/$this_branch/inst"). Otherwise we'll be testing upgrade to > the saved product of a previous run of this branch. Hmm, maybe that explains some inconsistent results I remember getting. > I'll take a stab at it tomorrow if you like. Please do. regards, tom lane
Re: Extracting cross-version-upgrade knowledge from buildfarm client
On 2023-01-18 We 16:14, Tom Lane wrote: > Andrew Dunstan writes: >> On 2023-01-18 We 14:32, Tom Lane wrote: >>> I suppose that the reason for not running under $from_source is to >>> avoid corrupting the saved installations with unofficial versions. >>> However, couldn't we skip the "save" step and still run the upgrade >>> tests against whatever we have saved? (Maybe skip the same-version >>> test, as it's not quite reflecting any real case then.) >> Something like this should do it: >> my $source_tree = $from_source || "$self->{buildroot}/$this_branch/pgsql"; > Ah, I didn't understand that $from_source is a path not just a bool. > > What do you think about the above questions? Is this $from_source > exclusion for the reason I guessed, or some other one? > > Yes, the reason is that, unlike almost everything else in the buildfarm, cross version upgrade testing requires saved state (binaries and data directory), and we don't want from-source builds corrupting that state. I think we can do what you want but it's a bit harder than what you've done. If we're not going to save the current run's product then we need to run the upgrade test from a different directory (probably directly in "$buildroot/$this_branch/inst"). Otherwise we'll be testing upgrade to the saved product of a previous run of this branch. I'll take a stab at it tomorrow if you like. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
RE: Ability to reference other extensions by schema in extension scripts
> On Mon, Jan 16, 2023 at 11:57:30PM -0500, Regina Obe wrote: > > > What would be more bullet-proof is having an extra column in > > pg_extension or adding an extra array element to > > pg_extension.extcondition[] that allows you to say "Hey, don't allow > > this to be relocatable cause other extensions depend on it that have explicitly > referenced the schema." > > I've given this some more thoughts and I think a good compromise could be to > add the safety net in ALTER EXTESION SET SCHEMA so that it does not only > check "extrelocatable" but also the presence of any extension effectively > depending on it, in which case the operation could be prevented with a more > useful message than "extension does not support SET SCHEMA" (what is > currently output). > > Example query to determine those cases: > > SELECT e.extname, array_agg(v.name) > FROM pg_extension e, pg_available_extension_versions v > WHERE e.extname = ANY( v.requires ) > AND e.extrelocatable > AND v.installed group by e.extname; > > extname|array_agg > ---+-- >fuzzystrmatch | {postgis_tiger_geocoder} > > --strk; The only problem with the above is then it bars an extension from being relocated even if no extensions reference their schema. Note you wouldn't be able to tell if an extension references a schema without analyzing the install script. Which is why I was thinking another property would be better, cause that could be checked during the install/upgrade of the dependent extensions. I personally would be okay with this and is easier to code I think and doesn't require structural changes, but not sure others would be as it's taking away permissions they had before when it wasn't necessary to do so. Thanks, Regina
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 1:08 PM Andres Freund wrote: > I suggested nearby to only have ANALYZE dead_tuples it if there's been no > [auto]vacuum since the stats entry was created. That allows recovering from > stats resets, be it via crashes or explicitly. What do you think? I like that idea. It's far simpler than the kind of stuff I was thinking about, and probably just as effective. Even if it introduces some unforeseen problem (which seems unlikely), we can still rely on pgstat_report_vacuum() to set things straight before too long. Are you planning on writing a patch for this? I'd be very interested in seeing this through. Could definitely review it. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Hi, On 2023-01-18 13:08:44 -0800, Andres Freund wrote: > One complicating factor is that VACUUM sometimes computes an incrementally > more bogus n_live_tup when it skips pages due to the VM, whereas ANALYZE > computes something sane. I unintentionally encountered one when I was trying > something while writing this email, reproducer attached. > > > VACUUM (DISABLE_PAGE_SKIPPING) foo; > SELECT n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relid = > 'foo'::regclass; > ┌┬┐ > │ n_live_tup │ n_dead_tup │ > ├┼┤ > │901 │ 50 │ > └┴┘ > > after one VACUUM: > ┌┬┐ > │ n_live_tup │ n_dead_tup │ > ├┼┤ > │8549905 │ 50 │ > └┴┘ > > after 9 more VACUUMs: > ┌┬┐ > │ n_live_tup │ n_dead_tup │ > ├┼┤ > │5388421 │ 50 │ > └┴┘ > (1 row) > > > I briefly tried it out, and it does *not* reproduce in 11, but does in > 12. Haven't dug into what the cause is, but we probably use the wrong > denominator somewhere... Oh, it does actually reproduce in 11 too - my script just didn't see it because it was "too fast". For some reason < 12 it takes longer for the new pgstat snapshot to be available. If I add a few sleeps, it shows in 11. The real point of change appears to be 10->11. There's a relevant looking difference in the vac_estimate_reltuples call: 10: /* now we can compute the new value for pg_class.reltuples */ vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false, nblocks, vacrelstats->tupcount_pages, num_tuples); 11: /* now we can compute the new value for pg_class.reltuples */ vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel, nblocks, vacrelstats->tupcount_pages, live_tuples); which points to: commit 7c91a0364fcf5d739a09cc87e7adb1d4a33ed112 Author: Tom Lane Date: 2018-03-22 15:47:29 -0400 Sync up our various ways of estimating pg_class.reltuples. Greetings, Andres Freund
Re: Ability to reference other extensions by schema in extension scripts
On Mon, Jan 16, 2023 at 11:57:30PM -0500, Regina Obe wrote: > What would be more bullet-proof is having an extra column in pg_extension or > adding an extra array element to pg_extension.extcondition[] that allows you > to say "Hey, don't allow this to be relocatable cause other extensions > depend on it that have explicitly referenced the schema." I've given this some more thoughts and I think a good compromise could be to add the safety net in ALTER EXTESION SET SCHEMA so that it does not only check "extrelocatable" but also the presence of any extension effectively depending on it, in which case the operation could be prevented with a more useful message than "extension does not support SET SCHEMA" (what is currently output). Example query to determine those cases: SELECT e.extname, array_agg(v.name) FROM pg_extension e, pg_available_extension_versions v WHERE e.extname = ANY( v.requires ) AND e.extrelocatable AND v.installed group by e.extname; extname|array_agg ---+-- fuzzystrmatch | {postgis_tiger_geocoder} --strk;
Re: document the need to analyze partitioned tables
On Wed, Jan 18, 2023 at 10:15:18AM +0100, Laurenz Albe wrote: > On Tue, 2023-01-17 at 16:16 -0500, Bruce Momjian wrote: > > On Tue, Jan 17, 2023 at 03:00:50PM -0600, Justin Pryzby wrote: > > > Maybe (all?) the clarification the docs need is to say: > > > "Partitioned tables are not *themselves* processed by autovacuum." > > > > Yes, I think the lack of autovacuum needs to be specifically mentioned > > since most people assume autovacuum handles _all_ statistics updating. > > > > Can someone summarize how bad it is we have no statistics on partitioned > > tables? It sounds bad to me. > > Andrey Lepikhov had an example earlier in this thread[1]. It doesn't take > an exotic query. > > Attached is a new version of my patch that tries to improve the wording. Ah, yes, that is the example I saw but could not re-find. Here is the output: CREATE TABLE test (id integer, val integer) PARTITION BY hash (id); CREATE TABLE test_0 PARTITION OF test FOR VALUES WITH (modulus 2, remainder 0); CREATE TABLE test_1 PARTITION OF test FOR VALUES WITH (modulus 2, remainder 1); INSERT INTO test (SELECT q, q FROM generate_series(1,10) AS q); VACUUM ANALYZE test; INSERT INTO test (SELECT q, q%2 FROM generate_series(11,200) AS q); VACUUM ANALYZE test_0,test_1; EXPLAIN (ANALYZE, TIMING OFF, SUMMARY OFF) SELECT * FROM test t1, test t2 WHERE t1.id = t2.val; QUERY PLAN - Hash Join (cost=7.50..15.25 rows=200 width=16) (actual rows=105 loops=1) Hash Cond: (t1.id = t2.val) -> Append (cost=0.00..5.00 rows=200 width=8) (actual rows=200 loops=1) -> Seq Scan on test_0 t1_1 (cost=0.00..2.13 rows=113 width=8) (actual rows=113 loops=1) -> Seq Scan on test_1 t1_2 (cost=0.00..1.87 rows=87 width=8) (actual rows=87 loops=1) -> Hash (cost=5.00..5.00 rows=200 width=8) (actual rows=200 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 16kB -> Append (cost=0.00..5.00 rows=200 width=8) (actual rows=200 loops=1) -> Seq Scan on test_0 t2_1 (cost=0.00..2.13 rows=113 width=8) (actual rows=113 loops=1) -> Seq Scan on test_1 t2_2 (cost=0.00..1.87 rows=87 width=8) (actual rows=87 loops=1) VACUUM ANALYZE test; EXPLAIN (ANALYZE, TIMING OFF, SUMMARY OFF) SELECT * FROM test t1, test t2 WHERE t1.id = t2.val; QUERY PLAN - Hash Join (cost=7.50..15.25 rows=200 width=16) (actual rows=105 loops=1) Hash Cond: (t2.val = t1.id) -> Append (cost=0.00..5.00 rows=200 width=8) (actual rows=200 loops=1) -> Seq Scan on test_0 t2_1 (cost=0.00..2.13 rows=113 width=8) (actual rows=113 loops=1) -> Seq Scan on test_1 t2_2 (cost=0.00..1.87 rows=87 width=8) (actual rows=87 loops=1) -> Hash (cost=5.00..5.00 rows=200 width=8) (actual rows=200 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 16kB -> Append (cost=0.00..5.00 rows=200 width=8) (actual rows=200 loops=1) -> Seq Scan on test_0 t1_1 (cost=0.00..2.13 rows=113 width=8) (actual rows=113 loops=1) -> Seq Scan on test_1 t1_2 (cost=0.00..1.87 rows=87 width=8) (actual rows=87 loops=1) I see the inner side uses 'val' in the first EXPLAIN and 'id' in the second, and you are right that 'val' has mostly 0/1. Is it possible to document when partition table statistics helps? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
Re: Extracting cross-version-upgrade knowledge from buildfarm client
Andrew Dunstan writes: > On 2023-01-18 We 14:32, Tom Lane wrote: >> I suppose that the reason for not running under $from_source is to >> avoid corrupting the saved installations with unofficial versions. >> However, couldn't we skip the "save" step and still run the upgrade >> tests against whatever we have saved? (Maybe skip the same-version >> test, as it's not quite reflecting any real case then.) > Something like this should do it: > my $source_tree = $from_source || "$self->{buildroot}/$this_branch/pgsql"; Ah, I didn't understand that $from_source is a path not just a bool. What do you think about the above questions? Is this $from_source exclusion for the reason I guessed, or some other one? regards, tom lane
Re: almost-super-user problems that we haven't fixed yet
On Wed, Jan 18, 2023 at 02:51:38PM -0500, Robert Haas wrote: > Should (nfree < SuperuserReservedBackends) be using <=, or am I confused? I believe < is correct. At this point, the new backend will have already claimed a proc struct, so if the number of remaining free slots equals the number of reserved slots, it is okay. > What's the deal with removing "and no new replication connections will > be accepted" from the documentation? Is the existing documentation > just wrong? If so, should we fix that first? And maybe delete > "non-replication" from the error message that says "remaining > connection slots are reserved for non-replication superuser > connections"? It seems like right now the comments say that > replication connections are a completely separate pool of connections, > but the documentation and the error message make it sound otherwise. > If that's true, then one of them is wrong, and I think it's the > docs/error message. Or am I just misreading it? I think you are right. This seems to have been missed in ea92368. I moved this part to a new patch that should probably be back-patched to v12. On that note, I wonder if it's worth changing the "sorry, too many clients already" message to make it clear that max_connections has been reached. IME some users are confused by this error, and I think it would be less confusing if it pointed to the parameter that governs the number of connection slots. I'll create a new thread for this. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 8f0aa2fa54ae01149cffe9a69265f98e76a08a23 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 18 Jan 2023 12:43:41 -0800 Subject: [PATCH v3 1/3] Code review for ea92368. This commit missed an error message and a line in the docs. Back-patch to v12. --- doc/src/sgml/config.sgml | 3 +-- src/backend/utils/init/postinit.c | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 89d53f2a64..e019a1aac9 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -725,8 +725,7 @@ include_dir 'conf.d' number of active concurrent connections is at least max_connections minus superuser_reserved_connections, new -connections will be accepted only for superusers, and no -new replication connections will be accepted. +connections will be accepted only for superusers. diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index ae5a85ed65..9145d96b38 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -931,7 +931,7 @@ InitPostgres(const char *in_dbname, Oid dboid, !HaveNFreeProcs(ReservedBackends)) ereport(FATAL, (errcode(ERRCODE_TOO_MANY_CONNECTIONS), - errmsg("remaining connection slots are reserved for non-replication superuser connections"))); + errmsg("remaining connection slots are reserved for superusers"))); /* Check replication permissions needed for walsender processes. */ if (am_walsender) -- 2.25.1 >From bc110461e4f0a73c2b76ba0a6e821349c2cbe3df Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 17 Jan 2023 13:58:56 -0800 Subject: [PATCH v3 2/3] Rename ReservedBackends to SuperuserReservedBackends. This is in preparation for adding a new reserved_connections GUC that will use the ReservedBackends variable name. --- src/backend/postmaster/postmaster.c | 18 +- src/backend/utils/init/postinit.c | 4 ++-- src/backend/utils/misc/guc_tables.c | 2 +- src/include/postmaster/postmaster.h | 2 +- 4 files changed, 13 insertions(+), 13 deletions(-) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 9cedc1b9f0..470704f364 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -204,15 +204,15 @@ char *Unix_socket_directories; char *ListenAddresses; /* - * ReservedBackends is the number of backends reserved for superuser use. - * This number is taken out of the pool size given by MaxConnections so + * SuperuserReservedBackends is the number of backends reserved for superuser + * use. This number is taken out of the pool size given by MaxConnections so * number of backend slots available to non-superusers is - * (MaxConnections - ReservedBackends). Note what this really means is - * "if there are <= ReservedBackends connections available, only superusers - * can make new connections" --- pre-existing superuser connections don't - * count against the limit. + * (MaxConnections - SuperuserReservedBackends). Note what this really means + * is "if there are <= SuperuserReservedBackends connections available, only + * superusers can make new connections" --- pre-existing superuser connections + * don't count against the limit. */ -int ReservedBackends; +int SuperuserReservedBackends; /* The socket(s) we're listening to. */
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 1:02 PM Peter Geoghegan wrote: > Some of what I'm proposing arguably amounts to deliberately adding a > bias. But that's not an unreasonable thing in itself. I think of it as > related to the bias-variance tradeoff, which is a concept that comes > up a lot in machine learning and statistical inference. To be clear, I was thinking of unreservedly trusting what pgstat_report_analyze() says about dead_tuples in the event of its estimate increasing our dead_tuples estimate, while being skeptical (to a varying degree) when it's the other way around. But now I need to go think about what Andres just brought up... -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Hi, On 2023-01-18 12:15:17 -0800, Peter Geoghegan wrote: > On Wed, Jan 18, 2023 at 11:02 AM Robert Haas wrote: > > On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan wrote: > > > pgstat_report_analyze() will totally override the > > > tabentry->dead_tuples information that drives autovacuum.c, based on > > > an estimate derived from a random sample -- which seems to me to be an > > > approach that just doesn't have any sound theoretical basis. > > > > Yikes. I think we don't have a choice but to have a method to correct > > the information somehow, because AFAIK the statistics system is not > > crash-safe. But that approach does seem to carry significant risk of > > overwriting correct information with wrong information. I suggested nearby to only have ANALYZE dead_tuples it if there's been no [auto]vacuum since the stats entry was created. That allows recovering from stats resets, be it via crashes or explicitly. What do you think? To add insult to injury, we overwrite accurate information gathered by VACUUM with bad information gathered by ANALYZE if you do VACUUM ANALYZE. One complicating factor is that VACUUM sometimes computes an incrementally more bogus n_live_tup when it skips pages due to the VM, whereas ANALYZE computes something sane. I unintentionally encountered one when I was trying something while writing this email, reproducer attached. VACUUM (DISABLE_PAGE_SKIPPING) foo; SELECT n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relid = 'foo'::regclass; ┌┬┐ │ n_live_tup │ n_dead_tup │ ├┼┤ │901 │ 50 │ └┴┘ after one VACUUM: ┌┬┐ │ n_live_tup │ n_dead_tup │ ├┼┤ │8549905 │ 50 │ └┴┘ after 9 more VACUUMs: ┌┬┐ │ n_live_tup │ n_dead_tup │ ├┼┤ │5388421 │ 50 │ └┴┘ (1 row) I briefly tried it out, and it does *not* reproduce in 11, but does in 12. Haven't dug into what the cause is, but we probably use the wrong denominator somewhere... Greetings, Andres Freund vactest.sql Description: application/sql
Re: Extracting cross-version-upgrade knowledge from buildfarm client
On 2023-01-18 We 14:32, Tom Lane wrote: > One more thing before we move on from this topic. I'd been testing > modified versions of the AdjustUpgrade.pm logic by building from a > --from-source source tree, which seemed way easier than dealing > with a private git repo. As it stands, TestUpgradeXversion.pm > refuses to run under $from_source, but I just diked that check out > and it seemed to work fine for my purposes. Now, that's going to be > a regular need going forward, so I'd like to not need a hacked version > of the BF client code to do it. > > Also, your committed version of TestUpgradeXversion.pm breaks that > use-case because you did > > - unshift(@INC, "$self->{pgsql}/src/test/perl"); > + unshift(@INC, "$self->{buildroot}/$this_branch/pgsql/src/test/perl"); > > which AFAICS is an empty directory in a $from_source run. > > I suppose that the reason for not running under $from_source is to > avoid corrupting the saved installations with unofficial versions. > However, couldn't we skip the "save" step and still run the upgrade > tests against whatever we have saved? (Maybe skip the same-version > test, as it's not quite reflecting any real case then.) > > Here's a quick draft patch showing what I have in mind. There may > well be a better way to deal with the wheres-the-source issue than > what is in hunk 2. Also, I didn't reindent the unchanged code in > sub installcheck, and I didn't add anything about skipping > same-version tests. No that won't work if we're using vpath builds (which was why I changed it from what you had). $self->{pgsql} is always the build directory. Something like this should do it: my $source_tree = $from_source || "$self->{buildroot}/$this_branch/pgsql"; cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 12:44 PM Robert Haas wrote: > I don't know enough about the specifics of how this works to have an > intelligent opinion about how likely these particular ideas are to > work out. However, I think it's risky to look at estimates and try to > infer whether they are reliable. It's too easy to be wrong. What we > really want to do is anchor our estimates to some data source that we > know we can trust absolutely. If you trust possibly-bad data less, it > screws up your estimates more slowly, but it still screws them up. Some of what I'm proposing arguably amounts to deliberately adding a bias. But that's not an unreasonable thing in itself. I think of it as related to the bias-variance tradeoff, which is a concept that comes up a lot in machine learning and statistical inference. We can afford to be quite imprecise at times, especially if we choose a bias that we know has much less potential to do us harm -- some mistakes hurt much more than others. We cannot afford to ever be dramatically wrong, though -- especially in the direction of vacuuming less often. Besides, there is something that we *can* place a relatively high degree of trust in that will still be in the loop here: VACUUM itself. If VACUUM runs then it'll call pgstat_report_vacuum(), which will set the record straight in the event of over estimating dead tuples. To some degree the problem of over estimating dead tuples is self-limiting. > If Andres is correct that what really matter is the number of pages > we're going to have to dirty, we could abandon counting dead tuples > altogether and just count not-all-visible pages in the VM map. That's what matters most from a cost point of view IMV. So it's a big part of the overall picture, but not everything. It tells us relatively little about the benefits, except perhaps when most pages are all-visible. -- Peter Geoghegan
Re: Non-superuser subscription owners
> On Jan 18, 2023, at 12:51 PM, Robert Haas wrote: > > Unless I'm missing something, it seems like this could be a quite small patch. I didn't like the idea of the create/alter subscription commands needing to parse the connection string and think about what it might do, because at some point in the future we might extend what things are allowed in that string, and we have to keep everything that contemplates that string in sync. I may have been overly hesitant to tackle that problem. Or maybe I just ran short of round tuits. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Non-superuser subscription owners
On Wed, Jan 18, 2023 at 3:26 PM Mark Dilger wrote: > Prior to the patch, if a superuser created a subscription, then later was > demoted to non-superuser, the subscription apply workers still applied the > changes with superuser force. So creating a superuser Alice, letting Alice > create a subscription, then revoking superuser from Alice didn't accomplish > anything interesting. But after the patch, it does. The superuser can now > create non-superuser subscriptions. (I'm not sure this ability is well > advertised.) But the problem of non-superuser roles creating non-superuser > subscriptions is not solved. Ah, OK, thanks for the clarification! > There are different ways of solving (1), and Jeff and I discussed them in Dec > 2021. My recollection was that idea (3) was the cleanest. Other ideas might > be simpler than (3), or they may just appear simpler but in truth turn into a > can of worms. I don't know, since I never went as far as trying to implement > either approach. > > Idea (2) seems to contemplate non-superuser subscription owners as a > theoretical thing, but it's quite real already. Again, see > 027_nosuperuser.pl. I think the solution to the problem of a connection string trying to access local files is to just look at the connection string, decide whether it does that, and if yes, require the owner to have pg_read_server_files as well as pg_create_subscription. (3) is about creating some more sophisticated and powerful solution to that problem, but that seems like a nice-to-have, not something essential, and a lot more complicated to implement. I guess what I listed as (2) is not relevant since I didn't understand correctly what the current state of things is. Unless I'm missing something, it seems like this could be a quite small patch. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 3:15 PM Peter Geoghegan wrote: > Suppose that we notice that its new > estimate for live_tuples approximately matches what the stats > subsystem already thought about live_tuples, while dead_tuples is far > far lower. We shouldn't be so credulous as to believe the new > dead_tuples estimate at that point. > > Another angle of attack is the PD_ALL_VISIBLE page-level bit, which > acquire_sample_rows() could pay attention to -- especially in larger > tables, where the difference between all pages and just the > all-visible subset of pages is most likely to matter. The more sampled > pages that had PD_ALL_VISIBLE set, the less credible the new > dead_tuples estimate will be (relative to existing information), and > so pgstat_report_analyze() should prefer the new estimate over the old > one in proportion to that. I don't know enough about the specifics of how this works to have an intelligent opinion about how likely these particular ideas are to work out. However, I think it's risky to look at estimates and try to infer whether they are reliable. It's too easy to be wrong. What we really want to do is anchor our estimates to some data source that we know we can trust absolutely. If you trust possibly-bad data less, it screws up your estimates more slowly, but it still screws them up. If Andres is correct that what really matter is the number of pages we're going to have to dirty, we could abandon counting dead tuples altogether and just count not-all-visible pages in the VM map. That would be cheap enough to recompute periodically. However, it would also be a big behavior change from the way we do things now, so I'm not sure it's a good idea. Still, a quantity that we can be certain we're measuring accurately is better than one we can't measure accurately even if it's a somewhat worse proxy for the thing we really care about. There's a ton of value in not being completely and totally wrong. > FWIW, part of my mental model with VACUUM is that the rules kind of > change in the case of a big table. We're far more vulnerable to issues > such as (say) waiting for cleanup locks because the overall cadence > used by autovacuum is so infrequently relative to everything else. > There are more opportunities for things to go wrong, worse > consequences when they do go wrong, and greater potential for the > problems to compound. Yes. A lot of parts of PostgreSQL, including this one, were developed a long time ago when PostgreSQL databases were a lot smaller than they often are today. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Non-superuser subscription owners
> On Jan 18, 2023, at 11:38 AM, Robert Haas wrote: > > I was just noticing that what was committed here didn't actually fix > the problem implied by the subject line. That is, non-superuser still > can't own subscriptions. Not so. They can. See src/test/subscription/027_nosuperuser.pl > To put that another way, there's no way for > the superuser to delegate the setup and administration of logical > replication to a non-superuser. True. > That's a bummer. Also true. > Reading the thread, I'm not quite sure why we seemingly did all the > preparatory work and then didn't actually fix the problem. Prior to the patch, if a superuser created a subscription, then later was demoted to non-superuser, the subscription apply workers still applied the changes with superuser force. So creating a superuser Alice, letting Alice create a subscription, then revoking superuser from Alice didn't accomplish anything interesting. But after the patch, it does. The superuser can now create non-superuser subscriptions. (I'm not sure this ability is well advertised.) But the problem of non-superuser roles creating non-superuser subscriptions is not solved. From a security perspective, the bit that was solved may be the more important part; from a usability perspective, perhaps not. > It was > previously proposed that we introduce a new predefined role > pg_create_subscriptions and allow users who have the privileges of > that predefined role to create and alter subscriptions. There are a > few issues with that which, however, seem fairly solvable to me: > > 1. Jeff pointed out that if you supply a connection string that is > going to try to access local files, you'd better have > pg_read_server_files, or else we should not let you use that > connection string. I guess that's mostly a function of which > parameters you specify, e.g. passfile, sslcert, sslkey, though maybe > for host it depends on whether the value starts with a slash. We might > need to think a bit here to make sure we get the rules right but it > seems like a pretty solvable problem. > > 2. There was also quite a bit of discussion of what to do if a user > who was previously eligible to own a subscription ceases to be > eligible, in particular around a superuser who is made into a > non-superuser, but the same problem would apply if you had > pg_create_subscriptions or pg_read_server_files and then lost it. My > vote is to not worry about it too much. Specifically, I think we > should certainly check whether the user has permission to create a > subscription before letting them do so, but we could handle the case > where the user already owns a subscription and tries to modify it by > either allowing or denying the operation and I think either of those > would be fine. I even think we could do one of those in some cases and > the other in other cases and as long as there is some principle to the > thing, it's fine. I argue that it's not a normal configuration and > therefore it doesn't have to work in a particularly useful way. It > shouldn't break the world in some horrendous way, but that's about as > good as it needs to be. I'd argue for example that DROP SUBSCRIPTION > could just check whether you own the object, and that ALTER > SUBSCRIPTION could check whether you own the object and, if you're > changing the connection string, also whether you would have privileges > to set that new connection string on a new subscription. > > 3. There was a bit of discussion of maybe wanting to allow users to > create subscriptions with some connection strings but not others, > perhaps by having some kind of intermediate object that owns the > connection string and is owned by a superuser or someone with lots of > privileges, and then letting a less-privileged user point a > subscription at that object. I agree that might be useful to somebody, > but I don't see why it's a hard requirement to get anything at all > done here. Right now, a subscription contains a connection string > directly. If in the future someone wants to introduce a CREATE > REPLICATION DESTINATION command (or whatever) and have a way to point > a subscription at a replication destination rather than a connection > string directly, cool. Or if someone wants to wire this into CREATE > SERVER somehow, also cool. But if you don't care about restricting > which IPs somebody can try to access by providing a connection string > of their choice, then you would be happy if we just did something > simple here and left this problem for another day. > > I am very curious to know (a) why work on this was abandoned (perhaps > the answer is just lack of round tuits, in which case there is no more > to be said) Mostly, it was a lack of round-tuits. After the patch was committed, I quickly switched my focus elsewhere. > , and (b) what people think of (1)-(3) above There are different ways of solving (1), and Jeff and I discussed them in Dec 2021. My recollection was that idea (3)
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Hi, On 2023-01-17 12:08:01 -0800, Peter Geoghegan wrote: > > I think that's not the fault of relfrozenxid as a trigger, but that we > > simply > > don't keep enough other stats. We should imo at least keep track of: > > If you assume that there is chronic undercounting of dead tuples > (which I think is very common), then of course anything that triggers > vacuuming is going to help with that problem -- it might be totally > inadequate, but still make the critical difference by not allowing the > system to become completely destabilized. I absolutely accept that > users that are relying on that exist, and that those users ought to > not have things get even worse -- I'm pragmatic. But overall, what we > should be doing is fixing the real problem, which is that the dead > tuples accounting is deeply flawed. Actually, it's not just that the > statistics are flat out wrong; the whole model is flat-out wrong. I think that depends on what "whole model" encompasses... > The assumptions that work well for optimizer statistics quite simply > do not apply here. Random sampling for this is just wrong, because > we're not dealing with something that follows a distribution that can > be characterized with a sufficiently large sample. With optimizer > statistics, the entire contents of the table is itself a sample taken > from the wider world -- so even very stale statistics can work quite > well (assuming that the schema is well normalized). Whereas the > autovacuum dead tuples stuff is characterized by constant change. I > mean of course it is -- that's the whole point! The central limit > theorem obviously just doesn't work for something like this -- we > cannot generalize from a sample, at all. If we were to stop dropping stats after crashes, I think we likely could afford to stop messing with dead tuple stats during analyze. Right now it's valuable to some degree because it's a way to reaosonably quickly recover from lost stats. The main way to collect inserted / dead tuple info for autovacuum's benefit is via the stats collected when making changes. We probably ought to simply not update dead tuples after analyze if the stats entry has information about a prior [auto]vacuum. Or at least split the fields. > How many dead heap-only tuples are equivalent to one LP_DEAD item? > What about page-level concentrations, and the implication for > line-pointer bloat? I don't have a good answer to any of these > questions myself. And I have my doubts that there are *any* good > answers. Hence my suggestion to track several of these via page level stats. In the big picture it doesn't really matter that much whether there's 10 or 100 (dead tuples|items) on a page that needs to be removed. It matters that the page needs to be processed. > Even these questions are the wrong questions (they're just less wrong). I don't agree. Nothing is going to be perfect, but you're not going to be able to do sensible vacuum scheduling without some stats, and it's fine if those are an approximation, as long as the approximation makes some sense. > I'd like to use the visibility map more for stuff here, too. It is > totally accurate about all-visible/all-frozen pages, so many of my > complaints about statistics don't really apply. Or need not apply, at > least. If 95% of a table's pages are all-frozen in the VM, then of > course it's pretty unlikely to be the right time to VACUUM the table > if it's to clean up bloat -- this is just about the most reliable > information we have access to. I think querying that from stats is too expensive for most things. I suggested tracking all-frozen in pg_class. Perhaps we should also track when pages are *removed* from the VM in pgstats, I don't think we do today. That should give a decent picture? > > > This sounds like a great argument in favor of suspend-and-resume as a > > > way of handling autocancellation -- no useful work needs to be thrown > > > away for AV to yield for a minute or two. > > > Hm, that seems a lot of work. Without having held a lock you don't even know > > whether your old dead items still apply. Of course it'd improve the > > situation > > substantially, if we could get it. > > I don't think it's all that much work, once the visibility map > snapshot infrastructure is there. > > Why wouldn't your old dead items still apply? Well, for one the table could have been rewritten. Of course we can add the code to deal with that, but it is definitely something to be aware of. There might also be some oddities around indexes getting added / removed. > > > Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be > > > special cases? Maybe they should always be able to auto cancel an > > > autovacuum? > > > > Yea, I think so. It's not obvious how to best pass down that knowledge into > > ProcSleep(). It'd have to be in the LOCALLOCK, I think. Looks like the best > > way would be to change LockAcquireExtended() to get a flags argument instead > > of
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 11:02 AM Robert Haas wrote: > On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan wrote: > > pgstat_report_analyze() will totally override the > > tabentry->dead_tuples information that drives autovacuum.c, based on > > an estimate derived from a random sample -- which seems to me to be an > > approach that just doesn't have any sound theoretical basis. > > Yikes. I think we don't have a choice but to have a method to correct > the information somehow, because AFAIK the statistics system is not > crash-safe. But that approach does seem to carry significant risk of > overwriting correct information with wrong information. This situation is really quite awful, so maybe we should do something about it soon, in the scope of the Postgres 16 work on autovacuum that is already underway. In fact I think that the problem here is so bad that even something slightly less naive would be far more effective. You're right to point out that pgstat_report_analyze needs to update the stats in case there is a hard crash, of course. But there is plenty of context with which to make better decisions close at hand. For example, I bet that pgstat_report_analyze already does a pretty good job of estimating live_tuples -- my spiel about statistics mostly doesn't apply to live_tuples. Suppose that we notice that its new estimate for live_tuples approximately matches what the stats subsystem already thought about live_tuples, while dead_tuples is far far lower. We shouldn't be so credulous as to believe the new dead_tuples estimate at that point. Perhaps we can change nothing about dead_tuples at all when this happens. Or perhaps we can set dead_tuples to a value that is scaled from the old estimate. The new dead_tuples value could be derived by taking the ratio of the old live_tuples to the old dead_tuples, and then using that to scale from the new live_tuples. This is just a first pass, to see what you and others think. Even very simple heuristics seem like they could make things much better. Another angle of attack is the PD_ALL_VISIBLE page-level bit, which acquire_sample_rows() could pay attention to -- especially in larger tables, where the difference between all pages and just the all-visible subset of pages is most likely to matter. The more sampled pages that had PD_ALL_VISIBLE set, the less credible the new dead_tuples estimate will be (relative to existing information), and so pgstat_report_analyze() should prefer the new estimate over the old one in proportion to that. We probably shouldn't change anything about pgstat_report_vacuum as part of this effort to make pgstat_report_analyze less terrible in the near term. It certainly has its problems (what is true for pages that VACUUM scanned at the end of VACUUM is far from representative for new pages!), it's probably much less of a contributor to issues like those that Andres reports seeing. BTW, one of the nice things about the insert-driven autovacuum stats is that pgstat_report_analyze doesn't have an opinion about how many tuples were inserted since the last VACUUM ran. It does have other problems, but they seem less serious to me. > Yep. I think what we should try to evaluate is which number is > furthest from the truth. My guess is that the threshold is so high > relative to what a reasonable value would be that you can't get any > benefit out of making the dead tuple count more accurate. Like, if the > threshold is 100x too high, or something, then who cares how accurate > the dead tuples number is? Right. Or if we don't make any reasonable distinction between LP_DEAD items and dead heap-only tuples, then the total number of both things together may matter very little. Better to be approximately correct than exactly wrong. Deliberately introducing a bias to lower the variance is a perfectly valid approach. > Maybe that's even true in some scenarios, but I bet that > it's never the issue when people have really big tables. The fact that > I'm OK with 10MB of bloat in my 100MB table doesn't mean I'm OK with > 1TB of bloat in my 10TB table. Among other problems, I can't even > vacuum away that much bloat in one index pass, because autovacuum > can't use enough work memory for that. Also, the absolute space > wastage matters. I certainly agree with all that. FWIW, part of my mental model with VACUUM is that the rules kind of change in the case of a big table. We're far more vulnerable to issues such as (say) waiting for cleanup locks because the overall cadence used by autovacuum is so infrequently relative to everything else. There are more opportunities for things to go wrong, worse consequences when they do go wrong, and greater potential for the problems to compound. -- Peter Geoghegan
Re: minor bug
Laurenz Albe writes: > On Tue, 2023-01-17 at 10:32 -0500, Tom Lane wrote: >> I seem to recall that the original idea was to report the timestamp >> of the commit/abort record we are stopping at. Maybe my memory is >> faulty, but I think that'd be significantly more useful than the >> current behavior, *especially* when the replay stopping point is >> defined by something other than time. >> (Also, the wording of the log message suggests that that's what >> the reported time is supposed to be. I wonder if somebody messed >> this up somewhere along the way.) > recoveryStopTime is set to recordXtime (the time of the xlog record) > a few lines above that patch, so this is useful information if it is > present. Ah, but that only happens if recoveryTarget == RECOVERY_TARGET_TIME. Digging in the git history, I see that this did use to work as I remember: we always extracted the record time before printing it. That was accidentally broken in refactoring in c945af80c. I think the correct fix is more like the attached. regards, tom lane diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 5e65785306..c14d1f3ef6 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -2548,8 +2548,13 @@ recoveryStopsBefore(XLogReaderState *record) stopsHere = (recordXid == recoveryTargetXid); } - if (recoveryTarget == RECOVERY_TARGET_TIME && - getRecordTimestamp(record, )) + /* + * Note: we must fetch recordXtime regardless of recoveryTarget setting. + * We don't expect getRecordTimestamp ever to fail, since we already know + * this is a commit or abort record; but test its result anyway. + */ + if (getRecordTimestamp(record, ) && + recoveryTarget == RECOVERY_TARGET_TIME) { /* * There can be many transactions that share the same commit time, so
Re: almost-super-user problems that we haven't fixed yet
On Wed, Jan 18, 2023 at 2:00 PM Nathan Bossart wrote: > On Wed, Jan 18, 2023 at 11:28:57AM -0500, Robert Haas wrote: > > In general, looks good. I think this will often call HaveNFreeProcs > > twice, though, and that would be better to avoid, e.g. > > I should have thought of this. This is fixed in v2. Should (nfree < SuperuserReservedBackends) be using <=, or am I confused? > > In the common case where we hit neither limit, this only counts free > > connection slots once. We could do even better by making > > HaveNFreeProcs have an out parameter for the number of free procs > > actually found when it returns false, but that's probably not > > important. > > Actually, I think it might be important. IIUC the separate calls to > HaveNFreeProcs might return different values for the same input, which > could result in incorrect error messages (e.g., you might get the > reserved_connections message despite setting reserved_connections to 0). > So, I made this change in v2, too. I thought of that briefly and it didn't seem that important, but the way you did it seems fine, so let's go with that. What's the deal with removing "and no new replication connections will be accepted" from the documentation? Is the existing documentation just wrong? If so, should we fix that first? And maybe delete "non-replication" from the error message that says "remaining connection slots are reserved for non-replication superuser connections"? It seems like right now the comments say that replication connections are a completely separate pool of connections, but the documentation and the error message make it sound otherwise. If that's true, then one of them is wrong, and I think it's the docs/error message. Or am I just misreading it? -- Robert Haas EDB: http://www.enterprisedb.com
Small omission in type_sanity.sql
Hi, I was playing around with splitting up the tablespace test in regress so that I could use the tablespaces it creates in another test and happened to notice that the pg_class validity checks in type_sanity.sql are incomplete. It seems that 8b08f7d4820fd did not update the pg_class tests in type_sanity to include partitioned indexes and tables. patch attached. I only changed these few lines in type_sanity to be more correct; I didn't change anything else in regress to actually exercise them (e.g. ensuring a partitioned table is around when running type_sanity). It might be worth moving type_sanity down in the parallel schedule? It does seem a bit hard to remember to update these tests in type_sanity.sql when adding some new value for a pg_class field. I wonder if there is a better way of testing this. - Melanie From 1531cdf257af92d31836e50e1a0b865131f1a018 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Wed, 18 Jan 2023 14:26:36 -0500 Subject: [PATCH v1] Update pg_class validity test Partitioned tables and indexes were not considered in the pg_class validity checks in type_sanity.sql. This is not currently exercised because all partitioned tables and indexes have been dropped by the time type_sanity is run. --- src/test/regress/expected/type_sanity.out | 44 --- src/test/regress/sql/type_sanity.sql | 36 ++- 2 files changed, 42 insertions(+), 38 deletions(-) diff --git a/src/test/regress/expected/type_sanity.out b/src/test/regress/expected/type_sanity.out index a640cfc476..95435e3de6 100644 --- a/src/test/regress/expected/type_sanity.out +++ b/src/test/regress/expected/type_sanity.out @@ -498,34 +498,36 @@ ORDER BY 1; -- pg_class -- Look for illegal values in pg_class fields -SELECT c1.oid, c1.relname -FROM pg_class as c1 -WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p') OR +SELECT oid, relname, relkind, relpersistence, relreplident +FROM pg_class +WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p', 'I') OR relpersistence NOT IN ('p', 'u', 't') OR relreplident NOT IN ('d', 'n', 'f', 'i'); - oid | relname --+- + oid | relname | relkind | relpersistence | relreplident +-+-+-++-- (0 rows) --- All tables and indexes should have an access method. -SELECT c1.oid, c1.relname -FROM pg_class as c1 -WHERE c1.relkind NOT IN ('S', 'v', 'f', 'c') and -c1.relam = 0; - oid | relname --+- +-- All tables and indexes except partitioned tables should have an access +-- method. +SELECT oid, relname, relkind, relam +FROM pg_class +WHERE relkind NOT IN ('S', 'v', 'f', 'c', 'p') and +relam = 0; + oid | relname | relkind | relam +-+-+-+--- (0 rows) --- Conversely, sequences, views, types shouldn't have them -SELECT c1.oid, c1.relname -FROM pg_class as c1 -WHERE c1.relkind IN ('S', 'v', 'f', 'c') and -c1.relam != 0; - oid | relname --+- +-- Conversely, sequences, views, types, and partitioned tables shouldn't have +-- them +SELECT oid, relname, relkind, relam +FROM pg_class +WHERE relkind IN ('S', 'v', 'f', 'c', 'p') and +relam != 0; + oid | relname | relkind | relam +-+-+-+--- (0 rows) --- Indexes should have AMs of type 'i' +-- Indexes and partitioned indexes should have AMs of type 'i' SELECT pc.oid, pc.relname, pa.amname, pa.amtype FROM pg_class as pc JOIN pg_am AS pa ON (pc.relam = pa.oid) WHERE pc.relkind IN ('i') and @@ -534,7 +536,7 @@ WHERE pc.relkind IN ('i') and -+-++ (0 rows) --- Tables, matviews etc should have AMs of type 't' +-- Tables, matviews etc should have AMs of type 't' (except partitioned tables) SELECT pc.oid, pc.relname, pa.amname, pa.amtype FROM pg_class as pc JOIN pg_am AS pa ON (pc.relam = pa.oid) WHERE pc.relkind IN ('r', 't', 'm') and diff --git a/src/test/regress/sql/type_sanity.sql b/src/test/regress/sql/type_sanity.sql index 79ec410a6c..22a2dba94d 100644 --- a/src/test/regress/sql/type_sanity.sql +++ b/src/test/regress/sql/type_sanity.sql @@ -358,31 +358,33 @@ ORDER BY 1; -- Look for illegal values in pg_class fields -SELECT c1.oid, c1.relname -FROM pg_class as c1 -WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p') OR +SELECT oid, relname, relkind, relpersistence, relreplident +FROM pg_class +WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p', 'I') OR relpersistence NOT IN ('p', 'u', 't') OR relreplident NOT IN ('d', 'n', 'f', 'i'); --- All tables and indexes should have an access method. -SELECT c1.oid, c1.relname -FROM pg_class as c1 -WHERE c1.relkind NOT IN ('S', 'v', 'f', 'c') and -c1.relam = 0; - --- Conversely, sequences, views, types shouldn't have them -SELECT c1.oid, c1.relname -FROM pg_class as c1 -WHERE c1.relkind IN ('S', 'v', 'f', 'c') and -c1.relam != 0; - --- Indexes should have AMs of type 'i'
Re: Non-superuser subscription owners
On Sat, Jan 8, 2022 at 2:38 AM Jeff Davis wrote: > Committed. I was just noticing that what was committed here didn't actually fix the problem implied by the subject line. That is, non-superuser still can't own subscriptions. To put that another way, there's no way for the superuser to delegate the setup and administration of logical replication to a non-superuser. That's a bummer. Reading the thread, I'm not quite sure why we seemingly did all the preparatory work and then didn't actually fix the problem. It was previously proposed that we introduce a new predefined role pg_create_subscriptions and allow users who have the privileges of that predefined role to create and alter subscriptions. There are a few issues with that which, however, seem fairly solvable to me: 1. Jeff pointed out that if you supply a connection string that is going to try to access local files, you'd better have pg_read_server_files, or else we should not let you use that connection string. I guess that's mostly a function of which parameters you specify, e.g. passfile, sslcert, sslkey, though maybe for host it depends on whether the value starts with a slash. We might need to think a bit here to make sure we get the rules right but it seems like a pretty solvable problem. 2. There was also quite a bit of discussion of what to do if a user who was previously eligible to own a subscription ceases to be eligible, in particular around a superuser who is made into a non-superuser, but the same problem would apply if you had pg_create_subscriptions or pg_read_server_files and then lost it. My vote is to not worry about it too much. Specifically, I think we should certainly check whether the user has permission to create a subscription before letting them do so, but we could handle the case where the user already owns a subscription and tries to modify it by either allowing or denying the operation and I think either of those would be fine. I even think we could do one of those in some cases and the other in other cases and as long as there is some principle to the thing, it's fine. I argue that it's not a normal configuration and therefore it doesn't have to work in a particularly useful way. It shouldn't break the world in some horrendous way, but that's about as good as it needs to be. I'd argue for example that DROP SUBSCRIPTION could just check whether you own the object, and that ALTER SUBSCRIPTION could check whether you own the object and, if you're changing the connection string, also whether you would have privileges to set that new connection string on a new subscription. 3. There was a bit of discussion of maybe wanting to allow users to create subscriptions with some connection strings but not others, perhaps by having some kind of intermediate object that owns the connection string and is owned by a superuser or someone with lots of privileges, and then letting a less-privileged user point a subscription at that object. I agree that might be useful to somebody, but I don't see why it's a hard requirement to get anything at all done here. Right now, a subscription contains a connection string directly. If in the future someone wants to introduce a CREATE REPLICATION DESTINATION command (or whatever) and have a way to point a subscription at a replication destination rather than a connection string directly, cool. Or if someone wants to wire this into CREATE SERVER somehow, also cool. But if you don't care about restricting which IPs somebody can try to access by providing a connection string of their choice, then you would be happy if we just did something simple here and left this problem for another day. I am very curious to know (a) why work on this was abandoned (perhaps the answer is just lack of round tuits, in which case there is no more to be said), and (b) what people think of (1)-(3) above, and (c) whether anyone knows of further problems that need to be considered here. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Re: Extracting cross-version-upgrade knowledge from buildfarm client
One more thing before we move on from this topic. I'd been testing modified versions of the AdjustUpgrade.pm logic by building from a --from-source source tree, which seemed way easier than dealing with a private git repo. As it stands, TestUpgradeXversion.pm refuses to run under $from_source, but I just diked that check out and it seemed to work fine for my purposes. Now, that's going to be a regular need going forward, so I'd like to not need a hacked version of the BF client code to do it. Also, your committed version of TestUpgradeXversion.pm breaks that use-case because you did - unshift(@INC, "$self->{pgsql}/src/test/perl"); + unshift(@INC, "$self->{buildroot}/$this_branch/pgsql/src/test/perl"); which AFAICS is an empty directory in a $from_source run. I suppose that the reason for not running under $from_source is to avoid corrupting the saved installations with unofficial versions. However, couldn't we skip the "save" step and still run the upgrade tests against whatever we have saved? (Maybe skip the same-version test, as it's not quite reflecting any real case then.) Here's a quick draft patch showing what I have in mind. There may well be a better way to deal with the wheres-the-source issue than what is in hunk 2. Also, I didn't reindent the unchanged code in sub installcheck, and I didn't add anything about skipping same-version tests. regards, tom lane diff --git a/PGBuild/Modules/TestUpgradeXversion.pm b/PGBuild/Modules/TestUpgradeXversion.pm index a784686..408432d 100644 --- a/PGBuild/Modules/TestUpgradeXversion.pm +++ b/PGBuild/Modules/TestUpgradeXversion.pm @@ -51,8 +51,6 @@ sub setup my $conf = shift;# ref to the whole config object my $pgsql = shift;# postgres build dir - return if $from_source; - return if $branch !~ /^(?:HEAD|REL_?\d+(?:_\d+)?_STABLE)$/; my $animal = $conf->{animal}; @@ -351,7 +349,16 @@ sub test_upgrade## no critic (Subroutines::ProhibitManyArgs) if $verbose; # load helper module from source tree - unshift(@INC, "$self->{buildroot}/$this_branch/pgsql/src/test/perl"); + my $source_tree; + if ($from_source) + { + $source_tree = $self->{pgsql}; + } + else + { + $source_tree = "$self->{buildroot}/$this_branch/pgsql"; + } + unshift(@INC, "$source_tree/src/test/perl"); require PostgreSQL::Test::AdjustUpgrade; PostgreSQL::Test::AdjustUpgrade->import; shift(@INC); @@ -694,6 +701,11 @@ sub installcheck my $upgrade_loc = "$upgrade_install_root/$this_branch"; my $installdir = "$upgrade_loc/inst"; + # Don't save in a $from_source build: we want the saved trees always + # to correspond to branch tips of the animal's standard repo. We can + # perform upgrade tests against previously-saved trees, though. + if (!$from_source) + { # for saving we need an exclusive lock. get_lock($self, $this_branch, 1); @@ -716,6 +728,7 @@ sub installcheck if ($verbose > 1); send_result('XversionUpgradeSave', $status, \@saveout) if $status; $steps_completed .= " XVersionUpgradeSave"; + } # in saveonly mode our work is done return if $ENV{PG_UPGRADE_SAVE_ONLY}; @@ -744,7 +757,7 @@ sub installcheck # other branch from being removed or changed under us. get_lock($self, $oversion, 0); - $status = + my $status = test_upgrade($self, $save_env, $this_branch, $upgrade_install_root, $dport, $install_loc, $other_branch) ? 0 : 1;
Re: Implement missing join selectivity estimation for range types
Hi Tomas, Thanks for picking up the patch and for the interesting discussions that you bring ! > Interesting. Are there any particular differences compared to how we > estimate for example range clauses on regular columns? The theory is the same for scalar types. Yet, the statistics that are currently collected for scalar types include other synopsis than the histogram, such as MCV, which should be incorporated in the estimation. The theory for using the additional statistics is ready in the paper, but we didn't yet implement it. We thought of sharing the ready part, till the time allows us to implement the rest, or other developers continue it. > Right. I think 0.5% is roughly expected for the default statistics > target, which creates 100 histogram bins, each representing ~1% of the > values. Which on average means ~0.5% error. Since this patch deals with join selectivity, we are then crossing 100*100 bins. The ~0.5% error estimation comes from our experiments, rather than a mathematical analysis. > independence means we take (P1*P2). But maybe we should be very > pessimistic and use e.g. Min(P1,P2)? Or maybe something in between? > > Another option is to use the length histogram, right? I mean, we know > what the average length is, and it should be possible to use that to > calculate how "far" ranges in a histogram can overlap. The independence assumption exists if we use the lower and upper histograms. It equally exists if we use the lower and length histograms. In both cases, the link between the two histograms is lost during their construction. You discussion brings an interesting trade-off of optimistic v.s. pessimistic estimations. A typical way to deal with such a trade-off is to average the two, for example is model validation in machine learning, Do you think we should implement something like average( (P1*P2), Min(P1,P2) )? > OK. But ideally we'd cross-check elements of the two multiranges, no? > So if the column(s) contain a couple very common (multi)ranges that make > it into an MCV, we'll ignore those? That's a bit unfortunate, because > those MCV elements are potentially the main contributors to selectivity. Both ideas would require collecting more detailed statistics, for example similar to arrays. In this patch, we restricted ourselves to the existing statistics. > Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs > a proper comment, not just "this is a copy from rangetypes". Right, the comment should elaborate more that the collected statistics are currently that same as rangetypes but may potentially deviate. > However, it seems the two functions are exactly the same. Would the > functions diverge in the future? If not, maybe there should be just a > single shared function? Indeed, it is possible that the two functions will deviate if that statistics of multirange types will be refined. -- Best regards Mahmoud SAKR On Wed, Jan 18, 2023 at 7:07 PM Tomas Vondra wrote: > > Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs > a proper comment, not just "this is a copy from rangetypes". > > However, it seems the two functions are exactly the same. Would the > functions diverge in the future? If not, maybe there should be just a > single shared function? > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
Re: document the need to analyze partitioned tables
On Wed, 2023-01-18 at 11:49 -0600, Justin Pryzby wrote: > I tweaked this a bit to end up with: > > > - Partitioned tables are not processed by autovacuum. Statistics > > - should be collected by running a manual ANALYZE > > when it is > > + The leaf partitions of a partitioned table are normal tables and are > > processed > > + by autovacuum; however, autovacuum does not process the partitioned > > table itself. > > + This is no problem as far as VACUUM is concerned, > > since > > + there's no need to vacuum the empty, partitioned table. But, as > > mentioned in > > + , it also means that autovacuum > > won't > > + run ANALYZE on the partitioned table. > > + Although statistics are automatically gathered on its leaf partitions, > > some queries also need > > + statistics on the partitioned table to run optimally. You should > > collect statistics by > > + running a manual ANALYZE when the partitioned table > > is > > first populated, and again whenever the distribution of data in its > > partitions changes significantly. > > > > "partitions are normal tables" was techically wrong, as partitions can > also be partitioned. I am fine with your tweaks. I think this is good to go. Yours, Laurenz Albe
Re: Parallelize correlated subqueries that execute within each worker
Hi, This patch hasn't been updated since September, and it got broken by 4a29eabd1d91c5484426bc5836e0a7143b064f5a which the incremental sort stuff a little bit. But the breakage was rather limited, so I took a stab at fixing it - attached is the result, hopefully correct. I also added a couple minor comments about stuff I noticed while rebasing and skimming the patch, I kept those in separate commits. There's also a couple pre-existing TODOs. James, what's your plan with this patch. Do you intend to work on it for PG16, or are there some issues I missed in the thread? One of the queries in in incremental_sort changed plans a little bit: explain (costs off) select distinct unique1, (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1) from tenk1 t, generate_series(1, 1000); switched from Unique (cost=18582710.41..18747375.21 rows=1 width=8) -> Gather Merge (cost=18582710.41..18697375.21 rows=1000 ...) Workers Planned: 2 -> Sort (cost=18582710.39..18593127.06 rows=417 ...) Sort Key: t.unique1, ((SubPlan 1)) ... to Unique (cost=18582710.41..18614268.91 rows=1 ...) -> Gather Merge (cost=18582710.41..18614168.91 rows=2 ...) Workers Planned: 2 -> Unique (cost=18582710.39..18613960.39 rows=1 ...) -> Sort (cost=18582710.39..18593127.06 ...) Sort Key: t.unique1, ((SubPlan 1)) ... which probably makes sense, as the cost estimate decreases a bit. -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 95db15fe16303ed3f4fdea52af3b8d6d05a8d7a6 Mon Sep 17 00:00:00 2001 From: jcoleman Date: Mon, 26 Sep 2022 20:30:23 -0400 Subject: [PATCH 1/5] Add tests before change --- src/test/regress/expected/select_parallel.out | 108 ++ src/test/regress/sql/select_parallel.sql | 25 2 files changed, 133 insertions(+) diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out index 91f74fe47a3..9b4d7dd44a4 100644 --- a/src/test/regress/expected/select_parallel.out +++ b/src/test/regress/expected/select_parallel.out @@ -311,6 +311,114 @@ select count(*) from tenk1 where (two, four) not in 1 (1 row) +-- test parallel plans for queries containing correlated subplans +-- where the subplan only needs params available from the current +-- worker's scan. +explain (costs off, verbose) select + (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1) + from tenk1 t, generate_series(1, 10); + QUERY PLAN + + Gather + Output: (SubPlan 1) + Workers Planned: 4 + -> Nested Loop + Output: t.unique1 + -> Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t + Output: t.unique1 + -> Function Scan on pg_catalog.generate_series + Output: generate_series.generate_series + Function Call: generate_series(1, 10) + SubPlan 1 + -> Index Only Scan using tenk1_unique1 on public.tenk1 + Output: t.unique1 + Index Cond: (tenk1.unique1 = t.unique1) +(14 rows) + +explain (costs off, verbose) select + (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1) + from tenk1 t; + QUERY PLAN +-- + Gather + Output: (SubPlan 1) + Workers Planned: 4 + -> Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t + Output: t.unique1 + SubPlan 1 + -> Index Only Scan using tenk1_unique1 on public.tenk1 + Output: t.unique1 + Index Cond: (tenk1.unique1 = t.unique1) +(9 rows) + +explain (costs off, verbose) select + (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1) + from tenk1 t + limit 1; +QUERY PLAN +--- + Limit + Output: ((SubPlan 1)) + -> Seq Scan on public.tenk1 t + Output: (SubPlan 1) + SubPlan 1 + -> Index Only Scan using tenk1_unique1 on public.tenk1 + Output: t.unique1 + Index Cond: (tenk1.unique1 = t.unique1) +(8 rows) + +explain (costs off, verbose) select t.unique1 + from tenk1 t + where t.unique1 = (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1); + QUERY PLAN +- + Seq Scan on public.tenk1 t + Output: t.unique1 + Filter: (t.unique1 = (SubPlan 1)) + SubPlan 1 + -> Index Only Scan using tenk1_unique1 on public.tenk1 + Output: t.unique1 + Index Cond: (tenk1.unique1 = t.unique1)
Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences
On Wed, 18 Jan 2023 18:34:47 +0100 Alvaro Herrera wrote: > On 2023-Jan-18, Karl O. Pinc wrote: > > > On Wed, 18 Jan 2023 13:30:45 +0100 > > Alvaro Herrera wrote: > > > > > Not related to this patch: it's very annoying that in the PDF > > > output, each section in the appendix doesn't start on a blank > > > page -- which means that the doc page for many modules starts in > > > the middle of a page were the previous one ends. > > > > > I wonder if we can tweak something in the stylesheet to include a > > > page break. > > > > Would this be something to be included in this patch? > > (If I can figure it out.) > > No, I think we should do that change separately. I just didn't think > a parenthical complain was worth a separate thread for it; but if you > do create a patch, please do create a new thread (unless the current > patch in this one is committed already.) Oops. Already sent a revised patch that includes starting each module on a new page, for PDF output. I'll wait to rip that out after review and start a new thread if necessary. Regards, Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements
Hi Tomas, On Wed, 2023-01-18 at 17:29 +0100, Tomas Vondra wrote: > I took a quick look at this patch, to see if there's something we > want/can get into v16. The last version was submitted about 9 months > ago, and it doesn't apply cleanly anymore, but the bitrot is fairly > minor. Not sure there's still interest, though. Thank you for your attention to this patch! I'm still very interest in this patch. And I think I'll try to rebase this patch during a week or two if it seems possible to get it in 16.. > > I'd probably call it "created_at" or something like > that, but that's a minor detail. Or maybe stats_reset, which is what > we > use in pgstat? Yes there is some naming issue. My thought was the following: - "stats_reset" is not quite correct here, because the statement entry moment if definitely not a reset. The field named just as it means - this is time of the moment from which statistics is collected for this particular entry. - "created_at" perfectly matches the purpose of the field, but seems not such self-explaining to me. > > But is the second timestamp for the min/max fields really useful? > AFAIK > to perform analysis, people take regular pg_stat_statements > snapshots, > which works fine for counters (calculating deltas) but not for gauges > (which need a reset, to track fresh values). But people analyzing > this > are already resetting the whole entry, and so the snapshots already > are > tracking deltas. > > So I'm not convinced actually need the second timestamp. The main purpose of the patch is to provide means to collecting solutions to avoid the reset of pgss at all. Just like it happens for the pg_stat_ views. The only really need of reset is that we can't be sure that observing statement was not evicted and come back since last sample. Right now we only can do a whole reset on each sample and see how many entries will be in pgss hashtable on the next sample - how close this value to the max. If there is a plenty space in hashtable we can hope that there was not evictions since last sample. However there could be reset performed by someone else and we are know nothing about this. Having a timestamp in stats_since field we are sure about how long this statement statistics is tracked. That said sampling solution can totally avoid pgss resets. Avoiding such resets means avoiding interference between monitoring solutions. But if no more resets is done we can't track min/max values, because they still needs a reset and we can do nothing with such resets - they are necessary. However I still want to know when min/max reset was performed. This will help to detect possible interference on such resets. > > > A couple more comments: > > 1) I'm not sure why the patch is adding tests of permissions on the > pg_stat_statements_reset function? > > 2) If we want the second timestamp, shouldn't it also cover resets of > the mean values, not just min/max? I think that mean values shouldn't be target for a partial reset because the value for mean values can be easily reconstructed by the sampling solution without a reset. > > 3) I don't understand why the patch is adding "IS NOT NULL AS t" to > various places in the regression tests. The most of tests was copied from the previous version. I'll recheck them. > > 4) I rather dislike the "minmax" naming, because that's often used in > other contexts (for BRIN indexes), and as I mentioned maybe it should > also cover the "mean" fields. Agreed, but I couldn't make it better. Other versions seemed worse to me... > > Regards, Andrei Zubkov
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan wrote: > pgstat_report_analyze() will totally override the > tabentry->dead_tuples information that drives autovacuum.c, based on > an estimate derived from a random sample -- which seems to me to be an > approach that just doesn't have any sound theoretical basis. Yikes. I think we don't have a choice but to have a method to correct the information somehow, because AFAIK the statistics system is not crash-safe. But that approach does seem to carry significant risk of overwriting correct information with wrong information. > On reflection, maybe you're right here. Maybe it's true that the > bigger problem is just that the implementation is bad, even on its own > terms -- since it's pretty bad! Hard to say at this point. > > Depends on how you define it, too. Statistically sampling is just not > fit for purpose here. But is that a problem with > autovacuum_vacuum_scale_factor? I may have said words that could > reasonably be interpreted that way, but I'm not prepared to blame it > on the underlying autovacuum_vacuum_scale_factor model now. It's > fuzzy. Yep. I think what we should try to evaluate is which number is furthest from the truth. My guess is that the threshold is so high relative to what a reasonable value would be that you can't get any benefit out of making the dead tuple count more accurate. Like, if the threshold is 100x too high, or something, then who cares how accurate the dead tuples number is? It's going to be insufficient to trigger vacuuming whether it's right or wrong. We should try substituting a less-bogus threshold calculation and see what happens then. An alternative theory is that the threshold is fine and we're only failing to reach it because the dead tuple calculation is so inaccurate. Maybe that's even true in some scenarios, but I bet that it's never the issue when people have really big tables. The fact that I'm OK with 10MB of bloat in my 100MB table doesn't mean I'm OK with 1TB of bloat in my 10TB table. Among other problems, I can't even vacuum away that much bloat in one index pass, because autovacuum can't use enough work memory for that. Also, the absolute space wastage matters. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences
On Wed, 18 Jan 2023 13:25:57 +0100 Alvaro Herrera wrote: > On 2023-Jan-02, Karl O. Pinc wrote: > > Attached is a patch: contrib_v1.patch > > > > It modifies Appendix F, the contrib directory. > > > > It adds brief text into the titles shown in the > > table of contents so it's easier to tell what > > each module does. It also suffixes [trusted] or [obsolete] > > on the relevant titles. > > I'm not 100% sold on having the > "trusted" or "obsolete" marker on the titles themselves, though. Not > sure what alternative do we have, though, other than leave them out > completely. The alternative would be to have a separate table with modules for rows and "trusted" and "obsolete" columns. It seems like more of a maintenance hassle than having the markers in the titles. Let me know if you want a table. I do like having a place to look to over all the modules to see what is "trusted" or "obsolete". I suppose there could just be a table, with module names, descriptions, and trusted and obsolete flags. Instead of a table of contents for the modules the module names in the table could be links. But that'd involve suppressing the table of contents showing all the module names. And has the problem of possible mis-match between the modules listed in the table and the modules that exist. > There's a typo "equalivent" in two places. Fixed. > In passwordcheck, I would say just "check for weak passwords" or maybe > "verify password strength". I used "verify password strength". > > pg_buffercache is missing. Maybe "-- inspect state of the Postgres > buffer cache". I used "inspect Postgres buffer cache state" > For pg_stat_statements I suggest "track statistics of planning and > execution of SQL queries" I had written "track SQL query planning and execution statistics". Changed to: "track statistics of SQL planning and execution" I don't really care. If you want your version I'll submit another patch. > For sepgsql, as I understand it is strictly SELinux based, not just > "-like". So this needs rewording: "label-based, SELinux-like, > mandatory access control". Maybe "SELinux-based implementation of > mandatory access control for row-level security". Changed to: "SELinux-based row-level security mandatory access control" > xml -- typo "qeurying" Fixed. I have also made the patch put each module on a separate page when producing PDF documents. This did produce one warning, which seems unrelated to me. The pdf seems right. I also tried just "make", to be sure I didn't break anything unrelated. Seemed to work. So..., works for me. New patch attached: contrib_v2.patch Regards, Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein diff --git a/doc/src/sgml/adminpack.sgml b/doc/src/sgml/adminpack.sgml index 184e96d7a0..04f3b52379 100644 --- a/doc/src/sgml/adminpack.sgml +++ b/doc/src/sgml/adminpack.sgml @@ -1,7 +1,7 @@ - adminpack + adminpack pgAdmin support toolpack adminpack diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml index 923cbde9dd..4006c75cdf 100644 --- a/doc/src/sgml/amcheck.sgml +++ b/doc/src/sgml/amcheck.sgml @@ -1,7 +1,7 @@ - amcheck + amcheck tools to verify index consistency amcheck diff --git a/doc/src/sgml/auth-delay.sgml b/doc/src/sgml/auth-delay.sgml index 40629311b1..0571f2a99d 100644 --- a/doc/src/sgml/auth-delay.sgml +++ b/doc/src/sgml/auth-delay.sgml @@ -1,7 +1,7 @@ - auth_delay + auth_delay pause on authentication failure auth_delay diff --git a/doc/src/sgml/auto-explain.sgml b/doc/src/sgml/auto-explain.sgml index bb7342b120..0c4656ee30 100644 --- a/doc/src/sgml/auto-explain.sgml +++ b/doc/src/sgml/auto-explain.sgml @@ -1,7 +1,7 @@ - auto_explain + auto_explain log execution plans of slow queries auto_explain diff --git a/doc/src/sgml/basebackup-to-shell.sgml b/doc/src/sgml/basebackup-to-shell.sgml index 491368eb8f..b6a3b39541 100644 --- a/doc/src/sgml/basebackup-to-shell.sgml +++ b/doc/src/sgml/basebackup-to-shell.sgml @@ -1,7 +1,7 @@ - basebackup_to_shell + basebackup_to_shell example "shell" pg_basebackup module basebackup_to_shell diff --git a/doc/src/sgml/basic-archive.sgml b/doc/src/sgml/basic-archive.sgml index 60f23d2855..b4d43ced20 100644 --- a/doc/src/sgml/basic-archive.sgml +++ b/doc/src/sgml/basic-archive.sgml @@ -1,7 +1,7 @@ - basic_archive + basic_archive an example WAL archive module basic_archive diff --git a/doc/src/sgml/bloom.sgml b/doc/src/sgml/bloom.sgml index 98d0316175..672ac2ed19 100644 --- a/doc/src/sgml/bloom.sgml +++ b/doc/src/sgml/bloom.sgml @@ -1,7 +1,7 @@ - bloom + bloom bloom filter index access bloom diff --git a/doc/src/sgml/btree-gin.sgml b/doc/src/sgml/btree-gin.sgml index 870c25559e..0eaea0dbcd 100644 --- a/doc/src/sgml/btree-gin.sgml +++ b/doc/src/sgml/btree-gin.sgml @@ -1,7 +1,8 @@ - btree_gin + btree_gin + sample GIN B-tree equivalent operator
Re: almost-super-user problems that we haven't fixed yet
On Wed, Jan 18, 2023 at 11:28:57AM -0500, Robert Haas wrote: > In general, looks good. I think this will often call HaveNFreeProcs > twice, though, and that would be better to avoid, e.g. I should have thought of this. This is fixed in v2. > In the common case where we hit neither limit, this only counts free > connection slots once. We could do even better by making > HaveNFreeProcs have an out parameter for the number of free procs > actually found when it returns false, but that's probably not > important. Actually, I think it might be important. IIUC the separate calls to HaveNFreeProcs might return different values for the same input, which could result in incorrect error messages (e.g., you might get the reserved_connections message despite setting reserved_connections to 0). So, I made this change in v2, too. > I don't think that we should default both the existing GUC and the new > one to 3, because that raises the default limit in the case where the > new feature is not used from 3 to 6. I think we should default one of > them to 0 and the other one to 3. Not sure which one should get which > value. I chose to set reserved_connections to 0 since it is new and doesn't have a pre-existing default value. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 94ee89548fd1080447b784993a1418480f407b49 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 17 Jan 2023 13:58:56 -0800 Subject: [PATCH v2 1/2] Rename ReservedBackends to SuperuserReservedBackends. This is in preparation for adding a new reserved_connections GUC that will use the ReservedBackends variable name. --- src/backend/postmaster/postmaster.c | 18 +- src/backend/utils/init/postinit.c | 4 ++-- src/backend/utils/misc/guc_tables.c | 2 +- src/include/postmaster/postmaster.h | 2 +- 4 files changed, 13 insertions(+), 13 deletions(-) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 9cedc1b9f0..470704f364 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -204,15 +204,15 @@ char *Unix_socket_directories; char *ListenAddresses; /* - * ReservedBackends is the number of backends reserved for superuser use. - * This number is taken out of the pool size given by MaxConnections so + * SuperuserReservedBackends is the number of backends reserved for superuser + * use. This number is taken out of the pool size given by MaxConnections so * number of backend slots available to non-superusers is - * (MaxConnections - ReservedBackends). Note what this really means is - * "if there are <= ReservedBackends connections available, only superusers - * can make new connections" --- pre-existing superuser connections don't - * count against the limit. + * (MaxConnections - SuperuserReservedBackends). Note what this really means + * is "if there are <= SuperuserReservedBackends connections available, only + * superusers can make new connections" --- pre-existing superuser connections + * don't count against the limit. */ -int ReservedBackends; +int SuperuserReservedBackends; /* The socket(s) we're listening to. */ #define MAXLISTEN 64 @@ -908,11 +908,11 @@ PostmasterMain(int argc, char *argv[]) /* * Check for invalid combinations of GUC settings. */ - if (ReservedBackends >= MaxConnections) + if (SuperuserReservedBackends >= MaxConnections) { write_stderr("%s: superuser_reserved_connections (%d) must be less than max_connections (%d)\n", progname, - ReservedBackends, MaxConnections); + SuperuserReservedBackends, MaxConnections); ExitPostmaster(1); } if (XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level == WAL_LEVEL_MINIMAL) diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index ae5a85ed65..6fa696fe8d 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -927,8 +927,8 @@ InitPostgres(const char *in_dbname, Oid dboid, * limited by max_connections or superuser_reserved_connections. */ if (!am_superuser && !am_walsender && - ReservedBackends > 0 && - !HaveNFreeProcs(ReservedBackends)) + SuperuserReservedBackends > 0 && + !HaveNFreeProcs(SuperuserReservedBackends)) ereport(FATAL, (errcode(ERRCODE_TOO_MANY_CONNECTIONS), errmsg("remaining connection slots are reserved for non-replication superuser connections"))); diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 5025e80f89..5aa2cda8f9 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -2163,7 +2163,7 @@ struct config_int ConfigureNamesInt[] = gettext_noop("Sets the number of connection slots reserved for superusers."), NULL }, - , + , 3, 0, MAX_BACKENDS, NULL, NULL, NULL }, diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h index 203177e1ff..168d85a3d1 100644 ---
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
. On Wed, Jan 18, 2023 at 7:54 AM Robert Haas wrote: > > It just fits: the dead tuples approach can sometimes be so > > completely wrong that even an alternative triggering condition based > > on something that is virtually unrelated to the thing we actually care > > about can do much better in practice. Consistently, reliably, for a > > given table/workload. > > Hmm, I don't know. I have no intuition one way or the other for > whether we're undercounting dead tuples, and I don't understand what > would cause us to do that. I thought that we tracked that accurately, > as part of the statistics system, not by sampling > (pg_stat_all_tables.n_dead_tup). It's both, kind of. pgstat_report_analyze() will totally override the tabentry->dead_tuples information that drives autovacuum.c, based on an estimate derived from a random sample -- which seems to me to be an approach that just doesn't have any sound theoretical basis. So while there is a sense in which we track dead tuples incrementally and accurately using the statistics system, we occasionally call pgstat_report_analyze (and pgstat_report_vacuum) like this, so AFAICT we might as well not even bother tracking things reliably the rest of the time. Random sampling works because the things that you don't sample are very well represented by the things that you do sample. That's why even very stale optimizer statistics can work quite well (and why the EAV anti-pattern makes query optimization impossible) -- the distribution is often fixed, more or less. The statistics generalize very well because the data meets certain underlying assumptions that all data stored in a relational database is theoretically supposed to meet. Whereas with dead tuples, the whole point is to observe and count dead tuples so that autovacuum can then go remove the dead tuples -- which then utterly changes the situation! That's a huge difference. ISTM that you need a *totally* different approach for something that's fundamentally dynamic, which is what this really is. Think about how the random sampling will work in a very large table with concentrated updates. The modified pages need to outweigh the large majority of pages in the table that can be skipped by VACUUM anyway. I wonder how workable it would be to just teach pgstat_report_analyze and pgstat_report_vacuum to keep out of this, or to not update the stats unless it's to increase the number of dead_tuples... > I think we ought to fire autovacuum_vacuum_scale_factor out of an > airlock. Couldn't agree more. I think that this and the underlying statistics are the really big problem as far as under-vacuuming is concerned. > I think we also ought to invent some sort of better cost limit system > that doesn't shoot you in the foot automatically as the database > grows. Nobody actually wants to limit the rate at which the database > vacuums stuff to a constant. What they really want to do is limit it > to a rate that is somewhat faster than the minimum rate needed to > avoid disaster. We should try to develop metrics for whether vacuum is > keeping up. Definitely agree that doing some kind of dynamic updating is promising. What we thought at the start versus what actually happened. Something cyclic, just like autovacuum itself. > I don't actually see any reason why dead tuples, even counted in a > relatively stupid way, isn't fundamentally good enough to get all > tables vacuumed before we hit the XID age cutoff. It doesn't actually > do that right now, but I feel like that must be because we're doing > other stupid things, not because there's anything that terrible about > the metric as such. Maybe that's wrong, but I find it hard to imagine. On reflection, maybe you're right here. Maybe it's true that the bigger problem is just that the implementation is bad, even on its own terms -- since it's pretty bad! Hard to say at this point. Depends on how you define it, too. Statistically sampling is just not fit for purpose here. But is that a problem with autovacuum_vacuum_scale_factor? I may have said words that could reasonably be interpreted that way, but I'm not prepared to blame it on the underlying autovacuum_vacuum_scale_factor model now. It's fuzzy. > > We're > > still subject to the laws of physics. VACUUM would still be something > > that more or less works at the level of the whole table, or not at > > all. So being omniscient seems kinda overrated to me. Adding more > > information does not in general lead to better outcomes. > > Yeah, I think that's true. In particular, it's not much use being > omniscient but stupid. It would be better to have limited information > and be smart about what you did with it. I would put it like this: autovacuum shouldn't ever be a sucker. It should pay attention to disconfirmatory signals. The information that drives its decision making process should be treated as provisional. Even if the information was correct at one point, the contents of the table are constantly changing
Re: [PATCH] Const'ify the arguments of ilist.c/ilist.h functions
Hi, On 2023-01-18 10:22:14 -0800, Andres Freund wrote: > On 2023-01-12 08:34:25 +0100, Peter Eisentraut wrote: > > On 07.01.23 08:21, Peter Eisentraut wrote: > > > This patch version looks correct to me. It is almost the same as the > > > one that Andres had posted in his thread, except that yours also > > > modifies slist_delete() and dlist_member_check(). Both of these changes > > > also look correct to me. > > > > committed > > Unfortunately this causes a build failure with ILIST_DEBUG > enabled. dlist_member_check() uses dlist_foreach(), which isn't set up to work > with const :(. I'll push a quick workaround. Pushed. Greetings, Andres Freund
Re: [PATCH] Add <> support to sepgsql_restorecon
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation:not tested This needs regression test support for the feature and some minimal documentation that shows how to make use of it. The new status of this patch is: Waiting on Author
Re: [PATCH] Const'ify the arguments of ilist.c/ilist.h functions
Hi, On 2023-01-12 08:34:25 +0100, Peter Eisentraut wrote: > On 07.01.23 08:21, Peter Eisentraut wrote: > > On 23.11.22 14:57, Aleksander Alekseev wrote: > > > Hi Andres, > > > > > > Thanks for the review! > > > > > > > I don't think it is correct for any of these to add const. The > > > > only reason it > > > > works is because of casting etc. > > > > > > Fair enough. PFA the corrected patch v2. > > > > This patch version looks correct to me. It is almost the same as the > > one that Andres had posted in his thread, except that yours also > > modifies slist_delete() and dlist_member_check(). Both of these changes > > also look correct to me. > > committed Unfortunately this causes a build failure with ILIST_DEBUG enabled. dlist_member_check() uses dlist_foreach(), which isn't set up to work with const :(. I'll push a quick workaround. Greetings, Andres Freund
Re: document the need to analyze partitioned tables
On Wed, Jan 18, 2023 at 11:49:19AM -0600, Justin Pryzby wrote: > On Wed, Jan 18, 2023 at 10:15:18AM +0100, Laurenz Albe wrote: > > On Tue, 2023-01-17 at 16:16 -0500, Bruce Momjian wrote: > > > On Tue, Jan 17, 2023 at 03:00:50PM -0600, Justin Pryzby wrote: > > > > Maybe (all?) the clarification the docs need is to say: > > > > "Partitioned tables are not *themselves* processed by autovacuum." > > > > > > Yes, I think the lack of autovacuum needs to be specifically mentioned > > > since most people assume autovacuum handles _all_ statistics updating. > > That's what 61fa6ca79 aimed to do. Laurenz is suggesting further > clarification. Ah, makes sense, thanks. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
Re: Implement missing join selectivity estimation for range types
Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs a proper comment, not just "this is a copy from rangetypes". However, it seems the two functions are exactly the same. Would the functions diverge in the future? If not, maybe there should be just a single shared function? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: document the need to analyze partitioned tables
On Wed, Jan 18, 2023 at 10:15:18AM +0100, Laurenz Albe wrote: > On Tue, 2023-01-17 at 16:16 -0500, Bruce Momjian wrote: > > On Tue, Jan 17, 2023 at 03:00:50PM -0600, Justin Pryzby wrote: > > > Maybe (all?) the clarification the docs need is to say: > > > "Partitioned tables are not *themselves* processed by autovacuum." > > > > Yes, I think the lack of autovacuum needs to be specifically mentioned > > since most people assume autovacuum handles _all_ statistics updating. That's what 61fa6ca79 aimed to do. Laurenz is suggesting further clarification. > > Can someone summarize how bad it is we have no statistics on partitioned > > tables? It sounds bad to me. > > Andrey Lepikhov had an example earlier in this thread[1]. It doesn't take > an exotic query. > > Attached is a new version of my patch that tries to improve the wording. I tweaked this a bit to end up with: > -Partitioned tables are not processed by autovacuum. Statistics > -should be collected by running a manual ANALYZE when > it is > +The leaf partitions of a partitioned table are normal tables and are > processed > +by autovacuum; however, autovacuum does not process the partitioned > table itself. > +This is no problem as far as VACUUM is concerned, > since > +there's no need to vacuum the empty, partitioned table. But, as > mentioned in > +, it also means that autovacuum > won't > +run ANALYZE on the partitioned table. > +Although statistics are automatically gathered on its leaf partitions, > some queries also need > +statistics on the partitioned table to run optimally. You should > collect statistics by > +running a manual ANALYZE when the partitioned table is > first populated, and again whenever the distribution of data in its > partitions changes significantly. > "partitions are normal tables" was techically wrong, as partitions can also be partitioned. -- Justin
Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences
On 2023-Jan-18, Karl O. Pinc wrote: > On Wed, 18 Jan 2023 13:30:45 +0100 > Alvaro Herrera wrote: > > > Not related to this patch: it's very annoying that in the PDF output, > > each section in the appendix doesn't start on a blank page -- which > > means that the doc page for many modules starts in the middle of a > > page were the previous one ends. > > > I wonder if we can tweak something in the stylesheet to include a page > > break. > > Would this be something to be included in this patch? > (If I can figure it out.) No, I think we should do that change separately. I just didn't think a parenthical complain was worth a separate thread for it; but if you do create a patch, please do create a new thread (unless the current patch in this one is committed already.) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Ninguna manada de bestias tiene una voz tan horrible como la humana" (Orual)
Re: Removing redundant grouping columns
David Rowley writes: > No objections from me. Pushed, thanks for looking at it. regards, tom lane
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On 18/01/23 15:12, David Rowley wrote: I also thought I'd better test that foreach_delete_current() works with foreach_reverse(). I can confirm that it *does not* work correctly. I guess maybe you only tested the fact that it deleted the current item and not that the subsequent loop correctly went to the item directly before the deleted item. There's a problem with that. We skip an item. Hmm, not really sure why did I miss that. I tried this again (added following in postgres.c above PortalStart) List* l = NIL; l = lappend(l, 1); l = lappend(l, 2); l = lappend(l, 3); l = lappend(l, 4); ListCell *lc; foreach_reverse(lc, l) { if (foreach_current_index(lc) == 2) // delete 3 { foreach_delete_current(l, lc); } } foreach(lc, l) { int i = (int) lfirst(lc); ereport(LOG,(errmsg("%d", i))); } Got result: 2023-01-18 20:23:28.115 IST [51007] LOG: 1 2023-01-18 20:23:28.115 IST [51007] STATEMENT: select pg_backend_pid(); 2023-01-18 20:23:28.115 IST [51007] LOG: 2 2023-01-18 20:23:28.115 IST [51007] STATEMENT: select pg_backend_pid(); 2023-01-18 20:23:28.115 IST [51007] LOG: 4 2023-01-18 20:23:28.115 IST [51007] STATEMENT: select pg_backend_pid(); I had expected list_delete_cell to take care of rest. Instead of fixing that, I think it's likely better just to loop backwards manually with a for() loop, so I've adjusted the patch to work that way. It's quite likely that the additional code in foreach() and what was in foreach_reverse() is slower than looping manually due to the additional work those macros do to set the cell to NULL when we run out of cells to loop over. Okay, current version looks fine as well. I made another pass over the v7 patch and fixed a bug that was disabling the optimization when the deepest WindowAgg had a runCondition. This should have been using llast_node instead of linitial_node. The top-level WindowAgg is the last in the list. I also made a pass over the tests and added a test case for the runCondition check to make sure we disable the optimization when the top-level WindowAgg has one of those. I wasn't sure what your test comments case numbers were meant to represent. They were not aligned with the code comments that document when the optimisation is disabled, they started out aligned, but seemed to go off the rails at #3. I've now made them follow the comments in create_one_window_path() and made it more clear what we expect the test outcome to be in each case. Those were just numbering for exceptional cases, making them in sync with comments wasn't really on my mind, but now they looks better. I've attached the v9 patch. I feel like this patch is quite self-contained and I'm quite happy with it now. If there are no objections soon, I'm planning on pushing it. Patch is already rebased with latest master, tests are all green. Tried some basic profiling and it looked good. I also tried a bit unrealistic case. create table abcdefgh(a int, b int, c int, d int, e int, f int, g int, h int); insert into abcdefgh select a,b,c,d,e,f,g,h from generate_series(1,7) a, generate_series(1,7) b, generate_series(1,7) c, generate_series(1,7) d, generate_series(1,7) e, generate_series(1,7) f, generate_series(1,7) g, generate_series(1,7) h; explain analyze select count(*) over (order by a), row_number() over (partition by a order by b) from abcdefgh order by a,b,c,d,e,f,g,h; In patch version QUERY PLAN -- --- WindowAgg (cost=1023241.14..1225007.67 rows=5764758 width=48) (actual time=64957.894..81950.352 rows=5764801 loops=1) -> WindowAgg (cost=1023241.14..1138536.30 rows=5764758 width=40) (actual time=37959.055..60391.799 rows=5764801 loops =1) -> Sort (cost=1023241.14..1037653.03 rows=5764758 width=32) (actual time=37959.045..52968.791 rows=5764801 loop s=1) Sort Key: a, b, c, d, e, f, g, h Sort Method: external merge Disk: 237016kB -> Seq Scan on abcdefgh (cost=0.00..100036.58 rows=5764758 width=32) (actual time=0.857..1341.107 rows=57 64801 loops=1) Planning Time: 0.168 ms Execution Time: 82748.789 ms (8 rows) In Master QUERY PLAN -- - Incremental Sort (cost=1040889.72..1960081.97 rows=5764758 width=48) (actual time=23461.815..69654.700 rows=5764801 loop s=1) Sort Key: a, b, c, d, e, f, g, h Presorted Key: a, b Full-sort Groups: 49 Sort Method: quicksort Average Memory: 30kB Peak Memory: 30kB Pre-sorted Groups: 49 Sort Method: external merge Average Disk: 6688kB Peak Disk: 6688kB -> WindowAgg (cost=1023241.14..1225007.67 rows=5764758 width=48) (actual time=22729.171..40189.407 rows=5764801 loops =1) ->
Re: Implement missing join selectivity estimation for range types
Hello Mahmoud, Thanks for the patch and sorry for not taking a look earlier. On 6/30/22 16:31, Mahmoud Sakr wrote: > Hi, > Given a query: > SELECT * FROM t1, t2 WHERE t1.r << t2.r > where t1.r, t2.r are of range type, > currently PostgreSQL will estimate a constant selectivity for the << > predicate, > which is equal to 0.005, not utilizing the statistics that the optimizer > collects for range attributes. > > We have worked out a theory for inequality join selectivity estimation > (http://arxiv.org/abs/2206.07396), and implemented it for range > types it in this patch. > Interesting. Are there any particular differences compared to how we estimate for example range clauses on regular columns? > The algorithm in this patch re-uses the currently collected statistics for > range types, which is the bounds histogram. It works fairly accurate for the > operations <<, >>, &&, &<, &>, <=, >= with estimation error of about 0.5%. Right. I think 0.5% is roughly expected for the default statistics target, which creates 100 histogram bins, each representing ~1% of the values. Which on average means ~0.5% error. > The patch also implements selectivity estimation for the > operations @>, <@ (contains and is contained in), but their accuracy is not > stable, since the bounds histograms assume independence between the range > bounds. A point to discuss is whether or not to keep these last two > operations. That's a good question. I think the independence assumption is rather foolish in this case, so I wonder if we could "stabilize" this by making some different - less optimistic - assumption. Essentially, we have an estimates for lower/upper boundaries: P1 = P(lower(var1) <= lower(var2)) P2 = P(upper(var2) <= upper(var1)) and independence means we take (P1*P2). But maybe we should be very pessimistic and use e.g. Min(P1,P2)? Or maybe something in between? Another option is to use the length histogram, right? I mean, we know what the average length is, and it should be possible to use that to calculate how "far" ranges in a histogram can overlap. > The patch also includes the selectivity estimation for multirange types, > treating a multirange as a single range which is its bounding box. > OK. But ideally we'd cross-check elements of the two multiranges, no? > The same algorithm in this patch is applicable to inequality joins of scalar > types. We, however, don't implement it for scalars, since more work is needed > to make use of the other statistics available for scalars, such as the MCV. > This is left as a future work. > So if the column(s) contain a couple very common (multi)ranges that make it into an MCV, we'll ignore those? That's a bit unfortunate, because those MCV elements are potentially the main contributors to selectivity. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
CREATEROLE users vs. role properties
On Mon, Jan 16, 2023 at 2:29 PM Robert Haas wrote: > 1. It's still possible for a CREATEROLE user to hand out role > attributes that they don't possess. The new prohibitions in > cf5eb37c5ee0cc54c80d95c1695d7fca1f7c68cb prevent a CREATEROLE user > from handing out membership in a role on which they lack sufficient > permissions, but they don't prevent a CREATEROLE user who lacks > CREATEDB from creating a new user who does have CREATEDB. I think we > should subject the CREATEDB, REPLICATION, and BYPASSRLS attributes to > the same rule that we now use for role memberships: you've got to have > the property in order to give it to someone else. In the case of > CREATEDB, this would tighten the current rules, which allow you to > give out CREATEDB without having it. In the case of REPLICATION and > BYPASSRLS, this would liberalize the current rules: right now, a > CREATEROLE user cannot give REPLICATION or BYPASSRLS to another user > even if they possess those attributes. > > This proposal doesn't address the CREATEROLE or CONNECTION LIMIT > properties. It seems possible to me that someone might want to set up > a CREATEROLE user who can't make more such users, and this proposal > doesn't manufacture any way of doing that. It also doesn't let you > constraint the ability of a CREATEROLE user to set a CONNECTION LIMIT > for some other user. I think that's OK. It might be nice to have ways > of imposing such restrictions at some point in the future, but it is > not very obvious what to do about such cases and, importantly, I don't > think there's any security impact from failing to address those cases. > If a CREATEROLE user without CREATEDB can create a new role that does > have CREATEDB, that's a privilege escalation. If they can hand out > CREATEROLE, that isn't: they already have it. Here is a patch implementing the above proposal. Since this is fairly closely related to already-committed work, I would like to get this into v16. That way, all the changes to how CREATEROLE works will go into a single release, which seems less confusing for users. It is also fairly clear to me that this is an improvement over the status quo. Sometimes things that seem clear to me turn out to be false, so if this change seems like a problem to you, please let me know. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com v1-0001-Adjust-interaction-of-CREATEROLE-with-role-proper.patch Description: Binary data
Re: ANY_VALUE aggregate
On 1/18/23 16:06, Peter Eisentraut wrote: On 05.12.22 21:18, Vik Fearing wrote: On 12/5/22 15:57, Vik Fearing wrote: The SQL:2023 Standard defines a new aggregate named ANY_VALUE. It returns an implementation-dependent (i.e. non-deterministic) value from the rows in its group. PFA an implementation of this aggregate. Here is v2 of this patch. I had forgotten to update sql_features.txt. In your patch, the documentation says the definition is any_value("any") but the catalog definitions are any_value(anyelement). Please sort that out. Since the transition function is declared strict, null values don't need to be checked. Thank you for the review. Attached is a new version rebased to d540a02a72. -- Vik Fearing From 9cf2c5b56ea38d3080c0cb9f8ef9e6229d8696b4 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Sat, 9 Apr 2022 00:07:38 +0200 Subject: [PATCH] Implement ANY_VALUE aggregate SQL:2023 defines an ANY_VALUE aggregate whose purpose is to emit an implementation-dependent (i.e. non-deterministic) value from the aggregated rows. --- doc/src/sgml/func.sgml | 14 ++ src/backend/catalog/sql_features.txt | 1 + src/backend/utils/adt/misc.c | 13 + src/include/catalog/pg_aggregate.dat | 4 src/include/catalog/pg_proc.dat | 8 src/test/regress/expected/aggregates.out | 24 src/test/regress/sql/aggregates.sql | 6 ++ 7 files changed, 70 insertions(+) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index b8dac9ef46..8ff9decfec 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19735,6 +19735,20 @@ SELECT NULLIF(value, '(none)') ... + + + + any_value + +any_value ( anyelement ) +same as input type + + +Chooses a non-deterministic value from the non-null input values. + + Yes + + diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt index abad216b7e..dfd3882801 100644 --- a/src/backend/catalog/sql_features.txt +++ b/src/backend/catalog/sql_features.txt @@ -520,6 +520,7 @@ T622 Trigonometric functions YES T623 General logarithm functions YES T624 Common logarithm functions YES T625 LISTAGG NO +T626 ANY_VALUE YES T631 IN predicate with one list element YES T641 Multiple column assignment NO only some syntax variants supported T651 SQL-schema statements in SQL routines YES diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c index 220ddb8c01..a9251f977e 100644 --- a/src/backend/utils/adt/misc.c +++ b/src/backend/utils/adt/misc.c @@ -1041,3 +1041,16 @@ pg_get_replica_identity_index(PG_FUNCTION_ARGS) else PG_RETURN_NULL(); } + +/* + * Transition function for the ANY_VALUE aggregate + * + * Currently this just returns the first value, but in the future it might be + * able to signal to the aggregate that it does not need to be called anymore. + */ +Datum +any_value_trans(PG_FUNCTION_ARGS) +{ + PG_RETURN_DATUM(PG_GETARG_DATUM(0)); +} + diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat index 8c957437ea..aac60dee58 100644 --- a/src/include/catalog/pg_aggregate.dat +++ b/src/include/catalog/pg_aggregate.dat @@ -625,4 +625,8 @@ aggfinalfn => 'dense_rank_final', aggfinalextra => 't', aggfinalmodify => 'w', aggmfinalmodify => 'w', aggtranstype => 'internal' }, +# any_value +{ aggfnoid => 'any_value(anyelement)', aggtransfn => 'any_value_trans', + aggcombinefn => 'any_value_trans', aggtranstype => 'anyelement' }, + ] diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 86eb8e8c58..95e760440e 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -11891,4 +11891,12 @@ prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary', prosrc => 'brin_minmax_multi_summary_send' }, +{ oid => '8981', descr => 'arbitrary value from among input values', + proname => 'any_value', prokind => 'a', proisstrict => 'f', + prorettype => 'anyelement', proargtypes => 'anyelement', + prosrc => 'aggregate_dummy' }, +{ oid => '8982', descr => 'any_value transition function', + proname => 'any_value_trans', prorettype => 'anyelement', proargtypes => 'anyelement anyelement', + prosrc => 'any_value_trans' }, + ] diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out index ae3b905331..b240ef522b 100644 --- a/src/test/regress/expected/aggregates.out +++ b/src/test/regress/expected/aggregates.out @@ -25,6 +25,24 @@ SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100; 32.6667 (1 row) +SELECT any_value(v) FROM (VALUES (1)) AS v (v); + any_value +--- + 1 +(1 row) + +SELECT any_value(v) FROM (VALUES (NULL)) AS v (v); + any_value +--- + +(1 row) +
Re: Improve GetConfigOptionValues function
On Wed, Jan 18, 2023 at 9:44 PM Tom Lane wrote: > > Possibly a better answer is to refactor into separate functions, > along the lines of > > static bool > ConfigOptionIsShowable(struct config_generic *conf) > > static void > GetConfigOptionValues(struct config_generic *conf, const char **values) +1 and ConfigOptionIsShowable() function can replace explicit showable checks in two other places too. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements
Hi, I took a quick look at this patch, to see if there's something we want/can get into v16. The last version was submitted about 9 months ago, and it doesn't apply cleanly anymore, but the bitrot is fairly minor. Not sure there's still interest, though. As for the patch, I wonder if it's unnecessarily complex. It adds *two* timestamps for each pg_stat_statements entry - one for reset of the whole entry, one for reset of "min/max" times only. I can see why the first timestamp (essentially tracking creating of the entry) is useful. I'd probably call it "created_at" or something like that, but that's a minor detail. Or maybe stats_reset, which is what we use in pgstat? But is the second timestamp for the min/max fields really useful? AFAIK to perform analysis, people take regular pg_stat_statements snapshots, which works fine for counters (calculating deltas) but not for gauges (which need a reset, to track fresh values). But people analyzing this are already resetting the whole entry, and so the snapshots already are tracking deltas. So I'm not convinced actually need the second timestamp. A couple more comments: 1) I'm not sure why the patch is adding tests of permissions on the pg_stat_statements_reset function? 2) If we want the second timestamp, shouldn't it also cover resets of the mean values, not just min/max? 3) I don't understand why the patch is adding "IS NOT NULL AS t" to various places in the regression tests. 4) I rather dislike the "minmax" naming, because that's often used in other contexts (for BRIN indexes), and as I mentioned maybe it should also cover the "mean" fields. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: almost-super-user problems that we haven't fixed yet
On Tue, Jan 17, 2023 at 7:15 PM Nathan Bossart wrote: > Great. Here is a first attempt at the patch. In general, looks good. I think this will often call HaveNFreeProcs twice, though, and that would be better to avoid, e.g. if (!am_superuser && !am_walsender && (SuperuserReservedBackends + ReservedBackends) > 0) && !HaveNFreeProcs(SuperuserReservedBackends + ReservedBackends)) { if (!HaveNFreeProcs(SuperuserReservedBackends)) remaining connection slots are reserved for non-replication superuser connections; if (!has_privs_of_role(GetUserId(), ROLE_PG_USE_RESERVED_CONNECTIONS)) remaining connection slots are reserved for roles with privileges of pg_use_reserved_backends; } In the common case where we hit neither limit, this only counts free connection slots once. We could do even better by making HaveNFreeProcs have an out parameter for the number of free procs actually found when it returns false, but that's probably not important. I don't think that we should default both the existing GUC and the new one to 3, because that raises the default limit in the case where the new feature is not used from 3 to 6. I think we should default one of them to 0 and the other one to 3. Not sure which one should get which value. > > I think the documentation will need some careful wordsmithing, > > including adjustments to superuser_reserved_connections. We want to > > recast superuser_reserved_connections as a final reserve to be touched > > after even reserved_connections has been exhausted. > > I tried to do this, but there is probably still room for improvement, > especially for the parts that discuss the relationship between > max_connections, superuser_reserved_connections, and reserved_connections. I think it's pretty good the way you have it. I agree that there might be a way to make it even better, but I don't think I know what it is. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Improve GetConfigOptionValues function
Nitin Jadhav writes: > GetConfigOptionValues function extracts the config parameters for the > given variable irrespective of whether it results in noshow or not. > But the parent function show_all_settings ignores the values parameter > if it results in noshow. It's unnecessary to fetch all the values > during noshow. So a return statement in GetConfigOptionValues() when > noshow is set to true is needed. Attached the patch for the same. > Please share your thoughts. I do not think this is an improvement: it causes GetConfigOptionValues to be making assumptions about how its results will be used. If show_all_settings() were a big performance bottleneck, and there were a lot of no-show values that we could optimize, then maybe the extra coupling would be worthwhile. But I don't believe either of those things. Possibly a better answer is to refactor into separate functions, along the lines of static bool ConfigOptionIsShowable(struct config_generic *conf) static void GetConfigOptionValues(struct config_generic *conf, const char **values) regards, tom lane
Re: ANY_VALUE aggregate
On 1/18/23 16:55, Tom Lane wrote: Peter Eisentraut writes: On 05.12.22 21:18, Vik Fearing wrote: On 12/5/22 15:57, Vik Fearing wrote: The SQL:2023 Standard defines a new aggregate named ANY_VALUE. It returns an implementation-dependent (i.e. non-deterministic) value from the rows in its group. Since the transition function is declared strict, null values don't need to be checked. Hmm, but should it be strict? That means that what it's returning is *not* "any value" but "any non-null value". What does the draft spec have to say about that? It falls into the same category as AVG() etc. That is, nulls are removed before calculation. -- Vik Fearing
Re: ANY_VALUE aggregate
Peter Eisentraut writes: > On 05.12.22 21:18, Vik Fearing wrote: >> On 12/5/22 15:57, Vik Fearing wrote: >>> The SQL:2023 Standard defines a new aggregate named ANY_VALUE. It >>> returns an implementation-dependent (i.e. non-deterministic) value >>> from the rows in its group. > Since the transition function is declared strict, null values don't need > to be checked. Hmm, but should it be strict? That means that what it's returning is *not* "any value" but "any non-null value". What does the draft spec have to say about that? regards, tom lane
Re: [DOCS] Stats views and functions not in order?
On Wed, Jan 18, 2023 at 8:38 AM Tom Lane wrote: > "David G. Johnston" writes: > > ... I was going for the html effect > > of having these views chunked into their own pages, any other changes > being > > non-detrimental. > > But is that a result we want? It will for example break any bookmarks > that people might have for these documentation entries. It will also > pretty thoroughly break the cross-version navigation links in this > part of the docs. > Maybe the benefit is worth those costs, but I'm entirely not convinced > of that. I think we need to tread pretty lightly when rearranging > longstanding documentation-layout decisions. > > Fair points. The external linking can be solved with redirect rules, as I believe we've done before, and fairly recently. Even if not I think when they see why the break happened they will be happy for the improved user experience. I do think it is important enough a change to warrant breaking the cross-version navigation links. I can imagine a linking scheme that would still work but I'm doubtful that this is important enough to expend the development effort. David J.