Some code cleanup for pgbench and pg_verifybackup
Hi all, While looking at the code areas of $subject, I got surprised about a couple of things: - pgbench has its own parsing routines for int64 and double, with an option to skip errors. That's not surprising in itself, but, for strtodouble(), errorOK is always true, meaning that the error strings are dead. For strtoint64(), errorOK is false only when parsing a Variable, where a second error string is generated. I don't really think that we need to be that pedantic about the type of errors generated in those code paths when failing to parse a variable, so I'd like to propose a simplification of the code where we reuse the same error message as for double, cutting a bit the number of translatable strings. - pgbench and pg_verifybackup make use of pg_log_fatal() to report some failures in code paths dedicated to command-line options, which is inconsistent with all the other tools that use pg_log_error(). Please find attached a patch to clean up all those inconsistencies. Thoughts? -- Michael diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c index f5ebd57a47..5f0469373f 100644 --- a/src/bin/pg_verifybackup/pg_verifybackup.c +++ b/src/bin/pg_verifybackup/pg_verifybackup.c @@ -261,7 +261,7 @@ main(int argc, char **argv) /* Get backup directory name */ if (optind >= argc) { - pg_log_fatal("no backup directory specified"); + pg_log_error("no backup directory specified"); fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname); exit(1); @@ -272,7 +272,7 @@ main(int argc, char **argv) /* Complain if any arguments remain */ if (optind < argc) { - pg_log_fatal("too many command-line arguments (first is \"%s\")", + pg_log_error("too many command-line arguments (first is \"%s\")", argv[optind]); fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname); diff --git a/src/bin/pgbench/exprscan.l b/src/bin/pgbench/exprscan.l index 75432cedc6..b3322f01f7 100644 --- a/src/bin/pgbench/exprscan.l +++ b/src/bin/pgbench/exprscan.l @@ -205,19 +205,19 @@ notnull [Nn][Oo][Tt][Nn][Uu][Ll][Ll] return MAXINT_PLUS_ONE_CONST; } {digit}+ { - if (!strtoint64(yytext, true, &yylval->ival)) + if (!strtoint64(yytext, &yylval->ival)) expr_yyerror_more(yyscanner, "bigint constant overflow", strdup(yytext)); return INTEGER_CONST; } {digit}+(\.{digit}*)?([eE][-+]?{digit}+)? { - if (!strtodouble(yytext, true, &yylval->dval)) + if (!strtodouble(yytext, &yylval->dval)) expr_yyerror_more(yyscanner, "double constant overflow", strdup(yytext)); return DOUBLE_CONST; } \.{digit}+([eE][-+]?{digit}+)? { - if (!strtodouble(yytext, true, &yylval->dval)) + if (!strtodouble(yytext, &yylval->dval)) expr_yyerror_more(yyscanner, "double constant overflow", strdup(yytext)); return DOUBLE_CONST; diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index c51ebb8e31..87f3b9f3b6 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -803,7 +803,7 @@ is_an_int(const char *str) * If not errorOK, an error message is also printed out on errors. */ bool -strtoint64(const char *str, bool errorOK, int64 *result) +strtoint64(const char *str, int64 *result) { const char *ptr = str; int64 tmp = 0; @@ -862,19 +862,15 @@ strtoint64(const char *str, bool errorOK, int64 *result) return true; out_of_range: - if (!errorOK) - pg_log_error("value \"%s\" is out of range for type bigint", str); return false; invalid_syntax: - if (!errorOK) - pg_log_error("invalid input syntax for type bigint: \"%s\"", str); return false; } /* convert string to double, detecting overflows/underflows */ bool -strtodouble(const char *str, bool errorOK, double *dv) +strtodouble(const char *str, double *dv) { char *end; @@ -882,18 +878,9 @@ strtodouble(const char *str, bool errorOK, double *dv) *dv = strtod(str, &end); if (unlikely(errno != 0)) - { - if (!errorOK) - pg_log_error("value \"%s\" is out of range for type double", str); return false; - } - if (unlikely(end == str || *end != '\0')) - { - if (!errorOK) - pg_log_error("invalid input syntax for type double: \"%s\"", str); return false; - } return true; } @@ -1525,16 +1512,19 @@ makeVariableValue(Variable *var) /* if it looks like an int, it must be an int without overflow */ int64 iv; - if (!strtoint64(var->svalue, false, &iv)) + if (!strtoint64(var->svalue, &iv)) + { + pg_log_error("malformed variable \"%s\" value: \"%s\"", + var->name, var->svalue); return false; - + } setIntValue(&var->value, iv); } else /* type should be double */ { double dv; - if (!strtodouble(var->svalue, true, &dv)) + if (!strtodouble(var->svalue, &dv)) { pg_log_error("malformed variable \"%s\" value: \"%s\"", var->name, var->svalue); @@ -5900,12 +5890,12 @@ main(int
Re: something is wonky with pgbench pipelining
On Sat, Jul 24, 2021 at 04:08:33PM -0700, Andres Freund wrote: > On 2021-07-21 16:55:08 -0700, Andres Freund wrote: >> I'm inclined to push it to 14 and master, but ... > > RMT: ^ If it were me, I think that I would have back-patched this change even if found after the GA release of 14 as there is no advantages in keeping the current behavior either except overloading pgbench with unnecessary system calls. No objections from me to change that now for 14~. -- Michael signature.asc Description: PGP signature
Re: pg_upgrade don't echo windows commands
On Fri, Jun 04, 2021 at 12:10:47PM -0400, Andrew Dunstan wrote: > Here's a completely trivial command to turn of echoing of a couple of > Windows commands pg_upgrade writes to cleanup scripts. This makes them > behave more like the Unix equivalents. Why not. Perhaps you should add a comment to mention that appending @ to those commands disables echo. That's not obvious for the reader. -- Michael signature.asc Description: PGP signature
Re: [Doc] Tiny fix for regression tests example
On Mon, Jul 26, 2021 at 05:50:28AM +, tanghy.f...@fujitsu.com wrote: > Thanks for your comment. Agree with your suggestion. > Modified it in the attachment patch. Okay, applied, but without the pwd part. -- Michael signature.asc Description: PGP signature
Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation
On Wed, Jul 7, 2021 at 5:35 PM Bharath Rupireddy wrote: > > Attaching v6 patch rebased onto the latest master. I came across a recent commit 81d5995 and have used the same error message for temporary and unlogged tables. Also added, test cases to cover these error cases for foreign, temporary, unlogged and system tables with CREATE PUBLICATION command. PSA v7. commit 81d5995b4b78017ef9e5c6f151361d1fb949924c Author: Peter Eisentraut Date: Wed Jul 21 07:40:05 2021 +0200 More improvements of error messages about mismatching relkind Regards, Bharath Rupireddy. v7-0001-Improve-publication-error-messages.patch Description: Binary data
Re: CREATE SEQUENCE with RESTART option
On Sat, Jul 24, 2021 at 09:56:40PM +0530, Bharath Rupireddy wrote: > LGTM. PSA v2 patch. FWIW, like Ashutosh upthread, my vote would be to do nothing here in terms of behavior changes as this is just breaking a behavior for the sake of breaking it, so there are chances that this is going to piss some users that relied accidentally on the existing behavior. I think that we should document that RESTART is accepted within the set of options as a consequence of the set of options supported by gram.y, though. -- Michael signature.asc Description: PGP signature
Re: Some code cleanup for pgbench and pg_verifybackup
Bonjour Michaël, My 0.02€: - pgbench has its own parsing routines for int64 and double, with an option to skip errors. That's not surprising in itself, but, for strtodouble(), errorOK is always true, meaning that the error strings are dead. Indeed. However, there are "atof" calls for option parsing which should rather use strtodouble, and that may or may not call it with errorOk as true or false, it may depend. For strtoint64(), errorOK is false only when parsing a Variable, where a second error string is generated. ISTM that it just returns false, there is no message about the parsing error, hence the message is generated in the function. I don't really think that we need to be that pedantic about the type of errors generated in those code paths when failing to parse a variable, so I'd like to propose a simplification of the code where we reuse the same error message as for double, cutting a bit the number of translatable strings. ISTM that point is that errors from the parser are handled differently (by calling some "yyerror" function which do different things), so they need a special call for that. For other cases we would not to have to replicate generating an error messages for each caller, so it is best done directly in the function. Ok, currently there is only one call, but there can be more, eg I have a not-yet submitted patch to add a new option which will need to parse an int64. - pgbench and pg_verifybackup make use of pg_log_fatal() to report some failures in code paths dedicated to command-line options, which is inconsistent with all the other tools that use pg_log_error(). The semantics for fatal vs error is that an error is somehow handled while a fatal is not. If the log message is followed by an cold exit, ISTM that fatal is the right call, and I cannot help if other commands do not do that. ISTM more logical to align other commands to fatal when appropriate. Thoughts? I'd be in favor of letting it as it is. -- Fabien.
Re: [HACKERS] logical decoding of two-phase transactions
On Fri, Jul 23, 2021 at 8:08 PM Amit Kapila wrote: > > On Tue, Jul 20, 2021 at 9:24 AM Peter Smith wrote: > > > > Please find attached the latest patch set v98* > > > > Review comments: > [...] > With Regards, > Amit Kapila. Thanks for your review comments. I having been working through them today and hope to post the v99* patches tomorrow. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Incorrect usage of strtol, atoi for non-numeric junk inputs
At Mon, 26 Jul 2021 15:01:35 +0900, Michael Paquier wrote in > On Sat, Jul 24, 2021 at 07:41:12PM +0900, Michael Paquier wrote: > > I have looked at that over the last couple of days, and applied it > > after some small fixes, including an indentation. > > One thing that we forgot here is the handling of trailing > whitespaces. Attached is small patch to adjust that, with one > positive and one negative tests. > > > The int64 and float > > parts are extra types we could handle, but I have not looked yet at > > how much benefits we'd get in those cases. > > I have looked at these two but there is really less benefits, so for > now I am not planning to do more in this area. For float options, > pg_basebackup --max-rate could be one target on top of the three set > of options in pgbench, but it needs to handle units :( Thanks for revising and committing! I'm fine with all of the recent discussions on the committed part. Though I don't think it works for "live" command line options, making the omitting policy symmetric looks good. I see the same done in several similar use of strto[il]. The change in 020_pg_receivewal.pl results in a chain of four successive failures, which is fine. But the last failure (#23) happens for a bit dubious reason. > Use of uninitialized value in pattern match (m//) at t/020_pg_receivewal.pl > line 114. > not ok 23 - one partial WAL segment is now completed It might not be worth amending, but we don't need to use m/// there and eq works fine. 020_pg_receivewal.pl: 114 - is($zlib_wals[0] =~ m/$partial_wals[0]/, + is($zlib_wals[0] eq $partial_wals[0], regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: shared-memory based stats collector
> > Yeah, thank you very much for checking that. However, this patch is > > now developed in Andres' GitHub repository. So I'm at a loss what to > > do for the failure.. > > I'll post a rebased version soon. (Sorry if you feel being hurried, which I didn't meant to.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: speed up verifying UTF-8
Attached is v20, which has a number of improvements: 1. Cleaned up and explained DFA coding. 2. Adjusted check_ascii to return bool (now called is_valid_ascii) and to produce an optimized loop, using branch-free accumulators. That way, it doesn't need to be rewritten for different input lengths. I also think it's a bit easier to understand this way. 3. Put SSE helper functions in their own file. 4. Mostly-cosmetic edits to the configure detection. 5. Draft commit message. With #2 above in place, I wanted to try different strides for the DFA, so more measurements (hopefully not much more of these): Power8, gcc 4.8 HEAD: chinese | mixed | ascii | mixed16 | mixed8 -+---+---+-+ 2944 | 1523 | 871 |1473 | 1509 v20, 8-byte stride: chinese | mixed | ascii | mixed16 | mixed8 -+---+---+-+ 1189 | 550 | 246 | 600 |936 v20, 16-byte stride (in the actual patch): chinese | mixed | ascii | mixed16 | mixed8 -+---+---+-+ 981 | 440 | 134 | 791 |820 v20, 32-byte stride: chinese | mixed | ascii | mixed16 | mixed8 -+---+---+-+ 857 | 481 | 141 | 834 |839 Based on the above, I decided that 16 bytes had the best overall balance. Other platforms may differ, but I don't expect it to make a huge amount of difference. Just for fun, I was also a bit curious about what Vladimir mentioned upthread about x86-64-v3 offering a different shift instruction. Somehow, clang 12 refused to build with that target, even though the release notes say it can, but gcc 11 was fine: x86 Macbook, gcc 11, USE_FALLBACK_UTF8=1: HEAD: chinese | mixed | ascii | mixed16 | mixed8 -+---+---+-+ 1200 | 728 | 370 | 544 |637 v20: chinese | mixed | ascii | mixed16 | mixed8 -+---+---+-+ 459 | 243 |77 | 424 |440 v20, CFLAGS="-march=x86-64-v3 -O2" : chinese | mixed | ascii | mixed16 | mixed8 -+---+---+-+ 390 | 215 |77 | 303 |323 And, gcc does generate the desired shift here: objdump -S src/port/pg_utf8_fallback.o | grep shrx 53: c4 e2 eb f7 d1 shrxq %rdx, %rcx, %rdx While it looks good, clang can do about as good by simply unrolling all 16 shifts in the loop, which gcc won't do. To be clear, it's irrelevant, since x86-64-v3 includes AVX2, and if we had that we would just use it with the SIMD algorithm. Macbook x86, clang 12: HEAD: chinese | mixed | ascii | mixed16 | mixed8 -+---+---+-+ 974 | 691 | 370 | 456 |526 v20, USE_FALLBACK_UTF8=1: chinese | mixed | ascii | mixed16 | mixed8 -+---+---+-+ 351 | 172 |88 | 349 |350 v20, with SSE4: chinese | mixed | ascii | mixed16 | mixed8 -+---+---+-+ 142 |92 |59 | 141 |141 I'm pretty happy with the patch at this point. -- John Naylor EDB: http://www.enterprisedb.com From c82cbcf342f986396152a743a552626757b0a2b3 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Sun, 25 Jul 2021 20:41:41 -0400 Subject: [PATCH v20] Add a fast path for validating UTF-8 text Our previous validator is a traditional one that performs comparisons and branching on one byte at a time. It's useful in that we always know exactly how many bytes we have validated, but that precision comes at a cost. Input validation can show up prominently in profiles of COPY FROM, and future improvements to COPY FROM such as parallelism or line and field parsing will put more pressure on input validation. Hence, supplement with two fast path implementations: On machines that support SSE4, use an algorithm described in the paper "Validating UTF-8 In Less Than One Instruction Per Byte" by John Keiser and Daniel Lemire. The authors have made available an open source implementation within the simdjson library (Apache 2.0 license). The lookup tables and naming conventions were adopted from this library, but the code was written from scratch. On other hardware, use a "shift-based" DFA. Both implementations are heavily optimized for blocks of ASCII text, are relatively free of branching and thus robust against many kinds of byte patterns, and delay error checking to the very end. With these algorithms, UTF-8 validation is from anywhere from two to seven times faster, depending on platform and the distribution of byte sequences in the input. The previous coding in pg_utf8_verifystr() is retained for short strings and for when the fast path returns an error. Review, performance testing, and additional hacking by: Heikki Linakangas, Vladimir Sitnikov, Amit Khandekar, Thomas Munro, and Greg Stark Discussion: https://www.postgresql.org/message-id/CAFBsxsEV_SzH%2BOLyCiyon%3DiwggSyMh_eF6A3LU2tiWf3Cy2ZQg%40mail.gmail.com --- config/c-compiler.m4
Re: speed up verifying UTF-8
Just wondering, do you have the code in a GitHub/Gitlab branch? >+ utf8_advance(s, state, len); >+ >+ /* >+ * If we saw an error during the loop, let the caller handle it. We treat >+ * all other states as success. >+ */ >+ if (state == ERR) >+ return 0; Did you mean state = utf8_advance(s, state, len); there? (reassign state variable) >I wanted to try different strides for the DFA Does that (and "len >= 32" condition) mean the patch does not improve validation of the shorter strings (the ones less than 32 bytes)? It would probably be nice to cover them as well (e.g. with 4 or 8-byte strides) Vladimir
Re: 2021-07 CF now in progress
On Mon, Jul 19, 2021 at 4:37 PM Ibrar Ahmed wrote: > > > On Mon, Jul 12, 2021 at 4:59 PM Ibrar Ahmed wrote: > >> >> Hackers, >> >> The Commitfest 2021-07 is now in progress. It is one of the biggest one. >> Total number of patches of this commitfest is 342. >> >> Needs review: 204. >> Waiting on Author: 40. >> Ready for Committer: 18. >> Committed: 57. >> Moved to next CF: 3. >> Withdrawn: 15. >> Rejected: 3. >> Returned with Feedback: 2. >> Total: 342. >> >> If you are a patch author, please check http://commitfest.cputube.org to >> be sure your patch still applies, compiles, and passes tests. >> >> We need your involvement and participation in reviewing the patches. >> Let's try and make this happen. >> >> -- >> Regards. >> Ibrar Ahmed >> > > > Over the past one week, statuses of 47 patches have been changed from > "Needs review". This still leaves us with 157 patches > requiring reviews. As always, your continuous support is appreciated to > get us over the line. > > Please look at the patches requiring review in the current commitfest. > Test, provide feedback where needed, and update the patch status. > > Total: 342. > > Needs review: 157. > Waiting on Author: 74. > Ready for Committer: 15. > Committed: 68. > Moved to next CF: 3. > Withdrawn: 19. > Rejected: 4. > Returned with Feedback: 2. > > > -- > Ibrar Ahmed > Over the past one week, some progress was made, however, there are still 155 patches in total that require reviews. Time to continue pushing for maximising patch reviews and getting stuff committed in PostgreSQL. Total: 342. Needs review: 155. Waiting on Author: 67. Ready for Committer: 20. Committed: 70. Moved to next CF: 3. Withdrawn: 20. Rejected: 5. Returned with Feedback: 2. Thank you for your continued effort and support. -- Ibrar Ahmed
Re: Skip temporary table schema name from explain-verbose output.
On Thu, 29 Apr 2021 at 08:17, Amul Sul wrote: > On Wed, Apr 28, 2021 at 7:56 PM Tom Lane wrote: > > I don't think we should remove them. However, it could make sense to > > print the "pg_temp" alias instead of the real schema name when we > > are talking about myTempNamespace. Basically try to make that alias > > a bit less leaky. > > +1, let's replace it by "pg_temp" -- did the same in that attached 0001 patch. Sounds like a good change. Surely we need a test to exercise this works? Otherwise ready... -- Simon Riggshttp://www.EnterpriseDB.com/
Re: speed up verifying UTF-8
On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > > Just wondering, do you have the code in a GitHub/Gitlab branch? > > >+ utf8_advance(s, state, len); > >+ > >+ /* > >+ * If we saw an error during the loop, let the caller handle it. We treat > >+ * all other states as success. > >+ */ > >+ if (state == ERR) > >+ return 0; > > Did you mean state = utf8_advance(s, state, len); there? (reassign state variable) Yep, that's a bug, thanks for catching! > >I wanted to try different strides for the DFA > > Does that (and "len >= 32" condition) mean the patch does not improve validation of the shorter strings (the ones less than 32 bytes)? Right. Also, the 32 byte threshold was just a temporary need for testing 32-byte stride -- testing different thresholds wouldn't hurt. I'm not terribly concerned about short strings, though, as long as we don't regress. That said, Heikki had something in his v14 [1] that we could use: +/* + * Subroutine of pg_utf8_verifystr() to check on char. Returns the length of the + * character at *s in bytes, or 0 on invalid input or premature end of input. + * + * XXX: could this be combined with pg_utf8_verifychar above? + */ +static inline int +pg_utf8_verify_one(const unsigned char *s, int len) It would be easy to replace pg_utf8_verifychar with this. It might even speed up the SQL function length_in_encoding() -- that would be a better reason to do it. [1] https://www.postgresql.org/message-id/2f95e70d-4623-87d4-9f24-ca534155f179%40iki.fi -- John Naylor EDB: http://www.enterprisedb.com
Re: speed up verifying UTF-8
On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > > Just wondering, do you have the code in a GitHub/Gitlab branch? Sorry, I didn't see this earlier. No, I don't. -- John Naylor EDB: http://www.enterprisedb.com
Re: .ready and .done files considered harmful
On Fri, Jul 23, 2021 at 5:46 PM Bossart, Nathan wrote: > My apologies for chiming in so late to this thread, but a similar idea > crossed my mind while working on a bug where .ready files get created > too early [0]. Specifically, instead of maintaining a status file per > WAL segment, I was thinking we could narrow it down to a couple of > files to keep track of the boundaries we care about: > > 1. earliest_done: the oldest segment that has been archived and >can be recycled/removed > 2. latest_done: the newest segment that has been archived > 3. latest_ready: the newest segment that is ready for archival > > This might complicate matters for backup utilities that currently > modify the .ready/.done files, but it would simplify this archive > status stuff quite a bit and eliminate the need to worry about the > directory scans in the first place. In terms of immediate next steps, I think we should focus on eliminating the O(n^2) problem and not get sucked into a bigger redesign. The patch on the table aims to do just that much and I think that's a good thing. But in the longer term I agree that we want to redesign the signalling somehow. I am not convinced that using a file is the right way to go. If we had to rewrite that file for every change, and especially if we had to fsync it, it would be almost as bad as what we're doing right now in terms of the amount of traffic to the filesystem. Atomicity is a problem too, because if we simply create a file then after a crash it will either exist or not, but a file might end up garbled with a mix of old and new contents unless we always write a temporary file and automatically rename that over the existing one. As I said in my original post, I'm kind of wondering about keeping the information in shared memory instead of using the filesystem. I think we would still need to persist it to disk at least occasionally but perhaps there is a way to avoid having to do that as frequently as what we do now. I haven't thought too deeply about what the requirements are here. -- Robert Haas EDB: http://www.enterprisedb.com
Re: WIP: Relaxing the constraints on numeric scale
On Fri, 23 Jul 2021 at 16:50, Tom Lane wrote: > > OK, I've now studied this more closely, and have some additional > nitpicks: > > * I felt the way you did the documentation was confusing. It seems > better to explain the normal case first, and then describe the two > extended cases. OK, that looks much better. Re-reading the entire section, I think it's much clearer now. > * As long as we're encapsulating typmod construction/extraction, let's > also encapsulate the checks for valid typmods. Good idea. > * Other places are fairly careful to declare typmod values as "int32", > so I think this code should too. OK, that seems sensible. > Attached is a proposed delta patch making those changes. > > (I made the docs mention that the extension cases are allowed as of v15. > While useful in the short run, that will look like noise in ten years; > so I could go either way on whether to do that.) Hmm, yeah. In general,I find such things in the documentation useful for quite a few years. I'm regularly looking to see when a particular feature was added, to see if I can use it in a particular situation. But eventually, it'll become irrelevant, and I don't know if anyone will go around tidying these things up. I have left it in, but perhaps there is a wider discussion to be had about whether we should be doing that more (or less) often. FWIW, I like the way some docs include an "available since" tag (e.g,, Java's @since tag). > If you're good with these, then I think it's ready to go. > I'll mark it RfC in the commitfest. Thanks. That all looked good, so I have pushed it. Regards, Dean
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Jul 26, 2021 at 1:07 AM Yura Sokolov wrote: > > Hi, > > I've dreamed to write more compact structure for vacuum for three > years, but life didn't give me a time to. > > Let me join to friendly competition. > > I've bet on HATM approach: popcount-ing bitmaps for non-empty elements. Thank you for proposing the new idea! > > Novelties: > - 32 consecutive pages are stored together in a single sparse array >(called "chunks"). >Chunk contains: >- its number, >- 4 byte bitmap of non-empty pages, >- array of non-empty page headers 2 byte each. > Page header contains offset of page's bitmap in bitmaps container. > (Except if there is just one dead tuple in a page. Then it is > written into header itself). >- container of concatenated bitmaps. > >Ie, page metadata overhead varies from 2.4byte (32pages in single > chunk) >to 18byte (1 page in single chunk) per page. > > - If page's bitmap is sparse ie contains a lot of "all-zero" bytes, >it is compressed by removing zero byte and indexing with two-level >bitmap index. >Two-level index - zero bytes in first level are removed using >second level. It is mostly done for 32kb pages, but let it stay since >it is almost free. > > - If page's bitmaps contains a lot of "all-one" bytes, it is inverted >and then encoded as sparse. > > - Chunks are allocated with custom "allocator" that has no >per-allocation overhead. It is possible because there is no need >to perform "free": allocator is freed as whole at once. > > - Array of pointers to chunks is also bitmap indexed. It saves cpu time >when not every 32 consecutive pages has at least one dead tuple. >But consumes time otherwise. Therefore additional optimization is > added >to quick skip lookup for first non-empty run of chunks. >(Ahhh, I believe this explanation is awful). It sounds better than my proposal. > > Andres Freund wrote 2021-07-20 02:49: > > Hi, > > > > On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote: > >> BTW is the implementation of the radix tree approach available > >> somewhere? If so I'd like to experiment with that too. > >> > >> > > >> > I have toyed with implementing adaptively large radix nodes like > >> > proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't > >> > gotten it quite working. > >> > >> That seems promising approach. > > > > I've since implemented some, but not all of the ideas of that paper > > (adaptive node sizes, but not the tree compression pieces). > > > > E.g. for > > > > select prepare( > > 100, -- max block > > 20, -- # of dead tuples per page > > 10, -- dead tuples interval within a page > > 1 -- page inteval > > ); > > attach sizeshuffled ordered > > array69 ms 120 MB 84.87 s 8.66 s > > intset 173 ms 65 MB 68.82 s 11.75 s > > rtbm201 ms 67 MB 11.54 s 1.35 s > > tbm 232 ms 100 MB 8.33 s 1.26 s > > vtbm162 ms 58 MB 10.01 s 1.22 s > > radix88 ms 42 MB 11.49 s 1.67 s > > > > and for > > select prepare( > > 100, -- max block > > 10, -- # of dead tuples per page > > 1, -- dead tuples interval within a page > > 1 -- page inteval > > ); > > > > attach sizeshuffled ordered > > array24 ms 60MB 3.74s1.02 s > > intset 97 ms 49MB 3.14s0.75 s > > rtbm138 ms 36MB 0.41s0.14 s > > tbm 198 ms 101MB 0.41s0.14 s > > vtbm118 ms 27MB 0.39s0.12 s > > radix33 ms 10MB 0.28s0.10 s > > > > (this is an almost unfairly good case for radix) > > > > Running out of time to format the results of the other testcases before > > I have to run, unfortunately. radix uses 42MB both in test case 3 and > > 4. > > My results (Ubuntu 20.04 Intel Core i7-1165G7): > > Test1. > > select prepare(100, 10, 20, 1); -- original > > attach size shuffled > array 29ms60MB 93.99s > intset 93ms49MB 80.94s > rtbm 171ms67MB 14.05s > tbm238ms 100MB8.36s > vtbm 148ms59MB9.12s > radix 100ms42MB 11.81s > svtm75ms29MB8.90s > > select prepare(100, 20, 10, 1); -- Andres's variant > > attach size shuffled > array 61ms 120MB 111.91s > intset 163ms66MB 85.00s > rtbm 236ms67MB 10.72s > tbm290ms 100MB8.40s > vtbm 190ms59MB9.28s > radix 117ms42MB 12.00s > svtm98ms29MB8.77s > > Test2. > > select prepare(100, 10, 1, 1); > > attach size shuffled > array 31ms60MB4.68s > intset 97ms49MB4.03s > rtbm 163ms36MB0.42s > tbm240ms 100MB0.42s > vtbm 136ms27MB0.36s > radix 60ms10MB0.72s > svtm39ms 6MB0.19s > > (Bad radix result probably due to smaller cache in notebook's CPU ?) > > Test3 > > select prepare(100, 2, 100, 1); > > attach size shuff
Re: SQL/JSON: JSON_TABLE
Below are a few small comments from a casual read-through. I noticed that there was a new version posted after I had finished perusing, but it seems to address other aspects. + Gerenates a column and inserts a composite SQL/JSON s/Gerenates/Generates/ + into both child and parrent columns for all missing values. s/parrent/parent/ - objectname = "xmltable"; + objectname = rte->tablefunc ? + rte->tablefunc->functype == TFT_XMLTABLE ? + "xmltable" : "json_table" : NULL; In which case can rte->tablefunc be NULL for a T_TableFuncScan? Also, nested ternary operators are confusing at best, I think this should be rewritten as plain if statements. In general when inspecting functype I think it's better to spell it out with if statements rather than ternary since it allows for grepping the code easier. Having to grep for TFT_XMLTABLE to find json_table isn't all that convenient. That also removes the need for comments stating why a ternary operator is Ok in the first place. + errmsg("JSON_TABLE() is not yet implemented for json type"), I can see this being potentially confusing to many, en errhint with a reference to jsonb seems like a good idea. +/* Recursively register column name in the path name list. */ +static void +registerJsonTableColumn(JsonTableContext *cxt, char *colname) This comment is misleading since the function isn't actually recursive, but a helper function for a recursive function. +switch (get_typtype(typid)) +{ +case TYPTYPE_COMPOSITE: +return true; + +case TYPTYPE_DOMAIN: +return typeIsComposite(getBaseType(typid)); +} switch statements without a default runs the risk of attracting unwanted compiler warning attention, or make static analyzers angry. This one can easily be rewritten with two if-statements on a cached get_typtype() returnvalue. + * Returned false at the end of a scan, true otherwise. s/Returned/Returns/ (applies at two places) + /* state->ordinal--; */ /* skip current outer row, reset counter */ Is this dead code to be removed, or left in there as a reminder to fix something? -- Daniel Gustafsson https://vmware.com/
Re: when the startup process doesn't (logging startup delays)
On Sun, Jul 25, 2021 at 1:56 PM Justin Pryzby wrote: > On Fri, Jul 23, 2021 at 04:09:47PM +0530, Nitin Jadhav wrote: > > > I think walkdir() should only call LogStartupProgress(FSYNC_IN_PROGRESS, > > > path); > > > when action == datadir_fsync_fname. > > > > I agree and fixed it. > > I saw that you fixed it by calling InitStartupProgress() after the walkdir() > calls which do pre_sync_fname. So then walkdir is calling > LogStartupProgress(STARTUP_PROCESS_OP_FSYNC) even when it's not doing fsync, > and then LogStartupProgress() is returning because !AmStartupProcess(). > > That seems indirect, fragile, and confusing. I suggest that walkdir() should > take an argument for which operation to pass to LogStartupProgress(). You can > pass a special enum for cases where nothing should be logged, like > STARTUP_PROCESS_OP_NONE. I don't think walkdir() has any business calling LogStartupProgress() at all. It's supposed to be a generic function, not one that is only available to be called from the startup process, or has different behavior there. From my point of view, the right thing is to put the logging calls into the particular callbacks that SyncDataDirectory() uses. > On Wed, Jul 21, 2021 at 04:47:32PM +0530, Bharath Rupireddy wrote: > > 1) I still don't see the need for two functions LogStartupProgress and > > LogStartupProgressComplete. Most of the code is duplicate. I think we > > can just do it with a single function something like [1]: > > I agree that one function can do this more succinctly. I think it's best to > use a separate enum value for START operations and END operations. Maybe I'm missing something here, but I don't understand the purpose of this. You can always combine two functions into one, but it's only worth doing if you end up with less code, which doesn't seem to be the case here. The strings are all different and that's most of the function, and the other stuff that gets done isn't the same either, so you'd just end up with a bunch of if-statements. That doesn't seem like an improvement. Thinking further, I guess I understand it from the caller's perspective. It's not necessarily clear why in some places we call LogStartupProgress() and others LogStartupProgressComplete(). Someone might expect a function with "complete" in the name like that to only be called once at the very end, rather than once at the end of a phase, and it does sort of make sense that you'd want to call one function everywhere rather than sometimes one and sometimes the other ... but I'm not exactly sure how to get there from here. Having only LogStartupProgress() but having it do a giant if-statement to figure out whether we're mid-phase or end-of-phase does not seem like the right approach. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Skip temporary table schema name from explain-verbose output.
Simon Riggs writes: > Sounds like a good change. > Surely we need a test to exercise this works? Otherwise ready... There are lots of places in the core regression tests where we'd have used a temp table, except that we needed to do EXPLAIN and the results would've been unstable, so we used a short-lived plain table instead. Find one of those and change it to use a temp table. regards, tom lane
Re: [bug?] Missed parallel safety checks, and wrong parallel safety
On Sat, Jul 24, 2021 at 5:52 AM Amit Kapila wrote: > I think for the consistency argument how about allowing users to > specify a parallel-safety option for both partitioned and > non-partitioned relations but for non-partitioned relations if users > didn't specify, it would be computed automatically? If the user has > specified parallel-safety option for non-partitioned relation then we > would consider that instead of computing the value by ourselves. Having the option for both partitioned and non-partitioned tables doesn't seem like the worst idea ever, but I am also not entirely sure that I understand the point. > Another reason for hesitation to do automatically for non-partitioned > relations was the new invalidation which will invalidate the cached > parallel-safety for all relations in relcache for a particular > database. As mentioned by Hou-San [1], it seems we need to do this > whenever any function's parallel-safety is changed. OTOH, changing > parallel-safety for a function is probably not that often to matter in > practice which is why I think you seem to be fine with this idea. Right. I think it should be quite rare, and invalidation events are also not crazy expensive. We can test what the worst case is, but if you have to sit there and run ALTER FUNCTION in a tight loop to see a measurable performance impact, it's not a real problem. There may be a code complexity argument against trying to figure it out automatically, perhaps, but I don't think there's a big performance issue. What bothers me is that if this is something people have to set manually then many people won't and will not get the benefit of the feature. And some of them will also set it incorrectly and have problems. So I am in favor of trying to determine it automatically where possible, to make it easy for people. However, other people may feel differently, and I'm not trying to say they're necessarily wrong. I'm just telling you what I think. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Case expression pushdown
Alexander Pyhalov writes: > [ 0001-Allow-pushing-CASE-expression-to-foreign-server-v6.patch ] This doesn't compile cleanly: deparse.c: In function 'foreign_expr_walker.isra.4': deparse.c:920:8: warning: 'collation' may be used uninitialized in this function [-Wmaybe-uninitialized] if (collation != outer_cxt->collation) ^ deparse.c:914:3: warning: 'state' may be used uninitialized in this function [-Wmaybe-uninitialized] switch (state) ^~ These uninitialized variables very likely explain the fact that it fails regression tests, both for me and for the cfbot. Even if this weren't an outright bug, we don't tolerate code that produces warnings on common compilers. regards, tom lane
Re: when the startup process doesn't (logging startup delays)
On Mon, Jul 26, 2021 at 10:13:09AM -0400, Robert Haas wrote: > I don't think walkdir() has any business calling LogStartupProgress() > at all. It's supposed to be a generic function, not one that is only > available to be called from the startup process, or has different > behavior there. From my point of view, the right thing is to put the > logging calls into the particular callbacks that SyncDataDirectory() > uses. You're right - this is better. On Wed, Jul 21, 2021 at 04:47:32PM +0530, Bharath Rupireddy wrote: > > > 1) I still don't see the need for two functions LogStartupProgress and > > > LogStartupProgressComplete. Most of the code is duplicate. I think we > > > can just do it with a single function something like [1]: > > > > I agree that one function can do this more succinctly. I think it's best to > > use a separate enum value for START operations and END operations. > > Maybe I'm missing something here, but I don't understand the purpose > of this. You can always combine two functions into one, but it's only > worth doing if you end up with less code, which doesn't seem to be the > case here. 4 files changed, 39 insertions(+), 71 deletions(-) > ... but I'm not exactly sure how to get there from here. Having only > LogStartupProgress() but having it do a giant if-statement to figure > out whether we're mid-phase or end-of-phase does not seem like the > right approach. I used a bool arg and negation to handle within a single switch. Maybe it's cleaner to use a separate enum value for each DONE, and set a local done flag. startup[29675] LOG: database system was interrupted; last known up at 2021-07-26 10:23:04 CDT startup[29675] LOG: syncing data directory (fsync), elapsed time: 1.38 s, current path: ./pg_ident.conf startup[29675] LOG: data directory sync (fsync) complete after 1.72 s startup[29675] LOG: database system was not properly shut down; automatic recovery in progress startup[29675] LOG: resetting unlogged relations (cleanup) complete after 0.00 s startup[29675] LOG: redo starts at 0/17BE500 startup[29675] LOG: redo in progress, elapsed time: 1.00 s, current LSN: 0/35D7CB8 startup[29675] LOG: redo in progress, elapsed time: 2.00 s, current LSN: 0/54A6918 startup[29675] LOG: redo in progress, elapsed time: 3.00 s, current LSN: 0/7370570 startup[29675] LOG: redo in progress, elapsed time: 4.00 s, current LSN: 0/924D8A0 startup[29675] LOG: redo done at 0/9B8 system usage: CPU: user: 4.28 s, system: 0.15 s, elapsed: 4.44 s startup[29675] LOG: resetting unlogged relations (init) complete after 0.03 s startup[29675] LOG: checkpoint starting: end-of-recovery immediate startup[29675] LOG: checkpoint complete: wrote 9872 buffers (60.3%); 0 WAL file(s) added, 0 removed, 8 recycled; write=0.136 s, sync=0.801 s, total=1.260 s; sync files=21, longest=0.774 s, average=B >From 9443005040be52225498d58678b5faa1668bd2ad Mon Sep 17 00:00:00 2001 From: Nitin Date: Fri, 23 Jul 2021 15:46:56 +0530 Subject: [PATCH 1/2] Shows the progress of the operations performed during startup process. The interval between each progress update is configurable and it should be mentioned in seconds --- src/backend/access/transam/xlog.c | 178 ++ src/backend/postmaster/startup.c | 1 + src/backend/storage/file/fd.c | 13 ++ src/backend/storage/file/reinit.c | 18 ++ src/backend/utils/misc/guc.c | 13 +- src/backend/utils/misc/postgresql.conf.sample | 6 + src/include/access/xlog.h | 18 ++ src/include/utils/timeout.h | 1 + 8 files changed, 247 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 3479402272..4f052050f3 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -79,6 +79,7 @@ #include "utils/relmapper.h" #include "utils/pg_rusage.h" #include "utils/snapmgr.h" +#include "utils/timeout.h" #include "utils/timestamp.h" extern uint32 bootstrap_data_checksum_version; @@ -397,6 +398,23 @@ static bool doRequestWalReceiverReply; */ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr; +/* + * Start timestamp of the operation that occur during startup process. + */ +static TimestampTz startupProcessOpStartTime; + +/* + * Indicates whether the startup progress interval mentioned by the user is + * elapsed or not. TRUE if timeout occurred, FALSE otherwise. + */ +static bool logStartupProgressTimeout; + +/* + * The interval between each progress update of the operations that occur + * during startup process. + */ +int log_startup_progress_interval; + /*-- * Shared-memory data structures for XLOG control * @@ -7351,6 +7369,8 @@ StartupXLOG(void) (errmsg("redo starts at %X/%X", LSN_FORMAT_ARGS(ReadRecPtr; + InitStartupProgress(); + /* * main redo apply loop */ @@ -7358,6 +7378,8 @@
Re: Case expression pushdown
Tom Lane писал 2021-07-26 18:18: Alexander Pyhalov writes: [ 0001-Allow-pushing-CASE-expression-to-foreign-server-v6.patch ] This doesn't compile cleanly: deparse.c: In function 'foreign_expr_walker.isra.4': deparse.c:920:8: warning: 'collation' may be used uninitialized in this function [-Wmaybe-uninitialized] if (collation != outer_cxt->collation) ^ deparse.c:914:3: warning: 'state' may be used uninitialized in this function [-Wmaybe-uninitialized] switch (state) ^~ These uninitialized variables very likely explain the fact that it fails regression tests, both for me and for the cfbot. Even if this weren't an outright bug, we don't tolerate code that produces warnings on common compilers. regards, tom lane Hi. Of course, this is a patch issue. Don't understand how I overlooked this. Rebased on master and fixed it. Tests are passing here (but they also passed for previous patch version). What exact tests are failing? -- Best regards, Alexander Pyhalov, Postgres ProfessionalFrom 9c9fa2e37fc62ddcd8dc6176306d74b7e219fd26 Mon Sep 17 00:00:00 2001 From: Alexander Pyhalov Date: Thu, 22 Jul 2021 11:42:16 +0300 Subject: [PATCH] Allow pushing CASE expression to foreign server --- contrib/postgres_fdw/deparse.c| 150 ++ .../postgres_fdw/expected/postgres_fdw.out| 184 ++ contrib/postgres_fdw/sql/postgres_fdw.sql | 63 ++ 3 files changed, 397 insertions(+) diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c index 31919fda8c6..df1aaf8e713 100644 --- a/contrib/postgres_fdw/deparse.c +++ b/contrib/postgres_fdw/deparse.c @@ -186,6 +186,8 @@ static void appendFunctionName(Oid funcid, deparse_expr_cxt *context); static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno, deparse_expr_cxt *context); +static void deparseCaseExpr(CaseExpr *node, deparse_expr_cxt *context); + /* * Helper functions */ @@ -627,6 +629,105 @@ foreign_expr_walker(Node *node, state = FDW_COLLATE_NONE; } break; + case T_CaseExpr: + { +ListCell *lc; +foreign_loc_cxt tmp_cxt; +CaseExpr *ce = (CaseExpr *) node; + +/* + * case arg expression collation doesn't affect result + * collation + */ +tmp_cxt.collation = InvalidOid; +tmp_cxt.state = FDW_COLLATE_NONE; +if (ce->arg && !foreign_expr_walker((Node *) ce->arg, glob_cxt, &tmp_cxt)) + return false; + +/* Recurse to case clause subexpressions. */ +foreach(lc, ce->args) +{ + CaseWhen *cw = lfirst_node(CaseWhen, lc); + Node *whenExpr = (Node *) cw->expr; + + /* + * The parser should have produced WHEN clauses of the + * form "CaseTestExpr = RHS", possibly with an implicit + * coercion inserted above the CaseTestExpr. However in an + * expression that's been through the optimizer, the WHEN + * clause could be almost anything (since the equality + * operator could have been expanded into an inline + * function). In this case forbid pushdown. + */ + + if (ce->arg) + { + List *whenExprArgs; + + if (!IsA(whenExpr, OpExpr)) + return false; + + whenExprArgs = ((OpExpr *) whenExpr)->args; + + if ((list_length(whenExprArgs) != 2) || + !IsA(strip_implicit_coercions(linitial(whenExprArgs)), CaseTestExpr)) + return false; + } + + /* + * case when expression collation doesn't affect result + * collation + */ + tmp_cxt.collation = InvalidOid; + tmp_cxt.state = FDW_COLLATE_NONE; + /* Recurse to case clause expression. */ + if (!foreign_expr_walker((Node *) cw->expr, + glob_cxt, &tmp_cxt)) + return false; + + /* Recurse to result expression. */ + if (!foreign_expr_walker((Node *) cw->result, + glob_cxt, &inner_cxt)) + return false; +} + +if (!foreign_expr_walker((Node *) ce->defresult, glob_cxt, &inner_cxt)) + return false; + +/* + * Collation rule is same as for function nodes. + */ +collation = ce->casecollid; +if (collation == InvalidOid) + state = FDW_COLLATE_NONE; +else if (inner_cxt.state == FDW_COLLATE_SAFE && + collation == inner_cxt.collation) + state = FDW_COLLATE_SAFE; +else if (collation == DEFAULT_COLLATION_OID) + state = FDW_COLLATE_NONE; +else + state = FDW_COLLATE_UNSAFE; + } + break; + case T_CaseTestExpr: + { +CaseTestExpr *c = (CaseTestExpr *) node; + +/* + * Collation rule is same as for function nodes. + */ +collation = c->collation; +if (collation == InvalidOid) + state = FDW_COLLATE_NONE; +else if (inner_cxt.state == FDW_COLLATE_SAFE && + collation == inner_cxt.collation) + state = FDW_COLLATE_SAFE; +else if (collation == DEFAULT_COLLATION_OID) + state = FDW_COLLATE_NONE; +else + state = FDW_COLLATE_UNSAFE; + } + break; c
Re: Inaccurate error message when set fdw batch_size to 0
On 2021/07/26 13:56, Bharath Rupireddy wrote: On Thu, Jul 15, 2021 at 7:54 PM Bharath Rupireddy wrote: On Mon, Jul 12, 2021 at 10:11 PM Bharath Rupireddy wrote: On Mon, Jul 12, 2021 at 9:20 PM Fujii Masao wrote: + + Avoid Using non-negative Word in Error Messages + + +Do not use non-negative word in error messages as it looks +ambiguous. Instead, use foo must be an integer value greater than zero +or foo must be an integer value greater than or equal to zero +if option foo expects an integer value. + + It seems suitable to put this guide under "Tricky Words to Avoid" rather than adding it as separate section. Thought? +1. I will change. PSA v7 patch with the above change. PSA v8 patch rebased on to latest master. Thanks for updating the patch! + +non-negative + +Do not use non-negative word in error messages as it looks +ambiguous. Instead, use foo must be an integer value greater than +zero or foo must be an integer value greater than or equal +to zero if option foo expects an integer value. + + This description looks a bit redundant. And IMO it's better to also document how "non-negative" is ambiguous. So what about the following description, instead? I replaced this description with the following. Patch attached. I also uppercased the first character "n" of "non-negative" at the title for the sake of consistency with other items. + +Non-negative + +Avoid non-negative as it is ambiguous +about whether it accepts zero. It's better to use +greater than zero or +greater than or equal to zero. + + - /* Number of workers should be non-negative. */ + /* Number of parallel workers should be greater than zero. */ Assert(nworkers >= 0); This should be "greater than or equal to zero", instead? Anyway since this is comment not an error message, and also there are still other comments using "non-negative", I don't think we need to change only this comment for now. So I excluded this change from the patch. Maybe we can get rid of all "non-negative" from comments and documents later *if* necessary. -errmsg("repeat count size must be a non-negative integer"))); +errmsg("repeat count size must be greater than or equal to zero"))); -errmsg("number of workers must be a non-negative integer"))); +errmsg("number of workers must be greater than or equal to zero"))); Isn't it better to replace "be greater" with "be an integer value greater"? I applied this to the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c index 4593cbc540..c574ca2cf3 100644 --- a/contrib/postgres_fdw/option.c +++ b/contrib/postgres_fdw/option.c @@ -119,7 +119,10 @@ postgres_fdw_validator(PG_FUNCTION_ARGS) else if (strcmp(def->defname, "fdw_startup_cost") == 0 || strcmp(def->defname, "fdw_tuple_cost") == 0) { - /* these must have a non-negative numeric value */ + /* +* These must have a floating point value greater than or equal to +* zero. +*/ char *value; double real_val; boolis_parsed; @@ -136,7 +139,7 @@ postgres_fdw_validator(PG_FUNCTION_ARGS) if (real_val < 0) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), -errmsg("\"%s\" requires a non-negative floating point value", +errmsg("\"%s\" must be a floating point value greater than or equal to zero", def->defname))); } else if (strcmp(def->defname, "extensions") == 0) @@ -163,7 +166,7 @@ postgres_fdw_validator(PG_FUNCTION_ARGS) if (int_val <= 0) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), -errmsg("\"%s\" requires a non-negative integer value", +errmsg("\"%s\" must be an integer value greater than zero", def->defname))); } else if (strcmp(def->defname, "password_required") == 0) diff --git a/doc/src/sgml/sources.sgml b/doc/src/sgml/sources.sgml index 3f2c40b750..e6ae02f2af 100644 --- a/do
needless complexity in StartupXLOG
StartupXLOG() has code beginning around line 7900 of xlog.c that decides, at the end of recovery, between four possible courses of action. It either writes an end-of-recovery record, or requests a checkpoint, or creates a checkpoint, or does nothing, depending on the value of 3 flag variables, and also on whether we're still able to read the last checkpoint record: checkPointLoc = ControlFile->checkPoint; /* * Confirm the last checkpoint is available for us to recover * from if we fail. */ record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false); if (record != NULL) { promoted = true; It seems to me that ReadCheckpointRecord() should never fail here. It should always be true, even when we're not in recovery, that the last checkpoint record is readable. If there's ever a situation where that is not true, even for an instant, then a crash at that point will be unrecoverable. Now, one way that it could happen is if we've got a bug in the code someplace that removes WAL segments too soon. However, if we have such a bug, we should fix it. What this code does is says "oh, I guess we removed the old checkpoint too soon, no big deal, let's just be more aggressive about getting the next one done," which I do not think is the right response. Aside from a bug, the only other way I can see it happening is if someone is manually removing WAL segments as the server is running through recovery, perhaps as some last-ditch play to avoid running out of disk space. I don't think the server needs to have - or should have - code to cater to such crazy scenarios. Therefore I think that this check, at least in its current form, is not sensible. My first thought was that we should do the check unconditionally, rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and ERROR if it fails. But then I wondered what the point of that would be exactly. If we have such a bug -- and to the best of my knowledge there's no evidence that we do -- there's no particular reason it should only happen at the end of recovery. It could happen any time the system -- or the user, or malicious space aliens -- remove files from pg_wal, and we have no real idea about the timing of malicious space alien activity, so doing the test here rather than anywhere else just seems like a shot in the dark. Perhaps the most logical place would be to move it to CreateCheckPoint() just after we remove old xlog files, but we don't have the xlogreader there, and anyway I don't see how it's really going to help. What bug do we expect to catch by removing files we think we don't need and then checking that we didn't remove the files we think we do need? That seems more like grasping at straws than a serious attempt to make things work better. So at the moment I am leaning toward the view that we should just remove this check entirely, as in the attached, proposed patch. Really, I think we should consider going further. If it's safe to write an end-of-recovery record rather than a checkpoint, why not do so all the time? Right now we fail to do that in the above-described "impossible" scenario where the previous checkpoint record can't be read, or if we're exiting archive recovery for some reason other than a promotion request, or if we're in single-user mode, or if we're in crash recovery. Presumably, people would like to start up the server quickly in all of those scenarios, so the only reason not to use this technology all the time is if we think it's safe in some scenarios and not others. I can't think of a reason why it matters why we're exiting archive recovery, nor can I think of a reason why it matters whether we're in single user mode. The distinction between crash recovery and archive recovery does seem to matter, but if anything the crash recovery scenario seems simpler, because then there's only one timeline involved. I realize that conservatism may have played a role in this code ending up looking the way that it does; someone seems to have thought it would be better not to rely on a new idea in all cases. From my point of view, though, it's scary to have so many cases, especially cases that don't seem like they should ever be reached. I think that simplifying the logic here and trying to do the same things in as many cases as we can will lead to better robustness. Imagine if instead of all the hairy logic we have now we just replaced this whole if (IsInRecovery) stanza with this: if (InRecovery) CreateEndOfRecoveryRecord(); That would be WAY easier to reason about than the rat's nest we have here today. Now, I am not sure what it would take to get there, but I think that is the direction we ought to be heading. -- Robert Haas EDB: http://www.enterprisedb.com 0001-Remove-pointless-call-to-ReadCheckpointRecord.patch Description: Binary data
Re: .ready and .done files considered harmful
On 7/26/21, 6:31 AM, "Robert Haas" wrote: > In terms of immediate next steps, I think we should focus on > eliminating the O(n^2) problem and not get sucked into a bigger > redesign. The patch on the table aims to do just that much and I think > that's a good thing. I agree. I'll leave further discussion about a redesign for another thread. Nathan
Re: Skip temporary table schema name from explain-verbose output.
I wrote: > Simon Riggs writes: >> Surely we need a test to exercise this works? Otherwise ready... > There are lots of places in the core regression tests where we'd have > used a temp table, except that we needed to do EXPLAIN and the results > would've been unstable, so we used a short-lived plain table instead. > Find one of those and change it to use a temp table. Hmm ... I looked through the core regression tests' usages of EXPLAIN VERBOSE and didn't really find any that it seemed to make sense to change that way. I guess we've been more effective at programming around that restriction than I thought. Anyway, after looking at the 0001 patch, I think there's a pretty large oversight in that it doesn't touch ruleutils.c, although EXPLAIN relies heavily on that to print expressions and suchlike. We could account for that as in the attached revision of 0001. However, I wonder whether this isn't going in the wrong direction. Instead of piecemeal s/get_namespace_name/get_namespace_name_or_temp/, we should consider just putting this behavior right into get_namespace_name, and dropping the separate get_namespace_name_or_temp function. I can't really see any situation in which it's important to report the exact schema name of our own temp schema. On the other hand, I don't like 0002 one bit, because it's not accounting for whether the temp schema it's mangling is *our own* temp schema or some other session's. I do not think it is wise or even safe to report some other temp schema as being "pg_temp". By the same token, I wonder whether this bit in event_trigger.c is a good idea or a safety hazard: /* XXX not quite get_namespace_name_or_temp */ if (isAnyTempNamespace(schema_oid)) schema = pstrdup("pg_temp"); else schema = get_namespace_name(schema_oid); Alvaro, you seem to be responsible for both the existence of the separate get_namespace_name_or_temp function and the fact that it's being avoided here. I wonder what you think about this. regards, tom lane diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index f15c97ad7a..51fac77f3d 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -2854,7 +2854,7 @@ postgresExplainForeignScan(ForeignScanState *node, ExplainState *es) { char *namespace; - namespace = get_namespace_name(get_rel_namespace(rte->relid)); + namespace = get_namespace_name_or_temp(get_rel_namespace(rte->relid)); appendStringInfo(relations, "%s.%s", quote_identifier(namespace), quote_identifier(relname)); diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 340db2bac4..36fbe129cf 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -3747,7 +3747,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es) Assert(rte->rtekind == RTE_RELATION); objectname = get_rel_name(rte->relid); if (es->verbose) -namespace = get_namespace_name(get_rel_namespace(rte->relid)); +namespace = get_namespace_name_or_temp(get_rel_namespace(rte->relid)); objecttag = "Relation Name"; break; case T_FunctionScan: @@ -3774,8 +3774,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es) objectname = get_func_name(funcid); if (es->verbose) - namespace = -get_namespace_name(get_func_namespace(funcid)); + namespace = get_namespace_name_or_temp(get_func_namespace(funcid)); } } objecttag = "Function Name"; diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index 5e7108f903..4df8cc5abf 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -1617,7 +1617,7 @@ pg_get_statisticsobj_worker(Oid statextid, bool columns_only, bool missing_ok) if (!columns_only) { - nsp = get_namespace_name(statextrec->stxnamespace); + nsp = get_namespace_name_or_temp(statextrec->stxnamespace); appendStringInfo(&buf, "CREATE STATISTICS %s", quote_qualified_identifier(nsp, NameStr(statextrec->stxname))); @@ -2811,7 +2811,7 @@ pg_get_functiondef(PG_FUNCTION_ARGS) * We always qualify the function name, to ensure the right function gets * replaced. */ - nsp = get_namespace_name(proc->pronamespace); + nsp = get_namespace_name_or_temp(proc->pronamespace); appendStringInfo(&buf, "CREATE OR REPLACE %s %s(", isfunction ? "FUNCTION" : "PROCEDURE", quote_qualified_identifier(nsp, name)); @@ -11183,7 +11183,7 @@ get_opclass_name(Oid opclass, Oid actual_datatype, appendStringInfo(buf, " %s", quote_identifier(opcname)); else { - nspname = get_namespace_name(opcrec->opcnamespace); + nspname = get_namespace_name_or_temp(opcrec->opcnamespace); appendStringInfo(buf, " %s.%s
Re: Skip temporary table schema name from explain-verbose output.
On Mon, 26 Jul 2021 at 17:49, Tom Lane wrote: > > I wrote: > > Simon Riggs writes: > >> Surely we need a test to exercise this works? Otherwise ready... > > > There are lots of places in the core regression tests where we'd have > > used a temp table, except that we needed to do EXPLAIN and the results > > would've been unstable, so we used a short-lived plain table instead. > > Find one of those and change it to use a temp table. > > Hmm ... I looked through the core regression tests' usages of EXPLAIN > VERBOSE and didn't really find any that it seemed to make sense to change > that way. I guess we've been more effective at programming around that > restriction than I thought. > > Anyway, after looking at the 0001 patch, I think there's a pretty large > oversight in that it doesn't touch ruleutils.c, although EXPLAIN relies > heavily on that to print expressions and suchlike. We could account > for that as in the attached revision of 0001. > > However, I wonder whether this isn't going in the wrong direction. > Instead of piecemeal s/get_namespace_name/get_namespace_name_or_temp/, > we should consider just putting this behavior right into > get_namespace_name, and dropping the separate get_namespace_name_or_temp > function. I can't really see any situation in which it's important > to report the exact schema name of our own temp schema. That sounds much better because any wholesale change like that will affect 100s of places in extensions and it would be easier if we made just one change in core. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: when the startup process doesn't (logging startup delays)
On Mon, Jul 26, 2021 at 11:30 AM Justin Pryzby wrote: > > Maybe I'm missing something here, but I don't understand the purpose > > of this. You can always combine two functions into one, but it's only > > worth doing if you end up with less code, which doesn't seem to be the > > case here. > > 4 files changed, 39 insertions(+), 71 deletions(-) Hmm. I don't completely hate that, but I don't love it either. I don't think the result is any easier to understand than the original, and in some ways it's worse. In particular, you've flattened the separate comments for each of the individual functions into a single-line comment that is more generic than the comments it replaced. You could fix that and you'd still be slightly ahead on LOC, but I'm not convinced that it's actually a real win. > > ... but I'm not exactly sure how to get there from here. Having only > > LogStartupProgress() but having it do a giant if-statement to figure > > out whether we're mid-phase or end-of-phase does not seem like the > > right approach. > > I used a bool arg and negation to handle within a single switch. Maybe it's > cleaner to use a separate enum value for each DONE, and set a local done flag. If we're going to go the route of combining the functions, I agree that a Boolean is the way to go; I think we have some existing precedent for 'bool finished' rather than 'bool done'. But I kind of wonder if we should have an enum in the first place. It feels like we've got code in a bunch of places that just exists to decide which enum value to use, and then code someplace else that turns around and decides which message to produce. That would be sensible if we were using the same enum values in lots of places, but that's not the case here. So suppose we just moved the messages to the places where we're now calling LogStartupProgress() or LogStartupProgressComplete()? So something like this: if (should_report_startup_progress()) ereport(LOG, (errmsg("syncing data directory (syncfs), elapsed time: %ld.%02d s, current path: %s", secs, (usecs / 1), path))); Well, that doesn't quite work, because "secs" and "usecs" aren't going to exist in the caller. We can fix that easily enough: let's just make should_report_startup_progress take arguments long *secs, int *usecs, and bool done. Then the above becomes: if (should_report_startup_progress(&secs, &usecs, false)) ereport(LOG, (errmsg("syncing data directory (syncfs), elapsed time: %ld.%02d s, current path: %s", secs, (usecs / 1), path))); And if this were the call site that corresponds to LogStartupProgressComplete(), we'd replace false with true. Now, the only real disadvantage of this that I can see is that it requires the caller to declare 'secs' and 'usecs', which is not a big deal, but mildly annoying perhaps. I think we can do better still with a little macro magic. Suppose we define a macro report_startup_progress(force, msg, ...) that expands to: { long secs; int usecs; if (startup_progress_timer_expired(&secs, &usecs, force)) ereport(LOG, errmsg(msg, secs, usecs, ...)); } Then the above just becomes this: report_startup_progress(false, "syncing data directory (syncfs), elapsed time: %ld.%02d s, current path: %s", path); This would have the advantage that the strings we're using would be present in the code that arranges to emit them, instead of being removed to some other module, so I think it would be clearer. It would also have the advantage of making it much easier to add further calls, if someone feels they want to do that. You don't have to run around and update enums and all the various things that use them, just copy and paste the line above and adjust as required. With this design, we avoid a lot of "action at a distance". We don't define the message strings in a place far-removed from the code that wants to emit them any more. When someone wants a new progress message, they can just add another call to report_statup_progress() wherever it needs to go and they're done. They don't have to go run and update the enum and various switch statements. They're just done. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Rename of triggers for partitioned tables
On 2021-Jul-25, Tom Lane wrote: > Perhaps there's no actual bug there, but it's still horrible coding. > For one thing, the implication that found could be negative is extremely > confusing to readers. A boolean might be better. However, I wonder why > you bothered with a flag in the first place. The usual convention if > we know there can be only one match is to just not write a loop at all, > with a suitable comment, like this pre-existing example elsewhere in > trigger.c: > > /* There should be at most one matching tuple */ > if (HeapTupleIsValid(tuple = systable_getnext(tgscan))) > > If you're not quite convinced there can be only one match, then it > still shouldn't be an Assert --- a real test-and-elog would be better. I agree that coding there was dubious. I've removed the flag and assert. Arne complained that there should be a unique constraint on (tgrelid, tgparentid) which would sidestep the need for this to be a loop. I don't think it's really necessary, and I'm not sure how to create a system index WHERE tgparentid <> 0. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "Saca el libro que tu religión considere como el indicado para encontrar la oración que traiga paz a tu alma. Luego rebootea el computador y ve si funciona" (Carlos Duclós)
Re: Avoiding data loss with synchronous replication
On 7/24/21, 5:25 PM, "Andres Freund" wrote: > First, from the user experience side of things, the issue is that this seems > to propose violating read-your-own-writes. Within a single connection to a > single node. Which imo is *far* worse than seeing writes that haven't yet been > acknowledged as replicated after a query cancel. Right. I suspect others will have a similar opinion. Nathan
Re: Skip temporary table schema name from explain-verbose output.
On Mon, Jul 26, 2021 at 12:49 PM Tom Lane wrote: > I can't really see any situation in which it's important > to report the exact schema name of our own temp schema. It would actually be nice if there were some easy way of getting that for the rare situations in which there are problems. For example, if the catalog entries get corrupted and you can't access some table in your pg_temp schema, you might like to know which pg_temp schema you've got so that you can be sure to examine the right catalog entries to fix the problem or understand the problem or whatever you are trying to do. I don't much care exactly how we go about making that information available and I agree that showing pg_temp_NNN in EXPLAIN output is worse than just pg_temp. I'm just saying that concealing too thoroughly what is actually happening can be a problem in the rare instance where troubleshooting is required. -- Robert Haas EDB: http://www.enterprisedb.com
Re: badly calculated width of emoji in psql
On Fri, 2021-07-23 at 17:42 +0200, Pavel Stehule wrote: > čt 22. 7. 2021 v 0:12 odesílatel Jacob Champion napsal: > > > > Pavel, I'd be interested to see what your benchmarks find with this > > code. Does this fix the original issue for you? > > I can confirm that the original issue is fixed. Great! > I tested performance > > I had three data sets > > 1. typical data - mix ascii and utf characters typical for czech > language - 25K lines - there is very small slowdown 2ms from 24 to > 26ms (stored file of this result has 3MB) > > 2. the worst case - this reports has only emoji 1000 chars * 10K rows > - there is more significant slowdown - from 160 ms to 220 ms (stored > file has 39MB) I assume the stored file size has grown with this patch, since we're now printing the correct number of spacing/border characters? > 3. a little bit of obscure datasets generated by \x and select * from > pg_proc - it has 99K lines - and there are a lot of unicode > decorations (borders). The line has 17K chars. (the stored file has > 1.7GB) > In this dataset I see a slowdown from 4300 to 4700 ms. These results are lining up fairly well with my profiling. To isolate the effects of the new algorithm (as opposed to printing time) I redirected to /dev/null: psql postgres -c "select repeat(unistr('\u115D'), 1000);" > /dev/null This is what I expect to be the worst case for the new patch: a huge string consisting of nothing but \u115D, which is in the first interval and therefore takes the maximum number of iterations during the binary search. For that command, the wall time slowdown with this patch was about 240ms (from 1.128s to 1.366s, or 21% slower). Callgrind shows an increase of 18% in the number of instructions executed with the interval table patch, all of it coming from PQdsplen (no surprise). PQdsplen itself has a 36% increase in instruction count for that run. I also did a microbenchmark of PQdsplen (code attached, requires Google Benchmark [1]). The three cases I tested were standard ASCII characters, a smiley-face emoji, and the worst-case \u115F character. Without the patch: Benchmark Time CPU Iterations ... BM_Ascii_mean 4.97 ns 4.97 ns5 BM_Ascii_median 4.97 ns 4.97 ns5 BM_Ascii_stddev0.035 ns0.035 ns5 ... BM_Emoji_mean 6.30 ns 6.30 ns5 BM_Emoji_median 6.30 ns 6.30 ns5 BM_Emoji_stddev0.045 ns0.045 ns5 ... BM_Worst_mean 12.4 ns 12.4 ns5 BM_Worst_median 12.4 ns 12.4 ns5 BM_Worst_stddev0.038 ns0.038 ns5 With the patch: Benchmark Time CPU Iterations ... BM_Ascii_mean 4.59 ns 4.59 ns5 BM_Ascii_median 4.60 ns 4.60 ns5 BM_Ascii_stddev0.069 ns0.068 ns5 ... BM_Emoji_mean 11.8 ns 11.8 ns5 BM_Emoji_median 11.8 ns 11.8 ns5 BM_Emoji_stddev0.059 ns0.059 ns5 ... BM_Worst_mean 18.5 ns 18.5 ns5 BM_Worst_median 18.5 ns 18.5 ns5 BM_Worst_stddev0.077 ns0.077 ns5 So an incredibly tiny improvement in the ASCII case, which is reproducible across multiple runs and not just a fluke (I assume because the code is smaller now and has better cache line characteristics?). A ~90% slowdown for the emoji case, and a ~50% slowdown for the worst-performing characters. That seems perfectly reasonable considering we're talking about dozens of nanoseconds. > 9% looks too high, but in absolute time it is 400ms for 99K lines and > very untypical data, or 2ms for more typical results., 2ms are > nothing (for interactive work). More - this is from a pspg > perspective. In psql there can be overhead of network, protocol > processing, formatting, and more and more, and psql doesn't need to > calculate display width of decorations (borders), what is the reason > for slowdowns in pspg. Yeah. Considering the alignment code is for user display, the absolute performance is going to dominate, and I don't see any red flags so far. If you're regularly dealing with unbelievably huge amounts of emoji, I think the amount of extra time we're seeing here is unlikely to be a problem. If it is, you can always turn alignment off. (Do you rely on horizontal alignment for lines with millions of characters, anyway?) Laurenz, Michael, what do you think? --Jacob [1] https://github.com/google/benchmark /* * Copyright 2021 VMware, Inc. * SP
Re: Skip temporary table schema name from explain-verbose output.
On 2021-Jul-26, Tom Lane wrote: > Alvaro, you seem to be responsible for both the existence of the separate > get_namespace_name_or_temp function and the fact that it's being avoided > here. I wonder what you think about this. The reason I didn't touch get_namespace_name then (e9a077cad379) was that I didn't want to change the user-visible behavior for any existing features; I was just after a way to implement dropped-object DDL trigger tracking. If we agree that displaying pg_temp instead of pg_temp_XXX everywhere is an improvement, then I don't see a reason not to change how get_namespace_name works and get rid of get_namespace_name_or_temp. I don't see much usefulness in displaying the exact name of the temp namespace anywhere, particularly since using "pg_temp" as a qualification in queries already refers to the current backend's temp namespace. Trying to refer to it by exact name in SQL may lead to affecting some other backend's temp objects ... -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: needless complexity in StartupXLOG
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > So at the moment I am leaning toward the view that we should just > remove this check entirely, as in the attached, proposed patch. Haven't dug in deeply but at least following your explanation and reading over the patch and the code a bit, I tend to agree. > Really, I think we should consider going further. If it's safe to > write an end-of-recovery record rather than a checkpoint, why not do > so all the time? Right now we fail to do that in the above-described > "impossible" scenario where the previous checkpoint record can't be > read, or if we're exiting archive recovery for some reason other than > a promotion request, or if we're in single-user mode, or if we're in > crash recovery. Presumably, people would like to start up the server > quickly in all of those scenarios, so the only reason not to use this > technology all the time is if we think it's safe in some scenarios and > not others. I can't think of a reason why it matters why we're exiting > archive recovery, nor can I think of a reason why it matters whether > we're in single user mode. The distinction between crash recovery and > archive recovery does seem to matter, but if anything the crash > recovery scenario seems simpler, because then there's only one > timeline involved. Yeah, tend to agree with this too ... but something I find a bit curious is the comment: * Insert a special WAL record to mark the end of * recovery, since we aren't doing a checkpoint. ... immediately after setting promoted = true, and then at the end of StartupXLOG() having: if (promoted) RequestCheckpoint(CHECKPOINT_FORCE); maybe I'm missing something, but seems like that comment isn't being terribly clear. Perhaps we aren't doing a full checkpoint *there*, but sure looks like we're going to do one moments later regardless of anything else since we've set promoted to true..? > I realize that conservatism may have played a role in this code ending > up looking the way that it does; someone seems to have thought it > would be better not to rely on a new idea in all cases. From my point > of view, though, it's scary to have so many cases, especially cases > that don't seem like they should ever be reached. I think that > simplifying the logic here and trying to do the same things in as many > cases as we can will lead to better robustness. Imagine if instead of > all the hairy logic we have now we just replaced this whole if > (IsInRecovery) stanza with this: > > if (InRecovery) > CreateEndOfRecoveryRecord(); > > That would be WAY easier to reason about than the rat's nest we have > here today. Now, I am not sure what it would take to get there, but I > think that is the direction we ought to be heading. Agreed that simpler logic is better, provided it's correct logic, of course. Finding better ways to test all of this would be really nice. Thanks! Stephen signature.asc Description: PGP signature
Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep
On 7/24/21, 9:16 AM, "Bharath Rupireddy" wrote: > On Fri, Jul 23, 2021 at 4:40 AM Bossart, Nathan wrote: >> I would suggest changing "attach from a debugger" to "attaching with a >> debugger." > > Thanks. IMO, the following looks better: > Waiting on connection startup before authentication to allow > attaching a debugger to the process. > Waiting on connection startup after authentication to allow > attaching a debugger to the process. Your phrasing looks good to me. >> IIUC you want to use the same set of flags as PostAuthDelay for >> PreAuthDelay, but the stated reason in this comment for leaving out >> WL_LATCH_SET suggests otherwise. It's not clear to me why the latch >> possibly pointing to a shared latch in the future is an issue. Should >> this instead say that we leave out WL_LATCH_SET for consistency with >> PostAuthDelay? > > If WL_LATCH_SET is used for PostAuthDelay, the waiting doesn't happen > because the MyLatch which is a shared latch would be set by > SwitchToSharedLatch. More details at [1]. > If WL_LATCH_SET is used for PreAuthDelay, actually there's no problem > because MyLatch is still not initialized properly in BackendInitialize > when waiting for PreAuthDelay, it still points to local latch, but > later gets pointed to shared latch and gets set SwitchToSharedLatch. > But relying on MyLatch there seems to me somewhat relying on an > uninitialized variable. More details at [1]. > > For PreAuthDelay, with the comment I wanted to say that the MyLatch is > not the correct one we would want to wait for. Since there is no > problem in using it there, I changed the comment to following: > /* > * Let's not use WL_LATCH_SET for PreAuthDelay to be consistent with > * PostAuthDelay. > */ How about we elaborate a bit? WL_LATCH_SET is not used for consistency with PostAuthDelay. MyLatch isn't fully initialized for the backend at this point, anyway. + /* +* PostAuthDelay will not get applied, if WL_LATCH_SET is used. This +* is because the latch could have been set initially. +*/ I would suggest the following: If WL_LATCH_SET is used, PostAuthDelay may not be applied, since the latch might already be set. Otherwise, this patch looks good and could probably be marked ready- for-committer. Nathan
Re: Skip temporary table schema name from explain-verbose output.
Robert Haas writes: > On Mon, Jul 26, 2021 at 12:49 PM Tom Lane wrote: >> I can't really see any situation in which it's important >> to report the exact schema name of our own temp schema. > It would actually be nice if there were some easy way of getting that > for the rare situations in which there are problems. I experimented with pushing the behavior into get_namespace_name, and immediately ran into problems, for example --- /home/postgres/pgsql/src/test/regress/expected/jsonb.out2021-03-01 16:32 :13.348655633 -0500 +++ /home/postgres/pgsql/src/test/regress/results/jsonb.out 2021-07-26 13:10 :53.523540855 -0400 @@ -320,11 +320,9 @@ where tablename = 'rows' and schemaname = pg_my_temp_schema()::regnamespace::text order by 1; - attname | histogram_bounds --+-- - x | [1, 2, 3] - y | ["txt1", "txt2", "txt3"] -(2 rows) + attname | histogram_bounds +-+-- +(0 rows) -- to_jsonb, timestamps select to_jsonb(timestamp '2014-05-28 12:22:35.614298'); What's happening here is that regnamespace_out is returning 'pg_temp' which doesn't match any name visible in pg_namespace. So that would pretty clearly break user queries as well as our own tests. I'm afraid that the wholesale behavior change I was imagining isn't going to work. Probably we'd better stick to doing something close to the v2 patch I posted. I'm still suspicious of that logic in event_trigger.c, though. regards, tom lane
Re: Have I found an interval arithmetic bug?
On Sun, Jul 25, 2021 at 11:56:54AM -0700, Bryn Llewellyn wrote: > As far as I’ve been able, the PG documentation doesn’t do a good job of > defining the semantics of any of these operations. Some (like the “justify” This is because fractional interval values are not used or asked about often. > functions” are sketched reasonably well. Others, like interval multiplication, > are entirely undefined. Yes, the “justify” functions were requested and implemented because they met a frequently-requested need unrelated to fractional values, though they do have spill-up uses. > This makes discussion of simple test like the two I showed immediately above > hard. It also makes any discussion of correctness, possible bugs, and proposed > implementation changes very difficult. Agreed. With fractional values an edge use-case, we are trying to find the most useful implementation. > As I’ve said, my conclusion is that the only safe approach is to create and > use > only “pure” interval values (where just one of the internal fields is > non-zero). For this reason (and having seen what I decided was the impossibly > unmemorable rules that my modeled implementation uses) I didn’t look at the > rules for the other fields that the interval literal allows (weeks, centuries, > millennia, and so on). I think the current page is clear about _specifying_ fractional units, but you are right that multiplication/division of fractional values is not covered. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Rename of triggers for partitioned tables
Alvaro Herrera writes: > Arne complained that there should be a unique constraint on (tgrelid, > tgparentid) which would sidestep the need for this to be a loop. I > don't think it's really necessary, and I'm not sure how to create a > system index WHERE tgparentid <> 0. Yeah, we don't support partial indexes on catalogs, and this example doesn't make me feel like we ought to open that can of worms. But then maybe this function shouldn't assume there's only one match? regards, tom lane
Re: Skip temporary table schema name from explain-verbose output.
On 2021-Jul-26, Tom Lane wrote: > On the other hand, I don't like 0002 one bit, because it's not accounting > for whether the temp schema it's mangling is *our own* temp schema or some > other session's. I do not think it is wise or even safe to report some > other temp schema as being "pg_temp". By the same token, I wonder whether > this bit in event_trigger.c is a good idea or a safety hazard: > > /* XXX not quite get_namespace_name_or_temp */ > if (isAnyTempNamespace(schema_oid)) > schema = pstrdup("pg_temp"); > else > schema = get_namespace_name(schema_oid); Oh, you meant this one. To be honest I don't remember *why* this code wants to show remote temp tables as just "pg_temp" ... it's possible that some test in the DDL-to-JSON code depended on this behavior. Without spending too much time analyzing it, I agree that it seems dangerous and might lead to referring to unintended objects. (Really, my memory is not clear on *why* we would be referring to temp tables of other sessions.) -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "No necesitamos banderas No reconocemos fronteras" (Jorge González)
Re: Fix around conn_duration in pgbench
On 2021/06/30 14:43, Yugo NAGATA wrote: On Wed, 30 Jun 2021 14:35:37 +0900 Yugo NAGATA wrote: Hello Asif, On Tue, 29 Jun 2021 13:21:54 + Asif Rehman wrote: The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:not tested This patch looks fine to me. master 48cb244fb9 The new status of this patch is: Ready for Committer Thank you for reviewing this patch! The previous patch included fixes about connection failures, but this part was moved to another patch discussed in [1]. [1] https://www.postgresql.org/message-id/alpine.DEB.2.22.394.2106181535400.3146194%40pseudo I attached the updated patach. I am sorry but I attached the other patch. Attached in this post is the latest patch. case CSTATE_FINISHED: + /* per-thread last disconnection time is not measured */ Could you tell me why we don't need to do this measurement? - /* no connection delay to record */ - thread->conn_duration = 0; + /* connection delay is measured globally between the barriers */ This comment is really correct? I was thinking that the measurement is not necessary here because this is the case where -C option is not specified. done: start = pg_time_now(); disconnect_all(state, nstate); thread->conn_duration += pg_time_now() - start; We should measure the disconnection time here only when -C option specified (i.e., is_connect variable is true)? Though, I'm not sure how much this change is helpful to reduce the performance overhead Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Why don't update minimum recovery point in xact_redo_abort
Hi, all Recently, I got a PANIC while restarts standby, which can be reproduced by the following steps, based on pg 11: 1. begin a transaction in primary node; 2. create a table in the transaction; 3. insert lots of data into the table; 4. do a checkpoint, and restart standby after checkpoint is done in primary node; 5. insert/update lots of data into the table again; 6. abort the transaction. after step 6, fast shutdown standby node, and then restart standby, you will get a PANIC log, and the backtrace is: #0 0x7fc663e5a277 in raise () from /lib64/libc.so.6 #1 0x7fc663e5b968 in abort () from /lib64/libc.so.6 #2 0x00c89f01 in errfinish (dummy=0) at elog.c:707 #3 0x00c8cba3 in elog_finish (elevel=22, fmt=0xdccc18 "WAL contains references to invalid pages") at elog.c:1658 #4 0x005e476a in XLogCheckInvalidPages () at xlogutils.c:253 #5 0x005cbc1a in CheckRecoveryConsistency () at xlog.c:9477 #6 0x005ca5c5 in StartupXLOG () at xlog.c:8609 #7 0x00a025a5 in StartupProcessMain () at startup.c:274 #8 0x00643a5c in AuxiliaryProcessMain (argc=2, argv=0x7ffe4e4849a0) at bootstrap.c:485 #9 0x00a00620 in StartChildProcess (type=StartupProcess) at postmaster.c:6215 #10 0x009f92c6 in PostmasterMain (argc=3, argv=0x4126500) at postmaster.c:1506 #11 0x008eab64 in main (argc=3, argv=0x4126500) at main.c:232 I think the reason for the above error is as follows: 1. the transaction in primary node was aborted finally, the standby node also deleted the table files after replayed the xlog record, however, without updating minimum recovery point; 2. primary node did a checkpoint before abort, and then standby node is restarted, so standby node will recovery from a point where the table has already been created and data has been inserted into the table; 3. when standby node restarts after step 6, it will find the page needed during recovery doesn't exist, which has already been deleted by xact_redo_abort before, so standby node will treat this page as an invalid page; 4. xact_redo_abort drop relation files without updating minumum recovery point, before standby node replay the abort xlog record and forget invalid pages again, it will reach consistency because the abort xlogrecord lsn is greater than minrecoverypoint; 5. during checkRecoveryConsistency, it will check invalid pages, and find that there is invalid page, and the PANIC log will be generated. So why don't update minimum recovery point in xact_redo_abort, just like XLogFlush in xact_redo_commit, in which way standby could reach consistency and check invalid pages after replayed the abort xlogrecord. Hope to get your reply Thanks & Best Regard
Re: Rename of triggers for partitioned tables
From: Tom Lane Sent: Monday, July 26, 2021 19:38 To: Alvaro Herrera Subject: Re: Rename of triggers for partitioned tables > Yeah, we don't support partial indexes on catalogs, and this example > doesn't make me feel like we ought to open that can of worms. I asked why such an index doesn't exists already. I guess that answers it. I definitely don't want to push for supporting that, especially not knowing what it entails. > But then maybe this function shouldn't assume there's only one match? I'd consider a duplication here a serious bug. That is pretty much catalog corruption. Having two children within a single child table sounds like it would break a lot of things. That being said this function is hardly performance critical. I don't know whether throwing an elog indicating what's going wrong here would be worth it. Regards Arne
Re: Skip temporary table schema name from explain-verbose output.
Alvaro Herrera writes: > Oh, you meant this one. To be honest I don't remember *why* this code > wants to show remote temp tables as just "pg_temp" ... it's possible > that some test in the DDL-to-JSON code depended on this behavior. > Without spending too much time analyzing it, I agree that it seems > dangerous and might lead to referring to unintended objects. (Really, > my memory is not clear on *why* we would be referring to temp tables of > other sessions.) Yeah, it's not very clear why that would happen, but if it does, showing "pg_temp" seems pretty misleading. I tried replacing the code with just get_namespace_name_or_temp(), and it still gets through check-world, for whatever that's worth. I'm inclined to change this in HEAD but leave it alone in the back branches. While it seems pretty bogus, it's not clear if anyone out there could be relying on the current behavior. regards, tom lane
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
On Fri, Jul 23, 2021 at 4:57 PM Mark Dilger wrote: > > On Jul 23, 2021, at 1:54 PM, Robert Haas wrote: > > Yeah, but you're inventing a system for allowing the restriction on a > > GUC to be something other than is-superuser in the very patch we're > > talking about. So it could be something like is-database-security. > > Therefore I don't grok the objection. > > I'm not objecting to how hard it would be to implement. I'm objecting to the > semantics. If the only non-superuser who can set the GUC is > pg_database_security, then it is absolutely worthless in preventing > pg_database_security from trapping actions performed by pg_network_security > members. On the other hand, if pg_network_security can also set the GUC, > then pg_network_security can circumvent audit logging that > pg_database_security put in place. What's the point in having these as > separate roles if they can circumvent each other's authority? Right, that would be bad. I had forgotten how this worked, but it seems that event triggers are called with the privileges of the user whose action caused the event trigger to be fired, not the privileges of the user who owns the trigger. So as you say, if you can get somebody to do something that causes an event trigger to be fired, you can do anything they can do. As far as I can see, the only reasonable conclusion is that, unless we change the security model, doing anything with event triggers will have to remain superuser-only. In other words I don't think we can give it to any of pg_database_security or pg_host_security or pg_network_security, or any similar role. We could have a pg_event_triggers role that is documented as able to usurp superuser, but I don't see the point. Now, the other alternative is changing the security model for event triggers, but I am not sure that really fixes anything. You proposed having a new mode where someone could only do things that could be done by either user, but that troubles me for a number of reasons. One is that it often makes a difference who actually did a particular operation. For example it might be that alice and bob both have the ability to give charlie permission on some table, but the ACL for that table will record who actually issued the grant. It might be that both alice and bob have the ability to create a table, but the table will be owned by whoever actually does. Suppose bob is about to be terminated but can arrange for alice (who is a star performer) to grant permissions to his accomplice charlie, thus arranging for those permissions to survive his impending termination. That's bad. Also, what about just throwing an ERROR? Anybody's allowed to do that, but that doesn't mean that it's OK for one user to block everything some other user wants to do. If seward and bates respectively have pg_database_security and pg_network_security, it's OK for seward to interfere with attempts by bates to access database objects, but it's not OK for seward to prevent bates from reconfiguring network access to PostgreSQL. Because event triggers don't fire for ALTER SYSTEM or DDL commands on global objects, we might almost be OK here, but I'm not sure if it's completely OK. I'm pretty sure that the reason we set this up the way we did was because we assumed that the person creating the event trigger would always have maximum privileges i.e. superuser. Therefore, it seemed "safer" to run the code under the less-privileged account. If we'd thought about this from the perspective of having non-superuser-owned event triggers, I think we would have made the opposite decision, since running code as yourself in somebody else's session is less dangerous than running code as somebody else straight up. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing "long int"-related limit on hash table sizes
On 2021-Jul-25, Ranier Vilela wrote: > > BTW, one aspect of this that I'm unsure how to tackle is the > > common usage of "L" constants; in particular, "work_mem * 1024L" > > is a really common idiom that we'll need to get rid of. Not sure > > that grep will be a useful aid for finding those. > > > I can see 30 matches in the head tree. (grep -d "1024L" *.c) grep grep '[0-9]L\>' -- *.[chyl] shows some more constants. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Re: list of extended statistics on psql (\dX)
Hi, I've pushed the last version of the fix, including the regression tests etc. Backpatch to 14, where \dX was introduced. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Some code cleanup for pgbench and pg_verifybackup
On 2021-Jul-26, Fabien COELHO wrote: > > - pgbench and pg_verifybackup make use of pg_log_fatal() to report > > some failures in code paths dedicated to command-line options, which > > is inconsistent with all the other tools that use pg_log_error(). > > The semantics for fatal vs error is that an error is somehow handled while a > fatal is not. If the log message is followed by an cold exit, ISTM that > fatal is the right call, and I cannot help if other commands do not do that. > ISTM more logical to align other commands to fatal when appropriate. I was surprised to discover a few weeks ago that pg_log_fatal() did not terminate the program, which was my expectation. If every single call to pg_log_fatal() is followed by exit(1), why not make pg_log_fatal() itself exit? Apparently this coding pattern confuses many people -- for example pg_verifybackup.c lines 291ff fail to exit(1) after "throwing" a fatal error, while the block at lines 275 does the right thing. (In reality we cannot literally just exit(1) in pg_log_fatal(), because there are quite a few places that do some other thing after the log call and before exit(1), or terminate the program in some other way than exit().) -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ Essentially, you're proposing Kevlar shoes as a solution for the problem that you want to walk around carrying a loaded gun aimed at your foot. (Tom Lane)
Re: needless complexity in StartupXLOG
On Mon, Jul 26, 2021 at 1:32 PM Stephen Frost wrote: > Yeah, tend to agree with this too ... but something I find a bit curious > is the comment: > > * Insert a special WAL record to mark the end of > * recovery, since we aren't doing a checkpoint. > > ... immediately after setting promoted = true, and then at the end of > StartupXLOG() having: > > if (promoted) > RequestCheckpoint(CHECKPOINT_FORCE); > > maybe I'm missing something, but seems like that comment isn't being > terribly clear. Perhaps we aren't doing a full checkpoint *there*, but > sure looks like we're going to do one moments later regardless of > anything else since we've set promoted to true..? Yep. So it's a question of whether we allow operations that might write WAL in the meantime. When we write the checkpoint record right here, there can't be any WAL from the new server lifetime until the checkpoint completes. When we write an end-of-recovery record, there can. And there could actually be quite a bit, because if we do the checkpoint right in this section of code, it will be a fast checkpoint, whereas in the code you quoted above, it's a spread checkpoint, which takes a lot longer. So the question is whether it's reasonable to give the checkpoint some time to complete or whether it needs to be completed right now. > Agreed that simpler logic is better, provided it's correct logic, of > course. Finding better ways to test all of this would be really nice. Yeah, and there again, it's a lot easier to test if we don't have as many cases. Now no single change is going to fix that, but the number of flag variables in xlog.c is simply bonkers. This particular stretch of code uses 3 of them to even decide whether to attempt the test in question, and all of those are set in complex ways depending on the values of still more flag variables. The comments where bgwriterLaunched is set claim that we only do this during archive recovery, not crash recovery, but the code depends on the value of ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we can't get the bgwriter to run even during crash recovery in the scenario described by: * It's possible that archive recovery was requested, but we don't * know how far we need to replay the WAL before we reach consistency. * This can happen for example if a base backup is taken from a * running server using an atomic filesystem snapshot, without calling * pg_start/stop_backup. Or if you just kill a running primary server * and put it into archive recovery by creating a recovery signal * file. If we ran the bgwriter all the time during crash recovery, we'd know for sure whether that causes any problems. If we only do it like this in certain corner cases, it's much more likely that we have bugs. Grumble, grumble. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Gilles Darold writes: > [ v4-0001-regexp-foo-functions.patch ] I started to work through this and was distressed to realize that it's trying to redefine regexp_replace() in an incompatible way. We already have regression=# \df regexp_replace List of functions Schema | Name | Result data type | Argument data types | Type ++--++-- pg_catalog | regexp_replace | text | text, text, text | func pg_catalog | regexp_replace | text | text, text, text, text | func (2 rows) The patch proposes to add (among other alternatives) +{ oid => '9608', descr => 'replace text using regexp', + proname => 'regexp_replace', prorettype => 'text', + proargtypes => 'text text text int4', prosrc => 'textregexreplace_extended_no_occurrence' }, which is going to be impossibly confusing for both humans and machines. I don't think we should go there. Even if you managed to construct examples that didn't result in "ambiguous function" failures, that doesn't mean that ordinary mortals won't get bit that way. I'm inclined to just drop the regexp_replace additions. I don't think that the extra parameters Oracle provides here are especially useful. They're definitely not useful enough to justify creating compatibility hazards for. regards, tom lane
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Fri, Jul 23, 2021 at 4:57 PM Mark Dilger > wrote: > > > On Jul 23, 2021, at 1:54 PM, Robert Haas wrote: > > > Yeah, but you're inventing a system for allowing the restriction on a > > > GUC to be something other than is-superuser in the very patch we're > > > talking about. So it could be something like is-database-security. > > > Therefore I don't grok the objection. > > > > I'm not objecting to how hard it would be to implement. I'm objecting to > > the semantics. If the only non-superuser who can set the GUC is > > pg_database_security, then it is absolutely worthless in preventing > > pg_database_security from trapping actions performed by pg_network_security > > members. On the other hand, if pg_network_security can also set the GUC, > > then pg_network_security can circumvent audit logging that > > pg_database_security put in place. What's the point in having these as > > separate roles if they can circumvent each other's authority? > > Right, that would be bad. I had forgotten how this worked, but it > seems that event triggers are called with the privileges of the user > whose action caused the event trigger to be fired, not the privileges > of the user who owns the trigger. So as you say, if you can get > somebody to do something that causes an event trigger to be fired, you > can do anything they can do. As far as I can see, the only reasonable > conclusion is that, unless we change the security model, doing > anything with event triggers will have to remain superuser-only. In > other words I don't think we can give it to any of > pg_database_security or pg_host_security or pg_network_security, or > any similar role. We could have a pg_event_triggers role that is > documented as able to usurp superuser, but I don't see the point. Right- event triggers work just the same as how regular triggers on tables do and how RLS works. All of these also have the possibility of leveraging security definer functions, of course, but that doesn't address the issue of the trigger author attempting to attack the individual running the trigger. I do think it'd be useful to have a pg_event_triggers or such role, so that someone could create them without being a superuser. A bit more discussion about that below though.. > Now, the other alternative is changing the security model for event > triggers, but I am not sure that really fixes anything. You proposed > having a new mode where someone could only do things that could be > done by either user, but that troubles me for a number of reasons. One > is that it often makes a difference who actually did a particular > operation. For example it might be that alice and bob both have the > ability to give charlie permission on some table, but the ACL for that > table will record who actually issued the grant. It might be that both > alice and bob have the ability to create a table, but the table will > be owned by whoever actually does. Suppose bob is about to be > terminated but can arrange for alice (who is a star performer) to > grant permissions to his accomplice charlie, thus arranging for those > permissions to survive his impending termination. That's bad. As I understood Mark's suggestion, the trigger would run but would have the privileges of the intersection of both user's permissions, which is an interesting idea but not one we've got any way to really do today as each privilege check would now need to check two different roles for privilege- and if one of the privilege checks fails, then what..? Presumably there would be an ERROR returned, meaning that the operation would be able to be prevented from happening by the trigger author, which was objected to as not being acceptable either, per below. > Also, what about just throwing an ERROR? Anybody's allowed to do that, > but that doesn't mean that it's OK for one user to block everything > some other user wants to do. If seward and bates respectively have > pg_database_security and pg_network_security, it's OK for seward to > interfere with attempts by bates to access database objects, but it's > not OK for seward to prevent bates from reconfiguring network access > to PostgreSQL. Because event triggers don't fire for ALTER SYSTEM or > DDL commands on global objects, we might almost be OK here, but I'm > not sure if it's completely OK. Regular table triggers can be used to block someone from messing with that table, so this isn't entirely unheard of. Deciding that someone with event trigger access is allowed to prevent certain things from happening in a database that they're allowed to connect and create event triggers in may not be completely unreasonable. > I'm pretty sure that the reason we set this up the way we did was > because we assumed that the person creating the event trigger would > always have maximum privileges i.e. superuser. Therefore, it seemed > "safer" to run the code under the less-privileged account. If
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
On Mon, Jul 26, 2021 at 4:05 PM Stephen Frost wrote: > As I understood Mark's suggestion, the trigger would run but would have > the privileges of the intersection of both user's permissions, which is > an interesting idea but not one we've got any way to really do today as > each privilege check would now need to check two different roles for > privilege- and if one of the privilege checks fails, then what..? > Presumably there would be an ERROR returned, meaning that the operation > would be able to be prevented from happening by the trigger author, > which was objected to as not being acceptable either, per below. I think I may not have expressed myself clearly enough here. What I'm concerned about is: Alice should not be permitted to preventing Bob from doing something which Bob is allowed to do and Alice is not allowed to do. If Alice is the administrator of PostgreSQL's XYZ subsystem, she can permit Bob from using it if she wishes. But if Bob is an administrator of XYZ and Alice is not, there shouldn't be a way for Alice to obstruct Bob's access to that system. Do you agree? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
On Mon, Jul 26, 2021 at 4:12 PM Robert Haas wrote: > I think I may not have expressed myself clearly enough here. What I'm > concerned about is: Alice should not be permitted to preventing Bob > from doing something which Bob is allowed to do and Alice is not > allowed to do. If Alice is the administrator of PostgreSQL's XYZ > subsystem, she can permit Bob from using it if she wishes. But if Bob argh, typo. I meant prevent, not permit. > is an administrator of XYZ and Alice is not, there shouldn't be a way > for Alice to obstruct Bob's access to that system. -- Robert Haas EDB: http://www.enterprisedb.com
Re: needless complexity in StartupXLOG
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Mon, Jul 26, 2021 at 1:32 PM Stephen Frost wrote: > > Yeah, tend to agree with this too ... but something I find a bit curious > > is the comment: > > > > * Insert a special WAL record to mark the end of > > * recovery, since we aren't doing a checkpoint. > > > > ... immediately after setting promoted = true, and then at the end of > > StartupXLOG() having: > > > > if (promoted) > > RequestCheckpoint(CHECKPOINT_FORCE); > > > > maybe I'm missing something, but seems like that comment isn't being > > terribly clear. Perhaps we aren't doing a full checkpoint *there*, but > > sure looks like we're going to do one moments later regardless of > > anything else since we've set promoted to true..? > > Yep. So it's a question of whether we allow operations that might > write WAL in the meantime. When we write the checkpoint record right > here, there can't be any WAL from the new server lifetime until the > checkpoint completes. When we write an end-of-recovery record, there > can. And there could actually be quite a bit, because if we do the > checkpoint right in this section of code, it will be a fast > checkpoint, whereas in the code you quoted above, it's a spread > checkpoint, which takes a lot longer. So the question is whether it's > reasonable to give the checkpoint some time to complete or whether it > needs to be completed right now. All I was really trying to point out above was that the comment might be improved upon, just so someone understands that we aren't doing a checkpoint at this particular place, but one will be done later due to the promotion. Maybe I'm being a bit extra with that, but that was my thought when reading the code and the use of the promoted flag variable. > > Agreed that simpler logic is better, provided it's correct logic, of > > course. Finding better ways to test all of this would be really nice. > > Yeah, and there again, it's a lot easier to test if we don't have as > many cases. Now no single change is going to fix that, but the number > of flag variables in xlog.c is simply bonkers. This particular stretch > of code uses 3 of them to even decide whether to attempt the test in > question, and all of those are set in complex ways depending on the > values of still more flag variables. The comments where > bgwriterLaunched is set claim that we only do this during archive > recovery, not crash recovery, but the code depends on the value of > ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we > can't get the bgwriter to run even during crash recovery in the > scenario described by: > > * It's possible that archive recovery was requested, but we don't > * know how far we need to replay the WAL before we reach consistency. > * This can happen for example if a base backup is taken from a > * running server using an atomic filesystem snapshot, without calling > * pg_start/stop_backup. Or if you just kill a running primary server > * and put it into archive recovery by creating a recovery signal > * file. > > If we ran the bgwriter all the time during crash recovery, we'd know > for sure whether that causes any problems. If we only do it like this > in certain corner cases, it's much more likely that we have bugs. > Grumble, grumble. Yeah ... not to mention that it really is just incredibly dangerous to use such an approach for PITR. For my 2c, I'd rather we figure out a way to prevent this than to imply that we support it when we have no way of knowing if we actually have replayed far enough to be consistent. That isn't to say that using snapshots for database backups isn't possible, but it should be done in-between pg_start/stop_backup calls which properly grab the returned info from those and store the backup label with the snapshot, etc. Thanks, Stephen signature.asc Description: PGP signature
Re: Removing "long int"-related limit on hash table sizes
Alvaro Herrera writes: > On 2021-Jul-25, Ranier Vilela wrote: > >> > BTW, one aspect of this that I'm unsure how to tackle is the >> > common usage of "L" constants; in particular, "work_mem * 1024L" >> > is a really common idiom that we'll need to get rid of. Not sure >> > that grep will be a useful aid for finding those. >> > >> I can see 30 matches in the head tree. (grep -d "1024L" *.c) > > grep grep '[0-9]L\>' -- *.[chyl] > shows some more constants. git grep -Eiw '(0x[0-9a-f]+|[0-9]+)U?LL?' -- *.[chyl] gives about a hundred more hits. We also have the (U)INT64CONST() macros, which are about about two thirds as common as the U?LL? suffixes. - ilmari
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
Stephen Frost writes: > As I understood Mark's suggestion, the trigger would run but would have > the privileges of the intersection of both user's permissions, which is > an interesting idea but not one we've got any way to really do today as > each privilege check would now need to check two different roles for > privilege- and if one of the privilege checks fails, then what..? > Presumably there would be an ERROR returned, meaning that the operation > would be able to be prevented from happening by the trigger author, > which was objected to as not being acceptable either, per below. I've not been paying close attention, so maybe this was already considered, but ... What if we allow event triggers owned by non-superusers, but only fire them on commands performed by the trigger's owner? This sidesteps all the issues of who has which privileges and whether Alice is malicious towards Bob or vice versa, because there is no change of privilege domain. Admittedly, it fails to cover some use-cases, but I think it would still handle a lot of interesting cases. The impression I have is that a lot of applications do everything under just one or a few roles. Possibly this could be generalized to "fire on commands performed by any role the trigger owner is a member of", but then I'm a bit less sure that it's safe from both roles' perspectives. regards, tom lane
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
> On Jul 26, 2021, at 1:12 PM, Robert Haas wrote: > > Alice should not be permitted to preventing Bob > from doing something which Bob is allowed to do and Alice is not > allowed to do. That sounds intuitively reasonable, though it depends on what "which Bob is allowed to do" means. For instance, if Alice is only allowed to enable or disable connections to the database, and she disables them, then she has prevented Bob from, for example, creating tables, something which Bob is otherwise allowed to do, because without the ability to connect, he cannot create tables. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Removing "long int"-related limit on hash table sizes
ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes: > We also have the (U)INT64CONST() macros, which are about about two > thirds as common as the U?LL? suffixes. Yeah. Ideally we'd forbid direct use of the suffixes and insist you go through those macros, but I don't know of any way that we could enforce such a coding rule, short of grepping the tree periodically. regards, tom lane
Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness
On 7/24/21, 8:10 PM, "Fujii Masao" wrote: > On 2021/07/25 7:50, Andres Freund wrote: >> Hi, >> >> I've been repeatedly confused by the the number of WAL files supposedly >> added. Even when 100s of new WAL files are created the relevant portion >> of log_checkpoints will only ever list zero or one added WAL file. >> >> The reason for that is that CheckpointStats.ckpt_segs_added is only >> incremented in PreallocXlogFiles(). Which has the following comment: >> * XXX this is currently extremely conservative, since it forces only one >> * future log segment to exist, and even that only if we are 75% done with >> * the current one. This is only appropriate for very low-WAL-volume >> systems. >> >> Whereas in real workloads WAL files are almost exclusively created via >> XLogWrite()->XLogFileInit(). >> >> I think we should consider just removing that field. Or, even better, show >> something accurate instead. > > +1 to show something accurate instead. > > It's also worth showing them in monitoring stats view like pg_stat_wal? +1. I was confused by this when working on a WAL pre-allocation patch [0]. Perhaps it could be replaced by a new parameter and a new field in pg_stat_wal. How about something like log_wal_init_interval, where the value is the minimum amount of time between reporting the number of WAL segments created since the last report? Nathan [0] https://www.postgresql.org/message-id/flat/20201225200953.jjkrytlrzojbn...@alap3.anarazel.de
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Mon, Jul 26, 2021 at 4:12 PM Robert Haas wrote: > > I think I may not have expressed myself clearly enough here. What I'm > > concerned about is: Alice should not be permitted to preventing Bob > > from doing something which Bob is allowed to do and Alice is not > > allowed to do. If Alice is the administrator of PostgreSQL's XYZ > > subsystem, she can permit Bob from using it if she wishes. But if Bob > > argh, typo. I meant prevent, not permit. > > > is an administrator of XYZ and Alice is not, there shouldn't be a way > > for Alice to obstruct Bob's access to that system. > Do you agree? so ... yes and no. There's an awful lot being ascribed to 'administrator' without any definition of it being actually given. We are working in this thread to explicitly split up superuser privileges to allow them to be granted to non-superusers and talking about cases where those privileges end up interacting with each other. Is Alice, as the 'network' manager considered an 'administrator' of XYZ? Is Bob, as the 'database' manager considered an 'administrator'? Perhaps both are, perhaps neither are. It doesn't seem helpful to be vague. If Alice is given the right to create event triggers in a given database, then that's explicitly giving Alice the right to block anyone from dropping tables in that database because that's an inherent part of the event trigger system. Should superusers be able to bypass that? Yes, they probably should be able to and, ideally, they'd be able to do that just in a particular session. Should a user who has been allowed to modify certain GUCs that perhaps Alice hasn't been allowed to modify be able to be prevented from modifying those GUCs by Alice, when neither is a superuser? That's definitely a trickier question and I don't know that I've got an answer offhand. Thanks, Stephen signature.asc Description: PGP signature
Re: Removing "long int"-related limit on hash table sizes
On 2021-Jul-26, Tom Lane wrote: > ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes: > > We also have the (U)INT64CONST() macros, which are about about two > > thirds as common as the U?LL? suffixes. > > Yeah. Ideally we'd forbid direct use of the suffixes and insist > you go through those macros, but I don't know of any way that > we could enforce such a coding rule, short of grepping the tree > periodically. IIRC we have one buildfarm member that warns us about perlcritic; maybe this is just another setup of that sort. (Personally I run the perlcritic check in my local commit-verifying script before pushing.) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "XML!" Exclaimed C++. "What are you doing here? You're not a programming language." "Tell that to the people who use me," said XML. https://burningbird.net/the-parable-of-the-languages/
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
On 2021-Jul-26, Tom Lane wrote: > What if we allow event triggers owned by non-superusers, but only fire > them on commands performed by the trigger's owner? This sidesteps all > the issues of who has which privileges and whether Alice is malicious > towards Bob or vice versa, because there is no change of privilege > domain. Admittedly, it fails to cover some use-cases, but I think it > would still handle a lot of interesting cases. The impression I have > is that a lot of applications do everything under just one or a few > roles. This is similar but not quite an idea I had: have event triggers owned by non-superusers run for all non-superusers, but not for superusers. It is still the case that all non-superusers have to trust everyone with the event-trigger-create permission, but that's probably the database owner so most of the time you have to trust them already. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Saca el libro que tu religión considere como el indicado para encontrar la oración que traiga paz a tu alma. Luego rebootea el computador y ve si funciona" (Carlos Duclós)
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
Greetings, * Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote: > On 2021-Jul-26, Tom Lane wrote: > > > What if we allow event triggers owned by non-superusers, but only fire > > them on commands performed by the trigger's owner? This sidesteps all > > the issues of who has which privileges and whether Alice is malicious > > towards Bob or vice versa, because there is no change of privilege > > domain. Admittedly, it fails to cover some use-cases, but I think it > > would still handle a lot of interesting cases. The impression I have > > is that a lot of applications do everything under just one or a few > > roles. > > This is similar but not quite an idea I had: have event triggers owned > by non-superusers run for all non-superusers, but not for superusers. > It is still the case that all non-superusers have to trust everyone with > the event-trigger-create permission, but that's probably the database > owner so most of the time you have to trust them already. This sort of logic is what has caused issues with CREATEROLE, imv. It's simply not so simple as "don't run this when the superuser flag is set" because non-superuser roles can become superusers. We need something better to have something like this actually be safe. Tom's suggestion would work, of course, but it would mean having to create event triggers for all the roles in the system, and would those roles who own those event triggers be able to disable them..? If so, it would almost certainly be against the point of an auditing event trigger.. Thanks, Stephen signature.asc Description: PGP signature
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
I wrote: > Possibly this could be generalized to "fire on commands performed by > any role the trigger owner is a member of", but then I'm a bit less > sure that it's safe from both roles' perspectives. After further thought, I can't poke a hole in that concept. We'd keep the rule that the trigger executes as the calling user. Therefore, the trigger cannot perform any action that the calling user couldn't do if she chose. Conversely, since the trigger owner could become a member of that role and then do whatever the trigger intends to do, this scheme does not give the trigger owner any new abilities either. All we've done is provide what some programming languages call an observer or annotation. I also like the fact that with this rule, superusers' ability to create event triggers that fire for everything is not a special case. regards, tom lane
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
Stephen Frost writes: > ... Tom's suggestion > would work, of course, but it would mean having to create event triggers > for all the roles in the system, and would those roles who own those > event triggers be able to disable them..? Uh, why not? If you own the trigger, you can drop it, so why shouldn't you be able to temporarily disable it? > If so, it would almost > certainly be against the point of an auditing event trigger.. If you want auditing capability, you make an auditor role that is a member of every other role, and then it owns the trigger. (If you need to audit superuser actions too, then the auditor has to be a superuser itself, but that's no worse than before; and I'd argue that non-superusers shouldn't be able to audit superusers anyway.) regards, tom lane
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
On 2021-Jul-26, Tom Lane wrote: > Stephen Frost writes: > > ... Tom's suggestion > > would work, of course, but it would mean having to create event triggers > > for all the roles in the system, and would those roles who own those > > event triggers be able to disable them..? > > Uh, why not? If you own the trigger, you can drop it, so why shouldn't > you be able to temporarily disable it? I think an auditing system that can be turned off by the audited user is pretty much useless. Or did I misunderstood what you are suggesting? -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Crear es tan difícil como ser libre" (Elsa Triolet)
Re: automatically generating node support functions
Peter Eisentraut writes: >> The first eight patches are to clean up various inconsistencies to make >> parsing or generation easier. > Are there any concerns about the patches 0001 through 0008? Otherwise, > maybe we could get those out of the way. I looked through those and don't have any complaints (though I just eyeballed them, I didn't see what a compiler would say). I see you pushed a couple of them already. regards, tom lane
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
Alvaro Herrera writes: > On 2021-Jul-26, Tom Lane wrote: >> Uh, why not? If you own the trigger, you can drop it, so why shouldn't >> you be able to temporarily disable it? > I think an auditing system that can be turned off by the audited user is > pretty much useless. Or did I misunderstood what you are suggesting? For auditing purposes, you make a trusted role that owns the trigger, and is a member of the roles whose actions are to be audited (but NOT vice versa). I think that any idea that the auditing role doesn't need to be trusted that much is foolhardy. What we can buy here is not requiring the auditing role to be full superuser ... assuming that you don't need auditing of superusers. regards, tom lane
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
On Mon, Jul 26, 2021 at 4:28 PM Stephen Frost wrote: > so ... yes and no. There's an awful lot being ascribed to > 'administrator' without any definition of it being actually given. We > are working in this thread to explicitly split up superuser privileges > to allow them to be granted to non-superusers and talking about cases > where those privileges end up interacting with each other. Is Alice, as > the 'network' manager considered an 'administrator' of XYZ? Is Bob, as > the 'database' manager considered an 'administrator'? Perhaps both are, > perhaps neither are. It doesn't seem helpful to be vague. XYZ was intended to stand in for something like 'network' or 'database' or whatever other particular part of PostgreSQL Alice might be charged with administering. > If Alice is given the right to create event triggers in a given > database, then that's explicitly giving Alice the right to block anyone > from dropping tables in that database because that's an inherent part of > the event trigger system. Should superusers be able to bypass that? > Yes, they probably should be able to and, ideally, they'd be able to do > that just in a particular session. I agree. > Should a user who has been allowed > to modify certain GUCs that perhaps Alice hasn't been allowed to modify > be able to be prevented from modifying those GUCs by Alice, when neither > is a superuser? That's definitely a trickier question and I don't know > that I've got an answer offhand. My answer would be "no". I concede Mark's point in another email that if Alice can entirely prevent Bob from connecting to the database then by inference she can also prevent him from exercising any other privileges he may have. I'm prepared to say that's OK; if Alice is administering network connections to the database and cuts everyone else off, then I guess that's just how it is. But if Bob does somehow succeed in getting a connection to the database, then he should be able to exercise his right to change those GUCs which he has permission to change. Alice shouldn't be able to thwart that. -- Robert Haas EDB: http://www.enterprisedb.com
pg_settings.pending_restart not set when line removed
Hi I got a report from Gabriele Bartolini and team that the pg_settings view does not get the pending_restart flag set when a setting's line is removed from a file (as opposed to its value changed). The explanation seems to be that GUC_PENDING_RESTART is set by set_config_option, but when ProcessConfigFileInternal is called only to provide data (applySettings=true), then set_config_option is never called and thus the flag doesn't get set. I tried the attached patch, which sets GUC_PENDING_RESTART if we're doing pg_file_settings(). Then any subsequent read of pg_settings will have the pending_restart flag set. This seems to work correctly, and consistently with the case where you change a line (without removing it) in unpatched master. You could argue that this is *weird* (why does reading pg_file_settings set a flag in global state?) ... but that weirdness is not something this patch is introducing. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Hay dos momentos en la vida de un hombre en los que no debería especular: cuando puede permitírselo y cuando no puede" (Mark Twain) >From 71fa384a6bf7aef58bdbe6d382cbc1219dbaa420 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 26 Jul 2021 18:35:09 -0400 Subject: [PATCH] fix guc --- src/backend/utils/misc/guc-file.l | 1 + 1 file changed, 1 insertion(+) diff --git a/src/backend/utils/misc/guc-file.l b/src/backend/utils/misc/guc-file.l index 986ce542e3..1fe3af6284 100644 --- a/src/backend/utils/misc/guc-file.l +++ b/src/backend/utils/misc/guc-file.l @@ -354,6 +354,7 @@ ProcessConfigFileInternal(GucContext context, bool applySettings, int elevel) gconf->name), NULL, 0, &head, &tail); + gconf->status |= GUC_PENDING_RESTART; error = true; continue; } -- 2.20.1
Re: Added schema level support for publication.
On Mon, Jul 26, 2021 at 3:21 PM vignesh C wrote: > > Thanks for the comment, this is modified in the v15 patch attached. > I have several minor review comments. (1) src/backend/catalog/objectaddress.c Should start comment sentences with an uppercase letter, for consistency. + /* fetch publication name and schema oid from input list */ I also notice that some 1-sentence comments end with "." (full-stop) and others don't. It seems to alternate all over the place, and so is quite noticeable. Unfortunately, it already seems to be like this in much of the code that this patch patches. Ideally (at least my personal preference is) 1-sentence comments should not end with a ".". (2) src/backend/catalog/pg_publication.c errdetail message I think the following should say "Temporary schemas ..." (since the existing error message for tables says "System tables cannot be added to publications."). + errdetail("Temporary schema cannot be added to publications."))); (3) src/backend/commands/publicationcmds.c PublicationAddTables I think that the Assert below is not correct (i.e. not restrictive enough). Although the condition passes, it is allowing, for example, stmt->for_all_tables==true if stmt->schemas==NIL, and that doesn't seem to be correct. I suggest the following change: BEFORE: + Assert(!stmt || !stmt->for_all_tables || !stmt->schemas); AFTER: + Assert(!stmt || (!stmt->for_all_tables && !stmt->schemas)); (4) src/backend/commands/publicationcmds.c PublicationAddSchemas Similarly, I think that the Assert below is not restrictive enough, and think it should be changed: BEFORE: + Assert(!stmt || !stmt->for_all_tables || !stmt->tables); AFTER: + Assert(!stmt || (!stmt->for_all_tables && !stmt->tables)); (5) src/bin/pg_dump/common.c Spelling mistake. BEFORE: + pg_log_info("reading publciation schemas"); AFTER: + pg_log_info("reading publication schemas"); Regards, Greg Nancarrow Fujitsu Australia
Slim down integer formatting
Folks, Please find attached a patch to do $subject. It's down to a one table lookup and 3 instructions. In covering the int64 versions, I swiped a light weight division from the Ryu stuff. I'm pretty sure that what I did is not how to do #includes, but it's a PoC. What would be a better way to do this? Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate >From 1cf7202facd9fee161865d90304e5ede1e3c65cf Mon Sep 17 00:00:00 2001 From: David Fetter Date: Mon, 26 Jul 2021 16:43:43 -0700 Subject: [PATCH v1] PoC: Slim down integer printing some more --- src/backend/utils/adt/numutils.c | 62 +++- 1 file changed, 29 insertions(+), 33 deletions(-) diff --git src/backend/utils/adt/numutils.c src/backend/utils/adt/numutils.c index b93096f288..c4f2d5cfb1 100644 --- src/backend/utils/adt/numutils.c +++ src/backend/utils/adt/numutils.c @@ -18,6 +18,7 @@ #include #include +#include "../common/d2s_intrinsics.h" #include "common/int.h" #include "utils/builtins.h" #include "port/pg_bitutils.h" @@ -39,50 +40,45 @@ static const char DIGIT_TABLE[200] = "90" "91" "92" "93" "94" "95" "96" "97" "98" "99"; /* - * Adapted from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10 + * Adapted from + * https://lemire.me/blog/2021/06/03/computing-the-number-of-digits-of-an-integer-even-faster/ */ static inline int decimalLength32(const uint32 v) { - int t; - static const uint32 PowersOfTen[] = { - 1, 10, 100, - 1000, 1, 10, - 100, 1000, 1, - 10 + /* + * Each element in the array, when added to a number with MSB that is the + * array index, will produce a number whose top 32 bits are its decimal + * length, so that can now be had using: + * + * 1 CLZ instruction, + * 1 table lookup, + * 1 add, and + * 1 shift + * + */ + static const uint64 digit_transitions[32] = { + 4294967295, 8589934582, 8589934582, 8589934582, 12884901788, + 12884901788, 12884901788, 17179868184, 17179868184, 17179868184, + 21474826480, 21474826480, 21474826480, 21474826480, 25769703776, + 25769703776, 25769703776, 30063771072, 30063771072, 30063771072, + 34349738368, 34349738368, 34349738368, 34349738368, 38554705664, + 38554705664, 38554705664, 41949672960, 41949672960, 41949672960, + 42949672960, 42949672960 }; - /* - * Compute base-10 logarithm by dividing the base-2 logarithm by a - * good-enough approximation of the base-2 logarithm of 10 - */ - t = (pg_leftmost_one_pos32(v) + 1) * 1233 / 4096; - return t + (v >= PowersOfTen[t]); + return (v + digit_transitions[pg_leftmost_one_pos32(v)]) >> 32; } static inline int decimalLength64(const uint64 v) { - int t; - static const uint64 PowersOfTen[] = { - UINT64CONST(1), UINT64CONST(10), - UINT64CONST(100), UINT64CONST(1000), - UINT64CONST(1), UINT64CONST(10), - UINT64CONST(100), UINT64CONST(1000), - UINT64CONST(1), UINT64CONST(10), - UINT64CONST(100), UINT64CONST(1000), - UINT64CONST(1), UINT64CONST(10), - UINT64CONST(100), UINT64CONST(1000), - UINT64CONST(1), UINT64CONST(10), - UINT64CONST(100), UINT64CONST(1000) - }; - - /* - * Compute base-10 logarithm by dividing the base-2 logarithm by a - * good-enough approximation of the base-2 logarithm of 10 - */ - t = (pg_leftmost_one_pos64(v) + 1) * 1233 / 4096; - return t + (v >= PowersOfTen[t]); + if (v >> 32 == 0) + return decimalLength32(v); + else if (div1e8(v) >> 32 == 0) + return 8 + decimalLength32(div1e8(v)); + else + return 16 + decimalLength32(div1e8(div1e8(v))); } /* -- 2.31.1
Re: Some code cleanup for pgbench and pg_verifybackup
On Mon, Jul 26, 2021 at 03:35:29PM -0400, Alvaro Herrera wrote: > On 2021-Jul-26, Fabien COELHO wrote: >> The semantics for fatal vs error is that an error is somehow handled while a >> fatal is not. If the log message is followed by an cold exit, ISTM that >> fatal is the right call, and I cannot help if other commands do not do that. >> ISTM more logical to align other commands to fatal when appropriate. I disagree. pgbench is an outlier here. There are 71 calls to pg_log_fatal() in src/bin/, and pgbench counts for 54 of them. It would be more consistent to align pgbench with the others. > I was surprised to discover a few weeks ago that pg_log_fatal() did not > terminate the program, which was my expectation. If every single call > to pg_log_fatal() is followed by exit(1), why not make pg_log_fatal() > itself exit? Apparently this coding pattern confuses many people -- for > example pg_verifybackup.c lines 291ff fail to exit(1) after "throwing" a > fatal error, while the block at lines 275 does the right thing. I remember having the same reaction when those logging APIs got introduced (I may be wrong here), and I also recall that this point has been discussed, where the conclusion was that the logging should never exit() directly. > (In reality we cannot literally just exit(1) in pg_log_fatal(), because > there are quite a few places that do some other thing after the log > call and before exit(1), or terminate the program in some other way than > exit().) Yes, that's obviously wrong. I am wondering if there is more of that. -- Michael signature.asc Description: PGP signature
Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness
On 2021/07/27 5:27, Bossart, Nathan wrote: +1. I was confused by this when working on a WAL pre-allocation patch [0]. Perhaps it could be replaced by a new parameter and a new field in pg_stat_wal. How about something like log_wal_init_interval, where the value is the minimum amount of time between reporting the number of WAL segments created since the last report? You mean to introduce new GUC like log_wal_init_interval and that the number of WAL files created since the last report will be logged every that interval? But isn't it better and simpler to just expose the accumulated number of WAL files created, in pg_stat_wal view or elsewhere? If so, we can easily get to know the number of WAL files created in every interval by checking the view and calculating the diff. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: list of extended statistics on psql (\dX)
Hi Tomas and Justin, On 2021/07/27 4:26, Tomas Vondra wrote: Hi, I've pushed the last version of the fix, including the regression tests etc. Backpatch to 14, where \dX was introduced. Thank you! Regards, Tatsuro Yamada
Re: needless complexity in StartupXLOG
On Mon, Jul 26, 2021 at 03:53:20PM -0400, Robert Haas wrote: > Yeah, and there again, it's a lot easier to test if we don't have as > many cases. Now no single change is going to fix that, but the number > of flag variables in xlog.c is simply bonkers. This particular stretch > of code uses 3 of them to even decide whether to attempt the test in > question, and all of those are set in complex ways depending on the > values of still more flag variables. The comments where > bgwriterLaunched is set claim that we only do this during archive > recovery, not crash recovery, but the code depends on the value of > ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we > can't get the bgwriter to run even during crash recovery in the > scenario described by: I'm not following along closely and maybe you're already aware of this one? https://commitfest.postgresql.org/33/2706/ Background writer and checkpointer in crash recovery @Thomas: https://www.postgresql.org/message-id/CA%2BTgmoYmw%3D%3DTOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA%40mail.gmail.com > * It's possible that archive recovery was requested, but we don't > * know how far we need to replay the WAL before we reach consistency. > * This can happen for example if a base backup is taken from a > * running server using an atomic filesystem snapshot, without calling > * pg_start/stop_backup. Or if you just kill a running primary server > * and put it into archive recovery by creating a recovery signal > * file. > > If we ran the bgwriter all the time during crash recovery, we'd know > for sure whether that causes any problems. If we only do it like this > in certain corner cases, it's much more likely that we have bugs. > Grumble, grumble. -- Justin
Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness
On 7/26/21, 5:23 PM, "Fujii Masao" wrote: > On 2021/07/27 5:27, Bossart, Nathan wrote: >> +1. I was confused by this when working on a WAL pre-allocation >> patch [0]. Perhaps it could be replaced by a new parameter and a new >> field in pg_stat_wal. How about something like log_wal_init_interval, >> where the value is the minimum amount of time between reporting the >> number of WAL segments created since the last report? > > You mean to introduce new GUC like log_wal_init_interval and that > the number of WAL files created since the last report will be logged > every that interval? But isn't it better and simpler to just expose > the accumulated number of WAL files created, in pg_stat_wal view > or elsewhere? If so, we can easily get to know the number of WAL files > created in every interval by checking the view and calculating the diff. I agree with you about adding a new field to pg_stat_wal. The parameter would just be a convenient way of logging this information for future reference. I don't feel strongly about the parameter if you think the pg_stat_wal addition is enough. Nathan
Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness
Hi, On 2021-07-26 20:27:21 +, Bossart, Nathan wrote: > +1. I was confused by this when working on a WAL pre-allocation > patch [0]. Perhaps it could be replaced by a new parameter and a new > field in pg_stat_wal. How about something like log_wal_init_interval, > where the value is the minimum amount of time between reporting the > number of WAL segments created since the last report? Why not just make the number in log_checkpoints accurate? There's no point in the current number, so we don't need to preserve it... Greetings, Andres Freund
Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness
Hi, On 2021-07-25 12:10:07 +0900, Fujii Masao wrote: > It's also worth showing them in monitoring stats view like pg_stat_wal? I'm not convinced that's all that meaningful. It makes sense to include it as part of the checkpoint output, because checkpoints determine when WAL can be recycled etc. It's not that clear to me how to represent that as part of pg_stat_wal? I guess we could add columns for the amount of WAL has been a) newly created b) recycled c) removed. In combination that *does* seem useful. But also a mostly independent change... Greetings, Andres Freund
RE: row filtering for logical replication
On July 23, 2021 6:16 PM Amit Kapila wrote: > On Fri, Jul 23, 2021 at 2:27 PM Rahila Syed wrote: > > > > The column comparison for row filtering happens before the unchanged > > toast columns are filtered. Unchanged toast columns are filtered just > > before writing the tuple to output stream. > > > > To perform filtering, you need to use the tuple from WAL and that tuple > doesn't > seem to have unchanged toast values, so how can we do filtering? I think it > is a > good idea to test this once. I agreed. Currently, both unchanged toasted key column and unchanged toasted non-key column is not logged. So, we cannot get the toasted value directly for these columns when doing row filtering. I tested the current patch for toasted data and found a problem: In the current patch, it will try to fetch the toast data from toast table when doing row filtering[1]. But, it's unsafe to do that in walsender. We can see it use HISTORIC snapshot in heap_fetch_toast_slice() and also the comments of init_toast_snapshot() have said "Detoasting *must* happen in the same transaction that originally fetched the toast pointer.". The toast data could have been changed when doing row filtering. For exmaple, I tested the following steps and get an error. 1) UPDATE a nonkey column in publisher. 2) Use debugger to block the walsender process in function pgoutput_row_filter_exec_expr(). 3) Open another psql to connect the publisher, and drop the table which updated in 1). 4) Unblock the debugger in 2), and then I can see the following error: --- ERROR: could not read block 0 in file "base/13675/16391" --- [1] (1)--publisher-- CREATE TABLE toasted_key ( id serial, toasted_key text PRIMARY KEY, toasted_col1 text, toasted_col2 text ); select repeat('99', 200) as tvalue \gset CREATE PUBLICATION pub FOR TABLE toasted_key WHERE (toasted_col2 = :'tvalue'); ALTER TABLE toasted_key REPLICA IDENTITY USING INDEX toasted_key_pkey; ALTER TABLE toasted_key ALTER COLUMN toasted_key SET STORAGE EXTERNAL; ALTER TABLE toasted_key ALTER COLUMN toasted_col1 SET STORAGE EXTERNAL; ALTER TABLE toasted_key ALTER COLUMN toasted_col2 SET STORAGE EXTERNAL; INSERT INTO toasted_key(toasted_key, toasted_col1, toasted_col2) VALUES(repeat('1234567890', 200), repeat('9876543210', 200), repeat('99', 200)); (2)--subscriber-- CREATE TABLE toasted_key ( id serial, toasted_key text PRIMARY KEY, toasted_col1 text, toasted_col2 text ); CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres port=1' PUBLICATION pub; (3)--publisher-- UPDATE toasted_key SET toasted_col1 = repeat('113113', 200); Based on the above steps, the row filter will ge through the following path and fetch toast data in walsender. -- pgoutput_row_filter_exec_expr ... texteq ... text *targ1 = DatumGetTextPP(arg1); pg_detoast_datum_packed detoast_attr --
Re: ORDER BY pushdowns seem broken in postgres_fdw
On Thu, 22 Jul 2021 at 20:49, Ronan Dunklau wrote: > > Le jeudi 22 juillet 2021, 09:44:54 CEST David Rowley a écrit : > > Can you also use explain (verbose, costs off) the same as the other > > tests in that area. Having the costs there would never survive a run > > of the buildfarm. Different hardware will produce different costs, e.g > > 32-bit hardware might cost cheaper due to narrower widths. > > > > Sorry about that. Here it is. I had a look over the v5 patch and noticed a few issues and a few things that could be improved. This is not ok: +tuple = SearchSysCache4(AMOPSTRATEGY, +ObjectIdGetDatum(pathkey->pk_opfamily), +em->em_datatype, +em->em_datatype, +pathkey->pk_strategy); SearchSysCache* expects Datum inputs, so you must use the *GetDatum() macro for each input that isn't already a Datum. I also: 1. Changed the error message for when that lookup fails so that it's the same as the others that perform a lookup with AMOPSTRATEGY. 2. Put back the comment in equivclass.c for find_em_expr_for_rel. I saw no reason that comment should be changed when the function does exactly what it did before. 3. Renamed appendOrderbyUsingClause to appendOrderBySuffix. I wasn't happy that the name indicated it was only handling USING clauses when it also handled ASC/DESC. I also moved in the NULLS FIRST/LAST stuff in there 4. Adjusted is_foreign_pathkey() to make it easier to read and do is_shippable() before calling find_em_expr_for_rel(). I didn't see the need to call find_em_expr_for_rel() when is_shippable() returned false. 5. Adjusted find_em_expr_for_rel() to remove the ternary operator. I've attached what I ended up with. It seems that it was the following commit that introduced the ability for sorts to be pushed down to the foreign server, so it would be good if the authors of that patch could look over this. commit f18c944b6137329ac4a6b2dce5745c5dc21a8578 Author: Robert Haas Date: Tue Nov 3 12:46:06 2015 -0500 postgres_fdw: Add ORDER BY to some remote SQL queries. David v6_fix_postgresfdw_orderby_handling.patch Description: Binary data
Re: shared-memory based stats collector
Hi, On 2021-07-26 17:52:01 +0900, Kyotaro Horiguchi wrote: > > > Yeah, thank you very much for checking that. However, this patch is > > > now developed in Andres' GitHub repository. So I'm at a loss what to > > > do for the failure.. > > > > I'll post a rebased version soon. > > (Sorry if you feel being hurried, which I didn't meant to.) No worries! I had intended to post a rebase by now. But while I did mostly finish that (see [1]) I unfortunately encountered a new issue around partitioned tables, see [2]. Currently I'm hoping for a few thoughts on that thread about the best way to address the issues. Greetings, Andres Freund [1] https://github.com/anarazel/postgres/tree/shmstat [2] https://www.postgresql.org/message-id/20210722205458.f2bug3z6qzxzpx2s%40alap3.anarazel.de
Re: Incorrect usage of strtol, atoi for non-numeric junk inputs
On Mon, Jul 26, 2021 at 05:46:22PM +0900, Kyotaro Horiguchi wrote: > Thanks for revising and committing! I'm fine with all of the recent > discussions on the committed part. Though I don't think it works for > "live" command line options, making the omitting policy symmetric > looks good. I see the same done in several similar use of strto[il]. OK, applied this one. So for now we are done here. > The change in 020_pg_receivewal.pl results in a chain of four > successive failures, which is fine. But the last failure (#23) happens > for a bit dubious reason. Yes, I saw that as well. I'll address that separately. -- Michael signature.asc Description: PGP signature
Re: Fix around conn_duration in pgbench
Hello Fujii-san, Thank you for looking at it. On Tue, 27 Jul 2021 03:04:35 +0900 Fujii Masao wrote: > case CSTATE_FINISHED: > + /* per-thread last disconnection time is not > measured */ > > Could you tell me why we don't need to do this measurement? We don't need to do it because it is already done in CSTATE_END_TX state when the transaction successfully finished. Also, we don't need it when the thread is aborted (that it, in CSTATE_ABORTED case) because we can't report complete results anyway in such cases. I updated the comment. > - /* no connection delay to record */ > - thread->conn_duration = 0; > + /* connection delay is measured globally between the barriers */ > > This comment is really correct? I was thinking that the measurement is not > necessary here because this is the case where -C option is not specified. This comment means that, when -C is not specified, the connection delay is measured between the barrier point where the benchmark starts /* READY */ THREAD_BARRIER_WAIT(&barrier); and the barrier point where all the thread finish making initial connections. /* GO */ THREAD_BARRIER_WAIT(&barrier); > done: > start = pg_time_now(); > disconnect_all(state, nstate); > thread->conn_duration += pg_time_now() - start; > > We should measure the disconnection time here only when -C option specified > (i.e., is_connect variable is true)? Though, I'm not sure how much this > change is helpful to reduce the performance overhead You are right. We are measuring the disconnection time only when -C option is specified, but it is already done at the end of transaction (i.e., CSTATE_END_TX). We need disconnection here only when we get an error. Therefore, we don't need the measurement here. I attached the updated patch. Regards, Yugo Nagata -- Yugo NAGATA diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 364b5a2e47..ffd74db3ad 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -3545,8 +3545,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) if (is_connect) { + pg_time_usec_t start = now; + + pg_time_now_lazy(&start); finishCon(st); - now = 0; + now = pg_time_now(); + thread->conn_duration += now - start; } if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded) @@ -3570,6 +3574,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) */ case CSTATE_ABORTED: case CSTATE_FINISHED: +/* + * Per-thread last disconnection time is not measured because it + * is already done when the transaction successfully finished. + * Also, we don't need it when the thread is aborted because we + * can't report complete results anyway in such cases. + */ finishCon(st); return; } @@ -6615,6 +6625,7 @@ threadRun(void *arg) thread_start = pg_time_now(); thread->started_time = thread_start; + thread->conn_duration = 0; last_report = thread_start; next_report = last_report + (int64) 100 * progress; @@ -6638,14 +6649,7 @@ threadRun(void *arg) goto done; } } - - /* compute connection delay */ - thread->conn_duration = pg_time_now() - thread->started_time; - } - else - { - /* no connection delay to record */ - thread->conn_duration = 0; + /* connection delay is measured globally between the barriers */ } /* GO */ @@ -6846,9 +6850,7 @@ threadRun(void *arg) } done: - start = pg_time_now(); disconnect_all(state, nstate); - thread->conn_duration += pg_time_now() - start; if (thread->logfile) {
a thinko in b676ac443b6
Hi, I noticed $subject while rebasing my patch at [1] to enable batching for the inserts used in cross-partition UPDATEs. b676ac443b6 did this: - resultRelInfo->ri_PlanSlots[resultRelInfo->ri_NumSlots] = - MakeSingleTupleTableSlot(planSlot->tts_tupleDescriptor, -planSlot->tts_ops); ... + { + TupleDesc tdesc = CreateTupleDescCopy(slot->tts_tupleDescriptor); + + resultRelInfo->ri_Slots[resultRelInfo->ri_NumSlots] = + MakeSingleTupleTableSlot(tdesc, slot->tts_ops); ... + resultRelInfo->ri_PlanSlots[resultRelInfo->ri_NumSlots] = + MakeSingleTupleTableSlot(tdesc, planSlot->tts_ops); I think it can be incorrect to use the same TupleDesc for both the slots in ri_Slots (for ready-to-be-inserted tuples) and ri_PlanSlots (for subplan output tuples). Especially if you consider what we did in 86dc90056df that was committed into v14. In that commit, we changed the way a subplan under ModifyTable produces its output for an UPDATE statement. Previously, it would produce a tuple matching the target table's TupleDesc exactly (plus any junk columns), but now it produces only a partial tuple containing the values for the changed columns. So it's better to revert to using planSlot->tts_tupleDescriptor for the slots in ri_PlanSlots. Attached a patch to do so. -- Amit Langote EDB: http://www.enterprisedb.com [1] https://commitfest.postgresql.org/33/2992/ b676ac443b6-thinko-fix.patch Description: Binary data
Re: Slim down integer formatting
On Tue, Jul 27, 2021 at 9:51 AM David Fetter wrote: > > In covering the int64 versions, I swiped a light weight division from > the Ryu stuff. I'm pretty sure that what I did is not how to do > #includes, but it's a PoC. What would be a better way to do this? > That patch didn't apply for me (on latest source) so I've attached an equivalent with those changes, that does apply, and also tweaks the Makefile include path to address that #include issue. Regards, Greg Nancarrow Fujitsu Australia v2-0001-PoC-Slim-down-integer-printing-some-more.patch Description: Binary data
Re: Allow batched insert during cross-partition updates
On Thu, Jul 22, 2021 at 2:18 PM vignesh C wrote: > On Fri, Jul 2, 2021 at 7:35 AM Amit Langote wrote: > > On Fri, Jul 2, 2021 at 1:39 AM Tom Lane wrote: > > > > > > Amit Langote writes: > > > > [ v6-0001-Allow-batching-of-inserts-during-cross-partition-.patch ] > > > > > > Per the cfbot, this isn't applying anymore, so I'm setting it back > > > to Waiting on Author. > > > > Rebased patch attached. Thanks for the reminder. > > One of the test postgres_fdw has failed, can you please post an > updated patch for the fix: > test postgres_fdw ... FAILED (test process exited with > exit code 2) 7264 ms Thanks Vignesh. I found a problem with the underlying batching code that caused this failure and have just reported it here: https://www.postgresql.org/message-id/CA%2BHiwqEWd5B0-e-RvixGGUrNvGkjH2s4m95%3DJcwUnyV%3Df0rAKQ%40mail.gmail.com Here's v8, including the patch for the above problem. -- Amit Langote EDB: http://www.enterprisedb.com v8-0002-Allow-batching-of-inserts-during-cross-partition-.patch Description: Binary data v8-0001-Fix-a-thinko-in-b676ac443b6.patch Description: Binary data
Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load
This patch has been applied back to 9.6 and will appear in the next minor release. --- On Tue, May 18, 2021 at 01:26:38PM +0200, Drouvot, Bertrand wrote: > Hi, > > On 5/4/21 10:17 AM, Drouvot, Bertrand wrote: > > > Hi, > > On 4/24/21 3:00 AM, Andres Freund wrote: > > Hi, > > On 2021-04-23 19:28:27 -0500, Justin Pryzby wrote: > > This (combination of) thread(s) seems relevant. > > Subject: pg_upgrade failing for 200+ million Large Objects > > https://www.postgresql.org/message-id/flat/12601596dbbc4c01b86b4ac4d2bd4d48%40EX13D05UWC001.ant.amazon.com > > https://www.postgresql.org/message-id/flat/a9f9376f1c3343a6bb319dce294e20ac%40EX13D05UWC001.ant.amazon.com > > https://www.postgresql.org/message-id/flat/cc089cc3-fc43-9904-fdba-d830d8222145%40enterprisedb.com#3eec85391c6076a4913e96a86fece75e > > Huh. Thanks for digging these up. > > > > Allows the user to provide a constant via pg_upgrade > command-line, that > overrides the 2 billion constant in pg_resetxlog [1] thereby > increasing the > (window of) Transaction IDs available for pg_upgrade to > complete. > > That seems the entirely the wrong approach to me, buying further into > the broken idea of inventing random wrong values for oldestXid. > > We drive important things like the emergency xid limits off > oldestXid. On > databases with tables that are older than ~147million xids (i.e. not > even > affected by the default autovacuum_freeze_max_age) the current > constant leads > to setting the oldestXid to a value *in the future*/wrapped around. > Any > different different constant (or pg_upgrade parameter) will do that > too in > other scenarios. > > As far as I can tell there is precisely *no* correct behaviour here > other than > exactly copying the oldestXid limit from the source database. > > > Please find attached a patch proposal doing so: it adds a new (- u) > parameter to pg_resetwal that allows to specify the oldest unfrozen XID to > set. > Then this new parameter is being used in pg_upgrade to copy the source > Latest checkpoint's oldestXID. > > Questions: > > □ Should we keep the old behavior in case -x is being used without -u? > (The proposed patch does not set an arbitrary oldestXID anymore in > case > -x is used.) > □ Also shouldn't we ensure that the xid provided with -x or -u is >= > FirstNormalTransactionId (Currently the only check is that it is # 0)? > > > Copy/pasting Andres feedback (Thanks Andres for this feedback) on those > questions from another thread [1]. > > > I was also wondering if: > > > > * We should keep the old behavior in case pg_resetwal -x is being used > > without -u? (The proposed patch does not set an arbitrary oldestXID > > anymore in case -x is used) > > Andres: I don't think we should. I don't see anything in the old behaviour > worth > maintaining. > > > * We should ensure that the xid provided with -x or -u is > > >= FirstNormalTransactionId (Currently the only check is that it is > > # 0)? > > Andres: Applying TransactionIdIsNormal() seems like a good idea. > > => I am attaching a new version that makes use of TransactionIdIsNormal() > checks. > > Andres: I think it's important to verify that the xid provided with -x is > within a reasonable range of the oldest xid. > > => What do you mean by "a reasonable range"? > > Thanks > > Bertrand > > [1]: https://www.postgresql.org/message-id/ > 20210517185646.pwe4klaufwmdhe2a%40alap3.anarazel.de > > > > > src/bin/pg_resetwal/pg_resetwal.c | 65 > ++- > src/bin/pg_upgrade/controldata.c | 17 +- > src/bin/pg_upgrade/pg_upgrade.c | 6 > src/bin/pg_upgrade/pg_upgrade.h | 1 + > 4 files changed, 60 insertions(+), 29 deletions(-) > diff --git a/src/bin/pg_resetwal/pg_resetwal.c > b/src/bin/pg_resetwal/pg_resetwal.c > index 805dafef07..5e864760ed 100644 > --- a/src/bin/pg_resetwal/pg_resetwal.c > +++ b/src/bin/pg_resetwal/pg_resetwal.c > @@ -65,6 +65,7 @@ static bool guessed = false;/* T if we had to guess > at any values */ > static const char *progname; > static uint32 set_xid_epoch = (uint32) -1; > static TransactionId set_xid = 0; > +static TransactionId set_oldest_unfrozen_xid = 0; > static TransactionId set_oldest_commit_ts_xid = 0; > static TransactionId set_newest_commit_ts_xid = 0; > static Oid set_oid = 0; > @@ -102,6 +103,7 @@ main(int argc, char *argv[]) > {"next-oid", required_argument, NULL, 'o'}, > {"multixact-offset", required_argument, NULL, 'O'}, > {"next-transaction-id", required_argument, NULL, 'x'}, > + {"oldest-transaction-id"
Re: visibility map corruption
On Sat, Jul 24, 2021 at 10:01:05AM -0400, Bruce Momjian wrote: > On Fri, Jul 23, 2021 at 09:01:18PM -0400, Bruce Momjian wrote: > > On Fri, Jul 23, 2021 at 05:47:18PM -0700, Peter Geoghegan wrote: > > > > I could perhaps see corruption happening if pg_control's oldest xid > > > > value was closer to the current xid value than it should be, but I can't > > > > see how having it 2-billion away could cause harm, unless perhaps > > > > pg_upgrade itself used enough xids to cause the counter to wrap more > > > > than 2^31 away from the oldest xid recorded in pg_control. > > > > > > > > What I am basically asking is how to document this and what it fixes. > > > > > > ISTM that this is a little like commits 78db307bb2 and a61daa14. Maybe > > > take a look at those? > > > > Agreed. I just wanted to make sure I wasn't missing an important aspect > > of this patch. Thanks. > > Another question --- with the previous code, the oldest xid was always > set to a reasonable value, -2 billion less than the current xid. With > the new code, the oldest xid might be slightly higher than the current > xid if they use -x but not -u. Is that acceptable? I think we agreed it > was. pg_upgrade will always set both. This patch has been applied back to 9.6 and will appear in the next minor release. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Slim down integer formatting
On Tue, Jul 27, 2021 at 12:28:22PM +1000, Greg Nancarrow wrote: > That patch didn't apply for me (on latest source) so I've attached an > equivalent with those changes, that does apply, and also tweaks the > Makefile include path to address that #include issue. When applying some micro-benchmarking to stress those APIs, how much does this change things? At the end of the day, this also comes down to an evaluation of pg_ulltoa_n() and pg_ultoa_n(). #include "common/int.h" +#include "d2s_intrinsics.h" Er, are you sure about this part? The first version of the patch did that in a different, also incorrect, way. -- Michael signature.asc Description: PGP signature
Re: Slim down integer formatting
On Tue, 27 Jul 2021 at 14:42, Michael Paquier wrote: > When applying some micro-benchmarking to stress those APIs, how much > does this change things? At the end of the day, this also comes down > to an evaluation of pg_ulltoa_n() and pg_ultoa_n(). I'd suggest something like creating a table with, say 1 million INTs and testing the performance of copy to '/dev/null'; Repeat for BIGINT David
Re: Slim down integer formatting
On Tue, Jul 27, 2021 at 12:42 PM Michael Paquier wrote: > > #include "common/int.h" > +#include "d2s_intrinsics.h" > Er, are you sure about this part? The first version of the patch did > that in a different, also incorrect, way. Er, I was just trying to help out, so at least the patch could be applied (whether the patch has merit is a different story). Are you saying that it's incorrect to include that header file in this source, or that's the wrong way to do it? (i.e. it's wrong to adjust the makefile include path to pickup the location where that header is located and use #include ""? That header is in src/common, which is not on the default include path). The method I used certainly works, but you have objections? Can you clarify what you say is incorrect? Regards, Greg Nancarrow Fujitsu Australia
Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness
On 7/26/21, 5:48 PM, "Andres Freund" wrote: > On 2021-07-26 20:27:21 +, Bossart, Nathan wrote: >> +1. I was confused by this when working on a WAL pre-allocation >> patch [0]. Perhaps it could be replaced by a new parameter and a new >> field in pg_stat_wal. How about something like log_wal_init_interval, >> where the value is the minimum amount of time between reporting the >> number of WAL segments created since the last report? > > Why not just make the number in log_checkpoints accurate? There's no > point in the current number, so we don't need to preserve it... My understanding is that the statistics logged for log_checkpoints are just for that specific checkpoint. From that angle, the value for the number of WAL files added is technically correct. Checkpoints will only ever create zero or one new files via PreallocXlogFiles(). If we also added all the segments created outside of the checkpoint, the value for "added" would go from meaning "WAL files created by this checkpoint" to "WAL files creates since the last checkpoint." That's probably less confusing than what's there today, but it's still slightly different than all the other log_checkpoints stats. Nathan
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Jul 26, 2021 at 11:01 PM Masahiko Sawada wrote: > > I'll experiment with the proposed ideas including this idea in more > scenarios and share the results tomorrow. > I've done some benchmarks for proposed data structures. In this trial, I've done with the scenario where dead tuples are concentrated on a particular range of table blocks (test 5-8), in addition to the scenarios I've done in the previous trial. Also, I've done benchmarks of each scenario while increasing table size. In the first test, the maximum block number of the table is 1,000,000 (i.g., 8GB table) and in the second test, it's 10,000,000 (80GB table). We can see how performance and memory consumption changes with a large-scale table. Here are the results: * Test 1 select prepare( 100, -- max block 10, -- # of dead tuples per page 1, -- dead tuples interval within a page 1, -- # of consecutive pages having dead tuples 20 -- page interval ); name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10 +---++--++---+- array | 57.23 MB | 0.040 | 98.613 | 572.21 MB | 0.387 |1521.981 intset | 46.88 MB | 0.114 | 75.944 | 468.67 MB | 0.961 | 997.760 radix | 40.26 MB | 0.102 | 18.427 | 336.64 MB | 0.797 | 266.146 rtbm | 64.02 MB | 0.234 | 22.443 | 512.02 MB | 2.230 | 275.143 svtm | 27.28 MB | 0.060 | 13.568 | 274.07 MB | 0.476 | 211.073 tbm| 96.01 MB | 0.273 | 10.347 | 768.01 MB | 2.882 | 128.103 * Test 2 select prepare( 100, -- max block 10, -- # of dead tuples per page 1, -- dead tuples interval within a page 1, -- # of consecutive pages having dead tuples 1 -- page interval ); name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10 +---++--++---+- array | 57.23 MB | 0.041 |4.757 | 572.21 MB | 0.344 | 71.228 intset | 46.88 MB | 0.127 |3.762 | 468.67 MB | 1.093 | 49.573 radix | 9.95 MB | 0.048 |0.679 | 82.57 MB | 0.371 | 16.211 rtbm | 34.02 MB | 0.179 |0.534 | 288.02 MB | 2.092 | 8.693 svtm | 5.78 MB | 0.043 |0.239 | 54.60 MB | 0.342 | 7.759 tbm| 96.01 MB | 0.274 |0.521 | 768.01 MB | 2.685 | 6.360 * Test 3 select prepare( 100, -- max block 2, -- # of dead tuples per page 100, -- dead tuples interval within a page 1, -- # of consecutive pages having dead tuples 1 -- page interval ); name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10 +---++--++---+- array | 11.45 MB | 0.009 | 57.698 | 114.45 MB | 0.076 |1045.639 intset | 15.63 MB | 0.031 | 46.083 | 156.23 MB | 0.243 | 848.525 radix | 40.26 MB | 0.063 | 13.755 | 336.64 MB | 0.501 | 223.413 rtbm | 36.02 MB | 0.123 | 11.527 | 320.02 MB | 1.843 | 180.977 svtm | 9.28 MB | 0.053 |9.631 | 92.59 MB | 0.438 | 212.626 tbm| 96.01 MB | 0.228 | 10.381 | 768.01 MB | 2.258 | 126.630 * Test 4 select prepare( 100, -- max block 100, -- # of dead tuples per page 1, -- dead tuples interval within a page 1, -- # of consecutive pages having dead tuples 1 -- page interval ); name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10 +---++--++---+- array | 572.21 MB | 0.367 | 78.047 | 5722.05 MB | 3.942 |1154.776 intset | 93.74 MB | 0.777 | 45.146 | 937.34 MB | 7.716 | 643.708 radix | 40.26 MB | 0.203 |9.015 | 336.64 MB | 1.775 | 133.294 rtbm | 36.02 MB | 0.369 |5.639 | 320.02 MB | 3.823 | 88.832 svtm | 7.28 MB | 0.294 |3.891 | 73.60 MB | 2.690 | 103.744 tbm| 96.01 MB | 0.534 |5.223 | 768.01 MB | 5.679 | 60.632 * Test 5 select prepare( 100, -- max block 150, -- # of dead tuples per page 1, -- dead tuples interval within a page 1, -- # of consecutive pages having dead tuples 2 -- page interval ); There are 1 consecutive pages that have 150 dead tuples at every 2 pages. name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10 +---++--++---+- array | 429.16 MB | 0.274 | 75.664 | 4291.54 MB | 3.067 |1259.501 intset | 46.88 MB | 0.559 | 36.449 | 468.67 MB | 4.565 | 517.445 radix | 20.26 MB | 0.166 |8.466 | 196.90 MB | 1.273 | 166.587 rtbm | 18.02 MB | 0.242 |8.491 | 160.02 MB | 2.407 | 171.725 svtm | 3.66 MB | 0.243 |3.635 | 37.10 MB | 2.022 | 86.165 tbm| 48.01 MB | 0.344 |9.763 | 384.01 MB | 3.327 | 151.824 * Test 6 select pre
Re: row filtering for logical replication
On Tue, Jul 27, 2021 at 6:21 AM houzj.f...@fujitsu.com wrote: > 1) UPDATE a nonkey column in publisher. > 2) Use debugger to block the walsender process in function >pgoutput_row_filter_exec_expr(). > 3) Open another psql to connect the publisher, and drop the table which > updated >in 1). > 4) Unblock the debugger in 2), and then I can see the following error: > --- > ERROR: could not read block 0 in file "base/13675/16391" Yeah, that's a big problem, seems like the expression evaluation machinery directly going and detoasting the externally stored data using some random snapshot. Ideally, in walsender we can never attempt to detoast the data because there is no guarantee that those data are preserved. Somehow before going to the expression evaluation machinery, I think we will have to deform that tuple and need to do something for the externally stored data otherwise it will be very difficult to control that inside the expression evaluation. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Slim down integer formatting
On Tue, 27 Jul 2021 at 15:08, Greg Nancarrow wrote: > > On Tue, Jul 27, 2021 at 12:42 PM Michael Paquier wrote: > > > > #include "common/int.h" > > +#include "d2s_intrinsics.h" > > Er, are you sure about this part? The first version of the patch did > > that in a different, also incorrect, way. > > Er, I was just trying to help out, so at least the patch could be > applied (whether the patch has merit is a different story). > Are you saying that it's incorrect to include that header file in this > source, or that's the wrong way to do it? (i.e. it's wrong to adjust > the makefile include path to pickup the location where that header is > located and use #include ""? That header is in src/common, > which is not on the default include path). > The method I used certainly works, but you have objections? > Can you clarify what you say is incorrect? I think the mistake is that the header file is not in src/include/common. For some reason, it's ended up with all the .c files in src/common. I imagine Andrew did this because he didn't ever expect anything else to have a use for these. He indicates that in [1]. Maybe Andrew can confirm? [1] https://www.postgresql.org/message-id/87mup9192t.fsf%40news-spur.riddles.org.uk David
Re: Some code cleanup for pgbench and pg_verifybackup
Bonjour Michaël-san, The semantics for fatal vs error is that an error is somehow handled while a fatal is not. If the log message is followed by an cold exit, ISTM that fatal is the right call, and I cannot help if other commands do not do that. ISTM more logical to align other commands to fatal when appropriate. I disagree. pgbench is an outlier here. There are 71 calls to pg_log_fatal() in src/bin/, and pgbench counts for 54 of them. It would be more consistent to align pgbench with the others. I do not understand your disagreement. Do you disagree about the expected semantics of fatal? Then why provide fatal if it should not be used? What is the expected usage of fatal? I do not dispute that pgbench is a statistical outlier. However, Pgbench is somehow special because it does not handle a lot of errors, hence a lot of "fatal & exit" pattern is used, and the user has to restart. There are 76 calls to "exit" from pgbench, but only 23 from psql which is much larger. ISTM that most "normal" pg programs won't do that because they are nice and try to handle errors. So for me "fatal" is the right choice before exiting with a non zero status, but if "error" is called instead I do not think it matters much, you do as you please. I was surprised to discover a few weeks ago that pg_log_fatal() did not terminate the program, which was my expectation. Mine too. -- Fabien.