Re: Is there any documentation on how to correctly create extensions in HA(primary-standby) setup?
Hi, On Thu, Dec 09, 2021 at 08:19:06AM +0530, Bharath Rupireddy wrote: > > Thanks. Attaching v1 patch specifying the notes there. Please review. I think that the common terminology is "module", not "extension". That's especially important here as this information is also relevant for modules that may come with an SQL level extension. This should be made clear in that new documentation, same for the CREATE EXTENSION part that may not be relevant. It also seems that this documentation is only aimed for physical replication. It should also be explicitly stated as it might not be obvious for the intended readers. + [...] set it either via ALTER SYSTEM + command or postgresql.conf file on both primary and + standys, reload the postgresql.conf file and restart + the servers. Isn't the reload a terrible advice? By definition changing shared_preload_libraries isn't compatible with a simple reload and will emit some error. + [...] Create the extension on the primary, there is no need to + create it on the standbys as the CREATE EXTENSION + command is replicated. The "no need" here is quite ambiguous, as it seems to indicate that trying to create the extension on the standby will work but is unnecessary.
Re: generic plans and "initial" pruning
On Tue, 11 Jan 2022 at 16:22, Robert Haas wrote: > This is just a relatively simple example and I think there are > probably a bunch of others. There are a lot of kinds of DDL that could > be performed on a partition that gets pruned away: DROP INDEX is just > one example. I haven't followed this in any detail, but this patch and its goal of reducing the O(N) drag effect on partition execution time is very important. Locking a long list of objects that then get pruned is very wasteful, as the results show. Ideally, we want an O(1) algorithm for single partition access and DDL is rare. So perhaps that is the starting point for a safe design - invent a single lock or cache that allows us to check if the partition hierarchy has changed in any way, and if so, replan, if not, skip locks. Please excuse me if this idea falls short, if so, please just note my comment about how important this is. Thanks. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: 32TB relation size make mdnblocks overflow
Hi, On Tue, Jan 18, 2022 at 02:21:14PM +0800, 陈佳昕(步真) wrote: > > We know that PostgreSQL doesn't support a single relation size over 32TB, > limited by the MaxBlockNumber. But if we just 'insert into' one relation over > 32TB, it will get an error message 'unexpected data beyond EOF in block 0 of > relation' in ReadBuffer_common. The '0 block' is from mdnblocks function > where the segment number is over 256 and make segno * RELSEG_SIZE over > uint32's max value. So is it necessary to make the error message more > readable like 'The relation size is over max value ...' and elog in > mdnblocks? I didn't try it but this is supposed to be caught by mdextend(): /* * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any * more --- we mustn't create a block whose number actually is * InvalidBlockNumber. (Note that this failure should be unreachable * because of upstream checks in bufmgr.c.) */ if (blocknum == InvalidBlockNumber) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("cannot extend file \"%s\" beyond %u blocks", relpath(reln->smgr_rnode, forknum), InvalidBlockNumber))); Didn't you hit this?
Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)
Hi, On Sun, Jan 02, 2022 at 02:55:04PM +0100, Fabien COELHO wrote: > > > Here is my review about v32: > > I forgot to tell that doc generation for the cumulated changes also works. Unfortunately the patchset doesn't apply anymore: http://cfbot.cputube.org/patch_36_2377.log === Applying patches on top of PostgreSQL commit ID 4483b2cf29bfe8091b721756928ccbe31c5c8e14 === === applying patch ./v32-0003-Add-pg_ls_dir_metadata-to-list-a-dir-with-file-m.patch [...] patching file src/test/regress/expected/misc_functions.out Hunk #1 succeeded at 274 (offset 7 lines). patching file src/test/regress/expected/tablespace.out Hunk #1 FAILED at 16. 1 out of 1 hunk FAILED -- saving rejects to file src/test/regress/expected/tablespace.out.rej Justin, could you send a rebased version?
Re: missing indexes in indexlist with partitioned tables
Hi, On Mon, Jan 17, 2022 at 08:32:40PM +, Arne Roland wrote: > > Afaiac the join pruning where the outer table is a partitioned table is the > relevant case. The last version of the patch now fails on all platform, with plan changes. For instance: https://cirrus-ci.com/task/4825629131538432 https://api.cirrus-ci.com/v1/artifact/task/4825629131538432/regress_diffs/src/test/regress/regression.diffs diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/partition_join.out /tmp/cirrus-ci-build/src/test/regress/results/partition_join.out --- /tmp/cirrus-ci-build/src/test/regress/expected/partition_join.out 2022-01-17 23:08:47.158198249 + +++ /tmp/cirrus-ci-build/src/test/regress/results/partition_join.out 2022-01-17 23:12:34.163488567 + @@ -4887,37 +4887,23 @@ SET enable_partitionwise_join = on; EXPLAIN (COSTS OFF) SELECT * FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY id ASC LIMIT 10; - QUERY PLAN + QUERY PLAN +- Limit - -> Merge Append - Sort Key: x.id - -> Merge Left Join - Merge Cond: (x_1.id = y_1.id) - -> Index Only Scan using fract_t0_pkey on fract_t0 x_1 - -> Index Only Scan using fract_t0_pkey on fract_t0 y_1 - -> Merge Left Join - Merge Cond: (x_2.id = y_2.id) - -> Index Only Scan using fract_t1_pkey on fract_t1 x_2 - -> Index Only Scan using fract_t1_pkey on fract_t1 y_2 -(11 rows) + -> Append + -> Index Only Scan using fract_t0_pkey on fract_t0 x_1 + -> Index Only Scan using fract_t1_pkey on fract_t1 x_2 +(4 rows) EXPLAIN (COSTS OFF) SELECT * FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY id DESC LIMIT 10; - QUERY PLAN - +QUERY PLAN +-- Limit - -> Merge Append - Sort Key: x.id DESC - -> Nested Loop Left Join - -> Index Only Scan Backward using fract_t0_pkey on fract_t0 x_1 - -> Index Only Scan using fract_t0_pkey on fract_t0 y_1 - Index Cond: (id = x_1.id) - -> Nested Loop Left Join - -> Index Only Scan Backward using fract_t1_pkey on fract_t1 x_2 - -> Index Only Scan using fract_t1_pkey on fract_t1 y_2 - Index Cond: (id = x_2.id) -(11 rows) + -> Append + -> Index Only Scan Backward using fract_t1_pkey on fract_t1 x_2 + -> Index Only Scan Backward using fract_t0_pkey on fract_t0 x_1 +(4 rows)
Re: BUFFERS enabled by default in EXPLAIN (ANALYZE)
Hi, On Wed, Dec 01, 2021 at 06:58:20PM -0600, Justin Pryzby wrote: > > The attached patch series now looks like this (some minor patches are not > included in this list): This version of the patchset doesn't apply anymore: http://cfbot.cputube.org/patch_36_3409.log === Applying patches on top of PostgreSQL commit ID 4483b2cf29bfe8091b721756928ccbe31c5c8e14 === === applying patch ./0003-Make-explain-default-to-BUFFERS-TRUE.patch patching file doc/src/sgml/config.sgml Hunk #1 FAILED at 7615. 1 out of 1 hunk FAILED -- saving rejects to file doc/src/sgml/config.sgml.rej By the way, could you add versionning to the patchsets you send, e.g. git format-patch -vXXX. It would make it easier to know which version are used or discussed. I'm switching the CF entry to Waiting on Author.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
On Tue, Jan 18, 2022 at 12:35 AM Robert Haas wrote: > > On Tue, Dec 14, 2021 at 1:21 PM Shruthi Gowda wrote: > > Thanks, Robert for the updated version. I reviewed the changes and it > > looks fine. > > I also tested the patch. The patch works as expected. > > Thanks. > > > > - I adjusted the function header comment for heap_create. Your > > > proposed comment seemed like it was pretty detailed but not 100% > > > correct. It also made one of the lines kind of long because you didn't > > > wrap the text in the surrounding style. I decided to make it simpler > > > and shorter instead of longer still and 100% correct. > > > > The comment update looks fine. However, I still feel it would be good to > > mention on which (rare) circumstance a valid relfilenode can get passed. > > In general, I think it's the job of a function parameter comment to > describe what the parameter does, not how the callers actually use it. > One problem with describing the latter is that, if someone later adds > another caller, there is a pretty good chance that they won't notice > that the comment needs to be changed. More fundamentally, the > parameter function comments should be like an instruction manual for > how to use the function. If you are trying to figure out how to use > this function, it is not helpful to know that "most callers like to > pass false" for this parameter. What you need to know is what value > your new call site should pass, and knowing what "most callers" do or > that something is "rare" doesn't really help. If we want to make this > comment more detailed, we should approach it from the point of view of > explaining how it ought to be set. It's clear now. Thanks for clarifying. > I've committed the v8-0001 patch you attached. I'll write separately > about v8-0002. Sure. Thank you.
Re: Use generation context to speed up tuplesorts
Hi, On Fri, Jan 07, 2022 at 12:03:55PM +0100, Ronan Dunklau wrote: > > (Sorry for trying to merge back the discussion on the two sides of the thread) > > In https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan, I > expressed the idea of being able to tune glibc's malloc behaviour. > > I implemented that (patch 0001) to provide a new hook which is called on > backend startup, and anytime we set work_mem. This hook is # defined > depending > on the malloc implementation: currently a default, no-op implementation is > provided as well as a glibc's malloc implementation. This patch apparently misses something to have the no-op on windows: https://cirrus-ci.com/task/4978247640285184 [03:03:01.676] Build FAILED. [03:03:01.676] [03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(19,15): warning C4013: 'mallinfo2' undefined; assuming extern returning int [c:\cirrus\malloc_stats.vcxproj] [03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(18,19): error C2079: 'm' uses undefined struct 'mallinfo2' [c:\cirrus\malloc_stats.vcxproj] [03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error C2224: left of '.arena' must have struct/union type [c:\cirrus\malloc_stats.vcxproj] [03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error C2224: left of '.hblkhd' must have struct/union type [c:\cirrus\malloc_stats.vcxproj] [03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error C2224: left of '.uordblks' must have struct/union type [c:\cirrus\malloc_stats.vcxproj] [03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error C2224: left of '.fordblks' must have struct/union type [c:\cirrus\malloc_stats.vcxproj] [03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error C2224: left of '.keepcost' must have struct/union type [c:\cirrus\malloc_stats.vcxproj] I'm also a bit confused about which patch(es) should actually be reviewed in that thread. My understanding is that the original patch might not be relevant anymore but I might be wrong. The CF entry should probably be updated to clarify that, with an annotation/ title change / author update or something. In the meantime I will switch the entry to Waiting on Author.
Re: Support for NSS as a libpq TLS backend
Hi, On Mon, Jan 17, 2022 at 03:09:11PM +0100, Daniel Gustafsson wrote: > > I must've fat-fingered the "git add -p" for v50 as the fix was in configure.ac > but not configure. Fixed now. Thanks! Apparently this version now fails on all OS, e.g.: https://cirrus-ci.com/task/4643868095283200 [22:17:39.965] # Failed test 'certificate authorization succeeds with correct client cert in PEM format' [22:17:39.965] # at t/001_ssltests.pl line 456. [22:17:39.965] # got: '2' [22:17:39.965] # expected: '0' [22:17:39.965] [22:17:39.965] # Failed test 'certificate authorization succeeds with correct client cert in PEM format: no stderr' [22:17:39.965] # at t/001_ssltests.pl line 456. [22:17:39.965] # got: 'psql: error: connection to server at "127.0.0.1", port 50023 failed: certificate present, but not private key file "/home/postgres/.postgresql/postgresql.key"' [22:17:39.965] # expected: '' [22:17:39.965] [22:17:39.965] # Failed test 'certificate authorization succeeds with correct client cert in DER format' [22:17:39.965] # at t/001_ssltests.pl line 475. [22:17:39.965] # got: '2' [22:17:39.965] # expected: '0' [...]
Re: SQL/JSON: functions
Hi, The last version conflicts with recent c4cc2850f4d1 (Rename value node fields). Can you send a rebased version?
Re: SQL/JSON: JSON_TABLE
Hi, On Tue, Jan 04, 2022 at 09:03:05AM -0500, Andrew Dunstan wrote: > > rebased again. This version conflicts with recent c4cc2850f4d1 (Rename value node fields). Can you send a rebased version?
Re: simplifying foreign key/RI checks
Thanks for the review. On Tue, Dec 21, 2021 at 5:54 PM Zhihong Yu wrote: > Hi, > > + int lockflags = 0; > + TM_Result test; > + > + lockflags = TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS; > > The above assignment can be meged with the line where variable lockflags is > declared. Sure. > + GetUserIdAndSecContext(&save_userid, &save_sec_context); > > save_userid -> saved_userid > save_sec_context -> saved_sec_context I agree that's better though I guess I had kept the names as they were in other functions. Fixed nevertheless. > +* the transaction-snapshot mode. If we didn't push one already, do > > didn't push -> haven't pushed Done. > For ri_PerformCheck(): > > + boolsource_is_pk = true; > > It seems the value of source_is_pk doesn't change - the value true can be > plugged into ri_ExtractValues() calls directly. OK, done. v13 is attached. -- Amit Langote EDB: http://www.enterprisedb.com v13-0002-Avoid-using-SPI-for-some-RI-checks.patch Description: Binary data v13-0001-Add-isolation-tests-for-snapshot-behavior-in-ri_.patch Description: Binary data
Re: In-placre persistance change of a relation
Hi, On Fri, Jan 14, 2022 at 11:43:10AM +0900, Kyotaro Horiguchi wrote: > I found a bug. > > mdmarkexists() didn't close the tentatively opend fd. This is a silent > leak on Linux and similars and it causes delete failure on Windows. > It was the reason of the CI failure. > > 027_persistence_change.pl uses interactive_psql() that doesn't work on > the Windos VM on the CI. > > In this version the following changes have been made in 0001. > > - Properly close file descriptor in mdmarkexists. > > - Skip some tests when IO::Pty is not available. > It might be better to separate that part. > > Looking again the ALTER TABLE ALL IN TABLESPACE SET LOGGED patch, I > noticed that it doesn't implement OWNED BY part and doesn't have test > and documenttaion (it was PoC). Added all of them to 0002. The cfbot is failing on all OS with this version of the patch. Apparently v16-0002 introduces some usage of "testtablespace" client-side variable that's never defined, e.g. https://api.cirrus-ci.com/v1/artifact/task/4670105480069120/regress_diffs/src/bin/pg_upgrade/tmp_check/regress/regression.diffs: diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/tablespace.out /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tablespace.out --- /tmp/cirrus-ci-build/src/test/regress/expected/tablespace.out 2022-01-18 04:26:38.744707547 + +++ /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tablespace.out 2022-01-18 04:30:37.557078083 + @@ -948,76 +948,71 @@ CREATE SCHEMA testschema; GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1; CREATE TABLESPACE regress_tablespace LOCATION :'testtablespace'; +ERROR: syntax error at or near ":" +LINE 1: CREATE TABLESPACE regress_tablespace LOCATION :'testtablespa...
Re: tweak to a few index tests to hits ambuildempty() routine.
On Mon, Nov 29, 2021 at 10:34 AM Amul Sul wrote: > Hi, > > Attached patch is doing small changes to brin, gin & gist index tests > to use an unlogged table without changing the original intention of > those tests and that is able to hit ambuildempty() routing which is > otherwise not reachable by the current tests. > +1 for the idea as it does the better code coverage. > > -- > Regards, > Amul Sul > EDB: http://www.enterprisedb.com > -- Rushabh Lathia
32TB relation size make mdnblocks overflow
Hello We know that PostgreSQL doesn't support a single relation size over 32TB, limited by the MaxBlockNumber. But if we just 'insert into' one relation over 32TB, it will get an error message 'unexpected data beyond EOF in block 0 of relation' in ReadBuffer_common. The '0 block' is from mdnblocks function where the segment number is over 256 and make segno * RELSEG_SIZE over uint32's max value. So is it necessary to make the error message more readable like 'The relation size is over max value ...' and elog in mdnblocks? This scene we met is as below, 'shl $0x18, %eax' make $ebx from 256 to 0, which makes segno from 256 to 0. 0x00c2cc51 <+289>: callq 0xc657f0 0x00c2cc56 <+294>: mov-0x8(%r15),%rdi 0x00c2cc5a <+298>: mov%r15,%rsi 0x00c2cc5d <+301>: mov%eax,%r14d 0x00c2cc60 <+304>: mov0x10(%rdi),%rax 0x00c2cc64 <+308>: callq *0x8(%rax) 0x00c2cc67 <+311>: test %r14d,%r14d 0x00c2cc6a <+314>: jns0xc2cd68 => 0x00c2cc70 <+320>: add$0x28,%rsp 0x00c2cc74 <+324>: mov%ebx,%eax 0x00c2cc76 <+326>: shl$0x18,%eax 0x00c2cc79 <+329>: pop%rbx 0x00c2cc7a <+330>: pop%r12 0x00c2cc7c <+332>: pop%r13 0x00c2cc7e <+334>: pop%r14 0x00c2cc80 <+336>: pop%r15 0x00c2cc82 <+338>: pop%rbp 0x00c2cc83 <+339>: retq
Re: a misbehavior of partition row movement (?)
On Tue, Jan 18, 2022 at 2:41 PM Julien Rouhaud wrote: > On Tue, Jan 18, 2022 at 02:33:39PM +0900, Michael Paquier wrote: > > On Tue, Jan 18, 2022 at 12:16:23PM +0800, Julien Rouhaud wrote: > > > I'm not sure why this test failed as it doesn't seem like something > > > impacted by > > > the patch, but I may have missed something as I only had a quick look at > > > the > > > patch and discussion. > > > > This issue is discussed here: > > https://www.postgresql.org/message-id/20220117203746.oj43254j5jurb...@alap3.anarazel.de > > Oh I missed it, thanks! Sorry for the noise. Thanks, it had puzzled me too when I first saw it this morning. -- Amit Langote EDB: http://www.enterprisedb.com
Re: Skipping logical replication transactions on subscriber side
On Tue, Jan 18, 2022 at 2:37 PM osumi.takami...@fujitsu.com wrote: > > On Tuesday, January 18, 2022 1:39 PM Masahiko Sawada > wrote: > > I've attached an updated patch. All comments I got so far were incorporated > > into this patch unless I'm missing something. > > Hi, thank you for your new patch v7. > For your information, I've encountered a failure to apply patch v7 > on top of the latest commit (d3f4532) > > $ git am v7-0001-Add-ALTER-SUBSCRIPTION-.-SKIP-to-skip-the-transac.patch > Applying: Add ALTER SUBSCRIPTION ... SKIP to skip the transaction on > subscriber nodes > error: patch failed: src/backend/parser/gram.y:9954 > error: src/backend/parser/gram.y: patch does not apply > > Could you please rebase it when it's necessary ? Thank you for reporting! I've attached a rebased patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ v8-0001-Add-ALTER-SUBSCRIPTION-.-SKIP-to-skip-the-transac.patch Description: Binary data
Re: Push down time-related SQLValue functions to foreign server
> > Hmm ... not really, because for these particular functions, the > point is exactly that we *don't* translate them to some function > call on the remote end. We evaluate them locally and push the > resulting constant to the far side, thus avoiding issues like > clock skew. > Ah, my pattern matching brain was so excited to see a use for routine mapping that I didn't notice that important detail.
Re: Proposal: More structured logging
Le lundi 17 janvier 2022, 09:18:04 CET Ronan Dunklau a écrit : > Le samedi 15 janvier 2022, 07:09:59 CET Julien Rouhaud a écrit : > > Hi, > > > > On Tue, Jan 11, 2022 at 11:05:26AM +0100, Ronan Dunklau wrote: > > > Done, and I added anoher commit per your suggestion to add this comment. > > > > The cfbot reports that the patchset doesn't apply anymore: > > > > http://cfbot.cputube.org/patch_36_3293.log > > === Applying patches on top of PostgreSQL commit ID > > 43c2175121c829c8591fc5117b725f1f22bfb670 === [...] > > === applying patch ./v4-0003-Output-error-tags-in-CSV-logs.patch > > [...] > > patching file src/backend/utils/error/elog.c > > Hunk #2 FAILED at 3014. > > 1 out of 2 hunks FAILED -- saving rejects to file > > src/backend/utils/error/elog.c.rej > > > > Could you send a rebased version? In the meantime I will switch the cf > > entry to Waiting on Author. > > Thank you for this notification ! > > The csvlog has been refactored since I wrote the patch, and the new jsonlog > destination has been introduced. I rebased to fix the first patch, and added > a new patch to add support for tags in jsonlog output. > > Best regards, Hum, there was a missing import in csvlog.c from the fix above. Sorry about that. -- Ronan Dunklau>From 8071760e2a826b4ec88d2580c7bfc76e9bd2ff6e Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Fri, 13 Aug 2021 15:03:18 +0200 Subject: [PATCH v6 1/5] Add ErrorTag support --- src/backend/utils/error/elog.c | 48 ++ src/include/utils/elog.h | 10 +++ 2 files changed, 58 insertions(+) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 7402696986..4fb4c67c3f 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -460,6 +460,7 @@ errstart(int elevel, const char *domain) edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION; /* errno is saved here so that error parameter eval can't change it */ edata->saved_errno = errno; + edata->tags = NIL; /* * Any allocations for this error state level should go into ErrorContext @@ -511,6 +512,7 @@ errfinish(const char *filename, int lineno, const char *funcname) int elevel; MemoryContext oldcontext; ErrorContextCallback *econtext; + ListCell *lc; recursion_depth++; CHECK_STACK_DEPTH(); @@ -616,7 +618,18 @@ errfinish(const char *filename, int lineno, const char *funcname) pfree(edata->constraint_name); if (edata->internalquery) pfree(edata->internalquery); + /* Every tag should have been palloc'ed */ + if (edata->tags != NIL) + { + foreach(lc, edata->tags) + { + ErrorTag *tag = (ErrorTag *) lfirst(lc); + pfree(tag->tagvalue); + pfree(tag); + } + pfree(edata->tags); + } errordata_stack_depth--; /* Exit error-handling context */ @@ -1187,6 +1200,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural, return 0; /* return value does not matter */ } +int +errtag(const char *tag, const char *fmt_value,...) +{ + ErrorData *edata = &errordata[errordata_stack_depth]; + ErrorTag *etag; + MemoryContext oldcontext; + StringInfoData buf; + + recursion_depth++; + CHECK_STACK_DEPTH(); + oldcontext = MemoryContextSwitchTo(edata->assoc_context); + etag = palloc(sizeof(ErrorTag)); + etag->tagname = tag; + initStringInfo(&buf); + for (;;) + { + va_list args; + int needed; + + errno = edata->saved_errno; + va_start(args, fmt_value); + needed = appendStringInfoVA(&buf, fmt_value, args); + va_end(args); + if (needed == 0) + break; + enlargeStringInfo(&buf, needed); + } + etag->tagvalue = pstrdup(buf.data); + edata->tags = lappend(edata->tags, etag); + pfree(buf.data); + MemoryContextSwitchTo(oldcontext); + recursion_depth--; + return 0; +} + /* * errcontext_msg --- add a context error message text to the current error diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h index 3eb8de3966..615fae47ef 100644 --- a/src/include/utils/elog.h +++ b/src/include/utils/elog.h @@ -15,6 +15,7 @@ #define ELOG_H #include +#include "nodes/pg_list.h" /* Error level codes */ #define DEBUG5 10 /* Debugging messages, in categories of @@ -193,6 +194,8 @@ extern int errhint(const char *fmt,...) pg_attribute_printf(1, 2); extern int errhint_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n,...) pg_attribute_printf(1, 4) pg_attribute_printf(2, 4); +extern int errtag(const char *tag, const char *fmt_value,...) pg_attribute_printf(2, 3); + /* * errcontext() is typically called in error context callback functions, not * within an ereport() invocation. The callback function can be in a different @@ -395,11 +398,18 @@ typedef struct ErrorData int internalpos; /* cursor index into internalquery */ char *internalquery; /* text of internally-generated query */ int saved_errno; /* errno at entry */ + List *tags; /* List of error tags */ /* context containing associated non-constant strings */ str
Re: Push down time-related SQLValue functions to foreign server
Corey Huinker writes: > I'm very late to the party, but it seems to me that this effort is > describing a small subset of what "routine mapping" seems to be for: > defining function calls that can be pushed down to the foreign server, and > the analogous function on the foreign side. We may want to consider > implementing just enough of CREATE ROUTINE MAPPING and DROP ROUTINE MAPPING > to support these specific fixed functions. Hmm ... not really, because for these particular functions, the point is exactly that we *don't* translate them to some function call on the remote end. We evaluate them locally and push the resulting constant to the far side, thus avoiding issues like clock skew. Having said that: why couldn't that implementation sketch be used for ANY stable subexpression? What's special about the datetime SQLValueFunctions? regards, tom lane
Re: a misbehavior of partition row movement (?)
Hi, On Tue, Jan 18, 2022 at 02:33:39PM +0900, Michael Paquier wrote: > On Tue, Jan 18, 2022 at 12:16:23PM +0800, Julien Rouhaud wrote: > > I'm not sure why this test failed as it doesn't seem like something > > impacted by > > the patch, but I may have missed something as I only had a quick look at the > > patch and discussion. > > This issue is discussed here: > https://www.postgresql.org/message-id/20220117203746.oj43254j5jurb...@alap3.anarazel.de Oh I missed it, thanks! Sorry for the noise.
RE: Skipping logical replication transactions on subscriber side
On Tuesday, January 18, 2022 1:39 PM Masahiko Sawada wrote: > I've attached an updated patch. All comments I got so far were incorporated > into this patch unless I'm missing something. Hi, thank you for your new patch v7. For your information, I've encountered a failure to apply patch v7 on top of the latest commit (d3f4532) $ git am v7-0001-Add-ALTER-SUBSCRIPTION-.-SKIP-to-skip-the-transac.patch Applying: Add ALTER SUBSCRIPTION ... SKIP to skip the transaction on subscriber nodes error: patch failed: src/backend/parser/gram.y:9954 error: src/backend/parser/gram.y: patch does not apply Could you please rebase it when it's necessary ? Best Regards, Takamichi Osumi
Re: a misbehavior of partition row movement (?)
On Tue, Jan 18, 2022 at 12:16:23PM +0800, Julien Rouhaud wrote: > I'm not sure why this test failed as it doesn't seem like something impacted > by > the patch, but I may have missed something as I only had a quick look at the > patch and discussion. This issue is discussed here: https://www.postgresql.org/message-id/20220117203746.oj43254j5jurb...@alap3.anarazel.de -- Michael signature.asc Description: PGP signature
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Mon, Jan 17, 2022 at 8:13 PM Robert Haas wrote: > On Mon, Jan 17, 2022 at 5:41 PM Peter Geoghegan wrote: > > That just seems like semantics to me. The very next sentence after the > > one you quoted in your reply was "And so it's highly unlikely that any > > given VACUUM will ever *completely* fail to advance relfrozenxid". > > It's continuous *within* each VACUUM. As far as I can tell there is > > pretty much no way that the patch series will ever fail to advance > > relfrozenxid *by at least a little bit*, barring pathological cases > > with cursors and whatnot. > > I mean this boils down to saying that VACUUM will advance relfrozenxid > except when it doesn't. It actually doesn't boil down, at all. The world is complicated and messy, whether we like it or not. > > I never said that anti-wraparound vacuums just won't happen anymore. I > > said that they'll be limited to cases like the stock table or > > customers table case. I was very clear on that point. > > I don't know how I'm supposed to sensibly respond to a statement like > this. If you were very clear, then I'm being deliberately obtuse if I > fail to understand. I don't know if I'd accuse you of being obtuse, exactly. Mostly I just think it's strange that you don't seem to take what I say seriously when it cannot be proven very easily. I don't think that you intend this to be disrespectful, and I don't take it personally. I just don't understand it. > > It isn't that hard to see that the cases where we continue to get any > > anti-wraparound VACUUMs with the patch seem to be limited to cases > > like the stock/customers table, or cases like the pathological idle > > cursor cases we've been discussing. Pretty narrow cases, overall. > > Don't take my word for it - see for yourself. > > I don't think that's really possible. Words like "narrow" and > "pathological" are value judgments, not factual statements. If I do an > experiment where no wraparound autovacuums happen, as I'm sure I can, > then those are the normal cases where the patch helps. If I do an > experiment where they do happen, as I'm sure that I also can, you'll > probably say either that the case in question is like the > stock/customers table, or that it's pathological. What will any of > this prove? You seem to be suggesting that I used words like "pathological" in some kind of highly informal, totally subjective way, when I did no such thing. I quite clearly said that you'll only get an anti-wraparound VACUUM with the patch applied when the only factor that *ever* causes *any* autovacuum worker to VACUUM the table (assuming the workload is stable) is the anti-wraparound/autovacuum_freeze_max_age cutoff. With a table like this, even increasing autovacuum_freeze_max_age to its absolute maximum of 2 billion would not make it any more likely that we'd get a non-aggressive VACUUM -- it would merely make the anti-wraparound VACUUMs less frequent. No big change should be expected with a table like that. Also, since the patch is not magic, and doesn't even change the basic invariants for relfrozenxid, it's still true that any scenario in which it's fundamentally impossible for VACUUM to keep up will also have anti-wraparound VACUUMs. But that's the least of the user's trouble -- in the long run we're going to have the system refuse to allocate new XIDs with such a workload. The claim that I have made is 100% testable. Even if it was flat out incorrect, not getting anti-wraparound VACUUMs per se is not the important part. The important part is that the work is managed intelligently, and the burden is spread out over time. I am particularly concerned about the "freezing cliff" we get when many pages are all-visible but not also all-frozen. Consistently avoiding an anti-wraparound VACUUM (except with very particular workload characteristics) is really just a side effect -- it's something that makes the overall benefit relatively obvious, and relatively easy to measure. I thought that you'd appreciate that. -- Peter Geoghegan
Re: Push down time-related SQLValue functions to foreign server
> The implementation of converting now() to CURRENT_TIMESTAMP > seems like an underdocumented kluge, too. > I'm very late to the party, but it seems to me that this effort is describing a small subset of what "routine mapping" seems to be for: defining function calls that can be pushed down to the foreign server, and the analogous function on the foreign side. We may want to consider implementing just enough of CREATE ROUTINE MAPPING and DROP ROUTINE MAPPING to support these specific fixed functions.
Re: TAP test to cover "EndOfLogTLI != replayTLI" case
At Tue, 18 Jan 2022 12:25:15 +0800, Julien Rouhaud wrote in > Hi, > > On Mon, Jan 17, 2022 at 09:33:57PM +0800, Julien Rouhaud wrote: > > > > Thanks for the updated patch! Note that thanks to Andres and Thomas work, > > you > > can now easily rely on the exact same CI than the cfbot on your own github > > repository, if you need to debug something on a platform you don't have > > access > > to. You can check the documentation at [1] for more detail on how to setup > > the > > CI. > > The cfbot reports that the patch still fails on Windows but also failed on > macos with the same error: https://cirrus-ci.com/task/5655777858813952: > > [14:20:43.950] # Failed test 'check standby content before timeline switch > 0/500FAF8' > [14:20:43.950] # at t/003_recovery_targets.pl line 239. > [14:20:43.950] # got: '6000' > [14:20:43.950] # expected: '7000' > [14:20:43.950] # Looks like you failed 1 test of 10. > > I'm switching the CF entry to Waiting on Author. The most significant advantages of the local CI setup are - CI immediately responds to push. - You can dirty the code with additional logging aid as much as you like to see closely what is going on. It makes us hesitant to do the same on this ML:p regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: row filtering for logical replication
On Tue, Jan 18, 2022 at 2:31 PM Amit Kapila wrote: > > On Tue, Jan 18, 2022 at 8:41 AM Greg Nancarrow wrote: > > > > On Tue, Jan 18, 2022 at 2:31 AM houzj.f...@fujitsu.com > > wrote: > > > > > > > (2) GetTopMostAncestorInPublication > > > > Is there a reason why there is no "break" after finding a > > > > topmost_relid? Why keep searching and potentially overwrite a > > > > previously-found topmost_relid? If it's intentional, I think that a > > > > comment should be added to explain it. > > > > > > The code was moved from get_rel_sync_entry, and was trying to get the > > > last oid in the ancestor list which is published by the publication. Do > > > you > > > have some suggestions for the comment ? > > > > > > > Maybe the existing comment should be updated to just spell it out like that: > > > > /* > > * Find the "topmost" ancestor that is in this publication, by getting the > > * last Oid in the ancestors list which is published by the publication. > > */ > > > > I am not sure that is helpful w.r.t what Peter is looking for as that > is saying what code is doing and he wants to know why it is so? I > think one can understand this by looking at get_partition_ancestors > which will return the top-most ancestor as the last element. I feel > either we can say see get_partition_ancestors or maybe explain how the > ancestors are stored in this list. > (note: I asked the original question about why there is no "break", not Peter) Maybe instead, an additional comment could be added to the GetTopMostAncestorInPublication function to say "Note that the ancestors list is ordered such that the topmost ancestor is at the end of the list". Unfortunately the get_partition_ancestors function currently doesn't explicitly say that the topmost ancestors are returned at the end of the list (I guess you could conclude it by then looking at get_partition_ancestors_worker code which it calls). Also, this leads me to wonder if searching the ancestors list backwards might be better here, and break at the first match? Perhaps there is only a small gain in doing that ... Regards, Greg Nancarrow Fujitsu Australia
Re: Null commitTS bug
At Tue, 18 Jan 2022 10:43:55 +0900, Michael Paquier wrote in > On Sun, Jan 16, 2022 at 11:01:25PM -0500, Tom Lane wrote: > > Isn't that a very bad way to write "i = j + 1"? > > > > I agree with Horiguchi-san that > > for (i = 0, headxid = xid;;) > > Okay. Horiguchi-san, would you like to write a patch? Yes, I will. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Skipping logical replication transactions on subscriber side
On Tue, Jan 18, 2022 at 12:20 PM osumi.takami...@fujitsu.com wrote: > > On Monday, January 17, 2022 9:52 PM Masahiko Sawada > wrote: > > Thank you for the comments! > .. > > > (2) Minor improvement suggestion of comment in > > > src/backend/replication/logical/worker.c > > > > > > + * reset during that. Also, we don't skip receiving the changes in > > > + streaming > > > + * cases, since we decide whether or not to skip applying the changes > > > + when > > > > > > I sugguest that you don't use 'streaming cases', because what > > > "streaming cases" means sounds a bit broader than actual your > > implementation. > > > We do skip transaction of streaming cases but not during the spooling > > > phase, > > right ? > > > > > > I suggest below. > > > > > > "We don't skip receiving the changes at the phase to spool streaming > > transactions" > > > > I might be missing your point but I think it's correct that we don't skip > > receiving > > the change of the transaction that is sent via streaming protocol. And it > > doesn't > > sound broader to me. Could you elaborate on that? > OK. Excuse me for lack of explanation. > > I felt "streaming cases" implies "non-streaming cases" > to compare a diffference (in my head) when it is > used to explain something at first. > I imagined the contrast between those, when I saw it. > > Thus, I thought "streaming cases" meant > whole flow of streaming transactions which consists of messages > surrounded by stream start and stream stop and which are finished by > stream commit/stream abort (including 2PC variations). > > When I come back to the subject, you wrote below in the comment > > "we don't skip receiving the changes in streaming cases, > since we decide whether or not to skip applying the changes > when starting to apply changes" > > The first part of this sentence > ("we don't skip receiving the changes in streaming cases") > gives me an impression where we don't skip changes in the streaming cases > (of my understanding above), but the last part > ("we decide whether or not to skip applying the changes > when starting to apply change") means we skip transactions for streaming at > apply phase. > > So, this sentence looked confusing to me slightly. > Thus, I suggested below (and when I connect it with existing part) > > "we don't skip receiving the changes at the phase to spool streaming > transactions > since we decide whether or not to skip applying the changes when starting to > apply changes" > > For me this looked better, but of course, this is a suggestion. Thank you for your explanation. I've modified the comment with some changes since "the phase to spool streaming transaction" seems not commonly be used in worker.c. > > > > > > > (3) in the comment of apply_handle_prepare_internal, two full-width > > characters. > > > > > > 3-1 > > > +* won’t be resent in a case where the server crashes between > > them. > > > > > > 3-2 > > > +* COMMIT PREPARED or ROLLBACK PREPARED. But that’s okay > > > + because this > > > > > > You have full-width characters for "won't" and "that's". > > > Could you please check ? > > > > Which characters in "won't" are full-width characters? I could not find > > them. > All characters I found and mentioned as full-width are single quotes. > > It might be good that you check the entire patch once > by some tool that helps you to detect it. Thanks! > > > > (5) > > > > > > I can miss something here but, in one of the past discussions, there > > > seems a consensus that if the user specifies XID of a subtransaction, > > > it would be better to skip only the subtransaction. > > > > > > This time, is it out of the range of the patch ? > > > If so, I suggest you include some description about it either in the > > > commit message or around codes related to it. > > > > How can the user know subtransaction XID? I suppose you refer to streaming > > protocol cases but while applying spooled changes we don't report > > subtransaction XID neither in server log nor pg_stat_subscription_workers. > Yeah, usually subtransaction XID is not exposed to the users. I agree. > > But, clarifying the target of this feature is only top-level transactions > sounds better to me. Thank you Amit-san for your support > about how we should write it in [1] ! Yes, I've included the sentence proposed by Amit in the latest patch. > > > > (6) > > > > > > I feel it's a better idea to include a test whether to skip aborted > > > streaming transaction clears the XID in the TAP test for this feature, > > > in a sense to cover various new code paths. Did you have any special > > > reason to omit the case ? > > > > Which code path is newly covered by this aborted streaming transaction > > tests? > > I think that this patch is already covered even by the test for a > > committed-and-streamed transaction. It doesn't matter whether the streamed > > transaction is committed or aborted because an error occurs while applying > > the > > spooled changes
Re: Skipping logical replication transactions on subscriber side
On Tue, Jan 18, 2022 at 12:04 PM tanghy.f...@fujitsu.com wrote: > > On Mon, Jan 17, 2022 2:18 PM Masahiko Sawada wrote: > > > > I've attached an updated patch. Please review it. > > > > Thanks for updating the patch. Few comments: > > 1) > /* Two_phase is only supported in v15 and higher */ > if (pset.sversion >= 15) > appendPQExpBuffer(&buf, > - ", subtwophasestate > AS \"%s\"\n", > - gettext_noop("Two > phase commit")); > + ", subtwophasestate > AS \"%s\"\n" > + ", subskipxid AS > \"%s\"\n", > + gettext_noop("Two > phase commit"), > + gettext_noop("Skip > XID")); > > appendPQExpBuffer(&buf, > ", subsynccommit AS > \"%s\"\n" > > I think "skip xid" should be mentioned in the comment. Maybe it could be > changed to: > "Two_phase and skip XID are only supported in v15 and higher" Added. > > 2) The following two places are not consistent in whether "= value" is > surround > with square brackets. > > +ALTER SUBSCRIPTION name SKIP ( > skip_option [= class="parameter">value] [, ... ] ) > > +SKIP ( class="parameter">skip_option = class="parameter">value [, ... ] ) > > Should we modify the first place to: > +ALTER SUBSCRIPTION name SKIP ( > skip_option = class="parameter">value [, ... ] ) > > Because currently there is only one skip_option - xid, and a parameter must be > specified when using it. Good catch. Fixed. > > 3) > +* Protect subskip_xid of pg_subscription from being concurrently > updated > +* while clearing it. > > "subskip_xid" should be "subskipxid" I think. Fixed. > > 4) > +/* > + * Start skipping changes of the transaction if the given XID matches the > + * transaction ID specified by skip_xid option. > + */ > > The option name was "skip_xid" in the previous version, and it is "xid" in > latest patch. So should we modify "skip_xid option" to "skip xid option", or > "skip option xid", or something else? > > Also the following place has similar issue: > + * the subscription if hte user has specified skip_xid. Once we start > skipping Fixed. I've attached an updated patch. All comments I got so far were incorporated into this patch unless I'm missing something. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ From 9faf874a7388368f86c500e1fef9616ecf86e5b5 Mon Sep 17 00:00:00 2001 From: Masahiko Sawada Date: Fri, 10 Dec 2021 14:41:30 +0900 Subject: [PATCH v7] Add ALTER SUBSCRIPTION ... SKIP to skip the transaction on subscriber nodes If incoming change violates any constraint, logical replication stops until it's resolved. This commit introduces another way to skip the transaction in question, other than manually updating the subscriber's database or using pg_replication_origin_advance(). The user can specify XID by ALTER SUBSCRIPTION ... SKIP (xid = XXX), updating pg_subscription.subskipxid field, telling the apply worker to skip the transaction. The apply worker skips all data modification changes within the specified transaction. After skipping the transaction the apply worker clears pg_subscription.subskipxid. --- doc/src/sgml/catalogs.sgml | 10 + doc/src/sgml/logical-replication.sgml | 49 +++- doc/src/sgml/ref/alter_subscription.sgml | 42 src/backend/catalog/pg_subscription.c | 1 + src/backend/commands/subscriptioncmds.c| 53 + src/backend/parser/gram.y | 9 + src/backend/replication/logical/worker.c | 265 - src/bin/pg_dump/pg_dump.c | 4 + src/bin/psql/describe.c| 10 +- src/bin/psql/tab-complete.c| 8 +- src/include/catalog/pg_subscription.h | 4 + src/include/nodes/parsenodes.h | 3 +- src/test/regress/expected/subscription.out | 124 ++ src/test/regress/sql/subscription.sql | 17 ++ src/test/subscription/t/028_skip_xact.pl | 217 + 15 files changed, 755 insertions(+), 61 deletions(-) create mode 100644 src/test/subscription/t/028_skip_xact.pl diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 2aeb2ef346..16f429b853 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -7746,6 +7746,16 @@ SCRAM-SHA-256$:&l + + + subskipxid xid + + + ID of the transaction whose changes are to be skipped, if a valid + transaction ID; otherwise 0. + + + subconninfo text diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/log
Re: libpq compression (part 2)
On Tue, Jan 18, 2022 at 02:06:32AM +0500, Daniil Zakhlystov wrote: > > => Since March, errmsg doesn't need extra parenthesis around it (e3a87b4). > I’ve resolved the stuck tests and added zlib support for CI Windows builds to > patch 0003-*. Thanks > for the suggestion, all tests seem to be OK now, except the macOS one. It > won't schedule in Cirrus > CI for some reason, but I guess it happens because of my GitHub account > limitation. I don't know about your github account, but it works for cfbot, which is now green. Thanks for implementing zlib for windows. Did you try this with default compressions set to lz4 and zstd ? The thread from 2019 is very long, and starts off with the guidance that compression had been implemented at the wrong layer. It looks like this hasn't changed since then. secure_read/write are passed as function pointers to the ZPQ interface, which then calls back to them to read and flush its compression buffers. As I understand, the suggestion was to leave the socket reads and writes alone. And then conditionally de/compress buffers after reading / before writing from the socket if compression was negotiated. It's currently done like this pq_recvbuf() => secure_read() - when compression is disabled pq_recvbuf() => ZPQ => secure_read() - when compression is enabled Dmitri sent a partial, POC patch which changes the de/compression to happen in secure_read/write, which is changed to call ZPQ: https://www.postgresql.org/message-id/ca+q6zcuprssnars+fyobsd-f2stk1roa-4sahfofajowlzi...@mail.gmail.com pq_recvbuf() => secure_read() => ZPQ The same thing is true of the frontend: function pointers to pqsecure_read/write are being passed to zpq_create, and then the ZPQ interface called instead of the original functions. Those are the functions which read using SSL, so they should also handle compression. That's where SSL is handled, and it seems like the right place to handle compression. Have you evaluated that way to do things ? Konstantin said he put ZPQ at that layer seems to 1) avoid code duplication between client/server; and, 2) to allow compression to happen before SSL, to allow both (if the admin decides it's okay). But I don't see why compression can't happen before sending to SSL, or after reading from it? -- Justin
Re: TAP test to cover "EndOfLogTLI != replayTLI" case
Hi, On Mon, Jan 17, 2022 at 09:33:57PM +0800, Julien Rouhaud wrote: > > Thanks for the updated patch! Note that thanks to Andres and Thomas work, you > can now easily rely on the exact same CI than the cfbot on your own github > repository, if you need to debug something on a platform you don't have access > to. You can check the documentation at [1] for more detail on how to setup > the > CI. The cfbot reports that the patch still fails on Windows but also failed on macos with the same error: https://cirrus-ci.com/task/5655777858813952: [14:20:43.950] # Failed test 'check standby content before timeline switch 0/500FAF8' [14:20:43.950] # at t/003_recovery_targets.pl line 239. [14:20:43.950] # got: '6000' [14:20:43.950] # expected: '7000' [14:20:43.950] # Looks like you failed 1 test of 10. I'm switching the CF entry to Waiting on Author.
Re: a misbehavior of partition row movement (?)
Hi, On Mon, Jan 17, 2022 at 08:40:54PM +0900, Amit Langote wrote: > > Okay, I created versions of the patch series for branches 13 and 14 > (.txt files). The one for HEAD is also re-attached. FYI The patch failed today on FreeBSD, while it was previously quite stable on all platforms (https://cirrus-ci.com/build/4551468081479680), with this error: https://api.cirrus-ci.com/v1/artifact/task/6360787076775936/regress_diffs/src/test/recovery/tmp_check/regression.diffs diff -U3 /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/reloptions.out /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/reloptions.out --- /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/reloptions.out 2022-01-18 00:12:52.533086000 + +++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/reloptions.out 2022-01-18 00:28:00.000524000 + @@ -133,7 +133,7 @@ SELECT pg_relation_size('reloptions_test') = 0; ?column? -- - t + f (1 row) I'm not sure why this test failed as it doesn't seem like something impacted by the patch, but I may have missed something as I only had a quick look at the patch and discussion.
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Mon, Jan 17, 2022 at 5:41 PM Peter Geoghegan wrote: > That just seems like semantics to me. The very next sentence after the > one you quoted in your reply was "And so it's highly unlikely that any > given VACUUM will ever *completely* fail to advance relfrozenxid". > It's continuous *within* each VACUUM. As far as I can tell there is > pretty much no way that the patch series will ever fail to advance > relfrozenxid *by at least a little bit*, barring pathological cases > with cursors and whatnot. I mean this boils down to saying that VACUUM will advance relfrozenxid except when it doesn't. > Actually, I think that even the people in the first category might > well have about the same improved experience. Not just because of this > patch series, mind you. It would also have a lot to do with the > autovacuum_vacuum_insert_scale_factor stuff in Postgres 13. Not to > mention the freeze map. What version are these users on? I think it varies. I expect the increase in the default cost limit to have had a much more salutary effect than autovacuum_vacuum_insert_scale_factor, but I don't know for sure. At any rate, if you make the database big enough and generate dirty data fast enough, it doesn't matter what the default limits are. > I never said that anti-wraparound vacuums just won't happen anymore. I > said that they'll be limited to cases like the stock table or > customers table case. I was very clear on that point. I don't know how I'm supposed to sensibly respond to a statement like this. If you were very clear, then I'm being deliberately obtuse if I fail to understand. If I say you weren't very clear, then we're just contradicting each other. > It isn't that hard to see that the cases where we continue to get any > anti-wraparound VACUUMs with the patch seem to be limited to cases > like the stock/customers table, or cases like the pathological idle > cursor cases we've been discussing. Pretty narrow cases, overall. > Don't take my word for it - see for yourself. I don't think that's really possible. Words like "narrow" and "pathological" are value judgments, not factual statements. If I do an experiment where no wraparound autovacuums happen, as I'm sure I can, then those are the normal cases where the patch helps. If I do an experiment where they do happen, as I'm sure that I also can, you'll probably say either that the case in question is like the stock/customers table, or that it's pathological. What will any of this prove? I think we're reaching the point of diminishing returns in this conversation. What I want to know is that users aren't going to be harmed - even in cases where they have behavior that is like the stock/customers table, or that you consider pathological, or whatever other words we want to use to describe the weird things that happen to people. And I think we've made perhaps a bit of modest progress in exploring that issue, but certainly less than I'd like. I don't want to spend the next several days going around in circles about it though. That does not seem likely to make anyone happy. -- Robert Haas EDB: http://www.enterprisedb.com
Re: ICU for global collation
Hi, On Thu, Jan 13, 2022 at 09:39:42AM +0100, Peter Eisentraut wrote: > On 11.01.22 12:08, Julien Rouhaud wrote: > > > So, unless there are concerns, I'm going to see about making a patch to > > > call > > > pg_newlocale_from_collation() even with the default collation. That would > > > make the actual feature patch quite a bit smaller, since we won't have to > > > patch every call site of pg_newlocale_from_collation(). > > +1 for me! > > Here is that patch. The patch is quite straightforward so I don't have much to say, it all looks good and passes all regression tests. > If this is applied, then in my estimation all these hunks will completely > disappear from the global ICU patch. So this would be a significant win. Agreed, the patch will be quite smaller and also easier to review. Are you planning to apply it on its own or add it to the default ICU patchset? Given the possible backpatch conflicts it can bring I'm not sure it's worthy enough to apply on its own.
Re: Skipping logical replication transactions on subscriber side
On Tue, Jan 18, 2022 at 12:37 PM Amit Kapila wrote: > > On Tue, Jan 18, 2022 at 8:34 AM tanghy.f...@fujitsu.com > wrote: > > > > On Mon, Jan 17, 2022 2:18 PM Masahiko Sawada wrote: > > > > > > > 2) The following two places are not consistent in whether "= value" is > > surround > > with square brackets. > > > > +ALTER SUBSCRIPTION name SKIP > > ( skip_option [= > class="parameter">value] [, ... ] ) > > > > +SKIP ( > class="parameter">skip_option = > class="parameter">value [, ... ] ) > > > > Should we modify the first place to: > > +ALTER SUBSCRIPTION name SKIP > > ( skip_option = > class="parameter">value [, ... ] ) > > > > Because currently there is only one skip_option - xid, and a parameter must > > be > > specified when using it. > > > > Good observation. Do we really need [, ... ] as currently, we support > only one value for XID? I think no. In the doc, it should be: ALTER SUBSCRIPTION name SKIP ( skip_option = value ) Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: ICU for global collation
Hi, On Mon, Jan 17, 2022 at 07:07:38PM +, Finnerty, Jim wrote: > On 10.01.22 12:49, Daniel Verite wrote: > > > I think some users would want their db-wide ICU collation to be > > case/accent-insensitive. > ... > > IIRC, that was the context for some questions where people were > > enquiring about db-wide ICU collations. > > +1. There is the DEFAULT_COLLATION_OID, which is the cluster-level default > collation, a.k.a. the "global collation", as distinct from the "db-wide" > database-level default collation, which controls the default type of the > collatable types within that database. There's no cluster-level default collation, and DEFAULT_COLLATION_OID is always database-level (and template1 having a specific default collation). The template0 database is there to be able to support different databases with different default collations, among other things. So this patchset would allow per-database default ICU collation, although not non-deterministic ones.
Re: Skipping logical replication transactions on subscriber side
On Tue, Jan 18, 2022 at 8:34 AM tanghy.f...@fujitsu.com wrote: > > On Mon, Jan 17, 2022 2:18 PM Masahiko Sawada wrote: > > > > 2) The following two places are not consistent in whether "= value" is > surround > with square brackets. > > +ALTER SUBSCRIPTION name SKIP ( > skip_option [= class="parameter">value] [, ... ] ) > > +SKIP ( class="parameter">skip_option = class="parameter">value [, ... ] ) > > Should we modify the first place to: > +ALTER SUBSCRIPTION name SKIP ( > skip_option = class="parameter">value [, ... ] ) > > Because currently there is only one skip_option - xid, and a parameter must be > specified when using it. > Good observation. Do we really need [, ... ] as currently, we support only one value for XID? -- With Regards, Amit Kapila.
Re: row filtering for logical replication
On Tue, Jan 18, 2022 at 8:41 AM Greg Nancarrow wrote: > > On Tue, Jan 18, 2022 at 2:31 AM houzj.f...@fujitsu.com > wrote: > > > > > (2) GetTopMostAncestorInPublication > > > Is there a reason why there is no "break" after finding a > > > topmost_relid? Why keep searching and potentially overwrite a > > > previously-found topmost_relid? If it's intentional, I think that a > > > comment should be added to explain it. > > > > The code was moved from get_rel_sync_entry, and was trying to get the > > last oid in the ancestor list which is published by the publication. Do you > > have some suggestions for the comment ? > > > > Maybe the existing comment should be updated to just spell it out like that: > > /* > * Find the "topmost" ancestor that is in this publication, by getting the > * last Oid in the ancestors list which is published by the publication. > */ > I am not sure that is helpful w.r.t what Peter is looking for as that is saying what code is doing and he wants to know why it is so? I think one can understand this by looking at get_partition_ancestors which will return the top-most ancestor as the last element. I feel either we can say see get_partition_ancestors or maybe explain how the ancestors are stored in this list. -- With Regards, Amit Kapila.
RE: Skipping logical replication transactions on subscriber side
On Monday, January 17, 2022 9:52 PM Masahiko Sawada wrote: > Thank you for the comments! .. > > (2) Minor improvement suggestion of comment in > > src/backend/replication/logical/worker.c > > > > + * reset during that. Also, we don't skip receiving the changes in > > + streaming > > + * cases, since we decide whether or not to skip applying the changes > > + when > > > > I sugguest that you don't use 'streaming cases', because what > > "streaming cases" means sounds a bit broader than actual your > implementation. > > We do skip transaction of streaming cases but not during the spooling phase, > right ? > > > > I suggest below. > > > > "We don't skip receiving the changes at the phase to spool streaming > transactions" > > I might be missing your point but I think it's correct that we don't skip > receiving > the change of the transaction that is sent via streaming protocol. And it > doesn't > sound broader to me. Could you elaborate on that? OK. Excuse me for lack of explanation. I felt "streaming cases" implies "non-streaming cases" to compare a diffference (in my head) when it is used to explain something at first. I imagined the contrast between those, when I saw it. Thus, I thought "streaming cases" meant whole flow of streaming transactions which consists of messages surrounded by stream start and stream stop and which are finished by stream commit/stream abort (including 2PC variations). When I come back to the subject, you wrote below in the comment "we don't skip receiving the changes in streaming cases, since we decide whether or not to skip applying the changes when starting to apply changes" The first part of this sentence ("we don't skip receiving the changes in streaming cases") gives me an impression where we don't skip changes in the streaming cases (of my understanding above), but the last part ("we decide whether or not to skip applying the changes when starting to apply change") means we skip transactions for streaming at apply phase. So, this sentence looked confusing to me slightly. Thus, I suggested below (and when I connect it with existing part) "we don't skip receiving the changes at the phase to spool streaming transactions since we decide whether or not to skip applying the changes when starting to apply changes" For me this looked better, but of course, this is a suggestion. > > > > (3) in the comment of apply_handle_prepare_internal, two full-width > characters. > > > > 3-1 > > +* won’t be resent in a case where the server crashes between > them. > > > > 3-2 > > +* COMMIT PREPARED or ROLLBACK PREPARED. But that’s okay > > + because this > > > > You have full-width characters for "won't" and "that's". > > Could you please check ? > > Which characters in "won't" are full-width characters? I could not find them. All characters I found and mentioned as full-width are single quotes. It might be good that you check the entire patch once by some tool that helps you to detect it. > > (5) > > > > I can miss something here but, in one of the past discussions, there > > seems a consensus that if the user specifies XID of a subtransaction, > > it would be better to skip only the subtransaction. > > > > This time, is it out of the range of the patch ? > > If so, I suggest you include some description about it either in the > > commit message or around codes related to it. > > How can the user know subtransaction XID? I suppose you refer to streaming > protocol cases but while applying spooled changes we don't report > subtransaction XID neither in server log nor pg_stat_subscription_workers. Yeah, usually subtransaction XID is not exposed to the users. I agree. But, clarifying the target of this feature is only top-level transactions sounds better to me. Thank you Amit-san for your support about how we should write it in [1] ! > > (6) > > > > I feel it's a better idea to include a test whether to skip aborted > > streaming transaction clears the XID in the TAP test for this feature, > > in a sense to cover various new code paths. Did you have any special > > reason to omit the case ? > > Which code path is newly covered by this aborted streaming transaction tests? > I think that this patch is already covered even by the test for a > committed-and-streamed transaction. It doesn't matter whether the streamed > transaction is committed or aborted because an error occurs while applying the > spooled changes. Oh, this was my mistake. What I expressed as a new patch is apply_handle_stream_abort -> clear_subscription_skip_xid. But, this was totally wrong as you explained. > > > > (7) > > > > I want more explanation for the reason to restart the subscriber in > > the TAP test because this is not mandatory operation. > > (We can pass the TAP tests without this restart) > > > > From : > > # Restart the subscriber node to restart logical replication with no > > interval > > > > IIUC, below would be better. > > > > To : > > # As an optimization to finish test
Re: row filtering for logical replication
On Tue, Jan 18, 2022 at 2:31 AM houzj.f...@fujitsu.com wrote: > > > (2) GetTopMostAncestorInPublication > > Is there a reason why there is no "break" after finding a > > topmost_relid? Why keep searching and potentially overwrite a > > previously-found topmost_relid? If it's intentional, I think that a > > comment should be added to explain it. > > The code was moved from get_rel_sync_entry, and was trying to get the > last oid in the ancestor list which is published by the publication. Do you > have some suggestions for the comment ? > Maybe the existing comment should be updated to just spell it out like that: /* * Find the "topmost" ancestor that is in this publication, by getting the * last Oid in the ancestors list which is published by the publication. */ Regards, Greg Nancarrow Fujitsu Australia
RE: Skipping logical replication transactions on subscriber side
On Mon, Jan 17, 2022 2:18 PM Masahiko Sawada wrote: > > I've attached an updated patch. Please review it. > Thanks for updating the patch. Few comments: 1) /* Two_phase is only supported in v15 and higher */ if (pset.sversion >= 15) appendPQExpBuffer(&buf, - ", subtwophasestate AS \"%s\"\n", - gettext_noop("Two phase commit")); + ", subtwophasestate AS \"%s\"\n" + ", subskipxid AS \"%s\"\n", + gettext_noop("Two phase commit"), + gettext_noop("Skip XID")); appendPQExpBuffer(&buf, ", subsynccommit AS \"%s\"\n" I think "skip xid" should be mentioned in the comment. Maybe it could be changed to: "Two_phase and skip XID are only supported in v15 and higher" 2) The following two places are not consistent in whether "= value" is surround with square brackets. +ALTER SUBSCRIPTION name SKIP ( skip_option [= value] [, ... ] ) +SKIP ( skip_option = value [, ... ] ) Should we modify the first place to: +ALTER SUBSCRIPTION name SKIP ( skip_option = value [, ... ] ) Because currently there is only one skip_option - xid, and a parameter must be specified when using it. 3) +* Protect subskip_xid of pg_subscription from being concurrently updated +* while clearing it. "subskip_xid" should be "subskipxid" I think. 4) +/* + * Start skipping changes of the transaction if the given XID matches the + * transaction ID specified by skip_xid option. + */ The option name was "skip_xid" in the previous version, and it is "xid" in latest patch. So should we modify "skip_xid option" to "skip xid option", or "skip option xid", or something else? Also the following place has similar issue: + * the subscription if hte user has specified skip_xid. Once we start skipping Regards, Tang
Re: a misbehavior of partition row movement (?)
On Tue, Jan 18, 2022 at 7:15 AM Alvaro Herrera wrote: > On 2022-Jan-17, Tom Lane wrote: > > But could we please do it in a way that is designed to keep the > > code readable, rather than to minimize the number of lines of diff? > > It makes zero sense to have the bits in AFTER_TRIGGER_TUP_BITS not > > be adjacent. So what should happen here is to renumber the symbols > > in between to move their bits over one place. > > Is it typical to enumerate bits starting from the right of each byte, > when doing it from the high bits of the word? DONE and IN_PROGRESS have > been defined as 0x1 and 0x2 of that byte for a very long time and I > found that very strange. I am inclined to count from the left, so I'd > pick 8 first, defining the set like this: > > #define AFTER_TRIGGER_OFFSET0x07FF /* must be > low-order bits */ > #define AFTER_TRIGGER_DONE 0x8000 > #define AFTER_TRIGGER_IN_PROGRESS 0x4000 > /* bits describing the size and tuple sources of this event */ > #define AFTER_TRIGGER_FDW_REUSE 0x > #define AFTER_TRIGGER_FDW_FETCH 0x2000 > #define AFTER_TRIGGER_1CTID 0x1000 > #define AFTER_TRIGGER_2CTID 0x3000 > #define AFTER_TRIGGER_CP_UPDATE 0x0800 > #define AFTER_TRIGGER_TUP_BITS 0x3800 Agree that the ultimate code readability trumps diff minimalism. :-) Would you like me to update the patch with the above renumbering or are you working on a new version yourself? -- Amit Langote EDB: http://www.enterprisedb.com
Re: Skipping logical replication transactions on subscriber side
On Tue, Jan 18, 2022 at 8:02 AM Masahiko Sawada wrote: > > On Mon, Jan 17, 2022 at 10:15 PM Amit Kapila wrote: > > > > On Mon, Jan 17, 2022 at 6:22 PM Masahiko Sawada > > wrote: > > > > > > > > > > > (5) > > > > > > > > I can miss something here but, in one of > > > > the past discussions, there seems a consensus that > > > > if the user specifies XID of a subtransaction, > > > > it would be better to skip only the subtransaction. > > > > > > > > This time, is it out of the range of the patch ? > > > > If so, I suggest you include some description about it > > > > either in the commit message or around codes related to it. > > > > > > How can the user know subtransaction XID? I suppose you refer to > > > streaming protocol cases but while applying spooled changes we don't > > > report subtransaction XID neither in server log nor > > > pg_stat_subscription_workers. > > > > > > > I also think in the current system users won't be aware of > > subtransaction's XID but I feel Osumi-San's point is valid that we > > should at least add it in docs that we allow to skip only top-level > > xacts. Also, in the future, it won't be impossible to imagine that we > > can have subtransaction's XID info also available to users as we have > > that in the case of streaming xacts (See subxact_data). > > Fair point and more accurate, but I'm a bit concerned that using these > words could confuse the user. There are some places in the doc where > we use the words “top-level transaction” and "sub transactions” but > these are not commonly used in the doc. The user normally would not be > aware that sub transactions are used to implement SAVEPOINTs. Also, > the publisher's subtransaction ID doesn’t appear anywhere on the > subscriber. So if we want to mention it, I think we should use other > words instead of them but I don’t have a good idea for that. Do you > have any ideas? > How about changing existing text: + Specifies the ID of the transaction whose changes are to be skipped + by the logical replication worker. Setting NONE + resets the transaction ID. to Specifies the top-level transaction identifier whose changes are to be skipped by the logical replication worker. We don't support skipping individual subtransactions. Setting NONE resets the transaction ID. -- With Regards, Amit Kapila.
Re: Skipping logical replication transactions on subscriber side
On Tue, Jan 18, 2022 at 10:36 AM Greg Nancarrow wrote: > > On Mon, Jan 17, 2022 at 5:18 PM Masahiko Sawada wrote: > > > > I've attached an updated patch. Please review it. > > > > Some review comments for the v6 patch: Thank you for the comments! > > > doc/src/sgml/logical-replication.sgml > > (1) Expanded output > > Since the view output is shown in "expanded output" mode, perhaps the > doc should say that, or alternatively add the following lines prior to > it, to make it clear: > > postgres=# \x > Expanded display is on. I'm not sure it's really necessary. A similar example would be perform.sgml but it doesn't say "\x". > > > (2) Message output in server log > > The actual CONTEXT text now just says "at ..." instead of "with commit > timestamp ...", so the doc needs to be updated as follows: > > BEFORE: > +CONTEXT: processing remote data during "INSERT" for replication > target relation "public.test" in transaction 716 with commit timestamp > 2021-09-29 15:52:45.165754+00 > AFTER: > +CONTEXT: processing remote data during "INSERT" for replication > target relation "public.test" in transaction 716 at 2021-09-29 > 15:52:45.165754+00 Will fix. > > (3) > The wording "the change" doesn't seem right here, so I suggest the > following update: > > BEFORE: > + Skipping the whole transaction includes skipping the change that > may not violate > AFTER: > + Skipping the whole transaction includes skipping changes that may > not violate > > > doc/src/sgml/ref/alter_subscription.sgml Will fix. > > (4) > I have a number of suggested wording improvements: > > BEFORE: > + Skips applying changes of the particular transaction. If incoming data > + violates any constraints the logical replication will stop until it is > + resolved. The resolution can be done either by changing data on the > + subscriber so that it doesn't conflict with incoming change or > by skipping > + the whole transaction. The logical replication worker skips all data > + modification changes within the specified transaction including > the changes > + that may not violate the constraint, so, it should only be used as a > last > + resort. This option has no effect on the transaction that is already > + prepared by enabling two_phase on subscriber. > > AFTER: > + Skips applying all changes of the specified transaction. If > incoming data > + violates any constraints, logical replication will stop until it is > + resolved. The resolution can be done either by changing data on the > + subscriber so that it doesn't conflict with incoming change or > by skipping > + the whole transaction. Using the SKIP option, the logical > replication worker skips all data > + modification changes within the specified transaction, including > changes > + that may not violate the constraint, so, it should only be used as a > last > + resort. This option has no effect on transactions that are already > + prepared by enabling two_phase on the subscriber. > Will fix. > > (5) > change -> changes > > BEFORE: > + subscriber so that it doesn't conflict with incoming change or > by skipping > AFTER: > + subscriber so that it doesn't conflict with incoming changes or > by skipping Will fix. > > > src/backend/replication/logical/worker.c > > (6) Missing word? > The following should say "worth doing" or "worth it"? > > + * doesn't seem to be worth since it's a very minor case. > WIll fix > > src/test/regress/sql/subscription.sql > > (7) Misleading test case > I think the following test case is misleading and should be removed, > because the "1.1" xid value is only regarded as invalid because "1" is > an invalid xid (and there's already a test case for a "1" xid) - the > fractional part gets thrown away, and doesn't affect the validity > here. > >+ALTER SUBSCRIPTION regress_testsub SKIP (xid = 1.1); > Good point. Will remove. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: row filtering for logical replication
On Mon, Jan 17, 2022 at 9:00 PM houzj.f...@fujitsu.com wrote: > > On Mon, Jan 17, 2022 12:34 PM Peter Smith wrote: > > > > Here are some review comments for v65-0001 (review of updates since > > v64-0001) > > Thanks for the comments! > > > ~~~ > > > > 1. src/include/commands/publicationcmds.h - rename func > > > > +extern bool contain_invalid_rfcolumn(Oid pubid, Relation relation, > > +List *ancestors, AttrNumber *invalid_rfcolumn); > > > > I thought that function should be called "contains_..." instead of > > "contain_...". > > > > ~~~ > > > > 2. src/backend/commands/publicationcmds.c - rename funcs > > > > Suggested renaming (same as above #1). > > > > "contain_invalid_rfcolumn_walker" --> "contains_invalid_rfcolumn_walker" > > "contain_invalid_rfcolumn" --> "contains_invalid_rfcolumn" > > > > Also, update it in the comment for rf_context: > > +/* > > + * Information used to validate the columns in the row filter > > +expression. See > > + * contain_invalid_rfcolumn_walker for details. > > + */ > > I am not sure about the name because many existing > functions are named contain_xxx_xxx. > (for example contain_mutable_functions) > I also see many similar functions whose name start with contain_* like contain_var_clause, contain_agg_clause, contain_window_function, etc. So, it is probably okay to retain the name as it is in the patch. -- With Regards, Amit Kapila.
Re: Skipping logical replication transactions on subscriber side
On Mon, Jan 17, 2022 at 10:15 PM Amit Kapila wrote: > > On Mon, Jan 17, 2022 at 6:22 PM Masahiko Sawada wrote: > > > > > > > > (5) > > > > > > I can miss something here but, in one of > > > the past discussions, there seems a consensus that > > > if the user specifies XID of a subtransaction, > > > it would be better to skip only the subtransaction. > > > > > > This time, is it out of the range of the patch ? > > > If so, I suggest you include some description about it > > > either in the commit message or around codes related to it. > > > > How can the user know subtransaction XID? I suppose you refer to > > streaming protocol cases but while applying spooled changes we don't > > report subtransaction XID neither in server log nor > > pg_stat_subscription_workers. > > > > I also think in the current system users won't be aware of > subtransaction's XID but I feel Osumi-San's point is valid that we > should at least add it in docs that we allow to skip only top-level > xacts. Also, in the future, it won't be impossible to imagine that we > can have subtransaction's XID info also available to users as we have > that in the case of streaming xacts (See subxact_data). Fair point and more accurate, but I'm a bit concerned that using these words could confuse the user. There are some places in the doc where we use the words “top-level transaction” and "sub transactions” but these are not commonly used in the doc. The user normally would not be aware that sub transactions are used to implement SAVEPOINTs. Also, the publisher's subtransaction ID doesn’t appear anywhere on the subscriber. So if we want to mention it, I think we should use other words instead of them but I don’t have a good idea for that. Do you have any ideas? > > Few minor points: > === > 1. > + * the subscription if hte user has specified skip_xid. > > Typo. /hte/the Will fix. > > 2. > + * PREPARED won’t be resent but subskipxid is left. > > In diffmerge tool, won't is showing some funny chars. When I manually > removed 't and added it again, everything is fine. I am not sure why > it is so? I think Osumi-San has also raised this complaint. Oh I didn't realize that. I'll check it again by using diffmerge tool. > > 3. > + /* > + * We don't expect that the user set the XID of the transaction that is > + * rolled back but if the skip XID is set, clear it. > + */ > > /user set/user to set/ Will fix. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Rewriting the test of pg_upgrade as a TAP test - take three - remastered set
On Sat, Jan 15, 2022 at 01:52:39PM +0800, Julien Rouhaud wrote: > libpq environment variable PGHOST has a non-local server value: > C:/Users/ContainerAdministrator/AppData/Local/Temp/FhBIlsw6SV > Failure, exiting > not ok 3 - run of pg_upgrade for new instance There are two things here, as far as I understand: 1) This is a valid Windows path. So shouldn't we fix pg_upgrade's server.c to be a bit more compliant with Windows paths? The code accepts only paths beginning with '/' as local paths, so this breaks. 2) It looks safer in the long run to disable completely PGHOST and PGHOSTADDR when running the pg_upgrade command in the test, and we'd better not use Cluster::command_ok() or we would fall down to each node's local environment. This could be done in the tests as of the attached, I guess, and this would bypass the problem coming from 1). The patch needed a refresh as --make-testtablespace-dir has been removed as of d6d317d. -- Michael From 5b4c3f279781418aa4bc24689342b41926d681e4 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 18 Jan 2022 11:13:09 +0900 Subject: [PATCH v7] Switch tests of pg_upgrade to use TAP --- src/bin/pg_upgrade/.gitignore| 5 + src/bin/pg_upgrade/Makefile | 23 +- src/bin/pg_upgrade/TESTING | 33 ++- src/bin/pg_upgrade/t/001_basic.pl| 9 + src/bin/pg_upgrade/t/002_pg_upgrade.pl | 286 +++ src/bin/pg_upgrade/test.sh | 279 -- src/test/perl/PostgreSQL/Test/Cluster.pm | 25 ++ src/tools/msvc/vcregress.pl | 94 +--- 8 files changed, 366 insertions(+), 388 deletions(-) create mode 100644 src/bin/pg_upgrade/t/001_basic.pl create mode 100644 src/bin/pg_upgrade/t/002_pg_upgrade.pl delete mode 100644 src/bin/pg_upgrade/test.sh diff --git a/src/bin/pg_upgrade/.gitignore b/src/bin/pg_upgrade/.gitignore index 2d3bfeaa50..3b64522ab6 100644 --- a/src/bin/pg_upgrade/.gitignore +++ b/src/bin/pg_upgrade/.gitignore @@ -7,3 +7,8 @@ /loadable_libraries.txt /log/ /tmp_check/ + +# Generated by pg_regress +/sql/ +/expected/ +/results/ diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile index 44d06be5a6..35b6c123a5 100644 --- a/src/bin/pg_upgrade/Makefile +++ b/src/bin/pg_upgrade/Makefile @@ -28,6 +28,12 @@ OBJS = \ override CPPFLAGS := -DDLSUFFIX=\"$(DLSUFFIX)\" -I$(srcdir) -I$(libpq_srcdir) $(CPPFLAGS) LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport) +# required for 002_pg_upgrade.pl +REGRESS_SHLIB=$(abs_top_builddir)/src/test/regress/regress$(DLSUFFIX) +export REGRESS_SHLIB +REGRESS_OUTPUTDIR=$(abs_top_builddir)/src/bin/pg_upgrade +export REGRESS_OUTPUTDIR + all: pg_upgrade pg_upgrade: $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils @@ -49,17 +55,8 @@ clean distclean maintainer-clean: pg_upgrade_dump_globals.sql \ pg_upgrade_dump_*.custom pg_upgrade_*.log -# When $(MAKE) is present, make automatically infers that this is a -# recursive make. which is not actually what we want here, as that -# e.g. prevents output synchronization from working (as make thinks -# that the subsidiary make knows how to deal with that itself, but -# we're invoking a shell script that doesn't know). Referencing -# $(MAKE) indirectly avoids that behaviour. -# See https://www.gnu.org/software/make/manual/html_node/MAKE-Variable.html#MAKE-Variable -NOTSUBMAKEMAKE=$(MAKE) +check: + $(prove_check) -check: test.sh all temp-install - MAKE=$(NOTSUBMAKEMAKE) $(with_temp_install) bindir=$(abs_top_builddir)/tmp_install/$(bindir) EXTRA_REGRESS_OPTS="$(EXTRA_REGRESS_OPTS)" $(SHELL) $< - -# installcheck is not supported because there's no meaningful way to test -# pg_upgrade against a single already-running server +installcheck: + $(prove_installcheck) diff --git a/src/bin/pg_upgrade/TESTING b/src/bin/pg_upgrade/TESTING index 78b9747908..43a71566e2 100644 --- a/src/bin/pg_upgrade/TESTING +++ b/src/bin/pg_upgrade/TESTING @@ -2,21 +2,30 @@ THE SHORT VERSION - On non-Windows machines, you can execute the testing process -described below by running +described below by running the following command in this directory: make check -in this directory. This will run the shell script test.sh, performing -an upgrade from the version in this source tree to a new instance of -the same version. -To test an upgrade from a different version, you must have a built -source tree for the old version as well as this version, and you -must have done "make install" for both versions. Then do: +This will run the TAP tests to run pg_upgrade, performing an upgrade +from the version in this source tree to a new instance of the same +version. + +To test an upgrade from a different version, there are two options +available: + +1) You have a built source tree for the old version as well as this +version's binaries. Then set up the following variables before +launching the test: export oldsrc=...s
Re: Null commitTS bug
On Sun, Jan 16, 2022 at 11:01:25PM -0500, Tom Lane wrote: > Isn't that a very bad way to write "i = j + 1"? > > I agree with Horiguchi-san that > for (i = 0, headxid = xid;;) Okay. Horiguchi-san, would you like to write a patch? -- Michael signature.asc Description: PGP signature
Re: Add last commit LSN to pg_last_committed_xact()
On Mon, Jan 17, 2022 at 4:34 PM Alvaro Herrera wrote: > > On 2022-Jan-14, James Coleman wrote: > > > The logical slot can't flush past the > > last commit, so even if there's 100s of megabytes of unflushed WAL on > > the slot there may be zero lag (in terms of what's possible to > > process). > > > > I've attached a simple patch (sans tests and documentation) to get > > feedback early. After poking around this afternoon it seemed to me > > that the simplest approach was to hook into the commit timestamps > > infrastructure and store the commit's XLogRecPtr in the cache of the > > most recent value (but of course don't write it out to disk). > > Maybe it would work to have a single LSN in shared memory, as an atomic > variable, which uses monotonic advance[1] to be updated. Whether this is > updated or not would depend on a new GUC, maybe track_latest_commit_lsn. > Causing performance pain during transaction commit is not great, but at > least this way it shouldn't be *too* a large hit. > > [1] part of a large patch at > https://www.postgresql.org/message-id/202111222156.xmo2yji5ifi2%40alvherre.pgsql I'd be happy to make it a separate GUC, though it seems adding an additional atomic access is worse (assuming we can convince ourselves putting this into the commit timestamps infrastructure is acceptable) given here we're already under a lock. Thanks, James Coleman
Re: Add last commit LSN to pg_last_committed_xact()
On Mon, Jan 17, 2022 at 4:20 PM Robert Haas wrote: > > On Fri, Jan 14, 2022 at 7:42 PM James Coleman wrote: > > I've attached a simple patch (sans tests and documentation) to get > > feedback early. After poking around this afternoon it seemed to me > > that the simplest approach was to hook into the commit timestamps > > infrastructure and store the commit's XLogRecPtr in the cache of the > > most recent value (but of course don't write it out to disk). That the > > downside of making this feature dependent on "track_commit_timestamps > > = on", but that seems reasonable: > > > > 1. Getting the xid of the last commit is similarly dependent on commit > > timestamps infrastructure. > > 2. It's a simple place to hook into and avoids new shared data and locking. > > > > Thoughts? > > It doesn't seem great to me. It's making commit_ts do something other > than commit timestamps, which looks kind of ugly. I wondered about that, but commit_ts already does more than commit timestamps by recording the xid of the last commit. For that matter, keeping a cache of last commit metadata in shared memory is arguably not obviously implied by "track_commit_timestamps", which leads to the below... > In general, I'm concerned about the cost of doing something like this. > Extra shared memory updates as part of the process of committing a > transaction are not (and can't be made) free. It seems to me that to the degree there's a hot path concern here we ought to separate out the last commit metadata caching from the "track_commit_timestamps" feature (at least in terms of how it's controlled by GUCs). If that were done we could also, in theory, allow controlling which items are tracked to reduce hot path cost if only a subset is needed. For that matter it'd also allow turning on this metadata caching without enabling the commit timestamp storage. I'm curious, though: I realize it's in the hot path, and I realize that there's an accretive cost to even small features, but given we're already paying the lock cost and updating memory in what is presumably the same cache line, would you expect this cost to be clearly measurable? Thanks, James Coleman
Re: Skipping logical replication transactions on subscriber side
On Mon, Jan 17, 2022 at 5:18 PM Masahiko Sawada wrote: > > I've attached an updated patch. Please review it. > Some review comments for the v6 patch: doc/src/sgml/logical-replication.sgml (1) Expanded output Since the view output is shown in "expanded output" mode, perhaps the doc should say that, or alternatively add the following lines prior to it, to make it clear: postgres=# \x Expanded display is on. (2) Message output in server log The actual CONTEXT text now just says "at ..." instead of "with commit timestamp ...", so the doc needs to be updated as follows: BEFORE: +CONTEXT: processing remote data during "INSERT" for replication target relation "public.test" in transaction 716 with commit timestamp 2021-09-29 15:52:45.165754+00 AFTER: +CONTEXT: processing remote data during "INSERT" for replication target relation "public.test" in transaction 716 at 2021-09-29 15:52:45.165754+00 (3) The wording "the change" doesn't seem right here, so I suggest the following update: BEFORE: + Skipping the whole transaction includes skipping the change that may not violate AFTER: + Skipping the whole transaction includes skipping changes that may not violate doc/src/sgml/ref/alter_subscription.sgml (4) I have a number of suggested wording improvements: BEFORE: + Skips applying changes of the particular transaction. If incoming data + violates any constraints the logical replication will stop until it is + resolved. The resolution can be done either by changing data on the + subscriber so that it doesn't conflict with incoming change or by skipping + the whole transaction. The logical replication worker skips all data + modification changes within the specified transaction including the changes + that may not violate the constraint, so, it should only be used as a last + resort. This option has no effect on the transaction that is already + prepared by enabling two_phase on subscriber. AFTER: + Skips applying all changes of the specified transaction. If incoming data + violates any constraints, logical replication will stop until it is + resolved. The resolution can be done either by changing data on the + subscriber so that it doesn't conflict with incoming change or by skipping + the whole transaction. Using the SKIP option, the logical replication worker skips all data + modification changes within the specified transaction, including changes + that may not violate the constraint, so, it should only be used as a last + resort. This option has no effect on transactions that are already + prepared by enabling two_phase on the subscriber. (5) change -> changes BEFORE: + subscriber so that it doesn't conflict with incoming change or by skipping AFTER: + subscriber so that it doesn't conflict with incoming changes or by skipping src/backend/replication/logical/worker.c (6) Missing word? The following should say "worth doing" or "worth it"? + * doesn't seem to be worth since it's a very minor case. src/test/regress/sql/subscription.sql (7) Misleading test case I think the following test case is misleading and should be removed, because the "1.1" xid value is only regarded as invalid because "1" is an invalid xid (and there's already a test case for a "1" xid) - the fractional part gets thrown away, and doesn't affect the validity here. +ALTER SUBSCRIPTION regress_testsub SKIP (xid = 1.1); Regards, Greg Nancarrow Fujitsu Australia
Re: Refactoring of compression options in pg_basebackup
On Mon, Jan 17, 2022 at 12:48:12PM -0500, Robert Haas wrote: > Alvaro's proposal is fine with me. I don't see any value in replacing > --compress with --compression. It's longer but not superior in any way > that I can see. Having both seems worst of all -- that's just > confusing. Okay, that looks like a consensus, then. Robert, would it be better to gather all that on the thread that deals with the server-side compression? Doing that here would be fine by me, with the option to only specify the client. Now it would be a bit weird to do things with only the client part and not the server part :) -- Michael signature.asc Description: PGP signature
Re: generic plans and "initial" pruning
On Fri, Jan 14, 2022 at 11:10 PM Amit Langote wrote: > On Thu, Jan 6, 2022 at 3:45 PM Amul Sul wrote: > > Here are few comments for v1 patch: > > Thanks Amul. I'm thinking about Robert's latest comments addressing > which may need some rethinking of this whole design, but I decided to > post a v2 taking care of your comments. cfbot tells me there is an unused variable warning, which is fixed in the attached v3. -- Amit Langote EDB: http://www.enterprisedb.com v3-0001-Teach-AcquireExecutorLocks-to-acquire-fewer-locks.patch Description: Binary data
Re: row filtering for logical replication
Here are some review comments for v66-0001 (review of updates since v65-0001) ~~~ 1. src/backend/catalog/pg_publication.c - GetTopMostAncestorInPublication @@ -276,17 +276,45 @@ GetPubPartitionOptionRelations(List *result, PublicationPartOpt pub_partopt, } /* + * Returns the relid of the topmost ancestor that is published via this + * publication if any, otherwise return InvalidOid. + */ Suggestion: "otherwise return InvalidOid." --> "otherwise returns InvalidOid." ~~~ 2. src/backend/commands/publicationcmds.c - contain_invalid_rfcolumn_walker @@ -235,6 +254,337 @@ CheckObjSchemaNotAlreadyInPublication(List *rels, List *schemaidlist, } /* + * Returns true, if any of the columns used in the row filter WHERE clause are + * not part of REPLICA IDENTITY, false, otherwise. Suggestion: ", false, otherwise" --> ", otherwise returns false." ~~~ 3. src/backend/replication/logical/tablesync.c - fetch_remote_table_info + * We do need to copy the row even if it matches one of the publications, + * so, we later combine all the quals with OR. Suggestion: BEFORE * We do need to copy the row even if it matches one of the publications, * so, we later combine all the quals with OR. AFTER * We need to copy the row even if it matches just one of the publications, * so, we later combine all the quals with OR. ~~~ 4. src/backend/replication/pgoutput/pgoutput.c - pgoutput_row_filter_exec_expr + ret = ExecEvalExprSwitchContext(state, econtext, &isnull); + + elog(DEBUG3, "row filter evaluates to %s (isnull: %s)", + DatumGetBool(ret) ? "true" : "false", + isnull ? "false" : "true"); + + if (isnull) + return false; + + return DatumGetBool(ret); That change to the logging looks incorrect - the "(isnull: %s)" value is backwards now. I guess maybe the intent was to change it something like below: elog(DEBUG3, "row filter evaluates to %s (isnull: %s)", isnull ? "false" : DatumGetBool(ret) ? "true" : "false", isnull ? "true" : "false"); -- Kind Regards, Peter Smith. Fujitsu Australia
Re: docs: pg_replication_origin_oid() description does not match behaviour
On Tue, Jan 18, 2022 at 10:19:41AM +0900, Ian Lawrence Barwick wrote: > Given that the code has remained unchanged since the function was > introduced in 9.5, it seems reasonable to change the documentation > to match the function behaviour rather than the other way round. Obviously. I'll go fix that as you suggest, if there are no objections. -- Michael signature.asc Description: PGP signature
docs: pg_replication_origin_oid() description does not match behaviour
Hi >From the documentation for pg_replication_origin_oid() [1]: > Looks up a replication origin by name and returns the internal ID. > If no such replication origin is found an error is thrown. However, it actually returns NULL if the origin does not exist: postgres=# SELECT * FROM pg_replication_origin; roident | roname -+ (0 rows) postgres=# SELECT pg_replication_origin_oid('foo'), pg_replication_origin_oid('foo') IS NULL; pg_replication_origin_oid | ?column? ---+-- | t (1 row) Given that the code has remained unchanged since the function was introduced in 9.5, it seems reasonable to change the documentation to match the function behaviour rather than the other way round. Regards Ian Barwick [1] https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-REPLICATION-TABLE -- EnterpriseDB: https://www.enterprisedb.com commit 9e0d549002d97b20261554c5f45ad5149916994a Author: Ian Barwick Date: Tue Jan 18 09:52:16 2022 +0900 doc: fix pg_replication_origin_oid() description If the specified replication origin does not exist, NULL is returned; an error is not thrown. diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index a270f89dfe..027dfa4b82 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -26524,7 +26524,7 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); Looks up a replication origin by name and returns the internal ID. If -no such replication origin is found an error is thrown. +no such replication origin is found, NULL is returned.
Re: drop tablespace failed when location contains .. on win32
On Tue, Jan 18, 2022 at 01:08:01AM +, wangsh.f...@fujitsu.com wrote: > Yes, I will send a new version before next weekend Thanks! -- Michael signature.asc Description: PGP signature
RE: drop tablespace failed when location contains .. on win32
Hi > This patch is a wanted bugfix and has been waiting for an update for 2 months. > > Do you plan to send a new version soon? Yes, I will send a new version before next weekend Regards Shenhao Wang
Re: \d with triggers: more than one row returned by a subquery used as an expression
I wrote: > Justin Pryzby writes: >> Is there any reason why WITH ORDINALITY can't work ? >> This is passing the smoke test. > How hard did you try to break it? It still seems to me that > this can be fooled by an unrelated trigger with the same tgname. Hmm ... no, it does work, because we'll stop at the first trigger with tgparentid = 0, so unrelated triggers further up the partition stack don't matter. But this definitely requires commentary. (And I'm not too happy with burying such a complicated thing inside a conditional inside a printf, either.) Will see about cleaning it up. regards, tom lane
Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings
It seems the man page of TCP_USER_TIMEOUT does not align with reality then. When I use it on my local machine it is effectively used as a connection timeout too. The second command times out after two seconds: sudo iptables -A INPUT -p tcp --destination-port 5432 -j DROP psql 'host=localhost tcp_user_timeout=2000' The keepalive settings only apply once you get to the recv however. And yes, it is pretty unlikely for the connection to break right when it is waiting for data. But it has happened for us. And when it happens it is really bad, because the process will be blocked forever. Since it is a blocking call. After investigation when this happened it seemed to be a combination of a few things making this happen: 1. The way citus uses cancelation requests: A Citus query on the coordinator creates multiple connections to a worker and with 2PC for distributed transactions. If one connection receives an error it sends a cancel request for all others. 2. When a machine is under heavy CPU or memory pressure things don't work well: i. errors can occur pretty frequently, causing lots of cancels to be sent by Citus. ii. postmaster can be slow in handling new cancelation requests. iii. Our failover system can think the node is down, because health checks are failing. 3. Our failover system effectively cuts the power and the network of the primary when it triggers a fail over to the secondary This all together can result in a cancel request being interrupted right at that wrong moment. And when it happens a distributed query on the Citus coordinator, becomes blocked forever. We've had queries stuck in this state for multiple days. The only way to get out of it at that point is either by restarting postgres or manually closing the blocked socket (either with ss or gdb). Jelte
Re: \d with triggers: more than one row returned by a subquery used as an expression
Justin Pryzby writes: > On Mon, Jan 17, 2022 at 05:02:00PM -0500, Tom Lane wrote: >> ISTM the real problem is the assumption that only related triggers could >> share a tgname, which evidently isn't true. I think this query needs to >> actually match on tgparentid, rather than taking shortcuts. > I don't think that should be needed - tgparentid should match > pg_partition_ancestors(). Uh, what? tgparentid is a trigger OID, not a relation OID. > Is there any reason why WITH ORDINALITY can't work ? > This is passing the smoke test. How hard did you try to break it? It still seems to me that this can be fooled by an unrelated trigger with the same tgname. regards, tom lane
Re: \d with triggers: more than one row returned by a subquery used as an expression
On Mon, Jan 17, 2022 at 05:02:00PM -0500, Tom Lane wrote: > Justin Pryzby writes: > > On Fri, Dec 17, 2021 at 09:43:56AM -0600, Justin Pryzby wrote: > >> I want to mention that the 2nd problem I mentioned here is still broken. > >> https://www.postgresql.org/message-id/20210717010259.gu20...@telsasoft.com > >> It happens if non-inheritted triggers on child and parent have the same > >> name. > > > This is the fix I was proposing > > It depends on pg_partition_ancestors() to return its partitions in order: > > this partition => parent => ... => root. > > I don't think that works at all. I might be willing to accept the > assumption about pg_partition_ancestors()'s behavior, but you're also > making an assumption about how the output of pg_partition_ancestors() > is joined to "pg_trigger AS u", and I really don't think that's safe. > ISTM the real problem is the assumption that only related triggers could > share a tgname, which evidently isn't true. I think this query needs to > actually match on tgparentid, rather than taking shortcuts. I don't think that should be needed - tgparentid should match pg_partition_ancestors(). > If we don't > want to use a recursive CTE, maybe we could define it as only looking up to > the immediate parent, rather than necessarily finding the root? I think that defeats the intent of c33869cc3. Is there any reason why WITH ORDINALITY can't work ? This is passing the smoke test. -- Justin >From 99f8ee921258d9b2e75299e69826004fda3b8919 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Fri, 3 Apr 2020 22:43:26 -0500 Subject: [PATCH 1/2] psql: Add newlines and spacing in trigger query for \d cosmetic change to c33869cc3 to make the output of psql -E look pretty. --- src/bin/psql/describe.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index 61cec118aca..ed9f320b015 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -2993,12 +2993,12 @@ describeOneTableDetails(const char *schemaname, "t.tgenabled, t.tgisinternal, %s\n" "FROM pg_catalog.pg_trigger t\n" "WHERE t.tgrelid = '%s' AND ", - (pset.sversion >= 13 ? - "(SELECT (NULLIF(a.relid, t.tgrelid))::pg_catalog.regclass" - " FROM pg_catalog.pg_trigger AS u, " - " pg_catalog.pg_partition_ancestors(t.tgrelid) AS a" - " WHERE u.tgname = t.tgname AND u.tgrelid = a.relid" - " AND u.tgparentid = 0) AS parent" : + (pset.sversion >= 13 ? "\n" + " (SELECT (NULLIF(a.relid, t.tgrelid))::pg_catalog.regclass\n" + " FROM pg_catalog.pg_trigger AS u,\n" + " pg_catalog.pg_partition_ancestors(t.tgrelid) AS a\n" + " WHERE u.tgname = t.tgname AND u.tgrelid = a.relid\n" + "AND u.tgparentid = 0) AS parent" : "NULL AS parent"), oid); -- 2.17.1 >From d866624abbd523afdbc3a539f14c935c42fc345f Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Fri, 16 Jul 2021 19:57:00 -0500 Subject: [PATCH 2/2] psql: fix \d for statement triggers with same name on partition and its parent This depends on pg_partition_ancestors() to return its partitions in order: this partition => parent => ... => root. 20210716193319.gs20...@telsasoft.com --- src/bin/psql/describe.c| 4 ++-- src/test/regress/expected/triggers.out | 13 + src/test/regress/sql/triggers.sql | 4 3 files changed, 19 insertions(+), 2 deletions(-) diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index ed9f320b015..8787849e8be 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -2996,9 +2996,9 @@ describeOneTableDetails(const char *schemaname, (pset.sversion >= 13 ? "\n" " (SELECT (NULLIF(a.relid, t.tgrelid))::pg_catalog.regclass\n" " FROM pg_catalog.pg_trigger AS u,\n" - " pg_catalog.pg_partition_ancestors(t.tgrelid) AS a\n" + " pg_catalog.pg_partition_ancestors(t.tgrelid) WITH ORDINALITY AS a(relid,depth)\n" " WHERE u.tgname = t.tgname AND u.tgrelid = a.relid\n" - "AND u.tgparentid = 0) AS parent" : + "AND u.tgparentid = 0 ORDER BY depth LIMIT 1) AS parent" : "NULL AS parent"), oid); diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out index 5c0e7c2b79e..9278cc07237 100644 --- a/src/test/regress/expected/triggers.out +++ b/src/test/regress/expected/triggers.out @@ -2122,6 +2122,19 @@ Triggers: alter table trigpart attach partition trigpart3 FOR VALUES FROM (2000) to (3000); -- fail ERROR: trigger "trg1" for relation "trigpart3" already exists drop table trigpart3; +create trigger samename after delete on trigpart execute function trigger_nothing(); +create trigger samename after delete on trigpart1 execute function trigger_nothing(); +\d trigpart1 + Table "public.
Re: a misbehavior of partition row movement (?)
Alvaro Herrera writes: > On 2022-Jan-17, Tom Lane wrote: >> It makes zero sense to have the bits in AFTER_TRIGGER_TUP_BITS not >> be adjacent. So what should happen here is to renumber the symbols >> in between to move their bits over one place. > Is it typical to enumerate bits starting from the right of each byte, > when doing it from the high bits of the word? DONE and IN_PROGRESS have > been defined as 0x1 and 0x2 of that byte for a very long time and I > found that very strange. I am inclined to count from the left, so I'd > pick 8 first, defining the set like this: Doesn't matter to me either way, as long as the values look like they were all defined by the same person ;-) > (The fact that FDW_REUSE bits is actually an empty mask comes from > 7cbe57c34dec, specifically [1]) That seemed a bit odd to me too, but this is not the patch to be changing it in, I suppose. regards, tom lane
Re: Push down time-related SQLValue functions to foreign server
Alexander Pyhalov writes: >> Perhaps in a MACRO? > Changed this check to a macro, also fixed condition in > is_foreign_param() and added test for it. > Also fixed comment in prepare_query_params(). I took a quick look at this. I'm unconvinced that you need the TIME_RELATED_SQLVALUE_FUNCTION macro, mainly because I think testing that in is_foreign_param() is pointless. The only way we'll be seeing a SQLValueFunction in is_foreign_param() is if we decided it was shippable, so you really don't need two duplicate tests. (In the same vein, I would not bother with including a switch in deparseSQLValueFunction that knows about these opcodes explicitly. Just use the typmod field; exprTypmod() does.) I also find it pretty bizarre that contain_unsafe_functions isn't placed beside its one caller. Maybe that indicates that is_foreign_expr is mis-located and should be in shippable.c. More generally, it's annoying that you had to copy-and-paste all of contain_mutable_functions to make this. That creates a rather nasty maintenance hazard for future hackers, who will need to keep these widely-separated functions in sync. Not sure what to do about that though. Do we want to extend contain_mutable_functions itself to cover this use-case? The test cases seem a bit overkill --- what is the point of the two nigh-identical PREPARE tests, or the GROUP BY test? If anything is broken about GROUP BY, surely it's not specific to this patch. I'm not really convinced by the premise of 0002, particularly this bit: static bool -contain_mutable_functions_checker(Oid func_id, void *context) +contain_unsafe_functions_checker(Oid func_id, void *context) { - return (func_volatile(func_id) != PROVOLATILE_IMMUTABLE); + /* now() is stable, but we can ship it as it's replaced by parameter */ + return !(func_volatile(func_id) == PROVOLATILE_IMMUTABLE || func_id == F_NOW); } The point of the check_functions_in_node callback API is to verify the mutability of functions that are embedded in various sorts of expression nodes ... but they might not be in a plain FuncExpr node, which is the only case you'll deparse correctly. It might be that now() cannot legally appear in any of the other node types that check_functions_in_node knows about, but I'm not quite convinced of that, and even less convinced that that'll stay true as we add more expression node types. Also, if we commit this, for sure some poor soul will try to expand the logic to some other stable function that *can* appear in those contexts, and that'll be broken. The implementation of converting now() to CURRENT_TIMESTAMP seems like an underdocumented kluge, too. On the whole I'm a bit inclined to drop 0002; I'm not sure it's worth the trouble. regards, tom lane
Re: [PATCH] reduce page overlap of GiST indexes built using sorted method
Hi Aliaksandr, Nice work on this. I've been following it a bit since the regression when it was noted and it sparked renewed interest in R-tree structure and optimization for me. As for ideas. I'm not deep into details of postgresql and gist, but I've learned that the node size for gist indexes are equal to the page size, and as such they are quite large in this context. This is what caused the overly flat tree when increasing the fill factor, if I'm not mistaken, and why it helps to decrease the fill factor. I think there is a large potential if tree node size could be tweaked to fit the use case and combine higher fill factor with optimal tree depth. It's data dependent so it would even make sense to make it an index parameter, if possible. There might be some deep reason in the architecture that I'm unaware of that could make it difficult to affect the node size but regardless, I believe there could be a substantial win if node size could be controlled. Regards, Björn Den mån 17 jan. 2022 kl 23:46 skrev Aliaksandr Kalenik : > Hey! > > Postgres 14 introduces an option to create a GiST index using a sort > method. It allows to create indexes much faster but as it had been > mentioned in sort support patch discussion the faster build performance > comes at cost of higher degree of overlap between pages than for indexes > built with regular method. > > > Sort support was implemented for GiST opclass in PostGIS but eventually > got removed as default behaviour in latest 3.2 release because as it had > been discovered by Paul Ramsey > https://lists.osgeo.org/pipermail/postgis-devel/2021-November/029225.html > performance of queries might degrade by 50%. > > Together with Darafei Praliaskouski, Andrey Borodin and me we tried > several approaches to solve query performance degrade: > >- The first attempt was try to decide whether to make a split >depending on direction of curve (Z-curve for Postgres geometry type, >Hilbert curve for PostGIS). It was implemented by filling page until >fillfactor / 2 and then checking penalty for every next item and keep >inserting in current page if penalty is 0 or start new page if penalty is >not 0. It turned out that with this approach index becomes significantly >larger whereas pages overlap still remains high. >- Andrey Borodin implemented LRU + split: a fixed amount of pages are >kept in memory and the best candidate page to insert the next item in is >selected by minimum penalty among these pages. If the best page for >insertion is full, it gets splited into multiple pages, and if the amount >of candidate pages after split exceeds the limit, the pages insertion to >which has not happened recently are flushed. > > https://github.com/x4m/postgres_g/commit/0f2ed5f32a00f6c3019048e0c145b7ebda629e73. >We made some tests and while query performance speed using index built with >this approach is fine a size of index is extremely large. > > Eventually we made implementation of an idea outlined in sort support > patch discussion here > https://www.postgresql.org/message-id/flat/08173bd0-488d-da76-a904-912c35da4...@iki.fi#09ac9751a4cde897c99b99b2170faf3a > that several pages can be collected and then divided into actual index > pages by calling picksplit. My benchmarks with data provided in > postgis-devel show that query performance using index built with patched > sort support is comparable with performance of query using index built with > regular method. The size of index is also matches size of index built with > non-sorting method. > > > It should be noted that with the current implementation of the sorting > build method, pages are always filled up to fillfactor. This patch changes > this behavior to what it would be if using a non-sorting method, and pages > are not always filled to fillfactor for the sake of query performance. I'm > interested in improving it and I wonder if there are any ideas on this. > > > Benchmark summary: > > > create index roads_rdr_idx on roads_rdr using gist (geom); > > > with sort support before patch / CREATE INDEX 76.709 ms > > with sort support after patch / CREATE INDEX 225.238 ms > > without sort support / CREATE INDEX 446.071 ms > > > select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom; > > > with sort support before patch / SELECT 5766.526 ms > > with sort support after patch / SELECT 2646.554 ms > > without sort support / SELECT 2721.718 ms > > > index size > > > with sort support before patch / IDXSIZE 2940928 bytes > > with sort support after patch / IDXSIZE 4956160 bytes > > without sort support / IDXSIZE 5447680 bytes > > More detailed: > > Before patch using sorted method: > > > postgres=# create index roads_rdr_geom_idx_sortsupport on roads_rdr using > gist(geom); > > CREATE INDEX > > Time: 76.709 ms > > > postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom && > b.geom; > > count > > > > 505806 > > (1 row) > > > Time:
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Mon, Jan 17, 2022 at 2:13 PM Robert Haas wrote: > On Mon, Jan 17, 2022 at 4:28 PM Peter Geoghegan wrote: > > Updating relfrozenxid should now be thought of as a continuous thing, > > not a discrete thing. > > I think that's pretty nearly 100% wrong. The most simplistic way of > expressing that is to say - clearly it can only happen when VACUUM > runs, which is not all the time. That just seems like semantics to me. The very next sentence after the one you quoted in your reply was "And so it's highly unlikely that any given VACUUM will ever *completely* fail to advance relfrozenxid". It's continuous *within* each VACUUM. As far as I can tell there is pretty much no way that the patch series will ever fail to advance relfrozenxid *by at least a little bit*, barring pathological cases with cursors and whatnot. > That's a bit facile, though; let me > try to say something a little smarter. There are real production > systems that exist today where essentially all vacuums are > anti-wraparound vacuums. And there are also real production systems > that exist today where virtually none of the vacuums are > anti-wraparound vacuums. So if we ship your proposed patches, the > frequency with which relfrozenxid gets updated is going to increase by > a large multiple, perhaps 100x, for the second group of people, who > will then perceive the movement of relfrozenxid to be much closer to > continuous than it is today even though, technically, it's still a > step function. But the people in the first category are not going to > see any difference at all. Actually, I think that even the people in the first category might well have about the same improved experience. Not just because of this patch series, mind you. It would also have a lot to do with the autovacuum_vacuum_insert_scale_factor stuff in Postgres 13. Not to mention the freeze map. What version are these users on? I have actually seen this for myself. With BenchmarkSQL, the largest table (the order lines table) starts out having its autovacuums driven entirely by autovacuum_vacuum_insert_scale_factor, even though there is a fair amount of bloat from updates. It stays like that for hours on HEAD. But even with my reasonably tuned setup, there is eventually a switchover point. Eventually all autovacuums end up as aggressive anti-wraparound VACUUMs -- this happens once the table gets sufficiently large (this is one of the two that is append-only, with one update to every inserted row from the delivery transaction, which happens hours after the initial insert). With the patch series, we have a kind of virtuous circle with freezing and with advancing relfrozenxid with the same order lines table. As far as I can tell, we fix the problem with the patch series. Because there are about 10 tuples inserted per new order transaction, the actual "XID consumption rate of the table" is much lower than the "worst case XID consumption" for such a table. It's also true that even with the patch we still get anti-wraparound VACUUMs for two fixed-size, hot-update-only tables: the stock table, and the customers table. But that's no big deal. It only happens because nothing else will ever trigger an autovacuum, no matter the autovacuum_freeze_max_age setting. > And therefore the reasoning that says - anti-wraparound vacuums just > aren't going to happen any more - or - relfrozenxid will advance > continuously seems like dangerous wishful thinking to me. I never said that anti-wraparound vacuums just won't happen anymore. I said that they'll be limited to cases like the stock table or customers table case. I was very clear on that point. With pgbench, whether or not you ever see any anti-wraparound VACUUMs will depend on how heap fillfactor for the accounts table -- set it low enough (maybe to 90) and you will still get them, since there won't be any other reason to VACUUM. As for the branches table, and the tellers table, they'll get VACUUMs in any case, regardless of heap fillfactor. And so they'll always advance relfrozenxid during eac VACUUM, and never have even one anti-wraparound VACUUM. > It's only > true if (# of vacuums) / (# of wraparound vacuums) >> 1. And that need > not be true in any particular environment, which to me means that all > conclusions based on the idea that it has to be true are pretty > dubious. There's no doubt in my mind that advancing relfrozenxid > opportunistically is a good idea. However, I'm not sure how reasonable > it is to change any other behavior on the basis of the fact that we're > doing it, because we don't know how often it really happens. It isn't that hard to see that the cases where we continue to get any anti-wraparound VACUUMs with the patch seem to be limited to cases like the stock/customers table, or cases like the pathological idle cursor cases we've been discussing. Pretty narrow cases, overall. Don't take my word for it - see for yourself. -- Peter Geoghegan
Re: Pluggable toaster
Hi, >This sounds interesting, but very much like column compression, which >was proposed some time ago. If we haven't made much progrees with that >patch (AFAICS), what's the likelihood we'll succeed here, when it's >combined with yet more complexity? The main concern is that this patch provides open API for Toast functionality and new Toasters could be written as extensions and plugged in instead of default one, for any column (or datatype). It was not possible before, because Toast functionality was part of the Postgres core code and was not meant to be modified. >Maybe doing that kind of compression in TOAST is somehow simpler, but I >don't see it. [Custom ]Toaster extension itself is not restricted to only toast data, it could be used for compression, encryption, etc, just name it - any case when data meant to be transformed in some complicated way before being stored and (possibly) transformed backwards while being selected from the table, along with some simpler but not so obvious transformations like removing common parts shared by all data in column before storing it and restoring column value to full during selection. >Seems you'd need a mapping table, to allow M:N mapping between types and >toasters, linking it to all "compatible" types. It's not clear to me how >would this work with custom data types, domains etc. Any suitable [custom] Toaster could be plugged in for any table column, or [duscussible] for datatype and assigned by default to the according column for any table using this datatype. >Also, what happens to existing values when you change the toaster? What >if the toasters don't use the same access method to store the chunks >(heap vs. btree)? And so on. For newer data there is no problem - it will be toasted by the newly assigned Toaster. For detoasting - the Toaster ID is stored in Toast pointer and in principle all data could be detoasted by the according toaster if it is available. But this is the topic for discussion and we are open for proposals, because there are possible cases where older Toaster is not available - the older used Toaster extension is not installed at all or was uninstalled, it was upgraded to the newer version. Currently we see two ways of handling this case - to restrict changing the toaster, and to re-toast all toasted data which could be very heavy if Toaster is assigned to a widely used datatype, and we're looking forward to any ideas. >So you suggest we move all of this to toaster? I'd say -1 to that, >because it makes it much harder to e.g. add custom compression method, etc. Original compression methods, etc. are not affected by this patch. Regards, -- Nikita Malakhov Postgres Professional https://postgrespro.ru/ On Mon, Jan 17, 2022 at 10:23 PM Tomas Vondra wrote: > > > On 1/14/22 19:41, Teodor Sigaev wrote: > > > >> In my understanding, we want to be able to > >> 1. Access data from a toasted object one slice at a time, by using > >> knowledge of the structure > >> 2. If toasted data is updated, then update a minimum number of > >> slices(s), without rewriting the existing slices > >> 3. If toasted data is expanded, then allownew slices to be appended to > >> the object without rewriting the existing slices > > > > There are more options: > > 1 share common parts between not only versions of row but between all > > rows in a column. Seems strange but examples: > >- urls often have a common prefix and so storing in a prefix tree (as > > SP-GiST does) allows significantly decrease storage size > >- the same for json - it's often use case with common part of its > > hierarchical structure > >- one more usecase for json. If json use only a few schemes > > (structure) it's possible to store in toast storage only values and > > don't store keys and structure > > This sounds interesting, but very much like column compression, which > was proposed some time ago. If we haven't made much progrees with that > patch (AFAICS), what's the likelihood we'll succeed here, when it's > combined with yet more complexity? > > Maybe doing that kind of compression in TOAST is somehow simpler, but I > don't see it. > > > 2 Current toast storage stores chunks in heap accesses method and to > > provide fast access by toast id it makes an index. Ideas: > >- store chunks directly in btree tree, pgsql's btree already has an > > INCLUDE columns, so, chunks and visibility data will be stored only > > in leaf pages. Obviously it reduces number of disk's access for > > "untoasting". > >- use another access method for chunk storage > > > > Maybe, but that probably requires more thought - e.g. btree requires the > values to be less than 1/3 page, so I wonder how would that play with > toasting of values. > > >> ISTM that we would want the toast algorithm to be associated with the > >> datatype, not the column? > >> Can you explain your thinking? > > Hm. I'll try to explain my motivation. > > 1) Datatype could have more th
Re: a misbehavior of partition row movement (?)
On 2022-Jan-17, Tom Lane wrote: > But could we please do it in a way that is designed to keep the > code readable, rather than to minimize the number of lines of diff? > It makes zero sense to have the bits in AFTER_TRIGGER_TUP_BITS not > be adjacent. So what should happen here is to renumber the symbols > in between to move their bits over one place. Is it typical to enumerate bits starting from the right of each byte, when doing it from the high bits of the word? DONE and IN_PROGRESS have been defined as 0x1 and 0x2 of that byte for a very long time and I found that very strange. I am inclined to count from the left, so I'd pick 8 first, defining the set like this: #define AFTER_TRIGGER_OFFSET0x07FF /* must be low-order bits */ #define AFTER_TRIGGER_DONE 0x8000 #define AFTER_TRIGGER_IN_PROGRESS 0x4000 /* bits describing the size and tuple sources of this event */ #define AFTER_TRIGGER_FDW_REUSE 0x #define AFTER_TRIGGER_FDW_FETCH 0x2000 #define AFTER_TRIGGER_1CTID 0x1000 #define AFTER_TRIGGER_2CTID 0x3000 #define AFTER_TRIGGER_CP_UPDATE 0x0800 #define AFTER_TRIGGER_TUP_BITS 0x3800 (The fact that FDW_REUSE bits is actually an empty mask comes from 7cbe57c34dec, specifically [1]) Is this what you were thinking? [1] https://postgr.es/m/20140306033644.ga3527...@tornado.leadboat.com -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "All rings of power are equal, But some rings of power are more equal than others." (George Orwell's The Lord of the Rings)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Mon, Jan 17, 2022 at 4:28 PM Peter Geoghegan wrote: > Updating relfrozenxid should now be thought of as a continuous thing, > not a discrete thing. I think that's pretty nearly 100% wrong. The most simplistic way of expressing that is to say - clearly it can only happen when VACUUM runs, which is not all the time. That's a bit facile, though; let me try to say something a little smarter. There are real production systems that exist today where essentially all vacuums are anti-wraparound vacuums. And there are also real production systems that exist today where virtually none of the vacuums are anti-wraparound vacuums. So if we ship your proposed patches, the frequency with which relfrozenxid gets updated is going to increase by a large multiple, perhaps 100x, for the second group of people, who will then perceive the movement of relfrozenxid to be much closer to continuous than it is today even though, technically, it's still a step function. But the people in the first category are not going to see any difference at all. And therefore the reasoning that says - anti-wraparound vacuums just aren't going to happen any more - or - relfrozenxid will advance continuously seems like dangerous wishful thinking to me. It's only true if (# of vacuums) / (# of wraparound vacuums) >> 1. And that need not be true in any particular environment, which to me means that all conclusions based on the idea that it has to be true are pretty dubious. There's no doubt in my mind that advancing relfrozenxid opportunistically is a good idea. However, I'm not sure how reasonable it is to change any other behavior on the basis of the fact that we're doing it, because we don't know how often it really happens. If someone says "every time I travel to Europe on business, I will use the opportunity to bring you back a nice present," you can't evaluate how much impact that will have on your life without knowing how often they travel to Europe on business. And that varies radically from "never" to "a lot" based on the person. -- Robert Haas EDB: http://www.enterprisedb.com
Re: \d with triggers: more than one row returned by a subquery used as an expression
Justin Pryzby writes: > On Fri, Dec 17, 2021 at 09:43:56AM -0600, Justin Pryzby wrote: >> I want to mention that the 2nd problem I mentioned here is still broken. >> https://www.postgresql.org/message-id/20210717010259.gu20...@telsasoft.com >> It happens if non-inheritted triggers on child and parent have the same name. > This is the fix I was proposing > It depends on pg_partition_ancestors() to return its partitions in order: > this partition => parent => ... => root. I don't think that works at all. I might be willing to accept the assumption about pg_partition_ancestors()'s behavior, but you're also making an assumption about how the output of pg_partition_ancestors() is joined to "pg_trigger AS u", and I really don't think that's safe. ISTM the real problem is the assumption that only related triggers could share a tgname, which evidently isn't true. I think this query needs to actually match on tgparentid, rather than taking shortcuts. If we don't want to use a recursive CTE, maybe we could define it as only looking up to the immediate parent, rather than necessarily finding the root? regards, tom lane
Re: Add last commit LSN to pg_last_committed_xact()
On Mon, Jan 17, 2022 at 4:34 PM Alvaro Herrera wrote: > On 2022-Jan-14, James Coleman wrote: > > The logical slot can't flush past the > > last commit, so even if there's 100s of megabytes of unflushed WAL on > > the slot there may be zero lag (in terms of what's possible to > > process). > > > > I've attached a simple patch (sans tests and documentation) to get > > feedback early. After poking around this afternoon it seemed to me > > that the simplest approach was to hook into the commit timestamps > > infrastructure and store the commit's XLogRecPtr in the cache of the > > most recent value (but of course don't write it out to disk). > > Maybe it would work to have a single LSN in shared memory, as an atomic > variable, which uses monotonic advance[1] to be updated. Whether this is > updated or not would depend on a new GUC, maybe track_latest_commit_lsn. > Causing performance pain during transaction commit is not great, but at > least this way it shouldn't be *too* a large hit. I don't know if it would or not, but it's such a hot path that I find the idea a bit worrisome. Atomics aren't free - especially inside of a loop. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg14 psql broke \d datname.nspname.relname
On Mon, Jan 17, 2022 at 1:06 PM Mark Dilger wrote: > > On Jan 15, 2022, at 12:28 AM, Julien Rouhaud wrote: > > Could you send a rebased version? > Yes. Here it is: This is not a full review, but I just noticed that: + * dotcnt: how many separators were parsed from the pattern, by reference. + * Can be NULL. But then: +Assert(dotcnt != NULL); On a related note, it's unclear why you've added three new arguments to processSQLNamePattern() but only one of them gets a mention in the function header comment. It's also pretty clear that the behavior of patternToSQLRegex() is changing, but the function header comments are not. -- Robert Haas EDB: http://www.enterprisedb.com
Re: a misbehavior of partition row movement (?)
Alvaro Herrera writes: > So this patch releases one bit from AFTER_TRIGGER_OFFSET and makes it > become AFTER_TRIGGER_CP_UPDATE. As far as I can tell there is no harm > in doing so. I agree that taking a bit away from AFTER_TRIGGER_OFFSET is okay (it could spare even a couple more, if we need them). But could we please do it in a way that is designed to keep the code readable, rather than to minimize the number of lines of diff? It makes zero sense to have the bits in AFTER_TRIGGER_TUP_BITS not be adjacent. So what should happen here is to renumber the symbols in between to move their bits over one place. (Since this data is only known within trigger.c, I don't even see an ABI-stability argument for not changing these assignments.) regards, tom lane
Re: Add last commit LSN to pg_last_committed_xact()
On 2022-Jan-14, James Coleman wrote: > The logical slot can't flush past the > last commit, so even if there's 100s of megabytes of unflushed WAL on > the slot there may be zero lag (in terms of what's possible to > process). > > I've attached a simple patch (sans tests and documentation) to get > feedback early. After poking around this afternoon it seemed to me > that the simplest approach was to hook into the commit timestamps > infrastructure and store the commit's XLogRecPtr in the cache of the > most recent value (but of course don't write it out to disk). Maybe it would work to have a single LSN in shared memory, as an atomic variable, which uses monotonic advance[1] to be updated. Whether this is updated or not would depend on a new GUC, maybe track_latest_commit_lsn. Causing performance pain during transaction commit is not great, but at least this way it shouldn't be *too* a large hit. [1] part of a large patch at https://www.postgresql.org/message-id/202111222156.xmo2yji5ifi2%40alvherre.pgsql -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Find a bug in a program, and fix it, and the program will work today. Show the program how to find and fix a bug, and the program will work forever" (Oliver Silfridge)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Mon, Jan 17, 2022 at 7:12 AM Robert Haas wrote: > On Thu, Jan 13, 2022 at 4:27 PM Peter Geoghegan wrote: > > 1. Cases where our inability to get a cleanup lock signifies nothing > > at all about the page in question, or any page in the same table, with > > the same workload. > > > > 2. Pathological cases. Cases where we're at least at the mercy of the > > application to do something about an idle cursor, where the situation > > may be entirely hopeless on a long enough timeline. (Whether or not it > > actually happens in the end is less significant.) > > Sure. I'm worrying about case (2). I agree that in case (1) waiting > for the lock is almost always the wrong idea. I don't doubt that we'd each have little difficulty determining which category (1 or 2) a given real world case should be placed in, using a variety of methods that put the issue in context (e.g., looking at the application code, talking to the developers or the DBA). Of course, it doesn't follow that it would be easy to teach vacuumlazy.c how to determine which category the same "can't get cleanup lock" falls under, since (just for starters) there is no practical way for VACUUM to see all that context. That's what I'm effectively trying to work around with this "wait and see approach" that demotes FreezeLimit to a backstop (and so justifies removing the vacuum_freeze_min_age GUC that directly dictates our FreezeLimit today). The cure may be worse than the disease, and the cure isn't actually all that great at the best of times, so we should wait until the disease visibly gets pretty bad before being "interventionist" by waiting for a cleanup lock. I've already said plenty about why I don't like vacuum_freeze_min_age (or FreezeLimit) due to XIDs being fundamentally the wrong unit. But that's not the only fundamental problem that I see. The other problem is this: vacuum_freeze_min_age also dictates when an aggressive VACUUM will start to wait for a cleanup lock. But why should the first thing be the same as the second thing? I see absolutely no reason for it. (Hence the idea of making FreezeLimit a backstop, and getting rid of the GUC itself.) > > This is my concern -- what I've called category 2 cases have this > > exact quality. So given that, why not freeze what you can, elsewhere, > > on other pages that don't have the same issue (presumably the vast > > vast majority in the table)? That way you have the best possible > > chance of recovering once the DBA gets a clue and fixes the issue. > > That's the part I'm not sure I believe. To be clear, I think that I have yet to adequately demonstrate that this is true. It's a bit tricky to do so -- absence of evidence isn't evidence of absence. I think that your principled skepticism makes sense right now. Fortunately the early refactoring patches should be uncontroversial. The controversial parts are all in the last patch in the patch series, which isn't too much code. (Plus another patch to at least get rid of vacuum_freeze_min_age, and maybe vacuum_freeze_table_age too, that hasn't been written just yet.) > Imagine a table with a > gigantic number of pages that are not yet all-visible, a small number > of all-visible pages, and one page containing very old XIDs on which a > cursor holds a pin. I don't think it's obvious that not waiting is > best. Maybe you're going to end up vacuuming the table repeatedly and > doing nothing useful. If you avoid vacuuming it repeatedly, you still > have a lot of work to do once the DBA locates a clue. Maybe this is a simpler way of putting it: I want to delay waiting on a pin until it's pretty clear that we truly have a pathological case, which should in practice be limited to an anti-wraparound VACUUM, which will now be naturally rare -- most individual tables will literally never have even one anti-wraparound VACUUM. We don't need to reason about the vacuuming schedule this way, since anti-wraparound VACUUMs are driven by age(relfrozenxid) -- we don't really have to predict anything. Maybe we'll need to do an anti-wraparound VACUUM immediately after a non-aggressive autovacuum runs, without getting a cleanup lock (due to an idle cursor pathological case). We won't be able to advance relfrozenxid until the anti-wraparound VACUUM runs (at the earliest) in this scenario, but it makes no difference. Rather than predicting the future, we're covering every possible outcome (at least to the extent that that's possible). > I think there's probably an important principle buried in here: the > XID threshold that forces a vacuum had better also force waiting for > pins. If it doesn't, you can tight-loop on that table without getting > anything done. I absolutely agree -- that's why I think that we still need FreezeLimit. Just as a backstop, that in practice very rarely influences our behavior. Probably just in those remaining cases that are never vacuumed except for the occasional anti-wraparound VACUUM (even then it might not be very important). > We
Re: a misbehavior of partition row movement (?)
> @@ -3398,7 +3432,7 @@ typedef SetConstraintStateData *SetConstraintState; > */ > typedef uint32 TriggerFlags; > > -#define AFTER_TRIGGER_OFFSET 0x0FFF /* must be > low-order bits */ > +#define AFTER_TRIGGER_OFFSET 0x07FF /* must be > low-order bits */ > #define AFTER_TRIGGER_DONE 0x1000 > #define AFTER_TRIGGER_IN_PROGRESS0x2000 > /* bits describing the size and tuple sources of this event */ > @@ -3406,7 +3440,8 @@ typedef uint32 TriggerFlags; > #define AFTER_TRIGGER_FDW_FETCH 0x8000 > #define AFTER_TRIGGER_1CTID 0x4000 > #define AFTER_TRIGGER_2CTID 0xC000 > -#define AFTER_TRIGGER_TUP_BITS 0xC000 > +#define AFTER_TRIGGER_CP_UPDATE 0x0800 > +#define AFTER_TRIGGER_TUP_BITS 0xC800 So this patch releases one bit from AFTER_TRIGGER_OFFSET and makes it become AFTER_TRIGGER_CP_UPDATE. As far as I can tell there is no harm in doing so. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Re: Add last commit LSN to pg_last_committed_xact()
On Fri, Jan 14, 2022 at 7:42 PM James Coleman wrote: > I've attached a simple patch (sans tests and documentation) to get > feedback early. After poking around this afternoon it seemed to me > that the simplest approach was to hook into the commit timestamps > infrastructure and store the commit's XLogRecPtr in the cache of the > most recent value (but of course don't write it out to disk). That the > downside of making this feature dependent on "track_commit_timestamps > = on", but that seems reasonable: > > 1. Getting the xid of the last commit is similarly dependent on commit > timestamps infrastructure. > 2. It's a simple place to hook into and avoids new shared data and locking. > > Thoughts? It doesn't seem great to me. It's making commit_ts do something other than commit timestamps, which looks kind of ugly. In general, I'm concerned about the cost of doing something like this. Extra shared memory updates as part of the process of committing a transaction are not (and can't be made) free. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Adding CI to our tree
On 1/17/22 13:19, Andres Freund wrote: > Hi, > > On 2022-01-17 10:25:12 -0500, Andrew Dunstan wrote: >> The buildfarm is moving in the opposite direction, to disaggregate >> steps. > I'm a bit confused as to where you want changes to vcregress.pl > going. Upthread you argued against adding more granular targets to > vcregress. But this seems to be arguing against that? OK, let me clear that up. Yes I argued upthread for a 'make check-world' equivalent, because it's useful for developers on Windows. But I don't really want to use it in the buildfarm client, where I prefer to run fine-grained tests. > > >> There are several reasons for that, including that it makes for >> less log output that you need to churn through o find out what's gone >> wrong in a particular case, and that it makes disabling certain test >> sets via the buildfarm client's `skip-steps' feature possible. > FWIW, to me this shouldn't require a lot of separate manual test > invocations. And continuing to have lots of granular test invocations from the > buildfarm client is *bad*, because it requires constantly syncing up the set > of test targets. Well, the logic we use for TAP tests is we run them for the following if they have a t subdirectory, subject to configuration settings like PG_TEST_EXTRA: src/test/bin/* contrib/* src/test/modules/* src/test/* As long as any new TAP test gets place in such a location nothing new is required in the buildfarm client. > > It also makes the buildfarm far slower than necessary, because it runs a lot > of stuff serially that it could run in parallel. That's a legitimate concern. I have made some strides towards gross parallelism in the buildfarm by providing for different branches to be run in parallel. This has proven to be fairly successful (i.e. I have not run into any complaints, and several of my own animals utilize it). I can certainly look at doing something of the sort for an individual branch run. > This is particularly a > problem for things like valgrind runs, where individual tests are quite slow - > but just throwing more CPUs at it would help a lot. When I ran a valgrind animal, `make check` was horribly slow, and it's already parallelized. But it was on a VM and I forget how many CPUs the VM had. > > We should set things up so that: > > a) The output of each test can easily be associated with the corresponding set >of log files. > b) The list of tests run can be determined generically by looking at the >filesystems > c) For each test run, it's easy to see whether it failed or succeeded > > As part of the meson stuff (which has its own test runner), I managed to get a > bit towards this by changing the log output hierarchy so that each test gets > its own directory for log files (regress_log_*, initdb.log, postmaster.log, > regression.diffs, server log files). What's missing is a > test.{failed,succeeded} marker or such, to make it easy to report the log > files for just failed tasks. One thing I have been working on is a way to split the log output of an individual buildfarm step, hitherto just a text blob, , so that you can easily navigate to say regress_log_0003-foo-step.log without having to page through myriads of stuff. It's been on the back burner but I need to prioritize it, maybe when the current CF is done. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
On Mon, Jan 17, 2022 at 9:57 AM Shruthi Gowda wrote: > I have rebased and generated the patches on top of PostgreSQL commit > ID cf925936ecc031355cd56fbd392ec3180517a110. > Kindly apply v8-0001-pg_upgrade-Preserve-relfilenodes-and-tablespace-O.patch > first and then v8-0002-Preserve-database-OIDs-in-pg_upgrade.patch. OK, so looking over 0002, I noticed a few things: 1. datlastsysoid isn't being used for anything any more. That's not a defect in your patch, but I've separately proposed to remove it. 2. I realized that the whole idea here depends on not having initdb create more than one database without a fixed OID. The patch solves that by nailing down the OID of template0, which is a sufficient solution. However, I think nailing down the (initial) OID of postgres as well would be a good idea, just in case somebody in the future decides to add another system-created database. 3. The changes to gram.y don't do anything. Sure, you've added a new "OID" token, but nothing generates that token, so it has no effect. The reason the syntax works is that createdb_opt_name accepts "IDENT", which means any string that's not in the keyword list (see kwlist.h). But that's there already, so you don't need to do anything in this file. 4. I felt that the documentation and comments could be somewhat improved. Here's an updated version in which I've reverted the changes to gram.y and tried to improve the comments and documentation. Could you have a look at implementing (2) above? -- Robert Haas EDB: http://www.enterprisedb.com v9-0001-pg_upgrade-perserve-database-OID-patch-from-shrut.patch Description: Binary data
Re: slowest tap tests - split or accelerate?
Hi, On 2022-01-17 15:13:57 -0500, Robert Haas wrote: > I guess there must be something explaining it, but I don't know what > it could be. The client and the server are each running the checksum > algorithm over the same data. If that's not the same speed then I > don't get it. Unless, somehow, they're using different implementations > of it? I think that actually might be the issue. On linux a test a pg_verifybackup was much faster than on windows (as in 10x). But if I disable openssl, it's only 2x. On the windows instance I *do* have openssl enabled. But I suspect something is off and the windows buildsystem ends up with our hand-rolled implementation on the client side, but not the server side. Which'd explain the times I'm seeing: We have a fast CRC implementation, but the rest is pretty darn unoptimized. Greetings, Andres Freund
Re: slowest tap tests - split or accelerate?
Andres Freund writes: > I've occasionally pondered caching initdb results and reusing them across > tests - just the locking around it seems a bit nasty, but perhaps that could > be done as part of the tmp_install step. Of course, it'd need to deal with > different options etc... I'd actually built a prototype to do that, based on making a reference cluster and then "cp -a"'ing it instead of re-running initdb. I gave up when I found than on slower, disk-bound machines it was hardly any faster. Thinking about it now, I wonder why not just re-use one cluster for many tests, only dropping and re-creating the database in which the testing happens. regards, tom lane
Re: removing datlastsysoid
Robert Haas writes: > Since that doesn't seem like an especially good idea, PFA a patch to > remove it. Note that, even prior to that commit, it wasn't being used > for anything when dumping modern servers, so it would still have been > OK to remove it from the current system catalog structure. Now, > though, we can remove all references to it. +1. Another reason to get rid of it is that it has nothing to do with the system OID ranges defined in access/transam.h. regards, tom lane
Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings
Jelte Fennema writes: > Thanks for all the cleanup and adding of windows support. To me it now looks > good to merge. I was about to commit this when I started to wonder if it actually does anything useful. In particular, I read in the Linux tcp(7) man page TCP_USER_TIMEOUT (since Linux 2.6.37) ... This option can be set during any state of a TCP connection, but is effective only during the synchronized states of a connection (ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, and LAST-ACK). ISTM that the case we care about is where the server fails to respond to the TCP connection request. If it does so, it seems pretty unlikely that it wouldn't then eat the small amount of data we're going to send. While the keepalive options aren't so explicitly documented, I'd bet that they too don't have any effect until the connection is known established. So I'm unconvinced that setting these values really has much effect to make PQcancel more robust (or more accurately, make it fail faster). I would like to know what scenarios you tested to convince you that this is worth doing. regards, tom lane
Re: missing indexes in indexlist with partitioned tables
Hi! Afaiac the join pruning where the outer table is a partitioned table is the relevant case. I am not sure whether there are other cases. The join pruning, which works great for plain relations since 9.0, falls short for partitioned tables, since the optimizer fails to prove uniqueness there. In practical cases inner and outer tables are almost surely different ones, but I reattached a simpler example. It's the one, I came up with back in September. I've seen this can be a reason to avoid partitioning for the time being, if the application relies on join pruning. I think generic views make it almost necessary to have it. If you had a different answer in mind, please don't hesitate to ask again. Regards Arne From: Alvaro Herrera Sent: Monday, January 17, 2022 7:16:08 PM To: Arne Roland Cc: Julien Rouhaud; pgsql-hackers Subject: Re: missing indexes in indexlist with partitioned tables Hmm, can you show cases of queries for which having this new partIndexlist changes plans? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Saca el libro que tu religión considere como el indicado para encontrar la oración que traiga paz a tu alma. Luego rebootea el computador y ve si funciona" (Carlos Duclós) indexlist_to_late.sql Description: indexlist_to_late.sql
Re: Adding CI to our tree
Hi, On 2022-01-17 14:30:53 -0500, Tom Lane wrote: > Andres Freund writes: > > I think it's not actually that hard, with something like I described in the > > email upthread, with each tests going into a prescribed location, and the > > on-disk status being inspectable in an automated way. check-world could > > invoke > > a command to summarize the tests at the end in a .NOTPARALLEL, to make the > > local case easier. > > That sounds a bit, um, make-centric. At this point it seems to me > we ought to be thinking about how it'd work under meson. Some of this is a lot easier with meson. It has a builtin test runner, which understands tap (thereby not requiring prove anymore). Those tests can be listed (meson test --list). Depending on the option this results in a list of all tests with just the "topline" name of passing tests and error output from failing tests, or all output all the time or ... At the end it prints a summary of test counts and a list of failed tests. Here's an example (the leading timestamps are from CI, not meson), on windows: https://api.cirrus-ci.com/v1/task/6009638771490816/logs/check.log The test naming isn't necessarily great, but that's my fault. Running all the tests with meson takes a good bit less time than running most, but far from all, tests using vcregress.pl: https://cirrus-ci.com/build/4611852939296768 meson test makes it far easier to spot which tests failed, it's consistent across operating systems, allows to skip individual tests, etc. However: It doesn't address the log collection issue in itself. For that we'd still need to collect them in a way that's easier to associate with individual tests. In the meson branch I made it so that each test (including individual tap ones) has it's own log directory, which makes it easier to select all the logs for a failing test etc. > > This subthread is about the windows tests specifically, where it's even > > worse > > - there's no way to run all tests. > > That's precisely because the windows build doesn't use make. > We shouldn't be thinking about inventing two separate dead-end > solutions to this problem. Agreed. I think some improvements, e.g. around making the logs easier to associate with an individual test, is orthogonal to the buildsystem issue. I think it might still be worth adding stopgap way of running all tap tests on windows though. Having a vcregress.pl function to find all directories with t/ and run the tests there, shouldn't be a lot of code... Greetings, Andres Freund
Re: slowest tap tests - split or accelerate?
On Mon, Jan 17, 2022 at 2:57 PM Andres Freund wrote: > I wonder if there's something explaining why pg_verifybackup is greatly slowed > down by sha224 but not crc32c, but the server's runtime only differs by ~20ms? > It seems incongruous that pg_basebackup, with all the complexity of needing to > communicate with the server, transferring the backup over network, and *also* > computing checksums, takes as long as the pg_verifybackup invocation? I guess there must be something explaining it, but I don't know what it could be. The client and the server are each running the checksum algorithm over the same data. If that's not the same speed then I don't get it. Unless, somehow, they're using different implementations of it? > I've occasionally pondered caching initdb results and reusing them across > tests - just the locking around it seems a bit nasty, but perhaps that could > be done as part of the tmp_install step. Of course, it'd need to deal with > different options etc... It's a thought, but it does seem like a bit of a pain to implement. -- Robert Haas EDB: http://www.enterprisedb.com
removing datlastsysoid
Hi, While reviewing another patch, I noticed that it slightly adjusted the treatment of datlastsysoid. That made me wonder what datlastsysoid is used for, so I started poking around and discovered that the answer, at least insofar as I can determine, is "nothing". The documentation claims that the value is useful "particularly to pg_dump," which turns out not to be true any more. Tom's recent commit, 30e7c175b81d53c0f60f6ad12d1913a6d7d77008, to remove pg_dump/pg_dumpall support for dumping from pre-9.2 servers, removed all remaining uses of this value from the source tree. It's still maintained. We just don't do anything with it. Since that doesn't seem like an especially good idea, PFA a patch to remove it. Note that, even prior to that commit, it wasn't being used for anything when dumping modern servers, so it would still have been OK to remove it from the current system catalog structure. Now, though, we can remove all references to it. -- Robert Haas EDB: http://www.enterprisedb.com 0001-Remove-datlastsysoid.patch Description: Binary data
Re: SLRUs in the main buffer pool, redux
On Mon, Jan 17, 2022 at 11:23 PM Heikki Linnakangas wrote: > IIRC one issue with this has been performance. When an SLRU is working > well, a cache hit in the SLRU is very cheap. Linearly scanning the SLRU > array is cheap, compared to computing the hash and looking up a buffer > in the buffer cache. Would be good to do some benchmarking of that. One trick I want to experiment with is trying to avoid the mapping table lookup by using ReadRecentBuffer(). In the prototype patch I do that for one buffer per SLRU cache (so that'll often point to the head CLOG page), but it could be extended to a small array of recently accessed buffers to scan linearly, much like the current SLRU happy case except that it's not shared and doesn't need a lock so it's even happier. I'm half joking here, but that would let us keep calling this subsystem SLRU :-) > I wanted to do this with the CSN (Commit Sequence Number) work a long > time ago. That would benefit from something like an SLRU with very fast > access, to look up CSNs. Even without CSNs, if the CLOG was very fast to > access we might not need hint bits anymore. I guess trying to make SLRUs > faster than they already are is moving the goalposts, but at a minimum > let's make sure they don't get any slower. One idea I've toyed with is putting a bitmap into shared memory where each bit corresponds to a range of (say) 256 xids, accessed purely with atomics. If a bit is set we know they're all committed, and otherwise you have to do more pushups. I've attached a quick and dirty experimental patch along those lines, but I don't think it's quite right, especially with people waving 64 bit xid patches around. Perhaps it'd need to be a smaller sliding window, somehow, and perhaps people will balk at the wild and unsubstantiated assumption that transactions generally commit. From ee0db89b7c054ca0297d42ef94058b37bb8b9436 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 14 Jan 2022 11:15:48 +1300 Subject: [PATCH] WIP: Try to cache CLOG results. To make CLOG lookups faster, maintain a bitmap of xid groups that are known to be all committed, using atomic ops. XXX Just an experiment... --- src/backend/access/transam/clog.c | 144 +- 1 file changed, 143 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index de787c3d37..0678fa828c 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -41,6 +41,7 @@ #include "miscadmin.h" #include "pg_trace.h" #include "pgstat.h" +#include "port/atomics.h" #include "storage/proc.h" #include "storage/sync.h" @@ -81,6 +82,127 @@ */ #define THRESHOLD_SUBTRANS_CLOG_OPT5 +/* + * The number of transactions to summarize into a single bit of shared memory + * that says 'everything in this group has committed'. Must be a power of two + * and <= CLOG_XACTS_PER_PAGE. Currently set so that it corresponds to one + * cache line of CLOG page, so that it's cheap to scan a group on every + * commit. That results in a 1MB summary array. XXX Many other policies + * including lazy computation and course granularity are possible. + */ +#define CLOG_XACTS_PER_SUMMARY_GROUP (64 * CLOG_XACTS_PER_BYTE) + +/* Fixed size of the array of all-committed summary bits, in bytes. */ +#define CLOG_SUMMARY_SIZE (0x8000 / CLOG_XACTS_PER_SUMMARY_GROUP / 8) + +static pg_atomic_uint32 *CLOGSummary; + +/* + * Write the all-committed bit for the summary group containing 'xid'. + */ +static void +TransactionXidSetSummaryBit(TransactionId xid, bool value) +{ + size_t bit_index = (xid & 0x7fff) / CLOG_XACTS_PER_SUMMARY_GROUP; + size_t bits_per_word = sizeof(uint32) * 8; + size_t word_index = bit_index / bits_per_word; + uint32 mask = 1 << (bit_index % bits_per_word); + + if (value) + pg_atomic_fetch_or_u32(&CLOGSummary[word_index], mask); + else + pg_atomic_fetch_and_u32(&CLOGSummary[word_index], ~mask); +} + +/* + * Read the all-committed bit for the summary group containing 'xid'. + */ +static bool +TransactionXidGetSummaryBit(TransactionId xid) +{ + size_t bit_index = (xid & 0x7fff) / CLOG_XACTS_PER_SUMMARY_GROUP; + size_t bits_per_word = sizeof(uint32) * 8; + size_t word_index = bit_index / bits_per_word; + uint32 mask = 1 << (bit_index % bits_per_word); + + return (pg_atomic_read_u32(&CLOGSummary[word_index]) & mask) != 0; +} + +/* + * Summarize a cache line's worth of CLOG statuses into an all-committed bit, + * if possible. Called for each CLOG commit, so had better be fast. + */ +static void +TransactionXidSummarizeGroup(TransactionId xid, char *page) +{ + int first_chunk; + int end_chunk; + uint64 all_committed_chunk; + uint64 *page_chunks = (uint64 *) page; + + /* Already set? */ + if (unlikely(TransactionXidGetSummaryBit(xid))) + return; + + /* +* XXX
Re: slowest tap tests - split or accelerate?
Hi, On 2022-01-17 14:05:17 -0500, Robert Haas wrote: > On Mon, Jan 17, 2022 at 1:41 PM Andres Freund wrote: > > The reason these in particular are slow is that they do a lot of > > pg_basebackups without either / one-of -cfast / --no-sync. The lack of > > -cfast > > in particularly is responsible for a significant proportion of the test > > time. The only reason this didn't cause the tests to take many minutes is > > that > > spread checkpoints only throttle when writing out a buffer and there aren't > > that many dirty buffers... > > Adding -cfast to 002_algorithm.pl seems totally reasonable. I'm not > sure what else can realistically be done to speed it up without losing > the point of the test. And it's basically just a single loop, so > splitting it up doesn't seem to make a lot of sense either. It's also not that slow compared other tests after the -cfast addition. However, I'm a bit surprised at how long the individual pg_verifybackup calls take on windows - about as long as the pg_basebackup call itself. Running: pg_basebackup -D C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/sha224 --manifest-checksums sha224 --no-sync -cfast # timing: [4.798 + 0.704s]: complete # Running: pg_verifybackup -e C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/sha224 backup successfully verified # timing: [5.507 + 0.697s]: completed Interestingly, with crc32c, this is not so: # Running: pg_basebackup -D C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/crc32c --manifest-checksums crc32c --no-sync -cfast # timing: [3.500 + 0.688s]: completed ok 5 - backup ok with algorithm "crc32c" ok 6 - crc32c is mentioned many times in the manifest # Running: pg_verifybackup -e C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/crc32c backup successfully verified # timing: [4.194 + 0.197s]: completed I wonder if there's something explaining why pg_verifybackup is greatly slowed down by sha224 but not crc32c, but the server's runtime only differs by ~20ms? It seems incongruous that pg_basebackup, with all the complexity of needing to communicate with the server, transferring the backup over network, and *also* computing checksums, takes as long as the pg_verifybackup invocation? > pg_basebackup's 010_pg_basebackup.pl looks like it could be split up, > though. That one, at least to me, looks like people have just kept > adding semi-related things into the same test file. Yea. It's generally interesting how much time initdb takes in these tests. It's about 1.1s on my linux workstation, and 2.1s on windows. I've occasionally pondered caching initdb results and reusing them across tests - just the locking around it seems a bit nasty, but perhaps that could be done as part of the tmp_install step. Of course, it'd need to deal with different options etc... Greetings, Andres Freund
Re: Blank archive_command
Robert Haas writes: > It might be nice to do something about the fact that you can't change > archive_mode without a server restart, though. I suspect we had a good > reason for that limitation from an engineering perspective, but from a > user perspective, it sucks pretty hard. Agreed. I don't recall what the motivation for that was, but maybe it could be fixed with some more elbow grease. regards, tom lane
Re: Adding CI to our tree
Andres Freund writes: > I think it's not actually that hard, with something like I described in the > email upthread, with each tests going into a prescribed location, and the > on-disk status being inspectable in an automated way. check-world could invoke > a command to summarize the tests at the end in a .NOTPARALLEL, to make the > local case easier. That sounds a bit, um, make-centric. At this point it seems to me we ought to be thinking about how it'd work under meson. > This subthread is about the windows tests specifically, where it's even worse > - there's no way to run all tests. That's precisely because the windows build doesn't use make. We shouldn't be thinking about inventing two separate dead-end solutions to this problem. regards, tom lane
Re: Adding CI to our tree
Hi, On 2022-01-17 13:50:04 -0500, Robert Haas wrote: > On Mon, Jan 17, 2022 at 1:19 PM Andres Freund wrote: > > FWIW, to me this shouldn't require a lot of separate manual test > > invocations. And continuing to have lots of granular test invocations from > > the > > buildfarm client is *bad*, because it requires constantly syncing up the set > > of test targets. > > I have a lot of sympathy with Andrew here, actually. If you just do > 'make check-world' and assume that will cover everything, you get one > giant output file. That is not great at all. I very much agree with check-world output being near unusable. > That's really hard to accomplish if you just run all the tests with one > invocation - any technique you use to find the boundaries between one test's > output and the next will prove to be unreliable. I think it's not actually that hard, with something like I described in the email upthread, with each tests going into a prescribed location, and the on-disk status being inspectable in an automated way. check-world could invoke a command to summarize the tests at the end in a .NOTPARALLEL, to make the local case easier. pg_regress/isolation style tests have the "summary" output in regression.out, but we remove it on success. Tap tests have their output in the regress_log_* files, however these are far more verbose than the output from running the tests directly. I wonder if we should keep regression.out and also keep a copy of the "shorter" tap test output in a file? > But having said that, I also agree that it sucks to have to keep > updating the BF client every time we want to do any kind of > test-related changes in the main source tree. It also sucks locally. I *hate* having to dig through a long check-world output to find the first failure. This subthread is about the windows tests specifically, where it's even worse - there's no way to run all tests. Nor a list of test targets that's even halfway complete :/. One just has to know that one has to invoke a myriad of vcregress.pl taptest path/to/directory Greetings, Andres Freund
Re: Pluggable toaster
On 1/14/22 19:41, Teodor Sigaev wrote: In my understanding, we want to be able to 1. Access data from a toasted object one slice at a time, by using knowledge of the structure 2. If toasted data is updated, then update a minimum number of slices(s), without rewriting the existing slices 3. If toasted data is expanded, then allownew slices to be appended to the object without rewriting the existing slices There are more options: 1 share common parts between not only versions of row but between all rows in a column. Seems strange but examples: - urls often have a common prefix and so storing in a prefix tree (as SP-GiST does) allows significantly decrease storage size - the same for json - it's often use case with common part of its hierarchical structure - one more usecase for json. If json use only a few schemes (structure) it's possible to store in toast storage only values and don't store keys and structure This sounds interesting, but very much like column compression, which was proposed some time ago. If we haven't made much progrees with that patch (AFAICS), what's the likelihood we'll succeed here, when it's combined with yet more complexity? Maybe doing that kind of compression in TOAST is somehow simpler, but I don't see it. 2 Current toast storage stores chunks in heap accesses method and to provide fast access by toast id it makes an index. Ideas: - store chunks directly in btree tree, pgsql's btree already has an INCLUDE columns, so, chunks and visibility data will be stored only in leaf pages. Obviously it reduces number of disk's access for "untoasting". - use another access method for chunk storage Maybe, but that probably requires more thought - e.g. btree requires the values to be less than 1/3 page, so I wonder how would that play with toasting of values. ISTM that we would want the toast algorithm to be associated with the datatype, not the column? Can you explain your thinking? Hm. I'll try to explain my motivation. 1) Datatype could have more than one suitable toasters. For different usecases: fast retrieving, compact storage, fast update etc. As I told above, for jsonb there are several optimal strategies for toasting: for values with a few different structures, for close to hierarchical structures, for values with different parts by access mode (easy to imagine json with some keys used for search and some keys only for output to user) 2) Toaster could be designed to work with different data type. Suggested appendable toaster is designed to work with bytea but could work with text Looking on this point I have doubts where to store connection between toaster and datatype. If we add toasteroid to pg_type how to deal with several toaster for one datatype? (And we could want to has different toaster on one table!) If we add typoid to pg_toaster then how it will work with several datatypes? An idea to add a new many-to-many connection table seems workable but here there are another questions, such as will any toaster work with any table access method? To resolve this bundle of question we propose validate() method of toaster, which should be called during DDL operation, i.e. toaster is assigned to column or column's datatype is changed. Seems you'd need a mapping table, to allow M:N mapping between types and toasters, linking it to all "compatible" types. It's not clear to me how would this work with custom data types, domains etc. Also, what happens to existing values when you change the toaster? What if the toasters don't use the same access method to store the chunks (heap vs. btree)? And so on. More thought: Now postgres has two options for column: storage and compression and now we add toaster. For me it seems too redundantly. Seems, storage should be binary value: inplace (plain as now) and toastable. All other variation such as toast limit, compression enabling, compression kind should be an per-column option for toaster (that's why we suggest valid toaster oid for any column with varlena/toastable datatype). It looks like a good abstraction but we will have a problem with backward compatibility and I'm afraid I can't implement it very fast. So you suggest we move all of this to toaster? I'd say -1 to that, because it makes it much harder to e.g. add custom compression method, etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Adding CI to our tree
Robert Haas writes: > I have a lot of sympathy with Andrew here, actually. If you just do > 'make check-world' and assume that will cover everything, you get one > giant output file. That is not great at all. Yeah. I agree with Andrew that we want output that is more modular, not less so. But we do need to find a way to have less knowledge hard-wired in the buildfarm client script. > But having said that, I also agree that it sucks to have to keep > updating the BF client every time we want to do any kind of > test-related changes in the main source tree. One way around that > would be to put a file in the main source tree that the build farm > client can read to know what to do. Another would be to have the BF > client download the latest list of steps from somewhere instead of > having it in the source code, so that it can be updated without > everyone needing to update their machine. The obvious place for "somewhere" is "the main source tree", so I doubt your second suggestion is better than your first. But your first does seem like a plausible way to proceed. Another way to think of it, maybe, is to migrate chunks of the buildfarm client script itself into the source tree. I'd rather that developers not need to become experts on the buildfarm client to make adjustments to the test process --- but I suspect that a simple script like "run make check in these directories" is not going to be flexible enough for everything. regards, tom lane
Re: ICU for global collation
On 10.01.22 12:49, Daniel Verite wrote: > I think some users would want their db-wide ICU collation to be > case/accent-insensitive. ... > IIRC, that was the context for some questions where people were > enquiring about db-wide ICU collations. +1. There is the DEFAULT_COLLATION_OID, which is the cluster-level default collation, a.k.a. the "global collation", as distinct from the "db-wide" database-level default collation, which controls the default type of the collatable types within that database. > With the current patch, it's not possible, AFAICS, because the user > can't tell that the collation is non-deterministic. Presumably this > would require another option to CREATE DATABASE and another > column to store that bit of information. On 1/11/22, 6:24 AM, "Peter Eisentraut" wrote: > Adding this would be easy, but since pattern matching currently does not > support nondeterministic collations, if you make a global collation > nondeterministic, a lot of system views, psql, pg_dump queries etc. > would break, so it's not practical. I view this is an orthogonal > project. Once we can support this without breaking system views etc., > then it's easy to enable with a new column in pg_database. So this patch only enables the default cluster collation (DEFAULT_COLLATION_OID) to be a deterministic ICU collation, but doesn't extend the metadata in a way that would enable database collations to be ICU collations? Waiting for the pattern matching problem to be solved in general before creating the metadata to support ICU collations everywhere will make it more difficult for members of the community to help solve the pattern matching problem. What additional metadata changes would be required to enable an ICU collation to be specified at either the cluster-level or the database-level, even if new checks need to be added to disallow a nondeterministic collation to be specified at the cluster level for now?
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
On Tue, Dec 14, 2021 at 1:21 PM Shruthi Gowda wrote: > Thanks, Robert for the updated version. I reviewed the changes and it > looks fine. > I also tested the patch. The patch works as expected. Thanks. > > - I adjusted the function header comment for heap_create. Your > > proposed comment seemed like it was pretty detailed but not 100% > > correct. It also made one of the lines kind of long because you didn't > > wrap the text in the surrounding style. I decided to make it simpler > > and shorter instead of longer still and 100% correct. > > The comment update looks fine. However, I still feel it would be good to > mention on which (rare) circumstance a valid relfilenode can get passed. In general, I think it's the job of a function parameter comment to describe what the parameter does, not how the callers actually use it. One problem with describing the latter is that, if someone later adds another caller, there is a pretty good chance that they won't notice that the comment needs to be changed. More fundamentally, the parameter function comments should be like an instruction manual for how to use the function. If you are trying to figure out how to use this function, it is not helpful to know that "most callers like to pass false" for this parameter. What you need to know is what value your new call site should pass, and knowing what "most callers" do or that something is "rare" doesn't really help. If we want to make this comment more detailed, we should approach it from the point of view of explaining how it ought to be set. I've committed the v8-0001 patch you attached. I'll write separately about v8-0002. -- Robert Haas EDB: http://www.enterprisedb.com
Re: slowest tap tests - split or accelerate?
On Mon, Jan 17, 2022 at 1:41 PM Andres Freund wrote: > The reason these in particular are slow is that they do a lot of > pg_basebackups without either / one-of -cfast / --no-sync. The lack of -cfast > in particularly is responsible for a significant proportion of the test > time. The only reason this didn't cause the tests to take many minutes is that > spread checkpoints only throttle when writing out a buffer and there aren't > that many dirty buffers... Adding -cfast to 002_algorithm.pl seems totally reasonable. I'm not sure what else can realistically be done to speed it up without losing the point of the test. And it's basically just a single loop, so splitting it up doesn't seem to make a lot of sense either. pg_basebackup's 010_pg_basebackup.pl looks like it could be split up, though. That one, at least to me, looks like people have just kept adding semi-related things into the same test file. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Adding CI to our tree
On Mon, Jan 17, 2022 at 1:19 PM Andres Freund wrote: > FWIW, to me this shouldn't require a lot of separate manual test > invocations. And continuing to have lots of granular test invocations from the > buildfarm client is *bad*, because it requires constantly syncing up the set > of test targets. I have a lot of sympathy with Andrew here, actually. If you just do 'make check-world' and assume that will cover everything, you get one giant output file. That is not great at all. People who are looking through buildfarm results do not want to have to look through giant logfiles hunting for the failure; they want to look at the stuff that's just directly relevant to the failure they saw. The current BF is actually pretty bad at this. You can click on various things on a buildfarm results page, but it's not very clear where those links are taking you, and the pages at least in my browser (normally Chrome) render so slowly as to make the whole thing almost unusable. I'd like to have a thing where the buildfarm shows a list of tests in red or green and I can click links next to each test to see the various logs that test produced. That's really hard to accomplish if you just run all the tests with one invocation - any technique you use to find the boundaries between one test's output and the next will prove to be unreliable. But having said that, I also agree that it sucks to have to keep updating the BF client every time we want to do any kind of test-related changes in the main source tree. One way around that would be to put a file in the main source tree that the build farm client can read to know what to do. Another would be to have the BF client download the latest list of steps from somewhere instead of having it in the source code, so that it can be updated without everyone needing to update their machine. There might well be other approaches that are even better. But the "ask Andrew to adjust the BF client and then have everybody install the new version" approach upon which we have been relying up until now is not terribly scalable. -- Robert Haas EDB: http://www.enterprisedb.com
Re: slowest tap tests - split or accelerate?
Hi, On 2021-12-31 11:25:28 -0800, Andres Freund wrote: > cfbot now runs most tests on windows, the windows task is by far the slowest, > and the task limitted most in concurrency [2]. Running tap tests is the > biggest part of that. This is a bigger issue on windows because we don't have > infrastructure (yet) to run tests in parallel. > > There's a few tests which stand out in their slowness, which seem worth > addressing even if we tackle test parallelism on windows at some point. I > often find them to be the slowest tests on linux too. > > Picking a random successful cfbot run [1] I see the following tap tests taking > more than 20 seconds: > > 67188 ms pg_basebackup t/010_pg_basebackup.pl > 25751 ms pg_verifybackup t/002_algorithm.pl The reason these in particular are slow is that they do a lot of pg_basebackups without either / one-of -cfast / --no-sync. The lack of -cfast in particularly is responsible for a significant proportion of the test time. The only reason this didn't cause the tests to take many minutes is that spread checkpoints only throttle when writing out a buffer and there aren't that many dirty buffers... Attached is a patch changing the parameters in all the instances I found. Testing on a local instance it about halves the runtime of t/010_pg_basebackup.pl on linux and windows (but there's still a 2x time difference between the two), it's less when running the tests concurrently CI. It might be worth having one explicit use of -cspread. Perhaps combined with an explicit checkpoint beforehand? Greetings, Andres Freund >From 4abed85e9085786f82d87667f2451821c01d37c0 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 17 Jan 2022 00:51:43 -0800 Subject: [PATCH v1 1/4] tests: Consistently use pg_basebackup -cfast --no-sync to accelerate tests. Most tests invoking pg_basebackup themselves did not yet use -cfast, which makes pg_basebackup take considerably longer. The only reason this didn't cause the tests to take many minutes is that spread checkpoints only throttle when writing out a buffer and there aren't that many dirty buffers... Author: Reviewed-By: Discussion: https://postgr.es/m/ Backpatch: --- src/bin/pg_basebackup/t/010_pg_basebackup.pl | 104 ++- src/bin/pg_verifybackup/t/002_algorithm.pl | 2 +- src/bin/pg_verifybackup/t/003_corruption.pl | 2 +- src/bin/pg_verifybackup/t/004_options.pl | 2 +- src/bin/pg_verifybackup/t/006_encoding.pl| 2 +- src/bin/pg_verifybackup/t/007_wal.pl | 4 +- 6 files changed, 61 insertions(+), 55 deletions(-) diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl index 872fec3d350..ebd102cd098 100644 --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl @@ -20,6 +20,13 @@ my $tempdir = PostgreSQL::Test::Utils::tempdir; my $node = PostgreSQL::Test::Cluster->new('main'); +# For nearly all tests some options should be specified, to keep test times +# reasonable. Using @pg_basebackup_defs as the first element of the array +# passed to to IPC::Run interpolates the array (as it is not a reference to an +# array).. +my @pg_basebackup_defs = ('pg_basebackup', '--no-sync', '-cfast'); + + # Set umask so test directories and files are created with default permissions umask(0077); @@ -43,7 +50,7 @@ $node->set_replication_conf(); system_or_bail 'pg_ctl', '-D', $pgdata, 'reload'; $node->command_fails( - [ 'pg_basebackup', '-D', "$tempdir/backup" ], + [ @pg_basebackup_defs, '-D', "$tempdir/backup" ], 'pg_basebackup fails because of WAL configuration'); ok(!-d "$tempdir/backup", 'backup directory was cleaned up'); @@ -54,7 +61,7 @@ mkdir("$tempdir/backup") or BAIL_OUT("unable to create $tempdir/backup"); append_to_file("$tempdir/backup/dir-not-empty.txt", "Some data"); -$node->command_fails([ 'pg_basebackup', '-D', "$tempdir/backup", '-n' ], +$node->command_fails([ @pg_basebackup_defs, '-D', "$tempdir/backup", '-n' ], 'failing run with no-clean option'); ok(-d "$tempdir/backup", 'backup directory was created and left behind'); @@ -105,7 +112,7 @@ foreach my $filename (@tempRelationFiles) } # Run base backup. -$node->command_ok([ 'pg_basebackup', '-D', "$tempdir/backup", '-X', 'none' ], +$node->command_ok([ @pg_basebackup_defs, '-D', "$tempdir/backup", '-X', 'none' ], 'pg_basebackup runs'); ok(-f "$tempdir/backup/PG_VERSION", 'backup was created'); ok(-f "$tempdir/backup/backup_manifest", 'backup manifest included'); @@ -165,9 +172,9 @@ rmtree("$tempdir/backup"); $node->command_ok( [ - 'pg_basebackup','-D', - "$tempdir/backup2", '--no-manifest', - '--waldir', "$tempdir/xlog2" + @pg_basebackup_defs, '-D', + "$tempdir/backup2", '--no-manifest', + '--waldir', "$tempdir/xlog2" ], 'separate xlog directory'); ok(-f "$tempdir/backup2/PG_VERSION", 'backup was created'); @@ -176,31 +183,31 @@ ok(-d "$tempdir/xlog2/",