Re: planner chooses incremental but not the best one
On 15/12/2023 15:58, Richard Guo wrote: With the patch the estimate for the number of distinct 'b' values is more accurate. +1 to commit this patch. It looks good and resolves kind of a bug in the code. BTW, this patch does not change any existing regression test results. I attempted to devise a regression test that shows how this change can improve query plans, but failed. Should I try harder to find such a test case? The test that was changed refers to different features. Its behaviour can be changed in. the future, and mask testing of this code. IMO, you should add a test directly checking appendrel->tuples correction. -- regards, Andrei Lepikhov Postgres Professional
Re: pg_upgrade and logical replication
On Wed, Feb 14, 2024 at 9:07 AM Hayato Kuroda (Fujitsu) wrote: > > > pg_upgrade/t/004_subscription.pl says > > > > |my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy'; > > > > ..but I think maybe it should not. > > > > When you try to use --link, it fails: > > https://cirrus-ci.com/task/4669494061170688 > > > > |Adding ".old" suffix to old global/pg_control ok > > | > > |If you want to start the old cluster, you will need to remove > > |the ".old" suffix from > > /tmp/cirrus-ci-build/build/testrun/pg_upgrade/004_subscription/data/t_004_su > > bscription_old_sub_data/pgdata/global/pg_control.old. > > |Because "link" mode was used, the old cluster cannot be safely > > |started once the new cluster has been started. > > |... > > | > > |postgres: could not find the database system > > |Expected to find it in the directory > > "/tmp/cirrus-ci-build/build/testrun/pg_upgrade/004_subscription/data/t_004_s > > ubscription_old_sub_data/pgdata", > > |but could not open file > > "/tmp/cirrus-ci-build/build/testrun/pg_upgrade/004_subscription/data/t_004_s > > ubscription_old_sub_data/pgdata/global/pg_control": No such file or > > directory > > |# No postmaster PID for node "old_sub" > > |[19:36:01.396](0.250s) Bail out! pg_ctl start failed > > > > Good catch! The primal reason of the failure is to reuse the old cluster, > even after > the successful upgrade. The documentation said [1]: > > > > If you use link mode, the upgrade will be much faster (no file copying) and > use less > disk space, but you will not be able to access your old cluster once you > start the new > cluster after the upgrade. > > > > > You could rename pg_control.old to avoid that immediate error, but that > > doesn't > > address the essential issue that "the old cluster cannot be safely started > > once > > the new cluster has been started." > > Yeah, I agreed that it should be avoided to access to the old cluster after > the upgrade. > IIUC, pg_upgrade would be run third times in 004_subscription. > > 1. successful upgrade > 2. failure due to the insufficient max_replication_slot > 3. failure because the pg_subscription_rel has 'd' state > > And old instance is reused in all of runs. Therefore, the most reasonable fix > is to > change the ordering of tests, i.e., "successful upgrade" should be done at > last. > This sounds like a reasonable way to address the reported problem. Justin, do let me know if you think otherwise? Comment: === * -# Setup an enabled subscription to verify that the running status and failover -# option are retained after the upgrade. +# Setup a subscription to verify that the failover option are retained after +# the upgrade. $publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1"); $old_sub->safe_psql('postgres', - "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' PUBLICATION regress_pub1 WITH (failover = true)" + "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' PUBLICATION regress_pub1 WITH (failover = true, enabled = false)" ); I think it is better not to create a subscription in the early stage which we wanted to use for the success case. Let's have separate subscriptions for failure and success cases. I think that will avoid the newly added ALTER statements in the patch. -- With Regards, Amit Kapila.
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In <20240213.173340.1518143507526518973@clear-code.com> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Tue, 13 Feb 2024 17:33:40 +0900 (JST), Sutou Kouhei wrote: > I'll reply other comments later... I've read other comments and my answers for them are same as Michael's one. I'll prepare the v15 patch with static inline functions and fixed arguments after the fcinfo cache patches are merged. I think that the v15 patch will be conflicted with fcinfo cache patches. Thanks, -- kou
Re: planner chooses incremental but not the best one
On 18/12/2023 19:53, Tomas Vondra wrote: On 12/18/23 11:40, Richard Guo wrote: The challenge is where to get usable information about correlation between columns. I only have a couple very rought ideas of what might try. For example, if we have multi-column ndistinct statistics, we might look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from ndistinct(b,c,d) / ndistinct(b,c) If we know how many distinct values we have for the predicate column, we could then estimate the number of groups. I mean, we know that for the restriction "WHERE b = 3" we only have 1 distinct value, so we could estimate the number of groups as 1 * ndistinct(b,c) Did you mean here ndistinct(c,d) and the formula: ndistinct(b,c,d) / ndistinct(c,d) ? Do you implicitly bear in mind here the necessity of tracking clauses that were applied to the data up to the moment of grouping? -- regards, Andrei Lepikhov Postgres Professional
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 14 Feb 2024 15:52:36 +0900, Michael Paquier wrote: >> How about InputFunctionCallSafeWithInfo(), >> InputFunctionCallSafeInfo() or >> InputFunctionCallInfoCallSafe()? > > WithInfo() would not be a new thing. There are a couple of APIs named > like this when manipulating catalogs, so that sounds kind of a good > choice from here. Thanks for the info. Let's use InputFunctionCallSafeWithInfo(). See that attached patch: v2-0001-Reuse-fcinfo-used-in-COPY-FROM.patch I also attach a patch for COPY TO: v1-0001-Reuse-fcinfo-used-in-COPY-TO.patch I measured the COPY TO patch on my environment with: COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1, 100::int4)) TO '/dev/null' \watch c=5 master: 740.066ms 734.884ms 738.579ms 734.170ms 727.953ms patched: 730.714ms 741.483ms 714.149ms 715.436ms 713.578ms It seems that it improves performance a bit but my environment isn't suitable for benchmark. So they may not be valid numbers. Thanks, -- kou >From b677732f46f735a5601b889f79671e91be41 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Thu, 15 Feb 2024 15:01:08 +0900 Subject: [PATCH v2] Reuse fcinfo used in COPY FROM Each NextCopyFrom() calls input functions or binary-input functions. We can reuse fcinfo for them instead of creating a local fcinfo for each call. This will improve performance. --- src/backend/commands/copyfrom.c | 4 + src/backend/commands/copyfromparse.c | 20 ++-- src/backend/utils/fmgr/fmgr.c| 126 +++ src/include/commands/copyfrom_internal.h | 1 + src/include/fmgr.h | 12 +++ 5 files changed, 154 insertions(+), 9 deletions(-) diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c index 1fe70b9133..ed375c012e 100644 --- a/src/backend/commands/copyfrom.c +++ b/src/backend/commands/copyfrom.c @@ -1691,6 +1691,10 @@ BeginCopyFrom(ParseState *pstate, /* We keep those variables in cstate. */ cstate->in_functions = in_functions; cstate->typioparams = typioparams; + if (cstate->opts.binary) + cstate->fcinfo = PrepareReceiveFunctionCallInfo(); + else + cstate->fcinfo = PrepareInputFunctionCallInfo(); cstate->defmap = defmap; cstate->defexprs = defexprs; cstate->volatile_defexprs = volatile_defexprs; diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index 7cacd0b752..7907e16ea8 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -861,6 +861,7 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, num_defaults = cstate->num_defaults; FmgrInfo *in_functions = cstate->in_functions; Oid *typioparams = cstate->typioparams; + FunctionCallInfoBaseData *fcinfo = cstate->fcinfo; int i; int *defmap = cstate->defmap; ExprState **defexprs = cstate->defexprs; @@ -961,12 +962,13 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, * If ON_ERROR is specified with IGNORE, skip rows with soft * errors */ - else if (!InputFunctionCallSafe(&in_functions[m], - string, - typioparams[m], - att->atttypmod, - (Node *) cstate->escontext, - &values[m])) + else if (!InputFunctionCallSafeWithInfo(fcinfo, + &in_functions[m], + string, + typioparams[m], + att->atttypmod, + (Node *) cstate->escontext, + &values[m])) { cstate->num_errors++; return true; @@ -1966,7 +1968,7 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo, if (fld_size == -1) { *isnull = true; - return ReceiveFunctionCall(flinfo, NULL, typioparam, typmod); + return ReceiveFunctionCallWithInfo(cstate->fcinfo, flinfo, NULL, typioparam, typmod); } if (fld_size < 0) ereport(ERROR, @@ -1987,8 +1989,8 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo, cstate->attribute_buf.data[fld_size] = '\0'; /* Call the column type's binary input converter */ - result = ReceiveFunctionCall(flinfo, &cstate->attribute_buf, - typioparam, typmod); + result = ReceiveFunctionCallWithInfo(cstate->fcinfo, flinfo, &cstate->attribute_buf, + typioparam, typmod); /* Trouble if it didn't eat the whole buffer */ if (cstate->attribute_buf.cursor != cstate->attribute_buf.len) diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c index e48a86be54..14c3ed2bdb 100644 --- a/src/backend/utils/fmgr/fmgr.c +++ b/src/backend/utils/fmgr/fmgr.c @@ -1672,6 +1672,73 @@ DirectInputFunctionCallSafe(PGFunction func, char *str, return true; } +/* + * Prepare callinfo for InputFunctionCallSafeWithInfo to reuse one callinfo + * instead of initializing it for each call. Thi
Re: About a recently-added message
At Thu, 15 Feb 2024 09:22:23 +0530, shveta malik wrote in > On Thu, Feb 15, 2024 at 8:26 AM Amit Kapila wrote: > > > > On Wed, Feb 14, 2024 at 7:51 PM Euler Taveira wrote: > > > > > > On Wed, Feb 14, 2024, at 8:45 AM, Amit Kapila wrote: > > > > > > Now, I am less clear about whether to quote "logical" or not in the > > > above message. Do you have any suggestions? > > > > > > > > > The possible confusion comes from the fact that the sentence contains > > > "must be" > > > in the middle of a comparison expression. For pg_createsubscriber, we are > > > using > > > > > > publisher requires wal_level >= logical > > > > > > I suggest to use something like > > > > > > slot synchronization requires wal_level >= logical > > > > > > > This sounds like a better idea but shouldn't we directly use this in > > 'errmsg' instead of a separate 'errhint'? Also, if change this, then > > we should make a similar change for other messages in the same > > function. > > +1 on changing the msg(s) suggested way. Please find the patch for the > same. It also removes double quotes around the variable names Thanks for the discussion. With a translator hat on, I would be happy if I could determine whether a word requires translation with minimal background information. In this case, a translator needs to know which values wal_level can take. It's relatively easy in this case, but I'm not sure if this is always the case. Therefore, I would be slightly happier if "logical" were double-quoted. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Skip collecting decoded changes of already-aborted transactions
On Fri, Feb 2, 2024 at 12:48 AM vignesh C wrote: > > On Tue, 3 Oct 2023 at 15:54, vignesh C wrote: > > > > On Mon, 3 Jul 2023 at 07:16, Masahiko Sawada wrote: > > > > > > On Fri, Jun 23, 2023 at 12:39 PM Dilip Kumar > > > wrote: > > > > > > > > On Fri, Jun 9, 2023 at 10:47 AM Masahiko Sawada > > > > wrote: > > > > > > > > > > Hi, > > > > > > > > > > In logical decoding, we don't need to collect decoded changes of > > > > > aborted transactions. While streaming changes, we can detect > > > > > concurrent abort of the (sub)transaction but there is no mechanism to > > > > > skip decoding changes of transactions that are known to already be > > > > > aborted. With the attached WIP patch, we check CLOG when decoding the > > > > > transaction for the first time. If it's already known to be aborted, > > > > > we skip collecting decoded changes of such transactions. That way, > > > > > when the logical replication is behind or restarts, we don't need to > > > > > decode large transactions that already aborted, which helps improve > > > > > the decoding performance. > > > > > > > > > +1 for the idea of checking the transaction status only when we need > > > > to flush it to the disk or send it downstream (if streaming in > > > > progress is enabled). Although this check is costly since we are > > > > planning only for large transactions then it is worth it if we can > > > > occasionally avoid disk or network I/O for the aborted transactions. > > > > > > > > > > Thanks. > > > > > > I've attached the updated patch. With this patch, we check the > > > transaction status for only large-transactions when eviction. For > > > regression test purposes, I disable this transaction status check when > > > logical_replication_mode is set to 'immediate'. > > > > May be there is some changes that are missing in the patch, which is > > giving the following errors: > > reorderbuffer.c: In function ‘ReorderBufferCheckTXNAbort’: > > reorderbuffer.c:3584:22: error: ‘logical_replication_mode’ undeclared > > (first use in this function) > > 3584 | if (unlikely(logical_replication_mode == > > LOGICAL_REP_MODE_IMMEDIATE)) > > | ^~~~ > > With no update to the thread and the compilation still failing I'm > marking this as returned with feedback. Please feel free to resubmit > to the next CF when there is a new version of the patch. > I resumed working on this item. I've attached the new version patch. I rebased the patch to the current HEAD and updated comments and commit messages. The patch is straightforward and I'm somewhat satisfied with it, but I'm thinking of adding some tests for it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com From c5c78f5d53d375f7a79b2561c551f7bb3ff57717 Mon Sep 17 00:00:00 2001 From: Masahiko Sawada Date: Mon, 3 Jul 2023 10:28:00 +0900 Subject: [PATCH v3] Skip logical decoding of already-aborted transactions. If we detect a concurrent abort of a streaming transaction, we discard all changes and skip decoding further changes of the transaction. This commit introduces a new check if a (streaming or non-streaming) transaction is already aborted by CLOG lookup, enabling us to skip decoding further changes of the transaction. This helps a lot in logical decoding performance in a case where the transaction is large and already rolled back since we can save disk or network I/O. We do this new check for only large-transactions when eviction since checking CLOG is costly and could cause a slowdown with lots of small transactions, where most transactions commit. Reviewed-by: Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com --- .../replication/logical/reorderbuffer.c | 98 --- src/include/replication/reorderbuffer.h | 13 ++- 2 files changed, 94 insertions(+), 17 deletions(-) diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index bbf0966182..f3284708bf 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -100,6 +100,7 @@ #include "replication/snapbuild.h" /* just for SnapBuildSnapDecRefcount */ #include "storage/bufmgr.h" #include "storage/fd.h" +#include "storage/procarray.h" #include "storage/sinval.h" #include "utils/builtins.h" #include "utils/combocid.h" @@ -256,7 +257,7 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, char *data); static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn); static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, - bool txn_prepared); + bool txn_prepared, bool streaming); static void ReorderBufferCleanupSerializedTXNs(const char *slotname); static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid, XLogSegNo segno); @@
Re: speed up a logical replica setup
On Thu, Feb 15, 2024 at 10:37 AM Hayato Kuroda (Fujitsu) wrote: > > Dear Euler, > > Here are my minor comments for 17. > > 01. > ``` > /* Options */ > static const char *progname; > > static char *primary_slot_name = NULL; > static bool dry_run = false; > > static bool success = false; > > static LogicalRepInfo *dbinfo; > static int num_dbs = 0; > ``` > > The comment seems out-of-date. There is only one option. > > 02. check_subscriber and check_publisher > > Missing pg_catalog prefix in some lines. > > 03. get_base_conninfo > > I think dbname would not be set. IIUC, dbname should be a pointer of the > pointer. > > > 04. > > I check the coverage and found two functions have been never called: > - drop_subscription > - drop_replication_slot > > Also, some cases were not tested. Below bullet showed notable ones for me. > (Some of them would not be needed based on discussions) > > * -r is specified > * -t is specified > * -P option contains dbname > * -d is not specified > * GUC settings are wrong > * primary_slot_name is specified on the standby > * standby server is not working > > In feature level, we may able to check the server log is surely removed in > case > of success. > > So, which tests should be added? drop_subscription() is called only when the > cleanup phase, so it may be difficult to test. According to others, it seems > that > -r and -t are not tested. GUC-settings have many test cases so not sure they > should be. Based on this, others can be tested. > > PSA my top-up patch set. > > V18-0001: same as your patch, v17-0001. > V18-0002: modify the alignment of codes. > V18-0003: change an argument of get_base_conninfo. Per comment 3. > === experimental patches === > V18-0004: Add testcases per comment 4. > V18-0005: Remove -P option. I'm not sure it should be needed, but I made just > in case. I created a cascade Physical Replication system like node1->node2->node3 and ran pg_createsubscriber for node2. After running the script, I started the node2 again and found pg_createsubscriber command was successful after which the physical replication between node2 and node3 has been broken. I feel pg_createsubscriber should check this scenario and throw an error in this case to avoid breaking the cascaded replication setup. I have attached the script which was used to verify this. Thanks and Regards, Shubham Khanna. #!/bin/bash # # This script tests a case of cascading replication. # # Creating system,: # node1-(physical replication)->node2-(physical replication)->node3 # # Stop instances pg_ctl stop -D node1 pg_ctl stop -D node2 pg_ctl stop -D node3 # Remove old files rm -rf node1 rm -rf node2 rm -rf node3 rm log_node2 log_node1 log_node3 # Initialize primary initdb -D node1 echo "wal_level = logical" >> node1/postgresql.conf echo "max_replication_slots=3" >> node1/postgresql.conf pg_ctl -D node1 -l log_node1 start psql -d postgres -c "CREATE DATABASE db1"; psql -d db1 -c "CHECKPOINT"; sleep 3 # Initialize standby pg_basebackup -v -R -D node2 echo "Port=9000" >> node2/postgresql.conf pg_ctl -D node2 -l log_node2 start # Initialize another standby pg_basebackup -p 9000 -v -R -D node3 echo "Port=9001" >> node3/postgresql.conf pg_ctl -D node3 -l log_node3 start # Insert a tuple to primary psql -d db1 -c "CREATE TABLE c1(a int)"; psql -d db1 -c "Insert into c1 Values(2)"; sleep 3 # And verify it can be seen on another standby psql -d db1 -p 9001 -c "Select * from c1"; pg_createsubscriber -D node2/ -S "host=localhost port=9000 dbname=postgres" -d postgres -d db1 -r -v pg_ctl -D node2 -l log_node2 start
Re: Returning non-terminated string in ECPG Informix-compatible function
On Thu, Feb 15, 2024 at 12:15:40PM +0700, Oleg Tselebrovskiy wrote: > Greetings again. > I was looking through more static analyzer output and found another problem. > In ecpg/pgtypeslib/dt_common.c there are 4 calls of pgtypes_alloc. > This function uses calloc and returns NULL if OOM, but we don't check its > return value and immediately pass it to strcpy, which could lead to > segfault. > > I suggest adding a check for a return value since all other calls of > pgtypes_alloc are checked for NULL. Right. > @@ -654,7 +654,7 @@ intoasc(interval * i, char *str) > if (!tmp) > return -errno; > > -memcpy(str, tmp, strlen(tmp)); > +strcpy(str, tmp); For this stuff, Ashutosh's point was to integrate a test case in the patch. With the pgc you have posted, most of the work is done, but this should be added to src/interfaces/ecpg/test/sql/ to add some automated coverage. See the area for references showing how this is achieved. > @@ -2837,6 +2843,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, > timestamp * d, > case 'T': > pfmt++; > tmp = pgtypes_alloc(strlen("%H:%M:%S") + strlen(pstr) + 1); > +if(!tmp) > +return 1; These are obviously wrong as pgtypes_alloc() could return NULL. -- Michael signature.asc Description: PGP signature
xmlBufferCreate return value not checked in pgxmlNodeSetToText
Greetings, everyone! While analyzing output of Svace static analyzer [1] I've found a bug. In function pgxmlNodeSetToText there is a call of xmlBufferCreate that doesn't have its return value checked. In all four other calls of xmlBufferCreate there is a try...catch that checks the return value inside. I suggest to add the same checks here that are used in other four calls of xmlBufferCreate. The proposed patch is attached. [1] - https://svace.pages.ispras.ru/svace-website/en/ Oleg Tselebrovskiy, Postgres Prodiff --git a/contrib/xml2/xpath.c b/contrib/xml2/xpath.c index 94641930f7b..2d72ade9c20 100644 --- a/contrib/xml2/xpath.c +++ b/contrib/xml2/xpath.c @@ -122,62 +122,85 @@ pgxmlNodeSetToText(xmlNodeSetPtr nodeset, xmlChar *septagname, xmlChar *plainsep) { - xmlBufferPtr buf; + xmlBufferPtr buf = NULL; xmlChar*result; int i; + PgXmlErrorContext *xmlerrcxt; - buf = xmlBufferCreate(); + xmlerrcxt = pg_xml_init(PG_XML_STRICTNESS_LEGACY); - if ((toptagname != NULL) && (xmlStrlen(toptagname) > 0)) - { - xmlBufferWriteChar(buf, "<"); - xmlBufferWriteCHAR(buf, toptagname); - xmlBufferWriteChar(buf, ">"); - } - if (nodeset != NULL) + PG_TRY(); { - for (i = 0; i < nodeset->nodeNr; i++) - { - if (plainsep != NULL) - { -xmlBufferWriteCHAR(buf, - xmlXPathCastNodeToString(nodeset->nodeTab[i])); + buf = xmlBufferCreate(); -/* If this isn't the last entry, write the plain sep. */ -if (i < (nodeset->nodeNr) - 1) - xmlBufferWriteChar(buf, (char *) plainsep); - } - else + if (buf == NULL) + xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY, + "could not allocate xmlBuffer"); + + if ((toptagname != NULL) && (xmlStrlen(toptagname) > 0)) + { + xmlBufferWriteChar(buf, "<"); + xmlBufferWriteCHAR(buf, toptagname); + xmlBufferWriteChar(buf, ">"); + } + if (nodeset != NULL) + { + for (i = 0; i < nodeset->nodeNr; i++) { -if ((septagname != NULL) && (xmlStrlen(septagname) > 0)) +if (plainsep != NULL) { - xmlBufferWriteChar(buf, "<"); - xmlBufferWriteCHAR(buf, septagname); - xmlBufferWriteChar(buf, ">"); -} -xmlNodeDump(buf, - nodeset->nodeTab[i]->doc, - nodeset->nodeTab[i], - 1, 0); + xmlBufferWriteCHAR(buf, + xmlXPathCastNodeToString(nodeset->nodeTab[i])); -if ((septagname != NULL) && (xmlStrlen(septagname) > 0)) + /* If this isn't the last entry, write the plain sep. */ + if (i < (nodeset->nodeNr) - 1) + xmlBufferWriteChar(buf, (char *) plainsep); +} +else { - xmlBufferWriteChar(buf, ""); + if ((septagname != NULL) && (xmlStrlen(septagname) > 0)) + { + xmlBufferWriteChar(buf, "<"); + xmlBufferWriteCHAR(buf, septagname); + xmlBufferWriteChar(buf, ">"); + } + xmlNodeDump(buf, +nodeset->nodeTab[i]->doc, +nodeset->nodeTab[i], +1, 0); + + if ((septagname != NULL) && (xmlStrlen(septagname) > 0)) + { + xmlBufferWriteChar(buf, ""); + } } } } - } - if ((toptagname != NULL) && (xmlStrlen(toptagname) > 0)) + if ((toptagname != NULL) && (xmlStrlen(toptagname) > 0)) + { + xmlBufferWriteChar(buf, ""); + } + result = xmlStrdup(buf->content); + } + PG_CATCH(); { - xmlBufferWriteChar(buf, ""); + if (buf) + xmlBufferFree(buf); + + pg_xml_done(xmlerrcxt, true); + + PG_RE_THROW(); } - result = xmlStrdup(buf->content); + PG_END_TRY(); + xmlBufferFree(buf); + pg_xml_done(xmlerrcxt, false); + return result; }
Re: Why is subscription/t/031_column_list.pl failing so much?
On Thu, 15 Feb 2024 at 08:42, Amit Kapila wrote: > > On Wed, Feb 14, 2024 at 12:58 PM vignesh C wrote: > > > > On Wed, 7 Feb 2024 at 16:27, vignesh C wrote: > > > > > > I was able to reproduce the issue consistently with the changes shared > > > by Tom Lane at [1]. > > > I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE > > > SUBSCRIPTION and verified that the test has passed consistently for > > > >50 runs that I ran. Also the test execution time increased for this > > > case is very negligibly: > > > Without patch: 7.991 seconds > > > With test change patch: 8.121 seconds > > > > > > The test changes for the same are attached. > > > > Alternative, this could also be fixed like the changes proposed by > > Amit at [1]. In this case we ignore publications that are not found > > for the purpose of computing RelSyncEntry attributes. We won't mark > > such an entry as valid till all the publications are loaded without > > anything missing. This means we won't publish operations on tables > > corresponding to that publication till we found such a publication and > > that seems okay. > > > > Tomas had raised a performance issue forcing us to reload it for every > > replicated change/row in case the publications are invalid at [2]. > > > > Did you measure the performance impact? I think it should impact the > performance only when there is a dropped/non-existent publication as > per the subscription definition. The performance was almost similar in this case: Without patch: 7.991 seconds With skip_not_exist_publication option patch: 7.996 seconds > Also, in the same email[2], Tomas > raised another concern that in such cases it is better to break the > replication. I will handle this in the next version > > > How > > about keeping the default option as it is and providing a new option > > skip_not_exist_publication while creating/altering a subscription. In > > this case if skip_not_exist_publication is specified we will ignore > > the case if publication is not present and publications will be kept > > in invalid and get validated later. > > > > I don't know if this deserves a separate option or not but I think it > is better to discuss this in a separate thread. I will start a separate thread for this. > To resolve the BF > failure, I still feel, we should just recreate the subscription. This > is a pre-existing problem and we can track it via a separate patch > with a test case targetting such failures. +1 for going with recreation of the subscription, the changes for this are available at [1]. [1] - https://www.postgresql.org/message-id/CALDaNm1hLZW4H4Z61swK6WPW6pcNcjzXqw%3D6NqG7e-RMtkFaZA%40mail.gmail.com Regards, Vignesh
Re: Returning non-terminated string in ECPG Informix-compatible function
Greetings again. I was looking through more static analyzer output and found another problem. In ecpg/pgtypeslib/dt_common.c there are 4 calls of pgtypes_alloc. This function uses calloc and returns NULL if OOM, but we don't check its return value and immediately pass it to strcpy, which could lead to segfault. I suggest adding a check for a return value since all other calls of pgtypes_alloc are checked for NULL. A proposed patch (with previous and current changes) is attached Oleg Tselebrovskiy, Postgres Prodiff --git a/src/interfaces/ecpg/compatlib/informix.c b/src/interfaces/ecpg/compatlib/informix.c index dccf39582da..80d40aa3e09 100644 --- a/src/interfaces/ecpg/compatlib/informix.c +++ b/src/interfaces/ecpg/compatlib/informix.c @@ -654,7 +654,7 @@ intoasc(interval * i, char *str) if (!tmp) return -errno; - memcpy(str, tmp, strlen(tmp)); + strcpy(str, tmp); free(tmp); return 0; } diff --git a/src/interfaces/ecpg/pgtypeslib/dt_common.c b/src/interfaces/ecpg/pgtypeslib/dt_common.c index 99bdc94d6d7..d4ca0cbff6e 100644 --- a/src/interfaces/ecpg/pgtypeslib/dt_common.c +++ b/src/interfaces/ecpg/pgtypeslib/dt_common.c @@ -2659,6 +2659,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d, */ pfmt++; tmp = pgtypes_alloc(strlen("%m/%d/%y") + strlen(pstr) + 1); +if(!tmp) + return 1; strcpy(tmp, "%m/%d/%y"); strcat(tmp, pfmt); err = PGTYPEStimestamp_defmt_scan(&pstr, tmp, d, year, month, day, hour, minute, second, tz); @@ -2784,6 +2786,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d, case 'r': pfmt++; tmp = pgtypes_alloc(strlen("%I:%M:%S %p") + strlen(pstr) + 1); +if(!tmp) + return 1; strcpy(tmp, "%I:%M:%S %p"); strcat(tmp, pfmt); err = PGTYPEStimestamp_defmt_scan(&pstr, tmp, d, year, month, day, hour, minute, second, tz); @@ -2792,6 +2796,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d, case 'R': pfmt++; tmp = pgtypes_alloc(strlen("%H:%M") + strlen(pstr) + 1); +if(!tmp) + return 1; strcpy(tmp, "%H:%M"); strcat(tmp, pfmt); err = PGTYPEStimestamp_defmt_scan(&pstr, tmp, d, year, month, day, hour, minute, second, tz); @@ -2837,6 +2843,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d, case 'T': pfmt++; tmp = pgtypes_alloc(strlen("%H:%M:%S") + strlen(pstr) + 1); +if(!tmp) + return 1; strcpy(tmp, "%H:%M:%S"); strcat(tmp, pfmt); err = PGTYPEStimestamp_defmt_scan(&pstr, tmp, d, year, month, day, hour, minute, second, tz);
Re: Refactoring backend fork+exec code
On Thu, Feb 15, 2024 at 3:07 AM Andres Freund wrote: > > I think the last remaining question here is about the 0- vs 1-based indexing > > of BackendIds. Is it a good idea to switch to 0-based indexing? And if we do > > it, should we reserve PGPROC 0. I'm on the fence on this one. > > I lean towards it being a good idea. Having two internal indexing schemes was > bad enough so far, but at least one would fairly quickly notice if one used > the wrong one. If they're just offset by 1, it might end up taking longer, > because that'll often also be a valid id. Yeah, I think making everything 0-based is probably the best way forward long term. It might require more cleanup work to get there, but it's just a lot simpler in the end, IMHO. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
On Thu, Jan 18, 2024 at 02:20:28PM +, Bertrand Drouvot wrote: > On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote: >> I'm not sure if it >> can happen at all, but I think we can rely on previous conflict reason >> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn. > > I'm not sure what you mean here. I think we should still keep the "initial" > LSN > so that the next check done with it still makes sense. The previous conflict > reason as you're proposing also makes sense to me but for another reason: > PANIC > in case the issue still happen (for cases we did not think about, means not > covered by what the added previous LSNs are covering). Using a PANIC seems like an overreaction to me for this path. Sure, we should not record twice a conflict reason, but wouldn't an assertion be more adapted enough to check that a termination is not logged twice for this code path run in the checkpointer? +if (!terminated) +{ +initial_restart_lsn = s->data.restart_lsn; +initial_effective_xmin = s->effective_xmin; +initial_catalog_effective_xmin = s->effective_catalog_xmin; +} It seems important to document why we are saving this data here; while we hold the LWLock for repslots, before we perform any termination, and we want to protect conflict reports with incorrect data if the slot data got changed while the lwlock is temporarily released while there's a termination. No need to mention the pesky standby snapshot records, I guess, as there could be different reasons why these xmins advance. >> 3. Is there a way to reproduce this racy issue, may be by adding some >> sleeps or something? If yes, can you share it here, just for the >> records and verify the whatever fix provided actually works? > > Alexander was able to reproduce it on a slow machine and the issue was not > there > anymore with v1 in place. I think it's tricky to reproduce as it would need > the > slot to advance between the 2 checks. I'd really want a test for that in the long term. And an injection point stuck between the point where ReplicationSlotControlLock is released then re-acquired when there is an active PID should be enough, isn't it? For example just after the kill()? It seems to me that we should stuck the checkpointer, perform a forced upgrade of the xmins, release the checkpointer and see the effects of the change in the second loop. Now, modules/injection_points/ does not allow that, yet, but I've posted a patch among these lines that can stop a process and release it using a condition variable (should be 0006 on [1]). I was planning to start a new thread with a patch posted for the next CF to add this kind of facility with a regression test for an old bug, the patch needs a few hours of love, first. I should be able to post that next week. [1]: https://www.postgresql.org/message-id/caexhw5uwp9rmct9ba91bufknqeurosawtmens4ah2puyzv2...@mail.gmail.com -- Michael signature.asc Description: PGP signature
Re: index prefetching
On Thu, Feb 15, 2024 at 10:33 AM Andres Freund wrote: > The issue here is that we need to read index leaf pages (synchronously for > now!) to get the tids to do readahead of table data. What you describe is done > for the table data (IMO not a good idea medium term [1]), but the problem at > hand is that once we've done readahead for all the tids on one index page, we > can't do more readahead without looking at the next index leaf page. Oh, right. > However, if we want to issue table readahead for tids on the neighboring index > leaf page, we'll - as the patch stands - not hold a pin on the "current" index > leaf page. Which makes index prefetching as currently implemented incompatible > with kill_prior_tuple, as that requires the index leaf page pin being held. But I think it probably also breaks MVCC, as Peter was saying. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Properly pathify the union planner
On Wed, 7 Feb 2024 at 12:05, Andy Fan wrote: > +static int > +pathkeys_useful_for_setop(PlannerInfo *root, List *pathkeys) > +{ > + int n_common_pathkeys; > + > + if (root->setop_pathkeys == NIL) > + return 0; /* no special setop > ordering requested */ > + > + if (pathkeys == NIL) > + return 0; /* unordered path */ > + > + (void) pathkeys_count_contained_in(root->setop_pathkeys, pathkeys, > + > &n_common_pathkeys); > + > + return n_common_pathkeys; > +} > > The two if-clauses looks unnecessary, it should be handled by > pathkeys_count_contained_in already. The same issue exists in > pathkeys_useful_for_ordering as well. Attached patch fix it in master. I agree. I'd rather not have those redundant checks in pathkeys_useful_for_setop(), and I do want those functions to be as similar as possible. So I think adjusting it in master is a good idea. I've pushed your patch. David
Re: index prefetching
Hi, On 2024-02-15 09:59:27 +0530, Robert Haas wrote: > I would have thought that the way this prefetching would work is that > we would bring pages into shared_buffers sooner than we currently do, > but not actually pin them until we're ready to use them, so that it's > possible they might be evicted again before we get around to them, if > we prefetch too far and the system is too busy. The issue here is that we need to read index leaf pages (synchronously for now!) to get the tids to do readahead of table data. What you describe is done for the table data (IMO not a good idea medium term [1]), but the problem at hand is that once we've done readahead for all the tids on one index page, we can't do more readahead without looking at the next index leaf page. Obviously that would lead to a sawtooth like IO pattern, where you'd regularly have to wait for IO for the first tuples referenced by an index leaf page. However, if we want to issue table readahead for tids on the neighboring index leaf page, we'll - as the patch stands - not hold a pin on the "current" index leaf page. Which makes index prefetching as currently implemented incompatible with kill_prior_tuple, as that requires the index leaf page pin being held. > Alternately, it also seems OK to read those later pages and pin them right > away, as long as (1) we don't also give up pins that we would have held in > the absence of prefetching and (2) we have some mechanism for limiting the > number of extra pins that we're holding to a reasonable number given the > size of shared_buffers. FWIW, there's already some logic for (2) in LimitAdditionalPins(). Currently used to limit how many buffers a backend may pin for bulk relation extension. Greetings, Andres Freund [1] The main reasons that I think that just doing readahead without keeping a pin is a bad idea, at least medium term, are: a) To do AIO you need to hold a pin on the page while the IO is in progress, as the target buffer contents will be modified at some moment you don't control, so that buffer should better not be replaced while IO is in progress. So at the very least you need to hold a pin until the IO is over. b) If you do not keep a pin until you actually use the page, you need to either do another buffer lookup (expensive!) or you need to remember the buffer id and revalidate that it's still pointing to the same block (cheaper, but still not cheap). That's not just bad because it's slow in an absolute sense, more importantly it increases the potential performance downside of doing readahead for fully cached workloads, because you don't gain anything, but pay the price of two lookups/revalidation. Note that these reasons really just apply to cases where we read ahead because we are quite certain we'll need exactly those blocks (leaving errors or queries ending early aside), not for "heuristic" prefetching. If we e.g. were to issue prefetch requests for neighboring index pages while descending during an ordered index scan, without checking that we'll need those, it'd make sense to just do a "throway" prefetch request.
Re: Properly pathify the union planner
On Tue, 6 Feb 2024 at 22:05, Richard Guo wrote: > I'm thinking that maybe it'd be better to move the work of sorting the > subquery's paths to the outer query level, specifically within the > build_setop_child_paths() function, just before we stick SubqueryScanPath > on top of the subquery's paths. I think this is better because: > > 1. This minimizes the impact on subquery planning and reduces the > footprint within the grouping_planner() function as much as possible. > > 2. This can help avoid the aforementioned add_path() issue because the > two involved paths will be structured as: Yes, this is a good idea. I agree with both of your points. I've taken your suggested changes with minor fixups and expanded on it to do the partial paths too. I've also removed almost all of the changes to planner.c. I fixed a bug where I was overwriting the union child's TargetEntry.ressortgroupref without consideration that it might be set for some other purpose in the subquery. I wrote generate_setop_child_grouplist() to handle this which is almost like generate_setop_grouplist() except it calls assignSortGroupRef() to figure out the next free tleSortGroupRef, (or reuse the existing one if the TargetEntry already has one set). Earlier, I pushed a small comment change to pathnode.c in order to shrink this patch down a little. It was also a chance that could be made in isolation of this work. v2 attached. David diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index b5a38aeb21..62b7103d0c 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -11454,6 +11454,10 @@ DROP INDEX base_tbl1_idx; DROP INDEX base_tbl2_idx; DROP INDEX async_p3_idx; -- UNION queries +SET enable_sort TO off; +SET enable_incremental_sort TO off; +-- Adjust fdw_startup_cost so that we get an unordered path in the Append. +ALTER SERVER loopback2 OPTIONS (ADD fdw_startup_cost '0.00'); EXPLAIN (VERBOSE, COSTS OFF) INSERT INTO result_tbl (SELECT a, b, 'AAA' || c FROM async_p1 ORDER BY a LIMIT 10) @@ -11535,6 +11539,9 @@ SELECT * FROM result_tbl ORDER BY a; (12 rows) DELETE FROM result_tbl; +RESET enable_incremental_sort; +RESET enable_sort; +ALTER SERVER loopback2 OPTIONS (DROP fdw_startup_cost); -- Disable async execution if we use gating Result nodes for pseudoconstant -- quals EXPLAIN (VERBOSE, COSTS OFF) diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql index f410c3db4e..0ed9c5d61f 100644 --- a/contrib/postgres_fdw/sql/postgres_fdw.sql +++ b/contrib/postgres_fdw/sql/postgres_fdw.sql @@ -3857,6 +3857,11 @@ DROP INDEX base_tbl2_idx; DROP INDEX async_p3_idx; -- UNION queries +SET enable_sort TO off; +SET enable_incremental_sort TO off; +-- Adjust fdw_startup_cost so that we get an unordered path in the Append. +ALTER SERVER loopback2 OPTIONS (ADD fdw_startup_cost '0.00'); + EXPLAIN (VERBOSE, COSTS OFF) INSERT INTO result_tbl (SELECT a, b, 'AAA' || c FROM async_p1 ORDER BY a LIMIT 10) @@ -3883,6 +3888,10 @@ UNION ALL SELECT * FROM result_tbl ORDER BY a; DELETE FROM result_tbl; +RESET enable_incremental_sort; +RESET enable_sort; +ALTER SERVER loopback2 OPTIONS (DROP fdw_startup_cost); + -- Disable async execution if we use gating Result nodes for pseudoconstant -- quals EXPLAIN (VERBOSE, COSTS OFF) diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index 4bd60a09c6..c036911309 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -2867,6 +2867,58 @@ add_child_join_rel_equivalences(PlannerInfo *root, MemoryContextSwitchTo(oldcontext); } +/* + * add_setop_child_rel_equivalences + * Add equivalence members for each non-resjunk target in 'child_tlist' + * to the EquivalenceClass in the corresponding setop_pathkey's pk_class. + * + * 'root' is the PlannerInfo belonging to the top-level set operation. + * 'child_relids' are the Relids of the child relation we're adding + * EquivalenceMembers for. + * 'child_tlist' is the target list for the setop child relation. The target + * list expressions are what we add as EquivalenceMembers. + * 'setop_pathkeys' is a list of PathKeys which must contain an entry for each + * non-resjunk target in 'child_tlist'. + */ +void +add_setop_child_rel_equivalences(PlannerInfo *root, Relids child_relids, +List *child_tlist, List *setop_pathkeys) +{ + ListCell *lc; + ListCell *lc2 = list_head(setop_pathkeys); + + foreach(lc, child_tlist) + { + TargetEntry *tle = lfirst_node(TargetEntry, lc); + EquivalenceMember *parent_em; + PathKey*pk; + + if (tle->resjunk) + continue; + + if (lc2 == NULL) + el
Re: index prefetching
On Wed, Feb 14, 2024 at 7:43 PM Tomas Vondra wrote: > I don't think it's just a bookkeeping problem. In a way, nbtree already > does keep an array of tuples to kill (see btgettuple), but it's always > for the current index page. So it's not that we immediately go and kill > the prior tuple - nbtree already stashes it in an array, and kills all > those tuples when moving to the next index page. > > The way I understand the problem is that with prefetching we're bound to > determine the kill_prior_tuple flag with a delay, in which case we might > have already moved to the next index page ... Well... I'm not clear on all of the details of how this works, but this sounds broken to me, for the reasons that Peter G. mentions in his comments about desynchronization. If we currently have a rule that you hold a pin on the index page while processing the heap tuples it references, you can't just throw that out the window and expect things to keep working. Saying that kill_prior_tuple doesn't work when you throw that rule out the window is probably understating the extent of the problem very considerably. I would have thought that the way this prefetching would work is that we would bring pages into shared_buffers sooner than we currently do, but not actually pin them until we're ready to use them, so that it's possible they might be evicted again before we get around to them, if we prefetch too far and the system is too busy. Alternately, it also seems OK to read those later pages and pin them right away, as long as (1) we don't also give up pins that we would have held in the absence of prefetching and (2) we have some mechanism for limiting the number of extra pins that we're holding to a reasonable number given the size of shared_buffers. However, it doesn't seem OK at all to give up pins that the current code holds sooner than the current code would do. -- Robert Haas EDB: http://www.enterprisedb.com
Re: 039_end_of_wal: error in "xl_tot_len zero" test
On Fri, Jan 19, 2024 at 11:35:30AM +1300, Thomas Munro wrote: > On Fri, Jan 19, 2024 at 1:47 AM Anton Voloshin > wrote: >> I believe there is a small problem in the 039_end_of_wal.pl's >> "xl_tot_len zero" test. It supposes that after immediate shutdown the >> server, upon startup recovery, should always produce a message matching >> "invalid record length at .*: wanted 24, got 0". However, if the >> circumstances are just right and we happened to hit exactly on the edge >> of WAL page, then the message on startup recovery would be "invalid >> magic number in log segment .*, offset .*". The test does not take >> that into account. > > Thanks for figuring this out! Right, I see. I will look more closely > when I'm back from summer vacation in a few days, but first reaction: Thomas, are you planning to look at that? -- Michael signature.asc Description: PGP signature
Re: logical decoding and replication of sequences, take 2
On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra wrote: > The way I think about non-transactional sequence changes is as if they > were tiny transactions that happen "fully" (including commit) at the LSN > where the LSN change is logged. 100% this. > It does not mean we can arbitrarily reorder the changes, it only means > the changes are applied as if they were independent transactions (but in > the same order as they were executed originally). Both with respect to > the other non-transactional changes, and to "commits" of other stuff. Right, this is very important and I agree completely. I'm feeling more confident about this now that I heard you say that stuff -- this is really the key issue I've been worried about since I first looked at this, and I wasn't sure that you were in agreement, but it sounds like you are. I think we should (a) fix the locking bug I found (but that can be independent of this patch) and (b) make sure that this patch documents the points from the quoted material above so that everyone who reads the code (and maybe tries to enhance it) is clear on what the assumptions are. (I haven't checked whether it documents that stuff or not. I'm just saying it should, because I think it's a subtlety that someone might miss.) -- Robert Haas EDB: http://www.enterprisedb.com
Re: Memory consumed by paths during partitionwise join planning
On 6/2/2024 19:51, Ashutosh Bapat wrote: On Fri, Dec 15, 2023 at 5:22 AM Ashutosh Bapat The patches are raw. make check has some crashes that I need to fix. I am waiting to hear whether this is useful and whether the design is on the right track. Let me write words of opinion on that feature. I generally like the idea of a refcount field. We had problems generating alternative paths in extensions and designing the Asymmetric Join feature [1] when we proposed an alternative path one level below and called the add_path() routine. We had lost the path in the path list referenced from the upper RelOptInfo. This approach allows us to make more handy solutions instead of hacking with a copy/restore pathlist. But I'm not sure about freeing unreferenced paths. I would have to see alternatives in the pathlist. About partitioning. As I discovered planning issues connected to partitions, the painful problem is a rule, according to which we are trying to use all nomenclature of possible paths for each partition. With indexes, it quickly increases optimization work. IMO, this can help a 'symmetrical' approach, which could restrict the scope of possible pathways for upcoming partitions if we filter some paths in a set of previously planned partitions. Also, I am glad to see a positive opinion about the path_walker() routine. Somewhere else, for example, in [2], it seems we need it too. [1] https://www.postgresql.org/message-id/flat/CAOP8fzaVL_2SCJayLL9kj5pCA46PJOXXjuei6-3aFUV45j4LJQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAMbWs496%2BN%3DUAjOc%3DrcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw%40mail.gmail.com -- regards, Andrei Lepikhov Postgres Professional
Re: About a recently-added message
On Thu, Feb 15, 2024 at 8:26 AM Amit Kapila wrote: > > On Wed, Feb 14, 2024 at 7:51 PM Euler Taveira wrote: > > > > On Wed, Feb 14, 2024, at 8:45 AM, Amit Kapila wrote: > > > > Now, I am less clear about whether to quote "logical" or not in the > > above message. Do you have any suggestions? > > > > > > The possible confusion comes from the fact that the sentence contains "must > > be" > > in the middle of a comparison expression. For pg_createsubscriber, we are > > using > > > > publisher requires wal_level >= logical > > > > I suggest to use something like > > > > slot synchronization requires wal_level >= logical > > > > This sounds like a better idea but shouldn't we directly use this in > 'errmsg' instead of a separate 'errhint'? Also, if change this, then > we should make a similar change for other messages in the same > function. +1 on changing the msg(s) suggested way. Please find the patch for the same. It also removes double quotes around the variable names thanks Shveta v1-0001-Fix-quotation-of-variable-names.patch Description: Binary data
RE: Synchronizing slots from primary to standby
On Thursday, February 15, 2024 10:49 AM Amit Kapila wrote: > > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot > wrote: > > > > On Wed, Feb 14, 2024 at 10:40:11AM +, Zhijie Hou (Fujitsu) wrote: > > > On Wednesday, February 14, 2024 6:05 PM Amit Kapila > wrote: > > > > > > > > To ensure that restart_lsn has been moved to a recent position, we > > > > need to log XLOG_RUNNING_XACTS and make sure the same is processed > > > > as well by walsender. The attached patch does the required change. > > > > > > > > Hou-San can reproduce this problem by adding additional > > > > checkpoints in the test and after applying the attached it fixes > > > > the problem. Now, this patch is mostly based on the theory we > > > > formed based on LOGs on BF and a reproducer by Hou-San, so still, > > > > there is some chance that this doesn't fix the BF failures in which > > > > case I'll > again look into those. > > > > > > I have verified that the patch can fix the issue on my machine(after > > > adding few more checkpoints before slot invalidation test.) I also > > > added one more check in the test to confirm the synced slot is not temp > > > slot. > Here is the v2 patch. > > > > Thanks! > > > > +# To ensure that restart_lsn has moved to a recent WAL position, we > > +need # to log XLOG_RUNNING_XACTS and make sure the same is processed > > +as well $primary->psql('postgres', "CHECKPOINT"); > > > > Instead of "CHECKPOINT" wouldn't a less heavy "SELECT > pg_log_standby_snapshot();" > > be enough? > > > > Yeah, that would be enough. However, the test still fails randomly due to the > same reason. See [1]. So, as mentioned yesterday, now, I feel it is better to > recreate the subscription/slot so that it can get the latest restart_lsn > rather than > relying on pg_log_standby_snapshot() to move it. > > > Not a big deal but maybe we could do the change while modifying > > 040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync > worker". > > > > Right, we can do that or probably this test would have made more sense with a > worker patch where we could wait for the slot to be synced. > Anyway, let's try to recreate the slot/subscription idea. BTW, do you think > that > adding a LOG when we are not able to sync will help in debugging such > problems? I think eventually we can change it to DEBUG1 but for now, it can > help > with stabilizing BF and or some other reported issues. Here is the patch that attempts the re-create sub idea. I also think that a LOG/DEBUG would be useful for such analysis, so the 0002 is to add such a log. Best Regards, Hou zj 0002-Add-a-log-if-remote-slot-didn-t-catch-up-to-locally-.patch Description: 0002-Add-a-log-if-remote-slot-didn-t-catch-up-to-locally-.patch 0001-fix-BF-error-take-2.patch Description: 0001-fix-BF-error-take-2.patch
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Sat, Feb 10, 2024 at 9:29 PM John Naylor wrote: > > On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada wrote: > > > > On Fri, Feb 2, 2024 at 8:47 PM John Naylor wrote: > > > > My todo: > > > - benchmark tid store / vacuum again, since we haven't since varlen > > > types and removing unnecessary locks. > > I ran a vacuum benchmark similar to the one in [1] (unlogged tables > for reproducibility), but smaller tables (100 million records), > deleting only the last 20% of the table, and including a parallel > vacuum test. Scripts attached. > > monotonically ordered int column index: > > master: > system usage: CPU: user: 4.27 s, system: 0.41 s, elapsed: 4.70 s > system usage: CPU: user: 4.23 s, system: 0.44 s, elapsed: 4.69 s > system usage: CPU: user: 4.26 s, system: 0.39 s, elapsed: 4.66 s > > v-59: > system usage: CPU: user: 3.10 s, system: 0.44 s, elapsed: 3.56 s > system usage: CPU: user: 3.07 s, system: 0.35 s, elapsed: 3.43 s > system usage: CPU: user: 3.07 s, system: 0.36 s, elapsed: 3.44 s > > uuid column index: > > master: > system usage: CPU: user: 18.22 s, system: 1.70 s, elapsed: 20.01 s > system usage: CPU: user: 17.70 s, system: 1.70 s, elapsed: 19.48 s > system usage: CPU: user: 18.48 s, system: 1.59 s, elapsed: 20.43 s > > v-59: > system usage: CPU: user: 5.18 s, system: 1.18 s, elapsed: 6.45 s > system usage: CPU: user: 6.56 s, system: 1.39 s, elapsed: 7.99 s > system usage: CPU: user: 6.51 s, system: 1.44 s, elapsed: 8.05 s > > int & uuid indexes in parallel: > > master: > system usage: CPU: user: 4.53 s, system: 1.22 s, elapsed: 20.43 s > system usage: CPU: user: 4.49 s, system: 1.29 s, elapsed: 20.98 s > system usage: CPU: user: 4.46 s, system: 1.33 s, elapsed: 20.50 s > > v59: > system usage: CPU: user: 2.09 s, system: 0.32 s, elapsed: 4.86 s > system usage: CPU: user: 3.76 s, system: 0.51 s, elapsed: 8.92 s > system usage: CPU: user: 3.83 s, system: 0.54 s, elapsed: 9.09 s > > Over all, I'm pleased with these results, although I'm confused why > sometimes with the patch the first run reports running faster than the > others. I'm curious what others get. Traversing a tree that lives in > DSA has some overhead, as expected, but still comes out way ahead of > master. Thanks! That's a great improvement. I've also run the same scripts in my environment just in case and got similar results: monotonically ordered int column index: master: system usage: CPU: user: 14.81 s, system: 0.90 s, elapsed: 15.74 s system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s system usage: CPU: user: 14.85 s, system: 0.70 s, elapsed: 15.57 s v-59: system usage: CPU: user: 9.47 s, system: 1.04 s, elapsed: 10.53 s system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s system usage: CPU: user: 9.59 s, system: 0.86 s, elapsed: 10.47 s uuid column index: master: system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s system usage: CPU: user: 28.05 s, system: 1.37 s, elapsed: 29.47 s system usage: CPU: user: 28.46 s, system: 1.36 s, elapsed: 29.88 s v-59: system usage: CPU: user: 14.87 s, system: 1.13 s, elapsed: 16.02 s system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s system usage: CPU: user: 10.96 s, system: 1.24 s, elapsed: 12.22 s int & uuid indexes in parallel: master: system usage: CPU: user: 15.81 s, system: 1.43 s, elapsed: 34.31 s system usage: CPU: user: 15.84 s, system: 1.41 s, elapsed: 34.34 s system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s v-59: system usage: CPU: user: 10.93 s, system: 0.92 s, elapsed: 17.59 s system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s system usage: CPU: user: 10.90 s, system: 1.01 s, elapsed: 17.45 s > > There are still some micro-benchmarks we could do on tidstore, and > it'd be good to find out worse-case memory use (1 dead tuple each on > spread-out pages), but this is decent demonstration. I've tested a simple case where vacuum removes 33k dead tuples spread about every 10 pages. master: 198,000 bytes (=33000 * 6) system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s v-59: 2,834,432 bytes (reported by TidStoreMemoryUsage()) system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s > > > > I'm not sure what the test_node_types_* functions are testing that > > > test_basic doesn't. They have a different, and confusing, way to stop > > > at every size class and check the keys/values. It seems we can replace > > > all that with two more calls (asc/desc) to test_basic, with the > > > maximum level. > > v58-0008: > > + /* borrowed from RT_MAX_SHIFT */ > + const int max_shift = (pg_leftmost_one_pos64(UINT64_MAX) / > BITS_PER_BYTE) * BITS_PER_BYTE; > > This is harder to read than "64 - 8", and doesn't really help > maintainability either. > Maybe "(sizeof(uint64) - 1) * BITS_PER_BYTE" is a good compromise. > > + /* leaf nodes */ > + test_basic(test_info, 0); > > + /* internal nodes */ > + test_basic(test_info, 8); > + > + /* max-level nodes */ > + test_basic
Re: make add_paths_to_append_rel aware of startup cost
Thomas Munro writes: > On Thu, Oct 5, 2023 at 9:07 PM David Rowley wrote: >> Thanks. Pushed. > > FYI somehow this plan from a8a968a8212e flipped in this run: > > === dumping > /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/regression.diffs > === > diff -U3 > /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/regress/expected/union.out > /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/union.out > --- /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/regress/expected/union.out > 2024-01-15 00:31:13.947555940 + > +++ > /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/union.out > 2024-02-14 00:06:17.075584839 + > @@ -1447,9 +1447,9 @@ > -> Append > -> Nested Loop > Join Filter: (t1.tenthous = t2.tenthous) > - -> Seq Scan on tenk1 t1 > + -> Seq Scan on tenk2 t2 > -> Materialize > - -> Seq Scan on tenk2 t2 > + -> Seq Scan on tenk1 t1 > -> Result > (8 rows) > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-02-14%2000%3A01%3A03 Thanks for this information! I will take a look at this. -- Best Regards Andy Fan
Re: Why is subscription/t/031_column_list.pl failing so much?
On Wed, Feb 14, 2024 at 12:58 PM vignesh C wrote: > > On Wed, 7 Feb 2024 at 16:27, vignesh C wrote: > > > > I was able to reproduce the issue consistently with the changes shared > > by Tom Lane at [1]. > > I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE > > SUBSCRIPTION and verified that the test has passed consistently for > > >50 runs that I ran. Also the test execution time increased for this > > case is very negligibly: > > Without patch: 7.991 seconds > > With test change patch: 8.121 seconds > > > > The test changes for the same are attached. > > Alternative, this could also be fixed like the changes proposed by > Amit at [1]. In this case we ignore publications that are not found > for the purpose of computing RelSyncEntry attributes. We won't mark > such an entry as valid till all the publications are loaded without > anything missing. This means we won't publish operations on tables > corresponding to that publication till we found such a publication and > that seems okay. > > Tomas had raised a performance issue forcing us to reload it for every > replicated change/row in case the publications are invalid at [2]. > Did you measure the performance impact? I think it should impact the performance only when there is a dropped/non-existent publication as per the subscription definition. Also, in the same email[2], Tomas raised another concern that in such cases it is better to break the replication. > How > about keeping the default option as it is and providing a new option > skip_not_exist_publication while creating/altering a subscription. In > this case if skip_not_exist_publication is specified we will ignore > the case if publication is not present and publications will be kept > in invalid and get validated later. > I don't know if this deserves a separate option or not but I think it is better to discuss this in a separate thread. To resolve the BF failure, I still feel, we should just recreate the subscription. This is a pre-existing problem and we can track it via a separate patch with a test case targetting such failures. > The attached patch has the changes for the same. Thoughts? > > [1] - > https://www.postgresql.org/message-id/CAA4eK1%2BT-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q%40mail.gmail.com > [2] - > https://www.postgresql.org/message-id/dc08add3-10a8-738b-983a-191c7406707b%40enterprisedb.com > -- With Regards, Amit Kapila.
Re: Can we include capturing logs of pgdata/pg_upgrade_output.d/*/log in buildfarm
On Thu, 15 Feb 2024 at 07:24, Michael Paquier wrote: > > On Wed, Feb 14, 2024 at 03:51:08PM +0530, vignesh C wrote: > > First regex is the testname_clusterinstance_data, second regex is the > > timestamp used for pg_upgrade, third regex is for the text files > > generated by pg_upgrade and fourth regex is for the log files > > generated by pg_upgrade. > > > > Can we include these log files also in the buildfarm? > > > > [1] - > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-02-10%2007%3A03%3A10 > > [2] - > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2023-12-07%2003%3A56%3A20 > > Indeed, these lack some patterns. Why not sending a pull request > around [1] to get more patterns covered? I have added a few more patterns to include the pg_upgrade generated files. The attached patch has the changes for the same. Adding Andrew also to get his thoughts on this. Regards, Vignesh diff --git a/PGBuild/Modules/TestUpgrade.pm b/PGBuild/Modules/TestUpgrade.pm index ad3e00e..83f62b7 100644 --- a/PGBuild/Modules/TestUpgrade.pm +++ b/PGBuild/Modules/TestUpgrade.pm @@ -139,6 +139,8 @@ sub check $self->{pgsql}/src/bin/pg_upgrade/log/* $self->{pgsql}/src/bin/pg_upgrade/tmp_check/*/*.diffs $self->{pgsql}/src/bin/pg_upgrade/tmp_check/data/pg_upgrade_output.d/log/* + $self->{pgsql}/src/bin/pg_upgrade/tmp_check/*/pgdata/pg_upgrade_output.d/*/*.txt + $self->{pgsql}/src/bin/pg_upgrade/tmp_check/*/pgdata/pg_upgrade_output.d/*/log/*.log $self->{pgsql}/src/test/regress/*.diffs" ); $log->add_log($_) foreach (@logfiles);
Re: About a recently-added message
On Wed, Feb 14, 2024 at 7:51 PM Euler Taveira wrote: > > On Wed, Feb 14, 2024, at 8:45 AM, Amit Kapila wrote: > > Now, I am less clear about whether to quote "logical" or not in the > above message. Do you have any suggestions? > > > The possible confusion comes from the fact that the sentence contains "must > be" > in the middle of a comparison expression. For pg_createsubscriber, we are > using > > publisher requires wal_level >= logical > > I suggest to use something like > > slot synchronization requires wal_level >= logical > This sounds like a better idea but shouldn't we directly use this in 'errmsg' instead of a separate 'errhint'? Also, if change this, then we should make a similar change for other messages in the same function. -- With Regards, Amit Kapila.
Re: Synchronizing slots from primary to standby
On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot wrote: > > On Wed, Feb 14, 2024 at 10:40:11AM +, Zhijie Hou (Fujitsu) wrote: > > On Wednesday, February 14, 2024 6:05 PM Amit Kapila > > wrote: > > > > > > To ensure that restart_lsn has been moved to a recent position, we need > > > to log > > > XLOG_RUNNING_XACTS and make sure the same is processed as well by > > > walsender. The attached patch does the required change. > > > > > > Hou-San can reproduce this problem by adding additional checkpoints in the > > > test and after applying the attached it fixes the problem. Now, this > > > patch is > > > mostly based on the theory we formed based on LOGs on BF and a reproducer > > > by Hou-San, so still, there is some chance that this doesn't fix the BF > > > failures in > > > which case I'll again look into those. > > > > I have verified that the patch can fix the issue on my machine(after adding > > few > > more checkpoints before slot invalidation test.) I also added one more > > check in > > the test to confirm the synced slot is not temp slot. Here is the v2 patch. > > Thanks! > > +# To ensure that restart_lsn has moved to a recent WAL position, we need > +# to log XLOG_RUNNING_XACTS and make sure the same is processed as well > +$primary->psql('postgres', "CHECKPOINT"); > > Instead of "CHECKPOINT" wouldn't a less heavy "SELECT > pg_log_standby_snapshot();" > be enough? > Yeah, that would be enough. However, the test still fails randomly due to the same reason. See [1]. So, as mentioned yesterday, now, I feel it is better to recreate the subscription/slot so that it can get the latest restart_lsn rather than relying on pg_log_standby_snapshot() to move it. > Not a big deal but maybe we could do the change while modifying > 040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync > worker". > Right, we can do that or probably this test would have made more sense with a worker patch where we could wait for the slot to be synced. Anyway, let's try to recreate the slot/subscription idea. BTW, do you think that adding a LOG when we are not able to sync will help in debugging such problems? I think eventually we can change it to DEBUG1 but for now, it can help with stabilizing BF and or some other reported issues. [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-02-15%2000%3A14%3A38 -- With Regards, Amit Kapila.
Re: Can we include capturing logs of pgdata/pg_upgrade_output.d/*/log in buildfarm
On Wed, Feb 14, 2024 at 03:51:08PM +0530, vignesh C wrote: > First regex is the testname_clusterinstance_data, second regex is the > timestamp used for pg_upgrade, third regex is for the text files > generated by pg_upgrade and fourth regex is for the log files > generated by pg_upgrade. > > Can we include these log files also in the buildfarm? > > [1] - > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-02-10%2007%3A03%3A10 > [2] - > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2023-12-07%2003%3A56%3A20 Indeed, these lack some patterns. Why not sending a pull request around [1] to get more patterns covered? [1]: https://github.com/PGBuildFarm/client-code/blob/main/PGBuild/Modules/TestUpgrade.pm -- Michael signature.asc Description: PGP signature
Re: Add system identifier to backup manifest
On Wed, Feb 14, 2024 at 12:29:07PM +0530, Amul Sul wrote: > Ok, I did that way in the attached version, I have passed the control file's > full path as a second argument to verify_system_identifier() what we gets in > verify_backup_file(), but that is not doing any useful things with it, > since we > were using get_controlfile() to open the control file, which takes the > directory as an input and computes the full path on its own. I've read through the patch, and that's pretty cool. -static void -parse_manifest_file(char *manifest_path, manifest_files_hash **ht_p, - manifest_wal_range **first_wal_range_p) +static manifest_data * +parse_manifest_file(char *manifest_path) In 0001, should the comment describing this routine be updated as well? + identifier with pg_control of the backup directory or fails verification This is missing a markup here. + PostgreSQL 17, it is 2; in older versions, + it is 1. Perhaps a couple of s here. + if (strcmp(parse->manifest_version, "1") != 0 && + strcmp(parse->manifest_version, "2") != 0) + json_manifest_parse_failure(parse->context, + "unexpected manifest version"); + + /* Parse version. */ + version = strtoi64(parse->manifest_version, &ep, 10); + if (*ep) + json_manifest_parse_failure(parse->context, + "manifest version not an integer"); + + /* Invoke the callback for version */ + context->version_cb(context, version); Shouldn't these two checks be reversed? And is there actually a need for the first check at all knowing that the version callback should be in charge of performing the validation vased on the version received? +my $node2; +{ + local $ENV{'INITDB_TEMPLATE'} = undef; Not sure that it is a good idea to duplicate this pattern twice. Shouldn't this be something we'd want to control with an option in the init() method instead? +static void +verify_system_identifier(verifier_context *context, char *controlpath) Relying both on controlpath, being a full path to the control file including the data directory, and context->backup_directory to read the contents of the control file looks a bit weird. Wouldn't it be cleaner to just use one of them? -- Michael signature.asc Description: PGP signature
Re: index prefetching
On Wed, Feb 14, 2024 at 7:28 PM Andres Freund wrote: > On 2024-02-13 14:54:14 -0500, Peter Geoghegan wrote: > > This property of index scans is fundamental to how index scans work. > > Pinning an index page as an interlock against concurrently TID > > recycling by VACUUM is directly described by the index API docs [1], > > even (the docs actually use terms like "buffer pin" rather than > > something more abstract sounding). I don't think that anything > > affecting that behavior should be considered an implementation detail > > of the nbtree index AM as such (nor any particular index AM). > > Given that the interlock is only needed for non-mvcc scans, that non-mvcc > scans are rare due to catalog accesses using snapshots these days and that > most non-mvcc scans do single-tuple lookups, it might be viable to be more > restrictive about prefetching iff non-mvcc snapshots are in use and to use > method of cleanup that allows multiple pages to be cleaned up otherwise. I agree, but don't think that it matters all that much. If you have an MVCC snapshot, that doesn't mean that TID recycle safety problems automatically go away. It only means that you have one known and supported alternative approach to dealing with such problems. It's not like you just get that for free, just by using an MVCC snapshot, though -- it has downsides. Downsides such as the current _bt_killitems() behavior with a concurrently-modified leaf page (modified when we didn't hold a leaf page pin). It'll just give up on setting any LP_DEAD bits due to noticing that the leaf page's LSN changed. (Plus there are implementation restrictions that I won't repeat again now.) When I refer to the buffer pin interlock, I'm mostly referring to the general need for something like that in the context of index scans. Principally in order to make kill_prior_tuple continue to work in something more or less like its current form. > However, I don't think we would necessarily have to relax the IAM pinning > rules, just to be able to do prefetching of more than one index leaf > page. To be clear, we already do relax the IAM pinning rules. Or at least nbtree selectively opts out, as I've gone into already. > Restricting prefetching to entries within a single leaf page obviously > has the disadvantage of not being able to benefit from concurrent IO whenever > crossing a leaf page boundary, but at the same time processing entries from > just two leaf pages would often allow for a sufficiently aggressive > prefetching. Pinning a small number of leaf pages instead of a single leaf > page shouldn't be a problem. You're probably right. I just don't see any need to solve that problem in v1. > One argument for loosening the tight coupling between kill_prior_tuples and > index scan progress is that the lack of kill_prior_tuples for bitmap scans is > quite problematic. I've seen numerous production issues with bitmap scans > caused by subsequent scans processing a growing set of dead tuples, where > plain index scans were substantially slower initially but didn't get much > slower over time. I've seen production issues like that too. No doubt it's a problem. > We might be able to design a system where the bitmap > contains a certain number of back-references to the index, allowing later > cleanup if there weren't any page splits or such. That does seem possible, but do you really want a design for index prefetching that relies on that massive enhancement (a total redesign of kill_prior_tuple) happening at some point in the not-too-distant future? Seems risky, from a project management point of view. This back-references idea seems rather complicated, especially if it needs to work with very large bitmap index scans. Since you'll still have the basic problem of TID recycle safety to deal with (even with an MVCC snapshot), you don't just have to revisit the leaf pages. You also have to revisit the corresponding heap pages (generally they'll be a lot more numerous than leaf pages). You'll have traded one problem for another (which is not to say that it's not a good trade-off). Right now the executor uses a amgettuple interface, and knows nothing about index related costs (e.g., pages accessed in any index, index qual costs). While the index AM has some limited understanding of heap access costs. So the index AM kinda knows a small bit about both types of costs (possibly not enough, but something). That informs the language I'm using to describe all this. To do something like your "back-references to the index" thing well, I think that you need more dynamic behavior around when you visit the heap to get heap tuples pointed to by TIDs from index pages (i.e. dynamic behavior that determines how many leaf pages to go before going to the heap to get pointed-to TIDs). That is basically what I meant by "put the index AM in control" -- it doesn't *strictly* require that the index AM actually do that. Just that a single piece of code has to have access to the full co
Re: index prefetching
Hi, On 2024-02-13 14:54:14 -0500, Peter Geoghegan wrote: > This property of index scans is fundamental to how index scans work. > Pinning an index page as an interlock against concurrently TID > recycling by VACUUM is directly described by the index API docs [1], > even (the docs actually use terms like "buffer pin" rather than > something more abstract sounding). I don't think that anything > affecting that behavior should be considered an implementation detail > of the nbtree index AM as such (nor any particular index AM). Given that the interlock is only needed for non-mvcc scans, that non-mvcc scans are rare due to catalog accesses using snapshots these days and that most non-mvcc scans do single-tuple lookups, it might be viable to be more restrictive about prefetching iff non-mvcc snapshots are in use and to use method of cleanup that allows multiple pages to be cleaned up otherwise. However, I don't think we would necessarily have to relax the IAM pinning rules, just to be able to do prefetching of more than one index leaf page. Restricting prefetching to entries within a single leaf page obviously has the disadvantage of not being able to benefit from concurrent IO whenever crossing a leaf page boundary, but at the same time processing entries from just two leaf pages would often allow for a sufficiently aggressive prefetching. Pinning a small number of leaf pages instead of a single leaf page shouldn't be a problem. One argument for loosening the tight coupling between kill_prior_tuples and index scan progress is that the lack of kill_prior_tuples for bitmap scans is quite problematic. I've seen numerous production issues with bitmap scans caused by subsequent scans processing a growing set of dead tuples, where plain index scans were substantially slower initially but didn't get much slower over time. We might be able to design a system where the bitmap contains a certain number of back-references to the index, allowing later cleanup if there weren't any page splits or such. > I think that it makes sense to put the index AM in control here -- > that almost follows from what I said about the index AM API. The index > AM already needs to be in control, in about the same way, to deal with > kill_prior_tuple (plus it helps with the LIMIT issue I described). Depending on what "control" means I'm doubtful: Imo there are decisions influencing prefetching that an index AM shouldn't need to know about directly, e.g. how the plan shape influences how many tuples are actually going to be consumed. Of course that determination could be made in planner/executor and handed to IAMs, for the IAM to then "control" the prefetching. Another aspect is that *long* term I think we want to be able to execute different parts of the plan tree when one part is blocked for IO. Of course that's not always possible. But particularly with partitioned queries it often is. Depending on the form of "control" that's harder if IAMs are in control, because control flow needs to return to the executor to be able to switch to a different node, so we can't wait for IO inside the AM. There probably are ways IAMs could be in "control" that would be compatible with such constraints however. Greetings, Andres Freund
Re: index prefetching
Hi, On 2024-02-14 16:45:57 -0500, Melanie Plageman wrote: > > > The LIMIT problem is not very clear to me either. Yes, if we get close > > > to the end of the leaf page, we may need to visit the next leaf page. > > > But that's kinda the whole point of prefetching - reading stuff ahead, > > > and reading too far ahead is an inherent risk. Isn't that a problem we > > > have even without LIMIT? The prefetch distance ramp up is meant to limit > > > the impact. > > > > Right now, the index AM doesn't know anything about LIMIT at all. That > > doesn't matter, since the index AM can only read/scan one full leaf > > page before returning control back to the executor proper. The > > executor proper can just shut down the whole index scan upon finding > > that we've already returned N tuples for a LIMIT N. > > > > We don't do prefetching right now, but we also don't risk reading a > > leaf page that'll just never be needed. Those two things are in > > tension, but I don't think that that's quite the same thing as the > > usual standard prefetching tension/problem. Here there is uncertainty > > about whether what we're prefetching will *ever* be required -- not > > uncertainty about when exactly it'll be required. (Perhaps this > > distinction doesn't mean much to you. I'm just telling you how I think > > about it, in case it helps move the discussion forward.) > > I don't think that the LIMIT problem is too different for index scans > than heap scans. We will need some advice from planner to come down to > prevent over-eager prefetching in all cases. I'm not sure that that's really true. I think the more common and more problematic case for partially executing a sub-tree of a query are nested loops (worse because that happens many times within a query). Particularly for anti-joins prefetching too aggressively could lead to a significant IO amplification. At the same time it's IMO more important to ramp up prefetching distance fairly aggressively for index scans than it is for sequential scans. For sequential scans it's quite likely that either the whole scan takes quite a while (thus slowly ramping doesn't affect overall time that much) or that the data is cached anyway because the tables are small and frequently used (in which case we don't need to ramp). And even if smaller tables aren't cached, because it's sequential IO, the IOs are cheaper as they're sequential. Contrast that to index scans, where it's much more likely that you have cache misses in queries that do an overall fairly small number of IOs and where that IO is largely random. I think we'll need some awareness at ExecInitNode() time about how the results of the nodes are used. I see a few "classes": 1) All rows are needed, because the node is below an Agg, Hash, Materialize, Sort, Can be determined purely by the plan shape. 2) All rows are needed, because the node is completely consumed by the top-level (i.e. no limit, anti-joins or such inbetween) and the top-level wants to run the whole query. Unfortunately I don't think we know this at plan time at the moment (it's just determined by what's passed to ExecutorRun()). 3) Some rows are needed, but it's hard to know the precise number. E.g. because of a LIMIT further up. 4) Only a single row is going to be needed, albeit possibly after filtering on the node level. E.g. the anti-join case. There are different times at which we could determine how each node is consumed: a) Determine node consumption "class" purely within ExecInit*, via different eflags. Today that couldn't deal with 2), but I think it'd not too hard to modify callers that consume query results completely to tell that ExecutorStart(), not just ExecutorRun(). A disadvantage would be that this prevents us from taking IO depth into account during costing. There very well might be plans that are cheaper than others because the plan shape allows more concurrent IO. b) Determine node consumption class at plan time. This also couldn't deal with 2), but fixing that probably would be harder, because we'll often not know at plan time how the query will be executed. And in fact the same plan might be executed multiple ways, in case of prepared statements. The obvious advantage is of course that we can influence the choice of paths. I suspect we'd eventually want a mix of both. Plan time to be able to influence plan shape, ExecInit* to deal with not knowing how the query will be consumed at plan time. Which suggests that we could start with whichever is easier and extend later. Greetings, Andres Freund
Re: common signal handler protection
On Wed, Feb 14, 2024 at 11:55:43AM -0800, Noah Misch wrote: > I took a look over each of these. +1 for all. Thank you. Committed. Thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: index prefetching
On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman wrote: > So, a pin on the index leaf page is sufficient to keep line pointers > from being reused? If we stick to prefetching heap blocks referred to > by index tuples in a single index leaf page, and we keep that page > pinned, will we still have a problem? That's certainly one way of dealing with it. Obviously, there are questions about how you do that in a way that consistently avoids creating new problems. > I don't think that the LIMIT problem is too different for index scans > than heap scans. We will need some advice from planner to come down to > prevent over-eager prefetching in all cases. I think that I'd rather use information at execution time instead, if at all possible (perhaps in addition to a hint given by the planner). But it seems a bit premature to discuss this problem now, except to say that it might indeed be a problem. > > It's certainly possible that you could figure out various workarounds > > for each of these issues (plus the kill_prior_tuple issue) with a > > prefetching design that "de-synchronizes" the index access and the > > heap access. But it might well be better to extend the existing design > > in a way that just avoids all these problems in the first place. Maybe > > "de-synchronization" really can pay for itself (because the benefits > > will outweigh these costs), but if you go that way then I'd really > > prefer it that way. > > Forcing each index access to be synchronous and interleaved with each > table access seems like an unprincipled design constraint. While it is > true that we rely on that in our current implementation (when using > non-MVCC snapshots), it doesn't seem like a principle inherent to > accessing indexes and tables. There is nothing sacred about the way plain index scans work right now -- especially the part about buffer pins as an interlock. If the pin thing really was sacred, then we could never have allowed nbtree to selectively opt-out in cases where it's possible to provide an equivalent correctness guarantee without holding onto buffer pins, which, as I went into, is how it actually works in nbtree's _bt_killitems() today (see commit 2ed5b87f96 for full details). And so in principle I have no problem with the idea of revising the basic definition of plain index scans -- especially if it's to make the definition more abstract, without fundamentally changing it (e.g., to make it no longer reference buffer pins, making life easier for prefetching, while at the same time still implying the same underlying guarantees sufficient to allow nbtree to mostly work the same way as today). All I'm really saying is: 1. The sort of tricks that we can do in nbtree's _bt_killitems() are quite useful, and ought to be preserved in something like their current form, even when prefetching is in use. This seems to push things in the direction of centralizing control of the process in index scan code. For example, it has to understand that _bt_killitems() will be called at some regular cadence that is well defined and sensible from an index point of view. 2. Are you sure that the leaf-page-at-a-time thing is such a huge hindrance to effective prefetching? I suppose that it might be much more important than I imagine it is right now, but it'd be nice to have something a bit more concrete to go on. 3. Even if it is somewhat important, do you really need to get that part working in v1? Tomas' original prototype worked with the leaf-page-at-a-time thing, and that still seemed like a big improvement to me. While being less invasive, in effect. If we can agree that something like that represents a useful step in the right direction (not an evolutionary dead end), then we can make good incremental progress within a single release. > I don't think the fact that it would also be valuable to do index > prefetching is a reason not to do prefetching of heap pages. And, > while it is true that were you to add index interior or leaf page > prefetching, it would impact the heap prefetching, at the end of the > day, the table AM needs some TID or TID-equivalents that whose blocks > it can go fetch. I wasn't really thinking of index page prefetching at all. Just the cost of applying index quals to read leaf pages that might never actually need to be read, due to the presence of a LIMIT. That is kind of a new problem created by eagerly reading (without actually prefetching) leaf pages. > You could argue that my suggestion to have the index AM manage and > populate a queue of TIDs for use by the table AM puts the index AM in > control. I do think having so many members of the IndexScanDescriptor > which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous > interplay between fetching an index tuple and fetching a heap tuple is > confusing and error prone. But that's kinda how amgettuple is supposed to work -- cursors need it to work that way. Having some kind of general notion of scan order is also important t
Re: When extended query protocol ends?
>>> From [1] I think the JDBC driver sends something like below if >>> autosave=always option is specified. >>> >>> "BEGIN READ ONLY" Parse/Bind/Eexecute (in the extended query protocol) >>> "SAVEPOINT PGJDBC_AUTOSAVE" (in the simple query protocol) >>> >>> It seems the SAVEPOINT is sent without finishing the extended query >>> protocol (i.e. without Sync message). Is it possible for the JDBC >>> driver to issue a Sync message before sending SAVEPOINT in simple >>> query protocol? Or you can send SAVEPOINT using the extended query >>> protocol. >>> >>> [1] >>> https://www.pgpool.net/pipermail/pgpool-general/2023-December/009051.html >> >> >> Can you ask the OP what version of the driver they are using. From what I >> can tell we send BEGIN using SimpleQuery. > > Sure. I will get back once I get the JDBC version. Here it is: > JDBC driver version used:42.5.1 Regards, Karel. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Add trim_trailing_whitespace to editorconfig file
> On 14 Feb 2024, at 23:06, Melanie Plageman wrote: > > On Wed, Feb 14, 2024 at 11:35 AM Jelte Fennema-Nio wrote: >> >> This brings our .gitattributes and .editorconfig files more in line. I >> had the problem that "git add" would complain often about trailing >> whitespaces when I was changing sgml files specifically. > > +1 from me. But when do we want it to be false? That is, why not > declare it true for all file types? Regression test .out files commonly have spaces at the end of the line. (Not to mention the ECPG .c files but they probably really shouldn't have.) -- Daniel Gustafsson
Re: Add trim_trailing_whitespace to editorconfig file
On Wed, Feb 14, 2024 at 11:35 AM Jelte Fennema-Nio wrote: > > This brings our .gitattributes and .editorconfig files more in line. I > had the problem that "git add" would complain often about trailing > whitespaces when I was changing sgml files specifically. +1 from me. But when do we want it to be false? That is, why not declare it true for all file types? - Melanie
Re: index prefetching
On Wed, Feb 14, 2024 at 1:21 PM Peter Geoghegan wrote: > > On Wed, Feb 14, 2024 at 8:34 AM Tomas Vondra > wrote: > > > Another thing that argues against doing this is that we might not need > > > to visit any more B-Tree leaf pages when there is a LIMIT n involved. > > > We could end up scanning a whole extra leaf page (including all of its > > > tuples) for want of the ability to "push down" a LIMIT to the index AM > > > (that's not what happens right now, but it isn't really needed at all > > > right now). > > > > > > > I'm not quite sure I understand what is "this" that you argue against. > > Are you saying we should not separate the two scans? If yes, is there a > > better way to do this? > > What I'm concerned about is the difficulty and complexity of any > design that requires revising "63.4. Index Locking Considerations", > since that's pretty subtle stuff. In particular, if prefetching > "de-synchronizes" (to use your term) the index leaf page level scan > and the heap page scan, then we'll probably have to totally revise the > basic API. So, a pin on the index leaf page is sufficient to keep line pointers from being reused? If we stick to prefetching heap blocks referred to by index tuples in a single index leaf page, and we keep that page pinned, will we still have a problem? > > The LIMIT problem is not very clear to me either. Yes, if we get close > > to the end of the leaf page, we may need to visit the next leaf page. > > But that's kinda the whole point of prefetching - reading stuff ahead, > > and reading too far ahead is an inherent risk. Isn't that a problem we > > have even without LIMIT? The prefetch distance ramp up is meant to limit > > the impact. > > Right now, the index AM doesn't know anything about LIMIT at all. That > doesn't matter, since the index AM can only read/scan one full leaf > page before returning control back to the executor proper. The > executor proper can just shut down the whole index scan upon finding > that we've already returned N tuples for a LIMIT N. > > We don't do prefetching right now, but we also don't risk reading a > leaf page that'll just never be needed. Those two things are in > tension, but I don't think that that's quite the same thing as the > usual standard prefetching tension/problem. Here there is uncertainty > about whether what we're prefetching will *ever* be required -- not > uncertainty about when exactly it'll be required. (Perhaps this > distinction doesn't mean much to you. I'm just telling you how I think > about it, in case it helps move the discussion forward.) I don't think that the LIMIT problem is too different for index scans than heap scans. We will need some advice from planner to come down to prevent over-eager prefetching in all cases. > Another factor that complicates things here is mark/restore > processing. The design for that has the idea of processing one page at > a time baked-in. Kinda like with the kill_prior_tuple issue. Yes, I mentioned this in my earlier email. I think we can resolve mark/restore by resetting the prefetch and TID queues and restoring the last used heap TID in the index scan descriptor. > It's certainly possible that you could figure out various workarounds > for each of these issues (plus the kill_prior_tuple issue) with a > prefetching design that "de-synchronizes" the index access and the > heap access. But it might well be better to extend the existing design > in a way that just avoids all these problems in the first place. Maybe > "de-synchronization" really can pay for itself (because the benefits > will outweigh these costs), but if you go that way then I'd really > prefer it that way. Forcing each index access to be synchronous and interleaved with each table access seems like an unprincipled design constraint. While it is true that we rely on that in our current implementation (when using non-MVCC snapshots), it doesn't seem like a principle inherent to accessing indexes and tables. > > > I think that it makes sense to put the index AM in control here -- > > > that almost follows from what I said about the index AM API. The index > > > AM already needs to be in control, in about the same way, to deal with > > > kill_prior_tuple (plus it helps with the LIMIT issue I described). > > > > > > > In control how? What would be the control flow - what part would be > > managed by the index AM? > > ISTM that prefetching for an index scan is about the index scan > itself, first and foremost. The heap accesses are usually the dominant > cost, of course, but sometimes the index leaf page accesses really do > make up a significant fraction of the overall cost of the index scan. > Especially with an expensive index qual. So if you just assume that > the TIDs returned by the index scan are the only thing that matters, > you might have a model that's basically correct on average, but is > occasionally very wrong. That's one reason for "putting the index AM > in control". I don't think the fac
Re: Refactoring backend fork+exec code
Hi, On 2024-02-08 13:19:53 +0200, Heikki Linnakangas wrote: > > > - /* > > > - * Assign the ProcSignalSlot for an auxiliary process. Since it doesn't > > > - * have a BackendId, the slot is statically allocated based on the > > > - * auxiliary process type (MyAuxProcType). Backends use slots indexed > > > in > > > - * the range from 1 to MaxBackends (inclusive), so we use MaxBackends + > > > - * AuxProcType + 1 as the index of the slot for an auxiliary process. > > > - * > > > - * This will need rethinking if we ever want more than one of a > > > particular > > > - * auxiliary process type. > > > - */ > > > - ProcSignalInit(MaxBackends + MyAuxProcType + 1); > > > + ProcSignalInit(); > > > > Now that we don't need the offset here, we could move ProcSignalInit() into > > BsaeInit() I think? > > Hmm, doesn't feel right to me. BaseInit() is mostly concerned with setting > up backend-private structures, and it's also called for a standalone > backend. It already initializes a lot of shared subsystems (pgstat, replication slots and arguable things like the buffer pool, temporary file access and WAL). And note that it already requires that MyProc is already set (but it's not yet "added" to the procarray, i.e. doesn't do visibility stuff at that stage). I don't think that BaseInit() being called by standalone backends really poses a problem? So is InitPostgres(), which does call ProcSignalInit() in standalone processes. My mental model is that BaseInit() is for stuff that's shared between processes that do attach to databases and those that don't. Right now the initialization flow is something like this ascii diagram: standalone: \ /-> StartupXLOG() \ -> InitProcess() -\ /-> ProcArrayAdd() -> SharedInvalBackendInit() -> ProcSignalInit()- -> pgstat_beinit() -> attach to db -> pgstat_bestart() normal backend: /\ / -> BaseInit() - aux process:InitAuxiliaryProcess() -/ \-- -> ProcSignalInit()-> pgstat_beinit() -> pgstat_bestart() The only reason ProcSignalInit() happens kinda late is that historically we used BackendIds as the index, which were only assigned in SharedInvalBackendInit() for normal processes. But that doesn't make sense anymore after your changes. Similarly, we do pgstat_beinit() quite late, but that's again only because it uses MyBackendId, which today is only assigned during SharedInvalBackendInit(). I don't think we can do pgstat_bestart() earlier though, which is a shame, given the four calls to it inside InitPostgres(). > I feel the process initialization codepaths could use some cleanup in > general. Not sure what exactly. Very much agreed. > > > +/* > > > + * BackendIdGetProc -- get a backend's PGPROC given its backend ID > > > + * > > > + * The result may be out of date arbitrarily quickly, so the caller > > > + * must be careful about how this information is used. NULL is > > > + * returned if the backend is not active. > > > + */ > > > +PGPROC * > > > +BackendIdGetProc(int backendID) > > > +{ > > > + PGPROC *result; > > > + > > > + if (backendID < 1 || backendID > ProcGlobal->allProcCount) > > > + return NULL; > > > > Hm, doesn't calling BackendIdGetProc() with these values a bug? That's not > > about being out of date or such. > > Perhaps. I just followed the example of the old implementation, which also > returns NULL on bogus inputs. Fair enough. Makes it harder to not notice bugs, but that's not on this patchset to fix... > I think the last remaining question here is about the 0- vs 1-based indexing > of BackendIds. Is it a good idea to switch to 0-based indexing? And if we do > it, should we reserve PGPROC 0. I'm on the fence on this one. I lean towards it being a good idea. Having two internal indexing schemes was bad enough so far, but at least one would fairly quickly notice if one used the wrong one. If they're just offset by 1, it might end up taking longer, because that'll often also be a valid id. But I think you have the author's prerogative on this one. If we do so, I think it might be better to standardize on MyProcNumber instead of MyBackendId. That'll force looking at code where indexing shifts by 1 - and it also seems more descriptive, as inside postgres it's imo clearer what a "proc number" is than what a "backend id" is. Particularly because the latter is also used for things that aren't backends... The only exception are SQL level users, for those I think it might make sense to keep the current 1 based indexing, there's just a few functions where we'd need to translate. > @@ -791,6 +792,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, > TransactionId latestXid) > static vo
Re: index prefetching
On Wed, Feb 14, 2024 at 11:40 AM Melanie Plageman wrote: > > On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra > wrote: > > > > On 2/7/24 22:48, Melanie Plageman wrote: > > > ... > > > - switching scan directions > > > > > > If the index scan switches directions on a given invocation of > > > IndexNext(), heap blocks may have already been prefetched and read for > > > blocks containing tuples beyond the point at which we want to switch > > > directions. > > > > > > We could fix this by having some kind of streaming read "reset" > > > callback to drop all of the buffers which have been prefetched which > > > are now no longer needed. We'd have to go backwards from the last TID > > > which was yielded to the caller and figure out which buffers in the > > > pgsr buffer ranges are associated with all of the TIDs which were > > > prefetched after that TID. The TIDs are in the per_buffer_data > > > associated with each buffer in pgsr. The issue would be searching > > > through those efficiently. > > > > > > > Yeah, that's roughly what I envisioned in one of my previous messages > > about this issue - walking back the TIDs read from the index and added > > to the prefetch queue. > > > > > The other issue is that the streaming read API does not currently > > > support backwards scans. So, if we switch to a backwards scan from a > > > forwards scan, we would need to fallback to the non streaming read > > > method. We could do this by just setting the TID queue size to 1 > > > (which is what I have currently implemented). Or we could add > > > backwards scan support to the streaming read API. > > > > > > > What do you mean by "support for backwards scans" in the streaming read > > API? I imagined it naively as > > > > 1) drop all requests in the streaming read API queue > > > > 2) walk back all "future" requests in the TID queue > > > > 3) start prefetching as if from scratch > > > > Maybe there's a way to optimize this and reuse some of the work more > > efficiently, but my assumption is that the scan direction does not > > change very often, and that we process many items in between. > > Yes, the steps you mention for resetting the queues make sense. What I > meant by "backwards scan is not supported by the streaming read API" > is that Thomas/Andres had mentioned that the streaming read API does > not support backwards scans right now. Though, since the callback just > returns a block number, I don't know how it would break. > > When switching between a forwards and backwards scan, does it go > backwards from the current position or start at the end (or beginning) > of the relation? Okay, well I answered this question for myself, by, um, trying it :). FETCH backward will go backwards from the current cursor position. So, I don't see exactly why this would be an issue. > If it is the former, then the blocks would most > likely be in shared buffers -- which the streaming read API handles. > It is not obvious to me from looking at the code what the gap is, so > perhaps Thomas could weigh in. I have the same problem with the sequential scan streaming read user, so I am going to try and figure this backwards scan and switching scan direction thing there (where we don't have other issues). - Melanie
Re: Fix a typo in pg_rotate_logfile
> On 14 Feb 2024, at 19:51, Nathan Bossart wrote: > > On Wed, Feb 14, 2024 at 10:04:49AM -0500, Tom Lane wrote: >> Daniel Gustafsson writes: >>> On 14 Feb 2024, at 11:35, Dave Page wrote: That said, pgAdmin III has been out of support for many years, and as far as I know, it (and similarly old versions of EDB's PEM which was based on it) were the only consumers of adminpack. I would not be sad to see it removed entirely >> >>> Searching on Github and Debian Codesearch I cannot find any reference to >>> anyone >>> using any function from adminpack. With pgAdminIII being EOL it might be to >>> remove it now rather than be on the hook to maintain it for another 5 years >>> until v17 goes EOL. It'll still be around for years in V16->. >> >> Works for me. >> >>> Attached is a diff to show what it would look like to remove adminpack >>> (catalog >>> version bump omitted on purpose to avoid conflicts until commit). >> >> I don't see any references you missed, so +1. > > Seems reasonable to me, too. Thanks! I'll put this in the next CF to keep it open for comments a bit longer, but will close it early in the CF. -- Daniel Gustafsson
Re: common signal handler protection
On Mon, Dec 18, 2023 at 11:32:47AM -0600, Nathan Bossart wrote: > rebased for cfbot I took a look over each of these. +1 for all. Thank you.
Re: BitmapHeapScan streaming read user and prelim refactoring
Hi, On 2024-02-13 18:11:25 -0500, Melanie Plageman wrote: > Attached is a patch set which refactors BitmapHeapScan such that it > can use the streaming read API [1]. It also resolves the long-standing > FIXME in the BitmapHeapScan code suggesting that the skip fetch > optimization should be pushed into the table AMs. Additionally, it > moves table scan initialization to after the index scan and bitmap > initialization. Thanks for working on this! While I have some quibbles with details, I think this is quite a bit of progress in the right direction. > patches 0001-0002 are assorted cleanup needed later in the set. > patches 0003 moves the table scan initialization to after bitmap creation > patch 0004 is, I think, a bug fix. see [2]. I'd not quite call it a bugfix, it's not like it leads to wrong behaviour. Seems more like an optimization. But whatever :) > The caveat is that these patches introduce breaking changes to two > table AM functions for bitmapheapscan: table_scan_bitmap_next_block() > and table_scan_bitmap_next_tuple(). That's to be expected, I don't think it's worth worrying about. Right now a bunch of TAMs can't implement bitmap scans, this goes a fair bit towards allowing that... > From d6dd6eb21dcfbc41208f87d1d81ffe3960130889 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Mon, 12 Feb 2024 18:50:29 -0500 > Subject: [PATCH v1 03/11] BitmapHeapScan begin scan after bitmap setup > > There is no reason for table_beginscan_bm() to begin the actual scan of > the underlying table in ExecInitBitmapHeapScan(). We can begin the > underlying table scan after the index scan has been completed and the > bitmap built. > > The one use of the scan descriptor during initialization was > ExecBitmapHeapInitializeWorker(), which set the scan descriptor snapshot > with one from an array in the parallel state. This overwrote the > snapshot set in table_beginscan_bm(). > > By saving that worker snapshot as a member in the BitmapHeapScanState > during initialization, it can be restored in table_beginscan_bm() after > returning from the table AM specific begin scan function. I don't understand what the point of passing two different snapshots to table_beginscan_bm() is. What does that even mean? Why can't we just use the correct snapshot initially? > From a3f62e4299663d418531ae61bb16ea39f0836fac Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Mon, 12 Feb 2024 19:03:24 -0500 > Subject: [PATCH v1 04/11] BitmapPrefetch use prefetch block recheck for skip > fetch > > Previously BitmapPrefetch() used the recheck flag for the current block > to determine whether or not it could skip prefetching the proposed > prefetch block. It makes more sense for it to use the recheck flag from > the TBMIterateResult for the prefetch block instead. I'd mention the commit that introduced the current logic and link to the the thread that you started about this. > From d56be7741765d93002649ef912ef4b8256a5b9af Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Mon, 12 Feb 2024 19:04:48 -0500 > Subject: [PATCH v1 05/11] Update BitmapAdjustPrefetchIterator parameter type > to BlockNumber > > BitmapAdjustPrefetchIterator() only used the blockno member of the > passed in TBMIterateResult to ensure that the prefetch iterator and > regular iterator stay in sync. Pass it the BlockNumber only. This will > allow us to move away from using the TBMIterateResult outside of table > AM specific code. Hm - I'm not convinced this is a good direction - doesn't that arguably *increase* TAM awareness? Perhaps it doesn't make much sense to use bitmap heap scans in a TAM without blocks, but still. > From 202b16d3a381210e8dbee69e68a8310be8ee11d2 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Mon, 12 Feb 2024 20:15:05 -0500 > Subject: [PATCH v1 06/11] Push BitmapHeapScan skip fetch optimization into > table AM > > This resolves the long-standing FIXME in BitmapHeapNext() which said that > the optmization to skip fetching blocks of the underlying table when > none of the column data was needed should be pushed into the table AM > specific code. Long-standing? Sure, it's old enough to walk, but we have FIXMEs that are old enough to drink, at least in some countries. :) > The table AM agnostic functions for prefetching still need to know if > skipping fetching is permitted for this scan. However, this dependency > will be removed when that prefetching code is removed in favor of the > upcoming streaming read API. > --- > src/backend/access/heap/heapam.c | 10 +++ > src/backend/access/heap/heapam_handler.c | 29 +++ > src/backend/executor/nodeBitmapHeapscan.c | 100 ++ > src/include/access/heapam.h | 2 + > src/include/access/tableam.h | 17 ++-- > src/include/nodes/execnodes.h | 6 -- > 6 files changed, 74 insertions(+), 90 deletions(-) > > diff --git a/src/backend/access/heap/heapam.c > b/src/backend/access/heap/
Re: postgres_fdw fails to see that array type belongs to extension
I wrote: > There's one big remaining to-do item, I think: experimentation with > pg_upgrade proves that a binary upgrade fails to fix the extension > membership of arrays/rowtypes. That's because pg_dump needs to > manually reconstruct the membership dependencies, and it thinks that > it doesn't need to do anything for implicit arrays. Normally the > point of that is that we want to exactly reconstruct the extension's > contents, but I think in this case we'd really like to add the > additional pg_depend entries even if they weren't there before. > Otherwise people wouldn't get the new behavior until they do a > full dump/reload. > I can see two ways we could do that: > * add logic to pg_dump > * modify ALTER EXTENSION ADD TYPE so that it automatically recurses > from a base type to its array type (and I guess we'd need to add > something for relation rowtypes and multiranges, too). > I think I like the latter approach because it's like how we > handle ownership: pg_dump doesn't emit any commands to explicitly > change the ownership of dependent types, either. (But see [1].) > We could presumably steal some logic from ALTER TYPE OWNER. > I've not tried to code that here, though. Now that the multirange issue is fixed (3e8235ba4), here's a new version that includes the needed recursion in ALTER EXTENSION. I spent some more effort on a proper test case, too. regards, tom lane diff --git a/src/backend/catalog/pg_type.c b/src/backend/catalog/pg_type.c index fe47be38d0..1a0460b491 100644 --- a/src/backend/catalog/pg_type.c +++ b/src/backend/catalog/pg_type.c @@ -540,7 +540,7 @@ TypeCreate(Oid newTypeOid, * etc. * * We make an extension-membership dependency if we're in an extension - * script and makeExtensionDep is true (and isDependentType isn't true). + * script and makeExtensionDep is true. * makeExtensionDep should be true when creating a new type or replacing a * shell type, but not for ALTER TYPE on an existing type. Passing false * causes the type's extension membership to be left alone. @@ -600,7 +600,7 @@ GenerateTypeDependencies(HeapTuple typeTuple, ObjectAddressSet(myself, TypeRelationId, typeObjectId); /* - * Make dependencies on namespace, owner, ACL, extension. + * Make dependencies on namespace, owner, ACL. * * Skip these for a dependent type, since it will have such dependencies * indirectly through its depended-on type or relation. An exception is @@ -625,11 +625,18 @@ GenerateTypeDependencies(HeapTuple typeTuple, recordDependencyOnNewAcl(TypeRelationId, typeObjectId, 0, typeForm->typowner, typacl); - - if (makeExtensionDep) - recordDependencyOnCurrentExtension(&myself, rebuild); } + /* + * Make extension dependency if requested. + * + * We used to skip this for dependent types, but it seems better to record + * their extension membership explicitly; otherwise code such as + * postgres_fdw's shippability test will be fooled. + */ + if (makeExtensionDep) + recordDependencyOnCurrentExtension(&myself, rebuild); + /* Normal dependencies on the I/O and support functions */ if (OidIsValid(typeForm->typinput)) { diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c index 226f85d0e3..6772d6a8d2 100644 --- a/src/backend/commands/extension.c +++ b/src/backend/commands/extension.c @@ -131,6 +131,9 @@ static void ApplyExtensionUpdates(Oid extensionOid, char *origSchemaName, bool cascade, bool is_create); +static void ExecAlterExtensionContentsRecurse(AlterExtensionContentsStmt *stmt, + ObjectAddress extension, + ObjectAddress object); static char *read_whole_file(const char *filename, int *length); @@ -3294,7 +3297,6 @@ ExecAlterExtensionContentsStmt(AlterExtensionContentsStmt *stmt, ObjectAddress extension; ObjectAddress object; Relation relation; - Oid oldExtension; switch (stmt->objtype) { @@ -3349,6 +3351,38 @@ ExecAlterExtensionContentsStmt(AlterExtensionContentsStmt *stmt, check_object_ownership(GetUserId(), stmt->objtype, object, stmt->object, relation); + /* Do the update, recursing to any dependent objects */ + ExecAlterExtensionContentsRecurse(stmt, extension, object); + + /* Finish up */ + InvokeObjectPostAlterHook(ExtensionRelationId, extension.objectId, 0); + + /* + * If get_object_address() opened the relation for us, we close it to keep + * the reference count correct - but we retain any locks acquired by + * get_object_address() until commit time, to guard against concurrent + * activity. + */ + if (relation != NULL) + relation_close(relation, NoLock); + + return extension; +} + +/* + * ExecAlterExtensionContentsRecurse + * Subroutine for ExecAlterExtensionContentsStmt + * + * Do the bare alteration of object's membership in extension, + * without permission checks. Recurse to dependent objects, if any. + */ +static void +ExecAlterExtensionContentsRecurse(AlterExtensio
Re: index prefetching
On Wed, Feb 14, 2024 at 11:40 AM Melanie Plageman wrote: > I wasn't quite sure how we could use > index_compute_xid_horizon_for_tuples() for inspiration -- per Peter's > suggestion. But, I'd like to understand. The point I was trying to make with that example was: a highly generic mechanism can sometimes work across disparate index AMs (that all at least support plain index scans) when it just so happens that these AMs don't actually differ in a way that could possibly matter to that mechanism. While it's true that (say) nbtree and hash are very different at a high level, it's nevertheless also true that the way things work at the level of individual index pages is much more similar than different. With index deletion, we know that we're differences between each supported index AM either don't matter at all (which is what obviates the need for index_compute_xid_horizon_for_tuples() to be directly aware of which index AM the page it is passed comes from), or matter only in small, incidental ways (e.g., nbtree stores posting lists in its tuples, despite using IndexTuple structs). With prefetching, it seems reasonable to suppose that an index-AM specific approach would end up needing very little truly custom code. This is pretty strongly suggested by the fact that the rules around buffer pins (as an interlock against concurrent TID recycling by VACUUM) are standardized by the index AM API itself. Those rules might be slightly more natural with nbtree, but that's kinda beside the point. While the basic organizing principle for where each index tuple goes can vary enormously, it doesn't necessarily matter at all -- in the end, you're really just reading each index page (that has TIDs to read) exactly once per scan, in some fixed order, with interlaced inline heap accesses (that go fetch heap tuples for each individual TID read from each index page). In general I don't accept that we need to do things outside the index AM, because software architecture encapsulation something something. I suspect that we'll need to share some limited information across different layers of abstraction, because that's just fundamentally what's required by the constraints we're operating under. Can't really prove it, though. -- Peter Geoghegan
Re: MAINTAIN privilege -- what do we need to un-revert it?
On Wed, Feb 14, 2024 at 10:20:28AM -0800, Jeff Davis wrote: > If those arguments are still unconvincing, then the next idea is to fix > the search_path for all maintenance commands[3]. I tried this during > the 16 cycle, but due to timing issues it was also reverted. I can > proceed with this approach again, but I'd like a clear endorsement, in > case there were other reasons to doubt the approach. This seemed like the approach folks were most in favor of at the developer meeting a couple weeks ago [0]. At least, that was my interpretation of the discussion. BTW I have been testing reverting commit 151c22d (i.e., un-reverting MAINTAIN) every month or two, and last I checked, it still applies pretty cleanly. The only changes I've needed to make are to the catversion and to a hard-coded version in a test (16 -> 17). [0] https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2024_Developer_Meeting#The_Path_to_un-reverting_the_MAINTAIN_privilege -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Fix a typo in pg_rotate_logfile
On Wed, Feb 14, 2024 at 10:04:49AM -0500, Tom Lane wrote: > Daniel Gustafsson writes: >> On 14 Feb 2024, at 11:35, Dave Page wrote: >>> That said, pgAdmin III has been out of support for many years, and as >>> far as I know, it (and similarly old versions of EDB's PEM which was >>> based on it) were the only consumers of adminpack. I would not be sad >>> to see it removed entirely > >> Searching on Github and Debian Codesearch I cannot find any reference to >> anyone >> using any function from adminpack. With pgAdminIII being EOL it might be to >> remove it now rather than be on the hook to maintain it for another 5 years >> until v17 goes EOL. It'll still be around for years in V16->. > > Works for me. > >> Attached is a diff to show what it would look like to remove adminpack >> (catalog >> version bump omitted on purpose to avoid conflicts until commit). > > I don't see any references you missed, so +1. Seems reasonable to me, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Re[2]: [PATCH] allow pg_current_logfile() execution under pg_monitor role
On Wed, Feb 14, 2024 at 01:45:31PM -0500, Tom Lane wrote: > "Pavlo Golub" writes: >> Oh, thanks! I forgot, indeed, to update docs and catalog version! My >> bad! > > Docs, yes, but don't include catversion bumps in submitted patches. > They'll just lead to merge problems when somebody else changes the > current catversion. We rely on the committer to remember to do this > (which is an imperfect system, but...) +1, I only included it in the patch I posted so that I didn't forget it... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Re[2]: [PATCH] allow pg_current_logfile() execution under pg_monitor role
On Wed, Feb 14, 2024, 19:45 Tom Lane wrote: > "Pavlo Golub" writes: > > Oh, thanks! I forgot, indeed, to update docs and catalog version! My > > bad! > > Docs, yes, but don't include catversion bumps in submitted patches. > They'll just lead to merge problems when somebody else changes the > current catversion. We rely on the committer to remember to do this > (which is an imperfect system, but...) > Thanks for the clarification. > regards, tom lane >
Re: Re[2]: [PATCH] allow pg_current_logfile() execution under pg_monitor role
"Pavlo Golub" writes: > Oh, thanks! I forgot, indeed, to update docs and catalog version! My > bad! Docs, yes, but don't include catversion bumps in submitted patches. They'll just lead to merge problems when somebody else changes the current catversion. We rely on the committer to remember to do this (which is an imperfect system, but...) regards, tom lane
Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
On Wed, 14 Feb 2024 at 18:41, Alvaro Herrera wrote: > Hmm, I think the changes to libpq_pipeline in 0005 should be in 0004. Yeah, you're correct. Fixed that now. v32-0005-Start-using-new-libpq-cancel-APIs.patch Description: Binary data v32-0004-libpq-Add-encrypted-and-non-blocking-versions-of.patch Description: Binary data
Re: index prefetching
On Wed, Feb 14, 2024 at 8:34 AM Tomas Vondra wrote: > > Another thing that argues against doing this is that we might not need > > to visit any more B-Tree leaf pages when there is a LIMIT n involved. > > We could end up scanning a whole extra leaf page (including all of its > > tuples) for want of the ability to "push down" a LIMIT to the index AM > > (that's not what happens right now, but it isn't really needed at all > > right now). > > > > I'm not quite sure I understand what is "this" that you argue against. > Are you saying we should not separate the two scans? If yes, is there a > better way to do this? What I'm concerned about is the difficulty and complexity of any design that requires revising "63.4. Index Locking Considerations", since that's pretty subtle stuff. In particular, if prefetching "de-synchronizes" (to use your term) the index leaf page level scan and the heap page scan, then we'll probably have to totally revise the basic API. Maybe that'll actually turn out to be the right thing to do -- it could just be the only thing that can unleash the full potential of prefetching. But I'm not aware of any evidence that points in that direction. Are you? (I might have just missed it.) > The LIMIT problem is not very clear to me either. Yes, if we get close > to the end of the leaf page, we may need to visit the next leaf page. > But that's kinda the whole point of prefetching - reading stuff ahead, > and reading too far ahead is an inherent risk. Isn't that a problem we > have even without LIMIT? The prefetch distance ramp up is meant to limit > the impact. Right now, the index AM doesn't know anything about LIMIT at all. That doesn't matter, since the index AM can only read/scan one full leaf page before returning control back to the executor proper. The executor proper can just shut down the whole index scan upon finding that we've already returned N tuples for a LIMIT N. We don't do prefetching right now, but we also don't risk reading a leaf page that'll just never be needed. Those two things are in tension, but I don't think that that's quite the same thing as the usual standard prefetching tension/problem. Here there is uncertainty about whether what we're prefetching will *ever* be required -- not uncertainty about when exactly it'll be required. (Perhaps this distinction doesn't mean much to you. I'm just telling you how I think about it, in case it helps move the discussion forward.) > > This property of index scans is fundamental to how index scans work. > > Pinning an index page as an interlock against concurrently TID > > recycling by VACUUM is directly described by the index API docs [1], > > even (the docs actually use terms like "buffer pin" rather than > > something more abstract sounding). I don't think that anything > > affecting that behavior should be considered an implementation detail > > of the nbtree index AM as such (nor any particular index AM). > > > > Good point. The main reason why the index AM docs require this interlock is because we need such an interlock to make non-MVCC snapshot scans safe. If you remove the interlock (the buffer pin interlock that protects against TID recycling by VACUUM), you can still avoid the same race condition by using an MVCC snapshot. This is why using an MVCC snapshot is a requirement for bitmap index scans. I believe that it's also a requirement for index-only scans, but the index AM docs don't spell that out. Another factor that complicates things here is mark/restore processing. The design for that has the idea of processing one page at a time baked-in. Kinda like with the kill_prior_tuple issue. It's certainly possible that you could figure out various workarounds for each of these issues (plus the kill_prior_tuple issue) with a prefetching design that "de-synchronizes" the index access and the heap access. But it might well be better to extend the existing design in a way that just avoids all these problems in the first place. Maybe "de-synchronization" really can pay for itself (because the benefits will outweigh these costs), but if you go that way then I'd really prefer it that way. > > I think that it makes sense to put the index AM in control here -- > > that almost follows from what I said about the index AM API. The index > > AM already needs to be in control, in about the same way, to deal with > > kill_prior_tuple (plus it helps with the LIMIT issue I described). > > > > In control how? What would be the control flow - what part would be > managed by the index AM? ISTM that prefetching for an index scan is about the index scan itself, first and foremost. The heap accesses are usually the dominant cost, of course, but sometimes the index leaf page accesses really do make up a significant fraction of the overall cost of the index scan. Especially with an expensive index qual. So if you just assume that the TIDs returned by the index scan are the only thing that matters, you might have a model that's basicall
MAINTAIN privilege -- what do we need to un-revert it?
The MAINTAIN privilege was reverted during the 16 cycle because of the potential for someone to play tricks with search_path. For instance, if user foo does: CREATE FUNCTION mod7(INT) RETURNS INT IMMUTABLE LANGUAGE plpgsql AS $$ BEGIN RETURN mod($1, 7); END; $$; CREATE TABLE x(i INT); CREATE UNIQUE INDEX x_mod7_idx ON x (mod7(i)); GRANT MAINTAIN ON x TO bar; Then user bar can create their own function named "bar.mod(int, int)", and "SET search_path = bar, pg_catalog", and then issue a "REINDEX x" and cause problems. There are several factors required for that to be a problem: 1. foo hasn't used a "SET search_path" clause on their function 2. bar must have the privileges to create a function somewhere 3. bar must have privileges on table x There's an argument that we should blame factor #1. Robert stated[1] that users should use SET search_path clauses on their functions, even SECURITY INVOKER functions. And I've added a search_path cache which improves the performance enough to make that more reasonable to do generally. There's also an argument that #2 is to blame. Given the realities of our system, best practice is that users shouldn't have the privileges to create objects, even in their own schema, unless required. (Joe made this suggestion in an offline discussion.) There's also an arugment that #3 is not specific to the MAINTAIN privilege. Clearly similar risks exist for other privileges, like TRIGGER. And even the INSERT privilege, in the above example, would allow bar to violate the unique constraint and corrupt the index[2]. If those arguments are still unconvincing, then the next idea is to fix the search_path for all maintenance commands[3]. I tried this during the 16 cycle, but due to timing issues it was also reverted. I can proceed with this approach again, but I'd like a clear endorsement, in case there were other reasons to doubt the approach. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/ca+tgmoyep40ibw-a9npfdp8ahgoekpp3apdfztgburqmfcw...@mail.gmail.com [2] https://www.postgresql.org/message-id/fff566293c9165c69bb4c555da1ac02c63660664.ca...@j-davis.com [3] https://www.postgresql.org/message-id/e44327179e5c9015c8dda67351c04da552066017.ca...@j-davis.com
Re[2]: [PATCH] allow pg_current_logfile() execution under pg_monitor role
On Mon, Feb 12, 2024 at 09:49:45AM -0600, Nathan Bossart wrote: Okay. I'll plan on committing this in the next few days. Here is what I have staged for commit. Oh, thanks! I forgot, indeed, to update docs and catalog version! My bad! In my defense, I was trying to find tests but I missed regress/sql/misc_functions.sql somehow. Now I will know. Thanks again! Best regards, Pavlo Golub
Re: [PATCH] allow pg_current_logfile() execution under pg_monitor role
On Wed, Feb 14, 2024 at 08:59:06AM +0100, Daniel Gustafsson wrote: > LGTM. Committed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
On 2024-Feb-14, Jelte Fennema-Nio wrote: > Attached is a new version of the final patches, with much improved > docs (imho) and the new function names: PQcancelStart and > PQcancelBlocking. Hmm, I think the changes to libpq_pipeline in 0005 should be in 0004. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
On Sun, 4 Feb 2024 at 16:39, Alvaro Herrera wrote: > Maybe this is okay? I'll have a look at the whole final situation more > carefully later; or if somebody else wants to share an opinion, please > do so. Attached is a new version of the final patches, with much improved docs (imho) and the new function names: PQcancelStart and PQcancelBlocking. v31-0004-libpq-Add-encrypted-and-non-blocking-versions-of.patch Description: Binary data v31-0005-Start-using-new-libpq-cancel-APIs.patch Description: Binary data
Re: Document efficient self-joins / UPDATE LIMIT techniques.
On Tue, Feb 13, 2024, at 23:56, Corey Huinker wrote: > This patch came out of a discussion at the last PgCon with the person > who made the "fringe feature" quote, who seemed quite supportive of > documenting the technique. The comment may have been in regards to > actually implementing a LIMIT clause on UPDATE and DELETE, which isn't > in the SQL standard and would be difficult to implement as the two > statements have no concept of ordering. Documenting the workaround > would alleviate some interest in implementing a nonstandard feature. Thanks for sharing the background story. > As for whether it's commonplace, when I was a consultant I had a number > of customers that I had who bemoaned how large updates caused big > replica lag, basically punishing access to records they did care about > in order to properly archive or backfill records they don't care about. > I used the technique a lot, putting the update/delete in a loop, and > often running multiple copies of the same script at times when I/O > contention was low, but if load levels rose it was trivial to just kill > a few of the scripts until things calmed down. I've also used the technique quite a lot, but only using the PK, didn't know about the ctid trick, so many thanks for documenting it. /Joel
Re: What about Perl autodie?
On 08.02.24 16:53, Tom Lane wrote: Daniel Gustafsson writes: On 8 Feb 2024, at 08:01, Peter Eisentraut wrote: I suppose we could start using it for completely new scripts. +1, it would be nice to eventually be able to move to it everywhere so starting now with new scripts may make the eventual transition smoother. I'm still concerned about people carelessly using autodie-reliant code in places where they shouldn't. I offer two safer ways forward: 1. Wait till v16 is the oldest supported branch, and then migrate both HEAD and back branches to using autodie. 2. Don't wait, migrate them all now. This would mean requiring Perl 5.10.1 or later to run the TAP tests, even in back branches. I think #2 might not be all that radical. We have nothing older than 5.14.0 in the buildfarm, so we don't really have much grounds for claiming that 5.8.3 will work today. And 5.10.1 came out in 2009, so how likely is it that anyone cares anymore? A gentler way might be to start using some perlcritic policies like InputOutput::RequireCheckedOpen or the more general InputOutput::RequireCheckedSyscalls and add explicit error checking at the sites it points out. And then if we start using autodie in the future, any inappropriate backpatching of calls lacking error checks would be caught.
Re: logical decoding and replication of sequences, take 2
On 2/13/24 17:37, Robert Haas wrote: > On Sun, Jan 28, 2024 at 1:07 AM Tomas Vondra > wrote: >> Right, locks + apply in commit order gives us this guarantee (I can't >> think of a case where it wouldn't be the case). > > I couldn't find any cases of inadequate locking other than the one I > mentioned. > >> Doesn't the whole logical replication critically depend on the commit >> order? If you decide to arbitrarily reorder/delay the transactions, all >> kinds of really bad things can happen. That's a generic problem, it >> applies to all kinds of objects, not just sequences - a parallel apply >> would need to detect this sort of dependencies (e.g. INSERT + DELETE of >> the same key), and do something about it. > > Yes, but here I'm not just talking about the commit order. I'm talking > about the order of applying non-transactional operations relative to > commits. > > Consider: > > T1: CREATE SEQUENCE s; > T2: BEGIN; > T2: SELECT nextval('s'); > T3: SELECT nextval('s'); > T2: ALTER SEQUENCE s INCREMENT 2; > T2: SELECT nextval('s'); > T2: COMMIT; > It's not clear to me if you're talking about nextval() that happens to generate WAL, or nextval() covered by WAL generated by a previous call. I'm going to assume it's the former, i.e. nextval() that generated WAL describing the *next* sequence chunk, because without WAL there's nothing to apply and therefore no issue with T3 ordering. The way I think about non-transactional sequence changes is as if they were tiny transactions that happen "fully" (including commit) at the LSN where the LSN change is logged. > The commit order is T1 < T3 < T2, but T3 makes no transactional > changes, so the commit order is really just T1 < T2. But it's > completely wrong to say that all we need to do is apply T1 before we > apply T2. The correct order of application is: > > 1. T1. > 2. T2's first nextval > 3. T3's nextval > 4. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and > the subsequent nextval) > Is that quite true? If T3 generated WAL (for the nextval call), it will be applied at that particular LSN. AFAIK that guarantees it happens after the first T2 change (which is also non-transactional) and before the transactional T2 change (because that creates a new relfilenode). > In other words, the fact that some sequence changes are > non-transactional creates ordering hazards that don't exist if there > are no non-transactional changes. So in that way, sequences are > different from table modifications, where applying the transactions in > order of commit is all we need to do. Here we need to apply the > transactions in order of commit and also apply the non-transactional > changes at the right point in the sequence. Consider the following > alternative apply sequence: > > 1. T1. > 2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and > the subsequent nextval) > 3. T3's nextval > 4. T2's first nextval > > That's still in commit order. It's also wrong. > Yes, this would be wrong. Thankfully the apply is not allowed to reorder the changes like this, because that's not what "non-transactional" means in this context. It does not mean we can arbitrarily reorder the changes, it only means the changes are applied as if they were independent transactions (but in the same order as they were executed originally). Both with respect to the other non-transactional changes, and to "commits" of other stuff. (for serial apply, at least) > Imagine that you commit this patch and someone later wants to do > parallel logical apply. So every time they finish decoding a > transaction, they stick it in a queue to be applied by the next > available worker. But, non-transactional changes are very simple, so > we just directly apply those in the main process. Well, kaboom! But > now this can happen with the above example. > > 1. Decode T1. Add to queue for apply. > 2. Before the (idle) apply worker has a chance to pull T1 out of the > queue, decode the first nextval and try to apply it. > > Oops. We're trying to apply a modification to a sequence that hasn't > been created yet. I'm not saying that this kind of hypothetical is a > reason not to commit the patch. But it seems like we're not on the > same page about what the ordering requirements are here. I'm just > making the argument that those non-transactional operations actually > act like mini-transactions. They need to happen at the right time > relative to the real transactions. A non-transactional operation needs > to be applied after any transactions that commit before it is logged, > and before any transactions that commit after it's logged. > How is this issue specific to sequences? AFAIK this is a general problem with transactions that depend on each other. Consider for example this: T1: INSERT INTO t (id) VALUES (1); T2: DELETE FROM t WHERE id = 1; If you parallelize this in a naive way, maybe T2 gets applied before T1. In which case the DELETE won't find the row yet. There's d
Re: Constant Splitting/Refactoring
This should target PG 18, but that status is not available in the CF app yet, so just making a note here.
Re: BitmapHeapScan streaming read user and prelim refactoring
> On Feb 14, 2024, at 6:47 AM, Melanie Plageman > wrote: > > Just curious, did your table AM implement > table_scan_bitmap_next_block() and table_scan_bitmap_next_tuple(), > and, if so, did you use the TBMIterateResult? Since it is not used in > BitmapHeapNext() in my version, table AMs would have to change how > they use TBMIterateResults anyway. But I assume they could add it to a > table AM specific scan descriptor if they want access to a > TBMIterateResult of their own making in both > table_san_bitmap_next_block() and next_tuple()? My table AM does implement those two functions and does use the TBMIterateResult *tbmres argument, yes. I would deal with the issue in very much the same way that your patches modify heapam. I don't really have any additional comments about that. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: index prefetching
On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra wrote: > > On 2/7/24 22:48, Melanie Plageman wrote: > > ... Issues > > --- > > - kill prior tuple > > > > This optimization doesn't work with index prefetching with the current > > design. Kill prior tuple relies on alternating between fetching a > > single index tuple and visiting the heap. After visiting the heap we > > can potentially kill the immediately preceding index tuple. Once we > > fetch multiple index tuples, enqueue their TIDs, and later visit the > > heap, the next index page we visit may not contain all of the index > > tuples deemed killable by our visit to the heap. > > > > I admit I haven't thought about kill_prior_tuple until you pointed out. > Yeah, prefetching separates (de-synchronizes) the two scans (index and > heap) in a way that prevents this optimization. Or at least makes it > much more complex :-( > > > In our case, we could try and fix this by prefetching only heap blocks > > referred to by index tuples on the same index page. Or we could try > > and keep a pool of index pages pinned and go back and kill index > > tuples on those pages. > > > > I think restricting the prefetching to a single index page would not be > a huge issue performance-wise - that's what the initial patch version > (implemented at the index AM level) did, pretty much. The prefetch queue > would get drained as we approach the end of the index page, but luckily > index pages tend to have a lot of entries. But it'd put an upper bound > on the prefetch distance (much lower than the e_i_c maximum 1000, but > I'd say common values are 10-100 anyway). > > But how would we know we're on the same index page? That knowledge is > not available outside the index AM - the executor or indexam.c does not > know this, right? Presumably we could expose this, somehow, but it seems > like a violation of the abstraction ... The easiest way to do this would be to have the index AM amgettuple() functions set a new member in the IndexScanDescData which is either the index page identifier or a boolean that indicates we have moved on to the next page. Then, when filling the queue, we would stop doing so when the page switches. Now, this wouldn't really work for the first index tuple on each new page, so, perhaps we would need the index AMs to implement some kind of "peek" functionality. Or, we could provide the index AM with a max queue size and allow it to fill up the queue with the TIDs it wants (which it could keep to the same index page). And, for the index-only scan case, could have some kind of flag which indicates if the caller is putting TIDs+HeapTuples or TIDS+IndexTuples on the queue, which might reduce the amount of space we need. I'm not sure who manages the memory here. I wasn't quite sure how we could use index_compute_xid_horizon_for_tuples() for inspiration -- per Peter's suggestion. But, I'd like to understand. > > - switching scan directions > > > > If the index scan switches directions on a given invocation of > > IndexNext(), heap blocks may have already been prefetched and read for > > blocks containing tuples beyond the point at which we want to switch > > directions. > > > > We could fix this by having some kind of streaming read "reset" > > callback to drop all of the buffers which have been prefetched which > > are now no longer needed. We'd have to go backwards from the last TID > > which was yielded to the caller and figure out which buffers in the > > pgsr buffer ranges are associated with all of the TIDs which were > > prefetched after that TID. The TIDs are in the per_buffer_data > > associated with each buffer in pgsr. The issue would be searching > > through those efficiently. > > > > Yeah, that's roughly what I envisioned in one of my previous messages > about this issue - walking back the TIDs read from the index and added > to the prefetch queue. > > > The other issue is that the streaming read API does not currently > > support backwards scans. So, if we switch to a backwards scan from a > > forwards scan, we would need to fallback to the non streaming read > > method. We could do this by just setting the TID queue size to 1 > > (which is what I have currently implemented). Or we could add > > backwards scan support to the streaming read API. > > > > What do you mean by "support for backwards scans" in the streaming read > API? I imagined it naively as > > 1) drop all requests in the streaming read API queue > > 2) walk back all "future" requests in the TID queue > > 3) start prefetching as if from scratch > > Maybe there's a way to optimize this and reuse some of the work more > efficiently, but my assumption is that the scan direction does not > change very often, and that we process many items in between. Yes, the steps you mention for resetting the queues make sense. What I meant by "backwards scan is not supported by the streaming read API" is that Thomas/Andres had mentioned that the streaming read API does not support backwards scans ri
Add trim_trailing_whitespace to editorconfig file
This brings our .gitattributes and .editorconfig files more in line. I had the problem that "git add" would complain often about trailing whitespaces when I was changing sgml files specifically. v1-0001-Configure-trailing-whitespace-trimming-in-editorc.patch Description: Binary data
Re: 035_standby_logical_decoding unbounded hang
Hi, On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote: > Coincidentally, one of my buildfarm animals hanged several weeks in a > different test, 035_standby_logical_decoding.pl. A LOG_SNAPSHOT_INTERVAL_MS > reduction was part of making it reproducible: > > On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote: > > On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote: > > > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote: > > > > If you look at the buildfarm's failures page and filter down to > > > > just subscriptionCheck failures, what you find is that all of the > > > > last 6 such failures are in 031_column_list.pl: > > > > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com > > > reported a replacement BgWriterDelay value reproducing it. > > > > Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to > > BgWriterDelay. > > I'm reusing this thread just in case there's overlap with the > 031_column_list.pl cause and fix. The 035_standby_logical_decoding.pl hang is > a race condition arising from an event sequence like this: > > - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU. > - Test script calls pg_log_standby_snapshot() on primary. Emits > XLOG_RUNNING_XACTS. > - checkpoint_timeout makes a primary checkpoint finish. Emits > XLOG_RUNNING_XACTS. > - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic. Emits XLOG_RUNNING_XACTS. > - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby. > > Other test code already has a solution for this, so the attached patches add a > timeout and copy the existing solution. I'm also attaching the hack that > makes it 100% reproducible. Thanks! I did a few tests and confirm that the proposed solution fixes the corner case. standby-slot-test-1-timeout-v1.patch LGTM. Regarding standby-slot-test-2-race-v1.patch: > +# See corresponding create_logical_slot_on_standby() code. > +$node_standby->poll_query_until( > + 'postgres', qq[ > + SELECT restart_lsn IS NOT NULL > + FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' > + ]) or die "timed out waiting for logical slot to calculate its > restart_lsn"; > + What about creating a sub, say wait_for_restart_lsn_calculation() in Cluster.pm and then make use of it in create_logical_slot_on_standby() and above? (something like wait_for_restart_lsn_calculation-v1.patch attached). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com commit 14d988c748f3e500c44d65e073c276e6d8af6156 Author: Bertrand Drouvot Date: Wed Feb 14 15:21:35 2024 + wait_for_restart_lsn_calculation diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index e2e70d0dbf..21cf179db1 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -3174,6 +3174,28 @@ $SIG{TERM} = $SIG{INT} = sub { =pod +=item $node->wait_for_restart_lsn_calculation(self, slot_name) + +Create logical replication slot on given standby + +=cut + +sub wait_for_restart_lsn_calculation +{ + my ($self, $slot_name) = @_; + my ($stdout, $stderr); + + $self->poll_query_until( + 'postgres', qq[ + SELECT restart_lsn IS NOT NULL + FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name' + ]) + or die + "timed out waiting for logical slot to calculate its restart_lsn"; +} + +=pod + =item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname) Create logical replication slot on given standby @@ -3203,13 +3225,7 @@ sub create_logical_slot_on_standby # xl_running_xacts WAL record from the restart_lsn onwards. First wait # until the slot restart_lsn is determined. - $self->poll_query_until( - 'postgres', qq[ - SELECT restart_lsn IS NOT NULL - FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name' - ]) - or die - "timed out waiting for logical slot to calculate its restart_lsn"; + $self->wait_for_restart_lsn_calculation($slot_name); # Then arrange for the xl_running_xacts record for which pg_recvlogical is # waiting. diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index f7b7fc7f9e..85330720c5 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -465,12 +465,8 @@ $psql_subscriber{subscriber_stdin} .= "\n"; $psql_subscriber{run}->pump_nb(); -# See corresponding create_logical_slot_on_standby() code. -$node_standby->poll_query_until( - 'postgres', qq[ - SELECT restart_lsn IS NOT NULL - FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' - ]) or die "timed out waiting for logical slot to calculate its restart_lsn"; +# Wait for restart_lsn calculation +$node_standby->wait_for_restart_lsn_calculation('tap_sub');
Re: Fix a typo in pg_rotate_logfile
Daniel Gustafsson writes: > On 14 Feb 2024, at 11:35, Dave Page wrote: >> That said, pgAdmin III has been out of support for many years, and as far as >> I know, it (and similarly old versions of EDB's PEM which was based on it) >> were the only consumers of adminpack. I would not be sad to see it removed >> entirely > Searching on Github and Debian Codesearch I cannot find any reference to > anyone > using any function from adminpack. With pgAdminIII being EOL it might be to > remove it now rather than be on the hook to maintain it for another 5 years > until v17 goes EOL. It'll still be around for years in V16->. Works for me. > Attached is a diff to show what it would look like to remove adminpack > (catalog > version bump omitted on purpose to avoid conflicts until commit). I don't see any references you missed, so +1. regards, tom lane
Re: BitmapHeapScan streaming read user and prelim refactoring
On Tue, Feb 13, 2024 at 11:34 PM Mark Dilger wrote: > > > On Feb 13, 2024, at 3:11 PM, Melanie Plageman > > wrote: > > Thanks for the patch... > > > Attached is a patch set which refactors BitmapHeapScan such that it > > can use the streaming read API [1]. It also resolves the long-standing > > FIXME in the BitmapHeapScan code suggesting that the skip fetch > > optimization should be pushed into the table AMs. Additionally, it > > moves table scan initialization to after the index scan and bitmap > > initialization. > > > > patches 0001-0002 are assorted cleanup needed later in the set. > > patches 0003 moves the table scan initialization to after bitmap creation > > patch 0004 is, I think, a bug fix. see [2]. > > patches 0005-0006 push the skip fetch optimization into the table AMs > > patches 0007-0009 change the control flow of BitmapHeapNext() to match > > that required by the streaming read API > > patch 0010 is the streaming read code not yet in master > > patch 0011 is the actual bitmapheapscan streaming read user. > > > > patches 0001-0009 apply on top of master but 0010 and 0011 must be > > applied on top of a commit before a 21d9c3ee4ef74e2 (until a rebased > > version of the streaming read API is on the mailing list). > > I followed your lead and applied them to > 6a8ffe812d194ba6f4f26791b6388a4837d17d6c. `make check` worked fine, though I > expect you know that already. Thanks for taking a look! > > The caveat is that these patches introduce breaking changes to two > > table AM functions for bitmapheapscan: table_scan_bitmap_next_block() > > and table_scan_bitmap_next_tuple(). > > You might want an independent perspective on how much of a hassle those > breaking changes are, so I took a stab at that. Having written a custom > proprietary TAM for postgresql 15 here at EDB, and having ported it and > released it for postgresql 16, I thought I'd try porting it to the the above > commit with your patches. Even without your patches, I already see breaking > changes coming from commit f691f5b80a85c66d715b4340ffabb503eb19393e, which > creates a similar amount of breakage for me as does your patches. Dealing > with the combined breakage might amount to a day of work, including testing, > half of which I think I've already finished. In other words, it doesn't seem > like a big deal. > > Were postgresql 17 shaping up to be compatible with TAMs written for 16, your > patch would change that qualitatively, but since things are already > incompatible, I think you're in the clear. Oh, good to know! I'm very happy to have the perspective of a table AM author. Just curious, did your table AM implement table_scan_bitmap_next_block() and table_scan_bitmap_next_tuple(), and, if so, did you use the TBMIterateResult? Since it is not used in BitmapHeapNext() in my version, table AMs would have to change how they use TBMIterateResults anyway. But I assume they could add it to a table AM specific scan descriptor if they want access to a TBMIterateResult of their own making in both table_san_bitmap_next_block() and next_tuple()? - Melanie
Re: About a recently-added message
On Wed, Feb 14, 2024, at 8:45 AM, Amit Kapila wrote: > Now, I am less clear about whether to quote "logical" or not in the > above message. Do you have any suggestions? The possible confusion comes from the fact that the sentence contains "must be" in the middle of a comparison expression. For pg_createsubscriber, we are using publisher requires wal_level >= logical I suggest to use something like slot synchronization requires wal_level >= logical -- Euler Taveira EDB https://www.enterprisedb.com/
Re: index prefetching
On 2/14/24 08:10, Robert Haas wrote: > On Thu, Feb 8, 2024 at 3:18 AM Melanie Plageman > wrote: >> - kill prior tuple >> >> This optimization doesn't work with index prefetching with the current >> design. Kill prior tuple relies on alternating between fetching a >> single index tuple and visiting the heap. After visiting the heap we >> can potentially kill the immediately preceding index tuple. Once we >> fetch multiple index tuples, enqueue their TIDs, and later visit the >> heap, the next index page we visit may not contain all of the index >> tuples deemed killable by our visit to the heap. > > Is this maybe just a bookkeeping problem? A Boolean that says "you can > kill the prior tuple" is well-suited if and only if the prior tuple is > well-defined. But perhaps it could be replaced with something more > sophisticated that tells you which tuples are eligible to be killed. > I don't think it's just a bookkeeping problem. In a way, nbtree already does keep an array of tuples to kill (see btgettuple), but it's always for the current index page. So it's not that we immediately go and kill the prior tuple - nbtree already stashes it in an array, and kills all those tuples when moving to the next index page. The way I understand the problem is that with prefetching we're bound to determine the kill_prior_tuple flag with a delay, in which case we might have already moved to the next index page ... So to make this work, we'd need to: 1) keep index pages pinned for all "in flight" TIDs (read from the index, not yet consumed by the index scan) 2) keep a separate array of "to be killed" index tuples for each page 3) have a more sophisticated way to decide when to kill tuples and unpin the index page (instead of just doing it when moving to the next index page) Maybe that's what you meant by "more sophisticated bookkeeping", ofc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Synchronizing slots from primary to standby
Hi, On Wed, Feb 14, 2024 at 10:40:11AM +, Zhijie Hou (Fujitsu) wrote: > On Wednesday, February 14, 2024 6:05 PM Amit Kapila > wrote: > > > > On Wed, Feb 14, 2024 at 2:14 PM Amit Kapila wrote: > > > > > > On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu) > > > wrote: > > > > > > > > Here is V87 patch that adds test for the suggested cases. > > > > > > > > > > I have pushed this patch and it leads to a BF failure: > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&d > > > t=2024-02-14%2004%3A43%3A37 > > > > > > The test failures are: > > > # Failed test 'logical decoding is not allowed on synced slot' > > > # at > > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > > ailover_slots_sync.pl > > > line 272. > > > # Failed test 'synced slot on standby cannot be altered' > > > # at > > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > > ailover_slots_sync.pl > > > line 281. > > > # Failed test 'synced slot on standby cannot be dropped' > > > # at > > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > > ailover_slots_sync.pl > > > line 287. > > > > > > The reason is that in LOGs, we see a different ERROR message than what > > > is expected: > > > 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR: > > > replication slot "lsub1_slot" is active for PID 1760871 > > > > > > Now, we see the slot still active because a test before these tests (# > > > Test that if the synchronized slot is invalidated while the remote > > > slot is still valid, ) is not able to successfully persist the > > > slot and the synced temporary slot remains active. > > > > > > The reason is clear by referring to below standby LOGS: > > > > > > LOG: connection authorized: user=bf database=postgres > > > application_name=040_standby_failover_slots_sync.pl > > > LOG: statement: SELECT pg_sync_replication_slots(); > > > LOG: dropped replication slot "lsub1_slot" of dbid 5 > > > STATEMENT: SELECT pg_sync_replication_slots(); ... > > > SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots > > > WHERE slot_name = 'lsub1_slot'; > > > > > > In the above LOGs, we should ideally see: "newly created slot > > > "lsub1_slot" is sync-ready now" after the "LOG: dropped replication > > > slot "lsub1_slot" of dbid 5" but lack of that means the test didn't > > > accomplish what it was supposed to. Ideally, the same test should have > > > failed but the pass criteria for the test failed to check whether the > > > slot is persisted or not. > > > > > > The probable reason for failure is that remote_slot's restart_lsn lags > > > behind the oldest WAL segment on standby. Now, in the test, we do > > > ensure that the publisher and subscriber are caught up by following > > > steps: > > > # Enable the subscription to let it catch up to the latest wal > > > position $subscriber1->safe_psql('postgres', > > > "ALTER SUBSCRIPTION regress_mysub1 ENABLE"); > > > > > > $primary->wait_for_catchup('regress_mysub1'); > > > > > > However, this doesn't guarantee that restart_lsn is moved to a > > > position new enough that standby has a WAL corresponding to it. > > > > > > > To ensure that restart_lsn has been moved to a recent position, we need to > > log > > XLOG_RUNNING_XACTS and make sure the same is processed as well by > > walsender. The attached patch does the required change. > > > > Hou-San can reproduce this problem by adding additional checkpoints in the > > test and after applying the attached it fixes the problem. Now, this patch > > is > > mostly based on the theory we formed based on LOGs on BF and a reproducer > > by Hou-San, so still, there is some chance that this doesn't fix the BF > > failures in > > which case I'll again look into those. > > I have verified that the patch can fix the issue on my machine(after adding > few > more checkpoints before slot invalidation test.) I also added one more check > in > the test to confirm the synced slot is not temp slot. Here is the v2 patch. Thanks! +# To ensure that restart_lsn has moved to a recent WAL position, we need +# to log XLOG_RUNNING_XACTS and make sure the same is processed as well +$primary->psql('postgres', "CHECKPOINT"); Instead of "CHECKPOINT" wouldn't a less heavy "SELECT pg_log_standby_snapshot();" be enough? Not a big deal but maybe we could do the change while modifying 040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync worker". Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Fix a typo in pg_rotate_logfile
> On 14 Feb 2024, at 11:35, Dave Page wrote: > That said, pgAdmin III has been out of support for many years, and as far as > I know, it (and similarly old versions of EDB's PEM which was based on it) > were the only consumers of adminpack. I would not be sad to see it removed > entirely Searching on Github and Debian Codesearch I cannot find any reference to anyone using any function from adminpack. With pgAdminIII being EOL it might be to remove it now rather than be on the hook to maintain it for another 5 years until v17 goes EOL. It'll still be around for years in V16->. If anyone still uses pgAdminIII then I have a hard time believing they are diligently updating to the latest major version of postgres.. Attached is a diff to show what it would look like to remove adminpack (catalog version bump omitted on purpose to avoid conflicts until commit). -- Daniel Gustafsson v1-0001-Remove-the-adminpack-extension.patch Description: Binary data
Re: index prefetching
On 2/13/24 20:54, Peter Geoghegan wrote: > On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra > wrote: >> On 2/7/24 22:48, Melanie Plageman wrote: >> I admit I haven't thought about kill_prior_tuple until you pointed out. >> Yeah, prefetching separates (de-synchronizes) the two scans (index and >> heap) in a way that prevents this optimization. Or at least makes it >> much more complex :-( > > Another thing that argues against doing this is that we might not need > to visit any more B-Tree leaf pages when there is a LIMIT n involved. > We could end up scanning a whole extra leaf page (including all of its > tuples) for want of the ability to "push down" a LIMIT to the index AM > (that's not what happens right now, but it isn't really needed at all > right now). > I'm not quite sure I understand what is "this" that you argue against. Are you saying we should not separate the two scans? If yes, is there a better way to do this? The LIMIT problem is not very clear to me either. Yes, if we get close to the end of the leaf page, we may need to visit the next leaf page. But that's kinda the whole point of prefetching - reading stuff ahead, and reading too far ahead is an inherent risk. Isn't that a problem we have even without LIMIT? The prefetch distance ramp up is meant to limit the impact. > This property of index scans is fundamental to how index scans work. > Pinning an index page as an interlock against concurrently TID > recycling by VACUUM is directly described by the index API docs [1], > even (the docs actually use terms like "buffer pin" rather than > something more abstract sounding). I don't think that anything > affecting that behavior should be considered an implementation detail > of the nbtree index AM as such (nor any particular index AM). > Good point. > I think that it makes sense to put the index AM in control here -- > that almost follows from what I said about the index AM API. The index > AM already needs to be in control, in about the same way, to deal with > kill_prior_tuple (plus it helps with the LIMIT issue I described). > In control how? What would be the control flow - what part would be managed by the index AM? I initially did the prefetching entirely in each index AM, but it was suggested doing this in the executor would be better. So I gradually moved it to executor. But the idea to combine this with the streaming read API seems as a move from executor back to the lower levels ... and now you're suggesting to make the index AM responsible for this again. I'm not saying any of those layering options is wrong, but it's not clear to me which is the right one. > There doesn't necessarily need to be much code duplication to make > that work. Offhand I suspect it would be kind of similar to how > deletion of LP_DEAD-marked index tuples by non-nbtree index AMs gets > by with generic logic implemented by > index_compute_xid_horizon_for_tuples -- that's all that we need to > determine a snapshotConflictHorizon value for recovery conflict > purposes. Note that index_compute_xid_horizon_for_tuples() reads > *index* pages, despite not being aware of the caller's index AM and > index tuple format. > > (The only reason why nbtree needs a custom solution is because it has > posting list tuples to worry about, unlike GiST and unlike Hash, which > consistently use unadorned generic IndexTuple structs with heap TID > represented in the standard/generic way only. While these concepts > probably all originated in nbtree, they're still not nbtree > implementation details.) > I haven't looked at the details, but I agree the LP_DEAD deletion seems like a sensible inspiration. >>> Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps >>> there is an easier way to fix this, as I don't think the mvcc test >>> failed on Tomas' version. >>> >> >> I kinda doubt it worked correctly, considering I simply ignored the >> optimization. It's far more likely it just worked by luck. > > The test that did fail will have only revealed that the > kill_prior_tuple wasn't operating as expected -- which isn't the same > thing as giving wrong answers. > Possible. But AFAIK it did fail for Melanie, and I don't have a very good explanation for the difference in behavior. > Note that there are various ways that concurrent TID recycling might > prevent _bt_killitems() from setting LP_DEAD bits. It's totally > unsurprising that breaking kill_prior_tuple in some way could be > missed. Andres wrote the MVCC test in question precisely because > certain aspects of kill_prior_tuple were broken for months without > anybody noticing. > > [1] https://www.postgresql.org/docs/devel/index-locking.html Yeah. There's clearly plenty of space for subtle issues. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Stack overflow issue
Hi! On Fri, Jan 12, 2024 at 11:00 PM Robert Haas wrote: > On Fri, Jan 12, 2024 at 10:12 AM Heikki Linnakangas wrote: > > Here's one goto-free attempt. It adds a local loop to where the > > recursion was, so that if you have a chain of subtransactions that need > > to be aborted in CommitTransactionCommand, they are aborted iteratively. > > The TBLOCK_SUBCOMMIT case already had such a loop. > > > > I added a couple of comments in the patch marked with "REVIEWER NOTE", > > to explain why I changed some things. They are to be removed before > > committing. > > > > I'm not sure if this is better than a goto. In fact, even if we commit > > this, I think I'd still prefer to replace the remaining recursive calls > > with a goto. Recursion feels a weird to me, when we're unwinding the > > states from the stack as we go. > > I'm not able to quickly verify whether this version is correct, but I > do think the code looks nicer this way. > > I understand that's a question of opinion rather than fact, though. I'd like to revive this thread. The attached 0001 patch represents my attempt to remove recursion in CommitTransactionCommand()/AbortCurrentTransaction() by adding a wrapper function. This method doesn't use goto, doesn't require much code changes and subjectively provides good readability. Regarding ShowTransactionStateRec() and memory context function, as I get from this thread they are called in states where abortion can lead to a panic. So, it's preferable to change them into loops too rather than just adding check_stack_depth(). The 0002 and 0003 patches by Heikki posted in [1] look good to me. Can we accept them? Also there are a number of recursive functions, which seem to be not used in critical states where abortion can lead to a panic. I've extracted them from [2] into an attached 0002 patch. I'd like to push it if there is no objection. Links. 1. https://www.postgresql.org/message-id/6b48c746-9704-46dc-b9be-01fe4137c824%40iki.fi 2. https://www.postgresql.org/message-id/4530546a-3216-eaa9-4c92-92d33290a211%40mail.ru -- Regards, Alexander Korotkov 0002-Add-missing-check_stack_depth-to-some-recursive-f-v3.patch Description: Binary data 0001-Turn-tail-recursion-into-iteration-in-CommitTrans-v3.patch Description: Binary data
Re: About a recently-added message
On Wed, Feb 14, 2024 at 12:57 PM Kyotaro Horiguchi wrote: > > A recent commit added the following message: > > > "wal_level" must be >= logical. > > The use of the term "logical" here is a bit confusing, as it's unclear > whether it's meant to be a natural language word or a token. (I > believe it to be a token.) > > On the contrary, we already have the following message: > > > wal_level must be set to "replica" or "logical" at server start. > > This has the conflicting policy about quotation of variable names and > enum values. > > I suggest making the quoting policy consistent. > As per docs [1] (In messages containing configuration variable names, do not include quotes when the names are visibly not natural English words, such as when they have underscores, are all-uppercase, or have mixed case. Otherwise, quotes must be added. Do include quotes in a message where an arbitrary variable name is to be expanded.), I think in this case GUC's name shouldn't have been quoted. I think the patch did this because it's developed parallel to a thread where we were discussing whether to quote GUC names or not [2]. I think it is better not to do as per docs till we get any further clarification. Now, I am less clear about whether to quote "logical" or not in the above message. Do you have any suggestions? [1] - https://www.postgresql.org/docs/devel/error-style-guide.html [2] - https://www.postgresql.org/message-id/cahut+psf3newxbsfky88qn1on1_dmd6343muwdmiim2ds9a...@mail.gmail.com -- With Regards, Amit Kapila.
Re: Psql meta-command conninfo+
On 12.02.24 15:16, Maiquel Grassi wrote: > > (v14) > > v14 applies cleanly and the SSL info is now shown as previously suggested. Here is a more comprehensive test: $ /usr/local/postgres-dev/bin/psql -x "\ host=server.uni-muenster.de hostaddr=172.19.42.1 user=jim dbname=postgres sslrootcert=server-certificates/server.crt sslcert=jim-certificates/jim.crt sslkey=jim-certificates/jim.key" psql (17devel) SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off) Type "help" for help. postgres=# SET ROLE foo; SET postgres=> \conninfo+ Current Connection Information -[ RECORD 1 ]--+--- Database | postgres Authenticated User | jim System User | cert:emailAddress=j...@uni-muenster.de,CN=jim,OU=WWU IT,O=Universitaet Muenster,L=Muenster,ST=Nordrhein-Westfalen,C=DE Current User | foo Session User | jim Backend PID | 278294 Server Address | 172.19.42.1 Server Port | 5432 Client Address | 192.168.178.27 Client Port | 54948 Socket Directory | Host | server.uni-muenster.de Encryption | SSL Protocol | TLSv1.3 Cipher | TLS_AES_256_GCM_SHA384 Compression | off postgres=> \conninfo You are connected to database "postgres" as user "jim" on host "server.uni-muenster.de" (address "127.0.0.1") at port "5432". SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off) A little odd is that "Server Address" of \conninfo+ differs from "address" of \conninfo... I think the documentation could add a little more info than just this: > When no + is specified, it simply prints a textual description of a few connection options. When + is given, more complete information is displayed as a table. Perhaps briefly mentioning the returned columns or simply listing them would be IMHO a nice addition. For some users the semantics of "Authenticated User", "System User", "Current User" and "Session User" can be a little confusing. And yes, I realize the current documentation of \conninfo is already a little vague :). Thanks for working on this -- Jim
Re: Add new error_action COPY ON_ERROR "log"
On 2024-02-12 15:46, Bharath Rupireddy wrote: Thanks for your comments. On Mon, Jan 29, 2024 at 8:41 AM torikoshia wrote: On 2024-01-27 00:43, David G. Johnston wrote: >> I'd like to have a new option "log", which skips soft errors and >> logs >> information that should have resulted in errors to PostgreSQL log. > > user-specified tables in the same database. Setting up temporary > tables or unlogged tables probably is going to be a more acceptable > methodology than trying to get to the log files. OTOH not everyone thinks saving table information is the best idea[2]. The added NOTICE message gives a summary of how many rows were skipped, but not the line numbers. It's hard for the users to find the malformed data, especially when they are bulk-inserting from data files of multiple GBs in size (hard to open such files in any editor too). I understand the value of writing the info to server log or tables of users' choice as being discussed in a couple of other threads. However, I'd prefer we do the simplest thing first before we go debate server log or table - let the users know what line numbers are containing the errors that COPY ignored something like [1] with a simple change like [2]. Agreed. Unlike my patch, it hides the error information(i.e. 22P02: invalid input syntax for type integer: ), but I feel that it's usually sufficient to know the row number and column where the error occurred. It not only helps users figure out which rows and attributes were malformed, but also helps them redirect them to server logs with setting log_min_messages = notice [3]. In the worst case scenario, a problem with this one NOTICE per malformed row is that it can overload the psql session if all the rows are malformed. I'm not sure if this is a big problem, but IMO better than a single summary NOTICE message and simpler than writing to tables of users' choice. Maybe could we do what you suggested for the behavior when 'log' is set to on_error? Thoughts? FWIW, I presented the new COPY ... ON_ERROR option feature at Hyderabad PostgreSQL User Group meetup and it was well-received by the audience. The audience felt it's going to be extremely helpful for bulk-loading tasks. Thank you all for working on this feature. Thanks for informing it! I hope it will not be reverted as the audience:) [1] postgres=# CREATE TABLE check_ign_err (n int, m int[], k int); CREATE TABLE postgres=# COPY check_ign_err FROM STDIN WITH (on_error ignore); Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. 1{1} 1 a {2} 2 3 {3} 33 4 {a, 4} 4 5 {5}>> >> >> >> >> 5 \.>> NOTICE: detected data type incompatibility at line number 2 for column check_ign_err, COPY n NOTICE: detected data type incompatibility at line number 3 for column check_ign_err, COPY k NOTICE: detected data type incompatibility at line number 4 for column check_ign_err, COPY m NOTICE: detected data type incompatibility at line number 5 for column check_ign_err, COPY n NOTICE: 4 rows were skipped due to data type incompatibility COPY 2 [2] diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index 906756362e..93ab5d4ebb 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -961,7 +961,16 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, &values[m])) { cstate->num_errors++; - return true; + + if (cstate->opts.on_error != COPY_ON_ERROR_STOP) + { + ereport(NOTICE, + errmsg("detected data type incompatibility at line number %llu for column %s, COPY %s", + (unsigned long long) cstate->cur_lineno, + cstate->cur_relname, + cstate->cur_attname)); + return true; + } } cstate->cur_attname = NULL; [3] 2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type incompatibility at line number 2 for column check_ign_err, COPY n 2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err, line 2, column n: "a" 2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type incompatibility at line number 3 for column check_ign_err, COPY k 2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err, line 3, column k: "33" 2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type incompatibility at line number 4 for column check_ign_err, COPY m 2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err, line 4, column m: "{a, 4}" 2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type incompatibility at line number 5 for column check_ign_err, COPY n 2024-02-12 06:20:29.363 UTC [427892] CON
Re: About a recently-added message
On Wed, Feb 14, 2024 at 1:04 PM Kyotaro Horiguchi wrote: > > Just after this, I found another inconsistency regarding quotation. > > > 'dbname' must be specified in "%s". > > The use of single quotes doesn't seem to comply with our standard. > Agreed, I think we have two choices here one is to use dbname without any quotes ("dbname must be specified in \"%s\".", ...)) or use double quotes ("\"%s\" must be specified in \"%s\".", "dbname" ...)). I see messages like: "host name must be specified", so if we want to follow that earlier makes more sense. Any suggestions? -- With Regards, Amit Kapila.
Re: Improve eviction algorithm in ReorderBuffer
On Sat, Feb 10, 2024 at 2:23 AM Masahiko Sawada wrote: > On Fri, Feb 9, 2024 at 7:35 PM Ajin Cherian wrote: > > > > > > > > On Tue, Feb 6, 2024 at 5:06 PM Masahiko Sawada > wrote: > >> > >> > >> I've attached the new version patch set. > >> > >> Regards, > >> > >> > >> -- > >> Masahiko Sawada > >> Amazon Web Services: https://aws.amazon.com > > > > > > Thanks for the patch. I reviewed that patch and did minimal testing and > it seems to show the speed up as claimed. Some minor comments: > > Thank you for the comments! > > > patch 0001: > > > > +static void > > +bh_enlarge_node_array(binaryheap *heap) > > +{ > > + if (heap->bh_size < heap->bh_space) > > + return; > > > > why not check "if (heap->bh_size >= heap->bh_space)" outside this > function to avoid calling this function when not necessary? This check was > there in code before the patch. > > Agreed. > > > > > patch 0003: > > > > +/* > > + * The threshold of the number of transactions in the max-heap > (rb->txn_heap) > > + * to switch the state. > > + */ > > +#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024 > > > > Typo: I think you meant REORDER_ and not REORDE_ > > Fixed. > > These comments are addressed in the v4 patch set I just shared[1]. > > > These changes look good to me. I've done some tests with a few varying levels of subtransaction and I could see that the patch was at least 5% better in all of them. regards, Ajin Cherian Fujitsu Australia
RE: Synchronizing slots from primary to standby
On Wednesday, February 14, 2024 6:05 PM Amit Kapila wrote: > > On Wed, Feb 14, 2024 at 2:14 PM Amit Kapila wrote: > > > > On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu) > > wrote: > > > > > > Here is V87 patch that adds test for the suggested cases. > > > > > > > I have pushed this patch and it leads to a BF failure: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&d > > t=2024-02-14%2004%3A43%3A37 > > > > The test failures are: > > # Failed test 'logical decoding is not allowed on synced slot' > > # at > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > ailover_slots_sync.pl > > line 272. > > # Failed test 'synced slot on standby cannot be altered' > > # at > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > ailover_slots_sync.pl > > line 281. > > # Failed test 'synced slot on standby cannot be dropped' > > # at > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > ailover_slots_sync.pl > > line 287. > > > > The reason is that in LOGs, we see a different ERROR message than what > > is expected: > > 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR: > > replication slot "lsub1_slot" is active for PID 1760871 > > > > Now, we see the slot still active because a test before these tests (# > > Test that if the synchronized slot is invalidated while the remote > > slot is still valid, ) is not able to successfully persist the > > slot and the synced temporary slot remains active. > > > > The reason is clear by referring to below standby LOGS: > > > > LOG: connection authorized: user=bf database=postgres > > application_name=040_standby_failover_slots_sync.pl > > LOG: statement: SELECT pg_sync_replication_slots(); > > LOG: dropped replication slot "lsub1_slot" of dbid 5 > > STATEMENT: SELECT pg_sync_replication_slots(); ... > > SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots > > WHERE slot_name = 'lsub1_slot'; > > > > In the above LOGs, we should ideally see: "newly created slot > > "lsub1_slot" is sync-ready now" after the "LOG: dropped replication > > slot "lsub1_slot" of dbid 5" but lack of that means the test didn't > > accomplish what it was supposed to. Ideally, the same test should have > > failed but the pass criteria for the test failed to check whether the > > slot is persisted or not. > > > > The probable reason for failure is that remote_slot's restart_lsn lags > > behind the oldest WAL segment on standby. Now, in the test, we do > > ensure that the publisher and subscriber are caught up by following > > steps: > > # Enable the subscription to let it catch up to the latest wal > > position $subscriber1->safe_psql('postgres', > > "ALTER SUBSCRIPTION regress_mysub1 ENABLE"); > > > > $primary->wait_for_catchup('regress_mysub1'); > > > > However, this doesn't guarantee that restart_lsn is moved to a > > position new enough that standby has a WAL corresponding to it. > > > > To ensure that restart_lsn has been moved to a recent position, we need to log > XLOG_RUNNING_XACTS and make sure the same is processed as well by > walsender. The attached patch does the required change. > > Hou-San can reproduce this problem by adding additional checkpoints in the > test and after applying the attached it fixes the problem. Now, this patch is > mostly based on the theory we formed based on LOGs on BF and a reproducer > by Hou-San, so still, there is some chance that this doesn't fix the BF > failures in > which case I'll again look into those. I have verified that the patch can fix the issue on my machine(after adding few more checkpoints before slot invalidation test.) I also added one more check in the test to confirm the synced slot is not temp slot. Here is the v2 patch. Best Regards, Hou zj v2-0001-fix-040-standby-failover-slots-sync.patch Description: v2-0001-fix-040-standby-failover-slots-sync.patch
Re: Fix a typo in pg_rotate_logfile
Hi On Mon, 12 Feb 2024 at 21:31, Bharath Rupireddy < bharath.rupireddyforpostg...@gmail.com> wrote: > On Tue, Feb 13, 2024 at 2:29 AM Daniel Gustafsson wrote: > > > > On that note though, we might want to consider just dropping it > altogether in > > v17 (while fixing the incorrect hint in backbranches)? I can't imagine > > adminpack 1.0 being in heavy use today, and skimming pgAdmin code it > seems it's > > only used in pgAdmin3 and not 4. Maybe it's time to simply drop old code? > > https://codesearch.debian.net/search?q=pg_logfile_rotate&literal=1 > shows no users for it though. There's pgadmin3 using it > > https://github.com/search?q=repo%3Apgadmin-org%2Fpgadmin3%20pg_logfile_rotate&type=code > , > however the repo is archived. Surprisingly, core has to maintain the > old code needed for adminpack 1.0 - pg_rotate_logfile_old SQL function > and pg_rotate_logfile function in signalfuncs.c. These things could > have been moved to adminpack.c back then and pointed CREATE FUNCTION > pg_catalog.pg_logfile_rotate() to use it from adminpack.c. If we > decide to remove adminpack 1.0 version completely, the 1.0 functions > pg_file_read, pg_file_length and pg_logfile_rotate will also go away > making adminpack code simpler. > > Having said that, it's good to hear from others, preferably from > pgadmin developers - added Dave Page (dp...@pgadmin.org) in here for > inputs. > As it happens we're currently implementing a redesigned version of that functionality from pgAdmin III in pgAdmin 4. However, we are not using adminpack for it. FWIW, the reason for the weird naming is that originally all the functionality for reading/managing files was added entirely as the adminpack extension. It was only later that some of the functionality was moved into core, and renamed along the way (everyone likes blue for their bikeshed right?). The old functions (albeit, rewritten to use the new core functions) were kept in adminpack for backwards compatibility. That said, pgAdmin III has been out of support for many years, and as far as I know, it (and similarly old versions of EDB's PEM which was based on it) were the only consumers of adminpack. I would not be sad to see it removed entirely - except for the fact that I fondly remember being invited to join -core immediately after a heated discussion with Tom about it! -- Dave Page pgAdmin: https://www.pgadmin.org PostgreSQL: https://www.postgresql.org EDB: https://www.enterprisedb.com
Can we include capturing logs of pgdata/pg_upgrade_output.d/*/log in buildfarm
Hi, Recently there have been few upgrade tap test failures in buildfarm like in [1] & [2]. Analysing these failures requires the log files that are getting generated from src/bin/pg_upgrade at the following locations: tmp_check/*/pgdata/pg_upgrade_output.d/*/*.txt - e.g. tmp_check/t_004_subscription_new_sub1_data/pgdata/pg_upgrade_output.d/20240214T052229.045/subs_invalid.txt tmp_check/*/pgdata/pg_upgrade_output.d/*/*/*.log - e.g. tmp_check/t_004_subscription_new_sub1_data/pgdata/pg_upgrade_output.d/20240214T052229.045/log/pg_upgrade_server.log First regex is the testname_clusterinstance_data, second regex is the timestamp used for pg_upgrade, third regex is for the text files generated by pg_upgrade and fourth regex is for the log files generated by pg_upgrade. Can we include these log files also in the buildfarm? [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-02-10%2007%3A03%3A10 [2] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2023-12-07%2003%3A56%3A20 Regards, Vignesh
Re: Synchronizing slots from primary to standby
On Wed, Feb 14, 2024 at 2:14 PM Amit Kapila wrote: > > On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu) > wrote: > > > > Here is V87 patch that adds test for the suggested cases. > > > > I have pushed this patch and it leads to a BF failure: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-14%2004%3A43%3A37 > > The test failures are: > # Failed test 'logical decoding is not allowed on synced slot' > # at > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl > line 272. > # Failed test 'synced slot on standby cannot be altered' > # at > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl > line 281. > # Failed test 'synced slot on standby cannot be dropped' > # at > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl > line 287. > > The reason is that in LOGs, we see a different ERROR message than what > is expected: > 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR: > replication slot "lsub1_slot" is active for PID 1760871 > > Now, we see the slot still active because a test before these tests (# > Test that if the synchronized slot is invalidated while the remote > slot is still valid, ) is not able to successfully persist the > slot and the synced temporary slot remains active. > > The reason is clear by referring to below standby LOGS: > > LOG: connection authorized: user=bf database=postgres > application_name=040_standby_failover_slots_sync.pl > LOG: statement: SELECT pg_sync_replication_slots(); > LOG: dropped replication slot "lsub1_slot" of dbid 5 > STATEMENT: SELECT pg_sync_replication_slots(); > ... > SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots > WHERE slot_name = 'lsub1_slot'; > > In the above LOGs, we should ideally see: "newly created slot > "lsub1_slot" is sync-ready now" after the "LOG: dropped replication > slot "lsub1_slot" of dbid 5" but lack of that means the test didn't > accomplish what it was supposed to. Ideally, the same test should have > failed but the pass criteria for the test failed to check whether the > slot is persisted or not. > > The probable reason for failure is that remote_slot's restart_lsn lags > behind the oldest WAL segment on standby. Now, in the test, we do > ensure that the publisher and subscriber are caught up by following > steps: > # Enable the subscription to let it catch up to the latest wal position > $subscriber1->safe_psql('postgres', > "ALTER SUBSCRIPTION regress_mysub1 ENABLE"); > > $primary->wait_for_catchup('regress_mysub1'); > > However, this doesn't guarantee that restart_lsn is moved to a > position new enough that standby has a WAL corresponding to it. > To ensure that restart_lsn has been moved to a recent position, we need to log XLOG_RUNNING_XACTS and make sure the same is processed as well by walsender. The attached patch does the required change. Hou-San can reproduce this problem by adding additional checkpoints in the test and after applying the attached it fixes the problem. Now, this patch is mostly based on the theory we formed based on LOGs on BF and a reproducer by Hou-San, so still, there is some chance that this doesn't fix the BF failures in which case I'll again look into those. -- With Regards, Amit Kapila. fix_040_standby_failover_slots_sync.1.patch Description: Binary data
Re: About a recently-added message
On Wed, Feb 14, 2024 at 1:04 PM Kyotaro Horiguchi wrote: > > At Wed, 14 Feb 2024 16:26:52 +0900 (JST), Kyotaro Horiguchi > wrote in > > > "wal_level" must be >= logical. > .. > > > wal_level must be set to "replica" or "logical" at server start. > .. > > I suggest making the quoting policy consistent. > > Just after this, I found another inconsistency regarding quotation. > > > 'dbname' must be specified in "%s". > > The use of single quotes doesn't seem to comply with our standard. > Thanks for the report. I'll look into it after analyzing the BF failure caused by the same commit. -- With Regards, Amit Kapila.
Re: Synchronizing slots from primary to standby
On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu) wrote: > > Here is V87 patch that adds test for the suggested cases. > I have pushed this patch and it leads to a BF failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-14%2004%3A43%3A37 The test failures are: # Failed test 'logical decoding is not allowed on synced slot' # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl line 272. # Failed test 'synced slot on standby cannot be altered' # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl line 281. # Failed test 'synced slot on standby cannot be dropped' # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl line 287. The reason is that in LOGs, we see a different ERROR message than what is expected: 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR: replication slot "lsub1_slot" is active for PID 1760871 Now, we see the slot still active because a test before these tests (# Test that if the synchronized slot is invalidated while the remote slot is still valid, ) is not able to successfully persist the slot and the synced temporary slot remains active. The reason is clear by referring to below standby LOGS: LOG: connection authorized: user=bf database=postgres application_name=040_standby_failover_slots_sync.pl LOG: statement: SELECT pg_sync_replication_slots(); LOG: dropped replication slot "lsub1_slot" of dbid 5 STATEMENT: SELECT pg_sync_replication_slots(); ... SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots WHERE slot_name = 'lsub1_slot'; In the above LOGs, we should ideally see: "newly created slot "lsub1_slot" is sync-ready now" after the "LOG: dropped replication slot "lsub1_slot" of dbid 5" but lack of that means the test didn't accomplish what it was supposed to. Ideally, the same test should have failed but the pass criteria for the test failed to check whether the slot is persisted or not. The probable reason for failure is that remote_slot's restart_lsn lags behind the oldest WAL segment on standby. Now, in the test, we do ensure that the publisher and subscriber are caught up by following steps: # Enable the subscription to let it catch up to the latest wal position $subscriber1->safe_psql('postgres', "ALTER SUBSCRIPTION regress_mysub1 ENABLE"); $primary->wait_for_catchup('regress_mysub1'); However, this doesn't guarantee that restart_lsn is moved to a position new enough that standby has a WAL corresponding to it. One easy fix is to re-create the subscription with the same slot_name after we have ensured that the slot has been invalidated on standby so that a new restart_lsn is assigned to the slot but it is better to analyze some more why the slot's restart_lsn hasn't moved enough only sometimes. -- With Regards, Amit Kapila.
RE: speed up a logical replica setup
Dear Euler, > If someone modifies data after promotion, fine; she has to deal with conflicts, if any. IMO it is solved adding one or two sentences in the documentation. > OK. I could find issues, for now. > > > > Regarding > GUCs, almost all of them is PGC_POSTMASTER (so it cannot be modified unless > the > server is restarted). The ones that are not PGC_POSTMASTER, does not affect > the > pg_createsubscriber execution [1]. > > > > IIUC, primary_conninfo and primary_slot_name is PGC_SIGHUP. Ditto. > Just to confirm - even if the primary_slot_name would be changed during the conversion, the slot initially set would be dropped. Currently we do not find any issues. > > > > I'm just pointing out that this case is a different from pg_upgrade (from > which > this idea was taken). I'm not saying that's a bad idea. I'm just arguing that > you might be preventing some access read only access (monitoring) when it is > perfectly fine to connect to the database and execute queries. As I said > before, the current UI allows anyone to setup the standby to accept only local > connections. Of course, it is an extra step but it is possible. However, once > you apply v16-0007, there is no option but use only local connection during > the > transformation. Is it an acceptable limitation? > > > > My remained concern is written above. If they do not problematic we may not > have > to restrict them for now. At that time, changes > > 1) overwriting a port number, > 2) setting listen_addresses = '' It can be implemented later if people are excited by it. > are not needed, right? IIUC inconsistency of -P may be still problematic. I still think we shouldn't have only the transformed primary_conninfo as option. > Hmm, OK. So let me summarize current status and discussions. Policy) Basically, we do not prohibit to connect to primary/standby. primary_slot_name may be changed during the conversion and tuples may be inserted on target just after the promotion, but it seems no issues. API) -D (data directory) and -d (databases) are definitively needed. Regarding the -P (a connection string for source), we can require it for now. But note that it may cause an inconsistency if the pointed not by -P is different from the node pointde by primary_conninfo. As for the connection string for the target server, we can choose two ways: a) accept native connection string as -S. This can reuse the same parsing mechanism as -P, but there is a room that non-local server is specified. b) accept username/port as -U/-p (Since the policy is like above, listen_addresses would not be overwritten. Also, the port just specify the listening port). This can avoid connecting to non-local, but more options may be needed. (E.g., Is socket directory needed? What about password?) Other discussing point, reported issue) Points raised by me [1] are not solved yet. * What if the target version is PG16-? * What if the found executables have diffent version with pg_createsubscriber? * What if the target is sending WAL to another server? I.e., there are clusters like `node1->node2-.node3`, and the target is node2. * Can we really cleanup the standby in case of failure? Shouldn't we suggest to remove the target once? * Can we move outputs to stdout? [1]: https://www.postgresql.org/message-id/TYCPR01MB1207713BEC5C379A05D65E342F54B2%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/