Re: harmonize password reuse in vacuumdb, clusterdb, and reindexdb
On Wed, Jun 28, 2023 at 09:20:03PM -0700, Gurjeet Singh wrote: > The comment on top of connect_utils.c:connectDatabase() seems pertinent: > >> (Callers should not pass >> * allow_password_reuse=true unless reconnecting to the same database+user >> * as before, else we might create password exposure hazards.) > > The callers of {cluster|reindex}_one_database() (which in turn call > connectDatabase()) clearly pass different database names in successive > calls to these functions. So the patch seems to be in conflict with > the recommendation in the comment. > > I'm not sure if the concern raised in that comment is a legitimate > one, though. I mean, if the password is reused to connect to a > different database in the same cluster/instance, which I think is > always the case with these utilities, the password will exposed in the > server logs (if at all). And since the admins of the instance already > have full control over the passwords of the user, I don't think this > patch will give them any more information than what they can get > anyways. > > It is a valid concern, though, if the utility connects to a different > instance in the same run/invocation, and hence exposes the password > from the first instance to the admins of the second cluster. The same commit that added this comment (ff402ae) also set the allow_password_reuse parameter to true in vacuumdb's connectDatabase() calls. I found a message from the corresponding thread that provides some additional detail [0]. I wonder if this comment should instead recommend against using the allow_password_reuse flag unless reconnecting to the same host/port/user target. Connecting to different databases with the same host/port/user information seems okay. Maybe I am missing something... > Nitpicking: The patch seems to have Windows line endings, which > explains why my `patch` complained so loudly. > > $ patch -p1 < v1-0001-harmonize-patch > (Stripping trailing CRs from patch; use --binary to disable.) > patching file doc/src/sgml/ref/reindexdb.sgml > (Stripping trailing CRs from patch; use --binary to disable.) > patching file doc/src/sgml/ref/vacuumdb.sgml > (Stripping trailing CRs from patch; use --binary to disable.) > patching file src/bin/scripts/clusterdb.c > (Stripping trailing CRs from patch; use --binary to disable.) > patching file src/bin/scripts/reindexdb.c > > $ file v1-0001-harmonize-password-reuse-in-vacuumdb-clusterdb-an.patch > v1-0001-harmonize-patch: unified diff output text, ASCII text, > with CRLF line terminators Huh. I didn't write it on a Windows machine. I'll look into it. [0] https://postgr.es/m/15139.1447357263%40sss.pgh.pa.us -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases
At Wed, 28 Jun 2023 16:24:02 -0700, Nathan Bossart wrote in > While working on some other patches, I found myself wanting to use the > following command to vacuum the catalogs in all databases in a cluster: > > vacuumdb --all --schema pg_catalog > > However, this presently fails with the following error: > > cannot vacuum specific schema(s) in all databases > > AFAICT there no technical reason to block this, and the resulting behavior > feels intuitive to me, so I wrote 0001 to allow it. 0002 allows specifying > tables to process in all databases in clusterdb, and 0003 allows specifying > tables, indexes, schemas, or the system catalogs to process in all > databases in reindexdb. It seems like useful. > I debated also allowing users to specify different types of objects in the > same command (e.g., "vacuumdb --schema myschema --table mytable"), but it > looked like this would require a more substantial rewrite, and I didn't > feel that the behavior was intuitive. For the example I just gave, does > the user expect us to process both the "myschema" schema and the "mytable" > table, or does the user want us to process the "mytable" table in the > "myschema" schema? In vacuumdb, this is already blocked, but reindexdb I think spcyfying the two at once is inconsistent if we maintain the current behavior of those options. It seems to me that that change clearly modifies the functionality of the options. As a result, those options look like restriction filters. For example, "vacuumdb -s s1_* -t t1" will vacuum all table named "t1" in all schemas matches "s1_*". > accepts combinations of tables, schemas, and indexes (yet disallows > specifying --system along with other types of objects). Since this is > inconsistent with vacuumdb and IMO ambiguous, I've restricted such > combinations in 0003. > > Thoughts? While I think this is useful, primarily for system catalogs, I'm not entirely convinced about its practicality to user objects. It's difficult for me to imagine that a situation where all databases share the same schema would be major. Assuming this is used for user objects, it may be necessary to safely exclude databases that lack the specified schema or table, provided the object present in at least one other database. But the exclusion should be done with printing some warnings. It could also be necessary to safely move to the next object when reindex or cluster operation fails on a single object due to missing prerequisite situations. But I don't think we might want to add such complexity to these "script" tools. So.. an alternative path might be to introduce a new option like --syscatalog to specify system catalogs as the only option that can be combined with --all. In doing so, we can leave the --table and --schema options untouched. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: add \dpS to psq [16beta1]
On Thu, Jun 29, 2023 at 02:11:43AM +, Shinoda, Noriyoshi (PN Japan FSIP) wrote: > I found while testing PostgreSQL 16 Beta 1 that the output of the \? > metacommand did not include \dS, \dpS. > The attached patch changes the output of the \? meta command to: Thanks for the report! I've committed your patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
RE: pg_decode_message vs skip_empty_xacts and xact_wrote_changes
On Thursday, June 29, 2023 12:06 PM vignesh C wrote: > > On Wed, 28 Jun 2023 at 19:26, Ashutosh Bapat > wrote: > > > > Hi Vignesh, > > Thanks for working on this. > > > > On Wed, Jun 28, 2023 at 4:52 PM vignesh C wrote: > > > > > > Here is a patch having the fix for the same. I have not added any > > > tests as the existing tests cover this scenario. The same issue is > > > present in back branches too. > > > > Interesting, we have a test for this scenario and it accepts erroneous > > output :). > > > > > v1-0001-Call-pg_output_begin-in-pg_decode_message-if-it-i_master.pat > > > ch can be applied on master, PG15 and PG14, > > > v1-0001-Call-pg_output_begin-in-pg_decode_message-if-it-i_PG13.patch > > > patch can be applied on PG13, PG12 and PG11. > > > Thoughts? > > > > I noticed this when looking at Tomas's patches for logical decoding of > > sequences. The code block you have added is repeated in > > pg_decode_change() and pg_decode_truncate(). It might be better to > > push the conditions in pg_output_begin() itself so that any future > > callsite of pg_output_begin() automatically takes care of these > > conditions. > > Thanks for the comments, here is an updated patch handling the above issue. Thanks for the patches. I tried to understand the following check: /* * If asked to skip empty transactions, we'll emit BEGIN at the point * where the first operation is received for this transaction. */ - if (data->skip_empty_xacts) + if (!(last_write ^ data->skip_empty_xacts) || txndata->xact_wrote_changes) return; I might miss something, but would you mind elaborating on why we use "last_write" in this check? Best Regard, Hou zj
Re: harmonize password reuse in vacuumdb, clusterdb, and reindexdb
On Tue, Jun 27, 2023 at 9:57 PM Nathan Bossart wrote: > > While looking at something unrelated, I noticed that the vacuumdb docs > mention the following: > > vacuumdb might need to connect several times to the PostgreSQL server, > asking for a password each time. > > IIUC this has been fixed since 83dec5a from 2015 (which was superceded by > ff402ae), so I think this note (originally added in e0a77f5 from 2002) can > now be removed. > > I also found that neither clusterdb nor reindexdb uses the > allow_password_reuse parameter in connectDatabase(), and the reindexdb > documentation contains the same note about repeatedly asking for a > password (originally added in 85e9a5a from 2005). IMO we should allow > password reuse for all three programs, and we should remove the > aforementioned notes in the docs, too. This is what the attached patch > does. > > Thoughts? The comment on top of connect_utils.c:connectDatabase() seems pertinent: > (Callers should not pass > * allow_password_reuse=true unless reconnecting to the same database+user > * as before, else we might create password exposure hazards.) The callers of {cluster|reindex}_one_database() (which in turn call connectDatabase()) clearly pass different database names in successive calls to these functions. So the patch seems to be in conflict with the recommendation in the comment. I'm not sure if the concern raised in that comment is a legitimate one, though. I mean, if the password is reused to connect to a different database in the same cluster/instance, which I think is always the case with these utilities, the password will exposed in the server logs (if at all). And since the admins of the instance already have full control over the passwords of the user, I don't think this patch will give them any more information than what they can get anyways. It is a valid concern, though, if the utility connects to a different instance in the same run/invocation, and hence exposes the password from the first instance to the admins of the second cluster. Nitpicking: The patch seems to have Windows line endings, which explains why my `patch` complained so loudly. $ patch -p1 < v1-0001-harmonize-patch (Stripping trailing CRs from patch; use --binary to disable.) patching file doc/src/sgml/ref/reindexdb.sgml (Stripping trailing CRs from patch; use --binary to disable.) patching file doc/src/sgml/ref/vacuumdb.sgml (Stripping trailing CRs from patch; use --binary to disable.) patching file src/bin/scripts/clusterdb.c (Stripping trailing CRs from patch; use --binary to disable.) patching file src/bin/scripts/reindexdb.c $ file v1-0001-harmonize-password-reuse-in-vacuumdb-clusterdb-an.patch v1-0001-harmonize-patch: unified diff output text, ASCII text, with CRLF line terminators Best regards, Gurjeet http://Gurje.et
Re: pg_decode_message vs skip_empty_xacts and xact_wrote_changes
On Wed, 28 Jun 2023 at 19:26, Ashutosh Bapat wrote: > > Hi Vignesh, > Thanks for working on this. > > On Wed, Jun 28, 2023 at 4:52 PM vignesh C wrote: > > > > Here is a patch having the fix for the same. I have not added any > > tests as the existing tests cover this scenario. The same issue is > > present in back branches too. > > Interesting, we have a test for this scenario and it accepts erroneous > output :). > > > v1-0001-Call-pg_output_begin-in-pg_decode_message-if-it-i_master.patch > > can be applied on master, PG15 and PG14, > > v1-0001-Call-pg_output_begin-in-pg_decode_message-if-it-i_PG13.patch > > patch can be applied on PG13, PG12 and PG11. > > Thoughts? > > I noticed this when looking at Tomas's patches for logical decoding of > sequences. The code block you have added is repeated in > pg_decode_change() and pg_decode_truncate(). It might be better to > push the conditions in pg_output_begin() itself so that any future > callsite of pg_output_begin() automatically takes care of these > conditions. Thanks for the comments, here is an updated patch handling the above issue. Regards, Vignesh From 130715a6dd808ed88af8c894d7e27b8637bc2c3a Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Wed, 28 Jun 2023 14:01:22 +0530 Subject: [PATCH v2] Call pg_output_begin in pg_decode_message if it is the first change in the transaction. Call pg_output_begin in pg_decode_message if it is the first change in the transaction. --- contrib/test_decoding/expected/messages.out | 10 +-- contrib/test_decoding/sql/messages.sql | 2 +- contrib/test_decoding/test_decoding.c | 31 + 3 files changed, 29 insertions(+), 14 deletions(-) diff --git a/contrib/test_decoding/expected/messages.out b/contrib/test_decoding/expected/messages.out index c75d40190b..0fd70036bd 100644 --- a/contrib/test_decoding/expected/messages.out +++ b/contrib/test_decoding/expected/messages.out @@ -58,17 +58,23 @@ SELECT 'ignorethis' FROM pg_logical_emit_message(true, 'test', 'czechtastic'); ignorethis (1 row) -SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1'); +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1', 'include-xids', '0'); data + BEGIN message: transactional: 1 prefix: test, sz: 4 content:msg1 + COMMIT message: transactional: 0 prefix: test, sz: 4 content:msg2 message: transactional: 0 prefix: test, sz: 4 content:msg4 message: transactional: 0 prefix: test, sz: 4 content:msg6 + BEGIN message: transactional: 1 prefix: test, sz: 4 content:msg5 message: transactional: 1 prefix: test, sz: 4 content:msg7 + COMMIT + BEGIN message: transactional: 1 prefix: test, sz: 11 content:czechtastic -(7 rows) + COMMIT +(13 rows) -- test db filtering \set prevdb :DBNAME diff --git a/contrib/test_decoding/sql/messages.sql b/contrib/test_decoding/sql/messages.sql index cf3f7738e5..3d8500f99c 100644 --- a/contrib/test_decoding/sql/messages.sql +++ b/contrib/test_decoding/sql/messages.sql @@ -19,7 +19,7 @@ COMMIT; SELECT 'ignorethis' FROM pg_logical_emit_message(true, 'test', 'czechtastic'); -SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1'); +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1', 'include-xids', '0'); -- test db filtering \set prevdb :DBNAME diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c index 93c948856e..86c29de40b 100644 --- a/contrib/test_decoding/test_decoding.c +++ b/contrib/test_decoding/test_decoding.c @@ -211,15 +211,21 @@ pg_decode_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn) TestDecodingData *data = ctx->output_plugin_private; data->xact_wrote_changes = false; - if (data->skip_empty_xacts) - return; pg_output_begin(ctx, data, txn, true); } static void -pg_output_begin(LogicalDecodingContext *ctx, TestDecodingData *data, ReorderBufferTXN *txn, bool last_write) +pg_output_begin(LogicalDecodingContext *ctx, TestDecodingData *data, +ReorderBufferTXN *txn, bool last_write) { + /* + * If asked to skip empty transactions, we'll emit BEGIN at the point + * where the first operation is received for this transaction. + */ + if (!(last_write ^ data->skip_empty_xacts) || data->xact_wrote_changes) + return; + OutputPluginPrepareWrite(ctx, last_write); if (data->include_xids) appendStringInfo(ctx->out, "BEGIN %u", txn->xid); @@ -403,10 +409,7 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, data = ctx->output_plugin_private; /* output BEGIN if we haven't yet */ - if (data->skip_empty_xacts && !data->xact_wrote_changes) - { -
Re: Changing types of block and chunk sizes in memory contexts
On Thu, 29 Jun 2023 at 09:26, Tomas Vondra wrote: > AllocSetContext 224B -> 208B (4 cachelines) > GenerationContext 152B -> 136B (3 cachelines) > SlabContext 200B -> 200B (no change, adds 4B hole) > > Nothing else changes, AFAICS. I don't think a lack of a reduction in the number of cache lines is the important part. Allowing more space on the keeper block, which is at the end of the context struct seems more useful. I understand that the proposal is just to shave off 12 bytes and that's not exactly huge when it's just once per context, but we do create quite a large number of contexts with ALLOCSET_SMALL_SIZES which have a 1KB initial block size. 12 bytes in 1024 is not terrible. It's not exactly an invasive change. It does not add any complexity to the code and as far as I can see, about zero risk of it slowing anything down. David
Re: Another incorrect comment for pg_stat_statements
On Thu, Jun 29, 2023 at 8:19 AM Michael Paquier wrote: > Nothing much to add, so applied with the initial comment fix. Thanks for pushing it! Thanks Richard
Trivial revise for the check of parameterized partial paths
While working on the invalid parameterized join path issue [1], I noticed that we can simplify the codes for checking parameterized partial paths in try_partial_hashjoin/mergejoin_path, with the help of macro PATH_REQ_OUTER. - if (inner_path->param_info != NULL) - { - Relids inner_paramrels = inner_path->param_info->ppi_req_outer; - - if (!bms_is_empty(inner_paramrels)) - return; - } + if (!bms_is_empty(PATH_REQ_OUTER(inner_path))) + return; Also there is a comment there that is not correct. * If the inner path is parameterized, the parameterization must be fully * satisfied by the proposed outer path. This is true for nestloop but not for hashjoin/mergejoin. Besides, I wonder if it'd be better that we verify that the outer input path for a partial join path should not have any parameterization dependency. Attached is a patch for all these changes. [1] https://www.postgresql.org/message-id/flat/CAJKUy5g2uZRrUDZJ8p-%3DgiwcSHVUn0c9nmdxPSY0jF0Ov8VoEA%40mail.gmail.com Thanks Richard v1-0001-Trivial-revise-for-the-check-of-parameterized-partial-paths.patch Description: Binary data
Re: [PGdocs] fix description for handling pf non-ASCII characters
On Fri, Jun 23, 2023 at 10:25 PM Hayato Kuroda (Fujitsu) wrote: > > Dear hackers, > > While discussing based on the article[1] with Japanese developers, > I found inconsistencies between codes and documents. > > 45b1a67a[2] changed the behavior when non-ASCII characters was set as > application_name, > cluster_name and postgres_fdw.application_name, but it seemed not to be > documented. > Previously non-ASCII chars were replaed with question makrs '?', but now they > are replaced > with a hex escape instead. > > How do you think? Is my understanding correct? > > Acknowledgement: > Sawada-san and Shinoda-san led the developer's discussion. > Fujii-san was confirmed my points. Thank you for all of their works! > > [1]: > https://h50146.www5.hpe.com/products/software/oe/linux/mainstream/support/lcc/pdf/PostgreSQL16Beta1_New_Features_en_20230528_1.pdf > [2]: > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=45b1a67a0fcb3f1588df596431871de4c93cb76f;hp=da5d4ea5aaac4fc02f2e2aec272efe438dd4e171 > > Best Regards, > Hayato Kuroda > FUJITSU LIMITED > in your patch: > printable ASCII characters will be replaced with a hex escape. My wording is not good. I think the result will be: ASCII characters will be as is, non-ASCII characters will be replaced with "a hex escape". set application_name to 'abc漢字Abc'; SET test16=# show application_name; application_name abc\xe6\xbc\xa2\xe5\xad\x97Abc (1 row) I see multi escape, so I am not sure "a hex escape". to properly render it back to 'abc漢字Abc' here is how i do it: select 'abc' || convert_from(decode(' e6bca2e5ad97','hex'), 'UTF8') || 'Abc'; I guess it's still painful if your application_name has non-ASCII chars.
Re: Assert !bms_overlap(joinrel->relids, required_outer)
On Wed, Jun 28, 2023 at 10:09 PM Tom Lane wrote: > However, given that what we need is to exclude parameterization > that depends on the currently-formed OJ, it seems to me we can do > it more simply and without any new JoinPathExtraData field, > as attached. What do you think? I think it makes sense. At first I wondered if we should also exclude parameterization that depends on OJs that have already been formed as part of this joinrel. But it seems not possible that the input paths have parameterization dependency on these OJs. So it should be sufficient to only consider the currently-formed OJ. > > * I think we need to check the incompatible relids also in > > try_hashjoin_path and try_mergejoin_path besides try_nestloop_path. > > I think this isn't necessary, at least in my formulation. > Those cases will go through calc_non_nestloop_required_outer > which has > > /* neither path can require rels from the other */ > Assert(!bms_overlap(outer_paramrels, inner_path->parent->relids)); > Assert(!bms_overlap(inner_paramrels, outer_path->parent->relids)); > > In order to have a dependency on an OJ, a path would have to have > a dependency on at least one of the OJ's base relations too, so > I think these assertions show that the case won't arise. (Of > course, if someone can trip one of these assertions, I'm wrong.) Hmm, while this holds in most cases, it does not if the joins have been commuted according to identity 3. If we change the t3/t4 join's qual to 't3.a = t4.a' to make hashjoin possible, we'd see the same Assert failure through try_hashjoin_path. I think it's also possible for merge join. explain (costs off) select 1 from t t1 join lateral (select t1.a from (select 1) foo offset 0) s1 on true join (select 1 from t t2 inner join t t3 left join t t4 left join t t5 on t4.a = 1 on t3.a = t4.a on false where t3.a = coalesce(t5.a,1)) as s2 on true; server closed the connection unexpectedly Thanks Richard
RE: add \dpS to psq [16beta1]
Hi, Thank you for developing a good feature. I found while testing PostgreSQL 16 Beta 1 that the output of the \? metacommand did not include \dS, \dpS. The attached patch changes the output of the \? meta command to: Current output psql=# \? \z [PATTERN] same as \dp \dp [PATTERN] list table, view, and sequence access privileges Patched output psql=# \? \dp[S] [PATTERN] list table, view, and sequence access privileges \z[S] [PATTERN] same as \dp Regards, Noriyoshi Shinoda -Original Message- From: Nathan Bossart Sent: Tuesday, January 10, 2023 2:46 AM To: Dean Rasheed Cc: Maxim Orlov ; Andrew Dunstan ; Michael Paquier ; Tom Lane ; Isaac Morland ; Justin Pryzby ; Pavel Luzanov ; pgsql-hack...@postgresql.org Subject: Re: add \dpS to psql On Sat, Jan 07, 2023 at 11:18:59AM +, Dean Rasheed wrote: > It might be true that temp tables aren't usually interesting from a > permissions point of view, but it's not hard to imagine situations > where interesting things do happen. It's also probably the case that > most users won't have many temp tables, so I don't think including > them by default will be particularly intrusive. > > Also, from a user perspective, I think it would be something of a POLA > violation for \dp[S] and \dt[S] to include different sets of tables, > though I appreciate that we do that now. There's nothing in the docs > to indicate that that's the case. Agreed. > Anyway, I've pushed the v2 patch as-is. If anyone feels strongly > enough that we should change its behaviour for temp tables, then we > can still discuss that. Thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com psql_dpS_metacommand_v1.diff Description: psql_dpS_metacommand_v1.diff
Re: POC, WIP: OR-clause support for indexes
Em qua., 28 de jun. de 2023 às 18:45, Tomas Vondra < tomas.von...@enterprisedb.com> escreveu: > On 6/27/23 20:55, Ranier Vilela wrote: > > Hi, > > > >>I finished writing the code patch for transformation "Or" expressions to > >>"Any" expressions. I didn't see any problems in regression tests, even > >>when I changed the constant at which the minimum or expression is > >>replaced by any at 0. I ran my patch on sqlancer and so far the code has > >>never fallen. > > Thanks for working on this. > > > > I took the liberty of making some modifications to the patch. > > I didn't compile or test it. > > Please feel free to use them. > > > > I don't want to be rude, but this doesn't seem very helpful. > Sorry, It was not my intention to cause interruptions. > - You made some changes, but you don't even attempt to explain what you > changed or why you changed it. > 1. Reduce scope 2. Eliminate unnecessary variables 3. Eliminate unnecessary expressions > > - You haven't even tried to compile the code, nor tested it. If it > happens to compile, wow could others even know it actually behaves the > way you wanted? > Attached v2 with make check pass all tests. Ubuntu 64 bits gcc 64 bits > - You responded in a way that breaks the original thread, so it's not > clear which message you're responding to. > It was a pretty busy day. Sorry for the noise, I hope I was of some help. regards, Ranier Vilela P.S. 0001-Replace-clause-X-N1-OR-X-N2-.-with-X-ANY-N1-N2-on.patch fails with 4 tests. diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c index 346fd272b6..74c256258d 100644 --- a/src/backend/parser/parse_expr.c +++ b/src/backend/parser/parse_expr.c @@ -95,6 +95,291 @@ static Expr *make_distinct_op(ParseState *pstate, List *opname, static Node *make_nulltest_from_distinct(ParseState *pstate, A_Expr *distincta, Node *arg); +typedef struct OrClauseGroupEntry +{ + Node *node; + List *consts; + Oid scalar_type; + Oid opno; + Expr *expr; +} OrClauseGroupEntry; + +static int const_transform_or_limit = 0; + +static Node * +transformBoolExprOr(ParseState *pstate, Expr *expr_orig) +{ + List *or_list = NIL; + List *groups_list = NIL; + ListCell *lc_eargs; + Node *result; + BoolExpr *expr; + const char *opname; + boolchange_apply = false; + boolor_statement; + + Assert(IsA(expr, BoolExpr)); + + expr = (BoolExpr *) expr_orig; + if (list_length(expr->args) < const_transform_or_limit) + return transformBoolExpr(pstate, (BoolExpr *)expr_orig); + + /* If this is not expression "Or", then will do it the old way. */ + switch (expr->boolop) + { + case AND_EXPR: + opname = "AND"; + break; + case OR_EXPR: + return transformBoolExpr(pstate, (BoolExpr *)expr_orig); + break; + case NOT_EXPR: + opname = "NOT"; + break; + default: + elog(ERROR, "unrecognized boolop: %d", (int) expr->boolop); + opname = NULL; /* keep compiler quiet */ + break; + } + + /* + * NOTE: + * It is an OR-clause. So, rinfo->orclause is a BoolExpr node, contains + * a list of sub-restrictinfo args, and rinfo->clause - which is the + * same expression, made from bare clauses. To not break selectivity + * caches and other optimizations, use both: + * - use rinfos from orclause if no transformation needed + * - use bare quals from rinfo->clause in the case of transformation, + * to create new RestrictInfo: in this case we have no options to avoid + * selectivity estimation procedure. + */ + or_statement = false; + expr = (BoolExpr *)copyObject(expr_orig); + foreach(lc_eargs, expr->args) + { + A_Expr *arg = (A_Expr *) lfirst(lc_eargs); + Node *bare_orarg; + Node *const_expr; + Node *non_const_expr; + ListCell *lc_groups; + OrClauseGroupEntry *gentry; + boolallow_transformation; + + /* +* The first step: checking that the expression consists only of equality. +* We can only do this here, while arg is still row data type,
Re: pg_rewind WAL segments deletion pitfall
At Wed, 28 Jun 2023 22:28:13 +0900, torikoshia wrote in > > On 2022-09-29 17:18, Polina Bungina wrote: > > I agree with your suggestions, so here is the updated version of > > patch. Hope I haven't missed anything. > > Regards, > > Polina Bungina > > Thanks for working on this! > It seems like we are also facing the same issue. Thanks for looking this. > I tested the v3 patch under our condition, old primary has succeeded > to become new standby. > > > BTW when I used pg_rewind-removes-wal-segments-reproduce.sh attached > in [1], old primary also failed to become standby: > > FATAL: could not receive data from WAL stream: ERROR: requested WAL > segment 00020007 has already been removed > > However, I think this is not a problem: just adding restore_command > like below fixed the situation. > > echo "restore_command = '/bin/cp `pwd`/newarch/%f %p'" >> > oldprim/postgresql.conf I thought on the same line at first, but that's not the point here. The problem we want ot address is that pg_rewind ultimately removes certain crucial WAL files required for the new primary to start, despite them being present previously. In other words, that restore_command works, but it only undoes what pg_rewind wrongly did, resulting in unnecessary consupmtion of I/O and/or network bandwidth that essentially serves no purpose. pg_rewind already has a feature that determines how each file should be handled, but it is currently making wrong dicisions for WAL files. The goal here is to rectify this behavior and ensure that pg_rewind makes the right decisions. > Attached modified reproduction script for reference. > > [1]https://www.postgresql.org/message-id/CAFh8B%3DnNiFZOAPsv49gffxHBPzwmZ%3D6Msd4miMis87K%3Dd9rcRA%40mail.gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Another incorrect comment for pg_stat_statements
On Thu, 29 Jun 2023 at 08:19, Michael Paquier wrote: > On Wed, Jun 28, 2023 at 09:26:02PM +0800, Japin Li wrote: >> +1. LGTM. > > Nothing much to add, so applied with the initial comment fix. Thanks! -- Regrads, Japin Li.
Re: [PATCH] Honor PG_TEST_NOCLEAN for tempdirs
On Wed, Jun 28, 2023 at 10:45:02AM +0200, Peter Eisentraut wrote: > Right, the usual style is just to check whether an environment variable is > set to something, not what it is. > > Also note that in general not all environment variables are processed by > Perl, so I would avoid encoding Perl semantics about what is "true" or > whatever into it. Agreed. I am not sure that this is worth changing to have boolean-like checks. Hence, I would also to keep the patch that checks if the environment variable is defined to enforce the behavior, without checking for a specific value. -- Michael signature.asc Description: PGP signature
Re: Another incorrect comment for pg_stat_statements
On Wed, Jun 28, 2023 at 09:26:02PM +0800, Japin Li wrote: > +1. LGTM. Nothing much to add, so applied with the initial comment fix. -- Michael signature.asc Description: PGP signature
Re: Changing types of block and chunk sizes in memory contexts
Hi, On 2023-06-28 17:56:55 -0400, Tom Lane wrote: > Tomas Vondra writes: > > ... 4B is tiny compared to what we waste due to the doubling. > > Yeah. I've occasionally wondered if we should rethink aset.c's > "only power-of-2 chunk sizes" rule. Haven't had the bandwidth > to pursue the idea though. Me too. It'd not be trivial to do without also incurring performance overhead. A somewhat easier thing we could try is to carve the "rounding up" space into smaller chunks, similar to what we do for full blocks. It wouldn't make sense to do that for the smaller size classes, but above 64-256 bytes or such, I think the wins might be big enough to outweight the costs? Of course that doesn't guarantee that that memory in those smaller size classes is going to be used... Greetings, Andres Freund
Add more sanity checks around callers of changeDependencyFor()
Hi all, While working on a different patch, I have noted three code paths that call changeDependencyFor() but don't check that they do not return errors. In all the three cases (support function, extension/schema and object/schema), it seems to me that only one dependency update is expected. I am adding that to the next CF. Thoughts or comments about the attached? -- Michael From 2b66afedbfd27a80beae3a48e0eb362e21ef4547 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Thu, 29 Jun 2023 08:35:23 +0900 Subject: [PATCH] Add checks for values returned by changeDependencyFor() --- src/backend/commands/alter.c| 6 -- src/backend/commands/extension.c| 6 -- src/backend/commands/functioncmds.c | 10 +++--- 3 files changed, 15 insertions(+), 7 deletions(-) diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c index e95dc31bde..455d8dd86a 100644 --- a/src/backend/commands/alter.c +++ b/src/backend/commands/alter.c @@ -848,8 +848,10 @@ AlterObjectNamespace_internal(Relation rel, Oid objid, Oid nspOid) pfree(replaces); /* update dependencies to point to the new schema */ - changeDependencyFor(classId, objid, - NamespaceRelationId, oldNspOid, nspOid); + if (changeDependencyFor(classId, objid, + NamespaceRelationId, oldNspOid, nspOid) != 1) + elog(ERROR, "failed to change schema dependency for object %u", + objid); InvokeObjectPostAlterHook(classId, objid, 0); diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c index 0eabe18335..e95642e14f 100644 --- a/src/backend/commands/extension.c +++ b/src/backend/commands/extension.c @@ -2948,8 +2948,10 @@ AlterExtensionNamespace(const char *extensionName, const char *newschema, Oid *o table_close(extRel, RowExclusiveLock); /* update dependencies to point to the new schema */ - changeDependencyFor(ExtensionRelationId, extensionOid, - NamespaceRelationId, oldNspOid, nspOid); + if (changeDependencyFor(ExtensionRelationId, extensionOid, + NamespaceRelationId, oldNspOid, nspOid) != 1) + elog(ERROR, "failed to change schema dependency for extension %s", + NameStr(extForm->extname)); InvokeObjectPostAlterHook(ExtensionRelationId, extensionOid, 0); diff --git a/src/backend/commands/functioncmds.c b/src/backend/commands/functioncmds.c index 49c7864c7c..5ec2c002d2 100644 --- a/src/backend/commands/functioncmds.c +++ b/src/backend/commands/functioncmds.c @@ -1450,9 +1450,13 @@ AlterFunction(ParseState *pstate, AlterFunctionStmt *stmt) /* Add or replace dependency on support function */ if (OidIsValid(procForm->prosupport)) - changeDependencyFor(ProcedureRelationId, funcOid, -ProcedureRelationId, procForm->prosupport, -newsupport); + { + if (changeDependencyFor(ProcedureRelationId, funcOid, + ProcedureRelationId, procForm->prosupport, + newsupport) != 1) +elog(ERROR, "failed to change support dependency for function %s", + get_func_name(funcOid)); + } else { ObjectAddress referenced; -- 2.40.1 signature.asc Description: PGP signature
Re: Changing types of block and chunk sizes in memory contexts
Hi, On 2023-06-28 23:26:00 +0200, Tomas Vondra wrote: > Yeah. FWIW I was interested what the patch does in practice, so I > checked what pahole says about impact on struct sizes: > > AllocSetContext 224B -> 208B (4 cachelines) > GenerationContext 152B -> 136B (3 cachelines) > SlabContext 200B -> 200B (no change, adds 4B hole) > > Nothing else changes, AFAICS. I find it hard to believe this could have > any sort of positive benefit - I doubt we ever have enough contexts for > this to matter. I don't think it's that hard to believe. We create a lot of memory contexts that we never or just barely use. Just reducing the number of cachelines touched for that can't hurt. This does't quite get us to reducing the size to a lower number of cachelines, but it's a good step. There are a few other fields that we can get rid of. - Afaics AllocSet->keeper is unnecessary these days, as it is always allocated together with the context itself. Saves 8 bytes. - The set of memory context types isn't runtime extensible. We could replace MemoryContextData->methods with a small integer index into mcxt_methods. I think that might actually end up being as-cheap or even cheaper than the current approach. Saves 8 bytes. Tthat's sufficient for going to 3 cachelines. - We could store the power of 2 for initBlockSize, nextBlockSize, maxBlockSize, instead of the "raw" value. That'd make them one byte each. Which also would get rid of the concerns around needing a "mini_size_t" type. - freeListIndex could be a single byte as well (saving 7 bytes, as right now we loose 4 trailing bytes due to padding). That would save another 12 bytes, if I calculate correctly. 25% shrinkage together ain't bad. Greetings, Andres Freund
vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases
While working on some other patches, I found myself wanting to use the following command to vacuum the catalogs in all databases in a cluster: vacuumdb --all --schema pg_catalog However, this presently fails with the following error: cannot vacuum specific schema(s) in all databases AFAICT there no technical reason to block this, and the resulting behavior feels intuitive to me, so I wrote 0001 to allow it. 0002 allows specifying tables to process in all databases in clusterdb, and 0003 allows specifying tables, indexes, schemas, or the system catalogs to process in all databases in reindexdb. I debated also allowing users to specify different types of objects in the same command (e.g., "vacuumdb --schema myschema --table mytable"), but it looked like this would require a more substantial rewrite, and I didn't feel that the behavior was intuitive. For the example I just gave, does the user expect us to process both the "myschema" schema and the "mytable" table, or does the user want us to process the "mytable" table in the "myschema" schema? In vacuumdb, this is already blocked, but reindexdb accepts combinations of tables, schemas, and indexes (yet disallows specifying --system along with other types of objects). Since this is inconsistent with vacuumdb and IMO ambiguous, I've restricted such combinations in 0003. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 272375cee9214da54f423241b5bee7b8a1f8faa3 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 28 Jun 2023 15:12:18 -0700 Subject: [PATCH v1 1/3] vacuumdb: allow specifying tables or schemas to process in all databases --- src/bin/scripts/t/100_vacuumdb.pl | 26 +- src/bin/scripts/vacuumdb.c| 19 +-- 2 files changed, 18 insertions(+), 27 deletions(-) diff --git a/src/bin/scripts/t/100_vacuumdb.pl b/src/bin/scripts/t/100_vacuumdb.pl index eccfcc54a1..43fba676f1 100644 --- a/src/bin/scripts/t/100_vacuumdb.pl +++ b/src/bin/scripts/t/100_vacuumdb.pl @@ -161,7 +161,7 @@ $node->issues_sql_like( 'vacuumdb --schema'); $node->issues_sql_like( [ 'vacuumdb', '--exclude-schema', '"Foo"', 'postgres' ], - qr/(?:(?!VACUUM "Foo".bar).)*/, + qr/(?:(?!VACUUM \(SKIP_DATABASE_STATS\) "Foo".bar).)*/, 'vacuumdb --exclude-schema'); $node->command_fails_like( [ 'vacuumdb', '-N', 'pg_catalog', '-t', 'pg_class', 'postgres', ], @@ -175,18 +175,18 @@ $node->command_fails_like( [ 'vacuumdb', '-n', 'pg_catalog', '-N', '"Foo"', 'postgres' ], qr/cannot vacuum all tables in schema\(s\) and exclude schema\(s\) at the same time/, 'cannot use options -n and -N at the same time'); -$node->command_fails_like( - [ 'vacuumdb', '-a', '-N', '"Foo"' ], - qr/cannot exclude specific schema\(s\) in all databases/, - 'cannot use options -a and -N at the same time'); -$node->command_fails_like( - [ 'vacuumdb', '-a', '-n', '"Foo"' ], - qr/cannot vacuum specific schema\(s\) in all databases/, - 'cannot use options -a and -n at the same time'); -$node->command_fails_like( - [ 'vacuumdb', '-a', '-t', '"Foo".bar' ], - qr/cannot vacuum specific table\(s\) in all databases/, - 'cannot use options -a and -t at the same time'); +$node->issues_sql_like( + [ 'vacuumdb', '-a', '-N', 'pg_catalog' ], + qr/(?:(?!VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class).)*/, + 'vacuumdb -a -N'); +$node->issues_sql_like( + [ 'vacuumdb', '-a', '-n', 'pg_catalog' ], + qr/VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class/, + 'vacuumdb -a -n'); +$node->issues_sql_like( + [ 'vacuumdb', '-a', '-t', 'pg_class' ], + qr/VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class/, + 'vacuumdb -a -t'); $node->command_fails_like( [ 'vacuumdb', '-a', '-d', 'postgres' ], qr/cannot vacuum all databases and a specific one at the same time/, diff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c index 4b17a07089..d7f4871198 100644 --- a/src/bin/scripts/vacuumdb.c +++ b/src/bin/scripts/vacuumdb.c @@ -72,6 +72,7 @@ static void vacuum_one_database(ConnParams *cparams, static void vacuum_all_databases(ConnParams *cparams, vacuumingOptions *vacopts, bool analyze_in_stages, + SimpleStringList *objects, int concurrentCons, const char *progname, bool echo, bool quiet); @@ -376,6 +377,7 @@ main(int argc, char *argv[]) vacuum_all_databases(, , analyze_in_stages, + , concurrentCons, progname, echo, quiet); } @@ -427,18 +429,6 @@ check_objfilter(void) (objfilter & OBJFILTER_DATABASE)) pg_fatal("cannot vacuum all databases and a specific one at the same time"); - if ((objfilter & OBJFILTER_ALL_DBS) && - (objfilter & OBJFILTER_TABLE)) - pg_fatal("cannot vacuum specific table(s) in all databases"); - - if ((objfilter & OBJFILTER_ALL_DBS) && - (objfilter & OBJFILTER_SCHEMA)) - pg_fatal("cannot vacuum specific schema(s) in all databases"); - - if ((objfilter & OBJFILTER_ALL_DBS) && - (objfilter &
Re: Add GUC to tune glibc's malloc implementation.
Hi, On 2023-06-28 07:26:03 +0200, Ronan Dunklau wrote: > I see it as a way to have *some* sort of control over the malloc > implementation we use, instead of tuning our allocations pattern on top of it > while treating it entirely as a black box. As for the tuning, I proposed > earlier to replace this parameter expressed in terms of size as a "profile" > (greedy / conservative) to make it easier to pick a sensible value. I don't think that makes it very usable - we'll still have idle connections use up a lot more memory than now in some cases, and not in others, even though it doesn't help. And it will be very heavily dependent on the OS and glibc version. > Le mardi 27 juin 2023, 20:17:46 CEST Andres Freund a écrit : > > > Except if you hinted we should write our own directly instead ? > > > > We e.g. could keep a larger number of memory blocks reserved > > > > ourselves. Possibly by delaying the release of additionally held blocks > > > > until we have been idle for a few seconds or such. > > > > > > I think keeping work_mem around after it has been used a couple times make > > > sense. This is the memory a user is willing to dedicate to operations, > > > after all. > > > > The biggest overhead of returning pages to the kernel is that that triggers > > zeroing the data during the next allocation. Particularly on multi-node > > servers that's surprisingly slow. It's most commonly not the brk() or > > mmap() themselves that are the performance issue. > > > > Indeed, with your benchmark, I see that most of the time, on my dual Xeon > > Gold 5215 workstation, is spent zeroing newly allocated pages during page > > faults. That microarchitecture is worse at this than some others, but it's > > never free (or cache friendly). > > I'm not sure I see the practical difference between those, but that's > interesting. Were you able to reproduce my results ? I see a bit smaller win than what you observed, but it is substantial. The runtime difference between the "default" and "cached" malloc are almost entirely in these bits: cached: -8.93% postgres libc.so.6 [.] __memmove_evex_unaligned_erms - __memmove_evex_unaligned_erms + 6.77% minimal_tuple_from_heap_tuple + 2.04% _int_realloc + 0.04% AllocSetRealloc 0.02% 0x56281094806f 0.02% 0x56281094e0bf vs uncached: - 14.52% postgres libc.so.6 [.] __memmove_evex_unaligned_erms 8.61% asm_exc_page_fault - 5.91% __memmove_evex_unaligned_erms + 5.78% minimal_tuple_from_heap_tuple 0.04% 0x560130a2900f 0.02% 0x560130a20faf + 0.02% AllocSetRealloc + 0.02% _int_realloc +3.81% postgres [kernel.vmlinux] [k] native_irq_return_iret +1.88% postgres [kernel.vmlinux] [k] __handle_mm_fault +1.76% postgres [kernel.vmlinux] [k] clear_page_erms +1.67% postgres [kernel.vmlinux] [k] get_mem_cgroup_from_mm +1.42% postgres [kernel.vmlinux] [k] cgroup_rstat_updated +1.00% postgres [kernel.vmlinux] [k] get_page_from_freelist +0.93% postgres [kernel.vmlinux] [k] mtree_range_walk None of the latter are visible in a profile in the cached case. I.e. the overhead is encountering page faults and individually allocating the necessary memory in the kernel. This isn't surprising, I just wanted to make sure I entirely understand. Part of the reason this code is a bit worse is that it's using generation.c, which doesn't cache any part of of the context. Not that aset.c's level of caching would help a lot, given that it caches the context itself, not later blocks. > > FWIW, in my experience trimming the brk()ed region doesn't work reliably > > enough in real world postgres workloads to be worth relying on (from a > > memory usage POV). Sooner or later you're going to have longer lived > > allocations placed that will prevent it from happening. > > I'm not sure I follow: given our workload is clearly split at queries and > transactions boundaries, releasing memory at that time, I've assumed (and > noticed in practice, albeit not on a production system) that most memory at > the top of the heap would be trimmable as we don't keep much in between > queries / transactions. That's true for very simple workloads, but once you're beyond that you just need some longer-lived allocation to happen. E.g. some relcache / catcache miss during the query execution, and there's no exant memory in CacheMemoryContext, so a new block is allocated. Greetings, Andres Freund
Re: Row pattern recognition
On 6/28/23 14:17, Tatsuo Ishii wrote: Small question. This query (with all the defaults made explicit): SELECT s.company, s.tdate, s.price, FIRST_VALUE(s.tdate) OVER w, LAST_VALUE(s.tdate) OVER w, lowest OVER w FROM stock AS s WINDOW w AS ( PARTITION BY s.company ORDER BY s.tdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING EXCLUDE NO OTHERS MEASURES LAST(DOWN) AS lowest AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START DOWN+ UP+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); LAST(DOWN) AS lowest should be "LAST(DOWN.price) AS lowest"? Yes, it should be. And the tdate='07-08-2023' row in the first resultset should have '07-08-2023' and '07-10-2023' as its 4th and 5th columns. Since my brain is doing the processing instead of postgres, I made some human errors. :-) -- Vik Fearing
Re: several attstattarget-related improvements
Tomas Vondra writes: > On 6/28/23 16:52, Peter Eisentraut wrote: >> - 0001: Change type of pg_statistic_ext.stxstattarget, to match >> attstattarget. Maybe this should go into PG16, for consistency? > Not sure about 0001 vs PG16, it'd require catversion bump. Yeah, past beta1 I think we should be conservative about bumping catversion. Suggest you hold this for now, and if we find some more-compelling reason for a catversion bump in v16, we can sneak it in at that time. Otherwise, I won't cry if it waits for v17. regards, tom lane
Re: Changing types of block and chunk sizes in memory contexts
Tomas Vondra writes: > ... 4B is tiny compared to what we waste due to the doubling. Yeah. I've occasionally wondered if we should rethink aset.c's "only power-of-2 chunk sizes" rule. Haven't had the bandwidth to pursue the idea though. regards, tom lane
Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
On 6/26/23 20:15, Alena Rybakina wrote: > Hi, all! > > On 24.06.2023 14:23, Tomas Vondra wrote: >> On 6/24/23 02:08, Tom Lane wrote: >>> Tomas Vondra writes: The problem is that the selectivity for "IS NULL" is estimated using the table-level statistics. But the LEFT JOIN entirely breaks the idea that the null_frac has anything to do with NULLs in the join result. >>> Right. >>> I wonder how to improve this, say by adjusting the IS NULL selectivity when we know to operate on the outer side of the join. We're able to do this for antijoins, so maybe we could do that here, somehow? >>> This mess is part of the long-term plan around the work I've been doing >>> on outer-join-aware Vars. We now have infrastructure that can let >>> the estimator routines see "oh, this Var isn't directly from a scan >>> of its table, it's been passed through a potentially-nulling outer >>> join --- and I can see which one". I don't have more than vague ideas >>> about what happens next, but that is clearly an essential step on the >>> road to doing better. >>> >> I was wondering if that work on outer-join-aware Vars could help with >> this, but I wasn't following it very closely. I agree the ability to >> check if the Var could be NULL due to an outer join seems useful, as it >> says whether applying raw attribute statistics makes sense or not. >> >> I was thinking about what to do for the case when that's not possible, >> i.e. when the Var refers to nullable side of the join. Knowing that this >> is happening is clearly not enough - we need to know how many new NULLs >> are "injected" into the join result, and "communicate" that to the >> estimation routines. >> >> Attached is a very ugly experimental patch doing that, and with it the >> estimate changes to this: >> >> QUERY PLAN >> -- >>Hash Left Join (cost=3.25..18179.88 rows=00 width=16) >>(actual time=0.528..596.151 rows=00 loops=1) >> Hash Cond: (large.id = small.id) >> Filter: ((small.id IS NULL) OR >> (large.a = ANY ('{1000,2000,3000,4000,5000}'::integer[]))) >> Rows Removed by Filter: 100 >> -> Seq Scan on large (cost=0.00..14425.00 rows=100 width=8) >> (actual time=0.069..176.138 rows=100 loops=1) >> -> Hash (cost=2.00..2.00 rows=100 width=8) >>(actual time=0.371..0.373 rows=100 loops=1) >>Buckets: 1024 Batches: 1 Memory Usage: 12kB >>-> Seq Scan on small (cost=0.00..2.00 rows=100 width=8) >> (actual time=0.032..0.146 rows=100 loops=1) >>Planning Time: 3.845 ms >>Execution Time: 712.405 ms >> (10 rows) >> >> Seems nice, but. The patch is pretty ugly, I don't claim it works for >> other queries or that this is exactly what we should do. It calculates >> "unmatched frequency" next to eqjoinsel_inner, stashes that info into >> sjinfo and the estimator (nulltestsel) then uses that to adjust the >> nullfrac it gets from the statistics. >> >> The good thing is this helps even for IS NULL checks on non-join-key >> columns (where we don't switch to an antijoin), but there's a couple >> things that I dislike ... >> >> 1) It's not restricted to outer joins or anything like that (this is >> mostly just my laziness / interest in one particular query, but also >> something the outer-join-aware patch might help with). >> >> 2) We probably don't want to pass this kind of information through >> sjinfo. That was the simplest thing for an experimental patch, but I >> suspect it's not the only piece of information we may need to pass to >> the lower levels of estimation code. >> >> 3) I kinda doubt we actually want to move this responsibility (to >> consider fraction of unmatched rows) to the low-level estimation >> routines (e.g. nulltestsel and various others). AFAICS this just >> "introduces NULLs" into the relation, so maybe we could "adjust" the >> attribute statistics (in examine_variable?) by inflating null_frac and >> modifying the other frequencies in MCV/histogram. >> >> 4) But I'm not sure we actually want to do that in these low-level >> selectivity functions. The outer join essentially produces output with >> two subsets - one with matches on the outer side, one without them. But >> the side without matches has NULLs in all columns. In a way, we know >> exactly how are these columns correlated - if we do the usual estimation >> (even with the null_frac adjusted), we just throw this information away. >> And when there's a lot of rows without a match, that seems bad. >> >> So maybe we should split the join estimate into two parts, one for each >> subset of the join result. One for the rows with a match (and then we >> can just do what we do now, with the attribute stats we already have). >> And one for the "unmatched part" where we know the values on the
Re: POC, WIP: OR-clause support for indexes
On 6/27/23 20:55, Ranier Vilela wrote: > Hi, > >>I finished writing the code patch for transformation "Or" expressions to >>"Any" expressions. I didn't see any problems in regression tests, even >>when I changed the constant at which the minimum or expression is >>replaced by any at 0. I ran my patch on sqlancer and so far the code has >>never fallen. > Thanks for working on this. > > I took the liberty of making some modifications to the patch. > I didn't compile or test it. > Please feel free to use them. > I don't want to be rude, but this doesn't seem very helpful. - You made some changes, but you don't even attempt to explain what you changed or why you changed it. - You haven't even tried to compile the code, nor tested it. If it happens to compile, wow could others even know it actually behaves the way you wanted? - You responded in a way that breaks the original thread, so it's not clear which message you're responding to. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: several attstattarget-related improvements
On 6/28/23 16:52, Peter Eisentraut wrote: > Here are a few patches related to attstattarget: > > - 0001: Change type of pg_statistic_ext.stxstattarget, to match > attstattarget. Maybe this should go into PG16, for consistency? > > - 0002: Add macro for maximum statistics target, instead of hardcoding > it everywhere. > > - 0003: Take pg_attribute out of VacAttrStats. This simplifies some > code, especially for extended statistics, which had to have weird > workarounds. +1 to all three patches Not sure about 0001 vs PG16, it'd require catversion bump. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Changing types of block and chunk sizes in memory contexts
On 6/28/23 12:59, Tom Lane wrote: > David Rowley writes: >> Perhaps it's ok to leave the context creation functions with Size >> typed parameters and then just Assert the passed-in sizes are not >> larger than 1GB within the context creation function. > > Yes, I'm strongly opposed to not using Size/size_t in the mmgr APIs. > If we go that road, we're going to have a problem when someone > inevitably wants to pass a larger-than-GB value for some context > type. +1 > What happens in semi-private structs is a different matter, although > I'm a little dubious that shaving a couple of bytes from context > headers is a useful activity. The self-documentation argument > still has some force there, so I agree with Peter that some positive > benefit has to be shown. > Yeah. FWIW I was interested what the patch does in practice, so I checked what pahole says about impact on struct sizes: AllocSetContext 224B -> 208B (4 cachelines) GenerationContext 152B -> 136B (3 cachelines) SlabContext 200B -> 200B (no change, adds 4B hole) Nothing else changes, AFAICS. I find it hard to believe this could have any sort of positive benefit - I doubt we ever have enough contexts for this to matter. When I first saw the patch I was thinking it's probably changing how we store the per-chunk requested_size. Maybe that'd make a difference, although 4B is tiny compared to what we waste due to the doubling. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Document efficient self-joins / UPDATE LIMIT techniques.
This patch adds a few examples to demonstrate the following: * The existence of the ctid column on every table * The utility of ctds in self joins * A practical usage of SKIP LOCKED The reasoning for this is a bit long, but if you're interested, keep reading. In the past, there has been a desire to see a LIMIT clause of some sort on UPDATE and DELETE statements. The reason for this usually stems from having a large archive or backfill operation that if done in one single transaction would overwhelm normal operations, either by the transaction failing outright, locking too many rows, flooding the WAL causing replica lag, or starving other processes of limited I/O. The reasons for not adding a LIMIT clause are pretty straightforward: it isn't in the SQL Standard, and UPDATE/DELETE operations are unordered operations, so updating 1000 rows randomly isn't a great idea. The people wanting the LIMIT clause were undeterred by this, because they know that they intend to keep issuing updates until they run out of rows to update. Given these limitations, I would write something like this: WITH doomed AS ( SELECT t.id FROM my_table AS t WHERE t.expiration_date < :'some_archive_date' FOR UPDATE SKIP LOCKED LIMIT 1000 ) DELETE FROM my_table WHERE id IN (SELECT id FROM doomed ); This wouldn't interfere with any other updates, so I felt good about it running when the system was not-too-busy. I'd then write a script to run that in a loop, with sleeps to allow the replicas a chance to catch their breath. Then, when the rowcount finally dipped below 1000, I'd issue the final DELETE FROM my_table WHERE expiration_date < :'some_archive_date'; And this was ok, because at that point I have good reason to believe that there are at most 1000 rows lingering out there, so waiting on locks for those was no big deal. But a query like this involves one scan along one index (or worse, a seq scan) followed by another scan, either index or seq. Either way, we're taking up a lot of cache with rows we don't even care about. Then in v12, the query planner got hip to bitmap tidscans, allowing for this optimization: WITH doomed AS ( SELECT t.ctid AS tid FROM my_table AS t WHERE t.expiration_date < :'some_archive_date' FOR UPDATE SKIP LOCKED LIMIT 1000 ) DELETE FROM my_table USING doomed WHERE my_table.ctid = doomed.tid; And this works pretty well, especially if you set up a partial index to meet the quals in the CTE. But we don't document this anywhere, and until UPDATE and DELETE get a LIMIT clause, we probably should document this workaround. From 209fd8abe50603e85ca0cc07aecd72b87889e757 Mon Sep 17 00:00:00 2001 From: coreyhuinker Date: Tue, 13 Jun 2023 13:00:40 -0400 Subject: [PATCH v1] Add examples that highlight the usage of the system column ctid in self-joins. Currently we do not show any examples of using ctid anywhere, nor do we address the often-requested but problematic use case of having a LIMIT clause on UPDATE and DELETE statements. These examples are a subtle way of addressing both those concerns. --- doc/src/sgml/ref/delete.sgml | 24 doc/src/sgml/ref/select.sgml | 21 + doc/src/sgml/ref/update.sgml | 23 +++ 3 files changed, 68 insertions(+) diff --git a/doc/src/sgml/ref/delete.sgml b/doc/src/sgml/ref/delete.sgml index 1b81b4e7d7..cca9138843 100644 --- a/doc/src/sgml/ref/delete.sgml +++ b/doc/src/sgml/ref/delete.sgml @@ -234,6 +234,30 @@ DELETE FROM films In some cases the join style is easier to write or faster to execute than the sub-select style. + + In situations where a single operation would consume too many resources, it + may be desirable to break up a large DELETE into multiple + separate commands. The SQL + standard does not define a LIMIT clause for + DELETE operations, but it is possible get the equivalent + functionality through the USING clause to a + Common Table Expression which identifies + a subset of rows to be deleted, locks those rows, and returns their system + column ctid values: + +WITH delete_batch AS ( + SELECT l.ctid + FROM user_logs AS l + WHERE l.status = 'archived' + ORDER BY l.creation_date + LIMIT 1 + FOR UPDATE ) +DELETE FROM user_logs AS ul +USING delete_branch AS del +WHERE ul.ctid = del.ctid + + This allows for flexible search criteria within the CTE and an efficient self-join. + diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml index 0ee0cc7e64..9d7c3d5c41 100644 --- a/doc/src/sgml/ref/select.sgml +++ b/doc/src/sgml/ref/select.sgml @@ -1676,6 +1676,27 @@ SELECT * FROM (SELECT * FROM mytable FOR UPDATE) ss WHERE col1 = 5; condition is not textually within the sub-query. + +In cases where a DML operation involving many rows +must be performed, and that table experiences numerous other simultaneous +DML operations, a FOR UPDATE clause +used in conjunction with SKIP
Re: Detecting use-after-free bugs using gcc's malloc() attribute
Hi, On 2023-06-28 10:40:22 +0200, Peter Eisentraut wrote: > On 26.06.23 21:54, Andres Freund wrote: > > For something like pg_list.h the malloc(free) attribute is a bit awkward to > > use, because one a) needs to list ~30 functions that can free a list and b) > > the referenced functions need to be declared. > > Hmm. Saying list_concat() "deallocates" a list is mighty confusing because > 1) it doesn't, and 2) it might actually allocate a new list. list_concat() basically behaves like realloc(), except that the "pointer is still valid" case is much more common. And the way that's modelled in the annotations is to say a function frees and allocates. Note that the free attribute references the first element for list_concat(), not the second. > So while you get the useful behavior of "you probably didn't mean to use > this variable again after passing it into list_concat()", if some other tool > actually took these allocate/deallocate decorations at face value and did a > memory leak analysis with them, they'd get completely bogus results. How would the annotations possibly lead to a bogus result? I see neither how it could lead to false negatives nor false positives? The gcc attributes are explicitly intended to track not just plain memory allocations, the example in the docs https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes is to add them for fopen() etc. So I don't think it's likely that external tools will interpret this is a much more stringent way. > > I also added such attributes to bitmapset.h and palloc() et al, but that > > didn't find existing bugs. It does find use-after-free instances if I add > > some, similarly it does find cases of mismatching palloc with free etc. > > This seems more straightforward. Even if it didn't find any bugs, I'd > imagine it would be useful during development. Agreed. Given our testing regimen (valgrind etc), I'd expect to find many such bugs before long in the tree anyway. But it's much nicer to get that far. And to find paths that aren't covered by tests. > > Do others think this would be useful enough to be worth polishing? And do > > you > > agree the warnings above are bugs? > > I actually just played with this the other day, because I can never remember > termPQExpBuffer() vs. destroyPQExpBuffer(). That's a pretty nasty one :( > I couldn't quite make it work for that, but I found the feature overall > useful, so I'd welcome support for it. Yea, I don't think the attributes can comfortable handle initPQExpBuffer() style allocation. It's somewhat posible by moving the allocation to an inline function, and then making the thing that's allocated ->data. But it ends up pretty messy, particularly because we need ABI stability for pqexpbuffer.h. But createPQExpBuffer() can be dealt with reasonably. Doing so points out: [51/354 42 14%] Compiling C object src/bin/initdb/initdb.p/initdb.c.o ../../../../home/andres/src/postgresql/src/bin/initdb/initdb.c: In function ‘replace_guc_value’: ../../../../home/andres/src/postgresql/src/bin/initdb/initdb.c:566:9: warning: ‘free’ called on pointer returned from a mismatched allocation function [-Wmismatched-dealloc] 566 | free(newline); /* but don't free newline->data */ | ^ ../../../../home/andres/src/postgresql/src/bin/initdb/initdb.c:470:31: note: returned from ‘createPQExpBuffer’ 470 | PQExpBuffer newline = createPQExpBuffer(); | ^~~ which is intentional, but ... not pretty, and could very well be a bug in other cases. If we want to do stuff like that, we'd probably better off having a dedicated version of destroyPQExpBuffer(). Although here it looks like the code should just use an on-stack PQExpBuffer. Greetings, Andres Freund
Re: Incremental View Maintenance, take 2
On Wed, Jun 28, 2023 at 4:06 PM Yugo NAGATA wrote: > > On Wed, 28 Jun 2023 00:01:02 +0800 > jian he wrote: > > > On Thu, Jun 1, 2023 at 2:47 AM Yugo NAGATA wrote: > > > > > > On Thu, 1 Jun 2023 23:59:09 +0900 > > > Yugo NAGATA wrote: > > > > > > > Hello hackers, > > > > > > > > Here's a rebased version of the patch-set adding Incremental View > > > > Maintenance support for PostgreSQL. That was discussed in [1]. > > > > > > > [1] > > > > https://www.postgresql.org/message-id/flat/20181227215726.4d166b4874f8983a641123f5%40sraoss.co.jp > > > > > > --- > > > * Overview > > > > > > Incremental View Maintenance (IVM) is a way to make materialized views > > > up-to-date by computing only incremental changes and applying them on > > > views. IVM is more efficient than REFRESH MATERIALIZED VIEW when > > > only small parts of the view are changed. > > > > > > ** Feature > > > > > > The attached patchset provides a feature that allows materialized views > > > to be updated automatically and incrementally just after a underlying > > > table is modified. > > > > > > You can create an incementally maintainable materialized view (IMMV) > > > by using CREATE INCREMENTAL MATERIALIZED VIEW command. > > > > > > The followings are supported in view definition queries: > > > - SELECT ... FROM ... WHERE ..., joins (inner joins, self-joins) > > > - some built-in aggregate functions (count, sum, avg, min, max) > > > - GROUP BY clause > > > - DISTINCT clause > > > > > > Views can contain multiple tuples with the same content (duplicate > > > tuples). > > > > > > ** Restriction > > > > > > The following are not supported in a view definition: > > > - Outer joins > > > - Aggregates otehr than above, window functions, HAVING > > > - Sub-queries, CTEs > > > - Set operations (UNION, INTERSECT, EXCEPT) > > > - DISTINCT ON, ORDER BY, LIMIT, OFFSET > > > > > > Also, a view definition query cannot contain other views, materialized > > > views, > > > foreign tables, partitioned tables, partitions, VALUES, non-immutable > > > functions, > > > system columns, or expressions that contains aggregates. > > > > > > --- > > > * Design > > > > > > An IMMV is maintained using statement-level AFTER triggers. > > > When an IMMV is created, triggers are automatically created on all base > > > tables contained in the view definition query. > > > > > > When a table is modified, changes that occurred in the table are extracted > > > as transition tables in the AFTER triggers. Then, changes that will occur > > > in > > > the view are calculated by a rewritten view dequery in which the modified > > > table > > > is replaced with the transition table. > > > > > > For example, if the view is defined as "SELECT * FROM R, S", and tuples > > > inserted > > > into R are stored in a transiton table dR, the tuples that will be > > > inserted into > > > the view are calculated as the result of "SELECT * FROM dR, S". > > > > > > ** Multiple Tables Modification > > > > > > Multiple tables can be modified in a statement when using triggers, > > > foreign key > > > constraint, or modifying CTEs. When multiple tables are modified, we need > > > the state of tables before the modification. > > > > > > For example, when some tuples, dR and dS, are inserted into R and S > > > respectively, > > > the tuples that will be inserted into the view are calculated by the > > > following > > > two queries: > > > > > > "SELECT * FROM dR, S_pre" > > > "SELECT * FROM R, dS" > > > > > > where S_pre is the table before the modification, R is the current state > > > of > > > table, that is, after the modification. This pre-update states of table > > > is calculated by filtering inserted tuples and appending deleted tuples. > > > The subquery that represents pre-update state is generated in > > > get_prestate_rte(). > > > Specifically, the insterted tuples are filtered by calling > > > IVM_visible_in_prestate() > > > in WHERE clause. This function checks the visibility of tuples by using > > > the snapshot taken before table modification. The deleted tuples are > > > contained > > > in the old transition table, and this table is appended using UNION ALL. > > > > > > Transition tables for each modification are collected in each AFTER > > > trigger > > > function call. Then, the view maintenance is performed in the last call of > > > the trigger. > > > > > > In the original PostgreSQL, tuplestores of transition tables are freed at > > > the > > > end of each nested query. However, their lifespan needs to be prolonged to > > > the end of the out-most query in order to maintain the view in the last > > > AFTER > > > trigger. For this purpose, SetTransitionTablePreserved is added in > > > trigger.c. > > > > > > ** Duplicate Tulpes > > > > > > When calculating changes that will occur in the view (= delta
Re: Bytea PL/Perl transform
=?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?= writes: > Andrew Dunstan writes: >> On 2023-06-22 Th 16:56, Greg Sabino Mullane wrote: >>> * Do all of these transforms need to be their own contrib modules? So >>> much duplicated code across contrib/*_plperl already (and *plpython >>> too for that matter) ... >> Yeah, that's a bit of a mess. Not sure what we can do about it now. > Would it be possible to move the functions and other objects to a new > combined extension, and make the existing ones depend on that? Perhaps another way could be to accept that the packaging is what it is, but look for ways to share the repetitive source code. The .so's wouldn't get any smaller, but they're not that big anyway. regards, tom lane
Re: Assistance Needed: Issue with pg_upgrade and --link option
On 28.06.23 12:46, Laurenz Albe wrote: On Wed, 2023-06-28 at 15:40 +0530, Pradeep Kumar wrote: I was under the impression that the --link option would create hard links between the old and new cluster's data files, but it appears that the entire old cluster data was copied to the new cluster, resulting in a significant increase in the new cluster's size. Please provide some numbers, ideally du -sk du -sk ~/pradeep_test/pg_upgrade_testing/postgres_11.4/master ~/pradeep_test/pg_upgrade_testing/postgres_14/new_pg 11224524 /home/test/pradeep_test/pg_upgrade_testing/postgres_11.4/master 41952 /home/test/pradeep_test/pg_upgrade_testing/postgres_14/new_pg That looks fine. The files exist only once, and the 41MB that only exist in the new data directory are catalog data and other stuff that is different on the new cluster. Interesting, so it actually does count files with multiple hardlinks only once.
tablecmds.c/MergeAttributes() cleanup
The MergeAttributes() and related code in and around tablecmds.c is huge and ancient, with many things bolted on over time, and difficult to deal with. I took some time to make careful incremental updates and refactorings to make the code easier to follow, more compact, and more modern in appearance. I also found several pieces of obsolete code along the way. This resulted in the attached long patch series. Each patch tries to make a single change and can be considered incrementally. At the end, the code is shorter, better factored, and I hope easier to understand. There shouldn't be any change in behavior.From 60a671aeb03293bdec65fd86f2a393c3aced6eb9 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 27 Jun 2023 17:10:25 +0200 Subject: [PATCH 01/17] Remove obsolete comment about OID support --- src/backend/catalog/heap.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 2a0d82aedd..9196dcd39f 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -843,9 +843,9 @@ AddNewAttributeTuples(Oid new_rel_oid, } /* -* Next we add the system attributes. Skip OID if rel has no OIDs. Skip -* all for a view or type relation. We don't bother with making datatype -* dependencies here, since presumably all these types are pinned. +* Next we add the system attributes. Skip all for a view or type +* relation. We don't bother with making datatype dependencies here, +* since presumably all these types are pinned. */ if (relkind != RELKIND_VIEW && relkind != RELKIND_COMPOSITE_TYPE) { base-commit: b381d9637030c163c3b1f8a9d3de51dfc1b4ee58 -- 2.41.0 From e12a69675919fb026c008e37649518f3d52e2a90 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Thu, 22 Jun 2023 12:05:06 +0200 Subject: [PATCH 02/17] Remove ancient special case code for adding oid columns The special handling of negative attribute numbers in ATExecAddColumn() was introduced to support SET WITH OIDS (commit 6d1e361852). But that feature doesn't exist anymore, so we can revert to the previous, simpler version. In passing, also remove an obsolete comment about OID support. --- src/backend/commands/tablecmds.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index d985278ac6..a5493705aa 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -2449,8 +2449,7 @@ MergeAttributes(List *schema, List *supers, char relpersistence, /* * Scan the parents left-to-right, and merge their attributes to form a -* list of inherited attributes (inhSchema). Also check to see if we need -* to inherit an OID column. +* list of inherited attributes (inhSchema). */ child_attno = 0; foreach(entry, supers) @@ -6944,7 +6943,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel, attribute.attrelid = myrelid; namestrcpy(&(attribute.attname), colDef->colname); attribute.atttypid = typeOid; - attribute.attstattarget = (newattnum > 0) ? -1 : 0; + attribute.attstattarget = -1; attribute.attlen = tform->typlen; attribute.attnum = newattnum; if (list_length(colDef->typeName->arrayBounds) > PG_INT16_MAX) @@ -7067,7 +7066,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel, * is certainly not going to touch them. System attributes don't have * interesting defaults, either. */ - if (RELKIND_HAS_STORAGE(relkind) && attribute.attnum > 0) + if (RELKIND_HAS_STORAGE(relkind)) { /* * For an identity column, we can't use build_column_default(), -- 2.41.0 From 58af867b3e3990e787a33c5d5023753e623dffe0 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 27 Jun 2023 14:43:55 +0200 Subject: [PATCH 03/17] Remove ancient special case code for dropping oid columns The special handling of negative attribute numbers in RemoveAttributeById() was introduced to support SET WITHOUT OIDS (commit 24614a9880). But that feature doesn't exist anymore, so we can revert to the previous, simpler version. --- src/backend/catalog/heap.c | 99 +- 1 file changed, 43 insertions(+), 56 deletions(-) diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 9196dcd39f..4c30c7d461 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -1666,68 +1666,56 @@ RemoveAttributeById(Oid relid, AttrNumber attnum) attnum, relid); attStruct = (Form_pg_attribute) GETSTRUCT(tuple); - if (attnum < 0) - { - /* System attribute (probably OID) ... just delete the row */ - -
Re: eqjoinsel_semi still sucks ...
Alena Rybakina writes: > On 23.06.2023 14:28, Andrey Lepikhov wrote: >> On 2/5/2012 20:34, Tom Lane wrote: >>> On reflection I think that the idea of clamping ndistinct beforehand is >>> just wrong, and what we ought to do instead is apply a multiplier to the >>> selectivity estimate afterwards. >> I got stuck in some cases where (due to a tree of filters) the planner >> underestimates the JOIN just because the ndistinct conveys a huge >> number to the selectivity estimation formula. However, the estimation >> of both input relations is made correctly and is limited. >> I've tried to understand the logic through commits 0d3b231eebf, >> 97930cf578e and 7f3eba30c9d. But it is still not clear. >> So, why the idea of clamping ndistinct is terrible in general? Could >> you explain your reasons a bit more? > Hi, I also have an interest in understanding the same issue, and I dug > into the above commits and the topic . I hope this letter will help to > sort out this issue. Note that nothing's actually been done about the idea I expressed at the start of this thread. The behavior is better now: if you try the example in v12 or later you get regression=# explain analyze select * from tenk1 a where not exists(select 1 from tenk1 b where a.thousand = b.unique2 and b.two = 0); QUERY PLAN - Hash Anti Join (cost=532.50..1059.38 rows=5000 width=244) (actual time=1.993..3.959 rows=5090 loops=1) Hash Cond: (a.thousand = b.unique2) -> Seq Scan on tenk1 a (cost=0.00..445.00 rows=1 width=244) (actual time=0.003..0.502 rows=1 loops=1) -> Hash (cost=470.00..470.00 rows=5000 width=4) (actual time=1.960..1.960 rows=5000 loops=1) Buckets: 8192 Batches: 1 Memory Usage: 240kB -> Seq Scan on tenk1 b (cost=0.00..470.00 rows=5000 width=4) (actual time=0.003..1.449 rows=5000 loops=1) Filter: (two = 0) Rows Removed by Filter: 5000 Planning Time: 0.760 ms Execution Time: 4.087 ms (10 rows) I believe this improvement just stems from commit a314c3407 ("Clamp semijoin selectivity to be not more than inner-join selectivity"), and so this example looks good only because the actual semijoin output size happens to be the same as the inner-join output size. With a slight adjustment, that's no longer true and the estimate still sucks: regression=# explain analyze select * from tenk1 a, tenk1 b where a.thousand = b.fivethous and b.two = 0; QUERY PLAN --- Hash Join (cost=532.50..1127.50 rows=1 width=488) (actual time=2.000..5.742 rows=1 loops=1) ... regression=# explain analyze select * from tenk1 a where exists(select 1 from tenk1 b where a.thousand = b.fivethous and b.two = 0); QUERY PLAN - Hash Semi Join (cost=532.50..1127.50 rows=1 width=244) (actual time=1.529..3.474 rows=5000 loops=1) ... regression=# explain analyze select * from tenk1 a where not exists(select 1 from tenk1 b where a.thousand = b.fivethous and b.two = 0); QUERY PLAN - Hash Anti Join (cost=532.50..1027.50 rows=1 width=244) (actual time=1.564..3.476 rows=5000 loops=1) ... So we still have an issue to fix. I've not had time to pursue the idea I expressed at the start of the thread. Also, it's a bit scary to change this logic with only a few examples to look at. If you could reduce the problem cases you're looking at to something sharable, that could help move things along. regards, tom lane
several attstattarget-related improvements
Here are a few patches related to attstattarget: - 0001: Change type of pg_statistic_ext.stxstattarget, to match attstattarget. Maybe this should go into PG16, for consistency? - 0002: Add macro for maximum statistics target, instead of hardcoding it everywhere. - 0003: Take pg_attribute out of VacAttrStats. This simplifies some code, especially for extended statistics, which had to have weird workarounds.From f01d4c9aba44cc5b9028200b21a6d1df9770c5a2 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 27 Jun 2023 16:38:00 +0200 Subject: [PATCH 1/3] Change type of pg_statistic_ext.stxstattarget Change from int32 to int16, to match attstattarget (changed in 90189eefc1). --- doc/src/sgml/catalogs.sgml | 2 +- src/backend/commands/statscmds.c | 4 ++-- src/include/catalog/pg_statistic_ext.h | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index ed32ca0349..852cb30ae1 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -7622,7 +7622,7 @@ pg_statistic_ext Columns - stxstattarget int4 + stxstattarget int2 stxstattarget controls the level of detail diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c index 26ebd0819d..c2dab20bdc 100644 --- a/src/backend/commands/statscmds.c +++ b/src/backend/commands/statscmds.c @@ -498,7 +498,7 @@ CreateStatistics(CreateStatsStmt *stmt) values[Anum_pg_statistic_ext_stxrelid - 1] = ObjectIdGetDatum(relid); values[Anum_pg_statistic_ext_stxname - 1] = NameGetDatum(); values[Anum_pg_statistic_ext_stxnamespace - 1] = ObjectIdGetDatum(namespaceId); - values[Anum_pg_statistic_ext_stxstattarget - 1] = Int32GetDatum(-1); + values[Anum_pg_statistic_ext_stxstattarget - 1] = Int16GetDatum(-1); values[Anum_pg_statistic_ext_stxowner - 1] = ObjectIdGetDatum(stxowner); values[Anum_pg_statistic_ext_stxkeys - 1] = PointerGetDatum(stxkeys); values[Anum_pg_statistic_ext_stxkind - 1] = PointerGetDatum(stxkind); @@ -676,7 +676,7 @@ AlterStatistics(AlterStatsStmt *stmt) /* replace the stxstattarget column */ repl_repl[Anum_pg_statistic_ext_stxstattarget - 1] = true; - repl_val[Anum_pg_statistic_ext_stxstattarget - 1] = Int32GetDatum(newtarget); + repl_val[Anum_pg_statistic_ext_stxstattarget - 1] = Int16GetDatum(newtarget); newtup = heap_modify_tuple(oldtup, RelationGetDescr(rel), repl_val, repl_null, repl_repl); diff --git a/src/include/catalog/pg_statistic_ext.h b/src/include/catalog/pg_statistic_ext.h index 53eec9025a..1e64aa8f16 100644 --- a/src/include/catalog/pg_statistic_ext.h +++ b/src/include/catalog/pg_statistic_ext.h @@ -43,7 +43,7 @@ CATALOG(pg_statistic_ext,3381,StatisticExtRelationId) * object's namespace */ Oid stxowner BKI_LOOKUP(pg_authid); /* statistics object's owner */ - int32 stxstattarget BKI_DEFAULT(-1); /* statistics target */ + int16 stxstattarget BKI_DEFAULT(-1); /* statistics target */ /* * variable-length fields start here, but we allow direct access to -- 2.41.0 From a473c258577f9f9b28435feef27e8cbc25721d5c Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 27 Jun 2023 14:09:39 +0200 Subject: [PATCH 2/3] Add macro for maximum statistics target The number of places where 1 was hardcoded had grown a bit beyond the comfort level. Introduce a macro MAX_STATISTICS_TARGET instead. --- src/backend/commands/statscmds.c| 4 ++-- src/backend/commands/tablecmds.c| 5 +++-- src/backend/statistics/extended_stats.c | 2 +- src/backend/utils/misc/guc_tables.c | 2 +- src/include/catalog/pg_attribute.h | 2 +- src/include/commands/vacuum.h | 7 +++ src/include/statistics/statistics.h | 4 ++-- 7 files changed, 17 insertions(+), 9 deletions(-) diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c index c2dab20bdc..36bc8c33ba 100644 --- a/src/backend/commands/statscmds.c +++ b/src/backend/commands/statscmds.c @@ -619,9 +619,9 @@ AlterStatistics(AlterStatsStmt *stmt) errmsg("statistics target %d is too low", newtarget))); } - else if (newtarget > 1) + else if (newtarget > MAX_STATISTICS_TARGET) { - newtarget = 1; + newtarget = MAX_STATISTICS_TARGET; ereport(WARNING, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("lowering statistics target to %d", diff --git a/src/backend/commands/tablecmds.c
Re: Assert !bms_overlap(joinrel->relids, required_outer)
Richard Guo writes: > On Wed, Jun 28, 2023 at 6:28 AM Tom Lane wrote: >> For a real fix, I'm inclined to extend the loop that calculates >> param_source_rels (in add_paths_to_joinrel) so that it also tracks >> a set of incompatible relids that *must not* be present in the >> parameterization of a proposed path. This would basically include >> OJ relids of OJs that partially overlap the target joinrel; maybe >> we should also include the min RHS of such OJs. Then we could >> check that in try_nestloop_path. I've not tried to code this yet. > I went ahead and drafted a patch based on this idea. Hmm. This patch is the opposite of what I'd been imagining, because I was thinking we needed to add OJs to param_incompatible_relids if they were *not* already in the join, rather than if they were. However, I tried it like that and while it did stop the assertion failure, it also broke a bunch of other test cases that no longer found the parameterized-nestloop plans they were supposed to find. So clearly I just didn't have my head screwed on in the correct direction yesterday. However, given that what we need is to exclude parameterization that depends on the currently-formed OJ, it seems to me we can do it more simply and without any new JoinPathExtraData field, as attached. What do you think? > * I think we need to check the incompatible relids also in > try_hashjoin_path and try_mergejoin_path besides try_nestloop_path. I think this isn't necessary, at least in my formulation. Those cases will go through calc_non_nestloop_required_outer which has /* neither path can require rels from the other */ Assert(!bms_overlap(outer_paramrels, inner_path->parent->relids)); Assert(!bms_overlap(inner_paramrels, outer_path->parent->relids)); In order to have a dependency on an OJ, a path would have to have a dependency on at least one of the OJ's base relations too, so I think these assertions show that the case won't arise. (Of course, if someone can trip one of these assertions, I'm wrong.) regards, tom lane diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c index c2f211a60d..4b6ed6e312 100644 --- a/src/backend/optimizer/path/joinpath.c +++ b/src/backend/optimizer/path/joinpath.c @@ -698,6 +698,17 @@ try_nestloop_path(PlannerInfo *root, Relids inner_paramrels = PATH_REQ_OUTER(inner_path); Relids outer_paramrels = PATH_REQ_OUTER(outer_path); + /* + * If we are forming an outer join at this join, it's nonsensical to use + * an input path that uses the outer join as part of its parameterization. + * (This can happen despite our join order restrictions, since those apply + * to what is in an input relation not what its parameters are.) + */ + if (extra->sjinfo && extra->sjinfo->ojrelid != 0 && + (bms_is_member(extra->sjinfo->ojrelid, outer_paramrels) || + bms_is_member(extra->sjinfo->ojrelid, inner_paramrels))) + return; + /* * Paths are parameterized by top-level parents, so run parameterization * tests on the parent relids. diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 6917faec14..12b828fae3 100644 --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -5063,6 +5063,37 @@ select 1 from -- (0 rows) +explain (costs off) +select 1 from tenk1 t1 + join lateral + (select t1.unique1 from (select 1) foo offset 0) s1 on true + join +(select 1 from tenk1 t2 +inner join tenk1 t3 + left join tenk1 t4 left join tenk1 t5 on t4.unique1 = 1 +on t4.unique1 = 1 on false + where t3.unique1 = coalesce(t5.unique1,1)) as s2 + on true; +QUERY PLAN +-- + Result + One-Time Filter: false +(2 rows) + +select 1 from tenk1 t1 + join lateral + (select t1.unique1 from (select 1) foo offset 0) s1 on true + join +(select 1 from tenk1 t2 +inner join tenk1 t3 + left join tenk1 t4 left join tenk1 t5 on t4.unique1 = 1 +on t4.unique1 = 1 on false + where t3.unique1 = coalesce(t5.unique1,1)) as s2 + on true; + ?column? +-- +(0 rows) + -- -- check a case in which a PlaceHolderVar forces join order -- diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql index 55080bec9a..38899ed3b9 100644 --- a/src/test/regress/sql/join.sql +++ b/src/test/regress/sql/join.sql @@ -1751,6 +1751,29 @@ select 1 from on false, lateral (select i4.f1, ss1.n from int8_tbl as i8 limit 1) as ss3; +explain (costs off) +select 1 from tenk1 t1 + join lateral + (select t1.unique1 from (select 1) foo offset 0) s1 on true + join +(select 1 from tenk1 t2 +inner join tenk1 t3 + left join tenk1 t4 left join tenk1 t5 on
Re: pg_decode_message vs skip_empty_xacts and xact_wrote_changes
Hi Vignesh, Thanks for working on this. On Wed, Jun 28, 2023 at 4:52 PM vignesh C wrote: > > Here is a patch having the fix for the same. I have not added any > tests as the existing tests cover this scenario. The same issue is > present in back branches too. Interesting, we have a test for this scenario and it accepts erroneous output :). > v1-0001-Call-pg_output_begin-in-pg_decode_message-if-it-i_master.patch > can be applied on master, PG15 and PG14, > v1-0001-Call-pg_output_begin-in-pg_decode_message-if-it-i_PG13.patch > patch can be applied on PG13, PG12 and PG11. > Thoughts? I noticed this when looking at Tomas's patches for logical decoding of sequences. The code block you have added is repeated in pg_decode_change() and pg_decode_truncate(). It might be better to push the conditions in pg_output_begin() itself so that any future callsite of pg_output_begin() automatically takes care of these conditions. Otherwise the patches look good to me. -- Best Wishes, Ashutosh Bapat
Re: MergeJoin beats HashJoin in the case of multiple hash clauses
Hi! On 15.06.2023 11:30, Andrey Lepikhov wrote: Hi, all. Some of my clients use JOIN's with three - four clauses. Quite frequently, I see complaints on unreasonable switch of JOIN algorithm to Merge Join instead of Hash Join. Quick research have shown one weak place - estimation of an average bucket size in final_cost_hashjoin (see q2.sql in attachment) with very conservative strategy. Unlike estimation of groups, here we use smallest ndistinct value across all buckets instead of multiplying them (or trying to make multivariate analysis). It works fine for the case of one clause. But if we have many clauses, and if each has high value of ndistinct, we will overestimate average size of a bucket and, as a result, prefer to use Merge Join. As the example in attachment shows, it leads to worse plan than possible, sometimes drastically worse. I assume, this is done with fear of functional dependencies between hash clause components. But as for me, here we should go the same way, as estimation of groups. The attached patch shows a sketch of the solution. This problem is very important. Honestly, I'm still learning your code and looking for cases on which cases your patch can affect for the worse or for the better. But I have already found something that seemed interesting to me. I have found several other interesting cases where your patch can solve some problem in order to choose a more correct plan, but in focus on memory consumption. To make it easier to evaluate, I added a hook to your patch that makes it easier to switch to your or the original way of estimating the size of baskets (diff_estimate.diff). Here are other cases where your fix improves the query plan. First of all, I changed the way creation of tables are created to look at the behavior of the query plan in terms of planning and execution time: DROP TABLE IF EXISTS a,b CASCADE; CREATE TABLE a AS SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z FROM generate_series(1,1e5) AS gs; CREATE TABLE b AS SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload FROM generate_series(1,1e5) AS gs; ANALYZE a,b; SET enable_cost_size = 'on'; EXPLAIN ANALYZE SELECT * FROM a,b WHERE a.x=b.x AND a.y=b.y AND a.z=b.z; SET enable_cost_size = 'off'; EXPLAIN ANALYZE SELECT * FROM a,b WHERE a.x=b.x AND a.y=b.y AND a.z=b.z; QUERY PLAN --- Hash Join (actual time=200.872..200.879 rows=0 loops=1) Hash Cond: ((b.x = a.x) AND (b.y = a.y) AND (b.z = a.z)) -> Seq Scan on b (actual time=0.029..15.946 rows=10 loops=1) -> Hash (actual time=97.645..97.649 rows=10 loops=1) Buckets: 131072 Batches: 1 Memory Usage: 5612kB -> Seq Scan on a (actual time=0.024..17.153 rows=10 loops=1) Planning Time: 2.910 ms Execution Time: 201.949 ms (8 rows) SET QUERY PLAN --- Merge Join (actual time=687.415..687.416 rows=0 loops=1) Merge Cond: ((b.y = a.y) AND (b.x = a.x) AND (b.z = a.z)) -> Sort (actual time=462.022..536.716 rows=10 loops=1) Sort Key: b.y, b.x, b.z Sort Method: external merge Disk: 3328kB -> Seq Scan on b (actual time=0.017..12.326 rows=10 loops=1) -> Sort (actual time=111.295..113.196 rows=16001 loops=1) Sort Key: a.y, a.x, a.z Sort Method: external sort Disk: 2840kB -> Seq Scan on a (actual time=0.020..10.129 rows=10 loops=1) Planning Time: 0.752 ms Execution Time: 688.829 ms (12 rows) Secondly, I found another case that is not related to the fact that the planner would prefer to choose merge join rather than hash join, but we have the opportunity to see that the plan has become better due to the consumption of less memory, and also takes less planning time. Here, with the same query, the planning time was reduced by 5 times, and the number of buckets by 128 times, therefore, memory consumption also decreased: DROP TABLE IF EXISTS a,b CASCADE; CREATE TABLE a AS SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z FROM generate_series(1,600) AS gs; CREATE TABLE b AS SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload FROM generate_series(1,1e5) AS gs; ANALYZE a,b; SET enable_cost_size = 'on'; EXPLAIN ANALYZE SELECT * FROM a,b WHERE a.x=b.x AND a.y=b.y AND a.z=b.z; SET enable_cost_size = 'off'; EXPLAIN ANALYZE SELECT * FROM a,b WHERE a.x=b.x AND a.y=b.y AND a.z=b.z; QUERY PLAN Hash Join (cost=20.50..3157.58 rows=8 width=32) (actual time=95.648..95.651 rows=0 loops=1) Hash Cond: ((b.x = (a.x)::numeric) AND (b.y = (a.y)::numeric) AND (b.z = (a.z)::numeric)) -> Seq Scan on b
Re: pg_rewind WAL segments deletion pitfall
On 2022-09-29 17:18, Polina Bungina wrote: I agree with your suggestions, so here is the updated version of patch. Hope I haven't missed anything. Regards, Polina Bungina Thanks for working on this! It seems like we are also facing the same issue. I tested the v3 patch under our condition, old primary has succeeded to become new standby. BTW when I used pg_rewind-removes-wal-segments-reproduce.sh attached in [1], old primary also failed to become standby: FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00020007 has already been removed However, I think this is not a problem: just adding restore_command like below fixed the situation. echo "restore_command = '/bin/cp `pwd`/newarch/%f %p'" >> oldprim/postgresql.conf Attached modified reproduction script for reference. [1]https://www.postgresql.org/message-id/CAFh8B%3DnNiFZOAPsv49gffxHBPzwmZ%3D6Msd4miMis87K%3Dd9rcRA%40mail.gmail.com -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATIONmkdir newarch oldarch initdb -k -D oldprim echo "archive_mode = 'on'">> oldprim/postgresql.conf echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start psql -p 5432 -c 'create table t(a int)' pg_basebackup -D newprim -p 5432 echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf touch newprim/standby.signal pg_ctl -D newprim -o '-p 5433' -l newprim.log start # the last common checkpoint psql -p 5432 -c 'checkpoint' # old primary cannot archive any more echo "archive_command = 'false'">> oldprim/postgresql.conf pg_ctl -D oldprim reload # advance WAL on the old primary; four WAL segments will never make it to the archive for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done # record approx. diverging WAL segment start_wal=`psql -p 5432 -Atc "select pg_walfile_name(pg_last_wal_replay_lsn() - (select setting from pg_settings where name = 'wal_segment_size')::int);"` pg_ctl -D newprim promote # old rprimary loses diverging WAL segment for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done psql -p 5432 -c 'checkpoint;' psql -p 5433 -c 'checkpoint;' pg_ctl -D oldprim stop # rewind the old primary, using its own archive # pg_rewind -D oldprim --source-server='port=5433' # should fail echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf pg_rewind -D oldprim --source-server='port=5433' -c # advance WAL on the old primary; new primary loses the launching WAL seg for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select pg_switch_wal();'; done psql -p 5433 -c 'checkpoint' echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf touch oldprim/standby.signal echo "restore_command = '/bin/cp `pwd`/newarch/%f %p'" >> oldprim/postgresql.conf postgres -D oldprim # fails with "WAL file has been removed" ## The alternative of copying-in ## echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/newarch/%f %p'">> oldprim/postgresql.conf # ## copy-in WAL files from new primary's archive to old primary #(cd newarch; #for f in `ls`; do # if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal; fi #done) # #postgres -D oldprim # also fails with "requested WAL segment XXX has already been removed"
Re: Another incorrect comment for pg_stat_statements
On Wed, 28 Jun 2023 at 16:27, Richard Guo wrote: > On Wed, Jun 28, 2023 at 3:36 PM Michael Paquier wrote: > >> On Wed, Jun 28, 2023 at 03:09:55PM +0800, Richard Guo wrote: >> > +1. To nitpick, how about we remove the blank line just before removing >> > the key for top level entry? >> > >> > - /* Also remove entries for top level statements */ >> > + /* Also remove entries if exist for top level statements */ >> >key.toplevel = true; >> > - >> > - /* Remove the key if exists */ >> >entry = (pgssEntry *) hash_search(pgss_hash, , HASH_REMOVE, NULL); >> >> Why not if it improves the overall situation. Could you send a patch >> with everything you have in mind? > > > Here is the patch. I don't have too much in mind, so the patch just > removes the blank line and revises the comment a bit. > +1. LGTM. -- Regrads, Japin Li.
Re: Request for new function in view update
Hi Yugo Thank you for taking a look at the paper. The key difference from PostgreSQL is that it only allows updates on single table views. My paper discusses two kinds of joins of two tables. It discusses how to update them and how to determine when they occur. The paper also discusses unioning two tables. I've done work on table intersection and table difference too, but didn't include it in the paper. Terry Brennan On Wed, Jun 28, 2023 at 2:49 AM Yugo NAGATA wrote: > On Thu, 1 Jun 2023 12:18:47 -0500 > Terry Brennan wrote: > > > Hello all, > > > > I am a researcher in databases who would like to suggest a new > function. I > > am writing to you because you have an active developer community. Your > > website said that suggestions for new functions should go to this mailing > > list. If there is another mailing list you prefer, please let me know. > > > > My research is in updating views -- the problem of translating an update > in > > a view to an update to a set of underlying base tables. This problem has > > been partially solved for many years, including in PostgreSQL, but a > > complete solution hasn't been found. > > > > Views are useful for data independence; if users only access data through > > views, then underlying databases can change without user programs. Data > > independence requires an automatic solution to the view update problem. > > > > In my research, I went back to the initial papers about the problem. The > > most promising approach was the "constant complement" approach. It > starts > > from the idea that a view shows only part of the information in a > database, > > and that view updates should never change the part of the database that > > isn't exposed in the view. (The "complement" is the unexposed part, and > > "constant" means that a view update shouldn't change the complement.) > The > > "constant complement" constraint is intuitive, that a view update > shouldn't > > have side effects on information not available through the view. > > > > A seminal paper showed that defining a complement is enough, because each > > complement of a view creates a unique view update. Unfortunately, there > > are limitations. Views have multiple complements, and no unique minimal > > complement exists. Because of this limitation and other practical > > difficulties, the constant complement approach was abandoned. > > > > I used a theorem in this initial paper that other researchers didn't use, > > that shows the inverse. An update method defines a unique complement. I > > used the two theorems as a saw's upstroke and downstroke to devise view > > update methods for several relational operators. Unlike other > approaches, > > these methods have a solid mathematical foundation. > > > > Some relational operators are easy (selection), others are hard > > (projection); some have several valid update methods that can be used > > interchangeably (union) and some can have several valid update methods > that > > reflect different semantics (joins). For joins, I found clues in the > > database that can determine which update method to use. I address the > > other relational operators, but not in the attached paper > > . > > I also discuss the problem of when views can't have updates, and possible > > reasons why. > > > > I have attached my arXiv paper. I would appreciate anyone's interest in > > this topic. > > I'm interested in the view update problem because we have some works on > this topic [1][2]. > > I read your paper. Although I don't understand the theoretical part enough, > I found your proposal methods to update views as for several relational > operators. > > The method for updating selection views seems same as the way of > automatically > updatable views in the current PostgreSQL, that is, deleting/updating rows > in > a view results in deletes/updates for corresponding rows in the base table. > Inserting rows that is not compliant to the view is checked and prevented > if > the view is defined with WITH CHECK OPTION. > > However, the proposed method for projection is that a column not contained > in the view is updated to NULL when a row is deleted. I think it is not > desirable to use NULL in a such special purpose, and above all, this is > different the current PostgreSQL behavior, so I wonder it would not > accepted > to change it. I think it would be same for the method for JOIN that uses > NULL > for the special use. > > I think it would be nice to extend view updatability of PostgreSQL because > the SQL standard allows more than the current limited support. In this > case, > I wonder we should follow SQL:1999 or later, and maybe this would be > somehow > compatible to the spec in Oracle. > > [1] https://dl.acm.org/doi/10.1145/3164541.3164584 > [2] https://www.pgcon.org/2017/schedule/events/1074.en.html > > Regards, > Yugo Nagata > > > Yours > > Terry Brennan > > > -- > Yugo NAGATA >
Re: Row pattern recognition
Small question. > This query (with all the defaults made explicit): > > SELECT s.company, s.tdate, s.price, >FIRST_VALUE(s.tdate) OVER w, >LAST_VALUE(s.tdate) OVER w, >lowest OVER w > FROM stock AS s > WINDOW w AS ( > PARTITION BY s.company > ORDER BY s.tdate > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > EXCLUDE NO OTHERS > MEASURES > LAST(DOWN) AS lowest > AFTER MATCH SKIP PAST LAST ROW > INITIAL PATTERN (START DOWN+ UP+) > DEFINE > START AS TRUE, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > LAST(DOWN) AS lowest should be "LAST(DOWN.price) AS lowest"? Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: pg_decode_message vs skip_empty_xacts and xact_wrote_changes
On Mon, 26 Jun 2023 at 15:51, Amit Kapila wrote: > > On Mon, Jun 26, 2023 at 3:07 PM Ashutosh Bapat > wrote: > > > > Hi All, > > Every pg_decode routine except pg_decode_message that decodes a > > transactional change, has following block > > /* output BEGIN if we haven't yet */ > > if (data->skip_empty_xacts && !txndata->xact_wrote_changes) > > { > > pg_output_begin(ctx, data, txn, false); > > } > > txndata->xact_wrote_changes = true; > > > > But pg_decode_message() doesn't call pg_output_begin(). If a WAL > > message is the first change in the transaction, it won't have a BEGIN > > before it. That looks like a bug. Why is pg_decode_message() > > exception? > > > > I can't see a reason why we shouldn't have a similar check for > transactional messages. So, agreed this is a bug. Here is a patch having the fix for the same. I have not added any tests as the existing tests cover this scenario. The same issue is present in back branches too. v1-0001-Call-pg_output_begin-in-pg_decode_message-if-it-i_master.patch can be applied on master, PG15 and PG14, v1-0001-Call-pg_output_begin-in-pg_decode_message-if-it-i_PG13.patch patch can be applied on PG13, PG12 and PG11. Thoughts? Regards, Vignesh From 2df5e87ec7c82daa5f17da50f770f923eb7765b4 Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Wed, 28 Jun 2023 14:01:22 +0530 Subject: [PATCH] Call pg_output_begin in pg_decode_message if it is the first change in the transaction. Call pg_output_begin in pg_decode_message if it is the first change in the transaction. --- contrib/test_decoding/expected/messages.out | 10 -- contrib/test_decoding/sql/messages.sql | 2 +- contrib/test_decoding/test_decoding.c | 12 3 files changed, 21 insertions(+), 3 deletions(-) diff --git a/contrib/test_decoding/expected/messages.out b/contrib/test_decoding/expected/messages.out index c75d40190b..0fd70036bd 100644 --- a/contrib/test_decoding/expected/messages.out +++ b/contrib/test_decoding/expected/messages.out @@ -58,17 +58,23 @@ SELECT 'ignorethis' FROM pg_logical_emit_message(true, 'test', 'czechtastic'); ignorethis (1 row) -SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1'); +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1', 'include-xids', '0'); data + BEGIN message: transactional: 1 prefix: test, sz: 4 content:msg1 + COMMIT message: transactional: 0 prefix: test, sz: 4 content:msg2 message: transactional: 0 prefix: test, sz: 4 content:msg4 message: transactional: 0 prefix: test, sz: 4 content:msg6 + BEGIN message: transactional: 1 prefix: test, sz: 4 content:msg5 message: transactional: 1 prefix: test, sz: 4 content:msg7 + COMMIT + BEGIN message: transactional: 1 prefix: test, sz: 11 content:czechtastic -(7 rows) + COMMIT +(13 rows) -- test db filtering \set prevdb :DBNAME diff --git a/contrib/test_decoding/sql/messages.sql b/contrib/test_decoding/sql/messages.sql index cf3f7738e5..3d8500f99c 100644 --- a/contrib/test_decoding/sql/messages.sql +++ b/contrib/test_decoding/sql/messages.sql @@ -19,7 +19,7 @@ COMMIT; SELECT 'ignorethis' FROM pg_logical_emit_message(true, 'test', 'czechtastic'); -SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1'); +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1', 'include-xids', '0'); -- test db filtering \set prevdb :DBNAME diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c index 12d1d0505d..2bbac5f6a8 100644 --- a/contrib/test_decoding/test_decoding.c +++ b/contrib/test_decoding/test_decoding.c @@ -743,6 +743,18 @@ pg_decode_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional, const char *prefix, Size sz, const char *message) { + TestDecodingData *data = ctx->output_plugin_private; + TestDecodingTxnData *txndata; + + txndata = transactional ? txn->output_plugin_private : NULL; + + /* output BEGIN if we haven't yet */ + if (transactional && data->skip_empty_xacts && !txndata->xact_wrote_changes) + pg_output_begin(ctx, data, txn, false); + + if (transactional) + txndata->xact_wrote_changes = true; + OutputPluginPrepareWrite(ctx, true); appendStringInfo(ctx->out, "message: transactional: %d prefix: %s, sz: %zu content:", transactional, prefix, sz); -- 2.34.1 From 273776c53da22ecd69f8b7d4e7712ef66ab0ea5c Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Wed, 28 Jun 2023 14:01:22 +0530 Subject: [PATCH] Call pg_output_begin in pg_decode_message if it is the first change in the transaction. Call pg_output_begin in pg_decode_message if it is the first change in
Re: Changing types of block and chunk sizes in memory contexts
David Rowley writes: > Perhaps it's ok to leave the context creation functions with Size > typed parameters and then just Assert the passed-in sizes are not > larger than 1GB within the context creation function. Yes, I'm strongly opposed to not using Size/size_t in the mmgr APIs. If we go that road, we're going to have a problem when someone inevitably wants to pass a larger-than-GB value for some context type. What happens in semi-private structs is a different matter, although I'm a little dubious that shaving a couple of bytes from context headers is a useful activity. The self-documentation argument still has some force there, so I agree with Peter that some positive benefit has to be shown. regards, tom lane
Re: Making empty Bitmapsets always be NULL
Thank you for running the profiles. On Tue, 27 Jun 2023 at 21:11, Yuya Watari wrote: > On Sat, Jun 24, 2023 at 1:15 PM David Rowley wrote: > > I think it's also important to check we don't slow anything down for > > more normal-sized sets. The vast majority of sets will contain just a > > single word, so we should probably focus on making sure we're not > > slowing anything down for those. > > I agree with you and thank you for sharing the results. I ran > installcheck with your patch. The result is as follows. The speedup > was 0.33%. At least in my environment, I did not observe any > regression with this test. So, the patch looks very good. > > Master: 2.559648 seconds > Patched: 2.551116 seconds (0.33% faster) I wondered if the common case could be made slightly faster by checking the 0th word before checking the word count before going onto check the remaining words. For bms_equal(), that's something like: if (a->words[0] != b->words[0] || a->nwords != b->nwords) return false; /* check all the remaining words match */ for (int i = 1; i < a->nwords; i++) ... I wrote the patch and tried it out, but it seems slightly slower than the v4 patch. Linux with AMD 3990x, again using the patch from [1] with make installcheck master: 1.41720145 seconds v4: 1.392969606 seconds (1.74% faster than master) v4 with 0th word check: 1.404199748 seconds (0.93% faster than master) I've attached a delta patch of what I used to test. Since it's not any faster, I don't think it's worth doing. It'll also produce slightly more compiled code. David [1] https://postgr.es/m/caaphdvo68m_0juthnehfnsdsjeb2upphk6bwxstj93u_qei...@mail.gmail.com diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index 9cda3b1cc1..2ee7385f02 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -9,12 +9,7 @@ * the minimum possible number of words, i.e, there are never any trailing * zero words. Enforcing this requires that an empty set is represented as * NULL. Because an empty Bitmapset is represented as NULL, a non-NULL - * Bitmapset always has at least 1 Bitmapword. We can exploit this fact to - * speedup various loops over the Bitmapset's words array by using "do while" - * loops instead of "for" loops. This means the code does not waste time - * checking the loop condition before the first iteration. For Bitmapsets - * containing only a single word (likely the majority of them) this reduces - * the loop condition tests by half. + * Bitmapset always has at least 1 Bitmapword. * * * Copyright (c) 2003-2023, PostgreSQL Global Development Group @@ -96,8 +91,6 @@ bms_copy(const Bitmapset *a) bool bms_equal(const Bitmapset *a, const Bitmapset *b) { - int i; - /* Handle cases where either input is NULL */ if (a == NULL) { @@ -108,17 +101,20 @@ bms_equal(const Bitmapset *a, const Bitmapset *b) else if (b == NULL) return false; - /* can't be equal if the word counts don't match */ - if (a->nwords != b->nwords) + /* +* Most Bitmapsets will likely just have 1 word, so check for equality +* there before checking the number of words match. The sets cannot be +* equal when they don't have the same number of words. +*/ + if (a->words[0] != b->words[0] || a->nwords != b->nwords) return false; - /* check each word matches */ - i = 0; - do + /* check all the remaining words match */ + for (int i = 1; i < a->nwords; i++) { if (a->words[i] != b->words[i]) return false; - } while (++i < a->nwords); + } return true; } @@ -353,25 +349,27 @@ bms_difference(const Bitmapset *a, const Bitmapset *b) bool bms_is_subset(const Bitmapset *a, const Bitmapset *b) { - int i; - /* Handle cases where either input is NULL */ if (a == NULL) return true;/* empty set is a subset of anything */ if (b == NULL) return false; - /* 'a' can't be a subset of 'b' if it contains more words */ - if (a->nwords > b->nwords) + /* +* Check the 0th word first as many sets will only have 1 word. +* Otherwise, 'a' can't be a subset of 'b' if it contains more words, so +* there's no need to check remaining words if 'a' contains more words +* than 'b'. +*/ + if ((a->words[0] & ~b->words[0]) != 0 || a->nwords > b->nwords) return false; - /* Check all 'a' members are set in 'b' */ - i = 0; - do + /* Check all remaining words to see 'b' has members not set in 'a'. */ + for (int i = 1; i < a->nwords; i++) { if ((a->words[i] & ~b->words[i]) != 0) return false; - } while (++i < a->nwords); + }
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Fri, Apr 14, 2023 at 4:00 PM Hayato Kuroda (Fujitsu) wrote: > > > Sorry for the delay, I didn't had time to come back to it until this > > afternoon. > > No issues, everyone is busy:-). > > > I don't think that your analysis is correct. Slots are guaranteed to be > > stopped after all the normal backends have been stopped, exactly to avoid > > such > > extraneous records. > > > > What is happening here is that the slot's confirmed_flush_lsn is properly > > updated in memory and ends up being the same as the current LSN before the > > shutdown. But as it's a logical slot and those records aren't decoded, the > > slot isn't marked as dirty and therefore isn't saved to disk. You don't see > > that behavior when doing a manual checkpoint before (per your script > > comment), > > as in that case the checkpoint also tries to save the slot to disk but then > > finds a slot that was marked as dirty and therefore saves it. > > Here, why the behavior is different for manual and non-manual checkpoint? > > In your script's scenario, when you restart the server the previous slot > > data > > is restored and the confirmed_flush_lsn goes backward, which explains those > > extraneous records. > > So you meant to say that the key point was that some records which are not > sent > to subscriber do not mark slots as dirty, hence the updated confirmed_flush > was > not written into slot file. Is it right? LogicalConfirmReceivedLocation() is > called > by walsender when the process gets reply from worker process, so your analysis > seems correct. > Can you please explain what led to updating the confirmed_flush in memory but not in the disk? BTW, have we ensured that discarding the additional records are already sent to the subscriber, if so, why for those records confirmed_flush LSN is not progressed? -- With Regards, Amit Kapila.
Re: Assistance Needed: Issue with pg_upgrade and --link option
On Wed, 2023-06-28 at 15:40 +0530, Pradeep Kumar wrote: > > > I was under the impression that the --link option would create hard links > > > between the > > > old and new cluster's data files, but it appears that the entire old > > > cluster data was > > > copied to the new cluster, resulting in a significant increase in the new > > > cluster's size. > > > > Please provide some numbers, ideally > > > > du -sk > > du -sk ~/pradeep_test/pg_upgrade_testing/postgres_11.4/master > ~/pradeep_test/pg_upgrade_testing/postgres_14/new_pg > 11224524 /home/test/pradeep_test/pg_upgrade_testing/postgres_11.4/master > 41952 /home/test/pradeep_test/pg_upgrade_testing/postgres_14/new_pg That looks fine. The files exist only once, and the 41MB that only exist in the new data directory are catalog data and other stuff that is different on the new cluster. Yours, Laurenz Albe
Re: Assistance Needed: Issue with pg_upgrade and --link option
This is my numbers. df ~/pradeep_test/pg_upgrade_testing/postgres_11.4/master ~/pradeep_test/pg_upgrade_testing/postgres_14/new_pg Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/nvme0n1p4_crypt 375161856 102253040 270335920 28% /home /dev/mapper/nvme0n1p4_crypt 375161856 102253040 270335920 28% /home On Wed, Jun 28, 2023 at 3:14 PM Peter Eisentraut wrote: > On 28.06.23 08:24, Laurenz Albe wrote: > > On Wed, 2023-06-28 at 11:49 +0530, Pradeep Kumar wrote: > >> I was under the impression that the --link option would create hard > links between the > >> old and new cluster's data files, but it appears that the entire old > cluster data was > >> copied to the new cluster, resulting in a significant increase in the > new cluster's size. > > > > Please provide some numbers, ideally > > > >du -sk > > I don't think you can observe the effects of the --link option this way. > It would just give you the full size count for both directories, even > though the point to the same underlying inodes. > > To see the effect, you could perhaps use `df` to see how much overall > disk space the upgrade step eats up. > >
Re: Assistance Needed: Issue with pg_upgrade and --link option
Sure, du -sk ~/pradeep_test/pg_upgrade_testing/postgres_11.4/master ~/pradeep_test/pg_upgrade_testing/postgres_14/new_pg 11224524 /home/test/pradeep_test/pg_upgrade_testing/postgres_11.4/master 41952 /home/test/pradeep_test/pg_upgrade_testing/postgres_14/new_pg On Wed, Jun 28, 2023 at 11:54 AM Laurenz Albe wrote: > On Wed, 2023-06-28 at 11:49 +0530, Pradeep Kumar wrote: > > I was under the impression that the --link option would create hard > links between the > > old and new cluster's data files, but it appears that the entire old > cluster data was > > copied to the new cluster, resulting in a significant increase in the > new cluster's size. > > Please provide some numbers, ideally > > du -sk > > Yours, > Laurenz Albe >
Show WAL write and fsync stats in pg_stat_io
Hi, This is a WIP patch to add WAL write and fsync stats to pg_stat_io view. There is a track_io_timing variable to control pg_stat_io timings and a track_wal_io_timing variable to control WAL timings. I couldn't decide on which logic to enable WAL timings on pg_stat_io. For now, both pg_stat_io and track_wal_io_timing are needed to be enabled to track WAL timings in pg_stat_io. Also, if you compare WAL stats in pg_stat_wal and pg_stat_io; you can come across differences. These differences are caused by the background writer's WAL stats not being flushed. Because of that, background writer's WAL stats are not seen in pg_stat_wal but in pg_stat_io. I already sent a patch [1] to fix that. [1] https://www.postgresql.org/message-id/CAN55FZ2FPYngovZstr%3D3w1KSEHe6toiZwrurbhspfkXe5UDocg%40mail.gmail.com Any kind of feedback would be appreciated. Regards, Nazir Bilal Yavuz Microsoft v1-0001-WIP-Show-WAL-write-and-fsync-stats-on-pg_stat_io.patch Description: Binary data
Re: Assistance Needed: Issue with pg_upgrade and --link option
On 28.06.23 08:24, Laurenz Albe wrote: On Wed, 2023-06-28 at 11:49 +0530, Pradeep Kumar wrote: I was under the impression that the --link option would create hard links between the old and new cluster's data files, but it appears that the entire old cluster data was copied to the new cluster, resulting in a significant increase in the new cluster's size. Please provide some numbers, ideally du -sk I don't think you can observe the effects of the --link option this way. It would just give you the full size count for both directories, even though the point to the same underlying inodes. To see the effect, you could perhaps use `df` to see how much overall disk space the upgrade step eats up.
Re: Changing types of block and chunk sizes in memory contexts
On Wed, 28 Jun 2023 at 20:13, Peter Eisentraut wrote: > size_t (= Size) is the correct type in C to store the size of an object > in memory. This is partially a self-documentation issue: If I see > size_t in a function signature, I know what is intended; if I see > uint32, I have to wonder what the intent was. Perhaps it's ok to leave the context creation functions with Size typed parameters and then just Assert the passed-in sizes are not larger than 1GB within the context creation function. That way we could keep this change self contained in the .c file for the given memory context. That would mean there's no less readability. If we ever wanted to lift the 1GB limit on block sizes then we'd not need to switch the function signature again. There's documentation where the struct's field is declared, so having a smaller type in the struct itself does not seem like a reduction of documentation quality. > You could make an argument that using shorter types would save space for > some internal structs, but then you'd have to show some more information > where and why that would be beneficial. I think there's not much need to go proving this speeds something up. There's just simply no point in the struct fields being changed in Melih's patch to be bigger than 32 bits as we never need to store more than 1GB in them. Reducing these down means we may have to touch fewer cache lines and we'll also have more space on the keeper blocks to store allocations. Memory allocation performance is fairly fundamental to Postgres's performance. In my view, we shouldn't have fields that are twice as large as they need to be in code as hot as this. > Absent any strong performance argument, I don't see the benefit of this > change. People might well want to experiment with MEMORYCHUNK_... > settings larger than 1GB. Anyone doing so will be editing C code anyway. They can adjust these fields then. David
Re: Targetlist lost when CTE join
HI, On Jun 28, 2023, 17:26 +0800, Julien Rouhaud , wrote: > On Wed, Jun 28, 2023 at 05:17:14PM +0800, Julien Rouhaud wrote: > > > > > > Table t1 and t2 both has 2 columns: c1, c2, when CTE join select *, the > > > result target list seems to lost one’s column c1. > > > But it looks good when select cte1.* and t1.* explicitly . > > > > > > Is it a bug? > > > > This is working as intended. When using a USING clause you "merge" both > > columns so the final target list only contain one version of the merged > > columns, which doesn't happen if you use e.g. ON instead. I'm assuming that > > what the SQL standard says, but I don't have a copy to confirm. > > I forgot to mention that this is actually documented: > > https://www.postgresql.org/docs/current/queries-table-expressions.html > > Furthermore, the output of JOIN USING suppresses redundant columns: there is > no > need to print both of the matched columns, since they must have equal values. > While JOIN ON produces all columns from T1 followed by all columns from T2, > JOIN USING produces one output column for each of the listed column pairs (in > the listed order), followed by any remaining columns from T1, followed by any > remaining columns from T2. Thanks for your help. Regards, Zhang Mingli
Re: Targetlist lost when CTE join
On Wed, Jun 28, 2023 at 05:17:14PM +0800, Julien Rouhaud wrote: > > > > Table t1 and t2 both has 2 columns: c1, c2, when CTE join select *, the > > result target list seems to lost one’s column c1. > > But it looks good when select cte1.* and t1.* explicitly . > > > > Is it a bug? > > This is working as intended. When using a USING clause you "merge" both > columns so the final target list only contain one version of the merged > columns, which doesn't happen if you use e.g. ON instead. I'm assuming that > what the SQL standard says, but I don't have a copy to confirm. I forgot to mention that this is actually documented: https://www.postgresql.org/docs/current/queries-table-expressions.html Furthermore, the output of JOIN USING suppresses redundant columns: there is no need to print both of the matched columns, since they must have equal values. While JOIN ON produces all columns from T1 followed by all columns from T2, JOIN USING produces one output column for each of the listed column pairs (in the listed order), followed by any remaining columns from T1, followed by any remaining columns from T2.
Re: Targetlist lost when CTE join
Hi Regards, Zhang Mingli On Jun 28, 2023, 17:17 +0800, Julien Rouhaud , wrote: > This is working as intended. When using a USING clause you "merge" both > columns so the final target list only contain one version of the merged > columns, which doesn't happen if you use e.g. ON instead. I'm assuming that > what the SQL standard says, but I don't have a copy to confirm. Thanks. You’r right. Have a test: gpadmin=# with cte1 as (insert into t2 values (1, 2) returning *) select * from cte1 join t1 on t1.c1 = cte1.c1; c1 | c2 | c1 | c2 +++ (0 rows)
Re: Targetlist lost when CTE join
Hi, On Wed, Jun 28, 2023 at 04:52:34PM +0800, Zhang Mingli wrote: > > Mini repo > > create table t1(c1 int, c2 int); > CREATE TABLE > create table t2(c1 int, c2 int); > CREATE TABLE > explain with cte1 as (insert into t2 values (1, 2) returning *) select * from > cte1 join t1 using(c1); > QUERY PLAN > > Hash Join (cost=0.04..41.23 rows=11 width=12) > Hash Cond: (t1.c1 = cte1.c1) > CTE cte1 > -> Insert on t2 (cost=0.00..0.01 rows=1 width=8) > -> Result (cost=0.00..0.01 rows=1 width=8) > -> Seq Scan on t1 (cost=0.00..32.60 rows=2260 width=8) > -> Hash (cost=0.02..0.02 rows=1 width=8) > -> CTE Scan on cte1 (cost=0.00..0.02 rows=1 width=8) > (8 rows) > > with cte1 as (insert into t2 values (1, 2) returning *) select * from cte1 > join t1 using(c1); > c1 | c2 | c2 > ++ > (0 rows) > > truncate t2; > TRUNCATE TABLE > with cte1 as (insert into t2 values (1, 2) returning *) select cte1.*, t1.* > from cte1 join t1 using(c1); > c1 | c2 | c1 | c2 > +++ > (0 rows) > > Table t1 and t2 both has 2 columns: c1, c2, when CTE join select *, the > result target list seems to lost one’s column c1. > But it looks good when select cte1.* and t1.* explicitly . > > Is it a bug? This is working as intended. When using a USING clause you "merge" both columns so the final target list only contain one version of the merged columns, which doesn't happen if you use e.g. ON instead. I'm assuming that what the SQL standard says, but I don't have a copy to confirm.
Re: Targetlist lost when CTE join
Hi, Explain verbose, seems HashJoin node drop that column. gpadmin=# explain(verbose) with cte1 as (insert into t2 values (1, 2) returning *) select * from cte1 join t1 using(c1); QUERY PLAN --- Hash Join (cost=0.04..41.23 rows=11 width=12) Output: cte1.c1, cte1.c2, t1.c2 Hash Cond: (t1.c1 = cte1.c1) CTE cte1 -> Insert on public.t2 (cost=0.00..0.01 rows=1 width=8) Output: t2.c1, t2.c2 -> Result (cost=0.00..0.01 rows=1 width=8) Output: 1, 2 -> Seq Scan on public.t1 (cost=0.00..32.60 rows=2260 width=8) Output: t1.c1, t1.c2 -> Hash (cost=0.02..0.02 rows=1 width=8) Output: cte1.c1, cte1.c2 -> CTE Scan on cte1 (cost=0.00..0.02 rows=1 width=8) Output: cte1.c1, cte1.c2 (14 rows) Regards, Zhang Mingli
Targetlist lost when CTE join
Hi, Mini repo create table t1(c1 int, c2 int); CREATE TABLE create table t2(c1 int, c2 int); CREATE TABLE explain with cte1 as (insert into t2 values (1, 2) returning *) select * from cte1 join t1 using(c1); QUERY PLAN Hash Join (cost=0.04..41.23 rows=11 width=12) Hash Cond: (t1.c1 = cte1.c1) CTE cte1 -> Insert on t2 (cost=0.00..0.01 rows=1 width=8) -> Result (cost=0.00..0.01 rows=1 width=8) -> Seq Scan on t1 (cost=0.00..32.60 rows=2260 width=8) -> Hash (cost=0.02..0.02 rows=1 width=8) -> CTE Scan on cte1 (cost=0.00..0.02 rows=1 width=8) (8 rows) with cte1 as (insert into t2 values (1, 2) returning *) select * from cte1 join t1 using(c1); c1 | c2 | c2 ++ (0 rows) truncate t2; TRUNCATE TABLE with cte1 as (insert into t2 values (1, 2) returning *) select cte1.*, t1.* from cte1 join t1 using(c1); c1 | c2 | c1 | c2 +++ (0 rows) Table t1 and t2 both has 2 columns: c1, c2, when CTE join select *, the result target list seems to lost one’s column c1. But it looks good when select cte1.* and t1.* explicitly . Is it a bug? Regards, Zhang Mingli
Re: Do we want a hashset type?
On Wed, Jun 28, 2023, at 08:26, jian he wrote: > Hi there. > I changed the function hashset_contains to strict. Changing hashset_contains to STRICT would cause it to return NULL if any of the operands are NULL, which I don't believe is correct, since: SELECT NULL = ANY('{}'::int4[]); ?column? -- f (1 row) Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE, to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in SQL:2023. Did you try running `make installcheck` after your change? You would then have seen one of the tests failing: test array-and-multiset-semantics ... FAILED 21 ms Check the content of `regression.diffs` to see why: % cat regression.diffs diff -U3 /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out /Users/joel/src/hashset/results/array-and-multiset-semantics.out --- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out 2023-06-27 10:07:38 +++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out 2023-06-28 10:13:27 @@ -158,7 +158,7 @@ | | {NULL} | {NULL} | | |1 | {1} | {1} | | |4 | {4} | {4} | | - {} | | {NULL} | {NULL} | f| f + {} | | {NULL} | {NULL} | | f {} |1 | {1} | {1} | f| f {} |4 | {4} | {4} | f| f {NULL} | | {NULL} | {NULL,NULL} | | @@ -284,7 +284,8 @@ "= ANY(...)"; arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...) --+--+-+--+--+ -(0 rows) + {} | | {NULL} | {NULL} | | f +(1 row) > also change the way to return an empty array. Nice. I agree the `Datum d` variable was unnecessary. I also removed the unused includes. > in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that > int4hashset can speed distinct aggregate and distinct counts? > like the following: > > explain(analyze, costs off, timing off, buffers) > SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3 > > explain(analyze, costs off, timing off, buffers) > SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3 The 100k tables seems to be too small to give any meaningful results, when trying to measure individual queries: EXPLAIN(analyze, costs off, timing off, buffers) SELECT array_agg(DISTINCT i) FROM benchmark_input_100k; Execution Time: 26.790 ms Execution Time: 30.616 ms Execution Time: 33.253 ms EXPLAIN(analyze, costs off, timing off, buffers) SELECT hashset_agg(i) FROM benchmark_input_100k; Execution Time: 32.797 ms Execution Time: 27.605 ms Execution Time: 26.228 ms If we instead try the 10M tables, it looks like array_agg(DISTINCT ...) is actually faster for the `i` column where all input integers are unique: EXPLAIN(analyze, costs off, timing off, buffers) SELECT array_agg(DISTINCT i) FROM benchmark_input_10M; Execution Time: 799.017 ms Execution Time: 796.008 ms Execution Time: 799.121 ms EXPLAIN(analyze, costs off, timing off, buffers) SELECT hashset_agg(i) FROM benchmark_input_10M; Execution Time: 1204.873 ms Execution Time: 1221.822 ms Execution Time: 1216.340 ms For random integers, hashset is a win though: EXPLAIN(analyze, costs off, timing off, buffers) SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M; Execution Time: 1874.722 ms Execution Time: 1878.760 ms Execution Time: 1861.640 ms EXPLAIN(analyze, costs off, timing off, buffers) SELECT hashset_agg(rnd) FROM benchmark_input_10M; Execution Time: 1253.709 ms Execution Time: 1222.651 ms Execution Time: 1237.849 ms > explain(costs off,timing off, analyze,buffers) > select count(distinct rnd) from benchmark_input_100k \watch c=3 > > explain(costs off,timing off, analyze,buffers) > SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM > benchmark_input_100k) sub(x) \watch c=3 I tried these with 10M: EXPLAIN(costs off,timing off, analyze,buffers) SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M; Execution Time: 1733.320 ms Execution Time: 1725.214 ms Execution Time: 1716.636 ms EXPLAIN(costs off,timing off, analyze,buffers) SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM benchmark_input_10M) sub(x); Execution Time: 1249.612 ms Execution Time: 1240.558 ms Execution Time: 1252.103 ms Not sure what I think of the current benchmark suite. I think it would be better to only include some realistic examples from real-life, such as the graph query which was the reason I personally started working on this. Otherwise there is a risk we optimise for some hypothetical scenario that is not relevant in practise. Would be good with more examples of typical work loads for when the hashset type would be
Re: [PATCH] Honor PG_TEST_NOCLEAN for tempdirs
On 27.06.23 17:54, Dagfinn Ilmari Mannsåker wrote: However, the docs (https://www.postgresql.org/docs/16/regress-tap.html#REGRESS-TAP-VARS) say "If the environment variable PG_TEST_NOCLEAN is set", not "is set to a true value", and the existing test in PostgreSQL::Test::Cluster's END block is: # skip clean if we are requested to retain the basedir next if defined $ENV{'PG_TEST_NOCLEAN'}; So the original `not defined` test is consistent with that. Right, the usual style is just to check whether an environment variable is set to something, not what it is. Also note that in general not all environment variables are processed by Perl, so I would avoid encoding Perl semantics about what is "true" or whatever into it.
Re: Detecting use-after-free bugs using gcc's malloc() attribute
On 26.06.23 21:54, Andres Freund wrote: For something like pg_list.h the malloc(free) attribute is a bit awkward to use, because one a) needs to list ~30 functions that can free a list and b) the referenced functions need to be declared. Hmm. Saying list_concat() "deallocates" a list is mighty confusing because 1) it doesn't, and 2) it might actually allocate a new list. So while you get the useful behavior of "you probably didn't mean to use this variable again after passing it into list_concat()", if some other tool actually took these allocate/deallocate decorations at face value and did a memory leak analysis with them, they'd get completely bogus results. I also added such attributes to bitmapset.h and palloc() et al, but that didn't find existing bugs. It does find use-after-free instances if I add some, similarly it does find cases of mismatching palloc with free etc. This seems more straightforward. Even if it didn't find any bugs, I'd imagine it would be useful during development. Do others think this would be useful enough to be worth polishing? And do you agree the warnings above are bugs? I actually just played with this the other day, because I can never remember termPQExpBuffer() vs. destroyPQExpBuffer(). I couldn't quite make it work for that, but I found the feature overall useful, so I'd welcome support for it.
Re: [PATCH] Support % wildcard in extension upgrade filenames
> On 31 May 2023, at 21:07, Sandro Santilli wrote: > On Thu, Apr 27, 2023 at 12:49:57PM +0200, Sandro Santilli wrote: >> I'm happy to bring back the control-file switch if there's an >> agreement about that. > > I'm attaching an up-to-date version of the patch with the control-file > switch back in, so there's an explicit choice by extension developers. This version fails the extension test since the comment from the .sql file is missing from test_extensions.out. Can you fix that in a rebase such that the CFBot can have a green build of this patch? -- Daniel Gustafsson
Re: Another incorrect comment for pg_stat_statements
On Wed, Jun 28, 2023 at 3:36 PM Michael Paquier wrote: > On Wed, Jun 28, 2023 at 03:09:55PM +0800, Richard Guo wrote: > > +1. To nitpick, how about we remove the blank line just before removing > > the key for top level entry? > > > > - /* Also remove entries for top level statements */ > > + /* Also remove entries if exist for top level statements */ > >key.toplevel = true; > > - > > - /* Remove the key if exists */ > >entry = (pgssEntry *) hash_search(pgss_hash, , HASH_REMOVE, NULL); > > Why not if it improves the overall situation. Could you send a patch > with everything you have in mind? Here is the patch. I don't have too much in mind, so the patch just removes the blank line and revises the comment a bit. Thanks Richard v1-0001-trivial-revise-for-entry_reset.patch Description: Binary data
Re: Changing types of block and chunk sizes in memory contexts
In memory contexts, block and chunk sizes are likely to be limited by some upper bounds. Some examples of those bounds can be MEMORYCHUNK_MAX_BLOCKOFFSET and MEMORYCHUNK_MAX_VALUE. Both values are only 1 less than 1GB. This makes memory contexts to have blocks/chunks with sizes less than 1GB. Such sizes can be stored in 32-bits. Currently, "Size" type, which is 64-bit, is used, but 32-bit integers should be enough to store any value less than 1GB. size_t (= Size) is the correct type in C to store the size of an object in memory. This is partially a self-documentation issue: If I see size_t in a function signature, I know what is intended; if I see uint32, I have to wonder what the intent was. You could make an argument that using shorter types would save space for some internal structs, but then you'd have to show some more information where and why that would be beneficial. (But again, self-documentation: If one were to do that, I would argue for introducing a custom type like pg_short_size_t.) Absent any strong performance argument, I don't see the benefit of this change. People might well want to experiment with MEMORYCHUNK_... settings larger than 1GB.
Re: Incremental View Maintenance, take 2
On Wed, 28 Jun 2023 00:01:02 +0800 jian he wrote: > On Thu, Jun 1, 2023 at 2:47 AM Yugo NAGATA wrote: > > > > On Thu, 1 Jun 2023 23:59:09 +0900 > > Yugo NAGATA wrote: > > > > > Hello hackers, > > > > > > Here's a rebased version of the patch-set adding Incremental View > > > Maintenance support for PostgreSQL. That was discussed in [1]. > > > > > [1] > > > https://www.postgresql.org/message-id/flat/20181227215726.4d166b4874f8983a641123f5%40sraoss.co.jp > > > > --- > > * Overview > > > > Incremental View Maintenance (IVM) is a way to make materialized views > > up-to-date by computing only incremental changes and applying them on > > views. IVM is more efficient than REFRESH MATERIALIZED VIEW when > > only small parts of the view are changed. > > > > ** Feature > > > > The attached patchset provides a feature that allows materialized views > > to be updated automatically and incrementally just after a underlying > > table is modified. > > > > You can create an incementally maintainable materialized view (IMMV) > > by using CREATE INCREMENTAL MATERIALIZED VIEW command. > > > > The followings are supported in view definition queries: > > - SELECT ... FROM ... WHERE ..., joins (inner joins, self-joins) > > - some built-in aggregate functions (count, sum, avg, min, max) > > - GROUP BY clause > > - DISTINCT clause > > > > Views can contain multiple tuples with the same content (duplicate tuples). > > > > ** Restriction > > > > The following are not supported in a view definition: > > - Outer joins > > - Aggregates otehr than above, window functions, HAVING > > - Sub-queries, CTEs > > - Set operations (UNION, INTERSECT, EXCEPT) > > - DISTINCT ON, ORDER BY, LIMIT, OFFSET > > > > Also, a view definition query cannot contain other views, materialized > > views, > > foreign tables, partitioned tables, partitions, VALUES, non-immutable > > functions, > > system columns, or expressions that contains aggregates. > > > > --- > > * Design > > > > An IMMV is maintained using statement-level AFTER triggers. > > When an IMMV is created, triggers are automatically created on all base > > tables contained in the view definition query. > > > > When a table is modified, changes that occurred in the table are extracted > > as transition tables in the AFTER triggers. Then, changes that will occur in > > the view are calculated by a rewritten view dequery in which the modified > > table > > is replaced with the transition table. > > > > For example, if the view is defined as "SELECT * FROM R, S", and tuples > > inserted > > into R are stored in a transiton table dR, the tuples that will be inserted > > into > > the view are calculated as the result of "SELECT * FROM dR, S". > > > > ** Multiple Tables Modification > > > > Multiple tables can be modified in a statement when using triggers, foreign > > key > > constraint, or modifying CTEs. When multiple tables are modified, we need > > the state of tables before the modification. > > > > For example, when some tuples, dR and dS, are inserted into R and S > > respectively, > > the tuples that will be inserted into the view are calculated by the > > following > > two queries: > > > > "SELECT * FROM dR, S_pre" > > "SELECT * FROM R, dS" > > > > where S_pre is the table before the modification, R is the current state of > > table, that is, after the modification. This pre-update states of table > > is calculated by filtering inserted tuples and appending deleted tuples. > > The subquery that represents pre-update state is generated in > > get_prestate_rte(). > > Specifically, the insterted tuples are filtered by calling > > IVM_visible_in_prestate() > > in WHERE clause. This function checks the visibility of tuples by using > > the snapshot taken before table modification. The deleted tuples are > > contained > > in the old transition table, and this table is appended using UNION ALL. > > > > Transition tables for each modification are collected in each AFTER trigger > > function call. Then, the view maintenance is performed in the last call of > > the trigger. > > > > In the original PostgreSQL, tuplestores of transition tables are freed at > > the > > end of each nested query. However, their lifespan needs to be prolonged to > > the end of the out-most query in order to maintain the view in the last > > AFTER > > trigger. For this purpose, SetTransitionTablePreserved is added in > > trigger.c. > > > > ** Duplicate Tulpes > > > > When calculating changes that will occur in the view (= delta tables), > > multiplicity of tuples are calculated by using count(*). > > > > When deleting tuples from the view, tuples to be deleted are identified by > > joining the delta table with the view, and tuples are deleted as many as > > specified multiplicity by numbered using row_number() function. > > This is
Re: Request for new function in view update
On Thu, 1 Jun 2023 12:18:47 -0500 Terry Brennan wrote: > Hello all, > > I am a researcher in databases who would like to suggest a new function. I > am writing to you because you have an active developer community. Your > website said that suggestions for new functions should go to this mailing > list. If there is another mailing list you prefer, please let me know. > > My research is in updating views -- the problem of translating an update in > a view to an update to a set of underlying base tables. This problem has > been partially solved for many years, including in PostgreSQL, but a > complete solution hasn't been found. > > Views are useful for data independence; if users only access data through > views, then underlying databases can change without user programs. Data > independence requires an automatic solution to the view update problem. > > In my research, I went back to the initial papers about the problem. The > most promising approach was the "constant complement" approach. It starts > from the idea that a view shows only part of the information in a database, > and that view updates should never change the part of the database that > isn't exposed in the view. (The "complement" is the unexposed part, and > "constant" means that a view update shouldn't change the complement.) The > "constant complement" constraint is intuitive, that a view update shouldn't > have side effects on information not available through the view. > > A seminal paper showed that defining a complement is enough, because each > complement of a view creates a unique view update. Unfortunately, there > are limitations. Views have multiple complements, and no unique minimal > complement exists. Because of this limitation and other practical > difficulties, the constant complement approach was abandoned. > > I used a theorem in this initial paper that other researchers didn't use, > that shows the inverse. An update method defines a unique complement. I > used the two theorems as a saw's upstroke and downstroke to devise view > update methods for several relational operators. Unlike other approaches, > these methods have a solid mathematical foundation. > > Some relational operators are easy (selection), others are hard > (projection); some have several valid update methods that can be used > interchangeably (union) and some can have several valid update methods that > reflect different semantics (joins). For joins, I found clues in the > database that can determine which update method to use. I address the > other relational operators, but not in the attached paper > . > I also discuss the problem of when views can't have updates, and possible > reasons why. > > I have attached my arXiv paper. I would appreciate anyone's interest in > this topic. I'm interested in the view update problem because we have some works on this topic [1][2]. I read your paper. Although I don't understand the theoretical part enough, I found your proposal methods to update views as for several relational operators. The method for updating selection views seems same as the way of automatically updatable views in the current PostgreSQL, that is, deleting/updating rows in a view results in deletes/updates for corresponding rows in the base table. Inserting rows that is not compliant to the view is checked and prevented if the view is defined with WITH CHECK OPTION. However, the proposed method for projection is that a column not contained in the view is updated to NULL when a row is deleted. I think it is not desirable to use NULL in a such special purpose, and above all, this is different the current PostgreSQL behavior, so I wonder it would not accepted to change it. I think it would be same for the method for JOIN that uses NULL for the special use. I think it would be nice to extend view updatability of PostgreSQL because the SQL standard allows more than the current limited support. In this case, I wonder we should follow SQL:1999 or later, and maybe this would be somehow compatible to the spec in Oracle. [1] https://dl.acm.org/doi/10.1145/3164541.3164584 [2] https://www.pgcon.org/2017/schedule/events/1074.en.html Regards, Yugo Nagata > Yours > Terry Brennan -- Yugo NAGATA
Commitfest manager for July
A quick scan of the archives doesn't turn up anyone who has volunteered in advance to run the upcoming commitfest. Is anyone keen at trying their hand at this very important community work? The July CF is good for anyone doing this for the first time IMHO as it's usually less stressful than the ones later in the cycle. -- Daniel Gustafsson
Re: Another incorrect comment for pg_stat_statements
On Wed, Jun 28, 2023 at 03:09:55PM +0800, Richard Guo wrote: > +1. To nitpick, how about we remove the blank line just before removing > the key for top level entry? > > - /* Also remove entries for top level statements */ > + /* Also remove entries if exist for top level statements */ >key.toplevel = true; > - > - /* Remove the key if exists */ >entry = (pgssEntry *) hash_search(pgss_hash, , HASH_REMOVE, NULL); Why not if it improves the overall situation. Could you send a patch with everything you have in mind? -- Michael signature.asc Description: PGP signature
Re: Add TLI number to name of files generated by pg_waldump --save-fullpage
On Wed, Jun 28, 2023 at 09:20:27AM +0900, Kyotaro Horiguchi wrote: > At Tue, 27 Jun 2023 18:58:39 -0500, David Christensen > wrote in >>> Adjusted as per the v2 attached. >> >> +1 > > +1 Okay, cool. Both of you seem happy with it, so I have applied it. Thanks for the quick checks. -- Michael signature.asc Description: PGP signature
Re: ExecRTCheckPerms() and many prunable partitions (checkAsUser)
Hi, On Tue, Feb 21, 2023 at 4:12 PM Amit Langote wrote: > On Tue, Feb 21, 2023 at 12:40 AM Tom Lane wrote: > > Alvaro Herrera writes: > > > On 2023-Feb-20, Amit Langote wrote: > > >> One more thing we could try is come up with a postgres_fdw test case, > > >> because it uses the RelOptInfo.userid value for remote-costs-based > > >> path size estimation. But adding a test case to contrib module's > > >> suite test a core planner change might seem strange, ;-). > > > > > Maybe. Perhaps adding it in a separate file there is okay? > > > > There is plenty of stuff in contrib module tests that is really > > there to test core-code behavior. (You could indeed argue that > > *all* of contrib is there for that purpose.) If it's not > > convenient to test something without an extension, just do it > > and don't sweat about that. > > OK. Attached adds a test case to postgres_fdw's suite. You can see > that it fails without a316a3bc. Noticed that there's an RfC entry for this in the next CF. Here's an updated version of the patch where I updated the comments a bit and the commit message. I'm thinking of pushing this on Friday barring objections. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com v2-0001-Add-a-test-case-for-a316a3bc.patch Description: Binary data
removing limitations from bitmap index scan
This email has no patch yet -- it's more of a placeholder to gather some issues into one place. During previous work on replacing vacuum's bsearch'd array for TID storage with a radix tree, Andres suggested [1] that the hash table in tidbitmap.c should also be replaced. This will hopefully illuminate how to get there. * Current limitations Page table entries are of fixed-size. At PGCon, Mark Dilger and Andres discussed the problem that bitmap scans assume a hard-coded limit on the offset of a TID, one that's particular to heap AM. That's not a requirement of hash tables in general, but that's the current state of the implementation. Using a radix tree would smooth the way to allowing variable-length page table entries. I have briefly talked about it with colleagues offlist as well. The radix tree implementation that Masahiko and I have worked on does still currently assume fixed-sized values. (not absolutely, but fixed at compile time for a given use), but last month I did some refactoring that would make variable-sized values fairly straightforward, at least no big challenge. There would of course also be some extra complexity in doing TBM union/intersection operations etc. Recent work I did also went in the direction of storing small-enough values in the last-level pointer, saving memory (as well as time spent accessing it). That seems important, since trees do have some space overhead compared to arrays. * Iteration/ordering There are also now some unpleasant consequences that stem from hashed blocknumbers: - To get them ready for the executor the entries need to be sorted by blocknumber, and "random" is a strenuous sorting case, because of cache misses and branch mispredicts. - Pages get lossified (when necessary) more-or-less at random Radix trees maintain logical ordering, allowing for ordered iteration, so that solves the sorting problem, and should help give a performance boost. One hurdle is that shared iteration must work so that each worker can have a disjoint subset of the input. The radix tree does have shared memory support, but not yet shared iteration since there hasn't been a concrete use case. Also, DSA has a noticeable performance cost. A good interim development step is to use a local-mem radix tree for the index scan, and then move everything out to the current array for the executor, in shmem if the heap scan will be parallel. (I have coded some steps in that direction, not ready to share.) That keeps that part of the interface the same, simplifying testing. It's possible this much would work even for varlen bitmaps: the iteration array could use a "tall skinny" page table entry format, like { blockno; ; wordnum; bitmapword; } ...which would save space in many cases. Long term, we will want to move to shared memory for the radix tree, at least as a prerequisite for parallel bitmap index scan. The concurrency scheme is likely too coarse to make that worthwhile now, but that will hopefully change at some point. * Possible simplification Some of the above adds complexity, but I see a possible simplification: Many places in tidbitmap.c need to know if we have a single entry, to keep from creating the hash table. That was added before simplehash.h existed. I suspect most of the overhead now in creating the hash table is in zeroing the backing array (correct me if I'm wrong). The radix tree wouldn't do that, but it would create about half a dozen memory contexts, and inserting a single entry would allocate one or two context blocks. Neither of these are free either. If the single-entry case is still worth optimizing, it could be pushed down inside inside the radix tree as a template option that lazily creates memory contexts etc. * Multiple use Vacuum concerns started this in the first place, so it'll have to be kept in mind as we proceed. At the very least, vacuum will need a boolean to disallow lossifying pages, but the rest should work about the same. There are some other things left out, like memory management and lossy entries to work out, but this is enough to give a sense of what's involved. [1] https://www.postgresql.org/message-id/20230216164408.bcatntzzxj3jqn3q%40awork3.anarazel.de -- John Naylor EDB: http://www.enterprisedb.com
Re: Another incorrect comment for pg_stat_statements
On Wed, Jun 28, 2023 at 3:04 PM Michael Paquier wrote: > On Wed, Jun 28, 2023 at 12:15:47PM +0800, Japin Li wrote: > > - /* Remove the key if it exists, starting with the > top-level entry */ > > + /* Remove the key if it exists, starting with the > non-top-level entry */ > > key.toplevel = false; > > entry = (pgssEntry *) hash_search(pgss_hash, , > HASH_REMOVE, NULL); > > if (entry) /* found */ > > Nice catch. That's indeed wrong. Will fix. +1. To nitpick, how about we remove the blank line just before removing the key for top level entry? - /* Also remove entries for top level statements */ + /* Also remove entries if exist for top level statements */ key.toplevel = true; - - /* Remove the key if exists */ entry = (pgssEntry *) hash_search(pgss_hash, , HASH_REMOVE, NULL); Thanks Richard
Re: Another incorrect comment for pg_stat_statements
On Wed, Jun 28, 2023 at 12:15:47PM +0800, Japin Li wrote: > - /* Remove the key if it exists, starting with the top-level > entry */ > + /* Remove the key if it exists, starting with the non-top-level > entry */ > key.toplevel = false; > entry = (pgssEntry *) hash_search(pgss_hash, , HASH_REMOVE, > NULL); > if (entry) /* found */ Nice catch. That's indeed wrong. Will fix. -- Michael signature.asc Description: PGP signature
Re: Assert !bms_overlap(joinrel->relids, required_outer)
On Wed, Jun 28, 2023 at 6:28 AM Tom Lane wrote: > For a real fix, I'm inclined to extend the loop that calculates > param_source_rels (in add_paths_to_joinrel) so that it also tracks > a set of incompatible relids that *must not* be present in the > parameterization of a proposed path. This would basically include > OJ relids of OJs that partially overlap the target joinrel; maybe > we should also include the min RHS of such OJs. Then we could > check that in try_nestloop_path. I've not tried to code this yet. I went ahead and drafted a patch based on this idea. A little differences include * You mentioned that the incompatible relids might need to also include the min_righthand of the OJs that are part of the target joinrel. It seems to me that we may need to also include the min_lefthand of such OJs, because the parameterization of any proposed join path for the target joinrel should not overlap anything in an OJ if the OJ is part of this joinrel. * I think we need to check the incompatible relids also in try_hashjoin_path and try_mergejoin_path besides try_nestloop_path. Thanks Richard v1-0001-Draft-Avoid-invalid-parameterized-path-for-joinrel.patch Description: Binary data
Re: Synchronizing slots from primary to standby
Hi, On 6/26/23 12:34 PM, Amit Kapila wrote: On Mon, Jun 26, 2023 at 11:15 AM Drouvot, Bertrand wrote: On 6/20/23 12:22 PM, Amit Kapila wrote: On Mon, Jun 19, 2023 at 9:56 PM Drouvot, Bertrand wrote: In such a case (slot valid on the primary but invalidated on the standby) then I think we could drop and recreate the invalidated slot on the standby. Will it be safe? Because after recreating the slot, it will reserve the new WAL location and build the snapshot based on that which might miss some important information in the snapshot. For example, to update the slot's position with new information from the primary, the patch uses pg_logical_replication_slot_advance() which means it will process all records and update the snapshot via DecodeCommit->SnapBuildCommitTxn(). Your concern is that the slot could have been consumed on the standby? I mean, if we suppose the "synchronized" slot can't be consumed on the standby then drop/recreate such an invalidated slot would be ok? That also may not be sufficient because as soon as the slot is invalidated/dropped, the required WAL could be removed on standby. Yeah, I think once the slot is dropped we just have to wait for the slot to be re-created on the standby according to the new synchronize_slot_names GUC. Assuming the initial slot "creation" on the standby (coming from the synchronize_slot_names usage) is working "correctly" then it should also work "correctly" once the slot is dropped. If we agree that a synchronized slot can not/should not be consumed (will implement this behavior) then I think the proposed scenario above should make sense, do you agree? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Do we want a hashset type?
On Tue, Jun 27, 2023 at 4:27 PM Joel Jacobson wrote: > > On Tue, Jun 27, 2023, at 04:35, jian he wrote: > > in SQLMultiSets.pdf(previously thread) I found a related explanation > > on page 45, 46. > > /home/jian/hashset/0001-make-int4hashset_contains-strict-and-header-file-change.patch > > (CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT > > T1.V FROM UNNEST (OP1) AS T1 (V) INTERSECT SQ SELECT T2.V FROM UNNEST > > (OP2) AS T2 (V) ) END) > > > > CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT > > T1.V FROM UNNEST (OP1) AS T1 (V) UNION SQ SELECT T2.V FROM UNNEST > > (OP2) AS T2 (V) ) END > > > > (CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT > > T1.V FROM UNNEST (OP1) AS T1 (V) EXCEPT SQ SELECT T2.V FROM UNNEST > > (OP2) AS T2 (V) ) END) > > Thanks! This was exactly what I was looking for, I knew I've seen it but > failed to find it. > > Attached is a new incremental patch as well as a full patch, since this is a > substantial change: > > Align null semantics with SQL:2023 array and multiset standards > > * Introduced a new boolean field, null_element, in the int4hashset_t type. > > * Rename hashset_count() to hashset_cardinality(). > > * Rename hashset_merge() to hashset_union(). > > * Rename hashset_equals() to hashset_eq(). > > * Rename hashset_neq() to hashset_ne(). > > * Add hashset_to_sorted_array(). > > * Handle null semantics to work as in arrays and multisets. > > * Update int4hashset_add() to allow creating a new set if none exists. > > * Use more portable int32 typedef instead of int32_t. > > This also adds a thorough test suite in array-and-multiset-semantics.sql, > which aims to test all relevant combinations of operations and values. > > Makefile | 2 +- > README.md | 6 ++-- > hashset--0.0.1.sql | 37 +++- > hashset-api.c | 208 > ++-- > hashset.c | 12 ++- > hashset.h | 11 +++--- > test/expected/array-and-multiset-semantics.out | 365 > ++ > test/expected/basic.out| 12 +++ > test/expected/reported_bugs.out| 6 ++-- > test/expected/strict.out | 114 > > test/expected/table.out| 8 ++--- > test/sql/array-and-multiset-semantics.sql | 232 > + > test/sql/basic.sql | 4 +-- > test/sql/benchmark.sql | 14 > test/sql/reported_bugs.sql | 6 ++-- > test/sql/strict.sql| 32 - > test/sql/table.sql | 2 +- > 17 files changed, 823 insertions(+), 248 deletions(-) > > /Joel Hi there. I changed the function hashset_contains to strict. also change the way to return an empty array. in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that int4hashset can speed distinct aggregate and distinct counts? like the following: explain(analyze, costs off, timing off, buffers) SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3 explain(analyze, costs off, timing off, buffers) SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3 explain(costs off,timing off, analyze,buffers) select count(distinct rnd) from benchmark_input_100k \watch c=3 explain(costs off,timing off, analyze,buffers) SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM benchmark_input_100k) sub(x) \watch c=3 From 9030adbf9e46f66812fb11849c367bbcf5b3a427 Mon Sep 17 00:00:00 2001 From: pgaddict Date: Wed, 28 Jun 2023 14:09:58 +0800 Subject: [PATCH] make int4hashset_contains strict and header file changes. --- hashset--0.0.1.sql | 2 +- hashset-api.c | 20 2 files changed, 5 insertions(+), 17 deletions(-) diff --git a/hashset--0.0.1.sql b/hashset--0.0.1.sql index d0478ce9..d448ee69 100644 --- a/hashset--0.0.1.sql +++ b/hashset--0.0.1.sql @@ -55,7 +55,7 @@ LANGUAGE C IMMUTABLE; CREATE OR REPLACE FUNCTION hashset_contains(int4hashset, int) RETURNS boolean AS 'hashset', 'int4hashset_contains' -LANGUAGE C IMMUTABLE; +LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION hashset_union(int4hashset, int4hashset) RETURNS int4hashset diff --git a/hashset-api.c b/hashset-api.c index a4beef4e..ff948b55
Re: sslinfo extension - add notbefore and notafter timestamps
> On 23 Jun 2023, at 22:10, Cary Huang wrote: >> Off the cuff that doesn't seem like a bad idea, but I wonder if we should add >> them to pg_stat_ssl (or both) instead if we deem them valuable? > > I think the same information should be available to pg_stat_ssl as well. > pg_stat_ssl can show the client certificate information for all connecting > clients, having it to show not_before and not_after timestamps of every > client are important in my opinion. The attached patch > "v2-0002-pg-stat-ssl-add-notbefore-and-notafter-timestamps.patch" adds this > support This needs to adjust the tests in src/test/ssl which now fails due to SELECT * returning a row which doesn't match what the test was coded for. >> Re the patch, it would be nice to move the logic in ssl_client_get_notafter >> and >> the _notbefore counterpart to a static function since they are copies of >> eachother. > > Yes agreed. I have made the adjustment attached as > "v2-0001-sslinfo-add-notbefore-and-notafter-timestamps.patch" The new patchset isn't updating contrib/sslinfo/meson with the 1.3 update so it fails to build with Meson. -- Daniel Gustafsson
Re: Assistance Needed: Issue with pg_upgrade and --link option
On Wed, 2023-06-28 at 11:49 +0530, Pradeep Kumar wrote: > I was under the impression that the --link option would create hard links > between the > old and new cluster's data files, but it appears that the entire old cluster > data was > copied to the new cluster, resulting in a significant increase in the new > cluster's size. Please provide some numbers, ideally du -sk Yours, Laurenz Albe
Assistance Needed: Issue with pg_upgrade and --link option
Dear Postgres Hackers, I hope this email finds you well. I am currently facing an issue while performing an upgrade using the pg_upgrade utility with the --link option. I was under the impression that the --link option would create hard links between the old and new cluster's data files, but it appears that the entire old cluster data was copied to the new cluster, resulting in a significant increase in the new cluster's size. Here are the details of my scenario: - PostgreSQL version: [Old Version: Postgres 11.4 | New Version: Postgres 14.0] - Command used for pg_upgrade: [~/pg_upgrade_testing/postgres_14/bin/pg_upgrade -b ~/pg_upgrade_testing/postgres_11.4/bin -B ~/pg_upgrade_testing/postgres_14/bin -d ~/pg_upgrade_testing/postgres_11.4/replica_db2 -D ~/pg_upgrade_testing/postgres_14/new_pg -r -k - Paths to the old and new data directories: [~/pg_upgrade_testing/postgres_11.4/replica_db2] [~/pg_upgrade_testing/postgres_14/new_pg] - OS information: [Ubuntu 22.04.2 linux] However, after executing the pg_upgrade command with the --link option, I observed that the size of the new cluster is much larger than expected. I expected the --link option to create hard links instead of duplicating the data files. I am seeking assistance to understand the following: 1. Is my understanding of the --link option correct? 2. Is there any additional configuration or step required to properly utilize the --link option? 3. Are there any limitations or considerations specific to my PostgreSQL version or file system that I should be aware of? Any guidance, clarification, or troubleshooting steps you can provide would be greatly appreciated. I want to ensure that I am utilizing the --link option correctly and optimize the upgrade process. Best regards, Pradeep Kumar