Re: Propagate pathkeys from CTEs up to the outer query
On 29/1/2024 10:18, Richard Guo wrote: In [1] we've reached a conclusion that for a MATERIALIZED CTE it's okay to 'allow our statistics or guesses for the sub-query to subsequently influence what the upper planner does'. Commit f7816aec23 exposes column statistics to the upper planner. In the light of that, here is a patch that exposes info about the ordering of the CTE result to the upper planner. This patch was initially posted in that same thread and has received some comments from Tom in [2]. Due to the presence of multiple patches in that thread, it has led to confusion. So fork a new thread here specifically dedicated to discussing the patch about exposing pathkeys from CTEs to the upper planner. I like this approach. It looks good initially, but such features need more opinions/views/time to analyse corner cases. It goes alongside my current backburner - pull parameterisation through the GROUP-BY and the query block fence up to the JOIN searching code of the parent query. -- regards, Andrei Lepikhov Postgres Professional
Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning
On 30/1/2024 12:44, Ashutosh Bapat wrote: Thanks Vignesh. PFA patches rebased on the latest HEAD. The patch addressing Amit's comments is still a separate patch for him to review. Thanks for this improvement. Working with partitions, I frequently see peaks of memory consumption during planning. So, maybe one more case can be resolved here. Patch 0001 looks good. I'm not sure about free_child_sjinfo_members. Do we really need it as a separate routine? It might be better to inline this code. Patch 0002 adds valuable comments, and I'm OK with that. Also, as I remember, some extensions, such as pg_hint_plan, call build_child_join_sjinfo. It is OK to break the interface with a major version. But what if they need child_sjinfo a bit longer and collect links to this structure? I don't think it is a real stopper, but it is worth additional analysis. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 13/2/2024 17:03, Andrei Lepikhov wrote: On 13/2/2024 07:00, jian he wrote: The time is the last result of the 10 iterations. I'm not sure about the origins of such behavior, but it seems to be an issue of parallel workers, not this specific optimization. Having written that, I'd got a backburner. And to close that issue, I explored get_restriction_qual_cost(). A close look shows us that "x IN (..)" cheaper than its equivalent "x=N1 OR ...". Just numbers: ANY: startup_cost = 0.0225; total_cost = 0.005 OR: startup_cost==0; total_cost = 0.0225 Expression total_cost is calculated per tuple. In your example, we have many tuples, so the low cost of expression per tuple dominates over the significant startup cost. According to the above, SAOP adds 6250 to the cost of SeqScan; OR - 13541. So, the total cost of the query with SAOP is less than with OR, and the optimizer doesn't choose heavy parallel workers. And it is the answer. So, this example is more about the subtle balance between parallel/sequential execution, which can vary from one platform to another. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 12/2/2024 17:51, Alena Rybakina wrote: On 12.02.2024 12:01, Andrei Lepikhov wrote: On 12/2/2024 15:55, Alena Rybakina wrote: As I understand it, match_clauses_to_index is necessary if you have a RestrictInfo (rinfo1) variable, so maybe we should run it after the make_restrictonfo procedure, otherwise calling it, I think, is useless. I think you must explain your note in more detail. Before this call, we already called make_restrictinfo() and built rinfo1, haven't we? I got it, I think, I was confused by the “else“ block when we can process the index, otherwise we move on to the next element. I think maybe “else“ block of creating restrictinfo with the indexpaths list creation should be moved to a separate function or just remove "else"? IMO, it is a matter of taste. But if you are really confused, maybe it will make understanding for someone else simpler. So, changed. I think we need to check that rinfo->clause is not empty, because if it is we can miss calling build_paths_for_OR function. We should add it there: restriction_is_saop_clause(RestrictInfo *restrictinfo) { if (IsA(restrictinfo->clause, ScalarArrayOpExpr)) I wonder if we should add here assertion, not NULL check. In what case we could get NULL clause here? But added for surety. By the way, I think we need to add a check that the clauseset is not empty (if (!clauseset.nonempty)) otherwise we could get an error. The same check I noticed in build_paths_for_OR function. I don't. Feel free to provide counterexample. -- regards, Andrei Lepikhov Postgres Professional From 2b088a1bd662c614ee491a797d2fcb67e2fb2391 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 24 Jan 2024 14:07:17 +0700 Subject: [PATCH 2/2] Teach generate_bitmap_or_paths to build BitmapOr paths over SAOP clauses. Likewise OR clauses, discover SAOP array and try to split its elements between smaller sized arrays to fit a set of partial indexes. --- src/backend/optimizer/path/indxpath.c | 301 ++ src/backend/optimizer/util/predtest.c | 46 src/backend/optimizer/util/restrictinfo.c | 13 + src/include/optimizer/optimizer.h | 16 ++ src/include/optimizer/restrictinfo.h | 1 + src/test/regress/expected/select.out | 282 src/test/regress/sql/select.sql | 82 ++ src/tools/pgindent/typedefs.list | 1 + 8 files changed, 689 insertions(+), 53 deletions(-) diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 32c6a8bbdc..56b04541db 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -32,6 +32,7 @@ #include "optimizer/paths.h" #include "optimizer/prep.h" #include "optimizer/restrictinfo.h" +#include "utils/array.h" #include "utils/lsyscache.h" #include "utils/selfuncs.h" @@ -1220,11 +1221,174 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel, return result; } +/* + * Building index paths over SAOP clause differs from the logic of OR clauses. + * Here we iterate across all the array elements and split them to SAOPs, + * corresponding to different indexes. We must match each element to an index. + */ +static List * +build_paths_for_SAOP(PlannerInfo *root, RelOptInfo *rel, RestrictInfo *rinfo, +List *other_clauses) +{ + List *result = NIL; + List *predicate_lists = NIL; + ListCell *lc; + PredicatesData *pd; + ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause; + + Assert(IsA(saop, ScalarArrayOpExpr) && saop->useOr); + + if (!IsA(lsecond(saop->args), Const)) + /* +* Has it practical outcome to merge arrays which couldn't constantified +* before that step? +*/ + return NIL; + + /* Collect predicates */ + foreach(lc, rel->indexlist) + { + IndexOptInfo *index = (IndexOptInfo *) lfirst(lc); + + /* Take into consideration partial indexes supporting bitmap scans */ + if (!index->amhasgetbitmap || index->indpred == NIL || index->predOK) + continue; + + pd = palloc0(sizeof(PredicatesData)); + pd->id = foreach_current_index(lc); + /* The trick with reducing recursion is stolen from predicate_implied_by */ + pd->predicate = list_length(index->indpred) == 1 ? + (Node *) linitial(index->indpred) : + (Node *) index->indpred; + predicate_lists = lappend(predicate_lis
Re: POC, WIP: OR-clause support for indexes
On 13/2/2024 07:00, jian he wrote: + newa = makeNode(ArrayExpr); + /* array_collid will be set by parse_collate.c */ + newa->element_typeid = scalar_type; + newa->array_typeid = array_type; + newa->multidims = false; + newa->elements = aexprs; + newa->location = -1; I am confused by the comments `array_collid will be set by parse_collate.c`, can you further explain it? I wonder if the second paragraph of comments on commit b310b6e will be enough to dive into details. if OR expression right arm is not plain Const, but with collation specification, eg. `where a = 'a' collate "C" or a = 'b' collate "C";` then the rightop is not Const, it will be CollateExpr, it will not be used in transformation. Yes, it is done for simplicity right now. I'm not sure about corner cases of merging such expressions. set enable_or_transformation to on; explain(timing off, analyze, costs off) select count(*) from test where (x = 1 or x = 2 or x = 3 or x = 4 or x = 5 or x = 6 or x = 7 or x = 8 or x = 9 ) \watch i=0.1 c=10 35.376 ms The time is the last result of the 10 iterations. The reason here - parallel workers. If you see into the plan you will find parallel workers without optimization and absence of them in the case of optimization: Gather (cost=1000.00..28685.37 rows=87037 width=12) (actual rows=90363 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on test Filter: ((x = 1) OR (x = 2) OR (x = 3) OR (x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = 8) OR (x = 9)) Seq Scan on test (cost=0.02..20440.02 rows=90600 width=12) (actual rows=90363 loops=1) Filter: (x = ANY ('{1,2,3,4,5,6,7,8,9}'::integer[])) Having 90600 tuples returned we estimate it into 87000 (less precisely) without transformation and 90363 (more precisely) with the transformation. But if you play with parallel_tuple_cost and parallel_setup_cost, you will end up having these parallel workers: Gather (cost=0.12..11691.03 rows=90600 width=12) (actual rows=90363 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on test Filter: (x = ANY ('{1,2,3,4,5,6,7,8,9}'::integer[])) Rows Removed by Filter: 303212 And some profit about 25%, on my laptop. I'm not sure about the origins of such behavior, but it seems to be an issue of parallel workers, not this specific optimization. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 12/2/2024 15:55, Alena Rybakina wrote: Hi! I can't unnderstand this part of code: /* Time to generate index paths */ MemSet(, 0, sizeof(clauseset)); match_clauses_to_index(root, list_make1(rinfo1), index, ); As I understand it, match_clauses_to_index is necessary if you have a RestrictInfo (rinfo1) variable, so maybe we should run it after the make_restrictonfo procedure, otherwise calling it, I think, is useless. I think you must explain your note in more detail. Before this call, we already called make_restrictinfo() and built rinfo1, haven't we? -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
Thanks for the review! It was the first version for discussion. Of course, refactoring and polishing cycles will be needed, even so we can discuss the general idea earlier. On 10/2/2024 12:00, jian he wrote: On Thu, Feb 8, 2024 at 1:34 PM Andrei Lepikhov 1235 | PredicatesData *pd; Thanks + if (!predicate_implied_by(index->indpred, list_make1(rinfo1), true)) + elog(ERROR, "Logical mistake in OR <-> ANY transformation code"); the error message seems not clear? Yeah, have rewritten static List * build_paths_for_SAOP(PlannerInfo *root, RelOptInfo *rel, RestrictInfo *rinfo, List *other_clauses) I am not sure what's `other_clauses`, and `rinfo` refers to? adding some comments would be great. struct PredicatesData needs some comments, I think. Added, not so much though +bool +saop_covered_by_predicates(ScalarArrayOpExpr *saop, List *predicate_lists) +{ + ListCell *lc; + PredIterInfoData clause_info; + bool result = false; + bool isConstArray; + + Assert(IsA(saop, ScalarArrayOpExpr)); is this Assert necessary? Not sure. Moved it into another routine. For the function build_paths_for_SAOP, I think I understand the first part of the code. But I am not 100% sure of the second part of the `foreach(lc, predicate_lists)` code. more comments in `foreach(lc, predicate_lists)` would be helpful. Done do you need to add `PredicatesData` to src/tools/pgindent/typedefs.list? Done I also did some minor refactoring of generate_saop_pathlist. Partially agree instead of let it go to `foreach (lc, entries)`, we can reject the Const array at `foreach(lc, expr->args)` Yeah, I just think we can go further and transform two const arrays into a new one if we have the same clause and operator. In that case, we should allow it to pass through this cycle down to the classification stage. also `foreach(lc, expr->args)` do we need to reject cases like `contain_subplans((Node *) nconst_expr)`? maybe let the nconst_expr be a Var node would be far more easier. It's contradictory. On the one hand, we simplify the comparison procedure and can avoid expr jumbling at all. On the other hand - we restrict the feature. IMO, it would be better to unite such clauses complex_clause1 IN (..) OR complex_clause1 IN (..) into complex_clause1 IN (.., ..) and don't do duplicated work computing complex clauses. In the attachment - the next version of the second patch. -- regards, Andrei Lepikhov Postgres Professional From c1e5fc35778acd01cca67e63fdf80c5605af03f2 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 24 Jan 2024 14:07:17 +0700 Subject: [PATCH 2/2] Teach generate_bitmap_or_paths to build BitmapOr paths over SAOP clauses. Likewise OR clauses, discover SAOP array and try to split its elements between smaller sized arrays to fit a set of partial indexes. --- src/backend/optimizer/path/indxpath.c | 304 ++ src/backend/optimizer/util/predtest.c | 46 src/backend/optimizer/util/restrictinfo.c | 13 + src/include/optimizer/optimizer.h | 16 ++ src/include/optimizer/restrictinfo.h | 1 + src/test/regress/expected/select.out | 282 src/test/regress/sql/select.sql | 82 ++ src/tools/pgindent/typedefs.list | 1 + 8 files changed, 692 insertions(+), 53 deletions(-) diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 32c6a8bbdc..5383cb76dc 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -32,6 +32,7 @@ #include "optimizer/paths.h" #include "optimizer/prep.h" #include "optimizer/restrictinfo.h" +#include "utils/array.h" #include "utils/lsyscache.h" #include "utils/selfuncs.h" @@ -1220,11 +1221,177 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel, return result; } +/* + * Building index paths over SAOP clause differs from the logic of OR clauses. + * Here we iterate across all the array elements and split them to SAOPs, + * corresponding to different indexes. We must match each element to an index. + */ +static List * +build_paths_for_SAOP(PlannerInfo *root, RelOptInfo *rel, RestrictInfo *rinfo, +List *other_clauses) +{ + List *result = NIL; + List *predicate_lists = NIL; + ListCell *lc; + PredicatesData *pd; + ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause; + + Assert(IsA(saop, ScalarArrayOpExpr) && saop->useOr); + + if (!IsA(lsecond(saop->args), Const)) + /* +* Has it practical outcome to merge arrays which couldn't constantified +* before that step? +*/ + return NIL; + + /* Collect predicates */ + foreach(lc, rel-&g
Re: pg_stat_advisor extension
On 6/2/2024 22:27, Ilia Evdokimov wrote: I welcome your insights, feedback, and evaluations regarding the necessity of integrating this new extension into PostgreSQL. Besides other issues that were immediately raised during the discovery of the extension, Let me emphasize two issues: 1. In the case of parallel workers the plan_rows value has a different semantics than the number of rows predicted. Just explore get_parallel_divisor(). 2. The extension recommends new statistics immediately upon an error finding. But what if the reason for the error is stale statistics? Or this error may be raised for only one specific set of constants, and estimation will be done well in another 99.% of cases for the same expression. According to No.2, it might make sense to collect and track clause combinations and cardinality errors found and let the DBA make decisions on their own. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 3/2/2024 02:06, Alena Rybakina wrote: On 01.02.2024 08:00, jian he wrote: I added your code to the patch. Thanks Alena and Jian for the detailed scrutiny! A couple of questions: 1. As I see, transformAExprIn uses the same logic as we invented but allows composite and domain types. Could you add a comment explaining why we forbid row types in general, in contrast to the transformAExprIn routine? 2. Could you provide the tests to check issues covered by the recent (in v.15) changes? Patch 0001-* in the attachment incorporates changes induced by Jian's notes from [1]. Patch 0002-* contains a transformation of the SAOP clause, which allows the optimizer to utilize partial indexes if they cover all values in this array. Also, it is an answer to Alexander's note [2] on performance degradation. This first version may be a bit raw, but I need your opinion: Does it resolve the issue? Skimming through the thread, I see that, in general, all issues have been covered for now. Only Robert's note on a lack of documentation is still needs to be resolved. [1] https://www.postgresql.org/message-id/CACJufxGXhJ823cdAdp2Ho7qC-HZ3_-dtdj-myaAi_u9RQLn45g%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAPpHfduJtO0s9E%3DSHUTzrCD88BH0eik0UNog1_q3XBF2wLmH6g%40mail.gmail.com -- regards, Andrei Lepikhov Postgres Professional From 0ac511ab94959e87d1525d5ea8c4855643a6ccbe Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Fri, 2 Feb 2024 22:01:09 +0300 Subject: [PATCH 1/2] Transform OR clause to ANY expressions. Replace (expr op C1) OR (expr op C2) ... with expr op ANY(C1, C2, ...) on the preliminary stage of optimization when we are still working with the tree expression. Here C is a constant expression, 'expr' is non-constant expression, 'op' is an operator which returns boolean result and has a commuter (for the case of reverse order of constant and non-constant parts of the expression, like 'CX op expr'). Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. Authors: Alena Rybakina , Andrey Lepikhov Reviewed-by: Peter Geoghegan , Ranier Vilela Reviewed-by: Alexander Korotkov , Robert Haas Reviewed-by: jian he --- .../postgres_fdw/expected/postgres_fdw.out| 16 +- src/backend/nodes/queryjumblefuncs.c | 27 ++ src/backend/parser/parse_expr.c | 327 +- src/backend/utils/misc/guc_tables.c | 11 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/bin/pg_dump/t/002_pg_dump.pl | 6 +- src/include/nodes/queryjumble.h | 1 + src/include/optimizer/optimizer.h | 1 + src/test/regress/expected/create_index.out| 156 - src/test/regress/expected/inherit.out | 2 +- src/test/regress/expected/join.out| 62 +++- src/test/regress/expected/partition_prune.out | 219 ++-- src/test/regress/expected/rules.out | 4 +- src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 23 +- src/test/regress/sql/create_index.sql | 35 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 2 + 21 files changed, 876 insertions(+), 70 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index b5a38aeb21..a07aefc9e5 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -1349,7 +1349,7 @@ SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 WHERE Foreign Scan Output: t1.c1, t1.c2, ft5.c1, ft5.c2 Relations: (public.ft4 t1) LEFT JOIN (public.ft5) - Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 < 10) OR (r4.c1 IS NULL))) AND ((r1.c1 < 10)) + Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 IS NULL) OR (r4.c1 < 10))) AND ((r1.c1 < 10)) (4 rows) SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 WHERE c1 < 10) t2 ON (t1.c1 = t2.c1) @@ -3105,7 +3105,7 @@ select array_agg(distinct (t1.c1)%5) from ft4 t1 full join ft5 t2 on (t1.c1 = t2 Foreign Scan Output: (array_agg(DISTINCT (t1.c1 % 5))), ((t2.c1 % 3)) Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2)) - Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5
Re: POC: GROUP BY optimization
On 2/2/2024 11:06, Richard Guo wrote: On Fri, Feb 2, 2024 at 11:32 AM Richard Guo <mailto:guofengli...@gmail.com>> wrote: On Fri, Feb 2, 2024 at 10:02 AM Tom Lane mailto:t...@sss.pgh.pa.us>> wrote: One of the test cases added by this commit has not been very stable in the buildfarm. Latest example is here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-01%2021%3A28%3A04 <https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-01%2021%3A28%3A04> and I've seen similar failures intermittently on other machines. I'd suggest building this test atop a table that is more stable than pg_class. You're just waving a red flag in front of a bull if you expect stable statistics from that during a regression run. Nor do I see any particular reason for pg_class to be especially suited to the test. Yeah, it's not a good practice to use pg_class in this place. While looking through the test cases added by this commit, I noticed some other minor issues that are not great. Such as * The table 'btg' is inserted with 1 tuples, which seems a bit expensive for a test. I don't think we need such a big table to test what we want. * I don't see why we need to manipulate GUC max_parallel_workers and max_parallel_workers_per_gather. * I think we'd better write the tests with the keywords being all upper or all lower. A mixed use of upper and lower is not great. Such as in explain (COSTS OFF) SELECT x,y FROM btg GROUP BY x,y,z,w; * Some comments for the test queries are not easy to read. * For this statement CREATE INDEX idx_y_x_z ON btg(y,x,w); I think the index name would cause confusion. It creates an index on columns y, x and w, but the name indicates an index on y, x and z. I'd like to write a draft patch for the fixes. Here is the draft patch that fixes the issues I complained about in upthread. I passed through the patch. Looks like it doesn't break anything. Why do you prefer to use count(*) in EXPLAIN instead of plain targetlist, like "SELECT x,y,..."? Also, according to the test mentioned by Tom: 1. I see, PG uses IndexScan on (x,y). So, column x will be already sorted before the MergeJoin. Why not use Incremental Sort on (x,z,w) instead of full sort? 2. For memo, IMO, this test shows us the future near perspective of this feature: It is cheaper to use grouping order (w,z) instead of (z,w). -- regards, Andrei Lepikhov Postgres Professional
Re: POC: GROUP BY optimization
On 2/2/2024 09:02, Tom Lane wrote: Alexander Korotkov writes: I'm going to push this if there are no objections. One of the test cases added by this commit has not been very stable in the buildfarm. Latest example is here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-01%2021%3A28%3A04 and I've seen similar failures intermittently on other machines. I'd suggest building this test atop a table that is more stable than pg_class. You're just waving a red flag in front of a bull if you expect stable statistics from that during a regression run. Nor do I see any particular reason for pg_class to be especially suited to the test. Yeah, It is my fault. Please, see in the attachment the patch fixing that. -- regards, Andrei Lepikhov Postgres Professional From 11a049d95ee48e38ad569aab7663d8de91f946ad Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Fri, 2 Feb 2024 10:39:55 +0700 Subject: [PATCH] Replace the GROUP-BY optimization test with the same based on something less volatile when the pg_class relation. --- src/test/regress/expected/aggregates.out | 32 +++- src/test/regress/sql/aggregates.sql | 9 +++ 2 files changed, 18 insertions(+), 23 deletions(-) diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out index 7a73c19314..c2e1b8c9ed 100644 --- a/src/test/regress/expected/aggregates.out +++ b/src/test/regress/expected/aggregates.out @@ -2873,7 +2873,6 @@ SELECT y,x,array_agg(distinct w) FROM btg WHERE y < 0 GROUP BY x,y; (6 rows) RESET enable_incremental_sort; -DROP TABLE btg; -- The case, when scanning sort order correspond to aggregate sort order but -- can not be found in the group-by list CREATE TABLE agg_sort_order (c1 int PRIMARY KEY, c2 int); @@ -2901,32 +2900,31 @@ DROP TABLE agg_sort_order CASCADE; SET enable_hashjoin = off; SET enable_nestloop = off; explain (COSTS OFF) -SELECT c1.relname,c1.relpages -FROM pg_class c1 JOIN pg_class c2 ON (c1.relname=c2.relname AND c1.relpages=c2.relpages) -GROUP BY c1.reltuples,c1.relpages,c1.relname -ORDER BY c1.relpages, c1.relname, c1.relpages*c1.relpages; - QUERY PLAN -- +SELECT b1.x,b1.w FROM btg b1 JOIN btg b2 ON (b1.z=b2.z AND b1.w=b2.w) +GROUP BY b1.x,b1.z,b1.w ORDER BY b1.z, b1.w, b1.x*b1.x; +QUERY PLAN +--- Incremental Sort - Sort Key: c1.relpages, c1.relname, ((c1.relpages * c1.relpages)) - Presorted Key: c1.relpages, c1.relname + Sort Key: b1.z, b1.w, ((b1.x * b1.x)) + Presorted Key: b1.z, b1.w -> Group - Group Key: c1.relpages, c1.relname, c1.reltuples + Group Key: b1.z, b1.w, b1.x -> Incremental Sort - Sort Key: c1.relpages, c1.relname, c1.reltuples - Presorted Key: c1.relpages, c1.relname + Sort Key: b1.z, b1.w, b1.x + Presorted Key: b1.z, b1.w -> Merge Join - Merge Cond: ((c1.relpages = c2.relpages) AND (c1.relname = c2.relname)) + Merge Cond: ((b1.z = b2.z) AND (b1.w = b2.w)) -> Sort - Sort Key: c1.relpages, c1.relname - -> Seq Scan on pg_class c1 + Sort Key: b1.z, b1.w + -> Seq Scan on btg b1 -> Sort - Sort Key: c2.relpages, c2.relname - -> Seq Scan on pg_class c2 + Sort Key: b2.z, b2.w + -> Seq Scan on btg b2 (16 rows) RESET enable_hashjoin; RESET enable_nestloop; +DROP TABLE btg; RESET enable_hashagg; RESET max_parallel_workers; RESET max_parallel_workers_per_gather; diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql index 916dbf908f..3548fbb8db 100644 --- a/src/test/regress/sql/aggregates.sql +++ b/src/test/regress/sql/aggregates.sql @@ -1229,8 +1229,6 @@ EXPLAIN (VERBOSE, COSTS OFF) SELECT y,x,array_agg(distinct w) FROM btg WHERE y < 0 GROUP BY x,y; RESET enable_incremental_sort; -DROP TABLE btg; - -- The case, when scanning sort order correspond to aggregate sort order but -- can not be found in the group-by list CREATE TABLE agg_sort_order (c1 int PRIMARY KEY, c2 int); @@ -1245,13 +1243,12 @@ DROP TABLE agg_sort_order CASCADE; SET enable_hashjoin = off; SET enable_nestloop = off; explain (COSTS OFF) -SELECT c1.relname,c1.relpages -FROM pg_class c1 JOIN pg_class c2 ON (c1.relname=c2.relname AND c1.relpages=c2.relpages) -GROUP BY c1.reltuples,c1.relpages,c1.relname -ORDER BY c1.relpages, c1.relname, c1.relpages*c1.r
Re: POC: GROUP BY optimization
Just forgotten second attachment. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index 1095b73dac..b612420547 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -432,6 +432,21 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, return n; } +static bool +duplicated_pathkey_combination(List *infos, List *pathkeys) +{ + ListCell *lc; + + foreach (lc, infos) + { + PathKeyInfo *info = lfirst_node(PathKeyInfo, lc); + + if (compare_pathkeys(pathkeys, info->pathkeys) == PATHKEYS_EQUAL) + return true; + } + return false; +} + /* * get_useful_group_keys_orderings * Determine which orderings of GROUP BY keys are potentially interesting. @@ -491,7 +506,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path) if (n > 0 && (enable_incremental_sort || n == root->num_groupby_pathkeys) && - compare_pathkeys(pathkeys, root->group_pathkeys) != PATHKEYS_EQUAL) + !duplicated_pathkey_combination(infos, pathkeys)) { info = makeNode(PathKeyInfo); info->pathkeys = pathkeys; @@ -514,8 +529,9 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path) , root->num_groupby_pathkeys); - if (n > 0 && compare_pathkeys(pathkeys, root->group_pathkeys) != PATHKEYS_EQUAL && - (enable_incremental_sort || n == list_length(root->sort_pathkeys))) + if (n > 0 && + (enable_incremental_sort || n == list_length(root->sort_pathkeys)) && + !duplicated_pathkey_combination(infos, pathkeys)) { info = makeNode(PathKeyInfo); info->pathkeys = pathkeys;
Re: POC: GROUP BY optimization
On 16/1/2024 22:05, Alexander Korotkov wrote: On Tue, Jan 16, 2024 at 4:48 AM Richard Guo wrote: * When trying to match the ordering of GROUP BY to that of ORDER BY in get_useful_group_keys_orderings, this patch checks against the length of path->pathkeys. This does not make sense. I guess this is a typo and the real intention is to check against root->sort_pathkeys. Thanks! It is really my blunder - fresh look works. --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -504,7 +504,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path) root->num_groupby_pathkeys); if (n > 0 && - (enable_incremental_sort || n == list_length(path->pathkeys))) + (enable_incremental_sort || n == list_length(root->sort_pathkeys))) However, I think this is also incorrect. When incremental sort is disabled, it is reasonable to consider reordering the GROUP BY keys only if the number of matching pathkeys is equal to the length of root->group_pathkeys i.e. if 'n == list_length(root->group_pathkeys)'. Hmm... Why should this be list_length(root->group_pathkeys) while we're targeting root->sort_pathkeys. I yet changed that to list_length(root->sort_pathkeys). I think, in the first case, when we are trying to arrange group-by keys according to underlying pathkeys with incremental sort = off, it makes sense to do if we fetch all group-by keys regardless of a more or equal number of path keys the incoming path contains. The code and test case are in step1.txt. * When the original ordering of GROUP BY keys matches the path's pathkeys or ORDER BY keys, get_useful_group_keys_orderings would generate duplicate PathKeyInfos. Consequently, this duplication would lead to the creation and addition of identical paths multiple times. This is not great. Consider the query below: create table t (a int, b int); create index on t (a, b); set enable_hashagg to off; explain select count(*) from t group by a, b; QUERY PLAN -- GroupAggregate (cost=0.15..97.27 rows=226 width=16) Group Key: a, b -> Index Only Scan using t_a_b_idx on t (cost=0.15..78.06 rows=2260 width=8) (3 rows) The same path with group keys {a, b} is created and added twice. I tried to address that. * Part of the work performed in this patch overlaps with that of preprocess_groupclause. They are both trying to adjust the ordering of the GROUP BY keys to match ORDER BY. I wonder if it would be better to perform this work only once. Andrei, could you take a look. As I see, the PathKeyInfo list often contains duplicated pathkeys, coming from either sort_pathkeys or path->pathkeys orderings. So, I can propose to check duplicates each time (see step2.txt in attachment). * When reordering the GROUP BY keys, I think the following checks need some improvements. + /* +* Since 1349d27 pathkey coming from underlying node can be in the +* root->group_pathkeys but not in the processed_groupClause. So, we +* should be careful here. +*/ + sgc = get_sortgroupref_clause_noerr(pathkey->pk_eclass->ec_sortref, + *group_clauses); + if (!sgc) + /* The grouping clause does not cover this pathkey */ + break; I think we need to check or assert 'pathkey->pk_eclass->ec_sortref' is valid before calling get_sortgroupref_clause_noerr with it. Also, how can we guarantee that the returned GROUP BY item is sortable? Should we check that with OidIsValid(sgc->sortop)? Added. Reviewed it, looks good. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index 919d54dd79..1095b73dac 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -489,8 +489,9 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path) n = group_keys_reorder_by_pathkeys(path->pathkeys, , , root->num_groupby_pathkeys); - if (n > 0 && compare_pathkeys(pathkeys, root->group_pathkeys) != PATHKEYS_EQUAL && - (enable_incremental_sort || n == list_length(path->pathkeys))) + if (n > 0 && + (enable_incremental_sort || n == root->num_groupby_pathkeys) && + compare_pathkeys(pathkeys, root->group_pathkeys) != PATHKEYS_EQUAL) { info = makeNode(PathKeyInfo); info->pathkeys = pathkeys; diff --git a/src/test/regress/expected/ag
Re: POC: GROUP BY optimization
On 15/1/2024 13:42, Richard Guo wrote: On Mon, Jan 15, 2024 at 8:20 AM Alexander Korotkov <mailto:aekorot...@gmail.com>> wrote: Thank you for providing the test case relevant for this code change. The revised patch incorporating this change is attached. Now the patchset looks good to me. I'm going to push it if there are no objections. Seems I'm late for the party. Can we hold for several more days? I'd like to have a review on this patch. Get on board! It looks like this feature needs as much review as possible (likewise SJE). -- regards, Andrei Lepikhov Postgres Professional
Re: POC: GROUP BY optimization
On 13/1/2024 22:00, Alexander Korotkov wrote: On Sat, Jan 13, 2024 at 11:09 AM Andrei Lepikhov wrote: On 11/1/2024 18:30, Alexander Korotkov wrote: On Tue, Jan 9, 2024 at 1:14 PM Pavel Borisov wrote: Hmm, I don't see this old code in these patches. Resend 0002-* because of trailing spaces. AFAIK, cfbot does not seek old versions of patchset parts in previous messages. So for it to run correctly, a new version of the whole patchset should be sent even if most patches are unchanged. Please, find the revised patchset with some refactoring and comments improvement from me. I'll continue looking into this. The patch looks better, thanks to your refactoring. I propose additional comments and tests to make the code more understandable (see attachment). I intended to look into this part of the code more, but the tests show a difference between PostgreSQL v.15 and v.16, which causes changes in the code of this feature. Makes sense. I've incorporated your changes into the patchset. One more improvement. To underpin code change: - return cur_ec; /* Match! */ + { + /* +* Match! +* +* Copy the sortref if it wasn't set yet. That may happen if +* the ec was constructed from WHERE clause, i.e. it doesn't +* have a target reference at all. +*/ + if (cur_ec->ec_sortref == 0 && sortref > 0) + cur_ec->ec_sortref = sortref; + return cur_ec; + } I propose the test (see attachment). It shows why we introduce this change: GROUP-BY should juggle not only pathkeys generated by explicit sort operators but also planner-made, likewise in this example, by MergeJoin. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out index 0d46e096e5..ca38e78f21 100644 --- a/src/test/regress/expected/aggregates.out +++ b/src/test/regress/expected/aggregates.out @@ -2879,6 +2879,37 @@ FROM t1 WHERE c2 < 100 GROUP BY c1 ORDER BY 2; (10 rows) DROP TABLE t1 CASCADE; +-- Check, that GROUP-BY reordering optimization can operate with pathkeys, built +-- by planner itself. For example, by MergeJoin. +SET enable_hashjoin = off; +SET enable_nestloop = off; +explain (COSTS OFF) +SELECT c1.relname,c1.relpages +FROM pg_class c1 JOIN pg_class c2 ON (c1.relname=c2.relname AND c1.relpages=c2.relpages) +GROUP BY c1.reltuples,c1.relpages,c1.relname +ORDER BY c1.relpages, c1.relname, c1.relpages*c1.relpages; + QUERY PLAN +- + Incremental Sort + Sort Key: c1.relpages, c1.relname, ((c1.relpages * c1.relpages)) + Presorted Key: c1.relpages, c1.relname + -> Group + Group Key: c1.relpages, c1.relname, c1.reltuples + -> Incremental Sort + Sort Key: c1.relpages, c1.relname, c1.reltuples + Presorted Key: c1.relpages, c1.relname + -> Merge Join + Merge Cond: ((c1.relpages = c2.relpages) AND (c1.relname = c2.relname)) + -> Sort + Sort Key: c1.relpages, c1.relname + -> Seq Scan on pg_class c1 + -> Sort + Sort Key: c2.relpages, c2.relname + -> Seq Scan on pg_class c2 +(16 rows) + +RESET enable_hashjoin; +RESET enable_nestloop; RESET enable_hashagg; RESET max_parallel_workers; RESET max_parallel_workers_per_gather; diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql index f99167ac9e..cf87b5d5dd 100644 --- a/src/test/regress/sql/aggregates.sql +++ b/src/test/regress/sql/aggregates.sql @@ -1233,6 +1233,18 @@ SELECT array_agg(c1 ORDER BY c2),c2 FROM t1 WHERE c2 < 100 GROUP BY c1 ORDER BY 2; DROP TABLE t1 CASCADE; +-- Check, that GROUP-BY reordering optimization can operate with pathkeys, built +-- by planner itself. For example, by MergeJoin. +SET enable_hashjoin = off; +SET enable_nestloop = off; +explain (COSTS OFF) +SELECT c1.relname,c1.relpages +FROM pg_class c1 JOIN pg_class c2 ON (c1.relname=c2.relname AND c1.relpages=c2.relpages) +GROUP BY c1.reltuples,c1.relpages,c1.relname +ORDER BY c1.relpages, c1.relname, c1.relpages*c1.relpages; +RESET enable_hashjoin; +RESET enable_nestloop; + RESET enable_hashagg; RESET max_parallel_workers; RESET max_parallel_workers_per_gather;
Re: POC: GROUP BY optimization
On 11/1/2024 18:30, Alexander Korotkov wrote: Hi! On Tue, Jan 9, 2024 at 1:14 PM Pavel Borisov wrote: Hmm, I don't see this old code in these patches. Resend 0002-* because of trailing spaces. AFAIK, cfbot does not seek old versions of patchset parts in previous messages. So for it to run correctly, a new version of the whole patchset should be sent even if most patches are unchanged. Please, find the revised patchset with some refactoring and comments improvement from me. I'll continue looking into this. The patch looks better, thanks to your refactoring. I propose additional comments and tests to make the code more understandable (see attachment). I intended to look into this part of the code more, but the tests show a difference between PostgreSQL v.15 and v.16, which causes changes in the code of this feature. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index f4b7dcac21..5aac6d6677 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -397,6 +397,11 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, !list_member_ptr(*group_pathkeys, pathkey)) break; + /* +* Since 1349d27 pathkey coming from underlying node can be in the +* root->group_pathkeys but not in the processed_groupClause. So, we +* should be careful here. +*/ sgc = get_sortgroupref_clause_noerr(pathkey->pk_eclass->ec_sortref, *group_clauses); if (!sgc) diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out index 423c8ec3b6..0d46e096e5 100644 --- a/src/test/regress/expected/aggregates.out +++ b/src/test/regress/expected/aggregates.out @@ -2857,6 +2857,28 @@ explain (COSTS OFF) SELECT x,w FROM btg GROUP BY w,x,y,z ORDER BY x*x,z; (8 rows) DROP TABLE btg; +-- The case, when scanning sort order correspond to aggregate sort order but +-- can not be found in the group-by list +CREATE TABLE t1 (c1 int PRIMARY KEY, c2 int); +CREATE UNIQUE INDEX ON t1(c2); +explain (costs off) +SELECT array_agg(c1 ORDER BY c2),c2 +FROM t1 WHERE c2 < 100 GROUP BY c1 ORDER BY 2; + QUERY PLAN + + Sort + Sort Key: c2 + -> GroupAggregate + Group Key: c1 + -> Sort + Sort Key: c1, c2 + -> Bitmap Heap Scan on t1 + Recheck Cond: (c2 < 100) + -> Bitmap Index Scan on t1_c2_idx + Index Cond: (c2 < 100) +(10 rows) + +DROP TABLE t1 CASCADE; RESET enable_hashagg; RESET max_parallel_workers; RESET max_parallel_workers_per_gather; diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql index b9fcceedd7..f99167ac9e 100644 --- a/src/test/regress/sql/aggregates.sql +++ b/src/test/regress/sql/aggregates.sql @@ -1224,6 +1224,15 @@ explain (COSTS OFF) SELECT x,w FROM btg GROUP BY w,x,y,z ORDER BY x*x,z; DROP TABLE btg; +-- The case, when scanning sort order correspond to aggregate sort order but +-- can not be found in the group-by list +CREATE TABLE t1 (c1 int PRIMARY KEY, c2 int); +CREATE UNIQUE INDEX ON t1(c2); +explain (costs off) +SELECT array_agg(c1 ORDER BY c2),c2 +FROM t1 WHERE c2 < 100 GROUP BY c1 ORDER BY 2; +DROP TABLE t1 CASCADE; + RESET enable_hashagg; RESET max_parallel_workers; RESET max_parallel_workers_per_gather;
Re: introduce dynamic shared memory registry
On 9/1/2024 00:16, Nathan Bossart wrote: On Mon, Jan 08, 2024 at 10:53:17AM +0530, Bharath Rupireddy wrote: 1. I think we need to add some notes about this new way of getting shared memory for external modules in the Shared Memory and LWLocks section in xfunc.sgml? This will at least tell there's another way for external modules to get shared memory, not just with the shmem_request_hook and shmem_startup_hook. What do you think? +1. Maybe even more - in the section related to extensions, this approach to using shared data can be mentioned, too. 2. FWIW, I'd like to call this whole feature "Support for named DSM segments in Postgres". Do you see anything wrong with this? Why do you feel it should be renamed? I don't see anything wrong with it, but I also don't see any particular advantage with that name compared to "dynamic shared memory registry." It is not a big issue, I suppose. But for me personally (as not a native English speaker), the label "Named DSM segments" seems more straightforward to understand. 3. IIUC, this feature eventually makes both shmem_request_hook and shmem_startup_hook pointless, no? Or put another way, what's the significance of shmem request and startup hooks in lieu of this new feature? I think it's quite possible to get rid of the shmem request and startup hooks (of course, not now but at some point in future to not break the external modules), because all the external modules can allocate and initialize the same shared memory via dsm_registry_init_or_attach and its init_callback. All the external modules will then need to call dsm_registry_init_or_attach in their _PG_init callbacks and/or in their bg worker's main functions in case the modules intend to start up bg workers. Am I right? Well, modules might need to do a number of other things (e.g., adding hooks) that can presently only be done when preloaded, in which case I doubt there's much benefit from switching to the DSM registry. I don't really intend for it to replace the existing request/startup hooks, but you're probably right that most, if not all, could use the registry instead. IMHO this is well beyond the scope of this thread, though. +1, it may be a many reasons to use these hooks. >> 3. Use NAMEDATALEN instead of 64? >> +charkey[64]; > I kept this the same, as I didn't see any need to tie the key size to > NAMEDATALEN. IMO, we should avoid magic numbers whenever possible. Current logic according to which first users of this feature will be extensions naturally bonds this size to the size of the 'name' type. And one more point. I think the commit already deserves a more detailed commit message. -- regards, Andrei Lepikhov Postgres Professional
Re: Custom explain options
On 10/1/2024 20:27, Konstantin Knizhnik wrote: On 10/01/2024 8:46 am, Michael Paquier wrote: On Wed, Jan 10, 2024 at 01:29:30PM +0700, Andrei Lepikhov wrote: What do you think about this really useful feature? Do you wish to develop it further? I am biased here. This seems like a lot of code for something we've been delegating to the explain hook for ages. Even if I can see the appeal of pushing that more into explain.c to get more data on a per-node basis depending on the custom options given by the caller of an EXPLAIN entry point, I cannot get really excited about the extra maintenance this facility would involve compared to the potential gains, knowing that there's a hook. -- Michael Well, I am not sure that proposed patch is flexible enough to handle all possible scenarios. I just wanted to make it as simple as possible to leave some chances for it to me merged. But it is easy to answer the question why existed explain hook is not enough: 1. It doesn't allow to add some extra options to EXPLAIN. My intention was to be able to do something like this "explain (analyze,buffers,prefetch) ...". It is completely not possible with explain hook. I agree. Designing mostly planner-related extensions, I also wanted to add some information to the explain of nodes. For example, pg_query_state could add the status of the node at the time of interruption of execution: started, stopped, or loop closed. Maybe we should gather some statistics on how developers of extensions deal with that issue ... -- regards, Andrei Lepikhov Postgres Professional
Re: Multidimensional Histograms
On 8/1/2024 16:21, Alexander Cheshev wrote: Hi Andrei, Maybe my wording needed to be more precise. I didn't implement multidimensional histograms before, so I don't know how expensive they are. I meant that for dependency statistics over about six columns, we have a lot of combinations to compute. Equi-Depth Histogram in a 6 dimensional case requires 6 times more iterations. Postgres already uses Equi-Depth Histogram. Even if you increase the number of buckets from 100 to 1000 then there will be no overhead as the time complexity of Equi-Depth Histogram has no dependence on the number of buckets. So, no overhead at all! Maybe. For three columns, we have 9 combinations (passes) for building dependency statistics and 4 combinations for ndistincts; for six columns, we have 186 and 57 combinations correspondingly. Even remembering that dependency is just one number for one combination, building the dependency statistics is still massive work. So, in the multicolumn case, having something like a histogram may be more effective. -- regards, Andrei Lepikhov Postgres Professional
Re: Custom explain options
On 30/11/2023 22:40, Konstantin Knizhnik wrote: In all this cases we are using array of `Instrumentation` and if it contains varying part, then it is not clear where to place it. Yes, there is also code which serialize and sends instrumentations between worker processes and I have updated this code in my PR to send actual amount of custom instrumentation data. But it can not help with the cases above. What do you think about this really useful feature? Do you wish to develop it further? -- regards, Andrei Lepikhov Postgres Professional
Re: POC: GROUP BY optimization
On 9/1/2024 16:45, vignesh C wrote: On Tue, 9 Jan 2024 at 14:31, Andrei Lepikhov wrote: Here is a new version of GROUP-BY optimization without sort model. On 21/12/2023 17:53, Alexander Korotkov wrote: I'd like to make some notes. 1) As already mentioned, there is clearly a repetitive pattern for the code following after get_useful_group_keys_orderings() calls. I think it would be good to extract it into a separate function. Please, do this as a separate patch coming before the group-by patch. That would simplify the review. Done. See patch 0001-*. Unfortunately, extraction of whole cycle isn't practical, because it blows out the interface of the routine. 2) I wonder what planning overhead this patch could introduce? Could you try to measure the worst case? What if we have a table with a lot of indexes and a long list of group-by clauses partially patching every index. This should give us an understanding on whether we need a separate GUC to control this feature. In current implementation I don't anticipate any significant overhead. GUC is needed here to allow users adhere their own ordering and to disable feature in the case of problems. 4) I think we can do some optimizations when enable_incremental_sort == off. Then in get_useful_group_keys_orderings() we should only deal with input_path fully matching the group-by clause, and try only full match of group-by output to the required order. Hm, is it really make sense in current implementation? CFBot shows the following errors at [1] with: [08:33:28.813] ../src/backend/utils/adt/selfuncs.c: In function ‘estimate_num_groups’: [08:33:28.813] ../src/backend/utils/adt/selfuncs.c:3389:9: warning: implicit declaration of function ‘estimate_num_groups_incremental’ [-Wimplicit-function-declaration] [08:33:28.813] 3389 | return estimate_num_groups_incremental(root, groupExprs, [08:33:28.813] | ^~~ [08:33:28.813] ../src/backend/utils/adt/selfuncs.c: At top level: [08:33:28.813] ../src/backend/utils/adt/selfuncs.c:3400:1: warning: no previous prototype for ‘estimate_num_groups_incremental’ [-Wmissing-prototypes] [08:33:28.813] 3400 | estimate_num_groups_incremental(PlannerInfo *root, List *groupExprs, [08:33:28.813] | ^~~ [08:33:28.813] ../src/backend/utils/adt/selfuncs.c:3400:1: error: conflicting types for ‘estimate_num_groups_incremental’ [08:33:28.813] ../src/backend/utils/adt/selfuncs.c:3389:9: note: previous implicit declaration of ‘estimate_num_groups_incremental’ was here [08:33:28.813] 3389 | return estimate_num_groups_incremental(root, groupExprs, Hmm, I don't see this old code in these patches. Resend 0002-* because of trailing spaces. -- regards, Andrei Lepikhov Postgres Professional From 45cfff5731b81c0df2af00f5e4212fc598f6a231 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 9 Jan 2024 12:32:15 +0700 Subject: [PATCH 2/2] Explore alternative orderings of group-by pathkeys during optimization. When evaluating a query with a multi-column GROUP BY clause, we can minimize sort operations or avoid them if we synchronize the order of GROUP BY clauses with the ORDER BY sort clause or sort order, which comes from the underlying query tree. Grouping does not imply any ordering, so we can compare the keys in arbitrary order, and a Hash Agg leverages this. But for Group Agg, we simply compared keys in the order specified in the query. This commit explores alternative ordering of the keys, trying to find a cheaper one. The ordering of group keys may interact with other parts of the query, some of which may not be known while planning the grouping. For example, there may be an explicit ORDER BY clause or some other ordering-dependent operation higher up in the query, and using the same ordering may allow using either incremental sort or even eliminating the sort entirely. The patch always keeps the ordering specified in the query, assuming the user might have additional insights. This introduces a new GUC enable_group_by_reordering so that the optimization may be disabled if needed. --- src/backend/optimizer/path/equivclass.c | 13 +- src/backend/optimizer/path/pathkeys.c | 222 ++ src/backend/optimizer/plan/planner.c | 214 +++-- src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/pathnodes.h | 10 + src/include/optimizer/paths.h | 3 + src/test/regress/expected/aggregates.out | 132 +++ src/test/regress/expected/sysviews.out| 3 +- src/test/regress/sql/aggregates.sql | 47 10 files changed, 586 insertions(+), 69 deletions(-) diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index e86dfeaecd..7dd14d0a43 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -652
Re: POC: GROUP BY optimization
Here is a new version of GROUP-BY optimization without sort model. On 21/12/2023 17:53, Alexander Korotkov wrote: I'd like to make some notes. 1) As already mentioned, there is clearly a repetitive pattern for the code following after get_useful_group_keys_orderings() calls. I think it would be good to extract it into a separate function. Please, do this as a separate patch coming before the group-by patch. That would simplify the review. Done. See patch 0001-*. Unfortunately, extraction of whole cycle isn't practical, because it blows out the interface of the routine. 2) I wonder what planning overhead this patch could introduce? Could you try to measure the worst case? What if we have a table with a lot of indexes and a long list of group-by clauses partially patching every index. This should give us an understanding on whether we need a separate GUC to control this feature. In current implementation I don't anticipate any significant overhead. GUC is needed here to allow users adhere their own ordering and to disable feature in the case of problems. 4) I think we can do some optimizations when enable_incremental_sort == off. Then in get_useful_group_keys_orderings() we should only deal with input_path fully matching the group-by clause, and try only full match of group-by output to the required order. Hm, is it really make sense in current implementation? -- regards, Andrei Lepikhov Postgres Professional From ea068e221a754a463142575f6e0360d3118475f8 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 9 Jan 2024 12:00:27 +0700 Subject: [PATCH 1/2] Generalize common code of adding sort before generation of grouping paths. --- src/backend/optimizer/plan/planner.c | 209 --- 1 file changed, 61 insertions(+), 148 deletions(-) diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 667723b675..fcf647940e 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -6809,6 +6809,52 @@ done: return parallel_workers; } +static Path * +try_presorting(PlannerInfo *root, RelOptInfo *grouped_rel, + Path *path, Path *cheapest_path) +{ + boolis_sorted; + int presorted_keys; + + is_sorted = pathkeys_count_contained_in(root->group_pathkeys, + path->pathkeys, + _keys); + + if (!is_sorted) + { + /* +* Try at least sorting the cheapest path and also try +* incrementally sorting any path which is partially sorted +* already (no need to deal with paths which have presorted +* keys when incremental sort is disabled unless it's the +* cheapest input path). +*/ + if (path != cheapest_path && + (presorted_keys == 0 || !enable_incremental_sort)) + return NULL; + + /* +* We've no need to consider both a sort and incremental sort. +* We'll just do a sort if there are no presorted keys and an +* incremental sort when there are presorted keys. +*/ + if (presorted_keys == 0 || !enable_incremental_sort) + path = (Path *) create_sort_path(root, + grouped_rel, + path, + root->group_pathkeys, + -1.0); + else + path = (Path *) create_incremental_sort_path(root, + grouped_rel, + path, + root->group_pathkeys, + presorted_keys, + -1.0); + } + return path; +} + /* * add_paths_to_grouping_rel * @@ -6840,45 +6886,11 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel, foreach(lc, input_rel->pathlist) {
Re: Multidimensional Histograms
On 8/1/2024 01:36, Tomas Vondra wrote: On 1/7/24 18:26, Andrei Lepikhov wrote: On 7/1/2024 17:51, Tomas Vondra wrote: On 1/7/24 11:22, Andrei Lepikhov wrote: On 7/1/2024 06:54, Tomas Vondra wrote: It's an interesting are for experiments, no doubt about it. And if you choose to explore it, that's fine. But it's better to be aware it may not end with a commit. For the multi-dimensional case, I propose we first try to experiment with the various algorithms, and figure out what works etc. Maybe implementing them in python or something would be easier than C. Curiously, trying to utilize extended statistics for some problematic cases, I am experimenting with auto-generating such statistics by definition of indexes [1]. Doing that, I wanted to add some hand-made statistics like a multidimensional histogram or just a histogram which could help to perform estimation over a set of columns/expressions. I realized that current hooks get_relation_stats_hook and get_index_stats_hook are insufficient if I want to perform an estimation over a set of ANDed quals on different columns. In your opinion, is it possible to add a hook into the extended statistics to allow for an extension to propose alternative estimation? [1] https://github.com/danolivo/pg_index_stats No idea, I haven't thought about that very much. Presumably the existing hooks are insufficient because they're per-attnum? I guess it would make sense to have a hook for all the attnums of the relation, but I'm not sure it'd be enough to introduce a new extended statistics kind ... I got stuck on the same problem Alexander mentioned: we usually have large tables with many uniformly distributed values. In this case, MCV doesn't help a lot. Usually, I face problems scanning a table with a filter containing 3-6 ANDed quals. Here, Postgres multiplies selectivities and ends up with a less than 1 tuple selectivity. But such scans, in reality, mostly have some physical sense and return a bunch of tuples. It looks like the set of columns representing some value of composite type. Understood. That's a fairly common scenario, and it can easily end up with rather terrible plan due to the underestimate. Sometimes extended statistics on dependency helps well, but it expensive for multiple columns. Expensive in what sense? Also, how would a multidimensional histogram be any cheaper? Maybe my wording needed to be more precise. I didn't implement multidimensional histograms before, so I don't know how expensive they are. I meant that for dependency statistics over about six columns, we have a lot of combinations to compute. And sometimes I see that even a trivial histogram on a ROW(x1,x2,...) could predict a much more adequate value (kind of conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) = ROW(N1, N2,...). Are you guessing the histogram would help, or do you know it'd help? I have no problem believing that for range queries, but I think it's far less clear for simple equalities. I'm not sure a histograms would work for that ... I added Teodor Sigaev to the CC of this email - He has much more user feedback on this technique. As I see, having a histogram over a set of columns, we have top selectivity estimation for the filter. I'm not sure how good a simple histogram is in that case, too, but according to my practice, it works, disallowing usage of too-optimistic query plans. Maybe it'd be possible to track more stuff for each bucket, not just the frequency, but also ndistinct for combinations of columns, and then use that to do better equality estimates. Or maybe we could see how the "expected" and "actual" bucket frequency compare, and use that to correct the equality estimate? Not sure. Yes, it is in my mind, too. Having such experimental stuff as an extension(s) in GitHub, we could get some community feedback. But perhaps you have some data to demonstrate it'd actually help? It should be redirected to Teodor, but I will consider translating messy real-life reports into a clear example. For such cases we don't have an in-core solution, and introducing a hook on clause list estimation (paired with maybe a hook on statistics generation) could help invent an extension that would deal with that problem. Also, it would open a way for experiments with different types of extended statistics ... I really don't know how to do that, or what would it take. AFAICS the statistics system is not very extensible from external code. Even if we could add new types of attribute/extended stats, I'm not sure the code calculating the estimates could use that. I imagine we can add an analysis routine and directly store statistics in an extension for demonstration and experimental purposes, no problem. -- regards, Andrei Lepikhov Postgres Professional
Re: Multidimensional Histograms
On 7/1/2024 17:51, Tomas Vondra wrote: On 1/7/24 11:22, Andrei Lepikhov wrote: On 7/1/2024 06:54, Tomas Vondra wrote: It's an interesting are for experiments, no doubt about it. And if you choose to explore it, that's fine. But it's better to be aware it may not end with a commit. For the multi-dimensional case, I propose we first try to experiment with the various algorithms, and figure out what works etc. Maybe implementing them in python or something would be easier than C. Curiously, trying to utilize extended statistics for some problematic cases, I am experimenting with auto-generating such statistics by definition of indexes [1]. Doing that, I wanted to add some hand-made statistics like a multidimensional histogram or just a histogram which could help to perform estimation over a set of columns/expressions. I realized that current hooks get_relation_stats_hook and get_index_stats_hook are insufficient if I want to perform an estimation over a set of ANDed quals on different columns. In your opinion, is it possible to add a hook into the extended statistics to allow for an extension to propose alternative estimation? [1] https://github.com/danolivo/pg_index_stats No idea, I haven't thought about that very much. Presumably the existing hooks are insufficient because they're per-attnum? I guess it would make sense to have a hook for all the attnums of the relation, but I'm not sure it'd be enough to introduce a new extended statistics kind ... I got stuck on the same problem Alexander mentioned: we usually have large tables with many uniformly distributed values. In this case, MCV doesn't help a lot. Usually, I face problems scanning a table with a filter containing 3-6 ANDed quals. Here, Postgres multiplies selectivities and ends up with a less than 1 tuple selectivity. But such scans, in reality, mostly have some physical sense and return a bunch of tuples. It looks like the set of columns representing some value of composite type. Sometimes extended statistics on dependency helps well, but it expensive for multiple columns. And sometimes I see that even a trivial histogram on a ROW(x1,x2,...) could predict a much more adequate value (kind of conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) = ROW(N1, N2,...). For such cases we don't have an in-core solution, and introducing a hook on clause list estimation (paired with maybe a hook on statistics generation) could help invent an extension that would deal with that problem. Also, it would open a way for experiments with different types of extended statistics ... -- regards, Andrei Lepikhov Postgres Professional
Re: Multidimensional Histograms
On 7/1/2024 06:54, Tomas Vondra wrote: It's an interesting are for experiments, no doubt about it. And if you choose to explore it, that's fine. But it's better to be aware it may not end with a commit. For the multi-dimensional case, I propose we first try to experiment with the various algorithms, and figure out what works etc. Maybe implementing them in python or something would be easier than C. Curiously, trying to utilize extended statistics for some problematic cases, I am experimenting with auto-generating such statistics by definition of indexes [1]. Doing that, I wanted to add some hand-made statistics like a multidimensional histogram or just a histogram which could help to perform estimation over a set of columns/expressions. I realized that current hooks get_relation_stats_hook and get_index_stats_hook are insufficient if I want to perform an estimation over a set of ANDed quals on different columns. In your opinion, is it possible to add a hook into the extended statistics to allow for an extension to propose alternative estimation? [1] https://github.com/danolivo/pg_index_stats -- regards, Andrei Lepikhov Postgres Professional
Re: Removing unneeded self joins
On 29/12/2023 12:00, Alexander Lakhin wrote: Hi Alexander, 23.10.2023 14:29, Alexander Korotkov wrote: Fixed all of the above. Thank you for catching this! I've discovered that starting from d3d55ce57 the following query: CREATE TABLE t(a int PRIMARY KEY); WITH tt AS (SELECT * FROM t) UPDATE t SET a = tt.a + 1 FROM tt WHERE tt.a = t.a RETURNING t.a; triggers an error "variable not found in subplan target lists". (Commits 8a8ed916f and b5fb6736e don't fix this, unfortunately.) Thanks for the report! The problem is with the resultRelation field. We forget to replace the relid here. Could you check your issue with the patch in the attachment? Does it resolve this case? -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c index 6c02fe8908..f79c673afd 100644 --- a/src/backend/optimizer/plan/analyzejoins.c +++ b/src/backend/optimizer/plan/analyzejoins.c @@ -1861,6 +1861,8 @@ remove_self_join_rel(PlannerInfo *root, PlanRowMark *kmark, PlanRowMark *rmark, /* Replace varno in all the query structures */ query_tree_walker(root->parse, replace_varno_walker, , QTW_EXAMINE_SORTGROUP); + if (root->parse->resultRelation == toRemove->relid) + root->parse->resultRelation = toKeep->relid; /* Replace links in the planner info */ remove_rel_from_query(root, toRemove, toKeep->relid, NULL, NULL); @@ -1870,6 +1872,9 @@ remove_self_join_rel(PlannerInfo *root, PlanRowMark *kmark, PlanRowMark *rmark, toRemove->relid, toKeep->relid); replace_varno((Node *) root->processed_groupClause, toRemove->relid, toKeep->relid); + replace_relid(root->all_result_relids, toRemove->relid, toKeep->relid); + replace_relid(root->leaf_result_relids, toRemove->relid, toKeep->relid); + /* * There may be references to the rel in root->fkey_list, but if so,
Re: POC: GROUP BY optimization
On 28/12/2023 18:29, Alexander Korotkov wrote: On Thu, Dec 28, 2023 at 10:22 AM Andrei Lepikhov wrote: But arrangement with an ORDER BY clause doesn't work: DROP INDEX abc; explain SELECT x,w,z FROM t GROUP BY (w,x,z) ORDER BY (x,z,w); I think the reason is that the sort_pathkeys and group_pathkeys are physically different structures, and we can't just compare pointers here. I haven't yet looked into the code. But this looks strange to me. Somehow, optimizer currently matches index pathkeys to ORDER BY pathkeys. If GROUP BY pathkeys could be matched to index pathkeys, then it should be possible to match them to ORDER BY pathkeys too. Oh, I found the mistake: I got used to using GROUP BY and ORDER BY on many columns with round brackets. In the case of the grouping list, it doesn't change anything. But ordering treats it as a WholeRowVar and breaks group-by arrangement. Look: explain (COSTS OFF) SELECT relname,reltuples FROM pg_class GROUP BY relname,reltuples ORDER BY reltuples,relname; Group Group Key: reltuples, relname -> Sort Sort Key: reltuples, relname -> Seq Scan on pg_class But: explain (COSTS OFF) SELECT relname,reltuples FROM pg_class GROUP BY relname,reltuples ORDER BY (reltuples,relname); Sort Sort Key: (ROW(reltuples, relname)) -> Group Group Key: relname, reltuples -> Sort Sort Key: relname, reltuples -> Seq Scan on pg_class So, let's continue to work. -- regards, Andrei Lepikhov Postgres Professional
Re: POC: GROUP BY optimization
On 27/12/2023 12:07, Tom Lane wrote: Andrei Lepikhov writes: To be clear. In [1], I mentioned we can perform micro-benchmarks and structure costs of operators. At least for fixed-length operators, it is relatively easy. I repeat what I said: this is a fool's errand. You will not get trustworthy results even for the cases you measured, let alone all the rest. I'd go as far as to say I would not believe your microbenchmarks, because they would only apply for one platform, compiler, backend build, phase of the moon, etc. Thanks for the explanation. I removed all cost-related codes. It still needs to be finished; I will smooth the code further and rewrite regression tests - many of them without cost-dependent reorderings look silly. Also, remember Alexander's remarks, which must be implemented, too. But already here, it works well. Look: Preliminaries: CREATE TABLE t(x int, y int, z text, w int); INSERT INTO t SELECT gs%100,gs%100, 'abc' || gs%10, gs FROM generate_series(1,1) AS gs; CREATE INDEX abc ON t(x,y); ANALYZE t; SET enable_hashagg = 'off'; This patch eliminates unneeded Sort operation: explain SELECT x,y FROM t GROUP BY (x,y); explain SELECT x,y FROM t GROUP BY (y,x); Engages incremental sort: explain SELECT x,y FROM t GROUP BY (x,y,z,w); explain SELECT x,y FROM t GROUP BY (z,y,w,x); explain SELECT x,y FROM t GROUP BY (w,z,x,y); explain SELECT x,y FROM t GROUP BY (w,x,z,y); Works with subqueries: explain SELECT x,y FROM (SELECT * FROM t ORDER BY x,y,w,z) AS q1 GROUP BY (w,x,z,y); explain SELECT x,y FROM (SELECT * FROM t ORDER BY x,y,w,z LIMIT 100) AS q1 GROUP BY (w,x,z,y); But arrangement with an ORDER BY clause doesn't work: DROP INDEX abc; explain SELECT x,w,z FROM t GROUP BY (w,x,z) ORDER BY (x,z,w); I think the reason is that the sort_pathkeys and group_pathkeys are physically different structures, and we can't just compare pointers here. -- regards, Andrei Lepikhov Postgres Professional From 6183555c2c56d7cbbc44f213f6c5bc4cbcaef1ec Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 13 Sep 2023 11:20:03 +0700 Subject: [PATCH] Explore alternative orderings of group-by pathkeys during optimization. When evaluating a query with a multi-column GROUP BY clause, we can minimize sort operations or avoid them if we synchronize the order of GROUP BY clauses with the ORDER BY sort clause or sort order, which comes from the underlying query tree. Grouping does not imply any ordering, so we can compare the keys in arbitrary order, and a Hash Agg leverages this. But for Group Agg, we simply compared keys in the order specified in the query. This commit explores alternative ordering of the keys, trying to find a cheaper one. The ordering of group keys may interact with other parts of the query, some of which may not be known while planning the grouping. For example, there may be an explicit ORDER BY clause or some other ordering-dependent operation higher up in the query, and using the same ordering may allow using either incremental sort or even eliminating the sort entirely. The patch always keeps the ordering specified in the query, assuming the user might have additional insights. This introduces a new GUC enable_group_by_reordering so that the optimization may be disabled if needed. --- src/backend/optimizer/path/equivclass.c | 13 +- src/backend/optimizer/path/pathkeys.c | 224 + src/backend/optimizer/plan/planner.c | 464 ++ src/backend/utils/adt/selfuncs.c | 38 +- src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/pathnodes.h | 10 + src/include/optimizer/paths.h | 3 + src/test/regress/expected/aggregates.out | 235 + .../regress/expected/incremental_sort.out | 20 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/sql/aggregates.sql | 99 src/test/regress/sql/incremental_sort.sql | 2 +- 13 files changed, 912 insertions(+), 210 deletions(-) diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index 7fa502d6e2..07edd4f38e 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -652,7 +652,18 @@ get_eclass_for_sort_expr(PlannerInfo *root, if (opcintype == cur_em->em_datatype && equal(expr, cur_em->em_expr)) - return cur_ec; /* Match! */ + { + /* +* Match! +* +* Copy the sortref if it wasn't set yet. That may happen if +* the ec was constructed from WHERE clause, i.e. it doesn't +* have a
Re: POC: GROUP BY optimization
On 27/12/2023 11:15, Alexander Korotkov wrote: On Wed, Dec 27, 2023 at 5:23 AM Tom Lane wrote: Alexander Korotkov writes: 2) An accurate estimate of the sorting cost is quite a difficult task. Indeed. What if we make a simple rule of thumb that sorting integers and floats is cheaper than sorting numerics and strings with collation C, in turn, that is cheaper than sorting collation-aware strings (probably more groups)? Within the group, we could keep the original order of items. I think it's a fool's errand to even try to separate different sort column orderings by cost. We simply do not have sufficiently accurate cost information. The previous patch in this thread got reverted because of that (well, also some implementation issues, but mostly that), and nothing has happened to make me think that another try will fare any better. To be clear. In [1], I mentioned we can perform micro-benchmarks and structure costs of operators. At least for fixed-length operators, it is relatively easy. So, the main block here is an accurate prediction of ndistincts for different combinations of columns. Does it make sense to continue to design the feature in the direction of turning on choosing between different sort column orderings if we have extended statistics on the columns? [1] https://www.postgresql.org/message-id/e3602ccb-e643-2e79-ed2c-1175a8053...@postgrespro.ru -- regards, Andrei Lepikhov Postgres Professional
Re: POC: GROUP BY optimization
On 21/12/2023 17:53, Alexander Korotkov wrote: On Sun, Oct 1, 2023 at 11:45 AM Andrei Lepikhov wrote: New version of the patch. Fixed minor inconsistencies and rebased onto current master. Thank you (and other authors) for working on this subject. Indeed to GROUP BY clauses are order-agnostic. Reordering them in the most suitable order could give up significant query planning benefits. I went through the thread: I see significant work has been already made on this patch, the code is quite polished. Maybe, but issues, mentioned in [1], still not resolved. It is the only reason, why this thread hasn't been active. I'd like to make some notes. 1) As already mentioned, there is clearly a repetitive pattern for the code following after get_useful_group_keys_orderings() calls. I think it would be good to extract it into a separate function. Please, do this as a separate patch coming before the group-by patch. That would simplify the review. Yeah, these parts of code a bit different. I will try to make common routine. 2) I wonder what planning overhead this patch could introduce? Could you try to measure the worst case? What if we have a table with a lot of indexes and a long list of group-by clauses partially patching every index. This should give us an understanding on whether we need a separate GUC to control this feature. Ok> 3) I see that get_useful_group_keys_orderings() makes 3 calls to get_cheapest_group_keys_order() function. Each time get_cheapest_group_keys_order() performs the cost estimate and reorders the free keys. However, cost estimation implies the system catalog lookups (that is quite expensive). I wonder if we could change the algorithm. Could we just sort the group-by keys by cost once, save this ordering and then just re-use it. So, every time we need to reorder a group by, we can just pull the required keys to the top and use saved ordering for the rest. I also wonder if we could do this once for add_paths_to_grouping_rel() and create_partial_grouping_paths() calls. So, it probably should be somewhere in create_ordinary_grouping_paths(). Thanks for the idea!> 4) I think we can do some optimizations when enable_incremental_sort == off. Then in get_useful_group_keys_orderings() we should only deal with input_path fully matching the group-by clause, and try only full match of group-by output to the required order. Oh, we had designed before the incremental sort was invented. Will see what we can do here. [1] https://www.postgresql.org/message-id/60610df1-c32f-ebdf-e58c-7a664431f452%40enterprisedb.com -- regards, Andrei Lepikhov Postgres Professional
Specify description of the SpecialJoinInfo structure
Hi, Working on Asymmetric Join, I found slight inconsistency in the description of SpecialJoinInfo: join type JOIN_ANTI can be accompanied by a zero value of the ojrelid if this join was created by the transformation of the NOT EXISTS subquery. -- regards, Andrei Lepikhov Postgres Professionaldiff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h index ed85dc7414..189b59f127 100644 --- a/src/include/nodes/pathnodes.h +++ b/src/include/nodes/pathnodes.h @@ -2791,7 +2791,8 @@ typedef struct PlaceHolderVar * * ojrelid is the RT index of the join RTE representing this outer join, * if there is one. It is zero when jointype is INNER or SEMI, and can be - * zero for jointype ANTI (if the join was transformed from a SEMI join). + * zero for jointype ANTI (if the join was transformed from a SEMI join or + * converted from a sublink). * One use for this field is that when constructing the output targetlist of a * join relation that implements this OJ, we add ojrelid to the varnullingrels * and phnullingrels fields of nullable (RHS) output columns, so that the
Re: Optimization outcome depends on the index order
On 25/12/2023 18:36, Alexander Korotkov wrote: On Fri, Dec 22, 2023 at 7:24 PM Andrei Lepikhov wrote: On 22/12/2023 11:48, Alexander Korotkov wrote: > Because we must trust all predictions made by the planner, we just > choose the most trustworthy path. According to the planner logic, it is > a path with a smaller selectivity. We can make mistakes anyway just > because of the nature of estimation. Even if we need to take selectivity into account here, it's still not clear why this should be on top of other logic later in add_path(). I got your point now, thanks for pointing it out. In the next version of the patch selectivity is used as a criteria only in the case of COSTS_EQUAL. It looks better now. But it's hard for me to judge these heuristics in add_path(). Tom, what do you think about this? Just food for thought: Another approach I have considered was to initialize the relation index list according to some consistent rule: place unique indexes at the head of the list, arrange indexes according to the number of columns involved and sort in some lexical order. But IMO, the implemented approach looks better. -- regards, Andrei Lepikhov Postgres Professional
Re: Optimization outcome depends on the index order
On 22/12/2023 11:48, Alexander Korotkov wrote: > Because we must trust all predictions made by the planner, we just > choose the most trustworthy path. According to the planner logic, it is > a path with a smaller selectivity. We can make mistakes anyway just > because of the nature of estimation. Even if we need to take selectivity into account here, it's still not clear why this should be on top of other logic later in add_path(). I got your point now, thanks for pointing it out. In the next version of the patch selectivity is used as a criteria only in the case of COSTS_EQUAL. -- regards, Andrei Lepikhov Postgres Professional From 45bda9784d28dc9cec90c5b33285023a49850800 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Mon, 27 Nov 2023 11:23:48 +0700 Subject: [PATCH] Choose an index path with the best selectivity estimation. In the case when optimizer predicts only one row prefer choosing UNIQUE indexes In other cases, if optimizer treats indexes as equal, make a last attempt selecting the index with less selectivity - this decision takes away dependency on the order of indexes in an index list (good for reproduction of some issues) and proposes one more objective argument to choose specific index. --- src/backend/optimizer/util/pathnode.c | 32 +++ src/test/regress/expected/functional_deps.out | 39 +++ src/test/regress/sql/functional_deps.sql | 32 +++ 3 files changed, 103 insertions(+) diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c index 0b1d17b9d3..984c974b57 100644 --- a/src/backend/optimizer/util/pathnode.c +++ b/src/backend/optimizer/util/pathnode.c @@ -454,6 +454,38 @@ add_path(RelOptInfo *parent_rel, Path *new_path) costcmp = compare_path_costs_fuzzily(new_path, old_path, STD_FUZZ_FACTOR); + /* +* Apply some heuristics on index paths. +*/ + if (costcmp == COSTS_EQUAL) + { + IndexPath *inp = (IndexPath *) new_path; + IndexPath *iop = (IndexPath *) old_path; + + if (IsA(new_path, IndexPath) && IsA(old_path, IndexPath)) + { + /* +* When both paths are predicted to produce only one tuple, +* the optimiser should prefer choosing a unique index scan +* in all cases. +*/ + if (inp->indexinfo->unique && !iop->indexinfo->unique) + costcmp = COSTS_BETTER1; + else if (!inp->indexinfo->unique && iop->indexinfo->unique) + costcmp = COSTS_BETTER2; + else if (costcmp != COSTS_DIFFERENT) + /* +* If the optimiser doesn't have an obviously stable choice +* of unique index, increase the chance of avoiding mistakes +* by choosing an index with smaller selectivity. +* This option makes decision more conservative and looks +* debatable. +*/ + costcmp = (inp->indexselectivity < iop->indexselectivity) ? + COSTS_BETTER1 : COSTS_BETTER2; + } + } + /* * If the two paths compare differently for startup and total cost, * then we want to keep both, and we can skip comparing pathkeys and diff --git a/src/test/regress/expected/functional_deps.out b/src/test/regress/expected/functional_deps.out index 32381b8ae7..7057254278 100644 --- a/src/test/regress/expected/functional_deps.out +++ b/src/test/regress/expected/functional_deps.out @@ -230,3 +230,42 @@ EXECUTE foo; ALTER TABLE articles DROP CONSTRAINT articles_pkey RESTRICT; EXECUTE foo; -- fail ERROR: column "articles.keywords" must appear in the GROUP BY clause or be used in an aggregate function +/* + * Corner case of the PostgreSQL optimizer: + * + * ANDed clauses selectivity multiplication increases total selectivity error. + * If such non-true selectivity is so tiny that row estimation predicts the + * absolute minimum number of tuples (1), the optimizer can't choose between + * different indexes and picks a first from the index list (last created). + */ +CREATE TABLE t A
Optimization outcome depends on the index order
On 21/12/2023 12:10, Alexander Korotkov wrote: > I took a closer look at the patch in [9]. I should drop my argument > about breaking the model, because add_path() already considers other > aspects than just costs. But I have two more note about that patch: > > 1) It seems that you're determining the fact that the index path > should return strictly one row by checking path->rows <= 1.0 and > indexinfo->unique. Is it really guaranteed that in this case quals > are matching unique constraint? path->rows <= 1.0 could be just an > estimation error. Or one row could be correctly estimated, but it's > going to be selected by some quals matching unique constraint and > other quals in recheck. So, it seems there is a risk to select > suboptimal index due to this condition. Operating inside the optimizer, we consider all estimations to be the sooth. This patch modifies only one place: having two equal assumptions, we just choose one that generally looks more stable. Filtered tuples should be calculated and included in the cost of the path. The decision on the equality of paths has been made in view of the estimation of these filtered tuples. > 2) Even for non-unique indexes this patch is putting new logic on top > of the subsequent code. How we can prove it's going to be a win? > That could lead, for instance, to dropping parallel-safe paths in > cases we didn't do so before. Because we must trust all predictions made by the planner, we just choose the most trustworthy path. According to the planner logic, it is a path with a smaller selectivity. We can make mistakes anyway just because of the nature of estimation. > Anyway, please start a separate thread if you're willing to put more > work into this. Done > 9. https://www.postgresql.org/message-id/154f786a-06a0-4fb1- > b8a4-16c66149731b%40postgrespro.ru -- regards, Andrei Lepikhov Postgres ProfessionalFrom 7b044de1449a5fdc450cb629caafb4e15ded7a93 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Mon, 27 Nov 2023 11:23:48 +0700 Subject: [PATCH] Choose an index path with the best selectivity estimation. In the case when optimizer predicts only one row prefer choosing UNIQUE indexes In other cases, if optimizer treats indexes as equal, make a last attempt selecting the index with less selectivity - this decision takes away dependency on the order of indexes in an index list (good for reproduction of some issues) and proposes one more objective argument to choose specific index. --- src/backend/optimizer/util/pathnode.c | 42 +++ .../expected/drop-index-concurrently-1.out| 16 +++ src/test/regress/expected/functional_deps.out | 39 + src/test/regress/expected/join.out| 40 +- src/test/regress/sql/functional_deps.sql | 32 ++ 5 files changed, 143 insertions(+), 26 deletions(-) diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c index 0b1d17b9d3..4b5aedd579 100644 --- a/src/backend/optimizer/util/pathnode.c +++ b/src/backend/optimizer/util/pathnode.c @@ -454,6 +454,48 @@ add_path(RelOptInfo *parent_rel, Path *new_path) costcmp = compare_path_costs_fuzzily(new_path, old_path, STD_FUZZ_FACTOR); + /* +* Apply some heuristics on index paths. +*/ + if (IsA(new_path, IndexPath) && IsA(old_path, IndexPath)) + { + IndexPath *inp = (IndexPath *) new_path; + IndexPath *iop = (IndexPath *) old_path; + + if (new_path->rows <= 1.0 && old_path->rows <= 1.0) + { + /* +* When both paths are predicted to produce only one tuple, +* the optimiser should prefer choosing a unique index scan +* in all cases. +*/ + if (inp->indexinfo->unique && !iop->indexinfo->unique) + costcmp = COSTS_BETTER1; + else if (!inp->indexinfo->unique && iop->indexinfo->unique) + costcmp = COSTS_BETTER2; + else if (costcmp != COSTS_DIFFERENT) + /* +* If the optimiser doesn't have an obviously stable choice +* of unique index, increase the chance of avoiding mistakes +* by choosing an index with smaller selectivity
Re: Postgres picks suboptimal index after building of an extended statistics
On 18/12/2023 15:29, Alexander Korotkov wrote: Also, there is a set of patches [7], [8], and [9], which makes the optimizer consider path selectivity as long as path costs during the path selection. I've rechecked that none of these patches could resolve the original problem described in [1]. It is true. We accidentally mixed two different problems in one thread. Also, I think they are quite tricky. The model of our optimizer assumes that paths in the list should be the different ways of getting the same result. If we choose the paths by their selectivity, that breaks this model. I don't say there is no way for this. But if we do this, that would require significant rethinking of our optimizer model and possible revision of a significant part of it. I can't understand that. In [9] we just elaborate the COSTS_EQUAL case and establish final decision on more stable basis than a casual order of indexes in the list. Anyway, I think if there is still interest in this, that should be moved into a separate thread to keep this thread focused on the problem described in [1]. Agree. IMO, the problem of optimizer dependency on an order of indexes in the relation index list is more urgent for now. Finally, I'd like to note that the issue described in [1] is mostly the selectivity estimation problem. It could be solved by adding the multi-column MCV statistics. The patches published so far look more like hacks for particular use cases rather than appropriate solutions. It still looks promising to me to use the knowledge of unique constraints during selectivity estimation [10]. Even though it's hard to implement and possibly implies some overhead, it fits the current model. I also think unique contracts could probably be used in some way to improve estimates even when there is no full match. I have tried to use the knowledge about unique indexes in the selectivity estimation routine. But it looks invasive and adds a lot of overhead. -- regards, Andrei Lepikhov Postgres Professional
Re: introduce dynamic shared memory registry
On 20/12/2023 17:33, Nathan Bossart wrote: On Wed, Dec 20, 2023 at 11:02:58AM +0200, Andrei Lepikhov wrote: In that case, maybe change the test case to make it closer to real-life usage - with locks and concurrent access (See attachment)? I'm not following why we should make this test case more complicated. It is only intended to test the DSM registry machinery, and setting/retrieving an atomic variable seems like a realistic use-case to me. I could provide you at least two reasons here: 1. A More complicated example would be a tutorial on using the feature correctly. It will reduce the number of questions in mailing lists. 2. Looking into existing extensions, I see that the most common case of using a shared memory segment is maintaining some hash table or state structure that needs at least one lock. Try to rewrite the pg_prewarm according to this new feature, and you will realize how difficult it is. -- regards, Andrei Lepikhov Postgres Professional
Re: introduce dynamic shared memory registry
On 20/12/2023 07:04, Michael Paquier wrote: On Tue, Dec 19, 2023 at 10:14:44AM -0600, Nathan Bossart wrote: On Tue, Dec 19, 2023 at 10:49:23AM -0500, Robert Haas wrote: On Mon, Dec 18, 2023 at 3:32 AM Andrei Lepikhov wrote: 2. I think a separate file for this feature looks too expensive. According to the gist of that code, it is a part of the DSA module. -1. I think this is a totally different thing than DSA. More files aren't nearly as expensive as the confusion that comes from smushing unrelated things together. Agreed. I think there's a decent chance that more functionality will be added to this registry down the line, in which case it will be even more important that this stuff stays separate from the tools it is built with. +1 for keeping a clean separation between both. Thanks, I got the reason. In that case, maybe change the test case to make it closer to real-life usage - with locks and concurrent access (See attachment)? -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/test/modules/test_dsm_registry/t/001_test_dsm_registry.pl b/src/test/modules/test_dsm_registry/t/001_test_dsm_registry.pl index 4ad6097d1c..762489f3dd 100644 --- a/src/test/modules/test_dsm_registry/t/001_test_dsm_registry.pl +++ b/src/test/modules/test_dsm_registry/t/001_test_dsm_registry.pl @@ -13,12 +13,27 @@ $node->init; $node->start; $node->safe_psql('postgres', "CREATE DATABASE test;"); -$node->safe_psql('postgres', "CREATE EXTENSION test_dsm_registry;"); -$node->safe_psql('postgres', "SELECT set_val_in_shmem(1236);"); - $node->safe_psql('test', "CREATE EXTENSION test_dsm_registry;"); -my $result = $node->safe_psql('test', "SELECT get_val_in_shmem();"); -is($result, "1236", "check shmem val"); + +my ($initial_value, $result, $custom_script); + +$initial_value = $node->safe_psql('test', "SELECT get_val_in_shmem();"); + +$custom_script = File::Temp->new(); +append_to_file($custom_script, q{ + \set val random(0, 1999) + SELECT increment_val_in_shmem(:val); + SELECT increment_val_in_shmem(-(:val)); +}); + +$node->command_ok([ 'pgbench', '-i', 'test' ], 'Init database'); +$node->command_ok([ 'pgbench', '-t', '31', '-c', '3', '-j', '3', + '-f', "$custom_script", 'test' ], + 'Change shared value simlutaneously'); + +$result = $node->safe_psql('test', "SELECT get_val_in_shmem();"); +print("res $initial_value, $result"); +is($result, $initial_value, "Check concurrent access"); $node->stop; diff --git a/src/test/modules/test_dsm_registry/test_dsm_registry--1.0.sql b/src/test/modules/test_dsm_registry/test_dsm_registry--1.0.sql index 8c55b0919b..0144845afa 100644 --- a/src/test/modules/test_dsm_registry/test_dsm_registry--1.0.sql +++ b/src/test/modules/test_dsm_registry/test_dsm_registry--1.0.sql @@ -3,7 +3,7 @@ -- complain if script is sourced in psql, rather than via CREATE EXTENSION \echo Use "CREATE EXTENSION test_dsm_registry" to load this file. \quit -CREATE FUNCTION set_val_in_shmem(val INT) RETURNS VOID +CREATE FUNCTION increment_val_in_shmem(val INT) RETURNS VOID AS 'MODULE_PATHNAME' LANGUAGE C; CREATE FUNCTION get_val_in_shmem() RETURNS INT diff --git a/src/test/modules/test_dsm_registry/test_dsm_registry.c b/src/test/modules/test_dsm_registry/test_dsm_registry.c index 8f78012e52..eed29a9f34 100644 --- a/src/test/modules/test_dsm_registry/test_dsm_registry.c +++ b/src/test/modules/test_dsm_registry/test_dsm_registry.c @@ -13,33 +13,48 @@ #include "postgres.h" #include "fmgr.h" +#include "common/pg_prng.h" #include "port/atomics.h" #include "storage/dsm_registry.h" +#include "storage/lwlock.h" PG_MODULE_MAGIC; -static pg_atomic_uint32 *val; +typedef struct +{ + LWLock lock; + uint32 value; +} SharedState; + +static SharedState *extstate; static void -init_val(void *ptr) +init_dsm(void *ptr) { - pg_atomic_init_u32(ptr, 0); + SharedState *state = (SharedState *) ptr; + + LWLockInitialize(>lock, LWLockNewTrancheId()); + state->value = pg_prng_uint32(_global_prng_state); } static void dsm_registry_attach(void) { - dsm_registry_init_or_attach("test_dsm_registry", (void **) , - sizeof(pg_atomic_uint32), init_val); + dsm_registry_init_or_attach("test_dsm_registry", (void **) , + sizeof(SharedState), init_dsm); } -PG_FUNCTION_INFO_V1(set_val_in_shmem); +PG_FUNCTION_INFO_V1(increment_val_in_shmem); Datum -set_val_in_shmem(PG_FUNCTION_ARGS) +increment_val_in_shmem(PG_FUNCTION_ARGS) { + uint32 increment = PG_GETARG_UINT32(0); +
Re: introduce dynamic shared memory registry
On 18/12/2023 13:39, Andrei Lepikhov wrote: On 5/12/2023 10:46, Nathan Bossart wrote: I don't presently have any concrete plans to use this for anything, but I thought it might be useful for extensions for caching, etc. and wanted to see whether there was any interest in the feature. I am delighted that you commenced this thread. Designing extensions, every time I feel pain introducing one shared value or some global stat, the extension must be required to be loadable on startup only. It reduces the flexibility of even very lightweight extensions, which look harmful to use in a cloud. After looking into the code, I have some comments: 1. The code looks good; I didn't find possible mishaps. Some proposed changes are in the attachment. 2. I think a separate file for this feature looks too expensive. According to the gist of that code, it is a part of the DSA module. 3. The dsm_registry_init_or_attach routine allocates a DSM segment. Why not create dsa_area for a requestor and return it? -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c index ea80f45716..0343ce987f 100644 --- a/src/backend/storage/ipc/dsm_registry.c +++ b/src/backend/storage/ipc/dsm_registry.c @@ -45,8 +45,8 @@ static const dshash_parameters dsh_params = { LWTRANCHE_DSM_REGISTRY_HASH }; -static dsa_area *dsm_registry_dsa; -static dshash_table *dsm_registry_table; +static dsa_area *dsm_registry_dsa = NULL; +static dshash_table *dsm_registry_table = NULL; static void init_dsm_registry(void); @@ -83,13 +83,20 @@ init_dsm_registry(void) { /* Quick exit if we already did this. */ if (dsm_registry_table) + { + Assert(dsm_registry_dsa != NULL); return; + } + + Assert(dsm_registry_dsa == NULL); /* Otherwise, use a lock to ensure only one process creates the table. */ LWLockAcquire(DSMRegistryLock, LW_EXCLUSIVE); if (DSMRegistryCtx->dshh == DSHASH_HANDLE_INVALID) { + Assert(DSMRegistryCtx->dsah == DSHASH_HANDLE_INVALID); + /* Initialize dynamic shared hash table for registry. */ dsm_registry_dsa = dsa_create(LWTRANCHE_DSM_REGISTRY_DSA); dsa_pin(dsm_registry_dsa); @@ -102,6 +109,8 @@ init_dsm_registry(void) } else { + Assert(DSMRegistryCtx->dsah != DSHASH_HANDLE_INVALID); + /* Attach to existing dynamic shared hash table. */ dsm_registry_dsa = dsa_attach(DSMRegistryCtx->dsah); dsa_pin_mapping(dsm_registry_dsa); diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h index 665d471418..e0e7b3b765 100644 --- a/src/include/storage/lwlock.h +++ b/src/include/storage/lwlock.h @@ -209,7 +209,7 @@ typedef enum BuiltinTrancheIds LWTRANCHE_LAUNCHER_HASH, LWTRANCHE_DSM_REGISTRY_DSA, LWTRANCHE_DSM_REGISTRY_HASH, - LWTRANCHE_FIRST_USER_DEFINED + LWTRANCHE_FIRST_USER_DEFINED, } BuiltinTrancheIds; /*
Re: introduce dynamic shared memory registry
On 5/12/2023 10:46, Nathan Bossart wrote: I don't presently have any concrete plans to use this for anything, but I thought it might be useful for extensions for caching, etc. and wanted to see whether there was any interest in the feature. I am delighted that you commenced this thread. Designing extensions, every time I feel pain introducing one shared value or some global stat, the extension must be required to be loadable on startup only. It reduces the flexibility of even very lightweight extensions, which look harmful to use in a cloud. -- regards, Andrei Lepikhov Postgres Professional
Re: Statistics Import and Export
On 13/12/2023 17:26, Corey Huinker wrote:> 4. I don't yet have a complete vision for how these tools will be used by pg_upgrade and pg_dump/restore, the places where these will provide the biggest win for users. Some issues here with docs: func.sgml:28465: parser error : Opening and ending tag mismatch: sect1 line 26479 and sect2 ^ Also, as I remember, we already had some attempts to invent dump/restore statistics [1,2]. They were stopped with the problem of type verification. What if the definition of the type has changed between the dump and restore? As I see in the code, Importing statistics you just check the column name and don't see into the type. [1] Backup and recovery of pg_statistic https://www.postgresql.org/message-id/flat/724322880.K8vzik8zPz%40abook [2] Re: Ideas about a better API for postgres_fdw remote estimates https://www.postgresql.org/message-id/7a40707d-1758-85a2-7bb1-6e5775518e64%40postgrespro.ru -- regards, Andrei Lepikhov Postgres Professional
Re: Oversight in reparameterize_path_by_child leading to executor crash
On 6/12/2023 14:30, Richard Guo wrote: I've self-reviewed this patch again and I think it's now in a committable state. I'm wondering if we can mark it as 'Ready for Committer' now, or we need more review comments/feedbacks. To recap, this patch postpones reparameterization of paths until createplan.c, which would help avoid building the reparameterized expressions we might not use. More importantly, it makes it possible to modify the expressions in RTEs (e.g. tablesample) and in RelOptInfos (e.g. baserestrictinfo) for reparameterization. Failing to do that can lead to executor crashes, wrong results, or planning errors, as we have already seen. As I see, this patch contains the following modifications: 1. Changes in the create_nestloop_path: here, it arranges all tests to the value of top_parent_relids, if any. It is ok for me. 2. Changes in reparameterize_path_by_child and coupled new function path_is_reparameterizable_by_child. I compared them, and it looks good. One thing here. You use the assertion trap in the case of inconsistency in the behaviour of these two functions. How disastrous would it be if someone found such inconsistency in production? Maybe better to use elog(PANIC, ...) report? 3. ADJUST_CHILD_ATTRS() macros. Understanding the necessity of this change, I want to say it looks a bit ugly. 4. SampleScan reparameterization part. It looks ok. It can help us in the future with asymmetric join and something else. 5. Tests. All of them are related to partitioning and the SampleScan issue. Maybe it is enough, but remember, this change now impacts the parameterization feature in general. 6. I can't judge how this switch of overall approach could impact something in the planner. I am only wary about what, from the first reparameterization up to the plan creation, we have some inconsistency in expressions (partitions refer to expressions with the parent relid). What if an extension in the middle changes the parent expression? By and large, this patch is in a good state and may be committed. -- regards, Andrei Lepikhov Postgres Professional
Re: Assert failure on 'list_member_ptr(rel->joininfo, restrictinfo)'
On 11/12/2023 09:31, Richard Guo wrote: On Fri, Dec 8, 2023 at 3:13 PM Alexander Pyhalov mailto:a.pyha...@postgrespro.ru>> wrote: Andrei Lepikhov писал(а) 2023-12-08 07:37: > I'd already clashed with Tom on copying the required_relids field and > voluntarily made unnecessary copies in the project [1]. > And ... stuck into huge memory consumption. The reason was in > Bitmapsets: > When we have 1E3-1E4 partitions and try to reparameterize a join, one > bitmapset field can have a size of about 1kB. Having bitmapset > referencing Relation with a large index value, we had a lot of (for > example, 1E4 * 1kB) copies on each reparametrization of such a field. > Alexander Pyhalov should remember that case. Yes. If it matters, this happened during reparametrization when 2 partitioned tables with 1000 partitions each were joined. Then asymmetric pw join managed to eat lots of memory for bitmapsets (by lots of memory I mean all available on the test VM). By reparametrization did you mean the work done in reparameterize_path_by_child()? If so maybe you'd be interested in the patch [1] which postpones reparameterization of paths until createplan.c and thus can help avoid unnecessary reparametrization work. Yeah, I have discovered it already. It is a promising solution and only needs a bit more review. But here, I embraced some corner cases with the idea that we may not see other cases right now. And also, sometimes the Bitmapset field is significant - it is not a corner case. -- regards, Andrei Lepikhov Postgres Professional
Re: Assert failure on 'list_member_ptr(rel->joininfo, restrictinfo)'
On 28/11/2023 01:37, Alexander Korotkov wrote: On Mon, Nov 27, 2023 at 8:07 PM Andres Freund wrote: Sorry for the late answer, I missed this thread because of vacation. On 2023-11-27 11:29:48 +0530, Ashutosh Bapat wrote: How do we ensure that we are not making unnecessary copies of Bitmapsets? We don't - but that's not specific to this patch. Bitmapsets typically aren't very large, I doubt that it's a significant proportion of the memory usage. Adding refcounts or such would likely add more overhead than it'd save, both in time and memory. I'd already clashed with Tom on copying the required_relids field and voluntarily made unnecessary copies in the project [1]. And ... stuck into huge memory consumption. The reason was in Bitmapsets: When we have 1E3-1E4 partitions and try to reparameterize a join, one bitmapset field can have a size of about 1kB. Having bitmapset referencing Relation with a large index value, we had a lot of (for example, 1E4 * 1kB) copies on each reparametrization of such a field. Alexander Pyhalov should remember that case. I don't claim we will certainly catch such an issue here, but it is a reason why we should look at this option carefully. I am a bit worried about the maintainability of remove_rel_from_query() et al. Is there any infrastructure for detecting that some PlannerInfo field that needs updating wasn't updated? There's not even a note in PlannerInfo that documents that that needs to happen. Thanks you for highlighting this issue.> That makes sense, thank you. We need at least a comment about this. I'll write a patch adding this comment. BTW, what do you think about the patches upthread [1]. Links 1. https://www.postgresql.org/message-id/CAPpHfdtLgCryACcrmLv=Koq9rAB3=tr5y9d84dggvuhscvj...@mail.gmail.com 0001 - Looks good and can be applied. 0002 - I am afraid the problems with expanded range table entries are likewise described above. The patch makes sense, but it requires time to reproduce corner cases. Maybe we can do it separately from the current hotfix? 0003 - I think it is really what we need right now: SJE is quite a rare optimization and executes before the entries expansion procedure. So it looks less risky. [1] Asymmetric partition-wise JOIN https://www.postgresql.org/message-id/flat/CAOP8fzaVL_2SCJayLL9kj5pCA46PJOXXjuei6-3aFUV45j4LJQ%40mail.gmail.com -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
Here is fresh version with the pg_dump.pl regex fixed. Now it must pass buildfarm. Under development: 1. Explanation of the general idea in comments (Robert's note) 2. Issue with hiding some optimizations (Alexander's note and example with overlapping clauses on two partial indexes) -- regards, Andrei Lepikhov Postgres Professional From 71746caae3eb0c771792b563fd53244fc1ac575b Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Thu, 23 Nov 2023 16:00:13 +0700 Subject: [PATCH] Transform OR clause to ANY expressions. Replace (expr op C1) OR (expr op C2) ... with expr op ANY(C1, C2, ...) on the preliminary stage of optimization when we are still working with the tree expression. Here C is a constant expression, 'expr' is non-constant expression, 'op' is an operator which returns boolean result and has a commuter (for the case of reverse order of constant and non-constant parts of the expression, like 'CX op expr'). Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. --- .../postgres_fdw/expected/postgres_fdw.out| 16 +- src/backend/nodes/queryjumblefuncs.c | 30 ++ src/backend/parser/parse_expr.c | 310 +- src/backend/utils/misc/guc_tables.c | 11 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/bin/pg_dump/t/002_pg_dump.pl | 6 +- src/include/nodes/queryjumble.h | 1 + src/include/parser/parse_expr.h | 1 + src/test/regress/expected/create_index.out| 156 - src/test/regress/expected/inherit.out | 2 +- src/test/regress/expected/join.out| 62 +++- src/test/regress/expected/partition_prune.out | 219 +++-- src/test/regress/expected/rules.out | 4 +- src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 23 +- src/test/regress/sql/create_index.sql | 35 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 2 + 21 files changed, 862 insertions(+), 70 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 0a5bdf8bcc..ff69b2cd3b 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -1349,7 +1349,7 @@ SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 WHERE Foreign Scan Output: t1.c1, t1.c2, ft5.c1, ft5.c2 Relations: (public.ft4 t1) LEFT JOIN (public.ft5) - Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 < 10) OR (r4.c1 IS NULL))) AND ((r1.c1 < 10)) + Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 IS NULL) OR (r4.c1 < 10))) AND ((r1.c1 < 10)) (4 rows) SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 WHERE c1 < 10) t2 ON (t1.c1 = t2.c1) @@ -3112,7 +3112,7 @@ select array_agg(distinct (t1.c1)%5) from ft4 t1 full join ft5 t2 on (t1.c1 = t2 Foreign Scan Output: (array_agg(DISTINCT (t1.c1 % 5))), ((t2.c1 % 3)) Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2)) - Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE (((r1.c1 < 20) OR ((r1.c1 IS NULL) AND (r2.c1 < 5 GROUP BY 2 ORDER BY array_agg(DISTINCT (r1.c1 % 5)) ASC NULLS LAST + Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE r1.c1 IS NULL) AND (r2.c1 < 5)) OR (r1.c1 < 20))) GROUP BY 2 ORDER BY array_agg(DISTINCT (r1.c1 % 5)) ASC NULLS LAST (4 rows) select array_agg(distinct (t1.c1)%5) from ft4 t1 full join ft5 t2 on (t1.c1 = t2.c1) where t1.c1 < 20 or (t1.c1 is null and t2.c1 < 5) group by (t2.c1)%3 order by 1; @@ -3130,7 +3130,7 @@ select array_agg(distinct (t1.c1)%5 order by (t1.c1)%5) from ft4 t1 full join ft Foreign Scan Output: (array_agg(DISTINCT (t1.c1 % 5) ORDER BY (t1.c1 % 5))), ((t2.c1 % 3)) Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2)) - Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5) ORDER BY ((r1.c1 % 5)) ASC NULLS LAST), (r2.c1 % 3) FROM ("S 1&q
Re: POC, WIP: OR-clause support for indexes
Hi, Here is the next version of the patch where, I think, all of Roberts's claims related to the code have been fixed. For example, expression 'x < 1 OR x < 2' is transformed to 'x < ANY (1,2)'. Here, we still need to deal with the architectural issues. I like the approach mentioned by Peter: try to transform the expression tree to some 'normal' form, which is more laconic and simple; delay the search for any optimization ways to the following stages. Also, it doesn't pass pg_dump test. At first glance, it is a problem of regex expression, which should be corrected further. -- regards, Andrei Lepikhov Postgres Professional From 73031b7acae68494ddd0f9b1faf4c94aae3bd6b0 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Thu, 23 Nov 2023 16:00:13 +0700 Subject: [PATCH] Transform OR clause to ANY expressions. Replace (expr op C1) OR (expr op C2) ... with expr op ANY(C1, C2, ...) on the preliminary stage of optimization when we are still working with the tree expression. Here C is a constant expression, 'expr' is non-constant expression, 'op' is an operator which returns boolean result and has a commuter (for the case of reverse order of constant and non-constant parts of the expression, like 'CX op expr'). Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. --- .../postgres_fdw/expected/postgres_fdw.out| 16 +- src/backend/nodes/queryjumblefuncs.c | 30 ++ src/backend/parser/parse_expr.c | 310 +- src/backend/utils/misc/guc_tables.c | 11 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/queryjumble.h | 1 + src/include/parser/parse_expr.h | 1 + src/test/regress/expected/create_index.out| 156 - src/test/regress/expected/inherit.out | 2 +- src/test/regress/expected/join.out| 62 +++- src/test/regress/expected/partition_prune.out | 219 +++-- src/test/regress/expected/rules.out | 4 +- src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 23 +- src/test/regress/sql/create_index.sql | 35 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 2 + 20 files changed, 859 insertions(+), 67 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 0a5bdf8bcc..ff69b2cd3b 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -1349,7 +1349,7 @@ SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 WHERE Foreign Scan Output: t1.c1, t1.c2, ft5.c1, ft5.c2 Relations: (public.ft4 t1) LEFT JOIN (public.ft5) - Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 < 10) OR (r4.c1 IS NULL))) AND ((r1.c1 < 10)) + Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 IS NULL) OR (r4.c1 < 10))) AND ((r1.c1 < 10)) (4 rows) SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 WHERE c1 < 10) t2 ON (t1.c1 = t2.c1) @@ -3112,7 +3112,7 @@ select array_agg(distinct (t1.c1)%5) from ft4 t1 full join ft5 t2 on (t1.c1 = t2 Foreign Scan Output: (array_agg(DISTINCT (t1.c1 % 5))), ((t2.c1 % 3)) Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2)) - Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE (((r1.c1 < 20) OR ((r1.c1 IS NULL) AND (r2.c1 < 5 GROUP BY 2 ORDER BY array_agg(DISTINCT (r1.c1 % 5)) ASC NULLS LAST + Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE r1.c1 IS NULL) AND (r2.c1 < 5)) OR (r1.c1 < 20))) GROUP BY 2 ORDER BY array_agg(DISTINCT (r1.c1 % 5)) ASC NULLS LAST (4 rows) select array_agg(distinct (t1.c1)%5) from ft4 t1 full join ft5 t2 on (t1.c1 = t2.c1) where t1.c1 < 20 or (t1.c1 is null and t2.c1 < 5) group by (t2.c1)%3 order by 1; @@ -3130,7 +3130,7 @@ select array_agg(distinct (t1.c1)%5 order by (t1.c1)%5) from ft4 t1 full join ft Foreign Scan Output: (array_agg(D
Re: Custom explain options
On 30/11/2023 22:40, Konstantin Knizhnik wrote: On 30/11/2023 5:59 am, Andrei Lepikhov wrote: On 21/10/2023 19:16, Konstantin Knizhnik wrote: EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, COST,...) which help to provide useful details of query execution. In Neon we have added PREFETCH option which shows information about page prefetching during query execution (prefetching is more critical for Neon architecture because of separation of compute and storage, so it is implemented not only for bitmap heap scan as in Vanilla Postgres, but also for seqscan, indexscan and indexonly scan). Another possible candidate for explain options is local file cache (extra caching layer above shared buffers which is used to somehow replace file system cache in standalone Postgres). I think that it will be nice to have a generic mechanism which allows extensions to add its own options to EXPLAIN. Generally, I welcome this idea: Extensions can already do a lot of work, and they should have a tool to report their state, not only into the log. But I think your approach needs to be elaborated. At first, it would be better to allow registering extended instruments for specific node types to avoid unneeded calls. Secondly, looking into the Instrumentation usage, I don't see the reason to limit the size: as I see everywhere it exists locally or in the DSA where its size is calculated on the fly. So, by registering an extended instrument, we can reserve a slot for the extension. The actual size of underlying data can be provided by the extension routine. Thank you for review. I agree that support of extended instruments is desired. I just tried to minimize number of changes to make this patch smaller. I got it. But having a substantial number of extensions in support, I think the extension part of instrumentation could have advantages and be worth elaborating on. In all this cases we are using array of `Instrumentation` and if it contains varying part, then it is not clear where to place it. Yes, there is also code which serialize and sends instrumentations between worker processes and I have updated this code in my PR to send actual amount of custom instrumentation data. But it can not help with the cases above. I see next basic instruments in the code: - Instrumentation (which should be named NodeInstrumentation) - MemoizeInstrumentation - JitInstrumentation - AggregateInstrumentation - HashInstrumentation - TuplesortInstrumentation As a variant, extensibility can be designed with parent 'AbstractInstrumentation' node, containing node type and link to extensible part. sizeof(Instr) calls should be replaced with the getInstrSize() call - not so much places in the code; memcpy() also can be replaced with the copy_instr() routine. -- regards, Andrei Lepikhov Postgres Professional
Re: Report planning memory in EXPLAIN ANALYZE
On 30/11/2023 18:40, Alvaro Herrera wrote: Well, SUMMARY is enabled by default with ANALYZE, and I'd rather not have planner memory consumption displayed by default with all EXPLAIN ANALYZEs. So yeah, I still think this deserves its own option. But let's hear others' opinions on this point. I'm only one vote here. I agree; it should be disabled by default. The fact that memory consumption outputs with byte precision (very uncomfortable for perception) is a sign that the primary audience is developers and for debugging purposes. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 30/11/2023 15:00, Alena Rybakina wrote: 2. The second patch is my patch version when I moved the OR transformation in the s index formation stage: So, I got the best query plan despite the possible OR to ANY transformation: If the user uses a clause like "x IN (1,2) AND y=100", it will break your 'good' solution. In my opinion, the general approach here is to stay with OR->ANY transformation at the parsing stage and invent one more way for picking an index by looking into the array and attempting to find a compound index. Having a shorter list of expressions, where uniform ORs are grouped into arrays, the optimizer will do such work with less overhead. -- regards, Andrei Lepikhov Postgres Professional
Re: Custom explain options
On 21/10/2023 19:16, Konstantin Knizhnik wrote: EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, COST,...) which help to provide useful details of query execution. In Neon we have added PREFETCH option which shows information about page prefetching during query execution (prefetching is more critical for Neon architecture because of separation of compute and storage, so it is implemented not only for bitmap heap scan as in Vanilla Postgres, but also for seqscan, indexscan and indexonly scan). Another possible candidate for explain options is local file cache (extra caching layer above shared buffers which is used to somehow replace file system cache in standalone Postgres). I think that it will be nice to have a generic mechanism which allows extensions to add its own options to EXPLAIN. Generally, I welcome this idea: Extensions can already do a lot of work, and they should have a tool to report their state, not only into the log. But I think your approach needs to be elaborated. At first, it would be better to allow registering extended instruments for specific node types to avoid unneeded calls. Secondly, looking into the Instrumentation usage, I don't see the reason to limit the size: as I see everywhere it exists locally or in the DSA where its size is calculated on the fly. So, by registering an extended instrument, we can reserve a slot for the extension. The actual size of underlying data can be provided by the extension routine. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 28/11/2023 04:03, Robert Haas wrote: On Mon, Nov 27, 2023 at 3:02 AM Andrei Lepikhov wrote: On 25/11/2023 08:23, Alexander Korotkov wrote: I think patch certainly gets better in this aspect. One thing I can't understand is why do we use home-grown code for resolving hash-collisions. You can just define custom hash and match functions in HASHCTL. Even if we need to avoid repeated JumbleExpr calls, we still can save pre-calculated hash value into hash entry and use custom hash and match. This doesn't imply us to write our own collision-resolving code. Thanks, it was an insightful suggestion. I implemented it, and the code has become shorter (see attachment). Neither the code comments nor the commit message really explain the design idea here. That's unfortunate, principally because it makes review difficult. Yeah, it is still an issue. We will think about how to improve this; any suggestions are welcome. I'm very skeptical about the idea of using JumbleExpr for any part of this. It seems fairly expensive, and it might produce false matches. If expensive is OK, then why not just use equal()? If it's not, then this probably isn't really OK either. But in any case there should be comments explaining why this strategy was chosen. We used the equal() routine without hashing in earlier versions. Hashing resolves issues with many different OR clauses. Is it expensive? Maybe, but we assume this transformation should be applied to simple enough expressions. The use of op_mergejoinable() seems pretty random to me. Why should we care about that? If somebody writes a<1 or a<2 or a<3 or a<4, you can transform that to a You are right. The only reason was to obtain a working patch to benchmark and look for corner cases. We would rewrite that place but still live with the equivalence operator. The reader shouldn't be left to guess whether a rule like this was made for reasons of correctness or for reasons of efficiency or something else. Looking further, I see that the reason for this is likely that the operator for the transformation result is constructing using list_make1(makeString((char *) "=")), but trying to choose an operator based on the operator name is, I think, pretty clearly unacceptable. Yes, it was a big mistake. It is fixed in the new version (I guess). I am extremely dubious about the use of select_common_type() here. Why not do this only when the types already match exactly? Maybe the concern is unknown literals, but perhaps that should be handled in some other way. If you do this kind of thing, you need to justify why it can't fail or produce wrong answers. Perhaps. We implemented your approach in the next version. At least we could see consequences. Honestly, it seems very hard to avoid the conclusion that this transformation is being done at too early a stage. Parse analysis is not the time to try to do query optimization. I can't really believe that there's a way to produce a committable patch along these lines. Ideally, a transformation like this should be done after we know what plan shape we're using (or considering using), so that we can make cost-based decisions about whether to transform or not. But at the very least it should happen somewhere in the planner. There's really no justification for parse analysis rewriting the SQL that the user entered. Here, we assume that array operation is generally better than many ORs. As a result, it should be more effective to make OR->ANY transformation in the parser (it is a relatively lightweight operation here) and, as a second phase, decompose that in the optimizer. We implemented earlier prototypes in different places of the optimizer, and I'm convinced that only this approach resolves the issues we found. Does this approach look weird? Maybe. We can debate it in this thread. -- regards, Andrei Lepikhov Postgres Professional From a5f8eba522ff907f7a3fffb0635b61d38933e385 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Thu, 23 Nov 2023 16:00:13 +0700 Subject: [PATCH] Transform OR clause to ANY expressions. Replace (X=N1) OR (X=N2) ... with X = ANY(N1, N2) on the preliminary stage of optimization when we are still working with a tree expression. Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. --- .../postgres_fdw/expected/postgres_fdw.out| 8 +- src/backend/nodes/queryjumblefuncs.c | 30 ++ src/backend/parser/parse_expr.c | 289 +- src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/queryjumble.h | 1 + src/include/parser/parse_expr.h | 1 + src/test/regress/expected/create_index.out| 141 +++-- src/test/regress/expected/guc
Re: POC, WIP: OR-clause support for indexes
On 25/11/2023 08:23, Alexander Korotkov wrote: I think patch certainly gets better in this aspect. One thing I can't understand is why do we use home-grown code for resolving hash-collisions. You can just define custom hash and match functions in HASHCTL. Even if we need to avoid repeated JumbleExpr calls, we still can save pre-calculated hash value into hash entry and use custom hash and match. This doesn't imply us to write our own collision-resolving code. Thanks, it was an insightful suggestion. I implemented it, and the code has become shorter (see attachment). -- regards, Andrei Lepikhov Postgres Professional From 8a43f5b50be6cb431046ab352fbcb9bdd3e4376c Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Thu, 23 Nov 2023 16:00:13 +0700 Subject: [PATCH] Transform OR clause to ANY expressions. Replace (X=N1) OR (X=N2) ... with X = ANY(N1, N2) on the preliminary stage of optimization when we are still working with a tree expression. Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. --- .../postgres_fdw/expected/postgres_fdw.out| 8 +- src/backend/nodes/queryjumblefuncs.c | 30 ++ src/backend/parser/parse_expr.c | 283 +- src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/queryjumble.h | 1 + src/include/parser/parse_expr.h | 1 + src/test/regress/expected/create_index.out| 141 +++-- src/test/regress/expected/guc.out | 3 +- src/test/regress/expected/inherit.out | 2 +- src/test/regress/expected/join.out| 62 +++- src/test/regress/expected/partition_prune.out | 257 +--- src/test/regress/expected/rules.out | 4 +- src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 23 +- src/test/regress/sql/create_index.sql | 32 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 2 + 21 files changed, 830 insertions(+), 83 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 22cae37a1e..fdfecd51c7 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -8537,18 +8537,18 @@ insert into utrtest values (2, 'qux'); -- Check case where the foreign partition is a subplan target rel explain (verbose, costs off) update utrtest set a = 1 where a = 1 or a = 2 returning *; - QUERY PLAN - + QUERY PLAN +-- Update on public.utrtest Output: utrtest_1.a, utrtest_1.b Foreign Update on public.remp utrtest_1 Update on public.locp utrtest_2 -> Append -> Foreign Update on public.remp utrtest_1 - Remote SQL: UPDATE public.loct SET a = 1 WHERE (((a = 1) OR (a = 2))) RETURNING a, b + Remote SQL: UPDATE public.loct SET a = 1 WHERE ((a = ANY ('{1,2}'::integer[]))) RETURNING a, b -> Seq Scan on public.locp utrtest_2 Output: 1, utrtest_2.tableoid, utrtest_2.ctid, NULL::record - Filter: ((utrtest_2.a = 1) OR (utrtest_2.a = 2)) + Filter: (utrtest_2.a = ANY ('{1,2}'::integer[])) (10 rows) -- The new values are concatenated with ' triggered !' diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c index 281907a4d8..99207a8670 100644 --- a/src/backend/nodes/queryjumblefuncs.c +++ b/src/backend/nodes/queryjumblefuncs.c @@ -135,6 +135,36 @@ JumbleQuery(Query *query) return jstate; } +JumbleState * +JumbleExpr(Expr *expr, uint64 *queryId) +{ + JumbleState *jstate = NULL; + + Assert(queryId != NULL); + + jstate = (JumbleState *) palloc(sizeof(JumbleState)); + + /* Set up workspace for query jumbling */ + jstate->jumble = (unsigned char *) palloc(JUMBLE_SIZE); + jstate->jumble_len = 0; + jstate->clocations_buf_size = 32; + jstate->clocations = (LocationLen *) + palloc(jstate->clocations_buf_size * sizeof(LocationLen)); + jstate->clocations_count = 0; +
Re: Postgres picks suboptimal index after building of an extended statistics
Second version of the patch - resolve non-symmetrical decision, thanks to Teodor Sigaev's review. -- regards, Andrei Lepikhov Postgres Professional From 604899b6afe70eccbbdbf69ce254f37808c598db Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Mon, 27 Nov 2023 11:23:48 +0700 Subject: [PATCH] Choose an index path with the best selectivity estimation. In the case when optimizer predicts only one row prefer choosing UNIQUE indexes In other cases, if optimizer treats indexes as equal, make a last attempt selecting the index with less selectivity - this decision takes away dependency on the order of indexes in an index list (good for reproduction of some issues) and proposes one more objective argument to choose specific index. --- src/backend/optimizer/util/pathnode.c | 42 ++ .../expected/drop-index-concurrently-1.out| 16 --- src/test/regress/expected/functional_deps.out | 43 +++ src/test/regress/expected/join.out| 40 + src/test/regress/sql/functional_deps.sql | 36 5 files changed, 151 insertions(+), 26 deletions(-) diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c index 0b1d17b9d3..4b5aedd579 100644 --- a/src/backend/optimizer/util/pathnode.c +++ b/src/backend/optimizer/util/pathnode.c @@ -454,6 +454,48 @@ add_path(RelOptInfo *parent_rel, Path *new_path) costcmp = compare_path_costs_fuzzily(new_path, old_path, STD_FUZZ_FACTOR); + /* +* Apply some heuristics on index paths. +*/ + if (IsA(new_path, IndexPath) && IsA(old_path, IndexPath)) + { + IndexPath *inp = (IndexPath *) new_path; + IndexPath *iop = (IndexPath *) old_path; + + if (new_path->rows <= 1.0 && old_path->rows <= 1.0) + { + /* +* When both paths are predicted to produce only one tuple, +* the optimiser should prefer choosing a unique index scan +* in all cases. +*/ + if (inp->indexinfo->unique && !iop->indexinfo->unique) + costcmp = COSTS_BETTER1; + else if (!inp->indexinfo->unique && iop->indexinfo->unique) + costcmp = COSTS_BETTER2; + else if (costcmp != COSTS_DIFFERENT) + /* +* If the optimiser doesn't have an obviously stable choice +* of unique index, increase the chance of avoiding mistakes +* by choosing an index with smaller selectivity. +* This option makes decision more conservative and looks +* debatable. +*/ + costcmp = (inp->indexselectivity < iop->indexselectivity) ? + COSTS_BETTER1 : COSTS_BETTER2; + } + else if (costcmp == COSTS_EQUAL) + /* +* The optimizer can't differ the value of two index paths. +* To avoid making a decision that is based on only an index +* order in the list, use some rational strategy based on +* selectivity: prefer touching fewer tuples on the disk to +* filtering them after. +*/ + costcmp = (inp->indexselectivity < iop->indexselectivity) ? + COSTS_BETTER1 : COSTS_BETTER2; + } + /* * If the two paths compare differently for startup and total cost, * then we want to keep both, and we can skip comparing pathkeys and diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out index 1cb2250891..2392cdb033 100644 --- a/src/test/isolation/expected/drop-index-concurrently-1.out +++ b/src/test/isolation/expected/drop-index-concurrently-1.out @@ -12,13 +12,15 @@ step preps: PREPARE getrow_seqscan AS SELECT * FROM test_dc WHERE data = 34 ORDE ste
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
On 14/11/2023 17:10, Damir Belyalov wrote: Here is a very straw-man-level sketch of what I think might work. The option to COPY FROM looks something like ERRORS TO other_table_name (item [, item [, ...]]) I tried to implement the patch using a table and came across a number of questions. Which table should we implement for this feature: a system catalog table or store this table as a file or create a new table? In these cases, security and user rights management issues arise. It is better for other users not to see error lines from another user. It is also not clear how access rights to this table are inherited and be given. Previous reviews have given helpful ideas about storing errors in the new table. It should be trivial code - use the current table name + 'err' + suffix as we already do in the case of conflicting auto-generated index names. The 'errors table' must inherit any right policies from the table, to which we do the copy. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 23/11/2023 16:23, Andrei Lepikhov wrote: This code changes tests in many places. But, as I see it, it mostly demonstrates the positive effect of the transformation. I found out that the postgres_fdw tests were impacted by the feature. Fix it, because the patch is on the commitfest and passes buildfarm. Taking advantage of this, I suppressed the expression evaluation procedure to make regression test changes more clear. -- regards, Andrei Lepikhov Postgres Professional From e29b35d3445d9852f3ecb9cdaa4f231c72334973 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Thu, 23 Nov 2023 16:00:13 +0700 Subject: [PATCH] Transform OR clause to ANY expressions. Replace (X=N1) OR (X=N2) ... with X = ANY(N1, N2) on the preliminary stage of optimization when we are still working with a tree expression. Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. --- .../postgres_fdw/expected/postgres_fdw.out| 8 +- src/backend/nodes/queryjumblefuncs.c | 30 ++ src/backend/parser/parse_expr.c | 284 +- src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/queryjumble.h | 1 + src/include/parser/parse_expr.h | 1 + src/test/regress/expected/create_index.out| 141 +++-- src/test/regress/expected/guc.out | 3 +- src/test/regress/expected/inherit.out | 2 +- src/test/regress/expected/join.out| 62 +++- src/test/regress/expected/partition_prune.out | 215 +++-- src/test/regress/expected/rules.out | 4 +- src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 23 +- src/test/regress/sql/create_index.sql | 32 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 2 + 21 files changed, 810 insertions(+), 62 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 64bcc66b8d..a8442f08aa 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -8537,18 +8537,18 @@ insert into utrtest values (2, 'qux'); -- Check case where the foreign partition is a subplan target rel explain (verbose, costs off) update utrtest set a = 1 where a = 1 or a = 2 returning *; - QUERY PLAN - + QUERY PLAN +-- Update on public.utrtest Output: utrtest_1.a, utrtest_1.b Foreign Update on public.remp utrtest_1 Update on public.locp utrtest_2 -> Append -> Foreign Update on public.remp utrtest_1 - Remote SQL: UPDATE public.loct SET a = 1 WHERE (((a = 1) OR (a = 2))) RETURNING a, b + Remote SQL: UPDATE public.loct SET a = 1 WHERE ((a = ANY ('{1,2}'::integer[]))) RETURNING a, b -> Seq Scan on public.locp utrtest_2 Output: 1, utrtest_2.tableoid, utrtest_2.ctid, NULL::record - Filter: ((utrtest_2.a = 1) OR (utrtest_2.a = 2)) + Filter: (utrtest_2.a = ANY ('{1,2}'::integer[])) (10 rows) -- The new values are concatenated with ' triggered !' diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c index 281907a4d8..99207a8670 100644 --- a/src/backend/nodes/queryjumblefuncs.c +++ b/src/backend/nodes/queryjumblefuncs.c @@ -135,6 +135,36 @@ JumbleQuery(Query *query) return jstate; } +JumbleState * +JumbleExpr(Expr *expr, uint64 *queryId) +{ + JumbleState *jstate = NULL; + + Assert(queryId != NULL); + + jstate = (JumbleState *) palloc(sizeof(JumbleState)); + + /* Set up workspace for query jumbling */ + jstate->jumble = (unsigned char *) palloc(JUMBLE_SIZE); + jstate->jumble_len = 0; + jstate->clocations_buf_size = 32; + jstate->clocations = (LocationLen *) + palloc(jstate->clocations_buf_size * sizeof(LocationLen)); + jstate->clocations_count = 0; + jstate->highest_extern_param_id = 0; + + /* Compute query ID */ + _jumbleNode(jstate, (Node *) expr); + *queryId = DatumGe
Re: POC, WIP: OR-clause support for indexes
On 21/11/2023 18:31, Alena Rybakina wrote: Sorry, I lost your changes during the revision process. I returned them. I raised the patch version just in case to run ci successfully. I think the usage of nodeToString for the generation of clause hash is too expensive and buggy. Also, in the code, you didn't resolve hash collisions. So, I've rewritten the patch a bit (see the attachment). One more thing: I propose to enable transformation by default at least for quick detection of possible issues. This code changes tests in many places. But, as I see it, it mostly demonstrates the positive effect of the transformation. -- regards, Andrei Lepikhov Postgres Professional From 5071d02426ac3430f4dd61a8ad32c2847ba6f8a5 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Thu, 23 Nov 2023 16:00:13 +0700 Subject: [PATCH] Transform OR clause to ANY expressions. Replace (X=N1) OR (X=N2) ... with X = ANY(N1, N2) on the preliminary stage of optimization when we are still working with a tree expression. Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. --- src/backend/nodes/queryjumblefuncs.c | 30 ++ src/backend/parser/parse_expr.c | 285 +- src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/queryjumble.h | 1 + src/include/parser/parse_expr.h | 1 + src/test/regress/expected/create_index.out| 141 +++-- src/test/regress/expected/create_view.out | 2 +- src/test/regress/expected/guc.out | 3 +- src/test/regress/expected/inherit.out | 2 +- src/test/regress/expected/join.out| 62 +++- src/test/regress/expected/partition_prune.out | 215 +++-- src/test/regress/expected/rules.out | 18 +- src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 23 +- src/test/regress/sql/create_index.sql | 32 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 2 + 21 files changed, 815 insertions(+), 66 deletions(-) diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c index 281907a4d8..99207a8670 100644 --- a/src/backend/nodes/queryjumblefuncs.c +++ b/src/backend/nodes/queryjumblefuncs.c @@ -135,6 +135,36 @@ JumbleQuery(Query *query) return jstate; } +JumbleState * +JumbleExpr(Expr *expr, uint64 *queryId) +{ + JumbleState *jstate = NULL; + + Assert(queryId != NULL); + + jstate = (JumbleState *) palloc(sizeof(JumbleState)); + + /* Set up workspace for query jumbling */ + jstate->jumble = (unsigned char *) palloc(JUMBLE_SIZE); + jstate->jumble_len = 0; + jstate->clocations_buf_size = 32; + jstate->clocations = (LocationLen *) + palloc(jstate->clocations_buf_size * sizeof(LocationLen)); + jstate->clocations_count = 0; + jstate->highest_extern_param_id = 0; + + /* Compute query ID */ + _jumbleNode(jstate, (Node *) expr); + *queryId = DatumGetUInt64(hash_any_extended(jstate->jumble, + jstate->jumble_len, + 0)); + + if (*queryId == UINT64CONST(0)) + *queryId = UINT64CONST(1); + + return jstate; +} + /* * Enables query identifier computation. * diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c index 64c582c344..d782642771 100644 --- a/src/backend/parser/parse_expr.c +++ b/src/backend/parser/parse_expr.c @@ -22,6 +22,7 @@ #include "miscadmin.h" #include "nodes/makefuncs.h" #include "nodes/nodeFuncs.h" +#include "nodes/queryjumble.h" #include "optimizer/optimizer.h" #include "parser/analyze.h" #include "parser/parse_agg.h" @@ -43,6 +44,7 @@ /* GUC parameters */ bool Transform_null_equals = false; +bool enable_or_transformation = true; static Node *transformExprRecurse(ParseState *pstate, Node *expr); @@ -99,6 +101,287 @@ static Expr *make_distinct_op(ParseState *pstate, List *opname, static Node *make_nulltest_from_distinct(ParseState *pstate, A_Expr *distincta, Node *arg); +typedef struct OrClauseGroupEntry +{ + Node
Re: Postgres picks suboptimal index after building of an extended statistics
Thanks for detaied answer, On 3/11/2023 00:37, Tomas Vondra wrote: On 9/25/23 06:30, Andrey Lepikhov wrote: ... I can't stop thinking about this issue. It is bizarre when Postgres chooses a non-unique index if a unique index gives us proof of minimum scan. That's true, but no one implemented this heuristics. So the "proof of minimum scan" is merely hypothetical - the optimizer is unaware of it. See the simple patch in the attachment. There, I have attempted to resolve situations of uncertainty to avoid making decisions based solely on the order of indexes in the list. I don't see a reason to teach the clauselist_selectivity() routine to estimate UNIQUE indexes. We add some cycles, but it will work with btree indexes only. I'm not sure I understand what this is meant to say. Can you elaborate? We only allow UNIQUE for btree indexes anyway, so what exactly is the problem here? Partly, you already answered yourself below: we have unique index estimation in a few estimation calls, but go through the list of indexes each time. Also, for this sake, we would add some input parameters, usually NULL, because many estimations don't involve indexes at all. Maybe to change compare_path_costs_fuzzily() and add some heuristic, for example: "If selectivity of both paths gives us no more than 1 row, prefer to use a unique index or an index with least selectivity." I don't understand how this would work. What do yo mean by "selectivity of a path"? AFAICS the problem here is that we estimate a path to return more rows (while we know there can only be 1, thanks to UNIQUE index). Oops, I meant cardinality. See the patch in the attachment. So how would you know the path does not give us more than 1 row? Surely you're not proposing compare_path_costs_fuzzily() to do something expensive like analyzing the paths, or something. I solely propose to make optimizer more consecutive in its decisions: if we have one row for both path candidates, use uniqueness of the index or value of selectivity as one more parameter. Also, how would it work in upper levels? If you just change which path we keep, but leave the inaccurate row estimate in place, that may fix that level, but it's certainly going to confuse later planning steps. It is designed for the only scan level. IMHO the problem here is that we produce wrong estimate, so we better fix that, instead of adding band-aids to other places. Agree. I am looking for a solution to help users somehow resolve such problems. As an alternative solution, I can propose a selectivity hook or (maybe even better) - use the pathlist approach and add indexes into the index list with some predefined order - at first positions, place unique indexes with more columns, etc. This happens because functional dependencies are very simple type of statistics - it has some expectations about the input data and also the queries executed on it. For example it assumes the data is reasonably homogeneous, so that we can calculate a global "degree". But the data in the example directly violates that - it has 1000 rows that are very random (so we'd find no dependencies), and 1000 rows with perfect dependencies. Hence we end with degree=0.5, which approximates the dependencies to all data. Not great, true, but that's the price for simplicity of this statistics kind. So the simplest solution is to disable dependencies on such data sets. It's a bit inconvenient/unfortunate we build dependencies by default, and people may not be aware of there assumptions. Perhaps we could/should make dependency_degree() more pessimistic when we find some "violations" of the rule (we intentionally are not strict about it, because data are imperfect). I don't have a good idea how to change the formulas, but I'm open to the idea in principle. Thanks for the explanation! The other thing we could do is checking for unique indexes in clauselist_selectivity, and if we find a match we can just skip the extended statistics altogether. Not sure how expensive this would be, but for typical cases (with modest number of indexes) perhaps OK. It wouldn't work without a unique index, but I don't have a better idea. It looks a bit expensive for me. But I am ready to try, if current solution doesn't look applicable. -- regards, Andrei Lepikhov Postgres Professional From 21677861e101eed22c829e0b14290a56786a12c2 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 22 Nov 2023 12:01:39 +0700 Subject: [PATCH] Use an index path with the best selectivity estimation --- src/backend/optimizer/util/pathnode.c | 40 + .../expected/drop-index-concurrently-1.out| 16 --- src/test/regress/expected/functional_deps.out | 43 +++ src/test/regress/expected/join.out| 40 + src/test/regress/sql/functional_deps.sql | 36 5 files changed, 149 insertions(
Re: POC, WIP: OR-clause support for indexes
On 10/11/2023 16:20, Alena Rybakina wrote: I added log like that: ERROR: unrecognized node type: 0. I fixed this issue and added some cosmetic refactoring. The changes are presented in the or_patch_changes.diff file. Looking into the patch, I found some trivial improvements (see attachment). Also, it is not obvious that using a string representation of the clause as a hash table key is needed here. Also, by making a copy of the node in the get_key_nconst_node(), you replace the location field, but in the case of complex expression, you don't do the same with other nodes. I propose to generate expression hash instead + prove the equality of two expressions by calling equal(). -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c index 25a4235dbd..de27d2646c 100644 --- a/src/backend/parser/parse_expr.c +++ b/src/backend/parser/parse_expr.c @@ -178,6 +178,10 @@ transformBoolExprOr(ParseState *pstate, BoolExpr *expr_orig) HTAB *or_group_htab = NULL; int len_ors = list_length(expr_orig->args); + /* If this is not an 'OR' expression, skip the transformation */ + if (!or_transform_limit || expr_orig->boolop != OR_EXPR || len_ors < 2) + return transformBoolExpr(pstate, (BoolExpr *) expr_orig); + MemSet(, 0, sizeof(info)); info.keysize = sizeof(char *); info.entrysize = sizeof(OrClauseGroupEntry); @@ -188,10 +192,6 @@ transformBoolExprOr(ParseState *pstate, BoolExpr *expr_orig) , HASH_ELEM | HASH_FUNCTION | HASH_COMPARE); - /* If this is not an 'OR' expression, skip the transformation */ - if (expr_orig->boolop != OR_EXPR || !or_transform_limit || len_ors == 1 || !or_group_htab) - return transformBoolExpr(pstate, (BoolExpr *) expr_orig); - foreach(lc, expr_orig->args) { Node *arg = lfirst(lc);
Re: [PoC] Reducing planning time when tables have many partitions
On 27/9/2023 14:28, Yuya Watari wrote: Thank you for pointing it out. The ec_source_indexes and ec_derive_indexes are just picked up from the previous patch, and I have not changed their design. I think a similar approach to EquivalenceMembers may be applied to RestrictInfos. I will experiment with them and share a new patch. During the work on committing the SJE feature [1], Alexander Korotkov pointed out the silver lining in this work [2]: he proposed that we shouldn't remove RelOptInfo from simple_rel_array at all but replace it with an 'Alias', which will refer the kept relation. It can simplify further optimizations on removing redundant parts of the query. [1] https://www.postgresql.org/message-id/flat/64486b0b-0404-e39e-322d-0801154901f3%40postgrespro.ru [2] https://www.postgresql.org/message-id/capphfdsnabg8cak+nj8akig_+_tt07ecstkb1loa50f0ust...@mail.gmail.com -- regards, Andrei Lepikhov Postgres Professional
Re: A performance issue with Memoize
On 30/10/2023 14:55, Richard Guo wrote: On Thu, Oct 26, 2023 at 12:07 PM Andrei Lepikhov mailto:a.lepik...@postgrespro.ru>> wrote: Do you've thought about the case, fixed with the commit 1db5667? As I see, that bugfix still isn't covered by regression tests. Could your approach of a PARAM_EXEC slot reusing break that case? Hm, I don't think so. The issue fixed by commit 1db5667 was caused by sharing PARAM_EXEC slots between different levels of NestLoop. AFAICS it's safe to share PARAM_EXEC slots within the same level of NestLoop. The change here is about sharing PARAM_EXEC slots between subquery's subplan_params and outer-relation variables, which happens within the same level of NestLoop. ... Did you notice a case that the change here breaks? Hi Tom, could you share your insights on this issue and the proposed fix? I think your patch works correctly so far. I mentioned the commit 1db5667 because, as I see, the origin of the problem was parallel workers. I have thought about pushing Memoize down to a parallel worker and couldn't imagine whether such a solution would be correct. Sorry if I disturbed you in vain. -- regards, Andrei Lepikhov Postgres Professional
Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements
On 25/10/2023 20:00, Andrei Zubkov wrote: Hi Andrei, On Wed, 2023-10-25 at 13:59 +0700, Andrei Lepikhov wrote: But minmax_stats_since and changes in the UI of the reset routine look like syntactic sugar here. I can't convince myself that it is really needed. Do you have some set of cases that can enforce the changes proposed? Yes, it looks strange, but it is needed in some way. The main purpose of this patch is to provide means for sampling solutions for collecting statistics data in series of samples. The first goal here - is to be sure that the statement was not evicted and come back between samples (especially between rare samples). This is the target of the stats_since field. The second goal - is the ability of getting all statistic increments for the interval between samples. All statistics provided by pg_stat_statements with except of min/max values can be calculated for the interval since the last sample knowing the values in the last and current samples. We need a way to get min/max values too. This is achieved by min/max reset performed by the sampling solution. The minmax_stats_since field is here for the same purpose - the sampling solution is need to be sure that no one have done a min/max reset between samples. And if such reset was performed at least we know when it was performed. It is the gist of my question. If needed, You can remove the record by (userid, dbOid, queryId). As I understand, this extension is usually used by an administrator. Who can reset these parameters except you and the DBMS? -- regards, Andrei Lepikhov Postgres Professional
Re: A performance issue with Memoize
On 20/10/2023 17:40, Richard Guo wrote: I noticed $subject with the query below. set enable_memoize to off; explain (analyze, costs off) select * from tenk1 t1 left join lateral (select t1.two as t1two, * from tenk1 t2 offset 0) s on t1.two = s.two; QUERY PLAN Nested Loop Left Join (actual time=0.050..59578.053 rows=5000 loops=1) -> Seq Scan on tenk1 t1 (actual time=0.027..2.703 rows=1 loops=1) -> Subquery Scan on s (actual time=0.004..4.819 rows=5000 loops=1) Filter: (t1.two = s.two) Rows Removed by Filter: 5000 -> Seq Scan on tenk1 t2 (actual time=0.002..3.834 rows=1 loops=1) Planning Time: 0.666 ms Execution Time: 60937.899 ms (8 rows) set enable_memoize to on; explain (analyze, costs off) select * from tenk1 t1 left join lateral (select t1.two as t1two, * from tenk1 t2 offset 0) s on t1.two = s.two; QUERY PLAN -- Nested Loop Left Join (actual time=0.061..122684.607 rows=5000 loops=1) -> Seq Scan on tenk1 t1 (actual time=0.026..3.367 rows=1 loops=1) -> Memoize (actual time=0.011..9.821 rows=5000 loops=1) Cache Key: t1.two, t1.two Cache Mode: binary Hits: 0 Misses: 1 Evictions: Overflows: 0 Memory Usage: 1368kB -> Subquery Scan on s (actual time=0.008..5.188 rows=5000 loops=1) Filter: (t1.two = s.two) Rows Removed by Filter: 5000 -> Seq Scan on tenk1 t2 (actual time=0.004..4.081 rows=1 loops=1) Planning Time: 0.607 ms Execution Time: 124431.388 ms (12 rows) The execution time (best of 3) is 124431.388 VS 60937.899 with and without memoize. The Memoize runtime stats 'Hits: 0 Misses: 1 Evictions: ' seems suspicious to me, so I've looked into it a little bit, and found that the MemoizeState's keyparamids and its outerPlan's chgParam are always different, and that makes us have to purge the entire cache each time we rescan the memoize node. But why are they always different? Well, for the query above, we have two NestLoopParam nodes, one (with paramno 1) is created when we replace outer-relation Vars in the scan qual 't1.two = s.two', the other one (with paramno 0) is added from the subquery's subplan_params, which was created when we replaced uplevel vars with Param nodes for the subquery. That is to say, the chgParam would be {0, 1}. When it comes to replace outer-relation Vars in the memoize keys, the two 't1.two' Vars are both replaced with the NestLoopParam with paramno 1, because it is the first NLP we see in root->curOuterParams that is equal to the Vars in memoize keys. That is to say, the memoize node's keyparamids is {1}. ... Any thoughts? Do you've thought about the case, fixed with the commit 1db5667? As I see, that bugfix still isn't covered by regression tests. Could your approach of a PARAM_EXEC slot reusing break that case? -- regards, Andrei Lepikhov Postgres Professional
Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements
On 19/10/2023 19:40, Andrei Zubkov wrote: Hi hackers, New version 23 attached. It contains rebase to the current master. I discovered the patch and parameters you've proposed. In my opinion, the stats_since parameter adds valuable information and should definitely be included in the stats data because the statement can be noteless removed from the list and inserted again. But minmax_stats_since and changes in the UI of the reset routine look like syntactic sugar here. I can't convince myself that it is really needed. Do you have some set of cases that can enforce the changes proposed? Maybe we should intensively work on adding the 'stats_since' parameter and continue the discussion of the necessity of another one? -- regards, Andrei Lepikhov Postgres Professional
Re: Add the ability to limit the amount of memory that can be allocated to backends.
On 20/10/2023 19:39, Stephen Frost wrote: Greetings, * Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote: The only issue I worry about is the uncertainty and clutter that can be created by this feature. In the worst case, when we have a complex error stack (including the extension's CATCH sections, exceptions in stored procedures, etc.), the backend will throw the memory limit error repeatedly. I'm not seeing what additional uncertainty or clutter there is- this is, again, exactly the same as what happens today on a system with overcommit disabled and I don't feel like we get a lot of complaints about this today. Maybe I missed something or see this feature from an alternate point of view (as an extension developer), but overcommit is more useful so far: it kills a process. It means that after restart, the backend/background worker will have an initial internal state. With this limit enabled, we need to remember that each function call can cause an error, and we have to remember it using static PG_CATCH sections where we must rearrange local variables to the initial (?) state. So, it complicates development. Of course, this limit is a good feature, but from my point of view, it would be better to kill a memory-consuming backend instead of throwing an error. At least for now, we don't have a technique to repeat query planning with chances to build a more effective plan. -- regards, Andrei Lepikhov Postgres Professional
Re: Removing unneeded self joins
On 22/10/2023 05:01, Alexander Korotkov wrote: On Thu, Oct 19, 2023 at 6:16 AM Andrei Lepikhov wrote: On 19/10/2023 01:50, Alexander Korotkov wrote: This query took 3778.432 ms with self-join removal disabled, and 3756.009 ms with self-join removal enabled. So, no measurable overhead. Similar to the higher number of joins. Can you imagine some extreme case when self-join removal could introduce significant overhead in comparison with other optimizer parts? If not, should we remove self_join_search_limit GUC? Thanks, It was Zhihong Yu who worried about that case [1]. And my purpose was to show a method to avoid such a problem if it would be needed. I guess the main idea here is that we have a lot of self-joins, but only few of them (or no one) can be removed. I can't imagine a practical situation when we can be stuck in the problems here. So, I vote to remove this GUC. I've removed the self_join_search_limit. Anyway there is enable_self_join_removal if the self join removal algorithm causes any problems. I also did some grammar corrections for the comments. I think the patch is getting to the committable shape. I noticed some failures on commitfest.cputube.org. I'd like to check how this version will pass it. I have observed the final patch. A couple of minor changes can be made (see attachment). Also, I see room for improvement, but it can be done later. For example, we limit the optimization to only ordinary tables in this patch. It can be extended at least with partitioned and foreign tables soon. -- regards, Andrey Lepikhov Postgres Professional diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c index 6b848aadad..b84197dadb 100644 --- a/src/backend/optimizer/plan/analyzejoins.c +++ b/src/backend/optimizer/plan/analyzejoins.c @@ -23,7 +23,6 @@ #include "postgres.h" #include "catalog/pg_class.h" -#include "nodes/makefuncs.h" #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/joininfo.h" @@ -50,7 +49,7 @@ typedef struct UniqueRelInfo } UniqueRelInfo; /* - * The context for replace_varno_walker() containing source and target relids2 + * The context for replace_varno_walker() containing source and target relids. */ typedef struct {
Re: Add the ability to limit the amount of memory that can be allocated to backends.
On 20/10/2023 05:06, Stephen Frost wrote: Greetings, * Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote: On 19/10/2023 02:00, Stephen Frost wrote: * Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote: On 29/9/2023 09:52, Andrei Lepikhov wrote: On 22/5/2023 22:59, reid.thomp...@crunchydata.com wrote: Attach patches updated to master. Pulled from patch 2 back to patch 1 a change that was also pertinent to patch 1. +1 to the idea, have doubts on the implementation. I have a question. I see the feature triggers ERROR on the exceeding of the memory limit. The superior PG_CATCH() section will handle the error. As I see, many such sections use memory allocations. What if some routine, like the CopyErrorData(), exceeds the limit, too? In this case, we could repeat the error until the top PG_CATCH(). Is this correct behaviour? Maybe to check in the exceeds_max_total_bkend_mem() for recursion and allow error handlers to slightly exceed this hard limit? By the patch in attachment I try to show which sort of problems I'm worrying about. In some PП_CATCH() sections we do CopyErrorData (allocate some memory) before aborting the transaction. So, the allocation error can move us out of this section before aborting. We await for soft ERROR message but will face more hard consequences. While it's an interesting idea to consider making exceptions to the limit, and perhaps we'll do that (or have some kind of 'reserve' for such cases), this isn't really any different than today, is it? We might have a malloc() failure in the main path, end up in PG_CATCH() and then try to do a CopyErrorData() and have another malloc() failure. If we can rearrange the code to make this less likely to happen, by doing a bit more work to free() resources used in the main path before trying to do new allocations, then, sure, let's go ahead and do that, but that's independent from this effort. I agree that rearranging efforts can be made independently. The code in the letter above was shown just as a demo of the case I'm worried about. IMO, the thing that should be implemented here is a recursion level for the memory limit. If processing the error, we fall into recursion with this limit - we should ignore it. I imagine custom extensions that use PG_CATCH() and allocate some data there. At least we can raise the level of error to FATAL. Ignoring such would defeat much of the point of this effort- which is to get to a position where we can say with some confidence that we're not going to go over some limit that the user has set and therefore not allow ourselves to end up getting OOM killed. These are all the same issues that already exist today on systems which don't allow overcommit too, there isn't anything new here in regards to these risks, so I'm not really keen to complicate this to deal with issues that are already there. Perhaps once we've got the basics in place then we could consider reserving some space for handling such cases.. but I don't think it'll actually be very clean and what if we have an allocation that goes beyond what that reserved space is anyway? Then we're in the same spot again where we have the choice of either failing the allocation in a less elegant way than we might like to handle that error, or risk getting outright kill'd by the kernel. Of those choices, sure seems like failing the allocation is the better way to go. I've got your point. The only issue I worry about is the uncertainty and clutter that can be created by this feature. In the worst case, when we have a complex error stack (including the extension's CATCH sections, exceptions in stored procedures, etc.), the backend will throw the memory limit error repeatedly. Of course, one failed backend looks better than a surprisingly killed postmaster, but the mix of different error reports and details looks terrible and challenging to debug in the case of trouble. So, may we throw a FATAL error if we reach this limit while handling an exception? -- regards, Andrey Lepikhov Postgres Professional
Re: Asymmetric partition-wise JOIN
On 18/10/2023 16:59, Ashutosh Bapat wrote: On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov wrote: But the clauses of A parameterized by P will produce different translations for each of the partitions. I think we will need different RelOptInfos (for A) to store these translations. Does the answer above resolved this issue? May be. There are other problematic areas like EvalPlanQual, Rescans, reparameterised paths which can blow up if we use the same RelOptInfo for different scans of the same relation. It will be good to test Yeah, now I got it. It is already the second place where I see some reference to a kind of hidden rule that the rte entry (or RelOptInfo) must correspond to only one plan node. I don't have a quick answer for now - maybe it is a kind of architectural agreement - and I will consider this issue during the development. those. And also A need not be a simple relation; it could be join as well. For a join RelOptInfo, as well as for any subtree, we have the same logic: the pathlist of this subtree is already formed during the previous level of the search and will not be changed. The relid is also used to track the scans at executor level. Since we have so many scans on A, each may be using different plan, we will need different ids for those. I don't understand this sentence. Which way executor uses this index of RelOptInfo ? See Scan::scanrelid -- regards, Andrey Lepikhov Postgres Professional
Re: Removing unneeded self joins
On 19/10/2023 01:50, Alexander Korotkov wrote: On Mon, Oct 16, 2023 at 11:28 AM Andrei Lepikhov wrote: On 12/10/2023 18:32, Alexander Korotkov wrote: On Thu, Oct 5, 2023 at 12:17 PM Andrei Lepikhov wrote: On 4/10/2023 14:34, Alexander Korotkov wrote: > Relid replacement machinery is the most contradictory code here. We used > a utilitarian approach and implemented a simplistic variant. > > 2) It would be nice to skip the insertion of IS NOT NULL checks when > > they are not necessary. [1] points that infrastructure from [2] might > > be useful. The patchset from [2] seems committed mow. However, I > > can't see it is directly helpful in this matter. Could we just skip > > adding IS NOT NULL clause for the columns, that have > > pg_attribute.attnotnull set? > Thanks for the links, I will look into that case. To be more precise, in the attachment, you can find a diff to the main patch, which shows the volume of changes to achieve the desired behaviour. Some explains in regression tests shifted. So, I've made additional tests: DROP TABLE test CASCADE; CREATE TABLE test (a int, b int not null); CREATE UNIQUE INDEX abc ON test(b); explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) WHERE t1.b=t2.b; CREATE UNIQUE INDEX abc1 ON test(a,b); explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) WHERE t1.b=t2.b; explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) WHERE t1.b=t2.b AND (t1.a=t2.a OR t2.a=t1.a); DROP INDEX abc1; explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) WHERE t1.b=t2.b AND (t1.b=t2.b OR t2.b=t1.b); We have almost the results we wanted to have. But in the last explain you can see that nothing happened with the OR clause. We should use the expression mutator instead of walker to handle such clauses. But It doesn't process the RestrictInfo node ... I'm inclined to put a solution of this issue off for a while. OK. I think it doesn't worth to eliminate IS NULL quals with this complexity (at least at this stage of work). I made improvements over the code. Mostly new comments, grammar corrections of existing comments and small refactoring. Also, I found that the suggestion from David Rowley [1] to qsort array of relations to faster find duplicates is still unaddressed. I've implemented it. That helps to evade quadratic complexity with large number of relations. Also I've incorporated improvements from Alena Rybakina except one for skipping SJ removal when no SJ quals is found. It's not yet clear for me if this check fix some cases. But at least optimization got skipped in some useful cases (as you can see in regression tests). I would like to propose one more minor improvement (see in attachment). The idea here is that after removing a self-join and changing clauses we should re-probe the set of relids with the same Oid, because we can find more removable self-joins (see the demo test in join.sql). Thank you, I've integrated this into the patch. BTW, the patch introduces two new GUC variables: enable_self_join_removal, self_join_search_limit. enable_self_join_removal variable turns on/off optimization at all. self_join_search_limit variable limits its usage by the number of joins. AFICS, self_join_search_limit is intended to protect us from quadratic complexity self-join removal has. I tried to reproduce the extreme case. SELECT count(*) FROM pgbench_accounts a0, pgbench_accounts a1, ..., pgbench_accounts a100 WHERE a0.aid = 1 AND a1.aid = a0.aid + 1 AND ... AND a100.aid = a99.aid + 1; This query took 3778.432 ms with self-join removal disabled, and 3756.009 ms with self-join removal enabled. So, no measurable overhead. Similar to the higher number of joins. Can you imagine some extreme case when self-join removal could introduce significant overhead in comparison with other optimizer parts? If not, should we remove self_join_search_limit GUC? Thanks, It was Zhihong Yu who worried about that case [1]. And my purpose was to show a method to avoid such a problem if it would be needed. I guess the main idea here is that we have a lot of self-joins, but only few of them (or no one) can be removed. I can't imagine a practical situation when we can be stuck in the problems here. So, I vote to remove this GUC. [1] https://www.postgresql.org/message-id/CALNJ-vTyL-LpvSOPZxpC63Et3LJLUAFZSfRqGEhT5Rj7_EEj7w%40mail.gmail.com -- regards, Andrey Lepikhov Postgres Professional
Re: Add the ability to limit the amount of memory that can be allocated to backends.
On 19/10/2023 02:00, Stephen Frost wrote: Greetings, * Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote: On 29/9/2023 09:52, Andrei Lepikhov wrote: On 22/5/2023 22:59, reid.thomp...@crunchydata.com wrote: Attach patches updated to master. Pulled from patch 2 back to patch 1 a change that was also pertinent to patch 1. +1 to the idea, have doubts on the implementation. I have a question. I see the feature triggers ERROR on the exceeding of the memory limit. The superior PG_CATCH() section will handle the error. As I see, many such sections use memory allocations. What if some routine, like the CopyErrorData(), exceeds the limit, too? In this case, we could repeat the error until the top PG_CATCH(). Is this correct behaviour? Maybe to check in the exceeds_max_total_bkend_mem() for recursion and allow error handlers to slightly exceed this hard limit? By the patch in attachment I try to show which sort of problems I'm worrying about. In some PП_CATCH() sections we do CopyErrorData (allocate some memory) before aborting the transaction. So, the allocation error can move us out of this section before aborting. We await for soft ERROR message but will face more hard consequences. While it's an interesting idea to consider making exceptions to the limit, and perhaps we'll do that (or have some kind of 'reserve' for such cases), this isn't really any different than today, is it? We might have a malloc() failure in the main path, end up in PG_CATCH() and then try to do a CopyErrorData() and have another malloc() failure. If we can rearrange the code to make this less likely to happen, by doing a bit more work to free() resources used in the main path before trying to do new allocations, then, sure, let's go ahead and do that, but that's independent from this effort. I agree that rearranging efforts can be made independently. The code in the letter above was shown just as a demo of the case I'm worried about. IMO, the thing that should be implemented here is a recursion level for the memory limit. If processing the error, we fall into recursion with this limit - we should ignore it. I imagine custom extensions that use PG_CATCH() and allocate some data there. At least we can raise the level of error to FATAL. -- regards, Andrey Lepikhov Postgres Professional
Re: Oversight in reparameterize_path_by_child leading to executor crash
On 18/10/2023 13:39, Richard Guo wrote: On Fri, Oct 13, 2023 at 6:18 PM Andrei Lepikhov mailto:a.lepik...@postgrespro.ru>> wrote: On 23/8/2023 12:37, Richard Guo wrote: > To fix it we may need to modify RelOptInfos for Path, BitmapHeapPath, > ForeignPath and CustomPath, and modify IndexOptInfos for IndexPath. It > seems that that is not easily done without postponing reparameterization > of paths until createplan.c. > > Attached is a patch which is v5 + fix for this new issue. Having looked into the patch for a while, I couldn't answer to myself for some stupid questions: Thanks for reviewing this patch! I think these are great questions. 1. If we postpone parameterization until the plan creation, why do we still copy the path node in the FLAT_COPY_PATH macros? Do we really need it? Good point. The NestPath's origin inner path should not be referenced any more after the reparameterization, so it seems safe to adjust the path itself, without the need of a flat-copy. I've done that in v8 patch. 2. I see big switches on path nodes. May it be time to create a path_walker function? I recall some thread where such a suggestion was declined, but I don't remember why. I'm not sure. But this seems a separate topic, so maybe it's better to discuss it separately? Agree. 3. Clause changing is postponed. Why does it not influence the calc_joinrel_size_estimate? We will use statistics on the parent table here. Am I wrong? Hmm, I think the reparameterization does not change the estimated size/costs. Quoting the comment Agree. I have looked at the code and figured it out - you're right. But it seems strange: maybe I don't understand something. Why not estimate selectivity for parameterized clauses based on leaf partition statistic, not the parent one? -- regards, Andrey Lepikhov Postgres Professional
Re: Allow ALTER SYSTEM SET on unrecognized custom GUCs
On 18/10/2023 12:15, Tom Lane wrote: Andrei Lepikhov writes: "SET foo.bar TO 'smth'" can immediately alter the placeholder's value. But what is the reason that "ALTER SYSTEM SET foo.bar TO 'smth'" doesn't do the same? Because it's not supposed to take effect until you issue a reload command (and maybe not even then, depending on which GUC we're talking about). I certainly think it wouldn't make sense for your own session to adopt the value ahead of others. Thanks for the answer. Introducing the assignable_custom_variable_name can be helpful. The code looks good. I think it deserves to be committed - after the indentation fix, of course. -- regards, Andrey Lepikhov Postgres Professional
Re: Asymmetric partition-wise JOIN
On 17/10/2023 17:09, Ashutosh Bapat wrote: On Tue, Oct 17, 2023 at 2:05 PM Andrei Lepikhov wrote: On 16/10/2023 23:21, Ashutosh Bapat wrote: On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov Whenever I visited this idea, I hit one issue prominently - how would we differentiate different scans of the non-partitioned relation. Normally we do that using different Relids but in this case we wouldn't be able to know the number of such relations involved in the query unless we start planning such a join. It's late to add new base relations and assign them new Relids. Of course I haven't thought hard about it. I haven't looked at the patch to see whether this problem is solved and how. I'm curious, which type of problems do you afraid here? Why we need a range table entry for each scan of non-partitioned relation? Not RTE but RelOptInfo. Using the same example as Alexander Korotkov, let's say A is the nonpartitioned table and P is partitioned table with partitions P1, P2, ... Pn. The partitionwise join would need to compute AP1, AP2, ... APn. Each of these joins may have different properties and thus will require creating paths. In order to save these paths, we need RelOptInfos which are indentified by relids. Let's assume that the relids of these join RelOptInfos are created by union of relid of A and relid of Px (the partition being joined). This is notionally misleading but doable. Ok, now I see your disquiet. In current patch we have built RelOptInfo for each JOIN(A, Pi) by the build_child_join_rel() routine. And of course, they all have different sets of cheapest paths (it is one more point of optimality). At this point the RelOptInfo of relation A is fully formed and upper joins use the pathlist "as is", without changes. But the clauses of A parameterized by P will produce different translations for each of the partitions. I think we will need different RelOptInfos (for A) to store these translations. Does the answer above resolved this issue? The relid is also used to track the scans at executor level. Since we have so many scans on A, each may be using different plan, we will need different ids for those. I don't understand this sentence. Which way executor uses this index of RelOptInfo ? -- regards, Andrey Lepikhov Postgres Professional
Re: Allow ALTER SYSTEM SET on unrecognized custom GUCs
On 17/10/2023 07:19, Tom Lane wrote: Currently we have this odd behavior (for a superuser): regression=# ALTER SYSTEM SET foo.bar TO 'baz'; ERROR: unrecognized configuration parameter "foo.bar" regression=# SET foo.bar TO 'baz'; SET regression=# ALTER SYSTEM SET foo.bar TO 'baz'; ALTER SYSTEM That is, you can't ALTER SYSTEM SET a random custom GUC unless there is already a placeholder GUC for it, because the find_option call in AlterSystemSetConfigFile fails. This is surely pretty inconsistent. Either the first ALTER SYSTEM SET ought to succeed, or the second one ought to fail too, because we don't have any more knowledge about the custom GUC than we did before. In the original discussion about this [1], I initially leaned towards "they should both fail", but I reconsidered: there doesn't seem to be any harm in allowing ALTER SYSTEM SET to succeed for any custom GUC name, as long as you're superuser. Hence, attached is a patch for that. Much of it is refactoring to avoid duplicating the code that checks for a reserved GUC name, which I think should still be done here --- otherwise, we're losing a lot of the typo detection that that check was intended to provide. (That is, if you have loaded an extension that defines "foo" as a prefix, we should honor the extension's opinion about whether "foo.bar" is valid.) I also fixed the code for GRANT ON PARAMETER so that it follows the same rules and throws the same errors for invalid cases. There's a chunk of AlterSystemSetConfigFile that now needs indenting one more tab stop, but I didn't do that yet for ease of review. Thoughts? I have reviewed this patch. It looks good in general. Now, we can change the placeholder value with the SET command and have one more tool (which may be unusual) to pass some data through the session. Keeping away from the reason why DBMS allows such behaviour, I have one question: "SET foo.bar TO 'smth'" can immediately alter the placeholder's value. But what is the reason that "ALTER SYSTEM SET foo.bar TO 'smth'" doesn't do the same? -- regards, Andrey Lepikhov Postgres Professional
Re: Asymmetric partition-wise JOIN
On 16/10/2023 23:21, Ashutosh Bapat wrote: On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov Whenever I visited this idea, I hit one issue prominently - how would we differentiate different scans of the non-partitioned relation. Normally we do that using different Relids but in this case we wouldn't be able to know the number of such relations involved in the query unless we start planning such a join. It's late to add new base relations and assign them new Relids. Of course I haven't thought hard about it. I haven't looked at the patch to see whether this problem is solved and how. I'm curious, which type of problems do you afraid here? Why we need a range table entry for each scan of non-partitioned relation? -- regards, Andrey Lepikhov Postgres Professional
Re: Removing unneeded self joins
On 12/10/2023 18:32, Alexander Korotkov wrote: On Thu, Oct 5, 2023 at 12:17 PM Andrei Lepikhov wrote: On 4/10/2023 14:34, Alexander Korotkov wrote: > Relid replacement machinery is the most contradictory code here. We used > a utilitarian approach and implemented a simplistic variant. > > 2) It would be nice to skip the insertion of IS NOT NULL checks when > > they are not necessary. [1] points that infrastructure from [2] might > > be useful. The patchset from [2] seems committed mow. However, I > > can't see it is directly helpful in this matter. Could we just skip > > adding IS NOT NULL clause for the columns, that have > > pg_attribute.attnotnull set? > Thanks for the links, I will look into that case. To be more precise, in the attachment, you can find a diff to the main patch, which shows the volume of changes to achieve the desired behaviour. Some explains in regression tests shifted. So, I've made additional tests: DROP TABLE test CASCADE; CREATE TABLE test (a int, b int not null); CREATE UNIQUE INDEX abc ON test(b); explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) WHERE t1.b=t2.b; CREATE UNIQUE INDEX abc1 ON test(a,b); explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) WHERE t1.b=t2.b; explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) WHERE t1.b=t2.b AND (t1.a=t2.a OR t2.a=t1.a); DROP INDEX abc1; explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) WHERE t1.b=t2.b AND (t1.b=t2.b OR t2.b=t1.b); We have almost the results we wanted to have. But in the last explain you can see that nothing happened with the OR clause. We should use the expression mutator instead of walker to handle such clauses. But It doesn't process the RestrictInfo node ... I'm inclined to put a solution of this issue off for a while. OK. I think it doesn't worth to eliminate IS NULL quals with this complexity (at least at this stage of work). I made improvements over the code. Mostly new comments, grammar corrections of existing comments and small refactoring. Also, I found that the suggestion from David Rowley [1] to qsort array of relations to faster find duplicates is still unaddressed. I've implemented it. That helps to evade quadratic complexity with large number of relations. Also I've incorporated improvements from Alena Rybakina except one for skipping SJ removal when no SJ quals is found. It's not yet clear for me if this check fix some cases. But at least optimization got skipped in some useful cases (as you can see in regression tests). I would like to propose one more minor improvement (see in attachment). The idea here is that after removing a self-join and changing clauses we should re-probe the set of relids with the same Oid, because we can find more removable self-joins (see the demo test in join.sql). -- regards, Andrey Lepikhov Postgres Professional diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c index 7b8dc7a2b7..f7ccda5231 100644 --- a/src/backend/optimizer/plan/analyzejoins.c +++ b/src/backend/optimizer/plan/analyzejoins.c @@ -2298,6 +2298,7 @@ remove_self_joins_recurse(PlannerInfo *root, List *joinlist, Relids toRemove) { /* Create group of relation indexes with the same oid */ Relids group = NULL; + Relids removed; while (i < j) { @@ -2306,8 +2307,21 @@ remove_self_joins_recurse(PlannerInfo *root, List *joinlist, Relids toRemove) } relids = bms_del_members(relids, group); - toRemove = bms_add_members(toRemove, - remove_self_joins_one_group(root, group)); + + /* +* Try to remove self-joins from a group of identical entries. +* Make next attempt iteratively - if something is deleted from +* a group, changes in clauses and equivalence classes can give +* us a chance to find more candidates. +*/ + do { + Assert(!bms_overlap(group, toRemove)); + removed = remove_self_joins_one_group(root, group); + toRemove = bms_add_members(toRemove, removed); + group = bms_del_members(group, removed); + } while (!bms_is_empty(removed) && +bms_membership(group) == BMS_MULTIPLE); + bms_free(
Re: Asymmetric partition-wise JOIN
On 15/10/2023 17:25, Alexander Korotkov wrote: On Sun, Oct 15, 2023 at 8:40 AM Andrei Lepikhov wrote: Thanks for such detailed feedback! The rationale for this patch was to give the optimizer additional ways to push down more joins into foreign servers. And, because of asynchronous append, the benefit of that optimization was obvious. Unfortunately, we hadn't found other applications for this feature, which was why this patch was postponed in the core. You have brought new ideas about applying this idea locally. Moreover, the main issue of the patch was massive memory consumption in the case of many joins and partitions - because of reparameterization. But now, postponing the reparameterization proposed in the thread [1] resolves that problem and gives some insights into the reparameterization technique of some fields, like lateral references. Hence, I think we can restart this work. The first thing here (after rebase, of course) is to figure out and implement in the cost model cases of effectiveness when asymmetric join would give significant performance. Great! I'm looking forward to the revised patch Before preparing a new patch, it would be better to find the common ground in the next issue: So far, this optimization stays aside, proposing an alternative path for a join RelOptInfo if we have an underlying append path in the outer. My back burner is redesigning the approach: asymmetric join doesn't change the partitioning scheme and bounds of the partitioned side. So, it looks consecutive to make it a part of partitionwise_join machinery and implement it as a part of the try_partitionwise_join / generate_partitionwise_join_paths routines. -- regards, Andrey Lepikhov Postgres Professional
Re: Asymmetric partition-wise JOIN
On 15/10/2023 07:18, Alexander Korotkov wrote: Hi Alexander, Hi Andrey, Thank you for your work on this subject. On Mon, Jan 17, 2022 at 1:42 PM Alexander Pyhalov wrote: The patch does not longer apply cleanly, so I rebased it. Attaching rebased version. Not surprising that the patch doesn't apply after 1.5 years since the last message. Could you please rebase it? I read the thread and the patch. The patch improves the joining of partitioned tables with non-partitioned relations. Let's denote non-partitioned relation as A, partitions as P1 ... PN. The patch allows to Append(Join(A, P1), ... Join(A, PN) instead of Join(A, Append(P1, ... PN). That could be cheaper because it's generally cheaper to join small pieces rather than do one big join. The drawback is the need to scan A multiple times. But is this really necessary and acceptable? Let's consider multiple options. 1) A is non-table. For instance, A is a function scan. In this case, doing multiple scans of A is not just expensive, but could lead to unexpected side effects. When the user includes a function once in the FROM clause, she expects this function to be evaluated once. I propose that we should materialize a scan of non-table relations. So, materialized representation will be scanned multiple times, but the source only scanned once. That would be similar to CTE. 2) A is the table to be scanned with the parametrized path in the inner part of the nested loop join. In this case, there is no big scan of A and nothing to materialize. 3) A is the table to be used in merge join or outer part of nested loop join. In this case, it would be nice to consider materialize. It's not always good to materialize, because materialization has its additional costs. I think that could be a cost-based decision. 4) A is used in the hash join. Could we re-use the hashed representation of A between multiple joins? I read upthread it was proposed to share a hashed table between multiple background workers via shared memory. But the first step would be to just share it between multiple join nodes within the same process. As we consider joining with each partition individually, there could be chosen different join methods. As I get, the current patch considers joining with each of the partitions as a separate isolated optimization task. However, if we share resources between the multiple joins, then rises a need for some global optimization. For instance, a join type could be expensive when applied to an individual partition, but cheap when applied to all the partitions thanks to saving the common work. My idea is to consider generated common resources (such as materialized scans) as a property of the path. For instance, if the nested loop join is cheaper than the hash join, but the hash join generates a common hash map of table A, we don't drop hash join immediately from the consideration and leave it to see how it could help join other partitions. What do you think? Thanks for such detailed feedback! The rationale for this patch was to give the optimizer additional ways to push down more joins into foreign servers. And, because of asynchronous append, the benefit of that optimization was obvious. Unfortunately, we hadn't found other applications for this feature, which was why this patch was postponed in the core. You have brought new ideas about applying this idea locally. Moreover, the main issue of the patch was massive memory consumption in the case of many joins and partitions - because of reparameterization. But now, postponing the reparameterization proposed in the thread [1] resolves that problem and gives some insights into the reparameterization technique of some fields, like lateral references. Hence, I think we can restart this work. The first thing here (after rebase, of course) is to figure out and implement in the cost model cases of effectiveness when asymmetric join would give significant performance. [1] Oversight in reparameterize_path_by_child leading to executor crash https://www.postgresql.org/message-id/flat/CAMbWs496%2BN%3DUAjOc%3DrcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw%40mail.gmail.com -- regards, Andrey Lepikhov Postgres Professional
Re: Oversight in reparameterize_path_by_child leading to executor crash
On 23/8/2023 12:37, Richard Guo wrote: To fix it we may need to modify RelOptInfos for Path, BitmapHeapPath, ForeignPath and CustomPath, and modify IndexOptInfos for IndexPath. It seems that that is not easily done without postponing reparameterization of paths until createplan.c. Attached is a patch which is v5 + fix for this new issue. Having looked into the patch for a while, I couldn't answer to myself for some stupid questions: 1. If we postpone parameterization until the plan creation, why do we still copy the path node in the FLAT_COPY_PATH macros? Do we really need it? 2. I see big switches on path nodes. May it be time to create a path_walker function? I recall some thread where such a suggestion was declined, but I don't remember why. 3. Clause changing is postponed. Why does it not influence the calc_joinrel_size_estimate? We will use statistics on the parent table here. Am I wrong? -- regards, Andrey Lepikhov Postgres Professional
Re: Removing unneeded self joins
On 13/10/2023 15:56, a.rybakina wrote: Also I've incorporated improvements from Alena Rybakina except one for skipping SJ removal when no SJ quals is found. It's not yet clear for me if this check fix some cases. But at least optimization got skipped in some useful cases (as you can see in regression tests). Agree. I wouldn't say I like it too. But also, I suggest skipping some unnecessary assertions proposed in that patch: Assert(toKeep->relid != -1); - quite strange. Why -1? Why not all the negative numbers, at least? Assert(is_opclause(orinfo->clause)); - above we skip clauses with rinfo->mergeopfamilies == NIL. Each mergejoinable clause is already checked as is_opclause. All these changes (see in the attachment) are optional. I don't mind about asserts, maybe I misunderstood something in the patch. About skipping SJ removal when no SJ quals is found, I assume it is about it: split_selfjoin_quals(root, restrictlist, , , inner->relid, outer->relid); + if (list_length(selfjoinquals) == 0) + { + /* + * XXX: + * we would detect self-join without quals like 'x==x' if we had + * an foreign key constraint on some of other quals and this join + * haven't any columns from the outer in the target list. + * But it is still complex task. + */ + continue; + } as far as I remember, this is the place where it is checked that the SJ list is empty and it is logical, in my opinion, that no transformations should be performed if no elements are found for them. You forget we have "Degenerate" case, as Alexander mentioned above. What if you have something like that: SELECT ... FROM A a1, A a2 WHERE a1.id=1 AND a2.id=1; In this case, uniqueness can be achieved by the baserestrictinfo "A.id=1", if we have an unique index on this column. -- regards, Andrey Lepikhov Postgres Professional
Re: Removing unneeded self joins
On 12/10/2023 18:32, Alexander Korotkov wrote: On Thu, Oct 5, 2023 at 12:17 PM Andrei Lepikhov We have almost the results we wanted to have. But in the last explain you can see that nothing happened with the OR clause. We should use the expression mutator instead of walker to handle such clauses. But It doesn't process the RestrictInfo node ... I'm inclined to put a solution of this issue off for a while. OK. I think it doesn't worth to eliminate IS NULL quals with this complexity (at least at this stage of work). Yeah. I think It would be meaningful in the case of replacing also nested x IS NOT NULL with nothing. But it requires using a mutator instead of the walker and may be done more accurately next time. I made improvements over the code. Mostly new comments, grammar corrections of existing comments and small refactoring. Great! Also, I found that the suggestion from David Rowley [1] to qsort array of relations to faster find duplicates is still unaddressed. I've implemented it. That helps to evade quadratic complexity with large number of relations. I see. The thread is too long so far, thanks for the catch. Also I've incorporated improvements from Alena Rybakina except one for skipping SJ removal when no SJ quals is found. It's not yet clear for me if this check fix some cases. But at least optimization got skipped in some useful cases (as you can see in regression tests). Agree. I wouldn't say I like it too. But also, I suggest skipping some unnecessary assertions proposed in that patch: Assert(toKeep->relid != -1); - quite strange. Why -1? Why not all the negative numbers, at least? Assert(is_opclause(orinfo->clause)); - above we skip clauses with rinfo->mergeopfamilies == NIL. Each mergejoinable clause is already checked as is_opclause. All these changes (see in the attachment) are optional. -- regards, Andrey Lepikhov Postgres Professional diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c index f0746f35a3..7b8dc7a2b7 100644 --- a/src/backend/optimizer/plan/analyzejoins.c +++ b/src/backend/optimizer/plan/analyzejoins.c @@ -1710,8 +1710,6 @@ remove_self_join_rel(PlannerInfo *root, PlanRowMark *kmark, PlanRowMark *rmark, List *binfo_candidates = NIL; ReplaceVarnoContext ctx = {.from = toRemove->relid,.to = toKeep->relid}; - Assert(toKeep->relid != -1); - /* * Replace index of removing table with the keeping one. The technique of * removing/distributing restrictinfo is used here to attach just appeared @@ -2017,8 +2015,6 @@ match_unique_clauses(PlannerInfo *root, RelOptInfo *outer, List *uclauses, /* Don't consider clauses which aren't similar to 'F(X)=G(Y)' */ continue; - Assert(is_opclause(orinfo->clause)); - oclause = bms_is_empty(orinfo->left_relids) ? get_rightop(orinfo->clause) : get_leftop(orinfo->clause); c2 = (bms_is_empty(orinfo->left_relids) ? diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index 2ff4881fdf..96ebd6eed3 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -367,7 +367,6 @@ CatalogId CatalogIdMapEntry CatalogIndexState ChangeVarNodes_context -ReplaceVarnoContext CheckPoint CheckPointStmt CheckpointStatsData @@ -2341,6 +2340,7 @@ ReorderBufferUpdateProgressTxnCB ReorderTuple RepOriginId ReparameterizeForeignPathByChild_function +ReplaceVarnoContext ReplaceVarsFromTargetList_context ReplaceVarsNoMatchOption ReplicaIdentityStmt @@ -2474,6 +2474,7 @@ SeenRelsEntry SelectLimit SelectStmt Selectivity +SelfJoinCandidate SemTPadded SemiAntiJoinFactors SeqScan
Re: Removing unneeded self joins
On 11/10/2023 02:29, Alena Rybakina wrote: I have reviewed your patch and I noticed a few things. Thanks for your efforts, I have looked at the latest version of the code, I assume that the error lies in the replace_varno_walker function, especially in the place where we check the node by type Var, and does not form any NullTest node. It's not a bug, it's an optimization we discussed with Alexander above. Secondly, I added some code in some places to catch erroneous cases and added a condition when we should not try to apply the self-join-removal transformation due to the absence of an empty self-join list after searching for it and in general if there are no joins in the query. Besides, I added a query for testing and wrote about it above. I have attached my diff file. Ok, I will look at this In addition, I found a comment for myself that was not clear to me. I would be glad if you could explain it to me. You mentioned superior outer join in the comment, unfortunately, I didn't find anything about it in the PostgreSQL code, and this explanation remained unclear to me. Could you explain in more detail what you meant? I meant here that only clauses pushed by reconsider_outer_join_clauses() can be found in the joininfo list, and they are not relevant, as you can understand. Having written that, I realized that it was a false statement. ;) - joininfo can also contain results of previous SJE iterations, look: CREATE TABLE test (oid int PRIMARY KEY); CREATE UNIQUE INDEX ON test((oid*oid)); explain SELECT count(*) FROM test c1, test c2, test c3 WHERE c1.oid=c2.oid AND c1.oid*c2.oid=c3.oid*c3.oid; explain SELECT count(*) FROM test c1, test c2, test c3 WHERE c1.oid=c3.oid AND c1.oid*c3.oid=c2.oid*c2.oid; explain SELECT count(*) FROM test c1, test c2, test c3 WHERE c3.oid=c2.oid AND c3.oid*c2.oid=c1.oid*c1.oid; Having executed this SQL code, you could see that in the last query, the SJE feature didn't delete one of the JOINs because of the reason I had written above. It's not an one-minute fix - I will try to propose solution a bit later. -- regards, Andrey Lepikhov Postgres Professional
Re: Removing unneeded self joins
On 4/10/2023 14:34, Alexander Korotkov wrote: > Relid replacement machinery is the most contradictory code here. We used > a utilitarian approach and implemented a simplistic variant. > > 2) It would be nice to skip the insertion of IS NOT NULL checks when > > they are not necessary. [1] points that infrastructure from [2] might > > be useful. The patchset from [2] seems committed mow. However, I > > can't see it is directly helpful in this matter. Could we just skip > > adding IS NOT NULL clause for the columns, that have > > pg_attribute.attnotnull set? > Thanks for the links, I will look into that case. To be more precise, in the attachment, you can find a diff to the main patch, which shows the volume of changes to achieve the desired behaviour. Some explains in regression tests shifted. So, I've made additional tests: DROP TABLE test CASCADE; CREATE TABLE test (a int, b int not null); CREATE UNIQUE INDEX abc ON test(b); explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) WHERE t1.b=t2.b; CREATE UNIQUE INDEX abc1 ON test(a,b); explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) WHERE t1.b=t2.b; explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) WHERE t1.b=t2.b AND (t1.a=t2.a OR t2.a=t1.a); DROP INDEX abc1; explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) WHERE t1.b=t2.b AND (t1.b=t2.b OR t2.b=t1.b); We have almost the results we wanted to have. But in the last explain you can see that nothing happened with the OR clause. We should use the expression mutator instead of walker to handle such clauses. But It doesn't process the RestrictInfo node ... I'm inclined to put a solution of this issue off for a while. -- regards, Andrey Lepikhov Postgres Professional diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c index a127239d30..c12aa15fc9 100644 --- a/src/backend/optimizer/plan/analyzejoins.c +++ b/src/backend/optimizer/plan/analyzejoins.c @@ -73,7 +73,7 @@ static bool is_innerrel_unique_for(PlannerInfo *root, List *restrictlist, List **extra_clauses); static Bitmapset *replace_relid(Relids relids, int oldId, int newId); -static void replace_varno(Node *node, int from, int to); +static void replace_varno(PlannerInfo *root, Node *node, int from, int to); /* @@ -388,7 +388,7 @@ remove_rel_from_query_common(PlannerInfo *root, RelOptInfo *rel, } /* Update lateral references. */ - replace_varno((Node *) otherrel->lateral_vars, relid, subst); + replace_varno(root, (Node *) otherrel->lateral_vars, relid, subst); } /* @@ -425,7 +425,7 @@ remove_rel_from_query_common(PlannerInfo *root, RelOptInfo *rel, sjinf->commute_below_l = replace_relid(sjinf->commute_below_l, ojrelid, subst); sjinf->commute_below_r = replace_relid(sjinf->commute_below_r, ojrelid, subst); - replace_varno((Node *) sjinf->semi_rhs_exprs, relid, subst); + replace_varno(root, (Node *) sjinf->semi_rhs_exprs, relid, subst); } /* @@ -1399,6 +1399,7 @@ is_innerrel_unique_for(PlannerInfo *root, typedef struct ReplaceVarnoContext { + PlannerInfo *root; int from; int to; } ReplaceVarnoContext; @@ -1420,6 +1421,11 @@ replace_varno_walker(Node *node, ReplaceVarnoContext *ctx) } return false; } + + /* +* Expression walker doesn't know about RestrictInfo node. Do recursive pass +* into the clauses manually. +*/ if (IsA(node, RestrictInfo)) { RestrictInfo *rinfo = (RestrictInfo *) node; @@ -1429,20 +1435,26 @@ replace_varno_walker(Node *node, ReplaceVarnoContext *ctx) if (bms_is_member(ctx->from, rinfo->clause_relids)) { - replace_varno((Node *) rinfo->clause, ctx->from, ctx->to); - replace_varno((Node *) rinfo->orclause, ctx->from, ctx->to); - rinfo->clause_relids = replace_relid(rinfo->clause_relids, ctx->from, ctx->to); - rinfo->left_relids = replace_relid(rinfo->left_relids, ctx->from, ctx->to); - rinfo->right_relids = replace_relid(rinfo->right_relids, ctx->from, ctx->to); + replace_varno(ctx->root, (Node *) rinfo->clause, ctx->from, ctx->to); + replace_varno(ctx->root, (Node *) rinfo->orclause, ctx->from, ctx->to); + rinfo->clause_relids = + replace_relid(rinfo->clause_relids, ctx->from, ctx->to); + rinfo->left_relids = + replace_relid(rinfo->left_relids, ctx->from, ctx->to); +
Re: Removing unneeded self joins
On 4/10/2023 14:34, Alexander Korotkov wrote: Hi! On Wed, Oct 4, 2023 at 9:56 AM Andrei Lepikhov mailto:a.lepik...@postgrespro.ru>> wrote: > On 4/10/2023 07:12, Alexander Korotkov wrote: > > On Tue, Sep 12, 2023 at 4:58 PM Andrey Lepikhov > > mailto:a.lepik...@postgrespro.ru>> wrote: > >> On 5/7/2023 21:28, Andrey Lepikhov wrote: > >>> During the significant code revision in v.41 I lost some replacement > >>> operations. Here is the fix and extra tests to check this in the future. > >>> Also, Tom added the JoinDomain structure five months ago, and I added > >>> code to replace relids for that place too. > >>> One more thing, I found out that we didn't replace SJs, defined by > >>> baserestrictinfos if no one self-join clause have existed for the join. > >>> Now, it is fixed, and the test has been added. > >>> To understand changes readily, see the delta file in the attachment. > >> Here is new patch in attachment. Rebased on current master and some > >> minor gaffes are fixed. > > > > I went through the thread and I think the patch gets better shape. A > > couple of notes from my side. > > 1) Why replace_relid() makes a copy of lids only on insert/replace of > > a member, but performs deletion in-place? > > Shortly speaking, it was done according to the 'Paranoid' strategy. > The main reason for copying before deletion was the case with the rinfo > required_relids and clause_relids. They both point to the same Bitmapset > in some cases. And we feared such things for other fields. > Right now, it may be redundant because we resolved the issue mentioned > above in replace_varno_walker. OK, but my point is still that you should be paranoid in all the cases or none of the cases. Right now (newId < 0) branch doesn't copy source relids, but bms_is_member(oldId, relids) does copy. Also, I think whether we copy or not should be reflected in the function comment. /* * Substitute newId by oldId in relids. */ static Bitmapset * replace_relid(Relids relids, int oldId, int newId) { if (oldId < 0) return relids; if (newId < 0) /* Delete relid without substitution. */ return bms_del_member(relids, oldId); if (bms_is_member(oldId, relids)) return bms_add_member(bms_del_member(bms_copy(relids), oldId), newId); return relids; } We tried to use replace_relid() for both cases of JOIN deletion: unneeded outer join and self-join, and the relid deletion is used only in the first case. Such an approach was used there for a long time, and we just didn't change it. I am prone to removing the copy operation in the code of relid replacement. > Relid replacement machinery is the most contradictory code here. We used > a utilitarian approach and implemented a simplistic variant. > > 2) It would be nice to skip the insertion of IS NOT NULL checks when > > they are not necessary. [1] points that infrastructure from [2] might > > be useful. The patchset from [2] seems committed mow. However, I > > can't see it is directly helpful in this matter. Could we just skip > > adding IS NOT NULL clause for the columns, that have > > pg_attribute.attnotnull set? > Thanks for the links, I will look into that case. Thanks for the curious issue. The new field Var::varnullingrels introduced in [2] doesn't make sense here, as I see: we operate with plain relations only, and I don't know how it can be applied to an arbitrary subtree contained OUTER JOINs. The second option, the attnotnull flag, can be used in this code. We haven't implemented it because the process_equivalence routine doesn't check the attnotnull before creating NullTest. In general, it is not a difficult operation - we just need to add a trivial get_attnotnull() routine to lssycache.c likewise get_attgenerated() and other functions. But, replace_varno uses the walker to change the relid. The mentioned replacement, like X=X --> X IS NOT NULL can be applied on different levels of the expression, look: A a1 JOIN A a2 ON (a1.id=a2.id) WHERE (a1.x AND (a1.y=a2.y)) Here, we can replace id=id and y=y. It may need some 'unwanted clauses' collection procedure and a second pass through the expression tree to remove them. It may add some unpredictable overhead. We can replace such a clause with a trivial 'TRUE' clause, of course. But is it the feature you have requested? -- regards, Andrey Lepikhov Postgres Professional
Re: Removing unneeded self joins
On 4/10/2023 07:12, Alexander Korotkov wrote: Hi! Thanks for the review! I think this is a neat optimization. On Tue, Sep 12, 2023 at 4:58 PM Andrey Lepikhov wrote: On 5/7/2023 21:28, Andrey Lepikhov wrote: During the significant code revision in v.41 I lost some replacement operations. Here is the fix and extra tests to check this in the future. Also, Tom added the JoinDomain structure five months ago, and I added code to replace relids for that place too. One more thing, I found out that we didn't replace SJs, defined by baserestrictinfos if no one self-join clause have existed for the join. Now, it is fixed, and the test has been added. To understand changes readily, see the delta file in the attachment. Here is new patch in attachment. Rebased on current master and some minor gaffes are fixed. I went through the thread and I think the patch gets better shape. A couple of notes from my side. 1) Why replace_relid() makes a copy of lids only on insert/replace of a member, but performs deletion in-place? Shortly speaking, it was done according to the 'Paranoid' strategy. The main reason for copying before deletion was the case with the rinfo required_relids and clause_relids. They both point to the same Bitmapset in some cases. And we feared such things for other fields. Right now, it may be redundant because we resolved the issue mentioned above in replace_varno_walker. Relid replacement machinery is the most contradictory code here. We used a utilitarian approach and implemented a simplistic variant. 2) It would be nice to skip the insertion of IS NOT NULL checks when they are not necessary. [1] points that infrastructure from [2] might be useful. The patchset from [2] seems committed mow. However, I can't see it is directly helpful in this matter. Could we just skip adding IS NOT NULL clause for the columns, that have pg_attribute.attnotnull set? Thanks for the links, I will look into that case. Links 1. https://www.postgresql.org/message-id/2375492.jE0xQCEvom%40aivenronan 2. https://www.postgresql.org/message-id/flat/830269.1656693747%40sss.pgh.pa.us -- regards, Andrey Lepikhov Postgres Professional
Re: Add the ability to limit the amount of memory that can be allocated to backends.
On 29/9/2023 09:52, Andrei Lepikhov wrote: On 22/5/2023 22:59, reid.thomp...@crunchydata.com wrote: Attach patches updated to master. Pulled from patch 2 back to patch 1 a change that was also pertinent to patch 1. +1 to the idea, have doubts on the implementation. I have a question. I see the feature triggers ERROR on the exceeding of the memory limit. The superior PG_CATCH() section will handle the error. As I see, many such sections use memory allocations. What if some routine, like the CopyErrorData(), exceeds the limit, too? In this case, we could repeat the error until the top PG_CATCH(). Is this correct behaviour? Maybe to check in the exceeds_max_total_bkend_mem() for recursion and allow error handlers to slightly exceed this hard limit? By the patch in attachment I try to show which sort of problems I'm worrying about. In some PП_CATCH() sections we do CopyErrorData (allocate some memory) before aborting the transaction. So, the allocation error can move us out of this section before aborting. We await for soft ERROR message but will face more hard consequences. -- regards, Andrey Lepikhov Postgres Professional diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c index 33975687b3..3f992b8d92 100644 --- a/src/backend/executor/spi.c +++ b/src/backend/executor/spi.c @@ -291,10 +291,7 @@ _SPI_commit(bool chain) { ErrorData *edata; - /* Save error info in caller's context */ MemoryContextSwitchTo(oldcontext); - edata = CopyErrorData(); - FlushErrorState(); /* * Abort the failed transaction. If this fails too, we'll just @@ -302,6 +299,10 @@ _SPI_commit(bool chain) */ AbortCurrentTransaction(); + /* Save error info in caller's context */ + edata = CopyErrorData(); + FlushErrorState(); + /* ... and start a new one */ StartTransactionCommand(); if (chain) @@ -383,10 +384,7 @@ _SPI_rollback(bool chain) { ErrorData *edata; - /* Save error info in caller's context */ MemoryContextSwitchTo(oldcontext); - edata = CopyErrorData(); - FlushErrorState(); /* * Try again to abort the failed transaction. If this fails too, @@ -395,6 +393,10 @@ _SPI_rollback(bool chain) */ AbortCurrentTransaction(); + /* Save error info in caller's context */ + edata = CopyErrorData(); + FlushErrorState(); + /* ... and start a new one */ StartTransactionCommand(); if (chain) diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 12edc5772a..f9cf599026 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -2565,7 +2565,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, PG_CATCH(); { MemoryContext ecxt = MemoryContextSwitchTo(ccxt); - ErrorData *errdata = CopyErrorData(); + ErrorData *errdata; /* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */ if (iterstate) @@ -2579,6 +2579,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, */ AbortCurrentTransaction(); + errdata = CopyErrorData(); + /* make sure there's no cache pollution */ ReorderBufferExecuteInvalidations(txn->ninvalidations, txn->invalidations);
Re: POC: GROUP BY optimization
New version of the patch. Fixed minor inconsistencies and rebased onto current master. -- regards, Andrey Lepikhov Postgres Professional From 2f5a42c8a53286f830e3376ff4d3f8b7d4215b4b Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 13 Sep 2023 11:20:03 +0700 Subject: [PATCH] Explore alternative orderings of group-by pathkeys during optimization. When evaluating a query with a multi-column GROUP BY clause using sort, the cost may depend heavily on the order in which the keys are compared when building the groups. Grouping does not imply any ordering, so we can compare the keys in arbitrary order, and a Hash Agg leverages this. But for Group Agg, we simply compared keys in the order specified in the query. This commit explores alternative ordering of the keys, trying to find a cheaper one. In principle, we might generate grouping paths for all permutations of the keys and leave the rest to the optimizer. But that might get very expensive, so we try to pick only a couple interesting orderings based on both local and global information. When planning the grouping path, we explore statistics (number of distinct values, cost of the comparison function) for the keys and reorder them to minimize comparison costs. Intuitively, it may be better to perform more expensive comparisons (for complex data types, etc.) last because maybe the cheaper comparisons will be enough. Similarly, the higher the cardinality of a key, the lower the probability we'll need to compare more keys. The patch generates and costs various orderings, picking the cheapest ones. The ordering of group keys may interact with other parts of the query, some of which may not be known while planning the grouping. For example, there may be an explicit ORDER BY clause or some other ordering-dependent operation higher up in the query, and using the same ordering may allow using either incremental sort or even eliminating the sort entirely. The patch generates orderings and picks those, minimizing the comparison cost (for various path keys), and then adds orderings that might be useful for operations higher up in the plan (ORDER BY, etc.). Finally, it always keeps the ordering specified in the query, assuming the user might have additional insights. This introduces a new GUC enable_group_by_reordering so that the optimization may be disabled if needed. --- .../postgres_fdw/expected/postgres_fdw.out| 36 +- src/backend/optimizer/path/costsize.c | 363 ++- src/backend/optimizer/path/equivclass.c | 13 +- src/backend/optimizer/path/pathkeys.c | 600 ++ src/backend/optimizer/plan/planner.c | 465 -- src/backend/optimizer/util/pathnode.c | 4 +- src/backend/utils/adt/selfuncs.c | 38 +- src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/pathnodes.h | 10 + src/include/optimizer/cost.h | 4 +- src/include/optimizer/paths.h | 7 + src/include/utils/selfuncs.h | 5 + src/test/regress/expected/aggregates.out | 244 ++- .../regress/expected/incremental_sort.out | 2 +- src/test/regress/expected/join.out| 51 +- src/test/regress/expected/merge.out | 15 +- .../regress/expected/partition_aggregate.out | 44 +- src/test/regress/expected/partition_join.out | 75 +-- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/union.out | 60 +- src/test/regress/sql/aggregates.sql | 99 +++ src/test/regress/sql/incremental_sort.sql | 2 +- 23 files changed, 1769 insertions(+), 382 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 144c114d0f..63af7feabe 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -2319,18 +2319,21 @@ SELECT t1."C 1" FROM "S 1"."T 1" t1, LATERAL (SELECT DISTINCT t2.c1, t3.c1 FROM -- join with pseudoconstant quals EXPLAIN (VERBOSE, COSTS OFF) SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1 AND CURRENT_USER = SESSION_USER) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10; - QUERY PLAN - + QUERY PLAN
Re: Add the ability to limit the amount of memory that can be allocated to backends.
On 22/5/2023 22:59, reid.thomp...@crunchydata.com wrote: Attach patches updated to master. Pulled from patch 2 back to patch 1 a change that was also pertinent to patch 1. +1 to the idea, have doubts on the implementation. I have a question. I see the feature triggers ERROR on the exceeding of the memory limit. The superior PG_CATCH() section will handle the error. As I see, many such sections use memory allocations. What if some routine, like the CopyErrorData(), exceeds the limit, too? In this case, we could repeat the error until the top PG_CATCH(). Is this correct behaviour? Maybe to check in the exceeds_max_total_bkend_mem() for recursion and allow error handlers to slightly exceed this hard limit? Also, the patch needs to be rebased. -- regards, Andrey Lepikhov Postgres Professional