Re: using extended statistics to improve join estimates
On 3/9/2024 14:58, Andrei Lepikhov wrote: On 17/6/2024 18:10, Tomas Vondra wrote: x = $1 AND y = $2 AND ... As I see, current patch doesn't resolve this issue currently. Let's explain my previous argument with an example (see in attachment). The query designed to be executed with parameterised NL join: EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM test t1 NATURAL JOIN test1 t2 WHERE t2.x1 < 1; After applying the topmost patch from the patchset we can see two different estimations (explain tuned a little bit) before and after extended statistics: -- before: Nested Loop (rows=1) (actual rows=1 loops=1) -> Seq Scan on test1 t2 (rows=100) (actual rows=100 loops=1) Filter: (x1 < 1) -> Memoize (rows=1) (actual rows=100 loops=100) Cache Key: t2.x1, t2.x2, t2.x3, t2.x4 -> Index Scan using test_x1_x2_x3_x4_idx on test t1 (rows=1 width=404) (actual rows=100 loops=1) Index Cond: ((x1 = t2.x1) AND (x2 = t2.x2) AND (x3 = t2.x3) AND (x4 = t2.x4)) -- after: Nested Loop (rows=1) (actual rows=1 loops=1) -> Seq Scan on test1 t2 (rows=100) (actual rows=100 loops=1) Filter: (x1 < 1) -> Memoize (rows=1) (actual rows=100 loops=100) Cache Key: t2.x1, t2.x2, t2.x3, t2.x4 -> Index Scan using test_x1_x2_x3_x4_idx on test t1 (rows=1) (actual rows=100 loops=1) Index Cond: ((x1 = t2.x1) AND (x2 = t2.x2) AND (x3 = t2.x3) AND (x4 = t2.x4)) You can see, that index condition was treated as join clause and PNL estimated correctly by an MCV on both sides. But scan estimation is incorrect. Moreover, sometimes we don't have MCV at all. And the next step for this patch should be implementation of bare estimation by the only ndistinct on each side. What to do with the scan filter? Not sure so far, but it looks like here may be used the logic similar to var_eq_non_const(). -- regards, Andrei Lepikhov example.sql Description: application/sql
Re: using extended statistics to improve join estimates
On 17/6/2024 18:10, Tomas Vondra wrote: Let me quickly go through the original parts - most of this is already in the "review" patches, but it's better to quote the main points here to start a discussion. I'll omit some of the smaller suggestions, so please look at the 'review' patches. v20240617-0001-Estimate-joins-using-extended-statistics.patch - rewords a couple comments, particularly for statext_find_matching_mcv - a couple XXX comments about possibly stale/inaccurate comments v20240617-0054-clauselist_selectivity_hook.patch - I believe this does not work with the earlier patch that removed estimatedclaused bitmap from the "try" function. This patch set is too big to eat at once - it's just challenging to invent examples and counterexamples. Can we see these two patches in the master and analyse further improvements based on that? Some thoughts: You remove verRelid. I have thought about replacing this value with RelOptInfo, which would allow extensions (remember selectivity hook) to know about the underlying path tree. The first patch is generally ok, and I vote for having it in the master. However, the most harmful case I see most reports about is parameterised JOIN on multiple anded clauses. In that case, we have a scan filter on something like the below: x = $1 AND y = $2 AND ... As I see, current patch doesn't resolve this issue currently. -- regards, Andrei Lepikhov
Re: using extended statistics to improve join estimates
On 5/23/24 09:04, Andy Fan wrote: Andrei Lepikhov writes: * c) No extended stats with MCV. If there are multiple join clauses, * we can try using ndistinct coefficients and do what eqjoinsel does. OK, I didn't pay enough attention to this comment before. and yes, I get the same conclusion as you - there is no implementation of this. and if so, I think we should remove the comments and do the implementation in the next patch. I have an opposite opinion about it: 1. distinct estimation is more universal thing - you can use it precisely on any subset of columns. 2. distinct estimation is faster - it just a number, you don't need to detoast huge array of values and compare them one-by-one. So, IMO, it is essential part of join estimation and it should be implemented like in eqjoinsel. Do you think the hook proposal is closely connected with the current topic? IIUC it's seems not. So a dedicated thread to explain the problem to slove and the proposal and the follwing discussion should be helpful for both topics. I'm just worried about mixing the two in one thread would make things complexer unnecessarily. Sure. -- regards, Andrei Lepikhov Postgres Professional
Re: using extended statistics to improve join estimates
Andrei Lepikhov writes: > On 20/5/2024 15:52, Andy Fan wrote: >> Hi Andrei, >> >>> On 4/3/24 01:22, Tomas Vondra wrote: Cool! There's obviously no chance to get this into v18, and I have stuff to do in this CF. But I'll take a look after that. >>> I'm looking at your patch now - an excellent start to an eagerly awaited >>> feature! >>> A couple of questions: >>> 1. I didn't find the implementation of strategy 'c' - estimation by the >>> number of distinct values. Do you forget it? >> What do you mean the "strategy 'c'"? > As described in 0001-* patch: > * c) No extended stats with MCV. If there are multiple join clauses, > * we can try using ndistinct coefficients and do what eqjoinsel does. OK, I didn't pay enough attention to this comment before. and yes, I get the same conclusion as you - there is no implementation of this. and if so, I think we should remove the comments and do the implementation in the next patch. >>> 2. Can we add a clauselist selectivity hook into the core (something >>> similar the code in attachment)? It can allow the development and >>> testing of multicolumn join estimations without patching the core. >> The idea LGTM. But do you want >> +if (clauselist_selectivity_hook) >> +s1 = clauselist_selectivity_hook(root, clauses, varRelid, >> jointype, >> + >> rather than >> +if (clauselist_selectivity_hook) >> +*return* clauselist_selectivity_hook(root, clauses, ..) > Of course - library may estimate not all the clauses - it is a reason, > why I added input/output parameter 'estimatedclauses' by analogy with > statext_clauselist_selectivity. OK. Do you think the hook proposal is closely connected with the current topic? IIUC it's seems not. So a dedicated thread to explain the problem to slove and the proposal and the follwing discussion should be helpful for both topics. I'm just worried about mixing the two in one thread would make things complexer unnecessarily. -- Best Regards Andy Fan
Re: using extended statistics to improve join estimates
On 5/20/24 16:40, Andrei Lepikhov wrote: On 20/5/2024 15:52, Andy Fan wrote: + if (clauselist_selectivity_hook) + *return* clauselist_selectivity_hook(root, clauses, ..) Of course - library may estimate not all the clauses - it is a reason, why I added input/output parameter 'estimatedclauses' by analogy with statext_clauselist_selectivity. Here is a polished and a bit modified version of the hook proposed. Additionally, I propose exporting the statext_mcv_clauselist_selectivity routine, likewise dependencies_clauselist_selectivity. This could potentially enhance the functionality of the PostgreSQL estimation code. To clarify the purpose, I want an optional, loaded as a library, more conservative estimation based on distinct statistics. Let's provide (a bit degenerate) example: CREATE TABLE is_test(x1 integer, x2 integer, x3 integer, x4 integer); INSERT INTO is_test (x1,x2,x3,x4) SELECT x%5,x%7,x%11,x%13 FROM generate_series(1,1E3) AS x; INSERT INTO is_test (x1,x2,x3,x4) SELECT 14,14,14,14 FROM generate_series(1,100) AS x; CREATE STATISTICS ist_stat (dependencies,ndistinct) ON x1,x2,x3,x4 FROM is_test; ANALYZE is_test; EXPLAIN (ANALYZE, COSTS ON, SUMMARY OFF, TIMING OFF) SELECT * FROM is_test WHERE x1=14 AND x2=14 AND x3=14 AND x4=14; DROP TABLE is_test CASCADE; I see: (cost=0.00..15.17 rows=3 width=16) (actual rows=100 loops=1) Dependency works great if it is the same for all the data in the columns. But we get underestimations if we have different laws for subsets of rows. So, if we don't have MCV statistics, sometimes we need to pass over dependency statistics and use ndistinct instead. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index 0ab021c1e8..1508a1beea 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -128,6 +128,11 @@ clauselist_selectivity_ext(PlannerInfo *root, ListCell *l; int listidx; + if (clauselist_selectivity_hook) + s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype, + sjinfo, &estimatedclauses, + use_extended_stats); + /* * If there's exactly one clause, just go directly to * clause_selectivity_ext(). None of what we might do below is relevant. diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 99fdf208db..b1722f5a60 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1712,7 +1712,7 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid, * 0-based 'clauses' indexes we estimate for and also skip clause items that * already have a bit set. */ -static Selectivity +Selectivity statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid, JoinType jointype, SpecialJoinInfo *sjinfo, RelOptInfo *rel, Bitmapset **estimatedclauses, diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 5f5d7959d8..ff98fda08c 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -146,6 +146,7 @@ /* Hooks for plugins to get control when we ask for stats */ get_relation_stats_hook_type get_relation_stats_hook = NULL; get_index_stats_hook_type get_index_stats_hook = NULL; +clauselist_selectivity_hook_type clauselist_selectivity_hook = NULL; static double eqsel_internal(PG_FUNCTION_ARGS, bool negate); static double eqjoinsel_inner(Oid opfuncoid, Oid collation, diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h index 7f2bf18716..436f30bdde 100644 --- a/src/include/statistics/statistics.h +++ b/src/include/statistics/statistics.h @@ -104,6 +104,14 @@ extern void BuildRelationExtStatistics(Relation onerel, bool inh, double totalro extern int ComputeExtStatisticsRows(Relation onerel, int natts, VacAttrStats **vacattrstats); extern bool statext_is_kind_built(HeapTuple htup, char type); +extern Selectivity statext_mcv_clauselist_selectivity(PlannerInfo *root, + List *clauses, + int varRelid, + JoinType jointype, + SpecialJoinInfo *sjinfo, + RelOptInfo *rel, + Bitmapset **estimatedclauses, + bool is_or); extern Selectivity dependencies_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid, diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h index f2563ad1cb..253f584c65 100644 --- a/src/include/utils/selfuncs.h +++ b/src/include/utils/selfuncs.h @@ -148,6 +148,17 @@ typedef bool (*get_index_stats_hook_type) (PlannerInfo *root, VariableStatData *vardata); extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook; +/* Hooks for plugins to get control when we ask for selectivity estimation */ +typedef Se
Re: using extended statistics to improve join estimates
On 20/5/2024 15:52, Andy Fan wrote: Hi Andrei, On 4/3/24 01:22, Tomas Vondra wrote: Cool! There's obviously no chance to get this into v18, and I have stuff to do in this CF. But I'll take a look after that. I'm looking at your patch now - an excellent start to an eagerly awaited feature! A couple of questions: 1. I didn't find the implementation of strategy 'c' - estimation by the number of distinct values. Do you forget it? What do you mean the "strategy 'c'"? As described in 0001-* patch: * c) No extended stats with MCV. If there are multiple join clauses, * we can try using ndistinct coefficients and do what eqjoinsel does. 2. Can we add a clauselist selectivity hook into the core (something similar the code in attachment)? It can allow the development and testing of multicolumn join estimations without patching the core. The idea LGTM. But do you want + if (clauselist_selectivity_hook) + s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype, + rather than + if (clauselist_selectivity_hook) + *return* clauselist_selectivity_hook(root, clauses, ..) Of course - library may estimate not all the clauses - it is a reason, why I added input/output parameter 'estimatedclauses' by analogy with statext_clauselist_selectivity. -- regards, Andrei Lepikhov Postgres Professional
Re: using extended statistics to improve join estimates
Hi Andrei, > On 4/3/24 01:22, Tomas Vondra wrote: >> Cool! There's obviously no chance to get this into v18, and I have stuff >> to do in this CF. But I'll take a look after that. > I'm looking at your patch now - an excellent start to an eagerly awaited > feature! > A couple of questions: > 1. I didn't find the implementation of strategy 'c' - estimation by the > number of distinct values. Do you forget it? What do you mean the "strategy 'c'"? > 2. Can we add a clauselist selectivity hook into the core (something > similar the code in attachment)? It can allow the development and > testing of multicolumn join estimations without patching the core. The idea LGTM. But do you want + if (clauselist_selectivity_hook) + s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype, + rather than + if (clauselist_selectivity_hook) + *return* clauselist_selectivity_hook(root, clauses, ..) ? -- Best Regards Andy Fan
Re: using extended statistics to improve join estimates
On 4/3/24 01:22, Tomas Vondra wrote: Cool! There's obviously no chance to get this into v18, and I have stuff to do in this CF. But I'll take a look after that. I'm looking at your patch now - an excellent start to an eagerly awaited feature! A couple of questions: 1. I didn't find the implementation of strategy 'c' - estimation by the number of distinct values. Do you forget it? 2. Can we add a clauselist selectivity hook into the core (something similar the code in attachment)? It can allow the development and testing of multicolumn join estimations without patching the core. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index 0ab021c1e8..271d36a522 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -128,6 +128,9 @@ clauselist_selectivity_ext(PlannerInfo *root, ListCell *l; int listidx; + if (clauselist_selectivity_hook) + s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype, + sjinfo, &estimatedclauses); /* * If there's exactly one clause, just go directly to * clause_selectivity_ext(). None of what we might do below is relevant. diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 5f5d7959d8..ff98fda08c 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -146,6 +146,7 @@ /* Hooks for plugins to get control when we ask for stats */ get_relation_stats_hook_type get_relation_stats_hook = NULL; get_index_stats_hook_type get_index_stats_hook = NULL; +clauselist_selectivity_hook_type clauselist_selectivity_hook = NULL; static double eqsel_internal(PG_FUNCTION_ARGS, bool negate); static double eqjoinsel_inner(Oid opfuncoid, Oid collation, diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h index f2563ad1cb..ee28d3ba9b 100644 --- a/src/include/utils/selfuncs.h +++ b/src/include/utils/selfuncs.h @@ -148,6 +148,15 @@ typedef bool (*get_index_stats_hook_type) (PlannerInfo *root, VariableStatData *vardata); extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook; +/* Hooks for plugins to get control when we ask for selectivity estimation */ +typedef bool (*clauselist_selectivity_hook_type) (PlannerInfo *root, + List *clauses, + int varRelid, + JoinType jointype, + SpecialJoinInfo *sjinfo, + Bitmapset **estimatedclauses); +extern PGDLLIMPORT clauselist_selectivity_hook_type clauselist_selectivity_hook; + /* Functions in selfuncs.c */ extern void examine_variable(PlannerInfo *root, Node *node, int varRelid,
Re: using extended statistics to improve join estimates
Hello Justin, Thanks for showing interest on this! > On Sun, Apr 28, 2024 at 10:07:01AM +0800, Andy Fan wrote: >> 's/estimiatedcluases/estimatedclauses/' typo error in the >> commit message is not fixed since I have to regenerate all the commits > > Maybe you know this, but some of these patches need to be squashed. > Regenerating the patches to address feedback is the usual process. > When they're not squished, it makes it hard to review the content of the > patches. You might overlooked the fact that the each individual commit is just to make the communication effectively (easy to review) and all of them will be merged into 1 commit at the last / during the process of review. Even though if something make it hard to review, I am pretty happy to regenerate the patches, but does 's/estimiatedcluases/estimatedclauses/' belongs to this category? I'm pretty sure that is not the only typo error or inapproprate word, if we need to regenerate the 22 patches because of that, we have to regenerate that pretty often. Do you mind to provide more feedback once and I can merge all of them in one modification or you think the typo error has blocked the review process? > > For example: > [PATCH v1 18/22] Fix error "unexpected system attribute" when join with > system attr > ..adds .sql regression tests, but the expected .out isn't updated until > [PATCH v1 19/22] Fix the incorrect comment on extended stats. > > That fixes an elog() in Tomas' original commit, so it should probably be > 002 or 003. Which elog are you talking about? > It might make sense to keep the first commit separate for > now, since it's nice to keep Tomas' original patch "pristine" to make > more apparent the changes you're proposing. This is my goal as well, did you find anything I did which break this rule, that's absoluately not my intention. > Another: > [PATCH v1 20/22] Add fastpath when combine the 2 MCV like eqjoinsel_inner. > ..doesn't compile without > [PATCH v1 21/22] When mcv->ndimensions == list_length(clauses), handle it > same as > > Your 022 patch fixes a typo in your 002 patch, which means that first > one reads a patch with a typo, and then later, a 10 line long patch > reflowing the comment with a typo fixed. I would like to regenerate the 22 patches if you think the typo error make the reivew process hard. I can do such things but not willing to do that often. > > A good guideline is that each patch should be self-contained, compiling > and passing tests. Which is more difficult with a long stack of > patches. I agree. -- Best Regards Andy Fan
Re: using extended statistics to improve join estimates
On Sun, Apr 28, 2024 at 10:07:01AM +0800, Andy Fan wrote: > 's/estimiatedcluases/estimatedclauses/' typo error in the > commit message is not fixed since I have to regenerate all the commits Maybe you know this, but some of these patches need to be squashed. Regenerating the patches to address feedback is the usual process. When they're not squished, it makes it hard to review the content of the patches. For example: [PATCH v1 18/22] Fix error "unexpected system attribute" when join with system attr ..adds .sql regression tests, but the expected .out isn't updated until [PATCH v1 19/22] Fix the incorrect comment on extended stats. That fixes an elog() in Tomas' original commit, so it should probably be 002 or 003. It might make sense to keep the first commit separate for now, since it's nice to keep Tomas' original patch "pristine" to make more apparent the changes you're proposing. Another: [PATCH v1 20/22] Add fastpath when combine the 2 MCV like eqjoinsel_inner. ..doesn't compile without [PATCH v1 21/22] When mcv->ndimensions == list_length(clauses), handle it same as Your 022 patch fixes a typo in your 002 patch, which means that first one reads a patch with a typo, and then later, a 10 line long patch reflowing the comment with a typo fixed. A good guideline is that each patch should be self-contained, compiling and passing tests. Which is more difficult with a long stack of patches. -- Justin
Re: using extended statistics to improve join estimates
Hello Justin! Justin Pryzby writes: > |../src/backend/statistics/extended_stats.c:3151:36: warning: ‘relid’ may be > used uninitialized [-Wmaybe-uninitialized] > | 3151 | if (var->varno != relid) > | |^ > |../src/backend/statistics/extended_stats.c:3104:33: note: ‘relid’ was > declared here > | 3104 | int relid; > | | ^ > |[1016/1893] Compiling C object > src/backend/postgres_lib.a.p/statistics_mcv.c.o > |../src/backend/statistics/mcv.c: In function ‘mcv_combine_extended’: > |../src/backend/statistics/mcv.c:2431:49: warning: declaration of ‘idx’ > shadows a previous local [-Wshadow=compatible-local] Thanks for the feedback, the warnning should be fixed in the lastest revision and 's/estimiatedcluases/estimatedclauses/' typo error in the commit message is not fixed since I have to regenerate all the commits to fix that. We are still in dicussion stage and I think these impact is pretty limited on dicussion. > FYI, I also ran the patch with a $large number of reports without > observing any errors or crashes. Good to know that. > I'll try to look harder at the next patch revision. Thank you! -- Best Regards Andy Fan
Re: using extended statistics to improve join estimates
On Tue, Apr 02, 2024 at 04:23:45PM +0800, Andy Fan wrote: > > 0001 is your patch, I just rebase them against the current master. 0006 > is not much relevant with current patch, and I think it can be committed > individually if you are OK with that. Your 002 should also remove listidx to avoid warning ../src/backend/statistics/extended_stats.c:2879:8: error: variable 'listidx' set but not used [-Werror,-Wunused-but-set-variable] > Subject: [PATCH v1 2/8] Remove estimiatedcluases and varRelid arguments > @@ -2939,15 +2939,11 @@ statext_try_join_estimates(PlannerInfo *root, List > *clauses, int varRelid, > /* needs to happen before skipping any clauses */ > listidx++; > > - /* Skip clauses that were already estimated. */ > - if (bms_is_member(listidx, estimatedclauses)) > - continue; > - Your 007 could instead test if relids == NULL: > Subject: [PATCH v1 7/8] bms_is_empty is more effective than bms_num_members(b) >- if (bms_num_members(relids) == 0) >+ if (bms_is_empty(relids)) typos: 001: s/heuristict/heuristics/ 002: s/grantee/guarantee/ 002: s/estimiatedcluases/estimatedclauses/ It'd be nice to fix/silence these warnings from 001: |../src/backend/statistics/extended_stats.c:3151:36: warning: ‘relid’ may be used uninitialized [-Wmaybe-uninitialized] | 3151 | if (var->varno != relid) | |^ |../src/backend/statistics/extended_stats.c:3104:33: note: ‘relid’ was declared here | 3104 | int relid; | | ^ |[1016/1893] Compiling C object src/backend/postgres_lib.a.p/statistics_mcv.c.o |../src/backend/statistics/mcv.c: In function ‘mcv_combine_extended’: |../src/backend/statistics/mcv.c:2431:49: warning: declaration of ‘idx’ shadows a previous local [-Wshadow=compatible-local] FYI, I also ran the patch with a $large number of reports without observing any errors or crashes. I'll try to look harder at the next patch revision. -- Justin
Re: using extended statistics to improve join estimates
Tomas Vondra writes: > On 4/2/24 10:23, Andy Fan wrote: >> >>> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote: Rebased over 269b532ae and muted compiler warnings. >> >> Thank you Justin for the rebase! >> >> Hello Tomas, >> >> Thanks for the patch! Before I review the path at the code level, I want >> to explain my understanding about this patch first. >> > > If you want to work on this patch, that'd be cool. A review would be > great, but if you want to maybe take over and try moving it forward, > that'd be even better. I don't know when I'll have time to work on it > again, but I'd promise to help you with working on it. OK, I'd try to moving it forward. > >> Before this patch, we already use MCV information for the eqjoinsel, it >> works as combine the MCV on the both sides to figure out the mcv_freq >> and then treat the rest equally, but this doesn't work for MCV in >> extended statistics, this patch fill this gap. Besides that, since >> extended statistics means more than 1 columns are involved, if 1+ >> columns are Const based on RestrictInfo, we can use such information to >> filter the MCVs we are interesting, that's really cool. >> > > Yes, I think that's an accurate description of what the patch does. Great to know that:) > >> I did some more testing, all of them are inner join so far, all of them >> works amazing and I am suprised this patch didn't draw enough >> attention. > I think it didn't go forward for a bunch of reasons: > .. > > 3) Uncertainty about how applicable the patch is in practice. > > I suppose it was some combination of these reasons, not sure. > > As for the "practicality" mentioned in (3), it's been a while since I > worked on the patch so I don't recall the details, but I think I've been > thinking mostly about "start join" queries, where a big "fact" table > joins to small dimensions. And in that case the fact table may have a > MCV, but the dimensions certainly don't have any (because the join > happens on a PK). > > But maybe that's a wrong way to think about it - it was clearly useful > to consider the case with (per-attribute) MCVs on both sides as worth > special handling. So why not to do that for multi-column MCVs, right? Yes, that's what my current understanding is. There are some cases where there are 2+ clauses between two tables AND the rows estimiation is bad AND the plan is not the best one. In such sisuations, I'd think this patch probably be helpful. The current case in hand is PG11, there is no MCV information for extended statistics, so I even can't verify the patch here is useful or not manually. When I see them next time in a newer version of PG, I can verity it manually to see if the rows estimation can be better. >> At for the code level, I reviewed them in the top-down manner and almost >> 40% completed. Here are some findings just FYI. For efficiency purpose, >> I provide each feedback with a individual commit, after all I want to >> make sure my comment is practical and coding and testing is a good way >> to archive that. I tried to make each of them as small as possible so >> that you can reject or accept them convinently. >> >> 0001 is your patch, I just rebase them against the current master. 0006 >> is not much relevant with current patch, and I think it can be committed >> individually if you are OK with that. >> >> Hope this kind of review is helpful. >> > > Cool! There's obviously no chance to get this into v18, and I have stuff > to do in this CF. But I'll take a look after that. Good to know that. I will continue my work before that. -- Best Regards Andy Fan
Re: using extended statistics to improve join estimates
On 4/2/24 10:23, Andy Fan wrote: > >> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote: >>> Rebased over 269b532ae and muted compiler warnings. > > Thank you Justin for the rebase! > > Hello Tomas, > > Thanks for the patch! Before I review the path at the code level, I want > to explain my understanding about this patch first. > If you want to work on this patch, that'd be cool. A review would be great, but if you want to maybe take over and try moving it forward, that'd be even better. I don't know when I'll have time to work on it again, but I'd promise to help you with working on it. > Before this patch, we already use MCV information for the eqjoinsel, it > works as combine the MCV on the both sides to figure out the mcv_freq > and then treat the rest equally, but this doesn't work for MCV in > extended statistics, this patch fill this gap. Besides that, since > extended statistics means more than 1 columns are involved, if 1+ > columns are Const based on RestrictInfo, we can use such information to > filter the MCVs we are interesting, that's really cool. > Yes, I think that's an accurate description of what the patch does. > I did some more testing, all of them are inner join so far, all of them > works amazing and I am suprised this patch didn't draw enough > attention. I will test more after I go though the code. > I think it didn't go forward for a bunch of reasons: 1) I got distracted by something else requiring immediate attention, and forgot about this patch. 2) I got stuck on some detail of the patch, unsure which of the possible solutions to try first. 3) Uncertainty about how applicable the patch is in practice. I suppose it was some combination of these reasons, not sure. As for the "practicality" mentioned in (3), it's been a while since I worked on the patch so I don't recall the details, but I think I've been thinking mostly about "start join" queries, where a big "fact" table joins to small dimensions. And in that case the fact table may have a MCV, but the dimensions certainly don't have any (because the join happens on a PK). But maybe that's a wrong way to think about it - it was clearly useful to consider the case with (per-attribute) MCVs on both sides as worth special handling. So why not to do that for multi-column MCVs, right? > At for the code level, I reviewed them in the top-down manner and almost > 40% completed. Here are some findings just FYI. For efficiency purpose, > I provide each feedback with a individual commit, after all I want to > make sure my comment is practical and coding and testing is a good way > to archive that. I tried to make each of them as small as possible so > that you can reject or accept them convinently. > > 0001 is your patch, I just rebase them against the current master. 0006 > is not much relevant with current patch, and I think it can be committed > individually if you are OK with that. > > Hope this kind of review is helpful. > Cool! There's obviously no chance to get this into v18, and I have stuff to do in this CF. But I'll take a look after that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: using extended statistics to improve join estimates
> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote: >> Rebased over 269b532ae and muted compiler warnings. Thank you Justin for the rebase! Hello Tomas, Thanks for the patch! Before I review the path at the code level, I want to explain my understanding about this patch first. Before this patch, we already use MCV information for the eqjoinsel, it works as combine the MCV on the both sides to figure out the mcv_freq and then treat the rest equally, but this doesn't work for MCV in extended statistics, this patch fill this gap. Besides that, since extended statistics means more than 1 columns are involved, if 1+ columns are Const based on RestrictInfo, we can use such information to filter the MCVs we are interesting, that's really cool. I did some more testing, all of them are inner join so far, all of them works amazing and I am suprised this patch didn't draw enough attention. I will test more after I go though the code. At for the code level, I reviewed them in the top-down manner and almost 40% completed. Here are some findings just FYI. For efficiency purpose, I provide each feedback with a individual commit, after all I want to make sure my comment is practical and coding and testing is a good way to archive that. I tried to make each of them as small as possible so that you can reject or accept them convinently. 0001 is your patch, I just rebase them against the current master. 0006 is not much relevant with current patch, and I think it can be committed individually if you are OK with that. Hope this kind of review is helpful. -- Best Regards Andy Fan >From daa6c27bc7dd0631607f0f254cc15491633a9ccc Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Mon, 13 Dec 2021 14:05:17 +0100 Subject: [PATCH v1 1/8] Estimate joins using extended statistics Use extended statistics (MCV) to improve join estimates. In general this is similar to how we use regular statistics - we search for extended statistics (with MCV) covering all join clauses, and if we find such MCV on both sides of the join, we combine those two MCVs. Extended statistics allow a couple additional improvements - e.g. if there are baserel conditions, we can use them to restrict the part of the MCVs combined. This means we're building conditional probability distribution and calculating conditional probability P(join clauses | baserel conditions) instead of just P(join clauses). The patch also allows combining regular and extended MCV - we don't need extended MCVs on both sides. This helps when one of the tables does not have extended statistics (e.g. because there are no correlations). --- src/backend/optimizer/path/clausesel.c| 63 +- src/backend/statistics/extended_stats.c | 805 ++ src/backend/statistics/mcv.c | 758 + .../statistics/extended_stats_internal.h | 20 + src/include/statistics/statistics.h | 12 + src/test/regress/expected/stats_ext.out | 167 src/test/regress/sql/stats_ext.sql| 66 ++ 7 files changed, 1890 insertions(+), 1 deletion(-) diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index 0ab021c1e8..bedf76edae 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -48,6 +48,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root, JoinType jointype, SpecialJoinInfo *sjinfo, bool use_extended_stats); +static inline bool treat_as_join_clause(PlannerInfo *root, + Node *clause, RestrictInfo *rinfo, + int varRelid, SpecialJoinInfo *sjinfo); / * ROUTINES TO COMPUTE SELECTIVITIES @@ -127,12 +130,53 @@ clauselist_selectivity_ext(PlannerInfo *root, RangeQueryClause *rqlist = NULL; ListCell *l; int listidx; + bool single_clause_optimization = true; + + /* + * The optimization of skipping to clause_selectivity_ext for single + * clauses means we can't improve join estimates with a single join + * clause but additional baserel restrictions. So we disable it when + * estimating joins. + * + * XXX Not sure if this is the right way to do it, but more elaborate + * checks would mostly negate the whole point of the optimization. + * The (Var op Var) patch has the same issue. + * + * XXX An alternative might be making clause_selectivity_ext smarter + * and make it use the join extended stats there. But that seems kinda + * against the whole point of the optimization (skipping expensive + * stuff) and it's making other parts more complex. + * + * XXX Maybe this should check if there are at least some restrictions + * on some base relations, which seems important. But then again, that + * seems to go against the idea of this check to be cheap. Moreover, it + * won't work for OR clauses, which may have multiple parts but we still + * see them as
Re: using extended statistics to improve join estimates
On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote: > Rebased over 269b532ae and muted compiler warnings. And attached. >From 587a5e9fe87c26cdcd9602fc349f092da95cc580 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Mon, 13 Dec 2021 14:05:17 +0100 Subject: [PATCH] Estimate joins using extended statistics Use extended statistics (MCV) to improve join estimates. In general this is similar to how we use regular statistics - we search for extended statistics (with MCV) covering all join clauses, and if we find such MCV on both sides of the join, we combine those two MCVs. Extended statistics allow a couple additional improvements - e.g. if there are baserel conditions, we can use them to restrict the part of the MCVs combined. This means we're building conditional probability distribution and calculating conditional probability P(join clauses | baserel conditions) instead of just P(join clauses). The patch also allows combining regular and extended MCV - we don't need extended MCVs on both sides. This helps when one of the tables does not have extended statistics (e.g. because there are no correlations). --- src/backend/optimizer/path/clausesel.c| 63 +- src/backend/statistics/extended_stats.c | 805 ++ src/backend/statistics/mcv.c | 757 .../statistics/extended_stats_internal.h | 20 + src/include/statistics/statistics.h | 12 + src/test/regress/expected/stats_ext.out | 167 src/test/regress/sql/stats_ext.sql| 66 ++ 7 files changed, 1889 insertions(+), 1 deletion(-) diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index 06f836308d0..1b2227321a2 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -50,6 +50,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root, JoinType jointype, SpecialJoinInfo *sjinfo, bool use_extended_stats); +static inline bool treat_as_join_clause(PlannerInfo *root, + Node *clause, RestrictInfo *rinfo, + int varRelid, SpecialJoinInfo *sjinfo); / * ROUTINES TO COMPUTE SELECTIVITIES @@ -129,12 +132,53 @@ clauselist_selectivity_ext(PlannerInfo *root, RangeQueryClause *rqlist = NULL; ListCell *l; int listidx; + bool single_clause_optimization = true; + + /* + * The optimization of skipping to clause_selectivity_ext for single + * clauses means we can't improve join estimates with a single join + * clause but additional baserel restrictions. So we disable it when + * estimating joins. + * + * XXX Not sure if this is the right way to do it, but more elaborate + * checks would mostly negate the whole point of the optimization. + * The (Var op Var) patch has the same issue. + * + * XXX An alternative might be making clause_selectivity_ext smarter + * and make it use the join extended stats there. But that seems kinda + * against the whole point of the optimization (skipping expensive + * stuff) and it's making other parts more complex. + * + * XXX Maybe this should check if there are at least some restrictions + * on some base relations, which seems important. But then again, that + * seems to go against the idea of this check to be cheap. Moreover, it + * won't work for OR clauses, which may have multiple parts but we still + * see them as a single BoolExpr clause (it doesn't work later, though). + */ + if (list_length(clauses) == 1) + { + Node *clause = linitial(clauses); + RestrictInfo *rinfo = NULL; + + if (IsA(clause, RestrictInfo)) + { + rinfo = (RestrictInfo *) clause; + clause = (Node *) rinfo->clause; + } + + single_clause_optimization + = !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo); + } /* * If there's exactly one clause, just go directly to * clause_selectivity_ext(). None of what we might do below is relevant. + * + * XXX This means we won't try using extended stats on OR-clauses (which + * are a single BoolExpr clause at this point), although we'll do that + * later (once we look at the arguments). */ - if (list_length(clauses) == 1) + if ((list_length(clauses) == 1) && single_clause_optimization) return clause_selectivity_ext(root, (Node *) linitial(clauses), varRelid, jointype, sjinfo, use_extended_stats); @@ -157,6 +201,23 @@ clauselist_selectivity_ext(PlannerInfo *root, &estimatedclauses, false); } + /* + * Try applying extended statistics to joins. There's not much we can + * do to detect when this makes sense, but we can check that there are + * join clauses, and that at least some of the rels have stats. + * + * XXX Isn't this mutually exclusive with the preceding block which + * calculates estimates for a single relation? + */ + if (use_extended_stats && + statext_try_join_estimat
Re: using extended statistics to improve join estimates
On Wed, Jan 19, 2022 at 06:18:09PM +0800, Julien Rouhaud wrote: > On Tue, Jan 04, 2022 at 03:55:50PM -0800, Andres Freund wrote: > > On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote: > > > Here's an updated patch, rebased and fixing a couple typos reported by > > > Justin Pryzby directly. > > > > FWIW, cfbot reports a few compiler warnings: > > Also the patch doesn't apply anymore: > > http://cfbot.cputube.org/patch_36_3055.log > === Applying patches on top of PostgreSQL commit ID > 74527c3e022d3ace648340b79a6ddec3419f6732 === > === applying patch > ./0001-Estimate-joins-using-extended-statistics-20220101.patch > patching file src/backend/optimizer/path/clausesel.c > patching file src/backend/statistics/extended_stats.c > Hunk #1 FAILED at 30. > Hunk #2 succeeded at 102 (offset 1 line). > Hunk #3 succeeded at 2619 (offset 9 lines). > 1 out of 3 hunks FAILED -- saving rejects to file > src/backend/statistics/extended_stats.c.rej Rebased over 269b532ae and muted compiler warnings. Tomas - is this patch viable for pg15 , or should move to the next CF ? In case it's useful, I ran this on cirrus with my branch for code coverage. https://cirrus-ci.com/task/5816731397521408 https://api.cirrus-ci.com/v1/artifact/task/5816731397521408/coverage/coverage/00-index.html statext_find_matching_mcv() has poor coverage. statext_clauselist_join_selectivity() has poor coverage for the "stats2" case. In mcv.c: mcv_combine_extended() and mcv_combine_simple() have poor coverage for the "else if" cases (does it matter?) Not related to this patch: build_attnums_array() isn't being hit. Same at statext_is_compatible_clause_internal() 1538 0 : *exprs = lappend(*exprs, clause); statext_mcv_[de]serialize() aren't being hit for cstrings. -- Justin
Re: using extended statistics to improve join estimates
Hi, On Tue, Jan 04, 2022 at 03:55:50PM -0800, Andres Freund wrote: > On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote: > > Here's an updated patch, rebased and fixing a couple typos reported by > > Justin Pryzby directly. > > FWIW, cfbot reports a few compiler warnings: Also the patch doesn't apply anymore: http://cfbot.cputube.org/patch_36_3055.log === Applying patches on top of PostgreSQL commit ID 74527c3e022d3ace648340b79a6ddec3419f6732 === === applying patch ./0001-Estimate-joins-using-extended-statistics-20220101.patch patching file src/backend/optimizer/path/clausesel.c patching file src/backend/statistics/extended_stats.c Hunk #1 FAILED at 30. Hunk #2 succeeded at 102 (offset 1 line). Hunk #3 succeeded at 2619 (offset 9 lines). 1 out of 3 hunks FAILED -- saving rejects to file src/backend/statistics/extended_stats.c.rej
Re: using extended statistics to improve join estimates
On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote: > Here's an updated patch, rebased and fixing a couple typos reported by > Justin Pryzby directly. FWIW, cfbot reports a few compiler warnings: https://cirrus-ci.com/task/6067262669979648?logs=gcc_warning#L505 [18:52:15.132] time make -s -j${BUILD_JOBS} world-bin [18:52:22.697] mcv.c: In function ‘mcv_combine_simple’: [18:52:22.697] mcv.c:2787:7: error: ‘reverse’ may be used uninitialized in this function [-Werror=maybe-uninitialized] [18:52:22.697] 2787 |if (reverse) [18:52:22.697] | ^ [18:52:22.697] mcv.c:2766:27: error: ‘index’ may be used uninitialized in this function [-Werror=maybe-uninitialized] [18:52:22.697] 2766 | if (mcv->items[i].isnull[index]) [18:52:22.697] | ^ Greetings, Andres Freund
Re: using extended statistics to improve join estimates
Hi, Here's an updated patch, rebased and fixing a couple typos reported by Justin Pryzby directly. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 15d0fa5b565d9ae8b4f333c1d54745397964110d Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Mon, 13 Dec 2021 14:05:17 +0100 Subject: [PATCH] Estimate joins using extended statistics Use extended statistics (MCV) to improve join estimates. In general this is similar to how we use regular statistics - we search for extended statistics (with MCV) covering all join clauses, and if we find such MCV on both sides of the join, we combine those two MCVs. Extended statistics allow a couple additional improvements - e.g. if there are baserel conditions, we can use them to restrict the part of the MCVs combined. This means we're building conditional probability distribution and calculating conditional probability P(join clauses | baserel conditions) instead of just P(join clauses). The patch also allows combining regular and extended MCV - we don't need extended MCVs on both sides. This helps when one of the tables does not have extended statistics (e.g. because there are no correlations). --- src/backend/optimizer/path/clausesel.c| 63 +- src/backend/statistics/extended_stats.c | 805 ++ src/backend/statistics/mcv.c | 754 .../statistics/extended_stats_internal.h | 20 + src/include/statistics/statistics.h | 12 + src/test/regress/expected/stats_ext.out | 167 src/test/regress/sql/stats_ext.sql| 66 ++ 7 files changed, 1886 insertions(+), 1 deletion(-) diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index d263ecf0827..09f3d246c9d 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -50,6 +50,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root, JoinType jointype, SpecialJoinInfo *sjinfo, bool use_extended_stats); +static inline bool treat_as_join_clause(PlannerInfo *root, + Node *clause, RestrictInfo *rinfo, + int varRelid, SpecialJoinInfo *sjinfo); / * ROUTINES TO COMPUTE SELECTIVITIES @@ -129,12 +132,53 @@ clauselist_selectivity_ext(PlannerInfo *root, RangeQueryClause *rqlist = NULL; ListCell *l; int listidx; + bool single_clause_optimization = true; + + /* + * The optimization of skipping to clause_selectivity_ext for single + * clauses means we can't improve join estimates with a single join + * clause but additional baserel restrictions. So we disable it when + * estimating joins. + * + * XXX Not sure if this is the right way to do it, but more elaborate + * checks would mostly negate the whole point of the optimization. + * The (Var op Var) patch has the same issue. + * + * XXX An alternative might be making clause_selectivity_ext smarter + * and make it use the join extended stats there. But that seems kinda + * against the whole point of the optimization (skipping expensive + * stuff) and it's making other parts more complex. + * + * XXX Maybe this should check if there are at least some restrictions + * on some base relations, which seems important. But then again, that + * seems to go against the idea of this check to be cheap. Moreover, it + * won't work for OR clauses, which may have multiple parts but we still + * see them as a single BoolExpr clause (it doesn't work later, though). + */ + if (list_length(clauses) == 1) + { + Node *clause = linitial(clauses); + RestrictInfo *rinfo = NULL; + + if (IsA(clause, RestrictInfo)) + { + rinfo = (RestrictInfo *) clause; + clause = (Node *) rinfo->clause; + } + + single_clause_optimization + = !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo); + } /* * If there's exactly one clause, just go directly to * clause_selectivity_ext(). None of what we might do below is relevant. + * + * XXX This means we won't try using extended stats on OR-clauses (which + * are a single BoolExpr clause at this point), although we'll do that + * later (once we look at the arguments). */ - if (list_length(clauses) == 1) + if ((list_length(clauses) == 1) && single_clause_optimization) return clause_selectivity_ext(root, (Node *) linitial(clauses), varRelid, jointype, sjinfo, use_extended_stats); @@ -157,6 +201,23 @@ clauselist_selectivity_ext(PlannerInfo *root, &estimatedclauses, false); } + /* + * Try applying extended statistics to joins. There's not much we can + * do to detect when this makes sense, but we can check that there are + * join clauses, and that at least some of the rels have stats. + * + * XXX Isn't this mutually exclusive with the preceding block which + * calculates estimates for a single
Re: using extended statistics to improve join estimates
On 11/6/21 11:03, Andy Fan wrote: Hi Tomas: This is the exact patch I want, thanks for the patch! Good to hear. On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra wrote: 3) estimation by join pairs At the moment, the estimates are calculated for pairs of relations, so for example given a query explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) join t3 on (t1.b = t3.b and t2.c = t3.c); we'll estimate the first join (t1,t2) just fine, but then the second join actually combines (t1,t2,t3). What the patch currently does is it splits it into (t1,t2) and (t2,t3) and estimates those. Actually I can't understand how this works even for a simpler example. let's say we query like this (ONLY use t2's column to join t3). select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) join t3 on (t2.c = t3.c and t2.d = t3.d); Then it works well on JoinRel(t1, t2) AND JoinRel(t2, t3). But when comes to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel, so it is hard to work. Here I see your solution is splitting it into (t1, t2) AND (t2, t3) and estimate those. But how does this help to estimate the size of JoinRel(t1, t2, t3)? Yeah, this is really confusing. The crucial thing to keep in mind is this works with clauses before running setrefs.c, so the clauses reference the original relations - *not* the join relation. Otherwise even the regular estimation would not work, because where would it get the per-column MCV lists etc. Let's use a simple case with join clauses referencing just a single attribute for each pair or relations. And let's talk about how many join pairs it'll extract: t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b) => first we join t1/t2, which is 1 join pair (t1,t2) => then we join t1/t2/t3, but the join clause references just 2 rels, so 1 join pair (t1,t3) Now a more complicated case, with more complex join clause t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b AND t2.c = t3.c) => first we join t1/t2, which is 1 join pair (t1,t2) => then we join t1/t2/t3, but this time the join clause references all three rels, so we have two join pairs (t1,t3) and (t2,t3) and we can use all the statistics. I wonder if this should actually combine all three MCVs at once - we're pretty much combining the MCVs into one large MCV representing the join result. I guess we can keep the MCVs on joinrel for these matches. Take the above query I provided for example, and suppose the MCV data as below: t1(a, b) (1, 2) -> 0.1 (1, 3) -> 0.2 (2, 3) -> 0.5 (2, 8) -> 0.1 t2(a, b) (1, 2) -> 0.2 (1, 3) -> 0.1 (2, 4) -> 0.2 (2, 10) -> 0.1 After t1.a = t2.a AND t1.b = t2.b, we can build the MCV as below (1, 2, 1, 2) -> 0.1 * 0.2 (1, 3, 1, 3) -> 0.2 * 0.1 And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) * (0.2 + 0.1 + 0.2 + 0.1) Right, that's about the joint distribution I whole join. With this design, the nitems of MCV on joinrel would be less than either of baserel. Actually, I think the number of items can grow, because the matches may duplicate some items. For example in your example with (t1.a = t2.a) the first first (1,2) item in t1 matches (1,2) and (1,3) in t2. And same for (1,3) in t1. So that's 4 combinations. Of course, we could aggregate the MCV by ignoring columns not used in the query. and since we handle the eqjoin as well, we even can record the items as (1, 2) -> 0.1 * 0.2 (1, 3) -> 0.2 * 0.1; About when we should maintain the JoinRel's MCV data, rather than maintain this just after the JoinRel size is estimated, we can only estimate it when it is needed. for example: select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) join t3 on (t2.c = t3.c and t2.d = t3.d); we don't need to maintain the MCV on (t1, t2, t3) since no others need it at all. However I don't check code too closely to see if it (Lazing computing MVC on joinrel) is convenient to do. I'm not sure I understand what you're proposing here. However, I think that estimating it for pairs has two advantages: 1) Combining MCVs for k relations requires k for loops. Processing 2 relations at a time limits the amount of CPU we need. Of course, this assumes the joins are independent, which may or may not be true. 2) It seems fairly easy to combine different types of statistics (regular, extended, ...), and also consider the part not represented by MCV. It seems much harder when joining more than 2 relations. I'm also worried about amplification of errors - I suspect attempting to build the joint MCV for the whole join relation may produce significant estimation errors. Furthermore, I think joins with clauses referencing more than just two relations are fairly uncommon. And we can always improve the feature in this direction in the future. But I haven't done that yet, as it requires the MCVs to be combined using the join clauses (overl
Re: using extended statistics to improve join estimates
On 11/22/21 02:23, Justin Pryzby wrote: Your regression tests include two errors, which appear to be accidental, and fixing the error shows that this case is being estimated poorly. +-- try combining with single-column (and single-expression) statistics +DROP STATISTICS join_test_2; +ERROR: statistics object "join_test_2" does not exist ... +ERROR: statistics object "join_stats_2" already exists D'oh, what a silly mistake ... You're right fixing the DROP STATISTICS results in worse estimate, but that's actually expected for a fairly simple reason. The join condition has expressions on both sides, and dropping the statistics means we don't have any MCV for the join_test_2 side. So the optimizer ends up not using the regular estimates, as if there were no extended stats. A couple lines later the script creates an extended statistics on that expression alone, which fixes this. An expression index would do the trick too. Attached is a patch fixing the test and also the issue reported by Zhihong Yu some time ago. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 616f7f3faa818ea89c4c1cecb9aa50dd9e4fe8e7 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Mon, 13 Dec 2021 14:05:17 +0100 Subject: [PATCH] Estimate joins using extended statistics Use extended statistics (MCV) to improve join estimates. In general this is similar to how we use regular statistics - we search for extended statistics (with MCV) covering all join clauses, and if we find such MCV on both sides of the join, we combine those two MCVs. Extended statistics allow a couple additional improvements - e.g. if there are baserel conditions, we can use them to restrict the part of the MCVs combined. This means we're building conditional probability distribution and calculating conditional probability P(join clauses | baserel conditions) instead of just P(join clauses). The patch also allows combining regular and extended MCV - we don't need extended MCVs on both sides. This helps when one of the tables does not have extended statistics (e.g. because there are no correlations). --- src/backend/optimizer/path/clausesel.c| 63 +- src/backend/statistics/extended_stats.c | 805 ++ src/backend/statistics/mcv.c | 754 .../statistics/extended_stats_internal.h | 20 + src/include/statistics/statistics.h | 12 + src/test/regress/expected/stats_ext.out | 167 src/test/regress/sql/stats_ext.sql| 66 ++ 7 files changed, 1886 insertions(+), 1 deletion(-) diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index d263ecf082..709e92446b 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -50,6 +50,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root, JoinType jointype, SpecialJoinInfo *sjinfo, bool use_extended_stats); +static inline bool treat_as_join_clause(PlannerInfo *root, + Node *clause, RestrictInfo *rinfo, + int varRelid, SpecialJoinInfo *sjinfo); / * ROUTINES TO COMPUTE SELECTIVITIES @@ -129,12 +132,53 @@ clauselist_selectivity_ext(PlannerInfo *root, RangeQueryClause *rqlist = NULL; ListCell *l; int listidx; + bool single_clause_optimization = true; + + /* + * The optimization of skipping to clause_selectivity_ext for single + * clauses means we can't improve join estimates with a single join + * clause but additional baserel restrictions. So we disable it when + * estimating joins. + * + * XXX Not sure if this is the right way to do it, but more elaborate + * checks would mostly negate the whole point of the optimization. + * The (Var op Var) patch has the same issue. + * + * XXX An alternative might be making clause_selectivity_ext smarter + * and make it use the join extended stats there. But that seems kinda + * against the whole point of the optimization (skipping expensive + * stuff) and it's making other parts more complex. + * + * XXX Maybe this should check if there are at least some restrictions + * on some base relations, which seems important. But then again, that + * seems to go against the idea of this check to be cheap. Moreover, it + * won't work for OR clauses, which may have multiple parts but we still + * see them as a single BoolExpr clause (it doesn't work later, though). + */ + if (list_length(clauses) == 1) + { + Node *clause = linitial(clauses); + RestrictInfo *rinfo = NULL; + + if (IsA(clause, RestrictInfo)) + { + rinfo = (RestrictInfo *) clause; + clause = (Node *) rinfo->clause; + } + + single_clause_optimization + = !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo); + } /* * If there's exactly one clause, just go directly to * clause_selec
Re: using extended statistics to improve join estimates
Your regression tests include two errors, which appear to be accidental, and fixing the error shows that this case is being estimated poorly. +-- try combining with single-column (and single-expression) statistics +DROP STATISTICS join_test_2; +ERROR: statistics object "join_test_2" does not exist ... +ERROR: statistics object "join_stats_2" already exists -- Justin
Re: using extended statistics to improve join estimates
Hi Tomas: This is the exact patch I want, thanks for the patch! On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra wrote: > 3) estimation by join pairs > > At the moment, the estimates are calculated for pairs of relations, so > for example given a query > > explain analyze > select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) >join t3 on (t1.b = t3.b and t2.c = t3.c); > > we'll estimate the first join (t1,t2) just fine, but then the second > join actually combines (t1,t2,t3). What the patch currently does is it > splits it into (t1,t2) and (t2,t3) and estimates those. Actually I can't understand how this works even for a simpler example. let's say we query like this (ONLY use t2's column to join t3). select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) join t3 on (t2.c = t3.c and t2.d = t3.d); Then it works well on JoinRel(t1, t2) AND JoinRel(t2, t3). But when comes to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel, so it is hard to work. Here I see your solution is splitting it into (t1, t2) AND (t2, t3) and estimate those. But how does this help to estimate the size of JoinRel(t1, t2, t3)? > I wonder if this > should actually combine all three MCVs at once - we're pretty much > combining the MCVs into one large MCV representing the join result. > I guess we can keep the MCVs on joinrel for these matches. Take the above query I provided for example, and suppose the MCV data as below: t1(a, b) (1, 2) -> 0.1 (1, 3) -> 0.2 (2, 3) -> 0.5 (2, 8) -> 0.1 t2(a, b) (1, 2) -> 0.2 (1, 3) -> 0.1 (2, 4) -> 0.2 (2, 10) -> 0.1 After t1.a = t2.a AND t1.b = t2.b, we can build the MCV as below (1, 2, 1, 2) -> 0.1 * 0.2 (1, 3, 1, 3) -> 0.2 * 0.1 And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) * (0.2 + 0.1 + 0.2 + 0.1) With this design, the nitems of MCV on joinrel would be less than either of baserel. and since we handle the eqjoin as well, we even can record the items as (1, 2) -> 0.1 * 0.2 (1, 3) -> 0.2 * 0.1; About when we should maintain the JoinRel's MCV data, rather than maintain this just after the JoinRel size is estimated, we can only estimate it when it is needed. for example: select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) join t3 on (t2.c = t3.c and t2.d = t3.d); we don't need to maintain the MCV on (t1, t2, t3) since no others need it at all. However I don't check code too closely to see if it (Lazing computing MVC on joinrel) is convenient to do. > But I haven't done that yet, as it requires the MCVs to be combined > using the join clauses (overlap in a way), but I'm not sure how likely > that is in practice. In the example it could help, but that's a bit > artificial example. > > > 4) still just inner equi-joins > > I haven't done any work on extending this to outer joins etc. Adding > outer and semi joins should not be complicated, mostly copying and > tweaking what eqjoinsel() does. > Overall, thanks for the feature and I am expecting there are more cases to handle during discussion. To make the review process more efficient, I suggest that we split the patch into smaller ones and review/commit them separately if we have finalized the design roughly . For example: Patch 1 -- required both sides to have extended statistics. Patch 2 -- required one side to have extended statistics and the other side had per-column MCV. Patch 3 -- handle the case like WHERE t1.a = t2.a and t1.b = Const; Patch 3 -- handle the case for 3+ table joins. Patch 4 -- supports the outer join. I think we can do this if we are sure that each individual patch would work in some cases and would not make any other case worse. If you agree with this, I can do that splitting work during my review process. -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: using extended statistics to improve join estimates
On 10/6/21 23:03, Zhihong Yu wrote: Hi, + conditions2 = statext_determine_join_restrictions(root, rel, mcv); + + /* if the new statistics covers more conditions, use it */ + if (list_length(conditions2) > list_length(conditions1)) + { + mcv = stat; It seems conditions2 is calculated using mcv, I wonder why mcv is replaced by stat (for conditions1 whose length is shorter) ? Yeah, that's wrong - it should be the other way around, i.e. if (list_length(conditions1) > list_length(conditions2)) There's no test with multiple candidate statistics yet, so this went unnoticed :-/ regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: using extended statistics to improve join estimates
On Wed, Oct 6, 2021 at 12:33 PM Tomas Vondra wrote: > Hi, > > attached is an improved version of this patch, addressing some of the > points mentioned in my last message: > > 1) Adds a couple regression tests, testing various join cases with > expressions, additional conditions, etc. > > 2) Adds support for expressions, so the join clauses don't need to > reference just simple columns. So e.g. this can benefit from extended > statistics, when defined on the expressions: > > -- CREATE STATISTICS s1 ON (a+1), b FROM t1; > -- CREATE STATISTICS s2 ON (a+1), b FROM t2; > > SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b); > > 3) Can combine extended statistics and regular (per-column) statistics. > The previous version required extended statistics MCV on both sides of > the join, but adding extended statistics on both sides may impractical > (e.g. if one side does not have correlated columns it's silly to have to > add it just to make this patch happy). > > For example you may have extended stats on the dimension table but not > the fact table, and the patch still can combine those two. Of course, if > there's no MCV on either side, we can't do much. > > So this patch works when both sides have extended statistics MCV, or > when one side has extended MCV and the other side regular MCV. It might > seem silly, but the extended MCV allows considering additional baserel > conditions (if there are any). > > > examples > > > The table / data is very simple, but hopefully good enough for some > simple examples. > > create table t1 (a int, b int, c int); > create table t2 (a int, b int, c int); > > insert into t1 select mod(i,50), mod(i,50), mod(i,50) > from generate_series(1,1000) s(i); > > insert into t2 select mod(i,50), mod(i,50), mod(i,50) > from generate_series(1,1000) s(i); > > analyze t1, t2; > > First, without extended stats (just the first line of explain analyze, > to keep the message short): > > explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b); > > QUERY PLAN > > Hash Join (cost=31.00..106.00 rows=400 width=24) > (actual time=5.426..22.678 rows=2 loops=1) > > > explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25; > > QUERY PLAN > > Hash Join (cost=28.50..160.75 rows=1 width=24) > (actual time=5.325..19.760 rows=1 loops=1) > > > explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < > 25 and t2.c > 25; > > QUERY PLAN > > Hash Join (cost=24.50..104.75 rows=4800 width=24) > (actual time=5.618..5.639 rows=0 loops=1) > > > Now, let's create statistics: > > create statistics s1 on a, b, c from t1 ; > create statistics s2 on a, b, c from t2 ; > analyze t1, t2; > > and now the same queries again: > > explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b); > > QUERY PLAN > > Hash Join (cost=31.00..106.00 rows=2 width=24) > (actual time=5.448..22.713 rows=2 loops=1) > > explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25; > > QUERY PLAN > > Hash Join (cost=28.50..160.75 rows=1 width=24) > (actual time=5.317..16.680 rows=1 loops=1) > > > explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < > 25 and t2.c > 25; > > QUERY PLAN > > Hash Join (cost=24.50..104.75 rows=1 width=24) > (actual time=5.647..5.667 rows=0 loops=1) > > > Those examples are a bit simplistic, but the improvements are fairly > clear I think. > > > limitations & open issues > = > > Let's talk about the main general restrictions and open issues in the > current patch that I can think of at the moment. > > 1) statistics covering all join clauses > > The patch requires the statistics to cover all the join clauses, mostly > because it simplifies the implementation. This means that to use the > per-column statistics, there has to be just a single join clause. > > AFAICS this could be relaxed and we could use multiple statistics to > estimate the clauses. But it'd make selection of statistics much more > complicated, because we have to pick "matching" statistics on both sides > of the join. So it seems like an overkill, and most joins have very few > conditions anyway. > > > 2) only equality join clauses > > The other restriction is that at the moment th
Re: using extended statistics to improve join estimates
Hi, attached is an improved version of this patch, addressing some of the points mentioned in my last message: 1) Adds a couple regression tests, testing various join cases with expressions, additional conditions, etc. 2) Adds support for expressions, so the join clauses don't need to reference just simple columns. So e.g. this can benefit from extended statistics, when defined on the expressions: -- CREATE STATISTICS s1 ON (a+1), b FROM t1; -- CREATE STATISTICS s2 ON (a+1), b FROM t2; SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b); 3) Can combine extended statistics and regular (per-column) statistics. The previous version required extended statistics MCV on both sides of the join, but adding extended statistics on both sides may impractical (e.g. if one side does not have correlated columns it's silly to have to add it just to make this patch happy). For example you may have extended stats on the dimension table but not the fact table, and the patch still can combine those two. Of course, if there's no MCV on either side, we can't do much. So this patch works when both sides have extended statistics MCV, or when one side has extended MCV and the other side regular MCV. It might seem silly, but the extended MCV allows considering additional baserel conditions (if there are any). examples The table / data is very simple, but hopefully good enough for some simple examples. create table t1 (a int, b int, c int); create table t2 (a int, b int, c int); insert into t1 select mod(i,50), mod(i,50), mod(i,50) from generate_series(1,1000) s(i); insert into t2 select mod(i,50), mod(i,50), mod(i,50) from generate_series(1,1000) s(i); analyze t1, t2; First, without extended stats (just the first line of explain analyze, to keep the message short): explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b); QUERY PLAN Hash Join (cost=31.00..106.00 rows=400 width=24) (actual time=5.426..22.678 rows=2 loops=1) explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25; QUERY PLAN Hash Join (cost=28.50..160.75 rows=1 width=24) (actual time=5.325..19.760 rows=1 loops=1) explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25 and t2.c > 25; QUERY PLAN Hash Join (cost=24.50..104.75 rows=4800 width=24) (actual time=5.618..5.639 rows=0 loops=1) Now, let's create statistics: create statistics s1 on a, b, c from t1 ; create statistics s2 on a, b, c from t2 ; analyze t1, t2; and now the same queries again: explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b); QUERY PLAN Hash Join (cost=31.00..106.00 rows=2 width=24) (actual time=5.448..22.713 rows=2 loops=1) explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25; QUERY PLAN Hash Join (cost=28.50..160.75 rows=1 width=24) (actual time=5.317..16.680 rows=1 loops=1) explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25 and t2.c > 25; QUERY PLAN Hash Join (cost=24.50..104.75 rows=1 width=24) (actual time=5.647..5.667 rows=0 loops=1) Those examples are a bit simplistic, but the improvements are fairly clear I think. limitations & open issues = Let's talk about the main general restrictions and open issues in the current patch that I can think of at the moment. 1) statistics covering all join clauses The patch requires the statistics to cover all the join clauses, mostly because it simplifies the implementation. This means that to use the per-column statistics, there has to be just a single join clause. AFAICS this could be relaxed and we could use multiple statistics to estimate the clauses. But it'd make selection of statistics much more complicated, because we have to pick "matching" statistics on both sides of the join. So it seems like an overkill, and most joins have very few conditions anyway. 2) only equality join clauses The other restriction is that at the moment this only supports simple equality clauses, combined with AND. So for example this is supported t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1)) while these are not: t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1)) t1 JOIN t2 ON ((t1.a - t2.a = 0) AND (t1.
Re: using extended statistics to improve join estimates
Hi, Here's a slightly improved / cleaned up version of the PoC patch, removing a bunch of XXX and FIXMEs, adding comments, etc. The approach is sound in principle, I think, although there's still a bunch of things to address: 1) statext_compare_mcvs only really deals with equijoins / inner joins at the moment, as it's based on eqjoinsel_inner. It's probably desirable to add support for additional join types (inequality and outer joins). 2) Some of the steps are performed multiple times - e.g. matching base restrictions to statistics, etc. Those probably can be cached somehow, to reduce the overhead. 3) The logic of picking the statistics to apply is somewhat simplistic, and maybe could be improved in some way. OTOH the number of candidate statistics is likely low, so this is not a big issue. 4) statext_compare_mcvs is based on eqjoinsel_inner and makes a bunch of assumptions similar to the original, but some of those assumptions may be wrong in multi-column case, particularly when working with a subset of columns. For example (ndistinct - size(MCV)) may not be the number of distinct combinations outside the MCV, when ignoring some columns. Same for nullfract, and so on. I'm not sure we can do much more than pick some reasonable approximation. 5) There are no regression tests at the moment. Clearly a gap. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index d263ecf082..dca1e7d34e 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -157,6 +157,23 @@ clauselist_selectivity_ext(PlannerInfo *root, &estimatedclauses, false); } + /* + * Try applying extended statistics to joins. There's not much we can + * do to detect when this makes sense, but we can check that there are + * join clauses, and that at least some of the rels have stats. + * + * XXX Isn't this mutualy exclusive with the preceding block which + * calculates estimates for a single relation? + */ + if (use_extended_stats && + statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo, + estimatedclauses)) + { + s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid, + jointype, sjinfo, + &estimatedclauses); + } + /* * Apply normal selectivity estimates for remaining clauses. We'll be * careful to skip any clauses which were already estimated above. diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index b05e818ba9..d4cbbee785 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -30,6 +30,7 @@ #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/optimizer.h" +#include "optimizer/pathnode.h" #include "pgstat.h" #include "postmaster/autovacuum.h" #include "statistics/extended_stats_internal.h" @@ -101,6 +102,16 @@ static StatsBuildData *make_build_data(Relation onerel, StatExtEntry *stat, int numrows, HeapTuple *rows, VacAttrStats **stats, int stattarget); +static bool stat_covers_expressions(StatisticExtInfo *stat, List *exprs, + Bitmapset **expr_idxs); + +static List *statext_mcv_get_conditions(PlannerInfo *root, + RelOptInfo *rel, + StatisticExtInfo *info); + +static bool *statext_mcv_eval_conditions(PlannerInfo *root, RelOptInfo *rel, + StatisticExtInfo *stat, MCVList *mcv, + Selectivity *sel); /* * Compute requested extended stats, using the rows sampled for the plain @@ -1154,6 +1165,89 @@ has_stats_of_kind(List *stats, char requiredkind) return false; } +/* + * find_matching_mcv + * Search for a MCV covering all the attributes. + * + * XXX Should consider both attnums and expressions. Also should consider + * additional restrictinfos as conditions (but only as optional). + * + * XXX The requirement that all the attributes need to be covered might be + * too strong. Maybe we could relax it a bit, and search for MCVs (on both + * sides of the join) with the largest overlap. But we don't really expect + * many candidate MCVs, so this simple approach seems sufficient. + */ +StatisticExtInfo * +find_matching_mcv(PlannerInfo *root, RelOptInfo *rel, Bitmapset *attnums, List *exprs) +{ + ListCell *l; + StatisticExtInfo *mcv = NULL; + List *stats = rel->statlist; + + foreach(l, stats) + { + StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l); + List *conditions1 = NIL, + *conditions2 = NIL; + + /* We only care about MCV statistics here. */ + if (stat->kind != STATS_EXT_MCV) + continue; + + /* + * Ignore MCVs not covering all the attributes/expressions. + * + * XXX Maybe we shouldn't be so strict and consider only partial + * matches for join clauses too? + */ + if (!bms_is_subset(attnums, stat->keys) || + !stat_co
Re: using extended statistics to improve join estimates
Hi, + * has_matching_mcv + * Check whether the list contains statistic of a given kind The method name is find_matching_mcv(). It seems the method initially returned bool but later the return type was changed. + StatisticExtInfo *found = NULL; found normally is associated with bool return value. Maybe call the variable matching_mcv or something similar. + /* skip items eliminated by restrictions on rel2 */ + if (conditions2 && !conditions2[j]) + continue; Maybe you can add a counter recording the number of non-skipped items for the inner loop. If this counter is 0 after the completion of one iteration, we come out of the outer loop directly. Cheers On Wed, Mar 31, 2021 at 10:36 AM Tomas Vondra wrote: > Hi, > > So far the extended statistics are applied only at scan level, i.e. when > estimating selectivity for individual tables. Which is great, but joins > are a known challenge, so let's try doing something about it ... > > Konstantin Knizhnik posted a patch [1] using functional dependencies to > improve join estimates in January. It's an interesting approach, but as > I explained in that thread I think we should try a different approach, > similar to how we use MCV lists without extended stats. We'll probably > end up considering functional dependencies too, but probably only as a > fallback (similar to what we do for single-relation estimates). > > This is a PoC demonstrating the approach I envisioned. It's incomplete > and has various limitations: > > - no expression support, just plain attribute references > - only equality conditions > - requires MCV lists on both sides > - inner joins only > > All of this can / should be relaxed later, of course. But for a PoC this > seems sufficient. > > The basic principle is fairly simple, and mimics what eqjoinsel_inner > does. Assume we have a query: > > SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b) > > If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the > same logic as eqjoinsel_inner and "match" them together. If the MCV list > is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but > the general idea is the same. > > To demonstrate this, consider a very simple example with a table that > has a lot of dependency between the columns: > > == > > CREATE TABLE t (a INT, b INT, c INT, d INT); > INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100) > FROM generate_series(1,10) s(i); > ANALYZE t; > > SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); > > CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t; > ANALYZE t; > > SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); > > ALTER STATISTICS s SET STATISTICS 1; > ANALYZE t; > > SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); > > == > > The results look like this: > > - actual rows: 1 > - estimated (no stats): 1003638 > - estimated (stats, 100): 100247844 > - estimated (stats, 10k): 1 > > So, in this case the extended stats help quite a bit, even with the > default statistics target. > > However, there are other things we can do. For example, we can use > restrictions (at relation level) as "conditions" to filter the MCV lits, > and calculate conditional probability. This is useful even if there's > just a single join condition (on one column), but there are dependencies > between that column and the other filters. Or maybe when there are > filters between conditions on the two sides. > > Consider for example these two queries: > > SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b) > WHERE t1.c < 25 AND t2.d < 25; > > SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b) > WHERE t1.c < 25 AND t2.d > 75; > > In this particular case we know that (a = b = c = d) so the two filters > are somewhat redundant. The regular estimates will ignore that, but with > MCV we can actually detect that - when we combine the two MCV lists, we > essentially calculate MCV (a,b,t1.c,t2.d), and use that. > > Q1 Q2 > - actual rows: 2500 0 > - estimated (no stats):62158 60241 > - estimated (stats, 100): 25047900 1 > - estimated (stats, 10k): 2500 1 > > Obviously, the accuracy depends on how representative the MCV list is > (what fraction of the data it represents), and in this case it works > fairly nicely. A lot of the future work will be about handling cases > when it represents only part of the data. > > The attached PoC patch has a number of FIXME and XXX, describing stuff I > ignored to keep it simple, possible future improvement. And so on. > > > regards > > > [1] > > https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc...@postgrespro.ru > > -- > Tomas Vondra
using extended statistics to improve join estimates
Hi, So far the extended statistics are applied only at scan level, i.e. when estimating selectivity for individual tables. Which is great, but joins are a known challenge, so let's try doing something about it ... Konstantin Knizhnik posted a patch [1] using functional dependencies to improve join estimates in January. It's an interesting approach, but as I explained in that thread I think we should try a different approach, similar to how we use MCV lists without extended stats. We'll probably end up considering functional dependencies too, but probably only as a fallback (similar to what we do for single-relation estimates). This is a PoC demonstrating the approach I envisioned. It's incomplete and has various limitations: - no expression support, just plain attribute references - only equality conditions - requires MCV lists on both sides - inner joins only All of this can / should be relaxed later, of course. But for a PoC this seems sufficient. The basic principle is fairly simple, and mimics what eqjoinsel_inner does. Assume we have a query: SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b) If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the same logic as eqjoinsel_inner and "match" them together. If the MCV list is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but the general idea is the same. To demonstrate this, consider a very simple example with a table that has a lot of dependency between the columns: == CREATE TABLE t (a INT, b INT, c INT, d INT); INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100) FROM generate_series(1,10) s(i); ANALYZE t; SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t; ANALYZE t; SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); ALTER STATISTICS s SET STATISTICS 1; ANALYZE t; SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); == The results look like this: - actual rows: 1 - estimated (no stats): 1003638 - estimated (stats, 100): 100247844 - estimated (stats, 10k): 1 So, in this case the extended stats help quite a bit, even with the default statistics target. However, there are other things we can do. For example, we can use restrictions (at relation level) as "conditions" to filter the MCV lits, and calculate conditional probability. This is useful even if there's just a single join condition (on one column), but there are dependencies between that column and the other filters. Or maybe when there are filters between conditions on the two sides. Consider for example these two queries: SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.c < 25 AND t2.d < 25; SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.c < 25 AND t2.d > 75; In this particular case we know that (a = b = c = d) so the two filters are somewhat redundant. The regular estimates will ignore that, but with MCV we can actually detect that - when we combine the two MCV lists, we essentially calculate MCV (a,b,t1.c,t2.d), and use that. Q1 Q2 - actual rows: 2500 0 - estimated (no stats):62158 60241 - estimated (stats, 100): 25047900 1 - estimated (stats, 10k): 2500 1 Obviously, the accuracy depends on how representative the MCV list is (what fraction of the data it represents), and in this case it works fairly nicely. A lot of the future work will be about handling cases when it represents only part of the data. The attached PoC patch has a number of FIXME and XXX, describing stuff I ignored to keep it simple, possible future improvement. And so on. regards [1] https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc...@postgrespro.ru -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index d263ecf082..dca1e7d34e 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -157,6 +157,23 @@ clauselist_selectivity_ext(PlannerInfo *root, &estimatedclauses, false); } + /* + * Try applying extended statistics to joins. There's not much we can + * do to detect when this makes sense, but we can check that there are + * join clauses, and that at least some of the rels have stats. + * + * XXX Isn't this mutualy exclusive with the preceding block which + * calculates estimates for a single relation? + */ + if (use_extended_stats && + statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo, + estimatedclauses)) + { + s1 *= statext_clauselist_join_selectivity(root, clauses, varRel