Re: extended stats on partitioned tables
I've pushed the part adding stxdinherit flag to the catalog, so that on master we build statistics both with and without data from the child relations. I'm not going to push the ACL refactoring. We have similar code on various other places (not addressed by the proposed patch), and it'd make backpatching harder. So I'm not sure it'd be an improvement. In any case, I'm going to mark this as committed. Justin, if you think we should reconsider the ACL refactoring, please submit it as a separate patch. It seems mostly unrelated to the issue this thread was about. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: extended stats on partitioned tables
On 1/15/22 19:35, Tomas Vondra wrote: On 1/15/22 06:11, Justin Pryzby wrote: On Mon, Dec 13, 2021 at 09:40:09PM +0100, Tomas Vondra wrote: 1) If the table is a separate relation (not part of an inheritance tree), this should make no difference. -> OK 2) If the table is using "old" inheritance, this reverts back to pre-regression behavior. So people will keep using the old statistics until the ANALYZE, and we need to tell them to ANALYZE or something. 3) If the table is using partitioning, it's guaranteed to be empty and there are no stats at all. Again, we should tell people to run ANALYZE. I think these can be mentioned in the commit message, which can end up in the minor release notes as a recommendation to rerun ANALYZE. Good point. I pushed the 0002 part and added a short paragraph suggesting ANALYZE might be necessary. I did not go into details about the individual cases, because that'd be too much for a commit message. Thanks for pushing 0001. Thanks for posting the patches! I've pushed the second part - attached are the two remaining parts. I'll wait a bit before pushing the rebased 0001 (which goes into master branch only). Not sure about 0002 - I'm not convinced the refactored ACL checks are an improvement, but I'll think about it. BTW when backpatching the first part, I had to decide what to do about tests. The 10 & 11 branches did not have the check_estimated_rows() function. I considered removing the tests, reworking the tests not to need the function, or adding the function. Ultimately I added the function, which seemed like the best option. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: extended stats on partitioned tables
On 1/15/22 06:11, Justin Pryzby wrote: On Mon, Dec 13, 2021 at 09:40:09PM +0100, Tomas Vondra wrote: 1) If the table is a separate relation (not part of an inheritance tree), this should make no difference. -> OK 2) If the table is using "old" inheritance, this reverts back to pre-regression behavior. So people will keep using the old statistics until the ANALYZE, and we need to tell them to ANALYZE or something. 3) If the table is using partitioning, it's guaranteed to be empty and there are no stats at all. Again, we should tell people to run ANALYZE. I think these can be mentioned in the commit message, which can end up in the minor release notes as a recommendation to rerun ANALYZE. Good point. I pushed the 0002 part and added a short paragraph suggesting ANALYZE might be necessary. I did not go into details about the individual cases, because that'd be too much for a commit message. Thanks for pushing 0001. Thanks for posting the patches! I've pushed the second part - attached are the two remaining parts. I'll wait a bit before pushing the rebased 0001 (which goes into master branch only). Not sure about 0002 - I'm not convinced the refactored ACL checks are an improvement, but I'll think about it. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom e065e54d1961a4e295ebe77febbe6a3307d142b5 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Sun, 12 Dec 2021 00:07:27 +0100 Subject: [PATCH 1/2] Add stxdinherit flag to pg_statistic_ext_data The support for extended statistics on inheritance trees was somewhat problematic, because the catalog pg_statistic_ext_data did not have a flag identifying whether the data was built with/without data for the child relations. For partitioned tables we've been able to work around this because the non-leaf relations can't contain data, so there's really just one set of statistics to store. But for regular inheritance trees we had to pick, and there is no clearly better option - we need both. This introduces the pg_statistic_ext_data.stxdinherit flag, analogous to pg_statistic.stainherit, and modifies analyze to build statistics for both cases. This relaxes the relationship between the two catalogs - until now we've created the pg_statistic_ext_data one when creating the statistics, and then only updated it later. There was always 1:1 relationship between rows in the two catalogs. With this change we no longer insert any data rows while creating statistics, because we don't know which flag value to create, and it seems wasteful to initialize both for every relation. The there may be 0, 1 or 2 data rows for each statistics definition. Patch by me, with extensive improvements and fixes by Justin Pryzby. Author: Tomas Vondra, Justin Pryzby Reviewed-by: Tomas Vondra, Justin Pryzby Discussion: https://postgr.es/m/20210923212624.GI831%40telsasoft.com --- doc/src/sgml/catalogs.sgml | 23 +++ src/backend/catalog/system_views.sql| 2 + src/backend/commands/analyze.c | 28 +--- src/backend/commands/statscmds.c| 72 +- src/backend/optimizer/util/plancat.c| 147 src/backend/statistics/dependencies.c | 22 ++- src/backend/statistics/extended_stats.c | 75 +- src/backend/statistics/mcv.c| 9 +- src/backend/statistics/mvdistinct.c | 5 +- src/backend/utils/adt/selfuncs.c| 37 + src/backend/utils/cache/syscache.c | 6 +- src/include/catalog/pg_statistic_ext_data.h | 4 +- src/include/commands/defrem.h | 1 + src/include/nodes/pathnodes.h | 1 + src/include/statistics/statistics.h | 11 +- src/test/regress/expected/rules.out | 2 + src/test/regress/expected/stats_ext.out | 31 +++-- src/test/regress/sql/stats_ext.sql | 13 +- 18 files changed, 254 insertions(+), 235 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 03e2537b07d..2aeb2ef346e 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -7521,6 +7521,19 @@ SCRAM-SHA-256$iteration count: created with CREATE STATISTICS. + + Normally there is one entry, with stxdinherit = + false, for each statistics object that has been analyzed. + If the table has inheritance children, a second entry with + stxdinherit = true is also created. + This row represents the statistics object over the inheritance tree, i.e., + statistics for the data you'd see with + SELECT * FROM table*, + whereas the stxdinherit = false row + represents the results of + SELECT * FROM ONLY table. + + Like pg_statistic, pg_statistic_ext_data should not be @@ -7560,6 +7573,16 @@ SCRAM-SHA-256$iteration count: + + + stxdinherit bool + + + If true, the stats include inheritance child columns, not just the
Re: extended stats on partitioned tables
On Mon, Dec 13, 2021 at 09:40:09PM +0100, Tomas Vondra wrote: > 1) If the table is a separate relation (not part of an inheritance > tree), this should make no difference. -> OK > > 2) If the table is using "old" inheritance, this reverts back to > pre-regression behavior. So people will keep using the old statistics > until the ANALYZE, and we need to tell them to ANALYZE or something. > > 3) If the table is using partitioning, it's guaranteed to be empty and > there are no stats at all. Again, we should tell people to run ANALYZE. I think these can be mentioned in the commit message, which can end up in the minor release notes as a recommendation to rerun ANALYZE. Thanks for pushing 0001. -- Justin
Re: extended stats on partitioned tables
On 12/13/21 14:53, Justin Pryzby wrote: > On Sun, Dec 12, 2021 at 11:23:19PM +0100, Tomas Vondra wrote: >> On 12/12/21 22:32, Justin Pryzby wrote: >>> On Sun, Dec 12, 2021 at 05:17:10AM +0100, Tomas Vondra wrote: The one thing bugging me a bit is that the regression test checks only a GROUP BY query. It'd be nice to add queries testing MCV/dependencies too, but that seems tricky because most queries will use per-partitions stats. >>> >>> You mean because the quals are pushed down to the scan node. >>> >>> Does that indicate a deficiency ? >>> >>> If extended stats are collected for a parent table, selectivity estimates >>> based >>> from the parent would be better; but instead we use uncorrected column >>> estimates from the child tables. >>> >>> From what I see, we could come up with a way to avoid the pushdown, >>> involving >>> volatile functions/foreign tables/RLS/window functions/SRF/wholerow >>> vars/etc. >>> But would it be better if extended stats objects on partitioned tables were >>> to >>> collect stats for both parent AND CHILD ? I'm not sure. Maybe that's the >>> wrong solution, but maybe we should still document that extended stats on >>> (empty) parent tables are often themselves not used/useful for selectivity >>> estimates, and the user should instead (or in addition) create stats on >>> child >>> tables. >>> >>> Or, maybe if there's no extended stats on the child tables, stats on the >>> parent >>> table should be consulted ? >> >> Maybe, but that seems like a mostly separate improvement. At this point I'm >> interested only in testing the behavior implemented in the current patches. > > I don't want to change the scope of the patch, or this thread, but my point is > that the behaviour already changed once (the original regression) and now > we're > planning to change it again to fix that, so we ought to decide on the expected > behavior before writing tests to verify it. > OK. Makes sense. > I think it may be impossible to use the "dependencies" statistic with > inherited > stats. Normally the quals would be pushed down to the child tables. But, if > they weren't pushed down, they'd be attached to something other than a scan > node on the parent table, so the stats on that table wouldn't apply (right?). > Yeah, that's probably right. But I'm not 100% sure the whole inheritance tree can't be treated as a single relation by some queries ... > Maybe the useless stats types should have been prohibited on partitioned > tables > since v10. It's too late to change that, but perhaps now they shouldn't even > be collected during analyze. The dependencies and MCV paths are never called > with rte->inh==true, so maybe we should Assert(!inh), or add a comment to that > effect. Or the regression tests should "memorialize" the behavior. I'm still > thinking about it. > Yeah, we can't really prohibit them in backbranches - that'd mean some CREATE STATISTICS commands suddenly start failing, which would be quite annoying. Not building them for partitioned tables seems like a better option - BuildRelationExtStatistics can check relkind and pick what to ignore. But I'd choose to do that in a separate patch, probably - after all, it shouldn't really change the behavior of any tests/queries, no? This reminds me we need to consider if these patches could cause any issues. The way I see it is this: 1) If the table is a separate relation (not part of an inheritance tree), this should make no difference. -> OK 2) If the table is using "old" inheritance, this reverts back to pre-regression behavior. So people will keep using the old statistics until the ANALYZE, and we need to tell them to ANALYZE or something. 3) If the table is using partitioning, it's guaranteed to be empty and there are no stats at all. Again, we should tell people to run ANALYZE. Of course, we can't be sure query plans will change in the right direction :-( regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: extended stats on partitioned tables
On 12/13/21 14:48, Justin Pryzby wrote: > On Sun, Dec 12, 2021 at 10:29:39PM +0100, Tomas Vondra wrote: >>> In your 0003 patch, the "if inh: break" isn't removed from >>> examine_variable(), >>> but the corresponding thing is removed everywhere else. >> >> Ah, you're right. And it wasn't updated in the 0002 patch either - it >> should do the relkind check too, to allow partitioned tables. Fixed. > > I think you fixed it in 0002 (thanks) but still wasn't removed from 0003? > D'oh! Those repeated rebase conflicts got me quite confused. > In these comments: > +* When dealing with regular inheritance trees, ignore extended stats > +* (which were built without data from child rels, and thus do not > +* represent them). For partitioned tables data from partitions are > +* in the stats (and there's no data in the non-leaf relations), so > +* in this case we do consider extended stats. > > I suggest to add a comment after "For partitioned tables". > I've reworded the comment a bit, hopefully it's a bit clearer now. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 0601bac64c0d7a1f1b676f0433a05f2f99d7bfb1 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 25 Sep 2021 19:42:41 -0500 Subject: [PATCH 1/4] Ignore extended statistics for inheritance trees Since commit 859b3003de we only build extended statistics for relations without the child relations, so that we update the catalog row only once. However, the non-inherited statistics were still used for planning of queries on inheritance trees, which may produce bogus estimates. This is roughly the same issue 427c6b5b9 addressed ~15 years ago, and we fix it the same way - by ignoring extended statistics when calculating estimates for the inheritance tree as a whole. We still consider extended statistics when calculating estimates for individual child relations, of course. Report and patch by Justin Pryzby, minor fixes and cleanup by me. Backpatch all the way back to PostgreSQL 10, where extended statistics were introduced (same as 859b3003de). Author: Justin Pryzby Reported-by: Justin Pryzby Backpatch-through: 10 Discussion: https://postgr.es/m/20210923212624.GI831%40telsasoft.com --- src/backend/statistics/dependencies.c | 9 + src/backend/statistics/extended_stats.c | 9 + src/backend/utils/adt/selfuncs.c| 17 + src/test/regress/expected/stats_ext.out | 24 src/test/regress/sql/stats_ext.sql | 15 +++ 5 files changed, 74 insertions(+) diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c index 8bf80db8e4..a41bace75a 100644 --- a/src/backend/statistics/dependencies.c +++ b/src/backend/statistics/dependencies.c @@ -24,6 +24,7 @@ #include "nodes/pathnodes.h" #include "optimizer/clauses.h" #include "optimizer/optimizer.h" +#include "parser/parsetree.h" #include "statistics/extended_stats_internal.h" #include "statistics/statistics.h" #include "utils/bytea.h" @@ -1410,11 +1411,19 @@ dependencies_clauselist_selectivity(PlannerInfo *root, int ndependencies; int i; AttrNumber attnum_offset; + RangeTblEntry *rte = planner_rt_fetch(rel->relid, root); /* unique expressions */ Node **unique_exprs; int unique_exprs_cnt; + /* + * When dealing with inheritance trees, ignore extended stats (which were + * built without data from child rels, and thus do not represent them). + */ + if (rte->inh) + return 1.0; + /* check if there's any stats that might be useful for us. */ if (!has_stats_of_kind(rel->statlist, STATS_EXT_DEPENDENCIES)) return 1.0; diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 69ca52094f..5a3ba2f1c5 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -30,6 +30,7 @@ #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/optimizer.h" +#include "parser/parsetree.h" #include "pgstat.h" #include "postmaster/autovacuum.h" #include "statistics/extended_stats_internal.h" @@ -1694,6 +1695,14 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli List **list_exprs; /* expressions matched to any statistic */ int listidx; Selectivity sel = (is_or) ? 0.0 : 1.0; + RangeTblEntry *rte = planner_rt_fetch(rel->relid, root); + + /* + * When dealing with inheritance trees, ignore extended stats (which were + * built without data from child rels, and thus do not represent them). + */ + if (rte->inh) + return sel; /* check if there's any stats that might be useful for us. */ if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV)) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 10895fb287..1010d5caa8 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -3913,6
Re: extended stats on partitioned tables
On Sun, Dec 12, 2021 at 11:23:19PM +0100, Tomas Vondra wrote: > On 12/12/21 22:32, Justin Pryzby wrote: > > On Sun, Dec 12, 2021 at 05:17:10AM +0100, Tomas Vondra wrote: > > > The one thing bugging me a bit is that the regression test checks only a > > > GROUP BY query. It'd be nice to add queries testing MCV/dependencies > > > too, but that seems tricky because most queries will use per-partitions > > > stats. > > > > You mean because the quals are pushed down to the scan node. > > > > Does that indicate a deficiency ? > > > > If extended stats are collected for a parent table, selectivity estimates > > based > > from the parent would be better; but instead we use uncorrected column > > estimates from the child tables. > > > > From what I see, we could come up with a way to avoid the pushdown, > > involving > > volatile functions/foreign tables/RLS/window functions/SRF/wholerow > > vars/etc. > > But would it be better if extended stats objects on partitioned tables were > > to > > collect stats for both parent AND CHILD ? I'm not sure. Maybe that's the > > wrong solution, but maybe we should still document that extended stats on > > (empty) parent tables are often themselves not used/useful for selectivity > > estimates, and the user should instead (or in addition) create stats on > > child > > tables. > > > > Or, maybe if there's no extended stats on the child tables, stats on the > > parent > > table should be consulted ? > > Maybe, but that seems like a mostly separate improvement. At this point I'm > interested only in testing the behavior implemented in the current patches. I don't want to change the scope of the patch, or this thread, but my point is that the behaviour already changed once (the original regression) and now we're planning to change it again to fix that, so we ought to decide on the expected behavior before writing tests to verify it. I think it may be impossible to use the "dependencies" statistic with inherited stats. Normally the quals would be pushed down to the child tables. But, if they weren't pushed down, they'd be attached to something other than a scan node on the parent table, so the stats on that table wouldn't apply (right?). Maybe the useless stats types should have been prohibited on partitioned tables since v10. It's too late to change that, but perhaps now they shouldn't even be collected during analyze. The dependencies and MCV paths are never called with rte->inh==true, so maybe we should Assert(!inh), or add a comment to that effect. Or the regression tests should "memorialize" the behavior. I'm still thinking about it. -- Justin
Re: extended stats on partitioned tables
On Sun, Dec 12, 2021 at 10:29:39PM +0100, Tomas Vondra wrote: > > In your 0003 patch, the "if inh: break" isn't removed from > > examine_variable(), > > but the corresponding thing is removed everywhere else. > > Ah, you're right. And it wasn't updated in the 0002 patch either - it > should do the relkind check too, to allow partitioned tables. Fixed. I think you fixed it in 0002 (thanks) but still wasn't removed from 0003? In these comments: +* When dealing with regular inheritance trees, ignore extended stats +* (which were built without data from child rels, and thus do not +* represent them). For partitioned tables data from partitions are +* in the stats (and there's no data in the non-leaf relations), so +* in this case we do consider extended stats. I suggest to add a comment after "For partitioned tables". -- Justin
Re: extended stats on partitioned tables
On 12/12/21 22:32, Justin Pryzby wrote: On Sun, Dec 12, 2021 at 05:17:10AM +0100, Tomas Vondra wrote: The one thing bugging me a bit is that the regression test checks only a GROUP BY query. It'd be nice to add queries testing MCV/dependencies too, but that seems tricky because most queries will use per-partitions stats. You mean because the quals are pushed down to the scan node. Does that indicate a deficiency ? If extended stats are collected for a parent table, selectivity estimates based from the parent would be better; but instead we use uncorrected column estimates from the child tables. From what I see, we could come up with a way to avoid the pushdown, involving volatile functions/foreign tables/RLS/window functions/SRF/wholerow vars/etc. > But would it be better if extended stats objects on partitioned tables were to collect stats for both parent AND CHILD ? I'm not sure. Maybe that's the wrong solution, but maybe we should still document that extended stats on (empty) parent tables are often themselves not used/useful for selectivity estimates, and the user should instead (or in addition) create stats on child tables. Or, maybe if there's no extended stats on the child tables, stats on the parent table should be consulted ? Maybe, but that seems like a mostly separate improvement. At this point I'm interested only in testing the behavior implemented in the current patches. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: extended stats on partitioned tables
On Sun, Dec 12, 2021 at 10:29:39PM +0100, Tomas Vondra wrote: > On 12/12/21 18:52, Justin Pryzby wrote: > That's mostly a conscious choice, so that I don't have to include > parsetree.h. But maybe that'd be better ... > > > The regression tests changed as a result of not populating stx_data; I think > > it's may be better to update like this: > > > > SELECT stxname, stxdndistinct, stxddependencies, stxdmcv, stxoid IS NULL > > FROM pg_statistic_ext s LEFT JOIN pg_statistic_ext_data d > > ON d.stxoid = s.oid > > WHERE s.stxname = 'ab1_a_b_stats'; > > > > Not sure I understand. Why would this be better than inner join? It shows that there's an entry in pg_statistic_ext and not one in ext_data, rather than that it's not in at least one of the catalogs. Which is nice to show since as you say it's no longer 1:1. > Thanks, fixed. Can you read through the commit messages and check the > attribution is correct for all the patches? Seems fine. -- Justin
Re: extended stats on partitioned tables
On Sun, Dec 12, 2021 at 05:17:10AM +0100, Tomas Vondra wrote: > The one thing bugging me a bit is that the regression test checks only a > GROUP BY query. It'd be nice to add queries testing MCV/dependencies > too, but that seems tricky because most queries will use per-partitions > stats. You mean because the quals are pushed down to the scan node. Does that indicate a deficiency ? If extended stats are collected for a parent table, selectivity estimates based from the parent would be better; but instead we use uncorrected column estimates from the child tables. >From what I see, we could come up with a way to avoid the pushdown, involving volatile functions/foreign tables/RLS/window functions/SRF/wholerow vars/etc. But would it be better if extended stats objects on partitioned tables were to collect stats for both parent AND CHILD ? I'm not sure. Maybe that's the wrong solution, but maybe we should still document that extended stats on (empty) parent tables are often themselves not used/useful for selectivity estimates, and the user should instead (or in addition) create stats on child tables. Or, maybe if there's no extended stats on the child tables, stats on the parent table should be consulted ? -- Justin
Re: extended stats on partitioned tables
On 12/12/21 18:52, Justin Pryzby wrote: > + * XXX Can't we simply look at rte->inh? > + */ > + inh = root->append_rel_array == NULL ? false : > + root->append_rel_array[onerel->relid]->parent_relid != 0; > > I think so. That's what I came up with while trying to figured this out, and > it's no great surprise that it needed to be cleaned up - thanks. > OK, fixed. > In your 0003 patch, the "if inh: break" isn't removed from examine_variable(), > but the corresponding thing is removed everywhere else. > Ah, you're right. And it wasn't updated in the 0002 patch either - it should do the relkind check too, to allow partitioned tables. Fixed. > In 0003, mcv_clauselist_selectivity still uses simple_rte_array rather than > rt_fetch. > That's mostly a conscious choice, so that I don't have to include parsetree.h. But maybe that'd be better ... > The regression tests changed as a result of not populating stx_data; I think > it's may be better to update like this: > > SELECT stxname, stxdndistinct, stxddependencies, stxdmcv, stxoid IS NULL > FROM pg_statistic_ext s LEFT JOIN pg_statistic_ext_data d > ON d.stxoid = s.oid > WHERE s.stxname = 'ab1_a_b_stats'; > Not sure I understand. Why would this be better than inner join? > There's this part about documentation for the changes in backbranches: > > On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote: >> Also, I think in backbranches we should document what's being stored in >> pg_statistic_ext, since it's pretty unintuitive: >> - noninherted stats (FROM ONLY) for inheritence parents; >> - inherted stats (FROM *) for partitioned tables; > > spellcheck: inheritence should be inheritance. > Thanks, fixed. Can you read through the commit messages and check the attribution is correct for all the patches? > All for now. I'm going to update the regression tests for dependencies and > the > other code paths. > Thanks! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 77bd34660d15d6ba03538c6fe45a196a162c196c Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Sun, 12 Dec 2021 02:28:41 +0100 Subject: [PATCH 4/4] Refactor parent ACL check selfuncs.c is 8k lines long, and this makes it 30 LOC shorter. --- src/backend/utils/adt/selfuncs.c | 140 --- 1 file changed, 52 insertions(+), 88 deletions(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index bf4525392d..e4be9fc95d 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -187,6 +187,8 @@ static char *convert_string_datum(Datum value, Oid typid, Oid collid, bool *failure); static double convert_timevalue_to_scalar(Datum value, Oid typid, bool *failure); +static void recheck_parent_acl(PlannerInfo *root, VariableStatData *vardata, +Oid relid); static void examine_simple_variable(PlannerInfo *root, Var *var, VariableStatData *vardata); static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata, @@ -5153,51 +5155,7 @@ examine_variable(PlannerInfo *root, Node *node, int varRelid, (pg_class_aclcheck(rte->relid, userid, ACL_SELECT) == ACLCHECK_OK); -/* - * If the user doesn't have permissions to - * access an inheritance child relation, check - * the permissions of the table actually - * mentioned in the query, since most likely - * the user does have that permission. Note - * that whole-table select privilege on the - * parent doesn't quite guarantee that the - * user could read all columns of the child. - * But in practice it's unlikely that any - * interesting security violation could result - * from allowing access to the expression - * index's stats, so we allow it anyway. See - * similar code in examine_simple_variable() - * for additional comments. - */ -if (!vardata->acl_ok && - root->append_rel_array != NULL) -{ - AppendRelInfo *appinfo; - Index varno = index->rel->relid; - - appinfo = root->append_rel_array[varno]; - while (appinfo && - planner_rt_fetch(appinfo->parent_relid, - root)->rtekind == RTE_RELATION) - { - varno = appinfo->parent_relid; - appinfo = root->append_rel_array[varno]; - } - if (varno != index->rel->relid) - { - /* Repeat access check on this rel */ - rte = planner_rt_fetch(varno, root); - Assert(rte->rtekind == RTE_RELATION); - - userid = rte->checkAsUser ? rte->checkAsUser : GetUserId(); - - vardata->acl_ok = - rte->securityQuals == NIL && - (pg_class_aclcheck(rte->relid, - userid, - ACL_SELECT) == ACLCHECK_OK); - } -
Re: extended stats on partitioned tables
Tomas Vondra writes: > On 12/12/21 16:37, Zhihong Yu wrote: >> Since the rte (RangeTblEntry*) doesn't seem to be used beyond checking >> inh, I think it would be better if the above style of checking is used >> throughout the patch (without introducing rte variable). > It's mostly a matter of personal taste, but I always found this style of > condition (i.e. dereferencing a pointer returned by a function) much > less readable. It's hard to parse what exactly is happening, what struct > type are we dealing with, etc. YMMV but the separate variable makes it > much clearer for me. And I'd expect the compilers to produce pretty much > the same code too for those cases. FWIW, I agree. Also, it's possible that future patches would create a need to touch the RTE again nearby, in which case having the variable makes it easier to write non-crummy code for that. regards, tom lane
Re: extended stats on partitioned tables
+ * XXX Can't we simply look at rte->inh? + */ + inh = root->append_rel_array == NULL ? false : + root->append_rel_array[onerel->relid]->parent_relid != 0; I think so. That's what I came up with while trying to figured this out, and it's no great surprise that it needed to be cleaned up - thanks. In your 0003 patch, the "if inh: break" isn't removed from examine_variable(), but the corresponding thing is removed everywhere else. In 0003, mcv_clauselist_selectivity still uses simple_rte_array rather than rt_fetch. The regression tests changed as a result of not populating stx_data; I think it's may be better to update like this: SELECT stxname, stxdndistinct, stxddependencies, stxdmcv, stxoid IS NULL FROM pg_statistic_ext s LEFT JOIN pg_statistic_ext_data d ON d.stxoid = s.oid WHERE s.stxname = 'ab1_a_b_stats'; There's this part about documentation for the changes in backbranches: On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote: > Also, I think in backbranches we should document what's being stored in > pg_statistic_ext, since it's pretty unintuitive: > - noninherted stats (FROM ONLY) for inheritence parents; > - inherted stats (FROM *) for partitioned tables; spellcheck: inheritence should be inheritance. All for now. I'm going to update the regression tests for dependencies and the other code paths. -- Justin
Re: extended stats on partitioned tables
On 12/12/21 16:37, Zhihong Yu wrote: Hi, For patch 1, minor comment: + if (planner_rt_fetch(onerel->relid, root)->inh) Since the rte (RangeTblEntry*) doesn't seem to be used beyond checking inh, I think it would be better if the above style of checking is used throughout the patch (without introducing rte variable). It's mostly a matter of personal taste, but I always found this style of condition (i.e. dereferencing a pointer returned by a function) much less readable. It's hard to parse what exactly is happening, what struct type are we dealing with, etc. YMMV but the separate variable makes it much clearer for me. And I'd expect the compilers to produce pretty much the same code too for those cases. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: extended stats on partitioned tables
On 12/12/21 14:47, Zhihong Yu wrote: On Sat, Dec 11, 2021 at 9:14 PM Tomas Vondra mailto:tomas.von...@enterprisedb.com>> wrote: > ... > + /* skip statistics with mismatching stxdinherit value */ > + if (stat->inherit != rte->inh) > > Should a log be added for the above case ? > Why should we log this? It's an entirely expected case - there's a mismatch between inheritance for the relation and statistics, simply skipping it is the right thing to do. Hi, I agree that skipping should be fine (to avoid too much logging). I'm not sure it's related to the amount of logging, really. It'd be just noise without any practical use, even for debugging purposes. If you have an inheritance tree, it'll automatically have one set of statistics for inh=true and one for inh=false. And this condition will always skip one of those, depending on what query is being estimated. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: extended stats on partitioned tables
On Sat, Dec 11, 2021 at 8:17 PM Tomas Vondra wrote: > Hi, > > Attached is a rebased and cleaned-up version of these patches, with more > comments, refactorings etc. Justin and Zhihong, can you take a look? > > > 0001 - Ignore extended statistics for inheritance trees > > 0002 - Build inherited extended stats on partitioned tables > > Those are mostly just Justin's patches, with more detailed comments and > updated commit message. I've considered moving the rel->inh check to > statext_clauselist_selectivity, and then removing the check from > dependencies and MCV. But I decided no to do that, because someone might > be calling those functions directly (even if that's very unlikely). > > The one thing bugging me a bit is that the regression test checks only a > GROUP BY query. It'd be nice to add queries testing MCV/dependencies > too, but that seems tricky because most queries will use per-partitions > stats. > > > 0003 - Add stxdinherit flag to pg_statistic_ext_data > > This is the patch for master, allowing to build stats for both inherits > flag values. It adds the flag to pg_stats_ext_exprs view to, reworked > how we deal with iterating both flags etc. I've adopted most of the > Justin's fixup patches, except that in plancat.c I've refactored how we > load the stats to process keys/expressions just once. > > It has the same issue with regression test using just a GROUP BY query, > but if we add a test to 0001/0002, that'll fix this too. > > > 0004 - Refactor parent ACL check > > Not sure about this - I doubt saving 30 rows in an 8kB file is really > worth it. Maybe it is, but then maybe we should try cleaning up the > other ACL checks in this file too? Seems mostly orthogonal to this > thread, though. > > Hi, For patch 1, minor comment: + if (planner_rt_fetch(onerel->relid, root)->inh) Since the rte (RangeTblEntry*) doesn't seem to be used beyond checking inh, I think it would be better if the above style of checking is used throughout the patch (without introducing rte variable). Cheers > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
Re: extended stats on partitioned tables
On Sat, Dec 11, 2021 at 9:14 PM Tomas Vondra wrote: > > > On 12/12/21 05:38, Zhihong Yu wrote: > > > > > > On Sat, Dec 11, 2021 at 8:17 PM Tomas Vondra > > mailto:tomas.von...@enterprisedb.com>> > > wrote: > > > > Hi, > > > > Attached is a rebased and cleaned-up version of these patches, with > more > > comments, refactorings etc. Justin and Zhihong, can you take a look? > > > > > > 0001 - Ignore extended statistics for inheritance trees > > > > 0002 - Build inherited extended stats on partitioned tables > > > > Those are mostly just Justin's patches, with more detailed comments > and > > updated commit message. I've considered moving the rel->inh check to > > statext_clauselist_selectivity, and then removing the check from > > dependencies and MCV. But I decided no to do that, because someone > might > > be calling those functions directly (even if that's very unlikely). > > > > The one thing bugging me a bit is that the regression test checks > only a > > GROUP BY query. It'd be nice to add queries testing MCV/dependencies > > too, but that seems tricky because most queries will use > per-partitions > > stats. > > > > > > 0003 - Add stxdinherit flag to pg_statistic_ext_data > > > > This is the patch for master, allowing to build stats for both > inherits > > flag values. It adds the flag to pg_stats_ext_exprs view to, reworked > > how we deal with iterating both flags etc. I've adopted most of the > > Justin's fixup patches, except that in plancat.c I've refactored how > we > > load the stats to process keys/expressions just once. > > > > It has the same issue with regression test using just a GROUP BY > query, > > but if we add a test to 0001/0002, that'll fix this too. > > > > > > 0004 - Refactor parent ACL check > > > > Not sure about this - I doubt saving 30 rows in an 8kB file is really > > worth it. Maybe it is, but then maybe we should try cleaning up the > > other ACL checks in this file too? Seems mostly orthogonal to this > > thread, though. > > > > > > Hi, > > For patch 3, in commit message: > > > > and there no clear winner. -> and there is no clear winner. > > > > and it seem wasteful -> and it seems wasteful > > > > The there may be -> There may be > > > > Thanks, will fix. > > > + /* skip statistics with mismatching stxdinherit value */ > > + if (stat->inherit != rte->inh) > > > > Should a log be added for the above case ? > > > > Why should we log this? It's an entirely expected case - there's a > mismatch between inheritance for the relation and statistics, simply > skipping it is the right thing to do. > Hi, I agree that skipping should be fine (to avoid too much logging). Thanks > > +* Determine if we'redealing with inheritance tree. > > > > There should be a space between re and dealing. > > > > Thanks, will fix. > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: extended stats on partitioned tables
On 12/12/21 05:38, Zhihong Yu wrote: > > > On Sat, Dec 11, 2021 at 8:17 PM Tomas Vondra > mailto:tomas.von...@enterprisedb.com>> > wrote: > > Hi, > > Attached is a rebased and cleaned-up version of these patches, with more > comments, refactorings etc. Justin and Zhihong, can you take a look? > > > 0001 - Ignore extended statistics for inheritance trees > > 0002 - Build inherited extended stats on partitioned tables > > Those are mostly just Justin's patches, with more detailed comments and > updated commit message. I've considered moving the rel->inh check to > statext_clauselist_selectivity, and then removing the check from > dependencies and MCV. But I decided no to do that, because someone might > be calling those functions directly (even if that's very unlikely). > > The one thing bugging me a bit is that the regression test checks only a > GROUP BY query. It'd be nice to add queries testing MCV/dependencies > too, but that seems tricky because most queries will use per-partitions > stats. > > > 0003 - Add stxdinherit flag to pg_statistic_ext_data > > This is the patch for master, allowing to build stats for both inherits > flag values. It adds the flag to pg_stats_ext_exprs view to, reworked > how we deal with iterating both flags etc. I've adopted most of the > Justin's fixup patches, except that in plancat.c I've refactored how we > load the stats to process keys/expressions just once. > > It has the same issue with regression test using just a GROUP BY query, > but if we add a test to 0001/0002, that'll fix this too. > > > 0004 - Refactor parent ACL check > > Not sure about this - I doubt saving 30 rows in an 8kB file is really > worth it. Maybe it is, but then maybe we should try cleaning up the > other ACL checks in this file too? Seems mostly orthogonal to this > thread, though. > > > Hi, > For patch 3, in commit message: > > and there no clear winner. -> and there is no clear winner. > > and it seem wasteful -> and it seems wasteful > > The there may be -> There may be > Thanks, will fix. > + /* skip statistics with mismatching stxdinherit value */ > + if (stat->inherit != rte->inh) > > Should a log be added for the above case ? > Why should we log this? It's an entirely expected case - there's a mismatch between inheritance for the relation and statistics, simply skipping it is the right thing to do. > + * Determine if we'redealing with inheritance tree. > > There should be a space between re and dealing. > Thanks, will fix. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: extended stats on partitioned tables
On Sat, Dec 11, 2021 at 8:17 PM Tomas Vondra wrote: > Hi, > > Attached is a rebased and cleaned-up version of these patches, with more > comments, refactorings etc. Justin and Zhihong, can you take a look? > > > 0001 - Ignore extended statistics for inheritance trees > > 0002 - Build inherited extended stats on partitioned tables > > Those are mostly just Justin's patches, with more detailed comments and > updated commit message. I've considered moving the rel->inh check to > statext_clauselist_selectivity, and then removing the check from > dependencies and MCV. But I decided no to do that, because someone might > be calling those functions directly (even if that's very unlikely). > > The one thing bugging me a bit is that the regression test checks only a > GROUP BY query. It'd be nice to add queries testing MCV/dependencies > too, but that seems tricky because most queries will use per-partitions > stats. > > > 0003 - Add stxdinherit flag to pg_statistic_ext_data > > This is the patch for master, allowing to build stats for both inherits > flag values. It adds the flag to pg_stats_ext_exprs view to, reworked > how we deal with iterating both flags etc. I've adopted most of the > Justin's fixup patches, except that in plancat.c I've refactored how we > load the stats to process keys/expressions just once. > > It has the same issue with regression test using just a GROUP BY query, > but if we add a test to 0001/0002, that'll fix this too. > > > 0004 - Refactor parent ACL check > > Not sure about this - I doubt saving 30 rows in an 8kB file is really > worth it. Maybe it is, but then maybe we should try cleaning up the > other ACL checks in this file too? Seems mostly orthogonal to this > thread, though. > > > Hi, For patch 3, in commit message: and there no clear winner. -> and there is no clear winner. and it seem wasteful -> and it seems wasteful The there may be -> There may be + /* skip statistics with mismatching stxdinherit value */ + if (stat->inherit != rte->inh) Should a log be added for the above case ? +* Determine if we'redealing with inheritance tree. There should be a space between re and dealing. Cheers regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
Re: extended stats on partitioned tables
Hi, Attached is a rebased and cleaned-up version of these patches, with more comments, refactorings etc. Justin and Zhihong, can you take a look? 0001 - Ignore extended statistics for inheritance trees 0002 - Build inherited extended stats on partitioned tables Those are mostly just Justin's patches, with more detailed comments and updated commit message. I've considered moving the rel->inh check to statext_clauselist_selectivity, and then removing the check from dependencies and MCV. But I decided no to do that, because someone might be calling those functions directly (even if that's very unlikely). The one thing bugging me a bit is that the regression test checks only a GROUP BY query. It'd be nice to add queries testing MCV/dependencies too, but that seems tricky because most queries will use per-partitions stats. 0003 - Add stxdinherit flag to pg_statistic_ext_data This is the patch for master, allowing to build stats for both inherits flag values. It adds the flag to pg_stats_ext_exprs view to, reworked how we deal with iterating both flags etc. I've adopted most of the Justin's fixup patches, except that in plancat.c I've refactored how we load the stats to process keys/expressions just once. It has the same issue with regression test using just a GROUP BY query, but if we add a test to 0001/0002, that'll fix this too. 0004 - Refactor parent ACL check Not sure about this - I doubt saving 30 rows in an 8kB file is really worth it. Maybe it is, but then maybe we should try cleaning up the other ACL checks in this file too? Seems mostly orthogonal to this thread, though. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 569099c5d14535a95cc21798cb041cae7eee0e36 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Sun, 12 Dec 2021 02:28:41 +0100 Subject: [PATCH 4/4] Refactor parent ACL check selfuncs.c is 8k lines long, and this makes it 30 LOC shorter. --- src/backend/utils/adt/selfuncs.c | 140 --- 1 file changed, 52 insertions(+), 88 deletions(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 886d85b7f3..9f656316bf 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -187,6 +187,8 @@ static char *convert_string_datum(Datum value, Oid typid, Oid collid, bool *failure); static double convert_timevalue_to_scalar(Datum value, Oid typid, bool *failure); +static void recheck_parent_acl(PlannerInfo *root, VariableStatData *vardata, +Oid relid); static void examine_simple_variable(PlannerInfo *root, Var *var, VariableStatData *vardata); static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata, @@ -5153,51 +5155,7 @@ examine_variable(PlannerInfo *root, Node *node, int varRelid, (pg_class_aclcheck(rte->relid, userid, ACL_SELECT) == ACLCHECK_OK); -/* - * If the user doesn't have permissions to - * access an inheritance child relation, check - * the permissions of the table actually - * mentioned in the query, since most likely - * the user does have that permission. Note - * that whole-table select privilege on the - * parent doesn't quite guarantee that the - * user could read all columns of the child. - * But in practice it's unlikely that any - * interesting security violation could result - * from allowing access to the expression - * index's stats, so we allow it anyway. See - * similar code in examine_simple_variable() - * for additional comments. - */ -if (!vardata->acl_ok && - root->append_rel_array != NULL) -{ - AppendRelInfo *appinfo; - Index varno = index->rel->relid; - - appinfo = root->append_rel_array[varno]; - while (appinfo && - planner_rt_fetch(appinfo->parent_relid, - root)->rtekind == RTE_RELATION) - { - varno = appinfo->parent_relid; - appinfo = root->append_rel_array[varno]; - } - if (varno != index->rel->relid) - { - /* Repeat access check on this rel */ - rte = planner_rt_fetch(varno, root); - Assert(rte->rtekind == RTE_RELATION); - - userid = rte->checkAsUser ? rte->checkAsUser : GetUserId(); - - vardata->acl_ok = - rte->securityQuals == NIL && - (pg_class_aclcheck(rte->relid, - userid, - ACL_SELECT) == ACLCHECK_OK); - } -} +recheck_parent_acl(root, vardata, index->rel->relid); } else { @@ -5302,49 +5260,7 @@ examine_variable(PlannerInfo *root, Node *node, int varRelid, (pg_class_aclcheck(rte->relid, userid, ACL_SELECT) == ACLCHECK_OK); - /* - * If the user doesn't have permissions to access
Re: extended stats on partitioned tables
On Thu, Dec 2, 2021 at 9:24 PM Justin Pryzby wrote: > On Thu, Nov 04, 2021 at 12:44:45AM +0100, Tomas Vondra wrote: > > >> And I'm not sure we do the right thing after removing children, for > example > > >> (that should drop the inheritance stats, I guess). > > > > Do you mean for inheritance only ? Or partitions too ? > > > I think for partitions, the stats should stay. > > > And for inheritence, they can stay, for consistency with partitions, > and since > > > it does no harm. > > > > I think the behavior should be the same as for data in pg_statistic, > > i.e. if we keep/remove those, we should do the same thing for extended > > statistics. > > That works for column stats the way I proposed for extended stats: child > stats > are never removed, neither when the only child is dropped, nor when > re-running > analyze (that part is actually a bit odd). > > Rebased, fixing an intermediate compile error, and typos in the commit > message. > > -- > Justin > Hi, + if (!HeapTupleIsValid(tup)) /* should not happen */ + // elog(ERROR, "cache lookup failed for statistics data %u", statsOid); You may want to remove commented out code. + for (i = 0; i < staForm->stxkeys.dim1; i++) + keys = bms_add_member(keys, staForm->stxkeys.values[i]); Since the above code is in a loop now, should keys be cleared across the outer loop iterations ? Cheers
Re: extended stats on partitioned tables
On Thu, Nov 04, 2021 at 12:44:45AM +0100, Tomas Vondra wrote: > >> And I'm not sure we do the right thing after removing children, for example > >> (that should drop the inheritance stats, I guess). > > Do you mean for inheritance only ? Or partitions too ? > > I think for partitions, the stats should stay. > > And for inheritence, they can stay, for consistency with partitions, and > > since > > it does no harm. > > I think the behavior should be the same as for data in pg_statistic, > i.e. if we keep/remove those, we should do the same thing for extended > statistics. That works for column stats the way I proposed for extended stats: child stats are never removed, neither when the only child is dropped, nor when re-running analyze (that part is actually a bit odd). Rebased, fixing an intermediate compile error, and typos in the commit message. -- Justin >From 7d752af4bbbded86f3c9d4b3571bdb4b8bdf5ef2 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 25 Sep 2021 19:42:41 -0500 Subject: [PATCH 01/15] Do not use extended statistics on inheritance trees.. Since 859b3003de, inherited ext stats are not built. However, the non-inherited stats stats were incorrectly used during planning of queries with inheritance heirarchies. Since the ext stats do not include child tables, they can lead to worse estimates. This is remarkably similar to 427c6b5b9, which affected column statistics 15 years ago. choose_best_statistics is handled a bit differently (in the calling function), because it isn't passed rel nor rel->inh, and it's an exported function, so avoid changing its signature in back branches. https://www.postgresql.org/message-id/flat/20210925223152.ga7...@telsasoft.com Backpatch to v10 --- src/backend/statistics/dependencies.c | 5 + src/backend/statistics/extended_stats.c | 5 + src/backend/utils/adt/selfuncs.c| 9 + src/test/regress/expected/stats_ext.out | 24 src/test/regress/sql/stats_ext.sql | 15 +++ 5 files changed, 58 insertions(+) diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c index 8bf80db8e4..015b9e28f6 100644 --- a/src/backend/statistics/dependencies.c +++ b/src/backend/statistics/dependencies.c @@ -1410,11 +1410,16 @@ dependencies_clauselist_selectivity(PlannerInfo *root, int ndependencies; int i; AttrNumber attnum_offset; + RangeTblEntry *rte = root->simple_rte_array[rel->relid]; /* unique expressions */ Node **unique_exprs; int unique_exprs_cnt; + /* If it's an inheritence tree, skip statistics (which do not include child stats) */ + if (rte->inh) + return 1.0; + /* check if there's any stats that might be useful for us. */ if (!has_stats_of_kind(rel->statlist, STATS_EXT_DEPENDENCIES)) return 1.0; diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 69ca52094f..b9e755f44e 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1694,6 +1694,11 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli List **list_exprs; /* expressions matched to any statistic */ int listidx; Selectivity sel = (is_or) ? 0.0 : 1.0; + RangeTblEntry *rte = root->simple_rte_array[rel->relid]; + + /* If it's an inheritence tree, skip statistics (which do not include child stats) */ + if (rte->inh) + return sel; /* check if there's any stats that might be useful for us. */ if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV)) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 10895fb287..a0932e39e1 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -3913,6 +3913,11 @@ estimate_multivariate_ndistinct(PlannerInfo *root, RelOptInfo *rel, Oid statOid = InvalidOid; MVNDistinct *stats; StatisticExtInfo *matched_info = NULL; + RangeTblEntry *rte = root->simple_rte_array[rel->relid]; + + /* If it's an inheritence tree, skip statistics (which do not include child stats) */ + if (rte->inh) + return false; /* bail out immediately if the table has no extended statistics */ if (!rel->statlist) @@ -5232,6 +5237,10 @@ examine_variable(PlannerInfo *root, Node *node, int varRelid, if (vardata->statsTuple) break; + /* If it's an inheritence tree, skip statistics (which do not include child stats) */ + if (planner_rt_fetch(onerel->relid, root)->inh) +break; + /* skip stats without per-expression stats */ if (info->kind != STATS_EXT_EXPRESSIONS) continue; diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out index c60ba45aba..5410f58f7f 100644 --- a/src/test/regress/expected/stats_ext.out +++ b/src/test/regress/expected/stats_ext.out @@ -176,6 +176,30 @@ CREATE STATISTICS ab1_a_b_stats ON a, b FROM ab1; ANALYZE ab1; DROP TABLE ab1 CASCADE; NOTICE: drop cascades to table ab1c
Re: extended stats on partitioned tables
On Thu, Nov 04, 2021 at 12:44:45AM +0100, Tomas Vondra wrote: > > I probably did this to make the code change small, to avoid indentin the > > whole > > block. > > But indenting the block is not necessary. It's possible to do something > like this: > > if (!rel->inh) > return 1.0; > > or whatever is the "default" result for that function. You're right. I did like that, Except in examine_variable, which already does it with "break". > > Maybe the for inh<=1 loop should instead be two calls to new functions > > factored > > out of get_relation_statistics() and RemoveStatisticsById(), which take > > "bool > > inh". I did like that in a separate patch for now. And I avoided making a !inh tuple for partitioned tables, since they're never populated. > >> And I'm not sure we do the right thing after removing children, for example > >> (that should drop the inheritance stats, I guess). > > > > Do you mean for inheritance only ? Or partitions too ? > > I think for partitions, the stats should stay. > > And for inheritence, they can stay, for consistency with partitions, and > > since > > it does no harm. > > > > I think the behavior should be the same as for data in pg_statistic, > i.e. if we keep/remove those, we should do the same thing for extended > statistics. This works for column stats the way I proposed for extended stats: child stats are never removed, neither when the only child is dropped, nor when re-running ANALYZE (actually, that part is odd). I can stop sending patches if it makes it hard to reconcile, but I wanted to put it "on paper" to see/show what the patch series would look like, for v15 and back branches. -- Justin >From 907046c93d8f20e8edcc4fb0334feed86d194b81 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 25 Sep 2021 19:42:41 -0500 Subject: [PATCH v3 1/6] Do not use extended statistics on inheritence trees.. Since 859b3003de, inherited ext stats are not built. However, the non-inherited stats stats were incorrectly used during planning of queries with inheritence heirarchies. Since the ext stats do not include child tables, they can lead to worse estimates. This is remarkably similar to 427c6b5b9, which affected column statistics 15 years ago. choose_best_statistics is handled a bit differently (in the calling function), because it isn't passed rel nor rel->inh, and it's an exported function, so avoid changing its signature in back branches. https://www.postgresql.org/message-id/flat/20210925223152.ga7...@telsasoft.com Backpatch to v10 --- src/backend/statistics/dependencies.c | 5 + src/backend/statistics/extended_stats.c | 5 + src/backend/utils/adt/selfuncs.c| 9 + src/test/regress/expected/stats_ext.out | 23 +++ src/test/regress/sql/stats_ext.sql | 14 ++ 5 files changed, 56 insertions(+) diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c index 8bf80db8e4..015b9e28f6 100644 --- a/src/backend/statistics/dependencies.c +++ b/src/backend/statistics/dependencies.c @@ -1410,11 +1410,16 @@ dependencies_clauselist_selectivity(PlannerInfo *root, int ndependencies; int i; AttrNumber attnum_offset; + RangeTblEntry *rte = root->simple_rte_array[rel->relid]; /* unique expressions */ Node **unique_exprs; int unique_exprs_cnt; + /* If it's an inheritence tree, skip statistics (which do not include child stats) */ + if (rte->inh) + return 1.0; + /* check if there's any stats that might be useful for us. */ if (!has_stats_of_kind(rel->statlist, STATS_EXT_DEPENDENCIES)) return 1.0; diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 69ca52094f..b9e755f44e 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1694,6 +1694,11 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli List **list_exprs; /* expressions matched to any statistic */ int listidx; Selectivity sel = (is_or) ? 0.0 : 1.0; + RangeTblEntry *rte = root->simple_rte_array[rel->relid]; + + /* If it's an inheritence tree, skip statistics (which do not include child stats) */ + if (rte->inh) + return sel; /* check if there's any stats that might be useful for us. */ if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV)) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 10895fb287..a0932e39e1 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -3913,6 +3913,11 @@ estimate_multivariate_ndistinct(PlannerInfo *root, RelOptInfo *rel, Oid statOid = InvalidOid; MVNDistinct *stats; StatisticExtInfo *matched_info = NULL; + RangeTblEntry *rte = root->simple_rte_array[rel->relid]; + + /* If it's an inheritence tree, skip statistics (which do not include child stats) */ + if (rte->inh) + return false; /* bail out immediately if the table has
Re: extended stats on partitioned tables
On 11/4/21 12:19 AM, Justin Pryzby wrote: > On Wed, Nov 03, 2021 at 11:48:44PM +0100, Tomas Vondra wrote: >> On 10/8/21 12:45 AM, Justin Pryzby wrote: >>> On Thu, Oct 07, 2021 at 03:26:46PM -0500, Jaime Casanova wrote: On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote: > On Sat, Sep 25, 2021 at 05:31:52PM -0500, Justin Pryzby wrote: >> It seems like your patch should also check "inh" in examine_variable and >> statext_expressions_load. > > I tried adding that - I mostly kept my patches separate. > Hopefully this is more helpful than a complication. > I added at: https://commitfest.postgresql.org/35/3332/ > Actually, this is confusing. Which patch is the one we should be reviewing? >>> >>> It is confusing, but not as much as I first thought. Please check the >>> commit >>> messages. >>> >>> The first two patches are meant to be applied to master *and* backpatched. >>> The >>> first one intends to fixes the bug that non-inherited stats are being used >>> for >>> queries of inheritance trees. The 2nd one fixes the regression that stats >>> are >>> not collected for inheritence trees of partitioned tables (which is the only >>> type of stats they could ever possibly have). >> >> I think 0001 and 0002 seem mostly fine, but it seems a bit strange to do >> the (!rte->inh) check in the rel->statlist loops. AFAICS both places >> could do that right at the beginning, because it does not depend on the >> statistics object at all, just the RelOptInfo. > > I probably did this to make the code change small, to avoid indentin the whole > block. But indenting the block is not necessary. It's possible to do something like this: if (!rel->inh) return 1.0; or whatever is the "default" result for that function. > >>> And the 3rd+4th patches (Tomas' plus my changes) allow collecting both >>> inherited and non-inherited stats, only in master, since it requires a >>> catalog >>> change. It's a bit confusing that patch #4 removes most what I added in >>> patches 1 and 2. But that's exactly what's needed to collect and apply both >>> inherited and non-inherited stats: the first two patches avoid applying >>> stats >>> collected with the wrong inheritence. That's also what's needed for the >>> patchset to follow the normal "apply to master and backpatch" process, >>> rather >>> than 2 patches which are backpatched but not applied to master, and one >>> which >>> is applied to master and not backpatched.. >>> >> >> Yeah. Af first I was a bit confused because after applying 0003 there >> are both the fixes and the "correct" way, but then I realized 0004 >> removes the unnecessary bits. > > This was to leave your 0003 (mostly) unchanged, so you can see and/or apply my > changes. They should be squished together. > Yep. >> The one thing 0003 still needs is to rework the places that need to >> touch both inh and !inh stats. The patch simply does >> >> for (inh = 0; inh <= 1; inh++) { ... } >> >> but that feels a bit too hackish. But if we don't know which of the two >> stats exist, I'm not sure what to do about it. > > There's also this: > > On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote: >> + /* create only the "stxdinherit=false", because that always exists */ >> + datavalues[Anum_pg_statistic_ext_data_stxdinherit - 1] = >> ObjectIdGetDatum(false); >> >> That'd be confusing for partitioned tables, no? >> They'd always have an row with no data. >> I guess it could be stxdinherit = BoolGetDatum(rel->rd_rel->relkind == >> RELKIND_PARTITIONED_TABLE). >> (not ObjectIdGetDatum). >> Then, that affects the loops which delete the tuples - neither inh nor !inh >> is >> guaranteed, unless you check relkind there, too. > > Maybe the for inh<=1 loop should instead be two calls to new functions > factored > out of get_relation_statistics() and RemoveStatisticsById(), which take "bool > inh". > Well, yeah. That's part of the strange 1:2 mapping between the stats definition and data. Although, even with regular stats we have such mapping, except the "definition" is the pg_attribute row. >> And I'm not sure we do the right thing after removing children, for example >> (that should drop the inheritance stats, I guess). > > Do you mean for inheritance only ? Or partitions too ? > I think for partitions, the stats should stay. > And for inheritence, they can stay, for consistency with partitions, and since > it does no harm. > I think the behavior should be the same as for data in pg_statistic, i.e. if we keep/remove those, we should do the same thing for extended statistics. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: extended stats on partitioned tables
On Wed, Nov 03, 2021 at 11:48:44PM +0100, Tomas Vondra wrote: > On 10/8/21 12:45 AM, Justin Pryzby wrote: > > On Thu, Oct 07, 2021 at 03:26:46PM -0500, Jaime Casanova wrote: > >> On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote: > >>> On Sat, Sep 25, 2021 at 05:31:52PM -0500, Justin Pryzby wrote: > It seems like your patch should also check "inh" in examine_variable and > statext_expressions_load. > >>> > >>> I tried adding that - I mostly kept my patches separate. > >>> Hopefully this is more helpful than a complication. > >>> I added at: https://commitfest.postgresql.org/35/3332/ > >>> > >> > >> Actually, this is confusing. Which patch is the one we should be > >> reviewing? > > > > It is confusing, but not as much as I first thought. Please check the > > commit > > messages. > > > > The first two patches are meant to be applied to master *and* backpatched. > > The > > first one intends to fixes the bug that non-inherited stats are being used > > for > > queries of inheritance trees. The 2nd one fixes the regression that stats > > are > > not collected for inheritence trees of partitioned tables (which is the only > > type of stats they could ever possibly have). > > I think 0001 and 0002 seem mostly fine, but it seems a bit strange to do > the (!rte->inh) check in the rel->statlist loops. AFAICS both places > could do that right at the beginning, because it does not depend on the > statistics object at all, just the RelOptInfo. I probably did this to make the code change small, to avoid indentin the whole block. > > And the 3rd+4th patches (Tomas' plus my changes) allow collecting both > > inherited and non-inherited stats, only in master, since it requires a > > catalog > > change. It's a bit confusing that patch #4 removes most what I added in > > patches 1 and 2. But that's exactly what's needed to collect and apply both > > inherited and non-inherited stats: the first two patches avoid applying > > stats > > collected with the wrong inheritence. That's also what's needed for the > > patchset to follow the normal "apply to master and backpatch" process, > > rather > > than 2 patches which are backpatched but not applied to master, and one > > which > > is applied to master and not backpatched.. > > > > Yeah. Af first I was a bit confused because after applying 0003 there > are both the fixes and the "correct" way, but then I realized 0004 > removes the unnecessary bits. This was to leave your 0003 (mostly) unchanged, so you can see and/or apply my changes. They should be squished together. > The one thing 0003 still needs is to rework the places that need to > touch both inh and !inh stats. The patch simply does > > for (inh = 0; inh <= 1; inh++) { ... } > > but that feels a bit too hackish. But if we don't know which of the two > stats exist, I'm not sure what to do about it. There's also this: On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote: > + /* create only the "stxdinherit=false", because that always exists */ > + datavalues[Anum_pg_statistic_ext_data_stxdinherit - 1] = > ObjectIdGetDatum(false); > > That'd be confusing for partitioned tables, no? > They'd always have an row with no data. > I guess it could be stxdinherit = BoolGetDatum(rel->rd_rel->relkind == > RELKIND_PARTITIONED_TABLE). > (not ObjectIdGetDatum). > Then, that affects the loops which delete the tuples - neither inh nor !inh is > guaranteed, unless you check relkind there, too. Maybe the for inh<=1 loop should instead be two calls to new functions factored out of get_relation_statistics() and RemoveStatisticsById(), which take "bool inh". > And I'm not sure we do the right thing after removing children, for example > (that should drop the inheritance stats, I guess). Do you mean for inheritance only ? Or partitions too ? I think for partitions, the stats should stay. And for inheritence, they can stay, for consistency with partitions, and since it does no harm.
Re: extended stats on partitioned tables
On 10/8/21 12:45 AM, Justin Pryzby wrote: > On Thu, Oct 07, 2021 at 03:26:46PM -0500, Jaime Casanova wrote: >> On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote: >>> On Sat, Sep 25, 2021 at 05:31:52PM -0500, Justin Pryzby wrote: It seems like your patch should also check "inh" in examine_variable and statext_expressions_load. >>> >>> I tried adding that - I mostly kept my patches separate. >>> Hopefully this is more helpful than a complication. >>> I added at: https://commitfest.postgresql.org/35/3332/ >>> >> >> Actually, this is confusing. Which patch is the one we should be >> reviewing? > > It is confusing, but not as much as I first thought. Please check the commit > messages. > > The first two patches are meant to be applied to master *and* backpatched. > The > first one intends to fixes the bug that non-inherited stats are being used for > queries of inheritance trees. The 2nd one fixes the regression that stats are > not collected for inheritence trees of partitioned tables (which is the only > type of stats they could ever possibly have). > I think 0001 and 0002 seem mostly fine, but it seems a bit strange to do the (!rte->inh) check in the rel->statlist loops. AFAICS both places could do that right at the beginning, because it does not depend on the statistics object at all, just the RelOptInfo. > And the 3rd+4th patches (Tomas' plus my changes) allow collecting both > inherited and non-inherited stats, only in master, since it requires a catalog > change. It's a bit confusing that patch #4 removes most what I added in > patches 1 and 2. But that's exactly what's needed to collect and apply both > inherited and non-inherited stats: the first two patches avoid applying stats > collected with the wrong inheritence. That's also what's needed for the > patchset to follow the normal "apply to master and backpatch" process, rather > than 2 patches which are backpatched but not applied to master, and one which > is applied to master and not backpatched.. > Yeah. Af first I was a bit confused because after applying 0003 there are both the fixes and the "correct" way, but then I realized 0004 removes the unnecessary bits. The one thing 0003 still needs is to rework the places that need to touch both inh and !inh stats. The patch simply does for (inh = 0; inh <= 1; inh++) { ... } but that feels a bit too hackish. But if we don't know which of the two stats exist, I'm not sure what to do about it. And I'm not sure we do the right thing after removing children, for example (that should drop the inheritance stats, I guess). The 1:2 mapping between pg_statistic_ext and pg_statistic_ext_data is a bit strange, but I can't think of a better way. > @Tomas: I just found commit 427c6b5b9, which is a remarkably similar issue > affecting column stats 15 years ago. > What can I say? The history repeats itself ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: extended stats on partitioned tables
On Thu, Oct 07, 2021 at 03:26:46PM -0500, Jaime Casanova wrote: > On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote: > > On Sat, Sep 25, 2021 at 05:31:52PM -0500, Justin Pryzby wrote: > > > It seems like your patch should also check "inh" in examine_variable and > > > statext_expressions_load. > > > > I tried adding that - I mostly kept my patches separate. > > Hopefully this is more helpful than a complication. > > I added at: https://commitfest.postgresql.org/35/3332/ > > > > Actually, this is confusing. Which patch is the one we should be > reviewing? It is confusing, but not as much as I first thought. Please check the commit messages. The first two patches are meant to be applied to master *and* backpatched. The first one intends to fixes the bug that non-inherited stats are being used for queries of inheritance trees. The 2nd one fixes the regression that stats are not collected for inheritence trees of partitioned tables (which is the only type of stats they could ever possibly have). And the 3rd+4th patches (Tomas' plus my changes) allow collecting both inherited and non-inherited stats, only in master, since it requires a catalog change. It's a bit confusing that patch #4 removes most what I added in patches 1 and 2. But that's exactly what's needed to collect and apply both inherited and non-inherited stats: the first two patches avoid applying stats collected with the wrong inheritence. That's also what's needed for the patchset to follow the normal "apply to master and backpatch" process, rather than 2 patches which are backpatched but not applied to master, and one which is applied to master and not backpatched.. @Tomas: I just found commit 427c6b5b9, which is a remarkably similar issue affecting column stats 15 years ago. Rebased since there were conflicts with my typos fixes. -- Justin >From f6fb0e139e5b19ea288040328590a91a8526e2a6 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 25 Sep 2021 19:42:41 -0500 Subject: [PATCH v2 1/5] Do not use extended statistics on inheritence trees.. Since 859b3003de, inherited ext stats are not built. However, the non-inherited stats stats were incorrectly used during planning of queries with inheritence heirarchies. Since the ext stats do not include child tables, they can lead to worse estimates. This is remarkably similar to 427c6b5b9, which affected column statistics 15 years ago. choose_best_statistics is handled a bit differently (in the calling function), because it isn't passed rel nor rel->inh, and it's an exported function, so avoid changing its signature in back branches. https://www.postgresql.org/message-id/flat/20210925223152.ga7...@telsasoft.com Backpatch to v10 --- src/backend/statistics/dependencies.c | 5 + src/backend/statistics/extended_stats.c | 5 + src/backend/utils/adt/selfuncs.c| 9 + src/test/regress/expected/stats_ext.out | 23 +++ src/test/regress/sql/stats_ext.sql | 14 ++ 5 files changed, 56 insertions(+) diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c index 8bf80db8e4..b2e33329c7 100644 --- a/src/backend/statistics/dependencies.c +++ b/src/backend/statistics/dependencies.c @@ -1593,6 +1593,11 @@ dependencies_clauselist_selectivity(PlannerInfo *root, int nexprs; int k; MVDependencies *deps; + RangeTblEntry *rte = root->simple_rte_array[rel->relid]; + + /* If it's an inheritence tree, skip statistics (which do not include child stats) */ + if (rte->inh) + break; /* skip statistics that are not of the correct type */ if (stat->kind != STATS_EXT_DEPENDENCIES) diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 69ca52094f..6c69e5bc96 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1744,6 +1744,11 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli StatisticExtInfo *stat; List *stat_clauses; Bitmapset *simple_clauses; + RangeTblEntry *rte = root->simple_rte_array[rel->relid]; + + /* If it's an inheritence tree, skip statistics (which do not include child stats) */ + if (rte->inh) + break; /* find the best suited statistics object for these attnums */ stat = choose_best_statistics(rel->statlist, STATS_EXT_MCV, diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 10895fb287..a0932e39e1 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -3913,6 +3913,11 @@ estimate_multivariate_ndistinct(PlannerInfo *root, RelOptInfo *rel, Oid statOid = InvalidOid; MVNDistinct *stats; StatisticExtInfo *matched_info = NULL; + RangeTblEntry *rte = root->simple_rte_array[rel->relid]; + + /* If it's an inheritence tree, skip statistics (which do not include child stats) */ + if (rte->inh) + return false; /* bail out immediately if the
Re: extended stats on partitioned tables
On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote: > On Sat, Sep 25, 2021 at 05:31:52PM -0500, Justin Pryzby wrote: > > It seems like your patch should also check "inh" in examine_variable and > > statext_expressions_load. > > I tried adding that - I mostly kept my patches separate. > Hopefully this is more helpful than a complication. > I added at: https://commitfest.postgresql.org/35/3332/ > Actually, this is confusing. Which patch is the one we should be reviewing? -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL
Re: extended stats on partitioned tables
On Sat, Sep 25, 2021 at 05:31:52PM -0500, Justin Pryzby wrote: > It seems like your patch should also check "inh" in examine_variable and > statext_expressions_load. I tried adding that - I mostly kept my patches separate. Hopefully this is more helpful than a complication. I added at: https://commitfest.postgresql.org/35/3332/ + /* create only the "stxdinherit=false", because that always exists */ + datavalues[Anum_pg_statistic_ext_data_stxdinherit - 1] = ObjectIdGetDatum(false); That'd be confusing for partitioned tables, no? They'd always have an row with no data. I guess it could be stxdinherit = BoolGetDatum(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE). (not ObjectIdGetDatum). Then, that affects the loops which delete the tuples - neither inh nor !inh is guaranteed, unless you check relkind there, too. BTW, you'd need to add an "inherited" column to \dX if you added the "built" data back. Also, I think in backbranches we should document what's being stored in pg_statistic_ext, since it's pretty unintuitive: - noninherted stats (FROM ONLY) for inheritence parents; - inherted stats (FROM *) for partitioned tables; I think the !inh decision in 859b3003de was basically backwards. I think it'd be rare for someone to put extended stats on a parent for improving plans involving FROM ONLY. But it's not worth trying to fix now, since it would change plans in irreversible ways. Also, if the stx data were already populated, users would have to run a manual analyze after upgrading to populate the catalog with the data the planner would expect in the new version, or else it would end up being the opposite of the issue I mentioned: non-inherited stats (from before the upgrade) would be applied by the planner (after the upgrade) to inherited queries. >From b8f9e453b6cd05718a7a149f7e472d6c2c28c8a6 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 25 Sep 2021 19:42:41 -0500 Subject: [PATCH 1/5] Do not use extended statistics on inheritence trees.. Since 859b3003de, inherited ext stats are not built. However, the non-inherited stats stats were incorrectly used during planning of queries with inheritence heirarchies. Since the ext stats do not include child tables, they can lead to worse estimates. choose_best_statistics is handled a bit differently (in the calling function), because it isn't passed rel nor rel->inh, and it's an exported function, so avoid changing its signature in back branches. https://www.postgresql.org/message-id/flat/20210925223152.ga7...@telsasoft.com Backpatch to v10 --- src/backend/statistics/dependencies.c | 5 + src/backend/statistics/extended_stats.c | 5 + src/backend/utils/adt/selfuncs.c| 9 + src/test/regress/expected/stats_ext.out | 23 +++ src/test/regress/sql/stats_ext.sql | 14 ++ 5 files changed, 56 insertions(+) diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c index 8bf80db8e4..b2e33329c7 100644 --- a/src/backend/statistics/dependencies.c +++ b/src/backend/statistics/dependencies.c @@ -1593,6 +1593,11 @@ dependencies_clauselist_selectivity(PlannerInfo *root, int nexprs; int k; MVDependencies *deps; + RangeTblEntry *rte = root->simple_rte_array[rel->relid]; + + /* If it's an inheritence tree, skip statistics (which do not include child stats) */ + if (rte->inh) + break; /* skip statistics that are not of the correct type */ if (stat->kind != STATS_EXT_DEPENDENCIES) diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 0a7d12d467..eea38a5bc7 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1744,6 +1744,11 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli StatisticExtInfo *stat; List *stat_clauses; Bitmapset *simple_clauses; + RangeTblEntry *rte = root->simple_rte_array[rel->relid]; + + /* If it's an inheritence tree, skip statistics (which do not include child stats) */ + if (rte->inh) + break; /* find the best suited statistics object for these attnums */ stat = choose_best_statistics(rel->statlist, STATS_EXT_MCV, diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index abcb628a39..7533574fdc 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -3913,6 +3913,11 @@ estimate_multivariate_ndistinct(PlannerInfo *root, RelOptInfo *rel, Oid statOid = InvalidOid; MVNDistinct *stats; StatisticExtInfo *matched_info = NULL; + RangeTblEntry *rte = root->simple_rte_array[rel->relid]; + + /* If it's an inheritence tree, skip statistics (which do not include child stats) */ + if (rte->inh) + return false; /* bail out immediately if the table has no extended statistics */ if (!rel->statlist) @@ -5232,6 +5237,10 @@ examine_variable(PlannerInfo *root, Node *node, int varRelid, if
Re: extended stats on partitioned tables
On 9/25/21 11:46 PM, Justin Pryzby wrote: > On Sat, Sep 25, 2021 at 11:01:21PM +0200, Tomas Vondra wrote: >>> Do you think it's possible to backpatch a fix to handle partitioned tables >>> specifically ? >>> >>> The "tuple already updated" error which I reported and which was fixed by >>> 859b3003 involved inheritence children. Since partitioned tables have no >>> data >>> themselves, the !inh check could be relaxed. It's not totally clear to me >>> if >>> the correct statistics would be used in that case. I suppose the wrong >>> (inherited) stats would be wrongly applied affect queries FROM ONLY a >>> partitioned table, which seems pointless to write and also hard for the >>> estimates to be far off :) >> >> Hmmm, maybe. To prevent the "tuple concurrently updated" we must ensure we >> never build stats with and without inheritance at the same time (for the >> same rel). The 859b3003de ensures that by only building extended stats in >> the (!inh) case, but we might tweak that based on relkind. See the attached >> patch. But I wonder if there are cases that might be hurt by this - that'd >> be a regression too, of course. > > I think we should leave the inheritance case alone, since it hasn't changed in > 2 years, and building stats on the table ONLY is a legitimate interpretation, > and it's as good as we can do without the catalog change. > > But the partitioned case used to work, and there's no utility in selecting > FROM > ONLY a partitioned table, so we might as well build the stats including its > partitions. > > I don't think anything would get worse for the partitioned case. > Obviously building inherited ext stats could change plans - that's the point. > It's weird that the stats objects which existed for 18 months before being > "built" after the patch was applied, but no so weird that the release notes > wouldn't be ample documentation. > Agreed. > If building statistics caused the plan to change undesirably, the solution > would be to drop the stats object, of course. > > + build_ext_stats = (onerel->rd_rel->relkind == > RELKIND_PARTITIONED_TABLE) ? inh : (!inh); > > > > It's weird to build inherited extended stats for partitioned tables but not > for > inheritence parents. We could be clever and say "build inherited ext stats > for > inheritence parents only if we didn't insert any stats for the table itself > (because it's empty)". But I think that's fragile: a single tuple in the > parent table could cause stats to be built there instead of on its heirarchy, > and the extended stats would be used for *both* FROM and FROM ONLY, which is > an > awful combination. > I don't think there's a good way to check if there are any rows in the parent relation. And even then, a single row might cause huge changes to query plans (essentially switching to very different stats). > Since do_analyze_rel is only called once for partitioned tables, I think you > could write that as: > > /* Do not build inherited stats (since the catalog cannot support it) except > * for partitioned tables, for which numrows==0 and have no non-inherited > stats */ > build_ext_stats = !inh || onerel->rd_rel->relkind == > RELKIND_PARTITIONED_TABLE; > Good point. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: extended stats on partitioned tables
(Resending with -hackers) It seems like your patch should also check "inh" in examine_variable and statext_expressions_load. Which leads to another issue in stable branches: ANALYZE builds only non-inherited stats, but they're incorrectly used for inherited queries - the rowcount estimate is worse on inheritence parents with extended stats than without. CREATE TABLE p(i int, j int); CREATE TABLE p1() INHERITS(p); INSERT INTO p SELECT a, a/10 FROM generate_series(1,9)a; INSERT INTO p1 SELECT a, a FROM generate_series(1,999)a; CREATE STATISTICS ps ON i,j FROM p; VACUUM ANALYZE p,p1; postgres=# explain analyze SELECT * FROM p GROUP BY 1,2; HashAggregate (cost=26.16..26.25 rows=9 width=8) (actual time=2.571..3.282 rows=1008 loops=1) postgres=# begin; DROP STATISTICS ps; explain analyze SELECT * FROM p GROUP BY 1,2; rollback; HashAggregate (cost=26.16..36.16 rows=1000 width=8) (actual time=2.167..2.872 rows=1008 loops=1) I guess examine_variable() should have corresponding logic to the hardcoded !inh in analyze.c. -- Justin
Re: extended stats on partitioned tables
On Sat, Sep 25, 2021 at 11:01:21PM +0200, Tomas Vondra wrote: > > Do you think it's possible to backpatch a fix to handle partitioned tables > > specifically ? > > > > The "tuple already updated" error which I reported and which was fixed by > > 859b3003 involved inheritence children. Since partitioned tables have no > > data > > themselves, the !inh check could be relaxed. It's not totally clear to me > > if > > the correct statistics would be used in that case. I suppose the wrong > > (inherited) stats would be wrongly applied affect queries FROM ONLY a > > partitioned table, which seems pointless to write and also hard for the > > estimates to be far off :) > > Hmmm, maybe. To prevent the "tuple concurrently updated" we must ensure we > never build stats with and without inheritance at the same time (for the > same rel). The 859b3003de ensures that by only building extended stats in > the (!inh) case, but we might tweak that based on relkind. See the attached > patch. But I wonder if there are cases that might be hurt by this - that'd > be a regression too, of course. I think we should leave the inheritance case alone, since it hasn't changed in 2 years, and building stats on the table ONLY is a legitimate interpretation, and it's as good as we can do without the catalog change. But the partitioned case used to work, and there's no utility in selecting FROM ONLY a partitioned table, so we might as well build the stats including its partitions. I don't think anything would get worse for the partitioned case. Obviously building inherited ext stats could change plans - that's the point. It's weird that the stats objects which existed for 18 months before being "built" after the patch was applied, but no so weird that the release notes wouldn't be ample documentation. If building statistics caused the plan to change undesirably, the solution would be to drop the stats object, of course. + build_ext_stats = (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) ? inh : (!inh); It's weird to build inherited extended stats for partitioned tables but not for inheritence parents. We could be clever and say "build inherited ext stats for inheritence parents only if we didn't insert any stats for the table itself (because it's empty)". But I think that's fragile: a single tuple in the parent table could cause stats to be built there instead of on its heirarchy, and the extended stats would be used for *both* FROM and FROM ONLY, which is an awful combination. Since do_analyze_rel is only called once for partitioned tables, I think you could write that as: /* Do not build inherited stats (since the catalog cannot support it) except * for partitioned tables, for which numrows==0 and have no non-inherited stats */ build_ext_stats = !inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE; -- Justin
Re: extended stats on partitioned tables
On 9/25/21 9:53 PM, Justin Pryzby wrote: On Sat, Sep 25, 2021 at 09:27:10PM +0200, Tomas Vondra wrote: On 9/23/21 11:26 PM, Justin Pryzby wrote: extended stats objects are allowed on partitioned tables since v10. https://www.postgresql.org/message-id/flat/CAKJS1f-BmGo410bh5RSPZUvOO0LhmHL2NYmdrC_Jm8pk_FfyCA%40mail.gmail.com 8c5cdb7f4f6e1d6a6104cb58ce4f23453891651b But since 859b3003de they're not populated - pg_statistic_ext(_data) is empty. This was the consequence of a commit to avoid an error I reported with stats on inheritence parents (not partitioned tables). preceding 859b3003de, stats on the parent table *did* improve the estimate, so this part of the commit message seems to have been wrong? |commit 859b3003de87645b62ee07ef245d6c1f1cd0cedb |Don't build extended statistics on inheritance trees ... |Moreover, the current selectivity estimation code only works with individual |relations, so building statistics on inheritance trees would be pointless |anyway. |CREATE TABLE p (i int, a int, b int) PARTITION BY RANGE (i); |CREATE TABLE pd PARTITION OF p FOR VALUES FROM (1)TO(100); |TRUNCATE p; INSERT INTO p SELECT 1, a/100, a/100 FROM generate_series(1,999)a; |CREATE STATISTICS pp ON (a),(b) FROM p; |VACUUM ANALYZE p; |SELECT * FROM pg_statistic_ext WHERE stxrelid ='p'::regclass; |postgres=# begin; DROP STATISTICS pp; explain analyze SELECT a,b FROM p GROUP BY 1,2; abort; | HashAggregate (cost=20.98..21.98 rows=100 width=8) (actual time=1.088..1.093 rows=10 loops=1) |postgres=# explain analyze SELECT a,b FROM p GROUP BY 1,2; | HashAggregate (cost=20.98..21.09 rows=10 width=8) (actual time=1.082..1.086 rows=10 loops=1) So I think this is a regression, and extended stats should be populated for partitioned tables - I had actually done that for some parent tables and hadn't noticed that the stats objects no longer do anything. ... Agreed, that seems like a regression, but I don't see how to fix that without having the extra flag in the catalog. Otherwise we can store just one version for each statistics object :-( Do you think it's possible to backpatch a fix to handle partitioned tables specifically ? The "tuple already updated" error which I reported and which was fixed by 859b3003 involved inheritence children. Since partitioned tables have no data themselves, the !inh check could be relaxed. It's not totally clear to me if the correct statistics would be used in that case. I suppose the wrong (inherited) stats would be wrongly applied affect queries FROM ONLY a partitioned table, which seems pointless to write and also hard for the estimates to be far off :) Hmmm, maybe. To prevent the "tuple concurrently updated" we must ensure we never build stats with and without inheritance at the same time (for the same rel). The 859b3003de ensures that by only building extended stats in the (!inh) case, but we might tweak that based on relkind. See the attached patch. But I wonder if there are cases that might be hurt by this - that'd be a regression too, of course. Attached is a PoC that I quickly bashed together today. It's pretty raw, but it passed "make check" and I think it does most of the things right. Can you try if this fixes the estimates with partitioned tables? I think pg_stats_ext_exprs also needs to expose the inherited flag. Yeah, I only did the bare minimum to get the PoC working. I'm sure there are various other loose ends. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 8bfb2ad958..299f4893b8 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -548,6 +548,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params, { MemoryContext col_context, old_context; + bool build_ext_stats; pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE, PROGRESS_ANALYZE_PHASE_COMPUTE_STATS); @@ -611,13 +612,15 @@ do_analyze_rel(Relation onerel, VacuumParams *params, thisdata->attr_cnt, thisdata->vacattrstats); } + build_ext_stats = (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) ? inh : (!inh); + /* * Build extended statistics (if there are any). * * For now we only build extended statistics on individual relations, * not for relations representing inheritance trees. */ - if (!inh) + if (build_ext_stats) BuildRelationExtStatistics(onerel, totalrows, numrows, rows, attr_cnt, vacattrstats); }
Re: extended stats on partitioned tables
On Sat, Sep 25, 2021 at 09:27:10PM +0200, Tomas Vondra wrote: > On 9/23/21 11:26 PM, Justin Pryzby wrote: > > extended stats objects are allowed on partitioned tables since v10. > > https://www.postgresql.org/message-id/flat/CAKJS1f-BmGo410bh5RSPZUvOO0LhmHL2NYmdrC_Jm8pk_FfyCA%40mail.gmail.com > > 8c5cdb7f4f6e1d6a6104cb58ce4f23453891651b > > > > But since 859b3003de they're not populated - pg_statistic_ext(_data) is > > empty. > > This was the consequence of a commit to avoid an error I reported with > > stats on > > inheritence parents (not partitioned tables). > > > > preceding 859b3003de, stats on the parent table *did* improve the estimate, > > so this part of the commit message seems to have been wrong? > > |commit 859b3003de87645b62ee07ef245d6c1f1cd0cedb > > |Don't build extended statistics on inheritance trees > > ... > > |Moreover, the current selectivity estimation code only works with > > individual > > |relations, so building statistics on inheritance trees would be > > pointless > > |anyway. > > > > |CREATE TABLE p (i int, a int, b int) PARTITION BY RANGE (i); > > |CREATE TABLE pd PARTITION OF p FOR VALUES FROM (1)TO(100); > > |TRUNCATE p; INSERT INTO p SELECT 1, a/100, a/100 FROM > > generate_series(1,999)a; > > |CREATE STATISTICS pp ON (a),(b) FROM p; > > |VACUUM ANALYZE p; > > |SELECT * FROM pg_statistic_ext WHERE stxrelid ='p'::regclass; > > > > |postgres=# begin; DROP STATISTICS pp; explain analyze SELECT a,b FROM p > > GROUP BY 1,2; abort; > > | HashAggregate (cost=20.98..21.98 rows=100 width=8) (actual > > time=1.088..1.093 rows=10 loops=1) > > > > |postgres=# explain analyze SELECT a,b FROM p GROUP BY 1,2; > > | HashAggregate (cost=20.98..21.09 rows=10 width=8) (actual > > time=1.082..1.086 rows=10 loops=1) > > > > So I think this is a regression, and extended stats should be populated for > > partitioned tables - I had actually done that for some parent tables and > > hadn't > > noticed that the stats objects no longer do anything. ... > Agreed, that seems like a regression, but I don't see how to fix that > without having the extra flag in the catalog. Otherwise we can store just > one version for each statistics object :-( Do you think it's possible to backpatch a fix to handle partitioned tables specifically ? The "tuple already updated" error which I reported and which was fixed by 859b3003 involved inheritence children. Since partitioned tables have no data themselves, the !inh check could be relaxed. It's not totally clear to me if the correct statistics would be used in that case. I suppose the wrong (inherited) stats would be wrongly applied affect queries FROM ONLY a partitioned table, which seems pointless to write and also hard for the estimates to be far off :) > Attached is a PoC that I quickly bashed together today. It's pretty raw, but > it passed "make check" and I think it does most of the things right. Can you > try if this fixes the estimates with partitioned tables? I think pg_stats_ext_exprs also needs to expose the inherited flag. Thanks, -- Justin
Re: extended stats on partitioned tables
On 9/23/21 11:26 PM, Justin Pryzby wrote: extended stats objects are allowed on partitioned tables since v10. https://www.postgresql.org/message-id/flat/CAKJS1f-BmGo410bh5RSPZUvOO0LhmHL2NYmdrC_Jm8pk_FfyCA%40mail.gmail.com 8c5cdb7f4f6e1d6a6104cb58ce4f23453891651b But since 859b3003de they're not populated - pg_statistic_ext(_data) is empty. This was the consequence of a commit to avoid an error I reported with stats on inheritence parents (not partitioned tables). preceding 859b3003de, stats on the parent table *did* improve the estimate, so this part of the commit message seems to have been wrong? |commit 859b3003de87645b62ee07ef245d6c1f1cd0cedb |Don't build extended statistics on inheritance trees ... |Moreover, the current selectivity estimation code only works with individual |relations, so building statistics on inheritance trees would be pointless |anyway. |CREATE TABLE p (i int, a int, b int) PARTITION BY RANGE (i); |CREATE TABLE pd PARTITION OF p FOR VALUES FROM (1)TO(100); |TRUNCATE p; INSERT INTO p SELECT 1, a/100, a/100 FROM generate_series(1,999)a; |CREATE STATISTICS pp ON (a),(b) FROM p; |VACUUM ANALYZE p; |SELECT * FROM pg_statistic_ext WHERE stxrelid ='p'::regclass; |postgres=# begin; DROP STATISTICS pp; explain analyze SELECT a,b FROM p GROUP BY 1,2; abort; | HashAggregate (cost=20.98..21.98 rows=100 width=8) (actual time=1.088..1.093 rows=10 loops=1) |postgres=# explain analyze SELECT a,b FROM p GROUP BY 1,2; | HashAggregate (cost=20.98..21.09 rows=10 width=8) (actual time=1.082..1.086 rows=10 loops=1) So I think this is a regression, and extended stats should be populated for partitioned tables - I had actually done that for some parent tables and hadn't noticed that the stats objects no longer do anything. That begs the question if the current behavior for inheritence parents is correct.. CREATE TABLE p (i int, a int, b int); CREATE TABLE pd () INHERITS (p); INSERT INTO pd SELECT 1, a/100, a/100 FROM generate_series(1,999)a; CREATE STATISTICS pp ON (a),(b) FROM p; VACUUM ANALYZE p; explain analyze SELECT a,b FROM p GROUP BY 1,2; | HashAggregate (cost=25.99..26.99 rows=100 width=8) (actual time=3.268..3.284 rows=10 loops=1) Agreed, that seems like a regression, but I don't see how to fix that without having the extra flag in the catalog. Otherwise we can store just one version for each statistics object :-( Since child tables can be queried directly, it's a legitimate question whether we should collect stats for the table heirarchy or (since the catalog only supports one) only the table itself. I'd think that stats for the table hierarchy would be more commonly useful (but we shouldn't change the behavior in existing releases again). Anyway it seems unfortunate that statistic_ext_data still has no stxinherited. Yeah, we probably need the flag - I planned to get it into 14, but then I got distracted by something else :-/ Attached is a PoC that I quickly bashed together today. It's pretty raw, but it passed "make check" and I think it does most of the things right. Can you try if this fixes the estimates with partitioned tables? Extended statistics use two catalogs, pg_statistic_ext for definition, while pg_statistic_ext_data stores the built statistics objects - the flag needs to be in the "data" catalog, and managing the records is a bit challenging - the current PoC code mostly works, but I had to relax some error checks and I'm sure there are cases when we fail to remove a row, or something like that. Note that for partitioned tables if I enable enable_partitionwise_aggregate, then stats objects on the child tables can be helpful (but that's also confusing to the question at hand). Yeah. I think it'd be helpful to assemble a script with various test cases demonstrating how we estimate various cases. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 2f0def9b19..e256a5533c 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -7441,6 +7441,19 @@ SCRAM-SHA-256$iteration count: created with CREATE STATISTICS. + + Normally there is one entry, with stxdinherit = + false, for each statistics object that has been analyzed. + If the table has inheritance children, a second entry with + stxdinherit = true is also created. + This row represents the statistics object over the inheritance tree, i.e., + statistics for the data you'd see with + SELECT * FROM table*, + whereas the stxdinherit = false row + represents the results of + SELECT * FROM ONLY table. + + Like pg_statistic, pg_statistic_ext_data should not be @@ -7480,6 +7493,16 @@ SCRAM-SHA-256$iteration count: + + + stxdinherit bool + + + If true, the stats include inheritance child columns, not just the + values in the