Re: [HACKERS] multivariate statistics (v25)
On 6 April 2017 at 10:17, Simon Riggs wrote: > On 5 April 2017 at 10:47, David Rowley wrote: > >> I've attached an updated patch to address Tomas' concerns and yours too. > > Commited, with some doc changes and additions based upon my explorations. Great. Thanks for committing! -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
On 5 April 2017 at 10:47, David Rowley wrote: > I've attached an updated patch to address Tomas' concerns and yours too. Commited, with some doc changes and additions based upon my explorations. For the record, I measured the time to calc extended statistics as +800ms on 2 million row sample. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
On 6 April 2017 at 07:19, Tels wrote: > I know I'm a bit late, but isn't the syntax backwards? > > "CREATE STATISTICS s1 WITH (dependencies) ON (col_a, col_b) FROM table;" > > These do it the other way round: > > CREATE INDEX idx ON table (col_a); > > AND: > >CREATE TABLE t ( > id INT REFERENCES table_2 (col_b); >); > > Won't this be confusing and make things hard to remember? > > Sorry for not asking earlier, I somehow missed this. The reasoning is in [1] [1] https://www.postgresql.org/message-id/CAEZATCUtGR+U5+QTwjHhe9rLG2nguEysHQ5NaqcK=vbj78v...@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
Moin, On Wed, April 5, 2017 2:52 pm, Simon Riggs wrote: > On 5 April 2017 at 10:47, David Rowley > wrote: > >>> I have some other comments. > > Me too. > > > CREATE STATISTICS should take ShareUpdateExclusiveLock like ANALYZE. > > This change is in line with other changes in this and earlier > releases. Comments and docs included. > > Patch ready to be applied directly barring objections. I know I'm a bit late, but isn't the syntax backwards? "CREATE STATISTICS s1 WITH (dependencies) ON (col_a, col_b) FROM table;" These do it the other way round: CREATE INDEX idx ON table (col_a); AND: CREATE TABLE t ( id INT REFERENCES table_2 (col_b); ); Won't this be confusing and make things hard to remember? Sorry for not asking earlier, I somehow missed this. Regard, Tels -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
On 5 April 2017 at 10:47, David Rowley wrote: >> I have some other comments. Me too. CREATE STATISTICS should take ShareUpdateExclusiveLock like ANALYZE. This change is in line with other changes in this and earlier releases. Comments and docs included. Patch ready to be applied directly barring objections. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services create_statistics_lock_reduction.v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
On 5 April 2017 at 14:53, Kyotaro HORIGUCHI wrote: > At Tue, 4 Apr 2017 20:19:39 +0200, Tomas Vondra > wrote in > <56f40b20-c464-fad2-ff39-06b668fac...@2ndquadrant.com> >> Two minor comments: >> >> 1) DEPENDENCY_MIN_GROUP_SIZE >> >> I'm not sure we still need the min_group_size, when evaluating >> dependencies. It was meant to deal with 'noisy' data, but I think it >> after switching to the 'degree' it might actually be a bad idea. Yeah, I'd wondered about this when I first started testing the patch. I failed to get any functional dependencies because my values were too unique. Seems I'd gotten a bit used to it, and in the end thought that if the values are unique enough then they won't suffer as much from the underestimation problem you're trying to solve here. I've removed that part of the code now. > I think the same. Quite large part of functional dependency in > reality is in this kind. > >> 2) A minor detail is that instead of this >> >> if (estimatedclauses != NULL && >> bms_is_member(listidx, estimatedclauses)) >> continue; >> >> perhaps we should do just this: >> >> if (bms_is_member(listidx, estimatedclauses)) >> continue; >> >> bms_is_member does the same NULL check right at the beginning, so I >> don't think this might make a measurable difference. hmm yeah, I'd added that because I thought the estimatedclauses would be NULL in 99.9% of cases and thought that I might be able to shave a few cycles off. I see that there's an x < 0 test before the NULL test in the function. Anyway, I'm not going to put up a fight here, so I've removed it. I didn't ever benchmark anything to see if the extra test actually helped anyway... > I have some other comments. > > == > - The comment for clauselist_selectivity, > | + * When 'rel' is not null and rtekind = RTE_RELATION, we'll try to apply > | + * selectivity estimates using any extended statistcs on 'rel'. > > The 'rel' is actually a parameter but rtekind means rel->rtekind > so this might be better be such like the following. > > | When a relation of RTE_RELATION is given as 'rel', we try > | extended statistcs on the relation. > > Then the following line doesn't seem to be required. > > | + * If we identify such extended statistics exist, we try to apply them. Yes, good point. I've revised this comment a bit now. > > = > The following comment in the same function, > > | +if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL) > | +{ > | +/* > | + * Try to estimate with multivariate functional dependency > statistics. > | + * > | + * The function will supply an estimate for the clauses which it > | + * estimated for. Any clauses which were unsuitible were ignored. > | + * Clauses which were estimated will have their 0-based list index > set > | + * in estimatedclauses. We must ignore these clauses when > processing > | + * the remaining clauses later. > | + */ > > (Notice that I'm not a good writer) This might better be the > following. > > | dependencies_clauselist_selectivity gives selectivity over > | caluses that functional dependencies on the given relation is > | applicable. 0-based index numbers of consumed clauses are > | returned in the bitmap set estimatedclauses so that the > | estimation here after can ignore them. I've changed this one too now. > = > | +s1 *= dependencies_clauselist_selectivity(root, clauses, varRelid, > | + jointype, sjinfo, rel, > &estimatedclauses); > > The name prefix "dependency_" means "functional_dependency" here > and omitting "functional" is confusing to me. On the other hand > "functional_dependency" is quite long as prefix. Could we use > "func_dependency" or something that is shorter but meaningful? > (But this change causes renaming of many other sutff..) oh no! Many functions in dependencies.c start with dependencies_. To me, it's a bit of an OOP thing, which if we'd been using some other language would have been dependencies->clauselist_selectivity(). Of course, not all functions in that file follow that rule, but I don't feel a pressing need to go make that any worse. Perhaps the prefix could be func_dependency, but I really don't feel very excited about having it that way, and even less so about making the change. > = > The name "dependency_compatible_clause" might be meaningful if it > were "clause_is_compatible_with_(functional_)dependency" or such. I could maybe squeeze the word "is" in there. ... OK done. > = > dependency_compatible_walker() returns true if given node is > *not* compatible. Isn't it confusing? Yeah. > > = > dependency_compatible_walker() seems implicitly expecting that > RestrictInfo will be given at the first. RestrictInfo might( > should be processed outside this function in _compatible_clause(). Actually, I don't really see a great need for this to be a recursive walker typ
Re: [HACKERS] multivariate statistics (v25)
On 04/05/2017 08:41 AM, Sven R. Kunze wrote: Thanks Tomas and David for hacking on this patch. On 04.04.2017 20:19, Tomas Vondra wrote: I'm not sure we still need the min_group_size, when evaluating dependencies. It was meant to deal with 'noisy' data, but I think it after switching to the 'degree' it might actually be a bad idea. Consider this: create table t (a int, b int); insert into t select 1, 1 from generate_series(1, 1) s(i); insert into t select i, i from generate_series(2, 2) s(i); create statistics s with (dependencies) on (a,b) from t; analyze t; select stadependencies from pg_statistic_ext ; stadependencies [{1 => 2 : 0.44}, {2 => 1 : 0.44}] (1 row) So the degree of the dependency is just ~0.333 although it's obviously a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The reason is that we discard 2/3 of rows, because those groups are only a single row each, except for the one large group (1/3 of rows). Just for me to follow the comments better. Is "dependency" roughly the same as when statisticians speak about " conditional probability"? No, it's more 'functional dependency' from relational normal forms. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
Thanks Tomas and David for hacking on this patch. On 04.04.2017 20:19, Tomas Vondra wrote: I'm not sure we still need the min_group_size, when evaluating dependencies. It was meant to deal with 'noisy' data, but I think it after switching to the 'degree' it might actually be a bad idea. Consider this: create table t (a int, b int); insert into t select 1, 1 from generate_series(1, 1) s(i); insert into t select i, i from generate_series(2, 2) s(i); create statistics s with (dependencies) on (a,b) from t; analyze t; select stadependencies from pg_statistic_ext ; stadependencies [{1 => 2 : 0.44}, {2 => 1 : 0.44}] (1 row) So the degree of the dependency is just ~0.333 although it's obviously a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The reason is that we discard 2/3 of rows, because those groups are only a single row each, except for the one large group (1/3 of rows). Just for me to follow the comments better. Is "dependency" roughly the same as when statisticians speak about " conditional probability"? Sven -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
At Tue, 4 Apr 2017 20:19:39 +0200, Tomas Vondra wrote in <56f40b20-c464-fad2-ff39-06b668fac...@2ndquadrant.com> > On 04/04/2017 09:55 AM, David Rowley wrote: > > On 1 April 2017 at 04:25, David Rowley > > wrote: > >> I've attached an updated patch. > > > > I've made another pass at this and ended up removing the tryextstats > > variable. We now only try to use extended statistics when > > clauselist_selectivity() is given a valid RelOptInfo with rtekind == > > RTE_RELATION, and of course, it must also have some extended stats > > defined too. > > > > I've also cleaned up a few more comments, many of which I managed to > > omit updating when I refactored how the selectivity estimates ties > > into clauselist_selectivity() > > > > I'm quite happy with all of this now, and would also be happy for > > other people to take a look and comment. > > > > As a reviewer, I'd be marking this ready for committer, but I've moved > > a little way from just reviewing this now, having spent two weeks > > hacking at it. > > > > The latest patch is attached. > > > > Thanks David, I agree the reworked patch is much cleaner that the last > version I posted. Thanks for spending your time on it. > > Two minor comments: > > 1) DEPENDENCY_MIN_GROUP_SIZE > > I'm not sure we still need the min_group_size, when evaluating > dependencies. It was meant to deal with 'noisy' data, but I think it > after switching to the 'degree' it might actually be a bad idea. > > Consider this: > > create table t (a int, b int); > insert into t select 1, 1 from generate_series(1, 1) s(i); > insert into t select i, i from generate_series(2, 2) s(i); > create statistics s with (dependencies) on (a,b) from t; > analyze t; > > select stadependencies from pg_statistic_ext ; > stadependencies > > [{1 => 2 : 0.44}, {2 => 1 : 0.44}] > (1 row) > > So the degree of the dependency is just ~0.333 although it's obviously > a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The > reason is that we discard 2/3 of rows, because those groups are only a > single row each, except for the one large group (1/3 of rows). > > Without the mininum group size limitation, the dependencies are: > > test=# select stadependencies from pg_statistic_ext ; > stadependencies > > [{1 => 2 : 1.00}, {2 => 1 : 1.00}] > (1 row) > > which seems way more reasonable, I think. I think the same. Quite large part of functional dependency in reality is in this kind. > 2) A minor detail is that instead of this > > if (estimatedclauses != NULL && > bms_is_member(listidx, estimatedclauses)) > continue; > > perhaps we should do just this: > > if (bms_is_member(listidx, estimatedclauses)) > continue; > > bms_is_member does the same NULL check right at the beginning, so I > don't think this might make a measurable difference. I have some other comments. == - The comment for clauselist_selectivity, | + * When 'rel' is not null and rtekind = RTE_RELATION, we'll try to apply | + * selectivity estimates using any extended statistcs on 'rel'. The 'rel' is actually a parameter but rtekind means rel->rtekind so this might be better be such like the following. | When a relation of RTE_RELATION is given as 'rel', we try | extended statistcs on the relation. Then the following line doesn't seem to be required. | + * If we identify such extended statistics exist, we try to apply them. = The following comment in the same function, | +if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL) | +{ | +/* | + * Try to estimate with multivariate functional dependency statistics. | + * | + * The function will supply an estimate for the clauses which it | + * estimated for. Any clauses which were unsuitible were ignored. | + * Clauses which were estimated will have their 0-based list index set | + * in estimatedclauses. We must ignore these clauses when processing | + * the remaining clauses later. | + */ (Notice that I'm not a good writer) This might better be the following. | dependencies_clauselist_selectivity gives selectivity over | caluses that functional dependencies on the given relation is | applicable. 0-based index numbers of consumed clauses are | returned in the bitmap set estimatedclauses so that the | estimation here after can ignore them. = | +s1 *= dependencies_clauselist_selectivity(root, clauses, varRelid, | + jointype, sjinfo, rel, &estimatedclauses); The name prefix "dependency_" means "functional_dependency" here and omitting "functional" is confusing to me. On the other hand "functional_dependency" is quite long as prefix. Could we use "func_dependency" or something that is sho
Re: [HACKERS] multivariate statistics (v25)
On 04/04/2017 09:55 AM, David Rowley wrote: On 1 April 2017 at 04:25, David Rowley wrote: I've attached an updated patch. I've made another pass at this and ended up removing the tryextstats variable. We now only try to use extended statistics when clauselist_selectivity() is given a valid RelOptInfo with rtekind == RTE_RELATION, and of course, it must also have some extended stats defined too. I've also cleaned up a few more comments, many of which I managed to omit updating when I refactored how the selectivity estimates ties into clauselist_selectivity() I'm quite happy with all of this now, and would also be happy for other people to take a look and comment. As a reviewer, I'd be marking this ready for committer, but I've moved a little way from just reviewing this now, having spent two weeks hacking at it. The latest patch is attached. Thanks David, I agree the reworked patch is much cleaner that the last version I posted. Thanks for spending your time on it. Two minor comments: 1) DEPENDENCY_MIN_GROUP_SIZE I'm not sure we still need the min_group_size, when evaluating dependencies. It was meant to deal with 'noisy' data, but I think it after switching to the 'degree' it might actually be a bad idea. Consider this: create table t (a int, b int); insert into t select 1, 1 from generate_series(1, 1) s(i); insert into t select i, i from generate_series(2, 2) s(i); create statistics s with (dependencies) on (a,b) from t; analyze t; select stadependencies from pg_statistic_ext ; stadependencies [{1 => 2 : 0.44}, {2 => 1 : 0.44}] (1 row) So the degree of the dependency is just ~0.333 although it's obviously a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The reason is that we discard 2/3 of rows, because those groups are only a single row each, except for the one large group (1/3 of rows). Without the mininum group size limitation, the dependencies are: test=# select stadependencies from pg_statistic_ext ; stadependencies [{1 => 2 : 1.00}, {2 => 1 : 1.00}] (1 row) which seems way more reasonable, I think. 2) A minor detail is that instead of this if (estimatedclauses != NULL && bms_is_member(listidx, estimatedclauses)) continue; perhaps we should do just this: if (bms_is_member(listidx, estimatedclauses)) continue; bms_is_member does the same NULL check right at the beginning, so I don't think this might make a measurable difference. kind regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
On 1 April 2017 at 04:25, David Rowley wrote: > I've attached an updated patch. I've made another pass at this and ended up removing the tryextstats variable. We now only try to use extended statistics when clauselist_selectivity() is given a valid RelOptInfo with rtekind == RTE_RELATION, and of course, it must also have some extended stats defined too. I've also cleaned up a few more comments, many of which I managed to omit updating when I refactored how the selectivity estimates ties into clauselist_selectivity() I'm quite happy with all of this now, and would also be happy for other people to take a look and comment. As a reviewer, I'd be marking this ready for committer, but I've moved a little way from just reviewing this now, having spent two weeks hacking at it. The latest patch is attached. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services mv_functional-deps_2017-04-04.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
On 31 March 2017 at 21:18, Kyotaro HORIGUCHI < horiguchi.kyot...@lab.ntt.co.jp> wrote: > > When adding these two parameters I had 2nd thoughts that the > "tryextstats" > > was required at all. We could just have this controlled by if the rel is > a > > base rel of kind RTE_RELATION. I ended up having to pass these parameters > > further, down to clauselist_selectivity's singleton couterpart, > > clause_selectivity(). This was due to clause_selectivity() calling > > clauselist_selectivity() for some clause types. I'm not entirely sure if > > this is actually required, but I can't see any reason for it to cause > > problems. > > I understand that the reason for tryextstats is that the two are > perfectly correlating but caluse_selectivity requires the > RelOptInfo anyway. Some comment about that may be reuiqred in the > function comment. hmm, you could say one is functionally dependant on the other. I did consider removing it, but it seemed weird to pass a NULL relation when we dont want to attempt to use extended stats. > Some random comments by just looking on the patch: > > == > The name of the function "collect_ext_attnums", and > "clause_is_ext_compatible" seems odd since "ext" doesn't seem to > be a part of "extended statistics". Some other names looks the > same, too. > I agree. I've made some changes to the patch to change how the functional dependency estimations are applied. I've removed most of the code from clausesel.c and put it into dependencies.c. In doing so I've removed some of the inefficiencies that were in the patch. For example clause_is_ext_compatible() was being called many times on the same clause at different times. I've now nailed that down to just once per clause. > Something like "collect_e(xt)stat_compatible_attnums" and > "clause_is_e(xt)stat_compatible" seem better to me. > > Changed to dependency_compatible_clause(), since this was searching for equality clauses in the form Var = Const, or Const = Var. This seems specific to the functional depdencies checking. A multivariate histogram won't want the same. > == > The following comment seems something wrong. > > + * When applying functional dependencies, we start with the strongest ones > + * strongest dependencies. That is, we select the dependency that: > > == > dependency_is_fully_matched() is not found. Maybe some other > patches are assumed? > > == > + /* see if it actually has the right */ > + ok = (NumRelids((Node *) expr) == 1) && > + (is_pseudo_constant_clause(lsecond(expr->args)) || > +(varonleft = false, > + is_pseudo_constant_clause( > linitial(expr->args; > + > + /* unsupported structure (two variables or so) */ > + if (!ok) > + return true; > > Ok is used only here. I don't think seeming-expressions with side > effect is not good idea here. > > I thought the same, but I happened to notice that Tomas must have taken it from clauselist_selectivity(). > == > + switch (get_oprrest(expr->opno)) > + { > + case F_EQSEL: > + > + /* equality conditions are compatible with > all statistics */ > + break; > + > + default: > + > + /* unknown estimator */ > + return true; > + } > > This seems somewhat stupid.. > I agree. Changed. I've attached an updated patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services mv_functional-deps_2017-04-01.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
On 31 March 2017 at 21:18, Kyotaro HORIGUCHI < horiguchi.kyot...@lab.ntt.co.jp> wrote: > Hello, > > At Fri, 31 Mar 2017 03:03:06 +1300, David Rowley < > david.row...@2ndquadrant.com> wrote in T5JLce5ynCi1vvezXxX=w...@mail.gmail.com> > > FWIW, I tries this. This cleanly applied on it but make ends with > the following error. > > $ make -s > Writing postgres.bki > Writing schemapg.h > Writing postgres.description > Writing postgres.shdescription > Writing fmgroids.h > Writing fmgrprotos.h > Writing fmgrtab.c > make[3]: *** No rule to make target `dependencies.o', needed by > `objfiles.txt'. Stop. > make[2]: *** [statistics-recursive] Error 2 > make[1]: *** [all-backend-recurse] Error 2 > make: *** [all-src-recurse] Error 2 Apologies. I was caught out by patching back on to master, then committing, and git diff'ing the last commit, where i'd of course forgotten to get add those files. I'm just in the middle of fixing up some other stuff. Hopefully I'll post a working patch soon. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] multivariate statistics (v25)
Hello, At Fri, 31 Mar 2017 03:03:06 +1300, David Rowley wrote in > On 25 March 2017 at 07:35, Alvaro Herrera wrote: > > > As I said in another thread, I pushed parts 0002,0003,0004. Tomas said > > he would try to rebase patches 0001,0005,0006 on top of what was > > committed. My intention is to give that one a look as soon as it is > > available. So we will have n-distinct and functional dependencies in > > PG10. It sounds unlikely that we will get MCVs and histograms in, since > > they're each a lot of code. > > > > I've been working on the MV functional dependencies part of the patch to > polish it up a bit. Tomas has been busy with a few other duties. > > I've made some changes around how clauselist_selectivity() determines if it > should try to apply any extended stats. The solution I came up with was to > add two parameters to this function, one for the RelOptInfo in question, > and one a bool to control if we should try to apply any extended stats. > For clauselist_selectivity() usage involving join rels we just pass the rel > as NULL, that way we can skip all the extended stats stuff with very low > overhead. When we actually have a base relation to pass along we can do so, > along with a true tryextstats value to have the function attempt to use any > extended stats to assist with the selectivity estimation. > > When adding these two parameters I had 2nd thoughts that the "tryextstats" > was required at all. We could just have this controlled by if the rel is a > base rel of kind RTE_RELATION. I ended up having to pass these parameters > further, down to clauselist_selectivity's singleton couterpart, > clause_selectivity(). This was due to clause_selectivity() calling > clauselist_selectivity() for some clause types. I'm not entirely sure if > this is actually required, but I can't see any reason for it to cause > problems. I understand that the reason for tryextstats is that the two are perfectly correlating but caluse_selectivity requires the RelOptInfo anyway. Some comment about that may be reuiqred in the function comment. > I've also attempted to simplify some of the logic within > clauselist_selectivity and some other parts of clausesel.c to remove some > unneeded code and make it a bit more efficient. For example, we no longer > count the attributes in the clause list before calling a similar function > to retrieve the actual attnums. This is now done as a single step. > > I've not yet quite gotten as far as I'd like with this. I'd quite like to > see clauselist_ext_split() gone, and instead we could build up a bitmapset > of clause list indexes to ignore when applying the selectivity of clauses > that couldn't use any extended stats. I'm planning on having a bit more of > a look at this tomorrow. > > The attached patch should apply to master as > of f90d23d0c51895e0d7db7910538e85d3d38691f0. FWIW, I tries this. This cleanly applied on it but make ends with the following error. $ make -s Writing postgres.bki Writing schemapg.h Writing postgres.description Writing postgres.shdescription Writing fmgroids.h Writing fmgrprotos.h Writing fmgrtab.c make[3]: *** No rule to make target `dependencies.o', needed by `objfiles.txt'. Stop. make[2]: *** [statistics-recursive] Error 2 make[1]: *** [all-backend-recurse] Error 2 make: *** [all-src-recurse] Error 2 Some random comments by just looking on the patch: == The name of the function "collect_ext_attnums", and "clause_is_ext_compatible" seems odd since "ext" doesn't seem to be a part of "extended statistics". Some other names looks the same, too. Something like "collect_e(xt)stat_compatible_attnums" and "clause_is_e(xt)stat_compatible" seem better to me. == The following comment seems something wrong. + * When applying functional dependencies, we start with the strongest ones + * strongest dependencies. That is, we select the dependency that: == dependency_is_fully_matched() is not found. Maybe some other patches are assumed? == + /* see if it actually has the right */ + ok = (NumRelids((Node *) expr) == 1) && + (is_pseudo_constant_clause(lsecond(expr->args)) || +(varonleft = false, + is_pseudo_constant_clause(linitial(expr->args; + + /* unsupported structure (two variables or so) */ + if (!ok) + return true; Ok is used only here. I don't think seeming-expressions with side effect is not good idea here. == + switch (get_oprrest(expr->opno)) + { + case F_EQSEL: + + /* equality conditions are compatible with all statistics */ + break; + + default: + + /* unknown estimator */ + return true; + } This seems somewhat stupid.. regards, -- Kyotaro Horig
Re: [HACKERS] multivariate statistics (v25)
On 25 March 2017 at 07:35, Alvaro Herrera wrote: > As I said in another thread, I pushed parts 0002,0003,0004. Tomas said > he would try to rebase patches 0001,0005,0006 on top of what was > committed. My intention is to give that one a look as soon as it is > available. So we will have n-distinct and functional dependencies in > PG10. It sounds unlikely that we will get MCVs and histograms in, since > they're each a lot of code. > I've been working on the MV functional dependencies part of the patch to polish it up a bit. Tomas has been busy with a few other duties. I've made some changes around how clauselist_selectivity() determines if it should try to apply any extended stats. The solution I came up with was to add two parameters to this function, one for the RelOptInfo in question, and one a bool to control if we should try to apply any extended stats. For clauselist_selectivity() usage involving join rels we just pass the rel as NULL, that way we can skip all the extended stats stuff with very low overhead. When we actually have a base relation to pass along we can do so, along with a true tryextstats value to have the function attempt to use any extended stats to assist with the selectivity estimation. When adding these two parameters I had 2nd thoughts that the "tryextstats" was required at all. We could just have this controlled by if the rel is a base rel of kind RTE_RELATION. I ended up having to pass these parameters further, down to clauselist_selectivity's singleton couterpart, clause_selectivity(). This was due to clause_selectivity() calling clauselist_selectivity() for some clause types. I'm not entirely sure if this is actually required, but I can't see any reason for it to cause problems. I've also attempted to simplify some of the logic within clauselist_selectivity and some other parts of clausesel.c to remove some unneeded code and make it a bit more efficient. For example, we no longer count the attributes in the clause list before calling a similar function to retrieve the actual attnums. This is now done as a single step. I've not yet quite gotten as far as I'd like with this. I'd quite like to see clauselist_ext_split() gone, and instead we could build up a bitmapset of clause list indexes to ignore when applying the selectivity of clauses that couldn't use any extended stats. I'm planning on having a bit more of a look at this tomorrow. The attached patch should apply to master as of f90d23d0c51895e0d7db7910538e85d3d38691f0. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services mv_functional-deps_2017-03-31.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
Alvaro Herrera wrote: > Here's a rebased series on top of today's a3eac988c267. I call this > v28. > > I put David's pg_dump and COMMENT patches as second in line, just after > the initial infrastructure patch. I suppose those three have to be > committed together, while the others (which add support for additional > statistic types) can rightly remain as separate commits. As I said in another thread, I pushed parts 0002,0003,0004. Tomas said he would try to rebase patches 0001,0005,0006 on top of what was committed. My intention is to give that one a look as soon as it is available. So we will have n-distinct and functional dependencies in PG10. It sounds unlikely that we will get MCVs and histograms in, since they're each a lot of code. I suppose we need 0011 too (psql tab completion), but that can wait. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
On 17 March 2017 at 11:20, Alvaro Herrera wrote: > (I think I lost some regression test files. I couldn't make up my mind > about putting each statistic type's tests in a separate file, or all > together in stats_ext.sql.) +1 for stats_ext.sql. I wanted to add some tests for pg_statisticsextdef(), but I didn't see a suitable location. stats_ext.sql would have been a good spot. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
On 16 March 2017 at 09:45, Alvaro Herrera wrote: > Here's another version of 0002 after cleaning up almost everything from > David's review. I also added tests for ALTER STATISTICS in > sql/alter_generic.sql which made me realize there were three crasher bug > in here; fixed all those. It also made me realize that psql's \d was a > little bit too generous with dropped columns in a stats object. That > should all behave better now. > Thanks for fixing. As you mentioned to me off-list about missing pg_dump support, I've gone and implemented that in the attached patch. I followed how pg_dump works for indexes, and created pg_get_statisticsextdef() in ruleutils.c. I was unsure if I should be naming this pg_get_statisticsdef() instead. I also noticed there's no COMMENT ON support either, so I added that too. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services extstats_pg_dump_and_comment_support.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
David Rowley wrote: > + k = -1; > + while ((k = bms_next_member(attnums, k)) >= 0) > + { > + bool attr_found = false; > + for (i = 0; i < info->stakeys->dim1; i++) > + { > + if (info->stakeys->values[i] == k) > + { > + attr_found = true; > + break; > + } > + } > + > + /* found attribute not covered by this ndistinct stats, skip */ > + if (!attr_found) > + { > + matches = false; > + break; > + } > + } > > Would it be better just to stuff info->stakeys->values into a bitmapset and > check its a subset of attnums? It would mean allocating memory in the loop, > so maybe you think otherwise, but in that case maybe StatisticExtInfo > should store the bitmapset? Yeah, I think StatisticExtInfo should have a bitmapset, not an int2vector. > + appendPQExpBuffer(&buf, "(dependencies)"); > > I think it's better practice to use appendPQExpBufferStr() when there's no > formatting. It'll perform marginally better, which might not be important > here, but it sets a better example for people to follow when performance is > more critical. FWIW this should have said "(ndistinct)" anyway :-) > + change the definition of a extended statistics > > "a" should be "an", Also is statistics plural here. It's commonly mixed up > in the patch. I think it needs standardised. I personally think if you're > speaking of a single pg_statatic_ext row, then it should be singular. Yet, > I'm aware you're using plural for the CREATE STATISTICS command, to me that > feels a bit like: CREATE TABLES mytable (); am I somehow thinking wrongly > somehow here? This was discussed upthread as I recall. This is what Merriam-Webster says on the topic: statistic 1 : a single term or datum in a collection of statistics 2 a : a quantity (as the mean of a sample) that is computed from a sample; specifically : estimate 3b b : a random variable that takes on the possible values of a statistic statistics 1 : a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data 2 : a collection of quantitative data Now, I think there's room to say that a single object created by the new CREATE STATISTICS is really the latter, not the former. I find it very weird that a single of these objects is named in the plural form, though, and it looks odd all over the place. I would rather use the term "statistics object", and then we can continue using the singular. > + If a schema name is given (for example, CREATE STATISTICS > + myschema.mystat ...) then the statistics is created in the specified > + schema. Otherwise it is created in the current schema. The name of > > What's created in the current schema? I thought this was just for naming? Well, "created in a schema" means that the object is named after that schema. So both are the same thing. Is this unclear in some way? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
On 15 March 2017 at 12:18, David Fetter wrote: > > Is the plan to convert completely from "multivariate" to "extended?" > I ask because I found a "multivariate" in there. > I get the idea that Tomas would like to keep the multivariate when it's actually referencing multivariate stats. The idea of the rename was to allow future expansion of the code to perhaps allow creation of stats on expressions, which is not multivariate. If you've found multivariate reference in an area that should be generic to extended statistics then that's a bug and should be fixed. I found a few of these and listed them during my review. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] multivariate statistics (v25)
On Tue, Mar 14, 2017 at 07:10:49PM -0300, Alvaro Herrera wrote: > Alvaro Herrera wrote: > > I tried patch 0002 today and again there are conflicts, so I rebased and > > fixed the merge problems. > > ... and attached the patch. Is the plan to convert completely from "multivariate" to "extended?" I ask because I found a "multivariate" in there. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
Alvaro Herrera wrote: > I tried patch 0002 today and again there are conflicts, so I rebased and > fixed the merge problems. ... and attached the patch. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 2c2da2a..b5c4129 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -296,6 +296,11 @@ + pg_statistic_ext + extended planner statistics + + + pg_subscription logical replication subscriptions @@ -4223,6 +4228,98 @@ + + pg_statistic_ext + + + pg_statistic_ext + + + + The catalog pg_statistic_ext + holds extended planner statistics. + + + + pg_statistic_ext Columns + + + + + Name + Type + References + Description + + + + + + + starelid + oid + pg_class.oid + The table that the described columns belongs to + + + + staname + name + + Name of the statistic. + + + + stanamespace + oid + pg_namespace.oid + + The OID of the namespace that contains this statistic + + + + + staowner + oid + pg_authid.oid + Owner of the statistic + + + + staenabled + char[] + + +An array with the modes of the enabled statistic types, encoded as +d for ndistinct coefficients. + + + + + stakeys + int2vector + pg_attribute.attnum + + This is an array of values that indicate which table columns this + statistic covers. For example a value of 1 3 would + mean that the first and the third table columns make up the statistic key. + + + + + standistinct + pg_ndistinct + + + Ndistict coefficients, serialized as pg_ndistinct type. + + + + + + + + pg_namespace diff --git a/doc/src/sgml/planstats.sgml b/doc/src/sgml/planstats.sgml index b73c66b..76955e5 100644 --- a/doc/src/sgml/planstats.sgml +++ b/doc/src/sgml/planstats.sgml @@ -448,4 +448,145 @@ rows = (outer_cardinality * inner_cardinality) * selectivity + + Extended Statistics + + + extended statistics + planner + + + + The examples presented in used + statistics about individual columns to compute selectivity estimates. + When estimating conditions on multiple columns, the planner assumes + independence of the conditions and multiplies the selectivities. When the + columns are correlated, the independence assumption is violated, and the + estimates may be off by several orders of magnitude, resulting in poor + plan choices. + + + + The examples presented below demonstrate such estimation errors on simple + data sets, and also how to resolve them by creating extended statistics + using CREATE STATISTICS command. + + + + Let's start with a very simple data set - a table with two columns, + containing exactly the same values: + + +CREATE TABLE t (a INT, b INT); +INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 1) s(i); +ANALYZE t; + + + As explained in , the planner can determine + cardinality of t using the number of pages and + rows is looked up in pg_class: + + +SELECT relpages, reltuples FROM pg_class WHERE relname = 't'; + + relpages | reltuples +--+--- + 45 | 1 + + + The data distribution is very simple - there are only 100 distinct values + in each column, uniformly distributed. + + + + The following example shows the result of estimating a WHERE + condition on the a column: + + +EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1; + QUERY PLAN +- + Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual time=0.031..2.870 rows=100 loops=1) + Filter: (a = 1) + Rows Removed by Filter: 9900 + Planning time: 0.092 ms + Execution time: 3.103 ms +(5 rows) + + + The planner examines the condition and computes the estimate using + eqsel, the selectivity function for =, and + statistics stored in the pg_stats table. In this case + the planner estimates the condition matches 1% rows, and by comparing + the estimated and actual number of rows, we see that the estimate is + very accurate (in fact exact, as the table is very small). + + + + Adding a condition on the second column results in the following plan: + + +EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1; + QUERY PLAN +--- + Seq Scan on t
Re: [HACKERS] multivariate statistics (v25)
I tried patch 0002 today and again there are conflicts, so I rebased and fixed the merge problems. I also changed a number of minor things, all AFAICS cosmetic in nature: * moved src/backend/statistics/common.h to src/include/statistics/common.h, as previously commented. I also took out postgres.h and most of the includes; instead, put all these into each .c source file. That aligns with our established practice. I also removed two prototypes that should actually be in stats.h. I think statistics/common.h should be further renamed to statistics/stats_ext_internal.h, and statistics/stats.h to something different though I don't know what ATM. * Moved src/include/utils/stats.h to src/include/statistics, clean it up a bit. * Moved some structs from analyze.c into statistics/common.h, removing some duplication; have analyze.c include that file. * renamed src/test/regress/sql/mv_ndistinct.sql to stats_ext.sql, to collect all ext.stats. related tests in a single file, instead of having a large number of them. I also added one test that drops a column, per David Rowley's reported failure, but I didn't actually fix the problem nor add it to the expected file. (I'll follow up with that tomorrow, if Tomas doesn't beat me to it). Also, put the test in an earlier parallel test group, 'cause I see no reason to put it last. * A bunch of stylistic changes. The added tests pass (or they passed before I added the drop column tests; not a surprise really that they pass, since I didn't touch anything functionally), but they aren't terribly exhaustive at the stage of the first patch in the series. I didn't get around to addressing all of David Rowley's input. Also I didn't try to rebase the remaining patches in the series on top of this one. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v25)
On 13 March 2017 at 23:00, David Rowley wrote: > > 0003: > > No more time today. Will try and get to those soon. > 0003: I've now read this patch. My main aim here was to learn what it does and how it works. I need to spend much longer understanding how your calculating the functional dependencies. In the meantime I've pasted the notes I took while reading over the patch. + default: + elog(ERROR, "unexcpected statistics type requested: %d", type); "unexpected", but we generally use "unknown". @@ -1293,7 +1294,8 @@ get_relation_statistics(RelOptInfo *rel, Relation relation) info->rel = rel; /* built/available statistics */ - info->ndist_built = true; + info->ndist_built = stats_are_built(htup, STATS_EXT_NDISTINCT); + info->deps_built = stats_are_built(htup, STATS_EXT_DEPENDENCIES); I don't really like how this function is shaping up. You're calling stats_are_built() potentially twice for each stats type. There must be a nicer way to do this. Are non-built stats common enough to optimize building a StatisticExtInfo regardless and throwing it away if it happens to be useless? Can you also rename mvoid to become something more esoid or similar. I seem to always read it as m-void instead of mv-oid and naturally I expect a void pointer rather than an Oid. +dependencies, and for each one count the number of rows rows consistent it. duplicate word "rows" +Apllying the functional dependencies is fairly simple - given a list of Applying +In this case the default estimation based on AVIA principle happens to work hmm, maybe I should know what AVIA principles are, but I don't. Is there something I should be reading? I searched a bit around the internet for a few minutes it didn't seem have a great idea either. + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group 2017 + Assert(tmp <= ((char *) output + len)); Shouldn't you just Assert(tmp == ((char *) output + len)); at the end of the loop? + if (dependencies->magic != STATS_DEPS_MAGIC) + elog(ERROR, "invalid dependency magic %d (expected %dd)", + dependencies->magic, STATS_DEPS_MAGIC); + + if (dependencies->type != STATS_DEPS_TYPE_BASIC) + elog(ERROR, "invalid dependency type %d (expected %dd)", + dependencies->type, STATS_DEPS_TYPE_BASIC); %dd ? + Assert(dependencies->ndeps > 0); Why Assert() and not elog() ? Wouldn't think mean that a corrupt dependency could fail an Assert + dependencies = (MVDependencies) palloc0(sizeof(MVDependenciesData)); Why palloc0() and not palloc()? Can you not just read it into a variable on the stack, then check the exact size using tempdeps.ndeps * sizeof(MVDependency), then memcpy() it over? That'll save you the realloc() + /* what minimum bytea size do we expect for those parameters */ + expected_size = offsetof(MVDependenciesData, deps) + + dependencies->ndeps * (offsetof(MVDependencyData, attributes) + + sizeof(AttrNumber) * 2); Can't quite make sense of this yet. Why * 2? + /* is the number of attributes valid? */ + Assert((k >= 2) && (k <= STATS_MAX_DIMENSIONS)); Seems like a bad idea to Assert() this. Wouldn't some bad data being deserialized cause an Assert failure? + d = (MVDependency) palloc0(offsetof(MVDependencyData, attributes) + + (k * sizeof(AttrNumber))); Why palloc0(), you seem to write out all the fields right away. Seems like a waste to zero the memory. + /* still within the bytea */ + Assert(tmp <= ((char *) data + VARSIZE_ANY(data))); Any point? You're already Asserting that you've consumed the entire array at the end anyway. + appendStringInfoString(&str, "["); appendStringInfoChar(&str. '['); would be better. + ret = pstrdup(str.data); ret = pnstrdup(str.data, str.len); +CREATE STATISTICS s1 WITH (dependencies) ON (a,a) FROM functional_dependencies; +ERROR: duplicate column name in statistics definition Is it worth mentioning which column here? I'll try to spend more time understanding 0003 soon. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] multivariate statistics (v25)
On 3 March 2017 at 03:53, Tomas Vondra wrote: > This time with the attachments It's been a long while since I looked at this patch, but I'm now taking another look. I've made a list of stuff I've found from making my first pass on 0001 and 0002. Some of the stuff may seem a little pedantic, so apologies about those ones. I merely SET nit_picking_threshold TO 0; and reviewed. Here goes: 0001: + RestrictInfo *rinfo = (RestrictInfo*)node; and + RestrictInfo *rinfo = (RestrictInfo *)node; + + return expression_tree_walker((Node*)rinfo->clause, + pull_varattnos_walker, + (void*) context); spacing incorrect. Please space after type name in casts and after the closing parenthesis. 0002: + dropped as well. Multivariate statistics referencing the column will + be dropped only if there would remain a single non-dropped column. I was initially confused by this. I think it should worded as: "Multivariate statistics referencing the dropped column will also be removed if the removal of the column would cause the statistics to contain data for only a single column" I had been confused as I'd been thinking of dropping multiple columns at once with the same command, and only 1 column remained in the table. So I think it's best to clarify you mean the statistic here. + OCLASS_STATISTICS /* pg_statistics_ext */ I wonder if this should be named: OCLASS_STATISTICEXT. The comment is also incorrect and should read "pg_statistic_ext" (without 's') I tried to perform a test in this area and received an error: postgres=# create table ab1 (a int, b int); CREATE TABLE postgres=# create statistics ab1_a_b_stats on (a,b) from ab1; CREATE STATISTICS postgres=# alter table ab1 drop column a; ALTER TABLE postgres=# drop table ab1; ERROR: cache lookup failed for statistics 16399 + When estimating conditions on multiple columns, the planner assumes + independence of the conditions and multiplies the selectivities. When the + columns are correlated, the independence assumption is violated, and the + estimates may be off by several orders of magnitude, resulting in poor + plan choices. I don't think the assumption is violated. We still assume that they're independent, which is incorrect. Nothing gets violated. Perhaps it would be more accurate to write: "When estimating the selectivity of conditions over multiple columns, the planner normally assumes each condition is independent of other conditions, and simply multiplies the selectivity estimates of each condition together to produce a final selectivity estimation for all conditions. This method can often lead to inaccurate row estimations when the conditions have dependencies on one another. Such misestimations can result poor plan choices being made." + using CREATE STATISTICS command. using the ... + As explained in , the planner can determine + cardinality of t using the number of pages and + rows is looked up in pg_class: perhaps "rows is" should become "rows as" or "rows which are". + * delete multi-variate statistics + */ + RemoveStatisticsExt(relid, 0); I think it should be "delete extended statistics" Should this not be rejected? postgres=# create view v1 as select 1 a, 2 b; CREATE VIEW postgres=# create statistics v1_a_stats on (a,b) from v1; CREATE STATISTICS and this? postgres=# create sequence test_seq; CREATE SEQUENCE postgres=# select * from test_seq; last_value | log_cnt | is_called +-+--- 1 | 0 | f (1 row) postgres=# create statistics test_seq_stats on (last_value,log_cnt) from test_seq; CREATE STATISTICS The patch does claim: + /* extended stats are supported on tables and matviews */ So I guess it should be disallowed. + /* OBJECT_STATISTICS */ + { + "statistics", OBJECT_STATISTICS Maybe this should be changed to be OBJECT_STATISTICEXT */. Doing it this way would close the door a bit on pg_depends records existing for pg_statistic. A quick test shows a problem here: postgres=# create table ab (a int, b int); CREATE TABLE postgres=# create statistics ab_a_b_stats on (a,b) from ab; CREATE STATISTICS postgres=# create statistics ab_a_b_stats1 on (a,b) from ab; CREATE STATISTICS postgres=# alter statistics ab_a_b_stats1 rename to ab_a_b_stats; ERROR: unsupported object class 3381 +/* + * + * QUERY : + * CREATE STATISTICS stats_name ON relname (columns) WITH (options) + * + */ Old Syntax? + $$ = (Node *)n; Incorrect spacing. + * The returned list is guaranteed to be sorted in order by OID, although + * this is not currently needed. hmm, whats the tie-breaker going to be for: CREATE TABLE abc (a int, b int, c int); create statistics abc_ab_stats (a,b) from abc; create statistics abc_bc_stats (b,c) from abc; select * from abc where a=1 and b=1 and c=1; I've not gotten to that part of the code
Re: [HACKERS] multivariate statistics (v25)
On 03/02/2017 07:42 AM, Kyotaro HORIGUCHI wrote: Hello, At Thu, 2 Mar 2017 04:05:34 +0100, Tomas Vondra wrote in OK, attached is v24 of the patch series, addressing most of the reported issues and comments (at least I believe so). The main changes are: Unfortunately, 0002 conflicts with the current master (4461a9b). Could you rebase them or tell us the commit where this patches stand on? Attached is a rebased patch series, otherwise it's the same as v24. FWIW it was based on 016c990834 from Feb 28, but apparently some recent patch caused a minor conflict. I only saw the patch files but have some comments. 1) I've mostly abandoned the "multivariate" name in favor of "extended", particularly in places referring to stats stored in the pg_statistic_ext in general. "Multivariate" is now used only in places talking about particular types (e.g. multivariate histograms). The "extended" name is more widely used for this type of statistics, and the assumption is that we'll also add other (non-multivariate) types of statistics - e.g. statistics on custom expressions, or some for of join statistics. In 0005, and @@ -184,14 +208,43 @@ clauselist_selectivity(PlannerInfo *root, * If there are no such stats or not enough attributes, don't waste time * simply skip to estimation using the plain per-column stats. */ + if (has_stats(stats, STATS_TYPE_MCV) && ... + /* compute the multivariate stats */ + s1 *= clauselist_ext_selectivity(root, mvclauses, stat); @@ -1080,10 +1136,71 @@ clauselist_ext_selectivity_deps(PlannerInfo *root, Index relid, } /* + * estimate selectivity of clauses using multivariate statistic These comment is left unchanged? or on purpose? 0007 adds very similar texts. Hmm, those comments should be probably changed to "extended". 2) Catalog pg_mv_statistic was renamed to pg_statistic_ext (and pg_mv_stats view renamed to pg_stats_ext). FWIW, "extended statistic" would be abbreviated as "ext_statistic" or "extended_stats". Why have you exchanged the words? Because this way it's clear it's a version of pg_statistic, and it will be sorted right next to it. 3) The structure of pg_statistic_ext was changed as proposed by Alvaro, i.e. the boolean flags were removed and instead we have just a single "char[]" column with list of enabled statistics. 4) I also got rid of the "mv" part in most variable/function/constant names, replacing it by "ext" or something similar. Also mvstats.h got renamed to stats.h. 5) Moved the files from src/backend/utils/mvstats to backend/statistics. 6) Fixed the n_choose_k() overflow issues by using the algorithm proposed by Dean. Also, use the simple formula for num_combinations(). 7) I've tweaked data types for a few struct members (in stats.h). I've kept most of the uint32 fields at the top level though, because int16 might not be large enough for large statistics and the overhead is minimal (compared to the space needed e.g. for histogram buckets). Some formulated proof or boundary value test cases might be needed (to prevent future trouble). Or any defined behavior on overflow of them might be enough. I belive all (or most) of overflow-able data has such behavior. That is probably a good idea and I plan to do that. The renames/changes were quite widespread, but I've done my best to fix all the comments and various other places. regards regards, regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers