Re: [HACKERS] multivariate statistics (v25)

2017-04-05 Thread David Rowley
On 6 April 2017 at 10:17, Simon Riggs  wrote:
> On 5 April 2017 at 10:47, David Rowley  wrote:
>
>> I've attached an updated patch to address Tomas' concerns and yours too.
>
> Commited, with some doc changes and additions based upon my explorations.

Great. Thanks for committing!


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-04-05 Thread Simon Riggs
On 5 April 2017 at 10:47, David Rowley  wrote:

> I've attached an updated patch to address Tomas' concerns and yours too.

Commited, with some doc changes and additions based upon my explorations.

For the record, I measured the time to calc extended statistics as
+800ms on 2 million row sample.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-04-05 Thread David Rowley
On 6 April 2017 at 07:19, Tels  wrote:
> I know I'm a bit late, but isn't the syntax backwards?
>
> "CREATE STATISTICS s1 WITH (dependencies) ON (col_a, col_b) FROM table;"
>
> These do it the other way round:
>
> CREATE INDEX idx ON table (col_a);
>
> AND:
>
>CREATE TABLE t (
>  id INT  REFERENCES table_2 (col_b);
>);
>
> Won't this be confusing and make things hard to remember?
>
> Sorry for not asking earlier, I somehow missed this.

The reasoning is in [1]

[1] 
https://www.postgresql.org/message-id/CAEZATCUtGR+U5+QTwjHhe9rLG2nguEysHQ5NaqcK=vbj78v...@mail.gmail.com


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-04-05 Thread Tels
Moin,

On Wed, April 5, 2017 2:52 pm, Simon Riggs wrote:
> On 5 April 2017 at 10:47, David Rowley 
> wrote:
>
>>> I have some other comments.
>
> Me too.
>
>
> CREATE STATISTICS should take ShareUpdateExclusiveLock like ANALYZE.
>
> This change is in line with other changes in this and earlier
> releases. Comments and docs included.
>
> Patch ready to be applied directly barring objections.

I know I'm a bit late, but isn't the syntax backwards?

"CREATE STATISTICS s1 WITH (dependencies) ON (col_a, col_b) FROM table;"

These do it the other way round:

CREATE INDEX idx ON table (col_a);

AND:

   CREATE TABLE t (
 id INT  REFERENCES table_2 (col_b);
   );

Won't this be confusing and make things hard to remember?

Sorry for not asking earlier, I somehow missed this.

Regard,

Tels


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-04-05 Thread Simon Riggs
On 5 April 2017 at 10:47, David Rowley  wrote:

>> I have some other comments.

Me too.


CREATE STATISTICS should take ShareUpdateExclusiveLock like ANALYZE.

This change is in line with other changes in this and earlier
releases. Comments and docs included.

Patch ready to be applied directly barring objections.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


create_statistics_lock_reduction.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-04-05 Thread David Rowley
On 5 April 2017 at 14:53, Kyotaro HORIGUCHI
 wrote:
> At Tue, 4 Apr 2017 20:19:39 +0200, Tomas Vondra 
>  wrote in 
> <56f40b20-c464-fad2-ff39-06b668fac...@2ndquadrant.com>
>> Two minor comments:
>>
>> 1) DEPENDENCY_MIN_GROUP_SIZE
>>
>> I'm not sure we still need the min_group_size, when evaluating
>> dependencies. It was meant to deal with 'noisy' data, but I think it
>> after switching to the 'degree' it might actually be a bad idea.

Yeah, I'd wondered about this when I first started testing the patch.
I failed to get any functional dependencies because my values were too
unique. Seems I'd gotten a bit used to it, and in the end thought that
if the values are unique enough then they won't suffer as much from
the underestimation problem you're trying to solve here.

I've removed that part of the code now.

> I think the same. Quite large part of functional dependency in
> reality is in this kind.
>
>> 2) A minor detail is that instead of this
>>
>> if (estimatedclauses != NULL &&
>> bms_is_member(listidx, estimatedclauses))
>> continue;
>>
>> perhaps we should do just this:
>>
>> if (bms_is_member(listidx, estimatedclauses))
>> continue;
>>
>> bms_is_member does the same NULL check right at the beginning, so I
>> don't think this might make a measurable difference.

hmm yeah, I'd added that because I thought the estimatedclauses would
be NULL in 99.9% of cases and thought that I might be able to shave a
few cycles off. I see that there's an x < 0 test before the NULL test
in the function. Anyway, I'm not going to put up a fight here, so I've
removed it. I didn't ever benchmark anything to see if the extra test
actually helped anyway...

> I have some other comments.
>
> ==
> - The comment for clauselist_selectivity,
> | + * When 'rel' is not null and rtekind = RTE_RELATION, we'll try to apply
> | + * selectivity estimates using any extended statistcs on 'rel'.
>
> The 'rel' is actually a parameter but rtekind means rel->rtekind
> so this might be better be such like the following.
>
> | When a relation of RTE_RELATION is given as 'rel', we try
> | extended statistcs on the relation.
>
> Then the following line doesn't seem to be required.
>
> | + * If we identify such extended statistics exist, we try to apply them.

Yes, good point. I've revised this comment a bit now.

>
> =
> The following comment in the same function,
>
> | +if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
> | +{
> | +/*
> | + * Try to estimate with multivariate functional dependency 
> statistics.
> | + *
> | + * The function will supply an estimate for the clauses which it
> | + * estimated for. Any clauses which were unsuitible were ignored.
> | + * Clauses which were estimated will have their 0-based list index 
> set
> | + * in estimatedclauses.  We must ignore these clauses when 
> processing
> | + * the remaining clauses later.
> | + */
>
> (Notice that I'm not a good writer) This might better be the
> following.
>
> |  dependencies_clauselist_selectivity gives selectivity over
> |  caluses that functional dependencies on the given relation is
> |  applicable. 0-based index numbers of consumed clauses are
> |  returned in the bitmap set estimatedclauses so that the
> |  estimation here after can ignore them.

I've changed this one too now.

> =
> | +s1 *= dependencies_clauselist_selectivity(root, clauses, varRelid,
> | +   jointype, sjinfo, rel, 
> );
>
> The name prefix "dependency_" means "functional_dependency" here
> and omitting "functional" is confusing to me. On the other hand
> "functional_dependency" is quite long as prefix. Could we use
> "func_dependency" or something that is shorter but meaningful?
> (But this change causes renaming of many other sutff..)

oh no! Many functions in dependencies.c start with dependencies_. To
me, it's a bit of an OOP thing, which if we'd been using some other
language would have been dependencies->clauselist_selectivity(). Of
course, not all functions in that file follow that rule, but I don't
feel a pressing need to go make that any worse.  Perhaps the prefix
could be func_dependency, but I really don't feel very excited about
having it that way, and even less so about making the change.

> =
> The name "dependency_compatible_clause" might be meaningful if it
> were "clause_is_compatible_with_(functional_)dependency" or such.

I could maybe squeeze the word "is" in there.  ... OK done.

> =
> dependency_compatible_walker() returns true if given node is
> *not* compatible. Isn't it confusing?

Yeah.

>
> =
> dependency_compatible_walker() seems implicitly expecting that
> RestrictInfo will be given at the first. RestrictInfo might(
> should be processed outside this function in _compatible_clause().

Actually, I don't really see a 

Re: [HACKERS] multivariate statistics (v25)

2017-04-05 Thread Tomas Vondra



On 04/05/2017 08:41 AM, Sven R. Kunze wrote:

Thanks Tomas and David for hacking on this patch.

On 04.04.2017 20:19, Tomas Vondra wrote:
I'm not sure we still need the min_group_size, when evaluating 
dependencies. It was meant to deal with 'noisy' data, but I think it 
after switching to the 'degree' it might actually be a bad idea.


Consider this:

create table t (a int, b int);
insert into t select 1, 1 from generate_series(1, 1) s(i);
insert into t select i, i from generate_series(2, 2) s(i);
create statistics s with (dependencies) on (a,b) from t;
analyze t;

select stadependencies from pg_statistic_ext ;
  stadependencies

 [{1 => 2 : 0.44}, {2 => 1 : 0.44}]
(1 row)

So the degree of the dependency is just ~0.333 although it's obviously 
a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The 
reason is that we discard 2/3 of rows, because those groups are only a 
single row each, except for the one large group (1/3 of rows).


Just for me to follow the comments better. Is "dependency" roughly the 
same as when statisticians speak about " conditional probability"?




No, it's more 'functional dependency' from relational normal forms.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-04-05 Thread Sven R. Kunze

Thanks Tomas and David for hacking on this patch.

On 04.04.2017 20:19, Tomas Vondra wrote:
I'm not sure we still need the min_group_size, when evaluating 
dependencies. It was meant to deal with 'noisy' data, but I think it 
after switching to the 'degree' it might actually be a bad idea.


Consider this:

create table t (a int, b int);
insert into t select 1, 1 from generate_series(1, 1) s(i);
insert into t select i, i from generate_series(2, 2) s(i);
create statistics s with (dependencies) on (a,b) from t;
analyze t;

select stadependencies from pg_statistic_ext ;
  stadependencies

 [{1 => 2 : 0.44}, {2 => 1 : 0.44}]
(1 row)

So the degree of the dependency is just ~0.333 although it's obviously 
a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The 
reason is that we discard 2/3 of rows, because those groups are only a 
single row each, except for the one large group (1/3 of rows).


Just for me to follow the comments better. Is "dependency" roughly the 
same as when statisticians speak about " conditional probability"?


Sven


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-04-04 Thread Kyotaro HORIGUCHI
At Tue, 4 Apr 2017 20:19:39 +0200, Tomas Vondra  
wrote in <56f40b20-c464-fad2-ff39-06b668fac...@2ndquadrant.com>
> On 04/04/2017 09:55 AM, David Rowley wrote:
> > On 1 April 2017 at 04:25, David Rowley 
> > wrote:
> >> I've attached an updated patch.
> >
> > I've made another pass at this and ended up removing the tryextstats
> > variable. We now only try to use extended statistics when
> > clauselist_selectivity() is given a valid RelOptInfo with rtekind ==
> > RTE_RELATION, and of course, it must also have some extended stats
> > defined too.
> >
> > I've also cleaned up a few more comments, many of which I managed to
> > omit updating when I refactored how the selectivity estimates ties
> > into clauselist_selectivity()
> >
> > I'm quite happy with all of this now, and would also be happy for
> > other people to take a look and comment.
> >
> > As a reviewer, I'd be marking this ready for committer, but I've moved
> > a little way from just reviewing this now, having spent two weeks
> > hacking at it.
> >
> > The latest patch is attached.
> >
> 
> Thanks David, I agree the reworked patch is much cleaner that the last
> version I posted. Thanks for spending your time on it.
> 
> Two minor comments:
> 
> 1) DEPENDENCY_MIN_GROUP_SIZE
> 
> I'm not sure we still need the min_group_size, when evaluating
> dependencies. It was meant to deal with 'noisy' data, but I think it
> after switching to the 'degree' it might actually be a bad idea.
> 
> Consider this:
> 
> create table t (a int, b int);
> insert into t select 1, 1 from generate_series(1, 1) s(i);
> insert into t select i, i from generate_series(2, 2) s(i);
> create statistics s with (dependencies) on (a,b) from t;
> analyze t;
> 
> select stadependencies from pg_statistic_ext ;
>   stadependencies
> 
>  [{1 => 2 : 0.44}, {2 => 1 : 0.44}]
> (1 row)
> 
> So the degree of the dependency is just ~0.333 although it's obviously
> a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The
> reason is that we discard 2/3 of rows, because those groups are only a
> single row each, except for the one large group (1/3 of rows).
> 
> Without the mininum group size limitation, the dependencies are:
> 
> test=# select stadependencies from pg_statistic_ext ;
>   stadependencies
> 
>  [{1 => 2 : 1.00}, {2 => 1 : 1.00}]
> (1 row)
> 
> which seems way more reasonable, I think.

I think the same. Quite large part of functional dependency in
reality is in this kind.

> 2) A minor detail is that instead of this
> 
> if (estimatedclauses != NULL &&
> bms_is_member(listidx, estimatedclauses))
> continue;
> 
> perhaps we should do just this:
> 
> if (bms_is_member(listidx, estimatedclauses))
> continue;
> 
> bms_is_member does the same NULL check right at the beginning, so I
> don't think this might make a measurable difference.


I have some other comments.

==
- The comment for clauselist_selectivity,
| + * When 'rel' is not null and rtekind = RTE_RELATION, we'll try to apply
| + * selectivity estimates using any extended statistcs on 'rel'.

The 'rel' is actually a parameter but rtekind means rel->rtekind
so this might be better be such like the following.

| When a relation of RTE_RELATION is given as 'rel', we try
| extended statistcs on the relation.

Then the following line doesn't seem to be required.

| + * If we identify such extended statistics exist, we try to apply them.


=
The following comment in the same function,

| +if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
| +{
| +/*
| + * Try to estimate with multivariate functional dependency 
statistics.
| + *
| + * The function will supply an estimate for the clauses which it
| + * estimated for. Any clauses which were unsuitible were ignored.
| + * Clauses which were estimated will have their 0-based list index 
set
| + * in estimatedclauses.  We must ignore these clauses when processing
| + * the remaining clauses later.
| + */

(Notice that I'm not a good writer) This might better be the
following.

|  dependencies_clauselist_selectivity gives selectivity over
|  caluses that functional dependencies on the given relation is
|  applicable. 0-based index numbers of consumed clauses are
|  returned in the bitmap set estimatedclauses so that the
|  estimation here after can ignore them.

=
| +s1 *= dependencies_clauselist_selectivity(root, clauses, varRelid,
| +   jointype, sjinfo, rel, );

The name prefix "dependency_" means "functional_dependency" here
and omitting "functional" is confusing to me. On the other hand
"functional_dependency" is quite long as prefix. Could we 

Re: [HACKERS] multivariate statistics (v25)

2017-04-04 Thread Tomas Vondra

On 04/04/2017 09:55 AM, David Rowley wrote:

On 1 April 2017 at 04:25, David Rowley  wrote:

I've attached an updated patch.


I've made another pass at this and ended up removing the tryextstats
variable. We now only try to use extended statistics when
clauselist_selectivity() is given a valid RelOptInfo with rtekind ==
RTE_RELATION, and of course, it must also have some extended stats
defined too.

I've also cleaned up a few more comments, many of which I managed to
omit updating when I refactored how the selectivity estimates ties
into clauselist_selectivity()

I'm quite happy with all of this now, and would also be happy for
other people to take a look and comment.

As a reviewer, I'd be marking this ready for committer, but I've moved
a little way from just reviewing this now, having spent two weeks
hacking at it.

The latest patch is attached.



Thanks David, I agree the reworked patch is much cleaner that the last 
version I posted. Thanks for spending your time on it.


Two minor comments:

1) DEPENDENCY_MIN_GROUP_SIZE

I'm not sure we still need the min_group_size, when evaluating 
dependencies. It was meant to deal with 'noisy' data, but I think it 
after switching to the 'degree' it might actually be a bad idea.


Consider this:

create table t (a int, b int);
insert into t select 1, 1 from generate_series(1, 1) s(i);
insert into t select i, i from generate_series(2, 2) s(i);
create statistics s with (dependencies) on (a,b) from t;
analyze t;

select stadependencies from pg_statistic_ext ;
  stadependencies

 [{1 => 2 : 0.44}, {2 => 1 : 0.44}]
(1 row)

So the degree of the dependency is just ~0.333 although it's obviously a 
perfect dependency, i.e. a knowledge of 'a' determines 'b'. The reason 
is that we discard 2/3 of rows, because those groups are only a single 
row each, except for the one large group (1/3 of rows).


Without the mininum group size limitation, the dependencies are:

test=# select stadependencies from pg_statistic_ext ;
  stadependencies

 [{1 => 2 : 1.00}, {2 => 1 : 1.00}]
(1 row)

which seems way more reasonable, I think.


2) A minor detail is that instead of this

if (estimatedclauses != NULL &&
bms_is_member(listidx, estimatedclauses))
continue;

perhaps we should do just this:

if (bms_is_member(listidx, estimatedclauses))
continue;

bms_is_member does the same NULL check right at the beginning, so I 
don't think this might make a measurable difference.



kind regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-04-04 Thread David Rowley
On 1 April 2017 at 04:25, David Rowley  wrote:
> I've attached an updated patch.

I've made another pass at this and ended up removing the tryextstats
variable. We now only try to use extended statistics when
clauselist_selectivity() is given a valid RelOptInfo with rtekind ==
RTE_RELATION, and of course, it must also have some extended stats
defined too.

I've also cleaned up a few more comments, many of which I managed to
omit updating when I refactored how the selectivity estimates ties
into clauselist_selectivity()

I'm quite happy with all of this now, and would also be happy for
other people to take a look and comment.

As a reviewer, I'd be marking this ready for committer, but I've moved
a little way from just reviewing this now, having spent two weeks
hacking at it.

The latest patch is attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


mv_functional-deps_2017-04-04.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-03-31 Thread David Rowley
On 31 March 2017 at 21:18, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> > When adding these two parameters I had 2nd thoughts that the
> "tryextstats"
> > was required at all. We could just have this controlled by if the rel is
> a
> > base rel of kind RTE_RELATION. I ended up having to pass these parameters
> > further, down to clauselist_selectivity's singleton couterpart,
> > clause_selectivity(). This was due to clause_selectivity() calling
> > clauselist_selectivity() for some clause types. I'm not entirely sure if
> > this is actually required, but I can't see any reason for it to cause
> > problems.
>
> I understand that the reason for tryextstats is that the two are
> perfectly correlating but caluse_selectivity requires the
> RelOptInfo anyway. Some comment about that may be reuiqred in the
> function comment.


hmm, you could say one is functionally dependant on the other. I did
consider removing it, but it seemed weird to pass a NULL relation when we
dont want to attempt to use extended stats.


> Some random comments by just looking on the patch:
>
> ==
> The name of the function "collect_ext_attnums", and
> "clause_is_ext_compatible" seems odd since "ext" doesn't seem to
> be a part of "extended statistics". Some other names looks the
> same, too.
>

I agree. I've made some changes to the patch to change how the functional
dependency estimations are applied. I've removed most of the code from
clausesel.c and put it into dependencies.c. In doing so I've removed some
of the inefficiencies that were in the patch.  For example
clause_is_ext_compatible() was being called many times on the same clause
at different times. I've now nailed that down to just once per clause.


> Something like "collect_e(xt)stat_compatible_attnums" and
> "clause_is_e(xt)stat_compatible" seem better to me.
>
>
Changed to dependency_compatible_clause(), since this was searching for
equality clauses in the form Var = Const, or Const = Var. This seems
specific to the functional depdencies checking. A multivariate histogram
won't want the same.


> ==
> The following comment seems something wrong.
>
> + * When applying functional dependencies, we start with the strongest ones
> + * strongest dependencies. That is, we select the dependency that:
>
> ==
> dependency_is_fully_matched() is not found. Maybe some other
> patches are assumed?
>
> ==
> +   /* see if it actually has the right */
> +   ok = (NumRelids((Node *) expr) == 1) &&
> +   (is_pseudo_constant_clause(lsecond(expr->args)) ||
> +(varonleft = false,
> + is_pseudo_constant_clause(
> linitial(expr->args;
> +
> +   /* unsupported structure (two variables or so) */
> +   if (!ok)
> +   return true;
>
> Ok is used only here. I don't think seeming-expressions with side
> effect is not good idea here.
>
>
I thought the same, but I happened to notice that Tomas must have taken it
from clauselist_selectivity().


> ==
> +   switch (get_oprrest(expr->opno))
> +   {
> +   case F_EQSEL:
> +
> +   /* equality conditions are compatible with
> all statistics */
> +   break;
> +
> +   default:
> +
> +   /* unknown estimator */
> +   return true;
> +   }
>
> This seems somewhat stupid..
>

I agree. Changed.

I've attached an updated patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


mv_functional-deps_2017-04-01.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-03-31 Thread David Rowley
On 31 March 2017 at 21:18, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Hello,
>
> At Fri, 31 Mar 2017 03:03:06 +1300, David Rowley <
> david.row...@2ndquadrant.com> wrote in  T5JLce5ynCi1vvezXxX=w...@mail.gmail.com>
>
> FWIW, I tries this. This cleanly applied on it but make ends with
> the following error.
>
> $ make -s
> Writing postgres.bki
> Writing schemapg.h
> Writing postgres.description
> Writing postgres.shdescription
> Writing fmgroids.h
> Writing fmgrprotos.h
> Writing fmgrtab.c
> make[3]: *** No rule to make target `dependencies.o', needed by
> `objfiles.txt'.  Stop.
> make[2]: *** [statistics-recursive] Error 2
> make[1]: *** [all-backend-recurse] Error 2
> make: *** [all-src-recurse] Error 2


Apologies. I was caught out by patching back on to master, then committing,
and git diff'ing the last commit, where i'd of course forgotten to get add
those files.

I'm just in the middle of fixing up some other stuff. Hopefully I'll post a
working patch soon.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] multivariate statistics (v25)

2017-03-31 Thread Kyotaro HORIGUCHI
Hello,

At Fri, 31 Mar 2017 03:03:06 +1300, David Rowley  
wrote in 
> On 25 March 2017 at 07:35, Alvaro Herrera  wrote:
> 
> > As I said in another thread, I pushed parts 0002,0003,0004.  Tomas said
> > he would try to rebase patches 0001,0005,0006 on top of what was
> > committed.  My intention is to give that one a look as soon as it is
> > available.  So we will have n-distinct and functional dependencies in
> > PG10.  It sounds unlikely that we will get MCVs and histograms in, since
> > they're each a lot of code.
> >
> 
> I've been working on the MV functional dependencies part of the patch to
> polish it up a bit. Tomas has been busy with a few other duties.
> 
> I've made some changes around how clauselist_selectivity() determines if it
> should try to apply any extended stats. The solution I came up with was to
> add two parameters to this function, one for the RelOptInfo in question,
> and one a bool to control if we should try to apply any extended stats.
> For clauselist_selectivity() usage involving join rels we just pass the rel
> as NULL, that way we can skip all the extended stats stuff with very low
> overhead. When we actually have a base relation to pass along we can do so,
> along with a true tryextstats value to have the function attempt to use any
> extended stats to assist with the selectivity estimation.
> 
> When adding these two parameters I had 2nd thoughts that the "tryextstats"
> was required at all. We could just have this controlled by if the rel is a
> base rel of kind RTE_RELATION. I ended up having to pass these parameters
> further, down to clauselist_selectivity's singleton couterpart,
> clause_selectivity(). This was due to clause_selectivity() calling
> clauselist_selectivity() for some clause types. I'm not entirely sure if
> this is actually required, but I can't see any reason for it to cause
> problems.

I understand that the reason for tryextstats is that the two are
perfectly correlating but caluse_selectivity requires the
RelOptInfo anyway. Some comment about that may be reuiqred in the
function comment.

> I've also attempted to simplify some of the logic within
> clauselist_selectivity and some other parts of clausesel.c to remove some
> unneeded code and make it a bit more efficient. For example, we no longer
> count the attributes in the clause list before calling a similar function
> to retrieve the actual attnums. This is now done as a single step.
> 
> I've not yet quite gotten as far as I'd like with this. I'd quite like to
> see clauselist_ext_split() gone, and instead we could build up a bitmapset
> of clause list indexes to ignore when applying the selectivity of clauses
> that couldn't use any extended stats. I'm planning on having a bit more of
> a look at this tomorrow.
> 
> The attached patch should apply to master as
> of f90d23d0c51895e0d7db7910538e85d3d38691f0.

FWIW, I tries this. This cleanly applied on it but make ends with
the following error.

$ make -s
Writing postgres.bki
Writing schemapg.h
Writing postgres.description
Writing postgres.shdescription
Writing fmgroids.h
Writing fmgrprotos.h
Writing fmgrtab.c
make[3]: *** No rule to make target `dependencies.o', needed by `objfiles.txt'. 
 Stop.
make[2]: *** [statistics-recursive] Error 2
make[1]: *** [all-backend-recurse] Error 2
make: *** [all-src-recurse] Error 2


Some random comments by just looking on the patch:

==
The name of the function "collect_ext_attnums", and
"clause_is_ext_compatible" seems odd since "ext" doesn't seem to
be a part of "extended statistics". Some other names looks the
same, too.

Something like "collect_e(xt)stat_compatible_attnums" and
"clause_is_e(xt)stat_compatible" seem better to me.

==
The following comment seems something wrong.

+ * When applying functional dependencies, we start with the strongest ones
+ * strongest dependencies. That is, we select the dependency that:

==
dependency_is_fully_matched() is not found. Maybe some other
patches are assumed?

==
+   /* see if it actually has the right */
+   ok = (NumRelids((Node *) expr) == 1) &&
+   (is_pseudo_constant_clause(lsecond(expr->args)) ||
+(varonleft = false,
+ is_pseudo_constant_clause(linitial(expr->args;
+
+   /* unsupported structure (two variables or so) */
+   if (!ok)
+   return true;

Ok is used only here. I don't think seeming-expressions with side
effect is not good idea here.

==
+   switch (get_oprrest(expr->opno))
+   {
+   case F_EQSEL:
+
+   /* equality conditions are compatible with all 
statistics */
+   break;
+
+   default:
+
+   /* unknown estimator 

Re: [HACKERS] multivariate statistics (v25)

2017-03-30 Thread David Rowley
On 25 March 2017 at 07:35, Alvaro Herrera  wrote:

> As I said in another thread, I pushed parts 0002,0003,0004.  Tomas said
> he would try to rebase patches 0001,0005,0006 on top of what was
> committed.  My intention is to give that one a look as soon as it is
> available.  So we will have n-distinct and functional dependencies in
> PG10.  It sounds unlikely that we will get MCVs and histograms in, since
> they're each a lot of code.
>

I've been working on the MV functional dependencies part of the patch to
polish it up a bit. Tomas has been busy with a few other duties.

I've made some changes around how clauselist_selectivity() determines if it
should try to apply any extended stats. The solution I came up with was to
add two parameters to this function, one for the RelOptInfo in question,
and one a bool to control if we should try to apply any extended stats.
For clauselist_selectivity() usage involving join rels we just pass the rel
as NULL, that way we can skip all the extended stats stuff with very low
overhead. When we actually have a base relation to pass along we can do so,
along with a true tryextstats value to have the function attempt to use any
extended stats to assist with the selectivity estimation.

When adding these two parameters I had 2nd thoughts that the "tryextstats"
was required at all. We could just have this controlled by if the rel is a
base rel of kind RTE_RELATION. I ended up having to pass these parameters
further, down to clauselist_selectivity's singleton couterpart,
clause_selectivity(). This was due to clause_selectivity() calling
clauselist_selectivity() for some clause types. I'm not entirely sure if
this is actually required, but I can't see any reason for it to cause
problems.

I've also attempted to simplify some of the logic within
clauselist_selectivity and some other parts of clausesel.c to remove some
unneeded code and make it a bit more efficient. For example, we no longer
count the attributes in the clause list before calling a similar function
to retrieve the actual attnums. This is now done as a single step.

I've not yet quite gotten as far as I'd like with this. I'd quite like to
see clauselist_ext_split() gone, and instead we could build up a bitmapset
of clause list indexes to ignore when applying the selectivity of clauses
that couldn't use any extended stats. I'm planning on having a bit more of
a look at this tomorrow.

The attached patch should apply to master as
of f90d23d0c51895e0d7db7910538e85d3d38691f0.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


mv_functional-deps_2017-03-31.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-03-24 Thread Alvaro Herrera
Alvaro Herrera wrote:
> Here's a rebased series on top of today's a3eac988c267.  I call this
> v28.
> 
> I put David's pg_dump and COMMENT patches as second in line, just after
> the initial infrastructure patch.  I suppose those three have to be
> committed together, while the others (which add support for additional
> statistic types) can rightly remain as separate commits.

As I said in another thread, I pushed parts 0002,0003,0004.  Tomas said
he would try to rebase patches 0001,0005,0006 on top of what was
committed.  My intention is to give that one a look as soon as it is
available.  So we will have n-distinct and functional dependencies in
PG10.  It sounds unlikely that we will get MCVs and histograms in, since
they're each a lot of code.

I suppose we need 0011 too (psql tab completion), but that can wait.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-03-16 Thread David Rowley
On 17 March 2017 at 11:20, Alvaro Herrera  wrote:
> (I think I lost some regression test files.  I couldn't make up my mind
> about putting each statistic type's tests in a separate file, or all
> together in stats_ext.sql.)

+1 for stats_ext.sql. I wanted to add some tests for
pg_statisticsextdef(), but I didn't see a suitable location.
stats_ext.sql would have been a good spot.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-03-16 Thread David Rowley
On 16 March 2017 at 09:45, Alvaro Herrera  wrote:

> Here's another version of 0002 after cleaning up almost everything from
> David's review.  I also added tests for ALTER STATISTICS in
> sql/alter_generic.sql which made me realize there were three crasher bug
> in here; fixed all those.  It also made me realize that psql's \d was a
> little bit too generous with dropped columns in a stats object.  That
> should all behave better now.
>

Thanks for fixing.

As you mentioned to me off-list about missing pg_dump support, I've gone
and implemented that in the attached patch.

I followed how pg_dump works for indexes, and
created pg_get_statisticsextdef() in ruleutils.c. I was unsure if I should
be naming this pg_get_statisticsdef() instead.

I also noticed there's no COMMENT ON support either, so I added that too.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


extstats_pg_dump_and_comment_support.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-03-15 Thread Alvaro Herrera
David Rowley wrote:

> + k = -1;
> + while ((k = bms_next_member(attnums, k)) >= 0)
> + {
> + bool attr_found = false;
> + for (i = 0; i < info->stakeys->dim1; i++)
> + {
> + if (info->stakeys->values[i] == k)
> + {
> + attr_found = true;
> + break;
> + }
> + }
> +
> + /* found attribute not covered by this ndistinct stats, skip */
> + if (!attr_found)
> + {
> + matches = false;
> + break;
> + }
> + }
> 
> Would it be better just to stuff info->stakeys->values into a bitmapset and
> check its a subset of attnums? It would mean allocating memory in the loop,
> so maybe you think otherwise, but in that case maybe StatisticExtInfo
> should store the bitmapset?

Yeah, I think StatisticExtInfo should have a bitmapset, not an
int2vector.

> + appendPQExpBuffer(, "(dependencies)");
> 
> I think it's better practice to use appendPQExpBufferStr() when there's no
> formatting. It'll perform marginally better, which might not be important
> here, but it sets a better example for people to follow when performance is
> more critical.

FWIW this should have said "(ndistinct)" anyway :-)

> +   change the definition of a extended statistics
> 
> "a" should be "an", Also is statistics plural here. It's commonly mixed up
> in the patch. I think it needs standardised. I personally think if you're
> speaking of a single pg_statatic_ext row, then it should be singular. Yet,
> I'm aware you're using plural for the CREATE STATISTICS command, to me that
> feels a bit like: CREATE TABLES mytable ();  am I somehow thinking wrongly
> somehow here?

This was discussed upthread as I recall.  This is what Merriam-Webster says on
the topic:

statistic
1   :  a single term or datum in a collection of statistics
2 a :  a quantity (as the mean of a sample) that is computed from a sample;
   specifically :  estimate 3b
  b :  a random variable that takes on the possible values of a statistic

statistics
1   :  a branch of mathematics dealing with the collection, analysis,
   interpretation, and presentation of masses of numerical data
2   :  a collection of quantitative data

Now, I think there's room to say that a single object created by the new CREATE
STATISTICS is really the latter, not the former.  I find it very weird
that a single of these objects is named in the plural form, though, and
it looks odd all over the place.  I would rather use the term
"statistics object", and then we can continue using the singular.

> +   If a schema name is given (for example, CREATE STATISTICS
> +   myschema.mystat ...) then the statistics is created in the specified
> +   schema.  Otherwise it is created in the current schema.  The name of
> 
> What's created in the current schema? I thought this was just for naming?

Well, "created in a schema" means that the object is named after that
schema.  So both are the same thing.  Is this unclear in some way?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-03-14 Thread David Rowley
On 15 March 2017 at 12:18, David Fetter  wrote:

>
> Is the plan to convert completely from "multivariate" to "extended?"
> I ask because I found a "multivariate" in there.
>

I get the idea that Tomas would like to keep the multivariate when it's
actually referencing multivariate stats. The idea of the rename was to
allow future expansion of the code to perhaps allow creation of stats on
expressions, which is not multivariate. If you've found multivariate
reference in an area that should be generic to extended statistics then
that's a bug and should be fixed.

I found a few of these and listed them during my review.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] multivariate statistics (v25)

2017-03-14 Thread David Fetter
On Tue, Mar 14, 2017 at 07:10:49PM -0300, Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> > I tried patch 0002 today and again there are conflicts, so I rebased and
> > fixed the merge problems.
> 
> ... and attached the patch.

Is the plan to convert completely from "multivariate" to "extended?"
I ask because I found a "multivariate" in there.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-03-14 Thread Alvaro Herrera
Alvaro Herrera wrote:
> I tried patch 0002 today and again there are conflicts, so I rebased and
> fixed the merge problems.

... and attached the patch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 2c2da2a..b5c4129 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -296,6 +296,11 @@
  
 
  
+  pg_statistic_ext
+  extended planner statistics
+ 
+
+ 
   pg_subscription
   logical replication subscriptions
  
@@ -4223,6 +4228,98 @@
   
  
 
+ 
+  pg_statistic_ext
+
+  
+   pg_statistic_ext
+  
+
+  
+   The catalog pg_statistic_ext
+   holds extended planner statistics.
+  
+
+  
+   pg_statistic_ext Columns
+
+   
+
+ 
+  Name
+  Type
+  References
+  Description
+ 
+
+
+
+
+ 
+  starelid
+  oid
+  pg_class.oid
+  The table that the described columns belongs to
+ 
+
+ 
+  staname
+  name
+  
+  Name of the statistic.
+ 
+
+ 
+  stanamespace
+  oid
+  pg_namespace.oid
+  
+   The OID of the namespace that contains this statistic
+  
+ 
+
+ 
+  staowner
+  oid
+  pg_authid.oid
+  Owner of the statistic
+ 
+
+ 
+  staenabled
+  char[]
+  
+  
+An array with the modes of the enabled statistic types, encoded as
+d for ndistinct coefficients.
+  
+ 
+
+ 
+  stakeys
+  int2vector
+  pg_attribute.attnum
+  
+   This is an array of values that indicate which table columns this
+   statistic covers. For example a value of 1 3 would
+   mean that the first and the third table columns make up the statistic 
key.
+  
+ 
+
+ 
+  standistinct
+  pg_ndistinct
+  
+  
+   Ndistict coefficients, serialized as pg_ndistinct type.
+  
+ 
+
+
+   
+  
+ 
+
  
   pg_namespace
 
diff --git a/doc/src/sgml/planstats.sgml b/doc/src/sgml/planstats.sgml
index b73c66b..76955e5 100644
--- a/doc/src/sgml/planstats.sgml
+++ b/doc/src/sgml/planstats.sgml
@@ -448,4 +448,145 @@ rows = (outer_cardinality * inner_cardinality) * 
selectivity
 
  
 
+ 
+  Extended Statistics
+
+  
+   extended statistics
+   planner
+  
+
+  
+   The examples presented in  used
+   statistics about individual columns to compute selectivity estimates.
+   When estimating conditions on multiple columns, the planner assumes
+   independence of the conditions and multiplies the selectivities. When the
+   columns are correlated, the independence assumption is violated, and the
+   estimates may be off by several orders of magnitude, resulting in poor
+   plan choices.
+  
+
+  
+   The examples presented below demonstrate such estimation errors on simple
+   data sets, and also how to resolve them by creating extended statistics
+   using CREATE STATISTICS command.
+  
+
+  
+   Let's start with a very simple data set - a table with two columns,
+   containing exactly the same values:
+
+
+CREATE TABLE t (a INT, b INT);
+INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 1) s(i);
+ANALYZE t;
+
+
+   As explained in , the planner can determine
+   cardinality of t using the number of pages and
+   rows is looked up in pg_class:
+
+
+SELECT relpages, reltuples FROM pg_class WHERE relname = 't';
+
+ relpages | reltuples
+--+---
+   45 | 1
+
+
+   The data distribution is very simple - there are only 100 distinct values
+   in each column, uniformly distributed.
+  
+
+  
+   The following example shows the result of estimating a WHERE
+   condition on the a column:
+
+
+EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1;
+   QUERY PLAN  
  
+-
+ Seq Scan on t  (cost=0.00..170.00 rows=100 width=8) (actual time=0.031..2.870 
rows=100 loops=1)
+   Filter: (a = 1)
+   Rows Removed by Filter: 9900
+ Planning time: 0.092 ms
+ Execution time: 3.103 ms
+(5 rows)
+
+
+   The planner examines the condition and computes the estimate using
+   eqsel, the selectivity function for =, and
+   statistics stored in the pg_stats table. In this case
+   the planner estimates the condition matches 1% rows, and by comparing
+   the estimated and actual number of rows, we see that the estimate is
+   very accurate (in fact exact, as the table is very small).
+ 
+
+  
+   Adding a condition on the second column results in the following plan:
+
+
+EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
+  QUERY PLAN   

+---
+ Seq Scan on t  

Re: [HACKERS] multivariate statistics (v25)

2017-03-14 Thread Alvaro Herrera
I tried patch 0002 today and again there are conflicts, so I rebased and
fixed the merge problems.  I also changed a number of minor things, all
AFAICS cosmetic in nature:

* moved src/backend/statistics/common.h to
  src/include/statistics/common.h, as previously commented.  I also took
  out postgres.h and most of the includes; instead, put all these into
  each .c source file.  That aligns with our established practice.
  I also removed two prototypes that should actually be in stats.h.
  I think statistics/common.h should be further renamed to
  statistics/stats_ext_internal.h, and statistics/stats.h to something
  different though I don't know what ATM.

* Moved src/include/utils/stats.h to src/include/statistics, clean it up
  a bit.

* Moved some structs from analyze.c into statistics/common.h, removing
  some duplication; have analyze.c include that file.

* renamed src/test/regress/sql/mv_ndistinct.sql to stats_ext.sql, to
  collect all ext.stats. related tests in a single file, instead of
  having a large number of them.  I also added one test that drops a
  column, per David Rowley's reported failure, but I didn't actually fix
  the problem nor add it to the expected file.  (I'll follow up with
  that tomorrow, if Tomas doesn't beat me to it).  Also, put the test in
  an earlier parallel test group, 'cause I see no reason to put it last.

* A bunch of stylistic changes.

The added tests pass (or they passed before I added the drop column
tests; not a surprise really that they pass, since I didn't touch
anything functionally), but they aren't terribly exhaustive at the stage
of the first patch in the series.

I didn't get around to addressing all of David Rowley's input.  Also I
didn't try to rebase the remaining patches in the series on top of this
one.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v25)

2017-03-14 Thread David Rowley
On 13 March 2017 at 23:00, David Rowley 
wrote:
>
> 0003:
>
> No more time today. Will try and get to those soon.
>

0003:

I've now read this patch. My main aim here was to learn what it does and
how it works. I need to spend much longer understanding how your
calculating the functional dependencies.

In the meantime I've pasted the notes I took while reading over the patch.

+ default:
+ elog(ERROR, "unexcpected statistics type requested: %d", type);

"unexpected", but we generally use "unknown".

@@ -1293,7 +1294,8 @@ get_relation_statistics(RelOptInfo *rel, Relation
relation)
  info->rel = rel;

  /* built/available statistics */
- info->ndist_built = true;
+ info->ndist_built = stats_are_built(htup, STATS_EXT_NDISTINCT);
+ info->deps_built = stats_are_built(htup, STATS_EXT_DEPENDENCIES);

I don't really like how this function is shaping up. You're calling
stats_are_built() potentially twice for each stats type. There must be a
nicer way to do this. Are non-built stats common enough to optimize
building a StatisticExtInfo regardless and throwing it away if it happens
to be useless?

Can you also rename mvoid to become something more esoid or similar. I seem
to always read it as m-void instead of mv-oid and naturally I expect a void
pointer rather than an Oid.

+dependencies, and for each one count the number of rows rows consistent it.

duplicate word "rows"

+Apllying the functional dependencies is fairly simple - given a list of

Applying


+In this case the default estimation based on AVIA principle happens to work

hmm, maybe I should know what AVIA principles are, but I don't. Is there
something I should be reading?  I searched a bit around the internet for a
few minutes it didn't seem have a great idea either.

+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group

2017


+ Assert(tmp <= ((char *) output + len));

Shouldn't you just Assert(tmp == ((char *) output + len)); at the end of
the loop?


+ if (dependencies->magic != STATS_DEPS_MAGIC)
+ elog(ERROR, "invalid dependency magic %d (expected %dd)",
+ dependencies->magic, STATS_DEPS_MAGIC);
+
+ if (dependencies->type != STATS_DEPS_TYPE_BASIC)
+ elog(ERROR, "invalid dependency type %d (expected %dd)",
+ dependencies->type, STATS_DEPS_TYPE_BASIC);

%dd ?

+ Assert(dependencies->ndeps > 0);

Why Assert() and not elog() ? Wouldn't think mean that a corrupt dependency
could fail an Assert


+ dependencies = (MVDependencies) palloc0(sizeof(MVDependenciesData));

Why palloc0() and not palloc()?

Can you not just read it into a variable on the stack, then check the exact
size using tempdeps.ndeps * sizeof(MVDependency), then memcpy() it over?
That'll save you the realloc()


+ /* what minimum bytea size do we expect for those parameters */
+ expected_size = offsetof(MVDependenciesData, deps) +
+ dependencies->ndeps * (offsetof(MVDependencyData, attributes) +
+   sizeof(AttrNumber) * 2);

Can't quite make sense of this yet. Why * 2?


+ /* is the number of attributes valid? */
+ Assert((k >= 2) && (k <= STATS_MAX_DIMENSIONS));

Seems like a bad idea to Assert() this. Wouldn't some bad data being
deserialized cause an Assert failure?


+ d = (MVDependency) palloc0(offsetof(MVDependencyData, attributes) +
+   (k * sizeof(AttrNumber)));

Why palloc0(), you seem to write out all the fields right away. Seems like
a waste to zero the memory.

+ /* still within the bytea */
+ Assert(tmp <= ((char *) data + VARSIZE_ANY(data)));

Any point? You're already Asserting that you've consumed the entire array
at the end anyway.

+ appendStringInfoString(, "[");

appendStringInfoChar( '['); would be better.

+ ret = pstrdup(str.data);

ret = pnstrdup(str.data, str.len);


+CREATE STATISTICS s1 WITH (dependencies) ON (a,a) FROM
functional_dependencies;
+ERROR:  duplicate column name in statistics definition

Is it worth mentioning which column here?

I'll try to spend more time understanding 0003 soon.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] multivariate statistics (v25)

2017-03-13 Thread David Rowley
On 3 March 2017 at 03:53, Tomas Vondra  wrote:

> This time with the attachments 


It's been a long while since I looked at this patch, but I'm now taking
another look.

I've made a list of stuff I've found from making my first pass on 0001 and
0002. Some of the stuff may seem a little pedantic, so apologies about
those ones. I merely SET nit_picking_threshold TO 0; and reviewed.

Here goes:

0001:

+ RestrictInfo *rinfo = (RestrictInfo*)node;

and

+ RestrictInfo *rinfo = (RestrictInfo *)node;
+
+ return expression_tree_walker((Node*)rinfo->clause,
+   pull_varattnos_walker,
+   (void*) context);

spacing incorrect. Please space after type name in casts and after the
closing parenthesis.

0002:

+  dropped as well.  Multivariate statistics referencing the column will
+  be dropped only if there would remain a single non-dropped column.

I was initially confused by this. I think it should worded as:

"Multivariate statistics referencing the dropped column will also be
removed if the removal of the column would cause the statistics to contain
data for only a single column"

I had been confused as I'd been thinking of dropping multiple columns at
once with the same command, and only 1 column remained in the table. So I
think it's best to clarify you mean the statistic here.

+ OCLASS_STATISTICS /* pg_statistics_ext */

I wonder if this should be named: OCLASS_STATISTICEXT. The comment is also
incorrect and should read "pg_statistic_ext" (without 's')

I tried to perform a test in this area and received an error:

postgres=# create table ab1 (a int, b int);
CREATE TABLE
postgres=# create statistics ab1_a_b_stats on (a,b) from ab1;
CREATE STATISTICS
postgres=# alter table ab1 drop column a;
ALTER TABLE
postgres=# drop table ab1;
ERROR:  cache lookup failed for statistics 16399

+   When estimating conditions on multiple columns, the planner assumes
+   independence of the conditions and multiplies the selectivities. When
the
+   columns are correlated, the independence assumption is violated, and the
+   estimates may be off by several orders of magnitude, resulting in poor
+   plan choices.

I don't think the assumption is violated. We still assume that they're
independent, which is incorrect. Nothing gets violated.

Perhaps it would be more accurate to write:

"When estimating the selectivity of conditions over multiple columns, the
planner normally assumes each condition is independent of other conditions,
and simply multiplies the selectivity estimates of each condition together
to produce a final selectivity estimation for all conditions. This method
can often lead to inaccurate row estimations when the conditions have
dependencies on one another. Such misestimations can result poor plan
choices being made."

+   using CREATE STATISTICS command.

using the ...

+   As explained in , the planner can
determine
+   cardinality of t using the number of pages and
+   rows is looked up in pg_class:

perhaps "rows is" should become "rows as" or "rows which are".

+ * delete multi-variate statistics
+ */
+ RemoveStatisticsExt(relid, 0);

I think it should be "delete extended statistics"


Should this not be rejected?

postgres=# create view v1 as select 1 a, 2 b;
CREATE VIEW
postgres=# create statistics v1_a_stats on (a,b) from v1;
CREATE STATISTICS

and this?

postgres=# create sequence test_seq;
CREATE SEQUENCE
postgres=# select * from test_seq;
 last_value | log_cnt | is_called
+-+---
  1 |   0 | f
(1 row)
postgres=# create statistics test_seq_stats on (last_value,log_cnt) from
test_seq;
CREATE STATISTICS

The patch does claim:

+ /* extended stats are supported on tables and matviews */

So I guess it should be disallowed.


+ /* OBJECT_STATISTICS */
+ {
+ "statistics", OBJECT_STATISTICS

Maybe this should be changed to be OBJECT_STATISTICEXT */. Doing it this
way would close the door a bit on pg_depends records existing for
pg_statistic.

A quick test shows a problem here:

postgres=# create table ab (a int, b int);
CREATE TABLE
postgres=# create statistics ab_a_b_stats on (a,b) from ab;
CREATE STATISTICS
postgres=# create statistics ab_a_b_stats1 on (a,b) from ab;
CREATE STATISTICS
postgres=# alter statistics ab_a_b_stats1 rename to ab_a_b_stats;
ERROR:  unsupported object class 3381


+/*
+ *
+ * QUERY :
+ * CREATE STATISTICS stats_name ON relname (columns) WITH (options)
+ *
+
*/

Old Syntax?


+ $$ = (Node *)n;

Incorrect spacing.


+ * The returned list is guaranteed to be sorted in order by OID, although
+ * this is not currently needed.

hmm, whats the tie-breaker going to be for:

CREATE TABLE abc (a int, b int, c int);
create statistics abc_ab_stats (a,b) from abc;
create statistics abc_bc_stats (b,c) from abc;

select * from abc where a=1 and b=1 and c=1;

I've not 

Re: [HACKERS] multivariate statistics (v25)

2017-03-02 Thread Tomas Vondra

On 03/02/2017 07:42 AM, Kyotaro HORIGUCHI wrote:

Hello,

At Thu, 2 Mar 2017 04:05:34 +0100, Tomas Vondra  wrote 
in 

OK,

attached is v24 of the patch series, addressing most of the reported
issues and comments (at least I believe so). The main changes are:


Unfortunately, 0002 conflicts with the current master
(4461a9b). Could you rebase them or tell us the commit where this
patches stand on?



Attached is a rebased patch series, otherwise it's the same as v24.

FWIW it was based on 016c990834 from Feb 28, but apparently some recent 
patch caused a minor conflict.



I only saw the patch files but have some comments.


1) I've mostly abandoned the "multivariate" name in favor of
"extended", particularly in places referring to stats stored in the
pg_statistic_ext in general. "Multivariate" is now used only in places
talking about particular types (e.g. multivariate histograms).

The "extended" name is more widely used for this type of statistics,
and the assumption is that we'll also add other (non-multivariate)
types of statistics - e.g. statistics on custom expressions, or some
for of join statistics.


In 0005, and

@@ -184,14 +208,43 @@ clauselist_selectivity(PlannerInfo *root,
 * If there are no such stats or not enough attributes, don't waste time
 * simply skip to estimation using the plain per-column stats.
 */
+   if (has_stats(stats, STATS_TYPE_MCV) &&
...
+   /* compute the multivariate stats */
+   s1 *= clauselist_ext_selectivity(root, mvclauses, stat);

@@ -1080,10 +1136,71 @@ clauselist_ext_selectivity_deps(PlannerInfo *root, 
Index relid,
 }

 /*
+ * estimate selectivity of clauses using multivariate statistic

These comment is left unchanged?  or on purpose? 0007 adds very
similar texts.



Hmm, those comments should be probably changed to "extended".


2) Catalog pg_mv_statistic was renamed to pg_statistic_ext (and
pg_mv_stats view renamed to pg_stats_ext).


FWIW, "extended statistic" would be abbreviated as
"ext_statistic" or "extended_stats". Why have you exchanged the
words?



Because this way it's clear it's a version of pg_statistic, and it will 
be sorted right next to it.



3) The structure of pg_statistic_ext was changed as proposed by
Alvaro, i.e. the boolean flags were removed and instead we have just a
single "char[]" column with list of enabled statistics.

4) I also got rid of the "mv" part in most variable/function/constant
names, replacing it by "ext" or something similar. Also mvstats.h got
renamed to stats.h.

5) Moved the files from src/backend/utils/mvstats to
backend/statistics.

6) Fixed the n_choose_k() overflow issues by using the algorithm
proposed by Dean. Also, use the simple formula for num_combinations().

7) I've tweaked data types for a few struct members (in stats.h). I've
kept most of the uint32 fields at the top level though, because int16
might not be large enough for large statistics and the overhead is
minimal (compared to the space needed e.g. for histogram buckets).


Some formulated proof or boundary value test cases might be
needed (to prevent future trouble). Or any defined behavior on
overflow of them might be enough. I belive all (or most) of
overflow-able data has such behavior.



That is probably a good idea and I plan to do that.


The renames/changes were quite widespread, but I've done my best to
fix all the comments and various other places.

regards


regards,



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers