Re: [HACKERS] Strange heuristic in analyze.c

2010-02-08 Thread Greg Stark
On Fri, Feb 5, 2010 at 8:53 PM, Bruce Momjian br...@momjian.us wrote:
 Do you want a C comment to document this problem?

Well I would rather a better heuristic :)

We really need some statistics nerds in this group who can pipe up
when these kinds of issues come up. There must be a good way to
estimate the probability that we've seen all distinct values.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange heuristic in analyze.c

2010-02-05 Thread Bruce Momjian
Greg Stark wrote:
 So I never realized the consequences of this little heuristic in
 analyze.c in the handling of very low cardinality columns where we
 want to just capture the complete list of values in the mcv and throw
 away the histogram:
 
   else if (toowide_cnt == 0  nmultiple == ndistinct)
   {
   /*
* Every value in the sample appeared more than once.  
 Assume the
* column has just these values.
*/
   stats-stadistinct = ndistinct;
   }
 
 The problem with this heuristic is that if the table is small enough
 you might expect you can set the statistics target high and sample
 the entire table and get a very accurate mcv covering all the values.
 However if any of the values in the table appears only once this
 heuristic will defeat you. The following code will then throw out of
 the mcv any value which isn't 25% more common than average. Leaving
 you with a histogram for those values which often does very poorly if
 the values don't fit any pattern and are just discrete arbitrary
 values.

Do you want a C comment to document this problem?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Strange heuristic in analyze.c

2010-01-29 Thread Greg Stark
So I never realized the consequences of this little heuristic in
analyze.c in the handling of very low cardinality columns where we
want to just capture the complete list of values in the mcv and throw
away the histogram:

else if (toowide_cnt == 0  nmultiple == ndistinct)
{
/*
 * Every value in the sample appeared more than once.  
Assume the
 * column has just these values.
 */
stats-stadistinct = ndistinct;
}

The problem with this heuristic is that if the table is small enough
you might expect you can set the statistics target high and sample
the entire table and get a very accurate mcv covering all the values.
However if any of the values in the table appears only once this
heuristic will defeat you. The following code will then throw out of
the mcv any value which isn't 25% more common than average. Leaving
you with a histogram for those values which often does very poorly if
the values don't fit any pattern and are just discrete arbitrary
values.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers