>>> "Nathan Boley" <npbo...@gmail.com> wrote: 
> Can anyone suggest a good data set to test this sort of question on?
 
Where we have the biggest problem with bad estimates is on complex
searches involving many joins where the main criterion limiting the
result set is a name.  The estimate based on the histogram is often
very low (e.g. 2) when the actual result set is several hundred. 
While several hundred is far short of 1% of the table, the best plan
for a result set of that size is very different than the best plan for
two rows.
 
Some numbers follow to give an idea of the shape of data where current
techniques sometimes do poorly.  We use a "searchName" column which
puts the name components from various columns into a canonical format;
this is what is indexed and searched.  The search is usually a LIKE
with the high order portion being six to ten characters followed by
the wild card.
 
Total rows in table: 32,384,830
 
There are 9,958,969 distinct values.
 
There is one value present in over 1% of the rows, with 433,578 rows.
 
There are ten values present in over 0.1% of the rows:
 433578
 140398
 135489
 112088
  64069
  63158
  44656
  36499
  35896
  35819
 
The 100th most common value is present in 4847 rows.
 
There are 186 rows with over 0.01% of the rows.
 
Based on my experience, we would need better estimates for ranges with
200 to 300 rows to improve our plans for the problem cases.  I'd be
happy to have it scan the whole table during our nightly VACUUM
ANALYZE if that would get me statistics which would improve the
estimates to that degree without a huge increase in plan time.
 
Which raises the issue, if we could get better statistics by passing
the whole table, why not do that when VACUUM ANALYZE is run?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to