>>> "Nathan Boley" <npbo...@gmail.com> wrote: > Can anyone suggest a good data set to test this sort of question on? Where we have the biggest problem with bad estimates is on complex searches involving many joins where the main criterion limiting the result set is a name. The estimate based on the histogram is often very low (e.g. 2) when the actual result set is several hundred. While several hundred is far short of 1% of the table, the best plan for a result set of that size is very different than the best plan for two rows. Some numbers follow to give an idea of the shape of data where current techniques sometimes do poorly. We use a "searchName" column which puts the name components from various columns into a canonical format; this is what is indexed and searched. The search is usually a LIKE with the high order portion being six to ten characters followed by the wild card. Total rows in table: 32,384,830 There are 9,958,969 distinct values. There is one value present in over 1% of the rows, with 433,578 rows. There are ten values present in over 0.1% of the rows: 433578 140398 135489 112088 64069 63158 44656 36499 35896 35819 The 100th most common value is present in 4847 rows. There are 186 rows with over 0.01% of the rows. Based on my experience, we would need better estimates for ranges with 200 to 300 rows to improve our plans for the problem cases. I'd be happy to have it scan the whole table during our nightly VACUUM ANALYZE if that would get me statistics which would improve the estimates to that degree without a huge increase in plan time. Which raises the issue, if we could get better statistics by passing the whole table, why not do that when VACUUM ANALYZE is run? -Kevin
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers