On 07/11/2015 06:32 PM, Tom Lane wrote: ...
Presumably, this is happening because the numbers of rows actually satisfying the index predicates are so small that it's a matter of luck whether any of them are included in ANALYZE's sample. Given this bad data for the index sizes, it's not totally surprising that choose_bitmap_and() does something wacko. I'm not sure whether we should try to make it smarter, or write this off as "garbage in, garbage out".
I think we should make it smarter, if possible - while this example is somewhat artificial, partial indexes are often used exactly like this, i.e. to index only very small subset of data. A good example may be an index on "active invoices", i.e. invoices that were yet sorted out. There may be a lot of invoices in the table, but only very small fraction of them will be active (and thus in the index).
So I don't think is an artificial problem, and we should not write it off as "garbage in".
Another idea is to not trust any individual ANALYZE's estimate of the index rowcount so completely. (I'd thought that the moving-average logic would get applied to that, but it doesn't seem to be kicking in for some reason.) We could probably make this smarter if we were willing to apply the predicate-proof machinery in more situations; in this example, once we know that idx001 is applicable, we really should disregard idx002 and idx003 altogether because their predicates are implied by idx001's. I've always been hesitant to do that because the cost of checking seemed likely to greatly outweigh the benefits. But since Tomas is nosing around in this territory already, maybe he'd like to investigate that further.
I think there are two possible approaches in general - we may improve the statistics somehow, or we may start doing the predicate proofing.
I doubt approaching this at the statistics level alone is sufficient, because even with statistics target 10k (i.e. the most detailed one), the sample is still fixed-size. So there will always exist a combination of a sufficiently large data set and selective partial index, causing trouble with the sampling.
Moreover, I can't really think of a way to fix this at the statistics level. Maybe there's a clever trick guarding against this particular issue, but my personal experience is that whenever I used such a smart hack, it eventually caused strange issues elsewhere.
So I think the predicate proofing is a better approach, but of course the planning cost may be an issue. But maybe we can make this cheaper by some clever tricks? For example, given two predicates A and B, it seems that if A => B, then selectivity(A) <= selectivity(B). Could we use this to skip some of the expensive stuff? We should have the selectivities anyway, no?
regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers