Robert Haas <robertmh...@gmail.com> writes: > On Sun, Jun 4, 2017 at 5:30 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> First, I think we need a larger hard floor on the number of occurrences >> of a value that're required to make ANALYZE decide it is a "most common >> value".
> This kind of math isn't my strong point, but it seems to me that, > while sampling noise is a problem, it's not obvious how to tell > whether a given result is signal or noise. I think that a single count in a 30K-row sample is noise by definition. We really ought to be setting the threshold for "what is an MCV" high enough that it's not drastically affected by variations that are clearly at the sampling-error level. > What makes me a bit cautious about this approach overall is that I've > seen cases where the length of the MCV list turns out to be key to > getting a good plan, and I needed to make it longer in order for > things to work. As long as they actually are MCVs, sure. The problem I've got with the current behavior is that it manufactures a spiky distribution where there is none. That leads directly to bad estimates, as shown in Marko's example. We'd be much better off, both as to planning time and accuracy, if we'd concluded that the table had no MCVs. > Another way to state it is: is this problem one-sided? You know as well as I do that there's no free lunch in this area. Anything we change at all will make things worse for somebody, if only by accident. But I do not think that choosing a bunch of values entirely at random and claiming (incorrectly) that they are more common than other values in the table can possibly lead to better results except by accident. > In general, I've pretty skeptical of the idea that sampling 30,000 > rows out of an arbitrarily large table will produce a > sufficiently-accurate MCV list. Perhaps not, but allowing two occurrences to define an MCV surely isn't helping with that. More to the point maybe, it's a behavior that doesn't go away simply by making the sample larger. Making the sample larger just allows us to falsely invent more pseudo-MCVs. I'm not by any means wedded to the proposition that we have to fix it simply by changing the filter rule. One idea that seems worth considering is to keep a track list that's a bit longer than the maximum allowed number of MCVs, and then to say that we accept only MCVs whose counts are significantly greater than what we find at the tail of the list. I'm not sure what "significantly" should be exactly, but surely a distribution as flat as the ones I was showing upthread should be a red flag. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers