On Apr 2, 2016 18:38, "Tom Lane" <t...@sss.pgh.pa.us> wrote: > > "Shulgin, Oleksandr" <oleksandr.shul...@zalando.de> writes: > > On Apr 1, 2016 23:14, "Tom Lane" <t...@sss.pgh.pa.us> wrote: > >> Haven't looked at 0002 yet. > > > [crosses fingers] hope you'll have a chance to do that before feature > > freeze for 9.6 > > I studied this patch for awhile after rebasing it onto yesterday's > commits.
Fantastic! I could not hope for a better reply :-) > I did not like the fact that the compute_scalar_stats logic > would allow absolutely anything into the MCV list once num_hist falls > below 2. I think it's important that we continue to reject values that > are only seen once in the sample, because there's no very good reason to > think that they are MCVs and not just infrequent values that by luck > appeared in the sample. In my understanding we only put a value in the track list if we've seen it at least twice, no? > However, after I rearranged the tests there so > that "if (num_hist >= 2)" only controlled whether to apply the 1/K limit, > one of the regression tests started to fail: Uh-oh. > there's a place in > rowsecurity.sql that expects that if a column contains nothing but several > instances of a single value, that value will be recorded as a lone MCV. > Now this isn't a particularly essential thing for that test, but it still > seems like a good property for ANALYZE to have. No objection here. > The reason it's failing, > of course, is that the test as written cannot possibly accept the last > (or only) value. Yeah, this I would expect from such a change. > Before I noticed the regression failure, I'd been thinking that maybe it'd > be better if the decision rule were not "at least 100+x% of the average > frequency of this value and later ones", but "at least 100+x% of the > average frequency of values after this one". Hm, sounds pretty similar to what I wanted to achieve, but better formalized. > With that formulation, we're > not constrained as to the range of x. Now, if there are *no* values after > this one, then this way needs an explicit special case in order not to > compute 0/0; but the preceding para shows that we need a special case for > the last value anyway. > > So, attached is a patch rewritten along these lines. I used 50% rather > than 25% as the new cutoff percentage --- obviously it should be higher > in this formulation than before, but I have no idea if that particular > number is good or we should use something else. Also, the rule for the > last value is "at least 1% of the non-null samples". That's a pure guess > as well. > > I do not have any good corpuses of data to try this on. Can folks who > have been following this thread try it on their data and see how it > does? Also please try some other multipliers besides 1.5, so we can > get a feeling for where that cutoff should be placed. Expect me to run it on my pet db early next week. :-) Many thanks! -- Alex