On 6 March 2018 at 08:51, John Naylor <jcnay...@gmail.com> wrote: > On 3/5/18, Dean Rasheed <dean.a.rash...@gmail.com> wrote: >> Attached is an updated patch. > Nice. The results look good.
Thanks for the review. > I agree it should be in a separate function. As for that large > comment, I spent some time pouring over it to verify the math and make > sure I understand it. I see a couple points where it might be a bit > confusing to someone reading this code for the first time: > > + * This bound is at most 25, and approaches 0 as n approaches 0 or N. > The > + * case where n approaches 0 cannot happen in practice, since the > sample > + * size is at least 300. > > I think the implication is that the bound cannot dip below 10 (the > stated minimum to justify using techniques intended for a normal > distributed sample), so there's no need to code a separate clamp to > ensure 10 values. That might be worth spelling out explicitly in the > comment. > Perhaps I should really say that n can't be less than 300 unless N is too, in which case n=N. So the only case where we need to worry about the bound approaching zero is when n --> N, and we're sampling the entire table, or almost all of it. I'll see if I can re-word that to be clearer. > + * size is at least 300. The case where n approaches N corresponds to > + * sampling the whole the table, in which case it is reasonable to > keep > + * the whole MCV list (have no lower bound), so it makes sense to > apply > + * this formula for all inputs, even though the above derivation is > + * technically only valid when the right hand side is at least around > 10. > > It occurred to me that if the table cardinality is just barely above > the sample size, we could get a case where a value sampled only once > could slip into the MCV list. With the default stat target, this would > be tables with between 30,000 and 31,500 rows. (actually I think that's 31250 rows, or more generally when we sample more than around 96% of the table) > I couldn't induce this > behavior, so I looked at the logic that identifies MCV candidates, and > saw a test for > > if (dups_cnt > 1) > > If I'm reading that block correctly, than a value sampled once will > never even be considered for the MCV list, so it seems that the > defacto lower bound for these tables is two. It might be worth > mentioning that in the comment. > > It also means that this clamp on HEAD > > - if (mincount < 2) > - mincount = 2; > > is dead code. > Yes, in compute_scalar_stats(), each value is guaranteed to have been seen at least twice. However, looking at compute_distinct_stats(), that's not the case. I'm not sure if that matters. When we have sampled 96% of the table, any estimate we produce based on the number of times a value has been seen in the sample is likely to be almost exact, even if it has only been seen once or twice. So the estimates from the MCV list will all be good, but I suspect in this case the estimates for all other values that didn't fit in the MCV list will also be good, so it may not matter whether or not we keep those MCV items. My instinct is to keep them, on the grounds that the more information the planner has, the better. >> mcv_cutoff - The relative standard error cutoff percentage used (10, >> 15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests >> against HEAD. > > I'm not quite following the negative numbers for HEAD. They're all the > equivalent, right? Just a label for convenience to make sure you ran > the same number of tests? > Yes, they should all be equivalent. They just reflect the way I ran the tests -- HEAD vs the patch with a cutoff of 10%, HEAD vs the patch with a cutoff of 15%, and so on. So I ended up running it against HEAD 9 times, and I didn't want to throw any of that data away. It's useful for getting a feel for the scope of random variations between test runs. Regards, Dean