On Sat, Jan 23, 2016 at 11:22 AM, Tomas Vondra <tomas.von...@2ndquadrant.com > wrote:
> Hi, > > On 01/20/2016 10:49 PM, Alvaro Herrera wrote: > >> >> Tom, are you reviewing this for the current commitfest? >> > > While I'm not the right Tom, I've been looking the the patch recently, so > let me post the review here ... > Thank you for the review! 2) mincount = 1.25 * avgcount > ----------------------------- > > While I share the dislike of arbitrary constants (like the 1.25 here), I > do think we better keep this, otherwise we can just get rid of the mincount > entirely I think - the most frequent value will always be above the > (remaining) average count, making the threshold pointless. > Correct. It might have impact in the original code, but in the new one it's quite > useless (see the next point), unless I'm missing some detail. > > > 3) modifying avgcount threshold inside the loop > ----------------------------------------------- > > The comment was extended with this statement: > > * We also decrease ndistinct in the process such that going forward > * it refers to the number of distinct values left for the histogram. > > and indeed, that's what's happening - at the end of each loop, we do this: > > /* Narrow our view of samples left for the histogram */ > sample_cnt -= track[i].count; > ndistinct--; > > but that immediately lowers the avgcount, as that's re-evaluated within > the same loop > > avgcount = (double) sample_cnt / (double) ndistinct; > > which means it's continuously decreasing and lowering the threshold, > although that's partially mitigated by keeping the 1.25 coefficient. > I was going to write "not necessarily lowering", but this is actually accurate. The following holds due to track[i].count > avgcount (= sample_cnt / ndistinct): sample_cnt sample_cnt - track[i].count ------------ > ----------------------------- ndistinct ndistinct - 1 It however makes reasoning about the algorithm much more complicated. > Unfortunately, yes. 4) for (i = 0; /* i < num_mcv */; i++) > --------------------------------------- > > The way the loop is coded seems rather awkward, I guess. Not only there's > an unexpected comment in the "for" clause, but the condition also says this > > /* Another way to say "while (i < num_mcv)" */ > if (i >= num_mcv) > break; > > Why not to write it as a while loop, then? Possibly including the > (num_hist >= 2) condition, like this: > > while ((i < num_mcv) && (num_hist >= 2)) > { > ... > } > > In any case, the loop would deserve a comment explaining why we think > computing the thresholds like this makes sense. > This is partially explained by a comment inside the loop: ! for (i = 0; /* i < num_mcv */; i++) { ! /* ! * We have to put this before the loop condition, otherwise ! * we'll have to repeat this code before the loop and after ! * decreasing ndistinct. ! */ ! num_hist = ndistinct; ! if (num_hist > num_bins) ! num_hist = num_bins + 1; I guess this is a case where code duplication can be traded for more apparent control flow, i.e: ! num_hist = ndistinct; ! if (num_hist > num_bins) ! num_hist = num_bins + 1; ! for (i = 0; i < num_mcv && num_hist >= 2; i++) { ... + /* Narrow our view of samples left for the histogram */ + sample_cnt -= track[i].count; + ndistinct--; + + num_hist = ndistinct; + if (num_hist > num_bins) + num_hist = num_bins + 1; } Summary > ------- > > Overall, I think this is really about deciding when to cut-off the MCV, so > that it does not grow needlessly large - as Robert pointed out, the larger > the list, the more expensive the estimation (and thus planning). > > So even if we could fit the whole sample into the MCV list (i.e. we > believe we've seen all the values and we can fit them into the MCV list), > it may not make sense to do so. The ultimate goal is to estimate > conditions, and if we can do that reasonably even after cutting of the > least frequent values from the MCV list, then why not? > > From this point of view, the analysis concentrates deals just with the > ANALYZE part and does not discuss the estimation counter-part at all. > True, this aspect still needs verification. As stated, my primary motivation was to improve the plan stability for relatively short MCV lists. Longer MCV lists might be a different story, but see "Increasing stats target" section of the original mail: increasing the target doesn't give quite the expected results with unpatched code either. 5) ndistinct estimation vs. NULLs > --------------------------------- > > While looking at the patch, I started realizing whether we're actually > handling NULLs correctly when estimating ndistinct. Because that part also > uses samplerows directly and entirely ignores NULLs, as it does this: > > numer = (double) samplerows *(double) d; > > denom = (double) (samplerows - f1) + > (double) f1 *(double) samplerows / totalrows; > > ... > if (stadistinct > totalrows) > stadistinct = totalrows; > > For tables with large fraction of NULLs, this seems to significantly > underestimate the ndistinct value - for example consider a trivial table > with 95% of NULL values and ~10k distinct values with skewed distribution: > > create table t (id int); > > insert into t > select (case when random() < 0.05 then (10000 * random() * random()) > else null end) from generate_series(1,1000000) s(i); > > In practice, there are 8325 distinct values in my sample: > > test=# select count(distinct id) from t; > count > ------- > 8325 > (1 row) > > But after ANALYZE with default statistics target (100), ndistinct is > estimated to be ~1300, and with target=1000 the estimate increases to ~6000. > > After fixing the estimator to consider fraction of NULLs, the estimates > look like this: > > statistics target | master | patched > ------------------------------------------ > 100 | 1302 | 5356 > 1000 | 6022 | 6791 > > So this seems to significantly improve the ndistinct estimate (patch > attached). > Hm... this looks correct. And compute_distinct_stats() needs the same treatment, obviously. -- Alex