On 3 September 2018 at 00:17, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > Hi, > > Attached is an updated version of the patch series, adopting a couple of > improvements - both for MCV lists and histograms. > > > MCV > --- > > For the MCV list part, I've adopted the approach proposed by Dean, using > base selectivity and using it to correct the non-MCV part. I agree the > simplicity of the approach is a nice feature, and it seems to produce > better estimates. I'm not sure I understand the approach perfectly, but > I've tried to add comments explaining how it works etc. >
Cool. Looking at this afresh after some time away, it still looks like a reasonable approach, and the test results are encouraging. In mcv_clauselist_selectivity(), you raise the following question: if (matches[i] != STATS_MATCH_NONE) { /* XXX Shouldn't the basesel be outside the if condition? */ *basesel += mcv->items[i]->base_frequency; s += mcv->items[i]->frequency; } The reason it needs to be inside the "if" block is that what it's computing is the base (independent) selectivity of the clauses found to match the MCV stats, so that then in statext_clauselist_selectivity() it can be used in the following computation: /* Estimated selectivity of values not covered by MCV matches */ other_sel = simple_sel - mcv_basesel; to give an estimate for the other clauses that aren't covered by the MCV stats. So I think the code is correct as it stands, but if you think it isn't clear enough, maybe a comment update is in order. The assumption being made is that mcv_basesel will cancel out the part of simple_sel that is due to clauses matching the MCV stats, so that we can then just add on the MCV selectivity. Of course that's only an approximation, and it won't be exact -- partly due to the fact that simple_sel and mcv_basesel are potentially computed using different sample rows, and so will differ in the MCV region, and partly because of second-order effects arising from the way that selectivities are combined in clauselist_selectivity_simple(). Maybe that's something that can be improved upon, but it doesn't seem like a bad initial approximation. > I've also changed how we build the MCV lists, particularly how we decide > how many / which items store in the MCV list. In the previous version > I've adopted the same algorithm we use for per-column MCV lists, but in > certain cases that turned out to be too restrictive. > > Consider for example a table with multiple perfectly correlated columns, > with very few combinations. That is, something like this: > > CREATE TABLE t (a int, b int); > > INSERT INTO t SELECT mod(i,50), mod(i,50) > FROM generate_series(1,1e6) s(i); > > CREATE STATISTICS s (mcv) ON a,b FROM t; > > Now, the data distribution is very simple - uniform, with 50 distinct > combinations, each representing 2% of data (and the random sample should > be pretty close to that). > > In these cases, analyze_mcv_list decides it does not need any MCV list, > because the frequency for each value is pretty much 1/ndistinct. For > single column that's reasonable, but for multiple correlated columns > it's rather problematic. We might use the same ndistinct approach > (assuming we have the ndistinct coefficients), but that still does not > allow us to decide which combinations are "valid" with respect to the > data. For example we can't decide (1,10) does not appear in the data. > > So I'm not entirely sure adopting the same algorithm analyze_mcv_list > algorithm both for single-column and multi-column stats. It may make > sense to keep more items in the multi-column case for reasons that are > not really valid for a single single-column. > > For now I've added a trivial condition to simply keep all the groups > when possible. This probably needs more thought. > Ah, this is a good point. I think I see the problem here. analyze_mcv_list() works by keeping those MCV entries that are statistically significantly more frequent than the selectivity that would have otherwise been assigned to the values, which is based on ndistinct and nullfrac. That's not really right for multivariate stats though, because the selectivity that would be assigned to a multi-column value if it weren't in the multivariate MCV list is actually calculated using the product of individual column selectivities. Fortunately we now calculate this (base_frequency), so actually I think what's needed is a custom version of analyze_mcv_list() that keeps MCV entries if the observed frequency is statistically significantly larger than the base frequency, not the ndistinct-based frequency. It might also be worthwhile doing a little more work to make the base_frequency values more consistent with the way individual column selectivities are actually calculated -- currently the patch always uses the observed single-column frequencies to calculate the base frequencies, but actually the univariate stats would only do that for a subset of the single-column values, and the rest would get assigned a fixed share of the remaining selectivity-space. Factoring that into the base frequency calculation ought to give a better base frequency estimate (for use in mcv_clauselist_selectivity() and statext_clauselist_selectivity()), as well as give a more principled cutoff threshold for deciding which multivariate MCV values to keep. It may be possible to reuse some of the existing code for that. The initial goal of the base frequency calculation was to replicate the univariate stats computations, so that it can be used to give the right correction to be applied to the simple_sel value. If it can also be used to determine how many MCV entries to keep, that's an added bonus. Regards, Dean