Re: [HACKERS] More stable query plans via more predictable column statistics

Shulgin, Oleksandr Mon, 25 Jan 2016 08:13:08 -0800

On Sat, Jan 23, 2016 at 11:22 AM, Tomas Vondra <tomas.von...@2ndquadrant.com
> wrote:


> Hi,
>
> On 01/20/2016 10:49 PM, Alvaro Herrera wrote:
>
>>
>> Tom, are you reviewing this for the current commitfest?
>>
>
> While I'm not the right Tom, I've been looking the the patch recently, so
> let me post the review here ...
>

Thank you for the review!

2) mincount = 1.25 * avgcount
> -----------------------------
>
> While I share the dislike of arbitrary constants (like the 1.25 here), I
> do think we better keep this, otherwise we can just get rid of the mincount
> entirely I think - the most frequent value will always be above the
> (remaining) average count, making the threshold pointless.
>

Correct.

It might have impact in the original code, but in the new one it's quite
> useless (see the next point), unless I'm missing some detail.
>
>
> 3) modifying avgcount threshold inside the loop
> -----------------------------------------------
>
> The comment was extended with this statement:
>
>  * We also decrease ndistinct in the process such that going forward
>  * it refers to the number of distinct values left for the histogram.
>
> and indeed, that's what's happening - at the end of each loop, we do this:
>
>     /* Narrow our view of samples left for the histogram */
>     sample_cnt -= track[i].count;
>     ndistinct--;
>
> but that immediately lowers the avgcount, as that's re-evaluated within
> the same loop
>
>     avgcount = (double) sample_cnt / (double) ndistinct;
>
> which means it's continuously decreasing and lowering the threshold,
> although that's partially mitigated by keeping the 1.25 coefficient.
>

I was going to write "not necessarily lowering", but this is actually
accurate.  The following holds due to track[i].count > avgcount (=
sample_cnt / ndistinct):

 sample_cnt     sample_cnt - track[i].count
------------ > -----------------------------
  ndistinct           ndistinct - 1

It however makes reasoning about the algorithm much more complicated.
>

Unfortunately, yes.

4) for (i = 0; /* i < num_mcv */; i++)
> ---------------------------------------
>
> The way the loop is coded seems rather awkward, I guess. Not only there's
> an unexpected comment in the "for" clause, but the condition also says this
>
>     /* Another way to say "while (i < num_mcv)" */
>     if (i >= num_mcv)
>         break;
>
> Why not to write it as a while loop, then? Possibly including the
> (num_hist >= 2) condition, like this:
>
>     while ((i < num_mcv) && (num_hist >= 2))
>     {
>         ...
>     }
>
> In any case, the loop would deserve a comment explaining why we think
> computing the thresholds like this makes sense.
>

This is partially explained by a comment inside the loop:

! for (i = 0; /* i < num_mcv */; i++)
  {
! /*
! * We have to put this before the loop condition, otherwise
! * we'll have to repeat this code before the loop and after
! * decreasing ndistinct.
! */
! num_hist = ndistinct;
! if (num_hist > num_bins)
! num_hist = num_bins + 1;

I guess this is a case where code duplication can be traded for more
apparent control flow, i.e:

!  num_hist = ndistinct;
!  if (num_hist > num_bins)
!  num_hist = num_bins + 1;

!  for (i = 0; i < num_mcv && num_hist >= 2; i++)
   {
...
+ /* Narrow our view of samples left for the histogram */
+ sample_cnt -= track[i].count;
+ ndistinct--;
+
+  num_hist = ndistinct;
+  if (num_hist > num_bins)
+  num_hist = num_bins + 1;
   }

Summary
> -------
>
> Overall, I think this is really about deciding when to cut-off the MCV, so
> that it does not grow needlessly large - as Robert pointed out, the larger
> the list, the more expensive the estimation (and thus planning).
>
> So even if we could fit the whole sample into the MCV list (i.e. we
> believe we've seen all the values and we can fit them into the MCV list),
> it may not make sense to do so. The ultimate goal is to estimate
> conditions, and if we can do that reasonably even after cutting of the
> least frequent values from the MCV list, then why not?
>
> From this point of view, the analysis concentrates deals just with the
> ANALYZE part and does not discuss the estimation counter-part at all.
>

True, this aspect still needs verification.  As stated, my primary
motivation was to improve the plan stability for relatively short MCV lists.

Longer MCV lists might be a different story, but see "Increasing stats
target" section of the original mail: increasing the target doesn't give
quite the expected results with unpatched code either.

5) ndistinct estimation vs. NULLs
> ---------------------------------
>
> While looking at the patch, I started realizing whether we're actually
> handling NULLs correctly when estimating ndistinct. Because that part also
> uses samplerows directly and entirely ignores NULLs, as it does this:
>
>     numer = (double) samplerows *(double) d;
>
>     denom = (double) (samplerows - f1) +
>         (double) f1 *(double) samplerows / totalrows;
>
>     ...
>     if (stadistinct > totalrows)
>         stadistinct = totalrows;
>
> For tables with large fraction of NULLs, this seems to significantly
> underestimate the ndistinct value - for example consider a trivial table
> with 95% of NULL values and ~10k distinct values with skewed distribution:
>
>     create table t (id int);
>
>     insert into t
>     select (case when random() < 0.05 then (10000 * random() * random())
>                  else null end) from generate_series(1,1000000) s(i);
>
> In practice, there are 8325 distinct values in my sample:
>
>     test=# select count(distinct id) from t;
>      count
>     -------
>       8325
>     (1 row)
>
> But after ANALYZE with default statistics target (100), ndistinct is
> estimated to be ~1300, and with target=1000 the estimate increases to ~6000.
>
> After fixing the estimator to consider fraction of NULLs, the estimates
> look like this:
>
>     statistics target |   master  |  patched
>    ------------------------------------------
>                   100 |     1302  |     5356
>                  1000 |     6022  |     6791
>
> So this seems to significantly improve the ndistinct estimate (patch
> attached).
>

Hm... this looks correct.  And compute_distinct_stats() needs the same
treatment, obviously.

--
Alex

Re: [HACKERS] More stable query plans via more predictable column statistics

Reply via email to