On 6/17/23 02:02, Quan Zongliang wrote: > > > On 2023/6/17 06:46, Tom Lane wrote: >> Quan Zongliang <[email protected]> writes: >>> Perhaps we should discard this (dups cnt > 1) restriction? >> >> That's not going to happen on the basis of one test case that you >> haven't even shown us. The implications of doing it are very unclear. >> In particular, I seem to recall that there are bits of logic that >> depend on the assumption that MCV entries always represent more than >> one row. The nmultiple calculation Tomas referred to may be failing >> because of that, but I'm worried about there being other places. >>
I don't recall any logic that'd outright fail with MCVs containing
single-row groups, and I haven't noticed anything obvious in analyze.c
during a cursory search. Maybe the paper analyze_mcv_list builds on
makes some assumptions? Not sure.
However, compute_distinct_stats() doesn't seem to have such protection
against single-row MCV groups, so if that's wrong we kinda already have
the issue I think (admittedly, compute_distinct_stats is much less used
than compute_scalar_stats).
>
> The statistics for the other table look like this:
> stadistinct | 6
> stanumbers1 | {0.50096667,0.49736667,0.0012}
> stavalues1 | {v22,v23,v5}
>
> The value that appears twice in the small table (v1 and v2) does not
> appear here. The stadistinct's true value is 18 instead of 6 (three
> values in the small table do not appear here).
>
> When calculating the selectivity:
> if (nd2 > sslot2->nvalues)
> totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - sslot2->nvalues);
>
> totalsel1 = 0
> nd2 = 21
> sslot2->nvalues = 2
> unmatchfreq1 = 0.99990002016420476
> otherfreq2 = 0.82608695328235626
>
> result: totalsel1 = 0.043473913749706022
> rows = 0.043473913749706022 * 23 * 2,000,000 = 1999800
>
Attached is a script reproducing this.
I think the fundamental issue here is that the most common element of
the large table - v22 (~50%) is not in the tiny one at all. IIRC the
join estimation assumes the domain of one table is a subset of the
other. The values 22 / 23 violate that assumption, unfortunately.
Including all values into the small MCV fix this because then
otherfreq1 = 0.0
and that simply eliminates the impact of stuff that didn't have a match
between the two MCV lists. Which mitigates the violated assumption.
But once the small table gets too large for the MCV, this won't work
that well - it probably helps a bit, as it makes otherfreq1 smaller.
Which doesn't mean it's useless, but it's likely a rare combination that
a table is (and remains) smaller than MCV, and the large table contains
values without a match in the smaller one (think foreign keys).
>
>> Basically, you're proposing a rather fundamental change in the rules
>> by which Postgres has gathered statistics for decades. You need to
>> bring some pretty substantial evidence to support that. The burden
>> of proof is on you, not on the status quo.
>>
Right. It's a good example of a "quick hack" fixing one particular case,
without considering the consequences on other cases too much. Good as a
starting point, but plenty of legwork to do.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
script.sql
Description: application/sql
