Hi Enrique,
On 5/18/26 19:09, Enrique Sánchez wrote:
Postgres only uses multi-column MCVs when the value we are looking for
is in the list. If not, it falls back into individual independent
statistics to estimate selectivity.
However, a miss in a multi-column MCV list still yields valuable
information that it currently throws away: we know that the
combination's frequency is strictly bounded by the frequency of the
last (least common) item in that MCV list.
LGTM. If the multicolumn MCV statistics exists and the clause
combination is absent from the MCV-list, we can use the least frequent
MCV item as an upper bound. BTW, this only applies to AND-clauses.
2. Estimate selectivity as Postgres does for single-column values not
in MCVs
=============================================================================
While that significantly improves estimations, we could mirror what
Postgres already does for individual MCVs. Quote from the official
documentation:
> The approach is to use the fact that the value is not in the list,
combined with the knowledge of the frequencies for all of the MCVs:
> That is, add up all the frequencies for the MCVs and subtract them
from one, then divide by the number of other distinct values.
To achieve this, we need to store an ndistinct estimation alongside
the MCVs that can be used for partial or entire column match.
P(1, 1, 1) = (1 - sum(MCVs)) / (ndistinct(col_a, col_b, col_c) -
MCV_list_size)
...
I think this is a cheap way to prevent bad estimations. The storage
overhead of adding an ndistinct field is trivial compared to the MCV
list itself, and the O(1) arithmetic during planning adds no
measurable overhead. I look forward to your feedback before drafting a
patch.
For this, the ndistinct extended statistics already exist. If both MCV
and ndistinct statistics are present on the same column set, the formula
is correct. There are already places in the code that compute ndistinct
for columns without extended ndistinct statistics (see
estimate_num_groups) - but it is worth thinking carefully about whether
the added complexity is justified before going down that path.
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/