On Mon, Aug 9, 2021 at 8:40 PM David G. Johnston <david.g.johns...@gmail.com>
wrote:

> On Mon, Aug 9, 2021 at 11:05 AM Bruce Momjian <br...@momjian.us> wrote:
>
>>
>> >         selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/
>> >         num_distinct1,
>> >         1/num_distinct2)
>> >                     = (1 - 0) * (1 - 0) / max(10000, 10000)
>> >                     = 0.0001
>>
>> Nice, can you provide a patch please?
>>
>>
> Change the line:
>
>
Concretely, as attached and inline.

David J.


commit 73fa486a855d75d74a1a695bb350bfbfe27c7751
Author: David G. Johnston <david.g.johns...@gmail.com>
Date:   Tue Apr 12 21:23:53 2022 +0000

    doc: make unique non-null join selectivity example match the prose

    The description of the computation for the unique, non-null,
    join selectivity describes a division by the maximum of two values,
    while the example shows a multiplication by their reciprocal.  While
    equivalent the max phrasing is easier to understand; which seems
    more important here than precisely adhering to the formula use
    in the code (for which either variant is still an approximation).

    While both num_distinct and num_rows are equal for a unique column
    both the concept and formula use row count (10,000) and the
    field num_distinct has already been set to mean the specific value
    present in the pg_stats table (i.e, -1), so use num_rows here.

diff --git a/doc/src/sgml/planstats.sgml b/doc/src/sgml/planstats.sgml
index 78053d7c49..f72bc4b274 100644
--- a/doc/src/sgml/planstats.sgml
+++ b/doc/src/sgml/planstats.sgml
@@ -391,18 +391,20 @@ tablename  | null_frac | n_distinct | most_common_vals
 </programlisting>

    In this case there is no <acronym>MCV</acronym> information for
-   <structfield>unique2</structfield> because all the values appear to be
-   unique, so we use an algorithm that relies only on the number of
-   distinct values for both relations together with their null fractions:
+   <structname>unique2</structname> and all the values appear to be
+   unique (n_distinct = -1), so we use an algorithm that relies on the row
+   count estimates for both relations (num_rows, not shown, but "tenk")
+   together with the column null fractions (zero for both):

 <programlisting>
-selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1,
1/num_distinct2)
+selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_rows1,
num_rows2)
             = (1 - 0) * (1 - 0) / max(10000, 10000)
             = 0.0001
 </programlisting>

    This is, subtract the null fraction from one for each of the relations,
-   and divide by the maximum of the numbers of distinct values.
+   and divide by the row count of the larger relation (this value does get
+   scaled in the non-unique case).
    The number of rows
    that the join is likely to emit is calculated as the cardinality of the
    Cartesian product of the two inputs, multiplied by the

Attachment: v0001-doc-make-row-estimation-example-match-prose.patch
Description: Binary data

Reply via email to