Re: PoC/WIP: Extended statistics on expressions

Dean Rasheed Wed, 24 Mar 2021 06:36:43 -0700

On Wed, 24 Mar 2021 at 10:22, Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:
>
> Thanks, it seems to be some thinko in handling in PlaceHolderVars, which
> seem to break the code's assumptions about varnos. This fixes it for me,
> but I need to look at it more closely.
>


I think that makes sense.

Reviewing the docs, I noticed a couple of omissions, and had a few
other suggestions (attached).

Regards,
Dean

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
new file mode 100644
index dadca67..382cbd7
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7377,6 +7377,15 @@ SCRAM-SHA-256$<replaceable>&lt;iteration
        <literal>m</literal> for most common values (MCV) list statistics
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stxexprs</structfield> <type>pg_node_tree</type>
+      </para>
+      <para>
+       A list of any expressions covered by this statistics object.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
@@ -7474,6 +7483,16 @@ SCRAM-SHA-256$<replaceable>&lt;iteration
        <structname>pg_mcv_list</structname> type
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stxdexpr</structfield> <type>pg_statistic[]</type>
+      </para>
+      <para>
+       Per-expression statistics, serialized as an array of
+       <structname>pg_statistic</structname> type
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
@@ -12843,7 +12862,8 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_p
 
   <para>
    The view <structname>pg_stats_ext</structname> provides access to
-   the information stored in the <link
+   information about each extended statistics object in the database,
+   combining information stored in the <link
    linkend="catalog-pg-statistic-ext"><structname>pg_statistic_ext</structname></link>
    and <link linkend="catalog-pg-statistic-ext-data"><structname>pg_statistic_ext_data</structname></link>
    catalogs.  This view allows access only to rows of
@@ -12930,7 +12950,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_p
        (references <link linkend="catalog-pg-attribute"><structname>pg_attribute</structname></link>.<structfield>attname</structfield>)
       </para>
       <para>
-       Names of the columns the extended statistics is defined on
+       Names of the columns included in the extended statistics
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>exprs</structfield> <type>text[]</type>
+      </para>
+      <para>
+       Expressions included in the extended statistics
       </para></entry>
      </row>
 
@@ -13033,7 +13062,8 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_p
 
   <para>
    The view <structname>pg_stats_ext_exprs</structname> provides access to
-   the information stored in the <link
+   information about all expressions included in extended statistics objects,
+   combining information stored in the <link
    linkend="catalog-pg-statistic-ext"><structname>pg_statistic_ext</structname></link>
    and <link linkend="catalog-pg-statistic-ext-data"><structname>pg_statistic_ext_data</structname></link>
    catalogs.  This view allows access only to rows of
@@ -13119,7 +13149,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_p
        <structfield>expr</structfield> <type>text</type>
       </para>
       <para>
-       Expression the extended statistics is defined on
+       Expression included in the extended statistics
       </para></entry>
      </row>
 
diff --git a/doc/src/sgml/ref/create_statistics.sgml b/doc/src/sgml/ref/create_statistics.sgml
new file mode 100644
index 5f3aefd..f561599
--- a/doc/src/sgml/ref/create_statistics.sgml
+++ b/doc/src/sgml/ref/create_statistics.sgml
@@ -27,7 +27,7 @@ CREATE STATISTICS [ IF NOT EXISTS ] <rep
 
 CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
     [ ( <replaceable class="parameter">statistics_kind</replaceable> [, ... ] ) ]
-    ON { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) } [, ...]
+    ON { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) }, { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) } [, ...]
     FROM <replaceable class="parameter">table_name</replaceable>
 </synopsis>
 
@@ -45,12 +45,15 @@ CREATE STATISTICS [ IF NOT EXISTS ] <rep
 
   <para>
    The <command>CREATE STATISTICS</command> command has two basic forms. The
-   simple variant allows building statistics for a single expression, does
-   not allow specifying any statistics kinds and provides benefits similar
-   to an expression index. The full variant allows defining statistics objects
-   on multiple columns and expressions, and selecting which statistics kinds will
-   be built. The per-expression statistics are built automatically when there
-   is at least one expression.
+   first form allows univariate statistics for a single expression to be
+   collected, providing benefits similar to an expression index without the
+   overhead of index maintenance.  This form does not allow the statistics
+   kind to be specified, since the various statistics kinds refer only to
+   multivariate statistics.  The second form of the command allows
+   multivariate statistics on multiple columns and/or expressions to be
+   collected, optionally specifying which statistics kinds to include.  This
+   form will also automatically cause univariate statistics to be collected on
+   any expressions included in the list.
   </para>
 
   <para>
@@ -93,16 +96,16 @@ CREATE STATISTICS [ IF NOT EXISTS ] <rep
     <term><replaceable class="parameter">statistics_kind</replaceable></term>
     <listitem>
      <para>
-      A statistics kind to be computed in this statistics object.
+      A multivariate statistics kind to be computed in this statistics object.
       Currently supported kinds are
       <literal>ndistinct</literal>, which enables n-distinct statistics,
       <literal>dependencies</literal>, which enables functional
       dependency statistics, and <literal>mcv</literal> which enables
       most-common values lists.
       If this clause is omitted, all supported statistics kinds are
-      included in the statistics object. Expression statistics are built
-      automatically when the statistics definition includes complex
-      expressions and not just simple column references.
+      included in the statistics object. Univariate expression statistics are
+      built automatically if the statistics definition includes any complex
+      expressions rather than just simple column references.
       For more information, see <xref linkend="planner-stats-extended"/>
       and <xref linkend="multivariate-statistics-examples"/>.
      </para>
@@ -114,8 +117,9 @@ CREATE STATISTICS [ IF NOT EXISTS ] <rep
     <listitem>
      <para>
       The name of a table column to be covered by the computed statistics.
-      At least two column names must be given;  the order of the column names
-      is insignificant.
+      This is only allowed when building multivariate statistics.  At least
+      two column names or expressions must be specified, and their order is
+      not significant.
      </para>
     </listitem>
    </varlistentry>
@@ -124,9 +128,11 @@ CREATE STATISTICS [ IF NOT EXISTS ] <rep
     <term><replaceable class="parameter">expression</replaceable></term>
     <listitem>
      <para>
-      The expression to be covered by the computed statistics. In this case
-      only a single expression is required, in which case only statistics
-      for the expression are built.
+      An expression to be covered by the computed statistics.  This may be
+      used to build univariate statistics on a single expression, or as part
+      of a list of multiple column names and/or expressions to build
+      multivariate statistics.  In the latter case, separate univariate
+      statistics are built automatically for each expression in the list.
      </para>
     </listitem>
    </varlistentry>
@@ -156,8 +162,8 @@ CREATE STATISTICS [ IF NOT EXISTS ] <rep
   <para>
    Expression statistics are per-expression and are similar to creating an
    index on the expression, except that they avoid the overhead of index
-   maintenance. Expression statistics are built automatically when there
-   is at least one expression in the statistics object definition.
+   maintenance. Expression statistics are built automatically for each
+   expression in the statistics object definition.
   </para>
  </refsect1>
 
@@ -232,13 +238,12 @@ EXPLAIN ANALYZE SELECT * FROM t2 WHERE (
 
   <para>
    Create table <structname>t3</structname> with a single timestamp column,
-   and run a query using an expression on that column.  Without extended
-   statistics, the planner has no information about data distribution for
-   results of those expression, and uses default estimates as illustrated
-   by the first query.  The planner also does not realize that the value of
-   the second column fully determines the value of the other column, because
-   date truncated to day still identifies the month. Then expression and
-   ndistinct statistics are built on those two columns:
+   and run queries using expressions on that column.  Without extended
+   statistics, the planner has no information about the data distribution for
+   the expressions, and uses default estimates.  The planner also does not
+   realize that the value of the date truncated to the month is fully
+   determined by the value of the date truncated to the day. Then expression
+   and ndistinct statistics are built on those two expressions:
 
 <programlisting>
 CREATE TABLE t3 (
@@ -262,7 +267,8 @@ EXPLAIN ANALYZE SELECT * FROM t3
 EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
    FROM t3 GROUP BY 1, 2;
 
--- per-expression statistics are built automatically
+-- build ndistinct statistics on the pair of expressions (per-expression
+-- statistics are built automatically)
 CREATE STATISTICS s3 (ndistinct) ON date_trunc('month', a), date_trunc('day', a) FROM t3;
 
 ANALYZE t3;

Re: PoC/WIP: Extended statistics on expressions

Reply via email to