Hi Tomas, On 2016/10/30 4:23, Tomas Vondra wrote: > Hi, > > Attached is v20 of the multivariate statistics patch series, doing mostly > the changes outlined in the preceding e-mail from October 11. > > The patch series currently has these parts: > > * 0001 : (FIX) teach pull_varno about RestrictInfo > * 0002 : (PATCH) shared infrastructure and ndistinct coefficients > * 0003 : (PATCH) functional dependencies (only the ANALYZE part) > * 0004 : (PATCH) selectivity estimation using functional dependencies > * 0005 : (PATCH) multivariate MCV lists > * 0006 : (PATCH) multivariate histograms > * 0007 : (WIP) selectivity estimation using ndistinct coefficients > * 0008 : (WIP) use multiple statistics for estimation > * 0009 : (WIP) psql tab completion basics
Unfortunately, this failed to compile because of the duplicate_oids error. Partitioning patch consumed same OIDs as used in this patch. I will try to read the patches in some more detail, but in the meantime, here are some comments/nitpicks on the documentation: No updates to doc/src/sgml/catalogs.sgml? + <para> + The examples presented in <xref linkend="row-estimation-examples"> used + statistics about individual columns to compute selectivity estimates. + When estimating conditions on multiple columns, the planner assumes + independence and multiplies the selectivities. When the columns are + correlated, the independence assumption is violated, and the estimates + may be seriously off, resulting in poor plan choices. + </para> The term independence is used in isolation - independence of what? Independence of the distributions of values in separate columns? Also, the phrase "seriously off" could perhaps be replaced by more rigorous terminology; it might be unclear to some readers. Perhaps: wildly inaccurate, :) +<programlisting> +EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1; + QUERY PLAN +------------------------------------------------------------------------------------------------- + Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual time=0.031..2.870 rows=100 loops=1) + Filter: (a = 1) + Rows Removed by Filter: 9900 + Planning time: 0.092 ms + Execution time: 3.103 ms Is there a reason why examples in "67.2. Multivariate Statistics" (like the one above) use EXPLAIN ANALYZE, whereas those in "67.1. Row Estimation Examples" (also, other relevant chapters) uses just EXPLAIN. + the final 0.01% estimate. The plan however shows that this results in + a significant under-estimate, as the actual number of rows matching the s/under-estimate/underestimate/g + <para> + For additional details about multivariate statistics, see + <filename>src/backend/utils/mvstats/README.statsc</>. There are additional + <literal>README</> for each type of statistics, mentioned in the following + sections. + </para> Referring to source tree READMEs seems novel around this portion of the documentation, but I think not too far away, there are some references. This is under the VII. Internals chapter anyway, so that might be OK. In any case, s/README.statsc/README.stats/g Also, s/additional README/additional READMEs/g (tags omitted for brevity) + used in definitions of database normal forms. When simplified, saying that + <literal>b</> is functionally dependent on <literal>a</> means that Maybe, s/When simplified/In simple terms/g + In normalized databases, only functional dependencies on primary keys + and super keys are allowed. In practice however many data sets are not + fully normalized, for example thanks to intentional denormalization for + performance reasons. The table <literal>t</> is an example of a data + with functional dependencies. As <literal>a=b</> for all rows in the + table, <literal>a</> is functionally dependent on <literal>b</> and + <literal>b</> is functionally dependent on <literal>a</literal>. "super keys" sounds like a new term. s/for example thanks to/for example, thanks to/g (or due to instead of thanks to) How about: s/an example of a data with/an example of a schema with/g Perhaps, s/a=b/a = b/g (additional white space) + Similarly to per-column statistics, multivariate statistics are stored in I notice that "similar to" is used more often than "similarly to". But that might be OK. + This shows that the statistics is defined on table <structname>t</>, Perhaps: the statistics is -> the statistics are or the statistic is + lists <structfield>attnums</structfield> of the columns (references + <structname>pg_attribute</structname>). While this text may be OK on the catalog description page, it might be better to expand attnums here as "attribute numbers" dropping the parenthesized phrase altogether. +<programlisting> +SELECT pg_mv_stats_dependencies_show(stadeps) + FROM pg_mv_statistic WHERE staname = 's1'; + + pg_mv_stats_dependencies_show +------------------------------- + (1) => 2, (2) => 1 +(1 row) +</programlisting> Couldn't this somehow show actual column names, instead of attribute numbers? Will read more later. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers