Re: [HACKERS] multivariate statistics v14

Alvaro Herrera Sun, 20 Mar 2016 20:35:31 -0700

Another skim on 0002:

reference.sgml is missing a call to &alterStatistic.


ObjectProperty[] contains a comment that the ACL is "same as relation",
but is that still correct, given that now stats may be related to more
than one relation?  Do we even know what the rules for ACLs on
cross-relation stats are?  One very simple way to get around this is to
dictate that all the rels must have the same owner.  Perhaps we're not
considering the multi-relation case yet?

We have this FIXME comment in do_analyze_rel:

+        * FIXME This sample sizing is mostly OK when computing stats for
+        *       individual columns, but when computing multi-variate stats
+        *       for multivariate stats (histograms, mcv, ...) it's rather
+        *       insufficient. For stats on multiple columns / complex stats
+        *       we need larger sample sizes, because we need to build more
+        *       detailed stats (more MCV items / histogram buckets) to get
+        *       good accuracy. Maybe it'd be appropriate to use samples
+        *       proportional to the table (say, 0.5% - 1%) instead of a
+        *       fixed size might be more appropriate. Also, this should be
+        *       bound to the requested statistics size - e.g. number of MCV
+        *       items or histogram buckets should require several sample
+        *       rows per item/bucket (so the sample should be k*size).

Maybe this merits more discussion.  Right now we have an upper bound on
how much to scan for analyze; if we introduce the idea of scanning a
percentage of the relation, the time to analyze very large relations
could increase significantly.  Do we have an idea of what to do for
this?  For instance, a rule that would make me comfortable would say to
scan a sample 3x the current size when you have a mvstats on 3 columns;
then the size of fraction to scan is still bounded.  But does that
actually work?  From the wording of this comment, I assume you don't
actually know.

In this block (CreateStatistics)
+       /* look for duplicities */
+       for (i = 0; i < numcols; i++)
+               for (j = 0; j < numcols; j++)
+                       if ((i != j) && (attnums[i] == attnums[j]))
+                               ereport(ERROR,
+                                               
(errcode(ERRCODE_UNDEFINED_COLUMN),
+                                                errmsg("duplicate column name 
in statistics definition")));

isn't it easier to have the inner loop go from i+1 to numcols?


I wonder if this is sensible with multi-relation statistics:
+       /*
+        * Store a dependency too, so that statistics are dropped on DROP TABLE
+        */
+       parentobject.classId = RelationRelationId;
+       parentobject.objectId = ObjectIdGetDatum(RelationGetRelid(rel));
+       parentobject.objectSubId = 0;
+       childobject.classId = MvStatisticRelationId;
+       childobject.objectId = statoid;
+       childobject.objectSubId = 0;

I suppose the idea is to drop the stats if any of the rels they are for
is dropped.

Right after that you create a dependency on the schema.  Is that
necessary?  Since you have the dependency on the relation, the stats
would be dropped by recursion.

Why are you #include'ing builtins.h everywhere?

RelationGetMVStatList() needs a comment.

Please get rid of common.h.  It's totally unlike the way we structure
our header files.  We don't keep headers in src/backend; they're all in
src/include.  One reason is that the latter gets installed as a whole in
include/server, which this file will not be.  This file may be necessary
to build some extensions in the future, for example.

In mvstats.h, please mark function prototypes as "extern".

Many files need a pgindent pass.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics v14

Reply via email to