Re: [HACKERS] WIP: multivariate statistics / proof of concept

Kyotaro HORIGUCHI Fri, 20 Mar 2015 01:34:54 -0700

Hello,


Patch 0001 needs changes for OIDs since my patch was
committed. The attached is compatible with current master.

And I tried this like this, and got the following error on
analyze. But unfortunately I don't have enough time to
investigate it now.

postgres=# create table t1 (a int, b int, c int);
insert into t1 (select a/ 10000, a / 10000, a / 10000 from generate_series(0, 
99999) a);
postgres=# analyze t1;
ERROR:  invalid memory alloc request size 1485176862

regards,


At Sat, 24 Jan 2015 21:21:39 +0100, Tomas Vondra <tomas.von...@2ndquadrant.com> 
wrote in <54c3fed3.1060...@2ndquadrant.com>
> Hi,
> 
> attached is an updated version of the multivariate stats patch. This is
> going to be a bit longer mail, so I'll put here a small ToC ;-)
> 
> 1) patch split into 4 parts
> 2) where to start / documentation
> 3) state of the code
> 4) main changes/improvements
> 5) remaining limitations
> 
> The motivation and design ideas, explained in the first message of this
> thread are still valid. It might be a good idea to read it first:
> 
>   http://www.postgresql.org/message-id/flat/543afa15.4080...@fuzzy.cz
> 
> BTW if you happen to go to FOSDEM [PGDay], I'll gladly give you an intro
> into the patch in person, or discuss the patch in general.
> 
> 
> 1) Patch split into 4 parts
> ---------------------------
> Firstly, the patch got broken into the following four pieces, to make
> the reviews somewhat easier:
> 
> 1) 0001-shared-infrastructure-and-functional-dependencies.patch
> 
>    - infrastructure, shared by all the kinds of stats added
>      in the following patches (catalog, ALTER TABLE, ANALYZE ...)
> 
>    - implementation of a simple statistics, tracking functional
>      dependencies between columns (previously called "associative
>      rules", but that's incorrect for several reasons)
> 
>    - this does not modify the optimizer in any way
> 2) 0002-clause-reduction-using-functional-dependencies.patch
> 
>    - applies the functional dependencies to optimizer (i.e. considers
>      the rules in clauselist_selectivity())
> 
> 3) 0003-multivariate-MCV-lists.patch
> 
>    - multivariate MCV lists (both ANALYZE and optimizer parts)
> 
> 4) 0004-multivariate-histograms.patch
> 
>    - multivariate histograms (both ANALYZE and optimizer parts)
> 
> 
> You may look at the patches at github here:
> 
>   https://github.com/tvondra/postgres/tree/multivariate-stats-squashed
> 
> The branch is not stable, i.e. I'll rebase / squash / force-push changes
> in the future. (There's also multivariate-stats development branch with
> unsquashed changes, but you don't want to look at that, trust me.)
> 
> The patches are not exactly small (being in the 50-100 kB range), but
> that's mostly because of the amount of comments explaining the goals and
> implementation details.
> 
> 
> 2) Where to start / documentation
> ---------------------------------
> I strived to document all the pieces properly, mostly in the form of
> comments. There's no sgml documentation at this point, which should
> obviously change in the future.
> 
> Anyway, I'd suggest reading the first e-mail in this thread, explaining
> the ideas, and then these comments:
> 
> 1) functional dependencies (patch 0001)
>    - src/backend/utils/mvstats/dependencies.c
> 
> 2) MCV lists (patch 0003)
>    - src/backend/utils/mvstats/mcv.c
> 
> 3) histograms (patch 0004)
>    - src/backend/utils/mvstats/mcv.c
> 
>    - also see clauselist_mv_selectivity_mcvlist() in clausesel.c
>    - also see clauselist_mv_selectivity_histogram() in clausesel.c
> 
> 4) selectivity estimation (patches 0002-0004)
>    - all in src/backend/optimizer/path/clausesel.c
>    - clauselist_selectivity() - overview of how the stats are applied
>    - clauselist_apply_dependencies() - functional dependencies reduction
>    - clauselist_mv_selectivity_mcvlist() - MCV list estimation
>    - clauselist_mv_selectivity_histogram() - histogram estimation
> 
> 
> 3) State of the code
> --------------------
> I've spent a fair amount of time testing the patches, and while I
> believe there are no segfaults or so, I know parts of the code need a
> bit more love.
> 
> The part most in need of improvements / comments is probably the code in
> clausesel.c - that seems a bit quirky. Reviews / comments regarding this
> part of the code are very welcome - I'm sure there are many ways to
> improve this part.
> 
> There are a few FIXMEs elsewhere (e.g. about memory allocation in the
> (de)serialization code), but those are mostly well-defined issues that I
> know how to address (at least I believe so).
> 
> 
> 4) Main changes/improvements
> ----------------------------
> There are many significant improvements. The previous patch version was
> in the 'proof of concept' category (missing pieces, knowingly broken in
> some areas), the current patch should 'mostly work'.
> 
> The patch fixes two most annoying limitations of the first version:
> 
>   (a) support for all data types (not just those passed by value)
>   (b) handles NULL values properly
>   (c) adds support for IS [NOT] NULL clauses
> 
> Aside from that the code was significantly improved, there are proper
> regression tests and plenty of comments explaining the details.
> 
> 
> 5) Remaining limitations
> ------------------------
> 
>   (a) limited to stats on 8 columns
> 
>       This is mostly just a 'safeguard' restriction.
> 
>   (b) only data types with '<' operator
> 
>       I don't think this will change anytime soon, because all the
>       algorithms for building the stats rely on this. I don't see
>       this as a serious limitation though.
> 
>   (c) not handling DROP COLUMN or DROP TABLE and so on
> 
>       Currently this is not handled at all (so the regression tests
>       do an explicit DELETE from the pg_mv_statistic catalog).
> 
>       Handling the DROP TABLE won't be difficult, it's similar to the
>       current stats. Handling ALTER TABLE ... DROP COLUMN will be much
>       more tricky I guess - should we drop all the stats referencing
>       that column, or should we just remove it from the stats? Or
>       should we keep it and treat it as NULL? Not sure what's the best
>       solution.
> 
>   (d) limited list of compatible WHERE clauses
> 
>       The initial patch handled only simple operator clauses
> 
>           (Var op Constant)
> 
>       where operator is one of ('<', '<=', '=', '>=', '>'). Now it also
>       handles IS [NOT] NULL clauses. Adding more clause types should
>       not  be overly difficult - starting with more traditional
>       'BooleanTest' conditions, or even multi-column conditions
>           (Var op Var)
> 
>       which are difficult to estimate using simple-column stats.
> 
>   (e) optimizer uses single stats per table
> 
>       This is still true and I don't think this will change soon. i do
>       have some ideas on how to merge multiple stats etc. but it's
>       certainly complex stuff, unlikely to happen within this CF. The
>       patch makes a lot of sense even without this particular feature,
>       because you can create multiple stats, each suitable for different
>       queries.
> 
>   (f) no JOIN conditions
> 
>       Similarly to the previous point, it's on the TODO but it's not
>       going to happen in this CF.
> 
> 
> kind regards
> 
> -- 
> Tomas Vondra                http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

>From 9ebfadb5d6cd9b55dd2707cfc8c789884dafa7fa Mon Sep 17 00:00:00 2001
From: Tomas Vondra <t...@fuzzy.cz>
Date: Sun, 11 Jan 2015 19:51:48 +0100
Subject: [PATCH 1/4] shared infrastructure and functional dependencies

Basic infrastructure shared by all kinds of multivariate
stats, most importantly:

- adds a new system catalog (pg_mv_statistic)
- ALTER TABLE ... ADD STATISTICS syntax
- implementation of functional dependencies (the simplest
  type of multivariate statistics)
- building functional dependencies in ANALYZE
- updates regression tests (new catalog etc.)

This does not include any changes to the optimizer, i.e.
it does not influence the query planning.

FIX: invalid assert in lookup_var_attr_stats()

The current implementation requires a valid 'ltopr'
so that we can sort the sample rows in various ways,
and the assert did verify this by checking that the
function is 'compute_scalar_stats'. This is however
private function in analyze.c, so the check failed
after moving the code into common.c.

Fixed by checking the 'ltopr' operator directly.
Eventually this will be removed, as ltopr is only
needed for histograms (functional dependencies and
MVC lists may be built without it).

FIX: improved comments about functional dependencies
FIX: add magic (MVSTAT_DEPS_MAGIC) into MVDependencies
FIX: improved analysis of functional dependencies

Changes:

- decreased minimum group size
- count contradicting rows ('not supporting' ones)

The algorithm is still rather simple and probably needs
other improvements.

FIX: add pg_mv_stats_dependencies_show() function

This function actually prints the rules, not just some basic
info (number of rules) as  pg_mv_stats_dependencies_info().

FIX: (dependencies != NULL) in pg_mv_stats_dependencies_info()

STRICT is not a solution, because the deserialization may fail
for some reason (corrupted data, ...)

FIX: rename 'associative rules' to 'functional dependencies'

It's a more appropriate name as functional dependencies,
as defined in relational theory (esp. Normal Forms) are
tracking column-level dependencies.

Associative (or more correctly 'association') rules are
tracking dependencies between particular values, and not
necessarily in different columns (shopping bag analysis).

Also, did a bunch of comment improvements, minor fixes.

This does not include changes in clausesel.c!

FIX: remove obsolete Assert() enforcing typbyval types
---
 src/backend/catalog/Makefile               |   1 +
 src/backend/catalog/system_views.sql       |  10 +
 src/backend/commands/analyze.c             |  17 +-
 src/backend/commands/tablecmds.c           | 149 +++++++-
 src/backend/nodes/copyfuncs.c              |  15 +-
 src/backend/parser/gram.y                  |  67 +++-
 src/backend/utils/Makefile                 |   2 +-
 src/backend/utils/cache/syscache.c         |  12 +
 src/backend/utils/mvstats/Makefile         |  17 +
 src/backend/utils/mvstats/common.c         | 272 ++++++++++++++
 src/backend/utils/mvstats/common.h         |  70 ++++
 src/backend/utils/mvstats/dependencies.c   | 554 +++++++++++++++++++++++++++++
 src/include/catalog/indexing.h             |   5 +
 src/include/catalog/pg_mv_statistic.h      |  69 ++++
 src/include/catalog/pg_proc.h              |   5 +
 src/include/catalog/toasting.h             |   1 +
 src/include/nodes/nodes.h                  |   1 +
 src/include/nodes/parsenodes.h             |  11 +-
 src/include/utils/mvstats.h                |  86 +++++
 src/include/utils/syscache.h               |   1 +
 src/test/regress/expected/rules.out        |   8 +
 src/test/regress/expected/sanity_check.out |   1 +
 22 files changed, 1365 insertions(+), 9 deletions(-)
 create mode 100644 src/backend/utils/mvstats/Makefile
 create mode 100644 src/backend/utils/mvstats/common.c
 create mode 100644 src/backend/utils/mvstats/common.h
 create mode 100644 src/backend/utils/mvstats/dependencies.c
 create mode 100644 src/include/catalog/pg_mv_statistic.h
 create mode 100644 src/include/utils/mvstats.h

diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index a403c64..d6c16f8 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -32,6 +32,7 @@ POSTGRES_BKI_SRCS = $(addprefix $(top_srcdir)/src/include/catalog/,\
 	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
 	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
 	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
+	pg_mv_statistic.h \
 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_event_trigger.h pg_description.h \
 	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
 	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2800f73..d05a716 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -150,6 +150,16 @@ CREATE VIEW pg_indexes AS
          LEFT JOIN pg_tablespace T ON (T.oid = I.reltablespace)
     WHERE C.relkind IN ('r', 'm') AND I.relkind = 'i';
 
+CREATE VIEW pg_mv_stats AS
+    SELECT
+        N.nspname AS schemaname,
+        C.relname AS tablename,
+        S.stakeys AS attnums,
+        length(S.stadeps) as depsbytes,
+        pg_mv_stats_dependencies_info(S.stadeps) as depsinfo
+    FROM (pg_mv_statistic S JOIN pg_class C ON (C.oid = S.starelid))
+        LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace);
+
 CREATE VIEW pg_stats AS
     SELECT
         nspname AS schemaname,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 75b45f7..da98d54 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -27,6 +27,7 @@
 #include "catalog/indexing.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_inherits_fn.h"
+#include "catalog/pg_mv_statistic.h"
 #include "catalog/pg_namespace.h"
 #include "commands/dbcommands.h"
 #include "commands/tablecmds.h"
@@ -54,7 +55,11 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/tqual.h"
+#include "utils/fmgroids.h"
+#include "utils/builtins.h"
 
+#include "utils/mvstats.h"
+#include "access/sysattr.h"
 
 /* Data structure for Algorithm S from Knuth 3.4.2 */
 typedef struct
@@ -110,7 +115,6 @@ static void update_attstats(Oid relid, bool inh,
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 
-
 /*
  *	analyze_rel() -- analyze one relation
  */
@@ -472,6 +476,13 @@ do_analyze_rel(Relation onerel, int options, List *va_cols,
 	 * all analyzable columns.  We use a lower bound of 100 rows to avoid
 	 * possible overflow in Vitter's algorithm.  (Note: that will also be the
 	 * target in the corner case where there are no analyzable columns.)
+	 * 
+	 * FIXME This sample sizing is mostly OK when computing stats for
+	 *       individual columns, but when computing multi-variate stats
+	 *       for multivariate stats (histograms, mcv, ...) it's rather
+	 *       insufficient. For small number of dimensions it works, but
+	 *       for complex stats it'd be nice use sample proportional to
+	 *       the table (say, 0.5% - 1%) instead of a fixed size.
 	 */
 	targrows = 100;
 	for (i = 0; i < attr_cnt; i++)
@@ -574,6 +585,9 @@ do_analyze_rel(Relation onerel, int options, List *va_cols,
 			update_attstats(RelationGetRelid(Irel[ind]), false,
 							thisdata->attr_cnt, thisdata->vacattrstats);
 		}
+
+		/* Build multivariate stats (if there are any). */
+		build_mv_stats(onerel, numrows, rows, attr_cnt, vacattrstats);
 	}
 
 	/*
@@ -2825,3 +2839,4 @@ compare_mcvs(const void *a, const void *b)
 
 	return da - db;
 }
+
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 623e6bf..0df7f03 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -35,6 +35,7 @@
 #include "catalog/pg_foreign_table.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_inherits_fn.h"
+#include "catalog/pg_mv_statistic.h"
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_opclass.h"
 #include "catalog/pg_tablespace.h"
@@ -92,7 +93,7 @@
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 #include "utils/typcache.h"
-
+#include "utils/mvstats.h"
 
 /*
  * ON COMMIT action list
@@ -140,8 +141,9 @@ static List *on_commits = NIL;
 #define AT_PASS_ADD_COL			5		/* ADD COLUMN */
 #define AT_PASS_ADD_INDEX		6		/* ADD indexes */
 #define AT_PASS_ADD_CONSTR		7		/* ADD constraints, defaults */
-#define AT_PASS_MISC			8		/* other stuff */
-#define AT_NUM_PASSES			9
+#define AT_PASS_ADD_STATS		8		/* ADD statistics */
+#define AT_PASS_MISC			9		/* other stuff */
+#define AT_NUM_PASSES			10
 
 typedef struct AlteredTableInfo
 {
@@ -416,7 +418,8 @@ static void ATExecReplicaIdentity(Relation rel, ReplicaIdentityStmt *stmt, LOCKM
 static void ATExecGenericOptions(Relation rel, List *options);
 static void ATExecEnableRowSecurity(Relation rel);
 static void ATExecDisableRowSecurity(Relation rel);
-
+static void ATExecAddStatistics(AlteredTableInfo *tab, Relation rel,
+								StatisticsDef *def, LOCKMODE lockmode);
 static void copy_relation_data(SMgrRelation rel, SMgrRelation dst,
 				   ForkNumber forkNum, char relpersistence);
 static const char *storage_name(char c);
@@ -2989,6 +2992,7 @@ AlterTableGetLockLevel(List *cmds)
 				 * updates.
 				 */
 			case AT_SetStatistics:		/* Uses MVCC in getTableAttrs() */
+			case AT_AddStatistics:		/* XXX not sure if the right level */
 			case AT_ClusterOn:	/* Uses MVCC in getIndexes() */
 			case AT_DropCluster:		/* Uses MVCC in getIndexes() */
 			case AT_SetOptions:	/* Uses MVCC in getTableAttrs() */
@@ -3145,6 +3149,7 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 			pass = AT_PASS_ADD_CONSTR;
 			break;
 		case AT_SetStatistics:	/* ALTER COLUMN SET STATISTICS */
+		case AT_AddStatistics:	/* XXX maybe not the right place */
 			ATSimpleRecursion(wqueue, rel, cmd, recurse, lockmode);
 			/* Performs own permission checks */
 			ATPrepSetStatistics(rel, cmd->name, cmd->def, lockmode);
@@ -3440,6 +3445,9 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
 		case AT_SetStatistics:	/* ALTER COLUMN SET STATISTICS */
 			ATExecSetStatistics(rel, cmd->name, cmd->def, lockmode);
 			break;
+		case AT_AddStatistics:		/* ADD STATISTICS */
+			ATExecAddStatistics(tab, rel, (StatisticsDef *) cmd->def, lockmode);
+			break;
 		case AT_SetOptions:		/* ALTER COLUMN SET ( options ) */
 			ATExecSetOptions(rel, cmd->name, cmd->def, false, lockmode);
 			break;
@@ -11638,3 +11646,136 @@ RangeVarCallbackForAlterRelation(const RangeVar *rv, Oid relid, Oid oldrelid,
 
 	ReleaseSysCache(tuple);
 }
+
+/* used for sorting the attnums in ATExecAddStatistics */
+static int compare_int16(const void *a, const void *b)
+{
+	return memcmp(a, b, sizeof(int16));
+}
+
+/*
+ * Implements the ALTER TABLE ... ADD STATISTICS (options) ON (columns).
+ *
+ * The code is an unholy mix of pieces that really belong to other parts
+ * of the source tree.
+ *
+ * FIXME Check that the types are pass-by-value and support sort,
+ *       although maybe we can live without the sort (and only build
+ *       MCV list / association rules).
+ *
+ * FIXME This should probably check for duplicate stats (i.e. same
+ *       keys, same options). Although maybe it's useful to have
+ *       multiple stats on the same columns with different options
+ *       (say, a detailed MCV-only stats for some queries, histogram
+ *       for others, etc.)
+ */
+static void ATExecAddStatistics(AlteredTableInfo *tab, Relation rel,
+						StatisticsDef *def, LOCKMODE lockmode)
+{
+	int			i, j;
+	ListCell   *l;
+	int16		attnums[INDEX_MAX_KEYS];
+	int			numcols = 0;
+
+	HeapTuple	htup;
+	Datum		values[Natts_pg_mv_statistic];
+	bool		nulls[Natts_pg_mv_statistic];
+	int2vector *stakeys;
+	Relation	mvstatrel;
+
+	/* by default build everything */
+	bool 	build_dependencies = true;
+
+	Assert(IsA(def, StatisticsDef));
+
+	/* transform the column names to attnum values */
+
+	foreach(l, def->keys)
+	{
+		char	   *attname = strVal(lfirst(l));
+		HeapTuple	atttuple;
+
+		atttuple = SearchSysCacheAttName(RelationGetRelid(rel), attname);
+
+		if (!HeapTupleIsValid(atttuple))
+			ereport(ERROR,
+					(errcode(ERRCODE_UNDEFINED_COLUMN),
+					 errmsg("column \"%s\" referenced in statistics does not exist",
+							attname)));
+
+		/* more than MVHIST_MAX_DIMENSIONS columns not allowed */
+		if (numcols >= MVSTATS_MAX_DIMENSIONS)
+			ereport(ERROR,
+					(errcode(ERRCODE_TOO_MANY_COLUMNS),
+					 errmsg("cannot have more than %d keys in a statistics",
+							MVSTATS_MAX_DIMENSIONS)));
+
+		attnums[numcols] = ((Form_pg_attribute) GETSTRUCT(atttuple))->attnum;
+		ReleaseSysCache(atttuple);
+		numcols++;
+	}
+
+	/*
+	 * Check the lower bound (at least 2 columns), the upper bound was
+	 * already checked in the loop.
+	 */
+	if (numcols < 2)
+			ereport(ERROR,
+					(errcode(ERRCODE_TOO_MANY_COLUMNS),
+					 errmsg("multivariate stats require 2 or more columns")));
+
+	/* look for duplicities */
+	for (i = 0; i < numcols; i++)
+		for (j = 0; j < numcols; j++)
+			if ((i != j) && (attnums[i] == attnums[j]))
+				ereport(ERROR,
+						(errcode(ERRCODE_UNDEFINED_COLUMN),
+						 errmsg("duplicate column name in statistics definition")));
+
+	/* parse the statistics options */
+	foreach (l, def->options)
+	{
+		DefElem *opt = (DefElem*)lfirst(l);
+
+		if (strcmp(opt->defname, "dependencies") == 0)
+			build_dependencies = defGetBoolean(opt);
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("unrecognized STATISTICS option \"%s\"",
+							opt->defname)));
+	}
+
+	/* sort the attnums and build int2vector */
+	qsort(attnums, numcols, sizeof(int16), compare_int16);
+	stakeys = buildint2vector(attnums, numcols);
+
+	/*
+	 * Okay, let's create the pg_mv_statistic entry.
+	 */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+
+	/* no stats collected yet, so just the keys */
+	values[Anum_pg_mv_statistic_starelid-1] = ObjectIdGetDatum(RelationGetRelid(rel));
+
+	values[Anum_pg_mv_statistic_stakeys -1] = PointerGetDatum(stakeys);
+	values[Anum_pg_mv_statistic_deps_enabled -1] = BoolGetDatum(build_dependencies);
+
+	nulls[Anum_pg_mv_statistic_stadeps -1] = true;
+
+	/* insert the tuple into pg_mv_statistic */
+	mvstatrel = heap_open(MvStatisticRelationId, RowExclusiveLock);
+
+	htup = heap_form_tuple(mvstatrel->rd_att, values, nulls);
+
+	simple_heap_insert(mvstatrel, htup);
+
+	CatalogUpdateIndexes(mvstatrel, htup);
+
+	heap_freetuple(htup);
+
+	heap_close(mvstatrel, RowExclusiveLock);
+
+	return;
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 029761e..df230d6 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -3918,6 +3918,17 @@ _copyAlterPolicyStmt(const AlterPolicyStmt *from)
 	return newnode;
 }
 
+static StatisticsDef *
+_copyStatisticsDef(const StatisticsDef *from)
+{
+	StatisticsDef  *newnode = makeNode(StatisticsDef);
+
+	COPY_NODE_FIELD(keys);
+	COPY_NODE_FIELD(options);
+
+	return newnode;
+}
+
 /* ****************************************************************
  *					pg_list.h copy functions
  * ****************************************************************
@@ -4744,7 +4755,9 @@ copyObject(const void *from)
 		case T_RoleSpec:
 			retval = _copyRoleSpec(from);
 			break;
-
+		case T_StatisticsDef:
+			retval = _copyStatisticsDef(from);
+			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(from));
 			retval = 0;			/* keep compiler quiet */
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 82405b9..0346a00 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -367,6 +367,13 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				create_generic_options alter_generic_options
 				relation_expr_list dostmt_opt_list
 
+%type <list>	OptStatsOptions 
+%type <str>		stats_options_name
+%type <node>	stats_options_arg
+%type <defelt>	stats_options_elem
+%type <list>	stats_options_list
+
+
 %type <list>	opt_fdw_options fdw_options
 %type <defelt>	fdw_option
 
@@ -486,7 +493,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <keyword> unreserved_keyword type_func_name_keyword
 %type <keyword> col_name_keyword reserved_keyword
 
-%type <node>	TableConstraint TableLikeClause
+%type <node>	TableConstraint TableLikeClause TableStatistics
 %type <ival>	TableLikeOptionList TableLikeOption
 %type <list>	ColQualList
 %type <node>	ColConstraint ColConstraintElem ConstraintAttr
@@ -2311,6 +2318,14 @@ alter_table_cmd:
 					n->subtype = AT_DisableRowSecurity;
 					$$ = (Node *)n;
 				}
+			/* ALTER TABLE <name> ADD STATISTICS (options) ON (columns) ... */
+			| ADD_P TableStatistics
+				{
+					AlterTableCmd *n = makeNode(AlterTableCmd);
+					n->subtype = AT_AddStatistics;
+					n->def = $2;
+					$$ = (Node *)n;
+				}
 			| alter_generic_options
 				{
 					AlterTableCmd *n = makeNode(AlterTableCmd);
@@ -3381,6 +3396,56 @@ OptConsTableSpace:   USING INDEX TABLESPACE name	{ $$ = $4; }
 ExistingIndex:   USING INDEX index_name				{ $$ = $3; }
 		;
 
+/*****************************************************************************
+ *
+ *		QUERY :
+ *				ALTER TABLE relname ADD STATISTICS (columns) WITH (options)
+ *
+ *****************************************************************************/
+
+TableStatistics:
+			STATISTICS OptStatsOptions ON '(' columnList ')'
+				{
+					StatisticsDef *n = makeNode(StatisticsDef);
+					n->keys  = $5;
+					n->options  = $2;
+					$$ = (Node *) n;
+				}
+		;
+
+OptStatsOptions:
+			'(' stats_options_list ')'		{ $$ = $2; }
+			| /*EMPTY*/						{ $$ = NIL; }
+		;
+
+stats_options_list:
+			stats_options_elem
+				{
+					$$ = list_make1($1);
+				}
+			| stats_options_list ',' stats_options_elem
+				{
+					$$ = lappend($1, $3);
+				}
+		;
+
+stats_options_elem:
+			stats_options_name stats_options_arg
+				{
+					$$ = makeDefElem($1, $2);
+				}
+		;
+
+stats_options_name:
+			NonReservedWord			{ $$ = $1; }
+		;
+
+stats_options_arg:
+			opt_boolean_or_string	{ $$ = (Node *) makeString($1); }
+			| NumericOnly			{ $$ = (Node *) $1; }
+			| /* EMPTY */			{ $$ = NULL; }
+		;
+
 
 /*****************************************************************************
  *
diff --git a/src/backend/utils/Makefile b/src/backend/utils/Makefile
index 8374533..eba0352 100644
--- a/src/backend/utils/Makefile
+++ b/src/backend/utils/Makefile
@@ -9,7 +9,7 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS        = fmgrtab.o
-SUBDIRS     = adt cache error fmgr hash init mb misc mmgr resowner sort time
+SUBDIRS     = adt cache error fmgr hash init mb misc mmgr mvstats resowner sort time
 
 # location of Catalog.pm
 catalogdir  = $(top_srcdir)/src/backend/catalog
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index bd27168..f61ef7e 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -43,6 +43,7 @@
 #include "catalog/pg_foreign_server.h"
 #include "catalog/pg_foreign_table.h"
 #include "catalog/pg_language.h"
+#include "catalog/pg_mv_statistic.h"
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_opclass.h"
 #include "catalog/pg_operator.h"
@@ -499,6 +500,17 @@ static const struct cachedesc cacheinfo[] = {
 		},
 		4
 	},
+	{MvStatisticRelationId,		/* MVSTATOID */
+		MvStatisticOidIndexId,
+		1,
+		{
+			ObjectIdAttributeNumber,
+			0,
+			0,
+			0
+		},
+		128
+	},
 	{NamespaceRelationId,		/* NAMESPACENAME */
 		NamespaceNameIndexId,
 		1,
diff --git a/src/backend/utils/mvstats/Makefile b/src/backend/utils/mvstats/Makefile
new file mode 100644
index 0000000..099f1ed
--- /dev/null
+++ b/src/backend/utils/mvstats/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for utils/mvstats
+#
+# IDENTIFICATION
+#    src/backend/utils/mvstats/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/utils/mvstats
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = common.o dependencies.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/mvstats/common.c b/src/backend/utils/mvstats/common.c
new file mode 100644
index 0000000..36757d5
--- /dev/null
+++ b/src/backend/utils/mvstats/common.c
@@ -0,0 +1,272 @@
+/*-------------------------------------------------------------------------
+ *
+ * common.c
+ *	  POSTGRES multivariate statistics
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/mvstats/common.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "common.h"
+
+/*
+ * Compute requested multivariate stats, using the rows sampled for the
+ * plain (single-column) stats.
+ *
+ * This fetches a list of stats from pg_mv_statistic, computes the stats
+ * and serializes them back into the catalog (as bytea values).
+ */
+void
+build_mv_stats(Relation onerel, int numrows, HeapTuple *rows,
+			   int natts, VacAttrStats **vacattrstats)
+{
+	int i;
+	MVStats mvstats;
+	int		nmvstats;
+
+	/*
+	 * Fetch defined MV groups from pg_mv_statistic, and then compute
+	 * the MV statistics (histograms for now).
+	 */
+	mvstats = list_mv_stats(RelationGetRelid(onerel), &nmvstats, false);
+
+	for (i = 0; i < nmvstats; i++)
+	{
+		MVDependencies	deps  = NULL;
+
+		/* int2 vector of attnums the stats should be computed on */
+		int2vector * attrs = mvstats[i].stakeys;
+
+		/* check allowed number of dimensions */
+		Assert((attrs->dim1 >= 2) && (attrs->dim1 <= MVSTATS_MAX_DIMENSIONS));
+
+		/*
+		 * Analyze functional dependencies of columns.
+		 */
+		deps = build_mv_dependencies(numrows, rows, attrs, natts, vacattrstats);
+
+		/* store the histogram / MCV list in the catalog */
+		update_mv_stats(mvstats[i].mvoid, deps);
+	}
+}
+
+/*
+ * Lookup the VacAttrStats info for the selected columns, with indexes
+ * matching the attrs vector (to make it easy to work with when
+ * computing multivariate stats).
+ */
+VacAttrStats **
+lookup_var_attr_stats(int2vector *attrs, int natts, VacAttrStats **vacattrstats)
+{
+	int i, j;
+	int numattrs = attrs->dim1;
+	VacAttrStats **stats = (VacAttrStats**)palloc0(numattrs * sizeof(VacAttrStats*));
+
+	/* lookup VacAttrStats info for the requested columns (same attnum) */
+	for (i = 0; i < numattrs; i++)
+	{
+		stats[i] = NULL;
+		for (j = 0; j < natts; j++)
+		{
+			if (attrs->values[i] == vacattrstats[j]->tupattnum)
+			{
+				stats[i] = vacattrstats[j];
+				break;
+			}
+		}
+
+		/*
+		 * Check that we found the info, that the attnum matches and
+		 * that there's the requested 'lt' operator and that the type
+		 * is 'passed-by-value'.
+		 */
+		Assert(stats[i] != NULL);
+		Assert(stats[i]->tupattnum == attrs->values[i]);
+
+		/* FIXME This is rather ugly way to check for 'ltopr' (which
+		 *       is defined for 'scalar' attributes).
+		 */
+		Assert(((StdAnalyzeData *)stats[i]->extra_data)->ltopr != InvalidOid);
+	}
+
+	return stats;
+}
+
+/*
+ * Fetch list of MV stats defined on a table, without the actual data
+ * for histograms, MCV lists etc.
+ */
+MVStats
+list_mv_stats(Oid relid, int *nstats, bool built_only)
+{
+	Relation	indrel;
+	SysScanDesc indscan;
+	ScanKeyData skey;
+	HeapTuple	htup;
+	MVStats		result;
+
+	/* start with 16 items, that should be enough for most cases */
+	int maxitems = 16;
+	result = (MVStats)palloc0(sizeof(MVStatsData) * maxitems);
+	*nstats = 0;
+
+	/* Prepare to scan pg_mv_statistic for entries having indrelid = this rel. */
+	ScanKeyInit(&skey,
+				Anum_pg_mv_statistic_starelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+
+	indrel = heap_open(MvStatisticRelationId, AccessShareLock);
+	indscan = systable_beginscan(indrel, MvStatisticRelidIndexId, true,
+								 NULL, 1, &skey);
+
+	while (HeapTupleIsValid(htup = systable_getnext(indscan)))
+	{
+		Form_pg_mv_statistic stats = (Form_pg_mv_statistic) GETSTRUCT(htup);
+
+		/*
+		 * Skip statistics that were not computed yet (if only stats
+		 * that were already built were requested)
+		 */
+		if (built_only && (! stats->deps_built))
+			continue;
+
+		/* double the array size if needed */
+		if (*nstats == maxitems)
+		{
+			maxitems *= 2;
+			result = (MVStats)repalloc(result, sizeof(MVStatsData) * maxitems);
+		}
+
+		result[*nstats].mvoid = HeapTupleGetOid(htup);
+		result[*nstats].stakeys = buildint2vector(stats->stakeys.values, stats->stakeys.dim1);
+		result[*nstats].deps_built = stats->deps_built;
+		*nstats += 1;
+	}
+
+	systable_endscan(indscan);
+
+	heap_close(indrel, AccessShareLock);
+
+	/* TODO maybe save the list into relcache, as in RelationGetIndexList
+	 *      (which was used as an inspiration of this one)?. */
+
+	return result;
+}
+
+void
+update_mv_stats(Oid mvoid, MVDependencies dependencies)
+{
+	HeapTuple	stup,
+				oldtup;
+	Datum		values[Natts_pg_mv_statistic];
+	bool		nulls[Natts_pg_mv_statistic];
+	bool		replaces[Natts_pg_mv_statistic];
+
+	Relation	sd = heap_open(MvStatisticRelationId, RowExclusiveLock);
+
+	memset(nulls,    1, Natts_pg_mv_statistic * sizeof(bool));
+	memset(replaces, 0, Natts_pg_mv_statistic * sizeof(bool));
+	memset(values,   0, Natts_pg_mv_statistic * sizeof(Datum));
+
+	/*
+	 * Construct a new pg_mv_statistic tuple - replace only the histogram
+	 * and MCV list, depending whether it actually was computed.
+	 */
+	if (dependencies != NULL)
+	{
+		nulls[Anum_pg_mv_statistic_stadeps -1]    = false;
+		values[Anum_pg_mv_statistic_stadeps  - 1]
+			= PointerGetDatum(serialize_mv_dependencies(dependencies));
+	}
+
+	/* always replace the value (either by bytea or NULL) */
+	replaces[Anum_pg_mv_statistic_stadeps -1] = true;
+
+	/* always change the availability flags */
+	nulls[Anum_pg_mv_statistic_deps_built -1] = false;
+
+	replaces[Anum_pg_mv_statistic_deps_built-1] = true;
+
+	values[Anum_pg_mv_statistic_deps_built-1] = BoolGetDatum(dependencies != NULL);
+
+	/* Is there already a pg_mv_statistic tuple for this attribute? */
+	oldtup = SearchSysCache1(MVSTATOID,
+							 ObjectIdGetDatum(mvoid));
+
+	if (HeapTupleIsValid(oldtup))
+	{
+		/* Yes, replace it */
+		stup = heap_modify_tuple(oldtup,
+								 RelationGetDescr(sd),
+								 values,
+								 nulls,
+								 replaces);
+		ReleaseSysCache(oldtup);
+		simple_heap_update(sd, &stup->t_self, stup);
+	}
+	else
+		elog(ERROR, "invalid pg_mv_statistic record (oid=%d)", mvoid);
+
+	/* update indexes too */
+	CatalogUpdateIndexes(sd, stup);
+
+	heap_freetuple(stup);
+
+	heap_close(sd, RowExclusiveLock);
+}
+
+/* multi-variate stats comparator */
+
+/*
+ * qsort_arg comparator for sorting Datums (MV stats)
+ *
+ * This does not maintain the tupnoLink array.
+ */
+int
+compare_scalars_simple(const void *a, const void *b, void *arg)
+{
+	Datum		da = *(Datum*)a;
+	Datum		db = *(Datum*)b;
+	SortSupport ssup= (SortSupport) arg;
+
+	return ApplySortComparator(da, false, db, false, ssup);
+}
+
+/*
+ * qsort_arg comparator for sorting data when partitioning a MV bucket
+ */
+int
+compare_scalars_partition(const void *a, const void *b, void *arg)
+{
+	Datum		da = ((ScalarItem*)a)->value;
+	Datum		db = ((ScalarItem*)b)->value;
+	SortSupport ssup= (SortSupport) arg;
+
+	return ApplySortComparator(da, false, db, false, ssup);
+}
+
+/*
+ * qsort_arg comparator for sorting Datum[] (row of Datums) when
+ * counting distinct values.
+ */
+int
+compare_scalars_memcmp(const void *a, const void *b, void *arg)
+{
+	Size		len = *(Size*)arg;
+
+	return memcmp(a, b, len);
+}
+
+int
+compare_scalars_memcmp_2(const void *a, const void *b)
+{
+	return memcmp(a, b, sizeof(Datum));
+}
diff --git a/src/backend/utils/mvstats/common.h b/src/backend/utils/mvstats/common.h
new file mode 100644
index 0000000..f511c4e
--- /dev/null
+++ b/src/backend/utils/mvstats/common.h
@@ -0,0 +1,70 @@
+/*-------------------------------------------------------------------------
+ *
+ * common.h
+ *	  POSTGRES multivariate statistics
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/mvstats/common.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/tuptoaster.h"
+#include "catalog/indexing.h"
+#include "catalog/pg_collation.h"
+#include "catalog/pg_mv_statistic.h"
+#include "foreign/fdwapi.h"
+#include "postmaster/autovacuum.h"
+#include "storage/lmgr.h"
+#include "utils/datum.h"
+#include "utils/sortsupport.h"
+#include "utils/syscache.h"
+#include "utils/fmgroids.h"
+#include "utils/builtins.h"
+#include "access/sysattr.h"
+
+#include "utils/mvstats.h"
+
+/* FIXME private structure copied from analyze.c */
+
+typedef struct
+{
+	Oid			eqopr;			/* '=' operator for datatype, if any */
+	Oid			eqfunc;			/* and associated function */
+	Oid			ltopr;			/* '<' operator for datatype, if any */
+} StdAnalyzeData;
+
+typedef struct
+{
+	Datum		value;			/* a data value */
+	int			tupno;			/* position index for tuple it came from */
+} ScalarItem;
+
+typedef struct
+{
+	int			count;			/* # of duplicates */
+	int			first;			/* values[] index of first occurrence */
+} ScalarMCVItem;
+
+typedef struct
+{
+	SortSupport ssup;
+	int		   *tupnoLink;
+} CompareScalarsContext;
+
+
+VacAttrStats ** lookup_var_attr_stats(int2vector *attrs,
+									  int natts, VacAttrStats **vacattrstats);
+
+/* comparators, used when constructing multivariate stats */
+int compare_scalars_simple(const void *a, const void *b, void *arg);
+int compare_scalars_partition(const void *a, const void *b, void *arg);
+int compare_scalars_memcmp(const void *a, const void *b, void *arg);
+int compare_scalars_memcmp_2(const void *a, const void *b);
diff --git a/src/backend/utils/mvstats/dependencies.c b/src/backend/utils/mvstats/dependencies.c
new file mode 100644
index 0000000..b900efd
--- /dev/null
+++ b/src/backend/utils/mvstats/dependencies.c
@@ -0,0 +1,554 @@
+/*-------------------------------------------------------------------------
+ *
+ * dependencies.c
+ *	  POSTGRES multivariate functional dependencies
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/mvstats/dependencies.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "common.h"
+
+/*
+ * Mine functional dependencies between columns, in the form (A => B),
+ * meaning that a value in column 'A' determines value in 'B'. A simple
+ * artificial example may be a table created like this
+ *
+ *     CREATE TABLE deptest (a INT, b INT)
+ *        AS SELECT i, i/10 FROM generate_series(1,100000) s(i);
+ *
+ * Clearly, once we know the value for 'A' we can easily determine the
+ * value of 'B' by dividing (A/10). A more practical example may be
+ * addresses, where (ZIP code => city name), i.e. once we know the ZIP,
+ * we probably know which city it belongs to. Larger cities usually have
+ * multiple ZIP codes, so the dependency can't be reversed.
+ *
+ * Functional dependencies are a concept well described in relational
+ * theory, especially in definition of normalization and "normal forms".
+ * Wikipedia has a nice definition of a functional dependency [1]:
+ *
+ *     In a given table, an attribute Y is said to have a functional
+ *     dependency on a set of attributes X (written X -> Y) if and only
+ *     if each X value is associated with precisely one Y value. For
+ *     example, in an "Employee" table that includes the attributes
+ *     "Employee ID" and "Employee Date of Birth", the functional
+ *     dependency {Employee ID} -> {Employee Date of Birth} would hold.
+ *     It follows from the previous two sentences that each {Employee ID}
+ *     is associated with precisely one {Employee Date of Birth}.
+ *
+ * [1] http://en.wikipedia.org/wiki/Database_normalization
+ *
+ * Most datasets might be normalized not to contain any such functional
+ * dependencies, but sometimes it's not practical. In some cases it's
+ * actually a conscious choice to model the dataset in denormalized way,
+ * either because of performance or to make querying easier.
+ *
+ * The current implementation supports only dependencies between two
+ * columns, but this is merely a simplification of the initial patch.
+ * It's certainly useful to mine for dependencies involving multiple
+ * columns on the 'left' side, i.e. a condition for the dependency.
+ * That is dependencies [A,B] => C and so on.
+ *
+ * Handling multiple columns on the right side is not necessary, as such
+ * dependencies may be decomposed into a set of dependencies with
+ * the same meaning, one for each column on the right side. For example
+ *
+ *     A => [B,C]
+ *
+ * is exactly the same as
+ *
+ *     (A => B) & (A => C).
+ *
+ * Of course, storing (A => [B, C]) may be more efficient thant storing
+ * the two dependencies (A => B) and (A => C) separately.
+ *
+ *
+ * Dependency mining (ANALYZE)
+ * ---------------------------
+ *
+ * FIXME Add more details about how build_mv_dependencies() works
+ *       (minimum group size, supporting/contradicting etc.).
+ *
+ * Real-world datasets are imperfect - there may be errors (e.g. due to
+ * data-entry mistakes), or factually correct records, yet contradicting
+ * the dependency (e.g. when a city splits into two, but both keep the
+ * same ZIP code). A strict ANALYZE implementation (where the functional
+ * dependencies are identified) would ignore dependencies on such noisy
+ * data, making the approach unusable in practice.
+ *
+ * The proposed implementation attempts to handle such noisy cases
+ * gracefully, by tolerating small number of contradicting cases.
+ *
+ * In the future this might also perform some sort of test and decide
+ * whether it's worth building any other kind of multivariate stats,
+ * or whether the dependencies sufficiently describe the data. Or at
+ * least not build the MCV list / histogram on the implied columns.
+ * Such reduction would however make the 'verification' (see the next
+ * section) impossible.
+ *
+ *
+ * Clause reduction (planner/optimizer)
+ * ------------------------------------
+ *
+ * FIXME Explain how reduction works.
+ *
+ * The problem with the reduction is that the query may use conditions
+ * that are not redundant, but in fact contradictory - e.g. the user
+ * may search for a ZIP code and a city name not matching the ZIP code.
+ *
+ * In such cases, the condition on the city name is not actually
+ * redundant, but actually contradictory (making the result empty), and
+ * removing it while estimating the cardinality will make the estimate
+ * worse.
+ *
+ * The current estimation assuming independence (and multiplying the
+ * selectivities) works better in this case, but only by utter luck.
+ *
+ * In some cases this might be verified using the other multivariate
+ * statistics - MCV lists and histograms. For MCV lists the verification
+ * might be very simple - peek into the list if there are any items
+ * matching the clause on the 'A' column (e.g. ZIP code), and if such
+ * item is found, check that the 'B' column matches the other clause.
+ * If it does not, the clauses are contradictory. We can't really say
+ * if such item was not found, except maybe restricting the selectivity
+ * using the MCV data (e.g. using min/max selectivity, or something).
+ *
+ * With histograms, it might work similarly - we can't check the values
+ * directly (because histograms use buckets, unlike MCV lists, storing
+ * the actual values). So we can only observe the buckets matching the
+ * clauses - if those buckets have very low frequency, it probably means
+ * the two clauses are incompatible.
+ *
+ * It's unclear what 'low frequency' is, but if one of the clauses is
+ * implied (automatically true because of the other clause), then
+ *
+ *     selectivity[clause(A)] = selectivity[clause(A) & clause(B)]
+ *
+ * So we might compute selectivity of the first clause (on the column
+ * A in dependency [A=>B]) - for example using regular statistics.
+ * And then check if the selectivity computed from the histogram is
+ * about the same (or significantly lower).
+ *
+ * The problem is that histograms work well only when the data ordering
+ * matches the natural meaning. For values that serve as labels - like
+ * city names or ZIP codes, or even generated IDs, histograms really
+ * don't work all that well. For example sorting cities by name won't
+ * match the sorting of ZIP codes, rendering the histogram unusable.
+ *
+ * The MCV are probably going to work much better, because they don't
+ * really assume any sort of ordering. And it's probably more appropriate
+ * for the label-like data.
+ *
+ * TODO Support dependencies with multiple columns on left/right.
+ *
+ * TODO Investigate using histogram and MCV list to confirm the
+ *      functional dependencies.
+ *
+ * TODO Investigate statistical testing of the distribution (to decide
+ *      whether it makes sense to build the histogram/MCV list).
+ *
+ * TODO Using a min/max of selectivities would probably make more sense
+ *      for the associated columns.
+ *
+ * TODO Consider eliminating the implied columns from the histogram and
+ *      MCV lists (but maybe that's not a good idea).
+ *
+ * FIXME Not sure if this handles NULL values properly (not sure how to
+ *       do that). We assume that NULL means 0 for now, handling it just
+ *       like any other value.
+ */
+MVDependencies
+build_mv_dependencies(int numrows, HeapTuple *rows, int2vector *attrs,
+					  int natts, VacAttrStats **vacattrstats)
+{
+	int i;
+	bool isNull;
+	Size len = 2 * sizeof(Datum);	/* only simple associations a => b */
+	int numattrs = attrs->dim1;
+
+	/* result */
+	int ndeps = 0;
+	MVDependencies	dependencies = NULL;
+
+	/* TODO Maybe this should be somehow related to the number of
+	 *      distinct columns in the two columns we're currently analyzing.
+	 *      Assuming the distribution is uniform, we should expected to
+	 *      observe in the sample - we can then use the average group
+	 *      size as a threshold. That seems better than a static approach.
+	 */
+	int min_group_size = 3;
+
+	/* dimension indexes we'll check for associations [a => b] */
+	int dima, dimb;
+
+	/* info for the interesting attributes only
+	 *
+	 * TODO Compute this only once and pass it to all the methods
+	 *      that need it.
+	 */
+	VacAttrStats **stats = lookup_var_attr_stats(attrs, natts, vacattrstats);
+
+	/* We'll reuse the same array for all the combinations */
+	Datum * values = (Datum*)palloc0(numrows * 2 * sizeof(Datum));
+
+	Assert(numattrs >= 2);
+
+	/*
+	 * Evaluate all possible combinations of [A => B], using a simple algorithm:
+	 *
+	 * (a) sort the data by [A,B]
+	 * (b) split the data into groups by A (new group whenever a value changes)
+	 * (c) count different values in the B column (again, value changes)
+	 *
+	 * TODO It should be rather simple to merge [A => B] and [A => C] into
+	 *      [A => B,C]. Just keep A constant, collect all the "implied" columns
+	 *      and you're done.
+	 */
+	for (dima = 0; dima < numattrs; dima++)
+	{
+		for (dimb = 0; dimb < numattrs; dimb++)
+		{
+			Datum val_a, val_b;
+
+			/* number of groups supporting / contradicting the dependency */
+			int n_supporting = 0;
+			int n_contradicting = 0;
+
+			/* counters valid within a group */
+			int group_size = 0;
+			int n_violations = 0;
+
+			int n_supporting_rows = 0;
+			int n_contradicting_rows = 0;
+
+			/* make sure the columns are different (A => A) */
+			if (dima == dimb)
+				continue;
+
+			/* accumulate all the data for both columns into an array and sort it */
+			for (i = 0; i < numrows; i++)
+			{
+				values[i*2]   = heap_getattr(rows[i], attrs->values[dima], stats[dima]->tupDesc, &isNull);
+				values[i*2+1] = heap_getattr(rows[i], attrs->values[dimb], stats[dimb]->tupDesc, &isNull);
+			}
+
+			qsort_arg((void *) values, numrows, sizeof(Datum) * 2, compare_scalars_memcmp, &len);
+
+			/*
+			 * Walk through the array, split it into rows according to
+			 * the A value, and count distinct values in the other one.
+			 * If there's a single B value for the whole group, we count
+			 * it as supporting the association, otherwise we count it
+			 * as contradicting.
+			 *
+			 * Furthermore we require a group to have at least a certain
+			 * number of rows to be considered useful for supporting the
+			 * dependency. But when it's contradicting, use it always useful.
+			 */
+
+			/* start with values from the first row */
+			val_a = values[0];
+			val_b = values[1];
+			group_size  = 1;
+
+			for (i = 1; i < numrows; i++)
+			{
+				if (values[2*i] != val_a)	/* end of the group */
+				{
+					/*
+					 * If there are no contradicting rows, count it as
+					 * supporting (otherwise contradicting), but only if
+					 * the group is large enough.
+					 *
+					 * The requirement of a minimum group size makes it
+					 * impossible to identify [unique,unique] cases, but
+					 * that's probably a different case. This is more
+					 * about [zip => city] associations etc.
+					 */
+					n_supporting += ((n_violations == 0) && (group_size >= min_group_size)) ? 1 : 0;
+					n_contradicting += (n_violations != 0) ? 1 : 0;
+
+					n_supporting_rows += ((n_violations == 0) && (group_size >= min_group_size)) ? group_size : 0;
+					n_contradicting_rows += (n_violations > 0) ? group_size : 0;
+
+					/* current values start a new group */
+					val_a = values[2*i];
+					val_b = values[2*i+1];
+					n_violations = 0;
+					group_size = 1;
+				}
+				else
+				{
+					if (values[2*i+1] != val_b)	/* mismatch of a B value is contradicting */
+					{
+						val_b = values[2*i+1];
+						n_violations += 1;
+					}
+
+					group_size += 1;
+				}
+			}
+
+			/* handle the last group */
+			n_supporting += ((n_violations == 0) && (group_size >= min_group_size)) ? 1 : 0;
+			n_contradicting += (n_violations != 0) ? 1 : 0;
+			n_supporting_rows += ((n_violations == 0) && (group_size >= min_group_size)) ? group_size : 0;
+			n_contradicting_rows += (n_violations > 0) ? group_size : 0;
+
+			/*
+			 * See if the number of rows supporting the association is at least
+			 * 10x the number of rows violating the hypothetical dependency.
+			 *
+			 * TODO This is rather arbitrary limit - I guess it's possible to do
+			 *      some math to come up with a better rule (e.g. testing a hypothesis
+			 *      'this is due to randomness'). We can create a contingency table
+			 *      from the values and use it for testing. Possibly only when
+			 *      there are no contradicting rows?
+			 *
+			 * TODO Also, if (a => b) and (b => a) at the same time, it pretty much
+			 *      means the columns have the same values (or one is a 'label'),
+			 *      making the conditions rather redundant. Although it's possible
+			 *      that the query uses incompatible combination of values.
+			 */
+			if (n_supporting_rows > (n_contradicting_rows * 10))
+			{
+				if (dependencies == NULL)
+				{
+					dependencies = (MVDependencies)palloc0(sizeof(MVDependenciesData));
+					dependencies->magic = MVSTAT_DEPS_MAGIC;
+				}
+				else
+					dependencies = repalloc(dependencies, offsetof(MVDependenciesData, deps)
+											+ sizeof(MVDependency) * (dependencies->ndeps + 1));
+
+				/* update the */
+				dependencies->deps[ndeps] = (MVDependency)palloc0(sizeof(MVDependencyData));
+				dependencies->deps[ndeps]->a = attrs->values[dima];
+				dependencies->deps[ndeps]->b = attrs->values[dimb];
+
+				dependencies->ndeps = (++ndeps);
+			}
+		}
+	}
+
+	pfree(values);
+
+	return dependencies;
+}
+
+/*
+ * Store the dependencies into a bytea, so that it can be stored in the
+ * pg_mv_statistic catalog.
+ *
+ * Currently this only supports simple two-column rules, and stores them
+ * as a sequence of attnum pairs. In the future, this needs to be made
+ * more complex to support multiple columns on both sides of the
+ * implication (using AND on left, OR on right).
+ */
+bytea *
+serialize_mv_dependencies(MVDependencies dependencies)
+{
+	int i;
+
+	/* we need to store ndeps, and each needs 2 * int16 */
+	Size len = VARHDRSZ + offsetof(MVDependenciesData, deps)
+				+ dependencies->ndeps * (sizeof(int16) * 2);
+
+	bytea * output = (bytea*)palloc0(len);
+
+	char * tmp = VARDATA(output);
+
+	SET_VARSIZE(output, len);
+
+	/* first, store the number of dimensions / items */
+	memcpy(tmp, dependencies, offsetof(MVDependenciesData, deps));
+	tmp += offsetof(MVDependenciesData, deps);
+
+	/* walk through the dependencies and copy both columns into the bytea */
+	for (i = 0; i < dependencies->ndeps; i++)
+	{
+		memcpy(tmp, &(dependencies->deps[i]->a), sizeof(int16));
+		tmp += sizeof(int16);
+
+		memcpy(tmp, &(dependencies->deps[i]->b), sizeof(int16));
+		tmp += sizeof(int16);
+	}
+
+	return output;
+}
+
+/*
+ * Reads serialized dependencies into MVDependencies structure.
+ */
+MVDependencies
+deserialize_mv_dependencies(bytea * data)
+{
+	int		i;
+	Size	expected_size;
+	MVDependencies	dependencies;
+	char   *tmp;
+
+	if (data == NULL)
+		return NULL;
+
+	if (VARSIZE_ANY_EXHDR(data) < offsetof(MVDependenciesData,deps))
+		elog(ERROR, "invalid MVDependencies size %ld (expected at least %ld)",
+			 VARSIZE_ANY_EXHDR(data), offsetof(MVDependenciesData,deps));
+
+	/* read the MVDependencies header */
+	dependencies = (MVDependencies)palloc0(sizeof(MVDependenciesData));
+
+	/* initialize pointer to the data part (skip the varlena header) */
+	tmp = VARDATA(data);
+
+	/* get the header and perform basic sanity checks */
+	memcpy(dependencies, tmp, offsetof(MVDependenciesData, deps));
+	tmp += offsetof(MVDependenciesData, deps);
+
+	if (dependencies->magic != MVSTAT_DEPS_MAGIC)
+	{
+		pfree(dependencies);
+		elog(WARNING, "not a MV Dependencies (magic number mismatch)");
+		return NULL;
+	}
+
+	Assert(dependencies->ndeps > 0);
+
+	/* what bytea size do we expect for those parameters */
+	expected_size = offsetof(MVDependenciesData,deps) +
+					dependencies->ndeps * sizeof(int16) * 2;
+
+	if (VARSIZE_ANY_EXHDR(data) != expected_size)
+		elog(ERROR, "invalid dependencies size %ld (expected %ld)",
+			 VARSIZE_ANY_EXHDR(data), expected_size);
+
+	/* allocate space for the MCV items */
+	dependencies = repalloc(dependencies, offsetof(MVDependenciesData,deps)
+							+ (dependencies->ndeps * sizeof(MVDependency)));
+
+	for (i = 0; i < dependencies->ndeps; i++)
+	{
+		dependencies->deps[i] = (MVDependency)palloc0(sizeof(MVDependencyData));
+
+		memcpy(&(dependencies->deps[i]->a), tmp, sizeof(int16));
+		tmp += sizeof(int16);
+
+		memcpy(&(dependencies->deps[i]->b), tmp, sizeof(int16));
+		tmp += sizeof(int16);
+	}
+
+	return dependencies;
+}
+
+/* print some basic info about dependencies (number of dependencies) */
+Datum
+pg_mv_stats_dependencies_info(PG_FUNCTION_ARGS)
+{
+	bytea	   *data = PG_GETARG_BYTEA_P(0);
+	char	   *result;
+
+	MVDependencies dependencies = deserialize_mv_dependencies(data);
+
+	if (dependencies == NULL)
+		PG_RETURN_NULL();
+
+	result = palloc0(128);
+	snprintf(result, 128, "dependencies=%d", dependencies->ndeps);
+
+	/* FIXME free the deserialized data (pfree is not enough) */
+
+	PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+
+/* print the dependencies
+ *
+ * TODO  Would be nice if this knew the actual column names (instead of
+ *       the attnums).
+ *
+ * FIXME This is really ugly and does not really check the lengths and
+ *       strcpy/snprintf return values properly. Needs to be fixed.
+ */
+Datum
+pg_mv_stats_dependencies_show(PG_FUNCTION_ARGS)
+{
+	int			i = 0;
+	bytea	   *data = PG_GETARG_BYTEA_P(0);
+	char	   *result = NULL;
+	int			len = 0;
+
+	MVDependencies dependencies = deserialize_mv_dependencies(data);
+
+	if (dependencies == NULL)
+		PG_RETURN_NULL();
+
+	for (i = 0; i < dependencies->ndeps; i++)
+	{
+		MVDependency dependency = dependencies->deps[i];
+		char	buffer[128];
+
+		int		tmp = snprintf(buffer, 128, "%s%d => %d",
+				((i == 0) ? "" : ", "), dependency->a, dependency->b);
+
+		if (tmp < 127)
+		{
+			if (result == NULL)
+				result = palloc0(len + tmp + 1);
+			else
+				result = repalloc(result, len + tmp + 1);
+
+			strcpy(result + len, buffer);
+			len += tmp;
+		}
+	}
+
+	PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+
+bytea *
+fetch_mv_dependencies(Oid mvoid)
+{
+	Relation	indrel;
+	SysScanDesc indscan;
+	ScanKeyData skey;
+	HeapTuple	htup;
+	bytea	   *stadeps = NULL;
+
+	/* Prepare to scan pg_mv_statistic for entries having indrelid = this rel. */
+	ScanKeyInit(&skey,
+				ObjectIdAttributeNumber,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(mvoid));
+
+	indrel = heap_open(MvStatisticRelationId, AccessShareLock);
+	indscan = systable_beginscan(indrel, MvStatisticOidIndexId, true,
+								 NULL, 1, &skey);
+
+	while (HeapTupleIsValid(htup = systable_getnext(indscan)))
+	{
+		bool isnull = false;
+		Datum deps = SysCacheGetAttr(MVSTATOID, htup,
+								   Anum_pg_mv_statistic_stadeps, &isnull);
+
+		Assert(!isnull);
+
+		stadeps = DatumGetByteaP(deps);
+
+		break;
+	}
+
+	systable_endscan(indscan);
+
+	heap_close(indrel, AccessShareLock);
+
+	/* TODO maybe save the list into relcache, as in RelationGetIndexList
+	 *      (which was used as an inspiration of this one)?. */
+
+	return stadeps;
+}
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index a680229..f69eb7c 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -173,6 +173,11 @@ DECLARE_UNIQUE_INDEX(pg_largeobject_loid_pn_index, 2683, on pg_largeobject using
 DECLARE_UNIQUE_INDEX(pg_largeobject_metadata_oid_index, 2996, on pg_largeobject_metadata using btree(oid oid_ops));
 #define LargeObjectMetadataOidIndexId	2996
 
+DECLARE_UNIQUE_INDEX(pg_mv_statistic_oid_index, 3286, on pg_mv_statistic using btree(oid oid_ops));
+#define MvStatisticOidIndexId  3286
+DECLARE_INDEX(pg_mv_statistic_relid_index, 3287, on pg_mv_statistic using btree(starelid oid_ops));
+#define MvStatisticRelidIndexId	3287
+
 DECLARE_UNIQUE_INDEX(pg_namespace_nspname_index, 2684, on pg_namespace using btree(nspname name_ops));
 #define NamespaceNameIndexId  2684
 DECLARE_UNIQUE_INDEX(pg_namespace_oid_index, 2685, on pg_namespace using btree(oid oid_ops));
diff --git a/src/include/catalog/pg_mv_statistic.h b/src/include/catalog/pg_mv_statistic.h
new file mode 100644
index 0000000..76b7db7
--- /dev/null
+++ b/src/include/catalog/pg_mv_statistic.h
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_mv_statistic.h
+ *	  definition of the system "multivariate statistic" relation (pg_mv_statistic)
+ *	  along with the relation's initial contents.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/catalog/pg_mv_statistic.h
+ *
+ * NOTES
+ *	  the genbki.pl script reads this file and generates .bki
+ *	  information from the DATA() statements.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_MV_STATISTIC_H
+#define PG_MV_STATISTIC_H
+
+#include "catalog/genbki.h"
+
+/* ----------------
+ *		pg_mv_statistic definition.  cpp turns this into
+ *		typedef struct FormData_pg_mv_statistic
+ * ----------------
+ */
+#define MvStatisticRelationId  3281
+
+CATALOG(pg_mv_statistic,3281)
+{
+	/* These fields form the unique key for the entry: */
+	Oid			starelid;		/* relation containing attributes */
+
+	/* statistics requested to build */
+	bool		deps_enabled;		/* analyze dependencies? */
+
+	/* statistics that are available (if requested) */
+	bool		deps_built;			/* dependencies were built */
+
+	/* variable-length fields start here, but we allow direct access to stakeys */
+	int2vector	stakeys;			/* array of column keys */
+
+#ifdef CATALOG_VARLEN
+	bytea		stadeps;			/* dependencies (serialized) */
+#endif
+
+} FormData_pg_mv_statistic;
+
+/* ----------------
+ *		Form_pg_mv_statistic corresponds to a pointer to a tuple with
+ *		the format of pg_mv_statistic relation.
+ * ----------------
+ */
+typedef FormData_pg_mv_statistic *Form_pg_mv_statistic;
+
+/* ----------------
+ *		compiler constants for pg_attrdef
+ * ----------------
+ */
+#define Natts_pg_mv_statistic					5
+#define Anum_pg_mv_statistic_starelid			1
+#define Anum_pg_mv_statistic_deps_enabled		2
+#define Anum_pg_mv_statistic_deps_built			3
+#define Anum_pg_mv_statistic_stakeys			4
+#define Anum_pg_mv_statistic_stadeps			5
+
+#endif   /* PG_MV_STATISTIC_H */
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 6a757f3..4b7ae1f 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2693,6 +2693,11 @@ DESCR("current user privilege on any column by rel name");
 DATA(insert OID = 3029 (  has_any_column_privilege	   PGNSP PGUID 12 10 0 0 0 f f f f t f s 2 0 16 "26 25" _null_ _null_ _null_ _null_ has_any_column_privilege_id _null_ _null_ _null_ ));
 DESCR("current user privilege on any column by rel oid");
 
+DATA(insert OID = 3284 (  pg_mv_stats_dependencies_info     PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 25 "17" _null_ _null_ _null_ _null_ pg_mv_stats_dependencies_info _null_ _null_ _null_ ));
+DESCR("multivariate stats: functional dependencies info");
+DATA(insert OID = 3285 (  pg_mv_stats_dependencies_show     PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 25 "17" _null_ _null_ _null_ _null_ pg_mv_stats_dependencies_show _null_ _null_ _null_ ));
+DESCR("multivariate stats: functional dependencies show");
+
 DATA(insert OID = 1928 (  pg_stat_get_numscans			PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_numscans _null_ _null_ _null_ ));
 DESCR("statistics: number of scans done for table/index");
 DATA(insert OID = 1929 (  pg_stat_get_tuples_returned	PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_tuples_returned _null_ _null_ _null_ ));
diff --git a/src/include/catalog/toasting.h b/src/include/catalog/toasting.h
index cba4ae7..45d3b5a 100644
--- a/src/include/catalog/toasting.h
+++ b/src/include/catalog/toasting.h
@@ -49,6 +49,7 @@ extern void BootstrapToastTable(char *relName,
 DECLARE_TOAST(pg_attrdef, 2830, 2831);
 DECLARE_TOAST(pg_constraint, 2832, 2833);
 DECLARE_TOAST(pg_description, 2834, 2835);
+DECLARE_TOAST(pg_mv_statistic, 3288, 3289);
 DECLARE_TOAST(pg_proc, 2836, 2837);
 DECLARE_TOAST(pg_rewrite, 2838, 2839);
 DECLARE_TOAST(pg_seclabel, 3598, 3599);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 38469ef..3a0e7c4 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -414,6 +414,7 @@ typedef enum NodeTag
 	T_WithClause,
 	T_CommonTableExpr,
 	T_RoleSpec,
+	T_StatisticsDef,
 
 	/*
 	 * TAGS FOR REPLICATION GRAMMAR PARSE NODES (replnodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ec0d0ea..b256162 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -570,6 +570,14 @@ typedef struct ColumnDef
 	int			location;		/* parse location, or -1 if none/unknown */
 } ColumnDef;
 
+typedef struct StatisticsDef
+{
+	NodeTag		type;
+	List	   *keys;			/* String nodes naming referenced column(s) */
+	List	   *options;		/* list of DefElem nodes */
+} StatisticsDef;
+
+
 /*
  * TableLikeClause - CREATE TABLE ( ... LIKE ... ) clause
  */
@@ -1362,7 +1370,8 @@ typedef enum AlterTableType
 	AT_ReplicaIdentity,			/* REPLICA IDENTITY */
 	AT_EnableRowSecurity,		/* ENABLE ROW SECURITY */
 	AT_DisableRowSecurity,		/* DISABLE ROW SECURITY */
-	AT_GenericOptions			/* OPTIONS (...) */
+	AT_GenericOptions,			/* OPTIONS (...) */
+	AT_AddStatistics			/* add statistics */
 } AlterTableType;
 
 typedef struct ReplicaIdentityStmt
diff --git a/src/include/utils/mvstats.h b/src/include/utils/mvstats.h
new file mode 100644
index 0000000..2b59c2d
--- /dev/null
+++ b/src/include/utils/mvstats.h
@@ -0,0 +1,86 @@
+/*-------------------------------------------------------------------------
+ *
+ * mvstats.h
+ *	  Multivariate statistics and selectivity estimation functions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/mvstats.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MVSTATS_H
+#define MVSTATS_H
+
+#include "commands/vacuum.h"
+
+/*
+ * Basic info about the stats, used when choosing what to use
+ * 
+ * TODO Add info about what statistics is available (histogram, MCV,
+ *      hashed MCV, functional dependencies).
+ */
+typedef struct MVStatsData {
+	Oid			mvoid;		/* OID of the stats in pg_mv_statistic */
+	int2vector *stakeys;	/* attnums for columns in the stats */
+	bool		deps_built;	/* functional dependencies available */
+} MVStatsData;
+
+typedef struct MVStatsData *MVStats;
+
+
+#define MVSTATS_MAX_DIMENSIONS	8		/* max number of attributes */
+
+/* An associative rule, tracking [a => b] dependency.
+ *
+ * TODO Make this work with multiple columns on both sides.
+ */
+typedef struct MVDependencyData {
+	int16	a;
+	int16	b;
+} MVDependencyData;
+
+typedef MVDependencyData* MVDependency;
+
+typedef struct MVDependenciesData {
+	uint32			magic;		/* magic constant marker */
+	int32			ndeps;		/* number of dependencies */
+	MVDependency	deps[1];	/* XXX why not a pointer? */
+} MVDependenciesData;
+
+typedef MVDependenciesData* MVDependencies;
+
+#define MVSTAT_DEPS_MAGIC		0xB4549A2C	/* marks serialized bytea */
+#define MVSTAT_DEPS_TYPE_BASIC	1			/* basic dependencies type */
+
+/*
+ * TODO Maybe fetching the histogram/MCV list separately is inefficient?
+ *      Consider adding a single `fetch_stats` method, fetching all
+ *      stats specified using flags (or something like that).
+ */
+MVStats list_mv_stats(Oid relid, int *nstats, bool built_only);
+
+bytea * fetch_mv_dependencies(Oid mvoid);
+
+bytea * serialize_mv_dependencies(MVDependencies dependencies);
+
+/* deserialization of stats (serialization is private to analyze) */
+MVDependencies	deserialize_mv_dependencies(bytea * data);
+
+/* FIXME this probably belongs somewhere else (not to operations stats) */
+extern Datum pg_mv_stats_dependencies_info(PG_FUNCTION_ARGS);
+extern Datum pg_mv_stats_dependencies_show(PG_FUNCTION_ARGS);
+
+MVDependencies
+build_mv_dependencies(int numrows, HeapTuple *rows,
+								  int2vector *attrs,
+								  int natts, VacAttrStats **vacattrstats);
+
+void build_mv_stats(Relation onerel, int numrows, HeapTuple *rows,
+						   int natts, VacAttrStats **vacattrstats);
+
+void update_mv_stats(Oid relid, MVDependencies dependencies);
+
+#endif
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index ba0b090..12147ab 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -66,6 +66,7 @@ enum SysCacheIdentifier
 	INDEXRELID,
 	LANGNAME,
 	LANGOID,
+	MVSTATOID,
 	NAMESPACENAME,
 	NAMESPACEOID,
 	OPERNAMENSP,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 1788270..f0117ca 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1353,6 +1353,14 @@ pg_matviews| SELECT n.nspname AS schemaname,
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
      LEFT JOIN pg_tablespace t ON ((t.oid = c.reltablespace)))
   WHERE (c.relkind = 'm'::"char");
+pg_mv_stats| SELECT n.nspname AS schemaname,
+    c.relname AS tablename,
+    s.stakeys AS attnums,
+    length(s.stadeps) AS depsbytes,
+    pg_mv_stats_dependencies_info(s.stadeps) AS depsinfo
+   FROM ((pg_mv_statistic s
+     JOIN pg_class c ON ((c.oid = s.starelid)))
+     LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)));
 pg_policies| SELECT n.nspname AS schemaname,
     c.relname AS tablename,
     pol.polname AS policyname,
diff --git a/src/test/regress/expected/sanity_check.out b/src/test/regress/expected/sanity_check.out
index c7be273..00f5fe7 100644
--- a/src/test/regress/expected/sanity_check.out
+++ b/src/test/regress/expected/sanity_check.out
@@ -113,6 +113,7 @@ pg_inherits|t
 pg_language|t
 pg_largeobject|t
 pg_largeobject_metadata|t
+pg_mv_statistic|t
 pg_namespace|t
 pg_opclass|t
 pg_operator|t
-- 
2.1.0.GIT

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: multivariate statistics / proof of concept

Reply via email to