[HACKERS] Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat Wed, 15 Jun 2016 00:25:55 -0700

Amit Langote is working on supporting declarative partitioning in
PostgreSQL [1]. I have started working on supporting partition-wise join.
This mail describes very high level design and some insight into the
performance improvements.


An equi-join between two partitioned tables can be broken down into
pair-wise join between their partitions. This technique is called
partition-wise join. Partition-wise joins between similarly partitioned
tables with equi-join condition can be efficient because [2]
1. Each provably non-empty partition-wise join smaller. All such joins
collectively might be more efficient than the join between their parent.
2. Such joins are able to exploit properties of partitions like indexes,
their storage etc.
3. An N-way partition-wise join may have different efficient join orders
compared to the efficient join order between the parent tables.

A partition-wise join is processed in following stages [2], [3].
1. Applicability testing: This phase checks if the join conditions match
the partitioning scheme. A partition-wise join is efficient if there is an
equi-join on the partition keys. E.g. join between tables R and S
partitioned by columns a and b resp. can be broken down into partition-wise
joins if there exists a join condition is R.a = S.b. Or in other words the
number of provably non-empty partition-wise joins is O(N) where N is the
number of partitions.

2. Matching: This phase determines which joins between the partitions of R
and S can potentially produce tuples in the join and prunes empty joins
between partitions.

3. Clustering: This phase aims at reducing the number of partition-wise
joins by clubbing together partitions from joining relations. E.g. clubbing
multiple partitions from either of the partitioned relations which can join
to a single partition from the other partitioned relation.

4. Path/plan creation: This phase creates multiple paths for each
partition-wise join. It also creates Append path/s representing the union
of partition-wise joins.

The work here focuses on a subset of use-cases discussed in [2]. It only
considers partition-wise join for join between similarly partitioned tables
with same number of partitions with same properties, thus producing at most
as many partition-wise joins as there are partitions. It should be possible
to apply partition-wise join technique (with some special handling for
OUTER joins) if both relations have some extra partitions with
non-overlapping partition conditions, apart from the matching partitions.
But I am not planning to implement this optimization in the first cut.

The attached patch is a POC implementation of partition-wise join. It is is
based on the set of patches posted on 23rd May 2016 by Amit Langote for
declarative partitioning. The patch gives an idea about the approach used.
It has several TODOs, which I am working on.

Attached is a script with output which measures potential performance
improvement because of partition-wise join. The script uses a GUC
enable_partition_wise_join to disable/enable this feature for performance
measurement. The scripts measures performance improvement of a join between
two tables partitioned by range on integer column. Each table contains 50K
rows. Each table has an integer and a varchar column. It shows around
10-15% reduction in execution time when partition-wise join is used.
Accompanied with parallel query and FDWs, it opens up avenues for further
improvements for joins between partitioned tables.

[1]. https://www.postgresql.org/message-id/55d3093c.5010...@lab.ntt.co.jp
[2]. https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf
[3]. https://users.cs.duke.edu/~shivnath/tmp/paqo_draft.pdf

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

partitioned_join.out
Description: Binary data

\set num_samples 100
drop table t1 cascade;
drop table t2 cascade;
create table t1 (a int, b varchar) partition by range(a);
create table t1_p1 partition of t1 for values start (1) end (10000) inclusive;
create table t1_p2 partition of t1 for values start (10001) end (20000) inclusive;
create table t1_p3 partition of t1 for values start (20001) end (30000) inclusive;
create table t1_p4 partition of t1 for values start (30001) end (40000) inclusive;
create table t1_p5 partition of t1 for values start (40001) end (50000) inclusive;
insert into t1 select i, to_char(i, 'FM000000') from generate_series(1, 50000) i;
create table t2 (a int, b varchar) partition by range(a);
create table t2_p1 partition of t2 for values start (1) end (10000) inclusive;
create table t2_p2 partition of t2 for values start (10001) end (20000) inclusive;
create table t2_p3 partition of t2 for values start (20001) end (30000) inclusive;
create table t2_p4 partition of t2 for values start (30001) end (40000) inclusive;
create table t2_p5 partition of t2 for values start (40001) end (50000) inclusive;
insert into t2 select i, to_char(i, 'FM000000') from generate_series(1, 50000) i;

drop function query_execution_stats(query text, num_samples int,
													OUT avg_exe_time float,
													OUT exec_time_dev float,
													OUT min_exe_time float,
													OUT max_exe_time float);

create function query_execution_stats(query text, num_samples int,
													OUT avg_exe_time float,
													OUT std_dev_exe_time float,
													OUT min_exe_time float,
													OUT max_exe_time float)
RETURNS record LANGUAGE plpgsql AS $$
DECLARE
	plan json;
BEGIN
	CREATE TEMPORARY TABLE query_exe_times(exe_time float); 

	-- Execute query a few times (5% of user specified runs) to warm the cache
	FOR i IN 1 .. num_samples/20 LOOP
		EXECUTE query;
	END LOOP;

	FOR i IN 1 .. num_samples LOOP
		EXECUTE 'EXPLAIN (analyze, format json) ' || query INTO plan;
		INSERT INTO query_exe_times VALUES ((plan->0->'Execution Time')::text::float);
		RAISE NOTICE 'completed % samples', i;
	END LOOP;

	SELECT avg(exe_time), stddev(exe_time), min(exe_time), max(exe_time)
		INTO avg_exe_time, std_dev_exe_time, min_exe_time, max_exe_time
		FROM query_exe_times;

	DROP TABLE query_exe_times;
END;
$$;

analyze t1;
analyze t2;
analyze t1_p1;
analyze t1_p2;
analyze t1_p3;
analyze t1_p4;
analyze t1_p5;
analyze t2_p1;
analyze t2_p2;
analyze t2_p3;
analyze t2_p4;
analyze t2_p5;

-- join between two partitioned relations
\set query 'select * from t1, t2 where t1.a = t2.a'

-- partition-wise join
set enable_partition_wise_join to true;
select avg_exe_time, std_dev_exe_time, min_exe_time, max_exe_time
		from query_execution_stats(:'query', :num_samples);
explain (verbose, analyze) :query;

-- partition-wise join off
set enable_partition_wise_join to false;
select avg_exe_time, std_dev_exe_time, min_exe_time, max_exe_time
		from query_execution_stats(:'query', :num_samples);
explain (verbose, analyze) :query;

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 873a764..e8c7c76 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -987,21 +987,59 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		if (rel->has_eclass_joins || has_useful_pathkeys(root, rel))
 			add_child_rel_equivalences(root, appinfo, rel, childrel);
 		childrel->has_eclass_joins = rel->has_eclass_joins;
 
 		/*
 		 * Note: we could compute appropriate attr_needed data for the child's
 		 * variables, by transforming the parent's attr_needed through the
 		 * translated_vars mapping.  However, currently there's no need
 		 * because attr_needed is only examined for base relations not
 		 * otherrels.  So we just leave the child's attr_needed empty.
+		 * For a partitioned tables, individual partitions can participate in
+		 * the pair-wise joins. We need attr_needed data for buiding pair-wise
+		 * join relations. Partition tables should have same layout as the
+		 * parent table and hence should not need any translation. But rest of
+		 * the code still uses inheritance mechanism. So do we here.
+		 * TODO: do we need to translate the relids as well?
 		 */
+		if (rel->part_desc && rel->part_desc->nparts > 0)
+		{
+			AttrNumber attno;
+			for (attno = rel->min_attr; attno <= rel->max_attr; attno++)
+			{
+				int	index = attno - rel->min_attr;
+				Relids	attr_needed = bms_copy(rel->attr_needed[index]);
+
+				/*
+				 * System attributes do not need translation. In such a case,
+				 * the attribute numbers of the parent and the child should
+				 * start from the same minimum attribute.
+				 */
+				if (attno <= 0)
+				{
+					Assert(rel->min_attr == childrel->min_attr);
+					childrel->attr_needed[index] = attr_needed;
+				}
+				else
+				{
+					Var *var = list_nth(appinfo->translated_vars,
+										attno - 1); 
+					int child_index;
+
+					/* Parent Var translates to child Var. */ 
+					Assert(IsA(var, Var));
+
+					child_index = var->varattno - childrel->min_attr;
+					childrel->attr_needed[child_index] = attr_needed;
+				}
+			}
+		}
 
 		/*
 		 * Compute the child's size.
 		 */
 		set_rel_size(root, childrel, childRTindex, childRTE);
 
 		/*
 		 * It is possible that constraint exclusion detected a contradiction
 		 * within a child subquery, even though we didn't prove one above. If
 		 * so, we can skip this child.
@@ -1097,34 +1135,38 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 {
 	int			parentRTindex = rti;
 	List	   *live_childrels = NIL;
 	List	   *subpaths = NIL;
 	bool		subpaths_valid = true;
 	List	   *partial_subpaths = NIL;
 	bool		partial_subpaths_valid = true;
 	List	   *all_child_pathkeys = NIL;
 	List	   *all_child_outers = NIL;
 	ListCell   *l;
+	PartitionDesc	part_desc = rel->part_desc;
+	int			num_parts = part_desc ? part_desc->nparts : 0;
+	Oid		   *part_oids = part_desc ? part_desc->oids : NULL;
 
 	/*
 	 * Generate access paths for each member relation, and remember the
 	 * cheapest path for each one.  Also, identify all pathkeys (orderings)
 	 * and parameterizations (required_outer sets) available for the member
 	 * relations.
 	 */
 	foreach(l, root->append_rel_list)
 	{
 		AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l);
 		int			childRTindex;
 		RangeTblEntry *childRTE;
 		RelOptInfo *childrel;
 		ListCell   *lcp;
+		int			cnt_parts; 
 
 		/* append_rel_list contains all append rels; ignore others */
 		if (appinfo->parent_relid != parentRTindex)
 			continue;
 
 		/* Re-locate the child RTE and RelOptInfo */
 		childRTindex = appinfo->child_relid;
 		childRTE = root->simple_rte_array[childRTindex];
 		childrel = root->simple_rel_array[childRTindex];
 
@@ -1132,20 +1174,37 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		 * Compute the child's access paths.
 		 */
 		set_rel_pathlist(root, childrel, childRTindex, childRTE);
 
 		/*
 		 * If child is dummy, ignore it.
 		 */
 		if (IS_DUMMY_REL(childrel))
 			continue;
 
+		/* 
+		 * Match the children to the partition to fill the partition scheme by
+		 * matching OID of the children in part_desc and RTE.
+		 * TODO: we are doing this here since we get hold of the partition
+		 * RelOptInfo here. But we should assess whether this is the right
+		 * place.
+		 */
+		for (cnt_parts = 0; cnt_parts < num_parts; cnt_parts++)
+		{
+			if (part_oids[cnt_parts] == childRTE->relid)
+			{
+				/* Every partition can be seen only once. */
+				Assert(!rel->part_rels[cnt_parts]);
+				rel->part_rels[cnt_parts] = childrel;
+			}
+		}
+
 		/*
 		 * Child is live, so add it to the live_childrels list for use below.
 		 */
 		live_childrels = lappend(live_childrels, childrel);
 
 		/*
 		 * If child has an unparameterized cheapest-total path, add that to
 		 * the unparameterized Append path we are constructing for the parent.
 		 * If not, there's no workable unparameterized path.
 		 */
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4c9d8d9..9cd0361 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -120,20 +120,21 @@ bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
 bool		enable_mergejoin = true;
 bool		enable_hashjoin = true;
 bool		enable_fkey_estimates = true;
+bool		enable_partition_wise_join = true;
 
 typedef struct
 {
 	PlannerInfo *root;
 	QualCost	total;
 } cost_qual_eval_context;
 
 static List *extract_nonindex_conditions(List *qual_clauses, List *indexquals);
 static MergeScanSelCache *cached_scansel(PlannerInfo *root,
 			   RestrictInfo *rinfo,
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 01d4fea..b0cbc1b 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -10,35 +10,42 @@
  * IDENTIFICATION
  *	  src/backend/optimizer/path/joinrels.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include "optimizer/joininfo.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/cost.h"
 #include "utils/memutils.h"
 
 
 static void make_rels_by_clause_joins(PlannerInfo *root,
 						  RelOptInfo *old_rel,
 						  ListCell *other_rels);
 static void make_rels_by_clauseless_joins(PlannerInfo *root,
 							  RelOptInfo *old_rel,
 							  ListCell *other_rels);
 static bool has_join_restriction(PlannerInfo *root, RelOptInfo *rel);
 static bool has_legal_joinclause(PlannerInfo *root, RelOptInfo *rel);
 static bool is_dummy_rel(RelOptInfo *rel);
 static void mark_dummy_rel(RelOptInfo *rel);
 static bool restriction_is_constant_false(List *restrictlist,
 							  bool only_pushed_down);
+static bool are_partitions_joinable(RelOptInfo *rel1, RelOptInfo *rel2,
+								   SpecialJoinInfo *sjinfo);
+static PartitionDesc build_joinrel_partition_desc(RelOptInfo *rel1,
+												  RelOptInfo *rel2,
+												  SpecialJoinInfo *sjinfo);
+static void add_append_paths_to_joinrel(RelOptInfo *joinrel);
 
 
 /*
  * join_search_one_level
  *	  Consider ways to produce join relations containing exactly 'level'
  *	  jointree items.  (This is one step of the dynamic-programming method
  *	  embodied in standard_join_search.)  Join rel nodes for each feasible
  *	  combination of lower-level rels are created and returned in a list.
  *	  Implementation paths are created for each such joinrel, too.
  *
@@ -862,25 +869,182 @@ make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2)
 			add_paths_to_joinrel(root, joinrel, rel1, rel2,
 								 JOIN_ANTI, sjinfo,
 								 restrictlist);
 			break;
 		default:
 			/* other values not expected here */
 			elog(ERROR, "unrecognized join type: %d", (int) sjinfo->jointype);
 			break;
 	}
 
+	/*
+	 * If both the relations are partitioned in the same way, and there is an
+	 * equi-join clause on partition key, try joining the partitions. Store the
+	 * partitioning scheme in joinrel for further joins.
+	 */
+
+	joinrel->part_desc = build_joinrel_partition_desc(rel1, rel2, sjinfo);
+
+	if (joinrel->part_desc && joinrel->part_desc->nparts > 0)
+	{
+		int	nparts = joinrel->part_desc->nparts;
+		int	cnt_parts;
+
+		/* Allocate space for holding the pair-wise join relations. */
+		joinrel->part_rels = (RelOptInfo **) palloc(sizeof(RelOptInfo *) *
+													nparts);
+		/* Create join relations for the partition relations. */
+		for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
+		{
+			RelOptInfo	*part_join_rel;
+			part_join_rel = make_join_rel(root, rel1->part_rels[cnt_parts],
+										  rel2->part_rels[cnt_parts]);
+			joinrel->part_rels[cnt_parts] = part_join_rel;
+		}
+
+		/* Add append paths for pair-wise joins. */
+		add_append_paths_to_joinrel(joinrel);
+	}
+
 	bms_free(joinrelids);
 
 	return joinrel;
 }
 
+/*
+ * Given partitioin description of two joining relations, construct partition
+ * description for join between those relations.
+ *
+ * TODO find the right place for this function.
+ */
+static PartitionDesc
+build_joinrel_partition_desc(RelOptInfo *rel1, RelOptInfo *rel2,
+							 SpecialJoinInfo *sjinfo)
+{
+	PartitionDesc	part_desc;
+
+	/* Do nothing, if user doesn't want to try partition-wise join. */
+	if (!enable_partition_wise_join)
+		return NULL;
+
+	if (!are_partitions_joinable(rel1, rel2, sjinfo))
+		return NULL;
+
+	/* 
+	 * The result of join is partitioned the same way as the joining relations.
+	 * Construct the partitioning scheme from the joining relations.
+	 */
+	part_desc = (PartitionDesc) palloc0(sizeof(PartitionDescData));
+	part_desc->nparts = rel1->part_desc->nparts;
+	/* Fill up the rest of the fields. */	
+
+	return part_desc;
+}
+
+/*
+ * Assess whether the given relations are similarly partitioned and have
+ * equi-join clauses on partition keys.
+ * 
+ * Two relations are similarly partitioned if
+ * o. They have same number of partitions
+ * o. They have same number of partition keys*
+ * o. Partition keys have same types and opclasses*
+ * o. They have same upper and lower bounds (with inclusive/exclusive
+ * attributes) for all keys for range partitions. They have same list items for
+ * list partitions.
+ *
+ * Have same number of partition keys: It might be possible to join partitioned
+ * table which have different number of partition keys and suitable equi-join
+ * clauses to eliminate the possibilities. But right now, we do not consider
+ * this.
+ *
+ * Have same types and opclasses: Right now, we expect the partition keys to
+ * have exact same order of partion key types and opclasses. But it might
+ * be possible to relax this condition, if we can find which partition key
+ * matches which and also find corresponding equi-joins.
+ */
+static bool
+are_partitions_joinable(RelOptInfo *rel1, RelOptInfo *rel2,
+					   SpecialJoinInfo *sjinfo)
+{
+	PartitionDesc	part_desc1 = rel1->part_desc;
+	PartitionDesc	part_desc2 = rel2->part_desc;
+
+	/* If either of the relations is not partitioned, nothing to check here. */
+	if (!part_desc1 || part_desc1->nparts == 0 ||
+		!part_desc2 || part_desc2->nparts == 0)
+		return false;
+
+	/*
+	 * If the number of partitions on either side differs, partitioning schemes
+	 * do not match.
+	 * TODO: it should be possible to push an inner join down even if the number of
+	 * partitions differ but the common partitions match. In such a case pushing
+	 * down outer joins would be tricky, but still doable using empty relation
+	 * for non-existing partition.
+	 */
+	if (part_desc1->nparts != part_desc2->nparts)
+		return false;
+
+	/* TODO: */
+	/* All the artitions on either side should have same bounds or lists. */
+	/* Joining condition should have an equi-join on the partition key. */
+
+	/* By default, the partitions match. */
+	return true;
+}
+
+/*
+ * Add append paths for the join relation.
+ *
+ * Like set_append_rel_pathlist, this function considers pair-wise partition
+ * join paths with parameterization and pathkeys. 
+ * 
+ * TODO: right now the function just picks up the cheapest path from each of
+ * the partitions and creates an append path with those. 
+ *
+ * TODO: may be we should consider splitting set_append_rel_pathlist() so that
+ * it can be used for both inheritance and partitioning.
+ */
+static void
+add_append_paths_to_joinrel(RelOptInfo *joinrel)
+{
+	RelOptInfo **part_rels = joinrel->part_rels;
+	int		nparts;
+	int		cnt_parts;
+	List   *part_paths = NIL;
+
+	Assert(joinrel->part_desc);
+
+	nparts = joinrel->part_desc->nparts;
+
+	for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
+	{
+		RelOptInfo *part_rel = part_rels[cnt_parts];
+
+		/* Find the cheapest path for partition relation. */
+		set_cheapest(part_rel);
+
+		/* We don't expect any parameterization here. */
+		Assert(!part_rel->cheapest_total_path->param_info);
+		
+		/*
+		 * Instead of lappend, we should use accumulate_append_subpath() to
+		 * pull up the paths in underlying append.
+		 */
+		part_paths = lappend(part_paths, part_rel->cheapest_total_path);
+	}
+
+	add_path(joinrel, (Path *) create_append_path(joinrel, part_paths, NULL,
+												  0));
+	return;
+}
 
 /*
  * have_join_order_restriction
  *		Detect whether the two relations should be joined to satisfy
  *		a join-order restriction arising from special or lateral joins.
  *
  * In practice this is always used with have_relevant_joinclause(), and so
  * could be merged with that function, but it seems clearer to separate the
  * two concerns.  We need this test because there are degenerate cases where
  * a clauseless join must be performed to satisfy join-order restrictions.
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 1179643..602f231 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -480,20 +480,35 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	{
 		rel->serverid = GetForeignServerIdByRelId(RelationGetRelid(relation));
 		rel->fdwroutine = GetFdwRoutineForRelation(relation, true);
 	}
 	else
 	{
 		rel->serverid = InvalidOid;
 		rel->fdwroutine = NULL;
 	}
 
+	/*
+	 * Get the partitioning scheme.
+	 * TODO: this is temporary code to stick partitioning information in
+	 * RelOptInfo. We might change the way the information is stored. At every
+	 * relation, we need to match partitioning information relevant at that
+	 * level. So, storing only a single level partitioning information should
+	 * suffice, even for a multi-level partitioned table.
+	 */
+	rel->part_desc = RelationGetPartitionDesc(relation);
+	if (rel->part_desc && rel->part_desc->nparts > 0)
+	{
+		rel->part_rels = (RelOptInfo **) palloc0(sizeof(RelOptInfo *) *
+												 rel->part_desc->nparts);
+	}
+
 	heap_close(relation, NoLock);
 
 	/*
 	 * Allow a plugin to editorialize on the info we obtained from the
 	 * catalogs.  Actions might include altering the assumed relation size,
 	 * removing an index, or adding a hypothetical index to the indexlist.
 	 */
 	if (get_relation_info_hook)
 		(*get_relation_info_hook) (root, relationObjectId, inhparent, rel);
 }
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 1e87a73..6e7b1a4 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -377,25 +377,41 @@ build_join_rel(PlannerInfo *root,
 		 * pair of component relations.
 		 */
 		if (restrictlist_ptr)
 			*restrictlist_ptr = build_joinrel_restrictlist(root,
 														   joinrel,
 														   outer_rel,
 														   inner_rel);
 		return joinrel;
 	}
 
+	/* A partition relation can be joined with only partition relation. */
+	Assert(!(outer_rel->reloptkind == RELOPT_OTHER_JOINREL ||
+		    outer_rel->reloptkind == RELOPT_OTHER_MEMBER_REL) ||
+		   (inner_rel->reloptkind == RELOPT_OTHER_JOINREL ||
+			inner_rel->reloptkind == RELOPT_OTHER_MEMBER_REL));
+
 	/*
 	 * Nope, so make one.
 	 */
 	joinrel = makeNode(RelOptInfo);
-	joinrel->reloptkind = RELOPT_JOINREL;
+
+	/*
+	 * A join between partitions or child tables is different from join between
+	 * regular tables.
+	 */
+	if (outer_rel->reloptkind == RELOPT_OTHER_JOINREL ||
+		outer_rel->reloptkind == RELOPT_OTHER_MEMBER_REL)
+		joinrel->reloptkind = RELOPT_OTHER_JOINREL;
+	else
+		joinrel->reloptkind = RELOPT_JOINREL;
+
 	joinrel->relids = bms_copy(joinrelids);
 	joinrel->rows = 0;
 	/* cheap startup cost is interesting iff not all tuples to be retrieved */
 	joinrel->consider_startup = (root->tuple_fraction > 0);
 	joinrel->consider_param_startup = false;
 	joinrel->consider_parallel = false;
 	joinrel->reltarget = create_empty_pathtarget();
 	joinrel->pathlist = NIL;
 	joinrel->ppilist = NIL;
 	joinrel->partial_pathlist = NIL;
@@ -539,21 +555,21 @@ build_join_rel(PlannerInfo *root,
 		Assert(!found);
 		hentry->join_rel = joinrel;
 	}
 
 	/*
 	 * Also, if dynamic-programming join search is active, add the new joinrel
 	 * to the appropriate sublist.  Note: you might think the Assert on number
 	 * of members should be for equality, but some of the level 1 rels might
 	 * have been joinrels already, so we can only assert <=.
 	 */
-	if (root->join_rel_level)
+	if (root->join_rel_level && joinrel->reloptkind != RELOPT_OTHER_JOINREL)
 	{
 		Assert(root->join_cur_level > 0);
 		Assert(root->join_cur_level <= bms_num_members(joinrel->relids));
 		root->join_rel_level[root->join_cur_level] =
 			lappend(root->join_rel_level[root->join_cur_level], joinrel);
 	}
 
 	return joinrel;
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e246a9c..65e993e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -879,20 +879,29 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 	{
 		{"enable_fkey_estimates", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables use of foreign keys for estimating joins."),
 			NULL
 		},
 		&enable_fkey_estimates,
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_partition_wise_join", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables partition-wise join."),
+			NULL
+		},
+		&enable_partition_wise_join,
+		true,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
 						 "exhaustive searching.")
 		},
 		&enable_geqo,
 		true,
 		NULL, NULL, NULL
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 45739c3..332e83f 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -12,20 +12,21 @@
  *-------------------------------------------------------------------------
  */
 #ifndef RELATION_H
 #define RELATION_H
 
 #include "access/sdir.h"
 #include "lib/stringinfo.h"
 #include "nodes/params.h"
 #include "nodes/parsenodes.h"
 #include "storage/block.h"
+#include "catalog/partition.h"
 
 
 /*
  * Relids
  *		Set of relation identifiers (indexes into the rangetable).
  */
 typedef Bitmapset *Relids;
 
 /*
  * When looking for a "cheapest path", this enum specifies whether we want
@@ -342,20 +343,27 @@ typedef struct PlannerInfo
  * is present in the query join tree but the members are not.  The member
  * RTEs and otherrels are used to plan the scans of the individual tables or
  * subqueries of the append set; then the parent baserel is given Append
  * and/or MergeAppend paths comprising the best paths for the individual
  * member rels.  (See comments for AppendRelInfo for more information.)
  *
  * At one time we also made otherrels to represent join RTEs, for use in
  * handling join alias Vars.  Currently this is not needed because all join
  * alias Vars are expanded to non-aliased form during preprocess_expression.
  *
+ * We also have relations representing each of the pair-wise joins between
+ * partitioned tables with same partitioning scheme. These relations are not
+ * added to join_rel_level lists as they are not joined directly by the dynamic
+ * programming algorithm. Adding these two join_rel_level list also means that
+ * top level list has more than one join relation, which is symantically
+ * incorrect.
+ *
  * There is also a RelOptKind for "upper" relations, which are RelOptInfos
  * that describe post-scan/join processing steps, such as aggregation.
  * Many of the fields in these RelOptInfos are meaningless, but their Path
  * fields always hold Paths showing ways to do that processing step.
  *
  * Lastly, there is a RelOptKind for "dead" relations, which are base rels
  * that we have proven we don't need to join after all.
  *
  * Parts of this data structure are specific to various scan and join
  * mechanisms.  It didn't seem worth creating new node types for them.
@@ -460,20 +468,21 @@ typedef struct PlannerInfo
  * We store baserestrictcost in the RelOptInfo (for base relations) because
  * we know we will need it at least once (to price the sequential scan)
  * and may need it multiple times to price index scans.
  *----------
  */
 typedef enum RelOptKind
 {
 	RELOPT_BASEREL,
 	RELOPT_JOINREL,
 	RELOPT_OTHER_MEMBER_REL,
+	RELOPT_OTHER_JOINREL,
 	RELOPT_UPPER_REL,
 	RELOPT_DEADREL
 } RelOptKind;
 
 typedef struct RelOptInfo
 {
 	NodeTag		type;
 
 	RelOptKind	reloptkind;
 
@@ -532,20 +541,35 @@ typedef struct RelOptInfo
 	struct FdwRoutine *fdwroutine;
 	void	   *fdw_private;
 
 	/* used by various scans and joins: */
 	List	   *baserestrictinfo;		/* RestrictInfo structures (if base
 										 * rel) */
 	QualCost	baserestrictcost;		/* cost of evaluating the above */
 	List	   *joininfo;		/* RestrictInfo structures for join clauses
 								 * involving this rel */
 	bool		has_eclass_joins;		/* T means joininfo is incomplete */
+
+	/* For partitioned relations, joins or base relations. */
+	/* TODO: the partition hierarchy described by the members below, may be put
+	 * into a path rather than RelOptInfo and will be handy in case we start
+	 * supporting repartitioning of the data by different partitioning scheme.
+	 *
+	 * TODO: PartitionDescData contains the array of OIDs of the partitions, but that
+	 * doesn't work for the partitions obtained by pair-wise joins of
+	 * partitioned tables. We should probably create another data structure
+	 * like AppendRelInfo for storing those, but I am not sure.
+	 *
+	 * TODO: Notice recursive usage of RelOptInfo.
+	 */
+	PartitionDesc	part_desc;	/* Partitioning scheme if partitioned */
+	struct RelOptInfo	  **part_rels;	/* RelOptInfo of the partitions. */
 } RelOptInfo;
 
 /*
  * IndexOptInfo
  *		Per-index information for planning/optimization
  *
  *		indexkeys[], indexcollations[], opfamily[], and opcintype[]
  *		each have ncolumns entries.
  *
  *		sortopfamily[], reverse_sort[], and nulls_first[] likewise have
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 58ac163..ff2e3c7 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -60,20 +60,21 @@ extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
 extern bool enable_bitmapscan;
 extern bool enable_tidscan;
 extern bool enable_sort;
 extern bool enable_hashagg;
 extern bool enable_nestloop;
 extern bool enable_material;
 extern bool enable_mergejoin;
 extern bool enable_hashjoin;
 extern bool enable_fkey_estimates;
+extern bool enable_partition_wise_join;
 extern int	constraint_exclusion;
 
 extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Partition-wise join for join between (declaratively) partitioned tables

Reply via email to