Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Mon, Oct 16, 2017 at 5:03 PM, Ashutosh Bapat wrote: > set_append_rel_size() crashes when it encounters a partitioned table > with a dropped column. Dropped columns do not have any translations > saved in AppendInfo::translated_vars; the corresponding entry is NULL > per make_inh_translation_list(). > 1802 att = TupleDescAttr(old_tupdesc, old_attno); > 1803 if (att->attisdropped) > 1804 { > 1805 /* Just put NULL into this list entry */ > 1806 vars = lappend(vars, NULL); > 1807 continue; > 1808 } > > In set_append_rel_size() we try to attr_needed for child tables. While > doing so we try to translate a user attribute number of parent to that > of a child and crash since the translated Var is NULL. Here's patch to > fix the crash. The patch also contains a testcase to test dropped > columns in partitioned table. > > Sorry for not noticing this problem earlier. OK, committed. This is a good example of how having good code coverage doesn't necessarily mean you've found all the bugs. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas wrote: >> >> The fix is to copy the relevant partitioning information from relcache >> into PartitionSchemeData and RelOptInfo. Here's a quick patch with >> that fix. > > Committed. I hope that makes things less red rather than more, > because I'm going to be AFK for a few hours anyway. > set_append_rel_size() crashes when it encounters a partitioned table with a dropped column. Dropped columns do not have any translations saved in AppendInfo::translated_vars; the corresponding entry is NULL per make_inh_translation_list(). 1802 att = TupleDescAttr(old_tupdesc, old_attno); 1803 if (att->attisdropped) 1804 { 1805 /* Just put NULL into this list entry */ 1806 vars = lappend(vars, NULL); 1807 continue; 1808 } In set_append_rel_size() we try to attr_needed for child tables. While doing so we try to translate a user attribute number of parent to that of a child and crash since the translated Var is NULL. Here's patch to fix the crash. The patch also contains a testcase to test dropped columns in partitioned table. Sorry for not noticing this problem earlier. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company From eca9e03410b049bf79d767657ad4d90fb974d19c Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Mon, 16 Oct 2017 13:23:49 +0530 Subject: [PATCH] Ignore dropped columns in set_append_rel_size(). A parent Var corresponding to column dropped from a parent table will not have any translation for a child. It won't have any attr_needed information since it can not be referenced from SQL. Hence ignore dropped columns while constructing attr_needed information for a child table. Ashutosh Bapat. --- src/backend/optimizer/path/allpaths.c | 12 src/test/regress/expected/alter_table.out |7 +++ src/test/regress/sql/alter_table.sql |4 3 files changed, 23 insertions(+) diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 5535b63..1bc5e01 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -950,6 +950,18 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel, attno - 1); int child_index; + /* + * Ignore any column dropped from the parent. + * Corresponding Var won't have any translation. It won't + * have attr_needed information, since it can not be + * referenced in the query. + */ + if (var == NULL) + { + Assert(attr_needed == NULL); + continue; + } + child_index = var->varattno - childrel->min_attr; childrel->attr_needed[child_index] = attr_needed; } diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out index dbe438d..cc56651 100644 --- a/src/test/regress/expected/alter_table.out +++ b/src/test/regress/expected/alter_table.out @@ -3604,6 +3604,13 @@ ALTER TABLE list_parted2 DROP COLUMN b; ERROR: cannot drop column named in partition key ALTER TABLE list_parted2 ALTER COLUMN b TYPE text; ERROR: cannot alter type of column named in partition key +-- dropping non-partition key columns should be allowed on the parent table. +ALTER TABLE list_parted DROP COLUMN b; +SELECT * FROM list_parted; + a +--- +(0 rows) + -- cleanup DROP TABLE list_parted, list_parted2, range_parted; DROP TABLE fail_def_part; diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql index 0c8ae2a..fc7bd66 100644 --- a/src/test/regress/sql/alter_table.sql +++ b/src/test/regress/sql/alter_table.sql @@ -2390,6 +2390,10 @@ ALTER TABLE part_2 INHERIT inh_test; ALTER TABLE list_parted2 DROP COLUMN b; ALTER TABLE list_parted2 ALTER COLUMN b TYPE text; +-- dropping non-partition key columns should be allowed on the parent table. +ALTER TABLE list_parted DROP COLUMN b; +SELECT * FROM list_parted; + -- cleanup DROP TABLE list_parted, list_parted2, range_parted; DROP TABLE fail_def_part; -- 1.7.9.5 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Thu, Oct 12, 2017 at 10:49 PM, Robert Haas wrote: > On Wed, Oct 11, 2017 at 10:43 PM, Ashutosh Bapat > wrote: >> You are suggesting that a dummy partitioned table be treated as an >> un-partitioned table and apply above suggested optimization. A join >> between a partitioned and unpartitioned table is partitioned by the >> keys of only partitioned table. An unpartitioned table doesn't have >> any keys, so this is fine. But a dummy partitioned table does have >> keys. Recording them as keys of the join relation helps when it joins >> to other relations. Furthermore a join between partitioned and >> unpartitioned table doesn't require any equi-join condition on >> partition keys of partitioned table but a join between partitioned >> tables is considered to be partitioned by keys on both sides only when >> there is an equi-join. So, when implementing a partitioned join >> between a partitioned and an unpartitioned table, we will have to make >> a special case to record partition keys when the unpartitioned side is >> actually a dummy partitioned table. That might be awkward. > > It seems to me that what we really need here is to move all of this > stuff into a separate struct: > > /* used for partitioned relations */ > PartitionScheme part_scheme;/* Partitioning scheme. */ > int nparts; /* number of > partitions */ > struct PartitionBoundInfoData *boundinfo; /* Partition bounds */ > struct RelOptInfo **part_rels; /* Array of RelOptInfos of partitions, > > * stored in the same order of bounds */ > List **partexprs; /* Non-nullable partition key > expressions. */ > List **nullable_partexprs; /* Nullable partition key > expressions. */ > In a very early patch I had PartitionOptInfo to hold all of this. RelOptInfo then had a pointer of PartitionOptInfo, if it was partitioned. When a relation can be partitioned in multiple ways like what you describe or because join by re-partitioning is efficient, RelOptInfo would have a list of those. But the representation needs to be thought through. I am wondering whether this should be modelled like IndexOptInfo. I am not sure. This is a topic of much larger discussion. I think we are digressing. We were discussing my patch to handle dummy partitioned relation, whose children are not marked dummy and do not have pathlists set. Do you still think that we should leave that aside? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Oct 11, 2017 at 10:43 PM, Ashutosh Bapat wrote: > You are suggesting that a dummy partitioned table be treated as an > un-partitioned table and apply above suggested optimization. A join > between a partitioned and unpartitioned table is partitioned by the > keys of only partitioned table. An unpartitioned table doesn't have > any keys, so this is fine. But a dummy partitioned table does have > keys. Recording them as keys of the join relation helps when it joins > to other relations. Furthermore a join between partitioned and > unpartitioned table doesn't require any equi-join condition on > partition keys of partitioned table but a join between partitioned > tables is considered to be partitioned by keys on both sides only when > there is an equi-join. So, when implementing a partitioned join > between a partitioned and an unpartitioned table, we will have to make > a special case to record partition keys when the unpartitioned side is > actually a dummy partitioned table. That might be awkward. It seems to me that what we really need here is to move all of this stuff into a separate struct: /* used for partitioned relations */ PartitionScheme part_scheme;/* Partitioning scheme. */ int nparts; /* number of partitions */ struct PartitionBoundInfoData *boundinfo; /* Partition bounds */ struct RelOptInfo **part_rels; /* Array of RelOptInfos of partitions, * stored in the same order of bounds */ List **partexprs; /* Non-nullable partition key expressions. */ List **nullable_partexprs; /* Nullable partition key expressions. */ ...and then have a RelOptInfo carry a pointer to a list of those structures. That lets us consider multiple possible partition schemes for the same relation. For instance, suppose that a user joins four relations, P1, P2, Q1, and Q2. P1 and P2 are compatibly partitioned. Q1 and Q2 are compatibly partitioned (but not compatible with P1 and P2). Furthermore, let's suppose that the optimal join order begins with a join between P1 and Q1. When we construct the paths for that joinrel, we can either join all of P1 to all of Q1 (giving up on partition-wise join), or we can join each partition of P1 to all of Q1 (producing a result partitioned compatibly with P1 and allowing for a future partition-wise join to P2), or we can join each partition of Q1 to all of P1 (producing a result partitioned compatibly with Q1 and allowing for a future partition-wise join to Q2). Any of those could win depending on the details. With the data structure as it is today, we'd have to choose whether to mark the joinrel as partitioned like P1 or like Q1, but that's not really what we need here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Oct 11, 2017 at 7:47 PM, Robert Haas wrote: > On Mon, Oct 9, 2017 at 2:05 AM, Ashutosh Bapat > wrote: >> On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas wrote: >>> Committed. I hope that makes things less red rather than more, >>> because I'm going to be AFK for a few hours anyway. >> >> Here's the last patch, dealing with the dummy relations, rebased. With >> this fix every join order of a partitioned join can be considered >> partitioned. (This wasn't the case earlier when dummy relation was >> involved.). So, we can allocate the child-join RelOptInfo array in >> build_joinrel_partition_info(), instead of waiting for an appropriate >> pair to arrive in try_partition_wise_join(). > > Wouldn't a far more general approach be to allow a partition-wise join > between a partitioned table and an unpartitioned table, considering > the result as partitioned? That seems like it would very often yield > much better query plans than what we have right now, and also make the > need for this particular thing go away. > You are suggesting that a dummy partitioned table be treated as an un-partitioned table and apply above suggested optimization. A join between a partitioned and unpartitioned table is partitioned by the keys of only partitioned table. An unpartitioned table doesn't have any keys, so this is fine. But a dummy partitioned table does have keys. Recording them as keys of the join relation helps when it joins to other relations. Furthermore a join between partitioned and unpartitioned table doesn't require any equi-join condition on partition keys of partitioned table but a join between partitioned tables is considered to be partitioned by keys on both sides only when there is an equi-join. So, when implementing a partitioned join between a partitioned and an unpartitioned table, we will have to make a special case to record partition keys when the unpartitioned side is actually a dummy partitioned table. That might be awkward. Because we don't have dummy children relation in all cases, we already have some awkwardness like allocating part_rels array only when we encounter a join order which has all the children. This patch removes that. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Mon, Oct 9, 2017 at 2:05 AM, Ashutosh Bapat wrote: > On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas wrote: >> Committed. I hope that makes things less red rather than more, >> because I'm going to be AFK for a few hours anyway. > > Here's the last patch, dealing with the dummy relations, rebased. With > this fix every join order of a partitioned join can be considered > partitioned. (This wasn't the case earlier when dummy relation was > involved.). So, we can allocate the child-join RelOptInfo array in > build_joinrel_partition_info(), instead of waiting for an appropriate > pair to arrive in try_partition_wise_join(). Wouldn't a far more general approach be to allow a partition-wise join between a partitioned table and an unpartitioned table, considering the result as partitioned? That seems like it would very often yield much better query plans than what we have right now, and also make the need for this particular thing go away. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas wrote: > > Committed. I hope that makes things less red rather than more, > because I'm going to be AFK for a few hours anyway. > Here's the last patch, dealing with the dummy relations, rebased. With this fix every join order of a partitioned join can be considered partitioned. (This wasn't the case earlier when dummy relation was involved.). So, we can allocate the child-join RelOptInfo array in build_joinrel_partition_info(), instead of waiting for an appropriate pair to arrive in try_partition_wise_join(). -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company From 4bf8ca38719aee730ed57a7f14522384b1ced7b0 Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Tue, 29 Aug 2017 12:37:14 +0530 Subject: [PATCH] Support partition-wise join for dummy partitioned relation. Current partition-wise join implementation treats dummy relations as unpartitioned since the child relations may not be marked dummy and may not have their pathlists set (See set_rel_size() and set_rel_pathlist()). Since children do not have paths set, they can not be used to form a child-join. This patch marks the children of dummy partitioned relations as dummy, thus allowing partition-wise join when one of the joining relations is dummy. If the dummy partitioned relation is a base relation, its children are base relations as well and we will be marking base relation dummy during join planning. But this shouldn't be a problem since populate_joinrel_with_paths() already does that to an inner relation of left join. Ashutosh Bapat. --- src/backend/optimizer/path/allpaths.c |7 +-- src/backend/optimizer/path/joinrels.c | 24 src/backend/optimizer/util/relnode.c |5 + 3 files changed, 18 insertions(+), 18 deletions(-) diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 5535b63..b46fb5b 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -3261,12 +3261,7 @@ generate_partition_wise_join_paths(PlannerInfo *root, RelOptInfo *rel) if (IS_DUMMY_REL(rel)) return; - /* - * Nothing to do if the relation is not partitioned. An outer join - * relation which had empty inner relation in every pair will have rest of - * the partitioning properties set except the child-join RelOptInfos. See - * try_partition_wise_join() for more explanation. - */ + /* Nothing to do if the relation is not partitioned. */ if (rel->nparts <= 0 || rel->part_rels == NULL) return; diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c index 2b868c5..1578dea 100644 --- a/src/backend/optimizer/path/joinrels.c +++ b/src/backend/optimizer/path/joinrels.c @@ -1321,14 +1321,19 @@ try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2, /* * set_rel_pathlist() may not create paths in children of an empty - * partitioned table and so we can not add paths to child-joins. So, deem - * such a join as unpartitioned. When a partitioned relation is deemed - * empty because all its children are empty, dummy path will be set in - * each of the children. In such a case we could still consider the join - * as partitioned, but it might not help much. + * partitioned table. Mark such children as dummy so that we can add paths + * to child-joins. */ - if (IS_DUMMY_REL(rel1) || IS_DUMMY_REL(rel2)) - return; + if (IS_DUMMY_REL(rel1)) + { + for (cnt_parts = 0; cnt_parts < rel1->nparts; cnt_parts++) + mark_dummy_rel(rel1->part_rels[cnt_parts]); + } + if (IS_DUMMY_REL(rel2)) + { + for (cnt_parts = 0; cnt_parts < rel2->nparts; cnt_parts++) + mark_dummy_rel(rel2->part_rels[cnt_parts]); + } /* * Since this join relation is partitioned, all the base relations @@ -1361,11 +1366,6 @@ try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2, nparts = joinrel->nparts; - /* Allocate space to hold child-joins RelOptInfos, if not already done. */ - if (!joinrel->part_rels) - joinrel->part_rels = - (RelOptInfo **) palloc0(sizeof(RelOptInfo *) * nparts); - /* * Create child-join relations for this partitioned join, if those don't * exist. Add paths to child-joins for a pair of child relations diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c index 3bd1063..21fd29f 100644 --- a/src/backend/optimizer/util/relnode.c +++ b/src/backend/optimizer/util/relnode.c @@ -1746,4 +1746,9 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel, joinrel->partexprs[cnt] = partexpr; joinrel->nullable_partexprs[cnt] = nullable_partexpr; } + + /* Allocate space to hold child-joins RelOptInfos. */ + joinrel->part_rels = + (RelOptInfo **) palloc0(sizeof(RelOptInfo *) * joinrel->nparts); + } -- 1.7.9.5 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http:
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Fri, Oct 6, 2017 at 3:07 PM, Ashutosh Bapat wrote: > On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas wrote: >> I think this is very good work and I'm excited about the feature. Now >> I'll wait to see whether the buildfarm, or Tom, yell at me for >> whatever problems this may still have... > > Buildfarm animal prion turned red. Before going into that failure, > good news is that the other animals are green. So the plans are > stable. > > prion runs the regression with -DRELCACHE_FORCE_RELEASE, which > destroys a relcache entry as soon as its reference count drops down to > 0. This destroys everything that's there in corresponding relcache > entry including partition key information and partition descriptor > information. find_partition_scheme() and set_relation_partition_info() > both assume that the relcache information will survive as long as the > relation lock is held. They do not copy the relevant partitioning > information but just copy the pointers. That assumption is wrong. > Because of -DRELCACHE_FORCE_RELEASE, as soon as refcount drops to > zero, the data in partition scheme and partition bounds goes invalid > and various checks to see if partition wise join is possible fail. > That causes partition_join test to fail on prion. But I think, the bug > could cause crash as well. > > The fix is to copy the relevant partitioning information from relcache > into PartitionSchemeData and RelOptInfo. Here's a quick patch with > that fix. Committed. I hope that makes things less red rather than more, because I'm going to be AFK for a few hours anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas wrote: > > I think this is very good work and I'm excited about the feature. Now > I'll wait to see whether the buildfarm, or Tom, yell at me for > whatever problems this may still have... > Buildfarm animal prion turned red. Before going into that failure, good news is that the other animals are green. So the plans are stable. prion runs the regression with -DRELCACHE_FORCE_RELEASE, which destroys a relcache entry as soon as its reference count drops down to 0. This destroys everything that's there in corresponding relcache entry including partition key information and partition descriptor information. find_partition_scheme() and set_relation_partition_info() both assume that the relcache information will survive as long as the relation lock is held. They do not copy the relevant partitioning information but just copy the pointers. That assumption is wrong. Because of -DRELCACHE_FORCE_RELEASE, as soon as refcount drops to zero, the data in partition scheme and partition bounds goes invalid and various checks to see if partition wise join is possible fail. That causes partition_join test to fail on prion. But I think, the bug could cause crash as well. The fix is to copy the relevant partitioning information from relcache into PartitionSchemeData and RelOptInfo. Here's a quick patch with that fix. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 3a8306a..ebda85e 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -702,6 +702,74 @@ partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval, } /* + * Return a copy of given PartitionBoundInfo structure. The data types of bounds + * are described by given partition key specificiation. + */ +extern PartitionBoundInfo +partition_bounds_copy(PartitionBoundInfo src, + PartitionKey key) +{ + PartitionBoundInfo dest; + int i; + int ndatums; + int partnatts; + int num_indexes; + + dest = (PartitionBoundInfo) palloc(sizeof(PartitionBoundInfoData)); + + dest->strategy = src->strategy; + ndatums = dest->ndatums = src->ndatums; + partnatts = key->partnatts; + + /* Range partitioned table has an extra index. */ + num_indexes = key->strategy == PARTITION_STRATEGY_RANGE ? ndatums + 1 : ndatums; + + /* List partitioned tables have only a single partition key. */ + Assert(key->strategy != PARTITION_STRATEGY_LIST || partnatts == 1); + + dest->datums = (Datum **) palloc(sizeof(Datum *) * ndatums); + + if (src->kind != NULL) + { + dest->kind = (PartitionRangeDatumKind **) palloc(ndatums * +sizeof(PartitionRangeDatumKind *)); + for (i = 0; i < ndatums; i++) + { + dest->kind[i] = (PartitionRangeDatumKind *) palloc(partnatts * +sizeof(PartitionRangeDatumKind)); + + memcpy(dest->kind[i], src->kind[i], + sizeof(PartitionRangeDatumKind) * key->partnatts); + } + } + else + dest->kind = NULL; + + for (i = 0; i < ndatums; i++) + { + int j; + dest->datums[i] = (Datum *) palloc(sizeof(Datum) * partnatts); + + for (j = 0; j < partnatts; j++) + { + if (dest->kind == NULL || +dest->kind[i][j] == PARTITION_RANGE_DATUM_VALUE) +dest->datums[i][j] = datumCopy(src->datums[i][j], + key->parttypbyval[j], + key->parttyplen[j]); + } + } + + dest->indexes = (int *) palloc(sizeof(int) * num_indexes); + memcpy(dest->indexes, src->indexes, sizeof(int) * num_indexes); + + dest->null_index = src->null_index; + dest->default_index = src->default_index; + + return dest; +} + +/* * check_new_partition_bound * * Checks if the new partition's bound overlaps any of the existing partitions diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index 93cc757..9d35a41 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -1825,13 +1825,15 @@ set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel, Relation relation) { PartitionDesc partdesc; + PartitionKey partkey; Assert(relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE); partdesc = RelationGetPartitionDesc(relation); + partkey = RelationGetPartitionKey(relation); rel->part_scheme = find_partition_scheme(root, relation); Assert(partdesc != NULL && rel->part_scheme != NULL); - rel->boundinfo = partdesc->boundinfo; + rel->boundinfo = partition_bounds_copy(partdesc->boundinfo, partkey); rel->nparts = partdesc->nparts; set_baserel_partition_key_exprs(relation, rel); } @@ -1888,18 +1890,33 @@ find_partition_scheme(PlannerInfo *root, Relation relation) /* * Did not find matching partition scheme. Create one copying relevant - * information from the relcache. Instead of copying whole arrays, copy - * the pointers in relcache. It's safe to do so since - * RelationClearRelation() wouldn't change it while planner is using it. + *
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas wrote: > On Fri, Oct 6, 2017 at 8:40 AM, Ashutosh Bapat > wrote: >> Sorry. I sent a wrong file. Here's the real v37. > > Committed 0001-0006. I made some assorted comment and formatting > changes and two small substantive changes: > > - In try_nestloop_path, bms_free(outerrelids) before returning if we > can't reparameterize. Hmm. I missed that. > > - Moved the call to try_partition_wise_join inside > populate_joinrel_with_paths, instead of always calling it just after > that function is called. This looks good too. > > I think this is very good work and I'm excited about the feature. Thanks a lot Robert for detailed review and guidance. Thanks a lot Rafia for benchmarking the feature with TPCH and esp. very large scale database and also for testing and reported some real issues. Thanks Rajkumar for testing it with an exhaustive testset. Thanks Amit Langote, Thomas Munro, Dilip Kumar, Antonin Houska, Alvaro Herrera and Amit Khandekar for their review comments and suggestions. Thanks Jeevan Chalke, who used the patchset to implement partition-wise aggregates and provided some insights offlist. Sorry if I have missed anybody. As Robert says in the commit message, there's more to do but now that we have basic feature, improving it incrementally becomes a lot easier. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Fri, Oct 6, 2017 at 8:40 AM, Ashutosh Bapat wrote: > Sorry. I sent a wrong file. Here's the real v37. Committed 0001-0006. I made some assorted comment and formatting changes and two small substantive changes: - In try_nestloop_path, bms_free(outerrelids) before returning if we can't reparameterize. - Moved the call to try_partition_wise_join inside populate_joinrel_with_paths, instead of always calling it just after that function is called. I think this is very good work and I'm excited about the feature. Now I'll wait to see whether the buildfarm, or Tom, yell at me for whatever problems this may still have... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Oct 4, 2017 at 9:04 PM, Robert Haas wrote: > On Tue, Oct 3, 2017 at 3:27 PM, Robert Haas wrote: >> I decided to skip over 0001 for today and spend some time looking at >> 0002-0006. > > Back to 0001. > > +Enables or disables the query planner's use of partition-wise join > +plans. When enabled, it spends time in creating paths for joins > between > +partitions and consumes memory to construct expression nodes to be > used > +for those joins, even if partition-wise join does not result in the > +cheapest path. The time and memory increase exponentially with the > +number of partitioned tables being joined and they increase linearly > +with the number of partitions. The default is off. > > I think this is too scary and too much technical detail. I think you > could just say something like: Enables or disables use of > partition-wise join, which allows a join between partitioned tables to > be performed by joining the matching partitions. Partition-wise join > currently applies only when the join conditions include all the > columns of the partition keys, which must be of the same data type and > have exactly matching sets of child partitions. Because > partition-wise join planning can use significantly increase CPU time > and memory usage during planning, the default is off. Done. With slight change. "include all the columns of the partition keys" has a different meaning when partition key is an expression, so I have used "include all the partition keys". Also changed the last sentence as "... can use significantly more CPU time and memory during planning ...". Please feel free to revert those changes, if you don't like them. > > +partitioned table. The join partners can not be found in other partitions. > This > +condition allows the join between partitioned tables to be broken into joins > +between the matching partitions. The resultant join is partitioned in the > same > > "The join partners can not be found in other partitions." is redundant > with the previous sentence. I suggest deleting it. I also suggest > "This condition allows the join between partitioned tables to be > broken" -> "Because of this, the join between partitioned tables can > be broken". Done. > > +relation" for both partitioned table as well as join between partitioned > tables > +which can use partition-wise join technique. > > for either a partitioned table or a join between compatibly partitioned tables Done. > > +Partitioning properties of a partitioned relation are stored in > +PartitionSchemeData structure. Planner maintains a list of canonical > partition > +schemes (distinct PartitionSchemeData objects) so that any two partitioned > +relations with same partitioning scheme share the same PartitionSchemeData > +object. This reduces memory consumed by PartitionSchemeData objects and makes > +it easy to compare the partition schemes of joining relations. > > Not all of the partitioning properties are stored in the > PartitionSchemeData structure any more. I think this needs some > rethinking and maybe some expansion. As written, each of the first > two sentences needs a "the" at the beginning. Changed to The partitioning properties of a partitioned relation are stored in its RelOptInfo. The information about data types of partition keys are stored in PartitionSchemeData structure. The planner maintains a list of canonical partition schemes (distinct PartitionSchemeData objects) so that RelOptInfo of any two partitioned relations with same partitioning scheme point to the same PartitionSchemeData object. This reduces memory consumed by PartitionSchemeData objects and makes it easy to compare the partition schemes of joining relations. Let me know if this looks good. > > + /* > +* Create "append" paths for > partitioned joins. Do this before > +* creating GatherPaths so that > partial "append" paths in > +* partitioned joins will be considered. > +*/ > > I think you could shorten this to a single-line comment and just keep > the first sentence. Similarly in the other location where you have > the same sort of thing. Done. > > + * child-joins. Otherwise, add_path might delete a path that some "append" > + * path has reference to. > > to which some path generated here has a reference. Done. > > Here and elsewhere, you use "append" rather than Append to refer to > the paths added. I suppose that's weasel-wording to work around the > fact that they might be either Append or MergeAppend paths, but I'm > not sure it's really going to convey that to anyone. I suggest > rephrasing those comments more generically, e.g.: > > + /* Add "append" paths containing paths from child-joins. */ > > You could say: Build additional paths for this rel from child-join paths. > > Or something. Do
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Thu, Oct 5, 2017 at 7:18 PM, Alvaro Herrera wrote: > Robert Haas wrote: > >> Regarding nomenclature and my previous griping about wisdom, I was >> wondering about just calling this a "partition join" like you have in >> the regression test. So the GUC would be enable_partition_join, you'd >> have generate_partition_join_paths(), etc. Basically just delete >> "wise" throughout. > > If I understand correctly, what's being used here is the "-wise" suffix, > unrelated to wisdom, which Merriam Webster lists as "adverb combining > form" here https://www.merriam-webster.com/dictionary/wise (though you > have to scroll down a lot), which is defined as > > 1 a :in the manner of * crabwise * fanwise > b :in the position or direction of * slantwise * clockwise > 2 :with regard to :in respect of * dollarwise > That's right. > According to that, the right way to write this is "partitionwise join" > (no dash), which means "join in respect of partitions", "join with > regard to partitions". Google lists mostly "partition wise" or "partition-wise" and very rarely "partitionwise". The first being used in other DBMS literature. The paper (there aren't many on this subject) I referred [1] uses "partition-wise". It made more sense to replace " " or "-" with "_" when syntax doesn't allow the first two. I am not against "partitionwise" but I don't see any real reason why we should move away from popular usage of this term. [1] https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Thu, Oct 5, 2017 at 9:48 AM, Alvaro Herrera wrote: > Robert Haas wrote: >> Regarding nomenclature and my previous griping about wisdom, I was >> wondering about just calling this a "partition join" like you have in >> the regression test. So the GUC would be enable_partition_join, you'd >> have generate_partition_join_paths(), etc. Basically just delete >> "wise" throughout. > > If I understand correctly, what's being used here is the "-wise" suffix, > unrelated to wisdom, which Merriam Webster lists as "adverb combining > form" here https://www.merriam-webster.com/dictionary/wise (though you > have to scroll down a lot), which is defined as > > 1 a :in the manner of * crabwise * fanwise > b :in the position or direction of * slantwise * clockwise > 2 :with regard to :in respect of * dollarwise > > According to that, the right way to write this is "partitionwise join" > (no dash), which means "join in respect of partitions", "join with > regard to partitions". I'm fine with that, if others like it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Robert Haas wrote: > Regarding nomenclature and my previous griping about wisdom, I was > wondering about just calling this a "partition join" like you have in > the regression test. So the GUC would be enable_partition_join, you'd > have generate_partition_join_paths(), etc. Basically just delete > "wise" throughout. If I understand correctly, what's being used here is the "-wise" suffix, unrelated to wisdom, which Merriam Webster lists as "adverb combining form" here https://www.merriam-webster.com/dictionary/wise (though you have to scroll down a lot), which is defined as 1 a :in the manner of * crabwise * fanwise b :in the position or direction of * slantwise * clockwise 2 :with regard to :in respect of * dollarwise According to that, the right way to write this is "partitionwise join" (no dash), which means "join in respect of partitions", "join with regard to partitions". -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Oct 4, 2017 at 9:01 PM, Robert Haas wrote: > On Thu, Sep 21, 2017 at 8:52 AM, Robert Haas wrote: >> On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat >> wrote: >>> About your earlier comment of making build_joinrel_partition_info() >>> simpler. Right now, the code assumes that partexprs or >>> nullable_partexpr can be NULL when either of them is not populated. >>> That may be saves a sizeof(pointer) * (number of keys) byes of memory. >>> Saving that much memory may not be worth the complexity of code. So, >>> we may always allocate memory for those arrays and fill it with NIL >>> values when there are no key expressions to populate those. That will >>> simplify the code. I haven't done that change in this patchset. I was >>> busy debugging the Q7 regression. Let me know your comments about >>> that. >> >> Hmm, I'm not sure that's the best approach, but let me look at it more >> carefully before I express a firm opinion. > > Having studied this a bit more, I now think your proposed approach is > a good idea. Thanks. Done. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Oct 4, 2017 at 8:23 AM, Ashutosh Bapat wrote: > I am not sure whether your assumption that expression with no Vars > would have em_relids empty is correct. I wonder whether we will add > any em_is_child members with empty em_relids; looking at > process_equivalence() those come from RestrictInfo::left/right_relids > which just indicates the relids at which that particular expression > can be evaluated. Place holder vars is an example when that can > happen, but there may be others. To verify this, I tried attached > patch on master and ran make check. The assertion didn't trip. If > em_relids is not NULL, bms_is_subset() is fine. I spent some more time experimenting with this. I found that cases where an em_is_child equivalence class contains multiple relids are quite easy to generate, e.g. select * from foo, bar where foo.a + bar.a = 0, where foo and bar are partitioned. However, I wasn't able to generate a case where an em_is_child equivalence class has no relids at all, and I'm out of ideas about how such a thing could occur. I suspect it can't. I wondered whether there was some problem with the multiple-relids case, but I can't find an example where that misbehaves either. So maybe it's fine (or maybe I'm just not smart enough to find the case where it breaks). >> I don't think I believe that comment, either. In the case from which >> that comment was copied (mark_dummy_rel), it was talking about a >> RelOptInfo, and geqo_eval() takes care to remove any leftover pointers >> to joinrels creating during a GEQO cycle. But there's no similar >> logic for ppilist, so I think what will happen here is that you'll end >> up with a freed node in the middle of the list. > > In mark_dummy_rel() it's not about RelOptInfo, it's about the pathlist > with dummy path being created in the same context as the RelOptInfo. > Same applies here. Oops. I was thinking that the ppilist was attached to some planner-global structure, but it's not; it's hanging off the RelOptInfo. So you're entirely right, and I'm just being dumb. > We need to reparameterize any path which contains further paths and/or > contains expressions that point to the parent relation. For a given > path we need to reparameterize any paths that it contains and > translate any expressions that are specific to that path. Expressions > common across the paths are translated after the switch case. I have > added this rule to the comment just above the switch case > /* > * Copy of the given path. Reparameterize any paths referenced by the > given > * path. Replace parent Vars in path specific expressions by corresponding > * child Vars. > */ > Does that look fine or we want to add explanation for every node handled here. No, I don't think we want something for every node, just a general explanation at the top of the function. Maybe something like this: Most fields from the original path can simply be flat-copied, but any expressions must be adjusted to refer to the correct varnos, and any paths must be recursively reparameterized. Other fields that refer to specific relids also need adjustment. >> I don't see much point in the T_SubqueryScanPath and T_ResultPath >> cases in reparameterize_path_by_child(). It's just falling through to >> the default case. > > I added those cases separately to explain why we should not see those > cases in that switch case. I think that explanation is important > (esp. considering your comment above) and associating those comment > with "case" statement looks better. Are you suggesting that we should > add that explanation in default case? Or leave the explanation out altogether. >> I wonder if reparameterize_path_by_child() ought to default to >> returning NULL rather than throwing an error; the caller would then >> have to be prepared for that and skip building the path. But that >> would be more like what reparameterize_path() does, and it would make >> failure to include some relevant path type here a corner-case >> performance bug rather than a correctness issue. It seems like >> someone adding a new path type could quite easily fail to realize that >> it might need to be added here, or might be unsure whether it's >> necessary to add it here. > > I am OK with that. However reparameterize_path_by_child() and > reparameterize_paths_by_child() are callers of > reparameterize_path_by_child() so they will need to deal with NULL > return. I am fine with that too, but making sure that we are on the > same page. If we do that, we could simply assert that the switch case > doesn't see T_SubqueryScanPath and T_ResultPath. Or do nothing at all about those cases. I noticed today that the version of the patchset I have here says in the header comments for reparameterize_path_by_child() that it returns NULL if it can't reparameterize, but that's not what it actually does. If you make this change, the existing comment will become correct. The problem with the NULL return conventi
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Oct 4, 2017 at 11:34 AM, Robert Haas wrote: > +Enables or disables the query planner's use of partition-wise join > +plans. When enabled, it spends time in creating paths for joins > between > +partitions and consumes memory to construct expression nodes to be > used > +for those joins, even if partition-wise join does not result in the > +cheapest path. The time and memory increase exponentially with the > +number of partitioned tables being joined and they increase linearly > +with the number of partitions. The default is off. > > I think this is too scary and too much technical detail. I think you > could just say something like: Enables or disables use of > partition-wise join, which allows a join between partitioned tables to > be performed by joining the matching partitions. Partition-wise join > currently applies only when the join conditions include all the > columns of the partition keys, which must be of the same data type and > have exactly matching sets of child partitions. Because > partition-wise join planning can use significantly increase CPU time > and memory usage during planning, the default is off. Not enough caffeine, obviously: should have been something like -- Because partition-wise join can significantly increase the CPU and memory costs of planning... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Oct 3, 2017 at 3:27 PM, Robert Haas wrote: > I decided to skip over 0001 for today and spend some time looking at > 0002-0006. Back to 0001. +Enables or disables the query planner's use of partition-wise join +plans. When enabled, it spends time in creating paths for joins between +partitions and consumes memory to construct expression nodes to be used +for those joins, even if partition-wise join does not result in the +cheapest path. The time and memory increase exponentially with the +number of partitioned tables being joined and they increase linearly +with the number of partitions. The default is off. I think this is too scary and too much technical detail. I think you could just say something like: Enables or disables use of partition-wise join, which allows a join between partitioned tables to be performed by joining the matching partitions. Partition-wise join currently applies only when the join conditions include all the columns of the partition keys, which must be of the same data type and have exactly matching sets of child partitions. Because partition-wise join planning can use significantly increase CPU time and memory usage during planning, the default is off. +partitioned table. The join partners can not be found in other partitions. This +condition allows the join between partitioned tables to be broken into joins +between the matching partitions. The resultant join is partitioned in the same "The join partners can not be found in other partitions." is redundant with the previous sentence. I suggest deleting it. I also suggest "This condition allows the join between partitioned tables to be broken" -> "Because of this, the join between partitioned tables can be broken". +relation" for both partitioned table as well as join between partitioned tables +which can use partition-wise join technique. for either a partitioned table or a join between compatibly partitioned tables +Partitioning properties of a partitioned relation are stored in +PartitionSchemeData structure. Planner maintains a list of canonical partition +schemes (distinct PartitionSchemeData objects) so that any two partitioned +relations with same partitioning scheme share the same PartitionSchemeData +object. This reduces memory consumed by PartitionSchemeData objects and makes +it easy to compare the partition schemes of joining relations. Not all of the partitioning properties are stored in the PartitionSchemeData structure any more. I think this needs some rethinking and maybe some expansion. As written, each of the first two sentences needs a "the" at the beginning. + /* +* Create "append" paths for partitioned joins. Do this before +* creating GatherPaths so that partial "append" paths in +* partitioned joins will be considered. +*/ I think you could shorten this to a single-line comment and just keep the first sentence. Similarly in the other location where you have the same sort of thing. + * child-joins. Otherwise, add_path might delete a path that some "append" + * path has reference to. to which some path generated here has a reference. Here and elsewhere, you use "append" rather than Append to refer to the paths added. I suppose that's weasel-wording to work around the fact that they might be either Append or MergeAppend paths, but I'm not sure it's really going to convey that to anyone. I suggest rephrasing those comments more generically, e.g.: + /* Add "append" paths containing paths from child-joins. */ You could say: Build additional paths for this rel from child-join paths. Or something. + if (!REL_HAS_ALL_PART_PROPS(rel)) + return; Isn't this an unnecessarily expensive test? I mean, it shouldn't be possible for it to have some arbitrary subset. + /* +* Every pair of joining relations we see here should have an equi-join +* between partition keys if this join has been deemed as a partitioned +* join. See build_joinrel_partition_info() for reasons. +*/ + Assert(have_partkey_equi_join(rel1, rel2, parent_sjinfo->jointype, + parent_restrictlist)); I suggest removing this assertion. Seems like overkill to me. + child_sjinfo = build_child_join_sjinfo(root, parent_sjinfo, + child_rel1->relids, + child_rel2->relids); It seems like we might end up doing this multiple times for the same child join, if there are more than 2 tables involved. Not sure if there's a good way to avoid that. Similarly for child_restrictlist. + pk_has_clause = (bool *) palloc0(sizeof(bool) * num_pks); Just do bool pk_has_clause[PARTITION_MAX_KEYS] instead. Stack allocation is a lot faster, and then you don't need to pfree it. + /* Remove the relabel
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Thu, Sep 21, 2017 at 8:52 AM, Robert Haas wrote: > On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat > wrote: >> About your earlier comment of making build_joinrel_partition_info() >> simpler. Right now, the code assumes that partexprs or >> nullable_partexpr can be NULL when either of them is not populated. >> That may be saves a sizeof(pointer) * (number of keys) byes of memory. >> Saving that much memory may not be worth the complexity of code. So, >> we may always allocate memory for those arrays and fill it with NIL >> values when there are no key expressions to populate those. That will >> simplify the code. I haven't done that change in this patchset. I was >> busy debugging the Q7 regression. Let me know your comments about >> that. > > Hmm, I'm not sure that's the best approach, but let me look at it more > carefully before I express a firm opinion. Having studied this a bit more, I now think your proposed approach is a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Oct 4, 2017 at 12:57 AM, Robert Haas wrote: > > 0003: > > The commit message mentions estimate_num_groups but the patch doesn't touch > it. This was fixed when we converted many rel->reloptkind == RELOPT_BASEREL to IS_SIMPLE_REL(). I have removed this section from the commit message. > > I am concerned that this patch might introduce some problem fixed by > commit dd4134ea56cb8855aad3988febc45eca28851cd8. The comment in that > patch say, at one place, that "This protects against possible > incorrect matches to child expressions that contain no Vars." > However, if a child expression has no Vars, then I think em->em_relids > will be empty, so the bms_is_equal() test that is there now will fail > but your proposed bms_is_subset() test will pass. bms_is_equal() was enough when there was only a single member in relids but it doesn't work now that there can be multiple of them. bms_is_equal() was replaced with bms_is_subset() to accomodate for ec_members with only a subset of relids when we are searching for a join relation. I am not sure whether your assumption that expression with no Vars would have em_relids empty is correct. I wonder whether we will add any em_is_child members with empty em_relids; looking at process_equivalence() those come from RestrictInfo::left/right_relids which just indicates the relids at which that particular expression can be evaluated. Place holder vars is an example when that can happen, but there may be others. To verify this, I tried attached patch on master and ran make check. The assertion didn't trip. If em_relids is not NULL, bms_is_subset() is fine. If em_relids could indeed go NULL when em_is_child is true, passing NULL relids (for parent rels) to that function can cause unwanted behaviour. bms_is_equal(em->em_relids, relids) will return true turning the if (em->em_is_child && !bms_is_equal(em->em_relids, relids)) to false. This means that we will consider a child member with em_relids NULL even while matching a parent relation. What surprises me is, that commit added a bunch of testcases and none of them failed with this change. Nonetheless, I have changed "matches" with "belongs to" in the prologue of those functions since an exact match won't be possible with child-joins. > > 0004: > > I suggest renaming get_wholerow_ref_from_convert_row_type to > is_converted_whole_row_reference and making it return a bool. Done. > > The coding of that function is a little strange; why not move Var to > an inner scope? Like this: if (IsA(convexpr->arg, var)) { Var *var = > castNode(Var, convexpr->arg; if (var->varattno == 0) return var; } I probably went too far to avoid indented code :). Fixed now. > > Will the statement that "In case of multi-level partitioning, we will > have as many nested ConvertRowtypeExpr as there are levels in > partition hierarchy" be falsified by Amit Khandekar's pending patch to > avoid sticking a ConvertRowTypeExpr on top of another > ConvertRowTypeExpr? Even if the answer is "no", I think it might be > better to drop this part of the comment; it would be easy for it to > become false in the future, because we might want to optimize that > case in the future and we'll probably forget to update this comment > when we do. That might keep someone wondering where the nested ConvertRowtypeExpr's came from. But may be in future those can arise from something other than multi-level partition hierarchy and in that case too the comment would be rendered inaccurate. So done. > > In fix_upper_expr_mutator(), you have an if statement whose entire > contents are another if statement. I think you should use && instead, > and maybe reverse the order of the tests, since > context->subplan_itlist->has_conv_whole_rows is probably cheaper to > test than a function call. It's also a little strange that this code > isn't adjacent too, or merged with, the existing has_non_vars case. > Maybe: > > converted_whole_row = is_converted_whole_row_reference(node); > if (context->outer_itlist && (context->outer_itlist->has_non_vars || > (context->outer_itlist->has_conv_whole_rows && converted_whole_row)) > ... > if (context->inner_itlist && (context->inner_itlist->has_non_vars || > (context->inner_itlist->has_conv_whole_rows && converted_whole_row)) I placed it with the other node types since it's for a specific node type, but I guess your suggestion avoids duplicates and looks better. Done. > ... > > 0005: > > The comment explaining why the ParamPathInfo is allocated in the same > context as the RelOptInfo is a modified copy of an existing comment > that still reads like the original, a manner of commenting I find a > bit undesirable as it leads to filling up the source base with > duplicate comments. I have pointed to mark_dummy_rel() in that comment instead of duplicating the whole paragraph. > > I don't think I believe that comment, either. In the case from which > that comment was copied (mark_dummy_rel), it was talking about a > RelOptInfo, and geq
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/10/04 4:27, Robert Haas wrote: > On Tue, Oct 3, 2017 at 8:57 AM, Ashutosh Bapat > wrote: >>> Regarding nomenclature and my previous griping about wisdom, I was >>> wondering about just calling this a "partition join" like you have in >>> the regression test. So the GUC would be enable_partition_join, you'd >>> have generate_partition_join_paths(), etc. Basically just delete >>> "wise" throughout. >> >> Partition-wise join is standard term used in literature and in >> documentation of other popular DBMSes, so partition_wise makes more >> sense. But I am fine with partition_join as well. Do you want it >> partition_join or partitionjoin like enable_mergejoin/enable_hashjoin >> etc.? > > Well, you're making me have second thoughts. It's really just that > partition_wise looks a little awkward to me, and maybe that's not > enough reason to change anything. I suppose if I commit it this way > and somebody really hates it, it can always be changed later. We're > not getting a lot of input from anyone else at the moment. FWIW, the name enable_partition_join seems enough to convey the core feature, that is, I see "_wise" as redundant, even though I'm now quite used to seeing "_wise" in the emails here and saying it out loud every now and then. Ashutosh may have a point though that users coming from other databases might miss the "_wise". :) Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Oct 3, 2017 at 8:57 AM, Ashutosh Bapat wrote: >> Regarding nomenclature and my previous griping about wisdom, I was >> wondering about just calling this a "partition join" like you have in >> the regression test. So the GUC would be enable_partition_join, you'd >> have generate_partition_join_paths(), etc. Basically just delete >> "wise" throughout. > > Partition-wise join is standard term used in literature and in > documentation of other popular DBMSes, so partition_wise makes more > sense. But I am fine with partition_join as well. Do you want it > partition_join or partitionjoin like enable_mergejoin/enable_hashjoin > etc.? Well, you're making me have second thoughts. It's really just that partition_wise looks a little awkward to me, and maybe that's not enough reason to change anything. I suppose if I commit it this way and somebody really hates it, it can always be changed later. We're not getting a lot of input from anyone else at the moment. > Attached the updated patch-set. I decided to skip over 0001 for today and spend some time looking at 0002-0006. Comments below. 0002: Looks fine. 0003: The commit message mentions estimate_num_groups but the patch doesn't touch it. I am concerned that this patch might introduce some problem fixed by commit dd4134ea56cb8855aad3988febc45eca28851cd8. The comment in that patch say, at one place, that "This protects against possible incorrect matches to child expressions that contain no Vars." However, if a child expression has no Vars, then I think em->em_relids will be empty, so the bms_is_equal() test that is there now will fail but your proposed bms_is_subset() test will pass. 0004: I suggest renaming get_wholerow_ref_from_convert_row_type to is_converted_whole_row_reference and making it return a bool. The coding of that function is a little strange; why not move Var to an inner scope? Like this: if (IsA(convexpr->arg, var)) { Var *var = castNode(Var, convexpr->arg; if (var->varattno == 0) return var; } Will the statement that "In case of multi-level partitioning, we will have as many nested ConvertRowtypeExpr as there are levels in partition hierarchy" be falsified by Amit Khandekar's pending patch to avoid sticking a ConvertRowTypeExpr on top of another ConvertRowTypeExpr? Even if the answer is "no", I think it might be better to drop this part of the comment; it would be easy for it to become false in the future, because we might want to optimize that case in the future and we'll probably forget to update this comment when we do. In fix_upper_expr_mutator(), you have an if statement whose entire contents are another if statement. I think you should use && instead, and maybe reverse the order of the tests, since context->subplan_itlist->has_conv_whole_rows is probably cheaper to test than a function call. It's also a little strange that this code isn't adjacent too, or merged with, the existing has_non_vars case. Maybe: converted_whole_row = is_converted_whole_row_reference(node); if (context->outer_itlist && (context->outer_itlist->has_non_vars || (context->outer_itlist->has_conv_whole_rows && converted_whole_row)) ... if (context->inner_itlist && (context->inner_itlist->has_non_vars || (context->inner_itlist->has_conv_whole_rows && converted_whole_row)) ... 0005: The comment explaining why the ParamPathInfo is allocated in the same context as the RelOptInfo is a modified copy of an existing comment that still reads like the original, a manner of commenting I find a bit undesirable as it leads to filling up the source base with duplicate comments. I don't think I believe that comment, either. In the case from which that comment was copied (mark_dummy_rel), it was talking about a RelOptInfo, and geqo_eval() takes care to remove any leftover pointers to joinrels creating during a GEQO cycle. But there's no similar logic for ppilist, so I think what will happen here is that you'll end up with a freed node in the middle of the list. I think reparameterize_path_by_chid() could use a helper function reparameterize_pathlist_by_child() that iterates over a list of paths and returns a list of paths. That would remove some of the loops. I think the comments for reparameterize_path_by_child() need to be expanded. They don't explain how you decided which nodes need to be handled here or which fields within those nodes need some kind of handling other than a flat-copy. I think these kinds of explanations will be important for future maintenance of this code. You know why you did it this way, I can mostly guess what you did it this way, but what about the next person who comes along who hasn't made a detailed study of partition-wise join? I don't see much point in the T_SubqueryScanPath and T_ResultPath cases in reparameterize_path_by_child(). It's just falling through to the default case. I wonder if reparameterize_path_by_child() ought to default to returning NULL rather than throwing an error; the caller would then have
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat wrote: > Here's set of rebased patches. The patch with extra tests is not for > committing. All other patches, except the last one, will need to be > committed together. The last patch may be committed along with other > patches or as a separate patch. In set_append_rel_size, is it necessary to set attr_needed = bms_copy(rel->attr_needed[index]) rather than just pointing to the existing value? If so, perhaps the comments should explain the reasons. I would have thought that the values wouldn't change after this point, in which case it might not be necessary to copy them. Regarding nomenclature and my previous griping about wisdom, I was wondering about just calling this a "partition join" like you have in the regression test. So the GUC would be enable_partition_join, you'd have generate_partition_join_paths(), etc. Basically just delete "wise" throughout. The elog(DEBUG3) in try_partition_wise_join() doesn't follow message style guidelines and I think should just be removed. It was useful for development, I'm sure, but it's time for it to go. +elog(ERROR, "unrecognized path node type %d", (int) nodeTag(path)); I think we should use the same formulation as elsewhere, namely "unrecognized node type: %d". And likewise probably "unexpected join type: %d". partition_join_extras.sql has a bunch of whitespace damage, although it doesn't really matter since, as you say, that's not for commit. (This is not a full review, just a few thoughts.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Fri, Sep 15, 2017 at 5:29 PM, Ashutosh Bapat wrote: > >> >> Apart from these there is a regression case on a custom table, on head >> query completes in 20s and with this patch it takes 27s. Please find >> the attached .out and .sql file for the output and schema for the test >> case respectively. I have reported this case before (sometime around >> March this year) as well, but I am not sure if it was overlooked or is >> an unimportant and expected behaviour for some reason. >> > > Are you talking about [1]? I have explained about the regression in > [2] and [3]. This looks like an issue with the existing costing model. > I debugged this case further. There are two partitioned tables being joined prt (with partitions prt_p1, prt_p2 and so on) and prt2 (with partitions prt2_p1, prt2_p2, and so on). When join is executed without partition-wise join, prt2 is used to build hash table and prt is used to probe that hash table. prt2 has lesser number of rows than prt. But when partition-wise join is used, individual partitions are joined in reverse join order i.e. partitions of prt are used to build the hash table and partitions of prt2 are used to probe. This happens because the path for the other join order (partition of prt2 used to build the hash table and partition of prt used to probe) has huge cost compared to the first one (74459 and 313109) and a portion worth 259094 comes from lines 3226/7 of final_cost_hashjoin() 3215 /* 3216 * The number of tuple comparisons needed is the number of outer 3217 * tuples times the typical number of tuples in a hash bucket, which 3218 * is the inner relation size times its bucketsize fraction. At each 3219 * one, we need to evaluate the hashjoin quals. But actually, 3220 * charging the full qual eval cost at each tuple is pessimistic, 3221 * since we don't evaluate the quals unless the hash values match 3222 * exactly. For lack of a better idea, halve the cost estimate to 3223 * allow for that. 3224 */ 3225 startup_cost += hash_qual_cost.startup; 3226 run_cost += hash_qual_cost.per_tuple * outer_path_rows * 3227 clamp_row_est(inner_path_rows * innerbucketsize) * 0.5; That's because for some reason innerbucketsize for partition of prt is 22 times more than that for partition of prt2. Looks like we have some estimation error in estimating bucket sizes. If I force partitions to be joined with the same order as partitioned tables (without partition-wise join), child-joins execute faster and in turn partition-wise join performs better than the non-partition-wise join. So, this is clearly some estimation and costing problem with regular joins. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Fri, Sep 22, 2017 at 10:45 AM, Rafia Sabih wrote: >> >> On completing the benchmark for all queries for the above mentioned >> setup, following performance improvement can be seen, >> Query | Patch | Head >> 3 | 1455 | 1631 >> 4 | 499 | 4344 >> 5 | 1464 | 1606 >> 10 | 1475 | 1599 >> 12 | 1465 | 1790 >> >> Note that all values of execution time are in seconds. > > I compared this experiment with non-partitioned database and following > is the result, > Query | Non-partitioned head > 3 | 1752 > 4 | 315 > 5 | 2319 > 10 | 1535 > 12 | 1739 > > In summary, the query that appears slowest in partitioned database is > not so otherwise. It is good to see that in Q4 partition-wise join > helps in achieving performance closer to it's non-partitioned case, > otherwise partitioning alone causes it to suffer greatly. Apart from > Q4 it does not looks like partitioning hurts anywhere else, though the > maximum improvement is ~35% for Q5. > Another point to note here is that the performance on partitioned and > unpartitioned heads are quite close (except Q4) which is something > atleast I wasn't expecting. It looks like we need not to partition the > tables anyway, or atleast this set of queries doesn't benefit from > partitioning. Please let me know if somebody has better ideas on how > partitioning schemes should be applied to make it more beneficial for > these queries. Just partitioning is not expected to improve query performance (but we still see some performance improvement). Partitioning + partition-wise operations, pruning is expected to show performance gains. IIUC the results you reported, Q3 takes 1752 seconds with non-partitioned head, with partitioning it completes in 1631 seconds and with partition-wise join it completes in 1455, so net improvement because of partitioning is 300 seconds is almost 16% improvement, which is a lot for very large data. So, except Q4, every query improves when the tables are partitioned. Am I interpreting the results correctly? There may be some other way of partitioning, which may give better results, but I think what we have now shows the importance of partitioning in case of very large data e.g. scale 300 TPCH. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Mon, Sep 18, 2017 at 10:18 AM, Rafia Sabih wrote: >> > > Limit (cost=83341943.28..83341943.35 rows=1 width=92) (actual > time=1556989.996..1556989.997 rows=1 loops=1) >-> Finalize GroupAggregate (cost=83341943.28..83342723.24 > rows=10064 width=92) (actual time=1556989.994..1556989.994 rows=1 > loops=1) > Group Key: n1.n_name, n2.n_name, (date_part('year'::text, > (lineitem_001.l_shipdate)::timestamp without time zone)) > -> Sort (cost=83341943.28..83342043.92 rows=40256 width=92) > (actual time=1556989.910..1556989.911 rows=6 loops=1) >Sort Key: n1.n_name, n2.n_name, > (date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without > time zone)) >Sort Method: quicksort Memory: 27kB >-> Gather (cost=83326804.81..83338864.31 rows=40256 > width=92) (actual time=1550598.855..1556989.760 rows=20 loops=1) > Workers Planned: 4 > Workers Launched: 4 > > AFAICU the node above sort is group-aggregate and then there is limit, > and the number of rows for sort node in explain analyse is returned > number of rows. So, what is happening here is once one group is > completed it is aggregated and fetched by limit, now there is no need > for sort to return any more rows and hence the result. Thanks for your explanation. That makes sense. I forgot about LIMIT node on top. I debugged the plans today and performed some experiments. Here are my observations The join order with and without partition-wise join changes. Without partition-wise join it is (lineitem, (suppliers, nation1)), (orders, (customer, nation2)). The join (lineitem, (suppliers, nation1)) is executed by one gather node and (orders, (customer, nation2)) is executed by other. Thus the plan has two gather nodes, which feed to the topmost join. With partition-wise join the join order is ((lineitem, orders), (supplier, nation1)), (customer, nation2). The join (lineitem, orders) uses partition-wise join. This plan executes the whole join tree along with partial group aggregation under a gather merge. The rows estimated for various nodes under Gather/GatherMerge are different from the actual rows e.g. -> Hash Join (cost=113164.47..61031454.40 rows=10789501 width=46) (actual time=3379.931..731987.943 rows=8744357 loops=5) (in non-partition-wise join plan) OR -> Append (cost=179532.36..80681785.95 rows=134868761 width=24) (actual time=9437.573..1360219.567 rows=109372134 loops=5) (in partition-wise join plan). I first thought that this is a real estimation error and spent some time investigating the estimation error. But eventually realised that this is how a parallel query plan reports, when I saw that Gather node estimated correct number of rows even though the nodes under it showed this difference. Here's the explanation of this report. There are 4 parallel workers, so, the leaders contribution would be estimated to be 0 by get_parallel_divisor(). So these estimates are per worker and so the total estimated rows produced by any of the nodes is 4 times the reported. But when the query actually runs, the leader also participates, so number of loops = 5 and the actual rows reported are (total actual rows) / (number of loops i.e. number of backends that executed the query). The total estimates rows and total actual rows are roughly equal. So there's no real estimation error, as I thought earlier. May be we want to make EXPLAIN (ANALYZE) output easier to understand. When I tried the same query on laptop with scale 20, I found that the leader is really contributing as much as other workers. So, the partial paths were really created based on an estimate which was 20% off. The cost difference between partition-wise join plan and non-partition-wise join plan is hardly 1.5%. So, it's possible that if we correct this estimation error, partition-wise join plan won't be chosen because of it will have a higher cost. Remember there are two gather nodes in non-partition-wise join plan and partition-wise join plan has one gather. So, non-partition-wise join path gets the 20% decreased estimates twice and partition-wise join gets it only once. The explain (analyze, verbose) of a parallel node looks like -> Parallel Seq Scan on public.lineitem_002 (cost=0.00..168752.99 rows=573464 width=24) (actual time=1.395..3075.485 rows=454464 loops=5) Filter: ((lineitem_002.l_shipdate >= '1995-01-01'::date) AND (lineitem_002.l_shipdate <= '1996-12-31'::date)) Rows Removed by Filter: 1045065 Worker 0: actual time=3.358..3131.426 rows=458267 loops=1 Worker 1: actual time=0.860..3146.282 rows=447231 loops=1 Worker 2: actual time=1.317..3123.646 rows=489960 loops=1 Worker 3: actual time=0.927..
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat wrote: > About your earlier comment of making build_joinrel_partition_info() > simpler. Right now, the code assumes that partexprs or > nullable_partexpr can be NULL when either of them is not populated. > That may be saves a sizeof(pointer) * (number of keys) byes of memory. > Saving that much memory may not be worth the complexity of code. So, > we may always allocate memory for those arrays and fill it with NIL > values when there are no key expressions to populate those. That will > simplify the code. I haven't done that change in this patchset. I was > busy debugging the Q7 regression. Let me know your comments about > that. Hmm, I'm not sure that's the best approach, but let me look at it more carefully before I express a firm opinion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Sep 19, 2017 at 5:47 AM, Ashutosh Bapat wrote: > Done. Committed 0001 with extensive editorialization. I did not think it was a good idea to include a partition.h a file in src/include/nodes, so I worked around that. The include of pg_inherits_fn.h was unneeded. I rewrote a lot of the comments and made some other style tweaks. Don't look now, but I think it might be about time for the main act. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Sep 20, 2017 at 3:13 PM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote: > On Wed, Sep 20, 2017 at 9:44 AM, Thomas Munro > wrote: > > 2. What queries in the 0008 patch are hitting lines that 0007 doesn't > hit? > > > > I thought about how to answer questions like this and came up with a > > shell script that (1) makes computers run really hot for quite a long > > time and (2) tells you which blocks of SQL hit which lines of C. > > Please find attached the shell script and its output. The .sql files > > have been annotated with "block" numbers (blocks being chunks of SQL > > stuff separated by blank lines), and the C files annotated with > > references to those block numbers where A = block > > partition_join.sql and B = block in partition_join_extras.sql. > > > > Then to find lines that B queries hit but A queries don't and know > > which particular queries hit them, you might use something like: > > > > grep -v 'SQL blocks: .*A[0-9]' < joinpath.c.aggregated_coverage | \ > > grep 'SQL blocks: .*B[0-9]' > > > > Thanks for this. It generates a lot of output (970 lines over all the > coverage files). It will take some time for getting anything > meaningful out of this. May be there's some faster way by looking at > the lines that are covered by B but not A. BTW, I checked those lines > to see if there could be any bug there. But I don't see what could go > wrong with those lines. > > I have also tried to find test cases in B which hits some extra line which is not hitting by A, with the help of results attached by Thomas in coverage.tarball_FILES. It took lot of time but I am able to find some test cases. which if adding in partition_join.sql increasing no of lines hit by 14. but for hitting these 14 extra line attached patch is doing 900+ line inserts in partition_join.sql and partition_join.out file. I have used gcov-lcov to find coverage for files changed by partition-wise-join patches with and without attached patch which is below. *with existing partition_join.sql* *partition_join.sql + some test cases of partition_join_extra.sql* *Modifed Files* *Line Coverage* *Functions* *Line Coverage* *Functions* src/backend/optimizer/geqo 79.4 % 269/339 96.6 % 28/29 79.4 % 269/339 96.6 % 28/29 src/backend/optimizer/path/allpaths.c 92.3 % 787 / 853 95.5 % 42 / 44 92.6 % 790 / 853 95.5 % 42 / 44 src/backend/optimizer/path/costsize.c 96.8 % 1415 / 1462 98.4 % 61 / 62 96.9 % 1416 / 1462 98.4 % 61 / 62 src/backend/optimizer/path/joinpath.c 95.5 % 404 / 423 100.0 % 16 / 16 95.5 % 404 / 423 100.0 % 16 / 16 src/backend/optimizer/path/joinrels.c 92.5 % 422 / 456 100.0 % 16 / 16 93.0 % 424 / 456 100.0 % 16 / 16 src/backend/optimizer/plan/createplan.c 90.9 % 1928 / 2122 96.3 % 103 / 107 91.0 % 1930 / 2122 96.3 % 103 / 107 src/backend/optimizer/plan/planner.c 94.9 % 1609 / 1696 97.6 % 41 / 42 94.9 % 1609 / 1696 97.6 % 41 / 42 src/backend/optimizer/plan/setrefs.c 91.3 % 806 / 883 94.3 % 33 / 35 91.3 % 806 / 883 94.3 % 33 / 35 src/backend/optimizer/prep/prepunion.c 95.5 % 661 / 692 100.0 % 25 / 25 95.5 % 661 / 692 100.0 % 25 / 25 src/backend/optimizer/util/pathnode.c 88.7 % 1144 / 1290 98.1 % 52 / 53 88.8 % 1146 / 1290 98.1 % 52 / 53 src/backend/optimizer/util/placeholder.c 96.5 % 139 / 144 100.0 % 10 / 10 96.5 % 139 / 144 100.0 % 10 / 10 src/backend/optimizer/util/plancat.c 89.0 % 540 / 607 94.7 % 18 / 19 89.6 % 544 / 607 94.7 % 18 / 19 src/backend/optimizer/util/relnode.c 95.3 % 548 / 575 100.0 % 24 / 24 95.3 % 548 / 575 100.0 % 24 / 24 src/backend/utils/misc/guc.c 67.4 % 1536 / 2278 89.7 % 113 / 126 67.4 % 1536 / 2278 89.7 % 113 / 126 Thanks & Regards, Rajkumar Raghuwanshi QMG, EnterpriseDB Corporation diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out index 9fec170..ab411b6 100644 --- a/src/test/regress/expected/partition_join.out +++ b/src/test/regress/expected/partition_join.out @@ -584,6 +584,215 @@ SELECT t1.a, ss.t2a, ss.t2c FROM prt1 t1 LEFT JOIN LATERAL 550 | | (12 rows) +-- join with aggregate +EXPLAIN (VERBOSE, COSTS OFF) +select t1.a, count(t2.*) from prt1 t1 left join prt1 t2 on (t1.a = t2.a) where t1.a % 25 = 0 group by t1.a; + QUERY PLAN +--- + GroupAggregate + Output: t1.a, count(((t2.*)::prt1)) + Group Key: t1.a + -> Sort + Output: t1.a, ((t2.*)::prt1) + Sort Key: t1.a + -> Append + -> Hash Right Join + Output: t1.a, ((t2.*)::prt1) + Hash Cond: (t2.a = t1.a) + -> Seq Scan on public.prt1_p1 t2 + Output: t2.*, t2.a + -> Hash + Output: t1.a + -> Seq Scan on public.prt1_p1 t1 + Output: t1.a + Filter: ((t1.a % 25) = 0) + -> Hash
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Sep 19, 2017 at 3:17 PM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote: > On Tue, Sep 19, 2017 at 2:35 AM, Robert Haas > wrote: > > On Mon, Sep 18, 2017 at 8:02 AM, Ashutosh Bapat > > wrote: > >> partition pruning might need partexprs look up relevant quals, but > >> nullable_partexprs doesn't have any use there. So may be we should add > >> nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join > >> implementation) instead of 0001. What do you think? > > > > +1. > > Done. > > > > >>> - I'm not entirely sure whether maintaining partexprs and > >>> nullable_partexprs is the right design. If I understand correctly, > >>> whether or not a partexpr is nullable is really a per-RTI property, > >>> not a per-expression property. You could consider something like > >>> "Relids nullable_rels". > >> > >> That's true. However in order to decide whether an expression falls on > >> nullable side of a join, we will need to call pull_varnos() on it and > >> check the output against nullable_rels. Separating the expressions > >> themselves avoids that step. > > > > Good point. Also, I'm not sure about cases like this: > > > > SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE > > a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y; > > > > Suppose the relations are all partitioned by (x, y) but that the = > > operator is not strict. A partition-wise join is valid between a and > > b, but we can't regard w as partitioned any more, because w.x might > > contain nulls in partitions where the partitioning scheme wouldn't > > allow them. On the other hand, if the subquery were to select a.x, > > a.y then clearly it would be fine: there would be no possibility of a > > NULL having been substituted for a proper value. > > > > What if the subquery selected a.x, b.y? Initially, I thought that > > would be OK too, because of the fact that the a.y = b.y clause is in > > the WHERE clause rather than the join condition. But on further > > thought I think that probably doesn't work, because with = being a > > non-strict operator there's no guarantee that it would remove any > > nulls introduced by the left join. Of course, if the subselect had a > > WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT > > list mention those columns would be fine. > > > > I am actually not sure whether we can use partition-wise join for a > LEFT JOIN b when the partition key equalities are spread across ON and > WHERE clauses. I am not able to find any example against it, but I am > not able to prove it as well. The reference I used for partition-wise > join [1], mentions JOIN conditions i.e. ON clause conditions. But all > the examples used in that paper are that of INNER join. So, I am not > sure what exactly the authors meant by JOIN conditions. Right now I am > restricting the patch to work with only conditions in the ON clause. > > Practically most of the operators are strict. OUTER join's WHERE > clause has any partition key equality with strict operator, optimizer > will turn > that OUTER join into an INNER one, turning all clauses into join > clauses. That will enable partition-wise join. So, the current > restriction doesn't restrict any practical cases. > > OTOH, I have seen that treating ON and WHERE clauses as same for an > OUTER join leads to surprising results. So, I am leaning to treat them > separate for partition-wise join as well and only use ON clause > conditions for partition-wise join. If we get complaints about > partition-wise join not being picked we will fix them after proving > that it's not harmful. Lifting that restriction is not so difficult. > have_partition_key_equijoin() ignores "pushed down" quals. We have to > just change that condition. > > Your last sentence about a clause b.x IS NOT NULL or b.y IS NOT NULL > is interesting. If those conditions are in ON clause, we may still > have a result where b.x and b.y as NULL when no row in "a" matches a > row in "b". If those conditions are in WHERE clause, I think optimizer > will turn the join into an INNER join irrespective of whether the > equality operator is strict. > > > > >> If partition-wise join is disabled, partition-wise aggregates, > >> strength reduction of MergeAppend won't be possible on a join tree, > >> but those will be possible on a base relation. Even if partition-wise > >> join enabled, one may want to disable other partition-wise > >> optimizations individually. So, they are somewhat independent > >> switches. I don't think we should bundle all of those into one. > >> Whatever names we choose for those GUCs, I think they should have same > >> naming convention e.g. "partition_wise_xyz". I am open to suggestions > >> about the names. > > > > I think the chances of you getting multiple GUCs for different > > partition-wise optimizations past Tom are pretty low. > > We do have enable_hashjoin and enable_hashagg to control use of > hashing for aggregate and join. On similar lines we
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Sep 19, 2017 at 3:17 PM, Ashutosh Bapat wrote: >> - I'm not entirely sure whether maintaining partexprs and nullable_partexprs is the right design. If I understand correctly, whether or not a partexpr is nullable is really a per-RTI property, not a per-expression property. You could consider something like "Relids nullable_rels". >>> >>> That's true. However in order to decide whether an expression falls on >>> nullable side of a join, we will need to call pull_varnos() on it and >>> check the output against nullable_rels. Separating the expressions >>> themselves avoids that step. >> >> Good point. Also, I'm not sure about cases like this: >> >> SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE >> a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y; >> >> Suppose the relations are all partitioned by (x, y) but that the = >> operator is not strict. A partition-wise join is valid between a and >> b, but we can't regard w as partitioned any more, because w.x might >> contain nulls in partitions where the partitioning scheme wouldn't >> allow them. On the other hand, if the subquery were to select a.x, >> a.y then clearly it would be fine: there would be no possibility of a >> NULL having been substituted for a proper value. >> >> What if the subquery selected a.x, b.y? Initially, I thought that >> would be OK too, because of the fact that the a.y = b.y clause is in >> the WHERE clause rather than the join condition. But on further >> thought I think that probably doesn't work, because with = being a >> non-strict operator there's no guarantee that it would remove any >> nulls introduced by the left join. Of course, if the subselect had a >> WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT >> list mention those columns would be fine. >> > In my previous reply to this, I probably didn't answer your question while I explained the restriction on where equality conditions on partition keys can appear. Here's answer to your questions assuming those restrictions don't exist. Actually in the example you have given, optimizer flattens w as a LJ b which kind of makes the explanations below a bit complicated. 1. SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y; partition-wise join will be possible between a and b but not between w and c for the reasons you have explained above. 2. SELECT * FROM (SELECT a.x, a.y FROM a LEFT JOIN b ON a.x = b.x WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y; partition-wise join will be possible between a and b and also between w and c for the reasons you have explained above. 3. SELECT * FROM (SELECT a.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y; partition-wise join will be possible between a and b but not w and c as you have explained. In this case b.x and b.y will appear as nullable_partexprs in w (represented as a LJ b in optimizer) and a.x and a.y will appear in partexprs. Depending upon what gets projected out of w, the join between w and c will use corresponding keys for equality conditions. Since the operator is non-strict, any expression which is part of nullable_partexprs will be discarded in match_expr_to_partition_keys(). Hope that helps. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Sep 20, 2017 at 9:44 AM, Thomas Munro wrote: > > The main areas of uncovered lines are: code in > get_wholerow_ref_from_convert_row_type() and code that calls it, and > per node type cases in reparameterize_path_by_child(). It seems like > the former could use a test case, and I wonder if there is some way we > could write "flat-copy and then apply recursively to all subpaths" > code like this without having to handle these cases explicitly. There > are a couple of other tiny return cases other than just sanity check > errors which it might be interesting to hit too. Under the debugger I checked that the test in partition_join.sql -- left outer join, with whole-row reference EXPLAIN (COSTS OFF) SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b; SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b; covers get_wholerow_ref_from_convert_row_type(). But it doesn't cover a couple of lines in the case of nested ConvertRowTypeExpr in that function. We can add/modify a testcase in multi-level partitioned table section to cover that. reparameterize_path_by_child() coverage is hard. It would require that many different kinds of paths survive in lower joins in the join tree. It's hard to come up with queries that would do that with limited amount of data and a reasonable number of tests. Me and Thomas discussed about his suggestion about "flat-copy and then apply recursively to all subpaths" which he sees as a path tree mutator. It won't improve the test coverage. Like expression_tree_mutator() path mutation is not that widely used phenomenon, so we do not yet know what should be the characteristics of a path mutator could be. In case we see more of path mutation code in future, it's an idea worth considering. > > 2. What queries in the 0008 patch are hitting lines that 0007 doesn't hit? > > I thought about how to answer questions like this and came up with a > shell script that (1) makes computers run really hot for quite a long > time and (2) tells you which blocks of SQL hit which lines of C. > Please find attached the shell script and its output. The .sql files > have been annotated with "block" numbers (blocks being chunks of SQL > stuff separated by blank lines), and the C files annotated with > references to those block numbers where A = block > partition_join.sql and B = block in partition_join_extras.sql. > > Then to find lines that B queries hit but A queries don't and know > which particular queries hit them, you might use something like: > > grep -v 'SQL blocks: .*A[0-9]' < joinpath.c.aggregated_coverage | \ > grep 'SQL blocks: .*B[0-9]' > Thanks for this. It generates a lot of output (970 lines over all the coverage files). It will take some time for getting anything meaningful out of this. May be there's some faster way by looking at the lines that are covered by B but not A. BTW, I checked those lines to see if there could be any bug there. But I don't see what could go wrong with those lines. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Sep 19, 2017 at 3:50 PM, Alvaro Herrera wrote: > Rafia Sabih wrote: > >> On completing the benchmark for all queries for the above mentioned >> setup, following performance improvement can be seen, >> Query | Patch | Head >> 3 | 1455 | 1631 >> 4 | 499 | 4344 >> 5 | 1464 | 1606 >> 10 | 1475 | 1599 >> 12 | 1465 | 1790 >> >> Note that all values of execution time are in seconds. >> To summarise, apart from Q4, all other queries are showing somewhat >> 10-20% improvement. > > Saving 90% of time on the slowest query looks like a worthy improvement > on its own right. However, you're reporting execution time only, right? > What happens to planning time? In a quick look, Definitely. The planning time issue has been discussed upthread, On Mon, Mar 20, 2017 at 12:07 PM, Rafia Sabih wrote: > Another minor thing to note that is planning time is almost twice with > this patch, though I understand that this is for scenarios with really > big 'big data' so this may not be a serious issue in such cases, but > it'd be good if we can keep an eye on this that it doesn't exceed the > computational bounds for a really large number of tables.. To which Robert replied as, Yes, this is definitely going to use significant additional planning time and memory. There are several possible strategies for improving that situation, but I think we need to get the basics in place first. That's why the proposal is now to have this turned off by default. People joining really big tables that happen to be equipartitioned are likely to want to turn it on, though, even before those optimizations are done. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Rafia Sabih wrote: > On completing the benchmark for all queries for the above mentioned > setup, following performance improvement can be seen, > Query | Patch | Head > 3 | 1455 | 1631 > 4 | 499 | 4344 > 5 | 1464 | 1606 > 10 | 1475 | 1599 > 12 | 1465 | 1790 > > Note that all values of execution time are in seconds. > To summarise, apart from Q4, all other queries are showing somewhat > 10-20% improvement. Saving 90% of time on the slowest query looks like a worthy improvement on its own right. However, you're reporting execution time only, right? What happens to planning time? In a quick look, $ grep 'Planning time' pg_part_*/4* pg_part_head/4_1.out: Planning time: 3390.699 ms pg_part_head/4_2.out: Planning time: 194.211 ms pg_part_head/4_3.out: Planning time: 210.964 ms pg_part_head/4_4.out: Planning time: 4150.647 ms pg_part_patch/4_1.out: Planning time: 7577.247 ms pg_part_patch/4_2.out: Planning time: 312.421 ms pg_part_patch/4_3.out: Planning time: 304.697 ms pg_part_patch/4_4.out: Planning time: 269.778 ms I think the noise in these few results is too large to draw any conclusions. Maybe a few dozen runs of EXPLAIN (w/o ANALYZE) would tell something significant? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Mon, Sep 18, 2017 at 8:02 AM, Ashutosh Bapat wrote: > partition pruning might need partexprs look up relevant quals, but > nullable_partexprs doesn't have any use there. So may be we should add > nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join > implementation) instead of 0001. What do you think? +1. >> - I'm not entirely sure whether maintaining partexprs and >> nullable_partexprs is the right design. If I understand correctly, >> whether or not a partexpr is nullable is really a per-RTI property, >> not a per-expression property. You could consider something like >> "Relids nullable_rels". > > That's true. However in order to decide whether an expression falls on > nullable side of a join, we will need to call pull_varnos() on it and > check the output against nullable_rels. Separating the expressions > themselves avoids that step. Good point. Also, I'm not sure about cases like this: SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y; Suppose the relations are all partitioned by (x, y) but that the = operator is not strict. A partition-wise join is valid between a and b, but we can't regard w as partitioned any more, because w.x might contain nulls in partitions where the partitioning scheme wouldn't allow them. On the other hand, if the subquery were to select a.x, a.y then clearly it would be fine: there would be no possibility of a NULL having been substituted for a proper value. What if the subquery selected a.x, b.y? Initially, I thought that would be OK too, because of the fact that the a.y = b.y clause is in the WHERE clause rather than the join condition. But on further thought I think that probably doesn't work, because with = being a non-strict operator there's no guarantee that it would remove any nulls introduced by the left join. Of course, if the subselect had a WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT list mention those columns would be fine. >> - The naming of enable_partition_wise_join might also need some >> thought. What happens when we also have partition-wise aggregate? >> What about the proposal to strength-reduce MergeAppend to Append -- >> would that use this infrastructure? I wonder if we out to call this >> enable_partition_wise or enable_partition_wise_planning to make it a >> bit more general. Then, too, I've never really liked having >> partition_wise in the GUC name because it might make someone think >> that it makes you partitions have a lot of wisdom. Removing the >> underscore might help: partitionwise. Or maybe there is some whole >> different name that would be better. If anyone wants to bikeshed, >> now's the time. > > partitions having a lot of wisdom would be wise_partitions rather than > partition_wise ;). Well, maybe it's the joins that have a lot of wisdom, then. enable_partition_wise_join could be read to mean that we should allow partitioning of joins, but only if those joins know the secret of true happiness. > If partition-wise join is disabled, partition-wise aggregates, > strength reduction of MergeAppend won't be possible on a join tree, > but those will be possible on a base relation. Even if partition-wise > join enabled, one may want to disable other partition-wise > optimizations individually. So, they are somewhat independent > switches. I don't think we should bundle all of those into one. > Whatever names we choose for those GUCs, I think they should have same > naming convention e.g. "partition_wise_xyz". I am open to suggestions > about the names. I think the chances of you getting multiple GUCs for different partition-wise optimizations past Tom are pretty low. >> - Instead of reorganizing add_paths_to_append_rel as you did, could >> you just add an RTE_JOIN case to the switch? Not sure if there's some >> problem with that idea, but it seems like it might come out nicer. > > RTE_JOIN is created only for joins specified using JOIN clause i.e > syntactic joins. The joins created during query planner like rel1, > rel2, rel3 do not have RTE_JOIN. So, we can't use RTE_JOIN there. OK, never mind that then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Sat, Sep 16, 2017 at 2:53 AM, Robert Haas wrote: > On Fri, Sep 15, 2017 at 6:11 AM, Ashutosh Bapat > wrote: >> Thanks a lot Robert. >> >> Here are rebased patches. > > I didn't get quite as much time to look at these today as I would have > liked, but here's what I've got so far. > > Comments on 0001: > > - In the RelOptInfo, part_oids is defined in a completely different > part of the structure than nparts, but you can't use it without nparts > because you don't know how long it is. I suggest moving the > definition to just after nparts. > > - On the other hand, maybe we should just remove it completely. I > don't see it in any of the subsequent patches. If it's used by the > advanced matching code, let's leave it out of 0001 for now and add it > back after the basic feature is committed. No, it's not used by advanced partition matching code. It was used by to match OIDs with the child rels to order those in the array. But now that we are expanding in EIBO fashion, it is not useful. Should have removed it earlier. Removed now. > > - Similarly, partsupfunc isn't used by the later patches either. It > seems it could also be removed, at least for now. It's used by advanced partition matching code to compare bounds. It will be required by partition pruning patch. But removed for now. > > - The comment for partexprs isn't very clear about how the lists > inside the array work. My understanding is that the lists have as > many members as the partition key has columns/expressions. Actually we are doing some preparation for partition-wise join here. partexprs and nullable_partexprs are used in partition-wise join implementation patch. I have updated prologue of RelOptInfo structure with the comments like below * Note: A base relation will always have only one set of partition keys. But a * join relation has as many sets of partition keys as the number of joining * relations. The number of partition keys is given by * "part_scheme->partnatts". "partexprs" and "nullable_partexprs" are arrays * containing part_scheme->partnatts elements. Each element of the array * contains a list of partition key expressions. For a base relation each list * contains only one expression. For a join relation each list contains at * most as many expressions as the joining relations. The expressions in a list * at a given position in the array correspond to the partition key at that * position. "partexprs" contains partition keys of non-nullable joining * relations and "nullable_partexprs" contains partition keys of nullable * joining relations. For a base relation only "partexprs" is populated. Let me know this looks fine. The logic to match the partition keys of joining relations in have_partkey_equi_join() and match_expr_to_partition_keys() becomes simpler if we arrange the partition key expressions as array indexed by position of partition key and each array element as list of partition key expressions at that position. partition pruning might need partexprs look up relevant quals, but nullable_partexprs doesn't have any use there. So may be we should add nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join implementation) instead of 0001. What do you think? > > - I'm not entirely sure whether maintaining partexprs and > nullable_partexprs is the right design. If I understand correctly, > whether or not a partexpr is nullable is really a per-RTI property, > not a per-expression property. You could consider something like > "Relids nullable_rels". That's true. However in order to decide whether an expression falls on nullable side of a join, we will need to call pull_varnos() on it and check the output against nullable_rels. Separating the expressions themselves avoids that step. > > Comments on 0002: > > - The relationship between deciding to set partition scheme and > related details and the configured value of enable_partition_wise_join > needs some consideration. If the only use of the partition scheme is > partition-wise join, there's no point in setting it even for a baserel > unless enable_partition_wise_join is set -- but if there are other > important uses for that data, such as Amit's partition pruning work, > then we might want to always set it. And similarly for a join: if the > details are only needed in the partition-wise join case, then we only > need to set them in that case, but if there are other uses, then it's > different. If it turns out that setting these details for a baserel > is useful in other cases but that it's only for a joinrel in the > partition-wise join case, then the patch gets it exactly right. But > is that correct? I'm not sure. Partition scheme contains the information about data types of partition keys, which is required to compare partition bounds. Partition pruning will need to compare constants with partition bounds and hence will need information contained in partition scheme. So, we will need to set it for base relations whether or
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Sat, Sep 16, 2017 at 9:23 AM, Robert Haas wrote: > On the overall patch set: > > - I am curious to know how this has been tested. How much of the new > code is covered by the tests in 0007-Partition-wise-join-tests.patch? > How much does coverage improve with > 0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch? What > code, if any, is not covered by either of those test suites? Could we > do meaningful testing of this with something like Andreas > Seltenreich's sqlsmith? FWIW I'm working on an answer to both of those question, but keep getting distracted by other things catching on fire... -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Fri, Sep 15, 2017 at 6:11 AM, Ashutosh Bapat wrote: > Thanks a lot Robert. > > Here are rebased patches. I didn't get quite as much time to look at these today as I would have liked, but here's what I've got so far. Comments on 0001: - In the RelOptInfo, part_oids is defined in a completely different part of the structure than nparts, but you can't use it without nparts because you don't know how long it is. I suggest moving the definition to just after nparts. - On the other hand, maybe we should just remove it completely. I don't see it in any of the subsequent patches. If it's used by the advanced matching code, let's leave it out of 0001 for now and add it back after the basic feature is committed. - Similarly, partsupfunc isn't used by the later patches either. It seems it could also be removed, at least for now. - The comment for partexprs isn't very clear about how the lists inside the array work. My understanding is that the lists have as many members as the partition key has columns/expressions. - I'm not entirely sure whether maintaining partexprs and nullable_partexprs is the right design. If I understand correctly, whether or not a partexpr is nullable is really a per-RTI property, not a per-expression property. You could consider something like "Relids nullable_rels". Comments on 0002: - The relationship between deciding to set partition scheme and related details and the configured value of enable_partition_wise_join needs some consideration. If the only use of the partition scheme is partition-wise join, there's no point in setting it even for a baserel unless enable_partition_wise_join is set -- but if there are other important uses for that data, such as Amit's partition pruning work, then we might want to always set it. And similarly for a join: if the details are only needed in the partition-wise join case, then we only need to set them in that case, but if there are other uses, then it's different. If it turns out that setting these details for a baserel is useful in other cases but that it's only for a joinrel in the partition-wise join case, then the patch gets it exactly right. But is that correct? I'm not sure. - The naming of enable_partition_wise_join might also need some thought. What happens when we also have partition-wise aggregate? What about the proposal to strength-reduce MergeAppend to Append -- would that use this infrastructure? I wonder if we out to call this enable_partition_wise or enable_partition_wise_planning to make it a bit more general. Then, too, I've never really liked having partition_wise in the GUC name because it might make someone think that it makes you partitions have a lot of wisdom. Removing the underscore might help: partitionwise. Or maybe there is some whole different name that would be better. If anyone wants to bikeshed, now's the time. - It seems to me that build_joinrel_partition_info() could be simplified a bit. One thing is that list_copy() is perfectly capable of handling a NIL input, so there's no need to test for that before calling it. Comments on 0003: - Instead of reorganizing add_paths_to_append_rel as you did, could you just add an RTE_JOIN case to the switch? Not sure if there's some problem with that idea, but it seems like it might come out nicer. On the overall patch set: - I am curious to know how this has been tested. How much of the new code is covered by the tests in 0007-Partition-wise-join-tests.patch? How much does coverage improve with 0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch? What code, if any, is not covered by either of those test suites? Could we do meaningful testing of this with something like Andreas Seltenreich's sqlsmith? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih wrote: > On TPC-H benchmarking of this patch, I found a regression in Q7. It > was taking some 1500s with the patch and some 900s without the patch. > Please find the attached pwd_reg.zip for the output of explain analyse > on head and with patch. > > The experimental settings used were, > commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629 > patch-version = v26 > > Server settings: > work_mem = 1GB > shared_buffers = 10GB > effective_cache_size = 10GB > max_parallel_workers_per_gather = 4 > > Partitioning information: > Partitioning scheme = by range > Number of partitions in lineitem and orders table = 106 > partition key for lineitem = l_orderkey > partition key for orders = o_orderkey I observe that with partition-wise join patch the planner is using GatherMerge along-with partition-wise join and on head its not using GatherMerge. Just to make sure that its partition-wise join which is causing regression and not GatherMerge, can you please run the query with enable_gathermerge = false? I see following lines explain analyze output 7_1.out without the patch -> Sort (cost=84634030.40..84638520.55 rows=1796063 width=72) (actual time=1061001.435..1061106.608 rows=437209 loops=1) Sort Key: n1.n_name, n2.n_name, (date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without time zone)) Sort Method: quicksort Memory: 308912kB -> Hash Join (cost=16080591.94..84447451.72 rows=1796063 width=72) (actual time=252745.701..1057447.219 rows=1749956 loops=1) Since Sort doesn't filter any rows, we would expect it to output the same number of rows as hash join underneath it. But the number of rows differ in this case. I am wondering whether there's some problem with the explain analyze output itself. > > Apart from these there is a regression case on a custom table, on head > query completes in 20s and with this patch it takes 27s. Please find > the attached .out and .sql file for the output and schema for the test > case respectively. I have reported this case before (sometime around > March this year) as well, but I am not sure if it was overlooked or is > an unimportant and expected behaviour for some reason. > Are you talking about [1]? I have explained about the regression in [2] and [3]. This looks like an issue with the existing costing model. [1] https://www.postgresql.org/message-id/caogqiimwcjnrunj_fcdbscrtlej-clp7exfzzipe2ut71n4...@mail.gmail.com [2] https://www.postgresql.org/message-id/CAFjFpRedUZPa7tKbCLEGK3u5UWdDNQoN=eyfb7ieg5d0d1p...@mail.gmail.com [3] https://www.postgresql.org/message-id/cafjfprejksdcfaeuzjgd79hoetzpz5bkdxljgxr7qznrxx+...@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/15 4:43, Robert Haas wrote: > On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat > wrote: >> I have few changes to multi-level expansion patch as per discussion in >> earlier mails > > OK, I have committed > 0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic > changes. > > Phew, getting that sorted out has been an astonishing amount of work. Yeah, thanks to both of you. Now on to other complicated stuff. :) Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat wrote: > I have few changes to multi-level expansion patch as per discussion in > earlier mails OK, I have committed 0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic changes. Phew, getting that sorted out has been an astonishing amount of work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Sep 13, 2017 at 10:57 PM, Amit Langote wrote: > I very much like pcinfo-for-subquery.patch, although I'm not sure if we > need to create PartitionedChildRelInfo for the sub-query parent RTE as the > patch teaches add_paths_to_append_rel() to do. ISTM, nested UNION ALL > subqueries are flattened way before we get to add_paths_to_append_rel(); > if it could not be flattened, there wouldn't be a call to > add_paths_to_append_rel() in the first place, because no AppendRelInfos > would be generated. See what happens when is_simple_union_all_recurse() > returns false to flatten_simple_union_all() -- no AppendRelInfos will be > generated and added to root->append_rel_list in that case. > > IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries > like we're setting out to build for multi-level partitioned tables. > > So, as things stand today, there can at most be one recursive call of > add_path_to_append_rel() for a sub-query parent RTE, that is, if its child > sub-queries contain partitioned tables, but not more. The other patch > (multi-level expansion of partitioned tables) will change that, but even > then we won't need sub-query's own PartitioendChildRelInfo. OK, let's assume you're correct unless some contrary evidence emerges. Committed without that part; thanks for the review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/14 7:43, Robert Haas wrote: > On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat > wrote: >> I debugged what happens in case of query "select 1 from t1 union all >> select 2 from t1;" with the current HEAD (without multi-level >> expansion patch attached). It doesn't set partitioned_rels in Append >> path that gets converted into Append plan. Remember t1 is a >> multi-level partitioned table here with t1p1 as its immediate >> partition and t1p1p1 as partition of t1p1. So, the >> set_append_rel_pathlist() recurses once as shown in the following >> stack trace. > > Nice debugging. +1. > I spent some time today looking at this and I think > it's a bug in v10, and specifically in add_paths_to_append_rel(), > which only sets partitioned_rels correctly when the appendrel is a > partitioned rel, and not when it's a subquery RTE with one or more > partitioned queries beneath it. > > Attached are two patches either one of which will fix it. First, I > wrote mechanical-partrels-fix.patch, which just mechanically > propagates partitioned_rels lists from accumulated subpaths into the > list used to construct the parent (Merge)AppendPath. I wasn't entire > happy with that, because it ends up building multiple partitioned_rels > lists for the same RelOptInfo. That seems silly, but there's no > principled way to avoid it; avoiding it amounts to hoping that all the > paths for the same relation carry the same partitioned_rels list, > which is uncomfortable. > > So then I wrote pcinfo-for-subquery.patch. That patch notices when an > RTE_SUBQUERY appendrel is processed and accumulates the > partitioned_rels of its immediate children; in case there can be > multiple nested levels of subqueries before we get down to the actual > partitioned rel, it also adds a PartitionedChildRelInfo for the > subquery RTE, so that there's no need to walk the whole tree to build > the partitioned_rels list at higher levels, just the immediate > children. I find this fix a lot more satisfying. It adds less code > and does no extra work in the common case. I very much like pcinfo-for-subquery.patch, although I'm not sure if we need to create PartitionedChildRelInfo for the sub-query parent RTE as the patch teaches add_paths_to_append_rel() to do. ISTM, nested UNION ALL subqueries are flattened way before we get to add_paths_to_append_rel(); if it could not be flattened, there wouldn't be a call to add_paths_to_append_rel() in the first place, because no AppendRelInfos would be generated. See what happens when is_simple_union_all_recurse() returns false to flatten_simple_union_all() -- no AppendRelInfos will be generated and added to root->append_rel_list in that case. IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries like we're setting out to build for multi-level partitioned tables. So, as things stand today, there can at most be one recursive call of add_path_to_append_rel() for a sub-query parent RTE, that is, if its child sub-queries contain partitioned tables, but not more. The other patch (multi-level expansion of partitioned tables) will change that, but even then we won't need sub-query's own PartitioendChildRelInfo. > Notice that the choice of fix we adopt has consequences for your > 0001-Multi-level-partitioned-table-expansion.patch -- with > mechanical-partrels-fix.patch, that patch could either associated all > partitioned_rels with the top-parent or it could work level by level > and everything would get properly assembled later. But with > pcinfo-for-subquery.patch, we need everything associated with the > top-parent. That doesn't seem like a problem to me, but it's > something to note. I think it's fine. With 0001-Multi-level-partitioned-table-expansion.patch, get_partitioned_child_rels() will get called even for non-root partitioned tables, for which it won't find a valid pcinfo. I think that patch must also change its callers to stop Asserting that a valid pcinfo is returned. Spotted a typo in pcinfo-for-subquery.patch: + * A plain relation will alread have Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat wrote: > I debugged what happens in case of query "select 1 from t1 union all > select 2 from t1;" with the current HEAD (without multi-level > expansion patch attached). It doesn't set partitioned_rels in Append > path that gets converted into Append plan. Remember t1 is a > multi-level partitioned table here with t1p1 as its immediate > partition and t1p1p1 as partition of t1p1. So, the > set_append_rel_pathlist() recurses once as shown in the following > stack trace. Nice debugging. I spent some time today looking at this and I think it's a bug in v10, and specifically in add_paths_to_append_rel(), which only sets partitioned_rels correctly when the appendrel is a partitioned rel, and not when it's a subquery RTE with one or more partitioned queries beneath it. Attached are two patches either one of which will fix it. First, I wrote mechanical-partrels-fix.patch, which just mechanically propagates partitioned_rels lists from accumulated subpaths into the list used to construct the parent (Merge)AppendPath. I wasn't entire happy with that, because it ends up building multiple partitioned_rels lists for the same RelOptInfo. That seems silly, but there's no principled way to avoid it; avoiding it amounts to hoping that all the paths for the same relation carry the same partitioned_rels list, which is uncomfortable. So then I wrote pcinfo-for-subquery.patch. That patch notices when an RTE_SUBQUERY appendrel is processed and accumulates the partitioned_rels of its immediate children; in case there can be multiple nested levels of subqueries before we get down to the actual partitioned rel, it also adds a PartitionedChildRelInfo for the subquery RTE, so that there's no need to walk the whole tree to build the partitioned_rels list at higher levels, just the immediate children. I find this fix a lot more satisfying. It adds less code and does no extra work in the common case. Notice that the choice of fix we adopt has consequences for your 0001-Multi-level-partitioned-table-expansion.patch -- with mechanical-partrels-fix.patch, that patch could either associated all partitioned_rels with the top-parent or it could work level by level and everything would get properly assembled later. But with pcinfo-for-subquery.patch, we need everything associated with the top-parent. That doesn't seem like a problem to me, but it's something to note. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company mechanical-partrels-fix.patch Description: Binary data pcinfo-for-subquery.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Sep 13, 2017 at 12:51 PM, Ashutosh Bapat wrote: > On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas wrote: >> On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote >> wrote: >>> In this case, AcquireExecutorLocks will lock all the relations in >>> PlannedStmt.rtable, which must include all partitioned tables of all >>> partition trees involved in the query. Of those, it will lock the tables >>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with >>> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global >>> list of all partitioned table RT indexes obtained by concatenating >>> partitioned_rels lists of all ModifyTable nodes involved in the query >>> (set_plan_refs does that). We need to distinguish nonleafResultRelations, >>> because we need to take the stronger lock on a given table before any >>> weaker one if it happens to appear in the query as a non-result relation >>> too, to avoid lock strength upgrade deadlock hazard. >> >> Hmm. The problem with this theory in my view is that it doesn't >> explain why InitPlan() and ExecOpenScanRelation() lock the relations >> instead of just assuming that they are already locked either by >> AcquireExecutorLocks or by planning. If ExecLockNonLeafAppendTables() >> doesn't really need to take locks, then ExecOpenScanRelation() must >> not need to do it either. We invented ExecLockNonLeafAppendTables() >> on the occasion of removing the scans of those tables which would >> previously have caused ExecOpenScanRelation() to be invoked, so as to >> keep the locking behavior unchanged. >> >> AcquireExecutorLocks() looks like an odd bit of code to me. The >> executor itself locks result tables in InitPlan() and then everything >> else during InitPlan() and all of the others later on while walking >> the plan tree -- comments in InitPlan() say that this is to avoid a >> lock upgrade hazard if a result rel is also a source rel. But >> AcquireExecutorLocks() has no such provision; it just locks everything >> in RTE order. In theory, that's a deadlock hazard of another kind, as >> we just talked about in the context of EIBO. In fact, expanding in >> bound order has made the situation worse: before, expansion order and >> locking order were the same, so maybe having AcquireExecutorLocks() >> work in RTE order coincidentally happened to give the same result as >> the executor code itself as long as there are no result relations. >> But this is certainly not true any more. I'm not sure it's worth >> expending a lot of time on this -- it's evidently not a problem in >> practice, or somebody probably would've complained before now. >> >> But that having been said, I don't think we should assume that all the >> locks taken from the executor are worthless because plancache.c will >> always do the job for us. I don't know of a case where we execute a >> saved plan without going through the plan cache, but that doesn't mean >> that there isn't one or that there couldn't be one in the future. >> It's not the job of these partitioning patches to whack around the way >> we do locking in general -- they should preserve the existing behavior >> as much as possible. If we want to get rid of the locking in the >> executor altogether, that's a separate discussion where, I have a >> feeling, there will prove to be better reasons for the way things are >> than we are right now supposing. >> > > I agree that it's not the job of these patches to change the locking > or even get rid of partitioned_rels. In order to continue returning > partitioned_rels in Append paths esp. in the case of queries involving > set operations and partitioned table e.g "select 1 from t1 union all > select 2 from t1;" in which t1 is multi-level partitioned table, we > need a fix in add_paths_to_append_rels(). The fix provided in [1] is > correct but we will need a longer explanation of why we have to > involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation > is complicated. If we get rid of partitioned_rels, we don't need to > fix that code in add_paths_to_append_rel(). > > I suggested that [2] > -- (excerpt from [2]) > > Actually, the original problem that caused this discussion started > with an assertion failure in get_partitioned_child_rels() as > Assert(list_length(result) >= 1); > > This assertion fails if result is NIL when an intermediate partitioned > table is passed. May be we should assert (result == NIL || > list_length(result) == 1) and allow that function to be called even > for intermediate partitioned partitions for which the function will > return NIL. That will leave the code in add_paths_to_append_rel() > simple. Thoughts? > -- > > Amit Langote agrees with this. It kind of makes the assertion lame but > keeps the code sane. What do you think? I debugged what happens in case of query "select 1 from t1 union all select 2 from t1;" with the current HEAD (without multi-level expansion patch attached). It doesn't set partitioned_rels in Append path th
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 13 September 2017 at 13:05, Ashutosh Bapat wrote: > On Wed, Sep 13, 2017 at 12:32 PM, Amit Khandekar > wrote: >> Hi, >> >> Rafia had done some testing on TPCH queries using Partition-wise join >> patch along with Parallel Append patch. >> >> There, we had observed that for query 4, even though the partition >> wise joins are under a Parallel Append, the join are all non-partial. >> >> Specifically, the partition-wise join has non-partial nested loop >> joins when actually it was expected to have partial nested loop joins. >> (The difference can be seen by the observation that the outer relation >> of that join is scanned by non-parallel Bitmap Heap scan when it >> should have used Parallel Bitmap Heap Scan). >> >> Here is the detailed analysis , including where I think is the issue : >> >> https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com >> >> All the TPCH results are posted in the same above mail thread. > > Can you please check if the attached patch fixes the issue. Thanks Ashutosh. Yes, it does fix the issue. Partial Nested Loop joins are generated now. If I see any unexpected differences in the estimated or actual costs, I will report that in the Parallel Append thread. As far as Partition-wise join is concerned, this issue is solved, because Partial nested loop join does get created. > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/13 16:21, Ashutosh Bapat wrote: > On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas wrote: >> locks taken from the executor are worthless because plancache.c will >> always do the job for us. I don't know of a case where we execute a >> saved plan without going through the plan cache, but that doesn't mean >> that there isn't one or that there couldn't be one in the future. >> It's not the job of these partitioning patches to whack around the way >> we do locking in general -- they should preserve the existing behavior >> as much as possible. If we want to get rid of the locking in the >> executor altogether, that's a separate discussion where, I have a >> feeling, there will prove to be better reasons for the way things are >> than we are right now supposing. >> > > I agree that it's not the job of these patches to change the locking > or even get rid of partitioned_rels. In order to continue returning > partitioned_rels in Append paths esp. in the case of queries involving > set operations and partitioned table e.g "select 1 from t1 union all > select 2 from t1;" in which t1 is multi-level partitioned table, we > need a fix in add_paths_to_append_rels(). The fix provided in [1] is > correct but we will need a longer explanation of why we have to > involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation > is complicated. If we get rid of partitioned_rels, we don't need to > fix that code in add_paths_to_append_rel(). Yeah, let's get on with setting partitioned_rels in AppendPath correctly in this patch. Ashutosh's suggested approach seems fine, although it needlessly requires to scan root->pcinfo_list. But it shouldn't be longer than the number of partitioned tables in the query, so maybe that's fine too. At least, it doesn't require us to add code to add_paths_to_append_rel() that can be pretty hard to wrap one's head around. That said, we might someday need to look carefully at some things that Robert mentioned carefully, especially around the order of locks taken by AcquireExecutorLocks() in light of the EIBO patch getting committed. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Sep 13, 2017 at 12:32 PM, Amit Khandekar wrote: > Hi, > > Rafia had done some testing on TPCH queries using Partition-wise join > patch along with Parallel Append patch. > > There, we had observed that for query 4, even though the partition > wise joins are under a Parallel Append, the join are all non-partial. > > Specifically, the partition-wise join has non-partial nested loop > joins when actually it was expected to have partial nested loop joins. > (The difference can be seen by the observation that the outer relation > of that join is scanned by non-parallel Bitmap Heap scan when it > should have used Parallel Bitmap Heap Scan). > > Here is the detailed analysis , including where I think is the issue : > > https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com > > All the TPCH results are posted in the same above mail thread. Can you please check if the attached patch fixes the issue. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company commit 203b3083318e9da41ad614a2ccec532025877c3b Author: Ashutosh Bapat Date: Tue Sep 12 17:41:54 2017 +0530 Reparamterize partial nestloop paths. We do not create partial nested looop paths if the inner path's parameterization is not fully covered by the outer relation. For partition-wise join, the test fails since the inner path is parameterized by the parent of the outer relation. Fix the test to check the parent relids instead of the child relids and also reparameterize the inner path to be parameterized by the outer child similar to try_nestloop_path(). TODO: squash this patch with the reparameterization patch. Ashutosh Bapat, per report from Rafia and analysis by Amit Khandekar diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c index 91f0b1c..c8da19c 100644 --- a/src/backend/optimizer/path/joinpath.c +++ b/src/backend/optimizer/path/joinpath.c @@ -496,8 +496,20 @@ try_partial_nestloop_path(PlannerInfo *root, if (inner_path->param_info != NULL) { Relids inner_paramrels = inner_path->param_info->ppi_req_outer; + RelOptInfo *outerrel = outer_path->parent; + Relids outerrelids; - if (!bms_is_subset(inner_paramrels, outer_path->parent->relids)) + /* + * Paths are parameterized by top-level parent(s). Any paths parameterized + * by the child relations, are not added to the pathlist. Hence run + * parameterization tests on the parent relids. + */ + if (outerrel->top_parent_relids) + outerrelids = outerrel->top_parent_relids; + else + outerrelids = outerrel->relids; + + if (!bms_is_subset(inner_paramrels, outerrelids)) return; } @@ -510,6 +522,32 @@ try_partial_nestloop_path(PlannerInfo *root, if (!add_partial_path_precheck(joinrel, workspace.total_cost, pathkeys)) return; + /* + * Since result produced by a child is part of the result produced by + * its topmost parent and has same properties, the parameters + * representing that parent may be substituted by values from a child. + * Hence expressions and hence paths using those expressions, + * parameterized by a parent can be said to be parameterized by any of + * its child. For a join between child relations, if the inner path + * is parameterized by the parent of the outer relation, translate + * the inner path to be parameterized by the outer child relation and + * create a nestloop join path. The translated path should have the + * same costs as the original path, so cost check above should still + * hold. + */ + if (PATH_PARAM_BY_PARENT(inner_path, outer_path->parent)) + { + inner_path = reparameterize_path_by_child(root, inner_path, + outer_path->parent); + + /* + * If we could not translate the path, we can't create nest loop + * path. + */ + if (!inner_path) + return; + } + /* Might be good enough to be worth trying, so let's try it. */ add_partial_path(joinrel, (Path *) create_nestloop_path(root, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Sep 13, 2017 at 11:29 AM, Amit Langote wrote: > On 2017/09/12 19:56, Ashutosh Bapat wrote: >> I think the code here expects the original parent_rte and not the one >> we set around line 1169. >> >> This isn't a bug right now, since both the parent_rte s have same >> content. But I am not sure if that will remain to be so. Here's patch >> to fix the thinko. > > Instead of the new bool is_parent_partitioned, why not move the code to > set partitioned_rels to the block where you're now setting > is_parent_partitioned. > > Also, since we know this isn't a bug at the moment but will turn into one > once we have step-wise expansion, why not include this fix in that patch > itself? It won't turn into a bug with step-wise expansion since every parent_rte will have RELKIND_PARTITIONED_TABLE for a partitioned top parent, which is used to extract the partitioned_rels. But I guess, it's better to fix the thinko in step-wise expansion since parent_rte itself changes. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas wrote: > On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote > wrote: >> In this case, AcquireExecutorLocks will lock all the relations in >> PlannedStmt.rtable, which must include all partitioned tables of all >> partition trees involved in the query. Of those, it will lock the tables >> whose RT indexes appear in PlannedStmt.nonleafResultRelations with >> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global >> list of all partitioned table RT indexes obtained by concatenating >> partitioned_rels lists of all ModifyTable nodes involved in the query >> (set_plan_refs does that). We need to distinguish nonleafResultRelations, >> because we need to take the stronger lock on a given table before any >> weaker one if it happens to appear in the query as a non-result relation >> too, to avoid lock strength upgrade deadlock hazard. > > Hmm. The problem with this theory in my view is that it doesn't > explain why InitPlan() and ExecOpenScanRelation() lock the relations > instead of just assuming that they are already locked either by > AcquireExecutorLocks or by planning. If ExecLockNonLeafAppendTables() > doesn't really need to take locks, then ExecOpenScanRelation() must > not need to do it either. We invented ExecLockNonLeafAppendTables() > on the occasion of removing the scans of those tables which would > previously have caused ExecOpenScanRelation() to be invoked, so as to > keep the locking behavior unchanged. > > AcquireExecutorLocks() looks like an odd bit of code to me. The > executor itself locks result tables in InitPlan() and then everything > else during InitPlan() and all of the others later on while walking > the plan tree -- comments in InitPlan() say that this is to avoid a > lock upgrade hazard if a result rel is also a source rel. But > AcquireExecutorLocks() has no such provision; it just locks everything > in RTE order. In theory, that's a deadlock hazard of another kind, as > we just talked about in the context of EIBO. In fact, expanding in > bound order has made the situation worse: before, expansion order and > locking order were the same, so maybe having AcquireExecutorLocks() > work in RTE order coincidentally happened to give the same result as > the executor code itself as long as there are no result relations. > But this is certainly not true any more. I'm not sure it's worth > expending a lot of time on this -- it's evidently not a problem in > practice, or somebody probably would've complained before now. > > But that having been said, I don't think we should assume that all the > locks taken from the executor are worthless because plancache.c will > always do the job for us. I don't know of a case where we execute a > saved plan without going through the plan cache, but that doesn't mean > that there isn't one or that there couldn't be one in the future. > It's not the job of these partitioning patches to whack around the way > we do locking in general -- they should preserve the existing behavior > as much as possible. If we want to get rid of the locking in the > executor altogether, that's a separate discussion where, I have a > feeling, there will prove to be better reasons for the way things are > than we are right now supposing. > I agree that it's not the job of these patches to change the locking or even get rid of partitioned_rels. In order to continue returning partitioned_rels in Append paths esp. in the case of queries involving set operations and partitioned table e.g "select 1 from t1 union all select 2 from t1;" in which t1 is multi-level partitioned table, we need a fix in add_paths_to_append_rels(). The fix provided in [1] is correct but we will need a longer explanation of why we have to involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation is complicated. If we get rid of partitioned_rels, we don't need to fix that code in add_paths_to_append_rel(). I suggested that [2] -- (excerpt from [2]) Actually, the original problem that caused this discussion started with an assertion failure in get_partitioned_child_rels() as Assert(list_length(result) >= 1); This assertion fails if result is NIL when an intermediate partitioned table is passed. May be we should assert (result == NIL || list_length(result) == 1) and allow that function to be called even for intermediate partitioned partitions for which the function will return NIL. That will leave the code in add_paths_to_append_rel() simple. Thoughts? -- Amit Langote agrees with this. It kind of makes the assertion lame but keeps the code sane. What do you think? [1] https://www.postgresql.org/message-id/d2f1cdcb-ebb4-76c5-e471-79348ca5d...@lab.ntt.co.jp [2] https://www.postgresql.org/message-id/CAFjFpRfJ3GRRmmOugaMA-q4i=se5p6yjz_c6a6hdrdqqtgx...@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Hi, Rafia had done some testing on TPCH queries using Partition-wise join patch along with Parallel Append patch. There, we had observed that for query 4, even though the partition wise joins are under a Parallel Append, the join are all non-partial. Specifically, the partition-wise join has non-partial nested loop joins when actually it was expected to have partial nested loop joins. (The difference can be seen by the observation that the outer relation of that join is scanned by non-parallel Bitmap Heap scan when it should have used Parallel Bitmap Heap Scan). Here is the detailed analysis , including where I think is the issue : https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com All the TPCH results are posted in the same above mail thread. Thanks -Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/12 19:56, Ashutosh Bapat wrote: > I think the code here expects the original parent_rte and not the one > we set around line 1169. > > This isn't a bug right now, since both the parent_rte s have same > content. But I am not sure if that will remain to be so. Here's patch > to fix the thinko. Instead of the new bool is_parent_partitioned, why not move the code to set partitioned_rels to the block where you're now setting is_parent_partitioned. Also, since we know this isn't a bug at the moment but will turn into one once we have step-wise expansion, why not include this fix in that patch itself? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote wrote: > In this case, AcquireExecutorLocks will lock all the relations in > PlannedStmt.rtable, which must include all partitioned tables of all > partition trees involved in the query. Of those, it will lock the tables > whose RT indexes appear in PlannedStmt.nonleafResultRelations with > RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global > list of all partitioned table RT indexes obtained by concatenating > partitioned_rels lists of all ModifyTable nodes involved in the query > (set_plan_refs does that). We need to distinguish nonleafResultRelations, > because we need to take the stronger lock on a given table before any > weaker one if it happens to appear in the query as a non-result relation > too, to avoid lock strength upgrade deadlock hazard. Hmm. The problem with this theory in my view is that it doesn't explain why InitPlan() and ExecOpenScanRelation() lock the relations instead of just assuming that they are already locked either by AcquireExecutorLocks or by planning. If ExecLockNonLeafAppendTables() doesn't really need to take locks, then ExecOpenScanRelation() must not need to do it either. We invented ExecLockNonLeafAppendTables() on the occasion of removing the scans of those tables which would previously have caused ExecOpenScanRelation() to be invoked, so as to keep the locking behavior unchanged. AcquireExecutorLocks() looks like an odd bit of code to me. The executor itself locks result tables in InitPlan() and then everything else during InitPlan() and all of the others later on while walking the plan tree -- comments in InitPlan() say that this is to avoid a lock upgrade hazard if a result rel is also a source rel. But AcquireExecutorLocks() has no such provision; it just locks everything in RTE order. In theory, that's a deadlock hazard of another kind, as we just talked about in the context of EIBO. In fact, expanding in bound order has made the situation worse: before, expansion order and locking order were the same, so maybe having AcquireExecutorLocks() work in RTE order coincidentally happened to give the same result as the executor code itself as long as there are no result relations. But this is certainly not true any more. I'm not sure it's worth expending a lot of time on this -- it's evidently not a problem in practice, or somebody probably would've complained before now. But that having been said, I don't think we should assume that all the locks taken from the executor are worthless because plancache.c will always do the job for us. I don't know of a case where we execute a saved plan without going through the plan cache, but that doesn't mean that there isn't one or that there couldn't be one in the future. It's not the job of these partitioning patches to whack around the way we do locking in general -- they should preserve the existing behavior as much as possible. If we want to get rid of the locking in the executor altogether, that's a separate discussion where, I have a feeling, there will prove to be better reasons for the way things are than we are right now supposing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Sep 12, 2017 at 2:35 PM, Amit Langote wrote: > On 2017/09/12 17:53, Ashutosh Bapat wrote: >> On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote wrote: >>> So, we can remove partitioned_rels from (Merge)AppendPath and >>> (Merge)Append nodes and remove ExecLockNonLeafAppendTables(). >> >> Don't we need partitioned_rels from Append paths to be transferred to >> ModifyTable node or we have a different way of calculating >> nonleafResultRelations? > > No, we don't transfer partitioned_rels from Append path to ModifyTable > node. inheritance_planner(), that builds the ModifyTable path for > UPDATE/DELETE on a partitioned table, fetches partitioned_rels from > root->pcinfo_list itself and passes it to create_modifytable_path. No > Append path is involved in that case. PlannedStmt.nonleafResultRelations > is built by concatenating the partitioned_rels lists of all ModifyTable > nodes appearing in the query. It does not depend on Append's or > AppendPath's partitioned_rels. Ok. Thanks for the explanation. This make me examine inheritance_planner() closely and I think I have spotted a thinko there. In inheritance_planner() parent_rte is set to the RTE of parent to start with and then in the loop 1132 /* 1133 * And now we can get on with generating a plan for each child table. 1134 */ 1135 foreach(lc, root->append_rel_list) 1136 { ... code clipped 1165 /* 1166 * If there are securityQuals attached to the parent, move them to the 1167 * child rel (they've already been transformed properly for that). 1168 */ 1169 parent_rte = rt_fetch(parentRTindex, subroot->parse->rtable); 1170 child_rte = rt_fetch(appinfo->child_relid, subroot->parse->rtable); 1171 child_rte->securityQuals = parent_rte->securityQuals; 1172 parent_rte->securityQuals = NIL; we set parent_rte to the one obtained from subroot->parse, which happens to be the same (at least in contents) as original parent_rte. Later we use this parent_rte to pull partitioned_rels outside that loop 1371 if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE) 1372 { 1373 partitioned_rels = get_partitioned_child_rels(root, parentRTindex); 1374 /* The root partitioned table is included as a child rel */ 1375 Assert(list_length(partitioned_rels) >= 1); 1376 } I think the code here expects the original parent_rte and not the one we set around line 1169. This isn't a bug right now, since both the parent_rte s have same content. But I am not sure if that will remain to be so. Here's patch to fix the thinko. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company inh_planner_prte.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/12 18:49, Ashutosh Bapat wrote: > On Tue, Sep 12, 2017 at 2:17 PM, Amit Langote > wrote: >> >> That said, I noticed that we might need to be careful about what the value >> of the root parent's PlanRowMark's allMarkType field gets set to. We need >> to make sure that it reflects markType of all partitions in the tree, >> including those that are not root parent's direct children. Is that true >> with the proposed implementation? > > Yes. We include child's allMarkTypes into parent's allMarkTypes. So, > top parent's PlanRowMarks should have all descendant's allMarkTypes, > which is not happening in the patch right now. There are two ways to > fix that. > > 1. Pass top parent's PlanRowMark all the way down to the leaf > partitions, so that current expand_single_inheritance_child() collects > allMarkTypes of all children correctly. But this way, PlanRowMarks of > intermediate parent does not reflect allMarkTypes of its children, > only top root records that. > 2. Pass immediate parent's PlanRowMark to > expand_single_inheritance_child(), so that it records allMarkTypes of > its children. In expand_partitioned_rtentry() have following sequence > > expand_single_inheritance_child(root, parentrte, parentRTindex, > parentrel, parentrc, childrel, > appinfos, &childrte, &childRTindex, > &childrc); > > /* If this child is itself partitioned, recurse */ > if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) >{ > expand_partitioned_rtentry(root, childrte, childRTindex, >childrel, childrc, lockmode, appinfos, >partitioned_child_rels); > > /* Include child's rowmark type in parent's allMarkTypes */ > parentrc->allMarkTypes |= childrc->allMarkTypes; >} > so that we push allMarkTypes up the hierarchy. > > I like the second way, since every intermediate parent records > allMarkTypes of its descendants. I like the second way, too. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Sep 12, 2017 at 2:17 PM, Amit Langote wrote: > > That said, I noticed that we might need to be careful about what the value > of the root parent's PlanRowMark's allMarkType field gets set to. We need > to make sure that it reflects markType of all partitions in the tree, > including those that are not root parent's direct children. Is that true > with the proposed implementation? Yes. We include child's allMarkTypes into parent's allMarkTypes. So, top parent's PlanRowMarks should have all descendant's allMarkTypes, which is not happening in the patch right now. There are two ways to fix that. 1. Pass top parent's PlanRowMark all the way down to the leaf partitions, so that current expand_single_inheritance_child() collects allMarkTypes of all children correctly. But this way, PlanRowMarks of intermediate parent does not reflect allMarkTypes of its children, only top root records that. 2. Pass immediate parent's PlanRowMark to expand_single_inheritance_child(), so that it records allMarkTypes of its children. In expand_partitioned_rtentry() have following sequence expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel, parentrc, childrel, appinfos, &childrte, &childRTindex, &childrc); /* If this child is itself partitioned, recurse */ if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { expand_partitioned_rtentry(root, childrte, childRTindex, childrel, childrc, lockmode, appinfos, partitioned_child_rels); /* Include child's rowmark type in parent's allMarkTypes */ parentrc->allMarkTypes |= childrc->allMarkTypes; } so that we push allMarkTypes up the hierarchy. I like the second way, since every intermediate parent records allMarkTypes of its descendants. Thoughts? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/12 17:53, Ashutosh Bapat wrote: > On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote wrote: >> So, we can remove partitioned_rels from (Merge)AppendPath and >> (Merge)Append nodes and remove ExecLockNonLeafAppendTables(). > > Don't we need partitioned_rels from Append paths to be transferred to > ModifyTable node or we have a different way of calculating > nonleafResultRelations? No, we don't transfer partitioned_rels from Append path to ModifyTable node. inheritance_planner(), that builds the ModifyTable path for UPDATE/DELETE on a partitioned table, fetches partitioned_rels from root->pcinfo_list itself and passes it to create_modifytable_path. No Append path is involved in that case. PlannedStmt.nonleafResultRelations is built by concatenating the partitioned_rels lists of all ModifyTable nodes appearing in the query. It does not depend on Append's or AppendPath's partitioned_rels. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote wrote: > On 2017/09/12 16:55, Ashutosh Bapat wrote: >> On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote wrote: >>> So I looked at this a bit closely and came to the conclusion that we may >>> not need to keep partitioned table RT indexes in the >>> (Merge)Append.partitioned_rels after all, as far as execution-time locking >>> is concerned. >>> >>> Consider two cases: >>> >>> 1. Plan is created and executed in the same transaction >>> >>> In this case, locks taken on the partitioned tables by the planner will >>> suffice. >>> >>> 2. Plan is executed in a different transaction from the one in which it >>>was created (a cached plan) >>> >>> In this case, AcquireExecutorLocks will lock all the relations in >>> PlannedStmt.rtable, which must include all partitioned tables of all >>> partition trees involved in the query. Of those, it will lock the tables >>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with >>> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global >>> list of all partitioned table RT indexes obtained by concatenating >>> partitioned_rels lists of all ModifyTable nodes involved in the query >>> (set_plan_refs does that). We need to distinguish nonleafResultRelations, >>> because we need to take the stronger lock on a given table before any >>> weaker one if it happens to appear in the query as a non-result relation >>> too, to avoid lock strength upgrade deadlock hazard. >>> >>> Moreover, because all the tables from plannedstmt->rtable, including the >>> partitioned tables, will be added to PlannedStmt.relationsOids, any >>> invalidation events affecting the partitioned tables (for example, >>> add/remove a partition) will cause the plan involving partitioned tables >>> to be recreated. >>> >>> In none of this do we rely on the partitioned table RT indexes appearing >>> in the (Merge)Append node itself. Maybe, we should just remove >>> partitioned_rels from (Merge)AppendPath and (Merge)Append node in a >>> separate patch and move on. >>> >>> Thoughts? >> >> Yes, I did the same analysis (to which I refer in my earlier reply to >> you). I too think we should just remove partitioned_rels from Append >> paths. But then the question is those are then transferred to >> ModifyTable node in create_modifytable_plan() and use it for something >> else. What should we do about that code? I don't think we are really >> using that list from ModifyTable node as well, so may be we could >> remove it from there as well. What do you think? Does that mean >> partitioned_rels isn't used at all in the code? > > No, we cannot simply get rid of partitioned_rels altogether. We'll need > to keep it in the ModifyTable node, because we *do* need the > nonleafResultRelations list in PlannedStmt to distinguish partitioned > table result relations, which set_plan_refs builds by concatenating > partitioned_rels lists of various ModifyTable nodes of the query. The > PlannedStmt.nonleafResultRelations list actually has some use (which > parallels PlannedStmt.resultRelations), but partitioned_rels list in the > individual (Merge)Append, as it turns out, doesn't. > > So, we can remove partitioned_rels from (Merge)AppendPath and > (Merge)Append nodes and remove ExecLockNonLeafAppendTables(). Don't we need partitioned_rels from Append paths to be transferred to ModifyTable node or we have a different way of calculating nonleafResultRelations? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/12 16:39, Ashutosh Bapat wrote: > On Tue, Sep 12, 2017 at 7:31 AM, Amit Langote > wrote: >> On 2017/09/11 19:45, Ashutosh Bapat wrote: >>> On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote: IMHO, we should make it the responsibility of the future patch to set a child PlanRowMark's prti to the direct parent's RT index, when we actually know that it's needed for something. We clearly know today why we need to pass the other objects like child RT entry, RT index, and Relation, so we should limit this patch to pass only those objects to the recursive call. That makes this patch a relatively easy to understand change. >>> >>> I think you are mixing two issues here 1. setting parent RTI in child >>> PlanRowMark and 2. passing immediate parent's PlanRowMark to >>> expand_single_inheritance_child(). >>> >>> I have discussed 1 in my reply to Robert. >>> >>> About 2 you haven't given any particular comments to my reply. To me >>> it looks like it's this patch that introduces the notion of >>> multi-level expansion, so it's natural for this patch to pass >>> PlanRowMark in cascaded fashion similar to other structures. >> >> You patch does 2 to be able to do 1, doesn't it? That is, to be able to >> set the child PlanRowMark's prti to the direct parent's RT index, you pass >> the immediate parent's PlanRowMark to the recursive call of >> expand_single_inheritance_child(). > > No. child PlanRowMark's prti is set to parentRTIndex, which is a > separate argument and is used to also set parent_relid in > AppendRelInfo. OK. So, to keep the old behavior (if at all), we'd actually need a new argument rootParentRTindex. Old behavior being that all child PlanRowMarks has the rootParentRTindex as their prti. It seems though that the new behavior where prti will now be set to the direct parent's RT index is more or less harmless, because whatever we set prti to, as long as it's different from rti, we can consider it a child PlanRowMark. So it might be fine to set prti to direct parent's RT index. That said, I noticed that we might need to be careful about what the value of the root parent's PlanRowMark's allMarkType field gets set to. We need to make sure that it reflects markType of all partitions in the tree, including those that are not root parent's direct children. Is that true with the proposed implementation? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/12 16:55, Ashutosh Bapat wrote: > On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote wrote: >> So I looked at this a bit closely and came to the conclusion that we may >> not need to keep partitioned table RT indexes in the >> (Merge)Append.partitioned_rels after all, as far as execution-time locking >> is concerned. >> >> Consider two cases: >> >> 1. Plan is created and executed in the same transaction >> >> In this case, locks taken on the partitioned tables by the planner will >> suffice. >> >> 2. Plan is executed in a different transaction from the one in which it >>was created (a cached plan) >> >> In this case, AcquireExecutorLocks will lock all the relations in >> PlannedStmt.rtable, which must include all partitioned tables of all >> partition trees involved in the query. Of those, it will lock the tables >> whose RT indexes appear in PlannedStmt.nonleafResultRelations with >> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global >> list of all partitioned table RT indexes obtained by concatenating >> partitioned_rels lists of all ModifyTable nodes involved in the query >> (set_plan_refs does that). We need to distinguish nonleafResultRelations, >> because we need to take the stronger lock on a given table before any >> weaker one if it happens to appear in the query as a non-result relation >> too, to avoid lock strength upgrade deadlock hazard. >> >> Moreover, because all the tables from plannedstmt->rtable, including the >> partitioned tables, will be added to PlannedStmt.relationsOids, any >> invalidation events affecting the partitioned tables (for example, >> add/remove a partition) will cause the plan involving partitioned tables >> to be recreated. >> >> In none of this do we rely on the partitioned table RT indexes appearing >> in the (Merge)Append node itself. Maybe, we should just remove >> partitioned_rels from (Merge)AppendPath and (Merge)Append node in a >> separate patch and move on. >> >> Thoughts? > > Yes, I did the same analysis (to which I refer in my earlier reply to > you). I too think we should just remove partitioned_rels from Append > paths. But then the question is those are then transferred to > ModifyTable node in create_modifytable_plan() and use it for something > else. What should we do about that code? I don't think we are really > using that list from ModifyTable node as well, so may be we could > remove it from there as well. What do you think? Does that mean > partitioned_rels isn't used at all in the code? No, we cannot simply get rid of partitioned_rels altogether. We'll need to keep it in the ModifyTable node, because we *do* need the nonleafResultRelations list in PlannedStmt to distinguish partitioned table result relations, which set_plan_refs builds by concatenating partitioned_rels lists of various ModifyTable nodes of the query. The PlannedStmt.nonleafResultRelations list actually has some use (which parallels PlannedStmt.resultRelations), but partitioned_rels list in the individual (Merge)Append, as it turns out, doesn't. So, we can remove partitioned_rels from (Merge)AppendPath and (Merge)Append nodes and remove ExecLockNonLeafAppendTables(). Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote wrote: > On 2017/09/11 21:07, Ashutosh Bapat wrote: >> On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas wrote: >>> On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat >>> wrote: So, all partitioned partitions are getting locked correctly. Am I missing something? >>> >>> That's not a valid test. In that scenario, you're going to hold all >>> the locks acquired by the planner, all the locks acquired by the >>> rewriter, and all the locks acquired by the executor, but when using >>> prepared queries, it's possible to execute the plan after the planner >>> and rewriter locks are no longer held. >> >> I see the same thing when I use prepare and execute > > So I looked at this a bit closely and came to the conclusion that we may > not need to keep partitioned table RT indexes in the > (Merge)Append.partitioned_rels after all, as far as execution-time locking > is concerned. > > Consider two cases: > > 1. Plan is created and executed in the same transaction > > In this case, locks taken on the partitioned tables by the planner will > suffice. > > 2. Plan is executed in a different transaction from the one in which it >was created (a cached plan) > > In this case, AcquireExecutorLocks will lock all the relations in > PlannedStmt.rtable, which must include all partitioned tables of all > partition trees involved in the query. Of those, it will lock the tables > whose RT indexes appear in PlannedStmt.nonleafResultRelations with > RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global > list of all partitioned table RT indexes obtained by concatenating > partitioned_rels lists of all ModifyTable nodes involved in the query > (set_plan_refs does that). We need to distinguish nonleafResultRelations, > because we need to take the stronger lock on a given table before any > weaker one if it happens to appear in the query as a non-result relation > too, to avoid lock strength upgrade deadlock hazard. > > Moreover, because all the tables from plannedstmt->rtable, including the > partitioned tables, will be added to PlannedStmt.relationsOids, any > invalidation events affecting the partitioned tables (for example, > add/remove a partition) will cause the plan involving partitioned tables > to be recreated. > > In none of this do we rely on the partitioned table RT indexes appearing > in the (Merge)Append node itself. Maybe, we should just remove > partitioned_rels from (Merge)AppendPath and (Merge)Append node in a > separate patch and move on. > > Thoughts? Yes, I did the same analysis (to which I refer in my earlier reply to you). I too think we should just remove partitioned_rels from Append paths. But then the question is those are then transferred to ModifyTable node in create_modifytable_plan() and use it for something else. What should we do about that code? I don't think we are really using that list from ModifyTable node as well, so may be we could remove it from there as well. What do you think? Does that mean partitioned_rels isn't used at all in the code? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/11 21:07, Ashutosh Bapat wrote: > On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas wrote: >> On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat >> wrote: >>> So, all partitioned partitions are getting locked correctly. Am I >>> missing something? >> >> That's not a valid test. In that scenario, you're going to hold all >> the locks acquired by the planner, all the locks acquired by the >> rewriter, and all the locks acquired by the executor, but when using >> prepared queries, it's possible to execute the plan after the planner >> and rewriter locks are no longer held. > > I see the same thing when I use prepare and execute So I looked at this a bit closely and came to the conclusion that we may not need to keep partitioned table RT indexes in the (Merge)Append.partitioned_rels after all, as far as execution-time locking is concerned. Consider two cases: 1. Plan is created and executed in the same transaction In this case, locks taken on the partitioned tables by the planner will suffice. 2. Plan is executed in a different transaction from the one in which it was created (a cached plan) In this case, AcquireExecutorLocks will lock all the relations in PlannedStmt.rtable, which must include all partitioned tables of all partition trees involved in the query. Of those, it will lock the tables whose RT indexes appear in PlannedStmt.nonleafResultRelations with RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global list of all partitioned table RT indexes obtained by concatenating partitioned_rels lists of all ModifyTable nodes involved in the query (set_plan_refs does that). We need to distinguish nonleafResultRelations, because we need to take the stronger lock on a given table before any weaker one if it happens to appear in the query as a non-result relation too, to avoid lock strength upgrade deadlock hazard. Moreover, because all the tables from plannedstmt->rtable, including the partitioned tables, will be added to PlannedStmt.relationsOids, any invalidation events affecting the partitioned tables (for example, add/remove a partition) will cause the plan involving partitioned tables to be recreated. In none of this do we rely on the partitioned table RT indexes appearing in the (Merge)Append node itself. Maybe, we should just remove partitioned_rels from (Merge)AppendPath and (Merge)Append node in a separate patch and move on. Thoughts? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Sep 12, 2017 at 7:31 AM, Amit Langote wrote: > On 2017/09/11 19:45, Ashutosh Bapat wrote: >> On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote: >>> IMHO, we should make it the responsibility of the future patch to set a >>> child PlanRowMark's prti to the direct parent's RT index, when we actually >>> know that it's needed for something. We clearly know today why we need to >>> pass the other objects like child RT entry, RT index, and Relation, so we >>> should limit this patch to pass only those objects to the recursive call. >>> That makes this patch a relatively easy to understand change. >> >> I think you are mixing two issues here 1. setting parent RTI in child >> PlanRowMark and 2. passing immediate parent's PlanRowMark to >> expand_single_inheritance_child(). >> >> I have discussed 1 in my reply to Robert. >> >> About 2 you haven't given any particular comments to my reply. To me >> it looks like it's this patch that introduces the notion of >> multi-level expansion, so it's natural for this patch to pass >> PlanRowMark in cascaded fashion similar to other structures. > > You patch does 2 to be able to do 1, doesn't it? That is, to be able to > set the child PlanRowMark's prti to the direct parent's RT index, you pass > the immediate parent's PlanRowMark to the recursive call of > expand_single_inheritance_child(). No. child PlanRowMark's prti is set to parentRTIndex, which is a separate argument and is used to also set parent_relid in AppendRelInfo. > >> Actually, the original problem that caused this discussion started >> with an assertion failure in get_partitioned_child_rels() as >> Assert(list_length(result) >= 1); >> >> This assertion fails if result is NIL when an intermediate partitioned >> table is passed. May be we should assert (result == NIL || >> list_length(result) == 1) and allow that function to be called even >> for intermediate partitioned partitions for which the function will >> return NIL. That will leave the code in add_paths_to_append_rel() >> simple. Thoughts? > > Yeah, I guess that could work. We'll just have to write comments to > describe why the Assert is written that way. > >>> By the way, when we call expand_single_inheritance_child() in the >>> non-partitioned inheritance case, we should pass NULL for childrte_p, >>> childRTindex_p, childrc_p, instead of declaring variables that won't be >>> used. Hence, expand_single_inheritance_child() should make those >>> arguments optional. >> >> That introduces an extra "if" condition, which is costlier than an >> assignment. We have used both the styles in the code. Previously, I >> have got comments otherwise. So, I am not sure. > > OK. expand_single_inheritance_child's header comment does not mention the > new result fields. Maybe add a comment describing what their role is and > that they're not optional arguments. > >> I will update the patches once we have some resolution about 1. prti >> in PlanRowMarks and 2. detection of root partitioned table in >> add_paths_to_append_rel(). > > OK. > > About 2, I somewhat agree with your proposed solution above, which might > be simpler to explain in comments than the code I proposed. After testing a few queries I am getting a feeling that ExecLockNonLeafAppendTables isn't really locking anything. I will write more about that in my reply to Robert's mail. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/11 19:45, Ashutosh Bapat wrote: > On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote: >> IMHO, we should make it the responsibility of the future patch to set a >> child PlanRowMark's prti to the direct parent's RT index, when we actually >> know that it's needed for something. We clearly know today why we need to >> pass the other objects like child RT entry, RT index, and Relation, so we >> should limit this patch to pass only those objects to the recursive call. >> That makes this patch a relatively easy to understand change. > > I think you are mixing two issues here 1. setting parent RTI in child > PlanRowMark and 2. passing immediate parent's PlanRowMark to > expand_single_inheritance_child(). > > I have discussed 1 in my reply to Robert. > > About 2 you haven't given any particular comments to my reply. To me > it looks like it's this patch that introduces the notion of > multi-level expansion, so it's natural for this patch to pass > PlanRowMark in cascaded fashion similar to other structures. You patch does 2 to be able to do 1, doesn't it? That is, to be able to set the child PlanRowMark's prti to the direct parent's RT index, you pass the immediate parent's PlanRowMark to the recursive call of expand_single_inheritance_child(). All I am trying to say is that this patch's mission is to expand inheritance step-wise to be able to do certain things in the *planner* that weren't possible before. The patch accomplishes that by creating child AppendRelInfos such that its parent_relid field is set to the immediate parent's RT index. It's quite clear why we're doing so. It's not clear why we should do so for PlanRowMarks too. Maybe it's fine as long as nothing breaks. >> If we go with your patch, partitioned tables won't get locked, for >> example, in case of the following query (p is a partitioned table): >> >> select 1 from p union all select 2 from p; >> >> That's because the RelOptInfos for the two instances of p in the above >> query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children >> of the Append corresponding to the UNION ALL subquery RTE. So, >> partitioned_rels does not get set per your proposed code. > [...] > So, all partitioned partitions are getting locked correctly. Am I > missing something? Will reply to this separately to your other email. > Actually, the original problem that caused this discussion started > with an assertion failure in get_partitioned_child_rels() as > Assert(list_length(result) >= 1); > > This assertion fails if result is NIL when an intermediate partitioned > table is passed. May be we should assert (result == NIL || > list_length(result) == 1) and allow that function to be called even > for intermediate partitioned partitions for which the function will > return NIL. That will leave the code in add_paths_to_append_rel() > simple. Thoughts? Yeah, I guess that could work. We'll just have to write comments to describe why the Assert is written that way. >> In create_lateral_join_info(): >> >> +Assert(IS_SIMPLE_REL(brel)); >> +Assert(brte); >> >> The second Assert is either unnecessary or should be placed first. > > simple_rte_array[] may have some NULL entries. Second assert makes > sure that we aren't dealing with a NULL entry. Any particular reason > to reorder the asserts? Sorry, I missed that the 2nd Assert has b"rte". I thought it's b"rel". >> The following comment could be made a bit clearer. >> >> + * In the case of table inheritance, the parent RTE is directly >> linked >> + * to every child table via an AppendRelInfo. In the case of table >> + * partitioning, the inheritance hierarchy is expanded one level at >> a >> + * time rather than flattened. Therefore, an other member rel >> that is >> + * a partitioned table may have children of its own, and must >> + * therefore be marked with the appropriate lateral info so that >> those >> + * children eventually get marked also. >> >> How about: In the case of partitioned table inheritance, the original >> parent RTE is linked, via AppendRelInfo, only to its immediate partitions. >> Partitions below the first level are accessible only via their immediate >> parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so >> consider those as well. > > I don't see much difference between those two. We usually do not use > macros in comments, so usually comments mention "other member" rel. > Let's leave this for the committer to judge. Sure. >> In expand_inherited_rtentry(), the following comment fragment is obsolete, >> because we *do* now create AppendRelInfo's for partitioned children: >> >> +/* >> + * We keep a list of objects in root, each of which maps a >> partitioned >> + * parent RT index to the list of RT indexes of its partitioned >> child >> + * tables which do not have AppendRelInfos associated with those. > > Good catch. I have
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Mon, Sep 11, 2017 at 8:07 AM, Ashutosh Bapat wrote: > I see the same thing when I use prepare and execute Hmm. Well, that's good, but it doesn't prove there's no bug. We have to understand where and why it's getting locked to know whether the behavior will be correct in all cases. I haven't had time to look at Amit's comments in detail yet so I don't know whether I agree with his analysis or not, but we have to look at what's going on under the hood to know whether the engine is working -- not just listen to the noise it makes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas wrote: > On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat > wrote: >> So, all partitioned partitions are getting locked correctly. Am I >> missing something? > > That's not a valid test. In that scenario, you're going to hold all > the locks acquired by the planner, all the locks acquired by the > rewriter, and all the locks acquired by the executor, but when using > prepared queries, it's possible to execute the plan after the planner > and rewriter locks are no longer held. > I see the same thing when I use prepare and execute Session 1 postgres=# prepare stmt as select 1 from t1 union all select 2 from t1; PREPARE postgres=# select pg_backend_pid(); pg_backend_pid 50912 (1 row) postgres=# begin; BEGIN postgres=# execute stmt; ?column? -- (0 rows) Session 2 postgres=# select locktype, relation::regclass, virtualxid, virtualtransaction, pid, mode, granted, fastpath from pg_locks; locktype | relation | virtualxid | virtualtransaction | pid | mode | granted | fastpath +--+++---+-+-+-- relation | pg_locks || 4/4| 50914 | AccessShareLock | t | t virtualxid | | 4/4| 4/4| 50914 | ExclusiveLock | t | t relation | t1p1p1 || 3/12 | 50912 | AccessShareLock | t | t relation | t1p1 || 3/12 | 50912 | AccessShareLock | t | t relation | t1 || 3/12 | 50912 | AccessShareLock | t | t virtualxid | | 3/12 | 3/12 | 50912 | ExclusiveLock | t | t (6 rows) -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat wrote: > So, all partitioned partitions are getting locked correctly. Am I > missing something? That's not a valid test. In that scenario, you're going to hold all the locks acquired by the planner, all the locks acquired by the rewriter, and all the locks acquired by the executor, but when using prepared queries, it's possible to execute the plan after the planner and rewriter locks are no longer held. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Mon, Sep 11, 2017 at 2:16 PM, Amit Langote wrote: > On 2017/09/11 16:23, Ashutosh Bapat wrote: >> On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas wrote: >>> I'm a bit suspicious about the fact that there are now executor >>> changes related to the PlanRowMarks. If the rowmark's prti is now the >>> intermediate parent's RT index rather than the top-parent's RT index, >>> it'd seem like that'd matter somehow. Maybe it doesn't, because the >>> code that cares about prti seems to only care about whether it's >>> different from rti. But if that's true everywhere, then why even >>> change this? I think we might be well off not to tinker with things >>> that don't need to be changed. >> >> In the definition of ExecRowMark, I see >> Index prti; /* parent range table index, if child */ >> It just says parent, by which I take as direct parent. For >> inheritance, which earlier flattened inheritance hierarchy, direct >> parent was top parent. So, probably nobody thought whether a parent is >> direct parent or top parent. But now that we have introduced that >> concept we need to interpret this comment anew. And I think >> interpreting it as direct parent is non-lossy. > > But then we also don't have anything to say about why we're making that > change. If you could describe what non-lossy is in this context, then > fine. By setting prti to the topmost parent RTI we are loosing information that this child may be an intermediate child similar to what we did earlier to AppendRelInfo. That's the lossy-ness in this context. > But that we'd like to match with what we're going to do for > AppendRelInfos does not seem to be a sufficient explanation for this change. The purpose of this patch is to change the parent-child linkages for partitioned table and prti is one of them. So, in fact, I am wondering why not to change that along with AppendRelInfo. > >> If we set top parent's >> index, parent RTI in AppendRelInfo and PlanRowMark would not agree. >> So, it looks quite natural that we set the direct parent's index in >> PlanRowMark. > > They would not agree, yes, but aren't they unrelated? If we have a reason > for them to agree, (for example, row-locking breaks in the inherited table > case if we didn't), then we should definitely make them agree. > > Updating the comment for prti definition might be something that this > patch could (should?) do, but I'm not quite sure about that too. > To me that looks backwards again for the reasons described above. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote: > On 2017/09/09 2:38, Ashutosh Bapat wrote: >> On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote wrote: >>> I updated the patch to include just those changes. I'm not sure about >>> one of the Ashutosh's changes whereby the child PlanRowMark is also passed >>> to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think >>> the child RTE, child RT index and child Relation are fine, because they >>> are necessary for creating AppendRelInfos in a desired way for later >>> planning steps. But PlanRowMarks are not processed within the planner >>> afterwards and do not need to be marked with the immediate parent-child >>> association in the same way that AppendRelInfos need to be. >> >> Passing top parent's row mark works today, since there is no >> parent-child specific translation happening there. But if in future, >> we introduce such a translation, row marks for indirect children in a >> multi-level partitioned hierarchy won't be accurate. So, I think it's >> better to pass row marks of the direct parent. > > IMHO, we should make it the responsibility of the future patch to set a > child PlanRowMark's prti to the direct parent's RT index, when we actually > know that it's needed for something. We clearly know today why we need to > pass the other objects like child RT entry, RT index, and Relation, so we > should limit this patch to pass only those objects to the recursive call. > That makes this patch a relatively easy to understand change. I think you are mixing two issues here 1. setting parent RTI in child PlanRowMark and 2. passing immediate parent's PlanRowMark to expand_single_inheritance_child(). I have discussed 1 in my reply to Robert. About 2 you haven't given any particular comments to my reply. To me it looks like it's this patch that introduces the notion of multi-level expansion, so it's natural for this patch to pass PlanRowMark in cascaded fashion similar to other structures. > >>> I also included the changes to add_paths_to_append_rel() from my patch on >>> the "path toward faster partition pruning" thread. We'd need that change, >>> because while add_paths_to_append_rel() is called for all partitioned >>> table RTEs in a given partition tree, expand_inherited_rtentry() would >>> have set up a PartitionedChildRelInfo only for the root parent, so >>> get_partitioned_child_rels() would not find the same for non-root >>> partitioned table rels and crash failing the Assert. The changes I made >>> are such that we call get_partitioned_child_rels() only for the parent >>> rels that are known to correspond root partitioned tables (or as you >>> pointed out on the thread, "the table named in the query" as opposed those >>> added to the query as result of inheritance expansion). In addition to >>> the relkind check on the input RTE, it would seem that checking that the >>> reloptkind is RELOPT_BASEREL would be enough. But actually not, because >>> if a partitioned table is accessed in a UNION ALL query, reloptkind even >>> for the root partitioned table (the table named in the query) would be >>> RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is >>> actually the root partitioned table is to check whether its parent rel is >>> not RTE_RELATION, because the parent in case of UNION ALL Append is a >>> RTE_SUBQUERY RT entry. >>> >> >> There was a change in my 0003 patch, which fixed the crash. It checked >> for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in >> my 0001 patch. It no more crashes. I tried various queries involving >> set operations and bare multi-level partitioned table scan with my >> patch, but none of them showed any anomaly. Do you have a testcase >> which shows problem with my patch? May be your suggestion is correct, >> but corresponding code implementation is slightly longer than I would >> expect. So, we should go with it, if there is corresponding testcase >> which shows why it's needed. > > If we go with your patch, partitioned tables won't get locked, for > example, in case of the following query (p is a partitioned table): > > select 1 from p union all select 2 from p; > > That's because the RelOptInfos for the two instances of p in the above > query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children > of the Append corresponding to the UNION ALL subquery RTE. So, > partitioned_rels does not get set per your proposed code. Session 1: postgres=# begin; BEGIN postgres=# select 1 from t1 union all select 2 from t1; ?column? -- (0 rows) postgres=# select pg_backend_pid(); pg_backend_pid 28843 (1 row) Session 2 postgres=# select locktype, relation::regclass, virtualxid, virtualtransaction, pid, mode, granted, fastpath from pg_locks; locktype | relation | virtualxid | virtualtransaction | pid | mode | granted | fastpath +--+++---+-
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/11 16:23, Ashutosh Bapat wrote: > On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas wrote: >> I'm a bit suspicious about the fact that there are now executor >> changes related to the PlanRowMarks. If the rowmark's prti is now the >> intermediate parent's RT index rather than the top-parent's RT index, >> it'd seem like that'd matter somehow. Maybe it doesn't, because the >> code that cares about prti seems to only care about whether it's >> different from rti. But if that's true everywhere, then why even >> change this? I think we might be well off not to tinker with things >> that don't need to be changed. > > In the definition of ExecRowMark, I see > Index prti; /* parent range table index, if child */ > It just says parent, by which I take as direct parent. For > inheritance, which earlier flattened inheritance hierarchy, direct > parent was top parent. So, probably nobody thought whether a parent is > direct parent or top parent. But now that we have introduced that > concept we need to interpret this comment anew. And I think > interpreting it as direct parent is non-lossy. But then we also don't have anything to say about why we're making that change. If you could describe what non-lossy is in this context, then fine. But that we'd like to match with what we're going to do for AppendRelInfos does not seem to be a sufficient explanation for this change. > If we set top parent's > index, parent RTI in AppendRelInfo and PlanRowMark would not agree. > So, it looks quite natural that we set the direct parent's index in > PlanRowMark. They would not agree, yes, but aren't they unrelated? If we have a reason for them to agree, (for example, row-locking breaks in the inherited table case if we didn't), then we should definitely make them agree. Updating the comment for prti definition might be something that this patch could (should?) do, but I'm not quite sure about that too. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas wrote: > On Fri, Sep 8, 2017 at 1:38 PM, Ashutosh Bapat > wrote: >>> I also confirmed that the partition-pruning patch set works fine with this >>> patch instead of the patch on that thread with the same functionality, >>> which I will now drop from that patch set. Sorry about the wasted time. >> >> Thanks a lot. Please review the patch in the updated patchset. > > In set_append_rel_size(), I don't find the comment too clear (and this > part was taken from Amit's patch, right?). I suggest: No, I didn't take it from Amit's patch. Both of us have different wordings. But yours is better than both of us. Included it in the attached patches. > > +/* > + * Associate the partitioned tables which are descendents of the table > + * named in the query with the topmost append path (i.e. the one where > + * rel->reloptkind is RELOPT_BASEREL). This ensures that they get > properly > + * locked at execution time. > + */ > > I'm a bit suspicious about the fact that there are now executor > changes related to the PlanRowMarks. If the rowmark's prti is now the > intermediate parent's RT index rather than the top-parent's RT index, > it'd seem like that'd matter somehow. Maybe it doesn't, because the > code that cares about prti seems to only care about whether it's > different from rti. But if that's true everywhere, then why even > change this? I think we might be well off not to tinker with things > that don't need to be changed. In the definition of ExecRowMark, I see Index prti; /* parent range table index, if child */ It just says parent, by which I take as direct parent. For inheritance, which earlier flattened inheritance hierarchy, direct parent was top parent. So, probably nobody thought whether a parent is direct parent or top parent. But now that we have introduced that concept we need to interpret this comment anew. And I think interpreting it as direct parent is non-lossy. If we set top parent's index, parent RTI in AppendRelInfo and PlanRowMark would not agree. So, it looks quite natural that we set the direct parent's index in PlanRowMark. From that POV, we aren't changing anything, we are setting the same parent RTI in AppendRelInfo and PlanRowMark. Chaning different parent RTIs in those two structure would be a real change. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/09 9:58, Robert Haas wrote: > I'm a bit suspicious about the fact that there are now executor > changes related to the PlanRowMarks. If the rowmark's prti is now the > intermediate parent's RT index rather than the top-parent's RT index, > it'd seem like that'd matter somehow. Maybe it doesn't, because the > code that cares about prti seems to only care about whether it's > different from rti. Yes, it doesn't matter. The important point though is that nothing we want to do in the short term requires us to set a child PlanRowMark's prti to its immediate parent's RT index, as I also mentioned in reply to Ashutosh. > But if that's true everywhere, then why even > change this? I think we might be well off not to tinker with things > that don't need to be changed. +1. > Apart from that concern, now that I understand (from my own failed > attempt and some off-list discussion) why this patch works the way it > does, I think this is in fairly good shape. I too think so, except we still need to incorporate changes to add_paths_to_append_rel() necessary to correctly set partitioned_rels, as I explained in reply Ashutosh. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/09 2:38, Ashutosh Bapat wrote: > On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote wrote: >> I updated the patch to include just those changes. I'm not sure about >> one of the Ashutosh's changes whereby the child PlanRowMark is also passed >> to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think >> the child RTE, child RT index and child Relation are fine, because they >> are necessary for creating AppendRelInfos in a desired way for later >> planning steps. But PlanRowMarks are not processed within the planner >> afterwards and do not need to be marked with the immediate parent-child >> association in the same way that AppendRelInfos need to be. > > Passing top parent's row mark works today, since there is no > parent-child specific translation happening there. But if in future, > we introduce such a translation, row marks for indirect children in a > multi-level partitioned hierarchy won't be accurate. So, I think it's > better to pass row marks of the direct parent. IMHO, we should make it the responsibility of the future patch to set a child PlanRowMark's prti to the direct parent's RT index, when we actually know that it's needed for something. We clearly know today why we need to pass the other objects like child RT entry, RT index, and Relation, so we should limit this patch to pass only those objects to the recursive call. That makes this patch a relatively easy to understand change. >> I also included the changes to add_paths_to_append_rel() from my patch on >> the "path toward faster partition pruning" thread. We'd need that change, >> because while add_paths_to_append_rel() is called for all partitioned >> table RTEs in a given partition tree, expand_inherited_rtentry() would >> have set up a PartitionedChildRelInfo only for the root parent, so >> get_partitioned_child_rels() would not find the same for non-root >> partitioned table rels and crash failing the Assert. The changes I made >> are such that we call get_partitioned_child_rels() only for the parent >> rels that are known to correspond root partitioned tables (or as you >> pointed out on the thread, "the table named in the query" as opposed those >> added to the query as result of inheritance expansion). In addition to >> the relkind check on the input RTE, it would seem that checking that the >> reloptkind is RELOPT_BASEREL would be enough. But actually not, because >> if a partitioned table is accessed in a UNION ALL query, reloptkind even >> for the root partitioned table (the table named in the query) would be >> RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is >> actually the root partitioned table is to check whether its parent rel is >> not RTE_RELATION, because the parent in case of UNION ALL Append is a >> RTE_SUBQUERY RT entry. >> > > There was a change in my 0003 patch, which fixed the crash. It checked > for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in > my 0001 patch. It no more crashes. I tried various queries involving > set operations and bare multi-level partitioned table scan with my > patch, but none of them showed any anomaly. Do you have a testcase > which shows problem with my patch? May be your suggestion is correct, > but corresponding code implementation is slightly longer than I would > expect. So, we should go with it, if there is corresponding testcase > which shows why it's needed. If we go with your patch, partitioned tables won't get locked, for example, in case of the following query (p is a partitioned table): select 1 from p union all select 2 from p; That's because the RelOptInfos for the two instances of p in the above query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children of the Append corresponding to the UNION ALL subquery RTE. So, partitioned_rels does not get set per your proposed code. > > In your patch > > +parent_rel = root->simple_rel_array[parent_relid]; > +get_pcinfo = (parent_rel->rtekind == RTE_SUBQUERY); > > Do you mean RTE_RELATION as you explained above? No, I mean RTE_SUBQUERY. If the partitioned table RTE in question corresponds to one named in the query, we should be able to find its pcinfo in root->pcinfo_list. If the partitioned table RTE is one added as result of inheritance expansion, it won't have an associated pcinfo. So, we should find a way to distinguish them from one another. The first idea that had occurred to me was the same as yours -- RelOptInfo of the partitioned table RTE named in the query would be of reloptkind RELOPT_BASEREL and those of the partitioned table RTE added as result of inheritance expansion will be of reloptkind RELOPT_OTHER_MEMBER_REL. Although the latter is always true, the former is not. If the partitioned table named in the query appears under UNION ALL query, then its reloptkind will be RELOPT_OTHER_MEMBER_REL. That means we have to use some other means to distinguish partitioned table RTEs that have an asso
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Fri, Sep 8, 2017 at 1:38 PM, Ashutosh Bapat wrote: >> I also confirmed that the partition-pruning patch set works fine with this >> patch instead of the patch on that thread with the same functionality, >> which I will now drop from that patch set. Sorry about the wasted time. > > Thanks a lot. Please review the patch in the updated patchset. In set_append_rel_size(), I don't find the comment too clear (and this part was taken from Amit's patch, right?). I suggest: +/* + * Associate the partitioned tables which are descendents of the table + * named in the query with the topmost append path (i.e. the one where + * rel->reloptkind is RELOPT_BASEREL). This ensures that they get properly + * locked at execution time. + */ I'm a bit suspicious about the fact that there are now executor changes related to the PlanRowMarks. If the rowmark's prti is now the intermediate parent's RT index rather than the top-parent's RT index, it'd seem like that'd matter somehow. Maybe it doesn't, because the code that cares about prti seems to only care about whether it's different from rti. But if that's true everywhere, then why even change this? I think we might be well off not to tinker with things that don't need to be changed. Apart from that concern, now that I understand (from my own failed attempt and some off-list discussion) why this patch works the way it does, I think this is in fairly good shape. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Fri, Sep 8, 2017 at 1:47 AM, Amit Langote wrote: > When I tried the attached patch, it doesn't seem to expand partitioning > inheritance in step-wise manner as the patch's title says. I think the > rewritten patch forgot to include Ashutosh's changes to > expand_single_inheritance_child() whereby the AppendRelInfo of the child > will be marked with the direct parent instead of always the root parent. Woops. > I updated the patch to include just those changes. I'm not sure about > one of the Ashutosh's changes whereby the child PlanRowMark is also passed > to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think > the child RTE, child RT index and child Relation are fine, because they > are necessary for creating AppendRelInfos in a desired way for later > planning steps. But PlanRowMarks are not processed within the planner > afterwards and do not need to be marked with the immediate parent-child > association in the same way that AppendRelInfos need to be. We probably need some better comments to explain which things need to be marked using the immediate parent and which need to be marked using the baserel, and why. > I also included the changes to add_paths_to_append_rel() from my patch on > the "path toward faster partition pruning" thread. We'd need that change, > because while add_paths_to_append_rel() is called for all partitioned > table RTEs in a given partition tree, expand_inherited_rtentry() would > have set up a PartitionedChildRelInfo only for the root parent, so > get_partitioned_child_rels() would not find the same for non-root > partitioned table rels and crash failing the Assert. The changes I made > are such that we call get_partitioned_child_rels() only for the parent > rels that are known to correspond root partitioned tables (or as you > pointed out on the thread, "the table named in the query" as opposed those > added to the query as result of inheritance expansion). In addition to > the relkind check on the input RTE, it would seem that checking that the > reloptkind is RELOPT_BASEREL would be enough. But actually not, because > if a partitioned table is accessed in a UNION ALL query, reloptkind even > for the root partitioned table (the table named in the query) would be > RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is > actually the root partitioned table is to check whether its parent rel is > not RTE_RELATION, because the parent in case of UNION ALL Append is a > RTE_SUBQUERY RT entry. OK, so this needs some good comments, too... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Fri, Sep 8, 2017 at 12:34 AM, Robert Haas wrote: > On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat > wrote: >> accumulate_append_subpath() is executed for every path instead of >> every relation, so changing it would collect the same list multiple >> times. Instead, I found the old way of associating all intermediate >> partitions with the root partitioned relation work better. Here's the >> updated patch set. > > When I tried out patch 0001, it crashed repeatedly during 'make check' > because of an assertion failure in get_partitioned_child_rels. Running "make check" on the whole patchset doesn't give that failure. So I didn't notice the crash since I was running regression on the whole patchset. Sorry for that. Fortunately git rebase -i allows us to execute shell commands while applying patches, so I have set it up to compile each patch and run regression. Hopefully that will catch such errors in future. That process showed me that patch 0003-In-add_paths_to_append_rel-get-partitioned_rels-for-.patch fixes that crash by calling get_partitioned_child_rels() only on the root partitioned table for which we have set up child_rels list. Amit Langote has a similar fix reported in his reply. So, we will discuss it there. > It > seemed to me that the way the patch was refactoring > expand_inherited_rtentry involved more code rearrangement than > necessary, so I reverted all the code rearrangement and just kept the > functional changes, and all the crashes went away. (That refactoring > also failed to initialize has_child properly.) In so doing, I > reintroduced the problem that the PartitionedChildRelInfo lists > weren't getting set up correctly, but after some thought I realized > that was just because expand_single_inheritance_child() was choosing > between adding an RTE and adding the OID to partitioned_child_rels, > whereas for an intermediate partitioned table it needs to do both. So > I inserted a trivial fix for that problem (replacing "else" with a new > "if"-test), basically: > > -else > + > +if (childrte->relkind == RELKIND_PARTITIONED_TABLE) > > Please check out the attached version of the patch. In addition to > the above simplifications, I did some adjustments to the comments in > various places - some just grammar and others a bit more substantive. > And I think I broke a long line in one place, too. > > One thing I notice is that if I rip out the changes to initsplan.c, > the new regression test still passes. If it's possible to write a > test that fails without those changes, I think it would be a good idea > to include one in the patch. That's certainly one of the subtler > parts of this patch, IMHO. Amit Langote has replied on these points as well. So, I will comment in a reply to his reply. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/08 14:47, Amit Langote wrote: > When I tried the attached patch, it doesn't seem to expand partitioning > inheritance in step-wise manner as the patch's title says. Oops. By "attached patch", I had meant to say the Robert's patch, that is, expand-stepwise-rmh.patch. Not expand-stepwise-rmh-2.patch, which is the updated version of the patch attached with the quoted message. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/08 4:04, Robert Haas wrote: > On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat > wrote: >> accumulate_append_subpath() is executed for every path instead of >> every relation, so changing it would collect the same list multiple >> times. Instead, I found the old way of associating all intermediate >> partitions with the root partitioned relation work better. Here's the >> updated patch set. > > When I tried out patch 0001, it crashed repeatedly during 'make check' > because of an assertion failure in get_partitioned_child_rels. It > seemed to me that the way the patch was refactoring > expand_inherited_rtentry involved more code rearrangement than > necessary, so I reverted all the code rearrangement and just kept the > functional changes, and all the crashes went away. (That refactoring > also failed to initialize has_child properly.) When I tried the attached patch, it doesn't seem to expand partitioning inheritance in step-wise manner as the patch's title says. I think the rewritten patch forgot to include Ashutosh's changes to expand_single_inheritance_child() whereby the AppendRelInfo of the child will be marked with the direct parent instead of always the root parent. I updated the patch to include just those changes. I'm not sure about one of the Ashutosh's changes whereby the child PlanRowMark is also passed to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think the child RTE, child RT index and child Relation are fine, because they are necessary for creating AppendRelInfos in a desired way for later planning steps. But PlanRowMarks are not processed within the planner afterwards and do not need to be marked with the immediate parent-child association in the same way that AppendRelInfos need to be. I also included the changes to add_paths_to_append_rel() from my patch on the "path toward faster partition pruning" thread. We'd need that change, because while add_paths_to_append_rel() is called for all partitioned table RTEs in a given partition tree, expand_inherited_rtentry() would have set up a PartitionedChildRelInfo only for the root parent, so get_partitioned_child_rels() would not find the same for non-root partitioned table rels and crash failing the Assert. The changes I made are such that we call get_partitioned_child_rels() only for the parent rels that are known to correspond root partitioned tables (or as you pointed out on the thread, "the table named in the query" as opposed those added to the query as result of inheritance expansion). In addition to the relkind check on the input RTE, it would seem that checking that the reloptkind is RELOPT_BASEREL would be enough. But actually not, because if a partitioned table is accessed in a UNION ALL query, reloptkind even for the root partitioned table (the table named in the query) would be RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is actually the root partitioned table is to check whether its parent rel is not RTE_RELATION, because the parent in case of UNION ALL Append is a RTE_SUBQUERY RT entry. > One thing I notice is that if I rip out the changes to initsplan.c, > the new regression test still passes. If it's possible to write a > test that fails without those changes, I think it would be a good idea > to include one in the patch. That's certainly one of the subtler > parts of this patch, IMHO. Back when this (step-wise expansion of partition inheritance) used to be a patch in the original declarative partitioning patch series, Ashutosh had reported a test query [1] that would fail getting a plan, for which we came up with the initsplan.c changes in this patch as the solution: ERROR: could not devise a query plan for the given query I tried that query again without the initsplan.c changes and somehow the same error does not occur anymore. It's strange because without the initsplan.c changes, there is no way for partitions lower in the tree than the first level to get the direct_lateral_relids and lateral_relids from the root parent rel. Maybe, Ashutosh has a way to devise the failing query again. I also confirmed that the partition-pruning patch set works fine with this patch instead of the patch on that thread with the same functionality, which I will now drop from that patch set. Sorry about the wasted time. Thanks, Amit [1] https://www.postgresql.org/message-id/CAFjFpReZF34MDbY95xoATi0xVj2mAry4-LHBWVBayOc8gj%3Diqg%40mail.gmail.com diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 2d7e1d84d0..71b5bdf95e 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -24,6 +24,7 @@ #include "catalog/pg_operator.h" #include "catalog/pg_proc.h" #include "foreign/fdwapi.h" +#include "miscadmin.h" #include "nodes/makefuncs.h" #include "nodes/nodeFuncs.h" #ifdef OPTIMIZER_DEBUG @@ -867,6 +868,9 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel, int
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat wrote: > accumulate_append_subpath() is executed for every path instead of > every relation, so changing it would collect the same list multiple > times. Instead, I found the old way of associating all intermediate > partitions with the root partitioned relation work better. Here's the > updated patch set. When I tried out patch 0001, it crashed repeatedly during 'make check' because of an assertion failure in get_partitioned_child_rels. It seemed to me that the way the patch was refactoring expand_inherited_rtentry involved more code rearrangement than necessary, so I reverted all the code rearrangement and just kept the functional changes, and all the crashes went away. (That refactoring also failed to initialize has_child properly.) In so doing, I reintroduced the problem that the PartitionedChildRelInfo lists weren't getting set up correctly, but after some thought I realized that was just because expand_single_inheritance_child() was choosing between adding an RTE and adding the OID to partitioned_child_rels, whereas for an intermediate partitioned table it needs to do both. So I inserted a trivial fix for that problem (replacing "else" with a new "if"-test), basically: -else + +if (childrte->relkind == RELKIND_PARTITIONED_TABLE) Please check out the attached version of the patch. In addition to the above simplifications, I did some adjustments to the comments in various places - some just grammar and others a bit more substantive. And I think I broke a long line in one place, too. One thing I notice is that if I rip out the changes to initsplan.c, the new regression test still passes. If it's possible to write a test that fails without those changes, I think it would be a good idea to include one in the patch. That's certainly one of the subtler parts of this patch, IMHO. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company expand-stepwise-rmh.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Thu, Sep 7, 2017 at 4:32 PM, Antonin Houska wrote: > Ashutosh Bapat wrote: > >> On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska wrote: >> > >> > >> > >> > * get_partitioned_child_rels_for_join() >> > >> > I think the Assert() statement is easier to understand inside the loop, see >> > the assert.diff attachment. > >> The assert at the end of function also checks that we have got >> child_rels lists for all the parents passed in. > > Really? I can imagine that some instances of PartitionedChildRelInfo have the > child_rels list empty, while other ones have these lists long enough to > compensate for the empty lists. > That isn't true. Each child_rels list will at least have one entry. Please see get_partitioned_child_rels(). >> > >> > >> > * have_partkey_equi_join() >> > >> > As the function handles generic join, this comment doesn't seem to me >> > relevant: >> > >> > /* >> > * The equi-join between partition keys is strict if equi-join between >> > * at least one partition key is using a strict operator. See >> > * explanation about outer join reordering identity 3 in >> > * optimizer/README >> > */ >> > strict_op = op_strict(opexpr->opno); >> >> What in that comment is not exactly relevant? > > Basically I don't understand why you mention join reordering here. The join > ordering questions must all have been resolved by the time > have_partkey_equi_join() is called. I am referring to a particular section in README which talks about the relation between strict operator and legal join order. > >> > >> > And I think the function can return true even if strict_op is false for all >> > the operators evaluated in the loop. >> >> I think it does that. Do you have a case where it doesn't? > > Here I refer to this part of the comment above: > > "... if equi-join between at least one partition key is using a strict > operator." > > My understanding of the code (especially match_expr_to_partition_keys) is that > no operator actually needs to be strict as long as each operator involved in > the join matches at least one non-nullable expression on both sides of the > join. I don't think so. A strict operator returns NULL when either of the inputs is NULL. We can not say so for non-strict operators, which may deem NULL and non-NULL arguments as equal, even though that looks insane. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Ashutosh Bapat wrote: > On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska wrote: > > > > > > > > * get_partitioned_child_rels_for_join() > > > > I think the Assert() statement is easier to understand inside the loop, see > > the assert.diff attachment. > The assert at the end of function also checks that we have got > child_rels lists for all the parents passed in. Really? I can imagine that some instances of PartitionedChildRelInfo have the child_rels list empty, while other ones have these lists long enough to compensate for the empty lists. > > > > > > * have_partkey_equi_join() > > > > As the function handles generic join, this comment doesn't seem to me > > relevant: > > > > /* > > * The equi-join between partition keys is strict if equi-join between > > * at least one partition key is using a strict operator. See > > * explanation about outer join reordering identity 3 in > > * optimizer/README > > */ > > strict_op = op_strict(opexpr->opno); > > What in that comment is not exactly relevant? Basically I don't understand why you mention join reordering here. The join ordering questions must all have been resolved by the time have_partkey_equi_join() is called. > > > > And I think the function can return true even if strict_op is false for all > > the operators evaluated in the loop. > > I think it does that. Do you have a case where it doesn't? Here I refer to this part of the comment above: "... if equi-join between at least one partition key is using a strict operator." My understanding of the code (especially match_expr_to_partition_keys) is that no operator actually needs to be strict as long as each operator involved in the join matches at least one non-nullable expression on both sides of the join. > > * match_expr_to_partition_keys() > > > > I'm not sure this comment is clear enough: > > > > /* > > * If it's a strict equi-join a NULL partition key on one side will > > * not join a NULL partition key on the other side. So, rows with NULL > > * partition key from a partition on one side can not join with those > > * from a non-matching partition on the other side. So, search the > > * nullable partition keys as well. > > */ > > if (!strict_op) > > continue; > > > > My understanding of the problem of NULL values generated by outer join is: > > these NULL values --- if evaluated by non-strict expression --- can make row > > of N-th partition on one side of the join match row(s) of *other than* N-th > > partition(s) on the other side. Thus the nullable input expressions may only > > be evaluated by strict operators. I think it'd be clearer if you stressed > > that > > (undesired) *match* of partition keys can be a problem, rather than mismatch > > Sorry, I am not able to understand this. To me it looks like my > wording conveys what you are saying. I just tried to expreess the idea in a way that is clearer to me. I think we both mean the same. Not sure I should spend more effort on another version of the comment. > > If you insist on your wording, then I think you should at least move the > > comment below to the part that only deals with strict operators. > > Done. o.k. > > > > * map_and_merge_partitions() > > > > Besides a few changes proposed in map_and_merge_partitions.diff (a few of > > them > > to suppress compiler warnings) I think that this part needs more thought: > > > > { > > Assert(mergemap1[index1] != mergemap2[index2] && > >mergemap1[index1] >= 0 && mergemap2[index2] >= 0); > > > > /* > > * Both the partitions map to different merged partitions. This > > * means that multiple partitions from one relation matches to > > one > > * partition from the other relation. Partition-wise join does > > not > > * handle this case right now, since it requires ganging > > multiple > > * partitions together (into one RelOptInfo). > > */ > > merged_index = -1; > > } > > > > I could hit this path with the following test: > > > > CREATE TABLE a(i int) PARTITION BY LIST(i); > > CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2); > > CREATE TABLE b(j int) PARTITION BY LIST(j); > > CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2); > > > > SET enable_partition_wise_join TO on; > > > > SELECT * > > FROMa > > FULL JOIN > > b ON i = j; > > > > I don't think there's a reason not to join a_0 partition to b_0, is there? > > With the latest patchset I am seeing that partition-wise join is used > in this case. I have started a new thread [1] for advanced partition > matching patches. What plan do you get, with the patches from https://www.postgresql.org/message-id/cafjfprfdxpusu0pxon3dkcr8wndjkaxlzhuvax_laod0tgc...@mail.gmail.com I still see the join above Append, not below: QUERY PLAN
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Sep 5, 2017 at 1:16 PM, Amit Langote wrote: > On 2017/09/05 15:43, Ashutosh Bapat wrote: >> Ok. Can you please answer my previous questions? >> >> AFAIU, the list contained RTIs of the relations, which didnt' have >> corresponding AppendRelInfos to lock those relations. Now that we >> create AppendRelInfos even for partitioned partitions with my 0001 >> patch, I don't think >> we need the list to take care of the locks. Is there any other reason >> why we maintain that list (apart from the trigger case I have raised >> and Fujita-san says that the list is not required in that case as >> well.)? > > AppendRelInfos exist within the planner (they come to be and go away > within the planner). Once we leave the planner, that information is gone. > > Executor will receive a plan node that won't contain that information: > > 1. Append has an appendplans field, which contains one plan tree for every >leaf partition. None of its fields, other than partitined_rels, >contains the RT indexes of the partitioned tables. Similarly in the >case of MergeAppend. > > 2. ModifyTable has a resultRelations fields which contains a list of leaf >partition RT indexes and a plans field which contains one plan tree for >every RT index in the resultRelations list (that is a plan tree that >will scan the particular leaf partition). None of its fields, other >than partitined_rels, contains the RT indexes of the partitioned >tables. > > I learned over the course of developing the patch that added this > partitioned_rels field [1] that the executor needs to identify all the > affected tables by a given plan tree so that it could lock them. Executor > needs to lock them separately even if the plan itself was built after > locking all the relevant tables [2]. For example, see > ExecLockNonLeafAppendTables(), which will lock the tables in the > (Merge)Append.partitioned_rels list. > > While I've been thinking all along that the same thing must be happening > for RT indexes in ModifyTable.partitioned_rels list (I said so a couple of > times on this thread), it's actually not. Instead, > ModifyTable.partitioned_rels of all ModifyTable nodes in a give query are > merged into PlannedStmt.nonleafResultRelations (which happens in > set_plan_refs) and that's where the executor finds them to lock them > (which happens in InitPlan). > > So, it appears that ModifyTable.partitioned_rels is indeed unused in the > executor. But we still can't get rid of it from the ModifyTable node > itself without figuring out a way (a channel) to transfer that information > into PlannedStmt.nonleafResultRelations. Thanks a lot for this detailed analysis. IIUC, in my 0001 patch, I am not adding any partitioned partition other than the parent itself. But since every partitioned partition in turn acts as parent, it appears its own list. The list obtained by concatenating all such lists together contains all the partitioned partition RTIs. In my patch, I need to teach accumulate_append_subpath() to accumulate partitioned_rels as well. > >> Having asked that, I think my patch shouldn't deal with removing >> partitioned_rels lists and related structures and code. It should be> done >> as a separate patch. > > Going back to your original email which started this discussion, it seems > that we agree on that the PartitionedChildRelInfo node can be removed, and > I agree that it shouldn't be done in the partitionwise-join patch series > but as a separate patch. As described above, we shouldn't try yet to get > rid of the partitioned_rels list that appears in some plan nodes. Thanks. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/05 15:43, Ashutosh Bapat wrote: > Ok. Can you please answer my previous questions? > > AFAIU, the list contained RTIs of the relations, which didnt' have > corresponding AppendRelInfos to lock those relations. Now that we > create AppendRelInfos even for partitioned partitions with my 0001 > patch, I don't think > we need the list to take care of the locks. Is there any other reason > why we maintain that list (apart from the trigger case I have raised > and Fujita-san says that the list is not required in that case as > well.)? AppendRelInfos exist within the planner (they come to be and go away within the planner). Once we leave the planner, that information is gone. Executor will receive a plan node that won't contain that information: 1. Append has an appendplans field, which contains one plan tree for every leaf partition. None of its fields, other than partitined_rels, contains the RT indexes of the partitioned tables. Similarly in the case of MergeAppend. 2. ModifyTable has a resultRelations fields which contains a list of leaf partition RT indexes and a plans field which contains one plan tree for every RT index in the resultRelations list (that is a plan tree that will scan the particular leaf partition). None of its fields, other than partitined_rels, contains the RT indexes of the partitioned tables. I learned over the course of developing the patch that added this partitioned_rels field [1] that the executor needs to identify all the affected tables by a given plan tree so that it could lock them. Executor needs to lock them separately even if the plan itself was built after locking all the relevant tables [2]. For example, see ExecLockNonLeafAppendTables(), which will lock the tables in the (Merge)Append.partitioned_rels list. While I've been thinking all along that the same thing must be happening for RT indexes in ModifyTable.partitioned_rels list (I said so a couple of times on this thread), it's actually not. Instead, ModifyTable.partitioned_rels of all ModifyTable nodes in a give query are merged into PlannedStmt.nonleafResultRelations (which happens in set_plan_refs) and that's where the executor finds them to lock them (which happens in InitPlan). So, it appears that ModifyTable.partitioned_rels is indeed unused in the executor. But we still can't get rid of it from the ModifyTable node itself without figuring out a way (a channel) to transfer that information into PlannedStmt.nonleafResultRelations. > Having asked that, I think my patch shouldn't deal with removing > partitioned_rels lists and related structures and code. It should be> done > as a separate patch. Going back to your original email which started this discussion, it seems that we agree on that the PartitionedChildRelInfo node can be removed, and I agree that it shouldn't be done in the partitionwise-join patch series but as a separate patch. As described above, we shouldn't try yet to get rid of the partitioned_rels list that appears in some plan nodes. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d3cc37f1d8 [2] https://www.postgresql.org/message-id/CA%2BTgmoYiwviCDRi3Zk%2BQuXj1r7uMu9T_kDNq%2B17PCWgzrbzw8A%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Sep 5, 2017 at 12:06 PM, Amit Langote wrote: > On 2017/09/05 15:30, Ashutosh Bapat wrote: >> Those changes are already part of my updated 0001 patch. Aren't they? >> May be you should just review those and see if those are suitable for >> you? > > Yeah, I think it's going to be the same patch, functionality-wise. > > And sorry, I didn't realize you were talking about the case after applying > your patch on HEAD. > Ok. Can you please answer my previous questions? AFAIU, the list contained RTIs of the relations, which didnt' have corresponding AppendRelInfos to lock those relations. Now that we create AppendRelInfos even for partitioned partitions with my 0001 patch, I don't think we need the list to take care of the locks. Is there any other reason why we maintain that list (apart from the trigger case I have raised and Fujita-san says that the list is not required in that case as well.)? Having asked that, I think my patch shouldn't deal with removing partitioned_rels lists and related structures and code. It should be done as a separate patch. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/05 15:30, Ashutosh Bapat wrote: > Those changes are already part of my updated 0001 patch. Aren't they? > May be you should just review those and see if those are suitable for > you? Yeah, I think it's going to be the same patch, functionality-wise. And sorry, I didn't realize you were talking about the case after applying your patch on HEAD. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/05 13:20, Amit Langote wrote: On 2017/09/04 21:32, Ashutosh Bapat wrote: +1. Will Fujita-san's patch also handle getting rid of partitioned_rels list? As Fujita-san mentioned, his patch won't. Actually, I suppose he didn't say that partitioned_rels itself is useless, just that its particular usage in ExecInitModifyTable is. That's right. (I thought there would probably be no need to create that list if we created AppendRelInfos even for partitioned partitions.) We still need that list for planner to tell the executor that there are some RT entries the latter would need to lock before executing a given plan. Without that dedicated list, the executor cannot know at all that certain tables in the partition tree (viz. the partitioned ones) need to be locked. I mentioned the reason - (Merge)Append.subplans, ModifyTable.resultRelations does not contain respective entries corresponding to the partitioned tables, and traditionally, the executor looks at those lists to figure out the tables to lock. I think so too. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Sep 5, 2017 at 11:54 AM, Amit Langote wrote: > On 2017/09/05 13:20, Amit Langote wrote: >> The later phase can >> build that list from the AppendRelInfos that you mention we now [1] build. >> >> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154 > > Looking at that commit again, AppendRelInfos are still not created for > partitioned child tables. Looking at the code in > expand_single_inheritance_child(), which exists as of 30833ba154: > > > /* > * Build an AppendRelInfo for this parent and child, unless the child is a > * partitioned table. > */ > if (childrte->relkind != RELKIND_PARTITIONED_TABLE && !childrte->inh) > { > ...code that builds AppendRelInfo... > } > else > *partitioned_child_rels = lappend_int(*partitioned_child_rels, > childRTindex); > > you can see that an AppendRelInfo won't get built for partitioned child > tables. > > Also, even if the commit changed things so that the child RT entries (and > AppendRelInfos) now get built in an order determined by depth-first > traversal of the partition tree, the same original parent RT index is used > to mark all AppendRelInfos, so the expansion essentially flattens the > hierarchy. In the updated patch I will post on the "path toward faster > partition pruning" thread [1], I am planning to rejigger things so that > two things start to happen: > > 1. For partitioned child tables, build the child RT entry with inh = true >and also build an AppendRelInfos > > 2. When recursively expanding a partitioned child table in >expand_partitioned_rtentry(), pass its new RT index as the >parentRTindex to the recursive call of expand_partitioned_rtentry(), so >that the resulting AppendRelInfos reflect immediate parent-child >relationship > > With 1 in place, build_simple_rel() will build RelOptInfos even for > partitioned child tables, so that for each one, we can recursively build > an Append path. So, instead of just one Append path for the root > partitioned table, there is one for each partitioned table in the tree. > > I will be including the above described change in the partition-pruning > patch, because the other code in that patch relies on the same and I know > Ashuotsh has wanted that for a long time. :) Those changes are already part of my updated 0001 patch. Aren't they? May be you should just review those and see if those are suitable for you? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/05 13:20, Amit Langote wrote: > The later phase can > build that list from the AppendRelInfos that you mention we now [1] build. > > [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154 Looking at that commit again, AppendRelInfos are still not created for partitioned child tables. Looking at the code in expand_single_inheritance_child(), which exists as of 30833ba154: /* * Build an AppendRelInfo for this parent and child, unless the child is a * partitioned table. */ if (childrte->relkind != RELKIND_PARTITIONED_TABLE && !childrte->inh) { ...code that builds AppendRelInfo... } else *partitioned_child_rels = lappend_int(*partitioned_child_rels, childRTindex); you can see that an AppendRelInfo won't get built for partitioned child tables. Also, even if the commit changed things so that the child RT entries (and AppendRelInfos) now get built in an order determined by depth-first traversal of the partition tree, the same original parent RT index is used to mark all AppendRelInfos, so the expansion essentially flattens the hierarchy. In the updated patch I will post on the "path toward faster partition pruning" thread [1], I am planning to rejigger things so that two things start to happen: 1. For partitioned child tables, build the child RT entry with inh = true and also build an AppendRelInfos 2. When recursively expanding a partitioned child table in expand_partitioned_rtentry(), pass its new RT index as the parentRTindex to the recursive call of expand_partitioned_rtentry(), so that the resulting AppendRelInfos reflect immediate parent-child relationship With 1 in place, build_simple_rel() will build RelOptInfos even for partitioned child tables, so that for each one, we can recursively build an Append path. So, instead of just one Append path for the root partitioned table, there is one for each partitioned table in the tree. I will be including the above described change in the partition-pruning patch, because the other code in that patch relies on the same and I know Ashuotsh has wanted that for a long time. :) Thanks, Amit [1] https://www.postgresql.org/message-id/044e2e09-9690-7aff-1749-2d318da38a11%40lab.ntt.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/04 21:32, Ashutosh Bapat wrote: > On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote wrote: >> By the way, if you want to get rid of PartitionedChildRelInfo, you can do >> that as long as you find some other way of putting together the >> partitioned_rels list to add into the ModifyTable (Append/MergeAppend) >> node created for the root partitioned table. Currently, >> PartitionedChildRelInfo (and the root->pcinfo_list) is the way for >> expand_inherited_rtentry() to pass that information to the planner's >> path-generating code. We may be able to generate that list when actually >> creating the path using set_append_rel_pathlist() or >> inheritance_planner(), without having created a PartitionedChildRelInfo >> node beforehand. > > AFAIU, the list contained RTIs of the relations, which didnt' have > corresponding AppendRelInfos to lock those relations. Now that we > create AppendRelInfos even for partitioned partitions, I don't think > we need the list to take care of the locks. Is there any other reason > why we maintain that list (apart from the trigger case I have raised > and Fujita-san says that the list is not required in that case as > well.) We do *need* the list in ModifyTable (Append/MergeAppend) node itself. We can, however, get rid of the PartitionedChildRelInfo node that carries the partitioned child RT indexes from an earlier planning phase (expand_inherited_rtentry) to a later phase (create_{modifytable|append|merge_append}_path). The later phase can build that list from the AppendRelInfos that you mention we now [1] build. >>> Though I haven't read the patch yet, I think the above code is useless. >>> And I proposed a patch to clean it up before [1]. I'll add that patch to >>> the next commitfest. >> >> +1. > +1. Will Fujita-san's patch also handle getting rid of partitioned_rels list? As Fujita-san mentioned, his patch won't. Actually, I suppose he didn't say that partitioned_rels itself is useless, just that its particular usage in ExecInitModifyTable is. We still need that list for planner to tell the executor that there are some RT entries the latter would need to lock before executing a given plan. Without that dedicated list, the executor cannot know at all that certain tables in the partition tree (viz. the partitioned ones) need to be locked. I mentioned the reason - (Merge)Append.subplans, ModifyTable.resultRelations does not contain respective entries corresponding to the partitioned tables, and traditionally, the executor looks at those lists to figure out the tables to lock. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/04 21:32, Ashutosh Bapat wrote: On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote By the way, if you want to get rid of PartitionedChildRelInfo, you can do that as long as you find some other way of putting together the partitioned_rels list to add into the ModifyTable (Append/MergeAppend) node created for the root partitioned table. Currently, PartitionedChildRelInfo (and the root->pcinfo_list) is the way for expand_inherited_rtentry() to pass that information to the planner's path-generating code. We may be able to generate that list when actually creating the path using set_append_rel_pathlist() or inheritance_planner(), without having created a PartitionedChildRelInfo node beforehand. AFAIU, the list contained RTIs of the relations, which didnt' have corresponding AppendRelInfos to lock those relations. Now that we create AppendRelInfos even for partitioned partitions, I don't think we need the list to take care of the locks. I don't think so either. (Since I haven't followed discussions on this thread in detail yet, I don't understand the idea/need of creating AppendRelInfos for partitioned partitions, though.) Though I haven't read the patch yet, I think the above code is useless. And I proposed a patch to clean it up before [1]. I'll add that patch to the next commitfest. +1. +1. Will Fujita-san's patch also handle getting rid of partitioned_rels list? No. The patch just removes the partitioned_rels list from nodeModifyTable.c. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote wrote: > On 2017/09/04 12:38, Etsuro Fujita wrote: >> On 2017/09/02 4:10, Ashutosh Bapat wrote: >>> This rebase mainly changes patch 0001, which translates partition >>> hierarchy into inheritance hierarchy creating AppendRelInfos and >>> RelOptInfos for partitioned partitions. Because of that, it's not >>> necessary to record the partitioned partitions in a >>> PartitionedChildRelInfos::child_rels. The only RTI that goes in there >>> is the RTI of child RTE which is same as the parent RTE except inh >>> flag. I tried removing that with a series of changes but it seems that >>> following code in ExecInitModifyTable() requires it. >>> 1897 /* The root table RT index is at the head of the >>> partitioned_rels list */ >>> 1898 if (node->partitioned_rels) >>> 1899 { >>> 1900 Index root_rti; >>> 1901 Oid root_oid; >>> 1902 >>> 1903 root_rti = linitial_int(node->partitioned_rels); >>> 1904 root_oid = getrelid(root_rti, estate->es_range_table); >>> 1905 rel = heap_open(root_oid, NoLock); /* locked by InitPlan */ >>> 1906 } >>> 1907 else >>> 1908 rel = mtstate->resultRelInfo->ri_RelationDesc; >>> >>> I don't know whether we could change this code not to use >>> PartitionedChildRelInfos::child_rels. > > For a root partitioned tables, ModifyTable.partitioned_rels comes from > PartitionedChildRelInfo.child_rels recorded for the table by > expand_inherited_rtnentry(). In fact, the latter is copied verbatim to > ModifyTablePath (or AppendPath/MergeAppendPath) when creating the same. > The only point of keeping those RT indexes around in the ModifyTable node > is for the executor to be able to locate partitioned table RT entries and > lock them. Without them, the executor wouldn't know about those tables at > all, because there won't be subplans corresponding to partitioned tables > in the tree and hence their RT indexes won't appear in the > ModifyTable.resultRelations list. If your patch adds partitioned child > rel AppendRelInfos back for whatever reason, you should also make sure in > inheritance_planner() that their RT indexes don't end up the > resultRelations list. See this piece of code in inheritance_planner(): > > 1351 /* Build list of sub-paths */ > 1352 subpaths = lappend(subpaths, subpath); > 1353 > 1354 /* Build list of modified subroots, too */ > 1355 subroots = lappend(subroots, subroot); > 1356 > 1357 /* Build list of target-relation RT indexes */ > 1358 resultRelations = lappend_int(resultRelations, > appinfo->child_relid); > > Maybe it won't happen, because if this appinfo corresponds to a > partitioned child table, recursion would occur and we'll get to this piece > of code for only the leaf children. You are right. We don't execute above lines for partitioned partitions. > > By the way, if you want to get rid of PartitionedChildRelInfo, you can do > that as long as you find some other way of putting together the > partitioned_rels list to add into the ModifyTable (Append/MergeAppend) > node created for the root partitioned table. Currently, > PartitionedChildRelInfo (and the root->pcinfo_list) is the way for > expand_inherited_rtentry() to pass that information to the planner's > path-generating code. We may be able to generate that list when actually > creating the path using set_append_rel_pathlist() or > inheritance_planner(), without having created a PartitionedChildRelInfo > node beforehand. AFAIU, the list contained RTIs of the relations, which didnt' have corresponding AppendRelInfos to lock those relations. Now that we create AppendRelInfos even for partitioned partitions, I don't think we need the list to take care of the locks. Is there any other reason why we maintain that list (apart from the trigger case I have raised and Fujita-san says that the list is not required in that case as well.) > >> Though I haven't read the patch yet, I think the above code is useless. >> And I proposed a patch to clean it up before [1]. I'll add that patch to >> the next commitfest. > > +1. +1. Will Fujita-san's patch also handle getting rid of partitioned_rels list? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/04 12:38, Etsuro Fujita wrote: > On 2017/09/02 4:10, Ashutosh Bapat wrote: >> This rebase mainly changes patch 0001, which translates partition >> hierarchy into inheritance hierarchy creating AppendRelInfos and >> RelOptInfos for partitioned partitions. Because of that, it's not >> necessary to record the partitioned partitions in a >> PartitionedChildRelInfos::child_rels. The only RTI that goes in there >> is the RTI of child RTE which is same as the parent RTE except inh >> flag. I tried removing that with a series of changes but it seems that >> following code in ExecInitModifyTable() requires it. >> 1897 /* The root table RT index is at the head of the >> partitioned_rels list */ >> 1898 if (node->partitioned_rels) >> 1899 { >> 1900 Index root_rti; >> 1901 Oid root_oid; >> 1902 >> 1903 root_rti = linitial_int(node->partitioned_rels); >> 1904 root_oid = getrelid(root_rti, estate->es_range_table); >> 1905 rel = heap_open(root_oid, NoLock); /* locked by InitPlan */ >> 1906 } >> 1907 else >> 1908 rel = mtstate->resultRelInfo->ri_RelationDesc; >> >> I don't know whether we could change this code not to use >> PartitionedChildRelInfos::child_rels. For a root partitioned tables, ModifyTable.partitioned_rels comes from PartitionedChildRelInfo.child_rels recorded for the table by expand_inherited_rtnentry(). In fact, the latter is copied verbatim to ModifyTablePath (or AppendPath/MergeAppendPath) when creating the same. The only point of keeping those RT indexes around in the ModifyTable node is for the executor to be able to locate partitioned table RT entries and lock them. Without them, the executor wouldn't know about those tables at all, because there won't be subplans corresponding to partitioned tables in the tree and hence their RT indexes won't appear in the ModifyTable.resultRelations list. If your patch adds partitioned child rel AppendRelInfos back for whatever reason, you should also make sure in inheritance_planner() that their RT indexes don't end up the resultRelations list. See this piece of code in inheritance_planner(): 1351 /* Build list of sub-paths */ 1352 subpaths = lappend(subpaths, subpath); 1353 1354 /* Build list of modified subroots, too */ 1355 subroots = lappend(subroots, subroot); 1356 1357 /* Build list of target-relation RT indexes */ 1358 resultRelations = lappend_int(resultRelations, appinfo->child_relid); Maybe it won't happen, because if this appinfo corresponds to a partitioned child table, recursion would occur and we'll get to this piece of code for only the leaf children. By the way, if you want to get rid of PartitionedChildRelInfo, you can do that as long as you find some other way of putting together the partitioned_rels list to add into the ModifyTable (Append/MergeAppend) node created for the root partitioned table. Currently, PartitionedChildRelInfo (and the root->pcinfo_list) is the way for expand_inherited_rtentry() to pass that information to the planner's path-generating code. We may be able to generate that list when actually creating the path using set_append_rel_pathlist() or inheritance_planner(), without having created a PartitionedChildRelInfo node beforehand. > Though I haven't read the patch yet, I think the above code is useless. > And I proposed a patch to clean it up before [1]. I'll add that patch to > the next commitfest. +1. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/02 4:10, Ashutosh Bapat wrote: This rebase mainly changes patch 0001, which translates partition hierarchy into inheritance hierarchy creating AppendRelInfos and RelOptInfos for partitioned partitions. Because of that, it's not necessary to record the partitioned partitions in a PartitionedChildRelInfos::child_rels. The only RTI that goes in there is the RTI of child RTE which is same as the parent RTE except inh flag. I tried removing that with a series of changes but it seems that following code in ExecInitModifyTable() requires it. 1897 /* The root table RT index is at the head of the partitioned_rels list */ 1898 if (node->partitioned_rels) 1899 { 1900 Index root_rti; 1901 Oid root_oid; 1902 1903 root_rti = linitial_int(node->partitioned_rels); 1904 root_oid = getrelid(root_rti, estate->es_range_table); 1905 rel = heap_open(root_oid, NoLock); /* locked by InitPlan */ 1906 } 1907 else 1908 rel = mtstate->resultRelInfo->ri_RelationDesc; I don't know whether we could change this code not to use PartitionedChildRelInfos::child_rels. Though I haven't read the patch yet, I think the above code is useless. And I proposed a patch to clean it up before [1]. I'll add that patch to the next commitfest. Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/93cf9816-2f7d-0f67-8ed2-4a4e497a6ab8%40lab.ntt.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Ashutosh Bapat wrote: > I originally thought to provide it along-with the changes to > expand_inherited_rtentry(), but that thread is taking longer. Jeevan > Chalke needs rebased patches for his work on aggregate pushdown and > Thomas might need them for further review. So, here they are. Since I have related patch in the current commitfest (https://commitfest.postgresql.org/14/1247/), I spent some time reviewing your patch: * generate_partition_wise_join_paths() Right parenthesis is missing in the prologue. * get_partitioned_child_rels_for_join() I think the Assert() statement is easier to understand inside the loop, see the assert.diff attachment. * have_partkey_equi_join() As the function handles generic join, this comment doesn't seem to me relevant: /* * The equi-join between partition keys is strict if equi-join between * at least one partition key is using a strict operator. See * explanation about outer join reordering identity 3 in * optimizer/README */ strict_op = op_strict(opexpr->opno); And I think the function can return true even if strict_op is false for all the operators evaluated in the loop. * match_expr_to_partition_keys() I'm not sure this comment is clear enough: /* * If it's a strict equi-join a NULL partition key on one side will * not join a NULL partition key on the other side. So, rows with NULL * partition key from a partition on one side can not join with those * from a non-matching partition on the other side. So, search the * nullable partition keys as well. */ if (!strict_op) continue; My understanding of the problem of NULL values generated by outer join is: these NULL values --- if evaluated by non-strict expression --- can make row of N-th partition on one side of the join match row(s) of *other than* N-th partition(s) on the other side. Thus the nullable input expressions may only be evaluated by strict operators. I think it'd be clearer if you stressed that (undesired) *match* of partition keys can be a problem, rather than mismatch. If you insist on your wording, then I think you should at least move the comment below to the part that only deals with strict operators. * There are several places where lfirst_node() macro should be used. For example rel = lfirst_node(RelOptInfo, lc); instead of rel = (RelOptInfo *) lfirst(lc); * map_and_merge_partitions() Besides a few changes proposed in map_and_merge_partitions.diff (a few of them to suppress compiler warnings) I think that this part needs more thought: { Assert(mergemap1[index1] != mergemap2[index2] && mergemap1[index1] >= 0 && mergemap2[index2] >= 0); /* * Both the partitions map to different merged partitions. This * means that multiple partitions from one relation matches to one * partition from the other relation. Partition-wise join does not * handle this case right now, since it requires ganging multiple * partitions together (into one RelOptInfo). */ merged_index = -1; } I could hit this path with the following test: CREATE TABLE a(i int) PARTITION BY LIST(i); CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2); CREATE TABLE b(j int) PARTITION BY LIST(j); CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2); SET enable_partition_wise_join TO on; SELECT * FROMa FULL JOIN b ON i = j; I don't think there's a reason not to join a_0 partition to b_0, is there? -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c new file mode 100644 index a75b1a3..3094b56 *** a/src/backend/optimizer/plan/planner.c --- b/src/backend/optimizer/plan/planner.c *** get_partitioned_child_rels_for_join(Plan *** 6160,6170 PartitionedChildRelInfo *pc = lfirst(l); if (bms_is_member(pc->parent_relid, join_relids)) result = list_concat(result, list_copy(pc->child_rels)); } - /* The root partitioned table is included as a child rel */ - Assert(list_length(result) >= bms_num_members(join_relids)); - return result; } --- 6160,6172 PartitionedChildRelInfo *pc = lfirst(l); if (bms_is_member(pc->parent_relid, join_relids)) + { + /* The root partitioned table is included as a child rel */ + Assert(list_length(pc->child_rels) >= 1); + result = list_concat(result, list_copy(pc->child_rels)); + } } return result; } diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c new file mode 100644 index eb35fab..aa9c70d *** a/src/backend/catalog/partition.c --- b/src/backend/catalog/partition.c *** partition_list_bounds_merge(int partnatt *** 3110,3116 --- 3110,3119
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Wed, Aug 16, 2017 at 3:31 AM, Ashutosh Bapat wrote: > There are two ways we can do this > 1. In expand_inherited_rtentry(), remember (childRTE and childRTIndex) > or just childRTIndex (using this we can fetch childRTE calling > rtfetch()) of intermediate partitioned tables. Once we are done > expanding immediate children, call expand_inherited_rtentry() > recursively on this list. > > 2. expand_inherited_tables() scans root->parse->rtable only upto the > end of original range table list. Make it go beyond that end, > expanding any new entries added for intermediate partitions. > > FWIW, the first option allows us to keep all AppendRelInfos > corresponding to one partitioned relation together and also expands > the whole partition hierarchy in one go. Second will require minimal > changes to expand_inherited_rtentry(). Both approaches will spend time > scanning same number of RTE; the first will have them in different > lists, and second will have them in root->parse->rtable. I don't see > one being more attractive than the other. Do you have any opinion? I don't like option (2). I'm not sure about option (1). I think maybe we should have two nested loops in expanded_inherited_rtentry(), the outer one iterating over partitioned tables (or just the original parent RTE if partitioning is not involved) and then inner one looping over individual leaf partitions for each partitioned table. Probably we'd end up wanting to move at least some of the logic inside the existing loop into a subroutine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Tue, Aug 15, 2017 at 10:15 PM, Robert Haas wrote: > On Thu, Aug 10, 2017 at 8:00 AM, Ashutosh Bapat > wrote: >> Attached patches with the comments addressed. > > I have committed 0001-0003 as 480f1f4329f1bf8bfbbcda8ed233851e1b110ad4 > and e139f1953f29db245f60a7acb72fccb1e05d2442. Thanks a lot Robert. Some less patches to maintain :). > > 0004 doesn't apply any more, probably due to commit > d57929afc7063431f80a0ac4c510fc39aacd22e6. I think something along > these lines could be separately committed prior to the main patch, and > I think that would be a good idea just to flush out any bugs in this > part independently of the rest. However, I also think that we > probably ought to try to get Amit Langote's changes to this function > to repair the locking order and expand in bound order committed before > proceeding with these changes. I am reviewing those changes and contribute to that thread if necessary. > > In fact, I think there's a certain amount of conflict between what's > being discussed over there and what you're trying to do here. In that > thread, we propose to move partitioned tables at any level to the > front of the inheritance expansion. Here, however, you want to expand > level by level. I think partitioned-tables-first is the right > approach for the reasons discussed on the other thread; namely, we > want to be able to prune leaf partitions before expanding them, but > that requires us to expand all the non-leaf tables first to maintain a > consistent locking order in all scenarios. So the approach you've > taken in this patch may need to be re-thought somewhat. > There are two ways we can do this 1. In expand_inherited_rtentry(), remember (childRTE and childRTIndex) or just childRTIndex (using this we can fetch childRTE calling rtfetch()) of intermediate partitioned tables. Once we are done expanding immediate children, call expand_inherited_rtentry() recursively on this list. 2. expand_inherited_tables() scans root->parse->rtable only upto the end of original range table list. Make it go beyond that end, expanding any new entries added for intermediate partitions. FWIW, the first option allows us to keep all AppendRelInfos corresponding to one partitioned relation together and also expands the whole partition hierarchy in one go. Second will require minimal changes to expand_inherited_rtentry(). Both approaches will spend time scanning same number of RTE; the first will have them in different lists, and second will have them in root->parse->rtable. I don't see one being more attractive than the other. Do you have any opinion? I will submit the rebased patches after reviewing/adjusting Amit's changes and also the changes in expand_inherited_rtentry() after we have concluded the approach to be taken. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Thu, Aug 10, 2017 at 8:00 AM, Ashutosh Bapat wrote: > Attached patches with the comments addressed. I have committed 0001-0003 as 480f1f4329f1bf8bfbbcda8ed233851e1b110ad4 and e139f1953f29db245f60a7acb72fccb1e05d2442. 0004 doesn't apply any more, probably due to commit d57929afc7063431f80a0ac4c510fc39aacd22e6. I think something along these lines could be separately committed prior to the main patch, and I think that would be a good idea just to flush out any bugs in this part independently of the rest. However, I also think that we probably ought to try to get Amit Langote's changes to this function to repair the locking order and expand in bound order committed before proceeding with these changes. In fact, I think there's a certain amount of conflict between what's being discussed over there and what you're trying to do here. In that thread, we propose to move partitioned tables at any level to the front of the inheritance expansion. Here, however, you want to expand level by level. I think partitioned-tables-first is the right approach for the reasons discussed on the other thread; namely, we want to be able to prune leaf partitions before expanding them, but that requires us to expand all the non-leaf tables first to maintain a consistent locking order in all scenarios. So the approach you've taken in this patch may need to be re-thought somewhat. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Thu, Aug 10, 2017 at 5:43 AM, Thomas Munro wrote: >> Do you think we solving this problem is a prerequisite for >> partition-wise join? Or should we propose that patch as a separate >> enhancement? > > No, I'm not proposing anything yet. For now I just wanted to share > this observation about where hot CPU time goes in simple tests, and > since it turned out to be a loop in a loop that I could see an easy to > way to fix for singleton sets and sets with a small range, I couldn't > help trying it... But I'm still trying to understand the bigger > picture. I'll be interested to compare profiles with the ordered > append_rel_list version you have mentioned, to see how that moves the > hot spots. Perhaps this is stating the obvious, but it's often better to optimize things like this at a higher level, rather than by tinkering with stuff like Bitmapset. On the other hand, sometimes micro-optimizations are the way to go, because optimizing find_ec_member_for_tle(), for example, might involve a much broader rethink of the planner code than we want to undertake right now. > I guess one very practical question to ask is: can we plan queries > with realistic numbers of partitioned tables and partitions in > reasonable times? Well, it certainly looks very good for hundreds of > partitions so far... My own experience of partitioning with other > RDBMSs has been on that order, 'monthly partitions covering the past > 10 years' and similar, but on the other hand it wouldn't be surprising > to learn that people want to go to many thousands, especially for > schemas which just keep adding partitions over time and don't want to > drop them. I've been thinking that it would be good if this feature - and other new partitioning features - could scale to about 1000 partitions without too much trouble. Eventually, it might be nice to scale higher, but there's not much point in making partition-wise join scale to 100,000 partitions if we've got some other part of the system that runs into trouble beyond 250. > Curious: would you consider joins between partitioned tables and > non-partitioned tables where the join is pushed down to be a kind of > "partition-wise join", or something else? If so, would that be a > special case, or just the logical extreme case for > 0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where > one single "partition" on the non-partitioned side maps to all the > partitions on the partitioned size? I think this is actually a really important case which we've just excluded from the initial scope because the problem is hard enough already. But it's quite possible that if you are joining partitioned tables A and B with unpartitioned table X, the right join order could be A-X-B; the A-X join might knock out a lot of rows. It's not great to have to pick between doing the A-B join partitionwise and doing the A-X join first; you want to get both things. But we can't do everything at once. Further down the road, there's more than one way of doing the A-X join. You could join each partition of A to all of X, which is likely optimal if for example you are doing a nested loop with an inner index scan on X. But you could also partition X on the fly using A's partitioning scheme and then join partitions of A against the on-the-fly-partitioned version of X. That's likely to be a lot better for a merge join with an underlying sort on X. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Thu, Aug 10, 2017 at 3:13 PM, Thomas Munro wrote: > On Thu, Aug 10, 2017 at 6:23 PM, Ashutosh Bapat > wrote: >> Your patch didn't improve planning time without partition-wise join, >> so it's something good to have along-with partition-wise join. Given >> that Bitmapsets are used in other parts of code as well, the >> optimization may affect those parts as well, esp. the overhead of >> maintaining first_non_empty_wordnum. > > Maybe, but if you consider that this container already deals with the > upper bound moving up by reallocating and copying the whole thing, > adjusting an int when the lower bound moves down doesn't seem like > anything to worry about... Yeah. May be we should check whether that makes any difference to planning times of TPC-H queries. If it shows any difference. > >> Do you think we solving this problem is a prerequisite for >> partition-wise join? Or should we propose that patch as a separate >> enhancement? > > No, I'm not proposing anything yet. For now I just wanted to share > this observation about where hot CPU time goes in simple tests, and > since it turned out to be a loop in a loop that I could see an easy to > way to fix for singleton sets and sets with a small range, I couldn't > help trying it... But I'm still trying to understand the bigger > picture. I'll be interested to compare profiles with the ordered > append_rel_list version you have mentioned, to see how that moves the > hot spots. build_simple_rel() which contains that loop takes only .23% of planning time. So, I doubt if that's going to change much. + 0.23% postgres postgres[.] build_simple_rel ▒ > > I guess one very practical question to ask is: can we plan queries > with realistic numbers of partitioned tables and partitions in > reasonable times? Well, it certainly looks very good for hundreds of > partitions so far... My own experience of partitioning with other > RDBMSs has been on that order, 'monthly partitions covering the past > 10 years' and similar, but on the other hand it wouldn't be surprising > to learn that people want to go to many thousands, especially for > schemas which just keep adding partitions over time and don't want to > drop them. As for hash partitioning, that seems to be typically done > with numbers like 16, 32 or 64 in other products from what I can > glean. Speculation: perhaps hash partitioning is more motivated by > parallelism than data maintenance and thus somehow anchored to the > ground by core counts; if so no planning performance worries there I > guess (until core counts double quite a few more times). Agreed. > > One nice thing about the planning time is that restrictions on the > partition key cut down planning time; so where I measure ~7 seconds to > plan SELECT * FROM foofoo JOIN barbar USING (a, b) with 2k partitions, > if I add WHERE a > 50 it's ~4 seconds and WHERE a > 99 it's ~0.8s, so > if someone has a keep-adding-more-partitions-over-time model then at > least their prunable current day/week/whatever queries will not suffer > quite so badly. (Yeah my computer seems to be a lot slower than yours > for these tests; clang -O2 no asserts on a mid 2014 MBP with i7 @ > 2.2Ghz). That's interesting observation. Thanks for sharing it. > > Curious: would you consider joins between partitioned tables and > non-partitioned tables where the join is pushed down to be a kind of > "partition-wise join", or something else? If so, would that be a > special case, or just the logical extreme case for > 0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where > one single "partition" on the non-partitioned side maps to all the > partitions on the partitioned size? > Parameterized nest loop joins with partition key as parameters simulate something like that. Apart from that case, I don't see any case where such a join would be more efficient compared to the current method of ganging all partitions and joining them to the unpartitioned table. But oh wait, that could be useful in sharding, when the unpartitioned table is replicated and partitioned table is distributed across shards. So, yes, that's a useful case. I am not sure whether it's some kind of partition-wise join; it doesn't matter, it looks useful. Said that, I am not planning to handle it in the near future. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers