Hi David.

On 2018/06/22 15:28, David Rowley wrote:
> Hi,
> 
> As part of my efforts to make partitioning scale better for larger
> numbers of partitions, I've been looking at primarily INSERT VALUES
> performance.  Here the overheads are almost completely in the
> executor. Planning of this type of statement is very simple since
> there is no FROM clause to process.

Thanks for this effort.

> My benchmarks have been around a RANGE partitioned table with 10k leaf
> partitions and no sub-partitioned tables. The partition key is a
> timestamp column.
> 
> I've found that ExecSetupPartitionTupleRouting() is very slow indeed
> and there are a number of things slow about it.  The biggest culprit
> for the slowness is the locking of each partition inside of
> find_all_inheritors().

Yes. :-(

> For now, this needs to remain as we must hold
> locks on each partition while performing RelationBuildPartitionDesc(),
> otherwise, one of the partitions may get dropped out from under us.

We lock all partitions using find_all_inheritors to keep locking order
consistent with other sites that may want to lock tables in the same
partition tree but with a possibly conflicting lock mode.  If we remove
the find_all_inheritors call in ExecSetupPartitionPruneState (like your
0002 does), we may end up locking partitions in arbitrary order in a given
transaction, because input tuples will be routed to various partitions in
an order that's not predetermined.

But, maybe it's not necessary to be that paranoid.  If we've locked on the
parent, any concurrent lockers would have to wait for the lock on the
parent anyway, so it doesn't matter which order tuple routing locks the
partitions.

> The locking is not the only slow thing. I found the following to also be slow:
> 
> 1. RelationGetPartitionDispatchInfo uses a List and lappend() must
> perform a palloc() each time a partition is added to the list.
> 2. A foreach loop is performed over leaf_parts to search for subplans
> belonging to this partition. This seems pointless to do for INSERTs as
> there's never any to find.
> 3. ExecCleanupTupleRouting() loops through the entire partitions
> array. If a single tuple was inserted then all but one of the elements
> will be NULL.
> 4. Tuple conversion map allocates an empty array thinking there might
> be something to put into it. This is costly when the array is large
> and pointless when there are no maps to store.
> 5. During get_partition_dispatch_recurse(), get_rel_relkind() is
> called to determine if the partition is a partitioned table or a leaf
> partition. This results in a slow relcache hashtable lookup.
> 6. get_partition_dispatch_recurse() also ends up just building the
> indexes array with a sequence of numbers from 0 to nparts - 1 if there
> are no sub-partitioned tables. Doing this is slow when there are many
> partitions.
> 
> Besides the locking, the only thing that remains slow now is the
> palloc0() for the 'partitions' array. In my test, it takes 0.6% of
> execution time. I don't see any pretty ways to fix that.
> 
> I've written fixes for items 1-6 above.
> 
> I did:
> 
> 1. Use an array instead of a List.
> 2. Don't do this loop. palloc0() the partitions array instead. Let
> UPDATE add whatever subplans exist to the zeroed array.
> 3. Track what we initialize in a gapless array and cleanup just those
> ones. Make this array small and increase it only when we need more
> space.
> 4. Only allocate the map array when we need to store a map.
> 5. Work that out in relcache beforehand.
> 6. ditto

The issues you list seem all legitimate to me and also your proposed fixes
for each, except I think we could go a bit further.

Why don't we abandon the notion altogether that
ExecSetupPartitionTupleRouting *has to* process the whole partition tree?
ISTM, there is no need to determine the exact number of leaf partitions
and partitioned tables in the partition tree and allocate the arrays in
PartitionTupleRouting based on that.  I know that the indexes array in
PartitionDispatchData contains mapping from local partition indexes (0 to
partdesc->nparts - 1) to those that span *all* leaf partitions and *all*
partitioned tables (0 to proute->num_partitions or proute->num_dispatch)
in a partition tree, but we can change that.

The idea I had was inspired by looking at partitions_init stuff in your
patch.  We could allocate proute->partition_dispatch_info and
proute->partitions arrays to be of a predetermined size, which doesn't
require us to calculate the exact number of leaf partitions and
partitioned tables beforehand.  So, RelationGetPartitionDispatchInfo need
not recursively go over all of the partition tree.  Instead we create just
one PartitionDispatch object of the root parent table, whose indexes array
is initialized with -1 meaning none of the partitions has not been
encountered yet.  In ExecFindPartition, once tuple routing chooses a
partition, we create either a ResultRelInfo (if leaf) or a
PartitionDispatch for it and store it in the 0th slot in
proute->partitions or proute->partition_dispatch_info, respectively.
Also, we update the indexes array in the parent's PartitionDispatch to
replace -1 with 0 so that future tuples routing to that partition don't
allocate it again.  The process is repeated if the tuple needs to be
routed one more level down.  If the query needs to allocate more
ResultRelInfos and/or PartitionDispatch objects than we initially
allocated space for, we expand those arrays.  Finally, during
ExecCleanupTupleRouting, we only "clean up" the partitions that we
allocated ResultRelInfos and PartitionDispatch objects for, which is very
similar to the partitions_init idea in your patch.

I implemented that idea in the attached patch, which applies on top of
your 0001 patch, but I'd say it's too big to be just called a delta.  I
was able to get following performance numbers using the following pgbench
test:

pgbench -n -T 180 -f insert-ht.sql
cat insert-ht.sql
\set b random(1, 1000)
\set a random(1, 1000)
insert into ht values (:b, :a);

Note that pgbench is run 3 times and every tps result is listed below.

HEAD - 0 parts (unpartitioned table)
tps = 2519.603076 (including connections establishing)
tps = 2486.903189 (including connections establishing)
tps = 2518.771182 (including connections establishing)

HEAD - 2500 hash parts (no subpart)
tps = 13.158224 (including connections establishing)
tps = 12.940713 (including connections establishing)
tps = 12.882465 (including connections establishing)

David - 2500 hash parts (no subpart)
tps = 18.717628 (including connections establishing)
tps = 18.602280 (including connections establishing)
tps = 18.945527 (including connections establishing)

Amit - 2500 hash parts (no subpart)
tps = 18.576858 (including connections establishing)
tps = 18.431902 (including connections establishing)
tps = 18.797023 (including connections establishing)

HEAD - 2500 hash parts (4 hash subparts each)
tps = 2.339332 (including connections establishing)
tps = 2.339582 (including connections establishing)
tps = 2.317037 (including connections establishing)

David - 2500 hash parts (4 hash subparts each)
tps = 3.225044 (including connections establishing)
tps = 3.214053 (including connections establishing)
tps = 3.239591 (including connections establishing)

Amit - 2500 hash parts (4 hash subparts each)
tps = 3.321918 (including connections establishing)
tps = 3.305952 (including connections establishing)
tps = 3.301036 (including connections establishing)

Applying the lazy locking patch on top of David's and my patch,
respectively, produces the following results.

David - 2500 hash parts (no subpart)
tps = 1577.854360 (including connections establishing)
tps = 1532.681499 (including connections establishing)
tps = 1464.254096 (including connections establishing)

Amit - 2500 hash parts (no subpart)
tps = 1532.475751 (including connections establishing)
tps = 1534.650325 (including connections establishing)
tps = 1527.840837 (including connections establishing)

David - 2500 hash parts (4 hash subparts each)
tps = 78.845916 (including connections establishing)
tps = 79.167079 (including connections establishing)
tps = 79.621686 (including connections establishing)

Amit - 2500 hash parts (4 hash subparts each)
9:tps = 329.887008 (including connections establishing)
9:tps = 327.428103 (including connections establishing)
9:tps = 326.863248 (including connections establishing)

About the last two results: after getting rid of the time-hog that is
find_all_inheritors() call in ExecSetupPartitionTupleRouting for locking
all partitions, it seems that we'll end up spending most of the time in
RelationGetPartitionDispatchInfo() without my patch, because it will call
get_partition_dispatch_recurse() for each of the 2500 partitions of first
level that are themselves partitioned.  With my patch, we won't do that
and won't end up generating 2499 PartitionDispatch objects that would not
be needed for a single-row insert statement.

Thanks,
Amit
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 25bec76c1d..44cf3bba12 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2621,10 +2621,8 @@ CopyFrom(CopyState cstate)
                         * will get us the ResultRelInfo and TupleConversionMap 
for the
                         * partition, respectively.
                         */
-                       leaf_part_index = ExecFindPartition(resultRelInfo,
-                                                                               
                proute->partition_dispatch_info,
-                                                                               
                slot,
-                                                                               
                estate);
+                       leaf_part_index = ExecFindPartition(mtstate, 
resultRelInfo,
+                                                                               
                proute, slot, estate);
                        Assert(leaf_part_index >= 0 &&
                                   leaf_part_index < proute->num_partitions);
 
@@ -2644,10 +2642,8 @@ CopyFrom(CopyState cstate)
                         * to the selected partition.
                         */
                        saved_resultRelInfo = resultRelInfo;
-                       resultRelInfo = ExecGetPartitionInfo(mtstate,
-                                                                               
                 saved_resultRelInfo,
-                                                                               
                 proute, estate,
-                                                                               
                 leaf_part_index);
+                       Assert(proute->partitions[leaf_part_index] != NULL);
+                       resultRelInfo = proute->partitions[leaf_part_index];
 
                        /*
                         * For ExecInsertIndexTuples() to work on the 
partition's indexes
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 1a3a67dd0d..250c2cd53e 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,17 +31,19 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
-static ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
+#define PARTITION_ROUTING_INITSIZE     8
+#define PARTITION_ROUTING_MAXSIZE              65536
+
+static void ExecUseUpdateResultRelForRouting(ModifyTableState *mtstate,
+                                                                
PartitionTupleRouting *proute,
+                                                                
PartitionDispatch pd);
+static void ExecInitPartitionInfo(ModifyTableState *mtstate,
                                          ResultRelInfo *resultRelInfo,
                                          PartitionTupleRouting *proute,
-                                         EState *estate, int partidx);
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-                                                                int 
*num_parted, Oid **leaf_part_oids,
-                                                                int 
*n_leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-                                                          List **pds, Oid 
**leaf_part_oids,
-                                                          int 
*n_leaf_part_oids,
-                                                          int 
*leaf_part_oid_size);
+                                         EState *estate, Oid partoid,
+                                         int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting 
*proute,
+                                               Oid partoid, Relation parent, 
int dispatchidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
                                          TupleTableSlot *slot,
                                          EState *estate,
@@ -68,127 +70,58 @@ static void 
find_matching_subplans_recurse(PartitionPruneState *prunestate,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for UPDATE
- * tuple routing, the caller will have already initialized ResultRelInfo's for
- * each partition present in the ModifyTable's subplans. These are reused and
- * assigned to their respective slot in the aforementioned array.  For such
- * partitions, we delay setting up objects such as TupleConversionMap until
- * those are actually chosen as the partitions to route tuples to.  See
- * ExecPrepareTupleRouting.
+ * This is called during the initialization of a COPY FROM command or of a
+ * INSERT/UPDATE query.  We provisionally allocate space to hold
+ * PARTITION_ROUTING_INITSIZE number of PartitionDispatch and ResultRelInfo
+ * pointers in their respective arrays.  The arrays will be doubled in
+ * size via repalloc (subject to the limit of PARTITION_ROUTING_MAXSIZE
+ * entries  at most) if and when we run out of space, as more partitions need
+ * to be added.  Since we already have the root parent open, its
+ * PartitionDispatch is created here.
+ *
+ * PartitionDispatch object of a non-root partitioned table or ResultRelInfo
+ * of a leaf partition is allocated and added to the respective array when
+ * it is encountered for the first time in ExecFindPartition.  As mentioned
+ * above, we might need to expand the respective array before storing it.
+ *
+ * Tuple conversion maps (either child to parent and/or vice versa) and the
+ * array(s) to hold them are allocated only if needed.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-       int                     i;
        PartitionTupleRouting *proute;
-       int                     nparts;
        ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-       /*
-        * Get the information about the partition tree after locking all the
-        * partitions.
-        */
+       /* Lock all the partitions. */
        (void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, 
NULL);
-       proute = (PartitionTupleRouting *) 
palloc(sizeof(PartitionTupleRouting));
-       proute->partition_dispatch_info =
-               RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-                                                                               
 &proute->partition_oids, &nparts);
 
-       proute->num_partitions = nparts;
-       proute->partitions =
-               (ResultRelInfo **) palloc0(nparts * sizeof(ResultRelInfo *));
+       proute = (PartitionTupleRouting *) 
palloc0(sizeof(PartitionTupleRouting));
+       proute->partition_root = rel;
+       proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
+       proute->partition_dispatch_info = (PartitionDispatchData **)
+                       palloc(sizeof(PartitionDispatchData) * 
PARTITION_ROUTING_INITSIZE);
+
+       /* Initialize this table's PartitionDispatch object. */
+       (void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), 
NULL,
+                                                                               
 0);
+       proute->num_dispatch = 1;
+       proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
+       proute->partitions = (ResultRelInfo **)
+                       palloc(sizeof(ResultRelInfo *) * 
PARTITION_ROUTING_INITSIZE);
+       proute->num_partitions = 0;
 
        /*
-        * Allocate an array to store ResultRelInfos that we'll later allocate.
-        * It is common that not all partitions will have tuples routed to them,
-        * so we'll refrain from allocating enough space for all partitions 
here.
-        * Let's just start with something small and make it bigger only when
-        * needed.  Storing these separately rather than relying on the
-        *'partitions' array allows us to quickly identify which ResultRelInfos 
we
-        * must teardown at the end.
+        * Check if we can use ResultRelInfos set up by ExecInitModifyTable as
+        * target result rels of an UPDATE as also the target result rels of 
tuple
+        * routing.  Note that we consider for now only the root parent's own 
leaf
+        * partitions, that is, leaf partitions of level 1 and none of the leaf
+        * partitions of the levels below.
         */
-       proute->partitions_init_size = Min(nparts, 8);
-
-       proute->partitions_init = (ResultRelInfo **)
-               palloc(proute->partitions_init_size * sizeof(ResultRelInfo *));
-
-       proute->num_partitions_init = 0;
-
-       /* We only allocate this when we need to store the first non-NULL map */
-       proute->parent_child_tupconv_maps = NULL;
-
-       proute->child_parent_tupconv_maps = NULL;
-
-
-       /*
-        * Initialize an empty slot that will be used to manipulate tuples of 
any
-        * given partition's rowtype.  It is attached to the caller-specified 
node
-        * (such as ModifyTableState) and released when the node finishes
-        * processing.
-        */
-       proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
-
-       /* Set up details specific to the type of tuple routing we are doing. */
        if (node && node->operation == CMD_UPDATE)
-       {
-               ResultRelInfo *update_rri = NULL;
-               int                     num_update_rri = 0,
-                                       update_rri_index = 0;
-
-               update_rri = mtstate->resultRelInfo;
-               num_update_rri = list_length(node->plans);
-               proute->subplan_partition_offsets =
-                       palloc(num_update_rri * sizeof(int));
-               proute->num_subplan_partition_offsets = num_update_rri;
-
-               proute->root_tuple_slot = MakeTupleTableSlot(NULL);
-
-               for (i = 0; i < nparts; i++)
-               {
-                       Oid                     leaf_oid = 
proute->partition_oids[i];
-
-                       /*
-                        * If the leaf partition is already present in the 
per-subplan
-                        * result rels, we re-use that rather than initialize a 
new result
-                        * rel. The per-subplan resultrels and the resultrels 
of the leaf
-                        * partitions are both in the same canonical order. So 
while going
-                        * through the leaf partition oids, we need to keep 
track of the
-                        * next per-subplan result rel to be looked for in the 
leaf
-                        * partition resultrels.
-                        */
-                       if (update_rri_index < num_update_rri &&
-                               
RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-                       {
-                               ResultRelInfo *leaf_part_rri;
-
-                               leaf_part_rri = &update_rri[update_rri_index];
-
-                               /*
-                                * This is required in order to convert the 
partition's tuple
-                                * to be compatible with the root partitioned 
table's tuple
-                                * descriptor.  When generating the per-subplan 
result rels,
-                                * this was not set.
-                                */
-                               leaf_part_rri->ri_PartitionRoot = rel;
-
-                               /* Remember the subplan offset for this 
ResultRelInfo */
-                               
proute->subplan_partition_offsets[update_rri_index] = i;
-
-                               update_rri_index++;
-
-                               proute->partitions[i] = leaf_part_rri;
-                       }
-               }
-
-               /*
-                * We should have found all the per-subplan resultrels in the 
leaf
-                * partitions.
-                */
-               Assert(update_rri_index == num_update_rri);
-       }
+               ExecUseUpdateResultRelForRouting(mtstate,
+                                                                               
 proute,
+                                                                               
 proute->partition_dispatch_info[0]);
        else
        {
                proute->root_tuple_slot = NULL;
@@ -196,26 +129,38 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, 
Relation rel)
                proute->num_subplan_partition_offsets = 0;
        }
 
+       /* We only allocate this when we need to store the first non-NULL map */
+       proute->parent_child_tupconv_maps = NULL;
+       proute->child_parent_tupconv_maps = NULL;
+
+       /*
+        * Initialize an empty slot that will be used to manipulate tuples of 
any
+        * given partition's rowtype.
+        */
+       proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
+
        return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+                                 ResultRelInfo *resultRelInfo,
+                                 PartitionTupleRouting *proute,
                                  TupleTableSlot *slot, EState *estate)
 {
-       int                     result;
+       ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+       PartitionDispatch *pd = proute->partition_dispatch_info;
+       int                     result = -1;
        Datum           values[PARTITION_MAX_KEYS];
        bool            isnull[PARTITION_MAX_KEYS];
        Relation        rel;
@@ -272,10 +217,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, 
PartitionDispatch *pd,
                 * partitions to begin with.
                 */
                if (partdesc->nparts == 0)
-               {
-                       result = -1;
                        break;
-               }
 
                cur_index = get_partition_for_tuple(rel, values, isnull);
 
@@ -285,17 +227,71 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, 
PartitionDispatch *pd,
                 * next parent to find a partition of.
                 */
                if (cur_index < 0)
-               {
-                       result = -1;
                        break;
-               }
-               else if (parent->indexes[cur_index] >= 0)
+
+               if (partdesc->is_leaf[cur_index])
                {
-                       result = parent->indexes[cur_index];
+                       /* Get the ResultRelInfo of this leaf partition. */
+                       if (parent->indexes[cur_index] >= 0)
+                       {
+                               /*
+                                * Already assigned (either created fresh or 
reused from the
+                                * set of UPDATE result rels.)
+                                */
+                               Assert(parent->indexes[cur_index] < 
proute->num_partitions);
+                               result = parent->indexes[cur_index];
+                       }
+                       else if (node && node->operation == CMD_UPDATE &&
+                                        !parent->scanned_update_result_rels)
+                       {
+                               /* Try to assign UPDATE result rels for tuple 
routing. */
+                               ExecUseUpdateResultRelForRouting(mtstate, 
proute, parent);
+
+                               /* Check if we really found one. */
+                               if (parent->indexes[cur_index] >= 0)
+                               {
+                                       Assert(parent->indexes[cur_index] < 
proute->num_partitions);
+                                       result = parent->indexes[cur_index];
+                               }
+                       }
+
+                       /* We need to create one afresh. */
+                       if (result < 0)
+                       {
+                               result = proute->num_partitions++;
+                               parent->indexes[cur_index] = result;
+                               if (parent->indexes[cur_index] >= 
PARTITION_ROUTING_MAXSIZE)
+                                       elog(ERROR, "invalid partition index: 
%u",
+                                                parent->indexes[cur_index]);
+                               ExecInitPartitionInfo(mtstate, resultRelInfo,
+                                                                         
proute, estate,
+                                                                         
partdesc->oids[cur_index], result);
+                       }
                        break;
                }
                else
-                       parent = pd[-parent->indexes[cur_index]];
+               {
+                       /* Get the PartitionDispatch of this parent. */
+                       if (parent->indexes[cur_index] >= 0)
+                       {
+                               /* Already allocated. */
+                               Assert(parent->indexes[cur_index] < 
proute->num_dispatch);
+                               parent = pd[parent->indexes[cur_index]];
+                       }
+                       else
+                       {
+                               /* Not yet, allocate one. */
+                               parent->indexes[cur_index] = 
proute->num_dispatch++;
+                               if (parent->indexes[cur_index] >= 
PARTITION_ROUTING_MAXSIZE)
+                                       elog(ERROR, "invalid partition index: 
%u",
+                                                parent->indexes[cur_index]);
+                               parent =
+                                       ExecInitPartitionDispatchInfo(proute,
+                                                                               
                  partdesc->oids[cur_index],
+                                                                               
                  rel,
+                                                                               
                  parent->indexes[cur_index]);
+                       }
+               }
        }
 
        /* A partition was not found. */
@@ -318,64 +314,114 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, 
PartitionDispatch *pd,
 }
 
 /*
- * ExecGetPartitionInfo
- *             Fetch ResultRelInfo for partidx
- *
- * Sets up ResultRelInfo, if not done already.
+ * ExecUseUpdateResultRelForRouting
+ *             Checks if any of the ResultRelInfo's created by 
ExecInitModifyTable
+ *             as target result rels for an UPDATE belong to a given parent 
table's
+ *             partitions, and if so, stores their pointers in proute so that 
they
+ *             can be used hereon as targets of tuple routing
  */
-ResultRelInfo *
-ExecGetPartitionInfo(ModifyTableState *mtstate,
-                                        ResultRelInfo *resultRelInfo,
-                                        PartitionTupleRouting *proute,
-                                        EState *estate, int partidx)
+static void
+ExecUseUpdateResultRelForRouting(ModifyTableState *mtstate,
+                                                                
PartitionTupleRouting *proute,
+                                                                
PartitionDispatch pd)
 {
-       ResultRelInfo *result = proute->partitions[partidx];
+       ModifyTable        *node = (ModifyTable *) mtstate->ps.plan;
+       Relation                rootrel  = proute->partition_root;
+       PartitionDesc   partdesc = RelationGetPartitionDesc(pd->reldesc);
+       ResultRelInfo  *update_rri = NULL;
+       int                             num_update_rri = 0,
+                                       my_part_index = 0;
+       int                             i;
 
-       if (result)
-               return result;
+       /* We should be here only once for a given parent table. */
+       Assert(!pd->scanned_update_result_rels);
 
-       result = ExecInitPartitionInfo(mtstate,
-                                                                  
resultRelInfo,
-                                                                  proute,
-                                                                  estate,
-                                                                  partidx);
-       Assert(result);
+       update_rri = mtstate->resultRelInfo;
+       num_update_rri = list_length(node->plans);
 
-       proute->partitions[partidx] = result;
-
-       /*
-        * Record the ones setup so far in setup order.  This makes the cleanup
-        * operation more efficient when very few have been setup.
-        */
-       if (proute->num_partitions_init == proute->partitions_init_size)
+       /* If here for the first time, initialize necessary data structures. */
+       if (proute->subplan_partition_offsets == NULL)
        {
-               /* First allocate more space if the array is not large enough */
-               proute->partitions_init_size =
-                       Min(proute->partitions_init_size * 2, 
proute->num_partitions);
-
-               proute->partitions_init = (ResultRelInfo **)
-                               repalloc(proute->partitions_init,
-                               proute->partitions_init_size * 
sizeof(ResultRelInfo *));
+               proute->subplan_partition_offsets = palloc(num_update_rri * 
sizeof(int));
+               memset(proute->subplan_partition_offsets, -1,
+                          num_update_rri * sizeof(int));
+               proute->num_subplan_partition_offsets = num_update_rri;
+               proute->root_tuple_slot = MakeTupleTableSlot(NULL);
        }
 
-       proute->partitions_init[proute->num_partitions_init++] = result;
+       /*
+        * Go through UPDATE result rels and note down those that belong to
+        * this table's partitions.
+        */
+       for (i = 0; i < num_update_rri; i++)
+       {
+               Relation        update_rel = update_rri[i].ri_RelationDesc;
+               Oid                     leaf_oid = 
partdesc->oids[my_part_index];
 
-       Assert(proute->num_partitions_init <= proute->num_partitions);
+               /*
+                * Skip UPDATE result rels that correspond to leaf partitions 
of lower
+                * levels.  They will be acquired via PartitionDispatch of 
their own
+                * parents, if needed.
+                */
+               while (RelationGetRelid(update_rel) != leaf_oid &&
+                          my_part_index < partdesc->nparts)
+                       leaf_oid = partdesc->oids[++my_part_index];
 
-       return result;
+               if (RelationGetRelid(update_rel) == leaf_oid)
+               {
+                       ResultRelInfo *leaf_part_rri;
+
+                       leaf_part_rri = &update_rri[i];
+
+                       /*
+                        * This is required in order to convert the partition's 
tuple
+                        * to be compatible with the root partitioned table's 
tuple
+                        * descriptor.  When generating the per-subplan result 
rels,
+                        * this was not set.
+                        */
+                       leaf_part_rri->ri_PartitionRoot = rootrel;
+
+                       /*
+                        * Remember the index of this UPDATE result rel in the 
tuple
+                        * routing partition array.
+                        */
+                       proute->subplan_partition_offsets[i] = 
proute->num_partitions;
+
+                       /*
+                        * Also, record in PartitionDispatch that we have a 
valid
+                        * ResultRelInfo for this partition.
+                        */
+
+                       Assert(pd->indexes[my_part_index] == -1);
+                       pd->indexes[my_part_index] = proute->num_partitions;
+                       proute->partitions[proute->num_partitions++] = 
leaf_part_rri;
+                       my_part_index++;
+               }
+
+               if (my_part_index >= partdesc->nparts)
+                       break;
+       }
+
+       /*
+        * Set that we have checked and reused all UPDATE result rels that we
+        * found for this parent.
+        */
+       pd->scanned_update_result_rels = true;
 }
 
 /*
  * ExecInitPartitionInfo
  *             Initialize ResultRelInfo and other information for a partition
  *
- * Returns the ResultRelInfo
+ * This also stores it in the proute->partitions array at the specified index
+ * ('partidx'), possibly expanding the array if there isn't space left in it.
  */
-static ResultRelInfo *
+static void
 ExecInitPartitionInfo(ModifyTableState *mtstate,
                                          ResultRelInfo *resultRelInfo,
                                          PartitionTupleRouting *proute,
-                                         EState *estate, int partidx)
+                                         EState *estate, Oid partoid,
+                                         int partidx)
 {
        ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
        Relation        rootrel = resultRelInfo->ri_RelationDesc,
@@ -390,7 +436,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
         * We locked all the partitions in ExecSetupPartitionTupleRouting
         * including the leaf partitions.
         */
-       partrel = heap_open(proute->partition_oids[partidx], NoLock);
+       partrel = heap_open(partoid, NoLock);
 
        /*
         * Keep ResultRelInfo and other information for this partition in the
@@ -729,12 +775,20 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
                }
        }
 
-       Assert(proute->partitions[partidx] == NULL);
+       if (partidx >= proute->partitions_allocsize)
+       {
+               /* Expand allocated place. */
+               proute->partitions_allocsize =
+                       Min(proute->partitions_allocsize * 2, 
PARTITION_ROUTING_MAXSIZE);
+               proute->partitions = (ResultRelInfo **)
+                       repalloc(proute->partitions,
+                                        sizeof(ResultRelInfo *) * 
proute->partitions_allocsize);
+       }
+
+       /* Save here for later use. */
        proute->partitions[partidx] = leaf_part_rri;
 
        MemoryContextSwitchTo(oldContext);
-
-       return leaf_part_rri;
 }
 
 /*
@@ -766,10 +820,26 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 
        if (map)
        {
+               int             new_size;
+
                /* Allocate parent child map array only if we need to store a 
map */
-               if (!proute->parent_child_tupconv_maps)
+               if (proute->parent_child_tupconv_maps == NULL)
+               {
+                       proute->parent_child_tupconv_maps_allocsize = new_size =
+                               PARTITION_ROUTING_INITSIZE;
                        proute->parent_child_tupconv_maps = (TupleConversionMap 
**)
-                               palloc0(proute->num_partitions * 
sizeof(TupleConversionMap *));
+                               palloc0(sizeof(TupleConversionMap *) * 
new_size);
+               }
+               /* We may have ran out of the initially allocated space. */
+               else if (partidx >= proute->parent_child_tupconv_maps_allocsize)
+               {
+                       proute->parent_child_tupconv_maps_allocsize = new_size =
+                               Min(proute->parent_child_tupconv_maps_allocsize 
* 2,
+                                       PARTITION_ROUTING_MAXSIZE);
+                       proute->parent_child_tupconv_maps = (TupleConversionMap 
**)
+                               repalloc( proute->parent_child_tupconv_maps,
+                                                sizeof(TupleConversionMap *) * 
new_size);
+               }
 
                proute->parent_child_tupconv_maps[partidx] = map;
        }
@@ -788,6 +858,88 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 }
 
 /*
+ * ExecInitPartitionDispatchInfo
+ *             Initialize PartitionDispatch for a partitioned table
+ *
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('dispatchidx'), possibly expanding the array if there
+ * isn't space left in it.
+ */
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+                                                         Oid partoid,
+                                                         Relation parent,
+                                                         int dispatchidx)
+{
+       Relation        rel;
+       TupleDesc       tupdesc;
+       PartitionDesc partdesc;
+       PartitionKey partkey;
+       PartitionDispatch pd;
+
+       if (partoid != RelationGetRelid(proute->partition_root))
+               rel = heap_open(partoid, NoLock);
+       else
+               rel = proute->partition_root;
+       tupdesc = RelationGetDescr(rel);
+       partdesc = RelationGetPartitionDesc(rel);
+       partkey = RelationGetPartitionKey(rel);
+
+       pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
+       pd->reldesc = rel;
+       pd->key = partkey;
+       pd->keystate = NIL;
+       pd->partdesc = partdesc;
+       if (parent != NULL)
+       {
+               /*
+                * For every partitioned table other than the root, we must 
store a
+                * tuple table slot initialized with its tuple descriptor and a 
tuple
+                * conversion map to convert a tuple from its parent's rowtype 
to its
+                * own. That is to make sure that we are looking at the correct 
row
+                * using the correct tuple descriptor when computing its 
partition key
+                * for tuple routing.
+                */
+               pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+               pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+                                                                               
        tupdesc,
+                                                                               
        gettext_noop("could not convert row type"));
+       }
+       else
+       {
+               /* Not required for the root partitioned table */
+               pd->tupslot = NULL;
+               pd->tupmap = NULL;
+       }
+
+       pd->indexes = (int *) palloc(sizeof(int) * partdesc->nparts);
+
+       /*
+        * Initialize with -1 to signify that the corresponding partition's
+        * ResultRelInfo or PartitionDispatch has not been created yet.
+        */
+       memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
+
+       pd->scanned_update_result_rels = false;
+
+       if (dispatchidx >= proute->dispatch_allocsize)
+       {
+               /* Expand allocated space. */
+               proute->dispatch_allocsize =
+                       Min(proute->dispatch_allocsize * 2, 
PARTITION_ROUTING_MAXSIZE);
+               proute->partition_dispatch_info = (PartitionDispatchData **)
+                       repalloc(proute->partition_dispatch_info,
+                                        sizeof(PartitionDispatchData *) *
+                                        proute->dispatch_allocsize);
+       }
+
+       /* Save here for later use. */
+       proute->partition_dispatch_info[dispatchidx] = pd;
+
+       return pd;
+}
+
+/*
  * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
  * child-to-root tuple conversion map array.
  *
@@ -805,13 +957,14 @@ ExecSetupChildParentMapForLeaf(PartitionTupleRouting 
*proute)
         * These array elements get filled up with maps on an on-demand basis.
         * Initially just set all of them to NULL.
         */
+       proute->child_parent_tupconv_maps_allocsize = 
PARTITION_ROUTING_INITSIZE;
        proute->child_parent_tupconv_maps =
                (TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-                                                                               
proute->num_partitions);
+                                                                               
PARTITION_ROUTING_INITSIZE);
 
        /* Same is the case for this array. All the values are set to false */
        proute->child_parent_map_not_required =
-               (bool *) palloc0(sizeof(bool) * proute->num_partitions);
+               (bool *) palloc0(sizeof(bool) * PARTITION_ROUTING_INITSIZE);
 }
 
 /*
@@ -826,8 +979,9 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
        TupleConversionMap **map;
        TupleDesc       tupdesc;
 
-       /* Don't call this if we're not supposed to be using this type of map. 
*/
-       Assert(proute->child_parent_tupconv_maps != NULL);
+       /* If nobody else set up the per-leaf maps array, do so ourselves. */
+       if (proute->child_parent_tupconv_maps == NULL)
+               ExecSetupChildParentMapForLeaf(proute);
 
        /* If it's already known that we don't need a map, return NULL. */
        if (proute->child_parent_map_not_required[leaf_index])
@@ -846,6 +1000,30 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
                                                           gettext_noop("could 
not convert row type"));
 
        /* If it turns out no map is needed, remember for next time. */
+
+       /* We may have run out of the initially allocated space. */
+       if (leaf_index >= proute->child_parent_tupconv_maps_allocsize)
+       {
+               int             new_size,
+                               old_size;
+
+               old_size = proute->child_parent_tupconv_maps_allocsize;
+               proute->child_parent_tupconv_maps_allocsize = new_size =
+                       Min(proute->parent_child_tupconv_maps_allocsize * 2,
+                               PARTITION_ROUTING_MAXSIZE);
+               proute->child_parent_tupconv_maps = (TupleConversionMap **)
+                       repalloc(proute->child_parent_tupconv_maps,
+                                        sizeof(TupleConversionMap *) * 
new_size);
+               memset(proute->child_parent_tupconv_maps + old_size, 0,
+                          sizeof(TupleConversionMap *) * (new_size - 
old_size));
+
+               proute->child_parent_map_not_required = (bool *)
+                       repalloc(proute->child_parent_map_not_required,
+                                        sizeof(bool) * new_size);
+               memset(proute->child_parent_map_not_required + old_size, false,
+                          sizeof(bool) * (new_size - old_size));
+       }
+
        proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
 
        return *map;
@@ -909,9 +1087,9 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
                ExecDropSingleTupleTableSlot(pd->tupslot);
        }
 
-       for (i = 0; i < proute->num_partitions_init; i++)
+       for (i = 0; i < proute->num_partitions; i++)
        {
-               ResultRelInfo *resultRelInfo = proute->partitions_init[i];
+               ResultRelInfo *resultRelInfo = proute->partitions[i];
 
                /* Allow any FDWs to shut down if they've been exercised */
                if (resultRelInfo->ri_PartitionReadyForRouting &&
@@ -920,6 +1098,28 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
                        
resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
                                                                                
                                   resultRelInfo);
 
+               /*
+                * Check if this result rel is one of UPDATE subplan result 
rels,
+                * which if so, let ExecEndPlan() close it.
+                */
+               if (proute->subplan_partition_offsets)
+               {
+                       int             j;
+                       int             found = false;
+
+                       for (j = 0; j < proute->num_subplan_partition_offsets; 
j++)
+                       {
+                               if (proute->subplan_partition_offsets[j] == i)
+                               {
+                                       found = true;
+                                       break;
+                               }
+                       }
+
+                       if (found)
+                               continue;
+               }
+
                ExecCloseIndices(resultRelInfo);
                heap_close(resultRelInfo->ri_RelationDesc, NoLock);
        }
@@ -931,211 +1131,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
                ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *             Returns an array of PartitionDispatch as is required for routing
- *             tuples to the correct partition.
- *
- * 'num_parted' is set to the size of the returned array and the
- *'leaf_part_oids' array is allocated and populated with each leaf partition
- * Oid in the hierarchy. 'n_leaf_part_oids' is set to the size of that array.
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-                                                                int 
*num_parted, Oid **leaf_part_oids,
-                                                                int 
*n_leaf_part_oids)
-{
-       List       *pdlist = NIL;
-       PartitionDispatchData **pd;
-       ListCell   *lc;
-       int                     i;
-       int                     leaf_part_oid_size;
-
-       Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-       *num_parted = 0;
-       *n_leaf_part_oids = 0;
-
-       leaf_part_oid_size = 0;
-       *leaf_part_oids = NULL;
-
-       get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids,
-                                                                  
n_leaf_part_oids, &leaf_part_oid_size);
-       *num_parted = list_length(pdlist);
-       pd = (PartitionDispatchData **) palloc(*num_parted *
-                                                                               
   sizeof(PartitionDispatchData *));
-       i = 0;
-       foreach(lc, pdlist)
-       {
-               pd[i++] = lfirst(lc);
-       }
-
-       return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *             Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we populate
- * '*pds' with PartitionDispatch objects of each partitioned table we find,
- * and populate leaf_part_oids with each leaf partition OID found.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- *
- * Note: Callers must not attempt to pfree the 'leaf_part_oids' array.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-                                                          List **pds, Oid 
**leaf_part_oids,
-                                                          int 
*n_leaf_part_oids,
-                                                          int 
*leaf_part_oid_size)
-{
-       TupleDesc       tupdesc = RelationGetDescr(rel);
-       PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-       PartitionKey partkey = RelationGetPartitionKey(rel);
-       PartitionDispatch pd;
-       int                     i;
-       int                     nparts;
-       int                     oid_array_used;
-       int                     oid_array_size;
-       Oid                *oid_array;
-       Oid                *partdesc_oids;
-       bool       *partdesc_subpartitions;
-       int                *indexes;
-
-       check_stack_depth();
-
-       /* Build a PartitionDispatch for this table and add it to *pds. */
-       pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-       *pds = lappend(*pds, pd);
-       pd->reldesc = rel;
-       pd->key = partkey;
-       pd->keystate = NIL;
-       pd->partdesc = partdesc;
-       if (parent != NULL)
-       {
-               /*
-                * For every partitioned table other than the root, we must 
store a
-                * tuple table slot initialized with its tuple descriptor and a 
tuple
-                * conversion map to convert a tuple from its parent's rowtype 
to its
-                * own. That is to make sure that we are looking at the correct 
row
-                * using the correct tuple descriptor when computing its 
partition key
-                * for tuple routing.
-                */
-               pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-               pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
-                                                                               
        tupdesc,
-                                                                               
        gettext_noop("could not convert row type"));
-       }
-       else
-       {
-               /* Not required for the root partitioned table */
-               pd->tupslot = NULL;
-               pd->tupmap = NULL;
-
-               /*
-                * If the parent has no sub partitions then we can skip 
calculating
-                * all the leaf partitions and just return all the oids at this 
level.
-                * In this case, the indexes were also pre-calculated for us by 
the
-                * syscache code.
-                */
-               if (!partdesc->hassubpart)
-               {
-                       *leaf_part_oids = partdesc->oids;
-                       /* XXX or should we memcpy this out of syscache? */
-                       pd->indexes = partdesc->indexes;
-                       *n_leaf_part_oids = partdesc->nparts;
-                       return;
-               }
-       }
-
-       /*
-        * Go look at each partition of this table.  If it's a leaf partition,
-        * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-        * recursively call get_partition_dispatch_recurse(), so that its
-        * partitions are processed as well and a corresponding 
PartitionDispatch
-        * object gets added to *pds.
-        *
-        * The 'indexes' array is used when searching for a partition matching a
-        * given tuple.  The actual value we store here depends on whether the
-        * array element belongs to a leaf partition or a subpartitioned table.
-        * For leaf partitions we store the index into *leaf_part_oids, and for
-        * sub-partitioned tables we store a negative version of the index into
-        * the *pds list.  Both indexes are 0-based, but the first element of 
the
-        * *pds list is the root partition, so 0 always means the first leaf. 
When
-        * searching, if we see a negative value, the search must continue in 
the
-        * corresponding sub-partition; otherwise, we've identified the correct
-        * partition.
-        */
-       oid_array_used = *n_leaf_part_oids;
-       oid_array_size = *leaf_part_oid_size;
-       oid_array = *leaf_part_oids;
-       nparts = partdesc->nparts;
-
-       if (!oid_array)
-       {
-               oid_array_size = *leaf_part_oid_size = nparts;
-               *leaf_part_oids = (Oid *) palloc(sizeof(Oid) * nparts);
-               oid_array = *leaf_part_oids;
-       }
-
-       partdesc_oids = partdesc->oids;
-       partdesc_subpartitions = partdesc->subpartitions;
-
-       pd->indexes = indexes = (int *) palloc(nparts * sizeof(int));
-
-       for (i = 0; i < nparts; i++)
-       {
-               Oid                     partrelid = partdesc_oids[i];
-
-               if (!partdesc_subpartitions[i])
-               {
-                       if (oid_array_size <= oid_array_used)
-                       {
-                               oid_array_size *= 2;
-                               oid_array = (Oid *) repalloc(oid_array,
-                                                                               
         sizeof(Oid) * oid_array_size);
-                       }
-
-                       oid_array[oid_array_used] = partrelid;
-                       indexes[i] = oid_array_used++;
-               }
-               else
-               {
-                       /*
-                        * We assume all tables in the partition tree were 
already locked
-                        * by the caller.
-                        */
-                       Relation        partrel = heap_open(partrelid, NoLock);
-
-                       *n_leaf_part_oids = oid_array_used;
-                       *leaf_part_oid_size = oid_array_size;
-                       *leaf_part_oids = oid_array;
-
-                       indexes[i] = -list_length(*pds);
-                       get_partition_dispatch_recurse(partrel, rel, pds, 
leaf_part_oids,
-                                                                               
   n_leaf_part_oids, leaf_part_oid_size);
-
-                       oid_array_used = *n_leaf_part_oids;
-                       oid_array_size = *leaf_part_oid_size;
-                       oid_array = *leaf_part_oids;
-               }
-       }
-
-       *n_leaf_part_oids = oid_array_used;
-       *leaf_part_oid_size = oid_array_size;
-       *leaf_part_oids = oid_array;
-}
-
 /* ----------------
  *             FormPartitionKeyDatum
  *                     Construct values[] and isnull[] arrays for the 
partition key
diff --git a/src/backend/executor/nodeModifyTable.c 
b/src/backend/executor/nodeModifyTable.c
index 07b5f968aa..8b671c6426 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot 
*ExecPrepareTupleRouting(ModifyTableState *mtstate,
                                                ResultRelInfo *targetRelInfo,
                                                TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
                                                int whichplan);
@@ -1666,7 +1665,7 @@ ExecSetupTransitionCaptureState(ModifyTableState 
*mtstate, EState *estate)
        if (mtstate->mt_transition_capture != NULL ||
                mtstate->mt_oc_transition_capture != NULL)
        {
-               ExecSetupChildParentMapForTcs(mtstate);
+               ExecSetupChildParentMapForSubplan(mtstate);
 
                /*
                 * Install the conversion map for the first plan for UPDATE and 
DELETE
@@ -1709,15 +1708,12 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
         * value is to be used as an index into the arrays for the ResultRelInfo
         * and TupleConversionMap for the partition.
         */
-       partidx = ExecFindPartition(targetRelInfo,
-                                                               
proute->partition_dispatch_info,
-                                                               slot,
-                                                               estate);
+       partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, 
estate);
        Assert(partidx >= 0 && partidx < proute->num_partitions);
 
        /* Get the ResultRelInfo corresponding to the selected partition. */
-       partrel = ExecGetPartitionInfo(mtstate, targetRelInfo, proute, estate,
-                                                                  partidx);
+       Assert(proute->partitions[partidx] != NULL);
+       partrel = proute->partitions[partidx];
 
        /*
         * Check whether the partition is routable if we didn't yet
@@ -1825,17 +1821,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState 
*mtstate)
        int                     i;
 
        /*
-        * First check if there is already a per-subplan array allocated. Even 
if
-        * there is already a per-leaf map array, we won't require a per-subplan
-        * one, since we will use the subplan offset array to convert the 
subplan
-        * index to per-leaf index.
-        */
-       if (mtstate->mt_per_subplan_tupconv_maps ||
-               (mtstate->mt_partition_tuple_routing &&
-                
mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-               return;
-
-       /*
         * Build array of conversion maps from each child's TupleDesc to the one
         * used in the target relation.  The map pointers may be NULL when no
         * conversion is necessary, which is hopefully a common case.
@@ -1857,78 +1842,17 @@ ExecSetupChildParentMapForSubplan(ModifyTableState 
*mtstate)
 }
 
 /*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-       PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-       /*
-        * If partition tuple routing is set up, we will require 
partition-indexed
-        * access. In that case, create the map array indexed by partition; we
-        * will still be able to access the maps using a subplan index by
-        * converting the subplan index to a partition index using
-        * subplan_partition_offsets. If tuple routing is not set up, it means 
we
-        * don't require partition-indexed access. In that case, create just a
-        * subplan-indexed map.
-        */
-       if (proute)
-       {
-               /*
-                * If a partition-indexed map array is to be created, the 
subplan map
-                * array has to be NULL.  If the subplan map array is already 
created,
-                * we won't be able to access the map using a partition index.
-                */
-               Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-               ExecSetupChildParentMapForLeaf(proute);
-       }
-       else
-               ExecSetupChildParentMapForSubplan(mtstate);
-}
-
-/*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-       /*
-        * If a partition-index tuple conversion map array is allocated, we need
-        * to first get the index into the partition array. Exactly *one* of the
-        * two arrays is allocated. This is because if there is a partition 
array
-        * required, we don't require subplan-indexed array since we can 
translate
-        * subplan index into partition index. And, we create a subplan-indexed
-        * array *only* if partition-indexed array is not required.
-        */
+       /* If nobody else set the per-subplan array of maps, do so ouselves. */
        if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-       {
-               int                     leaf_index;
-               PartitionTupleRouting *proute = 
mtstate->mt_partition_tuple_routing;
+               ExecSetupChildParentMapForSubplan(mtstate);
 
-               /*
-                * If subplan-indexed array is NULL, things should have been 
arranged
-                * to convert the subplan index to partition index.
-                */
-               Assert(proute && proute->subplan_partition_offsets != NULL &&
-                          whichplan < proute->num_subplan_partition_offsets);
-
-               leaf_index = proute->subplan_partition_offsets[whichplan];
-
-               return TupConvMapForLeaf(proute, 
getTargetResultRelInfo(mtstate),
-                                                                leaf_index);
-       }
-       else
-       {
-               Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-               return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-       }
+       Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+       return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/utils/cache/partcache.c 
b/src/backend/utils/cache/partcache.c
index b36b7366e5..aa82aa52eb 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -594,7 +594,7 @@ RelationBuildPartitionDesc(Relation rel)
                int                     next_index = 0;
 
                result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
-               result->subpartitions = (bool *) palloc(nparts * sizeof(bool));
+               result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
                boundinfo = (PartitionBoundInfoData *)
                        palloc0(sizeof(PartitionBoundInfoData));
@@ -775,7 +775,6 @@ RelationBuildPartitionDesc(Relation rel)
                }
 
                result->boundinfo = boundinfo;
-               result->hassubpart = false; /* unless we discover otherwise 
below */
 
                /*
                 * Now assign OIDs from the original array into mapped indexes 
of the
@@ -786,33 +785,13 @@ RelationBuildPartitionDesc(Relation rel)
                for (i = 0; i < nparts; i++)
                {
                        int                     index = mapping[i];
-                       bool            subpart;
 
                        result->oids[index] = oids[i];
-
-                       subpart = (get_rel_relkind(oids[i]) == 
RELKIND_PARTITIONED_TABLE);
                        /* Record if the partition is a subpartitioned table */
-                       result->subpartitions[index] = subpart;
-                       result->hassubpart |= subpart;
+                       result->is_leaf[index] =
+                               (get_rel_relkind(oids[i]) != 
RELKIND_PARTITIONED_TABLE);
                }
 
-               /*
-                * If there are no subpartitions then we can pre-calculate the
-                * PartitionDispatch->indexes array.  Doing this here saves 
quite a
-                * bit of overhead on simple queries which perform INSERTs or 
UPDATEs
-                * on partitioned tables with many partitions.  The 
pre-calculation is
-                * very simple.  All we need to store is a sequence of numbers 
from 0
-                * to nparts - 1.
-                */
-               if (!result->hassubpart)
-               {
-                       result->indexes = (int *) palloc(nparts * sizeof(int));
-                       for (i = 0; i < nparts; i++)
-                               result->indexes[i] = i;
-               }
-               else
-                       result->indexes = NULL;
-
                pfree(mapping);
        }
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index a8c69ff224..8d20469c98 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -26,18 +26,11 @@
 typedef struct PartitionDescData
 {
        int                     nparts;                 /* Number of partitions 
*/
-       Oid                *oids;                       /* OIDs array of 
'nparts' of partitions in
-                                                                * partbound 
order */
-       int                *indexes;            /* Stores index for 
corresponding 'oids'
-                                                                * element for 
use in tuple routing, or NULL
-                                                                * if 
hassubpart is true.
-                                                                */
-       bool       *subpartitions;      /* Array of 'nparts' set to true if the
-                                                                * 
corresponding 'oids' element belongs to a
-                                                                * 
sub-partitioned table.
-                                                                */
-       bool            hassubpart;             /* true if any oid belongs to a
-                                                                * 
sub-partitioned table */
+       Oid                *oids;                       /* Array of length 
'nparts' containing
+                                                                * partition 
OIDs in order of the their
+                                                                * bounds */
+       bool       *is_leaf;            /* Array of length 'nparts' containing 
whether
+                                                                * a partition 
is a leaf partition */
        PartitionBoundInfo boundinfo;   /* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h 
b/src/include/executor/execPartition.h
index 822f66f5e2..f284bc2d81 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -45,77 +45,98 @@ typedef struct PartitionDispatchData
        TupleTableSlot *tupslot;
        TupleConversionMap *tupmap;
        int                *indexes;
+       bool            scanned_update_result_rels;
 } PartitionDispatchData;
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info             Array of PartitionDispatch objects with 
one
- *                                                             entry for every 
partitioned table in the
- *                                                             partition tree.
- * num_dispatch                                        number of partitioned 
tables in the partition
- *                                                             tree (= length 
of partition_dispatch_info[])
- * partition_oids                              Array of leaf partitions OIDs 
with one entry
- *                                                             for every leaf 
partition in the partition tree,
- *                                                             initialized in 
full by
- *                                                             
ExecSetupPartitionTupleRouting.
- * partitions                                  Array of ResultRelInfo* objects 
with one entry
- *                                                             for every leaf 
partition in the partition tree,
- *                                                             initialized 
lazily by ExecInitPartitionInfo.
- * partitions_init                             Array of ResultRelInfo* objects 
in the order
- *                                                             that they were 
lazily initialized.
- * num_partitions                              Number of leaf partitions in 
the partition tree
- *                                                             (= 
'partitions_oid'/'partitions' array length)
- * num_partitions_init                 Number of leaf partition lazily setup 
so far.
- * partitions_init_size                        Size of partitions_init array.
- * parent_child_tupconv_maps   Array of TupleConversionMap objects with one
- *                                                             entry for every 
leaf partition (required to
- *                                                             convert tuple 
from the root table's rowtype to
- *                                                             a leaf 
partition's rowtype after tuple routing
- *                                                             is done). 
Remains NULL if no maps to store.
- * child_parent_tupconv_maps   Array of TupleConversionMap objects with one
- *                                                             entry for every 
leaf partition (required to
- *                                                             convert an 
updated tuple from the leaf
- *                                                             partition's 
rowtype to the root table's rowtype
- *                                                             so that tuple 
routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *                                                             determined to 
be not required for the given
- *                                                             partition. 
False means either we haven't yet
- *                                                             checked if a 
map is required, or it was
- *                                                             determined to 
be required.
- * subplan_partition_offsets   Integer array ordered by UPDATE subplans. Each
- *                                                             element of this 
array has the index into the
- *                                                             corresponding 
partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slot                        TupleTableSlot to be used to 
manipulate any
- *                                                             given leaf 
partition's rowtype after that
- *                                                             partition is 
chosen for insertion by
- *                                                             tuple-routing.
- * root_tuple_slot                             TupleTableSlot to be used to 
transiently hold
- *                                                             copy of a tuple 
that's being moved across
- *                                                             partitions in 
the root partitioned table's
- *                                                             rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+       /*
+        * Root table, that is, the table mentioned in the INSERT/UPDATE query 
or
+        * COPY FROM command.
+        */
+       Relation        partition_root;
+
+       /*
+        * Contains PartitionDispatch objects for every partitioned table 
touched
+        * by tuple routing.  The entry for the root partitioned table is 
*always*
+        * present as the first entry of this array.  'num_dispatch' is the
+        * number of existing entries and also serves as the index of the next
+        * entry to be allocated.  'dispatch_allocsize' (>= 'num_dispatch') is 
the
+        * number of entries that can be stored in the array, before needing to
+        * reallocate more space.  See ExecInitPartitionDispatchInfo().
+        */
        PartitionDispatch *partition_dispatch_info;
        int                     num_dispatch;
-       Oid                *partition_oids;
+       int                     dispatch_allocsize;
+
+       /*
+        * Contains pointers to a ResultRelInfos of all leaf partitions touched 
by
+        * tuple routing.  Some of these are pointers to "reused" 
ResultRelInfos,
+        * that is, those that are created and destroyed outside 
execPartition.c,
+        * for example, when tuple routing is used for UPDATE queries that 
modify
+        * partition key; see ExecUseUpdateResultRelForRouting().  Rest of them
+        * are pointers to ResultRelInfos managed by execPartition.c itself; see
+        * ExecInitPartitionInfo() and ExecCleanupTupleRouting().
+        */
        ResultRelInfo **partitions;
-       ResultRelInfo **partitions_init;
        int                     num_partitions;
-       int                     num_partitions_init;
-       int                     partitions_init_size;
+       int                     partitions_allocsize;
+
+       /*
+        * Contains information to convert tuples of the root parent's rowtype 
to
+        * those of the leaf partitions' rowtype, but only for those partitions
+        * whose TupleDescs are physically different from the root parent's.  If
+        * none of the partitions has such a differing TupleDesc, then the
+        * following array is NULL.  If non-NULL, is is of the same size as the
+        * 'partitions' array above, to be able to use the same array index.
+        * Also, there need not be more of these maps than there are partitions
+        * touched.
+        */
        TupleConversionMap **parent_child_tupconv_maps;
+       int                     parent_child_tupconv_maps_allocsize;
+
+       /*
+        * This is a tuple slot used for a partition after tuple routing.
+        * Maintained separately because partitions may have different rowtype.
+        */
+       TupleTableSlot *partition_tuple_slot;
+
+       /*
+        * Note: The following fields are used only when UPDATE ends up needing 
to
+        * do tuple routing.
+        */
+
+       /*
+        * Information to convert tuples of the leaf partitions' rowtype to the
+        * the root parent's rowtype.  These are needed by transition table
+        * machinery when storing tuples of partition's rowtype into the
+        * transition table that can only store tuples of root parent's rowtype.
+        */
        TupleConversionMap **child_parent_tupconv_maps;
        bool       *child_parent_map_not_required;
+       int                     child_parent_tupconv_maps_allocsize;
+
+       /*
+        * The following maps indexes of UPDATE result rels in the per-subplan 
to
+        * indexes of their pointers in the 'partitions' array above.
+        */
        int                *subplan_partition_offsets;
        int                     num_subplan_partition_offsets;
-       TupleTableSlot *partition_tuple_slot;
+
+       /*
+        * During UPDATE tuple routing, this tuple slot is used to transiently
+        * store a tuple using the root table's rowtype after converting it from
+        * the tuple's source leaf partition's rowtype.  That is, if leaf
+        * partition's rowtype is different.
+        */
        TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
@@ -193,8 +214,9 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState 
*mtstate,
                                                           Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-                                 PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+                                 ResultRelInfo *resultRelInfo,
+                                 PartitionTupleRouting *proute,
                                  TupleTableSlot *slot,
                                  EState *estate);
 extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,

Reply via email to