Hi Amit,

On 2017/08/17 21:18, Amit Khandekar wrote:
> Anyways, some more comments :
> 
> In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an
> array of pointers. Why can't it be an array of
> PartitionTupleRoutingInfo structure  rather than pointer to that
> structure ?

AFAIK, assigning pointers is less expensive than assigning struct and we
end up doing a lot of assigning of the members of that array to a local
variable in get_partition_for_tuple(), for example.  Perhaps, we could
avoid those assignments and implement it the way you suggest.

> diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
> + * Close all the leaf partitions and their indices.
> *
> Above comment needs to be shifted a bit down to the subsequent "for"
> loop where it's actually applicable.

That's right, done.

> * node->mt_partition_dispatch_info[0] corresponds to the root partitioned
> * table, for which we didn't create tupslot.
> Above : node->mt_partition_dispatch_info[0] => node->mt_ptrinfos[0]

Oops, fixed.

> /*
>  * XXX- do we need a pinning mechanism for partition descriptors
>  * so that there references can be managed independently of
>  * the parent relcache entry? Like PinPartitionDesc(partdesc)?
>  */
> pd->partdesc = partdesc;
> 
> Any idea if the above can be handled ? I am not too sure.

A similar mechanism exists for TupleDesc ref-counting (see the usage of
PinTupleDesc and ReleaseTupleDesc across the backend code.)  I too am
currently unsure if such an elaborate mechanism is actually *necessary*
for rd_partdesc.


Attached updated patches.

Thanks,
Amit
From 0cf8ab795fd3a8db462e8c692cfaa73f19e71ed6 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Mon, 24 Jul 2017 18:59:57 +0900
Subject: [PATCH 1/2] Decouple RelationGetPartitionDispatchInfo() from executor

Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code.  In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as relcache references
and tuple table slots.  That makes it harder to use in places other
than where it's currently being used.

After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo() and expand_inherited_rtentry() no
longer needs to do some things that it used to.
---
 src/backend/catalog/partition.c        | 309 +++++++++++++++++----------------
 src/backend/commands/copy.c            |  37 ++--
 src/backend/executor/execMain.c        | 145 ++++++++++++++--
 src/backend/executor/nodeModifyTable.c |  35 ++--
 src/include/catalog/partition.h        |  52 +++---
 src/include/executor/executor.h        |   4 +-
 src/include/nodes/execnodes.h          |  53 +++++-
 7 files changed, 398 insertions(+), 237 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 96a64ce6b2..7618e4cb31 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -105,6 +105,24 @@ typedef struct PartitionRangeBound
        bool            lower;                  /* this is the lower (vs upper) 
bound */
 } PartitionRangeBound;
 
+/*-----------------------
+ * PartitionDispatchData - information of partitions of one partitioned table
+ *                                                in a partition tree
+ *
+ *     partkey         Partition key of the table
+ *     partdesc        Partition descriptor of the table
+ *     indexes         Array with partdesc->nparts members (for details on 
what the
+ *                             individual value represents, see the comments in
+ *                             RelationGetPartitionDispatchInfo())
+ *-----------------------
+ */
+typedef struct PartitionDispatchData
+{
+       PartitionKey    partkey;        /* Points into the table's relcache 
entry */
+       PartitionDesc   partdesc;       /* Ditto */
+       int                        *indexes;
+} PartitionDispatchData;
+
 static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
                                                           void *arg);
 static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
@@ -981,181 +999,165 @@ get_partition_qual_relid(Oid relid)
 }
 
 /*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
-       do\
-       {\
-               int             i;\
-               for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
-               {\
-                       (partoids) = lappend_oid((partoids), 
(rel)->rd_partdesc->oids[i]);\
-                       (parents) = lappend((parents), (rel));\
-               }\
-       } while(0)
-
-/*
  * RelationGetPartitionDispatchInfo
- *             Returns information necessary to route tuples down a partition 
tree
+ *             Returns necessary information for each partition in the 
partition
+ *             tree rooted at rel
  *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
+ * Information returned includes the following: *ptinfos contains a list of
+ * PartitionedTableInfo objects, one for each partitioned table (with at least
+ * one member, that is, one for the root partitioned table), *leaf_part_oids
+ * contains a list of the OIDs of of all the leaf partitions.
  *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
+ * We require that the caller has locked at least the partitioned tables in the
+ * partition tree (including 'rel') using at least the AccessShareLock,
+ * because we need to look at their relcache entries to get PartitionKey and
+ * PartitionDesc.
  */
-PartitionDispatch *
+void
 RelationGetPartitionDispatchInfo(Relation rel,
-                                                                int 
*num_parted, List **leaf_part_oids)
+                                                                List 
**ptinfos, List **leaf_part_oids)
 {
-       PartitionDispatchData **pd;
-       List       *all_parts = NIL,
-                          *all_parents = NIL,
-                          *parted_rels,
-                          *parted_rel_parents;
+       List       *all_parts,
+                          *all_parents;
        ListCell   *lc1,
                           *lc2;
        int                     i,
-                               k,
                                offset;
 
        /*
         * We rely on the relcache to traverse the partition tree to build both
-        * the leaf partition OIDs list and the array of PartitionDispatch 
objects
-        * for the partitioned tables in the tree.  That means every partitioned
-        * table in the tree must be locked, which is fine since we require the
-        * caller to lock all the partitions anyway.
+        * the leaf partition OIDs list and the list of PartitionedTableInfo
+        * objects for partitioned tables.  That means every partitioned table 
in
+        * the tree must be locked, which is fine since the callers must have 
done
+        * that already.
         *
         * For every partitioned table in the tree, starting with the root
         * partitioned table, add its relcache entry to parted_rels, while also
         * queuing its partitions (in the order in which they appear in the
         * partition descriptor) to be looked at later in the same loop.  This 
is
         * a bit tricky but works because the foreach() macro doesn't fetch the
-        * next list element until the bottom of the loop.
+        * next list element until the bottom of the loop.  Non-partitioned 
tables
+        * are simply added to the leaf partitions list.
         */
-       *num_parted = 1;
-       parted_rels = list_make1(rel);
-       /* Root partitioned table has no parent, so NULL for parent */
-       parted_rel_parents = list_make1(NULL);
-       APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
+       i = offset = 0;
+       *ptinfos = *leaf_part_oids = NIL;
+
+       /* Start with the root table. */
+       all_parts = list_make1_oid(RelationGetRelid(rel));
+       all_parents = list_make1_oid(InvalidOid);
        forboth(lc1, all_parts, lc2, all_parents)
        {
-               Oid                     partrelid = lfirst_oid(lc1);
-               Relation        parent = lfirst(lc2);
+               Oid             partrelid = lfirst_oid(lc1);
+               Oid             parentrelid = lfirst_oid(lc2);
 
                if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
                {
-                       /*
-                        * Already locked by the caller.  Note that it is the
-                        * responsibility of the caller to close the below 
relcache entry,
-                        * once done using the information being collected here 
(for
-                        * example, in ExecEndModifyTable).
-                        */
-                       Relation        partrel = heap_open(partrelid, NoLock);
+                       int             j,
+                                       k;
+                       Relation                partrel;
+                       PartitionKey    partkey;
+                       PartitionDesc   partdesc;
+                       PartitionedTableInfo   *ptinfo;
+                       PartitionDispatch               pd;
+
+                       if (partrelid != RelationGetRelid(rel))
+                               partrel = heap_open(partrelid, NoLock);
+                       else
+                               partrel = rel;
 
-                       (*num_parted)++;
-                       parted_rels = lappend(parted_rels, partrel);
-                       parted_rel_parents = lappend(parted_rel_parents, 
parent);
-                       APPEND_REL_PARTITION_OIDS(partrel, all_parts, 
all_parents);
-               }
-       }
+                       partkey = RelationGetPartitionKey(partrel);
+                       partdesc = RelationGetPartitionDesc(partrel);
+
+                       ptinfo = (PartitionedTableInfo *)
+                                                                       
palloc0(sizeof(PartitionedTableInfo));
+                       ptinfo->relid = partrelid;
+                       ptinfo->parentid = parentrelid;
+
+                       ptinfo->pd = pd = (PartitionDispatchData *)
+                                                                       
palloc0(sizeof(PartitionDispatchData));
+                       pd->partkey = partkey;
 
-       /*
-        * We want to create two arrays - one for leaf partitions and another 
for
-        * partitioned tables (including the root table and internal 
partitions).
-        * While we only create the latter here, leaf partition array of 
suitable
-        * objects (such as, ResultRelInfo) is created by the caller using the
-        * list of OIDs we return.  Indexes into these arrays get assigned in a
-        * breadth-first manner, whereby partitions of any given level are 
placed
-        * consecutively in the respective arrays.
-        */
-       pd = (PartitionDispatchData **) palloc(*num_parted *
-                                                                               
   sizeof(PartitionDispatchData *));
-       *leaf_part_oids = NIL;
-       i = k = offset = 0;
-       forboth(lc1, parted_rels, lc2, parted_rel_parents)
-       {
-               Relation        partrel = lfirst(lc1);
-               Relation        parent = lfirst(lc2);
-               PartitionKey partkey = RelationGetPartitionKey(partrel);
-               TupleDesc       tupdesc = RelationGetDescr(partrel);
-               PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
-               int                     j,
-                                       m;
-
-               pd[i] = (PartitionDispatch) 
palloc(sizeof(PartitionDispatchData));
-               pd[i]->reldesc = partrel;
-               pd[i]->key = partkey;
-               pd[i]->keystate = NIL;
-               pd[i]->partdesc = partdesc;
-               if (parent != NULL)
-               {
                        /*
-                        * For every partitioned table other than root, we must 
store a
-                        * tuple table slot initialized with its tuple 
descriptor and a
-                        * tuple conversion map to convert a tuple from its 
parent's
-                        * rowtype to its own. That is to make sure that we are 
looking at
-                        * the correct row using the correct tuple descriptor 
when
-                        * computing its partition key for tuple routing.
+                        * XXX- do we need a pinning mechanism for partition 
descriptors
+                        * so that there references can be managed 
independently of
+                        * the parent relcache entry? Like 
PinPartitionDesc(partdesc)?
                         */
-                       pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
-                       pd[i]->tupmap = 
convert_tuples_by_name(RelationGetDescr(parent),
-                                                                               
                   tupdesc,
-                                                                               
                   gettext_noop("could not convert row type"));
-               }
-               else
-               {
-                       /* Not required for the root partitioned table */
-                       pd[i]->tupslot = NULL;
-                       pd[i]->tupmap = NULL;
-               }
-               pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+                       pd->partdesc = partdesc;
 
-               /*
-                * Indexes corresponding to the internal partitions are 
multiplied by
-                * -1 to distinguish them from those of leaf partitions.  
Encountering
-                * an index >= 0 means we found a leaf partition, which is 
immediately
-                * returned as the partition we are looking for.  A negative 
index
-                * means we found a partitioned table, whose PartitionDispatch 
object
-                * is located at the above index multiplied back by -1.  Using 
the
-                * PartitionDispatch object, search is continued further down 
the
-                * partition tree.
-                */
-               m = 0;
-               for (j = 0; j < partdesc->nparts; j++)
-               {
-                       Oid                     partrelid = partdesc->oids[j];
+                       /*
+                        * The values contained in the following array 
correspond to
+                        * indexes of this table's partitions in the global 
sequence of
+                        * all the partitions contained in the partition tree 
rooted at
+                        * rel, traversed in a breadh-first manner.  The values 
should be
+                        * such that we will be able to distinguish the leaf 
partitions
+                        * from the non-leaf partitions, because they are 
returned to
+                        * to the caller in separate structures from where they 
will be
+                        * accessed.  The way that's done is described below:
+                        *
+                        * Leaf partition OIDs are put into the global 
leaf_part_oids list,
+                        * and for each one, the value stored is its ordinal 
position in
+                        * the list minus 1.
+                        *
+                        * PartitionedTableInfo objects corresponding to 
partitions that
+                        * are partitioned tables are put into the global 
ptinfos[] list,
+                        * and for each one, the value stored is its ordinal 
position in
+                        * the list multiplied by -1.
+                        *
+                        * So while looking at the values in the indexes array, 
if one
+                        * gets zero or a positive value, then it's a leaf 
partition,
+                        * Otherwise, it's a partitioned table.
+                        */
+                       pd->indexes = (int *) palloc(partdesc->nparts * 
sizeof(int));
 
-                       if (get_rel_relkind(partrelid) != 
RELKIND_PARTITIONED_TABLE)
-                       {
-                               *leaf_part_oids = lappend_oid(*leaf_part_oids, 
partrelid);
-                               pd[i]->indexes[j] = k++;
-                       }
-                       else
+                       k = 0;
+                       for (j = 0; j < partdesc->nparts; j++)
                        {
+                               Oid                     partrelid = 
partdesc->oids[j];
+
                                /*
-                                * offset denotes the number of partitioned 
tables of upper
-                                * levels including those of the current level. 
 Any partition
-                                * of this table must belong to the next level 
and hence will
-                                * be placed after the last partitioned table 
of this level.
+                                * Queue this partition so that it will be 
processed later
+                                * by the outer loop.
                                 */
-                               pd[i]->indexes[j] = -(1 + offset + m);
-                               m++;
+                               all_parts = lappend_oid(all_parts, partrelid);
+                               all_parents = lappend_oid(all_parents,
+                                                                               
  RelationGetRelid(partrel));
+
+                               if (get_rel_relkind(partrelid) != 
RELKIND_PARTITIONED_TABLE)
+                               {
+                                       *leaf_part_oids = 
lappend_oid(*leaf_part_oids, partrelid);
+                                       pd->indexes[j] = i++;
+                               }
+                               else
+                               {
+                                       /*
+                                        * offset denotes the number of 
partitioned tables that
+                                        * we have already processed.  k counts 
the number of
+                                        * partitions of this table that were 
found to be
+                                        * partitioned tables.
+                                        */
+                                       pd->indexes[j] = -(1 + offset + k);
+                                       k++;
+                               }
                        }
-               }
-               i++;
 
-               /*
-                * This counts the number of partitioned tables at upper levels
-                * including those of the current level.
-                */
-               offset += m;
+                       offset += k;
+
+                       /*
+                        * Release the relation descriptor.  Lock that we have 
on the
+                        * table will keep the PartitionDesc that is pointing 
into
+                        * RelationData intact, a pointer to which hope to keep
+                        * through this transaction's commit.
+                        * (XXX - how true is that?)
+                        */
+                       if (partrel != rel)
+                               heap_close(partrel, NoLock);
+
+                       *ptinfos = lappend(*ptinfos, ptinfo);
+               }
        }
 
-       return pd;
+       Assert(i == list_length(*leaf_part_oids));
+       Assert((offset + 1) == list_length(*ptinfos));
 }
 
 /* Module-local functions */
@@ -1872,7 +1874,7 @@ generate_partition_qual(Relation rel)
  * ----------------
  */
 void
-FormPartitionKeyDatum(PartitionDispatch pd,
+FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
                                          TupleTableSlot *slot,
                                          EState *estate,
                                          Datum *values,
@@ -1881,20 +1883,21 @@ FormPartitionKeyDatum(PartitionDispatch pd,
        ListCell   *partexpr_item;
        int                     i;
 
-       if (pd->key->partexprs != NIL && pd->keystate == NIL)
+       if (keyinfo->key->partexprs != NIL && keyinfo->keystate == NIL)
        {
                /* Check caller has set up context correctly */
                Assert(estate != NULL &&
                           GetPerTupleExprContext(estate)->ecxt_scantuple == 
slot);
 
                /* First time through, set up expression evaluation state */
-               pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
+               keyinfo->keystate = ExecPrepareExprList(keyinfo->key->partexprs,
+                                                                               
                estate);
        }
 
-       partexpr_item = list_head(pd->keystate);
-       for (i = 0; i < pd->key->partnatts; i++)
+       partexpr_item = list_head(keyinfo->keystate);
+       for (i = 0; i < keyinfo->key->partnatts; i++)
        {
-               AttrNumber      keycol = pd->key->partattrs[i];
+               AttrNumber      keycol = keyinfo->key->partattrs[i];
                Datum           datum;
                bool            isNull;
 
@@ -1931,13 +1934,13 @@ FormPartitionKeyDatum(PartitionDispatch pd,
  * the latter case.
  */
 int
-get_partition_for_tuple(PartitionDispatch *pd,
+get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
                                                TupleTableSlot *slot,
                                                EState *estate,
-                                               PartitionDispatchData 
**failed_at,
+                                               PartitionTupleRoutingInfo 
**failed_at,
                                                TupleTableSlot **failed_slot)
 {
-       PartitionDispatch parent;
+       PartitionTupleRoutingInfo *parent;
        Datum           values[PARTITION_MAX_KEYS];
        bool            isnull[PARTITION_MAX_KEYS];
        int                     cur_offset,
@@ -1948,11 +1951,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
        TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 
        /* start with the root partitioned table */
-       parent = pd[0];
+       parent = ptrinfos[0];
        while (true)
        {
-               PartitionKey key = parent->key;
-               PartitionDesc partdesc = parent->partdesc;
+               PartitionKey  key = parent->pd->partkey;
+               PartitionDesc partdesc = parent->pd->partdesc;
                TupleTableSlot *myslot = parent->tupslot;
                TupleConversionMap *map = parent->tupmap;
 
@@ -1984,7 +1987,7 @@ get_partition_for_tuple(PartitionDispatch *pd,
                 * So update ecxt_scantuple accordingly.
                 */
                ecxt->ecxt_scantuple = slot;
-               FormPartitionKeyDatum(parent, slot, estate, values, isnull);
+               FormPartitionKeyDatum(parent->keyinfo, slot, estate, values, 
isnull);
 
                if (key->strategy == PARTITION_STRATEGY_RANGE)
                {
@@ -2055,13 +2058,13 @@ get_partition_for_tuple(PartitionDispatch *pd,
                        *failed_slot = slot;
                        break;
                }
-               else if (parent->indexes[cur_index] >= 0)
+               else if (parent->pd->indexes[cur_index] >= 0)
                {
-                       result = parent->indexes[cur_index];
+                       result = parent->pd->indexes[cur_index];
                        break;
                }
                else
-                       parent = pd[-parent->indexes[cur_index]];
+                       parent = ptrinfos[-parent->pd->indexes[cur_index]];
        }
 
 error_exit:
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f059c2..b0c596345b 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,8 +165,8 @@ typedef struct CopyStateData
        bool            volatile_defexprs;      /* is any of defexprs volatile? 
*/
        List       *range_table;
 
-       PartitionDispatch *partition_dispatch_info;
-       int                     num_dispatch;   /* Number of entries in the 
above array */
+       PartitionTupleRoutingInfo **ptrinfos;
+       int                     num_parted;             /* Number of entries in 
the above array */
        int                     num_partitions; /* Number of members in the 
following arrays */
        ResultRelInfo *partitions;      /* Per partition result relation */
        TupleConversionMap **partition_tupconv_maps;
@@ -2445,7 +2445,7 @@ CopyFrom(CopyState cstate)
         */
        if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
        {
-               PartitionDispatch *partition_dispatch_info;
+               PartitionTupleRoutingInfo **ptrinfos;
                ResultRelInfo *partitions;
                TupleConversionMap **partition_tupconv_maps;
                TupleTableSlot *partition_tuple_slot;
@@ -2455,13 +2455,13 @@ CopyFrom(CopyState cstate)
                ExecSetupPartitionTupleRouting(cstate->rel,
                                                                           1,
                                                                           
estate,
-                                                                          
&partition_dispatch_info,
+                                                                          
&ptrinfos,
                                                                           
&partitions,
                                                                           
&partition_tupconv_maps,
                                                                           
&partition_tuple_slot,
                                                                           
&num_parted, &num_partitions);
-               cstate->partition_dispatch_info = partition_dispatch_info;
-               cstate->num_dispatch = num_parted;
+               cstate->ptrinfos = ptrinfos;
+               cstate->num_parted = num_parted;
                cstate->partitions = partitions;
                cstate->num_partitions = num_partitions;
                cstate->partition_tupconv_maps = partition_tupconv_maps;
@@ -2502,7 +2502,7 @@ CopyFrom(CopyState cstate)
        if ((resultRelInfo->ri_TrigDesc != NULL &&
                 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
                  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-               cstate->partition_dispatch_info != NULL ||
+               cstate->ptrinfos != NULL ||
                cstate->volatile_defexprs)
        {
                useHeapMultiInsert = false;
@@ -2580,7 +2580,7 @@ CopyFrom(CopyState cstate)
                ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
                /* Determine the partition to heap_insert the tuple into */
-               if (cstate->partition_dispatch_info)
+               if (cstate->ptrinfos)
                {
                        int                     leaf_part_index;
                        TupleConversionMap *map;
@@ -2594,7 +2594,7 @@ CopyFrom(CopyState cstate)
                         * partition, respectively.
                         */
                        leaf_part_index = ExecFindPartition(resultRelInfo,
-                                                                               
                cstate->partition_dispatch_info,
+                                                                               
                cstate->ptrinfos,
                                                                                
                slot,
                                                                                
                estate);
                        Assert(leaf_part_index >= 0 &&
@@ -2826,24 +2826,23 @@ CopyFrom(CopyState cstate)
 
        ExecCloseIndices(resultRelInfo);
 
-       /* Close all the partitioned tables, leaf partitions, and their indices 
*/
-       if (cstate->partition_dispatch_info)
+       /* Release some resources that we acquired for tuple-routing. */
+       if (cstate->ptrinfos)
        {
                int                     i;
 
                /*
-                * Remember cstate->partition_dispatch_info[0] corresponds to 
the root
-                * partitioned table, which we must not try to close, because 
it is
-                * the main target table of COPY that will be closed eventually 
by
-                * DoCopy().  Also, tupslot is NULL for the root partitioned 
table.
+                * cstate->ptrinfos[0] corresponds to the root partitioned 
table, for
+                * which we didn't create tupslot.
                 */
-               for (i = 1; i < cstate->num_dispatch; i++)
+               for (i = 1; i < cstate->num_parted; i++)
                {
-                       PartitionDispatch pd = 
cstate->partition_dispatch_info[i];
+                       PartitionTupleRoutingInfo *ptrinfo = 
cstate->ptrinfos[i];
 
-                       heap_close(pd->reldesc, NoLock);
-                       ExecDropSingleTupleTableSlot(pd->tupslot);
+                       ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
                }
+
+               /* Close all the leaf partitions and their indices */
                for (i = 0; i < cstate->num_partitions; i++)
                {
                        ResultRelInfo *resultRelInfo = cstate->partitions + i;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 2946a0edee..a03188aba3 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3236,8 +3236,8 @@ EvalPlanQualEnd(EPQState *epqstate)
  * tuple routing for partitioned tables
  *
  * Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- *             every partitioned table in the partition tree
+ * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one
+ *             entry for each partitioned table in the partition tree
  * 'partitions' receives an array of ResultRelInfo objects with one entry for
  *             every leaf partition in the partition tree
  * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
@@ -3260,7 +3260,7 @@ void
 ExecSetupPartitionTupleRouting(Relation rel,
                                                           Index resultRTindex,
                                                           EState *estate,
-                                                          PartitionDispatch 
**pd,
+                                                          
PartitionTupleRoutingInfo ***ptrinfos,
                                                           ResultRelInfo 
**partitions,
                                                           TupleConversionMap 
***tup_conv_maps,
                                                           TupleTableSlot 
**partition_tuple_slot,
@@ -3268,16 +3268,135 @@ ExecSetupPartitionTupleRouting(Relation rel,
 {
        TupleDesc       tupDesc = RelationGetDescr(rel);
        List       *leaf_parts;
+       List       *ptinfos = NIL;
        ListCell   *cell;
        int                     i;
        ResultRelInfo *leaf_part_rri;
+       Relation        parent;
 
        /*
         * Get the information about the partition tree after locking all the
         * partitions.
         */
        (void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, 
NULL);
-       *pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
+       RelationGetPartitionDispatchInfo(rel, &ptinfos, &leaf_parts);
+
+       /*
+        * The ptinfos list contains PartitionedTableInfo objects for all the
+        * partitioned tables in the partition tree.  Using the information
+        * therein, we construct an array of PartitionTupleRoutingInfo objects
+        * to be used during tuple-routing.
+        */
+       *num_parted = list_length(ptinfos);
+       *ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted *
+                                                                               
sizeof(PartitionTupleRoutingInfo *));
+       /*
+        * Free the ptinfos List structure itself as we go through (open-coded
+        * list_free).
+        */
+       i = 0;
+       cell = list_head(ptinfos);
+       parent = NULL;
+       while (cell)
+       {
+               ListCell   *tmp = cell;
+               PartitionedTableInfo *ptinfo = lfirst(tmp),
+                                                        *next_ptinfo = NULL;
+               Relation                partrel;
+               PartitionTupleRoutingInfo *ptrinfo;
+
+               if (lnext(tmp))
+                       next_ptinfo = lfirst(lnext(tmp));
+
+               /* As mentioned above, the partitioned tables have been locked. 
*/
+               if (ptinfo->relid != RelationGetRelid(rel))
+                       partrel = heap_open(ptinfo->relid, NoLock);
+               else
+                       partrel = rel;
+
+               ptrinfo = (PartitionTupleRoutingInfo *)
+                                                       
palloc0(sizeof(PartitionTupleRoutingInfo));
+               ptrinfo->relid = ptinfo->relid;
+
+               /* Stash a reference to this PartitionDispatch. */
+               ptrinfo->pd = ptinfo->pd;
+
+               /* State for extracting partition key from tuples will go here. 
*/
+               ptrinfo->keyinfo = (PartitionKeyInfo *)
+                                                               
palloc0(sizeof(PartitionKeyInfo));
+               ptrinfo->keyinfo->key = RelationGetPartitionKey(partrel);
+               ptrinfo->keyinfo->keystate = NIL;
+
+               /*
+                * For every partitioned table other than root, we must store a 
tuple
+                * table slot initialized with its tuple descriptor and a tuple
+                * conversion map to convert a tuple from its parent's rowtype 
to its
+                * own.  That is to make sure that we are looking at the 
correct row
+                * using the correct tuple descriptor when computing its 
partition key
+                * for tuple routing.
+                */
+               if (ptinfo->parentid != InvalidOid)
+               {
+                       TupleDesc       tupdesc = RelationGetDescr(partrel);
+
+                       /* Open the parent relation descriptor if not already 
done. */
+                       if (ptinfo->parentid == RelationGetRelid(rel))
+                       {
+                               parent = rel;
+                       }
+                       else if (parent == NULL)
+                       {
+                               /* Locked by 
RelationGetPartitionDispatchInfo(). */
+                               parent = heap_open(ptinfo->parentid, NoLock);
+                       }
+
+                       ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc);
+                       ptrinfo->tupmap = 
convert_tuples_by_name(RelationGetDescr(parent),
+                                                                               
                         tupdesc,
+                                                                 
gettext_noop("could not convert row type"));
+
+                       /*
+                        * Close the parent descriptor, if the next partitioned 
table in
+                        * the list is not a sibling, because it will have a 
different
+                        * parent if so.
+                        */
+                       if (parent != NULL && parent != rel &&
+                               next_ptinfo != NULL &&
+                               next_ptinfo->parentid != ptinfo->parentid)
+                       {
+                               heap_close(parent, NoLock);
+                               parent = NULL;
+                       }
+
+                       /*
+                        * Release the relation descriptor.  Lock that we have 
on the
+                        * table will keep the PartitionDesc that is pointing 
into
+                        * RelationData intact, a pointer to which hope to keep
+                        * through this transaction's commit.
+                        * (XXX - how true is that?)
+                        */
+                       if (partrel != rel)
+                               heap_close(partrel, NoLock);
+               }
+               else
+               {
+                       /* Not required for the root partitioned table */
+                       ptrinfo->tupslot = NULL;
+                       ptrinfo->tupmap = NULL;
+               }
+
+               (*ptrinfos)[i++] = ptrinfo;
+
+               /* Free the ListCell. */
+               cell = lnext(cell);
+               pfree(tmp);
+       }
+
+       /* Free the List itself. */
+       if (ptinfos)
+               pfree(ptinfos);
+
+       /* For leaf partitions, we build ResultRelInfos and 
TupleConversionMaps. */
        *num_partitions = list_length(leaf_parts);
        *partitions = (ResultRelInfo *) palloc(*num_partitions *
                                                                                
   sizeof(ResultRelInfo));
@@ -3304,7 +3423,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
                 * Note that each of the relations in *partitions are eventually
                 * closed by the caller.
                 */
-               partrel = heap_open(lfirst_oid(cell), NoLock);
+               partrel = heap_open(lfirst_oid(cell), RowExclusiveLock);
                part_tupdesc = RelationGetDescr(partrel);
 
                /*
@@ -3317,7 +3436,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
                 * partition from the parent's type to the partition's.
                 */
                (*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, 
part_tupdesc,
-                                                                               
                         gettext_noop("could not convert row type"));
+                                                                
gettext_noop("could not convert row type"));
 
                InitResultRelInfo(leaf_part_rri,
                                                  partrel,
@@ -3354,11 +3473,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
  * by get_partition_for_tuple() unchanged.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
-                                 TupleTableSlot *slot, EState *estate)
+ExecFindPartition(ResultRelInfo *resultRelInfo,
+                                 PartitionTupleRoutingInfo **ptrinfos,
+                                 TupleTableSlot *slot,
+                                 EState *estate)
 {
        int                     result;
-       PartitionDispatchData *failed_at;
+       PartitionTupleRoutingInfo *failed_at;
        TupleTableSlot *failed_slot;
 
        /*
@@ -3368,7 +3489,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, 
PartitionDispatch *pd,
        if (resultRelInfo->ri_PartitionCheck)
                ExecPartitionCheck(resultRelInfo, slot, estate);
 
-       result = get_partition_for_tuple(pd, slot, estate,
+       result = get_partition_for_tuple(ptrinfos, slot, estate,
                                                                         
&failed_at, &failed_slot);
        if (result < 0)
        {
@@ -3378,9 +3499,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, 
PartitionDispatch *pd,
                char       *val_desc;
                ExprContext *ecxt = GetPerTupleExprContext(estate);
 
-               failed_rel = failed_at->reldesc;
+               failed_rel = heap_open(failed_at->relid, NoLock);
                ecxt->ecxt_scantuple = failed_slot;
-               FormPartitionKeyDatum(failed_at, failed_slot, estate,
+               FormPartitionKeyDatum(failed_at->keyinfo, failed_slot, estate,
                                                          key_values, 
key_isnull);
                val_desc = ExecBuildSlotPartitionKeyDescription(failed_rel,
                                                                                
                                key_values,
diff --git a/src/backend/executor/nodeModifyTable.c 
b/src/backend/executor/nodeModifyTable.c
index e12721a9b6..c5deed4685 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -278,7 +278,7 @@ ExecInsert(ModifyTableState *mtstate,
        resultRelInfo = estate->es_result_relation_info;
 
        /* Determine the partition to heap_insert the tuple into */
-       if (mtstate->mt_partition_dispatch_info)
+       if (mtstate->mt_ptrinfos)
        {
                int                     leaf_part_index;
                TupleConversionMap *map;
@@ -292,7 +292,7 @@ ExecInsert(ModifyTableState *mtstate,
                 * respectively.
                 */
                leaf_part_index = ExecFindPartition(resultRelInfo,
-                                                                               
        mtstate->mt_partition_dispatch_info,
+                                                                               
        mtstate->mt_ptrinfos,
                                                                                
        slot,
                                                                                
        estate);
                Assert(leaf_part_index >= 0 &&
@@ -1487,7 +1487,7 @@ ExecSetupTransitionCaptureState(ModifyTableState 
*mtstate, EState *estate)
                int                     numResultRelInfos;
 
                /* Find the set of partitions so that we can find their 
TupleDescs. */
-               if (mtstate->mt_partition_dispatch_info != NULL)
+               if (mtstate->mt_ptrinfos != NULL)
                {
                        /*
                         * For INSERT via partitioned table, so we need 
TupleDescs based
@@ -1911,7 +1911,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, 
int eflags)
        if (operation == CMD_INSERT &&
                rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
        {
-               PartitionDispatch *partition_dispatch_info;
+               PartitionTupleRoutingInfo **ptrinfos;
                ResultRelInfo *partitions;
                TupleConversionMap **partition_tupconv_maps;
                TupleTableSlot *partition_tuple_slot;
@@ -1921,13 +1921,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, 
int eflags)
                ExecSetupPartitionTupleRouting(rel,
                                                                           
node->nominalRelation,
                                                                           
estate,
-                                                                          
&partition_dispatch_info,
+                                                                          
&ptrinfos,
                                                                           
&partitions,
                                                                           
&partition_tupconv_maps,
                                                                           
&partition_tuple_slot,
                                                                           
&num_parted, &num_partitions);
-               mtstate->mt_partition_dispatch_info = partition_dispatch_info;
-               mtstate->mt_num_dispatch = num_parted;
+               mtstate->mt_ptrinfos = ptrinfos;
+               mtstate->mt_num_parted = num_parted;
                mtstate->mt_partitions = partitions;
                mtstate->mt_num_partitions = num_partitions;
                mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
@@ -2336,21 +2336,22 @@ ExecEndModifyTable(ModifyTableState *node)
                                                                                
                                   resultRelInfo);
        }
 
+       /* Release some resources that we acquired for tuple-routing. */
+
        /*
-        * Close all the partitioned tables, leaf partitions, and their indices
-        *
-        * Remember node->mt_partition_dispatch_info[0] corresponds to the root
-        * partitioned table, which we must not try to close, because it is the
-        * main target table of the query that will be closed by ExecEndPlan().
-        * Also, tupslot is NULL for the root partitioned table.
+        * node->mt_ptrinfos[0] corresponds to the root partitioned table, for
+        * which we didn't create tupslot.
         */
-       for (i = 1; i < node->mt_num_dispatch; i++)
+       for (i = 1; i < node->mt_num_parted; i++)
        {
-               PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+               PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i];
 
-               heap_close(pd->reldesc, NoLock);
-               ExecDropSingleTupleTableSlot(pd->tupslot);
+               ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
        }
+
+       /*
+        * Close all the leaf partitions and their indices.
+        */
        for (i = 0; i < node->mt_num_partitions; i++)
        {
                ResultRelInfo *resultRelInfo = node->mt_partitions + i;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2283c675e9..7b53baf847 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -39,36 +39,23 @@ typedef struct PartitionDescData
 
 typedef struct PartitionDescData *PartitionDesc;
 
-/*-----------------------
- * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
- *
- *     reldesc         Relation descriptor of the table
- *     key                     Partition key information of the table
- *     keystate        Execution state required for expressions in the 
partition key
- *     partdesc        Partition descriptor of the table
- *     tupslot         A standalone TupleTableSlot initialized with this 
table's tuple
- *                             descriptor
- *     tupmap          TupleConversionMap to convert from the parent's rowtype 
to
- *                             this table's rowtype (when extracting the 
partition key of a
- *                             tuple just before routing it through this table)
- *     indexes         Array with partdesc->nparts members (for details on what
- *                             individual members represent, see how they are 
set in
- *                             RelationGetPartitionDispatchInfo())
- *-----------------------
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * Information about one partitioned table in a given partition tree
  */
-typedef struct PartitionDispatchData
+typedef struct PartitionedTableInfo
 {
-       Relation        reldesc;
-       PartitionKey key;
-       List       *keystate;           /* list of ExprState */
-       PartitionDesc partdesc;
-       TupleTableSlot *tupslot;
-       TupleConversionMap *tupmap;
-       int                *indexes;
-} PartitionDispatchData;
+       Oid                             relid;
+       Oid                             parentid;
 
-typedef struct PartitionDispatchData *PartitionDispatch;
+       /*
+        * This contains information about bounds of the partitions of this
+        * table and about where individual partitions are placed in the global
+        * partition tree.
+        */
+       PartitionDispatch pd;
+} PartitionedTableInfo;
 
 extern void RelationBuildPartitionDesc(Relation relation);
 extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
@@ -86,17 +73,18 @@ extern List *map_partition_varattnos(List *expr, int 
target_varno,
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern void RelationGetPartitionDispatchInfo(Relation rel,
+                                                                List 
**ptinfos, List **leaf_part_oids);
+
 /* For tuple routing */
-extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-                                                                int 
*num_parted, List **leaf_part_oids);
-extern void FormPartitionKeyDatum(PartitionDispatch pd,
+extern void FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
                                          TupleTableSlot *slot,
                                          EState *estate,
                                          Datum *values,
                                          bool *isnull);
-extern int get_partition_for_tuple(PartitionDispatch *pd,
+extern int get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
                                                TupleTableSlot *slot,
                                                EState *estate,
-                                               PartitionDispatchData 
**failed_at,
+                                               PartitionTupleRoutingInfo 
**failed_at,
                                                TupleTableSlot **failed_slot);
 #endif                                                 /* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index eacbea3c36..44b7cd0fd6 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -209,13 +209,13 @@ extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, 
Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
                                                           Index resultRTindex,
                                                           EState *estate,
-                                                          PartitionDispatch 
**pd,
+                                                          
PartitionTupleRoutingInfo ***ptrinfos,
                                                           ResultRelInfo 
**partitions,
                                                           TupleConversionMap 
***tup_conv_maps,
                                                           TupleTableSlot 
**partition_tuple_slot,
                                                           int *num_parted, int 
*num_partitions);
 extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-                                 PartitionDispatch *pd,
+                                 PartitionTupleRoutingInfo **ptrinfos,
                                  TupleTableSlot *slot,
                                  EState *estate);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3272c4b315..2dcbb139fc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -414,6 +414,55 @@ typedef struct ResultRelInfo
        Relation        ri_PartitionRoot;
 } ResultRelInfo;
 
+/* Forward declarations, to avoid including other headers */
+typedef struct PartitionKeyData *PartitionKey;
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * PartitionKeyInfoData - execution state for the partition key of a
+ *                                               partitioned table
+ *
+ * keystate is the execution state required for expressions contained in the
+ * partition key.  It is NIL until initialized by FormPartitionKeyDatum() if
+ * and when it is called; for example, during tuple routing through a given
+ * partitioned table.
+ */
+typedef struct PartitionKeyInfo
+{
+       PartitionKey    key;            /* Points into the table's relcache 
entry */
+       List               *keystate;
+} PartitionKeyInfo;
+
+/*
+ * PartitionTupleRoutingInfo - information required for tuple-routing
+ *                                                        through one 
partitioned table in a partition
+ *                                                        tree
+ */
+typedef struct PartitionTupleRoutingInfo
+{
+       /* OID of the table */
+       Oid                             relid;
+
+       /* Information about the table's partitions */
+       PartitionDispatch       pd;
+
+       /* See comment above the definition of PartitionKeyInfo */
+       PartitionKeyInfo   *keyinfo;
+
+       /*
+        * A standalone TupleTableSlot initialized with this table's tuple
+        * descriptor
+        */
+       TupleTableSlot *tupslot;
+
+       /*
+        * TupleConversionMap to convert from the parent's rowtype to this 
table's
+        * rowtype (when extracting the partition key of a tuple just before
+        * routing it through this table)
+        */
+       TupleConversionMap *tupmap;
+} PartitionTupleRoutingInfo;
+
 /* ----------------
  *       EState information
  *
@@ -973,9 +1022,9 @@ typedef struct ModifyTableState
        TupleTableSlot *mt_existing;    /* slot to store existing target tuple 
in */
        List       *mt_excludedtlist;   /* the excluded pseudo relation's tlist 
 */
        TupleTableSlot *mt_conflproj;   /* CONFLICT ... SET ... projection 
target */
-       struct PartitionDispatchData **mt_partition_dispatch_info;
        /* Tuple-routing support info */
-       int                     mt_num_dispatch;        /* Number of entries in 
the above array */
+       struct PartitionTupleRoutingInfo **mt_ptrinfos;
+       int                     mt_num_parted;          /* Number of entries in 
the above array */
        int                     mt_num_partitions;      /* Number of members in 
the following
                                                                         * 
arrays */
        ResultRelInfo *mt_partitions;   /* Per partition result relation */
-- 
2.11.0

From df7b8c780fad226b9635b7530018f391acfe7055 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Wed, 9 Aug 2017 15:52:36 +0900
Subject: [PATCH 2/2] Teach expand_inherited_rtentry to use partition bound
 order

After locking the child tables using find_all_inheritors, we discard
the list of child table OIDs that it generates and rebuild the same
using the information returned by RelationGetPartitionDispatchInfo.
---
 src/backend/optimizer/prep/prepunion.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/src/backend/optimizer/prep/prepunion.c 
b/src/backend/optimizer/prep/prepunion.c
index e73c819901..68d0d8efa3 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "catalog/partition.h"
 #include "catalog/pg_inherits_fn.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
@@ -1452,6 +1453,38 @@ expand_inherited_rtentry(PlannerInfo *root, 
RangeTblEntry *rte, Index rti)
         */
        oldrelation = heap_open(parentOID, NoLock);
 
+       /*
+        * For partitioned tables, we arrange the child table OIDs such that 
they
+        * appear in the partition bound order.
+        */
+       if (rte->relkind == RELKIND_PARTITIONED_TABLE)
+       {
+               List   *leaf_part_oids,
+                          *ptinfos;
+
+               /* Discard the original list. */
+               list_free(inhOIDs);
+               inhOIDs = NIL;
+
+               /* Request partitioning information. */
+               RelationGetPartitionDispatchInfo(oldrelation, &ptinfos,
+                                                                               
 &leaf_part_oids);
+
+               /*
+                * First collect the partitioned child table OIDs, which 
includes the
+                * root parent at the head.
+                */
+               foreach(l, ptinfos)
+               {
+                       PartitionedTableInfo *ptinfo = lfirst(l);
+
+                       inhOIDs = lappend_oid(inhOIDs, ptinfo->relid);
+               }
+
+               /* Concatenate the leaf partition OIDs. */
+               inhOIDs = list_concat(inhOIDs, leaf_part_oids);
+       }
+
        /* Scan the inheritance set and expand it */
        appinfos = NIL;
        has_child = false;
-- 
2.11.0

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to