Re: [HACKERS] Declarative partitioning - another take

Robert Haas Thu, 17 Nov 2016 11:15:06 -0800

On Thu, Nov 17, 2016 at 6:27 AM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> Meanwhile, here are updated patch that address most of the following comments.


OK, I have re-reviewed 0005 and it looks basically fine to me, modulo
a few minor nitpicks. "This is called, *iff*" shouldn't have a comma
there, and I think the entire comment block that starts with "NOTE:
SQL specifies that a NULL" and ends with "it's unlikely that NULL
would result." should be changed to say something like /* As for
catalogued constraints, we treat a NULL result as a success, not a
failure. */ rather than duplicating an existing comment that doesn't
quite apply here.  Finally, ExecConstraints() contains a new if-block
whose sole contents are another if-block.  Perhaps if (this && that)
would be better.

Regarding 0006 and 0007, I think the PartitionTreeNodeData structure
you've chosen is awfully convoluted and probably not that efficient.
For example, get_partition_for_tuple() contains this loop:

+       prev = parent;
+       node = parent->downlink;
+       while (node != NULL)
+       {
+               if (node->index >= cur_idx)
+                       break;
+
+               prev = node;
+               node = node->next;
+       }

Well, it looks to me like that's an O(n) way to find the n'th
partition, which seems like a pretty bad idea in performance-critical
code, which this is.  I think this whole structure needs a pretty
heavy overhaul.  Here's a proposal:

1. Forget the idea of a tree.  Instead, let the total number of tables
in the partitioning hierarchy be N and let the number of those that
are partitioned be K.  Assign each partitioned table in the hierarchy
an index between 0 and K-1.  Make your top level data structure (in
lieu of PartitionTreeNodeData) be an array of K PartitionDispatch
objects, with the partitioning root in entry 0 and the rest in the
remaining entries.

2. Within each PartitionDispatch object, store (a) a pointer to a
PartitionDesc and (b) an array of integers of length equal to the
PartitionDesc's nparts value.  Each integer i, if non-negative, is the
final return value for get_partition_for_tuple.  If i == -1, tuple
routing fails.  If i < -1, we must next route using the subpartition
whose PartitionDesc is at index -(i+1).  Arrange for the array to be
in the same order the PartitionDesc's OID list.

3. Now get_partition_for_tuple looks something like this:

K = 0
loop:
    pd = PartitionDispatch[K]
    idx = list/range_partition_for_tuple(pd->partdesc, ...)
    if (idx >= -1)
        return idx
    K = -(idx + 1)

No recursion, minimal pointer chasing, no linked lists.  The whole
thing is basically trivial aside from the cost of
list/range_partition_for_tuple itself; optimizing that is a different
project.  I might have some details slightly off here, but hopefully
you can see what I'm going for: you want to keep the computation that
happens in get_partition_for_tuple() to an absolute minimum, and
instead set things up in advance so that getting the partition for a
tuple is FAST.  And you want the data structures that you are using in
that process to be very compact, hence arrays instead of linked lists.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

Reply via email to