Hi, everyone! I'd like to propose a postgres partition implementation. First, I 
would show the design to everyone, and talk about it. If we think the design is 
not very bad, and can be commit to the PostgreSQL baseline, then I will post 
the code to the community.
(note: my english is not very good.)

Table Partition Design
=====================
In this design, partitions are normal tables in inheritance hierarchies, with 
the same table structure with the partitioned table.

In pg_class we have an additional relpartition field which has following values:
's'        /* single regular table */
'r'        /* partitioned table by range */
'l'        /* partitioned table by list */
'h'        /* partitioned table by hash */
'c'        /* child partition table */

Add a new system schema named 'pg_partition', just like 'pg_toast', we can 
create the partition catalog table to store the partition entries. let's assume 
the partition catalog's name is pg_partition_2586 (2586 is the partitioned 
table's OID in pg_class).
a range or interval partition catalog's structure is as follows:
column            data type            comment
partname        name                a partition's name, this is the primary key
partid            oid                    a partition's OID in pg_class
interval        text                a interval partition's interval(maybe a 
expression)
partkey1        depends on partitioned table
...
partkeyN        depends on partitioned table
partkey1, ..., partkeyN is a partition's upper bound.
Finally, make a unique constraint on partkey1, ..., partkeyN.
Every time we create a new partition, we insert a new tuple into this partition 
catalog.
Every time we drop an old partition, we delete the related tuple in this 
partition catalog.

For a partitioned table's CREATE action, we should transform the action into 
the CREATE action of partitioned table and partitions, and the INSERT action 
into the partition catalog.

For INSERT action, we implement a RelationGetTuplePartid method, which can find 
the partition the tuple belongs to. It will do an index scan on the partition 
catalog table(assume it is pg_partition_2586) to find the partition.
and a ExecGetPartitionResultRel method, which can return the partition's 
ResultRelInfo to execute INSERT action.

For partitioned table's scan action, and JOIN action, we implemented a plan 
node named 'PartitionExpand'. the plan node can expand the partitioned table 
scan node into a list of partitions according to the filter and conditions. and 
it can expand partitioned table JOIN node into a list of partitions JOIN node 
wisely.
We implemented a DynamicPrunePartition method, which can expand the partitioned 
table's scan node into a list of partition's scan node.
We implemented a DynamicPrunePartitionJoin method, which can expand the 
partitioned table's JOIN node into a list of partition's JOIN node.
These expand action happend in ExecInitPartitionExpand function, when 
initialize the executor. and all these action implemented based on the 
partition catalog.

For UPDATE and DELETE action, we just set real partition as the ResultRelInfo, 
when ExecPartitionExpand is running.

For pg_dump backup action, we should dump the partition catalog, and 
relpartition field in pg_class.

so these are the main points of the design, and I can show any detail you 
wondered later.

Reply via email to