Hi Amit,

I will be traveling until next Tuesday and will have no access to email so don't be surprised if I don't follow up this week. The overall approach seems sound. The metadata table should help also for DDL to find out overlapping ranges or duplicate list entries. So far, I have not tried to use the SPI interface from a C trigger so I don't see any disadvantage yet. We would have to assess the performance to make sure it's not going to be a show stopper. I think that the main issue of the trigger approach is that other triggers might interfere. The 'partition trigger' must be the last of the 'before insert' triggers and if the destination (child) table has a trigger, we must ensure that this trigger is not going to require a new routing. Another issue is the result that will be returned by insert/copy commands if all tuples are moved to other tables, the result will be 0. We might want to have stats that would collect where tuples where moved for a particular command (I don't know yet what would be the best place to collect these stats but they could probably be updated by the trigger). Also would the trigger be attached to all tables in the hierarchy or only to the top parent? What kind of query would you use with more than 1 level of inheritance (e.g. parent=year, child=month, grand-child=day)? It looks like we have to parse the leaves of the graph but intermediate nodes would help accelerating the search.

An alternative approach (I haven't assessed the feasibility yet) would be to try to call the query planner. If we ask to select the partition value of the tuple, the query planner should return the table it is going to scan (as in EXPLAIN SELECT * FROM t WHERE key=$1).

Let me know what you think,
Emmanuel

We are considering to following approach:
1. metadata table pg_partitions is defined as follows:
CATALOG(pg_partitions,2336) BKI_WITHOUT_OIDS
{
    Oid         partrelid;     // partition table Oid
    Oid         parentrelid;  // Parent table Oid
    int4        parttype;   // Type of partition, list, hash, range
    Oid        partkey;    // partition key Oid
    Oid        keytype;  ///  type of partition key.
    int4        keyorder  /// order of the key in multi-key partitions.
    text        min;
    text        max;  // min and max for range parti
    text[]      list;
    int        hash;  // hash value
} FormData_pg_partitions;


2. C triggers will fire a query on this table to get the relevant
partition of the inserted/updated data using SPI interface. The query
will look something like (for range partitioning)

select min(partrelid)
from pg_partitions
where parentrelid = 2934  // we know this value
and (
( $1 between to_int(min ) and to_int(max) and
  keyorder  = 1) OR
($2 between to_date (min) and to_date (max) and
keyorder =2 )
....
)
group by
parentrelid
having
count(*) = <number of partition keys>

$1, $2, ... are the placeholders of the actual partition key values of
trigger tuple.

Since we know the type of partition keys, and the parentrelid, this
kind of query string can be saved in another table say, pg_part_map.
And its plan can be parsed once and saved in cache to be reused.
Do you see any issue with using SPI interface within triggers?

The advantage of this kind of approah is that trigger code can be made
genric for any kind of partition table.

Thanks,
Amit
Persistent Systems,
www.persistentsys.com





On 1/23/09, Emmanuel Cecchet <m...@frogthinker.org> wrote:
Amit,

 You might want to put this on the
http://wiki.postgresql.org/wiki/Table_partitioning wiki
page.
 How does your timeline look like for this implementation?
 I would be happy to contribute C triggers to your implementation. From what
I understood in
http://archives.postgresql.org/pgsql-hackers/2008-01/msg00269.php,
you already have an implementation that parses the grammar and generates
rules as if someone had written them. Is this code available?

 Regarding the use of triggers to push/move data to partitions, what if
someone declares triggers on partitions? Especially if you have
subpartitions, let's consider the case where there is a trigger on the
parent, child and grandchild. If I do an insert in the parent, the user
trigger on the parent will be executed, then the partition trigger that
decides to move to the grandchild. Are we going to bypass the child trigger?
 If we also want fast COPY operations on partitioned table, we could have an
optimized implementation that could bypass triggers and move the tuple
directly to the appropriate child table.

 Thanks for this big contribution,
 Emmanuel


Hi,

We are implementing table partitioning feature to support
- the attached commands. The syntax conforms to most of the suggestion
mentioned in
http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php,
barring the following:
-- Specification of partition names is optional. System will be able to
generate partition names in such cases.
-- sub partitioning
 We are using pgsql triggers to push/move data to appropriate partitions,
but we will definitely consider moving to C language triggers as suggested
by manu.
- Global non-partitioned indexes (that will extend all the partitions).
- Foreign key support for tables referring to partitioned tables.

Please feel free to post your comments and suggestions.

Thanks,
Amit
Persistent Systems




------------------------------------------------------------------------



--
Emmanuel Cecchet
FTO @ Frog Thinker Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: m...@frogthinker.org
Skype: emmanuel_cecchet


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to