On Thu, Feb 18, 2016 at 11:11 AM, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote: >> Personally, I would be more inclined to make this a CREATE TABLE statement, >> like >> >> CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ... >> CREATE TABLE subpartition_name PARTITION OF partition_name FOR VALUES ... > > I guess the first of which would actually be: > > CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ... > PARTITION BY ... > > Some might think that writing potentially the same PARTITION BY clause 100 > times for 100 level-1 partitions could be cumbersome. That is what > SUBPARTITION BY notation may be good as a shorthand for.
I think that anybody who is doing that much partitioning is going to use a tool to generate the DDL anyway, so it doesn't really matter. >> I think if you've got SUBPARTITION as a keyword in the syntax >> anywhere, you're doing it wrong. The toplevel object shouldn't really >> care whether its children are themselves partitioned or not. > > This is fine for the internals. SUBPARTITION BY is mere syntax sugar. It > might as well be just cascaded PARTITION BYs. The point is to specify as > many of those with CREATE TABLE toplevel as the number of levels of > partitioning we want. That does however fix the number of levels in advance. Which I think is not good. If we say that a table can be partitioned, and a table that is a partition can also be partitioned, we've got a nice general system. Fixing the number of partitioning levels for the sake of a little syntactic sugar is, IMHO, getting our priorities backwards. > In the patch I have posted, here are some details of the tuple routing > implementation for instance - the top-level parent knows only of its > immediate partitions. Once a level-1 partition from that list is > determined using a tuple's level-1 key, the tuple is passed down to choose > one of its own partitions using the level-2 key. The key descriptor list > is associated with the top-level parent alone and the recursive code knows > to iterate the key list to apply nth key for nth level. The recursion > happens in the partition module. I haven't looked at the code, but that sounds like the right approach. > Now there are also functions to let obtain, say *all* or only *leaf* > partitions (OIDs) of a top-level parent but they are used for certain DDL > scenarios. As far as DML is concerned, the level-at-a-time recursive > approach as described above is used. Queries not yet because the plan is a > flattened append of leaf partitions anyway. OK. > If such notation convenience at the expense of loss of generality is not > worth it, I'm willing to go ahead and implement SUBPARTITION-less syntax. > > CREATE TABLE toplevel() PARTITION BY > CREATE TABLE partition PARTITION OF toplevel FOR VALUES ... PARTITION BY > CREATE TABLE subpartition PARTITION OF partition FOR VALUES > ALTER TABLE partitioned ATTACH PARTITION name FOR VALUES ... USING TABLE > ALTER TABLE partitioned DETACH PARTITION name [ USING newname ] > > While we are on the syntax story, how about FOR VALUES syntax for range > partitions (sorry for piggybacking it here in this message). At least some > people dislike LESS THAN notation. Corey Huinker says we should be using > range type literals for that. It's not clear though that using range type > literals directly is a way ahead. For one, range type API expects there to > exist a range type with given element type. Whereas all we require for > range partitioning proper is for column type to have a btree operator > class. Should we require it to have an associated range type as well? > Don't think that there exists an API to check that either. All in all, > range types are good to implement things in applications but not so much > within the backend (unless I'm missing something). I know reinventing the > wheel is disliked as well but perhaps we could offer something like the > following because Corey offered some examples which would help from the > flexibility: > > START [ EXCL ] (startval) END [ INCL ] (endval) I don't think using range type literals is gonna work. There's no guarantee that the necessary range types exist. However, we could possibly use a syntax inspired by the syntax for range types. I'm a little nervous about asking people to type commands with mismatching braces: CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES [ 1, 10 ); ...but maybe it's a good idea. It certainly has the advantage of being more compact than a lot of there ways we might choose to write that. And I think LESS THAN sucks. It's just clunky and awkward syntax. > Some people said something akin to interval partitioning would be good like: > > PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' ) START > WITH value; > > But that could be just a UI addition to the design where each partition > has [startval, endval) bounds. In any case, not a version 1 material I'd > think. Agreed; not right now. > I see the trade-off. I agree that considering the significance for attach > partition case is quite important. > > So, the tuple routing code should be ready to use projection if there > happens to be a partition with differing tuple descriptor. In the code I > posted, a ResultRelInfo is lazily built afresh for each inserted tuple in > ExecInsert's case and for each tuple where the chosen partition is > different from the previous tuple's in CopyFrom's case. One can feel that > there is a certain overhead to that approach for the bulk-loading case > (almost every CopyFrom invocation). Now if projection enters this picture, > we have to consider that too. Should we initialize ResultRelInfo's and > corresponding ProjectionInfo's for all partitions beforehand? Consider how > ModifyTable currently handles update on inheritance set, for example. That > would incur unnecessary overhead if only a single tuple is inserted. But > it would certainly help bulk-loading case. Am I missing something? I'd say you are missing benchmarking data. :-) Why should the single-tuple case be harmed here? > Sorry for rather incoherent articulation in the previous message. Let me > try to say something that's perhaps more specific. > > Consider that we create partitions as inheritance children, that is, > creating a partition using: > > CREATE TABLE partition PARTITION OF parent FOR VALUES ... > > is equivalent to saying > > CREATE TABLE partition (...) INHERITS (parent) > > except that the latter allows partition to have its own column > definitions, table constraints and multiple parents as long as things are > conflict-free. The former requires specifying partition bounding values. > The common thing between the two then is StoreCatalogInheritance(partOid, > parentOid) that will mark partition as inheritance child of parent in > pg_inherits. So, our new partitions are really inheritance children but > that's not apparent to users (where this last bit is important). So far, this sounds good to me. > Then consider ALTER TABLE partition - should we need to handle it in way > different from existing inheritance code would do - > > * Prevent something whereas regular inheritance wouldn't? > * Do something instead of/in addition to whatever regular inheritance does? > > Consider adding a column to partition - regular inheritance wouldn't > prevent it because adding a column to child table doesn't get in the way > of how inheritance works. Now we can make it so that if table is a > partition created with CRATE TABLE PARTITION OF (which we can tell because > the table has pg_partition entry), we should *actively* prevent adding a > column. Inheritance already prevents dropping an inherited attribute in > child but says "cannot drop inherited column". IMO in a partition's case, > it should say "cannot drop columns of a partition". Agreed. So maybe pg_inherits gets an additional column that indicates whether this is "regular" inheritance or "partitioning" inheritance, and in the latter case there are additional restrictions. > OTOH, adding a column to parent propagates to all partitions by way of > ATSimpleRecursion(). But not everything that ATSimpleRecursion() does is > necessary for partitioned tables (which we can tell from relkind or maybe > because table has pg_partitioned_rel entry) - especially no need to > propagate column default and check constraint, etc. Because one cannot > apply insert/update on partitions anyway. Hmm. I feel like maybe that should be allowed. If the user wants to bulk-load a bunch of data and knows that it goes in a particular partition, why should they have to pay the overhead of determining that anew for each row? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers