Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)
On 27-01-2015 AM 05:46, Jim Nasby wrote: > On 1/25/15 7:42 PM, Amit Langote wrote: >> On 21-01-2015 PM 07:26, Amit Langote wrote: >>> Ok, I will limit myself to focusing on following things at the moment: >>> >>> * Provide syntax in CREATE TABLE to declare partition key >> >> While working on this, I stumbled upon the question of how we deal with >> any index definitions following from constraints defined in a CREATE >> statement. I think we do not want to have a physical index created for a >> table that is partitioned (in other words, has no heap of itself). As >> the current mechanisms dictate, constraints like PRIMARY KEY, UNIQUE, >> EXCLUSION CONSTRAINT are enforced as indexes. It seems there are really >> two decisions to make here: >> >> 1) how do we deal with any index definitions (either explicit or >> implicit following from constraints defined on it) - do we allow them by >> marking them specially, say, in pg_index, as being mere >> placeholders/templates or invent some other mechanism? >> >> 2) As a short-term solution, do we simply reject creating any indexes >> (/any constraints that require them) on a table whose definition also >> includes PARTITION ON clause? Instead define them on its partitions (or >> any relations in hierarchy that are not further partitioned). >> >> Or maybe I'm missing something... > > Wasn't the idea that the parent table in a partitioned table wouldn't > actually have a heap of it's own? If there's no heap there can't be an > index. > Yes, that's right. Perhaps, we should look at heap-less partitioned relation thingy not so soon as you say below. > That said, I think this is premature optimization that could be done later. It seems so. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)
On 1/25/15 7:42 PM, Amit Langote wrote: On 21-01-2015 PM 07:26, Amit Langote wrote: Ok, I will limit myself to focusing on following things at the moment: * Provide syntax in CREATE TABLE to declare partition key While working on this, I stumbled upon the question of how we deal with any index definitions following from constraints defined in a CREATE statement. I think we do not want to have a physical index created for a table that is partitioned (in other words, has no heap of itself). As the current mechanisms dictate, constraints like PRIMARY KEY, UNIQUE, EXCLUSION CONSTRAINT are enforced as indexes. It seems there are really two decisions to make here: 1) how do we deal with any index definitions (either explicit or implicit following from constraints defined on it) - do we allow them by marking them specially, say, in pg_index, as being mere placeholders/templates or invent some other mechanism? 2) As a short-term solution, do we simply reject creating any indexes (/any constraints that require them) on a table whose definition also includes PARTITION ON clause? Instead define them on its partitions (or any relations in hierarchy that are not further partitioned). Or maybe I'm missing something... Wasn't the idea that the parent table in a partitioned table wouldn't actually have a heap of it's own? If there's no heap there can't be an index. That said, I think this is premature optimization that could be done later. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)
On 21-01-2015 PM 07:26, Amit Langote wrote: > Ok, I will limit myself to focusing on following things at the moment: > > * Provide syntax in CREATE TABLE to declare partition key While working on this, I stumbled upon the question of how we deal with any index definitions following from constraints defined in a CREATE statement. I think we do not want to have a physical index created for a table that is partitioned (in other words, has no heap of itself). As the current mechanisms dictate, constraints like PRIMARY KEY, UNIQUE, EXCLUSION CONSTRAINT are enforced as indexes. It seems there are really two decisions to make here: 1) how do we deal with any index definitions (either explicit or implicit following from constraints defined on it) - do we allow them by marking them specially, say, in pg_index, as being mere placeholders/templates or invent some other mechanism? 2) As a short-term solution, do we simply reject creating any indexes (/any constraints that require them) on a table whose definition also includes PARTITION ON clause? Instead define them on its partitions (or any relations in hierarchy that are not further partitioned). Or maybe I'm missing something... Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)
On 21-01-2015 AM 01:42, Robert Haas wrote: > On Mon, Jan 19, 2015 at 8:48 PM, Amit Langote > wrote: Specifically, do we regard a partitions as pg_inherits children of its partitioning parent? >>> >>> I don't think this is totally an all-or-nothing decision. I think >>> everyone is agreed that we need to not break things that work today -- >>> e.g. Merge Append. What that implies for pg_inherits isn't altogether >>> clear. >> >> One point is that an implementation may end up establishing the >> parent-partition hierarchy somewhere other than (or in addition to) >> pg_inherits. One intention would be to avoid tying partitioning scheme >> to certain inheritance features that use pg_inherits. For example, >> consider call sites of find_all_inheritors(). One notable example is >> Append/MergeAppend which would be of interest to partitioning. We would >> want to reuse that part of the infrastructure but we could might as well >> write an equivalent, say find_all_partitions() which scans something >> other than pg_inherits to get all partitions. > > IMHO, there's little reason to avoid putting pg_inherits entries in > for the partitions, and then this just works. We can find other ways > to make it work if that turns out to be better, but if we don't have > one, there's no reason to complicate things. > Ok, I will go forward and stick to pg_inherits approach for now. Perhaps the concerns I am expressing have other solutions that don't require abandoning pg_inherits approach altogether. >> Agree that some concrete idea of internal representation should help >> guide the catalog structure. If we are going to cache the partitioning >> info in relcache (which we most definitely will), then we should try to >> make sure to consider the scenario of having a lot of partitioned tables >> with a lot of individual partitions. It looks like it would be similar >> to a scenarios where there are a lot of inheritance hierarchies. But, >> availability of partitioning feature would definitely cause these >> numbers to grow larger. Perhaps this is an important point driving this >> discussion. > > Yeah, it would be good if the costs of supporting, say, 1000 > partitions were negligible. > >> A primary question for me about partition-pruning is when do we do it? >> Should we model it after relation_excluded_by_constraints() and hence >> totally plan-time? But, the tone of the discussion is that we postpone >> partition-pruning to execution-time and hence my perhaps misdirected >> attempts to inject it into some executor machinery. > > It's useful to prune partitions at plan time, because then you only > have to do the work once. But sometimes you don't know enough to do > it at plan time, so it's useful to do it at execution time, too. > Then, you can do it differently for every tuple based on the actual > value you have. There's no point in doing 999 unnecessary relation > scans if we can tell which partition the actual run-time value must be > in. But I think execution-time pruning can be a follow-on patch. If > you don't restrict the scope of the first patch as much as possible, > you're not going to have much luck getting this committed. > Ok, I will limit myself to focusing on following things at the moment: * Provide syntax in CREATE TABLE to declare partition key * Provide syntax in CREATE TABLE to declare a table as partition of a partitioned table and values it contains * Arrange to have partition key and values stored in appropriate catalogs (existing or new) * Arrange to cache partitioning info of partitioned tables in relcache Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)
On Mon, Jan 19, 2015 at 8:48 PM, Amit Langote wrote: >>> Specifically, do we regard a partitions as pg_inherits children of its >>> partitioning parent? >> >> I don't think this is totally an all-or-nothing decision. I think >> everyone is agreed that we need to not break things that work today -- >> e.g. Merge Append. What that implies for pg_inherits isn't altogether >> clear. > > One point is that an implementation may end up establishing the > parent-partition hierarchy somewhere other than (or in addition to) > pg_inherits. One intention would be to avoid tying partitioning scheme > to certain inheritance features that use pg_inherits. For example, > consider call sites of find_all_inheritors(). One notable example is > Append/MergeAppend which would be of interest to partitioning. We would > want to reuse that part of the infrastructure but we could might as well > write an equivalent, say find_all_partitions() which scans something > other than pg_inherits to get all partitions. IMHO, there's little reason to avoid putting pg_inherits entries in for the partitions, and then this just works. We can find other ways to make it work if that turns out to be better, but if we don't have one, there's no reason to complicate things. > Agree that some concrete idea of internal representation should help > guide the catalog structure. If we are going to cache the partitioning > info in relcache (which we most definitely will), then we should try to > make sure to consider the scenario of having a lot of partitioned tables > with a lot of individual partitions. It looks like it would be similar > to a scenarios where there are a lot of inheritance hierarchies. But, > availability of partitioning feature would definitely cause these > numbers to grow larger. Perhaps this is an important point driving this > discussion. Yeah, it would be good if the costs of supporting, say, 1000 partitions were negligible. > A primary question for me about partition-pruning is when do we do it? > Should we model it after relation_excluded_by_constraints() and hence > totally plan-time? But, the tone of the discussion is that we postpone > partition-pruning to execution-time and hence my perhaps misdirected > attempts to inject it into some executor machinery. It's useful to prune partitions at plan time, because then you only have to do the work once. But sometimes you don't know enough to do it at plan time, so it's useful to do it at execution time, too. Then, you can do it differently for every tuple based on the actual value you have. There's no point in doing 999 unnecessary relation scans if we can tell which partition the actual run-time value must be in. But I think execution-time pruning can be a follow-on patch. If you don't restrict the scope of the first patch as much as possible, you're not going to have much luck getting this committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)
On 20-01-2015 AM 10:48, Amit Langote wrote: > On 17-01-2015 AM 02:34, Robert Haas wrote: >> On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote >> wrote: >>> * It is desirable to treat partitions as pg_class relations with perhaps >>> a new relkind(s). We may want to choose an implementation where only the >>> lowest level relations in a partitioning hierarchy have storage; those >>> at the upper layers are mere placeholder relations though of course with >>> associated constraints determined by partitioning criteria (with >>> appropriate metadata entered into the additional catalogs). >> >> I think the storage-less parents need a new relkind precisely to >> denote that they have no storage; I am not convinced that there's any >> reason to change the relkind for the leaf nodes. But that's been >> proposed, so evidently someone thinks there's a reason to do it. >> > > Again, this remains partly tied to decisions we make regarding catalog > structure. > > I am not sure but wouldn't we ever need to tell from a pg_class entry > that a leaf relation has partition bounds associated with it? One reason > I can see that we may not need it is that we would rather use > relispartitioned of a non-leaf relation to trigger finding all its > partitions and their associated bounds; we don't need to know (or > reserve a field for) that a relation has partition bounds associated > with it. The bounds can be stored in pg_partition indexed by relid. > Maybe relkind is not the right field for this anyway. > > With that said, would we be comfortable with putting partition key into > pg_class (maybe as a pg_node_tree also encapsulating opclass) so that if > relispartitioned, also look for relpartkey? > This paints a picture that our leaf relations would be plain old relations. They are almost similar in all respects (how they are planned, modified, maintained, ...). They just have an additional property that the values they can contain are restricted by, say, pg_partition.values; but it doesn't concern how they are planned. Planning related changes are confined to upper layers of the hierarchy instead. Kinda like saying instead of doing relation_excluded_by_constraints(childrel), we'd instead say prune_useless_partitions(&partitionedrel) possibly at some other site than its counterpart. Guess that illustrates the point. I am not sure again if we want to limit access to individual partitions unless via some special syntax, then what that means for the above. We have been discussing that. Such access limiting could (only) be facilitated by a new relkind. On the other hand, the non-leaf relations are slightly new kind of relations in that they do not have storage (they could have a tablespace which would be the default tablespace for its underlying partitions). Obviously they do not have indexes pointing at them. Because they are further partitioned, they are differently planned - most probably Append with partition-pruning (almost like Append with constraint-exclusion but supposedly quicker because of the explicit access to partition definitions and perhaps execution-time). INSERT/COPY on these involve routing tuple to the appropriate leaf relation. Not surprisingly, this is almost similar to the picture that Alvaro had presented modulo some differences. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)
On 17-01-2015 AM 02:34, Robert Haas wrote: > On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote > wrote: >> * It has been repeatedly pointed out that we may want to decouple >> partitioning from inheritance because implementing partitioning as an >> extension of inheritance mechanism means that we have to keep all the >> existing semantics which might limit what we want to do with the special >> case of it which is partitioning; in other words, we would find >> ourselves in difficult position where we have to inject a special case >> code into a very generalized mechanism that is inheritance. >> Specifically, do we regard a partitions as pg_inherits children of its >> partitioning parent? > > I don't think this is totally an all-or-nothing decision. I think > everyone is agreed that we need to not break things that work today -- > e.g. Merge Append. What that implies for pg_inherits isn't altogether > clear. > One point is that an implementation may end up establishing the parent-partition hierarchy somewhere other than (or in addition to) pg_inherits. One intention would be to avoid tying partitioning scheme to certain inheritance features that use pg_inherits. For example, consider call sites of find_all_inheritors(). One notable example is Append/MergeAppend which would be of interest to partitioning. We would want to reuse that part of the infrastructure but we could might as well write an equivalent, say find_all_partitions() which scans something other than pg_inherits to get all partitions. Now, we may not want to do that and instead add special case code to prevent partitioning from fiddling with unnecessary inheritance features in the code paths of inheritance. This seems like an important decision to make. >> * Syntax: do we want to make it similar to one of the many other >> databases out there? Or we could invent our own? > > Well, what I think we don't want is something that is *almost* like > some other database but not quite. I lean toward inventing our own > since I'm not aware of something that we'd want to copy exactly. > >> I wonder if we could add a clause like DISTRIBUTED BY to complement >> PARTITION ON that represents a hash distributed/partitioned table (that >> could be a syntax to support sharded tables maybe; we would definitely >> want to move ahead in that direction I guess) > > Maybe eventually, but let's not complicate things by worrying too much > about that now. > Agree that we may not want to mix the two too much at this point. >> * Catalog: We would like to have a catalog structure suitable to >> implement capabilities like multi-column partitioning, sub-partitioning >> (with arbitrary number of levels in the hierarchy). I had suggested >> that we create two new catalogs viz. pg_partitioned_rel, >> pg_partition_def to store metadata about a partition key of a >> partitioned relation and partition bound info of a partition, >> respectively. Also, see the point about on-disk representation of >> partition bounds > > I'm not convinced that there is any benefit in spreading this > information across two tables neither of which exist today. If the > representation of the partitioning scheme is going to be a node tree, > then there's no point in taking what would otherwise have been a List > and storing each element of it in a separate tuple. The overarching > point here is that the system catalog structure should be whatever is > most convenient for the system internals; I'm not sure we understand > what that is yet. > Agree that some concrete idea of internal representation should help guide the catalog structure. If we are going to cache the partitioning info in relcache (which we most definitely will), then we should try to make sure to consider the scenario of having a lot of partitioned tables with a lot of individual partitions. It looks like it would be similar to a scenarios where there are a lot of inheritance hierarchies. But, availability of partitioning feature would definitely cause these numbers to grow larger. Perhaps this is an important point driving this discussion. I guess this remains tied to the decision we would like make regarding inheritance (pg_inherits, etc.) >> * It is desirable to treat partitions as pg_class relations with perhaps >> a new relkind(s). We may want to choose an implementation where only the >> lowest level relations in a partitioning hierarchy have storage; those >> at the upper layers are mere placeholder relations though of course with >> associated constraints determined by partitioning criteria (with >> appropriate metadata entered into the additional catalogs). > > I think the storage-less parents need a new relkind precisely to > denote that they have no storage; I am not convinced that there's any > reason to change the relkind for the leaf nodes. But that's been > proposed, so evidently someone thinks there's a reason to do it. > Again, this remains partly tied to decisions we make regarding catalog struct
Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)
On 19-01-2015 PM 12:37, Ashutosh Bapat wrote: > On Fri, Jan 16, 2015 at 11:04 PM, Robert Haas wrote: > >> On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote >> wrote: >> >>> I wonder if we could add a clause like DISTRIBUTED BY to complement >>> PARTITION ON that represents a hash distributed/partitioned table (that >>> could be a syntax to support sharded tables maybe; we would definitely >>> want to move ahead in that direction I guess) >> >> Maybe eventually, but let's not complicate things by worrying too much >> about that now. >> > > Instead we might want to specify which server (foreign or local) each of > the partition go to, something like LOCATED ON clause for each of the > partitions with default as local server. > Given how things stand today, we do not allow DDL with the FDW interface, unless I'm missing something. So, we are restricted to only going the other way around, say, CREATE FOREIGN TABLE partXX PARTITION OF parent SERVER ...; assuming we like the proposed syntax - CREATE TABLE child PARTITION OF parent; I think this is also assuming we are relying on foreign table inheritance. That is, both that partitioning is based on inheritance and foreign tables support inheritance (which should be the case soon) Still, I think Robert may be correct in that it would not be sooner that we integrate foreign tables with partitioning scheme (I guess mostly the syntax aspect of it). Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)
On Fri, Jan 16, 2015 at 11:04 PM, Robert Haas wrote: > On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote > wrote: > > * It has been repeatedly pointed out that we may want to decouple > > partitioning from inheritance because implementing partitioning as an > > extension of inheritance mechanism means that we have to keep all the > > existing semantics which might limit what we want to do with the special > > case of it which is partitioning; in other words, we would find > > ourselves in difficult position where we have to inject a special case > > code into a very generalized mechanism that is inheritance. > > Specifically, do we regard a partitions as pg_inherits children of its > > partitioning parent? > > I don't think this is totally an all-or-nothing decision. I think > everyone is agreed that we need to not break things that work today -- > e.g. Merge Append. What that implies for pg_inherits isn't altogether > clear. > > > * Syntax: do we want to make it similar to one of the many other > > databases out there? Or we could invent our own? > > Well, what I think we don't want is something that is *almost* like > some other database but not quite. I lean toward inventing our own > since I'm not aware of something that we'd want to copy exactly. > > > I wonder if we could add a clause like DISTRIBUTED BY to complement > > PARTITION ON that represents a hash distributed/partitioned table (that > > could be a syntax to support sharded tables maybe; we would definitely > > want to move ahead in that direction I guess) > > Maybe eventually, but let's not complicate things by worrying too much > about that now. > Instead we might want to specify which server (foreign or local) each of the partition go to, something like LOCATED ON clause for each of the partitions with default as local server. > > > * Catalog: We would like to have a catalog structure suitable to > > implement capabilities like multi-column partitioning, sub-partitioning > > (with arbitrary number of levels in the hierarchy). I had suggested > > that we create two new catalogs viz. pg_partitioned_rel, > > pg_partition_def to store metadata about a partition key of a > > partitioned relation and partition bound info of a partition, > > respectively. Also, see the point about on-disk representation of > > partition bounds > > I'm not convinced that there is any benefit in spreading this > information across two tables neither of which exist today. If the > representation of the partitioning scheme is going to be a node tree, > then there's no point in taking what would otherwise have been a List > and storing each element of it in a separate tuple. The overarching > point here is that the system catalog structure should be whatever is > most convenient for the system internals; I'm not sure we understand > what that is yet. > > > * It is desirable to treat partitions as pg_class relations with perhaps > > a new relkind(s). We may want to choose an implementation where only the > > lowest level relations in a partitioning hierarchy have storage; those > > at the upper layers are mere placeholder relations though of course with > > associated constraints determined by partitioning criteria (with > > appropriate metadata entered into the additional catalogs). > > I think the storage-less parents need a new relkind precisely to > denote that they have no storage; I am not convinced that there's any > reason to change the relkind for the leaf nodes. But that's been > proposed, so evidently someone thinks there's a reason to do it. > > > I am not > > quite sure if each kind of the relations involved in the partitioning > > scheme have separate namespaces and, if they are, how we implement that > > I am in favor of having all of the nodes in the hierarchy have names > just as relations do today -- pg_class.relname. Anything else seems > to me to be complex to implement and of very marginal benefit. But > again, it's been proposed. > > > * In the initial implementation, we could just live with partitioning on > > a set of columns (and not arbitrary expressions of them) > > Seems quite fair. > > > * We perhaps do not need multi-column LIST partitions as they are not > > very widely used and may complicate the implementation > > I agree that the use case is marginal; but I'm not sure it needs to > complicate the implementation much. Depending on how the > implementation shakes out, prohibiting it might come to seem like more > of a wart than allowing it. > > > * There are a number of suggestions about how we represent partition > > bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype > > associated with the relation itself), etc. Important point to consider > > here may be that partition key may contain more than one column > > Yep. > > > * How we represent partition definition in memory (for a given > > partitioned relation) - important point to remember is that such a > > representation should be efficient to iterate t
Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)
On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote wrote: > * It has been repeatedly pointed out that we may want to decouple > partitioning from inheritance because implementing partitioning as an > extension of inheritance mechanism means that we have to keep all the > existing semantics which might limit what we want to do with the special > case of it which is partitioning; in other words, we would find > ourselves in difficult position where we have to inject a special case > code into a very generalized mechanism that is inheritance. > Specifically, do we regard a partitions as pg_inherits children of its > partitioning parent? I don't think this is totally an all-or-nothing decision. I think everyone is agreed that we need to not break things that work today -- e.g. Merge Append. What that implies for pg_inherits isn't altogether clear. > * Syntax: do we want to make it similar to one of the many other > databases out there? Or we could invent our own? Well, what I think we don't want is something that is *almost* like some other database but not quite. I lean toward inventing our own since I'm not aware of something that we'd want to copy exactly. > I wonder if we could add a clause like DISTRIBUTED BY to complement > PARTITION ON that represents a hash distributed/partitioned table (that > could be a syntax to support sharded tables maybe; we would definitely > want to move ahead in that direction I guess) Maybe eventually, but let's not complicate things by worrying too much about that now. > * Catalog: We would like to have a catalog structure suitable to > implement capabilities like multi-column partitioning, sub-partitioning > (with arbitrary number of levels in the hierarchy). I had suggested > that we create two new catalogs viz. pg_partitioned_rel, > pg_partition_def to store metadata about a partition key of a > partitioned relation and partition bound info of a partition, > respectively. Also, see the point about on-disk representation of > partition bounds I'm not convinced that there is any benefit in spreading this information across two tables neither of which exist today. If the representation of the partitioning scheme is going to be a node tree, then there's no point in taking what would otherwise have been a List and storing each element of it in a separate tuple. The overarching point here is that the system catalog structure should be whatever is most convenient for the system internals; I'm not sure we understand what that is yet. > * It is desirable to treat partitions as pg_class relations with perhaps > a new relkind(s). We may want to choose an implementation where only the > lowest level relations in a partitioning hierarchy have storage; those > at the upper layers are mere placeholder relations though of course with > associated constraints determined by partitioning criteria (with > appropriate metadata entered into the additional catalogs). I think the storage-less parents need a new relkind precisely to denote that they have no storage; I am not convinced that there's any reason to change the relkind for the leaf nodes. But that's been proposed, so evidently someone thinks there's a reason to do it. > I am not > quite sure if each kind of the relations involved in the partitioning > scheme have separate namespaces and, if they are, how we implement that I am in favor of having all of the nodes in the hierarchy have names just as relations do today -- pg_class.relname. Anything else seems to me to be complex to implement and of very marginal benefit. But again, it's been proposed. > * In the initial implementation, we could just live with partitioning on > a set of columns (and not arbitrary expressions of them) Seems quite fair. > * We perhaps do not need multi-column LIST partitions as they are not > very widely used and may complicate the implementation I agree that the use case is marginal; but I'm not sure it needs to complicate the implementation much. Depending on how the implementation shakes out, prohibiting it might come to seem like more of a wart than allowing it. > * There are a number of suggestions about how we represent partition > bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype > associated with the relation itself), etc. Important point to consider > here may be that partition key may contain more than one column Yep. > * How we represent partition definition in memory (for a given > partitioned relation) - important point to remember is that such a > representation should be efficient to iterate through or > binary-searchable. Also see the points about tuple-routing and > partition-pruning Yep. > * Overflow/catchall partition: it seems we do not want/need them. It > might seem desirable for example in cases where a big transaction enters > a large number of tuples all but one of which find a defined partition; > we may not want to error out in such case but instead enter that erring > tuple into the o
Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)
On 06-01-2015 PM 03:40, Amit Langote wrote: > > I agree that while we are discussing these points, we could also be > discussing how we solve problems of existing partitioning implementation > using whatever the above things end up being. Proposed approaches to > solve those problems might be useful to drive the first step as well or > perhaps that's how it should be done anyway. > I realize the discussion has not quite brought us to *conclusions* so far though surely there has been valuable input from people. Anyway, starting a new thread with the summary of what has been (please note that the order of listing the points does not necessarily connote the priority): * It has been repeatedly pointed out that we may want to decouple partitioning from inheritance because implementing partitioning as an extension of inheritance mechanism means that we have to keep all the existing semantics which might limit what we want to do with the special case of it which is partitioning; in other words, we would find ourselves in difficult position where we have to inject a special case code into a very generalized mechanism that is inheritance. Specifically, do we regard a partitions as pg_inherits children of its partitioning parent? * Syntax: do we want to make it similar to one of the many other databases out there? Or we could invent our own? I like the syntax that Robert suggested that covers the cases of RANGE and LIST partitioning without actually having to use those keywords explicitly; something like the following: CREATE TABLE parent PARTITION ON (column [ USING opclass ] [, ... ]); CREATE TABLE child PARTITION OF parent_name FOR VALUES { (value, ...) [ TO (value, ...) ] } So instead of making a hard distinction between range and list partitioning, you can say: CREATE TABLE child_name PARTITION OF parent_name FOR VALUES (3, 5, 7); wherein, child is effectively a LIST partition CREATE TABLE child PARTITION OF parent_name FOR VALUES (8) TO (12); wherein, child is effectively a RANGE partition on one column CREATE TABLE child PARTITION OF parent_name FOR VALUES(20, 120) TO (30, 130); wherein, child is effectively a RANGE partition on two columns I wonder if we could add a clause like DISTRIBUTED BY to complement PARTITION ON that represents a hash distributed/partitioned table (that could be a syntax to support sharded tables maybe; we would definitely want to move ahead in that direction I guess) * Catalog: We would like to have a catalog structure suitable to implement capabilities like multi-column partitioning, sub-partitioning (with arbitrary number of levels in the hierarchy). I had suggested that we create two new catalogs viz. pg_partitioned_rel, pg_partition_def to store metadata about a partition key of a partitioned relation and partition bound info of a partition, respectively. Also, see the point about on-disk representation of partition bounds * It is desirable to treat partitions as pg_class relations with perhaps a new relkind(s). We may want to choose an implementation where only the lowest level relations in a partitioning hierarchy have storage; those at the upper layers are mere placeholder relations though of course with associated constraints determined by partitioning criteria (with appropriate metadata entered into the additional catalogs). I am not quite sure if each kind of the relations involved in the partitioning scheme have separate namespaces and, if they are, how we implement that * In the initial implementation, we could just live with partitioning on a set of columns (and not arbitrary expressions of them) * We perhaps do not need multi-column LIST partitions as they are not very widely used and may complicate the implementation * There are a number of suggestions about how we represent partition bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype associated with the relation itself), etc. Important point to consider here may be that partition key may contain more than one column * How we represent partition definition in memory (for a given partitioned relation) - important point to remember is that such a representation should be efficient to iterate through or binary-searchable. Also see the points about tuple-routing and partition-pruning * Overflow/catchall partition: it seems we do not want/need them. It might seem desirable for example in cases where a big transaction enters a large number of tuples all but one of which find a defined partition; we may not want to error out in such case but instead enter that erring tuple into the overflow partition instead. If we choose to implement that, we would like to also implement the capability to move the tuples into the appropriate partition once it's defined. Related is the notion of automatically creating partitions if one is not already defined for a just entered tuple; but there may be locking troubles if many concurrent sessions try to do that * Tuple-routing: based on the intern
Re: [HACKERS] On partitioning
On 18-12-2014 AM 04:52, Robert Haas wrote: > On Wed, Dec 17, 2014 at 1:53 PM, Josh Berkus wrote: >> >> Sure. But there's a big difference between "we're going to take these >> steps and that problem will be fixable eventually" and "we're going to >> retain features of the current partitioning system which make that >> problem impossible to fix." The drift of discussion on this thread >> *sounded* like the latter, and I've been calling attention to the issue >> in an effort to make sure that it's not. >> >> Last week, I wrote a longish email listing out the common problems users >> have with our current partitioning as a way of benchmarking the plan for >> new partitioning. While some people responded to that post, absolutely >> nobody discussed the list of issues I gave. Is that because there's >> universal agreement that I got the major issues right? Seems doubtful. > > I agreed with many of the things you listed but not all of them. > However, I don't think it's realistic to burden whatever patch Amit > writes with the duty of, for example, making global indexes work. > That's a huge problem all of its own. Now, conceivably, we could try > to solve that as part of the next patch by insisting that the > "partitions" have to really be block number ranges within a single > relfilenode rather than separate relfilenodes as they are today ... > but I think that's a bad design which we would likely regret bitterly. > I also think that it would likely make what's being talked about here > so complicated that it will never go anywhere. I think it's better > that we focus on solving one problem really well - storing metadata > for partition boundaries in the catalog so that we can do efficient > tuple routing and partition pruning - and leave the other problems for > later. > Yes, I think partitioning as a whole is a BIG enough project that we need to tackle it as a series of steps each of which is a discussion of its own. The first step might as well be discussing how we represent a partitioned table. We have a number of design decisions to make during this step itself and we would definitely want to reach a consensus on these points. Things like where we indicate if a table is partitioned (pg_class), what the partition key looks like, where it is stored, what the partition definition looks like, where it is stored, how we represent arbitrary number of levels in partitioning hierarchy, how we implement that only leaf level relations in a hierarchy have storage, what are implications of all these choices, etc. Some of these points are being discussed. I agree that while we are discussing these points, we could also be discussing how we solve problems of existing partitioning implementation using whatever the above things end up being. Proposed approaches to solve those problems might be useful to drive the first step as well or perhaps that's how it should be done anyway. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Wed, Dec 17, 2014 at 1:53 PM, Josh Berkus wrote: > On 12/16/2014 07:35 PM, Robert Haas wrote: >> On Tue, Dec 16, 2014 at 9:01 PM, Josh Berkus wrote: >>> Anyway, what I'm saying is that I personally regard the inability to >>> handle even moderately complex expressions a major failing of our >>> existing partitioning scheme (possibly its worst single failing), and I >>> would regard any new partitioning feature which didn't address that >>> issue as suspect. >> >> I understand, but I think you need to be careful not to stonewall all >> progress in the name of getting what you want. Getting the >> partitioning metadata into the system catalogs in a suitable format >> will be a huge step forward regardless of whether it solves this >> particular problem right away or not, because it will make it possible >> to solve this problem in a highly-efficient way, which is quite hard >> to do right now. > > Sure. But there's a big difference between "we're going to take these > steps and that problem will be fixable eventually" and "we're going to > retain features of the current partitioning system which make that > problem impossible to fix." The drift of discussion on this thread > *sounded* like the latter, and I've been calling attention to the issue > in an effort to make sure that it's not. > > Last week, I wrote a longish email listing out the common problems users > have with our current partitioning as a way of benchmarking the plan for > new partitioning. While some people responded to that post, absolutely > nobody discussed the list of issues I gave. Is that because there's > universal agreement that I got the major issues right? Seems doubtful. I agreed with many of the things you listed but not all of them. However, I don't think it's realistic to burden whatever patch Amit writes with the duty of, for example, making global indexes work. That's a huge problem all of its own. Now, conceivably, we could try to solve that as part of the next patch by insisting that the "partitions" have to really be block number ranges within a single relfilenode rather than separate relfilenodes as they are today ... but I think that's a bad design which we would likely regret bitterly. I also think that it would likely make what's being talked about here so complicated that it will never go anywhere. I think it's better that we focus on solving one problem really well - storing metadata for partition boundaries in the catalog so that we can do efficient tuple routing and partition pruning - and leave the other problems for later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/17/2014 11:19 AM, Heikki Linnakangas wrote: > On 12/17/2014 08:53 PM, Josh Berkus wrote: >> Last week, I wrote a longish email listing out the common problems users >> have with our current partitioning as a way of benchmarking the plan for >> new partitioning. While some people responded to that post, absolutely >> nobody discussed the list of issues I gave. Is that because there's >> universal agreement that I got the major issues right? Seems doubtful. > > That was a good list. ;-) Ok, that made my morning. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/17/2014 08:53 PM, Josh Berkus wrote: Last week, I wrote a longish email listing out the common problems users have with our current partitioning as a way of benchmarking the plan for new partitioning. While some people responded to that post, absolutely nobody discussed the list of issues I gave. Is that because there's universal agreement that I got the major issues right? Seems doubtful. That was a good list. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/16/2014 07:35 PM, Robert Haas wrote: > On Tue, Dec 16, 2014 at 9:01 PM, Josh Berkus wrote: >> Anyway, what I'm saying is that I personally regard the inability to >> handle even moderately complex expressions a major failing of our >> existing partitioning scheme (possibly its worst single failing), and I >> would regard any new partitioning feature which didn't address that >> issue as suspect. > > I understand, but I think you need to be careful not to stonewall all > progress in the name of getting what you want. Getting the > partitioning metadata into the system catalogs in a suitable format > will be a huge step forward regardless of whether it solves this > particular problem right away or not, because it will make it possible > to solve this problem in a highly-efficient way, which is quite hard > to do right now. Sure. But there's a big difference between "we're going to take these steps and that problem will be fixable eventually" and "we're going to retain features of the current partitioning system which make that problem impossible to fix." The drift of discussion on this thread *sounded* like the latter, and I've been calling attention to the issue in an effort to make sure that it's not. Last week, I wrote a longish email listing out the common problems users have with our current partitioning as a way of benchmarking the plan for new partitioning. While some people responded to that post, absolutely nobody discussed the list of issues I gave. Is that because there's universal agreement that I got the major issues right? Seems doubtful. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Tue, Dec 16, 2014 at 9:01 PM, Josh Berkus wrote: > On 12/16/2014 05:52 PM, Robert Haas wrote: >> But in a more complicated case where the value there isn't known until >> runtime, yeah, it scans everything. I'm not sure what the best way to >> fix that is. If the partition bounds were stored in a structured way, >> as we've been discussing, then the Append or Merge Append node could, >> when initialized, check which partition the id = X qual routes to and >> ignore the rest. But that's more iffy with the current >> representation, I think. > > Huh. I was just testing: > > WHERE event_time BETWEEN timestamptz '2014-12-01' and ( timestamptz > '2014-12-01' + interval '1 month') > > In that case, the expression above got folded to constants by the time > Postgres did the index scans, but it scanned all partitions. So somehow > (timestamptz + interval) doesn't get constant-folded until after > planning, at least not on 9.3. > > And of course this leaves out common patterns like "now() - interval '30 > days'" or "to_timestamp('20141201','MMDD')" > > Anyway, what I'm saying is that I personally regard the inability to > handle even moderately complex expressions a major failing of our > existing partitioning scheme (possibly its worst single failing), and I > would regard any new partitioning feature which didn't address that > issue as suspect. I understand, but I think you need to be careful not to stonewall all progress in the name of getting what you want. Getting the partitioning metadata into the system catalogs in a suitable format will be a huge step forward regardless of whether it solves this particular problem right away or not, because it will make it possible to solve this problem in a highly-efficient way, which is quite hard to do right now. For example, we could (right now) write code that would do run-time partition pruning by taking the final filter clause, with all values substituted in, and re-checking for partitions that can be pruned via constraint exclusion. But that would be expensive and would often fail to find anything useful. Even in the best case where it works out it's O(n) in the number of partitions, and will therefore perform badly for large numbers of partitions (even, say, 1000). But once the partitioning metadata is stored in the catalog, we can implement this as a binary search -- O(lg n) time -- and the constant factor should be lower -- and it will be pretty easy to skip it in cases where it's useless so that we don't waste cycles spinning our wheels. Whether the initial patch covers all the cases you care about or not, and it probably won't, it will be a really big step towards making it POSSIBLE to handle those cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/16/2014 05:52 PM, Robert Haas wrote: > But in a more complicated case where the value there isn't known until > runtime, yeah, it scans everything. I'm not sure what the best way to > fix that is. If the partition bounds were stored in a structured way, > as we've been discussing, then the Append or Merge Append node could, > when initialized, check which partition the id = X qual routes to and > ignore the rest. But that's more iffy with the current > representation, I think. Huh. I was just testing: WHERE event_time BETWEEN timestamptz '2014-12-01' and ( timestamptz '2014-12-01' + interval '1 month') In that case, the expression above got folded to constants by the time Postgres did the index scans, but it scanned all partitions. So somehow (timestamptz + interval) doesn't get constant-folded until after planning, at least not on 9.3. And of course this leaves out common patterns like "now() - interval '30 days'" or "to_timestamp('20141201','MMDD')" Anyway, what I'm saying is that I personally regard the inability to handle even moderately complex expressions a major failing of our existing partitioning scheme (possibly its worst single failing), and I would regard any new partitioning feature which didn't address that issue as suspect. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Tue, Dec 16, 2014 at 1:45 PM, Josh Berkus wrote: > Yes, I wasn't saying that expressions should be used when *creating* the > partitions, which strikes me as a bad idea for several reasons. > Expressions should be usable when SELECTing data from the partitions. > Right now, they aren't, because the planner picks parttiions well before > the rewrite phase which would reduce "extract (month from current_date)" > to a constant. > > Right now, if you partition by an integer ID even, and do: > > SELECT * FROM partitioned_table WHERE ID = ( 3 + 4 ) > > ... postgres will scan all partitions because ( 3 + 4 ) is an expression > and isn't evaluated until after CE is done. Well, actually, that case works fine: rhaas=# create table partitioned_table (id integer, data text); CREATE TABLE rhaas=# create table child1 (check (id < 1000)) inherits (partitioned_table); CREATE TABLE rhaas=# create table child2 (check (id >= 1000)) inherits (partitioned_table); CREATE TABLE rhaas=# explain select * from partitioned_table where id = (3 + 4); QUERY PLAN Append (cost=0.00..25.38 rows=7 width=36) -> Seq Scan on partitioned_table (cost=0.00..0.00 rows=1 width=36) Filter: (id = 7) -> Seq Scan on child1 (cost=0.00..25.38 rows=6 width=36) Filter: (id = 7) (5 rows) The reason is that 3 + 4 gets constant-folded pretty early on in the process. But in a more complicated case where the value there isn't known until runtime, yeah, it scans everything. I'm not sure what the best way to fix that is. If the partition bounds were stored in a structured way, as we've been discussing, then the Append or Merge Append node could, when initialized, check which partition the id = X qual routes to and ignore the rest. But that's more iffy with the current representation, I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 17-12-2014 AM 12:28, Claudio Freire wrote: > On Tue, Dec 16, 2014 at 12:15 PM, Robert Haas wrote: >> I'm not really sure what you are getting here. An "otherwise-good >> expression" basically means a constant. Index expressions have to be >> things that always produce the same result given the same input, >> because otherwise you might get a different result when searching the >> index than you did when building it, and then you would fail to find >> keys that are actually present. > > I think the point is partitioning based on the result of an expression > over row columns. Actually, in this case, I was thinking about a partition definition not partition key definition. That is, using an expression as partition value which has problems that I see. > Or if it's not, it should be made anyway: > > PARTITION BY LIST (extract(month from date_created) VALUES (1, 3, 6, 9, 12); > > Or something like that. > Such a thing seems very desirable though there are some tradeoffs compared to having partitioning key be just attrnums. Or at least we can start with that. An arbitrary expression as partitioning key means that we have to recompute such an expression for each input row. Think how inefficient that may be when bulk-loading into a partitioned table during, say, a COPY. Though there may be ways to fix that. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 17-12-2014 AM 12:15, Robert Haas wrote: > On Mon, Dec 15, 2014 at 6:55 PM, Amit Langote > wrote: >> Robert wrote: >>> I would expect that to fail, just as it would fail if you tried to >>> build an index using a volatile expression. >> >> Oops, wrong example, sorry. In case of an otherwise good expression? > > I'm not really sure what you are getting here. An "otherwise-good > expression" basically means a constant. Index expressions have to be > things that always produce the same result given the same input, > because otherwise you might get a different result when searching the > index than you did when building it, and then you would fail to find > keys that are actually present. In the same way, partition boundaries > also need to be constants. Maybe you could allow expressions that can > be constant-folded, but that's about it. Yeah, this is what I meant. Expressions that can be constant-folded. Sorry, the example I chose was pretty lame. I was just thinking about kind of stuff that something like pg_node_tree would be a good choice for as on-disk representation of partition values. Though definitely it wouldn't be to store arbitrary expressions that evaluate to different values at different times. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/15/2014 10:55 AM, Robert Haas wrote: >> This means if a user puts arbitrary expressions in a partition definition, >> say, >> > >> > ... FOR VALUES extract(month from current_date) TO extract(month from >> > current_date + interval '3 months'), >> > >> > we make sure that those expressions are pre-computed to literal values. > I would expect that to fail, just as it would fail if you tried to > build an index using a volatile expression. Yes, I wasn't saying that expressions should be used when *creating* the partitions, which strikes me as a bad idea for several reasons. Expressions should be usable when SELECTing data from the partitions. Right now, they aren't, because the planner picks parttiions well before the rewrite phase which would reduce "extract (month from current_date)" to a constant. Right now, if you partition by an integer ID even, and do: SELECT * FROM partitioned_table WHERE ID = ( 3 + 4 ) ... postgres will scan all partitions because ( 3 + 4 ) is an expression and isn't evaluated until after CE is done. I don't think there's an easy way to do the expression rewrite while we're still in planning, is there? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Tue, Dec 16, 2014 at 12:15 PM, Robert Haas wrote: > wrote: >> Robert wrote: >>> On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote >>> wrote: >>> > This means if a user puts arbitrary expressions in a partition >>> > definition, say, >>> > >>> > ... FOR VALUES extract(month from current_date) TO extract(month from >>> current_date + interval '3 months'), >>> > >>> > we make sure that those expressions are pre-computed to literal values. >>> >>> I would expect that to fail, just as it would fail if you tried to >>> build an index using a volatile expression. >> >> Oops, wrong example, sorry. In case of an otherwise good expression? > > I'm not really sure what you are getting here. An "otherwise-good > expression" basically means a constant. Index expressions have to be > things that always produce the same result given the same input, > because otherwise you might get a different result when searching the > index than you did when building it, and then you would fail to find > keys that are actually present. I think the point is partitioning based on the result of an expression over row columns. Or if it's not, it should be made anyway: PARTITION BY LIST (extract(month from date_created) VALUES (1, 3, 6, 9, 12); Or something like that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 15, 2014 at 6:55 PM, Amit Langote wrote: > Robert wrote: >> On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote >> wrote: >> > This means if a user puts arbitrary expressions in a partition definition, >> > say, >> > >> > ... FOR VALUES extract(month from current_date) TO extract(month from >> current_date + interval '3 months'), >> > >> > we make sure that those expressions are pre-computed to literal values. >> >> I would expect that to fail, just as it would fail if you tried to >> build an index using a volatile expression. > > Oops, wrong example, sorry. In case of an otherwise good expression? I'm not really sure what you are getting here. An "otherwise-good expression" basically means a constant. Index expressions have to be things that always produce the same result given the same input, because otherwise you might get a different result when searching the index than you did when building it, and then you would fail to find keys that are actually present. In the same way, partition boundaries also need to be constants. Maybe you could allow expressions that can be constant-folded, but that's about it. If you allow anything else, then the partition boundary might "move" once it's been established and then some of the data will be in the wrong partition. What possible use case is there for defining partitions with non-constant boundaries? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Robert wrote: > On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote > wrote: > > This means if a user puts arbitrary expressions in a partition definition, > > say, > > > > ... FOR VALUES extract(month from current_date) TO extract(month from > current_date + interval '3 months'), > > > > we make sure that those expressions are pre-computed to literal values. > > I would expect that to fail, just as it would fail if you tried to > build an index using a volatile expression. Oops, wrong example, sorry. In case of an otherwise good expression? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote wrote: > This means if a user puts arbitrary expressions in a partition definition, > say, > > ... FOR VALUES extract(month from current_date) TO extract(month from > current_date + interval '3 months'), > > we make sure that those expressions are pre-computed to literal values. I would expect that to fail, just as it would fail if you tried to build an index using a volatile expression. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 15, 2014 at 8:09 AM, José Luis Tallón wrote: > On 12/15/2014 07:42 AM, Claudio Freire wrote: >> >> [snip] > > >> If you do that, you start with empty partitions, and each insert updates >> the BRIN tuple. Avoiding concurrency loss in this case would be tricky, but >> in theory this could allow very general partition exclusion. In fact it >> could even work with constraint exclusion right now: you'd have a >> single-tuple BRIN index for each partition and benefit from it. But you >> don't need to pay the price of updating BRIN indexes, as min-max tuples for >> each partition can be produced while creating the partitions if the syntax >> already provides the information. Then, it's just a matter of querying this >> meta-data which just happens to have the form of a BRIN tuple for each >> partition. > > > Yup. Indeed this is the way I outlined in my previous e-mail. > > The only point being: Why bother with BRIN when we already have the range > machinery, and it's trivial to add pointers to partitions from each range? The part of BRIN I find useful is not its on-disk structure, but all the execution machinery that checks quals against BRIN tuples. It's not a trivial part of code, and is especially useful since it's generalizable. New BRIN operator classes can be created and that's an interesting power to have in partitioning as well. Casting from ranges into min-max BRIN tuples seems quite doable, so both range and list notation should work fine. But BRIN works also for the generic "routing expression" some people seem to really want, and dynamically updated BRIN meta-indexes seem to be the only efficient solution for that. BRIN lacks some features, as you noted, so it does need some love before it's usable for this. But they're features BRIN itself would find useful so you take out two ducks in one shot. > I suggested that BRIN would solve a situation when the amount of partitions > is huge (say, thousands) and we might need to be able to efficiently locate > the appropriate partition. In this situation, a linear search might become > prohibitive, or the data structure (a simple B-Tree, maybe) become too big > to be worth keeping in memory. This is where being able to store the > "partition index" on disk would be interesting. BRIN also does a linear search, so it doesn't solve that. BRIN's only power is that it can answer very fast whether some quals rule out a partition. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/15/2014 07:42 AM, Claudio Freire wrote: [snip] If you do that, you start with empty partitions, and each insert updates the BRIN tuple. Avoiding concurrency loss in this case would be tricky, but in theory this could allow very general partition exclusion. In fact it could even work with constraint exclusion right now: you'd have a single-tuple BRIN index for each partition and benefit from it. But you don't need to pay the price of updating BRIN indexes, as min-max tuples for each partition can be produced while creating the partitions if the syntax already provides the information. Then, it's just a matter of querying this meta-data which just happens to have the form of a BRIN tuple for each partition. Yup. Indeed this is the way I outlined in my previous e-mail. The only point being: Why bother with BRIN when we already have the range machinery, and it's trivial to add pointers to partitions from each range? I suggested that BRIN would solve a situation when the amount of partitions is huge (say, thousands) and we might need to be able to efficiently locate the appropriate partition. In this situation, a linear search might become prohibitive, or the data structure (a simple B-Tree, maybe) become too big to be worth keeping in memory. This is where being able to store the "partition index" on disk would be interesting. Moreover, I guess that ---by using this approach (B-Tree[range]->partition_id and/or BRIN)--- we could efficiently answer the question "do we have any tuple with this key in some partition?" which AFAICS is pretty close to us having "global indexes". Regards, / J.L. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Claudio Freire wrote: > On Sun, Dec 14, 2014 at 11:12 PM, Amit Langote > wrote: > >> On egress you need some direct way to compare the scan quals with the > >> partitioning values. I would imagine this to be similar to how scan > >> quals are compared to the values stored in a BRIN index: each scan qual > >> has a corresponding operator strategy and a scan key, and you can say > >> "aye" or "nay" based on a small set of operations that can be run > >> cheaply, again without any proof or running arbitrary expressions. > >> > > > > My knowledge of this is far from being perfect, though to clear any > confusions - > > > > As far as planning is concerned, I could not imagine how index access > method way of pruning partitions could be made to work. Of course, I may > be missing something. > > Let me be overly verbose, don't take it as patronizing, just answering > in lots of detail why this could be a good idea to try. > Thanks for explaining. It helps. > Normal indexes store a pointer for each key value of sorts. So B-Tree > gets you a set of tids for each key, and so does GIN and hash. > > But BRIN is different in that it does the mapping differently. BRIN > stores a compact, approximate representation of the set of keys within > a page range. It can tell with some degree of (in)accuracy whether a > key or key range could be part of that page range or not. The way it > does this is abstracted out, but at its core, it stores a "compressed" > representation of the key set that takes a constant amount of bits to > store, and no more, no matter how many elements. What changes as the > element it represents grows, is its accuracy. > > Currently, BRIN only supports min-max representations. It will store, > for each page range, the minimum and maximum of some columns, and > when > you query it, you can compare range vs range, and discard whole page > ranges. > > Now, what are partitions, if not page ranges? Yes, I can see a partition as a page range. The fixed summary info in BRIN's terms would be range bounds in case this is a rang partition, list of values in case this is a list partition and hash value in case this is a hash partition. There is debate on the topic but each of these partitions also happens to be a separate relation. IIUC, BRIN is an access method for a relation (say, top-level partitioned relation) that comes into play in executor if that access method survives as preferred access method by the planner. I cannot see a way to generalize it further and make it support each block range as a separate relation and then use it for partition pruning in planner. This is assuming a partitioned relation is planned as an Append node which contains a list of plans for surviving partition relations based on, say, restrict quals. I may be thinking of BRIN as a whole as not being generalized enough but I may be wrong. Could you point out if so? > A BRIN tuple is a min-max pair. But BRIN in more general, it could use > other data structures to hold that "compressed representation", if > someone implemented them. Like bloom filters [0]. > > A BRIN index is a complex data structure because it has to account for > physically growing tables, but all the complexities vanish when you > fix a "block range" (the BR in BRIN) to a partition. Then, a mere > array of BRIN tuples would suffice. > > BRIN already contains the machinery to turn quals into something that > filters out entire partitions, if you provide the BRIN tuples. > IIUC, that machinery comes into play when, say, a Bitmap Heap scan starts, right? > And you could even effectively matain a BRIN index for the partitions > (just a BRIN tuple per partition, dynamically updated with every > insertion). > > If you do that, you start with empty partitions, and each insert > updates the BRIN tuple. Avoiding concurrency loss in this case would > be tricky, but in theory this could allow very general partition > exclusion. In fact it could even work with constraint exclusion right > now: you'd have a single-tuple BRIN index for each partition and > benefit from it. > > But you don't need to pay the price of updating BRIN indexes, as > min-max tuples for each partition can be produced while creating the > partitions if the syntax already provides the information. Then, it's > just a matter of querying this meta-data which just happens to have > the form of a BRIN tuple for each partition. > Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Sun, Dec 14, 2014 at 11:12 PM, Amit Langote wrote: >> On egress you need some direct way to compare the scan quals with the >> partitioning values. I would imagine this to be similar to how scan >> quals are compared to the values stored in a BRIN index: each scan qual >> has a corresponding operator strategy and a scan key, and you can say >> "aye" or "nay" based on a small set of operations that can be run >> cheaply, again without any proof or running arbitrary expressions. >> > > My knowledge of this is far from being perfect, though to clear any > confusions - > > As far as planning is concerned, I could not imagine how index access method > way of pruning partitions could be made to work. Of course, I may be missing > something. Let me be overly verbose, don't take it as patronizing, just answering in lots of detail why this could be a good idea to try. Normal indexes store a pointer for each key value of sorts. So B-Tree gets you a set of tids for each key, and so does GIN and hash. But BRIN is different in that it does the mapping differently. BRIN stores a compact, approximate representation of the set of keys within a page range. It can tell with some degree of (in)accuracy whether a key or key range could be part of that page range or not. The way it does this is abstracted out, but at its core, it stores a "compressed" representation of the key set that takes a constant amount of bits to store, and no more, no matter how many elements. What changes as the element it represents grows, is its accuracy. Currently, BRIN only supports min-max representations. It will store, for each page range, the minimum and maximum of some columns, and when you query it, you can compare range vs range, and discard whole page ranges. Now, what are partitions, if not page ranges? A BRIN tuple is a min-max pair. But BRIN in more general, it could use other data structures to hold that "compressed representation", if someone implemented them. Like bloom filters [0]. A BRIN index is a complex data structure because it has to account for physically growing tables, but all the complexities vanish when you fix a "block range" (the BR in BRIN) to a partition. Then, a mere array of BRIN tuples would suffice. BRIN already contains the machinery to turn quals into something that filters out entire partitions, if you provide the BRIN tuples. And you could even effectively matain a BRIN index for the partitions (just a BRIN tuple per partition, dynamically updated with every insertion). If you do that, you start with empty partitions, and each insert updates the BRIN tuple. Avoiding concurrency loss in this case would be tricky, but in theory this could allow very general partition exclusion. In fact it could even work with constraint exclusion right now: you'd have a single-tuple BRIN index for each partition and benefit from it. But you don't need to pay the price of updating BRIN indexes, as min-max tuples for each partition can be produced while creating the partitions if the syntax already provides the information. Then, it's just a matter of querying this meta-data which just happens to have the form of a BRIN tuple for each partition. [0] http://en.wikipedia.org/wiki/Bloom_filter -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Alvaro wrote: > Claudio Freire wrote: > > > Fair enough, but that's not the same as not requiring easy proofs. The > > planner might not the one doing the proofs, but you still need proofs. > > > > Even if the proving method is hardcoded into the partitioning method, > > as in the case of list or range partitioning, it's still a proof. With > > arbitrary functions (which is what prompted me to mention proofs) you > > can't do that. A function works very well for inserting, but not for > > selecting. > > > > I could be wrong though. Maybe there's a way to turn SQL functions > > into analyzable things? But it would still be very easy to shoot > > yourself in the foot by writing one that is too complex. > > Arbitrary SQL expressions (including functions) are not the thing to use > for partitioning -- at least that's how I understand this whole > discussion. I don't think you want to do "proofs" as such -- they are > expensive. > This means if a user puts arbitrary expressions in a partition definition, say, ... FOR VALUES extract(month from current_date) TO extract(month from current_date + interval '3 months'), we make sure that those expressions are pre-computed to literal values. The exact time when that happens is open for discussion I guess. It could be either DDL time or, if feasible, during relation cache building when we compute the value from pg_node_tree of this expression which we may choose to store in the partition definition catalog. The former entails an obvious challenge of figuring out how we store the computed value into catalog (pg_node_tree of a Const?). > To make this discussion a bit clearer, there are two things to > distinguish: one is routing tuples, when an INSERT or COPY command > references the partitioned table, into the individual partitions > (ingress); the other is deciding which partitions to read when a SELECT > query wants to read tuples from the partitioned table (egress). > > On ingress, what you want is something like being able to do something > on the tuple that tells you which partition it belongs into. Ideally > this is something much lighter than running an expression; if you can > just apply an operator to the partitioning column values, that should be > plenty fast. This requires no proof. > And I am thinking this's all executor stuff. > On egress you need some direct way to compare the scan quals with the > partitioning values. I would imagine this to be similar to how scan > quals are compared to the values stored in a BRIN index: each scan qual > has a corresponding operator strategy and a scan key, and you can say > "aye" or "nay" based on a small set of operations that can be run > cheaply, again without any proof or running arbitrary expressions. > My knowledge of this is far from being perfect, though to clear any confusions - As far as planning is concerned, I could not imagine how index access method way of pruning partitions could be made to work. Of course, I may be missing something. When you say "scan qual has a corresponding operator strategy", I'd think that is a part of scan key in executor, no? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Sun, Dec 14, 2014 at 1:40 AM, José Luis Tallón wrote: > On 12/12/2014 05:43 AM, Amit Langote wrote: > > Amit: mind if I add the DB2 syntax for partitioning to the wiki, too? > > This might as well help with deciding the final form of partitioning > (and define the first implementation boundaries, too) > Please go ahead. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Sun, Dec 14, 2014 at 1:57 AM, José Luis Tallón wrote: > On 12/13/2014 03:09 AM, Alvaro Herrera wrote: >> >> [snip] >> Arbitrary SQL expressions (including functions) are not the thing to use >> for partitioning -- at least that's how I understand this whole >> discussion. I don't think you want to do "proofs" as such -- they are >> expensive. > > > Yup. Plus, it looks like (from reading Oracle's documentation) they end up > converting the LESS THAN clauses into range lists internally. > Anyone that can attest to this? (or just disprove it, if I'm wrong) > > I just suggested using the existing RangeType infrastructure for this ( <<, >>> and && operators, specifically, might do the trick) before reading your > mail citing BRIN. > ... which might as well allow some interesting runtime optimizations > when range partitioning is used and *a huge* number of partitions get > defined --- I'm specifically thinking about massive OLTP with very deep > (say, 5 years' worth) archival partitioning where it would be inconvenient > to have the tuple routing information always in memory. > I'm specifically suggesting some ( range_value -> partitionOID) mapping > using a BRIN index for this --- it could be auto-created just like we do for > primary keys. > > Just my 2c Since we are keen on being able to reuse existing infrastructure, I think this and RangeType, ArrayType stuff is worth thinking about though I am afraid we may lose a certain level of generality of expression we might very well be able to afford. Though that's something difficult to definitely say without actually studying it a little more detail which I haven't quite yet. We may be able to go somewhere with it perhaps. And of course the original designers of the infrastructure in question would be better able to vouch for it I think. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/13/2014 05:57 PM, José Luis Tallón wrote: On 12/13/2014 03:09 AM, Alvaro Herrera wrote: [snip] Arbitrary SQL expressions (including functions) are not the thing to use for partitioning -- at least that's how I understand this whole discussion. I don't think you want to do "proofs" as such -- they are expensive. Yup. Plus, it looks like (from reading Oracle's documentation) they end up converting the LESS THAN clauses into range lists internally. Anyone that can attest to this? (or just disprove it, if I'm wrong) I just suggested using the existing RangeType infrastructure for this ( <<, >> and && operators, specifically, might do the trick) before reading your mail citing BRIN. ... which might as well allow some interesting runtime optimizations when range partitioning is used and *a huge* number of partitions get defined --- I'm specifically thinking about massive OLTP with very deep (say, 5 years' worth) archival partitioning where it would be inconvenient to have the tuple routing information always in memory. I'm specifically suggesting some ( range_value -> partitionOID) mapping using a BRIN index for this --- it could be auto-created just like we do for primary keys. Reviewing the existing documentation on this topic I have stumbled on an e-mail by Simon Riggs from almost seven years ago http://www.postgresql.org/message-id/1199296574.7260.149.ca...@ebony.site where he suggested a way of physically partitioning tables by using segments in a way that sounds to be quite close to what we are proposing here. ISTM that the partitioning meta-data might very well be augmented a bit in the direction Simon pointed to, adding support for "effectively read-only" and/or "explicitly marked read-only" PARTITIONS (not segments in this case) for an additional optimization. We would need some syntax additions (ALTER PARTITION SET READONLY) in this case. This feature can be added later on, of course. I'd like to explicitly remark the potentially performance-enhancing effect of fillfactor=100 (cfr. http://www.postgresql.org/docs/9.3/static/sql-createtable.html) and partitions marked "effectively read-only" (cfr. Simon's proposal) when coupled with "fullscan analyze" vs. the regular sample-based analyze that autovacuum performs. When a partition consists of multiple *segments*, a generalization of the proposed BRIN index (to cover segments in addition to partitions) will further speed up scans. Just for the record, allowing some partitions to be moved to foreign tables (i.e. foreign servers, via postgres_fdw) will multiply the usefullness of this "partitioned table wide" BRIN index now becoming a real "global index". Just my 2c Thanks, / J.L. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Fri, Dec 12, 2014 at 09:03:12AM -0500, Robert Haas wrote: > Yeah, range and list partition definitions are very similar, but > hash partition definitions are a different kettle of fish. I don't > think we really need hash partitioning for anything right away - > it's pretty useless unless you've got, say, a way for the partitions > to be foreign tables living on remote servers - There's a patch enabling exactly this feature in the queue for 9.5. https://commitfest.postgresql.org/action/patch_view?id=1386 > but we shouldn't pick a design that will make it really hard to add > later. Indeed not :) Cheers, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/12/14, 3:48 PM, Robert Haas wrote: On Fri, Dec 12, 2014 at 4:28 PM, Jim Nasby wrote: Sure. Mind you, I'm not proposing that the syntax I just mooted is actually for the best. What I'm saying is that we need to talk about it. Frankly, if we're going to require users to explicitly define each partition then I think the most appropriate API would be a function. Users will be writing code to create new partitions as needed, and it's generally easier to write code that calls a function as opposed to glomming a text string together and passing that to EXECUTE. I have very little idea what the API you're imagining would actually look like from this description, but it sounds like a terrible idea. We don't want to make this infinitely general. We need a *fast* way to go from a value (or list of values, one per partitioning column) to a partition OID, and the way to get there is not to call arbitrary user code. You were talking about the syntax for partition creation/definition; that's the API I was referring to. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/13/2014 03:09 AM, Alvaro Herrera wrote: [snip] Arbitrary SQL expressions (including functions) are not the thing to use for partitioning -- at least that's how I understand this whole discussion. I don't think you want to do "proofs" as such -- they are expensive. Yup. Plus, it looks like (from reading Oracle's documentation) they end up converting the LESS THAN clauses into range lists internally. Anyone that can attest to this? (or just disprove it, if I'm wrong) I just suggested using the existing RangeType infrastructure for this ( <<, >> and && operators, specifically, might do the trick) before reading your mail citing BRIN. ... which might as well allow some interesting runtime optimizations when range partitioning is used and *a huge* number of partitions get defined --- I'm specifically thinking about massive OLTP with very deep (say, 5 years' worth) archival partitioning where it would be inconvenient to have the tuple routing information always in memory. I'm specifically suggesting some ( range_value -> partitionOID) mapping using a BRIN index for this --- it could be auto-created just like we do for primary keys. Just my 2c Thanks, / J.L. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/12/2014 05:43 AM, Amit Langote wrote: [snip] In case of what we would have called a 'LIST' partition, this could look like ... FOR VALUES (val1, val2, val3, ...) Assuming we only support partition key to contain only one column in such a case. Hmmm…. [...] PARTITION BY LIST(col1 [, col2, ...]) just like we do for indexes would do. and CREATE PARTITION child_name OF parent_name FOR [VALUES] (val1a,val2a), (val1b,val2b), (val1c,val2c) [IN tblspc_name] just like we do for multi-valued inserts. In case of what we would have called a 'RANGE' partition, this could look like ... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...) How about BETWEEN ... AND ... ? Unless I'm missing something obvious, we already have range types for this, don't we? ... PARTITION BY RANGE (col) CREATE PARTITION child_name OF parent_name FOR [VALUES] '[val1min,val1max)', '[val2min,val2max)', '[val3min,val3max)' [IN tblspc_name] and I guess this should simplify a fully flexible implementation (if you can construct a RangeType for it, you can use that for partitioning). This would substitute the ugly (IMHO) "VALUES LESS THAN" syntax with a more flexible one (even though it might end up being converted into "less than" boundaries internally for implementation/optimization purposes) In both cases we would need to allow for overflows / default partition different from the parent table. Plus some ALTER PARTITION part_name TABLESPACE=tblspc_name The main problem being that we are assuming named partitions here, which might not be that practical at all. [snip] I would include the noise keyword VALUES just for readability if anything. +1 FWIW, deviating from already "standard" syntax (Oracle-like --as implemented by PPAS for example-- or DB2-like) is quite counter-productive unless we have very good reasons for it... which doesn't mean that we have to do it exactly like they do (specially if we would like to go the incremental implementation route). Amit: mind if I add the DB2 syntax for partitioning to the wiki, too? This might as well help with deciding the final form of partitioning (and define the first implementation boundaries, too) Thanks, / J.L. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
El 12/12/2014 23:09, "Alvaro Herrera" escribió: > > Claudio Freire wrote: > > > Fair enough, but that's not the same as not requiring easy proofs. The > > planner might not the one doing the proofs, but you still need proofs. > > > > Even if the proving method is hardcoded into the partitioning method, > > as in the case of list or range partitioning, it's still a proof. With > > arbitrary functions (which is what prompted me to mention proofs) you > > can't do that. A function works very well for inserting, but not for > > selecting. > > > > I could be wrong though. Maybe there's a way to turn SQL functions > > into analyzable things? But it would still be very easy to shoot > > yourself in the foot by writing one that is too complex. > > Arbitrary SQL expressions (including functions) are not the thing to use > for partitioning -- at least that's how I understand this whole > discussion. I don't think you want to do "proofs" as such -- they are > expensive. > > To make this discussion a bit clearer, there are two things to > distinguish: one is routing tuples, when an INSERT or COPY command > references the partitioned table, into the individual partitions > (ingress); the other is deciding which partitions to read when a SELECT > query wants to read tuples from the partitioned table (egress). > > On ingress, what you want is something like being able to do something > on the tuple that tells you which partition it belongs into. Ideally > this is something much lighter than running an expression; if you can > just apply an operator to the partitioning column values, that should be > plenty fast. This requires no proof. > > On egress you need some direct way to compare the scan quals with the > partitioning values. I would imagine this to be similar to how scan > quals are compared to the values stored in a BRIN index: each scan qual > has a corresponding operator strategy and a scan key, and you can say > "aye" or "nay" based on a small set of operations that can be run > cheaply, again without any proof or running arbitrary expressions. Interesting that you mention BRIN. It does seem that it could be made to work with BRIN's operator classes. In fact, a partition-wide brin tuple could be stored per partition and that in itself could be the definition for the partition. Either preinitialized or dynamically updated. Would work even for arbitrary routing functions, especially if the operator class to use is customizable. I stand corrected.
Re: [HACKERS] On partitioning
Claudio Freire wrote: > Fair enough, but that's not the same as not requiring easy proofs. The > planner might not the one doing the proofs, but you still need proofs. > > Even if the proving method is hardcoded into the partitioning method, > as in the case of list or range partitioning, it's still a proof. With > arbitrary functions (which is what prompted me to mention proofs) you > can't do that. A function works very well for inserting, but not for > selecting. > > I could be wrong though. Maybe there's a way to turn SQL functions > into analyzable things? But it would still be very easy to shoot > yourself in the foot by writing one that is too complex. Arbitrary SQL expressions (including functions) are not the thing to use for partitioning -- at least that's how I understand this whole discussion. I don't think you want to do "proofs" as such -- they are expensive. To make this discussion a bit clearer, there are two things to distinguish: one is routing tuples, when an INSERT or COPY command references the partitioned table, into the individual partitions (ingress); the other is deciding which partitions to read when a SELECT query wants to read tuples from the partitioned table (egress). On ingress, what you want is something like being able to do something on the tuple that tells you which partition it belongs into. Ideally this is something much lighter than running an expression; if you can just apply an operator to the partitioning column values, that should be plenty fast. This requires no proof. On egress you need some direct way to compare the scan quals with the partitioning values. I would imagine this to be similar to how scan quals are compared to the values stored in a BRIN index: each scan qual has a corresponding operator strategy and a scan key, and you can say "aye" or "nay" based on a small set of operations that can be run cheaply, again without any proof or running arbitrary expressions. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Fri, Dec 12, 2014 at 7:40 PM, Josh Berkus wrote: > On 12/12/2014 02:10 PM, Tom Lane wrote: >> Actually, I'm not sure that's what we want. I thought what we really >> wanted here was to postpone partition-routing decisions to runtime, >> so that the behavior would be efficient whether or not the decision >> could be predetermined at plan time. >> >> This still leads to the same point Robert is making: the routing >> decisions have to be cheap and fast. But it's wrong to think of it >> in terms of planner proofs. > > The other reason I'd really like to have the new partitioning taken out > of the planner: expressions. > > Currently, if you have partitions with constraints on, day, > "event_date", the following WHERE clause will NOT use CE and will scan > all partitions: > > WHERE event_date BETWEEN ( '2014-12-11' - interval '1 month' ) and > '2014-12-11'. > > This is despite the fact that the expression above gets rewritten to a > constant by the time the query is executed; by then it's too late. To > say nothing of functions like to_timestamp(), now(), etc. > > As long as partitions need to be chosen at plan time, I don't see a good > way to fix the expression problem. Fair enough, but that's not the same as not requiring easy proofs. The planner might not the one doing the proofs, but you still need proofs. Even if the proving method is hardcoded into the partitioning method, as in the case of list or range partitioning, it's still a proof. With arbitrary functions (which is what prompted me to mention proofs) you can't do that. A function works very well for inserting, but not for selecting. I could be wrong though. Maybe there's a way to turn SQL functions into analyzable things? But it would still be very easy to shoot yourself in the foot by writing one that is too complex. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/12/2014 02:10 PM, Tom Lane wrote: > Actually, I'm not sure that's what we want. I thought what we really > wanted here was to postpone partition-routing decisions to runtime, > so that the behavior would be efficient whether or not the decision > could be predetermined at plan time. > > This still leads to the same point Robert is making: the routing > decisions have to be cheap and fast. But it's wrong to think of it > in terms of planner proofs. The other reason I'd really like to have the new partitioning taken out of the planner: expressions. Currently, if you have partitions with constraints on, day, "event_date", the following WHERE clause will NOT use CE and will scan all partitions: WHERE event_date BETWEEN ( '2014-12-11' - interval '1 month' ) and '2014-12-11'. This is despite the fact that the expression above gets rewritten to a constant by the time the query is executed; by then it's too late. To say nothing of functions like to_timestamp(), now(), etc. As long as partitions need to be chosen at plan time, I don't see a good way to fix the expression problem. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Fri, Dec 12, 2014 at 7:10 PM, Tom Lane wrote: > Claudio Freire writes: >> On Fri, Dec 12, 2014 at 6:48 PM, Robert Haas wrote: >>> I have very little idea what the API you're imagining would actually >>> look like from this description, but it sounds like a terrible idea. >>> We don't want to make this infinitely general. We need a *fast* way >>> to go from a value (or list of values, one per partitioning column) to >>> a partition OID, and the way to get there is not to call arbitrary >>> user code. > >> I think this was mentioned upthread, but I'll repeat it anyway since >> it seems to need repeating. > >> More than fast, you want it analyzable (by the planner). Ie: it has to >> be easy to prove partition exclusion against a where clause. > > Actually, I'm not sure that's what we want. I thought what we really > wanted here was to postpone partition-routing decisions to runtime, > so that the behavior would be efficient whether or not the decision > could be predetermined at plan time. > > This still leads to the same point Robert is making: the routing > decisions have to be cheap and fast. But it's wrong to think of it > in terms of planner proofs. You'll need proofs whether at the planner or at the execution engine. A sequential scan over a partition with a query like select * from foo where date between X and Y Would be ripe for that but at some point you need to prove that the where clause excludes whole partitions. Be it at runtime (while executing the sequential scan node) or planning time. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Claudio Freire writes: > On Fri, Dec 12, 2014 at 6:48 PM, Robert Haas wrote: >> I have very little idea what the API you're imagining would actually >> look like from this description, but it sounds like a terrible idea. >> We don't want to make this infinitely general. We need a *fast* way >> to go from a value (or list of values, one per partitioning column) to >> a partition OID, and the way to get there is not to call arbitrary >> user code. > I think this was mentioned upthread, but I'll repeat it anyway since > it seems to need repeating. > More than fast, you want it analyzable (by the planner). Ie: it has to > be easy to prove partition exclusion against a where clause. Actually, I'm not sure that's what we want. I thought what we really wanted here was to postpone partition-routing decisions to runtime, so that the behavior would be efficient whether or not the decision could be predetermined at plan time. This still leads to the same point Robert is making: the routing decisions have to be cheap and fast. But it's wrong to think of it in terms of planner proofs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Fri, Dec 12, 2014 at 6:48 PM, Robert Haas wrote: > On Fri, Dec 12, 2014 at 4:28 PM, Jim Nasby wrote: >>> Sure. Mind you, I'm not proposing that the syntax I just mooted is >>> actually for the best. What I'm saying is that we need to talk about >>> it. >> >> Frankly, if we're going to require users to explicitly define each partition >> then I think the most appropriate API would be a function. Users will be >> writing code to create new partitions as needed, and it's generally easier >> to write code that calls a function as opposed to glomming a text string >> together and passing that to EXECUTE. > > I have very little idea what the API you're imagining would actually > look like from this description, but it sounds like a terrible idea. > We don't want to make this infinitely general. We need a *fast* way > to go from a value (or list of values, one per partitioning column) to > a partition OID, and the way to get there is not to call arbitrary > user code. I think this was mentioned upthread, but I'll repeat it anyway since it seems to need repeating. More than fast, you want it analyzable (by the planner). Ie: it has to be easy to prove partition exclusion against a where clause. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Fri, Dec 12, 2014 at 4:28 PM, Jim Nasby wrote: >> Sure. Mind you, I'm not proposing that the syntax I just mooted is >> actually for the best. What I'm saying is that we need to talk about >> it. > > Frankly, if we're going to require users to explicitly define each partition > then I think the most appropriate API would be a function. Users will be > writing code to create new partitions as needed, and it's generally easier > to write code that calls a function as opposed to glomming a text string > together and passing that to EXECUTE. I have very little idea what the API you're imagining would actually look like from this description, but it sounds like a terrible idea. We don't want to make this infinitely general. We need a *fast* way to go from a value (or list of values, one per partitioning column) to a partition OID, and the way to get there is not to call arbitrary user code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/12/14, 8:03 AM, Robert Haas wrote: On Thu, Dec 11, 2014 at 11:43 PM, Amit Langote wrote: >In case of what we would have called a 'LIST' partition, this could look like > >... FOR VALUES (val1, val2, val3, ...) > >Assuming we only support partition key to contain only one column in such a case. > >In case of what we would have called a 'RANGE' partition, this could look like > >... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...) > >How about BETWEEN ... AND ... ? Sure. Mind you, I'm not proposing that the syntax I just mooted is actually for the best. What I'm saying is that we need to talk about it. Frankly, if we're going to require users to explicitly define each partition then I think the most appropriate API would be a function. Users will be writing code to create new partitions as needed, and it's generally easier to write code that calls a function as opposed to glomming a text string together and passing that to EXECUTE. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Thu, Dec 11, 2014 at 11:43 PM, Amit Langote wrote: > In case of what we would have called a 'LIST' partition, this could look like > > ... FOR VALUES (val1, val2, val3, ...) > > Assuming we only support partition key to contain only one column in such a > case. > > In case of what we would have called a 'RANGE' partition, this could look like > > ... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...) > > How about BETWEEN ... AND ... ? Sure. Mind you, I'm not proposing that the syntax I just mooted is actually for the best. What I'm saying is that we need to talk about it. > I am not sure but perhaps RANGE and LIST as partitioning kinds may as well > just be noise keywords. We can parse those values into a parse node such that > we don’t have to care about whether they describe partition as being one kind > or the other. Say a List of something like, > > typedef struct PartitionColumnValue > { > NodeTagtype, > Oid*partitionid, > char *partcolname, > Node *partrangelower, > Node *partrangeupper, > List *partlistvalues > }; > > Or we could still add a (char) partkind just to say which of the fields > matter. > > We don't need any defining values here for hash partitions if and when we add > support for the same. We would either be using a system-wide common hash > function or we could add something with partitioning key definition. Yeah, range and list partition definitions are very similar, but hash partition definitions are a different kettle of fish. I don't think we really need hash partitioning for anything right away - it's pretty useless unless you've got, say, a way for the partitions to be foreign tables living on remote servers - but we shouldn't pick a design that will make it really hard to add later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Thu, Dec 11, 2014 at 8:42 PM, Robert Haas wrote: > > On Thu, Dec 11, 2014 at 12:00 AM, Amit Kapila wrote: > > Yeah either this way or what Josh has suggested upthread, the main > > point was that if at all we want to support multi-column list partitioning > > then we need to have slightly different syntax, however I feel that we > > can leave multi-column list partitioning for first version. > > Yeah, possibly. > > I think we could stand to have a lot more discussion about the syntax > here. So far the idea seems to be to copy what Oracle has, but it's > not clear if we're going to have exactly what Oracle has or something > subtly different. I personally don't find the Oracle syntax very > PostgreSQL-ish. I share your concern w.r.t the difficulties it can create if we don't do it carefully (one of the issue you have mentioned upthread about pg_dump, other such things could cause problems, if not thought of carefully from the beginning). One more thing, on a quick check it seems to me even DB2 uses some-thing similar to Oracle for defining partitions CREATE TABLE orders(id INT, shipdate DATE, …) PARTITION BY RANGE(shipdate) ( PARTITION q4_05 STARTING MINVALUE, PARTITION q1_06 STARTING '1/1/2006', PARTITION q2_06 STARTING '4/1/2006', PARTITION q3_06 STARTING '7/1/2006', PARTITION q4_06 STARTING '10/1/2006' ENDING ‘12/31/2006' ) I don't think there is any pressing need for PostgreSQL to use syntax similar to what some of the other databases use, however it has an advantage for ease of migration and ease of use (as people are already familiar with using such syntax). > Stuff like "VALUES LESS THAN 500" doesn't sit > especially well with me - less than according to which opclass? Are > we going to insist that partitioning must use the default btree > opclass so that we can use that syntax? That seems kind of lame. > Can't we simply specify the opclass along with column name while specifying partition clause which I feel is something similar to we already do in CREATE INDEX syntax. CREATE TABLE sales ( invoice_no NUMBER, sale_year INT NOT NULL, sale_month INT NOT NULL, sale_day INT NOT NULL ) PARTITION BY RANGE ( sale_year ) ( PARTITION sales_q1 VALUES LESS THAN (1999) Isn't the default operator class for a partition column would fit the bill for this particular case as the operators required in this syntax will be quite simple? > There are lots of interesting things we could do here, e.g.: > > CREATE TABLE parent_name PARTITION ON (column [ USING opclass ] [, ... ]); > CREATE TABLE child_name PARTITION OF parent_name >FOR { (value, ...) [ TO (value, ...) ] } [, ...]; > The only thing which slightly bothers me about this syntax is that it makes apparent that partitions are separate tables and it would be inconvenient if we choose to disallow some operations on partitions. I think it might be better we treat partitions as a way to divide the large amount of data and users be only given the option to specify boundaries to divide this data and storage mechanism of partitions should be an internal detail (something like we do in TOAST table case). I am not sure which syntax users will be more comfortable to use as I am seeing and using Oracle type syntax from long time so my opinion could be biased in this case. It would be really helpful if others who need or use partitioning scheme can share their inputs. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On partitioning
> -Original Message- > From: Robert Haas [mailto:robertmh...@gmail.com] > On Thu, Dec 11, 2014 at 12:00 AM, Amit Kapila > wrote: > > Yeah either this way or what Josh has suggested upthread, the main > > point was that if at all we want to support multi-column list partitioning > > then we need to have slightly different syntax, however I feel that we > > can leave multi-column list partitioning for first version. > > Yeah, possibly. > > I think we could stand to have a lot more discussion about the syntax > here. So far the idea seems to be to copy what Oracle has, but it's > not clear if we're going to have exactly what Oracle has or something > subtly different. I personally don't find the Oracle syntax very > PostgreSQL-ish. Stuff like "VALUES LESS THAN 500" doesn't sit > especially well with me - less than according to which opclass? Are > we going to insist that partitioning must use the default btree > opclass so that we can use that syntax? That seems kind of lame. > Syntax like VALUES LESS THAN 500 also means, we then have to go figure out what's that partition's lower bound based on upper bound of the previous one. Forget holes in the range if they matter. I expressed that concern elsewhere in favour of having available both a range's lower and upper bounds. > There are lots of interesting things we could do here, e.g.: > > CREATE TABLE parent_name PARTITION ON (column [ USING opclass ] [, ... ]); So, no PARTITION BY [RANGE | LIST] clause huh? What we are calling pg_partitioned_rel would obtain following bits of information from such a definition of a partitioned relation: * column(s) to partition on and respective opclass(es) * the level this partitioned relation lies in the partitioning hierarchy (determining its relkind and storage qualification) By the way, I am not sure how we define a partitioning key on a partition (in other words, a subpartitioning key on the corresponding partitioned relation). Perhaps (only) via ALTER TABLE on a partition relation? > CREATE TABLE child_name PARTITION OF parent_name >FOR { (value, ...) [ TO (value, ...) ] } [, ...]; > So it's still a CREATE "TABLE" but the part 'PARTITION OF' turns this "table" into something having characteristics of a partition relation getting all kinds of new treatments at various places. It appears there is a redistribution of table-characteristics between a partitioned relation and its partition. We take away storage from the former and instead give it to the latter. On the other hand, the latter's data is only accessible through the former perhaps with escape routes for direct access via some special syntax attached to various access commands. We also stand to lose certain abilities with a partitioned relation such as not able to define a unique constraint (other than what partition key could potentially help ensure) or use it as target of foreign key constraint (just reiterating). What we call pg_partition_def obtains following bits of information from such a definition of a partition relation: * parent relation (partitioned relation this is partition of) * partition kind (do we even want to keep carrying this around as a separate field in catalog?) * values this partition holds The last part being the most important. In case of what we would have called a 'LIST' partition, this could look like ... FOR VALUES (val1, val2, val3, ...) Assuming we only support partition key to contain only one column in such a case. In case of what we would have called a 'RANGE' partition, this could look like ... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...) How about BETWEEN ... AND ... ? Here we allow a partition key to contain more than one column. > So instead of making a hard distinction between range and list > partitioning, you can say: > > CREATE TABLE child_name PARTITION OF parent_name FOR (3), (5), (7); > CREATE TABLE child2_name PARTITION OF parent_name FOR (8) TO (12); > CREATE TABLE child2_name PARTITION OF parent_name FOR (20) TO (30), > (120) TO (130); > I would include the noise keyword VALUES just for readability if anything. > Now that might be a crappy idea for various reasons, but the point is > there are a lot of details to be hammered out with the syntax, and > there are several ways we can go wrong. If we choose an > overly-limiting syntax, we're needlessly restricting what can be done. > If we choose an overly-permissive syntax, we'll restrict the > optimization opportunities. > I am not sure but perhaps RANGE and LIST as partitioning kinds may as well just be noise keywords. We can parse those values into a parse node such that we don’t have to care about whether they describe partition as being one kind or the other. Say a List of something like, typedef struct PartitionColumnValue { NodeTagtype, Oid*partitionid, char *partcolname, Node *partrangelower, Node *partrangeup
Re: [HACKERS] On partitioning
On Thu, Dec 11, 2014 at 12:00 AM, Amit Kapila wrote: > Yeah either this way or what Josh has suggested upthread, the main > point was that if at all we want to support multi-column list partitioning > then we need to have slightly different syntax, however I feel that we > can leave multi-column list partitioning for first version. Yeah, possibly. I think we could stand to have a lot more discussion about the syntax here. So far the idea seems to be to copy what Oracle has, but it's not clear if we're going to have exactly what Oracle has or something subtly different. I personally don't find the Oracle syntax very PostgreSQL-ish. Stuff like "VALUES LESS THAN 500" doesn't sit especially well with me - less than according to which opclass? Are we going to insist that partitioning must use the default btree opclass so that we can use that syntax? That seems kind of lame. There are lots of interesting things we could do here, e.g.: CREATE TABLE parent_name PARTITION ON (column [ USING opclass ] [, ... ]); CREATE TABLE child_name PARTITION OF parent_name FOR { (value, ...) [ TO (value, ...) ] } [, ...]; So instead of making a hard distinction between range and list partitioning, you can say: CREATE TABLE child_name PARTITION OF parent_name FOR (3), (5), (7); CREATE TABLE child2_name PARTITION OF parent_name FOR (8) TO (12); CREATE TABLE child2_name PARTITION OF parent_name FOR (20) TO (30), (120) TO (130); Now that might be a crappy idea for various reasons, but the point is there are a lot of details to be hammered out with the syntax, and there are several ways we can go wrong. If we choose an overly-limiting syntax, we're needlessly restricting what can be done. If we choose an overly-permissive syntax, we'll restrict the optimization opportunities. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Wed, Dec 10, 2014 at 11:51 PM, Robert Haas wrote: > > On Mon, Dec 8, 2014 at 10:59 PM, Amit Kapila wrote: > > Yeah and also how would user specify the values, as an example > > assume that table is partitioned on monthly_salary, so partition > > definition would look: > > > > PARTITION BY LIST(monthly_salary) > > ( > > PARTITION salary_less_than_thousand VALUES(300, 900), > > PARTITION salary_less_than_two_thousand VALUES (500,1000,1500), > > ... > > ) > > > > Now if user wants to define multi-column Partition based on > > monthly_salary and annual_salary, how do we want him to > > specify the values. Basically how to distinguish which values > > belong to first column key and which one's belong to second > > column key. > > I assume you just add some parentheses. > > PARTITION BY LIST (colA, colB) (PARTITION VALUES ((valA1, valB1), > (valA2, valB2), (valA3, valB3)) > > Multi-column list partitioning may or may not be worth implementing, > but the syntax is not a real problem. > Yeah either this way or what Josh has suggested upthread, the main point was that if at all we want to support multi-column list partitioning then we need to have slightly different syntax, however I feel that we can leave multi-column list partitioning for first version. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On partitioning
On Wed, Dec 10, 2014 at 7:52 PM, Alvaro Herrera wrote: > > Amit Langote wrote: > > > On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila wrote: > > > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera < alvhe...@2ndquadrant.com> > > > wrote: > > > >> FWIW in my original proposal I was rejecting some things that after > > >> further consideration turn out to be possible to allow; for instance > > >> directly referencing individual partitions in COPY. We could allow > > >> something like > > >> > > >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT > > >> or maybe > > >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT > > >> > > > or > > > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT > > > COPY [TABLE] lineitems PARTITION TO STDOUT > > > > > > I think we should try to support operations on partitions via main > > > table whereever it is required. > > Um, I think the only difference is that you added the noise word TABLE > which we currently don't allow in COPY, Yeah, we could eliminate TABLE keyword from this syntax, the reason I have kept was for easier understanding of syntax, currently we don't have concept of PARTITION in COPY syntax, but now if we want to introduce such a concept, then it might be better to have TABLE keyword for the purpose of syntax clarity. > and that you added the > possibility of using named partitions, about which see below. > > > We can also allow to explicitly name a partition > > > > COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT; > > The problem with naming partitions is that the user has to pick names > for every partition, which is tedious and doesn't provide any > significant benefit. The input I had from users of other partitioning > systems was that they very much preferred not to name the partitions at > all, It seems to me both Oracle and DB2 supports named partitions, so even though there are user's which don't prefer named partitions, I suspect equal or more number of users will be there who will prefer for the sake of migration and because they are already used to such a syntax. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On partitioning
On Wed, Dec 10, 2014 at 7:25 PM, Amit Langote wrote: > In heap_create(), do we create storage for a top level partitioned table > (say, RELKIND_PARTITIONED_TABLE)? How about a partition that is further > sub-partitioned? We might allocate storage for a partition at some point and > then later choose to sub-partition it. In such a case, perhaps, we would have > to move existing data to the storage of subpartitions and deallocate the > partition's storage. In other words only leaf relations in a partition > hierarchy would have storage. Is there such a notion within code for some > other purpose or we'd have to invent it for partitioning scheme? I think it would be advantageous to have storage only for the leaf partitions, because then you don't need to waste time doing a zero-block sequential scan of the root as part of the append-plan, an annoyance of the current system. We have no concept for this right now; in fact, right now, the relkind fully determines whether a given relation has storage. One idea is to make the leaves relkind = 'r' and the interior notes some new relkind. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
> From: Robert Haas [mailto:robertmh...@gmail.com] > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund > wrote: > >> I don't think that's mutually exclusive with the idea of > >> partitions-as-tables. I mean, you can add code to the ALTER TABLE > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) > >> wherever you want. > > > > That'll be a lot of places you'll need to touch. More fundamentally: Why > > should we name something a table that's not one? > > Well, I'm not convinced that it isn't one. And adding a new relkind > will involve a bunch of code churn, too. But I don't much care to > pre-litigate this: when someone has got a patch, we can either agree > that the approach is OK or argue that it is problematic because X. I > think we need to hammer down the design in broad strokes first, and > I'm not sure we're totally there yet. > In heap_create(), do we create storage for a top level partitioned table (say, RELKIND_PARTITIONED_TABLE)? How about a partition that is further sub-partitioned? We might allocate storage for a partition at some point and then later choose to sub-partition it. In such a case, perhaps, we would have to move existing data to the storage of subpartitions and deallocate the partition's storage. In other words only leaf relations in a partition hierarchy would have storage. Is there such a notion within code for some other purpose or we'd have to invent it for partitioning scheme? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 10:59 PM, Amit Kapila wrote: > Yeah and also how would user specify the values, as an example > assume that table is partitioned on monthly_salary, so partition > definition would look: > > PARTITION BY LIST(monthly_salary) > ( > PARTITION salary_less_than_thousand VALUES(300, 900), > PARTITION salary_less_than_two_thousand VALUES (500,1000,1500), > ... > ) > > Now if user wants to define multi-column Partition based on > monthly_salary and annual_salary, how do we want him to > specify the values. Basically how to distinguish which values > belong to first column key and which one's belong to second > column key. I assume you just add some parentheses. PARTITION BY LIST (colA, colB) (PARTITION VALUES ((valA1, valB1), (valA2, valB2), (valA3, valB3)) Multi-column list partitioning may or may not be worth implementing, but the syntax is not a real problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 5:05 PM, Jim Nasby wrote: > Agreed, but it's possible to keep a block/CTID interface while doing > something different on the disk. Objection: hand-waving. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Wed, Dec 10, 2014 at 9:22 AM, Alvaro Herrera wrote: > The problem with naming partitions is that the user has to pick names > for every partition, which is tedious and doesn't provide any > significant benefit. The input I had from users of other partitioning > systems was that they very much preferred not to name the partitions at > all, which is why I chose the PARTITION FOR VALUE syntax (not sure if > this syntax is exactly what other systems use; it just seemed the > natural choice.) FWIW, Oracle does name partitions. It generates the names automatically if you don't care to specify them, and the partition names for a given table live in their own namespace that is separate from the toplevel object namespace. For example: CREATE TABLE sales ( invoice_no NUMBER, sale_year INT NOT NULL, sale_month INT NOT NULL, sale_day INT NOT NULL ) STORAGE (INITIAL 100K NEXT 50K) LOGGING PARTITION BY RANGE ( sale_year, sale_month, sale_day) ( PARTITION sales_q1 VALUES LESS THAN ( 1999, 04, 01 ) TABLESPACE tsa STORAGE (INITIAL 20K, NEXT 10K), PARTITION sales_q2 VALUES LESS THAN ( 1999, 07, 01 ) TABLESPACE tsb, PARTITION sales_q3 VALUES LESS THAN ( 1999, 10, 01 ) TABLESPACE tsc, PARTITION sales q4 VALUES LESS THAN ( 2000, 01, 01 ) TABLESPACE tsd) ENABLE ROW MOVEMENT; I don't think this practice has much to recommend it. We're going to need a way to refer to individual partitions by name, and I don't see much benefit in making that name something other than what is stored in pg_class.relname. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Amit Langote wrote: > On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila wrote: > > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera > > wrote: > >> FWIW in my original proposal I was rejecting some things that after > >> further consideration turn out to be possible to allow; for instance > >> directly referencing individual partitions in COPY. We could allow > >> something like > >> > >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT > >> or maybe > >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT > >> > > or > > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT > > COPY [TABLE] lineitems PARTITION TO STDOUT > > > > I think we should try to support operations on partitions via main > > table whereever it is required. Um, I think the only difference is that you added the noise word TABLE which we currently don't allow in COPY, and that you added the possibility of using named partitions, about which see below. > We can also allow to explicitly name a partition > > COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT; The problem with naming partitions is that the user has to pick names for every partition, which is tedious and doesn't provide any significant benefit. The input I had from users of other partitioning systems was that they very much preferred not to name the partitions at all, which is why I chose the PARTITION FOR VALUE syntax (not sure if this syntax is exactly what other systems use; it just seemed the natural choice.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila wrote: > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera > wrote: >> >> Amit Kapila wrote: >> > On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas >> > wrote: >> > > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund >> > wrote: >> > > >> I don't think that's mutually exclusive with the idea of >> > > >> partitions-as-tables. I mean, you can add code to the ALTER TABLE >> > > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, >> > > >> ...) >> > > >> wherever you want. >> > > > >> > > > That'll be a lot of places you'll need to touch. More fundamentally: >> > > > Why >> > > > should we name something a table that's not one? >> > > >> > > Well, I'm not convinced that it isn't one. And adding a new relkind >> > > will involve a bunch of code churn, too. But I don't much care to >> > > pre-litigate this: when someone has got a patch, we can either agree >> > > that the approach is OK or argue that it is problematic because X. I >> > > think we need to hammer down the design in broad strokes first, and >> > > I'm not sure we're totally there yet. >> > >> > That's right, I think at this point defining the top level >> > behaviour/design >> > is very important to proceed, we can decide about the better >> > implementation approach afterwards (may be once initial patch is ready, >> > because it might not be a major work to do it either way). So here's >> > where >> > we are on this point till now as per my understanding, I think that >> > direct >> > operations should be prohibited on partitions, you think that they >> > should be >> > allowed and Andres think that it might be better to allow direct >> > operations >> > on partitions for Read. >> >> FWIW in my original proposal I was rejecting some things that after >> further consideration turn out to be possible to allow; for instance >> directly referencing individual partitions in COPY. We could allow >> something like >> >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT >> or maybe >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT >> > or > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT > COPY [TABLE] lineitems PARTITION TO STDOUT > > I think we should try to support operations on partitions via main > table whereever it is required. > We can also allow to explicitly name a partition COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT; Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Wed, Dec 10, 2014 at 12:33 PM, Amit Kapila wrote: > On Tue, Dec 9, 2014 at 11:44 PM, Josh Berkus wrote: >> On 12/09/2014 12:17 AM, Amit Langote wrote: >> >> Now if user wants to define multi-column Partition based on >> >> > monthly_salary and annual_salary, how do we want him to >> >> > specify the values. Basically how to distinguish which values >> >> > belong to first column key and which one's belong to second >> >> > column key. >> >> > >> > Perhaps you are talking about "syntactic" difficulties that I totally >> > missed in my other reply to this mail? >> > >> > Can we represent the same data by rather using a subpartitioning scheme? >> > ISTM, semantics would remain the same. >> > >> > ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)? >> > > Using SUBPARTITION is not the answer for multi-column partition, > I think if we have to support it for List partitioning then something > on lines what Josh has mentioned below could workout, but I don't > think it is important to support multi-column partition for List at this > stage. > Yeah, I realize multicolumn list partitioning and list-list composite partitioning are different things in many respects. And given how awkward multicolumn list partitioning is looking to implement, I also think we only allow single column in a list partition key. >> ... or just use arrays. >> >> PARTITION BY LIST ( monthly_salary, annual_salary ) >> PARTITION salary_small VALUES ({[300,400],[5000,6000]}) >> ) >> >> ... but that begs the question of how partition by list over two columns >> (or more) would even work? You'd need an a*b number of partitions, and >> the user would be pretty much certain to miss a few value combinations. >> Maybe we should just restrict list partitioning to a single column for >> a first release, and wait and see if people ask for more? >> > > I also think we should not support multi-column list partition in first > release. > Yes. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera wrote: > > Amit Kapila wrote: > > On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas wrote: > > > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund > > wrote: > > > >> I don't think that's mutually exclusive with the idea of > > > >> partitions-as-tables. I mean, you can add code to the ALTER TABLE > > > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) > > > >> wherever you want. > > > > > > > > That'll be a lot of places you'll need to touch. More fundamentally: Why > > > > should we name something a table that's not one? > > > > > > Well, I'm not convinced that it isn't one. And adding a new relkind > > > will involve a bunch of code churn, too. But I don't much care to > > > pre-litigate this: when someone has got a patch, we can either agree > > > that the approach is OK or argue that it is problematic because X. I > > > think we need to hammer down the design in broad strokes first, and > > > I'm not sure we're totally there yet. > > > > That's right, I think at this point defining the top level behaviour/design > > is very important to proceed, we can decide about the better > > implementation approach afterwards (may be once initial patch is ready, > > because it might not be a major work to do it either way). So here's where > > we are on this point till now as per my understanding, I think that direct > > operations should be prohibited on partitions, you think that they should be > > allowed and Andres think that it might be better to allow direct operations > > on partitions for Read. > > FWIW in my original proposal I was rejecting some things that after > further consideration turn out to be possible to allow; for instance > directly referencing individual partitions in COPY. We could allow > something like > > COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT > or maybe > COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT > or COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT COPY [TABLE] lineitems PARTITION TO STDOUT I think we should try to support operations on partitions via main table whereever it is required. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On partitioning
On Tue, Dec 9, 2014 at 11:44 PM, Josh Berkus wrote: > On 12/09/2014 12:17 AM, Amit Langote wrote: > >> Now if user wants to define multi-column Partition based on > >> > monthly_salary and annual_salary, how do we want him to > >> > specify the values. Basically how to distinguish which values > >> > belong to first column key and which one's belong to second > >> > column key. > >> > > > Perhaps you are talking about "syntactic" difficulties that I totally missed in my other reply to this mail? > > > > Can we represent the same data by rather using a subpartitioning scheme? ISTM, semantics would remain the same. > > > > ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)? > Using SUBPARTITION is not the answer for multi-column partition, I think if we have to support it for List partitioning then something on lines what Josh has mentioned below could workout, but I don't think it is important to support multi-column partition for List at this stage. > ... or just use arrays. > > PARTITION BY LIST ( monthly_salary, annual_salary ) > PARTITION salary_small VALUES ({[300,400],[5000,6000]}) > ) > > ... but that begs the question of how partition by list over two columns > (or more) would even work? You'd need an a*b number of partitions, and > the user would be pretty much certain to miss a few value combinations. > Maybe we should just restrict list partitioning to a single column for > a first release, and wait and see if people ask for more? > I also think we should not support multi-column list partition in first release. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On partitioning
On 12/8/14, 5:19 PM, Josh Berkus wrote: On 12/08/2014 02:12 PM, Jim Nasby wrote: On 12/8/14, 12:26 PM, Josh Berkus wrote: 4. Creation Locking Problem high probability of lock pile-ups whenever a new partition is created on demand due to multiple backends trying to create the partition at the same time. Not Addressed? Do users actually try and create new partitions during DML? That sounds doomed to failure in pretty much any system... There is no question that it would be easier for users to create partitions on demand automatically. Particularly if you're partitioning by something other than time. For a particular case, consider users on RDS, which has no cron jobs for creating new partitons; it's on demand or manually. It's quite possible that there is no good way to work out the locking for on-demand partitions though, but *if* we're going to have a 2nd partition system, I think it's important to at least discuss the problems with on-demand creation. Yeah, we should discuss it. Perhaps the right answer here may be our own job scheduler, something a lot of folks want anyway. 11. Hash Partitioning Some users would prefer to partition into a fixed number of hash-allocated partitions. Not Addressed. Though, you should be able to do that in either system if you bother to define your own hash in a BEFORE trigger... That doesn't do you any good with the SELECT query, unless you change your middleware to add a hash(column) to every query. Which would be really hard to do for joins. A. COPY/ETL then attach In inheritance partitioning, you can easily build a partition outside the master and then "attach" it, allowing for minimal disturbance of concurrent users. Could be addressed in the future. How much of the desire for this is because our current "row routing" solutions are very slow? I suspect that's the biggest reason, and hopefully Alvaro's proposal mostly eliminates it. That doesn't always work, though. In some cases the partition is being built using some fairly complex logic (think of partitions which are based on matviews) and there's no fast way to create the new data. Again, this is an acceptable casualty of an improved design, but if it will be so, we should consciously decide that. Is there an example you can give here? If the scheme is that complicated I'm failing to see how you're supposed to do things like partition elimination. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/09/2014 12:17 AM, Amit Langote wrote: >> Now if user wants to define multi-column Partition based on >> > monthly_salary and annual_salary, how do we want him to >> > specify the values. Basically how to distinguish which values >> > belong to first column key and which one's belong to second >> > column key. >> > > Perhaps you are talking about "syntactic" difficulties that I totally missed > in my other reply to this mail? > > Can we represent the same data by rather using a subpartitioning scheme? > ISTM, semantics would remain the same. > > ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)? ... or just use arrays. PARTITION BY LIST ( monthly_salary, annual_salary ) PARTITION salary_small VALUES ({[300,400],[5000,6000]}) ) ... but that begs the question of how partition by list over two columns (or more) would even work? You'd need an a*b number of partitions, and the user would be pretty much certain to miss a few value combinations. Maybe we should just restrict list partitioning to a single column for a first release, and wait and see if people ask for more? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Amit Kapila wrote: > On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas wrote: > > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund > wrote: > > >> I don't think that's mutually exclusive with the idea of > > >> partitions-as-tables. I mean, you can add code to the ALTER TABLE > > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) > > >> wherever you want. > > > > > > That'll be a lot of places you'll need to touch. More fundamentally: Why > > > should we name something a table that's not one? > > > > Well, I'm not convinced that it isn't one. And adding a new relkind > > will involve a bunch of code churn, too. But I don't much care to > > pre-litigate this: when someone has got a patch, we can either agree > > that the approach is OK or argue that it is problematic because X. I > > think we need to hammer down the design in broad strokes first, and > > I'm not sure we're totally there yet. > > That's right, I think at this point defining the top level behaviour/design > is very important to proceed, we can decide about the better > implementation approach afterwards (may be once initial patch is ready, > because it might not be a major work to do it either way). So here's where > we are on this point till now as per my understanding, I think that direct > operations should be prohibited on partitions, you think that they should be > allowed and Andres think that it might be better to allow direct operations > on partitions for Read. FWIW in my original proposal I was rejecting some things that after further consideration turn out to be possible to allow; for instance directly referencing individual partitions in COPY. We could allow something like COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT or maybe COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT and this would emit the whole partition for year 2000 of table lineitems, and only that (the value is just computed on the fly to fit the partitioning constraints for that individual partition). Then pg_dump would be able to dump each and every partition separately. In a similar way we could have COPY FROM allow input into individual partitions so that such a dump can be restored in parallel for each partition. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Josh Berkus wrote: Hi, > Pardon me for jumping into this late. In general, I like Alvaro's > approach. Please don't call this "Alvaro's approach" as I'm not involved in this anymore. Amit Langote has taken ownership of it now. While some resemblance to what I originally proposed might remain, I haven't kept track of how this has evolved and this might be a totally different thing now. Or not. Anyway I just wanted to comment on a single point: > 6. Unique Index Problem > Cannot create a unique index across multiple partitions, which prevents > the partitioned table from being FK'd. > Not Addressed > (but could be addressed in the future) I think it's unlikely that we will ever create a unique index that spans all the partitions, actually. Even if there are some wild ideas on how to implement such a thing, the number of difficult issues that no one knows how to attack seems too large. I would perhaps be thinking in allowing foreign keys to be defined on column sets that are prefixed by partition keys; unique indexes must exist on all partitions on the same columns including the partition keys. (Perhaps make an extra exception that if a partition allows a single value for the partition column, that column need not be part of the unique index.) > 10. Scaling Problem > Inheritance partitioning becomes prohibitively slow for the planner at > somewhere between 100 and 500 partitions depending on various factors. > No idea? At least it was my intention to make the system scale to huge number of partitions, but this requires some forward thinking (such as avoiding loading the index list of all of them or evern opening all of them at the planner stage) and I think would be defeated if we want to keep all the generality of the inheritance-based approach. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Tue, Dec 9, 2014 at 12:59 PM, Amit Kapila wrote: > On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote > wrote: >> > From: Robert Haas [mailto:robertmh...@gmail.com] >> > On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila >> > wrote: >> > >> I guess you could list or hash partition on multiple columns, too. >> > > >> > > How would you distinguish values in list partition for multiple >> > > columns? I mean for range partition, we are sure there will >> > > be either one value for each column, but for list it could >> > > be multiple and not fixed for each partition, so I think it will not >> > > be easy to support the multicolumn partition key for list >> > > partitions. >> > >> > I don't understand. If you want to range partition on columns (a, b), >> > you say that, say, tuples with (a, b) values less than (100, 200) go >> > here and the rest go elsewhere. For list partitioning, you say that, >> > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the >> > rest go elsewhere. I'm not sure how useful that is but it's not >> > illogical. >> > >> >> In case of list partitioning, 100 and 200 would respectively be one of the >> values in lists of allowed values for a and b. I thought his concern is >> whether this "list of values for each column in partkey" is as convenient to >> store and manipulate as range partvalues. >> > > Yeah and also how would user specify the values, as an example > assume that table is partitioned on monthly_salary, so partition > definition would look: > > PARTITION BY LIST(monthly_salary) > ( > PARTITION salary_less_than_thousand VALUES(300, 900), > PARTITION salary_less_than_two_thousand VALUES (500,1000,1500), > ... > ) > > Now if user wants to define multi-column Partition based on > monthly_salary and annual_salary, how do we want him to > specify the values. Basically how to distinguish which values > belong to first column key and which one's belong to second > column key. > Perhaps you are talking about "syntactic" difficulties that I totally missed in my other reply to this mail? Can we represent the same data by rather using a subpartitioning scheme? ISTM, semantics would remain the same. ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Tue, Dec 9, 2014 at 12:59 PM, Amit Kapila wrote: > On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote > wrote: >> > From: Robert Haas [mailto:robertmh...@gmail.com] >> > I don't understand. If you want to range partition on columns (a, b), >> > you say that, say, tuples with (a, b) values less than (100, 200) go >> > here and the rest go elsewhere. For list partitioning, you say that, >> > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the >> > rest go elsewhere. I'm not sure how useful that is but it's not >> > illogical. >> > >> >> In case of list partitioning, 100 and 200 would respectively be one of the >> values in lists of allowed values for a and b. I thought his concern is >> whether this "list of values for each column in partkey" is as convenient to >> store and manipulate as range partvalues. >> > > Yeah and also how would user specify the values, as an example > assume that table is partitioned on monthly_salary, so partition > definition would look: > > PARTITION BY LIST(monthly_salary) > ( > PARTITION salary_less_than_thousand VALUES(300, 900), > PARTITION salary_less_than_two_thousand VALUES (500,1000,1500), > ... > ) > > Now if user wants to define multi-column Partition based on > monthly_salary and annual_salary, how do we want him to > specify the values. Basically how to distinguish which values > belong to first column key and which one's belong to second > column key. > Amit, in one of my earlier replies to your question of why we may not want to implement multi-column list partitioning (lack of user interest in the feature or possible complexity of the code), I tried to explain how that may work if we do choose to go that way. Basically, something we may call PartitionColumnValue should be such that above issue can be suitably sorted out. For example, a partition defining/bounding value would be a pg_node_tree representation of List of one of the (say) following parse nodes as appropriate - typedef struct PartitionColumnValue { NodeTag type, Oid *partitionid, char*partcolname, charpartkind, Node*partrangelower, Node*partrangeupper, List*partlistvalues }; OR separately, typedef struct RangePartitionColumnValue { NodeTag type, Oid *partitionid, char*partcolname, Node*partrangelower, Node*partrangeupper }; & typedef struct ListPartitionColumnValue { NodeTag type, Oid *partitionid, char*partcolname, List*partlistvalues }; Where a partition definition would look like typedef struct PartitionDef { NodeTag type, RangeVarpartition, RangeVarparentrel, char*kind, Node*values, List*options, char*tablespacename }; PartitionDef.values is an (ordered) List of PartitionColumnValue each of which corresponds to one column in the partition key in that order. We should be able to devise a way to load the pg_node_tree representation of PartitionDef.values (on-disk pg_partition_def.partvalues) into relcache using a "suitable data structure" so that it becomes readily usable in variety of contexts that we are interested in using this information. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas wrote: > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund wrote: > >> I don't think that's mutually exclusive with the idea of > >> partitions-as-tables. I mean, you can add code to the ALTER TABLE > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) > >> wherever you want. > > > > That'll be a lot of places you'll need to touch. More fundamentally: Why > > should we name something a table that's not one? > > Well, I'm not convinced that it isn't one. And adding a new relkind > will involve a bunch of code churn, too. But I don't much care to > pre-litigate this: when someone has got a patch, we can either agree > that the approach is OK or argue that it is problematic because X. I > think we need to hammer down the design in broad strokes first, and > I'm not sure we're totally there yet. That's right, I think at this point defining the top level behaviour/design is very important to proceed, we can decide about the better implementation approach afterwards (may be once initial patch is ready, because it might not be a major work to do it either way). So here's where we are on this point till now as per my understanding, I think that direct operations should be prohibited on partitions, you think that they should be allowed and Andres think that it might be better to allow direct operations on partitions for Read. > > >> - Direct access to individual partitions to bypass > >> tuple-routing/query-planning overhead. > > > > I think that might be ok in some cases, but in general I'd be very wary > > to allow that. I think it might be ok to allow direct read access, but > > everything else I'd be opposed. I'd much rather go the route of allowing > > to few things and then gradually opening up if required than the other > > way round (as that pretty much will never happen because it'll break > > deployed systems). > > Why? > Because I think it will be difficult for users to write/maintain more of such code, which is one of the complaints with previous system where user needs to write triggers to route the tuple to appropriate partition. I think in first step we should try to improve the tuple routing algorithm so that it is not pain for users or atleast it should be at par with some of the other competitive database systems and if we are not able to come up with such an implementation, then may be we can think of providing it as a special way for users to improve performance. Another reason is that fundamentally partitions are managed internally to divide the user data in a way so that access could be cheaper and we take the specifications for defining the partitions from users and allowing operations on internally managed objects could lead to user writing quite some code to do what database actually does internally. If we see that TOAST table are internally used to manage large tuples, however we don't want users to directly perform dml on those tables. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On partitioning
On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote wrote: > > From: Robert Haas [mailto:robertmh...@gmail.com] > > On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila > > wrote: > > >> I guess you could list or hash partition on multiple columns, too. > > > > > > How would you distinguish values in list partition for multiple > > > columns? I mean for range partition, we are sure there will > > > be either one value for each column, but for list it could > > > be multiple and not fixed for each partition, so I think it will not > > > be easy to support the multicolumn partition key for list > > > partitions. > > > > I don't understand. If you want to range partition on columns (a, b), > > you say that, say, tuples with (a, b) values less than (100, 200) go > > here and the rest go elsewhere. For list partitioning, you say that, > > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the > > rest go elsewhere. I'm not sure how useful that is but it's not > > illogical. > > > > In case of list partitioning, 100 and 200 would respectively be one of the values in lists of allowed values for a and b. I thought his concern is whether this "list of values for each column in partkey" is as convenient to store and manipulate as range partvalues. > Yeah and also how would user specify the values, as an example assume that table is partitioned on monthly_salary, so partition definition would look: PARTITION BY LIST(monthly_salary) ( PARTITION salary_less_than_thousand VALUES(300, 900), PARTITION salary_less_than_two_thousand VALUES (500,1000,1500), ... ) Now if user wants to define multi-column Partition based on monthly_salary and annual_salary, how do we want him to specify the values. Basically how to distinguish which values belong to first column key and which one's belong to second column key. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On partitioning
> From: Robert Haas [mailto:robertmh...@gmail.com] > On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila > wrote: > >> I guess you could list or hash partition on multiple columns, too. > > > > How would you distinguish values in list partition for multiple > > columns? I mean for range partition, we are sure there will > > be either one value for each column, but for list it could > > be multiple and not fixed for each partition, so I think it will not > > be easy to support the multicolumn partition key for list > > partitions. > > I don't understand. If you want to range partition on columns (a, b), > you say that, say, tuples with (a, b) values less than (100, 200) go > here and the rest go elsewhere. For list partitioning, you say that, > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the > rest go elsewhere. I'm not sure how useful that is but it's not > illogical. > In case of list partitioning, 100 and 200 would respectively be one of the values in lists of allowed values for a and b. I thought his concern is whether this "list of values for each column in partkey" is as convenient to store and manipulate as range partvalues. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/08/2014 02:12 PM, Jim Nasby wrote: > On 12/8/14, 12:26 PM, Josh Berkus wrote: >> 4. Creation Locking Problem >> high probability of lock pile-ups whenever a new partition is created on >> demand due to multiple backends trying to create the partition at the >> same time. >> Not Addressed? > > Do users actually try and create new partitions during DML? That sounds > doomed to failure in pretty much any system... There is no question that it would be easier for users to create partitions on demand automatically. Particularly if you're partitioning by something other than time. For a particular case, consider users on RDS, which has no cron jobs for creating new partitons; it's on demand or manually. It's quite possible that there is no good way to work out the locking for on-demand partitions though, but *if* we're going to have a 2nd partition system, I think it's important to at least discuss the problems with on-demand creation. >> 11. Hash Partitioning >> Some users would prefer to partition into a fixed number of >> hash-allocated partitions. >> Not Addressed. > > Though, you should be able to do that in either system if you bother to > define your own hash in a BEFORE trigger... That doesn't do you any good with the SELECT query, unless you change your middleware to add a hash(column) to every query. Which would be really hard to do for joins. >> A. COPY/ETL then attach >> In inheritance partitioning, you can easily build a partition outside >> the master and then "attach" it, allowing for minimal disturbance of >> concurrent users. Could be addressed in the future. > > How much of the desire for this is because our current "row routing" > solutions are very slow? I suspect that's the biggest reason, and > hopefully Alvaro's proposal mostly eliminates it. That doesn't always work, though. In some cases the partition is being built using some fairly complex logic (think of partitions which are based on matviews) and there's no fast way to create the new data. Again, this is an acceptable casualty of an improved design, but if it will be so, we should consciously decide that. >> B. Catchall Partition >> Many partitioning schemes currently contain a "catchall" partition which >> accepts rows outside of the range of the partitioning scheme, due to bad >> input data. Probably not handled on purpose; Alvaro is proposing that >> we reject these instead, or create the partitions on demand, which is a >> legitimate approach. >> >> C. Asymmetric Partitioning / NULLs in partition column >> This is the classic Active/Inactive By Month setup for partitions. >> Could be addressed via special handling for NULL/infinity in the >> partitioned column. > > If we allowed for a "catchall partition" and supported normal > inheritance/triggers on that partition then users could continue to do > whatever they needed with data that didn't fit the "normal" partitioning > pattern. That sounds to me like it would fall under the heading of "impossible levels of backwards-compatibility". -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/8/14, 12:26 PM, Josh Berkus wrote: 4. Creation Locking Problem high probability of lock pile-ups whenever a new partition is created on demand due to multiple backends trying to create the partition at the same time. Not Addressed? Do users actually try and create new partitions during DML? That sounds doomed to failure in pretty much any system... 6. Unique Index Problem Cannot create a unique index across multiple partitions, which prevents the partitioned table from being FK'd. Not Addressed (but could be addressed in the future) And would be extremely useful even with simple inheritance, let alone partitioning... 9. Hibernate Problem When using the trigger method, inserts into the master partition return 0, which Hibernate and some other ORMs regard as an insert failure. Addressed. It would be really nice to address this with regular inheritance too... 11. Hash Partitioning Some users would prefer to partition into a fixed number of hash-allocated partitions. Not Addressed. Though, you should be able to do that in either system if you bother to define your own hash in a BEFORE trigger... A. COPY/ETL then attach In inheritance partitioning, you can easily build a partition outside the master and then "attach" it, allowing for minimal disturbance of concurrent users. Could be addressed in the future. How much of the desire for this is because our current "row routing" solutions are very slow? I suspect that's the biggest reason, and hopefully Alvaro's proposal mostly eliminates it. B. Catchall Partition Many partitioning schemes currently contain a "catchall" partition which accepts rows outside of the range of the partitioning scheme, due to bad input data. Probably not handled on purpose; Alvaro is proposing that we reject these instead, or create the partitions on demand, which is a legitimate approach. C. Asymmetric Partitioning / NULLs in partition column This is the classic Active/Inactive By Month setup for partitions. Could be addressed via special handling for NULL/infinity in the partitioned column. If we allowed for a "catchall partition" and supported normal inheritance/triggers on that partition then users could continue to do whatever they needed with data that didn't fit the "normal" partitioning pattern. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/8/14, 1:05 PM, Robert Haas wrote: Besides, I haven't really seen anyone propose something that sounds like a credible alternative. If we could make partition objects things that the storage layer needs to know about but the query planner doesn't need to understand, that'd be maybe worth considering. But I don't see any way that that's remotely feasible. There are lots of places that we assume that a heap consists of blocks number 0 up through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits and pieces of the way index vacuuming is handled, which in turn bleeds into Hot Standby. You can't just decide that now block numbers are going to be replaced by some more complex structure, or even that they're now going to be nonlinear, without breaking a huge amount of stuff. Agreed, but it's possible to keep a block/CTID interface while doing something different on the disk. If you think about it, partitioning is really a hack anyway. It clutters up your logical set implementation with a bunch of physical details. What most people really want when they implement partitioning is simply data locality. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 2:58 PM, Josh Berkus wrote: >> I think any new partitioning system should keep the good things about >> the existing system, of which there are some, and not try to reinvent >> the wheel. The yard stick for a new system shouldn't be "is this >> different enough?" but "does this solve the problems without creating >> new ones?". > > It's unrealistic to assume that a new system would support all of the > features of the existing inheritance partitioning without restriction. > In fact, I'd say that such a requirement amounts to saying "don't > bother trying". > > For example, inheritance allows us to have different indexes, > constraints, and even columns on partitions. We can have overlapping > partitions, and heterogenous multilevel partitioning (partition this > customer by month but partition that customer by week). We can even add > triggers on individual partitions to reroute data away from a specific > partition. A requirement to support all of these peculiar uses of > inheritance partitioning would doom any new partitioning project. I don't think it has to be possible to support every use case that we can support today; clearly, a part of the goal here is to be LESS general so that we can be more performant. But I think the urge to change too many things at once had better be tempered by a clear-eyed vision of what can reasonably be accomplished in one patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund wrote: >> I don't think that's mutually exclusive with the idea of >> partitions-as-tables. I mean, you can add code to the ALTER TABLE >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) >> wherever you want. > > That'll be a lot of places you'll need to touch. More fundamentally: Why > should we name something a table that's not one? Well, I'm not convinced that it isn't one. And adding a new relkind will involve a bunch of code churn, too. But I don't much care to pre-litigate this: when someone has got a patch, we can either agree that the approach is OK or argue that it is problematic because X. I think we need to hammer down the design in broad strokes first, and I'm not sure we're totally there yet. >> - Direct access to individual partitions to bypass >> tuple-routing/query-planning overhead. > > I think that might be ok in some cases, but in general I'd be very wary > to allow that. I think it might be ok to allow direct read access, but > everything else I'd be opposed. I'd much rather go the route of allowing > to few things and then gradually opening up if required than the other > way round (as that pretty much will never happen because it'll break > deployed systems). Why? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/08/2014 11:40 AM, Robert Haas wrote: >> I don't thing its feasible to drop inheritance partitioning at this >> point; too many user exploit a lot of peculiarities of that system which >> wouldn't be supported by any other system. So any new partitioning >> system we're talking about would be *in addition* to the existing >> system. Hence my prior email trying to make sure that a new proposed >> system is sufficiently different from the existing one to be worthwhile. > > I think any new partitioning system should keep the good things about > the existing system, of which there are some, and not try to reinvent > the wheel. The yard stick for a new system shouldn't be "is this > different enough?" but "does this solve the problems without creating > new ones?". It's unrealistic to assume that a new system would support all of the features of the existing inheritance partitioning without restriction. In fact, I'd say that such a requirement amounts to saying "don't bother trying". For example, inheritance allows us to have different indexes, constraints, and even columns on partitions. We can have overlapping partitions, and heterogenous multilevel partitioning (partition this customer by month but partition that customer by week). We can even add triggers on individual partitions to reroute data away from a specific partition. A requirement to support all of these peculiar uses of inheritance partitioning would doom any new partitioning project. >>> Besides, I haven't really seen anyone propose something that sounds >>> like a credible alternative. If we could make partition objects >>> things that the storage layer needs to know about but the query >>> planner doesn't need to understand, that'd be maybe worth considering. >>> But I don't see any way that that's remotely feasible. >> >> On the other hand, as long as partitions exist exclusively at the >> planner layer, we can't fix the existing major shortcomings of >> inheritance partitioning, such as its inability to handle expressions. >> Again, see previous. > > Huh? Explained in the other email I posted on this thread. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 2014-12-08 14:48:50 -0500, Robert Haas wrote: > On Mon, Dec 8, 2014 at 2:39 PM, Andres Freund wrote: > >> I guess I'm in disagreement with you - and, perhaps - the majority on > >> this point. I think that ship has already sailed: partitions ARE > >> tables. We can try to make it less necessary for users to ever look > >> at those tables as separate objects, and I think that's a good idea. > >> But trying to go from a system where partitions are tables, which is > >> what we have today, to a system where they are not seems like a bad > >> idea to me. If we make a major break from how things work today, > >> we're going to end up having to reimplement stuff that already works. > > > > I don't think this makes much sense. That'd severely restrict our > > ability to do stuff for a long time. Unless we can absolutely rely on > > the fact that partitions have the same schema and such we'll rob > > ourselves of significant optimization opportunities. > > I don't think that's mutually exclusive with the idea of > partitions-as-tables. I mean, you can add code to the ALTER TABLE > path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) > wherever you want. That'll be a lot of places you'll need to touch. More fundamentally: Why should we name something a table that's not one? > >> Besides, I haven't really seen anyone propose something that sounds > >> like a credible alternative. If we could make partition objects > >> things that the storage layer needs to know about but the query > >> planner doesn't need to understand, that'd be maybe worth considering. > >> But I don't see any way that that's remotely feasible. There are lots > >> of places that we assume that a heap consists of blocks number 0 up > >> through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits > >> and pieces of the way index vacuuming is handled, which in turn bleeds > >> into Hot Standby. You can't just decide that now block numbers are > >> going to be replaced by some more complex structure, or even that > >> they're now going to be nonlinear, without breaking a huge amount of > >> stuff. > > > > I think you're making a wrong fundamental assumption here. Just because > > we define partitions to not be full relations doesn't mean we have to > > treat them entirely separate. I don't see why a pg_class.relkind = 'p' > > entry would be something actually problematic. That'd easily allow to > > treat them differently in all the relevant places (all of ALTER TABLE, > > DML et al) and still allow all of the current planner/executor > > infrastructure. We can even allow direct SELECTs from individual > > partitions if we want to - that's trivial to achieve. > > We may just be using different words to talk about more-or-less the > same thing, then. That might be > What I'm saying is that I want these things to keep working: > - Indexes. Nobody argued against that I think. > - Merge append and any other inheritance-aware query planning > techniques. Same here. > - Direct access to individual partitions to bypass > tuple-routing/query-planning overhead. I think that might be ok in some cases, but in general I'd be very wary to allow that. I think it might be ok to allow direct read access, but everything else I'd be opposed. I'd much rather go the route of allowing to few things and then gradually opening up if required than the other way round (as that pretty much will never happen because it'll break deployed systems). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 2:39 PM, Andres Freund wrote: >> I guess I'm in disagreement with you - and, perhaps - the majority on >> this point. I think that ship has already sailed: partitions ARE >> tables. We can try to make it less necessary for users to ever look >> at those tables as separate objects, and I think that's a good idea. >> But trying to go from a system where partitions are tables, which is >> what we have today, to a system where they are not seems like a bad >> idea to me. If we make a major break from how things work today, >> we're going to end up having to reimplement stuff that already works. > > I don't think this makes much sense. That'd severely restrict our > ability to do stuff for a long time. Unless we can absolutely rely on > the fact that partitions have the same schema and such we'll rob > ourselves of significant optimization opportunities. I don't think that's mutually exclusive with the idea of partitions-as-tables. I mean, you can add code to the ALTER TABLE path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) wherever you want. >> Besides, I haven't really seen anyone propose something that sounds >> like a credible alternative. If we could make partition objects >> things that the storage layer needs to know about but the query >> planner doesn't need to understand, that'd be maybe worth considering. >> But I don't see any way that that's remotely feasible. There are lots >> of places that we assume that a heap consists of blocks number 0 up >> through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits >> and pieces of the way index vacuuming is handled, which in turn bleeds >> into Hot Standby. You can't just decide that now block numbers are >> going to be replaced by some more complex structure, or even that >> they're now going to be nonlinear, without breaking a huge amount of >> stuff. > > I think you're making a wrong fundamental assumption here. Just because > we define partitions to not be full relations doesn't mean we have to > treat them entirely separate. I don't see why a pg_class.relkind = 'p' > entry would be something actually problematic. That'd easily allow to > treat them differently in all the relevant places (all of ALTER TABLE, > DML et al) and still allow all of the current planner/executor > infrastructure. We can even allow direct SELECTs from individual > partitions if we want to - that's trivial to achieve. We may just be using different words to talk about more-or-less the same thing, then. What I'm saying is that I want these things to keep working: - Indexes. - Merge append and any other inheritance-aware query planning techniques. - Direct access to individual partitions to bypass tuple-routing/query-planning overhead. I am not necessarily saying that I have a problem with putting other restrictions on partitions, like requiring them to have the same tuple descriptor or the same ACLs as their parents. Those kinds of details bear discussion, but I'm not intrinsically opposed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 2:30 PM, Josh Berkus wrote: > On 12/08/2014 11:05 AM, Robert Haas wrote: >> I guess I'm in disagreement with you - and, perhaps - the majority on >> this point. I think that ship has already sailed: partitions ARE >> tables. We can try to make it less necessary for users to ever look >> at those tables as separate objects, and I think that's a good idea. >> But trying to go from a system where partitions are tables, which is >> what we have today, to a system where they are not seems like a bad >> idea to me. If we make a major break from how things work today, >> we're going to end up having to reimplement stuff that already works. > > I don't thing its feasible to drop inheritance partitioning at this > point; too many user exploit a lot of peculiarities of that system which > wouldn't be supported by any other system. So any new partitioning > system we're talking about would be *in addition* to the existing > system. Hence my prior email trying to make sure that a new proposed > system is sufficiently different from the existing one to be worthwhile. I think any new partitioning system should keep the good things about the existing system, of which there are some, and not try to reinvent the wheel. The yard stick for a new system shouldn't be "is this different enough?" but "does this solve the problems without creating new ones?". >> Besides, I haven't really seen anyone propose something that sounds >> like a credible alternative. If we could make partition objects >> things that the storage layer needs to know about but the query >> planner doesn't need to understand, that'd be maybe worth considering. >> But I don't see any way that that's remotely feasible. > > On the other hand, as long as partitions exist exclusively at the > planner layer, we can't fix the existing major shortcomings of > inheritance partitioning, such as its inability to handle expressions. > Again, see previous. Huh? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 2014-12-08 14:05:52 -0500, Robert Haas wrote: > On Sat, Dec 6, 2014 at 3:06 AM, Amit Kapila wrote: > > Sure, I don't feel we should not provide anyway to take dump > > for individual partition but not at level of independent table. > > May be something like --table > > --partition . > > > > In general, I think we should try to avoid exposing that partitions are > > individual tables as that might hinder any future enhancement in that > > area (example if we someone finds a different and better way to > > arrange the partition data, then due to the currently exposed syntax, > > we might feel blocked). > > I guess I'm in disagreement with you - and, perhaps - the majority on > this point. I think that ship has already sailed: partitions ARE > tables. We can try to make it less necessary for users to ever look > at those tables as separate objects, and I think that's a good idea. > But trying to go from a system where partitions are tables, which is > what we have today, to a system where they are not seems like a bad > idea to me. If we make a major break from how things work today, > we're going to end up having to reimplement stuff that already works. I don't think this makes much sense. That'd severely restrict our ability to do stuff for a long time. Unless we can absolutely rely on the fact that partitions have the same schema and such we'll rob ourselves of significant optimization opportunities. > Besides, I haven't really seen anyone propose something that sounds > like a credible alternative. If we could make partition objects > things that the storage layer needs to know about but the query > planner doesn't need to understand, that'd be maybe worth considering. > But I don't see any way that that's remotely feasible. There are lots > of places that we assume that a heap consists of blocks number 0 up > through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits > and pieces of the way index vacuuming is handled, which in turn bleeds > into Hot Standby. You can't just decide that now block numbers are > going to be replaced by some more complex structure, or even that > they're now going to be nonlinear, without breaking a huge amount of > stuff. I think you're making a wrong fundamental assumption here. Just because we define partitions to not be full relations doesn't mean we have to treat them entirely separate. I don't see why a pg_class.relkind = 'p' entry would be something actually problematic. That'd easily allow to treat them differently in all the relevant places (all of ALTER TABLE, DML et al) and still allow all of the current planner/executor infrastructure. We can even allow direct SELECTs from individual partitions if we want to - that's trivial to achieve. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/08/2014 11:05 AM, Robert Haas wrote: > I guess I'm in disagreement with you - and, perhaps - the majority on > this point. I think that ship has already sailed: partitions ARE > tables. We can try to make it less necessary for users to ever look > at those tables as separate objects, and I think that's a good idea. > But trying to go from a system where partitions are tables, which is > what we have today, to a system where they are not seems like a bad > idea to me. If we make a major break from how things work today, > we're going to end up having to reimplement stuff that already works. I don't thing its feasible to drop inheritance partitioning at this point; too many user exploit a lot of peculiarities of that system which wouldn't be supported by any other system. So any new partitioning system we're talking about would be *in addition* to the existing system. Hence my prior email trying to make sure that a new proposed system is sufficiently different from the existing one to be worthwhile. > Besides, I haven't really seen anyone propose something that sounds > like a credible alternative. If we could make partition objects > things that the storage layer needs to know about but the query > planner doesn't need to understand, that'd be maybe worth considering. > But I don't see any way that that's remotely feasible. On the other hand, as long as partitions exist exclusively at the planner layer, we can't fix the existing major shortcomings of inheritance partitioning, such as its inability to handle expressions. Again, see previous. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Sat, Dec 6, 2014 at 3:06 AM, Amit Kapila wrote: > Sure, I don't feel we should not provide anyway to take dump > for individual partition but not at level of independent table. > May be something like --table > --partition . > > In general, I think we should try to avoid exposing that partitions are > individual tables as that might hinder any future enhancement in that > area (example if we someone finds a different and better way to > arrange the partition data, then due to the currently exposed syntax, > we might feel blocked). I guess I'm in disagreement with you - and, perhaps - the majority on this point. I think that ship has already sailed: partitions ARE tables. We can try to make it less necessary for users to ever look at those tables as separate objects, and I think that's a good idea. But trying to go from a system where partitions are tables, which is what we have today, to a system where they are not seems like a bad idea to me. If we make a major break from how things work today, we're going to end up having to reimplement stuff that already works. Besides, I haven't really seen anyone propose something that sounds like a credible alternative. If we could make partition objects things that the storage layer needs to know about but the query planner doesn't need to understand, that'd be maybe worth considering. But I don't see any way that that's remotely feasible. There are lots of places that we assume that a heap consists of blocks number 0 up through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits and pieces of the way index vacuuming is handled, which in turn bleeds into Hot Standby. You can't just decide that now block numbers are going to be replaced by some more complex structure, or even that they're now going to be nonlinear, without breaking a huge amount of stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 12:13 AM, Amit Langote wrote: > So just to clarify, first and last destinations are considered "defined" if > you have something like: > > ... > PARTITION p1 VALUES LESS THAN 10 > PARTITION p2 VALUES BETWEEN 10 AND 20 > PARTITION p3 VALUES GREATER THAN 20 > ... > > And "not defined" if: > > ... > PARTITION p1 VALUES BETWEEN 10 AND 20 > ... Yes. >> For pg_dump --binary-upgrade, you need a statement like SELECT >> binary_upgrade.set_next_toast_pg_class_oid('%d'::pg_catalog.oid) for >> each pg_class entry. So you can't easily have a single SQL statement >> creating multiple such entries. > > Hmm, do you mean "pg_dump cannot emit" such a SQL or there shouldn't be one > in the first place? I mean that the binary upgrade script needs to set the OID for every pg_class object being restored, and it does that by stashing away up to one (1) pg_class OID before each CREATE statement. If a single CREATE statement generates multiple pg_class entries, this method doesn't work. > Makes sense. This would double as a way to create subpartitions too? And that > would have to play well with any choice we end up making about how we treat > subpartitioning key (one of the points discussed above) Yes, I think so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila wrote: >> I guess you could list or hash partition on multiple columns, too. > > How would you distinguish values in list partition for multiple > columns? I mean for range partition, we are sure there will > be either one value for each column, but for list it could > be multiple and not fixed for each partition, so I think it will not > be easy to support the multicolumn partition key for list > partitions. I don't understand. If you want to range partition on columns (a, b), you say that, say, tuples with (a, b) values less than (100, 200) go here and the rest go elsewhere. For list partitioning, you say that, say, tuples with (a, b) values of EXACTLY (100, 200) go here and the rest go elsewhere. I'm not sure how useful that is but it's not illogical. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
All, Pardon me for jumping into this late. In general, I like Alvaro's approach. However, I wanted to list the major shortcomings of the existing replication system (based on complaints by PGX's users and on IRC) and compare them to Alvaro's proposed implementation to make sure that enough of them are addressed, and that the ones which aren't addressed are not being addressed as a clear decision. We can't address *all* of the limitations of the current system, but let's make sure that we're addressing enough of them to make implementing a 2nd partitioning system worthwhile. Where I have ? is because I'm not clear from Alvaro's proposal whether they're addressed or not. 1.The Trigger Problem the need to write triggers for INSERT/UPDATE/DELETE. Addressed. 2. The Clutter Problem cluttering up system views and dumps with hundreds of partitioned tables Addressed. 3. Creation Problem the need two write triggers and/or cron jobs to create new partitions Addressed. 4. Creation Locking Problem high probability of lock pile-ups whenever a new partition is created on demand due to multiple backends trying to create the partition at the same time. Not Addressed? 5. Constant Problem Since current partitioned query planning happens before the rewrite phase, SELECTs do not use partition logic to evaluate even simple expressions, let alone IMMUTABLE or STABLE functions. Addressed?? 6. Unique Index Problem Cannot create a unique index across multiple partitions, which prevents the partitioned table from being FK'd. Not Addressed (but could be addressed in the future) 7. JOIN Problem Two partitioned tables being JOINed need to append and materialize before the join, causing a very slow join under some circumstances, even if both tables are partitioned on the same ranges. Not Addressed? (but could be addressed in the future) 8. COPY Problem Cannot bulk-load into the Master, just into individual partitions. Addressed. 9. Hibernate Problem When using the trigger method, inserts into the master partition return 0, which Hibernate and some other ORMs regard as an insert failure. Addressed. 10. Scaling Problem Inheritance partitioning becomes prohibitively slow for the planner at somewhere between 100 and 500 partitions depending on various factors. No idea? 11. Hash Partitioning Some users would prefer to partition into a fixed number of hash-allocated partitions. Not Addressed. 12. Extra Constraint Evaluation Inheritance partitioning evaluates *all* constraints on the partitions, whether they are part of the partitioning scheme or not. This is way expensive if those are, say, polygon comparisons. Addressed. Additionally, I believe that Alvaro's proposal will make the following activities which are supported by partition-by-inheritance more difficult or impossible. Again, these are probably acceptable because inheritance partitioning isn't going away. However, we should consciously decide that: A. COPY/ETL then attach In inheritance partitioning, you can easily build a partition outside the master and then "attach" it, allowing for minimal disturbance of concurrent users. Could be addressed in the future. B. Catchall Partition Many partitioning schemes currently contain a "catchall" partition which accepts rows outside of the range of the partitioning scheme, due to bad input data. Probably not handled on purpose; Alvaro is proposing that we reject these instead, or create the partitions on demand, which is a legitimate approach. C. Asymmetric Partitioning / NULLs in partition column This is the classic Active/Inactive By Month setup for partitions. Could be addressed via special handling for NULL/infinity in the partitioned column. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
From: Amit Kapila [mailto:amit.kapil...@gmail.com] > > > How would you distinguish values in list partition for multiple > > > columns? I mean for range partition, we are sure there will > > > be either one value for each column, but for list it could > > > be multiple and not fixed for each partition, so I think it will not > > > be easy to support the multicolumn partition key for list > > > partitions. > > >Irrespective of difficulties of representing it using pg_node_tree, it seems > >to me that multicolumn list partitioning is not widely used. > > So I think it is better to be clear why we are not planning to > support it, is it that because it is not required by users or > is it due to the reason that code seems to be tricky or is it due > to both of the reasons. It might help us if anyone raises this > during the development of this patch or in general if someone > requests such a feature. Coming back to the how pg_node_tree representation for list partitions - For each column in a multicolumn list partition key, a value would look like a dumped Node for List of Consts (all allowed values in a given list partition). And the whole key would then be a List of such Nodes (a dump thereof). That's perhaps pretty verbose but I guess that's supposed to be only a catalog representation. During relcache building, we turn this back into a collection of structs to efficiently locate the partition of interest whatever the method of doing that ends up being (based on partition type). The relcache step ensures that we have decoupled the concern of quickly locating an interesting partition from its catalog representation. Of course, there may be flaws in this picture and would only reveal themselves when actually trying to implement it or they can be pointed out in advance. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 11:01 AM, Amit Langote wrote: > From: Amit Kapila [mailto:amit.kapil...@gmail.com] > Sent: Saturday, December 06, 2014 5:00 PM > To: Robert Haas > Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers > Subject: Re: [HACKERS] On partitioning > > On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas wrote: > > On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote > > wrote: > > > > > I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And I am thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key) > > > > I guess you could list or hash partition on multiple columns, too. > > > > How would you distinguish values in list partition for multiple > > columns? I mean for range partition, we are sure there will > > be either one value for each column, but for list it could > > be multiple and not fixed for each partition, so I think it will not > > be easy to support the multicolumn partition key for list > > partitions. > > Irrespective of difficulties of representing it using pg_node_tree, it seems to me that multicolumn list partitioning is not widely used. So I think it is better to be clear why we are not planning to support it, is it that because it is not required by users or is it due to the reason that code seems to be tricky or is it due to both of the reasons. It might help us if anyone raises this during the development of this patch or in general if someone requests such a feature. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On partitioning
From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Amit Kapila Sent: Saturday, December 06, 2014 5:06 PM To: Robert Haas Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers Subject: Re: [HACKERS] On partitioning On Fri, Dec 5, 2014 at 10:12 PM, Robert Haas wrote: > On Fri, Dec 5, 2014 at 2:18 AM, Amit Kapila wrote: > > Do we really need to support dml or pg_dump for individual partitions? > > I think we do. It's quite reasonable for a DBA (or developer or > whatever) to want to dump all the data that's in a single partition; > for example, maybe they have the table partitioned, but also spread > across several servers. When the data on one machine grows too big, > they want to dump that partition, move it to a new machine, and drop > the partition from the old machine. That needs to be easy and > efficient. > > More generally, with inheritance, I've seen the ability to reference > individual inheritance children be a real life-saver on any number of > occasions. Now, a new partitioning system that is not as clunky as > constraint exclusion will hopefully be fast enough that people don't > need to do it very often any more. But I would be really cautious > about removing the option. That is the equivalent of installing a new > fire suppression system and then boarding up the emergency exit. > Yeah, you *hope* the new fire suppression system is good enough that > nobody will ever need to go out that way any more. But if you're > wrong, people will die, so getting rid of it isn't prudent. The > stakes are not quite so high here, but the principle is the same. > > > Sure, I don't feel we should not provide anyway to take dump > for individual partition but not at level of independent table. > May be something like --table > --partition . > This does sound cleaner. > In general, I think we should try to avoid exposing that partitions are > individual tables as that might hinder any future enhancement in that > area (example if we someone finds a different and better way to > arrange the partition data, then due to the currently exposed syntax, > we might feel blocked). Sounds like a concern. I guess you are referring to whether we allow a partition relation to be included in the range table and then some other cases. In the former case we could allow referring to individual partitions by some additional syntax if it doesn’t end up looking too ugly or invite a bunch of other issues. This seems to have been discussed a little bit upthread (for example, see "Open Questions" in Alvaro's original proposal and Hannu Krosing's reply). Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
From: Amit Kapila [mailto:amit.kapil...@gmail.com] Sent: Saturday, December 06, 2014 5:00 PM To: Robert Haas Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers Subject: Re: [HACKERS] On partitioning On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas wrote: > On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote > wrote: > > > I wonder if your suggestion of pg_node_tree plays well here. This then > > could be a list of CONSTs or some such... And I am thinking it's a concern > > only for range partitions, no? (that is, a multicolumn partition key) > > I guess you could list or hash partition on multiple columns, too. > > How would you distinguish values in list partition for multiple > columns? I mean for range partition, we are sure there will > be either one value for each column, but for list it could > be multiple and not fixed for each partition, so I think it will not > be easy to support the multicolumn partition key for list > partitions. Irrespective of difficulties of representing it using pg_node_tree, it seems to me that multicolumn list partitioning is not widely used. It is used in combination with range or hash partitioning as composite partitioning. So, perhaps we need not worry about that. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Hi Robert, > From: Robert Haas [mailto:robertmh...@gmail.com] > On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote > wrote: > >> So, we're going to support exactly two levels of partitioning? > >> partitions with partissub=false and subpartitions with partissub=true? > >> Why not support only one level of partitioning here but then let the > >> children have their own pg_partitioned_rel entries if they are > >> subpartitioned? That seems like a cleaner design and lets us support > >> an arbitrary number of partitioning levels if we ever need them. > > > > Yeah, that's what I thought at some point in favour of dropping partissub > altogether. However, not that this design solves it, there is one question - > if > we would want to support defining for a table both partition key and sub- > partition key in advance? That is, without having defined a first level > partition > yet; in that case, what level do we associate sub-(sub-) partitioning key with > or more to the point where do we keep it? > > Do we really need to allow that? I think you let people partition a > toplevel table, and then partition its partitions once they've been > created. I'm not sure there's a good reason to associate the > subpartitioning scheme with the toplevel table. For one thing, that > forces all subpartitions to be partitioned the same way - do we want > to insist on that? If we do, then I agree that we need to think a > little harder here. > To me, it sounds better if we insist on a uniform subpartitioning scheme across all partitions. It seems that's how it's done elsewhere. It would be interesting to hear what others think though. > > That would be a default partition. That is, where the tuples that don't > belong elsewhere (other defined partitions) go. VALUES clause of the > definition for such a partition would look like: > > > > (a range partition) ... VALUES LESS THAN MAXVALUE > > (a list partition) ... VALUES DEFAULT > > > > There has been discussion about whether there shouldn't be such a place > for tuples to go. That is, it should generate an error if a tuple can't go > anywhere (or support auto-creating a new one like in interval partitioning?) > > I think Alvaro's response further down the thread is right on target. > But to go into a bit more detail, let's consider the three possible > cases: > > - Hash partitioning. Every key value gets hashed to some partition. > The concept of an overflow or default partition doesn't even make > sense. > > - List partitioning. Each key for which the user has defined a > mapping gets sent to the corresponding partition. The keys that > aren't mapped anywhere can either (a) cause an error or (b) get mapped > to some default partition. It's probably useful to offer both > behaviors. But I don't think it requires a partitionisoverflow > column, because you can represent it some other way, such as by making > partitionvalues NULL, which is otherwise meaningless. > > - Range partitioning. In this case, what you've basically got is a > list of partition bounds and a list of target partitions. Suppose > there are N partition bounds; then there will be N+1 targets. Some of > those targets can be undefined, meaning an attempt to insert a key > with that value will error out. For example, suppose the user defines > a partition for values 1-3 and 10-13. Then your list of partition > bounds looks like this: > > 1,3,10,13 > > And your list of destinations looks like this: > > undefined,firstpartition,undefined,secondpartition,undefined > > More commonly, the ranges will be contiguous, so that there are no > gaps. If you have everything <10 in the first partition, everything > 10-20 in the second partition, and everything else in a third > partition, then you have bounds 10,20 and destinations > firstpartition,secondpartition,thirdpartition. If you want values > greater than 20 to error out, then you have bounds 10,20 and > destinations firstpartition,secondpartition,undefined. > > In none of this do you really have "an overflow partition". Rather, > the first and last destinations, if defined, catch everything that has > a key lower than the lowest key or higher than the highest key. If > not defined, you error out. So just to clarify, first and last destinations are considered "defined" if you have something like: ... PARTITION p1 VALUES LESS THAN 10 PARTITION p2 VALUES BETWEEN 10 AND 20 PARTITION p3 VALUES GREATER THAN 20 ... And "not defined" if: ... PARTITION p1 VALUES BETWEEN 10 AND 20 ... In the second case, because no explicit definitions for values less than 10 and greater than 20 are in place, rows with that value error out? If so, that makes sense. > > > I wonder if your suggestion of pg_node_tree plays well here. This then > could be a list of CONSTs or some such... And I am thinking it's a concern > only > for range partitions, no? (that is, a multicolumn partition key) > > I guess you could list or hash partition on multiple colum
Re: [HACKERS] On partitioning
On Fri, Dec 5, 2014 at 10:12 PM, Robert Haas wrote: > On Fri, Dec 5, 2014 at 2:18 AM, Amit Kapila wrote: > > Do we really need to support dml or pg_dump for individual partitions? > > I think we do. It's quite reasonable for a DBA (or developer or > whatever) to want to dump all the data that's in a single partition; > for example, maybe they have the table partitioned, but also spread > across several servers. When the data on one machine grows too big, > they want to dump that partition, move it to a new machine, and drop > the partition from the old machine. That needs to be easy and > efficient. > > More generally, with inheritance, I've seen the ability to reference > individual inheritance children be a real life-saver on any number of > occasions. Now, a new partitioning system that is not as clunky as > constraint exclusion will hopefully be fast enough that people don't > need to do it very often any more. But I would be really cautious > about removing the option. That is the equivalent of installing a new > fire suppression system and then boarding up the emergency exit. > Yeah, you *hope* the new fire suppression system is good enough that > nobody will ever need to go out that way any more. But if you're > wrong, people will die, so getting rid of it isn't prudent. The > stakes are not quite so high here, but the principle is the same. > Sure, I don't feel we should not provide anyway to take dump for individual partition but not at level of independent table. May be something like --table --partition . In general, I think we should try to avoid exposing that partitions are individual tables as that might hinder any future enhancement in that area (example if we someone finds a different and better way to arrange the partition data, then due to the currently exposed syntax, we might feel blocked). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On partitioning
On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas wrote: > On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote > wrote: > > > I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And I am thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key) > > I guess you could list or hash partition on multiple columns, too. How would you distinguish values in list partition for multiple columns? I mean for range partition, we are sure there will be either one value for each column, but for list it could be multiple and not fixed for each partition, so I think it will not be easy to support the multicolumn partition key for list partitions. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On partitioning
On Fri, Dec 5, 2014 at 3:05 PM, Jim Nasby wrote: >> On what basis do you expect that? Every time you use a view, you're >> using a pg_node_tree. Nobody's ever complained that having to reload >> the pg_node_tree column was too slow, and I see no reason to suppose >> that things would be any different here. >> >> I mean, we can certainly invent something new if there is a reason to >> do so. But you (and a few other people) seem to be trying pretty hard >> to avoid using the massive amount of infrastructure that we already >> have to do almost this exact thing, which puzzles the heck out of me. > > My concern is how to do the routing of incoming tuples. I'm assuming it'd be > significantly faster to compare two tuples than to run each tuple through a > bunch of nodetrees. As I said before, that's a completely unrelated problem. To quickly route tuples for range or list partitioning, you're going to want to have an array of Datums in memory and bseach it. That says nothing about how they should be stored on disk. Whatever the on-disk representation looks like, the relcache is going to need to reassemble it into an array that can be binary-searched. As long as that's not hard to do - and none of the proposals here would make it hard to do - there's no reason to care about this from that point of view. At least, not that I can see. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/5/14, 2:02 PM, Robert Haas wrote: On Fri, Dec 5, 2014 at 2:52 PM, Jim Nasby wrote: The other option would be to use some custom rowtype to store boundary values and have a method that can form a boundary tuple from a real one. Either way, I suspect this is better than frequently evaluating pg_node_trees. On what basis do you expect that? Every time you use a view, you're using a pg_node_tree. Nobody's ever complained that having to reload the pg_node_tree column was too slow, and I see no reason to suppose that things would be any different here. I mean, we can certainly invent something new if there is a reason to do so. But you (and a few other people) seem to be trying pretty hard to avoid using the massive amount of infrastructure that we already have to do almost this exact thing, which puzzles the heck out of me. My concern is how to do the routing of incoming tuples. I'm assuming it'd be significantly faster to compare two tuples than to run each tuple through a bunch of nodetrees. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Fri, Dec 5, 2014 at 2:52 PM, Jim Nasby wrote: > The other option would be to use some custom rowtype to store boundary > values and have a method that can form a boundary tuple from a real one. > Either way, I suspect this is better than frequently evaluating > pg_node_trees. On what basis do you expect that? Every time you use a view, you're using a pg_node_tree. Nobody's ever complained that having to reload the pg_node_tree column was too slow, and I see no reason to suppose that things would be any different here. I mean, we can certainly invent something new if there is a reason to do so. But you (and a few other people) seem to be trying pretty hard to avoid using the massive amount of infrastructure that we already have to do almost this exact thing, which puzzles the heck out of me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/5/14, 1:22 PM, Jim Nasby wrote: On 12/5/14, 3:42 AM, Amit Langote wrote: > I think you are right. I think in this case we need something similar >to column pg_index.indexprs which is of type pg_node_tree(which >seems to be already suggested by Robert). So may be we can proceed >with this type and see if any one else has better idea. One point raised about/against pg_node_tree was the values represented therein would turn out to be too generalized to be used with advantage during planning. But, it seems we could deserialize it in advance back to the internal form (like an array of a struct) as part of the cached relation data. This overhead would only be incurred in case of partitioned tables. Perhaps this is what Robert suggested elsewhere. In order to store a composite type in a catalog, we would need to have one field that has the typid of the composite, and the field that stores the actual composite data would need to be a "dumb" varlena that stores the composite HeapTupleHeader. On further thought; if we disallow NULL as a partition boundary, we don't need a separate rowtype; we could just use the one associated with the relation itself. Presumably that would make comparing tuples to the relation list a lot easier. I was hung up on how that would work in the case of ALTER TABLE, but we'd have the same problem with using pg_node_tree: if you alter a table in such a way that *might* affect your partitioning, you have to do some kind of revalidation anyway. The other option would be to use some custom rowtype to store boundary values and have a method that can form a boundary tuple from a real one. Either way, I suspect this is better than frequently evaluating pg_node_trees. There may be one other option. If range partitions are defined in terms of an expression that is different for every partition (ie: (substr(product_key, 1, 4), date_trunc('month', sales_date))) then we could use a hash of that expression to identify a partition. In other words, range partitioning becomes a special case of hash partitioning. I do think we need a programmatic means to identify the range of an individual partition and hash won't solve that, but the performance of that case isn't critical so we could use pretty much whatever we wanted to there. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/5/14, 3:42 AM, Amit Langote wrote: > I think you are right. I think in this case we need something similar >to column pg_index.indexprs which is of type pg_node_tree(which >seems to be already suggested by Robert). So may be we can proceed >with this type and see if any one else has better idea. One point raised about/against pg_node_tree was the values represented therein would turn out to be too generalized to be used with advantage during planning. But, it seems we could deserialize it in advance back to the internal form (like an array of a struct) as part of the cached relation data. This overhead would only be incurred in case of partitioned tables. Perhaps this is what Robert suggested elsewhere. In order to store a composite type in a catalog, we would need to have one field that has the typid of the composite, and the field that stores the actual composite data would need to be a "dumb" varlena that stores the composite HeapTupleHeader. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Fri, Dec 5, 2014 at 3:11 AM, Amit Langote wrote: >> I think you are right. I think in this case we need something similar >> to column pg_index.indexprs which is of type pg_node_tree(which >> seems to be already suggested by Robert). So may be we can proceed >> with this type and see if any one else has better idea. > > Yeah, with that, I was thinking we may be able to do something like dump a > Node that describes the range partition bounds or list of allowed values > (say, RangePartitionValues, ListPartitionValues). That's exactly what the kind of thing I was thinking about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers