Re: [HACKERS] Declarative partitioning - another take
Hi Maksim, On 2017/04/07 19:52, Maksim Milyutin wrote: On 07.04.2017 13:05, Etsuro Fujita wrote: On 2016/12/09 19:46, Maksim Milyutin wrote: I would like to work on two tasks: - insert (and eventually update) tuple routing for foreign partition. - the ability to create an index on the parent and have all of the children inherit it; There seems to be no work on the first one, so I'd like to work on that. Yes, you can start to work on this, I'll join later as a reviewer. Great! I added the patch to the next commitfest: https://commitfest.postgresql.org/14/1184/ Sorry for the delay. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Wed, May 10, 2017 at 6:50 AM, Etsuro Fujita wrote: > I started working on this. I agree that the changes made in > make_modifytable would be unacceptable, but I'd vote for Amit's idea of > passing a modified PlannerInfo to PlanForeignModify so that the FDW can do > query planning for INSERT into a foreign partition in the same way as for > INSERT into a non-partition foreign table. (Though, I think we should > generate a more-valid-looking working-copy of the PlannerInfo which has > Query with the foreign partition as target.) I'm not sure it's a good idea > to add a new FDW API or modify the existing one such as PlanForeignModify > for this purpose. Thanks for working on this, but I think it would be better to start a new thread for this discussion. And probably also for any other issues that come up. This thread has gotten so long that between the original discussion and commit of the patch and discussion of multiple follow-on commits and patches that it's very hard to follow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2016/11/18 1:43, Robert Haas wrote: On Thu, Nov 17, 2016 at 6:27 AM, Amit Langote wrote: - The code in make_modifytable() to swap out the rte_array for a fake one looks like an unacceptable kludge. I don't know offhand what a better design would look like, but what you've got is really ugly. Agree that it looks horrible. The problem is we don't add partition (child table) RTEs when planning an insert on the parent and FDW partitions can't do without some planner handling - planForeignModify() expects a valid PlannerInfo for deparsing target lists (basically, to be able to use planner_rt_fetch()). If it's only needed for foreign tables, how about for v1 we just throw an error and say errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot route inserted tuples to a foreign table") for now. We can come back and fix it later. Doing more inheritance expansion Coming up with some new FDW API or some modification to the existing one is probably better, but I don't really want to get hung up on that right now. I started working on this. I agree that the changes made in make_modifytable would be unacceptable, but I'd vote for Amit's idea of passing a modified PlannerInfo to PlanForeignModify so that the FDW can do query planning for INSERT into a foreign partition in the same way as for INSERT into a non-partition foreign table. (Though, I think we should generate a more-valid-looking working-copy of the PlannerInfo which has Query with the foreign partition as target.) I'm not sure it's a good idea to add a new FDW API or modify the existing one such as PlanForeignModify for this purpose. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/05/10 12:59, Robert Haas wrote: > On Tue, May 9, 2017 at 11:54 PM, Thomas Munro > wrote: >> +A statement that targets a parent table in a inheritance or partitioning >> >> A tiny typo: s/a inheritance/an inheritance/ > > Now he tells me. Thanks both. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Tue, May 9, 2017 at 11:54 PM, Thomas Munro wrote: > +A statement that targets a parent table in a inheritance or partitioning > > A tiny typo: s/a inheritance/an inheritance/ Now he tells me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Wed, May 10, 2017 at 3:51 PM, Robert Haas wrote: > On Sun, May 7, 2017 at 9:44 PM, Amit Langote > wrote: >> I think that makes sense. Modified it to read: "A statement that targets >> a parent table in a inheritance or partitioning hierarchy..." in the >> attached updated patch. > > LGTM. Committed. +A statement that targets a parent table in a inheritance or partitioning A tiny typo: s/a inheritance/an inheritance/ -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Sun, May 7, 2017 at 9:44 PM, Amit Langote wrote: > I think that makes sense. Modified it to read: "A statement that targets > a parent table in a inheritance or partitioning hierarchy..." in the > attached updated patch. LGTM. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/05/08 10:22, Thomas Munro wrote: > On Mon, May 8, 2017 at 12:47 PM, Amit Langote > wrote: >> On 2017/05/03 2:48, Robert Haas wrote: >>> On Tue, May 2, 2017 at 3:30 AM, Amit Langote >>> wrote: You're right. I agree that whatever text we add here should be pointing out that statement-level triggers of affected child tables are not fired, when root parent is specified in the command. Since there was least some talk of changing that behavior for regular inheritance so that statement triggers of any affected children are fired [1], I thought we shouldn't say something general that applies to both inheritance and partitioning. But since nothing has happened in that regard, we might as well. How about the attached? >>> >>> Looks better, but I think we should say "statement" instead of >>> "operation" for consistency with the previous paragraph, and it >>> certainly shouldn't be capitalized. >> >> Agreed, done. Attached updated patch. > > > +A statement that targets the root table in a inheritance or partitioning > +hierarchy does not cause the statement-level triggers of affected child > +tables to be fired; only the root table's statement-level triggers are > +fired. However, row-level triggers of any affected child tables will be > +fired. > + > + > + > > Why talk specifically about the "root" table? Wouldn't we describe > the situation more generally if we said [a,the] "parent"? I think that makes sense. Modified it to read: "A statement that targets a parent table in a inheritance or partitioning hierarchy..." in the attached updated patch. Thanks, Amit >From 5c2c453235d5eedf857a5e7123337aae5aedd761 Mon Sep 17 00:00:00 2001 From: amit Date: Mon, 24 Apr 2017 14:55:08 +0900 Subject: [PATCH] Clarify statement trigger behavior with inheritance --- doc/src/sgml/trigger.sgml | 8 1 file changed, 8 insertions(+) diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml index 6f8416dda7..ce76a1f042 100644 --- a/doc/src/sgml/trigger.sgml +++ b/doc/src/sgml/trigger.sgml @@ -123,6 +123,14 @@ +A statement that targets a parent table in a inheritance or partitioning +hierarchy does not cause the statement-level triggers of affected child +tables to be fired; only the parent table's statement-level triggers are +fired. However, row-level triggers of any affected child tables will be +fired. + + + If an INSERT contains an ON CONFLICT DO UPDATE clause, it is possible that the effects of all row-level BEFORE INSERT triggers -- 2.11.0 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Mon, May 8, 2017 at 12:47 PM, Amit Langote wrote: > On 2017/05/03 2:48, Robert Haas wrote: >> On Tue, May 2, 2017 at 3:30 AM, Amit Langote >> wrote: >>> You're right. I agree that whatever text we add here should be pointing >>> out that statement-level triggers of affected child tables are not fired, >>> when root parent is specified in the command. >>> >>> Since there was least some talk of changing that behavior for regular >>> inheritance so that statement triggers of any affected children are fired >>> [1], I thought we shouldn't say something general that applies to both >>> inheritance and partitioning. But since nothing has happened in that >>> regard, we might as well. >>> >>> How about the attached? >> >> Looks better, but I think we should say "statement" instead of >> "operation" for consistency with the previous paragraph, and it >> certainly shouldn't be capitalized. > > Agreed, done. Attached updated patch. +A statement that targets the root table in a inheritance or partitioning +hierarchy does not cause the statement-level triggers of affected child +tables to be fired; only the root table's statement-level triggers are +fired. However, row-level triggers of any affected child tables will be +fired. + + + Why talk specifically about the "root" table? Wouldn't we describe the situation more generally if we said [a,the] "parent"? -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/05/03 2:48, Robert Haas wrote: > On Tue, May 2, 2017 at 3:30 AM, Amit Langote > wrote: >> You're right. I agree that whatever text we add here should be pointing >> out that statement-level triggers of affected child tables are not fired, >> when root parent is specified in the command. >> >> Since there was least some talk of changing that behavior for regular >> inheritance so that statement triggers of any affected children are fired >> [1], I thought we shouldn't say something general that applies to both >> inheritance and partitioning. But since nothing has happened in that >> regard, we might as well. >> >> How about the attached? > > Looks better, but I think we should say "statement" instead of > "operation" for consistency with the previous paragraph, and it > certainly shouldn't be capitalized. Agreed, done. Attached updated patch. Thanks, Amit >From 1d7e383c6d89ebabacc7aa3f7d9987779daaa4fb Mon Sep 17 00:00:00 2001 From: amit Date: Mon, 24 Apr 2017 14:55:08 +0900 Subject: [PATCH] Clarify statement trigger behavior with inheritance --- doc/src/sgml/trigger.sgml | 8 1 file changed, 8 insertions(+) diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml index 6f8416dda7..89e8c01a71 100644 --- a/doc/src/sgml/trigger.sgml +++ b/doc/src/sgml/trigger.sgml @@ -123,6 +123,14 @@ +A statement that targets the root table in a inheritance or partitioning +hierarchy does not cause the statement-level triggers of affected child +tables to be fired; only the root table's statement-level triggers are +fired. However, row-level triggers of any affected child tables will be +fired. + + + If an INSERT contains an ON CONFLICT DO UPDATE clause, it is possible that the effects of all row-level BEFORE INSERT triggers -- 2.11.0 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Tue, May 2, 2017 at 3:30 AM, Amit Langote wrote: > You're right. I agree that whatever text we add here should be pointing > out that statement-level triggers of affected child tables are not fired, > when root parent is specified in the command. > > Since there was least some talk of changing that behavior for regular > inheritance so that statement triggers of any affected children are fired > [1], I thought we shouldn't say something general that applies to both > inheritance and partitioning. But since nothing has happened in that > regard, we might as well. > > How about the attached? Looks better, but I think we should say "statement" instead of "operation" for consistency with the previous paragraph, and it certainly shouldn't be capitalized. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/05/01 21:30, Robert Haas wrote: > On Mon, May 1, 2017 at 12:18 AM, Amit Langote > wrote: >> Attached updated patch. > > Committed, except for this bit: Thanks. > +A statement-level trigger defined on partitioned tables is fired only > +once for the table itself, not once for every table in the partitioning > +hierarchy. However, row-level triggers of any affected leaf partitions > +will be fired. > > The first sentence there has a number of issues. Grammatically, there > is an agreement problem: trigger is singular, but partitioned table is > plural, and one trigger isn't defined across multiple tables. It > would have to say something like "A statement-level trigger defined on > a partitioned table". But even with that correction, it's not really > saying what you want it to say. Nobody would expect that the same > statement-level trigger would be fired multiple times. The issue is > whether statement-level triggers on the partitions themselves will be > fired, not the statement-level trigger on the partitioned table. > Also, if this applies to inheritance as well as partitioning, then why > mention only partitioning? I think we might want to document > something more here, but not like this. You're right. I agree that whatever text we add here should be pointing out that statement-level triggers of affected child tables are not fired, when root parent is specified in the command. Since there was least some talk of changing that behavior for regular inheritance so that statement triggers of any affected children are fired [1], I thought we shouldn't say something general that applies to both inheritance and partitioning. But since nothing has happened in that regard, we might as well. How about the attached? Thanks, Amit [1] https://www.postgresql.org/message-id/cd282adde5b70b20c57f53bb9ab75...@biglumber.com >From d2d27ca458fbd341efd5ae231f977385a5e3c951 Mon Sep 17 00:00:00 2001 From: amit Date: Mon, 24 Apr 2017 14:55:08 +0900 Subject: [PATCH] Clarify statement trigger behavior with inheritance --- doc/src/sgml/trigger.sgml | 8 1 file changed, 8 insertions(+) diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml index 6f8416dda7..ef41085744 100644 --- a/doc/src/sgml/trigger.sgml +++ b/doc/src/sgml/trigger.sgml @@ -123,6 +123,14 @@ +An Operation that targets the root table in a inheritance or partitioning +hierarchy does not cause the statement-level triggers of affected child +tables to be fired; only the root table's statement-level triggers are +fired. However, row-level triggers of any affected child tables will be +fired. + + + If an INSERT contains an ON CONFLICT DO UPDATE clause, it is possible that the effects of all row-level BEFORE INSERT triggers -- 2.11.0 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Mon, May 1, 2017 at 12:18 AM, Amit Langote wrote: > Attached updated patch. Committed, except for this bit: +A statement-level trigger defined on partitioned tables is fired only +once for the table itself, not once for every table in the partitioning +hierarchy. However, row-level triggers of any affected leaf partitions +will be fired. The first sentence there has a number of issues. Grammatically, there is an agreement problem: trigger is singular, but partitioned table is plural, and one trigger isn't defined across multiple tables. It would have to say something like "A statement-level trigger defined on a partitioned table". But even with that correction, it's not really saying what you want it to say. Nobody would expect that the same statement-level trigger would be fired multiple times. The issue is whether statement-level triggers on the partitions themselves will be fired, not the statement-level trigger on the partitioned table. Also, if this applies to inheritance as well as partitioning, then why mention only partitioning? I think we might want to document something more here, but not like this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/04/29 6:56, David Fetter wrote: > On Fri, Apr 28, 2017 at 06:29:48PM +0900, Amit Langote wrote: >> On 2017/04/28 7:36, David Fetter wrote: >>> Did I notice correctly that there's no way to handle transition tables >>> for partitions in AFTER STATEMENT triggers? >> >> Did you mean to ask about AFTER STATEMENT triggers defined on >> "partitioned" tables? Specifying transition table for them is disallowed >> at all. >> >> ERROR: "p" is a partitioned table >> DETAIL: Triggers on partitioned tables cannot have transition tables. > > OK, I suppose. It wasn't clear from the documentation. > >> Triggers created on (leaf) partitions *do* allow specifying transition table. > > That includes the upcoming "default" tables, I presume. If a "default" table is also a "leaf" table, then yes. A default table/partition can also be itself a partitioned table, in which case, it won't allow triggers that require transition tables. AFAICS, it's the table's being partitioned that stops it from supporting transition tables, not whether it is a "default" partition or not. >> Or are you asking something else altogether? > > I was just fuzzy on the interactions among these features. > >>> If not, I'm not suggesting that this be added at this late date, but >>> we might want to document that. >> >> I don't see mentioned in the documentation that such triggers cannot be >> defined on partitioned tables. Is that what you are saying should be >> documented? > > Yes, but I bias toward documenting a lot, and this restriction could > go away in some future version, which would make things more confusing > in the long run. Yeah, it would be a good idea to document this. > I'm picturing a conversation in 2020 that goes > something like this: > > "On 10, you could have AFTER STATEMENT triggers on tables, foreigh > tables, and leaf partition tables which referenced transition tables, > but not on DEFAULT partitions. On 11, you could on DEFAULT partition > tables. From 12 onward, you can have transition tables on any > relation." What we could document now is that partitioned tables don't allow specifying triggers that reference transition tables. Although, I am wondering where this note really belongs - the partitioning chapter, the triggers chapter or the CREATE TRIGGER reference page? Maybe, Kevin and Thomas have something to say about that. If it turns out that the partitioning chapter is a good place, here is a documentation patch. Thanks, Amit >From 289b4d906abb50451392d0efe13926f710952ca0 Mon Sep 17 00:00:00 2001 From: amit Date: Mon, 1 May 2017 13:46:58 +0900 Subject: [PATCH] Document transition table trigger limitation of partitioned tables --- doc/src/sgml/ddl.sgml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 84c4f20990..5f5a956d41 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -3293,7 +3293,9 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02 Row triggers, if necessary, must be defined on individual partitions, - not the partitioned table. + not the partitioned tables. Also, triggers that require accessing + transition tables are not allowed to be defined on the partitioned + tables. -- 2.11.0 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
Thanks for the review. On 2017/04/29 1:25, Robert Haas wrote: > On Fri, Apr 28, 2017 at 2:13 AM, Amit Langote > wrote: By the way, code changes I made in the attached are such that a subsequent patch could implement firing statement-level triggers of all the tables in a partition hierarchy, which it seems we don't want to do. Should then the code be changed to not create ResultRelInfos of all the tables but only the root table (the one mentioned in the command)? You will see that the patch adds fields named es_nonleaf_result_relations and es_num_nonleaf_result_relations, whereas just es_root_result_relation would perhaps do, for example. >>> >>> It seems better not to create any ResultRelInfos that we don't >>> actually need, so +1 for such a revision to the patch. >> >> OK, done. It took a bit more work than I thought. > > So, this seems weird, because rootResultRelIndex is initialized even > when splan->partitioned_rels == NIL, but isn't actually valid in that > case. ExecInitModifyTable seems to think it's always valid, though. OK, rootResultRelIndex is now set to a value >= 0 only when the node modifies a partitioned table. And then ExecInitModifyTable() checks if the index is valid before initializing the root table info. > I think the way that you've refactored fireBSTriggers and > fireASTriggers is a bit confusing. Instead of splitting out a > separate function, how about just having the existing function begin > with if (node->rootResultRelInfo) resultRelInfo = > node->rootResultRelInfo; else resultRelInfo = node->resultRelInfo; ? > I think the way you've coded it is a holdover from the earlier design > where you were going to call it multiple times, but now that's not > needed. OK, done that way. > Looks OK, otherwise. Attached updated patch. Thanks, Amit >From 53c964f9f1fd3e27824080b0e9e5b672fa53cdf0 Mon Sep 17 00:00:00 2001 From: amit Date: Mon, 24 Apr 2017 14:55:08 +0900 Subject: [PATCH] Fire per-statement triggers of partitioned tables Current implementation completely misses them, because it assumes that no ResultRelInfos need to be created for partitioned tables, whereas they *are* needed to fire the per-statment triggers, which we *do* allow to be defined on the partitioned tables. Reported By: Rajkumar Raghuwanshi Patch by: Amit Langote Report: https://www.postgresql.org/message-id/CAKcux6%3DwYospCRY2J4XEFuVy0L41S%3Dfic7rmkbsU-GXhhSbmBg%40mail.gmail.com --- doc/src/sgml/trigger.sgml | 26 +++ src/backend/executor/execMain.c | 55 -- src/backend/executor/nodeModifyTable.c | 42 + src/backend/nodes/copyfuncs.c | 2 + src/backend/nodes/outfuncs.c| 3 ++ src/backend/nodes/readfuncs.c | 2 + src/backend/optimizer/plan/createplan.c | 1 + src/backend/optimizer/plan/planner.c| 3 ++ src/backend/optimizer/plan/setrefs.c| 19 ++-- src/include/nodes/execnodes.h | 12 + src/include/nodes/plannodes.h | 13 +- src/include/nodes/relation.h| 1 + src/test/regress/expected/triggers.out | 81 + src/test/regress/sql/triggers.sql | 70 14 files changed, 303 insertions(+), 27 deletions(-) diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml index 2a718d7f47..5771bd5fdc 100644 --- a/doc/src/sgml/trigger.sgml +++ b/doc/src/sgml/trigger.sgml @@ -33,7 +33,8 @@ A trigger is a specification that the database should automatically execute a particular function whenever a certain type of operation is -performed. Triggers can be attached to tables, views, and foreign tables. +performed. Triggers can be attached to tables (partitioned or not), +views, and foreign tables. @@ -111,14 +112,21 @@ Statement-level BEFORE triggers naturally fire before the statement starts to do anything, while statement-level AFTER triggers fire at the very end of the statement. These types of -triggers may be defined on tables or views. Row-level BEFORE -triggers fire immediately before a particular row is operated on, -while row-level AFTER triggers fire at the end of the -statement (but before any statement-level AFTER triggers). -These types of triggers may only be defined on tables and foreign tables. -Row-level INSTEAD OF triggers may only be defined on views, -and fire immediately as each row in the view is identified as needing to -be operated on. +triggers may be defined on tables, views, or foreign tables. Row-level +BEFORE triggers fire immediately before a particular row is +operated on, while row-level AFTER triggers fire at the end of +the statement (but before any statement-level AFTER triggers). +These types of triggers may only be defined on non-partitioned tables and +foreign tables. Row-level INSTEAD OF triggers may o
Re: [HACKERS] Declarative partitioning - another take
On Mon, Apr 24, 2017 at 07:43:29PM +0900, Amit Langote wrote: > On 2017/04/21 17:00, Rajkumar Raghuwanshi wrote: > > I am able to create statement triggers at root partition, but these > > triggers, not getting fired on updating partition. > It would be great if you could check if the patches fix the issues. > > Also, adding this to the PostgreSQL 10 open items list. [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Robert, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Fri, Apr 28, 2017 at 06:29:48PM +0900, Amit Langote wrote: > On 2017/04/28 7:36, David Fetter wrote: > > On Thu, Apr 27, 2017 at 10:30:54AM +0900, Amit Langote wrote: > >> On 2017/04/27 1:52, Robert Haas wrote: > >>> On Tue, Apr 25, 2017 at 10:34 PM, Amit Langote > >>> wrote: > FWIW, I too prefer the latter, that is, fire only the parent's triggers. > In that case, applying only the patch 0001 will do. > >>> > >>> Do we need to update the documentation? > >> > >> Yes, I think we should. How about as in the attached? > >> > >> By the way, code changes I made in the attached are such that a subsequent > >> patch could implement firing statement-level triggers of all the tables in > >> a partition hierarchy, which it seems we don't want to do. Should then > >> the code be changed to not create ResultRelInfos of all the tables but > >> only the root table (the one mentioned in the command)? You will see that > >> the patch adds fields named es_nonleaf_result_relations and > >> es_num_nonleaf_result_relations, whereas just es_root_result_relation > >> would perhaps do, for example. > > > > Did I notice correctly that there's no way to handle transition tables > > for partitions in AFTER STATEMENT triggers? > > Did you mean to ask about AFTER STATEMENT triggers defined on > "partitioned" tables? Specifying transition table for them is disallowed > at all. > > ERROR: "p" is a partitioned table > DETAIL: Triggers on partitioned tables cannot have transition tables. OK, I suppose. It wasn't clear from the documentation. > Triggers created on (leaf) partitions *do* allow specifying transition table. That includes the upcoming "default" tables, I presume. > Or are you asking something else altogether? I was just fuzzy on the interactions among these features. > > If not, I'm not suggesting that this be added at this late date, but > > we might want to document that. > > I don't see mentioned in the documentation that such triggers cannot be > defined on partitioned tables. Is that what you are saying should be > documented? Yes, but I bias toward documenting a lot, and this restriction could go away in some future version, which would make things more confusing in the long run. I'm picturing a conversation in 2020 that goes something like this: "On 10, you could have AFTER STATEMENT triggers on tables, foreigh tables, and leaf partition tables which referenced transition tables, but not on DEFAULT partitions. On 11, you could on DEFAULT partition tables. From 12 onward, you can have transition tables on any relation." Kevin? Thomas? Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Fri, Apr 28, 2017 at 2:13 AM, Amit Langote wrote: > It seems to me that there is no difference in behavior between > inheritance-based and declarative partitioning as far as statement-level > triggers are concerned (at least currently). In both the cases, we fire > statement-level triggers only for the table specified in the command. OK. >>> By the way, code changes I made in the attached are such that a subsequent >>> patch could implement firing statement-level triggers of all the tables in >>> a partition hierarchy, which it seems we don't want to do. Should then >>> the code be changed to not create ResultRelInfos of all the tables but >>> only the root table (the one mentioned in the command)? You will see that >>> the patch adds fields named es_nonleaf_result_relations and >>> es_num_nonleaf_result_relations, whereas just es_root_result_relation >>> would perhaps do, for example. >> >> It seems better not to create any ResultRelInfos that we don't >> actually need, so +1 for such a revision to the patch. > > OK, done. It took a bit more work than I thought. So, this seems weird, because rootResultRelIndex is initialized even when splan->partitioned_rels == NIL, but isn't actually valid in that case. ExecInitModifyTable seems to think it's always valid, though. I think the way that you've refactored fireBSTriggers and fireASTriggers is a bit confusing. Instead of splitting out a separate function, how about just having the existing function begin with if (node->rootResultRelInfo) resultRelInfo = node->rootResultRelInfo; else resultRelInfo = node->resultRelInfo; ? I think the way you've coded it is a holdover from the earlier design where you were going to call it multiple times, but now that's not needed. Looks OK, otherwise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/04/28 7:36, David Fetter wrote: > On Thu, Apr 27, 2017 at 10:30:54AM +0900, Amit Langote wrote: >> On 2017/04/27 1:52, Robert Haas wrote: >>> On Tue, Apr 25, 2017 at 10:34 PM, Amit Langote >>> wrote: FWIW, I too prefer the latter, that is, fire only the parent's triggers. In that case, applying only the patch 0001 will do. >>> >>> Do we need to update the documentation? >> >> Yes, I think we should. How about as in the attached? >> >> By the way, code changes I made in the attached are such that a subsequent >> patch could implement firing statement-level triggers of all the tables in >> a partition hierarchy, which it seems we don't want to do. Should then >> the code be changed to not create ResultRelInfos of all the tables but >> only the root table (the one mentioned in the command)? You will see that >> the patch adds fields named es_nonleaf_result_relations and >> es_num_nonleaf_result_relations, whereas just es_root_result_relation >> would perhaps do, for example. > > Did I notice correctly that there's no way to handle transition tables > for partitions in AFTER STATEMENT triggers? Did you mean to ask about AFTER STATEMENT triggers defined on "partitioned" tables? Specifying transition table for them is disallowed at all. ERROR: "p" is a partitioned table DETAIL: Triggers on partitioned tables cannot have transition tables. Triggers created on (leaf) partitions *do* allow specifying transition table. Or are you asking something else altogether? > If not, I'm not suggesting that this be added at this late date, but > we might want to document that. I don't see mentioned in the documentation that such triggers cannot be defined on partitioned tables. Is that what you are saying should be documented? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
Hi Rajkumar, On 2017/04/28 17:11, Rajkumar Raghuwanshi wrote: > On Fri, Apr 28, 2017 at 11:43 AM, Amit Langote < >> Updated patch attached. > > I have applied given patch, could see below behaviour with statement > trigger. > > When trying to delete value within partition range, triggers got fired > (with zero row affected) as expected, but if trying to delete value which > is outside of partition range (with zero row affected), No trigger fired. > is this expected?? Good catch. I'm afraid though that this is not a defect of this patch, but some unrelated (maybe) bug, which affects not only the partitioned tables but inheritance in general. Problem is that the plan created is such that the executor has no opportunity to fire the trigger in question, because the plan contains no information about which table is affected by the statement. You can see that with inheritance. See below: create table foo (); create table bar () inherits (foo); create or replace function trig_notice() returns trigger as $$ begin raise notice 'trigger fired'; return null; end; $$ language plpgsql; create trigger foo_del_before before delete on foo for each statement execute procedure trig_notice(); explain delete from foo where false; QUERY PLAN -- Result (cost=0.00..0.00 rows=0 width=0) One-Time Filter: false (2 rows) -- no trigger fired delete from foo where false; DELETE 0 Trigger *is* fired though, if inheritance is not used. explain delete from only foo where false; QUERY PLAN - Delete on foo (cost=0.00..0.00 rows=0 width=0) -> Result (cost=0.00..0.00 rows=0 width=6) One-Time Filter: false (3 rows) delete from only foo where false; NOTICE: trigger fired DELETE 0 I'm not sure how to go about fixing this, if at all. Or to fix it at least for partitioned tables. Would like to hear what others think. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Fri, Apr 28, 2017 at 11:43 AM, Amit Langote < langote_amit...@lab.ntt.co.jp> wrote: > Updated patch attached. > Hi Amit, I have applied given patch, could see below behaviour with statement trigger. When trying to delete value within partition range, triggers got fired (with zero row affected) as expected, but if trying to delete value which is outside of partition range (with zero row affected), No trigger fired. is this expected?? Below are steps to reproduce. CREATE TABLE trigger_test_table (a INT, b INT) PARTITION BY RANGE(a); CREATE TABLE trigger_test_table1 PARTITION OF trigger_test_table FOR VALUES FROM (0) to (6); INSERT INTO trigger_test_table (a,b) SELECT i,i FROM generate_series(1,3)i; CREATE TABLE trigger_test_statatics(TG_NAME varchar,TG_TABLE_NAME varchar,TG_LEVEL varchar,TG_WHEN varchar, TG_OP varchar); CREATE FUNCTION trigger_test_procedure() RETURNS TRIGGER AS $ttp$ BEGIN INSERT INTO trigger_test_statatics SELECT TG_NAME,TG_TABLE_NAME,TG_LEVEL,TG_WHEN,TG_OP; RETURN OLD; END; $ttp$ LANGUAGE plpgsql; CREATE TRIGGER trigger_test11 AFTER DELETE ON trigger_test_table FOR EACH STATEMENT EXECUTE PROCEDURE trigger_test_procedure(); CREATE TRIGGER trigger_test12 AFTER DELETE ON trigger_test_table1 FOR EACH STATEMENT EXECUTE PROCEDURE trigger_test_procedure(); postgres=# DELETE FROM trigger_test_table WHERE a = 5; DELETE 0 postgres=# SELECT * FROM trigger_test_statatics; tg_name | tg_table_name| tg_level | tg_when | tg_op ++---+-+ trigger_test11 | trigger_test_table | STATEMENT | AFTER | DELETE (1 row) TRUNCATE TABLE trigger_test_statatics; --statement trigger NOT fired, when trying to delete data outside partition range. postgres=# DELETE FROM trigger_test_table WHERE a = 10; DELETE 0 postgres=# SELECT * FROM trigger_test_statatics; tg_name | tg_table_name | tg_level | tg_when | tg_op -+---+--+-+--- (0 rows) Thanks & Regards, Rajkumar Raghuwanshi QMG, EnterpriseDB Corporation
Re: [HACKERS] Declarative partitioning - another take
Thanks for taking a look. On 2017/04/28 5:24, Robert Haas wrote: > On Wed, Apr 26, 2017 at 9:30 PM, Amit Langote > wrote: >>> Do we need to update the documentation? >> >> Yes, I think we should. How about as in the attached? > > Looks reasonable, but I was thinking you might also update the section > which contrasts inheritance-based partitioning with declarative > partitioning. It seems to me that there is no difference in behavior between inheritance-based and declarative partitioning as far as statement-level triggers are concerned (at least currently). In both the cases, we fire statement-level triggers only for the table specified in the command. Maybe, we will fix things someday so that statement-level triggers will be fired for all the tables in a inheritance hierarchy when the root parent is updated or deleted, but that's currently not true. We may never implement that behavior for declarative partitioned tables though, so there will be a difference if and when we implement the former. Am I missing something? >> By the way, code changes I made in the attached are such that a subsequent >> patch could implement firing statement-level triggers of all the tables in >> a partition hierarchy, which it seems we don't want to do. Should then >> the code be changed to not create ResultRelInfos of all the tables but >> only the root table (the one mentioned in the command)? You will see that >> the patch adds fields named es_nonleaf_result_relations and >> es_num_nonleaf_result_relations, whereas just es_root_result_relation >> would perhaps do, for example. > > It seems better not to create any ResultRelInfos that we don't > actually need, so +1 for such a revision to the patch. OK, done. It took a bit more work than I thought. Updated patch attached. Thanks, Amit >From a8b584f09bc02b6c16c33e816fc12c7e443dd65c Mon Sep 17 00:00:00 2001 From: amit Date: Mon, 24 Apr 2017 14:55:08 +0900 Subject: [PATCH] Fire per-statement triggers of partitioned tables Current implementation completely misses them, because it assumes that no ResultRelInfos need to be created for partitioned tables, whereas they *are* needed to fire the per-statment triggers, which we *do* allow to be defined on the partitioned tables. Reported By: Rajkumar Raghuwanshi Patch by: Amit Langote Report: https://www.postgresql.org/message-id/CAKcux6%3DwYospCRY2J4XEFuVy0L41S%3Dfic7rmkbsU-GXhhSbmBg%40mail.gmail.com --- doc/src/sgml/trigger.sgml | 26 +++ src/backend/executor/execMain.c | 55 -- src/backend/executor/nodeModifyTable.c | 64 ++ src/backend/nodes/copyfuncs.c | 2 + src/backend/nodes/outfuncs.c| 3 ++ src/backend/nodes/readfuncs.c | 2 + src/backend/optimizer/plan/createplan.c | 1 + src/backend/optimizer/plan/planner.c| 3 ++ src/backend/optimizer/plan/setrefs.c| 19 ++-- src/include/nodes/execnodes.h | 12 + src/include/nodes/plannodes.h | 13 +- src/include/nodes/relation.h| 1 + src/test/regress/expected/triggers.out | 81 + src/test/regress/sql/triggers.sql | 70 14 files changed, 323 insertions(+), 29 deletions(-) diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml index 2a718d7f47..5771bd5fdc 100644 --- a/doc/src/sgml/trigger.sgml +++ b/doc/src/sgml/trigger.sgml @@ -33,7 +33,8 @@ A trigger is a specification that the database should automatically execute a particular function whenever a certain type of operation is -performed. Triggers can be attached to tables, views, and foreign tables. +performed. Triggers can be attached to tables (partitioned or not), +views, and foreign tables. @@ -111,14 +112,21 @@ Statement-level BEFORE triggers naturally fire before the statement starts to do anything, while statement-level AFTER triggers fire at the very end of the statement. These types of -triggers may be defined on tables or views. Row-level BEFORE -triggers fire immediately before a particular row is operated on, -while row-level AFTER triggers fire at the end of the -statement (but before any statement-level AFTER triggers). -These types of triggers may only be defined on tables and foreign tables. -Row-level INSTEAD OF triggers may only be defined on views, -and fire immediately as each row in the view is identified as needing to -be operated on. +triggers may be defined on tables, views, or foreign tables. Row-level +BEFORE triggers fire immediately before a particular row is +operated on, while row-level AFTER triggers fire at the end of +the statement (but before any statement-level AFTER triggers). +These types of triggers may only be defined on non-partitioned tables and +foreign tables. Row-level INSTEAD OF triggers may only be +defined on views,
Re: [HACKERS] Declarative partitioning - another take
On Thu, Apr 27, 2017 at 10:30:54AM +0900, Amit Langote wrote: > On 2017/04/27 1:52, Robert Haas wrote: > > On Tue, Apr 25, 2017 at 10:34 PM, Amit Langote > > wrote: > >> FWIW, I too prefer the latter, that is, fire only the parent's triggers. > >> In that case, applying only the patch 0001 will do. > > > > Do we need to update the documentation? > > Yes, I think we should. How about as in the attached? > > By the way, code changes I made in the attached are such that a subsequent > patch could implement firing statement-level triggers of all the tables in > a partition hierarchy, which it seems we don't want to do. Should then > the code be changed to not create ResultRelInfos of all the tables but > only the root table (the one mentioned in the command)? You will see that > the patch adds fields named es_nonleaf_result_relations and > es_num_nonleaf_result_relations, whereas just es_root_result_relation > would perhaps do, for example. Did I notice correctly that there's no way to handle transition tables for partitions in AFTER STATEMENT triggers? If not, I'm not suggesting that this be added at this late date, but we might want to document that. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Wed, Apr 26, 2017 at 9:30 PM, Amit Langote wrote: >> Do we need to update the documentation? > > Yes, I think we should. How about as in the attached? Looks reasonable, but I was thinking you might also update the section which contrasts inheritance-based partitioning with declarative partitioning. > By the way, code changes I made in the attached are such that a subsequent > patch could implement firing statement-level triggers of all the tables in > a partition hierarchy, which it seems we don't want to do. Should then > the code be changed to not create ResultRelInfos of all the tables but > only the root table (the one mentioned in the command)? You will see that > the patch adds fields named es_nonleaf_result_relations and > es_num_nonleaf_result_relations, whereas just es_root_result_relation > would perhaps do, for example. It seems better not to create any ResultRelInfos that we don't actually need, so +1 for such a revision to the patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/04/27 1:52, Robert Haas wrote: > On Tue, Apr 25, 2017 at 10:34 PM, Amit Langote > wrote: >> FWIW, I too prefer the latter, that is, fire only the parent's triggers. >> In that case, applying only the patch 0001 will do. > > Do we need to update the documentation? Yes, I think we should. How about as in the attached? By the way, code changes I made in the attached are such that a subsequent patch could implement firing statement-level triggers of all the tables in a partition hierarchy, which it seems we don't want to do. Should then the code be changed to not create ResultRelInfos of all the tables but only the root table (the one mentioned in the command)? You will see that the patch adds fields named es_nonleaf_result_relations and es_num_nonleaf_result_relations, whereas just es_root_result_relation would perhaps do, for example. Thanks, Amit >From c8b785a77a2a193906967762066e03e09de80442 Mon Sep 17 00:00:00 2001 From: amit Date: Mon, 24 Apr 2017 14:55:08 +0900 Subject: [PATCH 1/2] Fire per-statement triggers of partitioned tables Current implementation completely misses them, because it assumes that no ResultRelInfos need to be created for partitioned tables, whereas they *are* needed to fire the per-statment triggers, which we *do* allow to be defined on the partitioned tables. Reported By: Rajkumar Raghuwanshi Patch by: Amit Langote Report: https://www.postgresql.org/message-id/CAKcux6%3DwYospCRY2J4XEFuVy0L41S%3Dfic7rmkbsU-GXhhSbmBg%40mail.gmail.com --- doc/src/sgml/trigger.sgml | 26 - src/backend/executor/execMain.c | 33 - src/backend/executor/nodeModifyTable.c | 66 - src/backend/nodes/copyfuncs.c | 1 + src/backend/nodes/outfuncs.c| 1 + src/backend/nodes/readfuncs.c | 1 + src/backend/optimizer/plan/createplan.c | 1 + src/backend/optimizer/plan/setrefs.c| 4 ++ src/include/nodes/execnodes.h | 11 ++ src/include/nodes/plannodes.h | 2 + src/test/regress/expected/triggers.out | 54 +++ src/test/regress/sql/triggers.sql | 48 12 files changed, 227 insertions(+), 21 deletions(-) diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml index 2a718d7f47..5771bd5fdc 100644 --- a/doc/src/sgml/trigger.sgml +++ b/doc/src/sgml/trigger.sgml @@ -33,7 +33,8 @@ A trigger is a specification that the database should automatically execute a particular function whenever a certain type of operation is -performed. Triggers can be attached to tables, views, and foreign tables. +performed. Triggers can be attached to tables (partitioned or not), +views, and foreign tables. @@ -111,14 +112,21 @@ Statement-level BEFORE triggers naturally fire before the statement starts to do anything, while statement-level AFTER triggers fire at the very end of the statement. These types of -triggers may be defined on tables or views. Row-level BEFORE -triggers fire immediately before a particular row is operated on, -while row-level AFTER triggers fire at the end of the -statement (but before any statement-level AFTER triggers). -These types of triggers may only be defined on tables and foreign tables. -Row-level INSTEAD OF triggers may only be defined on views, -and fire immediately as each row in the view is identified as needing to -be operated on. +triggers may be defined on tables, views, or foreign tables. Row-level +BEFORE triggers fire immediately before a particular row is +operated on, while row-level AFTER triggers fire at the end of +the statement (but before any statement-level AFTER triggers). +These types of triggers may only be defined on non-partitioned tables and +foreign tables. Row-level INSTEAD OF triggers may only be +defined on views, and fire immediately as each row in the view is +identified as needing to be operated on. + + + +A statement-level trigger defined on partitioned tables is fired only +once for the table itself, not once for every table in the partitioning +hierarchy. However, row-level triggers of any affected leaf partitions +will be fired. diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 5c12fb457d..586b396b3e 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -861,18 +861,35 @@ InitPlan(QueryDesc *queryDesc, int eflags) /* * In the partitioned result relation case, lock the non-leaf result - * relations too. We don't however need ResultRelInfos for them. + * relations too. We also need ResultRelInfos so that per-statement + * triggers defined on these relations can be fired. */ if (plannedstmt->nonleafResultRelations) { + numResultRelations = + list_length(plannedstmt->nonleafResultRelations); + + res
Re: [HACKERS] Declarative partitioning - another take
On Tue, Apr 25, 2017 at 10:34 PM, Amit Langote wrote: > FWIW, I too prefer the latter, that is, fire only the parent's triggers. > In that case, applying only the patch 0001 will do. Do we need to update the documentation? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 26 April 2017 at 00:28, Robert Haas wrote: > So what I'd prefer -- on > the totally unprincipled basis that it would let us improve > performance in the future -- if you operate on a partition directly, > you fire the partition's triggers, but if you operate on the parent, > only the parent's triggers fire. I would also opt for this behaviour. Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
Hi, Thanks for testing. On 2017/04/25 19:03, Rajkumar Raghuwanshi wrote: > Thanks for looking into it. I have applied fixes and checked for triggers. > I could see difference in behaviour of statement triggers for INSERT and > UPDATE, for insert only root partition triggers are getting fired but for > update root as well as child partition table triggers both getting fired. > is this expected?? Yes, because I didn't implement anything for the insert case yet. I posed a question whether to fire partitions' per-statement triggers when inserting data through the root table. Robert replied [1] that it would be desirable to not fire partitions' per-statement triggers if the root table is mentioned in the query; only fire their per-row triggers if any. It already works that way for inserts, and applying only 0001 will get you the same for update/delete. Patch 0002 is to enable firing partition's per-statement triggers even if the root table is mentioned in the query, but it implemented the same only for the update/delete cases. If we decide that that's the right thing to do, then I will implement the same behavior for the insert case too. Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmoatBYy8Hyi3cYR1rFrCkD2NM4ZLZcck4QDGvH%3DHddfDwA%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/04/26 3:58, Robert Haas wrote: > On Mon, Apr 24, 2017 at 6:43 AM, Amit Langote > wrote: >> The reason it doesn't work is that we do not allocate ResultRelInfos for >> partitioned tables (not even for the root partitioned table in the >> update/delete cases), because the current implementation assumes they are >> not required. That's fine only so long as we consider that no rows are >> inserted into them, no indexes need to be updated, and that no row-level >> triggers are to be fired. But it misses the fact that we do allow >> statement-level triggers on partitioned tables. So, we should fix things >> such that ResultRelInfos are allocated so that any statement-level >> triggers are fired. But there are following questions to consider: >> >> 1. Should we consider only the root partitioned table or all partitioned >>tables in a given hierarchy, including the child partitioned tables? >>Note that this is related to a current limitation of updating/deleting >>inheritance hierarchies that we do not currently fire per-statement >>triggers of the child tables. See the TODO item in wiki: >>https://wiki.postgresql.org/wiki/Todo#Triggers, which says: "When >>statement-level triggers are defined on a parent table, have them fire >>only on the parent table, and fire child table triggers only where >>appropriate" >> >> 2. Do we apply the same to inserts on the partitioned tables? Since >>insert on a partitioned table implicitly affects its partitions, one >>might say that we would need to fire per-statement insert triggers of >>all the partitions. > > It seems to me that it doesn't make a lot of sense to fire the > triggers for some tables involved in the hierarchy and not others. I > suppose the consistent thing to do here is to make sure that we fire > the statement triggers for all tables in the partitioning hierarchy > for all operations (insert, update, delete, etc.). > > TBH, I don't like that very much. I'd rather fire the triggers only > for the table actually named in the query and skip all the others, > mostly because it seems like it would be faster and less likely to > block future optimizations; eventually, I'd like to consider not even > locking the children we're not touching, but that's not going to work > if we have to check them all for triggers. So what I'd prefer -- on > the totally unprincipled basis that it would let us improve > performance in the future -- if you operate on a partition directly, > you fire the partition's triggers, but if you operate on the parent, > only the parent's triggers fire. FWIW, I too prefer the latter, that is, fire only the parent's triggers. In that case, applying only the patch 0001 will do. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Mon, Apr 24, 2017 at 6:43 AM, Amit Langote wrote: > The reason it doesn't work is that we do not allocate ResultRelInfos for > partitioned tables (not even for the root partitioned table in the > update/delete cases), because the current implementation assumes they are > not required. That's fine only so long as we consider that no rows are > inserted into them, no indexes need to be updated, and that no row-level > triggers are to be fired. But it misses the fact that we do allow > statement-level triggers on partitioned tables. So, we should fix things > such that ResultRelInfos are allocated so that any statement-level > triggers are fired. But there are following questions to consider: > > 1. Should we consider only the root partitioned table or all partitioned >tables in a given hierarchy, including the child partitioned tables? >Note that this is related to a current limitation of updating/deleting >inheritance hierarchies that we do not currently fire per-statement >triggers of the child tables. See the TODO item in wiki: >https://wiki.postgresql.org/wiki/Todo#Triggers, which says: "When >statement-level triggers are defined on a parent table, have them fire >only on the parent table, and fire child table triggers only where >appropriate" > > 2. Do we apply the same to inserts on the partitioned tables? Since >insert on a partitioned table implicitly affects its partitions, one >might say that we would need to fire per-statement insert triggers of >all the partitions. It seems to me that it doesn't make a lot of sense to fire the triggers for some tables involved in the hierarchy and not others. I suppose the consistent thing to do here is to make sure that we fire the statement triggers for all tables in the partitioning hierarchy for all operations (insert, update, delete, etc.). TBH, I don't like that very much. I'd rather fire the triggers only for the table actually named in the query and skip all the others, mostly because it seems like it would be faster and less likely to block future optimizations; eventually, I'd like to consider not even locking the children we're not touching, but that's not going to work if we have to check them all for triggers. So what I'd prefer -- on the totally unprincipled basis that it would let us improve performance in the future -- if you operate on a partition directly, you fire the partition's triggers, but if you operate on the parent, only the parent's triggers fire. How heretical is that idea? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Mon, Apr 24, 2017 at 4:13 PM, Amit Langote wrote: > Hi Rajkumar, > > It would be great if you could check if the patches fix the issues. > Hi Amit, Thanks for looking into it. I have applied fixes and checked for triggers. I could see difference in behaviour of statement triggers for INSERT and UPDATE, for insert only root partition triggers are getting fired but for update root as well as child partition table triggers both getting fired. is this expected?? Below are steps to reproduce. CREATE TABLE pt (a INT, b INT) PARTITION BY RANGE(a); CREATE TABLE pt1 PARTITION OF pt FOR VALUES FROM (1) to (6); CREATE TABLE pt2 PARTITION OF pt FOR VALUES FROM (6) to (11); INSERT INTO pt (a,b) SELECT i,i FROM generate_series(1,7)i; CREATE TABLE pt_trigger(TG_NAME varchar,TG_TABLE_NAME varchar,TG_LEVEL varchar,TG_WHEN varchar,a_old int,a_new int,b_old int,b_new int); CREATE FUNCTION process_pt_trigger() RETURNS TRIGGER AS $ttp$ BEGIN IF (TG_OP = 'INSERT') THEN IF (TG_LEVEL = 'STATEMENT') THEN INSERT INTO pt_trigger SELECT TG_NAME,TG_TABLE_NAME,TG_LEVEL,TG_WHEN,NULL,NULL,NULL,NULL; END IF; IF (TG_LEVEL = 'ROW') THEN INSERT INTO pt_trigger SELECT TG_NAME,TG_TABLE_NAME,TG_LEVEL,TG_WHEN,NULL,NEW.a,NULL,NEW.b; END IF; RETURN NEW; END IF; IF (TG_OP = 'UPDATE') THEN IF (TG_LEVEL = 'STATEMENT') THEN INSERT INTO pt_trigger SELECT TG_NAME,TG_TABLE_NAME,TG_LEVEL,TG_WHEN,NULL,NULL,NULL,NULL; END IF; IF (TG_LEVEL = 'ROW') THEN INSERT INTO pt_trigger SELECT TG_NAME,TG_TABLE_NAME,TG_LEVEL,TG_WHEN,OLD.a,NEW.a,OLD.b,NEW.b; END IF; RETURN NEW; END IF; END; $ttp$ LANGUAGE plpgsql; CREATE TRIGGER trigger_test11 AFTER INSERT OR UPDATE ON pt FOR EACH STATEMENT EXECUTE PROCEDURE process_pt_trigger(); CREATE TRIGGER trigger_test12 AFTER INSERT OR UPDATE ON pt1 FOR EACH STATEMENT EXECUTE PROCEDURE process_pt_trigger(); CREATE TRIGGER trigger_test13 AFTER INSERT OR UPDATE ON pt2 FOR EACH STATEMENT EXECUTE PROCEDURE process_pt_trigger(); CREATE TRIGGER trigger_test21 BEFORE INSERT OR UPDATE ON pt FOR EACH STATEMENT EXECUTE PROCEDURE process_pt_trigger(); CREATE TRIGGER trigger_test22 BEFORE INSERT OR UPDATE ON pt1 FOR EACH STATEMENT EXECUTE PROCEDURE process_pt_trigger(); CREATE TRIGGER trigger_test23 BEFORE INSERT OR UPDATE ON pt2 FOR EACH STATEMENT EXECUTE PROCEDURE process_pt_trigger(); CREATE TRIGGER trigger_test32 AFTER INSERT OR UPDATE ON pt1 FOR EACH ROW EXECUTE PROCEDURE process_pt_trigger(); CREATE TRIGGER trigger_test33 AFTER INSERT OR UPDATE ON pt2 FOR EACH ROW EXECUTE PROCEDURE process_pt_trigger(); CREATE TRIGGER trigger_test42 BEFORE INSERT OR UPDATE ON pt1 FOR EACH ROW EXECUTE PROCEDURE process_pt_trigger(); CREATE TRIGGER trigger_test43 BEFORE INSERT OR UPDATE ON pt2 FOR EACH ROW EXECUTE PROCEDURE process_pt_trigger(); postgres=# INSERT INTO pt (a,b) VALUES (8,8); INSERT 0 1 postgres=# SELECT * FROM pt_trigger; tg_name | tg_table_name | tg_level | tg_when | a_old | a_new | b_old | b_new +---+---+-+---+---+---+--- trigger_test21 | pt| STATEMENT | BEFORE | | | | trigger_test43 | pt2 | ROW | BEFORE | | 8 | | 8 trigger_test33 | pt2 | ROW | AFTER | | 8 | | 8 trigger_test11 | pt| STATEMENT | AFTER | | | | (4 rows) postgres=# TRUNCATE TABLE pt_trigger; TRUNCATE TABLE postgres=# UPDATE pt SET a = 2 WHERE a = 1; UPDATE 1 postgres=# SELECT * FROM pt_trigger; tg_name | tg_table_name | tg_level | tg_when | a_old | a_new | b_old | b_new +---+---+-+---+---+---+--- trigger_test21 | pt| STATEMENT | BEFORE | | | | trigger_test22 | pt1 | STATEMENT | BEFORE | | | | trigger_test42 | pt1 | ROW | BEFORE | 1 | 2 | 1 | 1 trigger_test32 | pt1 | ROW | AFTER | 1 | 2 | 1 | 1 trigger_test11 | pt| STATEMENT | AFTER | | | | trigger_test12 | pt1 | STATEMENT | AFTER | | | | (6 rows) Thanks & Regards, Rajkumar Raghuwanshi QMG, EnterpriseDB
Re: [HACKERS] Declarative partitioning - another take
Hi Rajkumar, Thanks for testing and the report. On 2017/04/21 17:00, Rajkumar Raghuwanshi wrote: > Hi, > > I have observed below with the statement triggers. > > I am able to create statement triggers at root partition, but these > triggers, not getting fired on updating partition. > > CREATE TABLE pt (a INT, b INT) PARTITION BY RANGE(a); > CREATE TABLE pt1 PARTITION OF pt FOR VALUES FROM (1) to (7); > CREATE TABLE pt2 PARTITION OF pt FOR VALUES FROM (7) to (11); > INSERT INTO pt (a,b) SELECT i,i FROM generate_series(1,10)i; > CREATE TABLE pt_trigger(TG_NAME varchar,TG_TABLE_NAME varchar,TG_LEVEL > varchar,TG_WHEN varchar); > CREATE OR REPLACE FUNCTION process_pt_trigger() RETURNS TRIGGER AS $pttg$ > BEGIN > IF (TG_OP = 'UPDATE') THEN > INSERT INTO pt_trigger SELECT > TG_NAME,TG_TABLE_NAME,TG_LEVEL,TG_WHEN; > RETURN NEW; > END IF; > RETURN NULL; > END; > $pttg$ LANGUAGE plpgsql; > CREATE TRIGGER pt_trigger_after_p0 AFTER UPDATE ON pt FOR EACH STATEMENT > EXECUTE PROCEDURE process_pt_trigger(); > CREATE TRIGGER pt_trigger_before_p0 BEFORE UPDATE ON pt FOR EACH STATEMENT > EXECUTE PROCEDURE process_pt_trigger(); > > postgres=# UPDATE pt SET a = 2 WHERE a = 1; > UPDATE 1 > postgres=# SELECT * FROM pt_trigger ORDER BY 1; > tg_name | tg_table_name | tg_level | tg_when > -+---+--+- > (0 rows) > > no statement level trigger fired in this case, is this expected behaviour?? I think we discussed this during the development and decided to allow per-statement triggers on partitioned tables [1], but it seems it doesn't quite work for update/delete as your example shows. You can however see that per-statement triggers of the root partitioned table *are* fired in the case of insert. The reason it doesn't work is that we do not allocate ResultRelInfos for partitioned tables (not even for the root partitioned table in the update/delete cases), because the current implementation assumes they are not required. That's fine only so long as we consider that no rows are inserted into them, no indexes need to be updated, and that no row-level triggers are to be fired. But it misses the fact that we do allow statement-level triggers on partitioned tables. So, we should fix things such that ResultRelInfos are allocated so that any statement-level triggers are fired. But there are following questions to consider: 1. Should we consider only the root partitioned table or all partitioned tables in a given hierarchy, including the child partitioned tables? Note that this is related to a current limitation of updating/deleting inheritance hierarchies that we do not currently fire per-statement triggers of the child tables. See the TODO item in wiki: https://wiki.postgresql.org/wiki/Todo#Triggers, which says: "When statement-level triggers are defined on a parent table, have them fire only on the parent table, and fire child table triggers only where appropriate" 2. Do we apply the same to inserts on the partitioned tables? Since insert on a partitioned table implicitly affects its partitions, one might say that we would need to fire per-statement insert triggers of all the partitions. Meanwhile, attached is a set of patches to fix this. Descriptions follow: 0001: fire per-statement triggers of the partitioned table mentioned in a given update or delete statement 0002: fire per-statement triggers of the child tables during update/delete of inheritance hierarchies (including the partitioned table case) Depending on the answer to 2 above, we can arrange for the per-statement triggers of all the partitions to be fired upon insert into the partitioned table. > but if i am creating triggers on leaf partition, trigger is getting fired. > > CREATE TRIGGER pt_trigger_after_p1 AFTER UPDATE ON pt1 FOR EACH STATEMENT > EXECUTE PROCEDURE process_pt_trigger(); > CREATE TRIGGER pt_trigger_before_p1 BEFORE UPDATE ON pt1 FOR EACH STATEMENT > EXECUTE PROCEDURE process_pt_trigger(); > > postgres=# UPDATE pt SET a = 5 WHERE a = 4; > UPDATE 1 > postgres=# SELECT * FROM pt_trigger ORDER BY 1; >tg_name| tg_table_name | tg_level | tg_when > --+---+---+- > pt_trigger_after_p1 | pt1 | STATEMENT | AFTER > pt_trigger_before_p1 | pt1 | STATEMENT | BEFORE > (2 rows) Actually, this works only by accident (with the current implementation, the *first* partition's triggers will get fired). If you create another partition, its per-statement triggers won't get fired. The patches described above fixes that. It would be great if you could check if the patches fix the issues. Also, adding this to the PostgreSQL 10 open items list. Thanks, Amit [1] https://www.postgresql.org/message-id/8e05817e-14c8-13e4-502b-e440adb24258%40lab.ntt.co.jp >From 9c7543615ccb05c004591a9176f818413df7ea9d Mon Sep 17 00:00:00 2001 From: amit Da
Re: [HACKERS] Declarative partitioning - another take
Hi, I have observed below with the statement triggers. I am able to create statement triggers at root partition, but these triggers, not getting fired on updating partition. CREATE TABLE pt (a INT, b INT) PARTITION BY RANGE(a); CREATE TABLE pt1 PARTITION OF pt FOR VALUES FROM (1) to (7); CREATE TABLE pt2 PARTITION OF pt FOR VALUES FROM (7) to (11); INSERT INTO pt (a,b) SELECT i,i FROM generate_series(1,10)i; CREATE TABLE pt_trigger(TG_NAME varchar,TG_TABLE_NAME varchar,TG_LEVEL varchar,TG_WHEN varchar); CREATE OR REPLACE FUNCTION process_pt_trigger() RETURNS TRIGGER AS $pttg$ BEGIN IF (TG_OP = 'UPDATE') THEN INSERT INTO pt_trigger SELECT TG_NAME,TG_TABLE_NAME,TG_LEVEL,TG_WHEN; RETURN NEW; END IF; RETURN NULL; END; $pttg$ LANGUAGE plpgsql; CREATE TRIGGER pt_trigger_after_p0 AFTER UPDATE ON pt FOR EACH STATEMENT EXECUTE PROCEDURE process_pt_trigger(); CREATE TRIGGER pt_trigger_before_p0 BEFORE UPDATE ON pt FOR EACH STATEMENT EXECUTE PROCEDURE process_pt_trigger(); postgres=# UPDATE pt SET a = 2 WHERE a = 1; UPDATE 1 postgres=# SELECT * FROM pt_trigger ORDER BY 1; tg_name | tg_table_name | tg_level | tg_when -+---+--+- (0 rows) no statement level trigger fired in this case, is this expected behaviour?? but if i am creating triggers on leaf partition, trigger is getting fired. CREATE TRIGGER pt_trigger_after_p1 AFTER UPDATE ON pt1 FOR EACH STATEMENT EXECUTE PROCEDURE process_pt_trigger(); CREATE TRIGGER pt_trigger_before_p1 BEFORE UPDATE ON pt1 FOR EACH STATEMENT EXECUTE PROCEDURE process_pt_trigger(); postgres=# UPDATE pt SET a = 5 WHERE a = 4; UPDATE 1 postgres=# SELECT * FROM pt_trigger ORDER BY 1; tg_name| tg_table_name | tg_level | tg_when --+---+---+- pt_trigger_after_p1 | pt1 | STATEMENT | AFTER pt_trigger_before_p1 | pt1 | STATEMENT | BEFORE (2 rows) Thanks & Regards, Rajkumar Raghuwanshi QMG, EnterpriseDB Corporation
Re: [HACKERS] Declarative partitioning vs. information_schema
On 2017/04/14 5:28, Robert Haas wrote: > On Thu, Apr 6, 2017 at 3:14 AM, Amit Langote > wrote: >>> The bulk of operations that work on traditional tables also work on >>> partitions >>> and partitioned tables. The next closest kind of relation, a materialized >>> view, is far less table-like. Therefore, I recommend showing both >>> partitions >>> and partitioned tables in those views. This is also consistent with the >>> decision to use words like "partition" and "partitioned" in messages only >>> when >>> partitioning is relevant to the error. For example, ATWrongRelkindError() >>> distinguishes materialized views from tables, but it does not distinguish >>> tables based on their participation in partitioning. >> >> +1 > > OK, whoever wants to write the patch, please step forward. Sorry, perhaps I'm missing something, but I thought there was no patch left to be written, because the original patch (this thread) implemented what Noah recommended. As of HEAD (6cfaffc0ddc): create table p (a int, b char) partition by list (a); create table p1 partition of p for values in (1) partition by list (b); create table p1a partition of p1 for values in ('a'); \d List of relations Schema | Name | Type | Owner +--+---+--- public | p| table | amit public | p1 | table | amit public | p1a | table | amit (3 rows) select tablename from pg_tables where schemaname = 'public'; tablename --- p p1 p1a (3 rows) select table_name from information_schema.tables where table_schema = 'public'; table_name p p1 p1a (3 rows) Also, it seems that this open item has been listed under Non-bugs, with remark "firm support for status quo, lack of firm support for alternatives". Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning vs. information_schema
On Thu, Apr 6, 2017 at 3:14 AM, Amit Langote wrote: >> The bulk of operations that work on traditional tables also work on >> partitions >> and partitioned tables. The next closest kind of relation, a materialized >> view, is far less table-like. Therefore, I recommend showing both partitions >> and partitioned tables in those views. This is also consistent with the >> decision to use words like "partition" and "partitioned" in messages only >> when >> partitioning is relevant to the error. For example, ATWrongRelkindError() >> distinguishes materialized views from tables, but it does not distinguish >> tables based on their participation in partitioning. > > +1 OK, whoever wants to write the patch, please step forward. /me steps backward. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 07.04.2017 13:05, Etsuro Fujita wrote: On 2016/12/14 16:20, Etsuro Fujita wrote: On 2016/12/09 19:46, Maksim Milyutin wrote: I would like to work on two tasks: - insert (and eventually update) tuple routing for foreign partition. - the ability to create an index on the parent and have all of the children inherit it; The first one has been implemented in pg_pathman somehow, but the code relies on dirty hacks, so the FDW API has to be improved. As for the extended index support, it doesn't look like a super-hard task. That would be great! I'd like to help review the first one. There seems to be no work on the first one, so I'd like to work on that. Yes, you can start to work on this, I'll join later as a reviewer. -- Maksim Milyutin Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2016/12/14 16:20, Etsuro Fujita wrote: On 2016/12/09 19:46, Maksim Milyutin wrote: I would like to work on two tasks: - insert (and eventually update) tuple routing for foreign partition. - the ability to create an index on the parent and have all of the children inherit it; The first one has been implemented in pg_pathman somehow, but the code relies on dirty hacks, so the FDW API has to be improved. As for the extended index support, it doesn't look like a super-hard task. That would be great! I'd like to help review the first one. There seems to be no work on the first one, so I'd like to work on that. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning vs. information_schema
On 2017/04/06 16:02, Noah Misch wrote: > On Wed, Jan 25, 2017 at 01:19:00PM -0500, Robert Haas wrote: >> On Wed, Jan 25, 2017 at 1:04 PM, Peter Eisentraut >> wrote: >>> On 1/18/17 2:32 PM, Robert Haas wrote: Unless we can find something official, I suppose we should just display BASE TABLE in that case as we do in other cases. I wonder if the schema needs some broader revision; for example, are there information_schema elements intended to show information about partitions? >>> >>> Is it intentional that we show the partitions by default in \d, >>> pg_tables, information_schema.tables? Or should we treat those as >>> somewhat-hidden details? >> >> I'm not really sure what the right thing to do is there. I was hoping >> you had an opinion. > > The bulk of operations that work on traditional tables also work on partitions > and partitioned tables. The next closest kind of relation, a materialized > view, is far less table-like. Therefore, I recommend showing both partitions > and partitioned tables in those views. This is also consistent with the > decision to use words like "partition" and "partitioned" in messages only when > partitioning is relevant to the error. For example, ATWrongRelkindError() > distinguishes materialized views from tables, but it does not distinguish > tables based on their participation in partitioning. +1 Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning vs. information_schema
On Wed, Jan 25, 2017 at 01:19:00PM -0500, Robert Haas wrote: > On Wed, Jan 25, 2017 at 1:04 PM, Peter Eisentraut > wrote: > > On 1/18/17 2:32 PM, Robert Haas wrote: > >> Unless we can find something official, I suppose we should just > >> display BASE TABLE in that case as we do in other cases. I wonder if > >> the schema needs some broader revision; for example, are there > >> information_schema elements intended to show information about > >> partitions? > > > > Is it intentional that we show the partitions by default in \d, > > pg_tables, information_schema.tables? Or should we treat those as > > somewhat-hidden details? > > I'm not really sure what the right thing to do is there. I was hoping > you had an opinion. The bulk of operations that work on traditional tables also work on partitions and partitioned tables. The next closest kind of relation, a materialized view, is far less table-like. Therefore, I recommend showing both partitions and partitioned tables in those views. This is also consistent with the decision to use words like "partition" and "partitioned" in messages only when partitioning is relevant to the error. For example, ATWrongRelkindError() distinguishes materialized views from tables, but it does not distinguish tables based on their participation in partitioning. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning vs. information_schema
On 2017/01/26 3:19, Robert Haas wrote: > On Wed, Jan 25, 2017 at 1:04 PM, Peter Eisentraut > wrote: >> On 1/18/17 2:32 PM, Robert Haas wrote: >>> Unless we can find something official, I suppose we should just >>> display BASE TABLE in that case as we do in other cases. I wonder if >>> the schema needs some broader revision; for example, are there >>> information_schema elements intended to show information about >>> partitions? >> >> Is it intentional that we show the partitions by default in \d, >> pg_tables, information_schema.tables? Or should we treat those as >> somewhat-hidden details? > > I'm not really sure what the right thing to do is there. I was hoping > you had an opinion. I guess this is an open item then. I think Greg Stark brought this up too on the original partitioning thread [1]. Thanks, Amit [1] https://www.postgresql.org/message-id/CAM-w4HOZ5fPS7GoCTTrW42q01%2BwPrOWFCnr9H0iDyVTZP2H1CA%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
On Fri, Mar 24, 2017 at 11:06 PM, Simon Riggs wrote: > On 1 March 2017 at 01:36, Amit Langote wrote: > >> I don't know which way you're thinking of fixing this, but a planner patch >> to implement faster partition-pruning will have taken care of this, I >> think. As you may know, even declarative partitioned tables currently >> depend on constraint exclusion for partition-pruning and planner's current >> approach of handling inheritance requires to open all the child tables >> (partitions), whereas the new approach hopefully shouldn't need to do >> that. I am not sure if looking for a more localized fix for this would be >> worthwhile, although I may be wrong. > > What "new approach" are we discussing? > > Is there a patch or design discussion? Neither at the moment. As Aleksander said in his reply I was referring to Dmitry Ivanov's plan of porting pg_pathman's planner functionality to core that he mentioned on the declarative partitioning thread back in December [1]. Thanks, Amit [1] https://www.postgresql.org/message-id/426b2b01-61e0-43aa-bd84-c6fcf516f1c3%40postgrespro.ru -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi Simon, > > I don't know which way you're thinking of fixing this, but a planner patch > > to implement faster partition-pruning will have taken care of this, I > > think. As you may know, even declarative partitioned tables currently > > depend on constraint exclusion for partition-pruning and planner's current > > approach of handling inheritance requires to open all the child tables > > (partitions), whereas the new approach hopefully shouldn't need to do > > that. I am not sure if looking for a more localized fix for this would be > > worthwhile, although I may be wrong. > > What "new approach" are we discussing? > Is there a patch or design discussion? I think what was meant was plans of my colleague Dmitry Ivanov to implement partition-pruning. I've just spoke with Dmitry about this matter. Unless there is anyone else who is already working on this optimization we would like to work on it together. Unfortunately there is no patch or design discussion of partition-pruning on this commitfest. -- Best regards, Aleksander Alekseev signature.asc Description: PGP signature
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
On 1 March 2017 at 01:36, Amit Langote wrote: > I don't know which way you're thinking of fixing this, but a planner patch > to implement faster partition-pruning will have taken care of this, I > think. As you may know, even declarative partitioned tables currently > depend on constraint exclusion for partition-pruning and planner's current > approach of handling inheritance requires to open all the child tables > (partitions), whereas the new approach hopefully shouldn't need to do > that. I am not sure if looking for a more localized fix for this would be > worthwhile, although I may be wrong. What "new approach" are we discussing? Is there a patch or design discussion? -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi Tels, Thanks a lot for the review! > "corresponding" Fixed. > Also a question: Some one-line comments are > > /* Comment. */ > > while others are > > /* > * Comment. > */ > > Why the difference? I'm trying to follow a code stile used in a code I'm modifying. In this case I got an impression that second style of one-line comments is used more often an I tried to follow it but apparently I didn't succeed :) This is fixed as well. On Thu, Mar 09, 2017 at 06:40:39PM -0500, Tels wrote: > Hi Aleksander, > > noticed this in your patch: > > > * Add a corespinding entry to pgStatTabHash. > > "corresponding" > > Also a question: Some one-line comments are > > /* Comment. */ > > while others are > > /* > * Comment. > */ > > Why the difference? > > Hope this helps, > > Tels -- Best regards, Aleksander Alekseev diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 7cacb1e9b2..a22a5a37c8 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -161,6 +161,20 @@ typedef struct TabStatusArray static TabStatusArray *pgStatTabList = NULL; /* + * pgStatTabHash entry + */ +typedef struct TabStatHashEntry +{ + Oid t_id; + PgStat_TableStatus* tsa_entry; +} TabStatHashEntry; + +/* + * Hash table for O(1) t_id -> tsa_entry lookup + */ +static HTAB *pgStatTabHash = NULL; + +/* * Backends store per-function info that's waiting to be sent to the collector * in this hash table (indexed by function OID). */ @@ -815,7 +829,13 @@ pgstat_report_stat(bool force) pgstat_send_tabstat(this_msg); this_msg->m_nentries = 0; } + + /* + * Entry will be zeroed soon so we need to remove it from the lookup table. + */ + hash_search(pgStatTabHash, &entry->t_id, HASH_REMOVE, NULL); } + /* zero out TableStatus structs after use */ MemSet(tsa->tsa_entries, 0, tsa->tsa_used * sizeof(PgStat_TableStatus)); @@ -1667,59 +1687,77 @@ pgstat_initstats(Relation rel) } /* + * Make sure pgStatTabList and pgStatTabHash are initialized. + */ +static void +make_sure_stat_tab_initialized() +{ + HASHCTL ctl; + + if (pgStatTabList) + return; + + Assert(!pgStatTabHash); + + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(TabStatHashEntry); + ctl.hcxt = TopMemoryContext; + + pgStatTabHash = hash_create("pgstat t_id to tsa_entry lookup table", + TABSTAT_QUANTUM, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); + + pgStatTabList = (TabStatusArray *) MemoryContextAllocZero(TopMemoryContext, +sizeof(TabStatusArray)); +} + +/* * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel */ static PgStat_TableStatus * get_tabstat_entry(Oid rel_id, bool isshared) { + TabStatHashEntry* hash_entry; PgStat_TableStatus *entry; TabStatusArray *tsa; - TabStatusArray *prev_tsa; - int i; + bool found; + + make_sure_stat_tab_initialized(); + + /* + * Find an entry or create a new one. + */ + hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found); + if(found) + return hash_entry->tsa_entry; /* - * Search the already-used tabstat slots for this relation. + * `hash_entry` was just created and now we have to fill it. + * First make sure there is a free space in a last element of pgStatTabList. */ - prev_tsa = NULL; - for (tsa = pgStatTabList; tsa != NULL; prev_tsa = tsa, tsa = tsa->tsa_next) + tsa = pgStatTabList; + while(tsa->tsa_used == TABSTAT_QUANTUM) { - for (i = 0; i < tsa->tsa_used; i++) + if(tsa->tsa_next == NULL) { - entry = &tsa->tsa_entries[i]; - if (entry->t_id == rel_id) -return entry; + tsa->tsa_next = (TabStatusArray *) MemoryContextAllocZero(TopMemoryContext, + sizeof(TabStatusArray)); } - if (tsa->tsa_used < TABSTAT_QUANTUM) - { - /* - * It must not be present, but we found a free slot instead. Fine, - * let's use this one. We assume the entry was already zeroed, - * either at creation or after last use. - */ - entry = &tsa->tsa_entries[tsa->tsa_used++]; - entry->t_id = rel_id; - entry->t_shared = isshared; - return entry; - } + tsa = tsa->tsa_next; } /* - * We ran out of tabstat slots, so allocate more. Be sure they're zeroed. - */ - tsa = (TabStatusArray *) MemoryContextAllocZero(TopMemoryContext, - sizeof(TabStatusArray)); - if (prev_tsa) - prev_tsa->tsa_next = tsa; - else - pgStatTabList = tsa; - - /* - * Use the first entry of the new TabStatusArray. + * Add an entry. */ entry = &tsa->tsa_entries[tsa->tsa_used++]; entry->t_id = rel_id; entry->t_shared = isshared; + + /* + * Add a corresponding entry to pgStatTabHash. + */ + hash_entry->tsa_entry = entry; return entry; } @@ -1731,22 +1769,19 @@ get_tabstat_entry(Oid rel_id, bool isshared) PgStat_TableStatus * find_tabstat_entry(Oid rel_id) { - PgStat_TableStatus *entry; - TabStatusArray *tsa; - int i; + TabStatHashEntry* hash_entry; - for (tsa = pgStatTabList; tsa
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi Aleksander, noticed this in your patch: > * Add a corespinding entry to pgStatTabHash. "corresponding" Also a question: Some one-line comments are /* Comment. */ while others are /* * Comment. */ Why the difference? Hope this helps, Tels PS: Sorry if this appears twice, I fumbled the From: ... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi Aleksander, noticed this in your patch: > * Add a corespinding entry to pgStatTabHash. "corresponding" Also a question: Some one-line comments are /* Comment. */ while others are /* * Comment. */ Why the difference? Hope this helps, Tels -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi Amit, > Sorry, I didn't mean to dissuade you from trying those > micro-optimizations. If general inheritance cases can benefit from it > (which, until we have a different method, will be used by partitioned > tables as well), I think we should try it. OK, I'll see what could be done here as well then. On Tue, Mar 07, 2017 at 10:55:12AM +0900, Amit Langote wrote: > Hi Aleksander, > > On 2017/03/07 0:22, Aleksander Alekseev wrote: > > Hello. > > > > OK, here is a patch. > > > > Benchmark, before: > > > > ``` > > number of transactions actually processed: 1823 > > latency average = 1153.495 ms > > latency stddev = 154.366 ms > > tps = 6.061104 (including connections establishing) > > tps = 6.061211 (excluding connections establishing) > > ``` > > > > Benchmark, after: > > > > ``` > > number of transactions actually processed: 2462 > > latency average = 853.862 ms > > latency stddev = 112.270 ms > > tps = 8.191861 (including connections establishing) > > tps = 8.192028 (excluding connections establishing) > > ``` > > > > +35% TPS, just as expected. Feel free to run your own benchmarks on > > different datasets and workloads. `perf top` shows that first bottleneck > > is completely eliminated. > > That seems like a good gain. > > > I did nothing about the second bottleneck > > since as Amit mentioned partition-pruning should solve this anyway and > > apparently any micro-optimizations don't worth an effort. > > Sorry, I didn't mean to dissuade you from trying those > micro-optimizations. If general inheritance cases can benefit from it > (which, until we have a different method, will be used by partitioned > tables as well), I think we should try it. > > Thanks, > Amit > > -- Best regards, Aleksander Alekseev signature.asc Description: PGP signature
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi, Andres Thanks a lot for the review! > Why are we keeping that list / the "batch" allocation pattern? It > doesn't actually seem useful to me after these changes. Given that we > don't shrink hash-tables, we could just use the hash-table memory for > this, no? I don't think we can do that. According to comments: ``` * NOTE: once allocated, TabStatusArray structures are never moved or deleted [...] * This allows relcache pgstat_info pointers to be treated as long-lived data, * avoiding repeated searches in pgstat_initstats() when a relation is * repeatedly opened during a transaction. ``` It is my understanding that dynahash can't give same guarantees. > 'faster ... lookup' doesn't strike me as a useful comment, because it's > only useful in relation of the current code to the new code. Agree, fixed to 'O(1) ... lookup'. > It's not really freed... Right, it's actually zeroed. Fixed. > Arguably it'd be better to HASH_ENTER directly here, instead of doing > two lookups. Good point. Fixed. > Think it'd be better to move the hash creation into its own function. Agree, fixed. On Mon, Mar 06, 2017 at 12:01:50PM -0800, Andres Freund wrote: > Hi, > > This issue has bothered me in non-partitioned use-cases recently, so > thanks for taking it up. > > > On 2017-03-06 18:22:17 +0300, Aleksander Alekseev wrote: > > diff --git a/src/backend/postmaster/pgstat.c > > b/src/backend/postmaster/pgstat.c > > index 2fb9a8bf58..fa906e7930 100644 > > --- a/src/backend/postmaster/pgstat.c > > +++ b/src/backend/postmaster/pgstat.c > > @@ -160,6 +160,16 @@ typedef struct TabStatusArray > > > > static TabStatusArray *pgStatTabList = NULL; > > Why are we keeping that list / the "batch" allocation pattern? It > doesn't actually seem useful to me after these changes. Given that we > don't shrink hash-tables, we could just use the hash-table memory for > this, no? > > I think a separate list that only contains things that are "active" in > the current transaction makes sense, because the current logic requires > iterating over a potentially very large array at transaction commit. > > > > +/* pgStatTabHash entry */ > > +typedef struct TabStatHashEntry > > +{ > > + Oid t_id; > > + PgStat_TableStatus* tsa_entry; > > +} TabStatHashEntry; > > + > > +/* Hash table for faster t_id -> tsa_entry lookup */ > > +static HTAB *pgStatTabHash = NULL; > > + > > 'faster ... lookup' doesn't strike me as a useful comment, because it's > only useful in relation of the current code to the new code. > > > > > /* > > * Backends store per-function info that's waiting to be sent to the > > collector > > * in this hash table (indexed by function OID). > > @@ -815,7 +825,13 @@ pgstat_report_stat(bool force) > > pgstat_send_tabstat(this_msg); > > this_msg->m_nentries = 0; > > } > > + > > + /* > > +* Entry will be freed soon so we need to remove it > > from the lookup table. > > +*/ > > + hash_search(pgStatTabHash, &entry->t_id, HASH_REMOVE, > > NULL); > > } > > It's not really freed... > > > > static PgStat_TableStatus * > > get_tabstat_entry(Oid rel_id, bool isshared) > > { > > + TabStatHashEntry* hash_entry; > > PgStat_TableStatus *entry; > > TabStatusArray *tsa; > > - TabStatusArray *prev_tsa; > > - int i; > > + > > + /* Try to find an entry */ > > + entry = find_tabstat_entry(rel_id); > > + if(entry != NULL) > > + return entry; > > Arguably it'd be better to HASH_ENTER directly here, instead of doing > two lookups. > > > > /* > > -* Search the already-used tabstat slots for this relation. > > +* Entry doesn't exist - creating one. > > +* First make sure there is a free space in a last element of > > pgStatTabList. > > */ > > - prev_tsa = NULL; > > - for (tsa = pgStatTabList; tsa != NULL; prev_tsa = tsa, tsa = > > tsa->tsa_next) > > + if (!pgStatTabList) > > { > > - for (i = 0; i < tsa->tsa_used; i++) > > - { > > - entry = &tsa->tsa_entries[i]; > > - if (entry->t_id == rel_id) > > - return entry; > > - } > > + HASHCTL ctl; > > > > - if (tsa->tsa_used < TABSTAT_QUANTUM) > > + Assert(!pgStatTabHash); > > + > > + memset(&ctl, 0, sizeof(ctl)); > > + ctl.keysize = sizeof(Oid); > > + ctl.entrysize = sizeof(TabStatHashEntry); > > + ctl.hcxt = TopMemoryContext; > > + > > + pgStatTabHash = hash_create("pgstat t_id to tsa_entry lookup > > table", > > + TABSTAT_QUANTUM, &ctl, HASH_ELEM | HASH_BLOBS | > > HASH_CONTEXT); > > Think it'd be better to move the hash creation into its own function. > > > - Andres -- Best regards, Aleksander Alekseev diff --git a/src/backend/postma
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi Aleksander, On 2017/03/07 0:22, Aleksander Alekseev wrote: > Hello. > > OK, here is a patch. > > Benchmark, before: > > ``` > number of transactions actually processed: 1823 > latency average = 1153.495 ms > latency stddev = 154.366 ms > tps = 6.061104 (including connections establishing) > tps = 6.061211 (excluding connections establishing) > ``` > > Benchmark, after: > > ``` > number of transactions actually processed: 2462 > latency average = 853.862 ms > latency stddev = 112.270 ms > tps = 8.191861 (including connections establishing) > tps = 8.192028 (excluding connections establishing) > ``` > > +35% TPS, just as expected. Feel free to run your own benchmarks on > different datasets and workloads. `perf top` shows that first bottleneck > is completely eliminated. That seems like a good gain. > I did nothing about the second bottleneck > since as Amit mentioned partition-pruning should solve this anyway and > apparently any micro-optimizations don't worth an effort. Sorry, I didn't mean to dissuade you from trying those micro-optimizations. If general inheritance cases can benefit from it (which, until we have a different method, will be used by partitioned tables as well), I think we should try it. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi, This issue has bothered me in non-partitioned use-cases recently, so thanks for taking it up. On 2017-03-06 18:22:17 +0300, Aleksander Alekseev wrote: > diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c > index 2fb9a8bf58..fa906e7930 100644 > --- a/src/backend/postmaster/pgstat.c > +++ b/src/backend/postmaster/pgstat.c > @@ -160,6 +160,16 @@ typedef struct TabStatusArray > > static TabStatusArray *pgStatTabList = NULL; Why are we keeping that list / the "batch" allocation pattern? It doesn't actually seem useful to me after these changes. Given that we don't shrink hash-tables, we could just use the hash-table memory for this, no? I think a separate list that only contains things that are "active" in the current transaction makes sense, because the current logic requires iterating over a potentially very large array at transaction commit. > +/* pgStatTabHash entry */ > +typedef struct TabStatHashEntry > +{ > + Oid t_id; > + PgStat_TableStatus* tsa_entry; > +} TabStatHashEntry; > + > +/* Hash table for faster t_id -> tsa_entry lookup */ > +static HTAB *pgStatTabHash = NULL; > + 'faster ... lookup' doesn't strike me as a useful comment, because it's only useful in relation of the current code to the new code. > /* > * Backends store per-function info that's waiting to be sent to the > collector > * in this hash table (indexed by function OID). > @@ -815,7 +825,13 @@ pgstat_report_stat(bool force) > pgstat_send_tabstat(this_msg); > this_msg->m_nentries = 0; > } > + > + /* > + * Entry will be freed soon so we need to remove it > from the lookup table. > + */ > + hash_search(pgStatTabHash, &entry->t_id, HASH_REMOVE, > NULL); > } It's not really freed... > static PgStat_TableStatus * > get_tabstat_entry(Oid rel_id, bool isshared) > { > + TabStatHashEntry* hash_entry; > PgStat_TableStatus *entry; > TabStatusArray *tsa; > - TabStatusArray *prev_tsa; > - int i; > + > + /* Try to find an entry */ > + entry = find_tabstat_entry(rel_id); > + if(entry != NULL) > + return entry; Arguably it'd be better to HASH_ENTER directly here, instead of doing two lookups. > /* > - * Search the already-used tabstat slots for this relation. > + * Entry doesn't exist - creating one. > + * First make sure there is a free space in a last element of > pgStatTabList. >*/ > - prev_tsa = NULL; > - for (tsa = pgStatTabList; tsa != NULL; prev_tsa = tsa, tsa = > tsa->tsa_next) > + if (!pgStatTabList) > { > - for (i = 0; i < tsa->tsa_used; i++) > - { > - entry = &tsa->tsa_entries[i]; > - if (entry->t_id == rel_id) > - return entry; > - } > + HASHCTL ctl; > > - if (tsa->tsa_used < TABSTAT_QUANTUM) > + Assert(!pgStatTabHash); > + > + memset(&ctl, 0, sizeof(ctl)); > + ctl.keysize = sizeof(Oid); > + ctl.entrysize = sizeof(TabStatHashEntry); > + ctl.hcxt = TopMemoryContext; > + > + pgStatTabHash = hash_create("pgstat t_id to tsa_entry lookup > table", > + TABSTAT_QUANTUM, &ctl, HASH_ELEM | HASH_BLOBS | > HASH_CONTEXT); Think it'd be better to move the hash creation into its own function. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hello. OK, here is a patch. Benchmark, before: ``` number of transactions actually processed: 1823 latency average = 1153.495 ms latency stddev = 154.366 ms tps = 6.061104 (including connections establishing) tps = 6.061211 (excluding connections establishing) ``` Benchmark, after: ``` number of transactions actually processed: 2462 latency average = 853.862 ms latency stddev = 112.270 ms tps = 8.191861 (including connections establishing) tps = 8.192028 (excluding connections establishing) ``` +35% TPS, just as expected. Feel free to run your own benchmarks on different datasets and workloads. `perf top` shows that first bottleneck is completely eliminated. I did nothing about the second bottleneck since as Amit mentioned partition-pruning should solve this anyway and apparently any micro-optimizations don't worth an effort. Also I checked test coverage using lcov to make sure that this code is test covered. An exact script I'm using could be found here [1]. [1] https://github.com/afiskon/pgscripts/blob/master/code-coverage.sh On Wed, Mar 01, 2017 at 10:36:24AM +0900, Amit Langote wrote: > Hi, > > On 2017/02/28 23:25, Aleksander Alekseev wrote: > > Hello. > > > > I decided to figure out whether current implementation of declarative > > partitioning has any bottlenecks when there is a lot of partitions. Here > > is what I did [1]. > > Thanks for sharing. > > > Then: > > > > ``` > > # 2580 is some pk that exists > > echo 'select * from part_test where pk = 2580;' > t.sql > > pgbench -j 7 -c 7 -f t.sql -P 1 -T 300 eax > > ``` > > > > `perf top` showed to bottlenecks [2]. A stacktrace for the first one > > looks like this [3]: > > > > ``` > > 0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') > > at pgstat.c:1689 > > 1689if (entry->t_id == rel_id) > > #0 0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 > > '\000') at pgstat.c:1689 > > #1 0x007a4275 in pgstat_initstats (rel=0x7f4af3fd41f8) at > > pgstat.c:1666 > > #2 0x004c7090 in relation_open (relationId=25696, lockmode=0) at > > heapam.c:1137 > > #3 0x004c72c9 in heap_open (relationId=25696, lockmode=0) at > > heapam.c:1291 > > (skipped) > > ``` > > > > And here is a stacktrace for the second bottleneck [4]: > > > > ``` > > 0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, > > numparents=0x0) at pg_inherits.c:199 > > 199 forboth(lo, rels_list, li, rel_numparents) > > #0 0x00584fb1 in find_all_inheritors (parentrelId=16393, > > lockmode=1, numparents=0x0) at pg_inherits.c:199 > > #1 0x0077fc9f in expand_inherited_rtentry (root=0x1badcb8, > > rte=0x1b630b8, rti=1) at prepunion.c:1408 > > #2 0x0077fb67 in expand_inherited_tables (root=0x1badcb8) at > > prepunion.c:1335 > > #3 0x00767526 in subquery_planner (glob=0x1b63cc0, > > parse=0x1b62fa0, parent_root=0x0, hasRecursion=0 '\000', tuple_fraction=0) > > at planner.c:568 > > (skipped) > > ``` > > > > The first one could be easily fixed by introducing a hash table > > (rel_id -> pgStatList entry). Perhaps hash table should be used only > > after some threshold. Unless there are any objections I will send a > > corresponding patch shortly. > > I have never thought about this one really. > > > I didn't explored the second bottleneck closely yet but at first glance > > it doesn't look much more complicated. > > I don't know which way you're thinking of fixing this, but a planner patch > to implement faster partition-pruning will have taken care of this, I > think. As you may know, even declarative partitioned tables currently > depend on constraint exclusion for partition-pruning and planner's current > approach of handling inheritance requires to open all the child tables > (partitions), whereas the new approach hopefully shouldn't need to do > that. I am not sure if looking for a more localized fix for this would be > worthwhile, although I may be wrong. > > Thanks, > Amit > > -- Best regards, Aleksander Alekseev diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 2fb9a8bf58..fa906e7930 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -160,6 +160,16 @@ typedef struct TabStatusArray static TabStatusArray *pgStatTabList = NULL; +/* pgStatTabHash entry */ +typedef struct TabStatHashEntry +{ + Oid t_id; + PgStat_TableStatus* tsa_entry; +} TabStatHashEntry; + +/* Hash table for faster t_id -> tsa_entry lookup */ +static HTAB *pgStatTabHash = NULL; + /* * Backends store per-function info that's waiting to be sent to the collector * in this hash table (indexed by function OID). @@ -815,7 +825,13 @@ pgstat_report_stat(bool force) pgstat_send_tabstat(this_msg); this_msg->m_nentries = 0; } + + /* + * Entry will be freed soon so we need to remove it from the lookup table. + */ + hash_search(pgStatTabH
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi, On 2017/02/28 23:25, Aleksander Alekseev wrote: > Hello. > > I decided to figure out whether current implementation of declarative > partitioning has any bottlenecks when there is a lot of partitions. Here > is what I did [1]. Thanks for sharing. > Then: > > ``` > # 2580 is some pk that exists > echo 'select * from part_test where pk = 2580;' > t.sql > pgbench -j 7 -c 7 -f t.sql -P 1 -T 300 eax > ``` > > `perf top` showed to bottlenecks [2]. A stacktrace for the first one > looks like this [3]: > > ``` > 0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') at > pgstat.c:1689 > 1689 if (entry->t_id == rel_id) > #0 0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') > at pgstat.c:1689 > #1 0x007a4275 in pgstat_initstats (rel=0x7f4af3fd41f8) at > pgstat.c:1666 > #2 0x004c7090 in relation_open (relationId=25696, lockmode=0) at > heapam.c:1137 > #3 0x004c72c9 in heap_open (relationId=25696, lockmode=0) at > heapam.c:1291 > (skipped) > ``` > > And here is a stacktrace for the second bottleneck [4]: > > ``` > 0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, > numparents=0x0) at pg_inherits.c:199 > 199 forboth(lo, rels_list, li, rel_numparents) > #0 0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, > numparents=0x0) at pg_inherits.c:199 > #1 0x0077fc9f in expand_inherited_rtentry (root=0x1badcb8, > rte=0x1b630b8, rti=1) at prepunion.c:1408 > #2 0x0077fb67 in expand_inherited_tables (root=0x1badcb8) at > prepunion.c:1335 > #3 0x00767526 in subquery_planner (glob=0x1b63cc0, parse=0x1b62fa0, > parent_root=0x0, hasRecursion=0 '\000', tuple_fraction=0) at planner.c:568 > (skipped) > ``` > > The first one could be easily fixed by introducing a hash table > (rel_id -> pgStatList entry). Perhaps hash table should be used only > after some threshold. Unless there are any objections I will send a > corresponding patch shortly. I have never thought about this one really. > I didn't explored the second bottleneck closely yet but at first glance > it doesn't look much more complicated. I don't know which way you're thinking of fixing this, but a planner patch to implement faster partition-pruning will have taken care of this, I think. As you may know, even declarative partitioned tables currently depend on constraint exclusion for partition-pruning and planner's current approach of handling inheritance requires to open all the child tables (partitions), whereas the new approach hopefully shouldn't need to do that. I am not sure if looking for a more localized fix for this would be worthwhile, although I may be wrong. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Declarative partitioning optimization for large amount of partitions
Hello. I decided to figure out whether current implementation of declarative partitioning has any bottlenecks when there is a lot of partitions. Here is what I did [1]. ``` -- init schema \timing on CREATE TABLE part_test (pk int not null, k int, v varchar(128)) PARTITION BY RANGE(pk); do $$ declare i integer; begin for i in 1 .. 1 loop raise notice 'i = %', i; execute ('CREATE TABLE part_test_' || i || ' PARTITION OF part_test FOR VALUES FROM (' || (1 + (i-1)*1000) || ') to (' || ( (i * 1000) + 1) || ');' ); end loop; end $$; -- fill tables with some data do $$ declare i integer; begin for i in 1 .. 100*1000 loop raise notice 'i = %', i; execute ('insert into part_test values ( ceil(random()*(1-1)*1000), ceil(random()*1*1000), || ceil(random()*1*1000) );'); end loop; end $$; ``` Then: ``` # 2580 is some pk that exists echo 'select * from part_test where pk = 2580;' > t.sql pgbench -j 7 -c 7 -f t.sql -P 1 -T 300 eax ``` `perf top` showed to bottlenecks [2]. A stacktrace for the first one looks like this [3]: ``` 0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') at pgstat.c:1689 1689if (entry->t_id == rel_id) #0 0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') at pgstat.c:1689 #1 0x007a4275 in pgstat_initstats (rel=0x7f4af3fd41f8) at pgstat.c:1666 #2 0x004c7090 in relation_open (relationId=25696, lockmode=0) at heapam.c:1137 #3 0x004c72c9 in heap_open (relationId=25696, lockmode=0) at heapam.c:1291 (skipped) ``` And here is a stacktrace for the second bottleneck [4]: ``` 0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, numparents=0x0) at pg_inherits.c:199 199 forboth(lo, rels_list, li, rel_numparents) #0 0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, numparents=0x0) at pg_inherits.c:199 #1 0x0077fc9f in expand_inherited_rtentry (root=0x1badcb8, rte=0x1b630b8, rti=1) at prepunion.c:1408 #2 0x0077fb67 in expand_inherited_tables (root=0x1badcb8) at prepunion.c:1335 #3 0x00767526 in subquery_planner (glob=0x1b63cc0, parse=0x1b62fa0, parent_root=0x0, hasRecursion=0 '\000', tuple_fraction=0) at planner.c:568 (skipped) ``` The first one could be easily fixed by introducing a hash table (rel_id -> pgStatList entry). Perhaps hash table should be used only after some threshold. Unless there are any objections I will send a corresponding patch shortly. I didn't explored the second bottleneck closely yet but at first glance it doesn't look much more complicated. Please don't hesitate to share your thoughts regarding this matter. [1] http://afiskon.ru/s/e3/5f47af9102_benchmark.txt [2] http://afiskon.ru/s/00/2008c4ae66_temp.png [3] http://afiskon.ru/s/23/650f0afc89_stack.txt [4] http://afiskon.ru/s/03/a7e685a4db_stack2.txt -- Best regards, Aleksander Alekseev signature.asc Description: PGP signature
Re: [HACKERS] Declarative partitioning - another take
Hi Amit, On Thu, 23 Feb 2017 16:29:32 +0900 Amit Langote wrote: > Thanks for taking care of that. > > + * PartitionRoot relation descriptor for parent > relation > > Maybe: relation descriptor for root parent relation This seems better. Patch is updated. Thanks, -- Yugo Nagata diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index 6332ea0..af3367a 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -326,6 +326,7 @@ typedef struct JunkFilter * onConflictSetWhere list of ON CONFLICT DO UPDATE exprs (qual) * PartitionCheck partition check expression * PartitionCheckExpr partition check expression state + * PartitionRoot relation descriptor for root parent relation * */ typedef struct ResultRelInfo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
Nagata-san, On 2017/02/23 16:17, Yugo Nagata wrote: > Hi, > > I found that a comment for PartitionRoot in ResultRelInfo is missing. > Although this is trivial, since all other members have comments, I > think it is needed. Attached is the patch to fix it. Thanks for taking care of that. + * PartitionRoot relation descriptor for parent relation Maybe: relation descriptor for root parent relation Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
Hi, I found that a comment for PartitionRoot in ResultRelInfo is missing. Although this is trivial, since all other members have comments, I think it is needed. Attached is the patch to fix it. Regards, Yugo Nagata On Tue, 27 Dec 2016 17:59:05 +0900 Amit Langote wrote: > On 2016/12/26 19:46, Amit Langote wrote: > > (Perhaps, the following should be its own new thread) > > > > I noticed that ExecProcessReturning() doesn't work properly after tuple > > routing (example shows how returning tableoid currently fails but I > > mention some other issues below): > > > > create table p (a int, b int) partition by range (a); > > create table p1 partition of p for values from (1) to (10); > > insert into p values (1) returning tableoid::regclass, *; > > tableoid | a | b > > --+---+--- > > -| 1 | > > (1 row) > > > > INSERT 0 1 > > > > I tried to fix that in 0007 to get: > > > > insert into p values (1) returning tableoid::regclass, *; > > tableoid | a | b > > --+---+--- > > p| 1 | > > (1 row) > > > > INSERT 0 1 > > > > But I think it *may* be wrong to return the root table OID for tuples > > inserted into leaf partitions, because with select we get partition OIDs: > > > > select tableoid::regclass, * from p; > > tableoid | a | b > > --+---+--- > > p1 | 1 | > > (1 row) > > > > If so, that means we should build the projection info (corresponding to > > the returning list) for each target partition somehow. ISTM, that's going > > to have to be done within the planner by appropriate inheritance > > translation of the original returning targetlist. > > Turns out getting the 2nd result may not require planner tweaks after all. > Unless I'm missing something, translation of varattnos of the RETURNING > target list can be done as late as ExecInitModifyTable() for the insert > case, unlike update/delete (which do require planner's attention). > > I updated the patch 0007 to implement the same, including the test. While > doing that, I realized map_partition_varattnos introduced in 0003 is > rather restrictive in its applicability, because it assumes varno = 1 for > the expressions it accepts as input for the mapping. Mapping returning > (target) list required modifying map_partition_varattnos to accept > target_varno as additional argument. That way, we can map arbitrary > expressions from the parent attributes numbers to partition attribute > numbers for expressions not limited to partition constraints. > > Patches 0001 to 0006 unchanged. > > Thanks, > Amit -- Yugo Nagata diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index 6332ea0..845bdf2 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -326,6 +326,7 @@ typedef struct JunkFilter * onConflictSetWhere list of ON CONFLICT DO UPDATE exprs (qual) * PartitionCheck partition check expression * PartitionCheckExpr partition check expression state + * PartitionRoot relation descriptor for parent relation * */ typedef struct ResultRelInfo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Fri, Feb 10, 2017 at 12:54 AM, Amit Langote wrote: > On 2017/02/09 15:22, amul sul wrote: >> About 0001-Check-partition-strategy-in-ATExecDropNotNull.patch, >> following test is already covered in alter_table.sql @ Line # 1918, >> instead of this kindly add test for list_partition: >> >> 77 +-- cannot drop NOT NULL constraint of a range partition key column >> 78 +ALTER TABLE range_parted ALTER a DROP NOT NULL; >> 79 + > > Thanks for the review! You're right. Updated patch attached. Committed, thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/02/09 15:22, amul sul wrote: > About 0001-Check-partition-strategy-in-ATExecDropNotNull.patch, > following test is already covered in alter_table.sql @ Line # 1918, > instead of this kindly add test for list_partition: > > 77 +-- cannot drop NOT NULL constraint of a range partition key column > 78 +ALTER TABLE range_parted ALTER a DROP NOT NULL; > 79 + Thanks for the review! You're right. Updated patch attached. Thanks, Amit >From a4335048e92462fb55fa6db0d830c7c066cd7011 Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 10 Feb 2017 14:52:18 +0900 Subject: [PATCH] Check partition strategy in ATExecDropNotNull We should prevent dropping the NOT NULL constraint on a column only if the column is in the *range* partition key (as the error message also says). Add a regression test while at it. Reported by: Amul Sul Patch by: Amit Langote Report: https://postgr.es/m/CAAJ_b95g5AgkpJKbLajAt8ovKub874-B9X08PiOqHvVfMO2mLw%40mail.gmail.com --- src/backend/commands/tablecmds.c | 22 +- src/test/regress/expected/alter_table.out | 4 src/test/regress/sql/alter_table.sql | 5 + 3 files changed, 22 insertions(+), 9 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 37a4c4a3d6..f33aa70da6 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -5593,18 +5593,22 @@ ATExecDropNotNull(Relation rel, const char *colName, LOCKMODE lockmode) if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { PartitionKey key = RelationGetPartitionKey(rel); - int partnatts = get_partition_natts(key), - i; - for (i = 0; i < partnatts; i++) + if (get_partition_strategy(key) == PARTITION_STRATEGY_RANGE) { - AttrNumber partattnum = get_partition_col_attnum(key, i); + int partnatts = get_partition_natts(key), + i; - if (partattnum == attnum) -ereport(ERROR, - (errcode(ERRCODE_INVALID_TABLE_DEFINITION), - errmsg("column \"%s\" is in range partition key", -colName))); + for (i = 0; i < partnatts; i++) + { +AttrNumber partattnum = get_partition_col_attnum(key, i); + +if (partattnum == attnum) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TABLE_DEFINITION), + errmsg("column \"%s\" is in range partition key", + colName))); + } } } diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out index d8e7b61294..b0e80a7788 100644 --- a/src/test/regress/expected/alter_table.out +++ b/src/test/regress/expected/alter_table.out @@ -3029,6 +3029,10 @@ ERROR: cannot alter type of column referenced in partition key expression -- cannot drop NOT NULL on columns in the range partition key ALTER TABLE partitioned ALTER COLUMN a DROP NOT NULL; ERROR: column "a" is in range partition key +-- it's fine however to drop one on the list partition key column +CREATE TABLE list_partitioned (a int not null) partition by list (a); +ALTER TABLE list_partitioned ALTER a DROP NOT NULL; +DROP TABLE list_partitioned; -- partitioned table cannot participate in regular inheritance CREATE TABLE foo ( a int, diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql index 1f551ec53c..7513769359 100644 --- a/src/test/regress/sql/alter_table.sql +++ b/src/test/regress/sql/alter_table.sql @@ -1917,6 +1917,11 @@ ALTER TABLE partitioned ALTER COLUMN b TYPE char(5); -- cannot drop NOT NULL on columns in the range partition key ALTER TABLE partitioned ALTER COLUMN a DROP NOT NULL; +-- it's fine however to drop one on the list partition key column +CREATE TABLE list_partitioned (a int not null) partition by list (a); +ALTER TABLE list_partitioned ALTER a DROP NOT NULL; +DROP TABLE list_partitioned; + -- partitioned table cannot participate in regular inheritance CREATE TABLE foo ( a int, -- 2.11.0 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
About 0001-Check-partition-strategy-in-ATExecDropNotNull.patch, following test is already covered in alter_table.sql @ Line # 1918, instead of this kindly add test for list_partition: 77 +-- cannot drop NOT NULL constraint of a range partition key column 78 +ALTER TABLE range_parted ALTER a DROP NOT NULL; 79 + Regards, Amul -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/02/08 21:20, amul sul wrote: > Regarding following code in ATExecDropNotNull function, I don't see > any special check for RANGE partitioned, is it intended to have same > restriction for LIST partitioned too, I guess not? > > /* > * If the table is a range partitioned table, check that the column is not > * in the partition key. > */ > if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > { > PartitionKey key = RelationGetPartitionKey(rel); > int partnatts = get_partition_natts(key), > i; > > for (i = 0; i < partnatts; i++) > { > AttrNumber partattnum = get_partition_col_attnum(key, i); > > if (partattnum == attnum) > ereport(ERROR, > (errcode(ERRCODE_INVALID_TABLE_DEFINITION), > errmsg("column \"%s\" is in range partition key", > colName))); > } > } Good catch! Seems like an oversight, attached fixes it. Thanks, Amit >From b84ac63b6b4acd19a09e8507cf199db127c06d71 Mon Sep 17 00:00:00 2001 From: amit Date: Thu, 9 Feb 2017 10:31:58 +0900 Subject: [PATCH] Check partition strategy in ATExecDropNotNull We should prevent dropping the NOT NULL constraint on a column only if the column is in the *range* partition key (as the error message also says). Add a regression test while at it. Reported by: Amul Sul Patch by: Amit Langote Report: https://postgr.es/m/CAAJ_b95g5AgkpJKbLajAt8ovKub874-B9X08PiOqHvVfMO2mLw%40mail.gmail.com --- src/backend/commands/tablecmds.c | 22 +- src/test/regress/expected/alter_table.out | 3 +++ src/test/regress/sql/alter_table.sql | 3 +++ 3 files changed, 19 insertions(+), 9 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 37a4c4a3d6..f33aa70da6 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -5593,18 +5593,22 @@ ATExecDropNotNull(Relation rel, const char *colName, LOCKMODE lockmode) if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { PartitionKey key = RelationGetPartitionKey(rel); - int partnatts = get_partition_natts(key), - i; - for (i = 0; i < partnatts; i++) + if (get_partition_strategy(key) == PARTITION_STRATEGY_RANGE) { - AttrNumber partattnum = get_partition_col_attnum(key, i); + int partnatts = get_partition_natts(key), + i; - if (partattnum == attnum) -ereport(ERROR, - (errcode(ERRCODE_INVALID_TABLE_DEFINITION), - errmsg("column \"%s\" is in range partition key", -colName))); + for (i = 0; i < partnatts; i++) + { +AttrNumber partattnum = get_partition_col_attnum(key, i); + +if (partattnum == attnum) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TABLE_DEFINITION), + errmsg("column \"%s\" is in range partition key", + colName))); + } } } diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out index d8e7b61294..e6d45fbf9c 100644 --- a/src/test/regress/expected/alter_table.out +++ b/src/test/regress/expected/alter_table.out @@ -3330,6 +3330,9 @@ ALTER TABLE list_parted2 DROP COLUMN b; ERROR: cannot drop column named in partition key ALTER TABLE list_parted2 ALTER COLUMN b TYPE text; ERROR: cannot alter type of column named in partition key +-- cannot drop NOT NULL constraint of a range partition key column +ALTER TABLE range_parted ALTER a DROP NOT NULL; +ERROR: column "a" is in range partition key -- cleanup: avoid using CASCADE DROP TABLE list_parted, part_1; DROP TABLE list_parted2, part_2, part_5, part_5_a; diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql index 1f551ec53c..12edb8b3ba 100644 --- a/src/test/regress/sql/alter_table.sql +++ b/src/test/regress/sql/alter_table.sql @@ -2189,6 +2189,9 @@ ALTER TABLE part_2 INHERIT inh_test; ALTER TABLE list_parted2 DROP COLUMN b; ALTER TABLE list_parted2 ALTER COLUMN b TYPE text; +-- cannot drop NOT NULL constraint of a range partition key column +ALTER TABLE range_parted ALTER a DROP NOT NULL; + -- cleanup: avoid using CASCADE DROP TABLE list_parted, part_1; DROP TABLE list_parted2, part_2, part_5, part_5_a; -- 2.11.0 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
Hi Amit, Regarding following code in ATExecDropNotNull function, I don't see any special check for RANGE partitioned, is it intended to have same restriction for LIST partitioned too, I guess not? /* * If the table is a range partitioned table, check that the column is not * in the partition key. */ if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { PartitionKey key = RelationGetPartitionKey(rel); int partnatts = get_partition_natts(key), i; for (i = 0; i < partnatts; i++) { AttrNumber partattnum = get_partition_col_attnum(key, i); if (partattnum == attnum) ereport(ERROR, (errcode(ERRCODE_INVALID_TABLE_DEFINITION), errmsg("column \"%s\" is in range partition key", colName))); } } Regards, Amul -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Mon, Jan 30, 2017 at 4:42 PM, Peter Eisentraut wrote: > On 1/25/17 12:54 AM, Ashutosh Bapat wrote: >> The documentation available at >> https://www.postgresql.org/docs/devel/static/sql-createtable.html, >> does not make it clear that the lower bound of a range partition is >> always inclusive and the higher one is exclusive. I think a note in >> section " PARTITION OF parent_table FOR VALUES partition_bound_spec" >> would be helpful. > > Hmm. I see the practical use of that, but I think this is going to be a > source of endless confusion. Can we make that a bit clearer in the > syntax, for example by using additional keywords (INCLUSIVE/EXCLUSIVE)? I am not in favor of adding mandatory noise words to the syntax; it's fairly verbose already. I think it would be reasonable to eventually consider supporting optional keywords to allow multiple behaviors, but I don't think that the usefulness of that has been so firmly established that we should do it right this minute. I think there are a heck of a lot of other things about this partitioning implementation that are more urgently in need of improvement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/01/31 6:42, Peter Eisentraut wrote: > On 1/25/17 12:54 AM, Ashutosh Bapat wrote: >> The documentation available at >> https://www.postgresql.org/docs/devel/static/sql-createtable.html, >> does not make it clear that the lower bound of a range partition is >> always inclusive and the higher one is exclusive. I think a note in >> section " PARTITION OF parent_table FOR VALUES partition_bound_spec" >> would be helpful. > > Hmm. I see the practical use of that, but I think this is going to be a > source of endless confusion. Can we make that a bit clearer in the > syntax, for example by using additional keywords (INCLUSIVE/EXCLUSIVE)? The decision not to make that configurable with INCLUSIVE/EXCLUSIVE syntax was deliberate. To summarize, we can start with a default configuration catering to most practical cases (that is, inclusive lower and exclusive upper bounds) and documenting so (not done yet, which I will post a doc patch today for). If it turns out that there is some demand for making that configurable, we can later add the code to handle that internally plus the syntax. But *starting* with that syntax means we have to potentially needlessly carry the code to handle seldom used cases that could not be made as efficient as it is now with all lower bounds being inclusive and upper bounds exclusive. Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmoZou4ApEvC_nfhOxsi5G4SoD_evwNaiYn60ZcJ4XB_-QQ%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 1/25/17 12:54 AM, Ashutosh Bapat wrote: > The documentation available at > https://www.postgresql.org/docs/devel/static/sql-createtable.html, > does not make it clear that the lower bound of a range partition is > always inclusive and the higher one is exclusive. I think a note in > section " PARTITION OF parent_table FOR VALUES partition_bound_spec" > would be helpful. Hmm. I see the practical use of that, but I think this is going to be a source of endless confusion. Can we make that a bit clearer in the syntax, for example by using additional keywords (INCLUSIVE/EXCLUSIVE)? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning vs. information_schema
On Wed, Jan 25, 2017 at 1:04 PM, Peter Eisentraut wrote: > On 1/18/17 2:32 PM, Robert Haas wrote: >> Unless we can find something official, I suppose we should just >> display BASE TABLE in that case as we do in other cases. I wonder if >> the schema needs some broader revision; for example, are there >> information_schema elements intended to show information about >> partitions? > > Is it intentional that we show the partitions by default in \d, > pg_tables, information_schema.tables? Or should we treat those as > somewhat-hidden details? I'm not really sure what the right thing to do is there. I was hoping you had an opinion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning vs. information_schema
On 1/18/17 2:32 PM, Robert Haas wrote: > Unless we can find something official, I suppose we should just > display BASE TABLE in that case as we do in other cases. I wonder if > the schema needs some broader revision; for example, are there > information_schema elements intended to show information about > partitions? Is it intentional that we show the partitions by default in \d, pg_tables, information_schema.tables? Or should we treat those as somewhat-hidden details? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
Hi Ashutosh, On 2017/01/25 14:54, Ashutosh Bapat wrote: > The documentation available at > https://www.postgresql.org/docs/devel/static/sql-createtable.html, > does not make it clear that the lower bound of a range partition is > always inclusive and the higher one is exclusive. I think a note in > section " PARTITION OF parent_table FOR VALUES partition_bound_spec" > would be helpful. Yes, I'm working on a documentation patch for that and a few other things such as revising the Partitioning chapter. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
The documentation available at https://www.postgresql.org/docs/devel/static/sql-createtable.html, does not make it clear that the lower bound of a range partition is always inclusive and the higher one is exclusive. I think a note in section " PARTITION OF parent_table FOR VALUES partition_bound_spec" would be helpful. On Wed, Jan 25, 2017 at 7:11 AM, Amit Langote wrote: > On 2017/01/25 5:55, Robert Haas wrote: >> On Thu, Jan 19, 2017 at 9:58 PM, Amit Langote >> wrote: >>> [ new patches ] >> >> Committed 0001 and 0002. See my earlier email for comments on 0003. > > It seems patches for all the issues mentioned in this thread so far, > except 0003 (I just sent an updated version in another email), have been > committed. Thanks a lot for your time. I will create new threads for any > more patches from here on. > > Regards, > Amit > > -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/01/25 5:55, Robert Haas wrote: > On Thu, Jan 19, 2017 at 9:58 PM, Amit Langote > wrote: >> [ new patches ] > > Committed 0001 and 0002. See my earlier email for comments on 0003. It seems patches for all the issues mentioned in this thread so far, except 0003 (I just sent an updated version in another email), have been committed. Thanks a lot for your time. I will create new threads for any more patches from here on. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/01/25 2:56, Robert Haas wrote: > On Thu, Jan 19, 2017 at 9:58 PM, Amit Langote > wrote: >>> But I wonder why we don't instead just change this function to >>> consider tdhasoid rather than tdtypeid. I mean, if the only point of >>> comparing the type OIDs is to find out whether the table-has-OIDs >>> setting matches, we could instead test that directly and avoid needing >>> to pass an extra argument. I wonder if there's some other reason this >>> code is there which is not documented in the comment... >> >> With the following patch, regression tests run fine: >> >> if (indesc->natts == outdesc->natts && >> - indesc->tdtypeid == outdesc->tdtypeid) >> + indesc->tdhasoid != outdesc->tdhasoid) >> { >> >> If checking tdtypeid (instead of tdhasoid directly) has some other >> consideration, I'd would have seen at least some tests broken by this >> change. So, if we are to go with this, I too prefer it over my previous >> proposal to add an argument to convert_tuples_by_name(). Attached 0003 >> implements the above approach. > > I think this is not quite right. First, the patch compares the > tdhasoid status with != rather than ==, which would have the effect of > saying that we can skip conversion of the has-OID statuses do NOT > match. That can't be right. You're right. > Second, I believe that the comments > imply that conversion should be done if *either* tuple has OIDs. I > believe that's because whoever wrote this comment thought that we > needed to replace the OID if the tuple already had one, which is what > do_convert_tuple would do. I'm not sure whether that's really > necessary, but we're less likely to break anything if we preserve the > existing behavior, and I don't think we lose much from doing so > because few user tables will have OIDs. So I would change this test > to if (indesc->natts == outdesc->natts && !indesc->tdhasoid && > !outdesc->tdhasoid), and I'd revise the one in > convert_tuples_by_position() to match. Then I think it's much clearer > that we're just optimizing what's there already, not changing the > behavior. Agreed. Updated patch attached. Thanks, Amit >From c1fa4b9f04f328b8df54ef487ee9d46f5978e0de Mon Sep 17 00:00:00 2001 From: amit Date: Mon, 26 Dec 2016 17:44:14 +0900 Subject: [PATCH] Avoid tuple coversion in common partitioning cases Currently, the tuple conversion is performed after a tuple is routed, even if the attributes of a target leaf partition map one-to-one with those of the root table, which is wasteful. Avoid that by making convert_tuples_by_name() return a NULL map for such cases. Reported by: n/a Patch by: Amit Langote Reports: n/a --- src/backend/access/common/tupconvert.c | 18 ++ src/backend/catalog/partition.c| 3 +-- 2 files changed, 11 insertions(+), 10 deletions(-) diff --git a/src/backend/access/common/tupconvert.c b/src/backend/access/common/tupconvert.c index b17ceafa6e..a4012525d8 100644 --- a/src/backend/access/common/tupconvert.c +++ b/src/backend/access/common/tupconvert.c @@ -138,12 +138,13 @@ convert_tuples_by_position(TupleDesc indesc, nincols, noutcols))); /* - * Check to see if the map is one-to-one and the tuple types are the same. - * (We check the latter because if they're not, we want to do conversion - * to inject the right OID into the tuple datum.) + * Check to see if the map is one-to-one, in which case we need not do + * the tuple conversion. That's not enough though if either source or + * destination (tuples) contains OIDs; we'd need conversion in that case + * to inject the right OID into the tuple datum. */ if (indesc->natts == outdesc->natts && - indesc->tdtypeid == outdesc->tdtypeid) + !indesc->tdhasoid && !outdesc->tdhasoid) { for (i = 0; i < n; i++) { @@ -214,12 +215,13 @@ convert_tuples_by_name(TupleDesc indesc, attrMap = convert_tuples_by_name_map(indesc, outdesc, msg); /* - * Check to see if the map is one-to-one and the tuple types are the same. - * (We check the latter because if they're not, we want to do conversion - * to inject the right OID into the tuple datum.) + * Check to see if the map is one-to-one, in which case we need not do + * the tuple conversion. That's not enough though if either source or + * destination (tuples) contains OIDs; we'd need conversion in that case + * to inject the right OID into the tuple datum. */ if (indesc->natts == outdesc->natts && - indesc->tdtypeid == outdesc->tdtypeid) + !indesc->tdhasoid && !outdesc->tdhasoid) { same = true; for (i = 0; i < n; i++) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 7be60529c5..4bcef58763 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -1700,12 +1700,11 @@ get_partition_for_tuple(PartitionDispatch *pd, return -1; } - if (myslot != NULL) + if (myslot != NULL && map != NULL) { HeapTuple tuple = ExecFetchSlotTuple(slot); Ex
Re: [HACKERS] Declarative partitioning - another take
Hi Keith, On 2017/01/20 12:40, Keith Fiske wrote: > So testing things out in pg_partman for native sub-partitioning and ran > into what is a bug for me that I know I have to fix, but I'm curious if > this can be prevented in the first place within the native partitioning > code itself. The below shows a sub-partitioning set where the sub-partition > has a constraint range that is outside of the range of its parent. If the > columns were different I could see where this would be allowed, but the > same column is used throughout the levels of sub-partitioning. > Understandable if that may be too complex to check for, but figured I'd > bring it up as something I accidentally ran into in case you see an easy > way to prevent it. This was discussed. See Robert's response (2nd part of): https://www.postgresql.org/message-id/CA%2BTgmoaQABrsLQK4ms_4NiyavyJGS-b6ZFkZBBNC%2B-P5DjJNFA%40mail.gmail.com In short, defining partitions across different levels such that the data user intended to insert into the table (the part of the sub-partition's range that doesn't overlap with its parent's) couldn't be, that's an operator error. It's like adding contradictory check constraints to the table: create table foo (a int check (a > 0 and a < 0)); insert into foo values (1); ERROR: new row for relation "foo" violates check constraint "foo_a_check" DETAIL: Failing row contains (1). One (perhaps the only) thing that could be done is to warn users to prevent this kind of mistake through documentation. Trying to do anything else in the core partitioning code is making it too complicated for not much benefit (also see Robert's last line, :)). Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Thu, Jan 19, 2017 at 9:58 PM, Amit Langote wrote: > [ new patches ] Committed 0001 and 0002. See my earlier email for comments on 0003. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Thu, Jan 19, 2017 at 9:58 PM, Amit Langote wrote: >> But I wonder why we don't instead just change this function to >> consider tdhasoid rather than tdtypeid. I mean, if the only point of >> comparing the type OIDs is to find out whether the table-has-OIDs >> setting matches, we could instead test that directly and avoid needing >> to pass an extra argument. I wonder if there's some other reason this >> code is there which is not documented in the comment... > > With the following patch, regression tests run fine: > > if (indesc->natts == outdesc->natts && > - indesc->tdtypeid == outdesc->tdtypeid) > + indesc->tdhasoid != outdesc->tdhasoid) > { > > If checking tdtypeid (instead of tdhasoid directly) has some other > consideration, I'd would have seen at least some tests broken by this > change. So, if we are to go with this, I too prefer it over my previous > proposal to add an argument to convert_tuples_by_name(). Attached 0003 > implements the above approach. I think this is not quite right. First, the patch compares the tdhasoid status with != rather than ==, which would have the effect of saying that we can skip conversion of the has-OID statuses do NOT match. That can't be right. Second, I believe that the comments imply that conversion should be done if *either* tuple has OIDs. I believe that's because whoever wrote this comment thought that we needed to replace the OID if the tuple already had one, which is what do_convert_tuple would do. I'm not sure whether that's really necessary, but we're less likely to break anything if we preserve the existing behavior, and I don't think we lose much from doing so because few user tables will have OIDs. So I would change this test to if (indesc->natts == outdesc->natts && !indesc->tdhasoid && !outdesc->tdhasoid), and I'd revise the one in convert_tuples_by_position() to match. Then I think it's much clearer that we're just optimizing what's there already, not changing the behavior. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning vs. BulkInsertState
On Mon, Jan 23, 2017 at 5:25 AM, Amit Langote wrote: > I tried implementing the second idea in the attached patch. It fixes the > bug (multiple reports as mentioned in the commit message) that tuples may > be inserted into the wrong partition. Looks good to me, thanks. Committed with a few tweaks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning vs. BulkInsertState
On 2017/01/19 5:25, Robert Haas wrote: > On Wed, Jan 11, 2017 at 10:53 PM, Amit Langote wrote: >> On 2017/01/06 20:23, Amit Langote wrote: >>> >>> If a single BulkInsertState object is passed to >>> heap_insert()/heap_multi_insert() for different heaps corresponding to >>> different partitions (from one input tuple to next), tuples might end up >>> going into wrong heaps (like demonstrated in one of the reports [1]). A >>> simple solution is to disable bulk-insert in case of partitioned tables. >>> >>> But my patch (or its motivations) was slightly wrongheaded, wherein I >>> conflated multi-insert stuff and bulk-insert considerations. I revised >>> 0002 to not do that. >> >> Ragnar Ouchterlony pointed out [1] on pgsql-bugs that 0002 wasn't correct. >> Attaching updated 0002 along with rebased 0001 and 0003. > > The BulkInsertState is not there only to improve performance. It's > also there to make sure we use a BufferAccessStrategy, so that we > don't trash the whole buffer arena. See commit > 85e2cedf985bfecaf43a18ca17433070f439fb0e. If a partitioned table uses > a separate BulkInsertState for each partition, I believe it will also > end up using a separate ring of buffers for every partition. That may > well be faster than copying into an unpartitioned table in some cases, > because dirtying everything in the buffer arena without actually > writing any of those buffers is a lot faster than actually doing the > writes. But it is also anti-social behavior; we have > BufferAccessStrategy objects for a reason. Thanks for the explanation. I agree that creating thousands of BufferAccessStrategy objects would be a bad idea. > One idea would be to have each partition use a separate > BulkInsertState but have them point to the same underlying > BufferAccessStrategy, but even that's problematic, because it could > result in us holding a gigantic number of pins (one per partition). I > think maybe a better idea would be to add an additional function > ReleaseBulkInsertStatePin() which gets called whenever we switch > relations, and then just use the same BulkInsertState throughout. I tried implementing the second idea in the attached patch. It fixes the bug (multiple reports as mentioned in the commit message) that tuples may be inserted into the wrong partition. Thanks, Amit >From 2e31e97389b0911f749fcf59c0a24c0208a5e2b3 Mon Sep 17 00:00:00 2001 From: amit Date: Mon, 23 Jan 2017 17:00:19 +0900 Subject: [PATCH] Make partitioned tables play nicely with bulk-insert MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When COPY'ing (or bulk-inserting) into a partitioned table, the target heap may change from one tuple to next. We better ask ReadBufferBI() to get a new buffer every time such change occurs. We do that by having CopyFrom() call the new ReleaseBulkInsertStatePin() which unpins the BulkInsertState.cur_buffer and sets it to InvalidBuffer, which is enough a signal for ReadBufferBI() to do the right thing. This also fixes the bug that tuples ended up being inserted into the wrong partition, which occurred exactly because the wrong buffer was used. Reported by: é«å¢ç¦, Venkata B Nagothi, Ragnar Ouchterlony Patch by: Amit Langote (idea by Robert Haas) Reports: https://www.postgresql.org/message-id/CAFmBtr32FDOqofo8yG-4mjzL1HnYHxXK5S9OGFJ%3D%3DcJpgEW4vA%40mail.gmail.com https://www.postgresql.org/message-id/CAEyp7J9WiX0L3DoiNcRrY-9iyw%3DqP%2Bj%3DDLsAnNFF1xT2J1ggfQ%40mail.gmail.com https://www.postgresql.org/message-id/16d73804-c9cd-14c5-463e-5caad563ff77%40agama.tv --- src/backend/access/heap/heapam.c | 11 +++ src/backend/commands/copy.c | 12 src/include/access/heapam.h | 1 + 3 files changed, 24 insertions(+) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 1ce42ea970..5fd7f1e1a2 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2324,6 +2324,17 @@ FreeBulkInsertState(BulkInsertState bistate) pfree(bistate); } +/* + * ReleaseBulkInsertStatePin - release a buffer currently held in bistate + */ +void +ReleaseBulkInsertStatePin(BulkInsertState bistate) +{ + if (bistate->current_buf != InvalidBuffer) + ReleaseBuffer(bistate->current_buf); + bistate->current_buf = InvalidBuffer; +} + /* * heap_insert - insert tuple into a heap diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index c05e14e26f..86bd53c179 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2307,6 +2307,7 @@ CopyFrom(CopyState cstate) uint64 processed = 0; bool useHeapMultiInsert; int nBufferedTuples = 0; + int prev_leaf_part_index = -1; #define MAX_BUFFERED_TUPLES 1000 HeapTuple *bufferedTuples = NULL; /* initialize to silence warning */ @@ -2562,6 +2563,17 @@ CopyFrom(CopyState cstate) leaf_part_index < cstate->num_partitions); /* + * If the current tuple is mapped to a
Re: [HACKERS] Declarative partitioning - another take
On 2017/01/21 6:29, Robert Haas wrote: > On Fri, Jan 20, 2017 at 1:15 AM, Andres Freund wrote: >> On 2017-01-19 14:18:23 -0500, Robert Haas wrote: >>> Committed. >> >> One of the patches around this topic committed recently seems to cause >> valgrind failures like >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-01-19%2008%3A40%3A02 >> : > > Tom Lane independently reported this same issue. I believe I've fixed > it. Sorry for not noticing and crediting this report also. Thanks Robert for looking at this and committing the fix. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Fri, Jan 20, 2017 at 1:15 AM, Andres Freund wrote: > On 2017-01-19 14:18:23 -0500, Robert Haas wrote: >> Committed. > > One of the patches around this topic committed recently seems to cause > valgrind failures like > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-01-19%2008%3A40%3A02 > : Tom Lane independently reported this same issue. I believe I've fixed it. Sorry for not noticing and crediting this report also. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
Hi Andres, On 2017/01/20 15:15, Andres Freund wrote: > On 2017-01-19 14:18:23 -0500, Robert Haas wrote: >> Committed. > > One of the patches around this topic committed recently seems to cause > valgrind failures like > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-01-19%2008%3A40%3A02 > : > ==24969== Conditional jump or move depends on uninitialised value(s) > ==24969==at 0x4C38DA: btint4cmp (nbtcompare.c:97) > ==24969==by 0x83860C: FunctionCall2Coll (fmgr.c:1318) > ==24969==by 0x536643: partition_bounds_equal (partition.c:627) > ==24969==by 0x820864: equalPartitionDescs (relcache.c:1203) > ==24969==by 0x82423A: RelationClearRelation (relcache.c:2553) > ==24969==by 0x8248BA: RelationFlushRelation (relcache.c:2662) > ==24969==by 0x824983: RelationCacheInvalidateEntry (relcache.c:2714) > ==24969==by 0x81D9D6: LocalExecuteInvalidationMessage (inval.c:568) > ==24969==by 0x81CB0D: ProcessInvalidationMessages (inval.c:444) > ==24969==by 0x81D3CB: CommandEndInvalidationMessages (inval.c:1056) > ==24969==by 0x4F6735: AtCCI_LocalCache (xact.c:1374) > ==24969==by 0x4F8249: CommandCounterIncrement (xact.c:957) > ==24969== Uninitialised value was created by a heap allocation > ==24969==at 0x85AA83: palloc (mcxt.c:914) > ==24969==by 0x53648E: RelationBuildPartitionDesc (partition.c:528) > ==24969==by 0x823F93: RelationBuildDesc (relcache.c:1348) > ==24969==by 0x8241DB: RelationClearRelation (relcache.c:2524) > ==24969==by 0x8248BA: RelationFlushRelation (relcache.c:2662) > ==24969==by 0x824983: RelationCacheInvalidateEntry (relcache.c:2714) > ==24969==by 0x81D9D6: LocalExecuteInvalidationMessage (inval.c:568) > ==24969==by 0x81CB0D: ProcessInvalidationMessages (inval.c:444) > ==24969==by 0x81D3CB: CommandEndInvalidationMessages (inval.c:1056) > ==24969==by 0x4F6735: AtCCI_LocalCache (xact.c:1374) > ==24969==by 0x4F8249: CommandCounterIncrement (xact.c:957) > ==24969==by 0x82538B: RelationSetNewRelfilenode (relcache.c:3490) > ==24969== > ==24969== VALGRINDERROR-END Thanks for the report. This being my first time reading a valgrind report on buildfarm, is it correct to to assume that the command immediately preceding VALGRINDERROR-BEGIN is what triggered the failure? ... LOG: statement: truncate list_parted; ==24969== VALGRINDERROR-BEGIN ==24969== Conditional jump or move depends on uninitialised value(s) ==24969==at 0x4C38DA: btint4cmp (nbtcompare.c:97) So in this case: truncate list_parted? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017-01-19 14:18:23 -0500, Robert Haas wrote: > Committed. One of the patches around this topic committed recently seems to cause valgrind failures like https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-01-19%2008%3A40%3A02 : ==24969== Conditional jump or move depends on uninitialised value(s) ==24969==at 0x4C38DA: btint4cmp (nbtcompare.c:97) ==24969==by 0x83860C: FunctionCall2Coll (fmgr.c:1318) ==24969==by 0x536643: partition_bounds_equal (partition.c:627) ==24969==by 0x820864: equalPartitionDescs (relcache.c:1203) ==24969==by 0x82423A: RelationClearRelation (relcache.c:2553) ==24969==by 0x8248BA: RelationFlushRelation (relcache.c:2662) ==24969==by 0x824983: RelationCacheInvalidateEntry (relcache.c:2714) ==24969==by 0x81D9D6: LocalExecuteInvalidationMessage (inval.c:568) ==24969==by 0x81CB0D: ProcessInvalidationMessages (inval.c:444) ==24969==by 0x81D3CB: CommandEndInvalidationMessages (inval.c:1056) ==24969==by 0x4F6735: AtCCI_LocalCache (xact.c:1374) ==24969==by 0x4F8249: CommandCounterIncrement (xact.c:957) ==24969== Uninitialised value was created by a heap allocation ==24969==at 0x85AA83: palloc (mcxt.c:914) ==24969==by 0x53648E: RelationBuildPartitionDesc (partition.c:528) ==24969==by 0x823F93: RelationBuildDesc (relcache.c:1348) ==24969==by 0x8241DB: RelationClearRelation (relcache.c:2524) ==24969==by 0x8248BA: RelationFlushRelation (relcache.c:2662) ==24969==by 0x824983: RelationCacheInvalidateEntry (relcache.c:2714) ==24969==by 0x81D9D6: LocalExecuteInvalidationMessage (inval.c:568) ==24969==by 0x81CB0D: ProcessInvalidationMessages (inval.c:444) ==24969==by 0x81D3CB: CommandEndInvalidationMessages (inval.c:1056) ==24969==by 0x4F6735: AtCCI_LocalCache (xact.c:1374) ==24969==by 0x4F8249: CommandCounterIncrement (xact.c:957) ==24969==by 0x82538B: RelationSetNewRelfilenode (relcache.c:3490) ==24969== ==24969== VALGRINDERROR-END Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
So testing things out in pg_partman for native sub-partitioning and ran into what is a bug for me that I know I have to fix, but I'm curious if this can be prevented in the first place within the native partitioning code itself. The below shows a sub-partitioning set where the sub-partition has a constraint range that is outside of the range of its parent. If the columns were different I could see where this would be allowed, but the same column is used throughout the levels of sub-partitioning. Understandable if that may be too complex to check for, but figured I'd bring it up as something I accidentally ran into in case you see an easy way to prevent it. This was encountered as of commit ba61a04bc7fefeee03416d9911eb825c4897c223. keith@keith=# \d+ partman_test.time_taptest_table Table "partman_test.time_taptest_table" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description +--+---+--+-+--+--+- col1 | integer | | | | plain| | col2 | text | | | | extended | | col3 | timestamp with time zone | | not null | now() | plain| | Partition key: RANGE (col3) Partitions: partman_test.time_taptest_table_p2015 FOR VALUES FROM ('2015-01-01 00:00:00-05') TO ('2016-01-01 00:00:00-05'), partman_test.time_taptest_table_p2016 FOR VALUES FROM ('2016-01-01 00:00:00-05') TO ('2017-01-01 00:00:00-05'), partman_test.time_taptest_table_p2017 FOR VALUES FROM ('2017-01-01 00:00:00-05') TO ('2018-01-01 00:00:00-05'), partman_test.time_taptest_table_p2018 FOR VALUES FROM ('2018-01-01 00:00:00-05') TO ('2019-01-01 00:00:00-05'), partman_test.time_taptest_table_p2019 FOR VALUES FROM ('2019-01-01 00:00:00-05') TO ('2020-01-01 00:00:00-05') keith@keith=# \d+ partman_test.time_taptest_table_p2015 Table "partman_test.time_taptest_table_p2015" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description +--+---+--+-+--+--+- col1 | integer | | | | plain| | col2 | text | | | | extended | | col3 | timestamp with time zone | | not null | now() | plain| | Partition of: partman_test.time_taptest_table FOR VALUES FROM ('2015-01-01 00:00:00-05') TO ('2016-01-01 00:00:00-05') Partition key: RANGE (col3) Partitions: partman_test.time_taptest_table_p2015_p2016_11 FOR VALUES FROM ('2016-11-01 00:00:00-04') TO ('2016-12-01 00:00:00-05'), partman_test.time_taptest_table_p2015_p2016_12 FOR VALUES FROM ('2016-12-01 00:00:00-05') TO ('2017-01-01 00:00:00-05'), partman_test.time_taptest_table_p2015_p2017_01 FOR VALUES FROM ('2017-01-01 00:00:00-05') TO ('2017-02-01 00:00:00-05'), partman_test.time_taptest_table_p2015_p2017_02 FOR VALUES FROM ('2017-02-01 00:00:00-05') TO ('2017-03-01 00:00:00-05'), partman_test.time_taptest_table_p2015_p2017_03 FOR VALUES FROM ('2017-03-01 00:00:00-05') TO ('2017-04-01 00:00:00-04') keith@keith=# \d+ partman_test.time_taptest_table_p2015_p2017_03 Table "partman_test.time_taptest_table_p2015_p2017_03" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description +--+---+--+-+--+--+- col1 | integer | | | | plain| | col2 | text | | | | extended | | col3 | timestamp with time zone | | not null | now() | plain| | Partition of: partman_test.time_taptest_table_p2015 FOR VALUES FROM ('2017-03-01 00:00:00-05') TO ('2017-04-01 00:00:00-04') -- Keith Fiske Database Administrator OmniTI Computer Consulting, Inc. http://www.keithf4.com
Re: [HACKERS] Declarative partitioning - another take
On 2017/01/20 4:18, Robert Haas wrote: > On Thu, Jan 19, 2017 at 12:15 AM, Amit Langote wrote: >> 0002-Set-ecxt_scantuple-correctly-for-tuple-routing.patch >> >> In 2ac3ef7a01df859c62d0a02333b646d65eaec5ff, we changed things so that >> it's possible for a different TupleTableSlot to be used for partitioned >> tables at successively lower levels. If we do end up changing the slot >> from the original, we must update ecxt_scantuple to point to the new one >> for partition key of the tuple to be computed correctly. >> >> Last posted here: >> https://www.postgresql.org/message-id/99fbfad6-b89b-9427-a6ca-197aad98c48e%40lab.ntt.co.jp > > Why does FormPartitionKeyDatum need this? Could that be documented in > a comment in here someplace, perhaps the header comment to > FormPartitionKeyDatum? FormPartitionKeyDatum() computes partition key expressions (if any) and the expression evaluation machinery expects ecxt_scantuple to point to the tuple to extract attribute values from. FormPartitionKeyDatum() already has a tiny comment, which it seems is the only thing we could say here about this there: * the ecxt_scantuple slot of estate's per-tuple expr context must point to * the heap tuple passed in. In get_partition_for_tuple() which calls FormPartitionKeyDatum(), the patch adds this comment (changed it a little from the last version): + /* + * Extract partition key from tuple. Expression evaluation machinery + * that FormPartitionKeyDatum() invokes expects ecxt_scantuple to + * point to the correct tuple slot. The slot might have changed from + * what was used for the parent table if the table of the current + * partitioning level has different tuple descriptor from the parent. + * So update ecxt_scantuple accordingly. + */ + ecxt->ecxt_scantuple = slot; FormPartitionKeyDatum(parent, slot, estate, values, isnull); It says why we need to change which slot ecxt_scantuple points to. >> 0004-Fix-some-issues-with-views-and-partitioned-tables.patch >> >> Automatically updatable views failed to handle partitioned tables. >> Once that's fixed, WITH CHECK OPTIONS wouldn't work correctly without >> the WCO expressions having been suitably converted for each partition >> (think applying map_partition_varattnos to Vars in the WCO expressions >> just like with partition constraint expressions). > > The changes to execMain.c contain a hunk which has essentially the > same code twice. That looks like a mistake. Also, the patch doesn't > compile because convert_tuples_by_name() takes 3 arguments, not 4. Actually, I realized that and sent the updated version [1] of this patch that fixed this issue. In the updated version, I removed that code block (the 2 copies of it), because we are still discussing what to do about showing tuples in constraint violation (in this case, WITH CHECK OPTION violation) messages. Anyway, attached here again. >> 0006-Avoid-tuple-coversion-in-common-partitioning-cases.patch >> >> Currently, the tuple conversion is performed after a tuple is routed, >> even if the attributes of a target leaf partition map one-to-one with >> those of the root table, which is wasteful. Avoid that by making >> convert_tuples_by_name() return a NULL map for such cases. > > +Assert(!consider_typeid && indesc->tdhasoid == outdesc->tdhasoid); > > I think you mean Assert(consider_typeid || indesc->tdhasoid == > outdesc->tdhasoid); Ah, you're right. > But I wonder why we don't instead just change this function to > consider tdhasoid rather than tdtypeid. I mean, if the only point of > comparing the type OIDs is to find out whether the table-has-OIDs > setting matches, we could instead test that directly and avoid needing > to pass an extra argument. I wonder if there's some other reason this > code is there which is not documented in the comment... With the following patch, regression tests run fine: if (indesc->natts == outdesc->natts && - indesc->tdtypeid == outdesc->tdtypeid) + indesc->tdhasoid != outdesc->tdhasoid) { If checking tdtypeid (instead of tdhasoid directly) has some other consideration, I'd would have seen at least some tests broken by this change. So, if we are to go with this, I too prefer it over my previous proposal to add an argument to convert_tuples_by_name(). Attached 0003 implements the above approach. > Phew. Thanks for all the patches, sorry I'm having trouble keeping up. Thanks a lot for taking time to look through them and committing. Thanks, Amit [1] https://www.postgresql.org/message-id/92fe2a71-5eb7-ee8d-53ef-cfd5a65dfc3d%40lab.ntt.co.jp >From c7088290221a9fe0818139145b7bdf6731bc419a Mon Sep 17 00:00:00 2001 From: amit Date: Wed, 28 Dec 2016 10:10:26 +0900 Subject: [PATCH 1/3] Set ecxt_scantuple correctly for tuple-routing In 2ac3ef7a01df859c62d0a02333b646d65eaec5ff, we changed things so that it's possible for a different TupleTableSlot to be used for partitioned tables at successively lower levels. If we do end up changing the slot fro
Re: [HACKERS] Declarative partitioning - another take
On Thu, Jan 19, 2017 at 12:15 AM, Amit Langote wrote: > 0001-Fix-a-bug-of-insertion-into-an-internal-partition.patch > > Since implicit partition constraints are not inherited, an internal > partition's constraint was not being enforced when targeted directly. > So, include such constraint when setting up leaf partition result > relations for tuple-routing. Committed. > 0002-Set-ecxt_scantuple-correctly-for-tuple-routing.patch > > In 2ac3ef7a01df859c62d0a02333b646d65eaec5ff, we changed things so that > it's possible for a different TupleTableSlot to be used for partitioned > tables at successively lower levels. If we do end up changing the slot > from the original, we must update ecxt_scantuple to point to the new one > for partition key of the tuple to be computed correctly. > > Last posted here: > https://www.postgresql.org/message-id/99fbfad6-b89b-9427-a6ca-197aad98c48e%40lab.ntt.co.jp Why does FormPartitionKeyDatum need this? Could that be documented in a comment in here someplace, perhaps the header comment to FormPartitionKeyDatum? > 0003-Fix-RETURNING-to-work-correctly-after-tuple-routing.patch > > In ExecInsert(), do not switch back to the root partitioned table > ResultRelInfo until after we finish ExecProcessReturning(), so that > RETURNING projection is done using the partition's descriptor. For > the projection to work correctly, we must initialize the same for > each leaf partition during ModifyTableState initialization. Committed. > 0004-Fix-some-issues-with-views-and-partitioned-tables.patch > > Automatically updatable views failed to handle partitioned tables. > Once that's fixed, WITH CHECK OPTIONS wouldn't work correctly without > the WCO expressions having been suitably converted for each partition > (think applying map_partition_varattnos to Vars in the WCO expressions > just like with partition constraint expressions). The changes to execMain.c contain a hunk which has essentially the same code twice. That looks like a mistake. Also, the patch doesn't compile because convert_tuples_by_name() takes 3 arguments, not 4. > 0005-Fix-some-wrong-thinking-in-check_new_partition_bound.patch > > Because a given range bound in the PartitionBoundInfo.datums array > is sometimes a range lower bound and upper bound at other times, we > must be careful when assuming which, especially when interpreting > the result of partition_bound_bsearch which returns the index of the > greatest bound that is less than or equal to probe. Due to an error > in thinking about the same, the relevant code in > check_new_partition_bound() caused invalid partition (index = -1) > to be chosen as the partition being overlapped. > > Last posted here: > https://www.postgresql.org/message-id/603acb8b-5dec-31e8-29b0-609a68aac50f%40lab.ntt.co.jp } +/* + * If equal has been set to true or if there is no "gap" + * between the bound at off1 and that at off1 + 1, the new + * partition will overlap some partition. In the former + * case, the new lower bound is found to be equal to the + * bound at off1, which could only ever be true if the + * latter is the lower bound of some partition. It's + * clear in such a case that the new partition overlaps + * that partition, whose index we get using its upper + * bound (that is, using the bound at off1 + 1). + */ else Stylistically, we usually avoid this, or at least I do. The comment should go inside the "else" block. But it looks OK apart from that, so committed with a little rephrasing and reformatting of the comment. > 0006-Avoid-tuple-coversion-in-common-partitioning-cases.patch > > Currently, the tuple conversion is performed after a tuple is routed, > even if the attributes of a target leaf partition map one-to-one with > those of the root table, which is wasteful. Avoid that by making > convert_tuples_by_name() return a NULL map for such cases. +Assert(!consider_typeid && indesc->tdhasoid == outdesc->tdhasoid); I think you mean Assert(consider_typeid || indesc->tdhasoid == outdesc->tdhasoid); But I wonder why we don't instead just change this function to consider tdhasoid rather than tdtypeid. I mean, if the only point of comparing the type OIDs is to find out whether the table-has-OIDs setting matches, we could instead test that directly and avoid needing to pass an extra argument. I wonder if there's some other reason this code is there which is not documented in the comment... > 0007-Avoid-code-duplication-in-map_partition_varattnos.patch > > Code to map attribute numbers in map_partition_varattnos() duplicates > what convert_tuples_by_name_map() does. Avoid that as pointed out by > Álvaro Herrera. > > Last posted here: > https://www.postgresql.org/message-id/9ce97382-54c8-deb3-9ee9-a
Re: [HACKERS] Declarative partitioning - another take
On 2017/01/19 14:15, Amit Langote wrote: > So, here are all the patches I posted to date (and one new at the bottom) > for reported and unreported bugs, excluding the one involving > BulkInsertState for which you replied in a new thread. > > I'll describe the attached patches in brief: Sorry, I forgot to mention that I have skipped the patch I proposed to modify the committed approach [1] to get the correct tuple to show in the constraint violation messages. It might be better to continue that discussion at [2]. And because I skipped that patch, I should have removed the related logic in ExecWithCheckOptions() that I added in one of the later patches (patch 0004), which I forgot to do. So, here are all the patches again with the correct 0004 this time. Sigh. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f1b4c771ea74f42447dccaed42ffcdcccf3aa694 [2] https://www.postgresql.org/message-id/CA%2BTgmoZjGzSM5WwnyapFaw3GxnDLWh7pm8Xiz8_QWQnUQy%3DSCA%40mail.gmail.com >From 37c861d5eae1c8f11b3fae93cb209da262a13fbc Mon Sep 17 00:00:00 2001 From: amit Date: Tue, 13 Dec 2016 15:07:41 +0900 Subject: [PATCH 1/8] Fix a bug of insertion into an internal partition. Since implicit partition constraints are not inherited, an internal partition's constraint was not being enforced when targeted directly. So, include such constraint when setting up leaf partition result relations for tuple-routing. Reported by: n/a Patch by: Amit Langote Reports: n/a --- src/backend/commands/copy.c | 1 - src/backend/commands/tablecmds.c | 1 - src/backend/executor/execMain.c | 42 src/include/executor/executor.h | 1 - src/test/regress/expected/insert.out | 6 ++ src/test/regress/sql/insert.sql | 5 + 6 files changed, 44 insertions(+), 12 deletions(-) diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 1fd2162794..75386212e0 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2432,7 +2432,6 @@ CopyFrom(CopyState cstate) InitResultRelInfo(resultRelInfo, cstate->rel, 1, /* dummy rangetable index */ - true, /* do load partition check expression */ NULL, 0); diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index e633a50dd2..06e43cbb3a 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -1324,7 +1324,6 @@ ExecuteTruncate(TruncateStmt *stmt) InitResultRelInfo(resultRelInfo, rel, 0, /* dummy rangetable index */ - false, NULL, 0); resultRelInfo++; diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index ff277d300a..5457f8fbde 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -824,10 +824,10 @@ InitPlan(QueryDesc *queryDesc, int eflags) resultRelationOid = getrelid(resultRelationIndex, rangeTable); resultRelation = heap_open(resultRelationOid, RowExclusiveLock); + InitResultRelInfo(resultRelInfo, resultRelation, resultRelationIndex, - true, NULL, estate->es_instrument); resultRelInfo++; @@ -1218,10 +1218,11 @@ void InitResultRelInfo(ResultRelInfo *resultRelInfo, Relation resultRelationDesc, Index resultRelationIndex, - bool load_partition_check, Relation partition_root, int instrument_options) { + List *partition_check = NIL; + MemSet(resultRelInfo, 0, sizeof(ResultRelInfo)); resultRelInfo->type = T_ResultRelInfo; resultRelInfo->ri_RangeTableIndex = resultRelationIndex; @@ -1257,13 +1258,38 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo, resultRelInfo->ri_ConstraintExprs = NULL; resultRelInfo->ri_junkFilter = NULL; resultRelInfo->ri_projectReturning = NULL; - if (load_partition_check) - resultRelInfo->ri_PartitionCheck = - RelationGetPartitionQual(resultRelationDesc); + /* - * The following gets set to NULL unless we are initializing leaf - * partitions for tuple-routing. + * If partition_root has been specified, that means we are builiding the + * ResultRelationInfo for one of its leaf partitions. In that case, we + * need *not* initialize the leaf partition's constraint, but rather the + * the partition_root's (if any). We must do that explicitly like this, + * because implicit partition constraints are not inherited like user- + * defined constraints and would fail to be enforced by ExecConstraints() + * after a tuple is routed to a leaf partition. */ + if (partition_root) + { + /* + * Root table itself may or may not be a partition; partition_check + * would be NIL in the latter case. + */ + partition_check = RelationGetPartitionQual(partition_root); + + /* + * This is not our own partition constraint, but rather an ancestor's. + * So any Vars in it bear the ancestor's attribute numbers. We must + * sw
Re: [HACKERS] Declarative partitioning - another take
On 2017/01/19 5:29, Robert Haas wrote: > On Wed, Jan 18, 2017 at 3:12 PM, Robert Haas wrote: >> On Tue, Jan 10, 2017 at 6:06 AM, Amit Langote >> wrote: >>> [ updated patches ] >> >> I committed 0004 and also fixed the related regression test not to >> rely on DROP .. CASCADE, which isn't always stable. The remainder of Thanks! >> this patch set needs a rebase, and perhaps you could also fold in >> other pending partitioning fixes so I have everything to look at it in >> one place. > > Just to be a little more clear, I don't mind multiple threads each > with a patch or patch set so much, but multiple patch sets on the same > thread gets hard for me to track. Sorry for the inconvenience. OK, I agree that having multiple patch sets on the same thread is cumbersome. So, here are all the patches I posted to date (and one new at the bottom) for reported and unreported bugs, excluding the one involving BulkInsertState for which you replied in a new thread. I'll describe the attached patches in brief: 0001-Fix-a-bug-of-insertion-into-an-internal-partition.patch Since implicit partition constraints are not inherited, an internal partition's constraint was not being enforced when targeted directly. So, include such constraint when setting up leaf partition result relations for tuple-routing. 0002-Set-ecxt_scantuple-correctly-for-tuple-routing.patch In 2ac3ef7a01df859c62d0a02333b646d65eaec5ff, we changed things so that it's possible for a different TupleTableSlot to be used for partitioned tables at successively lower levels. If we do end up changing the slot from the original, we must update ecxt_scantuple to point to the new one for partition key of the tuple to be computed correctly. Last posted here: https://www.postgresql.org/message-id/99fbfad6-b89b-9427-a6ca-197aad98c48e%40lab.ntt.co.jp 0003-Fix-RETURNING-to-work-correctly-after-tuple-routing.patch In ExecInsert(), do not switch back to the root partitioned table ResultRelInfo until after we finish ExecProcessReturning(), so that RETURNING projection is done using the partition's descriptor. For the projection to work correctly, we must initialize the same for each leaf partition during ModifyTableState initialization. 0004-Fix-some-issues-with-views-and-partitioned-tables.patch Automatically updatable views failed to handle partitioned tables. Once that's fixed, WITH CHECK OPTIONS wouldn't work correctly without the WCO expressions having been suitably converted for each partition (think applying map_partition_varattnos to Vars in the WCO expressions just like with partition constraint expressions). 0005-Fix-some-wrong-thinking-in-check_new_partition_bound.patch Because a given range bound in the PartitionBoundInfo.datums array is sometimes a range lower bound and upper bound at other times, we must be careful when assuming which, especially when interpreting the result of partition_bound_bsearch which returns the index of the greatest bound that is less than or equal to probe. Due to an error in thinking about the same, the relevant code in check_new_partition_bound() caused invalid partition (index = -1) to be chosen as the partition being overlapped. Last posted here: https://www.postgresql.org/message-id/603acb8b-5dec-31e8-29b0-609a68aac50f%40lab.ntt.co.jp 0006-Avoid-tuple-coversion-in-common-partitioning-cases.patch Currently, the tuple conversion is performed after a tuple is routed, even if the attributes of a target leaf partition map one-to-one with those of the root table, which is wasteful. Avoid that by making convert_tuples_by_name() return a NULL map for such cases. 0007-Avoid-code-duplication-in-map_partition_varattnos.patch Code to map attribute numbers in map_partition_varattnos() duplicates what convert_tuples_by_name_map() does. Avoid that as pointed out by Álvaro Herrera. Last posted here: https://www.postgresql.org/message-id/9ce97382-54c8-deb3-9ee9-a2ec271d866b%40lab.ntt.co.jp 0008-Avoid-DROP-TABLE-.-CASCADE-in-more-partitioning-test.patch This is the new one. There were quite a few commits recently to fix the breakage in regression tests due to not using ORDER BY in queries on system catalogs and using DROP TABLE ... CASCADE. There were still some instances of the latter in create_table.sql and alter_table.sql. Thanks, Amit >From 37c861d5eae1c8f11b3fae93cb209da262a13fbc Mon Sep 17 00:00:00 2001 From: amit Date: Tue, 13 Dec 2016 15:07:41 +0900 Subject: [PATCH 1/8] Fix a bug of insertion into an internal partition. Since implicit partition constraints are not inherited, an internal partition's constraint was not being enforced when targeted directly. So, include such constraint when setting up leaf partition result relations for tuple-routing. Reported by: n/a Patch by: Amit Langote Reports: n/a --- src/backend/commands/copy.c | 1 - src/backend/commands/tablecmds.c | 1 - src/backend/executor/execMain.c | 42 src/include/executor/executor.h
Re: [HACKERS] Declarative partitioning - another take
On Wed, Jan 18, 2017 at 3:12 PM, Robert Haas wrote: > On Tue, Jan 10, 2017 at 6:06 AM, Amit Langote > wrote: >> [ updated patches ] > > I committed 0004 and also fixed the related regression test not to > rely on DROP .. CASCADE, which isn't always stable. The remainder of > this patch set needs a rebase, and perhaps you could also fold in > other pending partitioning fixes so I have everything to look at it in > one place. Just to be a little more clear, I don't mind multiple threads each with a patch or patch set so much, but multiple patch sets on the same thread gets hard for me to track. Sorry for the inconvenience. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning vs. BulkInsertState
On Wed, Jan 11, 2017 at 10:53 PM, Amit Langote wrote: > On 2017/01/06 20:23, Amit Langote wrote: >> On 2017/01/05 3:26, Robert Haas wrote: >>> It's unclear to me why we need to do 0002. It doesn't seem like it >>> should be necessary, it doesn't seem like a good idea, and the commit >>> message you proposed is uninformative. >> >> If a single BulkInsertState object is passed to >> heap_insert()/heap_multi_insert() for different heaps corresponding to >> different partitions (from one input tuple to next), tuples might end up >> going into wrong heaps (like demonstrated in one of the reports [1]). A >> simple solution is to disable bulk-insert in case of partitioned tables. >> >> But my patch (or its motivations) was slightly wrongheaded, wherein I >> conflated multi-insert stuff and bulk-insert considerations. I revised >> 0002 to not do that. > > Ragnar Ouchterlony pointed out [1] on pgsql-bugs that 0002 wasn't correct. > Attaching updated 0002 along with rebased 0001 and 0003. The BulkInsertState is not there only to improve performance. It's also there to make sure we use a BufferAccessStrategy, so that we don't trash the whole buffer arena. See commit 85e2cedf985bfecaf43a18ca17433070f439fb0e. If a partitioned table uses a separate BulkInsertState for each partition, I believe it will also end up using a separate ring of buffers for every partition. That may well be faster than copying into an unpartitioned table in some cases, because dirtying everything in the buffer arena without actually writing any of those buffers is a lot faster than actually doing the writes. But it is also anti-social behavior; we have BufferAccessStrategy objects for a reason. One idea would be to have each partition use a separate BulkInsertState but have them point to the same underlying BufferAccessStrategy, but even that's problematic, because it could result in us holding a gigantic number of pins (one per partition). I think maybe a better idea would be to add an additional function ReleaseBulkInsertStatePin() which gets called whenever we switch relations, and then just use the same BulkInsertState throughout. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Tue, Jan 10, 2017 at 6:06 AM, Amit Langote wrote: > [ updated patches ] I committed 0004 and also fixed the related regression test not to rely on DROP .. CASCADE, which isn't always stable. The remainder of this patch set needs a rebase, and perhaps you could also fold in other pending partitioning fixes so I have everything to look at it in one place. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning vs. information_schema
On Tue, Jan 10, 2017 at 4:17 AM, Amit Langote wrote: > On 2017/01/10 14:44, Keith Fiske wrote: >> Is there any reason for the exclusion of parent tables from the pg_tables >> system catalog view? They do not show up in information_schema.tables as >> well. I believe I found where to make the changes and I tested to make sure >> it works for my simple case. Attached is my first attempt at patching >> anything in core. Not sure if there's anywhere else this would need to be >> fixed. > > That's an oversight. The original partitioning patch didn't touch > information_schema.sql and system_views.sql at all. I added the relkind = > 'P' check in some other views as well, including what your patch considered. I took a look at this patch: * The SQL for pg_seclabels contained an obvious syntax error. * pg_seclabels should only return object types that could be used in the SECURITY LABEL command, so it needs to return 'table' not 'partitioned table' for the new relkind. * There's a pointless change from v.relkind = 'v' to v.relkind IN ('v'). * I don't see any indication on the Internet that "PARTITIONED TABLE" is a legal value for the table type in information_schema.tables. Unless we can find something official, I suppose we should just display BASE TABLE in that case as we do in other cases. I wonder if the schema needs some broader revision; for example, are there information_schema elements intended to show information about partitions? * Even though unique/primary key/foreign key constraints can't currently be defined on partitioned tables, it seems best to change the related reference symmetrically with the others, because we might support it in the future and then this would break again. Similarly for key_column_usage. Committed with fixes for those issues. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Mon, Jan 16, 2017 at 4:09 AM, Amit Langote wrote: > The problem is that whereas the SlotValueDescription that we build to show > in the error message should be based on the tuple that was passed to > ExecInsert() or whatever NextCopyFrom() returned for CopyFrom() to > process, it might fail to be the case if the tuple was needed to be > converted after tuple routing. slot (the tuple in it and its tuple > descriptor) and resultRelInfo that ExecConstraint() receives *do* > correspond with each other, even after possible tuple conversion following > tuple-routing, and hence constraint checking itself works fine (since > commit 2ac3ef7a01 [1]). As said, it's the val_desc built to show in the > error message being based on the converted-for-partition tuple that could > be seen as a problem - is it acceptable if we showed in the error message > whatever the converted-for-partition tuple looks like which might have > columns ordered differently from the root table? If so, we could simply > forget the whole thing, including reverting f1b4c771 [2]. > > An example: > > create table p (a int, b char, c int) partition by list (a); > create table p1 (b char, c int, a int);-- note the column order > alter table p attach partition p1 for values in (1); > alter table p add constraint check_b check (b = 'x'); > > insert into p values (1, 'y', 1); > ERROR: new row for relation "p1" violates check constraint "check_b" > DETAIL: Failing row contains (y, 1, 1). > > Note that "(y, 1, 1)" results from using p1's descriptor on the converted > tuple. As long that's clear and acceptable, I think we need not worry > about this patch and revert the previously committed patch for this "problem". Hmm. It would be fine, IMHO, if the detail message looked like the one that BuildIndexValueDescription produces. Without the column names, the clarity is somewhat lessened. Anybody else have an opinion on this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/01/14 6:24, Robert Haas wrote: > On Tue, Jan 10, 2017 at 6:06 AM, Amit Langote wrote: >> >> Thanks! I realized however that the approach I used in 0002 of passing >> the original slot to ExecConstraints() fails in certain situations. For >> example, if a BR trigger changes the tuple, the original slot would not >> receive those changes, so it will be wrong to use such a tuple anymore. >> In attached 0001, I switched back to the approach of converting the >> partition-tupdesc-based tuple back to the root partitioned table's format. >> The converted tuple contains the changes by BR triggers, if any. Sorry >> about some unnecessary work. > > Hmm. Even with this patch, I wonder if this is really correct. I > mean, isn't the root of the problem here that ExecConstraints() is > expecting that resultRelInfo matches slot, and it doesn't? The problem is that whereas the SlotValueDescription that we build to show in the error message should be based on the tuple that was passed to ExecInsert() or whatever NextCopyFrom() returned for CopyFrom() to process, it might fail to be the case if the tuple was needed to be converted after tuple routing. slot (the tuple in it and its tuple descriptor) and resultRelInfo that ExecConstraint() receives *do* correspond with each other, even after possible tuple conversion following tuple-routing, and hence constraint checking itself works fine (since commit 2ac3ef7a01 [1]). As said, it's the val_desc built to show in the error message being based on the converted-for-partition tuple that could be seen as a problem - is it acceptable if we showed in the error message whatever the converted-for-partition tuple looks like which might have columns ordered differently from the root table? If so, we could simply forget the whole thing, including reverting f1b4c771 [2]. An example: create table p (a int, b char, c int) partition by list (a); create table p1 (b char, c int, a int);-- note the column order alter table p attach partition p1 for values in (1); alter table p add constraint check_b check (b = 'x'); insert into p values (1, 'y', 1); ERROR: new row for relation "p1" violates check constraint "check_b" DETAIL: Failing row contains (y, 1, 1). Note that "(y, 1, 1)" results from using p1's descriptor on the converted tuple. As long that's clear and acceptable, I think we need not worry about this patch and revert the previously committed patch for this "problem". > And why > isn't that also a problem for the things that get passed resultRelInfo > and slot after tuple routing and before ExecConstraints? In > particular, in copy.c, ExecBRInsertTriggers. If I explained the problem (as I'm seeing it) well enough above, you may see why that's not an issue in other cases. Other sub-routines viz. ExecBRInsertTriggers, ExecWithCheckOptions for RLS INSERT WITH CHECK policies, ExecARInsertTriggers, etc. do receive the correct slot and resultRelInfo for whatever they do with a given tuple and the relation (partition) it is being inserted into. However, if we do consider the above to be a problem, then we would need to fix ExecWithCheckOptions() to do whatever we decide ExecConstraints() should do, because it can also report WITH CHECK violation for a tuple. > Committed 0002. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ac3ef7 [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f1b4c77 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Tue, Jan 10, 2017 at 6:06 AM, Amit Langote wrote: > On 2017/01/05 5:50, Robert Haas wrote: >> On Tue, Dec 27, 2016 at 3:59 AM, Amit Langote >> wrote: >>> Patches 0001 to 0006 unchanged. >> >> Committed 0001 earlier, as mentioned in a separate email. Committed >> 0002 and part of 0003. > > Thanks! I realized however that the approach I used in 0002 of passing > the original slot to ExecConstraints() fails in certain situations. For > example, if a BR trigger changes the tuple, the original slot would not > receive those changes, so it will be wrong to use such a tuple anymore. > In attached 0001, I switched back to the approach of converting the > partition-tupdesc-based tuple back to the root partitioned table's format. > The converted tuple contains the changes by BR triggers, if any. Sorry > about some unnecessary work. Hmm. Even with this patch, I wonder if this is really correct. I mean, isn't the root of the problem here that ExecConstraints() is expecting that resultRelInfo matches slot, and it doesn't? And why isn't that also a problem for the things that get passed resultRelInfo and slot after tuple routing and before ExecConstraints? In particular, in copy.c, ExecBRInsertTriggers. Committed 0002. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/01/06 20:23, Amit Langote wrote: > On 2017/01/05 3:26, Robert Haas wrote: >> It's unclear to me why we need to do 0002. It doesn't seem like it >> should be necessary, it doesn't seem like a good idea, and the commit >> message you proposed is uninformative. > > If a single BulkInsertState object is passed to > heap_insert()/heap_multi_insert() for different heaps corresponding to > different partitions (from one input tuple to next), tuples might end up > going into wrong heaps (like demonstrated in one of the reports [1]). A > simple solution is to disable bulk-insert in case of partitioned tables. > > But my patch (or its motivations) was slightly wrongheaded, wherein I > conflated multi-insert stuff and bulk-insert considerations. I revised > 0002 to not do that. Ragnar Ouchterlony pointed out [1] on pgsql-bugs that 0002 wasn't correct. Attaching updated 0002 along with rebased 0001 and 0003. Thanks, Amit [1] https://www.postgresql.org/message-id/732dfc84-25f5-413c-1eee-0bfa7a370093%40agama.tv >From f2a64348021c7dba1f96d0c8b4e3e253f635b019 Mon Sep 17 00:00:00 2001 From: amit Date: Wed, 28 Dec 2016 10:10:26 +0900 Subject: [PATCH 1/3] Set ecxt_scantuple correctly for tuple-routing In 2ac3ef7a01df859c62d0a02333b646d65eaec5ff, we changed things so that it's possible for a different TupleTableSlot to be used for partitioned tables at successively lower levels. If we do end up changing the slot from the original, we must update ecxt_scantuple to point to the new one for partition key of the tuple to be computed correctly. Also update the regression tests so that the code manipulating ecxt_scantuple is covered. Reported by: Rajkumar Raghuwanshi Patch by: Amit Langote Reports: https://www.postgresql.org/message-id/CAKcux6%3Dm1qyqB2k6cjniuMMrYXb75O-MB4qGQMu8zg-iGGLjDw%40mail.gmail.com --- src/backend/catalog/partition.c | 29 ++--- src/backend/executor/execMain.c | 2 -- src/test/regress/expected/insert.out | 2 +- src/test/regress/sql/insert.sql | 2 +- 4 files changed, 24 insertions(+), 11 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index f54e1bdf3f..0de1cf245a 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -1643,7 +1643,10 @@ get_partition_for_tuple(PartitionDispatch *pd, bool isnull[PARTITION_MAX_KEYS]; int cur_offset, cur_index; - int i; + int i, +result; + ExprContext *ecxt = GetPerTupleExprContext(estate); + TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple; /* start with the root partitioned table */ parent = pd[0]; @@ -1672,7 +1675,14 @@ get_partition_for_tuple(PartitionDispatch *pd, slot = myslot; } - /* Extract partition key from tuple */ + /* + * Extract partition key from tuple; FormPartitionKeyDatum() expects + * ecxt_scantuple to point to the correct tuple slot (which might be + * different from the slot we received from the caller if the + * partitioned table of the current level has different tuple + * descriptor from its parent). + */ + ecxt->ecxt_scantuple = slot; FormPartitionKeyDatum(parent, slot, estate, values, isnull); if (key->strategy == PARTITION_STRATEGY_RANGE) @@ -1727,16 +1737,21 @@ get_partition_for_tuple(PartitionDispatch *pd, */ if (cur_index < 0) { + result = -1; *failed_at = RelationGetRelid(parent->reldesc); - return -1; + break; } - else if (parent->indexes[cur_index] < 0) - parent = pd[-parent->indexes[cur_index]]; - else + else if (parent->indexes[cur_index] >= 0) + { + result = parent->indexes[cur_index]; break; + } + else + parent = pd[-parent->indexes[cur_index]]; } - return parent->indexes[cur_index]; + ecxt->ecxt_scantuple = ecxt_scantuple_old; + return result; } /* diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index ff277d300a..6a9bc8372f 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -3167,9 +3167,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd, { int result; Oid failed_at; - ExprContext *econtext = GetPerTupleExprContext(estate); - econtext->ecxt_scantuple = slot; result = get_partition_for_tuple(pd, slot, estate, &failed_at); if (result < 0) { diff --git a/src/test/regress/expected/insert.out b/src/test/regress/expected/insert.out index ca3134c34c..1c7b8047ee 100644 --- a/src/test/regress/expected/insert.out +++ b/src/test/regress/expected/insert.out @@ -302,7 +302,7 @@ drop cascades to table part_ee_ff1 drop cascades to table part_ee_ff2 -- more tests for certain multi-level partitioning scenarios create table p (a int, b int) partition by range (a, b); -create table p1 (b int, a int not null) partition by range (b); +create table p1 (b int not null, a int not null) partition by range ((b+0)); create table p11 (like p1); alter table p11 drop a; alter table p11 add a int; diff --git a/src
Re: [HACKERS] Declarative partitioning - another take
On 2017/01/05 5:50, Robert Haas wrote: > On Tue, Dec 27, 2016 at 3:59 AM, Amit Langote > wrote: >> Patches 0001 to 0006 unchanged. > > Committed 0001 earlier, as mentioned in a separate email. Committed > 0002 and part of 0003. Thanks! I realized however that the approach I used in 0002 of passing the original slot to ExecConstraints() fails in certain situations. For example, if a BR trigger changes the tuple, the original slot would not receive those changes, so it will be wrong to use such a tuple anymore. In attached 0001, I switched back to the approach of converting the partition-tupdesc-based tuple back to the root partitioned table's format. The converted tuple contains the changes by BR triggers, if any. Sorry about some unnecessary work. > But I'm skeptical that the as-patched-by-0003 > logic in generate_partition_qual() makes sense. You do this: > > result = list_concat(generate_partition_qual(parent), > copyObject(rel->rd_partcheck)); > > /* Mark Vars with correct attnos */ > result = map_partition_varattnos(result, rel, parent); > > But that has the effect of applying map_partition_varattnos to > everything in rel->rd_partcheck in addition to applying it to > everything returned by generate_partition_qual() on the parent, which > doesn't seem right. I've replaced this portion of the code with (as also mentioned below): /* Quick copy */ if (rel->rd_partcheck != NIL) return copyObject(rel->rd_partcheck); Down below (for the case when the partition qual is not cached, we now do this: my_qual = get_qual_from_partbound(rel, parent, bound); /* Add the parent's quals to the list (if any) */ if (parent->rd_rel->relispartition) result = list_concat(generate_partition_qual(parent), my_qual); else result = my_qual; /* * Change Vars to have partition's attnos instead of the parent's. * We do this after we concatenate the parent's quals, because * we want every Var in it to bear this relation's attnos. */ result = map_partition_varattnos(result, rel, parent); Which is then cached wholly in rd_partcheck. As for your concern whether it's correct to do so, consider that doing generate_partition_qual() on the parent returns qual with Vars that bear the parent's attnos (which is OK as far parent as partition is concerned). To apply the qual to the current relation as partition, we must change the Vars to have this relation's attnos. > Also, don't we want to do map_partition_varattnos() just ONCE, rather > than on every call to this function? I think maybe your concern is > that the parent might be changed without a relcache flush on the > child, but I don't quite see how that could happen. If the parent's > tuple descriptor changes, surely the child's tuple descriptor would > have to be altered at the same time... Makes sense. I fixed so that we return copyObject(rel->rd_partcheck), if it's non-NIL, instead of generating parent's qual and doing the mapping again. For some reason, I thought we couldn't save the mapped version in the relcache. By the way, in addition to the previously mentioned bug of RETURNING, I found that WITH CHECK OPTION didn't work correctly as well. In fact automatically updatable views failed to consider partitioned tables at all. Patch 0007 is addressed towards fixing that. Thanks, Amit >From e408234633c01817d6a2313fdbdccdb4f0057c1e Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 6 Jan 2017 15:53:10 +0900 Subject: [PATCH 1/7] Fix reporting of violation in ExecConstraints, again We decided in f1b4c771ea74f42447dccaed42ffcdcccf3aa694 that passing the original slot (one containing the tuple formatted per root partitioned table's tupdesc) to ExecConstraints(), but that breaks certain cases. Imagine what would happen if a BR trigger changed the tuple - the original slot would not contain those changes. So, it seems better to convert (if necessary) the tuple formatted per partition tupdesc after tuple-routing back to the root table's format and use the converted tuple to make val_desc shown in the message if an error occurs. --- src/backend/commands/copy.c| 6 ++-- src/backend/executor/execMain.c| 53 +- src/backend/executor/nodeModifyTable.c | 5 ++-- src/include/executor/executor.h| 3 +- src/test/regress/expected/insert.out | 18 ++-- src/test/regress/sql/insert.sql| 17 ++- 6 files changed, 82 insertions(+), 20 deletions(-) diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index f56b2ac49b..65eb167087 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2491,8 +2491,7 @@ CopyFrom(CopyState cstate) for (;;) { - TupleTableSlot *slot, - *oldslot; + TupleTableSlot *slot; bool skip_tuple; Oid loaded_oid = InvalidOid; @@ -2534,7 +2533,6 @@ CopyFrom(CopyState cstate) ExecStore
Re: [HACKERS] Declarative partitioning - another take
Hi Kieth, On 2017/01/10 14:44, Keith Fiske wrote: > Is there any reason for the exclusion of parent tables from the pg_tables > system catalog view? They do not show up in information_schema.tables as > well. I believe I found where to make the changes and I tested to make sure > it works for my simple case. Attached is my first attempt at patching > anything in core. Not sure if there's anywhere else this would need to be > fixed. That's an oversight. The original partitioning patch didn't touch information_schema.sql and system_views.sql at all. I added the relkind = 'P' check in some other views as well, including what your patch considered. Thanks, Amit >From 9eef3e87b9a025d233aa4b935b50bb0c7633efbb Mon Sep 17 00:00:00 2001 From: amit Date: Tue, 10 Jan 2017 18:12:25 +0900 Subject: [PATCH] information_schema/system_views.sql and relkind 'P' Currently, partitioned table are not taken into account in various information_schema and system views. Reported by: Keith Fiske Patch by: Kieth Fiske, Amit Langote Reports: https://www.postgresql.org/message-id/CAG1_KcDJiZB=l6youo_bvufj2q2851_xdkfhw0jdcd_2vtk...@mail.gmail.com --- src/backend/catalog/information_schema.sql | 37 +++--- src/backend/catalog/system_views.sql | 3 ++- 2 files changed, 21 insertions(+), 19 deletions(-) diff --git a/src/backend/catalog/information_schema.sql b/src/backend/catalog/information_schema.sql index 4df390a763..318f195b81 100644 --- a/src/backend/catalog/information_schema.sql +++ b/src/backend/catalog/information_schema.sql @@ -453,7 +453,7 @@ CREATE VIEW check_constraints AS AND a.attnum > 0 AND NOT a.attisdropped AND a.attnotnull - AND r.relkind = 'r' + AND r.relkind IN ('r', 'P') AND pg_has_role(r.relowner, 'USAGE'); GRANT SELECT ON check_constraints TO PUBLIC; @@ -525,7 +525,7 @@ CREATE VIEW column_domain_usage AS AND a.attrelid = c.oid AND a.atttypid = t.oid AND t.typtype = 'd' - AND c.relkind IN ('r', 'v', 'f') + AND c.relkind IN ('r', 'v', 'f', 'P') AND a.attnum > 0 AND NOT a.attisdropped AND pg_has_role(t.typowner, 'USAGE'); @@ -564,7 +564,7 @@ CREATE VIEW column_privileges AS pr_c.relowner FROM (SELECT oid, relname, relnamespace, relowner, (aclexplode(coalesce(relacl, acldefault('r', relowner.* FROM pg_class - WHERE relkind IN ('r', 'v', 'f') + WHERE relkind IN ('r', 'v', 'f', 'P') ) pr_c (oid, relname, relnamespace, relowner, grantor, grantee, prtype, grantable), pg_attribute a WHERE a.attrelid = pr_c.oid @@ -586,7 +586,7 @@ CREATE VIEW column_privileges AS ) pr_a (attrelid, attname, grantor, grantee, prtype, grantable), pg_class c WHERE pr_a.attrelid = c.oid - AND relkind IN ('r', 'v', 'f') + AND relkind IN ('r', 'v', 'f', 'P') ) x, pg_namespace nc, pg_authid u_grantor, @@ -629,7 +629,7 @@ CREATE VIEW column_udt_usage AS WHERE a.attrelid = c.oid AND a.atttypid = t.oid AND nc.oid = c.relnamespace - AND a.attnum > 0 AND NOT a.attisdropped AND c.relkind in ('r', 'v', 'f') + AND a.attnum > 0 AND NOT a.attisdropped AND c.relkind in ('r', 'v', 'f', 'P') AND pg_has_role(coalesce(bt.typowner, t.typowner), 'USAGE'); GRANT SELECT ON column_udt_usage TO PUBLIC; @@ -738,7 +738,7 @@ CREATE VIEW columns AS CAST('NEVER' AS character_data) AS is_generated, CAST(null AS character_data) AS generation_expression, - CAST(CASE WHEN c.relkind = 'r' OR + CAST(CASE WHEN c.relkind IN ('r', 'P') OR (c.relkind IN ('v', 'f') AND pg_column_is_updatable(c.oid, a.attnum, false)) THEN 'YES' ELSE 'NO' END AS yes_or_no) AS is_updatable @@ -753,7 +753,7 @@ CREATE VIEW columns AS WHERE (NOT pg_is_other_temp_schema(nc.oid)) - AND a.attnum > 0 AND NOT a.attisdropped AND c.relkind in ('r', 'v', 'f') + AND a.attnum > 0 AND NOT a.attisdropped AND c.relkind in ('r', 'v', 'f', 'P') AND (pg_has_role(c.relowner, 'USAGE') OR has_column_privilege(c.oid, a.attnum, @@ -789,7 +789,7 @@ CREATE VIEW constraint_column_usage AS AND d.objid = c.oid AND c.connamespace = nc.oid AND c.contype = 'c' -AND r.relkind = 'r' +AND r.relkind IN ('r', 'P') AND NOT a.attisdropped UNION ALL @@ -841,7 +841,7 @@ CREATE VIEW constraint_table_usage AS WHERE c.connamespace = nc.oid AND r.relnamespace = nr.oid AND ( (c.contype = 'f' AND c.confrelid = r.oid) OR (c.contype IN ('p', 'u') AND c.conrelid = r.oid) ) - AND r.relkind = 'r' +
Re: [HACKERS] Declarative partitioning - another take
Hi Amul, On 2017/01/09 17:29, amul sul wrote: > I got server crash due to assert failure at ATTACHing overlap rang > partition, here is test case to reproduce this: > > CREATE TABLE test_parent(a int) PARTITION BY RANGE (a); > CREATE TABLE test_parent_part2 PARTITION OF test_parent FOR VALUES > FROM(100) TO(200); > CREATE TABLE test_parent_part1(a int NOT NULL); > ALTER TABLE test_parent ATTACH PARTITION test_parent_part1 FOR VALUES > FROM(1) TO(200); > > I think, this bug exists in the following code of check_new_partition_bound(): > > 767 if (equal || off1 != off2) > 768 { > 769 overlap = true; > 770 with = boundinfo->indexes[off2 + 1]; > 771 } > > When equal is true array index should not be 'off2 + 1'. Good catch. Attached patch should fix that. I observed crash with the following command as well: ALTER TABLE test_parent ATTACH PARTITION test_parent_part1 FOR VALUES FROM (1) TO (300); That's because there is one more case when the array index shouldn't be off2 + 1 - the case where the bound at off2 is an upper bound (I'd wrongly assumed that it's always a lower bound). Anyway, I rewrote the surrounding comments to clarify the logic a bit. > While reading code related to this, I wondered why > partition_bound_bsearch is not immediately returns when cmpval==0? partition_bound_bsearch() is meant to return the *greatest* index of the bound less than or equal to the input bound ("probe"). But it seems to me now that we would always return the first index at which we get 0 for cmpval, albeit after wasting cycles to try to find even greater index. Because we don't have duplicates in the datums array, once we encounter a bound that is equal to probe, we are only going to find bounds that are *greater than* probe if we continue looking right, only to turn back again to return the equal index (which is wasted cycles in invoking the partition key comparison function(s)). So, it perhaps makes sense to do this per your suggestion: @@ -1988,8 +2018,11 @@ partition_bound_bsearch(PartitionKey key, PartitionBoundInfo boundinfo, if (cmpval <= 0) { lo = mid; *is_equal = (cmpval == 0); + +if (*is_equal) +break; } Thanks, Amit >From 7fe537f8e8efeac51c3b0cc91ac51a1aa39399cd Mon Sep 17 00:00:00 2001 From: amit Date: Tue, 10 Jan 2017 17:43:36 +0900 Subject: [PATCH] Fix some wrong thinking in check_new_partition_bound() Because a given range bound in the PartitionBoundInfo.datums array is sometimes a lower range bound and at other times an upper range bound, we must be careful when assuming which, especially when interpreting the result of partition_bound_bsearch which returns the index of the greatest bound that is less than or equal to probe. Due to an error in thinking about the same, the relevant code in check_new_partition_bound() caused invalid partition (index == -1) to be chosen as the partition being overlapped. Also, we need not continue searching for even greater bound in partition_bound_bsearch() once we find the first bound that is *equal* to the probe, because we don't have duplicate datums. That spends cycles needlessly. Per suggestion from Amul Sul. Reported by: Amul Sul Patch by: Amit Langote Reports: https://www.postgresql.org/message-id/CAAJ_b94XgbqVoXMyxxs63CaqWoMS1o2gpHiU0F7yGnJBnvDc_A%40mail.gmail.com --- src/backend/catalog/partition.c| 62 ++ src/test/regress/expected/create_table.out | 10 - src/test/regress/sql/create_table.sql | 4 ++ 3 files changed, 60 insertions(+), 16 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index f54e1bdf3f..df5652de4c 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -741,35 +741,64 @@ check_new_partition_bound(char *relname, Relation parent, Node *bound) boundinfo->strategy == PARTITION_STRATEGY_RANGE); /* - * Find the greatest index of a range bound that is less - * than or equal with the new lower bound. + * Firstly, find the greatest range bound that is less + * than or equal to the new lower bound. */ off1 = partition_bound_bsearch(key, boundinfo, lower, true, &equal); /* - * If equal has been set to true, that means the new lower - * bound is found to be equal with the bound at off1, - * which clearly means an overlap with the partition at - * index off1+1). - * - * Otherwise, check if there is a "gap" that could be - * occupied by the new partition. In case of a gap, the - * new upper bound should not cross past the upper - * boundary of the gap, that is, off2 == off1 should be - * true. + * off1 == -1 means that all existing bounds are greater + * than the new lower
Re: [HACKERS] Declarative partitioning - another take
Is there any reason for the exclusion of parent tables from the pg_tables system catalog view? They do not show up in information_schema.tables as well. I believe I found where to make the changes and I tested to make sure it works for my simple case. Attached is my first attempt at patching anything in core. Not sure if there's anywhere else this would need to be fixed. -- Keith Fiske Database Administrator OmniTI Computer Consulting, Inc. http://www.keithf4.com diff --git a/src/backend/catalog/information_schema.sql b/src/backend/catalog/information_schema.sql index 4df390a..c31d0d8 100644 --- a/src/backend/catalog/information_schema.sql +++ b/src/backend/catalog/information_schema.sql @@ -1901,6 +1901,7 @@ CREATE VIEW tables AS WHEN c.relkind = 'r' THEN 'BASE TABLE' WHEN c.relkind = 'v' THEN 'VIEW' WHEN c.relkind = 'f' THEN 'FOREIGN TABLE' + WHEN c.relkind = 'P' THEN 'PARTITIONED TABLE' ELSE null END AS character_data) AS table_type, @@ -1912,7 +1913,7 @@ CREATE VIEW tables AS CAST(t.typname AS sql_identifier) AS user_defined_type_name, CAST(CASE WHEN c.relkind = 'r' OR - (c.relkind IN ('v', 'f') AND + (c.relkind IN ('v', 'f', 'P') AND -- 1 << CMD_INSERT pg_relation_is_updatable(c.oid, false) & 8 = 8) THEN 'YES' ELSE 'NO' END AS yes_or_no) AS is_insertable_into, @@ -1923,7 +1924,7 @@ CREATE VIEW tables AS FROM pg_namespace nc JOIN pg_class c ON (nc.oid = c.relnamespace) LEFT JOIN (pg_type t JOIN pg_namespace nt ON (t.typnamespace = nt.oid)) ON (c.reloftype = t.oid) -WHERE c.relkind IN ('r', 'v', 'f') +WHERE c.relkind IN ('r', 'v', 'f', 'P') AND (NOT pg_is_other_temp_schema(nc.oid)) AND (pg_has_role(c.relowner, 'USAGE') OR has_table_privilege(c.oid, 'SELECT, INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER') diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 31aade1..f4dc460 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -136,7 +136,7 @@ CREATE VIEW pg_tables AS C.relrowsecurity AS rowsecurity FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) LEFT JOIN pg_tablespace T ON (T.oid = C.reltablespace) -WHERE C.relkind = 'r'; +WHERE C.relkind = 'r' OR C.relkind = 'P'; CREATE VIEW pg_matviews AS SELECT -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
Hi, I got server crash due to assert failure at ATTACHing overlap rang partition, here is test case to reproduce this: CREATE TABLE test_parent(a int) PARTITION BY RANGE (a); CREATE TABLE test_parent_part2 PARTITION OF test_parent FOR VALUES FROM(100) TO(200); CREATE TABLE test_parent_part1(a int NOT NULL); ALTER TABLE test_parent ATTACH PARTITION test_parent_part1 FOR VALUES FROM(1) TO(200); I think, this bug exists in the following code of check_new_partition_bound(): 767 if (equal || off1 != off2) 768 { 769 overlap = true; 770 with = boundinfo->indexes[off2 + 1]; 771 } When equal is true array index should not be 'off2 + 1'. While reading code related to this, I wondered why partition_bound_bsearch is not immediately returns when cmpval==0? Apologise if this has been already reported. Regards, Amul -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2017/01/05 3:26, Robert Haas wrote: > On Tue, Dec 27, 2016 at 8:41 PM, Amit Langote > wrote: >> On 2016/12/27 19:07, Amit Langote wrote: >>> Attached should fix that. >> >> Here are the last two patches with additional information like other >> patches. Forgot to do that yesterday. > > 0001 has the disadvantage that get_partition_for_tuple() acquires a > side effect. That seems undesirable. At the least, it needs to be > documented in the function's header comment. That's true. How about we save away the original ecxt_scantuple at entry and restore the same just before returning from the function? That way there would be no side effect. 0001 implements that. > It's unclear to me why we need to do 0002. It doesn't seem like it > should be necessary, it doesn't seem like a good idea, and the commit > message you proposed is uninformative. If a single BulkInsertState object is passed to heap_insert()/heap_multi_insert() for different heaps corresponding to different partitions (from one input tuple to next), tuples might end up going into wrong heaps (like demonstrated in one of the reports [1]). A simple solution is to disable bulk-insert in case of partitioned tables. But my patch (or its motivations) was slightly wrongheaded, wherein I conflated multi-insert stuff and bulk-insert considerations. I revised 0002 to not do that. However if we disable bulk-insert mode, COPY's purported performance benefit compared with INSERT is naught. Patch 0003 is a proposal to implement bulk-insert mode even for partitioned tables. Basically, allocate separate BulkInsertState objects for each partition and switch to the appropriate one just before calling heap_insert()/heap_multi_insert(). Then to be able to use heap_multi_insert(), we must also manage buffered tuples separately for each partition. Although, I didn't modify the limit on number of buffered tuples and/or size of buffered tuples which controls when we pause buffering and do heap_multi_insert() on buffered tuples. Maybe, it should work slightly differently for the partitioned table case, like for example, increase the overall limit on both the number of tuples and tuple size in the partitioning case (I observed that increasing it 10x or 100x helped to some degree). Thoughts on this? Thanks, Amit [1] https://www.postgresql.org/message-id/CAFmBtr32FDOqofo8yG-4mjzL1HnYHxXK5S9OGFJ%3D%3DcJpgEW4vA%40mail.gmail.com >From 332c67c258a0f25f76c29ced23199fe0ee8e153e Mon Sep 17 00:00:00 2001 From: amit Date: Wed, 28 Dec 2016 10:10:26 +0900 Subject: [PATCH 1/3] Set ecxt_scantuple correctly for tuple-routing In 2ac3ef7a01df859c62d0a02333b646d65eaec5ff, we changed things so that it's possible for a different TupleTableSlot to be used for partitioned tables at successively lower levels. If we do end up changing the slot from the original, we must update ecxt_scantuple to point to the new one for partition key of the tuple to be computed correctly. Also update the regression tests so that the code manipulating ecxt_scantuple is covered. Reported by: Rajkumar Raghuwanshi Patch by: Amit Langote Reports: https://www.postgresql.org/message-id/CAKcux6%3Dm1qyqB2k6cjniuMMrYXb75O-MB4qGQMu8zg-iGGLjDw%40mail.gmail.com --- src/backend/catalog/partition.c | 29 ++--- src/backend/executor/execMain.c | 2 -- src/test/regress/expected/insert.out | 2 +- src/test/regress/sql/insert.sql | 2 +- 4 files changed, 24 insertions(+), 11 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index f54e1bdf3f..0de1cf245a 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -1643,7 +1643,10 @@ get_partition_for_tuple(PartitionDispatch *pd, bool isnull[PARTITION_MAX_KEYS]; int cur_offset, cur_index; - int i; + int i, +result; + ExprContext *ecxt = GetPerTupleExprContext(estate); + TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple; /* start with the root partitioned table */ parent = pd[0]; @@ -1672,7 +1675,14 @@ get_partition_for_tuple(PartitionDispatch *pd, slot = myslot; } - /* Extract partition key from tuple */ + /* + * Extract partition key from tuple; FormPartitionKeyDatum() expects + * ecxt_scantuple to point to the correct tuple slot (which might be + * different from the slot we received from the caller if the + * partitioned table of the current level has different tuple + * descriptor from its parent). + */ + ecxt->ecxt_scantuple = slot; FormPartitionKeyDatum(parent, slot, estate, values, isnull); if (key->strategy == PARTITION_STRATEGY_RANGE) @@ -1727,16 +1737,21 @@ get_partition_for_tuple(PartitionDispatch *pd, */ if (cur_index < 0) { + result = -1; *failed_at = RelationGetRelid(parent->reldesc); - return -1; + break; } - else if (parent->indexes[cur_index] < 0) - parent = pd[-parent->indexes[cur_index]]; - else + else if (parent->indexes[cur_index] >= 0)
Re: [HACKERS] Declarative partitioning - another take
Hi Keith, On 2017/01/06 2:16, Keith Fiske wrote: > Could we get some clarification on the partition_bound_spec portion of the > PARTITION OF clause? Just doing some testing it seems it's inclusive of the > FROM value but exclusive of the TO value. I don't see mention of this in > the docs as of commit 18fc5192a631441a73e6a3b911ecb14765140389 yesterday. > It does mention that the values aren't allowed to overlap, but looking at > the schema below, without the clarification of which side is > inclusive/exclusive it seems confusing because 2016-08-01 is in both. Even > the child table does not clarify this. Not sure if there's a way to do this > in the \d+ display which would be ideal, but it should at least be > mentioned in the docs. I agree that needs highlighting. I'm planning to write a doc patch for that (among other documentation improvements). Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
Could we get some clarification on the partition_bound_spec portion of the PARTITION OF clause? Just doing some testing it seems it's inclusive of the FROM value but exclusive of the TO value. I don't see mention of this in the docs as of commit 18fc5192a631441a73e6a3b911ecb14765140389 yesterday. It does mention that the values aren't allowed to overlap, but looking at the schema below, without the clarification of which side is inclusive/exclusive it seems confusing because 2016-08-01 is in both. Even the child table does not clarify this. Not sure if there's a way to do this in the \d+ display which would be ideal, but it should at least be mentioned in the docs. keith@keith=# \d+ measurement Table "public.measurement" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ---+-+---+--+-+-+--+- logdate | date| | not null | | plain | | peaktemp | integer | | | 1 | plain | | unitsales | integer | | | | plain | | Partition key: RANGE (logdate) Check constraints: "measurement_peaktemp_check" CHECK (peaktemp > 0) Partitions: measurement_y2016m07 FOR VALUES FROM ('2016-07-01') TO ('2016-08-01'), measurement_y2016m08 FOR VALUES FROM ('2016-08-01') TO ('2016-09-01') keith@keith=# \d+ measurement_y2016m07 Table "public.measurement_y2016m07" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ---+-+---+--+-+-+--+- logdate | date| | not null | | plain | | peaktemp | integer | | | 1 | plain | | unitsales | integer | | | 0 | plain | | Partition of: measurement FOR VALUES FROM ('2016-07-01') TO ('2016-08-01') Check constraints: "measurement_peaktemp_check" CHECK (peaktemp > 0) keith@keith=# insert into measurement (logdate) values ('2016-08-01'); INSERT 0 1 Time: 2.848 ms keith@keith=# select * from measurement_y2016m07; logdate | peaktemp | unitsales -+--+--- (0 rows) Time: 0.273 ms keith@keith=# select * from measurement_y2016m08; logdate | peaktemp | unitsales +--+--- 2016-08-01 |1 |«NULL» (1 row) Time: 0.272 ms keith@keith=# drop table measurement_y2016m08; DROP TABLE Time: 5.919 ms keith@keith=# select * from only measurement; logdate | peaktemp | unitsales -+--+--- (0 rows) Time: 0.307 ms keith@keith=# insert into measurement (logdate) values ('2016-08-01'); ERROR: no partition of relation "measurement" found for row DETAIL: Failing row contains (2016-08-01, 1, null). Time: 0.622 ms