Re: [HACKERS] Parallel Append implementation
Thanks a lot Robert for the patch. I will have a look. Quickly tried to test some aggregate queries with a partitioned pgbench_accounts table, and it is crashing. Will get back with the fix, and any other review comments. Thanks -Amit Khandekar On 9 November 2017 at 23:44, Robert Haas wrote: > On Sat, Oct 28, 2017 at 5:50 AM, Robert Haas wrote: >> No, because the Append node is *NOT* getting copied into shared >> memory. I have pushed a comment update to the existing functions; you >> can use the same comment for this patch. > > I spent the last several days working on this patch, which had a > number of problems both cosmetic and functional. I think the attached > is in better shape now, but it could certainly use some more review > and testing since I only just finished modifying it, and I modified it > pretty heavily. Changes: > > - I fixed the "morerows" entries in the documentation. If you had > built the documentation the way you had it and loaded up in a web > browser, you would have seen that the way you had it was not correct. > > - I moved T_AppendState to a different position in the switch inside > ExecParallelReInitializeDSM, so as to keep that switch in the same > order as all of the other switch statements in that file. > > - I rewrote the comment for pa_finished. It previously began with > "workers currently executing the subplan", which is not an accurate > description. I suspect this was a holdover from a previous version of > the patch in which this was an array of integers rather than an array > of type bool. I also fixed the comment in ExecAppendEstimate and > added, removed, or rewrote various other comments as well. > > - I renamed PA_INVALID_PLAN to INVALID_SUBPLAN_INDEX, which I think is > more clear and allows for the possibility that this sentinel value > might someday be used for non-parallel-aware Append plans. > > - I largely rewrote the code for picking the next subplan. A > superficial problem with the way that you had it is that you had > renamed exec_append_initialize_next to exec_append_seq_next but not > updated the header comment to match. Also, the logic was spread out > all over the file. There are three cases: not parallel aware, leader, > worker. You had the code for the first case at the top of the file > and the other two cases at the bottom of the file and used multiple > "if" statements to pick the right one in each case. I replaced all > that with a function pointer stored in the AppendState, moved the code > so it's all together, and rewrote it in a way that I find easier to > understand. I also changed the naming convention. > > - I renamed pappend_len to pstate_len and ParallelAppendDescData to > ParallelAppendState. I think the use of the word "descriptor" is a > carryover from the concept of a scan descriptor. There's nothing > really wrong with inventing the concept of an "append descriptor", but > it seems more clear to just refer to shared state. > > - I fixed ExecAppendReInitializeDSM not to reset node->as_whichplan. > Per commit 41b0dd987d44089dc48e9c70024277e253b396b7, that's wrong; > instead, local state should be reset in ExecReScanAppend. I installed > what I believe to be the correct logic in that function instead. > > - I fixed list_qsort() so that it copies the type of the old list into > the new list. Otherwise, sorting a list of type T_IntList or > T_OidList would turn it into just plain T_List, which is wrong. > > - I removed get_append_num_workers and integrated the logic into the > callers. This function was coded quite strangely: it assigned the > return value of fls() to a double and then eventually rounded the > result back to an integer. But fls() returns an integer, so this > doesn't make much sense. On a related note, I made it use fls(# of > subpaths) instead of fls(# of subpaths)+1. Adding 1 doesn't make > sense to me here because it leads to a decision to use 2 workers for a > single, non-partial subpath. I suspect both of these mistakes stem > from thinking that fls() returns the base-2 logarithm, but in fact it > doesn't, quite: log2(1) = 0.0 but fls(1) = 1. > > - In the process of making the changes described in the previous > point, I added a couple of assertions, one of which promptly failed. > It turns out the reason is that your patch didn't update > accumulate_append_subpaths(), which can result in flattening > non-partial paths from a Parallel Append into a parent Append's list > of partial paths, which is bad. The easiest way to fix that would be > to just teach accumulate_append_subpaths() not to flatten a Parallel > Append into a parent Append or MergeAppend node,
Re: [HACKERS] UPDATE of partition key
On 9 November 2017 at 09:27, Thomas Munro wrote: > On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar wrote: >> On 8 November 2017 at 07:55, Thomas Munro >> wrote: >>> On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas wrote: >>>> The changes to trigger.c still make me super-nervous. Hey THOMAS >>>> MUNRO, any chance you could review that part? > > At first, it seemed quite strange to me that row triggers and > statement triggers fire different events for the same modification. > Row triggers see DELETE + INSERT (necessarily because different > tables are involved), but this fact is hidden from the target table's > statement triggers. > > The alternative would be for all triggers to see consistent events and > transitions. Instead of having your special case code in ExecInsert > and ExecDelete that creates the two halves of a 'synthetic' UPDATE for > the transition tables, you'd just let the existing ExecInsert and > ExecDelete code do its thing, and you'd need a flag to record that you > should also fire INSERT/DELETE after statement triggers if any rows > moved. Yeah I also had thought about that. But thought that change was too invasive. For e.g. letting ExecARInsertTriggers() do the transition capture even when transition_capture->tcs_update_new_table is set. I was also thinking of having a separate function to *only* add the transition table rows. So in ExecInsert, call this one instead of ExecARUpdateTriggers(). But realized that the existing ExecARUpdateTriggers() looks like a better, robust interface with all its checks. Just that calling ExecARUpdateTriggers() sounds like we are also firing trigger; we are not firing any trigger or saving any event, we are just adding the transition row. > > After sleeping on this question, I am coming around to the view that > the way you have it is right. The distinction isn't really between > row triggers and statement triggers, it's between triggers at > different levels in the hierarchy. It just so happens that we > currently only fire target table statement triggers and leaf table row > triggers. Yes. And rows are there only in leaf partitions. So we have to simulate as though the target table has these rows. Like you mentioned, the user has to get the impression of a normal table. So we have to do something extra to capture the rows. > Future development ideas that seem consistent with your choice: > > 1. If we ever allow row triggers with transition tables on child > tables, then I think *their* transition tables should certainly see > the deletes and inserts, otherwise OLD TABLE and NEW TABLE would be > inconsistent with the OLD and NEW variables in a single trigger > invocation. (These were prohibited mainly due to lack of time and > (AFAIK) limited usefulness; I think they would need probably need > their own separate tuplestores, or possibly some kind of filtering.) As we know, for row triggers on leaf partitions, we treat them as normal tables, so a trigger written on a leaf partition sees only the local changes. The trigger is unaware whether the insert is part of an UPDATE row movement. Similarly, the transition table referenced by that row trigger function should see only the NEW table, not the old table. > > 2. If we ever allow row triggers on partitioned tables (ie that fire > when its children are modified), then I think their UPDATE trigger > should probably fire when a row moves between any two (grand-)*child > tables, just as you have it for target table statement triggers. Yes I agree. > It doesn't matter that the view from parent tables' triggers is > inconsistent with the view from leaf table triggers: it's a feature > that we 'hide' partitioning from the user to the extent we can so that > you can treat the partitioned table just like a table. > > Any other views? I think because because there is no provision for a row trigger on partitioned table, users who want to have a common trigger on a partition subtree, has no choice but to create the same trigger individually on the leaf partitions. And that's the reason we cannot handle an update row movement with triggers without anomalies. Thanks -Amit Khandekar -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 8 November 2017 at 07:55, Thomas Munro wrote: > On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas wrote: >> The changes to trigger.c still make me super-nervous. Hey THOMAS >> MUNRO, any chance you could review that part? > > Looking, but here's one silly thing that jumped out at me while > getting started with this patch. I cannot seem to convince my macOS > system to agree with the expected sort order from :show_data, where > underscores precede numbers: > > part_a_10_a_20 | a | 10 | 200 | 1 | > part_a_1_a_10 | a | 1 | 1 | 1 | > - part_d_1_15| b | 15 | 146 | 1 | > - part_d_1_15| b | 16 | 147 | 2 | > part_d_15_20 | b | 17 | 155 | 16 | > part_d_15_20 | b | 19 | 155 | 19 | > + part_d_1_15| b | 15 | 146 | 1 | > + part_d_1_15| b | 16 | 147 | 2 | > > It seems that macOS (like older BSDs) just doesn't know how to sort > Unicode and falls back to sorting the bits. I expect that means that > the test will also fail on any other OS with "make check > LC_COLLATE=C". I believe our regression tests are supposed to pass > with a wide range of collations including C, so I wonder if this means > we should stick a leading zero on those single digit numbers, or > something, to stabilise the output. I preferably need to retain the partition names. I have now added a LOCALE "C" for partname like this : -\set show_data 'select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4, 5, 6' +\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6' Thomas, can you please try the attached incremental patch regress_locale_changes.patch and check if the test passes ? The patch is to be applied on the main v22 patch. If the test passes, I will include these changes (also for list_parted) in the upcoming v23 patch. Thanks -Amit Khandekar regress_locale_changes.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 7 November 2017 at 00:33, Robert Haas wrote: > Also, +1 for Amit Langote's idea of trying to merge > mt_perleaf_childparent_maps with mt_persubplan_childparent_maps. Currently I am trying to see if it simplifies things if we do that. We will be merging these arrays into one, but we are adding a new int[] array that maps subplans to leaf partitions. Will get back with how it looks finally. Robert, Amit , I will get back with your other review comments. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 13 October 2017 at 00:29, Robert Haas wrote: > On Wed, Oct 11, 2017 at 8:51 AM, Amit Khandekar > wrote: >> [ new patch ] > > + parallel_append > + Waiting to choose the next subplan during Parallel Append > plan > + execution. > + > + > > Probably needs to update a morerows values of some earlier entry. >From what I observed from the other places, the morerows value is one less than the number of following entries. I have changed it to 10 since it has 11 entries. > > + enable_parallelappend configuration > parameter > > How about enable_parallel_append? Done. > > + * pa_finished : workers currently executing the subplan. A worker which > > The way the colon is used here is not a standard comment style for PostgreSQL. Changed it to "pa_finished:". > > + * Go on to the "next" subplan. If no more subplans, return the empty > + * slot set up for us by ExecInitAppend. > + * Note: Parallel-aware Append follows different logic for choosing > the > + * next subplan. > > Formatting looks wrong, and moreover I don't think this is the right > way of handling this comment anyway. Move the existing comment inside > the if (!node->padesc) block and leave it unchanged; the else block > explains the differences for parallel append. I think the first couple of lines do apply to both parallel-append and sequential append plans. I have moved the remaining couple of lines inside the else block. > > + *ExecAppendEstimate > + * > + *estimates the space required to serialize Append node. > > Ugh, this is wrong, but I notice it follows various other > equally-wrong comments for other parallel-aware node types. I guess > I'll go fix that. We are not in serializing the Append node. I didn't clealy get this. Do you think it should be "space required to copy the Append node into the shared memory" ? > > I do not think that it's a good idea to call > exec_append_parallel_next() from ExecAppendInitializeDSM, > ExecAppendReInitializeDSM, and ExecAppendInitializeWorker. We want to > postpone selecting which plan to run until we're actually ready to run > that plan. Otherwise, for example, the leader might seize a > non-partial plan (if only such plans are included in the Parallel > Append) when it isn't really necessary for it to do so. If the > workers would've reached the plans and started returning tuples to the > leader before it grabbed a plan, oh well, too bad. The leader's still > claimed that plan and must now run it. > > I concede that's not a high-probability scenario, but I still maintain > that it is better for processes not to claim a subplan until the last > possible moment. I think we need to initialize as_whichplan as > PA_INVALID plan and then fix it when ExecProcNode() is called for the > first time. Done. Set as_whichplan to PA_INVALID_PLAN in ExecAppendInitializeDSM(), ExecAppendReInitializeDSM() and ExecAppendInitializeWorker(). Then when ExecAppend() is called for the first time, we notice that as_whichplan is PA_INVALID_PLAN, that means we need to choose the plan. > > +if (!IsParallelWorker()) > > This is not a great test, because it would do the wrong thing if we > ever allowed an SQL function called from a parallel worker to run a > parallel query of its own. Currently that's not allowed but we might > want to allow it someday. What we really want to test is whether > we're the leader for *this* query. Maybe use a flag in the > AppendState for that, and set it correctly in > ExecAppendInitializeWorker. Done. Set a new AppendState->is_parallel_worker field to true in ExecAppendInitializeWorker(). > > I think maybe the loop in exec_append_parallel_next should look more like > this: > > /* Pick the next plan. */ > state->as_whichplan = padesc->pa_nextplan; > if (state->as_whichplan != PA_INVALID_PLAN) > { > int nextplan = state->as_whichplan; > > /* Mark non-partial plans done immediately so that they can't be > picked again. */ > if (nextplan < first_partial_plan) > padesc->pa_finished[nextplan] = true; > > /* Figure out what plan the next worker should pick. */ > do > { > /* If we've run through all the plans, loop back through > partial plans only. */ > if (++nextplan >= state->as_nplans) > nextplan = first_partial_plan; > > /* No plans remaining or tried them all? Then give up. */ > if (nextplan == state->as_whichplan || nextplan >= state->as_nplans) > { > nextpla
Re: [HACKERS] pgsql: Avoid coercing a whole-row variable that is already coerced
Bringing here the mail thread from pgsql-committers regarding this commit : commit 1c497fa72df7593d8976653538da3d0ab033207f Author: Robert Haas Date: Thu Oct 12 17:10:48 2017 -0400 Avoid coercing a whole-row variable that is already coerced. Marginal efficiency and beautification hack. I'm not sure whether this case ever arises currently, but the pending patch for update tuple routing will cause it to arise. Amit Khandekar Discussion: http://postgr.es/m/caj3gd9cazfppe7-wwubabpcq4_0subkipfd1+0r5_dkvnwo...@mail.gmail.com Tom Lane wrote: > Robert Haas wrote: > > Avoid coercing a whole-row variable that is already coerced. > > This logic seems very strange and more than likely buggy: the > added coerced_var flag feels like the sort of action at a distance > that is likely to have unforeseen side effects. > > I'm not entirely sure what the issue is here, but if you're concerned > about not applying two ConvertRowtypeExprs in a row, why not have the > upper one just strip off the lower one? We handle, eg, nested > RelabelTypes that way. We kind of do a similar thing. When a ConvertRowtypeExpr node is encountered, we create a new ConvertRowtypeExpr that points to a new var, and return this new ConvertRowtypeExpr instead of the existing one. So we actually replace the old with a new one. But additionally, we also want to change the vartype of the new var to context->to_rowtype. I think you are worried specifically about coerced_var causing unexpected regression in existing scenarios, such as : context->coerced_var getting set and prematurely unset in recursive scenarios. But note that, when we call map_variable_attnos_mutator() just after setting context->coerced_var = true, map_variable_attnos_mutator() won't recurse further, because it is always called with a Var, which does not have any further arguments to process. So coerced_var won't be again changed until we return from map_variable_attnos_mutator(). The only reason why we chose to call map_variable_attnos_mutator() with a Var is so that we can re-use the code that converts the whole row var. One thing we can do is : instead of calling map_variable_attnos_mutator(), convert the var inside the if block for "if (IsA(node, ConvertRowtypeExpr))". Please check the attached patch. There, I have avoided coerced_var context variable. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company handle-redundant-ConvertRowtypeExpr-node.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 9 October 2017 at 16:03, Amit Kapila wrote: > On Fri, Oct 6, 2017 at 12:03 PM, Amit Khandekar > wrote: >> On 6 October 2017 at 08:49, Amit Kapila wrote: >>> >>> Okay, but why not cheapest partial path? >> >> I gave some thought on this point. Overall I feel it does not matter >> which partial path it should pick up. RIght now the partial paths are >> not ordered. But for non-partial paths sake, we are just choosing the >> very last path. So in case of mixed paths, leader will get a partial >> path, but that partial path would not be the cheapest path. But if we >> also order the partial paths, the same logic would then pick up >> cheapest partial path. The question is, should we also order the >> partial paths for the leader ? >> >> The only scenario I see where leader choosing cheapest partial path >> *might* show some benefit, is if there are some partial paths that >> need to do some startup work using only one worker. I think currently, >> parallel hash join is one case where it builds the hash table, but I >> guess here also, we support parallel hash build, but not sure about >> the status. >> > > You also need to consider how merge join is currently work in parallel > (each worker need to perform the whole of work of right side). Yes, here if the leader happens to take the right side, it may slow down the overall merge join. But this seems to be a different case than the case of high startup costs. > I think there could be more scenario's where the startup cost is high > and parallel worker needs to do that work independently. True. > > For such plan, if leader starts it, it would be slow, and >> no other worker would be able to help it, so its actual startup cost >> would be drastically increased. (Another path is parallel bitmap heap >> scan where the leader has to do something and the other workers wait. >> But here, I think it's not much work for the leader to do). So >> overall, to handle such cases, it's better for leader to choose a >> cheapest path, or may be, a path with cheapest startup cost. We can >> also consider sorting partial paths with decreasing startup cost. >> > > Yeah, that sounds reasonable. Attached patch sorts partial paths by descending startup cost. On 6 October 2017 at 08:49, Amit Kapila wrote: > On Thu, Oct 5, 2017 at 4:11 PM, Amit Khandekar wrote: >> >> Ok. How about removing pa_all_partial_subpaths altogether , and >> instead of the below condition : >> >> /* >> * If all the child rels have partial paths, and if the above Parallel >> * Append path has a mix of partial and non-partial subpaths, then consider >> * another Parallel Append path which will have *all* partial subpaths. >> * If enable_parallelappend is off, make this one non-parallel-aware. >> */ >> if (partial_subpaths_valid && !pa_all_partial_subpaths) >> .. >> >> Use this condition : >> if (partial_subpaths_valid && pa_nonpartial_subpaths != NIL) >> .. >> > > Sounds good to me. Did this. Here is the new condition I used along with the comments explaining it : +* If parallel append has not been added above, or the added one has a mix +* of partial and non-partial subpaths, then consider another Parallel +* Append path which will have *all* partial subpaths. We can add such a +* path only if all childrels have partial paths in the first place. This +* new path will be parallel-aware unless enable_parallelappend is off. */ - if (partial_subpaths_valid && !pa_all_partial_subpaths) + if (partial_subpaths_valid && + (!pa_subpaths_valid || pa_nonpartial_subpaths != NIL)) Also added some test scenarios. On 6 October 2017 at 12:03, Amit Khandekar wrote: > On 6 October 2017 at 08:49, Amit Kapila wrote: >> >> One minor point: >> >> + if (!node->as_padesc) >> + { >> + /* >> + */ >> + if (!exec_append_seq_next(node)) >> + return ExecClearTuple(node->ps.ps_ResultTupleSlot); >> + } >> >> It seems either you want to add a comment in above part of patch or >> you just left /**/ mistakenly. > > Oops. Yeah, the comment wrapper remained there when I moved its > content "This is Parallel-aware append. Follow it's own logic ..." out > of the if block. Removed the comment wrapper. Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company ParallelAppend_v17.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 6 October 2017 at 08:49, Amit Kapila wrote: > On Thu, Oct 5, 2017 at 4:11 PM, Amit Khandekar wrote: >> >> Ok. How about removing pa_all_partial_subpaths altogether , and >> instead of the below condition : >> >> /* >> * If all the child rels have partial paths, and if the above Parallel >> * Append path has a mix of partial and non-partial subpaths, then consider >> * another Parallel Append path which will have *all* partial subpaths. >> * If enable_parallelappend is off, make this one non-parallel-aware. >> */ >> if (partial_subpaths_valid && !pa_all_partial_subpaths) >> .. >> >> Use this condition : >> if (partial_subpaths_valid && pa_nonpartial_subpaths != NIL) >> .. >> > > Sounds good to me. > > One minor point: > > + if (!node->as_padesc) > + { > + /* > + */ > + if (!exec_append_seq_next(node)) > + return ExecClearTuple(node->ps.ps_ResultTupleSlot); > + } > > It seems either you want to add a comment in above part of patch or > you just left /**/ mistakenly. Oops. Yeah, the comment wrapper remained there when I moved its content "This is Parallel-aware append. Follow it's own logic ..." out of the if block. Since this is too small a change for an updated patch, I will do this along with any other changes that would be required as the review progresses. > >> >> >> >> Regarding a mix of partial and non-partial paths, I feel it always >> makes sense for the leader to choose the partial path. >> > > Okay, but why not cheapest partial path? I gave some thought on this point. Overall I feel it does not matter which partial path it should pick up. RIght now the partial paths are not ordered. But for non-partial paths sake, we are just choosing the very last path. So in case of mixed paths, leader will get a partial path, but that partial path would not be the cheapest path. But if we also order the partial paths, the same logic would then pick up cheapest partial path. The question is, should we also order the partial paths for the leader ? The only scenario I see where leader choosing cheapest partial path *might* show some benefit, is if there are some partial paths that need to do some startup work using only one worker. I think currently, parallel hash join is one case where it builds the hash table, but I guess here also, we support parallel hash build, but not sure about the status. For such plan, if leader starts it, it would be slow, and no other worker would be able to help it, so its actual startup cost would be drastically increased. (Another path is parallel bitmap heap scan where the leader has to do something and the other workers wait. But here, I think it's not much work for the leader to do). So overall, to handle such cases, it's better for leader to choose a cheapest path, or may be, a path with cheapest startup cost. We can also consider sorting partial paths with decreasing startup cost. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 30 September 2017 at 19:21, Amit Kapila wrote: > On Wed, Sep 20, 2017 at 10:59 AM, Amit Khandekar > wrote: >> On 16 September 2017 at 10:42, Amit Kapila wrote: >>> >>> At a broader level, the idea is good, but I think it won't turn out >>> exactly like that considering your below paragraph which indicates >>> that it is okay if leader picks a partial path that is costly among >>> other partial paths as a leader won't be locked with that. >>> Considering this is a good design for parallel append, the question is >>> do we really need worker and leader to follow separate strategy for >>> choosing next path. I think the patch will be simpler if we can come >>> up with a way for the worker and leader to use the same strategy to >>> pick next path to process. How about we arrange the list of paths >>> such that first, all partial paths will be there and then non-partial >>> paths and probably both in decreasing order of cost. Now, both leader >>> and worker can start from the beginning of the list. In most cases, >>> the leader will start at the first partial path and will only ever >>> need to scan non-partial path if there is no other partial path left. >>> This is not bulletproof as it is possible that some worker starts >>> before leader in which case leader might scan non-partial path before >>> all partial paths are finished, but I think we can avoid that as well >>> if we are too worried about such cases. >> >> If there are no partial subpaths, then again the leader is likely to >> take up the expensive subpath. And this scenario would not be >> uncommon. >> > > While thinking about how common the case of no partial subpaths would > be, it occurred to me that as of now we always create a partial path > for the inheritance child if it is parallel-safe and the user has not > explicitly set the value of parallel_workers to zero (refer > compute_parallel_worker). So, unless you are planning to change that, > I think it will be quite uncommon to have no partial subpaths. There are still some paths that can have non-partial paths cheaper than the partial paths. Also, there can be UNION ALL queries which could have non-partial subpaths. I guess this has already been discussed in the other replies. > > Few nitpicks in your latest patch: > 1. > @@ -298,6 +366,292 @@ ExecReScanAppend(AppendState *node) > if (subnode->chgParam == NULL) > ExecReScan(subnode); > } > + > > Looks like a spurious line. > > 2. > @@ -1285,7 +1291,11 @@ add_paths_to_append_rel(PlannerInfo *root, > RelOptInfo *rel, > .. > + if (chosen_path && chosen_path != cheapest_partial_path) > + pa_all_partial_subpaths = false; > > It will keep on setting pa_all_partial_subpaths as false for > non-partial paths which don't seem to be the purpose of this variable. > I think you want it to be set even when there is one non-partial path, > so isn't it better to write as below or something similar: > if (pa_nonpartial_subpaths && pa_all_partial_subpaths) > pa_all_partial_subpaths = false; Ok. How about removing pa_all_partial_subpaths altogether , and instead of the below condition : /* * If all the child rels have partial paths, and if the above Parallel * Append path has a mix of partial and non-partial subpaths, then consider * another Parallel Append path which will have *all* partial subpaths. * If enable_parallelappend is off, make this one non-parallel-aware. */ if (partial_subpaths_valid && !pa_all_partial_subpaths) .. Use this condition : if (partial_subpaths_valid && pa_nonpartial_subpaths != NIL) .. Regarding a mix of partial and non-partial paths, I feel it always makes sense for the leader to choose the partial path. If it chooses a non-partial path, no other worker would be able to help finish that path. Among the partial paths, whether it chooses the cheapest one or expensive one does not matter, I think. We have the partial paths unordered. So whether it starts from the last partial path or the first partial path should not matter. Regarding scenario where all paths are non-partial, here is an e.g. : Suppose we have 4 child paths with costs : 10 5 5 3, and with 2 workers plus one leader. And suppose the leader takes additionally 1/4th of these costs to process the returned tuples. If leader takes least expensive one (3) : 2 workers will finish 10, 5, 5 in 10 units, and leader simultaneously chooses the plan with cost 3, and so it takes 3 + (1/4)(10 + 5 + 5 + 3) = 9 units. So the total time taken by Append is : 10. Whereas if leader takes most expensive one (10) : 10 + .25 (total) = 10 + 6 = 16 The 2 workers will finish 2nd, 3rd and 4th plan
Re: [HACKERS] UPDATE of partition key
On 30 September 2017 at 01:26, Robert Haas wrote: > On Fri, Sep 29, 2017 at 3:53 PM, Robert Haas wrote: >> On Fri, Sep 22, 2017 at 1:57 AM, Amit Khandekar >> wrote: >>> The patch for the above change is : >>> 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch >> >> Thinking about this a little more, I'm wondering about how this case >> arises. I think that for this patch to avoid multiple conversions, >> we'd have to be calling map_variable_attnos on an expression and then >> calling map_variable_attnos on that expression again. We are not calling map_variable_attnos() twice. The first time it calls, there is already the ConvertRowtypeExpr node if the expression is a whole row var. This node is already added from adjust_appendrel_attrs(). So the conversion is done by two different functions. For ConvertRowtypeExpr, map_variable_attnos_mutator() recursively calls map_variable_attnos_mutator() for ConvertRowtypeExpr->arg with coerced_var=true. > > I guess I didn't quite finish this thought, sorry. Maybe it's > obvious, but the point I was going for is: why would we do that, vs. > just converting once? The first time ConvertRowtypeExpr node gets added in the expression is when adjust_appendrel_attrs() is called for each of the child tables. Here, for each of the child table, when the parent parse tree is converted into the child parse tree, the whole row var (in RETURNING or WITH CHECK OPTIONS expr) is wrapped with ConvertRowtypeExpr(), so child parse tree (or the child WCO expr) has this ConvertRowtypeExpr node. The second time this node is added is during update-tuple-routing in ExecInitModifyTable(), when map_partition_varattnos() is called for each of the partitions to convert from the first per-subplan RETURNING/WCO expression to the RETURNING/WCO expression belonging to the leaf partition. This second conversion happens for the leaf partitions which are not already present in per-subplan UPDATE result rels. So the first conversion is from parent to child while building per-subplan plans, and the second is from first per-subplan child to another child for building expressions of the leaf partitions. So suppose the root partitioned table RETURNING expression is a whole row var wr(r) where r is its composite type representing the root table type. Then, one of its UPDATE child tables will have its RETURNING expression converted like this : wr(r) ===> CRE(r) -> wr(c1) where CRE(r) represents ConvertRowtypeExpr of result type r, which has its arg pointing to wr(c1) which is a whole row var of composite type c1 for the child table c1. So this node converts from composite type of child table to composite type of root table. Now, when the second conversion occurs for the leaf partition (i.e. during update-tuple-routing), the conversion looks like this : CRE(r) -> wr(c1) ===> CRE(r) -> wr(c2) But W/o the 0002*ConvertRowtypeExpr*.patch the conversion would have looked like this : CRE(r) -> wr(c1) ===> CRE(r) -> CRE(c1) -> wr(c2) In short, we omit the intermediate CRE(c1) node. While writing this down, I observed that after multi-level partition tree expansion was introduced, the child table expressions are not converted directly from the root. Instead, they are converted from their immediate parent. So there is a chain of conversions : to leaf from its parent, to that parent from its parent, and so on from the root. Effectively, during the first conversion, there are that many ConvertRowtypeExpr nodes one above the other already present in the UPDATE result rel expressions. But my patch handles the optimization only for the leaf partition conversions. If already has CRE : CRE(rr) -> wr(r) Parent-to-child conversion ::: CRE(p) -> wr(r) ===> CRE(rr) -> CRE(r) -> wr(c1) W patch : CRE(rr) -> CRE(r) -> wr(c1) ===> CRE(rr) -> CRE(r) -> wr(c2) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 20 September 2017 at 11:32, Amit Khandekar wrote: > There is still the open point being > discussed : whether to have non-parallel-aware partial Append path, or > always have parallel-aware Append paths. Attached is the revised patch v16. In previous versions, we used to add a non-parallel-aware Partial Append path having all partial subpaths if the Parallel Append path already added does not contain all-partial subpaths. Now in the patch, when we add such Append Path containing all partial subpaths, we make it parallel-aware (unless enable_parallelappend is false). So in this case, there will be a parallel-aware Append path containing one or more non-partial subpaths, and there will be another parallel-aware Append path containing all-partial subpaths. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company ParallelAppend_v16.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
Below are some performance figures. Overall, there does not appear to be a noticeable difference in the figures in partition key updates with and without row movement (which is surprising), and non-partition-key updates with and without the patch. All the values are in milliseconds. Configuration : shared_buffers = 8GB maintenance_work_mem = 4GB synchronous_commit = off checkpoint_timeout = 15min checkpoint_completion_target = 0.9 log_line_prefix = '%t [%p] ' max_wal_size = 5GB max_connections = 200 The attached files were used to create a partition tree made up of 16 partitioned tables, each containing 125 partitions. First half of the 2000 partitions are filled with 10 million rows. Update row movement moves the data to the other half of the partitions. gen.sql : Creates the partitions. insert.data : This data file is uploaded here [1]. Used "COPY ptab from '$PWD/insert.data' " index.sql : Optionally, Create index on column d. The schema looks like this : CREATE TABLE ptab (a date, b int, c int, d int) PARTITION BY RANGE (a, b); CREATE TABLE ptab_1_1 PARTITION OF ptab for values from ('1900-01-01', 1) to ('1900-01-01', 7501) PARTITION BY range (c); CREATE TABLE ptab_1_1_1 PARTITION OF ptab_1_1 for values from (1) to (81); CREATE TABLE ptab_1_1_2 PARTITION OF ptab_1_1 for values from (81) to (161); .. .. CREATE TABLE ptab_1_2 PARTITION OF ptab for values from ('1900-01-01', 7501) to ('1900-01-01', 15001) PARTITION BY range (c); .. .. On 20 September 2017 at 00:06, Robert Haas wrote: > I wonder how much more expensive it > is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with > 1000 subpartitions with this patch than without, assuming the update > succeeds in both cases. UPDATE query used : UPDATE ptab set d = d + 1 where d = 1; -- where d is not a partition key of any of the partitions. This query updates 8 rows out of 10 million rows. With HEAD : 2953.691 , 2862.298 , 2855.286 , 2835.879 (avg : 2876) With Patch : 2933.719 , 2832.463 , 2749.979 , 2820.416 (avg : 2834) (All the values are in milliseconds.) > suppose you make a table with 1000 partitions each containing > 10,000 tuples and update them all, and consider three scenarios: (1) > partition key not updated but all tuples subject to non-HOT updates > because the updated column is indexed, (2) partition key updated but > no tuple movement required as a result, (3) partition key updated and > all tuples move to a different partition. Note that the following figures do not represent a consistent set of figures. They keep on varying. For e.g. , even though the partition-key-update without row movement appears to have taken a bit more time with patch than with HEAD, a new set of tests run might even end up the other way round. NPK : 42089 (patch) NPKI : 81593 (patch) PK : 45250 (patch) , 44944 (HEAD) PKR : 46701 (patch) The above figures are in milliseconds. The explanations of the above short-forms : NPK : Update of column that is not a partition-key. UPDATE query used : UPDATE ptab set d = d + 1 ; This update *all* rows. NPKI : Update of column that is not a partition-key. And this column is indexed (Used attached file index.sql). UPDATE query used : UPDATE ptab set d = d + 1 ; This update *all* rows. PK : Update of partition key, but row movement does not occur. There are no indexed columns. UPDATE query used : UPDATE ptab set a = a + '1 hour'::interval ; PKR : Update of partition key, with all rows moved to other partitions. There are no indexed columns. UPDATE query used : UPDATE ptab set a = a + '2 years'::interval ; [1] https://drive.google.com/open?id=0B_YJCqIAxKjeN3hMXzdDejlNYmlpWVJpaU9mWUhFRVhXTG5Z -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company gen.tar.gz Description: GNU Zip compressed data index.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 21 September 2017 at 19:52, amul sul wrote: > On Wed, Sep 20, 2017 at 9:27 PM, Amit Khandekar > wrote: >> >> On 20 September 2017 at 00:06, Robert Haas wrote: >> > On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar >> > wrote: >> >> [ new patch ] > > > 86 - (event == TRIGGER_EVENT_UPDATE && > !trigdesc->trig_update_after_row)) > 87 + (event == TRIGGER_EVENT_UPDATE && > !trigdesc->trig_update_after_row) || > 88 + (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup > == NULL))) > 89 return; > 90 } > > > Either of oldtup or newtup will be valid at a time & vice versa. Can we > improve > this check accordingly? > > For e.g.: > (event == TRIGGER_EVENT_UPDATE && )(HeapTupleIsValid(oldtup) ^ > ItemPointerIsValid(newtup) Ok, I will be doing this as below : - (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL))) + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL At other places in the function, oldtup and newtup are checked for NULL, so to be consistent, I haven't used HeapTupleIsValid. Actually, it won't happen that both oldtup and newtup are NULL ... in either of delete, insert, or update, but I haven't added an Assert for this, because that has been true even on HEAD. Will include the above minor change in the next patch when more changes come in. > > > 247 > 248 + /* > 249 +* EDB: In case this is part of update tuple routing, put this row > into the > 250 +* transition NEW TABLE if we are capturing transition tables. We > need to > 251 +* do this separately for DELETE and INSERT because they happen on > 252 +* different tables. > 253 +*/ > 254 + if (mtstate->operation == CMD_UPDATE && > mtstate->mt_transition_capture) > 255 + ExecARUpdateTriggers(estate, resultRelInfo, NULL, > 256 +NULL, > 257 +tuple, > 258 +NULL, > 259 +mtstate->mt_transition_capture); > 260 + > 261 list_free(recheckIndexes); > > 267 > 268 + /* > 269 +* EDB: In case this is part of update tuple routing, put this row > into the > 270 +* transition OLD TABLE if we are capturing transition tables. We > need to > 271 +* do this separately for DELETE and INSERT because they happen on > 272 +* different tables. > 273 +*/ > 274 + if (mtstate->operation == CMD_UPDATE && > mtstate->mt_transition_capture) > 275 + ExecARUpdateTriggers(estate, resultRelInfo, tupleid, > 276 +oldtuple, > 277 +NULL, > 278 +NULL, > 279 +mtstate->mt_transition_capture); > 280 + > > Initially, I wondered that why can't we have above code right after > ExecInsert() & ExecIDelete() in ExecUpdate respectively? > > We can do that for ExecIDelete() but not easily in the ExecInsert() case, > because ExecInsert() internally searches the correct partition's > resultRelInfo > for an insert and before returning to ExecUpdate resultRelInfo is restored > to the old one. That's why current logic seems to be reasonable for now. > Is there anything that we can do? Yes, resultRelInfo is different when we return from ExecInsert(). Also, I think the trigger and transition capture be done immediately after the rows are inserted. This is true for existing code also. Furthermore, there is a dependency of ExecARUpdateTriggers() on ExecARInsertTriggers(). transition_capture is passed NULL if we already captured the tuple in ExecARUpdateTriggers(). It looks simpler to do all this at a single place. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 20 September 2017 at 00:06, Robert Haas wrote: > On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar > wrote: >> [ new patch ] > > This already fails to apply again. In general, I think it would be a > good idea to break this up into a patch series rather than have it as > a single patch. That would allow some bits to be applied earlier. > The main patch will probably still be pretty big, but at least we can > make things a little easier by getting some of the cleanup out of the > way first. Specific suggestions on what to break out below. > > If the changes to rewriteManip.c are a marginal efficiency hack and > nothing more, then let's commit this part separately before the main > patch. If they're necessary for correctness, then please add a > comment explaining why they are necessary. Ok. Yes, just wanted to avoid two ConvertRowtypeExpr nodes one over the other. But that was not causing any correctness issue. Will extract these changes into separate patch. > > There appears to be no reason why the definitions of > GetInsertedColumns() and GetUpdatedColumns() need to be moved to a > header file as a result of this patch. GetUpdatedColumns() was > previously defined in trigger.c and execMain.c and, post-patch, is > still called from only those files. GetInsertedColumns() was, and > remains, called only from execMain.c. If this were needed I'd suggest > doing it as a preparatory patch before the main patch, but it seems we > don't need it at all. In earlier versions of the patch, these functions were used in nodeModifyTable.c as well. Now that those calls are not there in this file, I will revert back the changes done for moving the definitions into header file. > > If I understand correctly, the reason for changing mt_partitions from > ResultRelInfo * to ResultRelInfo ** is that, currently, all of the > ResultRelInfos for a partitioning hierarchy are allocated as a single > chunk, but we can't do that and also reuse the ResultRelInfos created > during InitPlan. I suggest that we do this as a preparatory patch. Ok, will prepare a separate patch. Do you mean to include in that patch the changes I did in ExecSetupPartitionTupleRouting() that re-use the ResultRelInfo structures of per-subplan update result rels ? > Someone could argue that this is going the wrong way and that we ought > to instead make InitPlan() create all of the necessarily > ResultRelInfos, but it seems to me that eventually we probably want to > allow setting up ResultRelInfos on the fly for only those partitions > for which we end up needing them. The code already has some provision > for creating ResultRelInfos on the fly - see ExecGetTriggerResultRel. > I don't think it's this patch's job to try to apply that kind of thing > to tuple routing, but it seems like in the long run if we're inserting > 1 tuple into a table with 1000 partitions, or performing 1 update that > touches the partition key, it would be best not to create > ResultRelInfos for all 1000 partitions just for fun. Yes makes sense. > But this sort of > thing seems much easier of mt_partitions is ResultRelInfo ** rather > than ResultRelInfo *, so I think what you have is going in the right > direction. Ok. > > + * mtstate->resultRelInfo[]. Note: We assume that if the > resultRelInfo > + * does not belong to subplans, then it already matches the root > tuple > + * descriptor; although there is no such known scenario where this > + * could happen. > + */ > +if (rootResultRelInfo != resultRelInfo && > +mtstate->mt_persubplan_childparent_maps != NULL && > +resultRelInfo >= mtstate->resultRelInfo && > +resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1) > +{ > +int map_index = resultRelInfo - mtstate->resultRelInfo; > > I think you should Assert() that it doesn't happen instead of assuming > that it doesn't happen. IOW, remove the last two branches of the > if-condition, and then add an Assert() that map_index is sane. Ok. > > It is not clear to me why we need both mt_perleaf_childparent_maps and > mt_persubplan_childparent_maps. mt_perleaf_childparent_maps : This is used for converting transition-captured inserted/modified/deleted tuples from leaf to root partition, because we need to have all the ROWS in the root partition attribute order. This map is used only for tuples that are routed from root to leaf partition during INSERT, or when tuples are routed from one leaf partition to another leaf partition during update row movement. For both of these operations, we need per-leaf maps, because during tuple conversion, the
Re: [HACKERS] Parallel Append implementation
On 11 September 2017 at 18:55, Amit Kapila wrote: >>> 1. >>> + else if (IsA(subpath, MergeAppendPath)) >>> + { >>> + MergeAppendPath *mpath = (MergeAppendPath *) subpath; >>> + >>> + /* >>> + * If at all MergeAppend is partial, all its child plans have to be >>> + * partial : we don't currently support a mix of partial and >>> + * non-partial MergeAppend subpaths. >>> + */ >>> + if (is_partial) >>> + return list_concat(partial_subpaths, list_copy(mpath->subpaths)); >>> >>> In which situation partial MergeAppendPath is generated? Can you >>> provide one example of such path? >> >> Actually currently we don't support partial paths for MergeAppendPath. >> That code just has that if condition (is_partial) but currently that >> condition won't be true for MergeAppendPath. >> > > I think then it is better to have Assert for MergeAppend. It is > generally not a good idea to add code which we can never hit. Done. > One more thing, I think you might want to update comment atop > add_paths_to_append_rel to reflect the new type of parallel-aware > append path being generated. In particular, I am referring to below > part of the comment: > > * Similarly it collects partial paths from > * non-dummy children to create partial append paths. > */ > static void > add_paths_to_append_rel() > Added comments. Attached revised patch v15. There is still the open point being discussed : whether to have non-parallel-aware partial Append path, or always have parallel-aware Append paths. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company ParallelAppend_v15.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 16 September 2017 at 10:42, Amit Kapila wrote: > On Thu, Sep 14, 2017 at 9:41 PM, Robert Haas wrote: >> On Mon, Sep 11, 2017 at 9:25 AM, Amit Kapila wrote: >>> I think the patch stores only non-partial paths in decreasing order, >>> what if partial paths having more costs follows those paths? >> >> The general picture here is that we don't want the leader to get stuck >> inside some long-running operation because then it won't be available >> to read tuples from the workers. On the other hand, we don't want to >> just have the leader do no work because that might be slow. And in >> most cast cases, the leader will be the first participant to arrive at >> the Append node, because of the worker startup time. So the idea is >> that the workers should pick expensive things first, and the leader >> should pick cheap things first. >> > > At a broader level, the idea is good, but I think it won't turn out > exactly like that considering your below paragraph which indicates > that it is okay if leader picks a partial path that is costly among > other partial paths as a leader won't be locked with that. > Considering this is a good design for parallel append, the question is > do we really need worker and leader to follow separate strategy for > choosing next path. I think the patch will be simpler if we can come > up with a way for the worker and leader to use the same strategy to > pick next path to process. How about we arrange the list of paths > such that first, all partial paths will be there and then non-partial > paths and probably both in decreasing order of cost. Now, both leader > and worker can start from the beginning of the list. In most cases, > the leader will start at the first partial path and will only ever > need to scan non-partial path if there is no other partial path left. > This is not bulletproof as it is possible that some worker starts > before leader in which case leader might scan non-partial path before > all partial paths are finished, but I think we can avoid that as well > if we are too worried about such cases. If there are no partial subpaths, then again the leader is likely to take up the expensive subpath. And this scenario would not be uncommon. And for this scenario at least, we anyway have to make it start from cheapest one, so will have to maintain code for that logic. Now since we anyway have to maintain that logic, why not use the same logic for leader for all cases. Otherwise, if we try to come up with a common logic that conditionally chooses different next plan for leader or worker, then that logic would most probably get complicated than the current state. Also, I don't see any performance issue if there is a leader is running backwards while the others are going forwards. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 18 September 2017 at 20:45, Dilip Kumar wrote: > Please find few more comments. > > + * in which they appear in the PartitionDesc. Also, extract the > + * partition key columns of the root partitioned table. Those of the > + * child partitions would be collected during recursive expansion. > */ > + pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation); > expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc, > lockmode, &root->append_rel_list, > + &all_part_cols, > > pcinfo->all_part_cols is only used in case of update, I think we can > call pull_child_partition_columns > only if rte has updateCols? > > @@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo > > Index parent_relid; > List *child_rels; > + Bitmapset *all_part_cols; > } PartitionedChildRelInfo; > > I might be missing something, but do we really need to store > all_part_cols inside the > PartitionedChildRelInfo, can't we call pull_child_partition_columns > directly inside > inheritance_planner whenever we realize that RTE has some updateCols > and we want to > check the overlap? One thing we will have to do extra is : Open and close the partitioned rels again. The idea was that we collect the bitmap *while* we are already expanding through the tree and the rel is open. Will check if this is feasible. > > +extern void partition_walker_init(PartitionWalker *walker, Relation rel); > +extern Relation partition_walker_next(PartitionWalker *walker, > + Relation *parent); > + > > I don't see these functions are used anywhere? > > +typedef struct PartitionWalker > +{ > + List *rels_list; > + ListCell *cur_cell; > +} PartitionWalker; > + > > Same as above Yes, this was left out from the earlier implementation. Will have this removed in the next updated patch. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 16 September 2017 at 11:45, Amit Kapila wrote: > On Thu, Sep 14, 2017 at 8:30 PM, Amit Khandekar > wrote: >> On 11 September 2017 at 18:55, Amit Kapila wrote: >>>> >>> >>> How? See, if you have four partial subpaths and two non-partial >>> subpaths, then for parallel-aware append it considers all six paths in >>> parallel path whereas for non-parallel-aware append it will consider >>> just four paths and that too with sub-optimal strategy. Can you >>> please try to give me some example so that it will be clear. >> >> Suppose 4 appendrel children have costs for their cheapest partial (p) >> and non-partial paths (np) as shown below : >> >> p1=5000 np1=100 >> p2=200 np2=1000 >> p3=80 np3=2000 >> p4=3000 np4=50 >> >> Here, following two Append paths will be generated : >> >> 1. a parallel-aware Append path with subpaths : >> np1, p2, p3, np4 >> >> 2. Partial (i.e. non-parallel-aware) Append path with all partial subpaths: >> p1,p2,p3,p4 >> >> Now, one thing we can do above is : Make the path#2 parallel-aware as >> well; so both Append paths would be parallel-aware. >> > > Yes, we can do that and that is what I think is probably better. So, > the question remains that in which case non-parallel-aware partial > append will be required? Basically, it is not clear to me why after > having parallel-aware partial append we need non-parallel-aware > version? Are you keeping it for the sake of backward-compatibility or > something like for cases if someone has disabled parallel append with > the help of new guc in this patch? Yes one case is the enable_parallelappend GUC case. If a user disables it, we do want to add the usual non-parallel-aware append partial path. About backward compatibility, the concern we discussed in [1] was that we better continue to have the usual non-parallel-aware partial Append path, plus we should have an additional parallel-aware Append path containing mix of partial and non-partial subpaths. But thinking again on the example above, I think Amit, I tend to agree that we don't have to worry about the existing behaviour, and so we can make the path#2 parallel-aware as well. Robert, can you please suggest what is your opinion on the paths that are chosen in the above example ? [1] https://www.postgresql.org/message-id/CA%2BTgmoaLRtaWdJVHfhHej2s7w1spbr6gZiZXJrM5bsz1KQ54Rw%40mail.gmail.com > -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 11 September 2017 at 18:55, Amit Kapila wrote: >>> Do you think non-parallel-aware Append >>> will be better in any case when there is a parallel-aware append? I >>> mean to say let's try to create non-parallel-aware append only when >>> parallel-aware append is not possible. >> >> By non-parallel-aware append, I am assuming you meant partial >> non-parallel-aware Append. Yes, if the parallel-aware Append path has >> *all* partial subpaths chosen, then we do omit a partial non-parallel >> Append path, as seen in this code in the patch : >> >> /* >> * Consider non-parallel partial append path. But if the parallel append >> * path is made out of all partial subpaths, don't create another partial >> * path; we will keep only the parallel append path in that case. >> */ >> if (partial_subpaths_valid && !pa_all_partial_subpaths) >> { >> .. >> } >> >> But if the parallel-Append path has a mix of partial and non-partial >> subpaths, then we can't really tell which of the two could be cheapest >> until we calculate the cost. It can be that the non-parallel-aware >> partial Append can be cheaper as well. >> > > How? See, if you have four partial subpaths and two non-partial > subpaths, then for parallel-aware append it considers all six paths in > parallel path whereas for non-parallel-aware append it will consider > just four paths and that too with sub-optimal strategy. Can you > please try to give me some example so that it will be clear. Suppose 4 appendrel children have costs for their cheapest partial (p) and non-partial paths (np) as shown below : p1=5000 np1=100 p2=200 np2=1000 p3=80 np3=2000 p4=3000 np4=50 Here, following two Append paths will be generated : 1. a parallel-aware Append path with subpaths : np1, p2, p3, np4 2. Partial (i.e. non-parallel-aware) Append path with all partial subpaths: p1,p2,p3,p4 Now, one thing we can do above is : Make the path#2 parallel-aware as well; so both Append paths would be parallel-aware. Are you suggesting exactly this ? So above, what I am saying is, we can't tell which of the paths #1 and #2 are cheaper until we calculate total cost. I didn't understand what did you mean by "non-parallel-aware append will consider only the partial subpaths and that too with sub-optimal strategy" in the above example. I guess, you were considering a different scenario than the above one. Whereas, if one or more subpaths of Append do not have partial subpath in the first place, then non-parallel-aware partial Append is out of question, which we both agree. And the other case where we skip non-parallel-aware partial Append is when all the cheapest subpaths of the parallel-aware Append path are partial paths: we do not want parallel-aware and non-parallel-aware Append paths both having exactly the same partial subpaths. - I will be addressing your other comments separately. Thanks -Amit Khandekar -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On 14 September 2017 at 06:43, Amit Langote > langote_amit...@lab.ntt.co.jp> wrote: > Attached updated patch. @@ -1222,151 +1209,130 @@ PartitionDispatch * RelationGetPartitionDispatchInfo(Relation rel, int *num_parted, List **leaf_part_oids) { + List *pdlist; PartitionDispatchData **pd; + get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids); Above, pdlist is passed uninitialized. And then inside get_partition_dispatch_recurse(), it is used here : *pds = lappend(*pds, pd); pg_indent says more alignments needed. For e.g. gettext_noop() call below needs to be aligned: pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent), tupdesc, gettext_noop("could not convert row type")); Other than that, the patch looks good to me. I verified that the leaf oids are ordered exaclty in the order of the UPDATE subplans output. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On 13 September 2017 at 15:32, Amit Langote wrote: > On 2017/09/11 18:56, Amit Langote wrote: >> Attached updated patch does it that way for both partitioned table indexes >> and leaf partition indexes. Thanks for pointing it out. > > It seems to me we don't really need the first patch all that much. That > is, let's keep PartitionDispatchData the way it is for now, since we don't > really have any need for it beside tuple-routing (EIBO as committed didn't > need it for one). So, let's forget about "decoupling > RelationGetPartitionDispatchInfo() from the executor" thing for now and > move on. > > So, attached is just the patch to make RelationGetPartitionDispatchInfo() > traverse the partition tree in depth-first manner to be applied on HEAD. > > Thoughts? +1. If at all we need the decoupling later for some reason, we can do that incrementally. Will review your latest patch by tomorrow. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 13 September 2017 at 13:05, Ashutosh Bapat wrote: > On Wed, Sep 13, 2017 at 12:32 PM, Amit Khandekar > wrote: >> Hi, >> >> Rafia had done some testing on TPCH queries using Partition-wise join >> patch along with Parallel Append patch. >> >> There, we had observed that for query 4, even though the partition >> wise joins are under a Parallel Append, the join are all non-partial. >> >> Specifically, the partition-wise join has non-partial nested loop >> joins when actually it was expected to have partial nested loop joins. >> (The difference can be seen by the observation that the outer relation >> of that join is scanned by non-parallel Bitmap Heap scan when it >> should have used Parallel Bitmap Heap Scan). >> >> Here is the detailed analysis , including where I think is the issue : >> >> https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com >> >> All the TPCH results are posted in the same above mail thread. > > Can you please check if the attached patch fixes the issue. Thanks Ashutosh. Yes, it does fix the issue. Partial Nested Loop joins are generated now. If I see any unexpected differences in the estimated or actual costs, I will report that in the Parallel Append thread. As far as Partition-wise join is concerned, this issue is solved, because Partial nested loop join does get created. > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Hi, Rafia had done some testing on TPCH queries using Partition-wise join patch along with Parallel Append patch. There, we had observed that for query 4, even though the partition wise joins are under a Parallel Append, the join are all non-partial. Specifically, the partition-wise join has non-partial nested loop joins when actually it was expected to have partial nested loop joins. (The difference can be seen by the observation that the outer relation of that join is scanned by non-parallel Bitmap Heap scan when it should have used Parallel Bitmap Heap Scan). Here is the detailed analysis , including where I think is the issue : https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com All the TPCH results are posted in the same above mail thread. Thanks -Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] why not parallel seq scan for slow functions
On 5 September 2017 at 14:04, Amit Kapila wrote: > On Fri, Aug 25, 2017 at 10:08 PM, Robert Haas wrote: >> On Mon, Aug 21, 2017 at 5:08 AM, Amit Kapila wrote: >>> (b) I have changed the costing of gather path for path target in >>> generate_gather_paths which I am not sure is the best way. Another >>> possibility could have been that I change the code in >>> apply_projection_to_path as done in the previous patch and just call >>> it from generate_gather_paths. I have not done that because of your >>> comment above thread ("is probably unsafe, because it might confuse >>> code that reaches the modified-in-place path through some other >>> pointer (e.g. code which expects the RelOptInfo's paths to still be >>> sorted by cost)."). It is not clear to me what exactly is bothering >>> you if we directly change costing in apply_projection_to_path. >> >> The point I was trying to make is that if you retroactively change the >> cost of a path after you've already done add_path(), it's too late to >> change your mind about whether to keep the path. At least according >> to my current understanding, that's the root of this problem in the >> first place. On top of that, add_path() and other routines that deal >> with RelOptInfo path lists expect surviving paths to be ordered by >> descending cost; if you frob the cost, they might not be properly >> ordered any more. >> > > Okay, now I understand your point, but I think we already change the > cost of paths in apply_projection_to_path which is done after add_path > for top level scan/join paths. I think this matters a lot in case of > Gather because the cost of computing target list can be divided among > workers. I have changed the patch such that parallel paths for top > level scan/join node will be generated after pathtarget is ready. I > had kept the costing of path targets local to > apply_projection_to_path() as that makes the patch simple. I started with a quick review ... a couple of comments below : - * If this is a baserel, consider gathering any partial paths we may have - * created for it. (If we tried to gather inheritance children, we could + * If this is a baserel and not the only rel, consider gathering any + * partial paths we may have created for it. (If we tried to gather /* Create GatherPaths for any useful partial paths for rel */ - generate_gather_paths(root, rel); + if (lev < levels_needed) + generate_gather_paths(root, rel, NULL); I think at the above two places, and may be in other place also, it's better to mention the reason why we should generate the gather path only if it's not the only rel. -- - if (rel->reloptkind == RELOPT_BASEREL) - generate_gather_paths(root, rel); + if (rel->reloptkind == RELOPT_BASEREL && root->simple_rel_array_size > 2) + generate_gather_paths(root, rel, NULL); Above, in case it's a partitioned table, root->simple_rel_array_size includes the child rels. So even if it's a simple select without a join rel, simple_rel_array_size would be > 2, and so gather path would be generated here for the root table, and again in grouping_planner(). -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 12 September 2017 at 11:57, Dilip Kumar wrote: > On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar > wrote: > >> But the statement level trigger function can refer to OLD TABLE and >> NEW TABLE, which will contain all the OLD rows and NEW rows >> respectively. So the updated rows of the partitions (including the >> moved ones) need to be captured. So for OLD TABLE, we need to capture >> the deleted row, and for NEW TABLE, we need to capture the inserted >> row. > > Yes, I agree. So in ExecDelete for OLD TABLE we only need to call > ExecARUpdateTriggers which will make the entry in OLD TABLE only if > transition table is there otherwise nothing and I guess this part > already exists in your patch. And, we are also calling > ExecARDeleteTriggers and I guess that is to fire the ROW-LEVEL delete > trigger and that is also fine. What I don't understand is that if > there is no "ROW- LEVEL delete trigger" and there is only a "statement > level delete trigger" with transition table still we are making the > entry in transition table of the delete trigger and that will never be > used. Hmm, ok, that might be happening, since we are calling ExecARDeleteTriggers() with mtstate->mt_transition_capture non-NULL, and so the deleted tuple gets captured even when there is no UPDATE statement trigger defined, which looks redundant. Will check this. Thanks. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 11 September 2017 at 21:12, Dilip Kumar wrote: > On Thu, Sep 7, 2017 at 11:41 AM, Amit Khandekar > wrote: >> On 6 September 2017 at 21:47, Dilip Kumar wrote: > >> Actually, since transition tables came in, the functions like >> ExecARUpdateTriggers() or ExecARInsertTriggers() have this additional >> purpose of capturing transition table rows, so that the images of the >> tables are visible when statement triggers are fired that refer to >> these transition tables. So in the above code, these functions only >> capture rows, they do not add any event for firing any ROW triggers. >> AfterTriggerSaveEvent() returns without adding any event if it's >> called only for transition capture. So even if UPDATE row triggers are >> defined, they won't get fired in case of row movement, although the >> updated rows would be captured if transition tables are referenced in >> these triggers or in the statement triggers. >> > > Ok then I have one more question, > > With transition table, we can only support statement level trigger Yes, we don't support row triggers with transition tables if the table is a partition. > and for update > statement, we are only going to execute UPDATE statement level > trigger? so is there > any point of making transition table entry for DELETE/INSERT trigger > as those transition > table will never be used. But the statement level trigger function can refer to OLD TABLE and NEW TABLE, which will contain all the OLD rows and NEW rows respectively. So the updated rows of the partitions (including the moved ones) need to be captured. So for OLD TABLE, we need to capture the deleted row, and for NEW TABLE, we need to capture the inserted row. In the regression test update.sql, check how the statement trigger trans_updatetrig prints all the updated rows, including the moved ones. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 8 September 2017 at 19:17, Amit Kapila wrote: > On Fri, Sep 8, 2017 at 3:59 PM, Amit Khandekar wrote: >> On 7 September 2017 at 11:05, Amit Kapila wrote: >>> On Thu, Aug 31, 2017 at 12:47 PM, Amit Khandekar >>> wrote: >>> 3. >>> +/* >>> + * exec_append_leader_next >>> + * >>> + * To be used only if it's a parallel leader. The backend should scan >>> + * backwards from the last plan. This is to prevent it from taking up >>> + * the most expensive non-partial plan, i.e. the first subplan. >>> + * >>> + */ >>> +static bool >>> +exec_append_leader_next(AppendState *state) >>> >>> From above explanation, it is clear that you don't want backend to >>> pick an expensive plan for a leader, but the reason for this different >>> treatment is not clear. >> >> Explained it, saying that for more workers, a leader spends more work >> in processing the worker tuples , and less work contributing to >> parallel processing. So it should not take expensive plans, otherwise >> it will affect the total time to finish Append plan. >> > > In that case, why can't we keep the workers also process in same > order, what is the harm in that? Because of the way the logic of queuing works, the workers finish earlier if they start with expensive plans first. For e.g. : with 3 plans with costs 8, 4, 4 and with 2 workers w1 and w2, they will finish in 8 time units (w1 will finish plan 1 in 8, while in parallel w2 will finish the remaining 2 plans in 8 units. Whereas if the plans are ordered like : 4, 4, 8, then the workers will finish in 12 time units (w1 and w2 will finish each of the 1st two plans in 4 units, and then w1 or w2 will take up plan 3 and finish in 8 units, while the other worker remains idle). > Also, the leader will always scan > the subplans from last subplan even if all the subplans are partial > plans. Since we already need to have two different code paths, I think we can use the same code paths for any subplans. > I think this will be the unnecessary difference in the > strategy of leader and worker especially when all paths are partial. > I think the selection of next subplan might become simpler if we use > the same strategy for worker and leader. Yeah if we had a common method for both it would have been better. But anyways we have different logics to maintain. > > Few more comments: > > 1. > + else if (IsA(subpath, MergeAppendPath)) > + { > + MergeAppendPath *mpath = (MergeAppendPath *) subpath; > + > + /* > + * If at all MergeAppend is partial, all its child plans have to be > + * partial : we don't currently support a mix of partial and > + * non-partial MergeAppend subpaths. > + */ > + if (is_partial) > + return list_concat(partial_subpaths, list_copy(mpath->subpaths)); > > In which situation partial MergeAppendPath is generated? Can you > provide one example of such path? Actually currently we don't support partial paths for MergeAppendPath. That code just has that if condition (is_partial) but currently that condition won't be true for MergeAppendPath. > > 2. > add_paths_to_append_rel() > { > .. > + /* Consider parallel append path. */ > + if (pa_subpaths_valid) > + { > + AppendPath *appendpath; > + int parallel_workers; > + > + parallel_workers = get_append_num_workers(pa_partial_subpaths, > + pa_nonpartial_subpaths); > + appendpath = create_append_path(rel, pa_nonpartial_subpaths, > + pa_partial_subpaths, > + NULL, parallel_workers, true, > + partitioned_rels); > + add_partial_path(rel, (Path *) appendpath); > + } > + > /* > - * Consider an append of partial unordered, unparameterized partial paths. > + * Consider non-parallel partial append path. But if the parallel append > + * path is made out of all partial subpaths, don't create another partial > + * path; we will keep only the parallel append path in that case. > */ > - if (partial_subpaths_valid) > + if (partial_subpaths_valid && !pa_all_partial_subpaths) > { > AppendPath *appendpath; > ListCell *lc; > int parallel_workers = 0; > > /* > - * Decide on the number of workers to request for this append path. > - * For now, we just use the maximum value from among the members. It > - * might be useful to use a higher number if the Append node were > - * smart enough to spread out the workers, but it currently isn't. > + * To decide the number of workers, just use the maximum value from > + * among the children. > */ > foreach(lc, partial_
Re: [HACKERS] expanding inheritance in partition bound order
Thanks Amit for the patch. I am still reviewing it, but meanwhile below are a few comments so far ... On 8 September 2017 at 15:53, Amit Langote wrote: > [PATCH 2/2] Make RelationGetPartitionDispatch expansion order > depth-first > > This is so as it matches what the planner is doing with partitioning > inheritance expansion. Matching with planner order helps because > it helps ease matching the executor's per-partition objects with > planner-created per-partition nodes. > > + next_parted_idx += (list_length(*pds) - next_parted_idx - 1); I think this can be replaced just by : + next_parted_idx = list_length(*pds) - 1; Or, how about removing this variable next_parted_idx altogether ? Instead, we can just do this : pd->indexes[i] = -(1 + list_length(*pds)); If that is not possible, I may be missing something. --- + next_leaf_idx += (list_length(*leaf_part_oids) - next_leaf_idx); Didn't understand why next_leaf_idx needs to be updated in case when the current partrelid is partitioned. I think it should be incremented only for leaf partitions, no ? Or for that matter, again, how about removing the variable 'next_leaf_idx' and doing this : *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid); pd->indexes[i] = list_length(*leaf_part_oids) - 1; --- * For every partitioned table in the tree, starting with the root * partitioned table, add its relcache entry to parted_rels, while also * queuing its partitions (in the order in which they appear in the * partition descriptor) to be looked at later in the same loop. This is * a bit tricky but works because the foreach() macro doesn't fetch the * next list element until the bottom of the loop. I think the above comment needs to be modified with something explaining the relevant changed code. For e.g. there is no parted_rels, and the "tricky" part was there earlier because of the list being iterated and at the same time being appended. I couldn't see the existing comments like "Indexes corresponding to the internal partitions are multiplied by" anywhere in the patch. I think those comments are still valid, and important. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 7 September 2017 at 11:05, Amit Kapila wrote: > On Thu, Aug 31, 2017 at 12:47 PM, Amit Khandekar > wrote: >> The last updated patch needs a rebase. Attached is the rebased version. >> > > Few comments on the first read of the patch: Thanks ! > > 1. > @@ -279,6 +347,7 @@ void > ExecReScanAppend(AppendState *node) > { > int i; > + ParallelAppendDesc padesc = node->as_padesc; > > for (i = 0; i < node->as_nplans; i++) > { > @@ -298,6 +367,276 @@ ExecReScanAppend(AppendState *node) > if (subnode->chgParam == NULL) > ExecReScan(subnode); > } > + > + if (padesc) > + { > + padesc->pa_first_plan = padesc->pa_next_plan = 0; > + memset(padesc->pa_finished, 0, sizeof(bool) * node->as_nplans); > + } > + > > For rescan purpose, the parallel state should be reinitialized via > ExecParallelReInitializeDSM. We need to do that way mainly to avoid > cases where sometimes in rescan machinery we don't perform rescan of > all the nodes. See commit 41b0dd987d44089dc48e9c70024277e253b396b7. Right. I didn't notice this while I rebased my patch over that commit. Fixed it. Also added an exec_append_parallel_next() call in ExecAppendReInitializeDSM(), otherwise the next ExecAppend() in leader will get an uninitialized as_whichplan. > > 2. > + * shared next_subplan counter index to start looking for unfinished plan, Done. > > Here using "counter index" seems slightly confusing. I think using one > of those will be better. Re-worded it a bit. See whether that's what you wanted. > > 3. > +/* > + * exec_append_leader_next > + * > + * To be used only if it's a parallel leader. The backend should scan > + * backwards from the last plan. This is to prevent it from taking up > + * the most expensive non-partial plan, i.e. the first subplan. > + * > + */ > +static bool > +exec_append_leader_next(AppendState *state) > > From above explanation, it is clear that you don't want backend to > pick an expensive plan for a leader, but the reason for this different > treatment is not clear. Explained it, saying that for more workers, a leader spends more work in processing the worker tuples , and less work contributing to parallel processing. So it should not take expensive plans, otherwise it will affect the total time to finish Append plan. > > 4. > accumulate_partialappend_subpath() > { > .. > + /* Add partial subpaths, if any. */ > + return list_concat(partial_subpaths, apath_partial_paths); > .. > + return partial_subpaths; > .. > + if (is_partial) > + return lappend(partial_subpaths, subpath); > .. > } > > In this function, instead of returning from multiple places > partial_subpaths list, you can just return it at the end and in all > other places just append to it if required. That way code will look > more clear and simpler. Agreed. Did it that way. > > 5. > * is created to represent the case that a relation is provably empty. > + * > */ > typedef struct AppendPath > > Spurious line addition. Removed. > > 6. > if (!node->as_padesc) > { > /* > * This is Parallel-aware append. Follow it's own logic of choosing > * the next subplan. > */ > if (!exec_append_seq_next(node)) > > I think this is the case of non-parallel-aware appends, but the > comments are indicating the opposite. Yeah, this comment got left over there when the relevant code got changed. Shifted that comment upwards where it is appropriate. Attached is the updated patch v14. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company ParallelAppend_v14.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 3 September 2017 at 17:10, Amit Khandekar wrote: > After recent commit 30833ba154, now the partitions are expanded in > depth-first order. It didn't seem worthwhile rebasing my partition > walker changes onto the latest code. So in the attached patch, I have > removed all the partition walker changes. But > RelationGetPartitionDispatchInfo() traverses in breadth-first order, > which is different than the update result rels order (because > inheritance expansion order is depth-first). So, in order to make the > tuple-routing-related leaf partitions in the same order as that of the > update result rels, we would have to make changes in > RelationGetPartitionDispatchInfo(), which I am not sure whether it is > going to be done as part of the thread "expanding inheritance in > partition bound order" [1]. For now, in the attached patch, I have > reverted back to the hash table method to find the leaf partitions in > the update result rels. > > [1] > https://www.postgresql.org/message-id/CAJ3gD9eyudCNU6V-veMme%2BeyzfX_ey%2BgEzULMzOw26c3f9rzdg%40mail.gmail.com As mentioned by Amit Langote in the above mail thread, he is going to do changes for making RelationGetPartitionDispatchInfo() return the leaf partitions in depth-first order. Once that is done, I will then remove the hash table method for finding leaf partitions in update result rels, and instead use the earlier efficient method that takes advantage of the fact that update result rels and leaf partitions are in the same order. > > Thanks > -Amit Khandekar -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 7 September 2017 at 13:40, Rafia Sabih wrote: > On Wed, Aug 30, 2017 at 5:32 PM, Amit Khandekar > wrote: >> Hi Rafia, >> >> On 17 August 2017 at 14:12, Amit Khandekar wrote: >>> But for all of the cases here, partial >>> subplans seem possible, and so even on HEAD it executed Partial >>> Append. So between a Parallel Append having partial subplans and a >>> Partial Append having partial subplans , the cost difference would not >>> be significant. Even if we assume that Parallel Append was chosen >>> because its cost turned out to be a bit cheaper, the actual >>> performance gain seems quite large as compared to the expected cost >>> difference. So it might be even possible that the performance gain >>> might be due to some other reasons. I will investigate this, and the >>> other queries. >>> >> >> I ran all the queries that were showing performance benefits in your >> run. But for me, the ParallelAppend benefits are shown only for plans >> that use Partition-Wise-Join. >> >> For all the queries that use only PA plans but not PWJ plans, I got >> the exact same plan for HEAD as for PA+PWJ patch, except that for the >> later, the Append is a ParallelAppend. Whereas, for you, the plans >> have join-order changed. >> >> Regarding actual costs; consequtively, for me the actual-cost are more >> or less the same for HEAD and PA+PWJ. Whereas, for your runs, you have >> quite different costs naturally because the plans themselves are >> different on head versus PA+PWJ. >> >> My PA+PWJ plan outputs (and actual costs) match exactly what you get >> with PA+PWJ patch. But like I said, I get the same join order and same >> plans (and actual costs) for HEAD as well (except >> ParallelAppend=>Append). >> >> May be, if you have the latest HEAD code with your setup, you can >> yourself check some of the queries again to see if they are still >> seeing higher costs as compared to PA ? I suspect that some changes in >> latest code might be causing this discrepancy; because when I tested >> some of the explains with a HEAD-branch server running with your >> database, I got results matching PA figures. >> >> Attached is my explain-analyze outputs. >> > > Strange. Please let me know what was the commit-id you were > experimenting on. I think we need to investigate this a little > further. Sure. I think the commit was b5c75fec. It was sometime in Aug 30 when I ran the tests. But you may try on latest head. > Additionally I want to point that I also applied patch [1], > which I forgot to mention before. Yes , I also had applied that patch over PA+PWJ. > > [1] > https://www.postgresql.org/message-id/CAEepm%3D3%3DNHHko3oOzpik%2BggLy17AO%2Bpx3rGYrg3x_x05%2BBr9-A%40mail.gmail.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 30 August 2017 at 17:32, Amit Khandekar wrote: > On 16 August 2017 at 18:34, Robert Haas wrote: >> Thanks for the benchmarking results! >> >> On Tue, Aug 15, 2017 at 11:35 PM, Rafia Sabih >> wrote: >>> Q4 | 244 | 12 | PA and PWJ, time by only PWJ - 41 >> >> 12 seconds instead of 244? Whoa. I find it curious that we picked a >> Parallel Append with a bunch of non-partial plans when we could've >> just as easily picked partial plans, or so it seems to me. To put >> that another way, why did we end up with a bunch of Bitmap Heap Scans >> here instead of Parallel Bitmap Heap Scans? > > Actually, the cost difference would be quite low for Parallel Append > with partial plans and Parallel Append with non-partial plans with 2 > workers. But yes, I should take a look at why it is consistently > taking non-partial Bitmap Heap Scan. Here, I checked that Partial Bitmap Heap Scan Path is not getting created in the first place; but I think it should. As you can see from the below plan snippet, the inner path of the join is a parameterized Index Scan : -> Parallel Append -> Nested Loop Semi Join -> Bitmap Heap Scan on orders_004 Recheck Cond: ((o_orderdate >= '1994-01-01'::date) AND (o_orderdate < '1994-04-01 00:00:00'::timestamp without time zone)) -> Bitmap Index Scan on idx_orders_orderdate_004 Index Cond: ((o_orderdate >= '1994-01-01'::date) AND (o_orderdate < '1994-04-01 00:00:00'::timestamp without time zone)) -> Index Scan using idx_lineitem_orderkey_004 on lineitem_004 Index Cond: (l_orderkey = orders_004.o_orderkey) Filter: (l_commitdate < l_receiptdate) In the index condition of the inner IndexScan path, it is referencing partition order_004 which is used by the outer path. So this should satisfy the partial join path restriction concerning parameterized inner path : "inner path should not refer to relations *outside* the join path". Here, it is referring to relations *inside* the join path. But still this join path gets rejected by try_partial_nestloop_path(), here : if (inner_path->param_info != NULL) { Relids inner_paramrels = inner_path->param_info->ppi_req_outer; if (!bms_is_subset(inner_paramrels, outer_path->parent->relids)) return; } Actually, bms_is_subset() above should return true, because inner_paramrels and outer_path relids should have orders_004. But that's not happening. inner_paramrels is referring to orders, not orders_004. And hence bms_is_subset() returns false (thereby rejecting the partial nestloop path). I suspect this is because the innerpath is not getting reparameterized so as to refer to child relations. In the PWJ patch, I saw that reparameterize_path_by_child() is called by try_nestloop_path(), but not by try_partial_nestloop_path(). Now, for Parallel Append, if this partial nestloop subpath gets created, it may or may not get chosen, depending upon the number of workers. For e.g. if the number of workers is 6, and ParalleAppend+PWJ runs with only 2 partitions, then partial nestedloop join would definitely win because we can put all 6 workers to work, whereas for ParallelAppend with all non-partial nested loop join subpaths, at the most only 2 workers could be allotted, one for each child. But if the partitions are more, and available workers are less, then I think the cost difference in case of partial versus non-partial join paths would not be significant. But here the issue is, partial nest loop subpaths don't get created in the first place. Looking at the above analysis, this issue should be worked by a different thread, not in this one. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On 4 September 2017 at 06:34, Amit Langote wrote: > Hi Amit, > > On 2017/09/03 16:07, Amit Khandekar wrote: >> On 31 August 2017 at 13:06, Amit Langote >> wrote: >>>> Mind you, that idea has some problems anyway in the face of default >>>> partitions, null partitions, and list partitions which accept >>>> non-contiguous values (e.g. one partition for 1, 3, 5; another for 2, >>>> 4, 6). We might need to mark the PartitionDesc to indicate whether >>>> PartitionDesc-order is in fact bound-order in a particular instance, >>>> or something like that. >>> >>> ISTM, the primary motivation for the EIBO patch at this point is to get >>> the partitions ordered in a predictable manner so that the partition-wise >>> join patch and update partition key patches could implement certain logic >>> using O (n) algorithm rather than an O (n^2) one. Neither of them depend >>> on the actual order in the sense of, say, sticking a PathKey to the >>> resulting Append. >> >> Now that the inheritance hierarchy is expanded in depth-first order, >> RelationGetPartitionDispatchInfo() needs to be changed to arrange the >> PartitionDispatch array and the leaf partitions in depth-first order >> (as we know this is a requirement for the update-partition-key patch >> for efficiently determining which of the leaf partitions are already >> present in the update result rels). > > I was thinking the same. > >> Amit, I am not sure if you are >> already doing this as part of the patches in this mail thread. Please >> let me know. > > Actually, I had thought of changing the expansion order in > RelationGetPartitionDispatchInfo to depth-first after Robert committed his > patch the other day, but haven't got around to doing that yet. Will do > that in the updated patch (the refactoring patch) I will post sometime > later today or tomorrow on a differently titled thread, because the EIBO > work seems to be done. Great, thanks. Just wanted to make sure someone is working on that, because, as you said, it is no longer an EIBO patch. Since you are doing that, I won't work on that. > >> Also, let me know if you think there will be any loss of >> efficiency in tuple routing code if we arrange the Partition Dispatch >> indexes in depth-first order. > > I don't think there will be any loss in the efficiency of the tuple > routing code itself. It's just that the position of the ResultRelInfos > (of leaf partitions) and PartitionDispatch objects (of partitioned tables) > will be different in their respective arrays, that is, pd->indexes will > now have different values than formerly. Ok. Good to hear that. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 31 August 2017 at 14:15, Amit Khandekar wrote: > Thanks Dilip. I am working on rebasing the patch. Particularly, the > partition walker in my patch depended on the fact that all the tables > get opened (and then closed) while creating the tuple routing info. > But in HEAD, now only the partitioned tables get opened. So need some > changes in my patch. > > The partition walker related changes are going to be inapplicable once > the other thread [1] commits the changes for expansion of inheritence > in bound-order, but till then I would have to rebase the partition > walker changes over HEAD. > > [1] > https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp > After recent commit 30833ba154, now the partitions are expanded in depth-first order. It didn't seem worthwhile rebasing my partition walker changes onto the latest code. So in the attached patch, I have removed all the partition walker changes. But RelationGetPartitionDispatchInfo() traverses in breadth-first order, which is different than the update result rels order (because inheritance expansion order is depth-first). So, in order to make the tuple-routing-related leaf partitions in the same order as that of the update result rels, we would have to make changes in RelationGetPartitionDispatchInfo(), which I am not sure whether it is going to be done as part of the thread "expanding inheritance in partition bound order" [1]. For now, in the attached patch, I have reverted back to the hash table method to find the leaf partitions in the update result rels. [1] https://www.postgresql.org/message-id/CAJ3gD9eyudCNU6V-veMme%2BeyzfX_ey%2BgEzULMzOw26c3f9rzdg%40mail.gmail.com Thanks -Amit Khandekar -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On 31 August 2017 at 13:06, Amit Langote wrote: >> Mind you, that idea has some problems anyway in the face of default >> partitions, null partitions, and list partitions which accept >> non-contiguous values (e.g. one partition for 1, 3, 5; another for 2, >> 4, 6). We might need to mark the PartitionDesc to indicate whether >> PartitionDesc-order is in fact bound-order in a particular instance, >> or something like that. > > ISTM, the primary motivation for the EIBO patch at this point is to get > the partitions ordered in a predictable manner so that the partition-wise > join patch and update partition key patches could implement certain logic > using O (n) algorithm rather than an O (n^2) one. Neither of them depend > on the actual order in the sense of, say, sticking a PathKey to the > resulting Append. Now that the inheritance hierarchy is expanded in depth-first order, RelationGetPartitionDispatchInfo() needs to be changed to arrange the PartitionDispatch array and the leaf partitions in depth-first order (as we know this is a requirement for the update-partition-key patch for efficiently determining which of the leaf partitions are already present in the update result rels). Amit, I am not sure if you are already doing this as part of the patches in this mail thread. Please let me know. Also, let me know if you think there will be any loss of efficiency in tuple routing code if we arrange the Partition Dispatch indexes in depth-first order. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
Thanks Dilip. I am working on rebasing the patch. Particularly, the partition walker in my patch depended on the fact that all the tables get opened (and then closed) while creating the tuple routing info. But in HEAD, now only the partitioned tables get opened. So need some changes in my patch. The partition walker related changes are going to be inapplicable once the other thread [1] commits the changes for expansion of inheritence in bound-order, but till then I would have to rebase the partition walker changes over HEAD. [1] https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp On 31 August 2017 at 12:09, Dilip Kumar wrote: > On Fri, Aug 11, 2017 at 10:44 AM, Amit Khandekar > wrote: >> On 4 August 2017 at 22:28, Amit Khandekar wrote: >>>> > > I am planning to review and test this patch, Seems like this patch > needs to be rebased. > > [dilip@localhost postgresql]$ patch -p1 < > ../patches/update-partition-key_v15.patch > patching file doc/src/sgml/ddl.sgml > patching file doc/src/sgml/ref/update.sgml > patching file doc/src/sgml/trigger.sgml > patching file src/backend/catalog/partition.c > Hunk #3 succeeded at 910 (offset -1 lines). > Hunk #4 succeeded at 924 (offset -1 lines). > Hunk #5 succeeded at 934 (offset -1 lines). > Hunk #6 succeeded at 994 (offset -1 lines). > Hunk #7 succeeded at 1009 with fuzz 1 (offset 3 lines). > Hunk #8 FAILED at 1023. > Hunk #9 succeeded at 1059 with fuzz 2 (offset 10 lines). > Hunk #10 succeeded at 2069 (offset 2 lines). > Hunk #11 succeeded at 2406 (offset 2 lines). > 1 out of 11 hunks FAILED -- saving rejects to file > src/backend/catalog/partition.c.rej > patching file src/backend/commands/copy.c > Hunk #2 FAILED at 1426. > Hunk #3 FAILED at 1462. > Hunk #4 succeeded at 2616 (offset 7 lines). > Hunk #5 succeeded at 2726 (offset 8 lines). > Hunk #6 succeeded at 2846 (offset 8 lines). > 2 out of 6 hunks FAILED -- saving rejects to file > src/backend/commands/copy.c.rej > patching file src/backend/commands/trigger.c > Hunk #4 succeeded at 5261 with fuzz 2. > patching file src/backend/executor/execMain.c > Hunk #1 succeeded at 65 (offset 1 line). > Hunk #2 succeeded at 103 (offset 1 line). > Hunk #3 succeeded at 1829 (offset 20 lines). > Hunk #4 succeeded at 1860 (offset 20 lines). > Hunk #5 succeeded at 1927 (offset 20 lines). > Hunk #6 succeeded at 2044 (offset 21 lines). > Hunk #7 FAILED at 3210. > Hunk #8 FAILED at 3244. > Hunk #9 succeeded at 3289 (offset 26 lines). > Hunk #10 FAILED at 3340. > Hunk #11 succeeded at 3387 (offset 29 lines). > Hunk #12 succeeded at 3424 (offset 29 lines). > 3 out of 12 hunks FAILED -- saving rejects to file > src/backend/executor/execMain.c.rej > patching file src/backend/executor/execReplication.c > patching file src/backend/executor/nodeModifyTable.c > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
The last updated patch needs a rebase. Attached is the rebased version. Thanks -Amit Khandekar ParallelAppend_v13_rebased_3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
Hi Rafia, On 17 August 2017 at 14:12, Amit Khandekar wrote: > But for all of the cases here, partial > subplans seem possible, and so even on HEAD it executed Partial > Append. So between a Parallel Append having partial subplans and a > Partial Append having partial subplans , the cost difference would not > be significant. Even if we assume that Parallel Append was chosen > because its cost turned out to be a bit cheaper, the actual > performance gain seems quite large as compared to the expected cost > difference. So it might be even possible that the performance gain > might be due to some other reasons. I will investigate this, and the > other queries. > I ran all the queries that were showing performance benefits in your run. But for me, the ParallelAppend benefits are shown only for plans that use Partition-Wise-Join. For all the queries that use only PA plans but not PWJ plans, I got the exact same plan for HEAD as for PA+PWJ patch, except that for the later, the Append is a ParallelAppend. Whereas, for you, the plans have join-order changed. Regarding actual costs; consequtively, for me the actual-cost are more or less the same for HEAD and PA+PWJ. Whereas, for your runs, you have quite different costs naturally because the plans themselves are different on head versus PA+PWJ. My PA+PWJ plan outputs (and actual costs) match exactly what you get with PA+PWJ patch. But like I said, I get the same join order and same plans (and actual costs) for HEAD as well (except ParallelAppend=>Append). May be, if you have the latest HEAD code with your setup, you can yourself check some of the queries again to see if they are still seeing higher costs as compared to PA ? I suspect that some changes in latest code might be causing this discrepancy; because when I tested some of the explains with a HEAD-branch server running with your database, I got results matching PA figures. Attached is my explain-analyze outputs. On 16 August 2017 at 18:34, Robert Haas wrote: > Thanks for the benchmarking results! > > On Tue, Aug 15, 2017 at 11:35 PM, Rafia Sabih > wrote: >> Q4 | 244 | 12 | PA and PWJ, time by only PWJ - 41 > > 12 seconds instead of 244? Whoa. I find it curious that we picked a > Parallel Append with a bunch of non-partial plans when we could've > just as easily picked partial plans, or so it seems to me. To put > that another way, why did we end up with a bunch of Bitmap Heap Scans > here instead of Parallel Bitmap Heap Scans? Actually, the cost difference would be quite low for Parallel Append with partial plans and Parallel Append with non-partial plans with 2 workers. But yes, I should take a look at why it is consistently taking non-partial Bitmap Heap Scan. > Q6 | 29 | 12 | PA only This one needs to be analysed, because here, the plan cost is the same, but actual cost for PA is almost half the cost for HEAD. This is the same observation for my run also. Thanks -Amit PA-test-AmitKh.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On 25 August 2017 at 23:58, Robert Haas wrote: > > That just leaves indexes. In a world where keystate, tupslot, and > tupmap are removed from the PartitionDispatchData, you must need > indexes or there would be no point in constructing a > PartitionDispatchData object in the first place; any application that > needs neither indexes nor the executor-specific stuff could just use > the Relation directly. But there is expand_inherited_rtentry() which neither requires indexes nor any executor stuff, but still requires to call RelationGetPartitionDispatchInfo(), and so these indexes get built unnecessarily. Looking at the latest patch, I can see that those indexes can be separately built in ExecSetupPartitionTupleRouting() where it is required, instead of in RelationGetPartitionDispatchInfo(). In the loop which iterates through the pd[] returned from RelationGetPartitionDispatchInfo(), we can build them using the exact code currently written to build them in RelationGetPartitionDispatchInfo(). In the future, if we require such applications where indexes are also required, we may have a separate function only to build indexes, and then ExecSetupPartitionTupleRouting() would also call that function. On 21 August 2017 at 11:40, Amit Langote wrote: >> In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an >> array of pointers. Why can't it be an array of >> PartitionTupleRoutingInfo structure rather than pointer to that >> structure ? > > AFAIK, assigning pointers is less expensive than assigning struct and we > end up doing a lot of assigning of the members of that array to a local > variable I didn't get why exactly we would have to copy the structures. We could just pass the address of ptrinfos[index], no ? My only point for this was : we would not have to call palloc0() for each of the element of ptrinfos. Instead, just allocate memory for all of the elements in a single palloc0(). We anyways have to allocate memory for *each* of the element. > in get_partition_for_tuple(), for example. Perhaps, we could > avoid those assignments and implement it the way you suggest. You mean at these 2 places in get_partition_for_tuple() , right ? : 1. /* start with the root partitioned table */ parent = ptrinfos[0]; 2. else parent = ptrinfos[-parent->pd->indexes[cur_index]]; Both of the above places, we could just use &ptrinfos[...] instead of ptrinfos[...]. But I guess you meant something else. RelationGetPartitionDispatchInfo() opens all the partitioned tables. But in ExecSetupPartitionTupleRouting(), it again opens all the parents, that is all the partitioned tables, and closes them back. Instead, would it be possible to do this : Instead of the PartitionDispatch->parentoid field, PartitionDispatch can have parentrel Relation field, which will point to reldesc field of one of the pds[] elements. For me, the CopyStateData->ptrinfos and ModifyTableState.mt_ptrinfos field names sound confusing. How about part_tuple_routing_info or just tuple_routing_info ? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On 18 August 2017 at 10:55, Amit Langote wrote: > On 2017/08/18 4:54, Robert Haas wrote: >> On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat >> wrote: >>> [2] had a patch with some changes to the original patch you posted. I >>> didn't describe those changes in my mail, since they rearranged the >>> comments. Those changes are not part of this patch and you haven't >>> comments about those changes as well. If you have intentionally >>> excluded those changes, it's fine. In case, you haven't reviewed them, >>> please see if they are good to be incorporated. >> >> I took a quick look at your version but I think I like Amit's fine the >> way it is, so committed that and back-patched it to v10. > > Thanks for committing. > >> I find 0002 pretty ugly as things stand. We get a bunch of tuple maps >> that we don't really need, only to turn around and free them. We get >> a bunch of tuple slots that we don't need, only to turn around and >> drop them. We don't really need the PartitionDispatch objects either, >> except for the OIDs they contain. There's a lot of extra stuff being >> computed here that is really irrelevant for this purpose. I think we >> should try to clean that up somehow. > > One way to do that might be to reverse the order of the remaining patches > and put the patch to refactor RelationGetPartitionDispatchInfo() first. > With that refactoring, PartitionDispatch itself has become much simpler in > that it does not contain a relcache reference to be closed eventually by > the caller, the tuple map, and the tuple table slot. Since those things > are required for tuple-routing, the refactoring makes > ExecSetupPartitionTupleRouting itself create them from the (minimal) > information returned by RelationGetPartitionDispatchInfo and ultimately > destroy when done using it. I kept the indexes field in > PartitionDispatchData though, because it's essentially free to create > while we are walking the partition tree in > RelationGetPartitionDispatchInfo() and it seems undesirable to make the > caller compute that information (indexes) by traversing the partition tree > all over again, if it doesn't otherwise have to. I am still considering > some counter-arguments raised by Amit Khandekar about this last assertion. > > Thoughts? One another approach (that I have used in update-partition-key patch) is to *not* generate the oids beforehand, and instead, call a partition_walker_next() function to traverse through the tree. Each next() function would return a ChildInfo that includes child oid, parent oid, etc. All users of this would guarantee a fixed order of oids. In the update-partition-key patch, I am opening and closing each of the children, which of course, we need to avoid. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On 18 August 2017 at 01:24, Robert Haas wrote: > On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat > wrote: >> [2] had a patch with some changes to the original patch you posted. I >> didn't describe those changes in my mail, since they rearranged the >> comments. Those changes are not part of this patch and you haven't >> comments about those changes as well. If you have intentionally >> excluded those changes, it's fine. In case, you haven't reviewed them, >> please see if they are good to be incorporated. > > I took a quick look at your version but I think I like Amit's fine the > way it is, so committed that and back-patched it to v10. > > I find 0002 pretty ugly as things stand. We get a bunch of tuple maps > that we don't really need, only to turn around and free them. We get > a bunch of tuple slots that we don't need, only to turn around and > drop them. I think in the final changes after applying all 3 patches, the redundant tuple slot is no longer present. But ... > We don't really need the PartitionDispatch objects either, > except for the OIDs they contain. There's a lot of extra stuff being > computed here that is really irrelevant for this purpose. I think we > should try to clean that up somehow. ... I am of the same opinion. That's why - as I mentioned upthread - I was thinking why not have a separate light-weight function to just generate oids, and keep RelationGetPartitionDispatchInfo() intact, to be used only for tuple routing. But, I haven't yet checked Ashuthosh's requirements, which suggest that it does not help to even get the oid list. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On 17 August 2017 at 06:39, Amit Langote wrote: > Hi Amit, > > Thanks for the comments. > > On 2017/08/16 20:30, Amit Khandekar wrote: >> On 16 August 2017 at 11:06, Amit Langote >> wrote: >> >>> Attached updated patches. >> >> Thanks Amit for the patches. >> >> I too agree with the overall approach taken for keeping the locking >> order consistent: it's best to do the locking with the existing >> find_all_inheritors() since it is much cheaper than to lock them in >> partition-bound order, the later being expensive since it requires >> opening partitioned tables. > > Yeah. Per the Robert's design idea, we will always do the *locking* in > the order determined by find_all_inheritors/find_inheritance_children. > >>> I haven't yet done anything about changing the timing of opening and >>> locking leaf partitions, because it will require some more thinking about >>> the required planner changes. But the above set of patches will get us >>> far enough to get leaf partition sub-plans appear in the partition bound >>> order (same order as what partition tuple-routing uses in the executor). >> >> So, I believe none of the changes done in pg_inherits.c are essential >> for expanding the inheritence tree in bound order, right ? > > Right. > > The changes to pg_inherits.c are only about recognizing partitioned tables > in an inheritance hierarchy and putting them ahead in the returned list. > Now that I think of it, the title of the patch (teach pg_inherits.c about > "partitioning") sounds a bit confusing. In particular, the patch does not > teach it things like partition bound order, just that some tables in the > hierarchy could be partitioned tables. > >> I am not >> sure whether we are planning to commit these two things together or >> incrementally : >> 1. Expand in bound order >> 2. Allow for locking only the partitioned tables first. >> >> For #1, the changes in pg_inherits.c are not required (viz, keeping >> partitioned tables at the head of the list, adding inhchildparted >> column, etc). > > Yes. Changes to pg_inherits.c can be committed completely independently > of either 1 or 2, although then there would be nobody using that capability. > > About 2: I think the capability to lock only the partitioned tables in > expand_inherited_rtentry() will only be used once we have the patch to do > the necessary planner restructuring that will allow us to defer child > table locking to some place that is not expand_inherited_rtentry(). I am > working on that patch now and should be able to show something soon. > >> If we are going to do #2 together with #1, then in the patch set there >> is no one using the capability introduced by #2. That is, there are no >> callers of find_all_inheritors() that make use of the new >> num_partitioned_children parameter. Also, there is no boolean >> parameter for find_all_inheritors() to be used to lock only the >> partitioned tables. >> >> I feel we should think about >> 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and >> first get the review done for the other patches. > > I think that makes sense. > >> I see that RelationGetPartitionDispatchInfo() now returns quite a >> small subset of what it used to return, which is good. But I feel for >> expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is >> still a bit heavy. We only require the oids, so the >> PartitionedTableInfo data structure (including the pd->indexes array) >> gets discarded. > > Maybe we could make the output argument optional, but I don't see much > point in being too conservative here. Building the indexes array does not > cost us that much and if a not-too-distant-in-future patch could use that > information somehow, it could do so for free. Ok, so these changes are mostly kept keeping in mind some future use-cases. Otherwise, I was thinking we could just keep a light-weight function to generate the oids, and keep the current RelationGetPartitionDispatchInfo() intact. Anyways, some more comments : In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an array of pointers. Why can't it be an array of PartitionTupleRoutingInfo structure rather than pointer to that structure ? diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c + * Close all the leaf partitions and their indices. * Above comment needs to be shifted a bit down to the subsequent "for" loop where it's actually applicable. * node->mt_partition_dispatch_info[0] corresponds to the root partitioned * table, for which w
Re: [HACKERS] Parallel Append implementation
On 16 August 2017 at 18:34, Robert Haas wrote: > Thanks for the benchmarking results! > > On Tue, Aug 15, 2017 at 11:35 PM, Rafia Sabih > wrote: >> Q4 | 244 | 12 | PA and PWJ, time by only PWJ - 41 > > 12 seconds instead of 244? Whoa. I find it curious that we picked a > Parallel Append with a bunch of non-partial plans when we could've > just as easily picked partial plans, or so it seems to me. To put > that another way, why did we end up with a bunch of Bitmap Heap Scans > here instead of Parallel Bitmap Heap Scans? > >> Q7 | 134 | 88 | PA only >> Q18 | 508 | 489 | PA only > > What's interesting in these results is that the join order changes > quite a lot when PA is in the mix, and I don't really see why that > should happen. Yes, it seems hard to determine why exactly the join order changes. Parallel Append is expected to give the benefit especially if there are no partial subplans. But for all of the cases here, partial subplans seem possible, and so even on HEAD it executed Partial Append. So between a Parallel Append having partial subplans and a Partial Append having partial subplans , the cost difference would not be significant. Even if we assume that Parallel Append was chosen because its cost turned out to be a bit cheaper, the actual performance gain seems quite large as compared to the expected cost difference. So it might be even possible that the performance gain might be due to some other reasons. I will investigate this, and the other queries. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On 16 August 2017 at 11:06, Amit Langote wrote: > Attached updated patches. Thanks Amit for the patches. I too agree with the overall approach taken for keeping the locking order consistent: it's best to do the locking with the existing find_all_inheritors() since it is much cheaper than to lock them in partition-bound order, the later being expensive since it requires opening partitioned tables. > I haven't yet done anything about changing the timing of opening and > locking leaf partitions, because it will require some more thinking about > the required planner changes. But the above set of patches will get us > far enough to get leaf partition sub-plans appear in the partition bound > order (same order as what partition tuple-routing uses in the executor). So, I believe none of the changes done in pg_inherits.c are essential for expanding the inheritence tree in bound order, right ? I am not sure whether we are planning to commit these two things together or incrementally : 1. Expand in bound order 2. Allow for locking only the partitioned tables first. For #1, the changes in pg_inherits.c are not required (viz, keeping partitioned tables at the head of the list, adding inhchildparted column, etc). If we are going to do #2 together with #1, then in the patch set there is no one using the capability introduced by #2. That is, there are no callers of find_all_inheritors() that make use of the new num_partitioned_children parameter. Also, there is no boolean parameter for find_all_inheritors() to be used to lock only the partitioned tables. I feel we should think about 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and first get the review done for the other patches. --- I see that RelationGetPartitionDispatchInfo() now returns quite a small subset of what it used to return, which is good. But I feel for expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is still a bit heavy. We only require the oids, so the PartitionedTableInfo data structure (including the pd->indexes array) gets discarded. Also, RelationGetPartitionDispatchInfo() has to call get_rel_relkind() for each child. In expand_inherited_rtentry(), we anyway have to open all the child tables, so we get the partition descriptors for each of the children for free. So how about, in expand_inherited_rtentry(), we traverse the partition tree using these descriptors similar to how it is traversed in RelationGetPartitionDispatchInfo() ? May be to avoid code duplication for traversing, we can have a common API. Still looking at RelationGetPartitionDispatchInfo() changes ... -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 9 August 2017 at 19:05, Robert Haas wrote: > On Wed, Jul 5, 2017 at 7:53 AM, Amit Khandekar wrote: >>> This is not applicable on the latest head i.e. commit -- >>> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7, looks like need a rebasing. >> >> Thanks for notifying. Attached is the rebased version of the patch. > > This again needs a rebase. Attached rebased version of the patch. Thanks. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company ParallelAppend_v13_rebased_2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On 4 August 2017 at 22:55, Robert Haas wrote: > > 1. Before calling RelationGetPartitionDispatchInfo, the calling code > should use find_all_inheritors to lock all the relevant relations (or > the planner could use find_all_inheritors to get a list of relation > OIDs, store it in the plan in order, and then at execution time we > visit them in that order and lock them). > > 2. RelationGetPartitionDispatchInfo assumes the relations are already locked. I agree. I think overall, we should keep RelationGetPartitionDispatchInfo() only for preparing the dispatch info in the planner, and generate the locked oids (using find_all_inheritors() or get_partitioned_oids() or whatever) *without* using RelationGetPartitionDispatchInfo(), since RelationGetPartitionDispatchInfo() is generating the pd structure which we don't want in every expansion. > > 3. While we're optimizing, in the first loop inside of > RelationGetPartitionDispatchInfo, don't call heap_open(). Instead, > use get_rel_relkind() to see whether we've got a partitioned table; if > so, open it. If not, there's no need. Yes, this way we need to open only the partitioned tables. > P.S. While I haven't reviewed 0002 in detail, I think the concept of > minimizing what needs to be built in RelationGetPartitionDispatchInfo > is a very good idea. True. I also think, RelationGetPartitionDispatchInfo () should be called while preparing the ModifyTable plan; the PartitionDispatch data structure returned by RelationGetPartitionDispatchInfo() should be stored in that plan, and then the execution-time fields in PartitionDispatch would be populated in ExecInitModifyTable(). -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
> > Below are the TODOS at this point : > > Fix for bug reported by Rajkumar about update with join. I had explained the root issue of this bug here : [1] Attached patch includes the fix, which is explained below. Currently in the patch, there is a check if the tuple is concurrently deleted by other session, i.e. when heap_update() returns HeapTupleUpdated. In such case we set concurrently_deleted output param to true. We should also do the same for HeapTupleSelfUpdated return value. In fact, there are other places in ExecDelete() where it can return without doing anything. For e.g. if a BR DELETE trigger prevents the delete from happening, ExecBRDeleteTriggers() returns false, in which case ExecDelete() returns. So what the fix does is : rename concurrently_deleted parameter to delete_skipped so as to indicate a more general status : whether delete has actually happened or was it skipped. And set this param to true only after the delete happens. This allows us to avoid adding a new rows for the trigger case also. Added test scenario for UPDATE with JOIN case, and also TRIGGER case. > Do something about two separate mapping tables for Transition tables > and update tuple-routing. On 1 July 2017 at 03:15, Thomas Munro wrote: > Would make sense to have a set of functions with names like > GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays > m_convertors_{from,to}_by_{subplan,leaf} the first time they need > them? This was discussed here : [2]. I think even if we have them built when needed, still in presence of both tuple routing and transition tables, we do need separate arrays. So I think rather than dynamic arrays, we can have static arrays but their elements will point to a shared TupleConversionMap structure whenever possible. As already in the patch, in case of insert/update tuple routing, there is a per-leaf partition mt_transition_tupconv_maps array for transition tables, and a separate per-subplan arry mt_resultrel_maps for update tuple routing. *But*, what I am proposing is: for the mt_transition_tupconv_maps[] element for which the leaf partition also exists as a per-subplan result, that array element and the mt_resultrel_maps[] element will point to the same TupleConversionMap structure. This is quite similar to how we are re-using the per-subplan resultrels for the per-leaf result rels. We will re-use the per-subplan TupleConversionMap for the per-leaf mt_transition_tupconv_maps[] elements. Not yet implemented this. > GetUpdatedColumns() to be moved to header file. Done. I have moved it in execnodes.h > More test scenarios in regression tests. > Need to check/test whether we are correctly applying insert policies > (ant not update) while inserting a routed tuple. Yet to do above two. > Use getASTriggerResultRelInfo() for attrno mapping, rather than first > resultrel, for generating child WCO/RETURNING expression. > Regarding generating child WithCheckOption and Returning expressions using those of the root result relation, ModifyTablePath and ModifyTable should have new fields rootReturningList (and rootWithCheckOptions) which would be derived from root->parse->returningList in inheritance_planner(). But then, similar to per-subplan returningList, rootReturningList would have to pass through set_plan_refs()=>set_returning_clause_references() which requires the subplan targetlist to be passed. Because of this, for rootReturningList, we require a subplan for root partition, which is not there currently; we have subpans only for child rels. That means we would have to create such plan only for the sake of generating rootReturningList. The other option is to do the way the patch is currently doing in the executor by using the returningList of the first per-subplan result rel to generate the other child returningList (and WithCheckOption). This is working by applying map_partition_varattnos() to the first returningList. But now that we realized that we have to specially handle whole-row vars, map_partition_varattnos() would need some changes to convert whole row vars differently for child-rel-to-child-rel mapping. For childrel-to-childrel conversion, the whole-row var is already wrapped by ConvertRowtypeExpr, but we need to change its Var->vartype to the new child vartype. I think the second option looks easier, but I am open to suggestions, and I am myself still checking the first one. > Address Robert's review comments on make_resultrel_ordered.patch. > > +typedef struct ParentChild > > This is a pretty generic name. Pick something more specific and informative. I have used ChildPartitionInfo. But suggestions welcome. > > +static List *append_rel_partition_oids(List *rel_list, Relation rel); > > One could be forgiven for thinking that this function was just going > to append OIDs, but it actually appends ParentChild structures, so I > think the name needs work. Renamed it to append_child_partitions(). > > +List *append_rel_partition_oids(List *rel_list,
Re: [HACKERS] map_partition_varattnos() and whole-row vars
On 3 August 2017 at 11:00, Amit Langote wrote: > Thanks for the review. > > On 2017/08/03 13:54, Amit Khandekar wrote: >> On 2 August 2017 at 11:51, Amit Langote wrote: >>> On 2017/08/02 1:33, Amit Khandekar wrote: >>>> Instead of callers of map_partition_varattnos() reporting error, we >>>> can have map_partition_varattnos() itself report error. Instead of the >>>> found_whole_row parameter of map_partition_varattnos(), we can have >>>> error_on_whole_row parameter. So callers who don't expect whole row, >>>> would pass error_on_whole_row=true to map_partition_varattnos(). This >>>> will simplify the resultant code a bit. >>> >>> So, I think it's only the callers of >>> map_partition_varattnos() who know whether finding whole-row vars is an >>> error and *what kind* of error. >> >> Ok. Got it. Thanks for the explanation. >> >> How about making found_whole_row as an optional parameter ? >> map_partition_varattnos() would populate it only if its not NULL. This >> way, callers who don't bother about whether it is found or not don't >> have to declare a variable and pass its address. In current code, >> ExecInitModifyTable() is the one who has to declare found_whole_row >> without needing it. > > Sure, done. Looks good. > But while implementing that, I avoided changing anything in > map_variable_attnos(), such as adding a check for not-NULLness of > found_whole_row. The only check added is in map_partition_varattnos(). Yes, I agree. That is anyway not relevant to the fix. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] map_partition_varattnos() and whole-row vars
On 2 August 2017 at 11:51, Amit Langote wrote: > Thanks Fuita-san and Amit for reviewing. > > On 2017/08/02 1:33, Amit Khandekar wrote: >> On 1 August 2017 at 15:11, Etsuro Fujita wrote: >>> On 2017/07/31 18:56, Amit Langote wrote: >>>> Yes, that's what's needed here. So we need to teach >>>> map_variable_attnos_mutator() to convert whole-row vars just like it's >>>> done in adjust_appendrel_attrs_mutator(). >>> >>> >>> Seems reasonable. (Though I think it might be better to do this kind of >>> conversion in the planner, not the executor, because that would increase the >>> efficiency of cached plans.) > > That's a good point, although it sounds like a bigger project that, IMHO, > should be undertaken separately, because that would involve designing for > considerations of expanding inheritance even in the INSERT case. > >> I think the work of shifting to planner should be taken as a different >> task when we shift the preparation of DispatchInfo to the planner. > > Yeah, I think it'd be a good idea to do those projects together. That is, > doing what Fujita-san suggested and expanding partitioned tables in > partition bound order in the planner. > >>>> Attached 2 patches: >>>> >>>> 0001: Addresses the bug that Rajkumar reported (handling whole-row vars in >>>>WITH CHECK and RETURNING expressions at all) >>>> >>>> 0002: Addressed the bug that Amit reported (converting whole-row vars >>>>that could occur in WITH CHECK and RETURNING expressions) >>> >>> >>> I took a quick look at the patches. One thing I noticed is this: >>> >>> map_variable_attnos(Node *node, >>> int target_varno, int sublevels_up, >>> const AttrNumber *attno_map, int >>> map_length, >>> + Oid from_rowtype, Oid to_rowtype, >>> bool *found_whole_row) >>> { >>> map_variable_attnos_context context; >>> @@ -1291,6 +1318,8 @@ map_variable_attnos(Node *node, >>> context.sublevels_up = sublevels_up; >>> context.attno_map = attno_map; >>> context.map_length = map_length; >>> + context.from_rowtype = from_rowtype; >>> + context.to_rowtype = to_rowtype; >>> context.found_whole_row = found_whole_row; >>> >>> You added two arguments to pass to map_variable_attnos(): from_rowtype and >>> to_rowtype. But I'm not sure that we really need the from_rowtype argument >>> because it's possible to get the Oid from the vartype of target whole-row >>> Vars. And in this: >>> >>> + /* >>> +* If the callers expects us to convert the >>> same, do so if >>> +* necessary. >>> +*/ >>> + if (OidIsValid(context->to_rowtype) && >>> + OidIsValid(context->from_rowtype) && >>> + context->to_rowtype != >>> context->from_rowtype) >>> + { >>> + ConvertRowtypeExpr *r = >>> makeNode(ConvertRowtypeExpr); >>> + >>> + r->arg = (Expr *) newvar; >>> + r->resulttype = >>> context->from_rowtype; >>> + r->convertformat = >>> COERCE_IMPLICIT_CAST; >>> + r->location = -1; >>> + /* Make sure the Var node has the >>> right type ID, too */ >>> + newvar->vartype = >>> context->to_rowtype; >>> + return (Node *) r; >>> + } >>> >>> I think we could set r->resulttype to the vartype (ie, "r->resulttype = >>> newvar->vartype" instead of "r->resulttype = context->from_rowtype"). >> >> I agree. > > You're right, from_rowtype is unnecessary. > >> --- >> >> Few more comments : >> >> @@ -1240,7 +1247,7 @@ map_variable_
Re: [HACKERS] UPDATE of partition key
On 2 August 2017 at 14:38, Amit Langote wrote: > On 2017/07/29 2:45, Amit Khandekar wrote: >> On 28 July 2017 at 20:10, Robert Haas wrote: >>> On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote wrote: >>>> I checked that we get the same result relation order with both the >>>> patches, but I would like to highlight a notable difference here between >>>> the approaches taken by our patches. In my patch, I have now taught >>>> RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables >>>> in the tree, because we need to look at its partition descriptor to >>>> collect partition OIDs and bounds. We can defer locking (and opening the >>>> relation descriptor of) leaf partitions to a point where planner has >>>> determined that the partition will be accessed after all (not pruned), >>>> which will be done in a separate patch of course. >>> >>> That's very desirable, but I believe it introduces a deadlock risk >>> which Amit's patch avoids. A transaction using the code you've >>> written here is eventually going to lock all partitions, BUT it's >>> going to move the partitioned ones to the front of the locking order >>> vs. what find_all_inheritors would do. So, when multi-level >>> partitioning is in use, I think it could happen that some other >>> transaction is accessing the table using a different code path that >>> uses the find_all_inheritors order without modification. If those >>> locks conflict (e.g. query vs. DROP) then there's a deadlock risk. >> >> Yes, I agree. Even with single-level partitioning, the leaf partitions >> ordered by find_all_inheritors() is by oid values, so that's also >> going to be differently ordered. > > We do require to lock the parent first in any case. Doesn't that prevent > deadlocks by imparting an implicit order on locking by operations whose > locks conflict. Yes may be, but I am not too sure at this point. find_all_inheritors() locks only the children, and the parent lock is already locked separately. find_all_inheritors() does not necessitate to lock the children with the same lockmode as the parent. > Having said that, I think it would be desirable for all code paths to > manipulate partitions in the same order. For partitioned tables, I think > we can make it the partition bound order by replacing all calls to > find_all_inheritors and find_inheritance_children on partitioned table > parents with something else that reads partition OIDs from the relcache > (PartitionDesc) and traverses the partition tree breadth-first manner. > >>> Unfortunately I don't see any easy way around that problem, but maybe >>> somebody else has an idea. >> >> One approach I had considered was to have find_inheritance_children() >> itself lock the children in bound order, so that everyone will have >> bound-ordered oids, but that would be too expensive since it requires >> opening all partitioned tables to initialize partition descriptors. In >> find_inheritance_children(), we get all oids without opening any >> tables. But now that I think more of it, it's only the partitioned >> tables that we have to open, not the leaf partitions; and furthermore, >> I didn't see calls to find_inheritance_children() and >> find_all_inheritors() in performance-critical code, except in >> expand_inherited_rtentry(). All of them are in DDL commands; but yes, >> that can change in the future. > > This approach more or less amounts to calling the new > RelationGetPartitionDispatchInfo() (per my proposed patch, a version of > which I posted upthread.) Maybe we can add a wrapper on top, say, > get_all_partition_oids() which throws away other things that > RelationGetPartitionDispatchInfo() returned. In addition it locks all the > partitions that are returned, unlike only the partitioned ones, which is > what RelationGetPartitionDispatchInfo() has been taught to do. So there are three different task items here : 1. Arrange the oids in consistent order everywhere. 2. Prepare the Partition Dispatch Info data structure in the planner as against during execution. 3. For update tuple routing, assume that the result rels are ordered consistently to make the searching efficient. #3 depends on #1. So for that, I have come up with a minimum set of changes to have expand_inherited_rtentry() generate the rels in bound order. When we do #2 , it may be possible that we may need to re-do my changes in expand_inherited_rtentry(), but those are minimum. We may even end up having the walker function being used at multiple places, but right now it is not certain. So, I thi
Re: [HACKERS] map_partition_varattnos() and whole-row vars
var->vartype = > context->to_rowtype; > + return (Node *) r; > + } > > I think we could set r->resulttype to the vartype (ie, "r->resulttype = > newvar->vartype" instead of "r->resulttype = context->from_rowtype"). I agree. --- Few more comments : @@ -1240,7 +1247,7 @@ map_variable_attnos_mutator(Node *node, var->varlevelsup == context->sublevels_up) { /* Found a matching variable, make the substitution */ - Var*newvar = (Var *) palloc(sizeof(Var)); + Var*newvar = copyObject(var); int attno = var->varattno; *newvar = *var; Here, "*newvar = *var" should be removed. --- - result = map_partition_varattnos(result, 1, rel, parent); + result = map_partition_varattnos(result, 1, rel, parent, + &found_whole_row); + /* There can never be a whole-row reference here */ + if (found_whole_row) + elog(ERROR, "unexpected whole-row reference found in partition key"); Instead of callers of map_partition_varattnos() reporting error, we can have map_partition_varattnos() itself report error. Instead of the found_whole_row parameter of map_partition_varattnos(), we can have error_on_whole_row parameter. So callers who don't expect whole row, would pass error_on_whole_row=true to map_partition_varattnos(). This will simplify the resultant code a bit. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
s might be automatically solved. Below are the TODOS at this point : Fix for bug reported by Rajkumar about update with join. Do something about two separate mapping tables for Transition tables and update tuple-routing. GetUpdatedColumns() to be moved to header file. More test scenarios in regression tests. Need to check/test whether we are correctly applying insert policies (ant not update) while inserting a routed tuple. Use getASTriggerResultRelInfo() for attrno mapping, rather than first resultrel, for generating child WCO/RETURNING expression. Address Robert's review comments on make_resultrel_ordered.patch. pgindent. [1] https://www.postgresql.org/message-id/d86d27ea-cc9d-5dbe-b131-e7dec4017983%40lab.ntt.co.jp Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] map_partition_varattnos() and whole-row vars
On 28 July 2017 at 11:17, Amit Langote wrote: > On 2017/07/26 16:58, Amit Langote wrote: >> Rajkumar Raghuwanshi reported [1] on the "UPDATE partition key" thread >> that whole-row vars don't play nicely with partitioned table operations. >> >> For example, when used to convert WITH CHECK OPTION constraint expressions >> and RETURNING target list expressions, it will error out if the >> expressions contained whole-row vars. That's a bug, because whole-row >> vars are legal in those expressions. I think the decision to output error >> upon encountering a whole-row in the input expression was based on the >> assumption that it will only ever be used to convert partition constraint >> expressions, which cannot contain those. So, we can relax that >> requirement so that its other users are not bitten by it. >> >> Attached fixes that. > > Attached a revised version of the patch. > > Updated patch moves the if (found_whole_row) elog(ERROR) bit in > map_partition_varattnos to the callers. Only the callers know if > whole-row vars are not expected in the expression it's getting converted > from map_partition_varattnos. Given the current message text (mentioning > "partition key"), it didn't seem appropriate to have the elog inside this > function, since it's callable from places where it wouldn't make sense. create table foo (a int, b text) partition by list (a); create table foo1 partition of foo for values in (1); create table foo2(b text, a int) ; alter table foo attach partition foo2 for values in (2); postgres=# insert into foo values (1, 'hi there'); INSERT 0 1 postgres=# insert into foo values (2, 'bi there'); INSERT 0 1 postgres=# insert into foo values (2, 'error there') returning foo; ERROR: table row type and query-specified row type do not match DETAIL: Table has type text at ordinal position 1, but query expects integer. This is due to a mismatch between the composite type tuple descriptor of the leaf partition doesn't match the RETURNING slot tuple descriptor. We probably need to do what inheritance_planner()=>adjust_appendrel_attrs() does for adjusting the attrs in the child rel parse tree; i.e., handle some specific nodes other than just simple var nodes. In adjust_appendrel_attrs_mutator(), for whole row node, it updates var->vartype to the child rel. I suspect that the other nodes that adjust_appendrel_attrs_mutator handles (like JoinExpr, CurrentOfExpr, etc ) may also require similar adjustments for our case, because WithCheckOptions (and I think even RETURNING) can have a subquery. > > Thanks, > Amit > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 26 July 2017 at 02:37, Robert Haas wrote: > Is there any real benefit in this "walker" interface? It looks to me > like it might be simpler to just change things around so that it > returns a list of OIDs, like find_all_inheritors, but generated > differently. Then if you want bound-ordering rather than > OID-ordering, you just do this: > > list_free(inhOids); > inhOids = get_partition_oids_in_bound_order(rel); > > That'd remove the need for some if/then logic as you've currently got > in get_next_child(). Yes, I had considered that ; i.e., first generating just a list of bound-ordered oids. But that consequently needs all the child tables to be opened and closed twice; once during the list generation, and then while expanding the partitioned table. Agreed, that the second time, heap_open() would not be that expensive because tables would be cached, but still it would require to get the cached relation handle from hash table. Since we anyway want to open the tables, better have a *next() function to go-get the next partition in a fixed order. Actually, there isn't much that the walker next() function does. Any code that wants to traverse bound-wise can do that by its own. The walker function is just a convenient way to make sure everyone traverses in the same order by using this function. Yet to go over other things including your review comments, and Amit Langote's patch on refactoring RelationGetPartitionDispatchInfo(). > On another note, did you do anything about the suggestion Thomas made > in > http://postgr.es/m/CAEepm=3sc_j1zwqdyrbu4dtfx5rhcamnnuaxrkwzfgt9m23...@mail.gmail.com > ? This is still pending on me; plus I think there are some more points. I need to go over those and consolidate a list of todos. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 25 July 2017 at 15:02, Rajkumar Raghuwanshi wrote: > On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar > wrote: >> >> >> Attached update-partition-key_v13.patch now contains this >> make_resultrels_ordered.patch changes. >> > > I have applied attach patch and got below observation. > > Observation : if join producing multiple output rows for a given row to be > modified. I am seeing here it is updating a row and also inserting rows in > target table. hence after update total count of table got incremented. Thanks for catching this Rajkumar. So after the row to be updated is already moved to another partition, when the next join output row corresponds to the same row which is moved, that row is now deleted, so ExecDelete()=>heap_delete() gets HeapTupleSelfUpdated, and this is not handled. So even when ExecDelete() finds that the row is already deleted, we still call ExecInsert(), so a new row is inserted. In ExecDelete(), we should indicate that the row is already deleted. In the existing patch, there is a parameter concurrenty_deleted for ExecDelete() which indicates that the row is concurrently deleted. I think we can make this parameter for both of these purposes so as to avoid ExecInsert() for both these scenarios. Will work on a patch. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Oddity in error handling of constraint violation in ExecConstraints for partitioned tables
On 24 July 2017 at 12:11, Amit Langote wrote: > Hi Amit, > > On 2017/07/24 14:09, Amit Khandekar wrote: >>>> On 2017/07/10 14:15, Etsuro Fujita wrote: >>>> Another thing I noticed is the error handling in ExecWithCheckOptions; it >>>> doesn't take any care for partition tables, so the column description in >>>> the error message for WCO_VIEW_CHECK is built using the partition table's >>>> rowtype, not the root table's. But I think it'd be better to build that >>>> using the root table's rowtype, like ExecConstraints, because that would >>>> make the column description easier to understand since the parent view(s) >>>> (from which WithCheckOptions evaluated there are created) would have been >>>> defined on the root table. This seems independent from the above issue, >>>> so I created a separate patch, which I'm attaching. What do you think >>>> about that? >> >> + if (map != NULL) >> + { >> + tuple = do_convert_tuple(tuple, map); >> + ExecStoreTuple(tuple, slot, InvalidBuffer, false); >> + } >> >> Above, the tuple descriptor also needs to be set, since the parent and >> child partitions can have different column ordering. >> >> Due to this, the following testcase crashes : >> >> CREATE TABLE range_parted (a text,b int, c int) partition by range (b); >> CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > >> 120) WITH CHECK OPTION; >> create table part_a_1_a_10(b int, c int, a text); >> alter table range_parted attach partition part_a_1_a_10 for values >> from (1) to (10); >> insert into upview values ('a', 2, 100); > > Good catch. Thanks for creating the patch. > > There are other places with similar code viz. those in ExecConstraints() > that would need fixing. Ah ok. I should have noticed that. Thanks. > Attached is the updated version of your patch. Now that this is done, any particular reason it is not done in ExecPartitionCheck() ? I see that there is a do_convert_tuple() in that function, again without changing the slot descriptor. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 13 July 2017 at 22:39, Amit Khandekar wrote: > Attached is a WIP patch (make_resultrels_ordered.patch) that generates > the result rels in canonical order. This patch is kept separate from > the update-partition-key patch, and can be applied on master branch. Attached update-partition-key_v13.patch now contains this make_resultrels_ordered.patch changes. So now that the per-subplan result rels and the leaf partition oids that are generated for tuple routing are both known to have the same order (cannonical), in ExecSetupPartitionTupleRouting(), we look for the per-subplan results without the need for a hash table. Instead of the hash table, we iterate over the leaf partition oids and at the same time keep shifting a position over the per-subplan resultrels whenever the resultrel at the position is found to be present in the leaf partitions list. Since the two lists are in the same order, we never have to again scan the portition of the lists that is already scanned. I considered whether the issue behind this recent commit might be relevant for update tuple-routing as well : commit f81a91db4d1c2032632aa5df9fc14be24f5fe5ec Author: Robert Haas Date: Mon Jul 17 21:29:45 2017 -0400 Use a real RT index when setting up partition tuple routing. Since we know that using a dummy 1 value for tuple routing result rels is not correct, I am checking about another possibility : Now in the latest patch, the tuple routing partitions would have a mix of a) existing update result-rels, and b) new partition resultrels. 'b' resultrels would have the RT index of nominalRelation, but the existing 'a' resultrels would have their own different RT indexes. I suspect, this might surface a similar issue that was fixed by the above commit, for e.g. with the WITH query having UPDATE subqueries doing tuple routing. Will check that. This patch also has Robert's changes in the planner to decide whether to do update tuple routing. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company update-partition-key_v13.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Oddity in error handling of constraint violation in ExecConstraints for partitioned tables
>> On 2017/07/10 14:15, Etsuro Fujita wrote: >> Another thing I noticed is the error handling in ExecWithCheckOptions; it >> doesn't take any care for partition tables, so the column description in >> the error message for WCO_VIEW_CHECK is built using the partition table's >> rowtype, not the root table's. But I think it'd be better to build that >> using the root table's rowtype, like ExecConstraints, because that would >> make the column description easier to understand since the parent view(s) >> (from which WithCheckOptions evaluated there are created) would have been >> defined on the root table. This seems independent from the above issue, >> so I created a separate patch, which I'm attaching. What do you think >> about that? + if (map != NULL) + { + tuple = do_convert_tuple(tuple, map); + ExecStoreTuple(tuple, slot, InvalidBuffer, false); + } Above, the tuple descriptor also needs to be set, since the parent and child partitions can have different column ordering. Due to this, the following testcase crashes : CREATE TABLE range_parted (a text,b int, c int) partition by range (b); CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > 120) WITH CHECK OPTION; create table part_a_1_a_10(b int, c int, a text); alter table range_parted attach partition part_a_1_a_10 for values from (1) to (10); insert into upview values ('a', 2, 100); Attached is a patch that sets the tuple descriptor. Also in the patch, in test updatable_view.sql, I have added a varchar column in one of the partitioned tables used in updatable_views.sql, so as to cover this scenario. Without setting the tuple descriptor, the output of the patched updatable_views.sql shows junk value in one of the columns of the row in the error message emitted for the WithCheckOption violation. Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company set_slot_descriptor.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 5 July 2017 at 15:12, Amit Khandekar wrote: > Like I mentioned upthread... in expand_inherited_rtentry(), if we > replace find_all_inheritors() with something else that returns oids in > canonical order, that will change the order in which children tables > get locked, which increases the chance of deadlock. Because, then the > callers of find_all_inheritors() will lock them in one order, while > callers of expand_inherited_rtentry() will lock them in a different > order. Even in the current code, I think there is a chance of > deadlocks because RelationGetPartitionDispatchInfo() and > find_all_inheritors() have different lock ordering. > > Now, to get the oids of a partitioned table children sorted by > canonical ordering, (i.e. using the partition bound values) we need to > either use the partition bounds to sort the oids like the way it is > done in RelationBuildPartitionDesc() or, open the parent table and get > it's Relation->rd_partdesc->oids[] which are already sorted in > canonical order. So if we generate oids using this way in > find_all_inheritors() and find_inheritance_children(), that will > generate consistent ordering everywhere. But this method is quite > expensive as compared to the way oids are generated and sorted using > oid values in find_inheritance_children(). > > In both expand_inherited_rtentry() and > RelationGetPartitionDispatchInfo(), each of the child tables are > opened. > > So, in both of these functions, what we can do is : call a new > function partition_tree_walker() which does following : > 1. Lock the children using the existing order (i.e. sorted by oid > values) using the same function find_all_inheritors(). Rename > find_all_inheritors() to lock_all_inheritors(... , bool return_oids) > which returns the oid list only if requested. > 2. And then scan through each of the partitions in canonical order, by > opening the parent table, then opening the partition descriptor oids, > and then doing whatever needs to be done with that partition rel. > > partition_tree_walker() will look something like this : > > void partition_tree_walker(Oid parentOid, LOCKMODE lockmode, >void (*walker_func) (), void *context) > { > Relation parentrel; > List *rels_list; > ListCell *cell; > > (void) lock_all_inheritors(parentOid, lockmode, >false /* don't generate oids */); > > parentrel = heap_open(parentOid, NoLock); > rels_list = append_rel_partition_oids(NIL, parentrel); > > /* Scan through all partitioned rels, and at the > * same time append their children. */ > foreach(cell, rels_list) > { > /* Open partrel without locking; lock_all_inheritors() has locked it > */ > Relationpartrel = heap_open(lfirst_oid(cell), NoLock); > > /* Append the children of a partitioned rel to the same list > * that we are iterating on */ > if (RelationGetPartitionDesc(partrel)) > rels_list = append_rel_partition_oids(rels_list, partrel); > > /* > * Do whatever processing needs to be done on this partel. > * The walker function is free to either close the partel > * or keep it opened, but it needs to make sure the opened > * ones are closed later > */ > walker_func(partrel, context); > } > } > > List *append_rel_partition_oids(List *rel_list, Relation rel) > { > int i; > for (i = 0; i < rel->rd_partdesc->nparts; i++) > rel_list = lappend_oid(rel_list, rel->rd_partdesc->oids[i]); > > return rel_list; > } > > > So, in expand_inherited_rtentry() the foreach(l, inhOIDs) loop will be > replaced by partition_tree_walker(parentOid, expand_rte_walker_func) > where expand_rte_walker_func() will do all the work done in the for > loop for each of the partition rels. > > Similarly, in RelationGetPartitionDispatchInfo() the initial part > where it uses APPEND_REL_PARTITION_OIDS() can be replaced by > partition_tree_walker(rel, dispatch_info_walkerfunc) where > dispatch_info_walkerfunc() will generate the oids, or may be populate > the complete PartitionDispatchData structure. 'pd' variable can be > passed as context to the partition_tree_walker(..., context) > > Generating the resultrels in canonical order by opening the tables > using the above way wouldn't be more expensive than the existing code, > because even currently we anyways have to open all the tables in both > of these functions. > Attached is a WIP patch (make_resultrels_ordered.patch) that generates the result rels in canonical order. This patch is kept separate from the updat
Re: [HACKERS] Parallel Append implementation
On 30 June 2017 at 15:10, Rafia Sabih wrote: > > On Tue, Apr 4, 2017 at 12:37 PM, Amit Khandekar > wrote: >> >> Attached is an updated patch v13 that has some comments changed as per >> your review, and also rebased on latest master. > > > This is not applicable on the latest head i.e. commit -- > 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7, looks like need a rebasing. Thanks for notifying. Attached is the rebased version of the patch. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company ParallelAppend_v13_rebased.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 4 July 2017 at 15:23, Amit Khandekar wrote: > On 4 July 2017 at 14:48, Amit Khandekar wrote: >> On 4 July 2017 at 14:38, Amit Langote wrote: >>> On 2017/07/04 17:25, Etsuro Fujita wrote: >>>> On 2017/07/03 18:54, Amit Langote wrote: >>>>> On 2017/07/02 20:10, Robert Haas wrote: >>>>>> That >>>>>> seems pretty easy to do - just have expand_inherited_rtentry() notice >>>>>> that it's got a partitioned table and call >>>>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to >>>>>> produce the list of OIDs. >>>> Seems like a good idea. >>>> >>>>> Interesting idea. >>>>> >>>>> If we are going to do this, I think we may need to modify >>>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that >>>>> does not do as much work. Currently, it assumes that it's only ever >>>>> called by ExecSetupPartitionTupleRouting() and hence also generates >>>>> PartitionDispatchInfo objects for partitioned child tables. We don't need >>>>> that if called from within the planner. >>>>> >>>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled >>>>> with its usage within the executor, because there is this comment: >>>>> >>>>> /* >>>>> * We keep the partitioned ones open until we're done using the >>>>> * information being collected here (for example, see >>>>> * ExecEndModifyTable). >>>>> */ >>>> >>>> Yeah, we need some refactoring work. Is anyone working on that? >>> >>> I would like to take a shot at that if someone else hasn't already cooked >>> up a patch. Working on making RelationGetPartitionDispatchInfo() a >>> routine callable from both within the planner and the executor should be a >>> worthwhile effort. >> >> What I am currently working on is to see if we can call >> find_all_inheritors() or find_inheritance_children() instead of >> generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS(). >> Possibly we don't have to refactor it completely. >> find_inheritance_children() needs to return the oids in canonical >> order. So in find_inheritance_children () need to re-use part of >> RelationBuildPartitionDesc() where it generates those oids in that >> order. I am checking this part, and am going to come up with an >> approach based on findings. > > The other approach is to make canonical ordering only in > find_all_inheritors() by replacing call to find_inheritance_children() > with the refactored/modified RelationGetPartitionDispatchInfo(). But > that would mean that the callers of find_inheritance_children() would > have one ordering, while the callers of find_all_inheritors() would > have a different ordering; that brings up chances of deadlocks. That's > why I think, we need to think about modifying the common function > find_inheritance_children(), so that we would be consistent with the > ordering. And then use find_inheritance_children() or > find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes, > there would be some modifications to > RelationGetPartitionDispatchInfo(). > >> >> Also, need to investigate whether *always* sorting the oids in >> canonical order is going to be much expensive than the current sorting >> using oids. But I guess it won't be that expensive. Like I mentioned upthread... in expand_inherited_rtentry(), if we replace find_all_inheritors() with something else that returns oids in canonical order, that will change the order in which children tables get locked, which increases the chance of deadlock. Because, then the callers of find_all_inheritors() will lock them in one order, while callers of expand_inherited_rtentry() will lock them in a different order. Even in the current code, I think there is a chance of deadlocks because RelationGetPartitionDispatchInfo() and find_all_inheritors() have different lock ordering. Now, to get the oids of a partitioned table children sorted by canonical ordering, (i.e. using the partition bound values) we need to either use the partition bounds to sort the oids like the way it is done in RelationBuildPartitionDesc() or, open the parent table and get it's Relation->rd_partdesc->oids[] which are already sorted in canonical order. So if we generate oids using this way in find_all_inheritors() and find_inheritance_children(), that will generate consistent ordering e
Re: [HACKERS] UPDATE of partition key
On 4 July 2017 at 14:48, Amit Khandekar wrote: > On 4 July 2017 at 14:38, Amit Langote wrote: >> On 2017/07/04 17:25, Etsuro Fujita wrote: >>> On 2017/07/03 18:54, Amit Langote wrote: >>>> On 2017/07/02 20:10, Robert Haas wrote: >>>>> That >>>>> seems pretty easy to do - just have expand_inherited_rtentry() notice >>>>> that it's got a partitioned table and call >>>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to >>>>> produce the list of OIDs. >>> Seems like a good idea. >>> >>>> Interesting idea. >>>> >>>> If we are going to do this, I think we may need to modify >>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that >>>> does not do as much work. Currently, it assumes that it's only ever >>>> called by ExecSetupPartitionTupleRouting() and hence also generates >>>> PartitionDispatchInfo objects for partitioned child tables. We don't need >>>> that if called from within the planner. >>>> >>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled >>>> with its usage within the executor, because there is this comment: >>>> >>>> /* >>>> * We keep the partitioned ones open until we're done using the >>>> * information being collected here (for example, see >>>> * ExecEndModifyTable). >>>> */ >>> >>> Yeah, we need some refactoring work. Is anyone working on that? >> >> I would like to take a shot at that if someone else hasn't already cooked >> up a patch. Working on making RelationGetPartitionDispatchInfo() a >> routine callable from both within the planner and the executor should be a >> worthwhile effort. > > What I am currently working on is to see if we can call > find_all_inheritors() or find_inheritance_children() instead of > generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS(). > Possibly we don't have to refactor it completely. > find_inheritance_children() needs to return the oids in canonical > order. So in find_inheritance_children () need to re-use part of > RelationBuildPartitionDesc() where it generates those oids in that > order. I am checking this part, and am going to come up with an > approach based on findings. The other approach is to make canonical ordering only in find_all_inheritors() by replacing call to find_inheritance_children() with the refactored/modified RelationGetPartitionDispatchInfo(). But that would mean that the callers of find_inheritance_children() would have one ordering, while the callers of find_all_inheritors() would have a different ordering; that brings up chances of deadlocks. That's why I think, we need to think about modifying the common function find_inheritance_children(), so that we would be consistent with the ordering. And then use find_inheritance_children() or find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes, there would be some modifications to RelationGetPartitionDispatchInfo(). > > Also, need to investigate whether *always* sorting the oids in > canonical order is going to be much expensive than the current sorting > using oids. But I guess it won't be that expensive. > > > -- > Thanks, > -Amit Khandekar > EnterpriseDB Corporation > The Postgres Database Company -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 4 July 2017 at 14:38, Amit Langote wrote: > On 2017/07/04 17:25, Etsuro Fujita wrote: >> On 2017/07/03 18:54, Amit Langote wrote: >>> On 2017/07/02 20:10, Robert Haas wrote: >>>> That >>>> seems pretty easy to do - just have expand_inherited_rtentry() notice >>>> that it's got a partitioned table and call >>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to >>>> produce the list of OIDs. >> Seems like a good idea. >> >>> Interesting idea. >>> >>> If we are going to do this, I think we may need to modify >>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that >>> does not do as much work. Currently, it assumes that it's only ever >>> called by ExecSetupPartitionTupleRouting() and hence also generates >>> PartitionDispatchInfo objects for partitioned child tables. We don't need >>> that if called from within the planner. >>> >>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled >>> with its usage within the executor, because there is this comment: >>> >>> /* >>> * We keep the partitioned ones open until we're done using the >>> * information being collected here (for example, see >>> * ExecEndModifyTable). >>> */ >> >> Yeah, we need some refactoring work. Is anyone working on that? > > I would like to take a shot at that if someone else hasn't already cooked > up a patch. Working on making RelationGetPartitionDispatchInfo() a > routine callable from both within the planner and the executor should be a > worthwhile effort. What I am currently working on is to see if we can call find_all_inheritors() or find_inheritance_children() instead of generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS(). Possibly we don't have to refactor it completely. find_inheritance_children() needs to return the oids in canonical order. So in find_inheritance_children () need to re-use part of RelationBuildPartitionDesc() where it generates those oids in that order. I am checking this part, and am going to come up with an approach based on findings. Also, need to investigate whether *always* sorting the oids in canonical order is going to be much expensive than the current sorting using oids. But I guess it won't be that expensive. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 22 June 2017 at 01:41, Robert Haas wrote: >>> +for (i = 0; i < num_rels; i++) >>> +{ >>> +ResultRelInfo *resultRelInfo = &result_rels[i]; >>> +Relationrel = resultRelInfo->ri_RelationDesc; >>> +Bitmapset *expr_attrs = NULL; >>> + >>> +pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs); >>> + >>> +/* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. >>> */ >>> +if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, >>> estate))) >>> +return true; >>> +} >>> >>> This seems like an awfully expensive way of performing this test. >>> Under what circumstances could this be true for some result relations >>> and false for others; >> >> One resultRelinfo can have no partition key column used in its quals, >> but the next resultRelinfo can have quite different quals, and these >> quals can have partition key referred. This is possible if the two of >> them have different parents that have different partition-key columns. > > Hmm, true. So if we have a table foo that is partitioned by list (a), > and one of its children is a table bar that is partitioned by list > (b), then we need to consider doing tuple-routing if either column a > is modified, or if column b is modified for a partition which is a > descendant of bar. But visiting that only requires looking at the > partitioned table and those children that are also partitioned, not > all of the leaf partitions as the patch does. The main concern is that the non-leaf partitions are not open (except root), so we would need to open them in order to get the partition key of the parents of update resultrels (or get only the partition key atts and exprs from pg_partitioned_table). There can be multiple approaches to finding partition key columns. Approach 1 : When there are a few update result rels and a large partition tree, we traverse from each of the result rels to their ancestors , and open their ancestors (get_partition_parent()) to get the partition key columns. For result rels having common parents, do this only once. Approach 2 : If there are only a few partitioned tables, and large number of update result rels, it would be easier to just open all the partitioned tables and form the partition key column bitmap out of all their partition keys. If the bitmap does not have updated columns, that's not a partition-key-update. So for typical non-partition-key updates, just opening the partitioned tables will suffice, and so that would not affect performance of normal updates. But if the bitmap has updated columns, we can't conclude that it's a partition-key-update, otherwise it would be false positive. We again need to further check whether the update result rels belong to ancestors that have updated partition keys. Approach 3 : In RelationData, in a new bitmap field (rd_partcheckattrs ?), store partition key attrs that are used in rd_partcheck . Populate this field during generate_partition_qual(). So to conclude, I think, we can do this : Scenario 1 : Only one partitioned table : the root; rest all are leaf partitions. In this case, it is definitely efficient to just check the root partition key, which will be sufficient. Scenario 2 : There are few non-leaf partitioned tables (3-4) : Open those tables, and follow 2nd approach above: If we don't find any updated partition-keys in any of them, well and good. If we do find, failover to approach 3 : For each of the update resultrels, use the new rd_partcheckattrs bitmap to know if it uses any of the updated columns. This would be faster than pulling up attrs from the quals like how it was done in the patch. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 29 June 2017 at 07:42, Amit Langote wrote: > Hi Amit, > > On 2017/06/28 20:43, Amit Khandekar wrote: >> In attached patch v12 > > The patch no longer applies and fails to compile after the following > commit was made yesterday: > > commit 501ed02cf6f4f60c3357775eb07578aebc912d3a > Author: Andrew Gierth > Date: Wed Jun 28 18:55:03 2017 +0100 > > Fix transition tables for partition/inheritance. Thanks for informing Amit. As Thomas mentioned upthread, the above commit already uses a tuple conversion mapping from leaf partition to root partitioned table (mt_transition_tupconv_maps), which serves the same purpose as that of the mapping used in the update-partition-key patch during update tuple routing (mt_resultrel_maps). We need to try to merge these two into a general-purpose mapping array such as mt_leaf_root_maps. I haven't done that in the rebased patch (attached), so currently it has both these mapping fields. For transition tables, this map is per-leaf-partition in case of inserts, whereas it is per-subplan result rel for updates. For update-tuple routing, the mapping is required to be per-subplan. Now, for update-row-movement in presence of transition tables, we would require both per-subplan mapping as well as per-leaf-partition mapping, which can't be done if we have a single mapping field, unless we have some way to identify which of the per-leaf partition mapping elements belong to per-subplan rels. So, it's not immediately possible to merge them. update-partition-key_v12_rebased.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 22 June 2017 at 01:57, Robert Haas wrote: > On Wed, Jun 21, 2017 at 1:38 PM, Amit Khandekar > wrote: >>>> Yep, it's more appropriate to use >>>> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow. That >>>> is, if answer to the question I raised above is positive. >> >> From what I had checked earlier when coding that part, >> rootResultRelInfo is NULL in case of inserts, unless something has >> changed in later commits. That's the reason I decided to use the first >> resultRelInfo. > > We're just going around in circles here. Saying that you decided to > use the first child's resultRelInfo because you didn't have a > resultRelInfo for the parent is an explanation of why you wrote the > code the way you did, but that doesn't make it correct. I want to > know why you think it's correct. Yeah, that was just an FYI on how I decided to use the first resultRelInfo; it was not for explaining why using first resultRelInfo is correct. So upthread, I have tried to explain. > > I think it's probably wrong, because it seems to me that if the INSERT > code needs to use the parent's ResultRelInfo rather than the first > child's ResultRelInfo, the UPDATE code probably needs to do the same. > Commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50 got rid of > resultRelInfos for non-leaf partitions, and commit > e180c8aa8caf5c55a273d4a8e6092e77ff3cff10 added the resultRelInfo back > for the topmost parent, because otherwise it didn't work correctly. Regarding rootResultRelInfo , it would have been good if rootResultRelInfo was set for both insert and update, but it isn't set for inserts. For inserts : In ExecInitModifyTable(), ModifyTableState->rootResultRelInfo remains NULL because ModifyTable->rootResultRelIndex is = -1 : /* If modifying a partitioned table, initialize the root table info */ if (node->rootResultRelIndex >= 0) mtstate->rootResultRelInfo = estate->es_root_result_relations + node->rootResultRelIndex; ModifyTable->rootResultRelIndex = -1 because it does not get set since ModifyTable->partitioned_rels is NULL : /* * If the main target relation is a partitioned table, the * following list contains the RT indexes of partitioned child * relations including the root, which are not included in the * above list. We also keep RT indexes of the roots * separately to be identitied as such during the executor * initialization. */ if (splan->partitioned_rels != NIL) { root->glob->nonleafResultRelations = list_concat(root->glob->nonleafResultRelations, list_copy(splan->partitioned_rels)); /* Remember where this root will be in the global list. */ splan->rootResultRelIndex = list_length(root->glob->rootResultRelations); root->glob->rootResultRelations = lappend_int(root->glob->rootResultRelations, linitial_int(splan->partitioned_rels)); } ModifyTable->partitioned_rels is NULL because inheritance_planner() does not get called for INSERTs; instead, grouping_planner() gets called : subquery_planner() { /* * Do the main planning. If we have an inherited target relation, that * needs special processing, else go straight to grouping_planner. */ if (parse->resultRelation && rt_fetch(parse->resultRelation, parse->rtable)->inh) inheritance_planner(root); else grouping_planner(root, false, tuple_fraction); } Above, inh is false in case of inserts. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 26 June 2017 at 08:37, Amit Khandekar wrote: > On 22 June 2017 at 01:41, Robert Haas wrote: >>>> Second, it will amount to a functional bug if you get a >>>> different answer than the planner did. >>> >>> Actually, the per-leaf WCOs are meant to be executed on the >>> destination partitions where the tuple is moved, while the WCOs >>> belonging to the per-subplan resultRelInfo are meant for the >>> resultRelinfo used for the UPDATE plans. So actually it should not >>> matter whether they look same or different, because they are fired at >>> different objects. Now these objects can happen to be the same >>> relations though. >>> >>> But in any case, it's not clear to me how the mapped WCO and the >>> planner's WCO would yield a different answer if they are both the same >>> relation. I am possibly missing something. The planner has already >>> generated the withCheckOptions for each of the resultRelInfo. And then >>> we are using one of those to re-generate the WCO for a leaf partition >>> by only adjusting the attnos. If there is already a WCO generated in >>> the planner for that leaf partition (because that partition was >>> present in mtstate->resultRelInfo), then the re-built WCO should be >>> exactly look same as the earlier one, because they are the same >>> relations, and so the attnos generated in them would be same since the >>> Relation TupleDesc is the same. >> >> If the planner's WCOs and mapped WCOs are always the same, then I >> think we should try to avoid generating both. If they can be >> different, but that's intentional and correct, then there's no >> substantive problem with the patch but the comments need to make it >> clear why we are generating both. >> >>> Actually I meant, "above works for only local updates. For >>> row-movement-updates, we need per-leaf partition WCOs, because when >>> the row is inserted into target partition, that partition may be not >>> be included in the above planner resultRelInfo, so we need WCOs for >>> all partitions". I think this said comment should be sufficient if I >>> add this in the code ? >> >> Let's not get too focused on updating the comment until we are in >> agreement about what the code ought to be doing. I'm not clear >> whether you accept the point that the patch needs to be changed to >> avoid generating the same WCOs and returning lists in both the planner >> and the executor. > > Yes, we can re-use the WCOs generated in the planner, as an > optimization, since those we re-generate for the same relations will > look exactly the same. The WCOs generated by planner (in > inheritance_planner) are generated when (in adjust_appendrel_attrs()) > we change attnos used in the query to refer to the child RTEs and this > adjusts the attnos of the WCOs of the child RTEs. So the WCOs of > subplan resultRelInfo are actually the parent table WCOs, but only the > attnos changed. And in ExecInitModifyTable() we do the same thing for > leaf partitions, although using different function > map_variable_attnos(). In attached patch v12, during UPDATE tuple routing setup, for each leaf partition, we now check if it is present already in one of the UPDATE per-subplan resultrels. If present, we re-use them rather than creating a new one and opening the table again. So the mtstate->mt_partitions is now an array of ResultRelInfo pointers. That pointer points to either the UPDATE per-subplan result rel, or a newly allocated ResultRelInfo. For each of the leaf partitions, we have to search through the per-subplan resultRelInfo oids to check if there is a match. To do this, I have created a temporary hash table which stores oids and the ResultRelInfo pointers of mtstate->resultRelInfo array, and which can be used to search the oid for each of the leaf partitions. This patch version has handled only the above discussion point. I will follow up with the other points separately. update-partition-key_v12.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 22 June 2017 at 01:41, Robert Haas wrote: >>> Second, it will amount to a functional bug if you get a >>> different answer than the planner did. >> >> Actually, the per-leaf WCOs are meant to be executed on the >> destination partitions where the tuple is moved, while the WCOs >> belonging to the per-subplan resultRelInfo are meant for the >> resultRelinfo used for the UPDATE plans. So actually it should not >> matter whether they look same or different, because they are fired at >> different objects. Now these objects can happen to be the same >> relations though. >> >> But in any case, it's not clear to me how the mapped WCO and the >> planner's WCO would yield a different answer if they are both the same >> relation. I am possibly missing something. The planner has already >> generated the withCheckOptions for each of the resultRelInfo. And then >> we are using one of those to re-generate the WCO for a leaf partition >> by only adjusting the attnos. If there is already a WCO generated in >> the planner for that leaf partition (because that partition was >> present in mtstate->resultRelInfo), then the re-built WCO should be >> exactly look same as the earlier one, because they are the same >> relations, and so the attnos generated in them would be same since the >> Relation TupleDesc is the same. > > If the planner's WCOs and mapped WCOs are always the same, then I > think we should try to avoid generating both. If they can be > different, but that's intentional and correct, then there's no > substantive problem with the patch but the comments need to make it > clear why we are generating both. > >> Actually I meant, "above works for only local updates. For >> row-movement-updates, we need per-leaf partition WCOs, because when >> the row is inserted into target partition, that partition may be not >> be included in the above planner resultRelInfo, so we need WCOs for >> all partitions". I think this said comment should be sufficient if I >> add this in the code ? > > Let's not get too focused on updating the comment until we are in > agreement about what the code ought to be doing. I'm not clear > whether you accept the point that the patch needs to be changed to > avoid generating the same WCOs and returning lists in both the planner > and the executor. Yes, we can re-use the WCOs generated in the planner, as an optimization, since those we re-generate for the same relations will look exactly the same. The WCOs generated by planner (in inheritance_planner) are generated when (in adjust_appendrel_attrs()) we change attnos used in the query to refer to the child RTEs and this adjusts the attnos of the WCOs of the child RTEs. So the WCOs of subplan resultRelInfo are actually the parent table WCOs, but only the attnos changed. And in ExecInitModifyTable() we do the same thing for leaf partitions, although using different function map_variable_attnos(). > >>> Also, I feel like it's probably not correct to use the first result >>> relation as the nominal relation for building WCOs and returning lists >>> anyway. I mean, if the first result relation has a different column >>> order than the parent relation, isn't this just broken? If it works >>> for some reason, the comments don't explain what that reason is. One thing I didn't mention earlier about the WCOs, is that for child rels, we don't use the WCOs defined for the child rels. We only inherit the WCO expressions defined for the root rel. That's the reason they are the same expressions, only the attnos changed to match the respective relation tupledesc. If the WCOs of each of the subplan resultRelInfo() were different, then definitely it was not possible to use the first resultRelinfo to generate other leaf partition WCOs, because the WCO defined for relation A is independent of that defined for relation B. So, since the WCOs of all the relations are actually those of the parent, we only need to adjust the attnos of any of these resultRelInfos. For e.g., if the root rel WCO is defined as "col > 5" where col is the 4th column, the expression will look like "var_1.attno_4 > 5". And the WCO that is generated for a subplan resultRelInfo will look something like "var_n.attno_2 > 5" if col is the 2nd column in this table. All of the above logic assumes that we never use the WCO defined for the child relation. At least that's how it looks by looking at the way we generate WCOs in ExecInitModifyTable() for INSERTs as well looking at the code in inheritance_planner() for UPDATEs. At both these places, we never use the WCOs defined for child tables. So suppose we define the tables and their WCOs like this : CREATE TABLE range_parted ( a text, b int, c int) partition by range (a, b); ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY; GRANT ALL ON range_parted TO PUBLIC ; create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true); create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition
Re: [HACKERS] UPDATE of partition key
On 21 June 2017 at 20:14, Robert Haas wrote: > On Wed, Jun 21, 2017 at 5:28 AM, Amit Langote > wrote:>> The comment "UPDATE/DELETE > cases are handled above" is referring to >>> the code that initializes the WCOs generated by the planner. You've >>> modified the comment in your patch, but the associated code: your >>> updated comment says that only "DELETEs and local UPDATES are handled >>> above", but in reality, *all* updates are still handled above. And >>> then they are handled again here. Similarly for returning lists. >>> It's certainly not OK for the comment to be inaccurate, but I think >>> it's also bad to redo the work which the planner has already done, >>> even if it makes the patch smaller. >> >> I guess this has to do with the UPDATE turning into DELETE+INSERT. So, it >> seems like WCOs are being initialized for the leaf partitions >> (ResultRelInfos in the mt_partitions array) that are in turn are >> initialized for the aforementioned INSERT. That's why the term "...local >> UPDATEs" in the new comment text. >> >> If that's true, I wonder if it makes sense to apply what would be >> WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into >> by calling ExecInsert()? > > I think we probably should apply the insert policy, just as we're > executing the insert trigger. Yes, the RLS quals should execute during tuple routing according to whether it is a update or whether it has been converted to insert. I think the tests don't quite test the insert part. Will check. > >>> Also, I feel like it's probably not correct to use the first result >>> relation as the nominal relation for building WCOs and returning lists >>> anyway. I mean, if the first result relation has a different column >>> order than the parent relation, isn't this just broken? If it works >>> for some reason, the comments don't explain what that reason is. >> >> Yep, it's more appropriate to use >> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow. That >> is, if answer to the question I raised above is positive. >From what I had checked earlier when coding that part, rootResultRelInfo is NULL in case of inserts, unless something has changed in later commits. That's the reason I decided to use the first resultRelInfo. Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 21 June 2017 at 00:23, Robert Haas wrote: > On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar > wrote: >>> I guess I don't see why it should work like this. In the INSERT case, >>> we must build withCheckOption objects for each partition because those >>> partitions don't appear in the plan otherwise -- but in the UPDATE >>> case, they're already there, so why do we need to build anything at >>> all? Similarly for RETURNING projections. How are the things we need >>> for those cases not already getting built, associated with the >>> relevant resultRelInfos? Maybe there's a concern if some children got >>> pruned - they could turn out later to be the children into which >>> tuples need to be routed. But the patch makes no distinction >>> between possibly-pruned children and any others. >> >> Yes, only a subset of the partitions appear in the UPDATE subplans. I >> think typically for updates, a very small subset of the total leaf >> partitions will be there in the plans, others would get pruned. IMHO, >> it would not be worth having an optimization where it opens only those >> leaf partitions which are not already there in the subplans. Without >> the optimization, we are able to re-use the INSERT infrastructure >> without additional changes. > > Well, that is possible, but certainly not guaranteed. I mean, > somebody could do a whole-table UPDATE, or an UPDATE that hits a > smattering of rows in every partition; I am not saying that it's guaranteed to be a small subset. I am saying that it would be typically a small subset for update-of-partitioned-key case. Seems weird if a user causes an update-row-movement for multiple partitions at the same time. Generally it would be an administrative task where some/all of the rows of a partition need to have their partition key updated that cause them to change their partition, and so there would be probably a where clause that would narrow down the update to that particular partition, because without the where clause the update is anyway slower and it's redundant to scan all other partitions. But, point taken, that there can always be certain cases involving multiple table partition-key updates. > e.g. the table is partitioned on order number, and you do UPDATE > lineitem SET product_code = 'K372B' WHERE product_code = 'K372'. This query does not update order number, so here there is no partition-key-update. Are you thinking that the patch is generating the per-leaf-partition WCO expressions even for a update not involving a partition key ? > > Leaving that aside, the point here is that you're rebuilding > withCheckOptions and returningLists that have already been built in > the planner. That's bad for two reasons. First, it's inefficient, > especially if there are many partitions. Yeah, I agree that this becomes more and more redundant if the update involves more partitions. > Second, it will amount to a functional bug if you get a > different answer than the planner did. Actually, the per-leaf WCOs are meant to be executed on the destination partitions where the tuple is moved, while the WCOs belonging to the per-subplan resultRelInfo are meant for the resultRelinfo used for the UPDATE plans. So actually it should not matter whether they look same or different, because they are fired at different objects. Now these objects can happen to be the same relations though. But in any case, it's not clear to me how the mapped WCO and the planner's WCO would yield a different answer if they are both the same relation. I am possibly missing something. The planner has already generated the withCheckOptions for each of the resultRelInfo. And then we are using one of those to re-generate the WCO for a leaf partition by only adjusting the attnos. If there is already a WCO generated in the planner for that leaf partition (because that partition was present in mtstate->resultRelInfo), then the re-built WCO should be exactly look same as the earlier one, because they are the same relations, and so the attnos generated in them would be same since the Relation TupleDesc is the same. > Note this comment in the existing code: > > /* > * Build WITH CHECK OPTION constraints for each leaf partition rel. Note > * that we didn't build the withCheckOptionList for each partition within > * the planner, but simple translation of the varattnos for each partition > * will suffice. This only occurs for the INSERT case; UPDATE/DELETE > * cases are handled above. > */ > > The comment "UPDATE/DELETE cases are handled above" is referring to > the code that initializes the WCOs generated by the planner. You've > modified the co
Re: [HACKERS] UPDATE of partition key
On 20 June 2017 at 03:46, Robert Haas wrote: > On Thu, Jun 15, 2017 at 1:36 PM, Amit Khandekar > wrote: >> Attached patch v10 fixes the above. In the existing code, where it >> builds WCO constraints for each leaf partition; with the patch, that >> code now is applicable to row-movement-updates as well. > > I guess I don't see why it should work like this. In the INSERT case, > we must build withCheckOption objects for each partition because those > partitions don't appear in the plan otherwise -- but in the UPDATE > case, they're already there, so why do we need to build anything at > all? Similarly for RETURNING projections. How are the things we need > for those cases not already getting built, associated with the > relevant resultRelInfos? Maybe there's a concern if some children got > pruned - they could turn out later to be the children into which > tuples need to be routed. But the patch makes no distinction > between possibly-pruned children and any others. Yes, only a subset of the partitions appear in the UPDATE subplans. I think typically for updates, a very small subset of the total leaf partitions will be there in the plans, others would get pruned. IMHO, it would not be worth having an optimization where it opens only those leaf partitions which are not already there in the subplans. Without the optimization, we are able to re-use the INSERT infrastructure without additional changes. > >> There is another issue I discovered. The row-movement works fine if >> the destination leaf partition has different attribute ordering than >> the root : the existing insert-tuple-routing mapping handles that. But >> if the source partition has different ordering w.r.t. the root, it has >> a problem : there is no mapping in the opposite direction, i.e. from >> the leaf to root. And we require that because the tuple of source leaf >> partition needs to be converted to root partition tuple descriptor, >> since ExecFindPartition() starts with root. > > Seems reasonable, but... > >> To fix this, I have introduced another mapping array >> mtstate->mt_resultrel_maps[]. This corresponds to the >> mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping, >> because the update result relations are pruned subset of the total >> leaf partitions. > > ... I don't understand how you can *not* need a per-leaf-partition > mapping. I mean, maybe you only need the mapping for the *unpruned* > leaf partitions Yes, we need the mapping only for the unpruned leaf partitions, and those partitions are available in the per-subplan resultRelInfo's. > but you certainly need a separate mapping for each one of those. You mean *each* of the leaf partitions ? I didn't get why we would need it for each one. The tuple targeted for update belongs to one of the per-subplan resultInfos. And this tuple is to be routed to another leaf partition. So the reverse mapping is for conversion from the source resultRelinfo to the root partition. I am unable to figure out a scenario where we would require this reverse mapping for partitions on which UPDATE is *not* going to be executed. > > It's possible to imagine driving the tuple routing off of just the > partition key attributes, extracted from wherever they are inside the > tuple at the current level, rather than converting to the root's tuple > format. However, that's not totally straightforward because there > could be multiple levels of partitioning throughout the tree and > different attributes might be needed at different levels. Yes, the conversion anyway occurs at each of these levels even for insert, specifically because there can be different partition attributes each time. For update, its only one additional conversion. But yes, this new mapping would be required for this one single conversion. > Moreover, > in most cases, the mappings are going to end up being no-ops because > the column order will be the same, so it's probably not worth > complicating the code to try to avoid a double conversion that usually > won't happen. I agree. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 20 June 2017 at 03:42, Thomas Munro wrote: > Just a thought: If I understand correctly this new array of tuple > conversion maps is the same as mtstate->mt_transition_tupconv_maps in > my patch transition-tuples-from-child-tables-v11.patch (hopefully soon > to be committed to close a PG10 open item). In my patch I bounce > transition tuples from child relations up to the named relation's > triggers, and in this patch you bounce child tuples up to the named > relation for rerouting, so the conversion requirement is the same. > Perhaps we could consider refactoring to build a common struct member > on demand for the row movement patch at some point in the future if it > makes the code cleaner. I agree; thanks for bringing this to my attention. The conversion maps in my patch and yours do sound like they are exactly same. And even in case where both update-row-movement and transition tables are playing together, the same map should serve the purpose of both. I will keep a watch on your patch, and check how I can adjust my patch so that I don't have to refactor the mapping. One difference I see is : in your patch, in ExecModifyTable() we jump the current map position for each successive subplan, whereas in my patch, in ExecInsert() we deduce the position of the right map to be fetched using the position of the current resultRelInfo in the mtstate->resultRelInfo[] array. I think your way is more consistent with the existing code. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
When I tested partition-key-update on a partitioned table having no child partitions, it crashed. This is because there is an Assert(mtstate->mt_num_partitions > 0) for creating the partition-to-root map, which fails if there are no partitions under the partitioned table. Actually we should skp creating this map if there are no partitions under the partitioned table on which UPDATE is run. So the attached patch has this new change to fix it (and appropriate additional test case added) : --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -2006,15 +2006,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags) * descriptor of a source partition does not match the root partition * descriptor. In such case we need to convert tuples to the root partition * tuple descriptor, because the search for destination partition starts -* from the root. +* from the root. Skip this setup if it's not a partition key update or if +* there are no partitions below this partitioned table. */ - if (is_partitionkey_update) + if (is_partitionkey_update && mtstate->mt_num_partitions > 0) { TupleConversionMap **tup_conv_maps; TupleDesc outdesc; - Assert(mtstate->mt_num_partitions > 0); - mtstate->mt_resultrel_maps = (TupleConversionMap **) palloc0(sizeof(TupleConversionMap*) * nplans); On 15 June 2017 at 23:06, Amit Khandekar wrote: > On 13 June 2017 at 15:40, Amit Khandekar wrote: >> While rebasing my patch for the below recent commit, I realized that a >> similar issue exists for the uptate-tuple-routing patch as well : >> >> commit 78a030a441966d91bc7e932ef84da39c3ea7d970 >> Author: Tom Lane >> Date: Mon Jun 12 23:29:44 2017 -0400 >> >> Fix confusion about number of subplans in partitioned INSERT setup. >> >> The above issue was about incorrectly using 'i' in >> mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in >> ExecInitModifyTable(), where 'i' was actually meant to refer to the >> positions in mtstate->mt_num_partitions. Actually for INSERT, there is >> only a single plan element in mtstate->mt_plans[] array. >> >> Similarly, for update-tuple routing, we cannot use >> mtstate->mt_plans[i], because 'i' refers to position in >> mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in >> order of mtstate->mt_partitions; in fact mt_plans has only the plans >> that are to be scanned on pruned partitions; so it can well be a small >> subset of total partitions. >> >> I am working on an updated patch to fix the above. > > Attached patch v10 fixes the above. In the existing code, where it > builds WCO constraints for each leaf partition; with the patch, that > code now is applicable to row-movement-updates as well. So the > assertions in the code are now updated to allow the same. Secondly, > the mapping for each of the leaf partitions was constructed using the > root partition attributes. Now in the patch, the > mtstate->resultRelInfo[0] (i.e. the first resultRelInfo) is used as > reference. So effectively, map_partition_varattnos() now represents > not just parent-to-partition mapping, but rather, mapping between any > two partitions/partitioned_tables. It's done this way, so that we can > have a common WCO building code for inserts as well as updates. For > e.g. for inserts, the first (and only) WCO belongs to > node->nominalRelation so nominalRelation is used for > map_partition_varattnos(), whereas for updates, first WCO belongs to > the first resultRelInfo which is not same as nominalRelation. So in > the patch, in both cases, we use the first resultRelInfo and the WCO > of the first resultRelInfo for map_partition_varattnos(). > > Similar thing is done for Returning expressions. > > - > > Another change in the patch is : for ExecInitQual() for WCO quals, > mtstate->ps is used as parent, rather than first plan. For updates, > first plan does not belong to the parent partition. In fact, I think > in all cases, we should use mtstate->ps as the parent. > mtstate->mt_plans[0] don't look like they should be considered parent > of these expressions. May be it does not matter to which parent we > link these quals, because there is no ReScan for ExecModifyTable(). > > Note that for RETURNING projection expressions, we do use mtstate->ps. > > > > There is another issue I discovered. The row-movement works fine if > the destination leaf partition has different attribute ordering than > the root : the
Re: [HACKERS] UPDATE of partition key
On 13 June 2017 at 15:40, Amit Khandekar wrote: > While rebasing my patch for the below recent commit, I realized that a > similar issue exists for the uptate-tuple-routing patch as well : > > commit 78a030a441966d91bc7e932ef84da39c3ea7d970 > Author: Tom Lane > Date: Mon Jun 12 23:29:44 2017 -0400 > > Fix confusion about number of subplans in partitioned INSERT setup. > > The above issue was about incorrectly using 'i' in > mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in > ExecInitModifyTable(), where 'i' was actually meant to refer to the > positions in mtstate->mt_num_partitions. Actually for INSERT, there is > only a single plan element in mtstate->mt_plans[] array. > > Similarly, for update-tuple routing, we cannot use > mtstate->mt_plans[i], because 'i' refers to position in > mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in > order of mtstate->mt_partitions; in fact mt_plans has only the plans > that are to be scanned on pruned partitions; so it can well be a small > subset of total partitions. > > I am working on an updated patch to fix the above. Attached patch v10 fixes the above. In the existing code, where it builds WCO constraints for each leaf partition; with the patch, that code now is applicable to row-movement-updates as well. So the assertions in the code are now updated to allow the same. Secondly, the mapping for each of the leaf partitions was constructed using the root partition attributes. Now in the patch, the mtstate->resultRelInfo[0] (i.e. the first resultRelInfo) is used as reference. So effectively, map_partition_varattnos() now represents not just parent-to-partition mapping, but rather, mapping between any two partitions/partitioned_tables. It's done this way, so that we can have a common WCO building code for inserts as well as updates. For e.g. for inserts, the first (and only) WCO belongs to node->nominalRelation so nominalRelation is used for map_partition_varattnos(), whereas for updates, first WCO belongs to the first resultRelInfo which is not same as nominalRelation. So in the patch, in both cases, we use the first resultRelInfo and the WCO of the first resultRelInfo for map_partition_varattnos(). Similar thing is done for Returning expressions. - Another change in the patch is : for ExecInitQual() for WCO quals, mtstate->ps is used as parent, rather than first plan. For updates, first plan does not belong to the parent partition. In fact, I think in all cases, we should use mtstate->ps as the parent. mtstate->mt_plans[0] don't look like they should be considered parent of these expressions. May be it does not matter to which parent we link these quals, because there is no ReScan for ExecModifyTable(). Note that for RETURNING projection expressions, we do use mtstate->ps. There is another issue I discovered. The row-movement works fine if the destination leaf partition has different attribute ordering than the root : the existing insert-tuple-routing mapping handles that. But if the source partition has different ordering w.r.t. the root, it has a problem : there is no mapping in the opposite direction, i.e. from the leaf to root. And we require that because the tuple of source leaf partition needs to be converted to root partition tuple descriptor, since ExecFindPartition() starts with root. To fix this, I have introduced another mapping array mtstate->mt_resultrel_maps[]. This corresponds to the mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping, because the update result relations are pruned subset of the total leaf partitions. So in ExecInsert, before calling ExecFindPartition(), we need to convert the leaf partition tuple to root using this reverse mapping. Since we need to convert the tuple here, and again after ExecFindPartition() for the found leaf partition, I have replaced the common code by new function ConvertPartitionTupleSlot(). --- Used a new flag is_partitionkey_update in ExecInitModifyTable(), which can be re-used in subsequent sections , rather than again calling IsPartitionKeyUpdate() function again. --- Some more test scenarios added that cover above changes. Basically partitions that have different tuple descriptors than parents. update-partition-key_v10.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
While rebasing my patch for the below recent commit, I realized that a similar issue exists for the uptate-tuple-routing patch as well : commit 78a030a441966d91bc7e932ef84da39c3ea7d970 Author: Tom Lane Date: Mon Jun 12 23:29:44 2017 -0400 Fix confusion about number of subplans in partitioned INSERT setup. The above issue was about incorrectly using 'i' in mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in ExecInitModifyTable(), where 'i' was actually meant to refer to the positions in mtstate->mt_num_partitions. Actually for INSERT, there is only a single plan element in mtstate->mt_plans[] array. Similarly, for update-tuple routing, we cannot use mtstate->mt_plans[i], because 'i' refers to position in mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in order of mtstate->mt_partitions; in fact mt_plans has only the plans that are to be scanned on pruned partitions; so it can well be a small subset of total partitions. I am working on an updated patch to fix the above. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 9 June 2017 at 19:10, Amit Kapila wrote: > On Thu, Jun 8, 2017 at 10:40 PM, Robert Haas wrote: >> On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila wrote: >>> On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas wrote: >>>> On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila >>>> wrote: >>>>> As far as I understand, it is to ensure that for deleted rows, nothing >>>>> more needs to be done. For example, see the below check in >>>>> ExecUpdate/ExecDelete. >>>>> if (!ItemPointerEquals(tupleid, &hufd.ctid)) >>>>> { >>>>> .. >>>>> } >>>>> .. >>>>> >>>>> Also a similar check in ExecLockRows. Now for deleted rows, if the >>>>> t_ctid wouldn't point to itself, then in the mentioned functions, we >>>>> were not in a position to conclude that the row is deleted. >>>> >>>> Right, so we would have to find all such checks and change them to use >>>> some other method to conclude that the row is deleted. What method >>>> would we use? >>> >>> I think before doing above check we can simply check if ctid.ip_blkid >>> contains InvalidBlockNumber, then return an error. >> >> Hmm, OK. That case never happens today? >> > > As per my understanding that case doesn't exist. I will verify again > once the patch is available. I can take a crack at it if Amit > Khandekar is busy with something else or is not comfortable in this > area. Amit, I was going to have a look at this, once I finish with the other part. I was busy on getting that done first. But your comments/help are always welcome. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On 2 June 2017 at 23:52, Peter Geoghegan wrote: > On Fri, Jun 2, 2017 at 10:34 AM, Amit Khandekar > wrote: >> Ok. I was thinking we are doing the tie-breaker because specifically >> strcoll_l() was unexpectedly returning 0 for some cases. Now I get it, >> that we do that to be compatible with texteq(). > > Both of these explanations are correct, in a way. See commit 656beff. > >> Secondly, I was also considering if ICU especially has a way to >> customize an ICU locale by setting some attributes which dictate >> comparison or sorting rules for a set of characters. I mean, if there >> is such customized ICU locale defined in the system, and we use that >> to create PG collation, I thought we might have to strictly follow >> those rules without a tie-breaker, so as to be 100% conformant to ICU. >> I can't come up with an example, or may there isn't one, but , say , >> there is a locale which is supposed to sort only by lowest comparison >> strength (de@strength=1 ?? ). In that case, there might be many >> characters considered equal, but PG < operator or > operator would >> still return true for those chars. > > In the terminology of the Unicode collation algorithm, PostgreSQL > "forces deterministic comparisons" [1]. There is a lot of information > on the details of that within the UCA spec. > > If we ever wanted to offer a case insensitive collation feature, then > we wouldn't necessarily have to do the equivalent of a full strxfrm() > when hashing, at least with collations controlled by ICU. Perhaps we > could instead use a collator whose UCOL_STRENGTH is only UCOL_PRIMARY > to build binary sort keys, and leave the rest to a ucol_equal() call > (within texteq()) that has the usual UCOL_STRENGTH for the underlying > PostgreSQL collation. > > I don't think it would be possible to implement case insensitive > collations by using some pre-existing ICU collation that is case > insensitive. Instead, an implementation might directly vary collation > strength of any given collation to achieve case insensitivity. > PostgreSQL would know that this collation was case insensitive, so > regular collations wouldn't need to change their > behavior/implementation (to use ucol_equal() within texteq(), and so > on). Ah ok. Understood, thanks. Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 7 June 2017 at 20:19, Amit Khandekar wrote: > On 7 June 2017 at 16:42, Amit Khandekar wrote: >> The column bitmap set returned by GetUpdatedColumns() refer to >> attribute numbers w.r.t. to the root partition. And the >> mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So >> we need to do something similar to map_partition_varattnos() to change >> the updated columns attnos to the leaf partitions > > I was wrong about this. Each of the mtstate->resultRelInfo[] has its > own corresponding RangeTblEntry with its own updatedCols having attnos > accordingly adjusted to refer its own table attributes. So we don't > have to do the mapping; we need to get modifedCols separately for each > of the ResultRelInfo, rather than the root relinfo. > >> and walk down the >> partition constraint expressions to find if the attnos are present >> there. > > But this we will need to do. Attached is v9 patch. This covers the two parts discussed upthread : 1. Prevent triggers from causing the row movement. 2. Setup the tuple routing in ExecInitModifyTable(), but only if a partition key is modified. Check new function IsPartitionKeyUpdate(). Have rebased the patch to consider changes done in commit 15ce775faa428dc9 to prevent triggers from violating partition constraints. There, for the call to ExecFindPartition() in ExecInsert, we need to fetch the mtstate->rootResultRelInfo in case the operation is part of update row movement. This is because the root partition is not available in the resultRelInfo for UPDATE. Added many more test scenarios in update.sql that cover the above. I am yet to test the concurrency part using isolation tester. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company update-partition-key_v9.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 7 June 2017 at 16:42, Amit Khandekar wrote: > The column bitmap set returned by GetUpdatedColumns() refer to > attribute numbers w.r.t. to the root partition. And the > mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So > we need to do something similar to map_partition_varattnos() to change > the updated columns attnos to the leaf partitions I was wrong about this. Each of the mtstate->resultRelInfo[] has its own corresponding RangeTblEntry with its own updatedCols having attnos accordingly adjusted to refer its own table attributes. So we don't have to do the mapping; we need to get modifedCols separately for each of the ResultRelInfo, rather than the root relinfo. > and walk down the > partition constraint expressions to find if the attnos are present > there. But this we will need to do. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 6 June 2017 at 23:52, Robert Haas wrote: > On Fri, Jun 2, 2017 at 7:07 AM, Amit Khandekar wrote: >> So, according to that, below would be the logic : >> >> Run partition constraint check on the original NEW row. >> If it succeeds : >> { >> Fire BR UPDATE trigger on the original partition. >> Run partition constraint check again with the modified NEW row >> (may be do this only if the trigger modified the partition key) >> If it fails, >> abort. >> Else >> proceed with the usual local update. >> } >> else >> { >> Fire BR UPDATE trigger on original partition. >> Find the right partition for the modified NEW row. >> If it is the same partition, >> proceed with the usual local update. >> else >> do the row movement. >> } > > Sure, that sounds about right, although the "Fire BR UPDATE trigger on > the original partition." is the same in both branches, so I'm not > quite sure why you have that in the "if" block. Actually after coding this logic, it looks a bit different. See ExecUpdate() in the attached file trigger_related_changes.patch Now that we are making sure trigger won't change the partition of the tuple, next thing we need to do is, make sure the tuple routing setup is done *only* if the UPDATE is modifying partition keys. Otherwise, this will degrade normal update performance. Below is the logic I am implementing for determining whether the UPDATE is modifying partition keys. In ExecInitModifyTable() ... Call GetUpdatedColumns(mtstate->rootResultRelInfo, estate) to get updated_columns. For each of the updated_columns : { Check if the column is part of partition key quals of any of the relations in mtstate->resultRelInfo[] array. /* * mtstate->resultRelInfo[] contains exactly those leaf partitions * which qualify the update quals. */ If (it is part of partition key quals of at least one of the relations) { Do ExecSetupPartitionTupleRouting() for the root partition. break; } } Few things need to be considered : Use Relation->rd_partcheck to get partition check quals of each of the relations in mtstate->resultRelInfo[]. The Relation->rd_partcheck of the leaf partitions would include the ancestors' partition quals as well. So we are good: we don't have to explicitly get the upper partition constraints. Note that an UPDATE can modify a column which is not used in a partition constraint expressions of any of the partitions or partitioned tables in the subtree, but that column may have been used in partition constraint of a partitioned table belonging to upper subtree. All of the relations in mtstate->resultRelInfo are already open. So we don't need to re-open any more relations to get the partition quals. The column bitmap set returned by GetUpdatedColumns() refer to attribute numbers w.r.t. to the root partition. And the mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So we need to do something similar to map_partition_varattnos() to change the updated columns attnos to the leaf partitions and walk down the partition constraint expressions to find if the attnos are present there. Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company trigger_related_changes.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 5 June 2017 at 11:27, Amit Kapila wrote: > On Fri, Jun 2, 2017 at 4:37 PM, Amit Khandekar wrote: >> On 2 June 2017 at 01:17, Robert Haas wrote: >>> On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar >>> wrote: >>>>> Regarding the trigger issue, I can't claim to have a terribly strong >>>>> opinion on this. I think that practically anything we do here might >>>>> upset somebody, but probably any halfway-reasonable thing we choose to >>>>> do will be OK for most people. However, there seems to be a >>>>> discrepancy between the approach that got the most votes and the one >>>>> that is implemented by the v8 patch, so that seems like something to >>>>> fix. >>>> >>>> Yes, I have started working on updating the patch to use that approach >>>> (BR and AR update triggers on source and destination partition >>>> respectively, instead of delete+insert) The approach taken by the >>>> patch (BR update + delete+insert triggers) didn't require any changes >>>> in the way ExecDelete() and ExecInsert() were called. Now we would >>>> require to skip the delete/insert triggers, so some flags need to be >>>> passed to these functions, >>>> > > I thought you already need to pass an additional flag for special > handling of ctid in Delete case. Yeah that was unavoidable. > For Insert, a new flag needs to be > passed and need to have a check for that in few places. For skipping delete and insert trigger, we need to include still another flag, and checks in both ExecDelete() and ExecInsert() for skipping both BR and AR trigger, and then in ExecUpdate(), again a call to ExecARUpdateTriggers() before quitting. > >> or else have stripped down versions of >>>> ExecDelete() and ExecInsert() which don't do other things like >>>> RETURNING handling and firing triggers. >>> >>> See, that strikes me as a pretty good argument for firing the >>> DELETE+INSERT triggers... >>> >>> I'm not wedded to that approach, but "what makes the code simplest?" >>> is not a bad tiebreak, other things being equal. >> >> Yes, that sounds good to me. >> > > I am okay if we want to go ahead with firing BR UPDATE + DELETE + > INSERT triggers for an Update statement (when row movement happens) on > the argument of code simplicity, but it sounds slightly odd behavior. Ok. Will keep this behaviour that is already present in the patch. I myself also feel that code simplicity can be used as a tie-breaker if a single behaviour cannot be agreed upon that completely satisfies all aspects. > >> But I think we want to wait for other's >> opinion because it is quite understandable that two triggers firing on >> the same partition sounds odd. >> > > Yeah, but I think we have to rely on docs in this case as behavior is > not intuitive. Agreed. The doc changes in the patch already has explained in detail this behaviour. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On 2 June 2017 at 03:18, Thomas Munro wrote: > On Fri, Jun 2, 2017 at 9:27 AM, Peter Geoghegan wrote: >> On Thu, Jun 1, 2017 at 2:24 PM, Thomas Munro >> wrote: >>> Why should ICU be any different than the system provider in this >>> respect? In both cases, we have a two-level comparison: first we use >>> the collation-aware comparison, and then as a tie breaker, we use a >>> binary comparison. If we didn't do a binary comparison as a >>> tie-breaker, wouldn't the result be logically incompatible with the = >>> operator, which does a binary comparison? Ok. I was thinking we are doing the tie-breaker because specifically strcoll_l() was unexpectedly returning 0 for some cases. Now I get it, that we do that to be compatible with texteq(). Secondly, I was also considering if ICU especially has a way to customize an ICU locale by setting some attributes which dictate comparison or sorting rules for a set of characters. I mean, if there is such customized ICU locale defined in the system, and we use that to create PG collation, I thought we might have to strictly follow those rules without a tie-breaker, so as to be 100% conformant to ICU. I can't come up with an example, or may there isn't one, but , say , there is a locale which is supposed to sort only by lowest comparison strength (de@strength=1 ?? ). In that case, there might be many characters considered equal, but PG < operator or > operator would still return true for those chars. -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 2 June 2017 at 01:17, Robert Haas wrote: > On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar wrote: >>> Regarding the trigger issue, I can't claim to have a terribly strong >>> opinion on this. I think that practically anything we do here might >>> upset somebody, but probably any halfway-reasonable thing we choose to >>> do will be OK for most people. However, there seems to be a >>> discrepancy between the approach that got the most votes and the one >>> that is implemented by the v8 patch, so that seems like something to >>> fix. >> >> Yes, I have started working on updating the patch to use that approach >> (BR and AR update triggers on source and destination partition >> respectively, instead of delete+insert) The approach taken by the >> patch (BR update + delete+insert triggers) didn't require any changes >> in the way ExecDelete() and ExecInsert() were called. Now we would >> require to skip the delete/insert triggers, so some flags need to be >> passed to these functions, or else have stripped down versions of >> ExecDelete() and ExecInsert() which don't do other things like >> RETURNING handling and firing triggers. > > See, that strikes me as a pretty good argument for firing the > DELETE+INSERT triggers... > > I'm not wedded to that approach, but "what makes the code simplest?" > is not a bad tiebreak, other things being equal. Yes, that sounds good to me. But I think we want to wait for other's opinion because it is quite understandable that two triggers firing on the same partition sounds odd. > >>> In terms of the approach taken by the patch itself, it seems >>> surprising to me that the patch only calls >>> ExecSetupPartitionTupleRouting when an update fails the partition >>> constraint. Note that in the insert case, we call that function at >>> the start of execution; >> >>> calling it in the middle seems to involve additional hazards; >>> for example, is it really safe to add additional >>> ResultRelInfos midway through the operation? >> >> I thought since the additional ResultRelInfos go into >> mtstate->mt_partitions which is independent of >> estate->es_result_relations, that should be safe. > > I don't know. That sounds scary to me, but it might be OK. Probably > needs more study. > >>> Is it safe to take more locks midway through the operation? >> >> I can imagine some rows already updated, when other tasks like ALTER >> TABLE or CREATE INDEX happen on other partitions which are still >> unlocked, and then for row movement we try to lock these other >> partitions and wait for the DDL tasks to complete. But I didn't see >> any particular issues with that. But correct me if you suspect a >> possible issue. One issue can be if we were able to modify the table >> attributes, but I believe we cannot do that for inherited columns. > > It's just that it's very unlike what we do anywhere else. I don't > have a real specific idea in mind about what might totally break, but > at a minimum it could certainly cause behavior that can't happen > today. Today, if you run a query on some tables, it will block > waiting for any locks at the beginning of the query, and the query > won't begin executing until it has all of the locks. With this > approach, you might block midway through; you might even deadlock > midway through. Maybe that's not overtly broken, but it's at least > got the possibility of being surprising. > > Now, I'd actually kind of like to have behavior like this for other > cases, too. If we're inserting one row, can't we just lock the one > partition into which it needs to get inserted, rather than all of > them? But I'm wary of introducing such behavior incidentally in a > patch whose main goal is to allow UPDATE row movement. Figuring out > what could go wrong and fixing it seems like a substantial project all > of its own. Yes, I agree it makes sense trying to avoid introducing something we haven't tried before, in this patch, as far as possible. > >> The reason I thought it cannot be done at the start of the execution, >> is because even if we know that update is not modifying the partition >> key column, we are not certain that the final NEW row has its >> partition key column unchanged, because of triggers. I understand it >> might be weird for a user requiring to modify a partition key value, >> but if a user does that, it will result in crash because we won't have >> the partition routing setup, thinking that there is no partition
[HACKERS] strcmp() tie-breaker for identical ICU-collated strings
While comparing two text strings using varstr_cmp(), if *strcoll*() call returns 0, we do strcmp() tie-breaker to do binary comparison, because strcoll() can return 0 for non-identical strings : varstr_cmp() { ... /* * In some locales strcoll() can claim that nonidentical strings are * equal. Believing that would be bad news for a number of reasons, * so we follow Perl's lead and sort "equal" strings according to * strcmp(). */ if (result == 0) result = strcmp(a1p, a2p); ... } But is this supposed to apply for ICU collations as well ? If collation provider is icu, the comparison is done using ucol_strcoll*(). I suspect that ucol_strcoll*() intentionally returns some characters as being identical, so doing strcmp() may not make sense. For e.g. , if the below two characters are compared using ucol_strcollUTF8(), it returns 0, meaning the strings are identical : Greek Oxia : UTF-16 encoding : 0x1FFD (http://www.fileformat.info/info/unicode/char/1ffd/index.htm) Greek Tonos : UTF-16 encoding : 0x0384 (http://www.fileformat.info/info/unicode/char/0384/index.htm) The characters are displayed like this : postgres=# select (U&'\+001FFD') , (U&'\+000384') collate ucatest; ?column? | ?column? --+-- ´| ΄ (Although this example has similar looking characters, this might not be a factor behind treating them equal) Now since ucol_strcoll*() returns 0, these strings are always compared using strcmp(), so 1FFD > 0384 returns true : create collation ucatest (locale = 'en_US.UTF8', provider = 'icu'); postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest; ?column? -- t Whereas, if strcmp() is skipped for ICU collations : if (result == 0 && !(mylocale && mylocale->provider == COLLPROVIDER_ICU)) result = strcmp(a1p, a2p); ... then the comparison using ICU collation tells they are identical strings : postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest; ?column? -- f (1 row) postgres=# select (U&'\+001FFD') < (U&'\+000384') collate ucatest; ?column? -- f (1 row) postgres=# select (U&'\+001FFD') <= (U&'\+000384') collate ucatest; ?column? -- t Now I have verified that strcoll() returns true for 1FFD > 0384. So, it looks like ICU API function ucol_strcoll() returns false by intention. That's the reason I feel like the strcmp-if-strtoll-returns-0 thing might not be applicable for ICU. But I may be wrong, please correct me if I may be missing something. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
t function at > the start of execution; > calling it in the middle seems to involve additional hazards; > for example, is it really safe to add additional > ResultRelInfos midway through the operation? I thought since the additional ResultRelInfos go into mtstate->mt_partitions which is independent of estate->es_result_relations, that should be safe. > Is it safe to take more locks midway through the operation? I can imagine some rows already updated, when other tasks like ALTER TABLE or CREATE INDEX happen on other partitions which are still unlocked, and then for row movement we try to lock these other partitions and wait for the DDL tasks to complete. But I didn't see any particular issues with that. But correct me if you suspect a possible issue. One issue can be if we were able to modify the table attributes, but I believe we cannot do that for inherited columns. > It seems like it might be a lot > safer to decide at the beginning of the operation whether this is > needed -- we can skip it if none of the columns involved in the > partition key (or partition key expressions) are mentioned in the > update. > (There's also the issue of triggers, The reason I thought it cannot be done at the start of the execution, is because even if we know that update is not modifying the partition key column, we are not certain that the final NEW row has its partition key column unchanged, because of triggers. I understand it might be weird for a user requiring to modify a partition key value, but if a user does that, it will result in crash because we won't have the partition routing setup, thinking that there is no partition key column in the UPDATE. And we also cannot unconditionally setup the partition routing on all updates, for performance reasons. > I'm not sure that it's sensible to allow a trigger on an > individual partition to reroute an update to another partition > what if we get an infinite loop?) You mean, if the other table has another trigger that will again route to the original partition ? But this infinite loop problem could occur even for 2 normal tables ? > > +if (concurrently_deleted) > +return NULL; > > I don't understand the motivation for this change, and there are no > comments explaining it that I can see. Yeah comments, I think, are missing. I thought in the ExecDelete() they are there, but they are not. If a concurrent delete already deleted the row, we should not bother about moving the row, hence the above code. > Perhaps the concurrency-related (i.e. EPQ) behavior here could be > tested via the isolation tester. WIll check. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 24 May 2017 at 20:16, Amit Kapila wrote: > On Wed, May 24, 2017 at 8:14 PM, Amit Kapila wrote: >> Apart from above, there is one open issue [1] >> > > Forget to mention the link, doing it now. > > [1] - > https://www.postgresql.org/message-id/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com I am not sure right now whether making the t_ctid of such tuples to Invalid would be a right option, especially because I think there can be already some other meaning if t_ctid is not valid. But may be we can check this more. If we decide to error out using some way, I would be inclined towards considering re-using some combinations of infomask bits (like HEAP_MOVED_OFF as suggested upthread) rather than using invalid t_ctid value. But I think, we can also take step-by-step approach even for v11. If we agree that it is ok to silently do the updates as long as we document the behaviour, we can go ahead and do this, and then as a second step, implement error handling as a separate patch. If that patch does not materialize, we at least have the current behaviour documented. Ideally, I think we would have liked if we were somehow able to make the row-movement UPDATE itself abort if it finds any normal updates waiting for it to finish, rather than making the normal updates fail because a row-movement occurred . But I think we will have to live with it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 12 May 2017 at 09:27, Amit Kapila wrote: > > + is_partitioned_table = > + root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE; > + > + if (is_partitioned_table) > + ExecSetupPartitionTupleRouting( > + root_rel, > + /* Build WITH CHECK OPTION constraints for leaf partitions */ > + ExecInitPartitionWithCheckOptions(mtstate, root_rel); > + /* Build a projection for each leaf partition rel. */ > + ExecInitPartitionReturningProjection(mtstate, root_rel); > .. > + /* It's not a partitioned table after all; error out. */ > + ExecPartitionCheckEmitError(resultRelInfo, slot, estate); > > When we are anyway going to give error if table is not a partitioned > table, then isn't it better to give it early when we first identify > that. Yeah that's right, fixed. Moved the partitioned table check early. This also showed that there is no need for is_partitioned_table variable. Accordingly adjusted the code. > - > +static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate, > + Relation root_rel); > Spurious line delete. Done. Also rebased the patch over latest code. Attached v8 patch. update-partition-key_v8.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 18 May 2017 at 16:52, Amit Kapila wrote: > On Wed, May 17, 2017 at 4:05 PM, Dilip Kumar wrote: >> On Wed, May 17, 2017 at 3:15 PM, Amit Kapila wrote: >>>> Earlier I thought that option1 is better but later I think that this >>>> can complicate the situation as we are firing first BR update then BR >>>> delete and can change the row multiple time and defining such >>>> behaviour can be complicated. >>>> >>> >>> If we have to go by this theory, then the option you have preferred >>> will still execute BR triggers for both delete and insert, so input >>> row can still be changed twice. >> >> Yeah, right as per my theory above Option3 have the same problem. >> >> But after putting some more thought I realised that only for "Before >> Update" or the "Before Insert" trigger row can be changed. >> > > Before Row Delete triggers can suppress the delete operation itself > which is kind of unintended in this case. I think without the user > being aware it doesn't seem advisable to execute multiple BR triggers. By now, majority of the opinions have shown that they do not favour two triggers getting fired on a single update. Amit, do you consider option 2 as a valid option ? That is, fire only UPDATE triggers. BR on source partition, and AR on destination partition. Do you agree that firing BR update trigger is essential since it can modify the row and even prevent the update from happening ? Also, since a user does not have a provision to install a common UPDATE row trigger, (s)he installs it on each of the leaf partitions. And then when an update causes row movement, using option 3 would end up not firing update triggers on any of the partitions. So, I prefer option 2 over option 3 , i.e. make sure to fire BR and AR update triggers. Actually option 2 is what Robert had proposed in the beginning. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 17 May 2017 at 17:29, Rushabh Lathia wrote: > > > On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar wrote: >> >> On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar >> wrote: >> > Option 3 >> > >> > >> > BR, AR delete triggers on source partition >> > BR, AR insert triggers on destination partition. >> > >> > Rationale : >> > Since the update is converted to delete+insert, just skip the update >> > triggers completely. >> >> +1 to option3 >> Generally, BR triggers are used for updating the ROW value and AR >> triggers to VALIDATE the row or to modify some other tables. So it >> seems that we can fire the triggers what is actual operation is >> happening at the partition level. >> >> For source partition, it's only the delete operation (no update >> happened) so we fire delete triggers and for the destination only >> insert operations so fire only inserts triggers. That will keep the >> things simple. And, it will also be in sync with the actual partition >> level delete/insert operations. >> >> We may argue that user might have declared only update triggers and as >> he has executed the update operation he may expect those triggers to >> get fired. But, I think this behaviour can be documented with the >> proper logic that if the user is updating the partition key then he >> must be ready with the Delete/Insert triggers also, he can not rely >> only upon update level triggers. >> > > Right, that is even my concern. That user might had declared only update > triggers and when user executing UPDATE its expect it to get call - but > with option 3 its not happening. Yes that's the issue with option 3. A user wants to make sure update triggers run, and here we are skipping the BEFORE update triggers. And user might even modify rows. Now regarding the AR update triggers The user might be more concerned with the non-partition-key columns, and the UPDATE of partition key typically would update only the partition key and not the other column. So for typical case, it makes sense to skip the UPDATE AR trigger. But if the UPDATE contains both partition key as well as other column updates, it makes sense to fire AR UPDATE trigger. One thing we can do is restrict an UPDATE to have both partition key and non-partition key column updates. So this way we can always skip the AR update triggers for row-movement updates, unless may be fire AR UPDATE triggers *only* if they are created using "BEFORE UPDATE OF " and the column is the partition key. Between skipping delete-insert triggers versus skipping update triggers, I would go for skipping delete-insert triggers. I think we cannot skip BR update triggers because that would be a correctness issue. >From user-perspective, I think the user would like to install a trigger that would fire if any of the child tables get modified. But because there is no provision to install a common trigger, the user has to install the same trigger on every child table. In that sense, it might not matter whether we fire AR UPDATE trigger on old partition or new partition. > > In term of consistency option 1 looks better. Its doing the same what > its been implemented for the UPSERT - so that user might be already > aware of trigger behaviour. Plus if we document the behaviour then it > sounds correct - > > - Original command was UPDATE so BR update > - Later found that its ROW movement - so BR delete followed by AR delete > - Then Insert in new partition - so BR INSERT followed by AR Insert. > > But again I am not quite sure how good it will be to compare the partition > behaviour with the UPSERT. > > > >> >> Earlier I thought that option1 is better but later I think that this >> can complicate the situation as we are firing first BR update then BR >> delete and can change the row multiple time and defining such >> behaviour can be complicated. >> >> -- >> Regards, >> Dilip Kumar >> EnterpriseDB: http://www.enterprisedb.com >> >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers > > > > > -- > Rushabh Lathia -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 12 May 2017 at 14:56, Amit Kapila wrote: > I think it might be better to summarize all the options discussed > including what the patch has and see what most people consider as > sensible. Yes, makes sense. Here are the options that were discussed so far for ROW triggers : Option 1 : (the patch follows this option) -- BR Update trigger for source partition. BR,AR Delete trigger for source partition. BR,AR Insert trigger for destination partition. No AR Update trigger. Rationale : BR Update trigger should be fired because that trigger can even modify the rows, and that can even result in partition key update even though the UPDATE statement is not updating the partition key. Also, fire the delete/insert triggers on respective partitions since the rows are about to be deleted/inserted. AR update trigger should not be fired because that required an actual update to have happened. Option 2 -- BR Update trigger for source partition. AR Update trigger on destination partition. No insert/delete triggers. Rationale : Since it's an UPDATE statement, only update triggers should be fired. The update ends up moving the row into another partition, so AR Update trigger should be fired on this partition rather than the original partition. Option 3 BR, AR delete triggers on source partition BR, AR insert triggers on destination partition. Rationale : Since the update is converted to delete+insert, just skip the update triggers completely. Option 4 BR-AR update triggers for source partition BR-AR insert triggers for destination partition Rationale : Since it is an update statement, both BR and AR UPDATE trigger should be fired on original partition. Since update is converted to delete+insert, the corresponding triggers should be fired, but since we already are firing UPDATE trigger on original partition, skip delete triggers, otherwise both UPDATE and DELETE triggers would get fired on the same partition. For statement triggers, I think it should be based on the documentation recently checked in for partitions in general. +A statement that targets a parent table in a inheritance or partitioning +hierarchy does not cause the statement-level triggers of affected child +tables to be fired; only the parent table's statement-level triggers are +fired. However, row-level triggers of any affected child tables will be +fired. Based on that, for row movement as well, the trigger should be fired only for the table referred in the UPDATE statement, and not for any child tables, or for any partitions to which the rows were moved. The doc in this row-movement patch also matches with this behaviour. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
>> 3. >> + longer satisfy the partition constraint of the containing partition. In >> that >> + case, if there is some other partition in the partition tree for which >> this >> + row satisfies its partition constraint, then the row is moved to that >> + partition. If there isn't such a partition, an error will occur. >> >> Doesn't this error case indicate that this needs to be integrated with >> Default partition patch of Rahila or that patch needs to take care >> this error case? >> Basically, if there is no matching partition, then move it to default >> partition. > > Will have a look on this. Thanks for pointing this out. I tried update row movement with both my patch and default-partition patch applied. And it looks like it works as expected : 1. When an update changes the partitioned key such that the row does not fit into any of the non-default partitions, the row is moved to the default partition. 2. If the row does fit into a non-default partition, the row moves into that partition. 3. If a row from a default partition is updated such that it fits into any of the non-default partition, it moves into that partition. I think we can debate on whether the row should stay in the default partition or move. I think it should be moved, since now the row has a suitable partition. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 12 May 2017 at 08:30, Amit Kapila wrote: > On Thu, May 11, 2017 at 5:41 PM, Amit Khandekar > wrote: >> On 11 May 2017 at 17:23, Amit Kapila wrote: >>> On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar >>> wrote: >>>> On 4 March 2017 at 12:49, Robert Haas wrote: >>>>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar >>>>> wrote: >>>>>> I think it does not make sense running after row triggers in case of >>>>>> row-movement. There is no update happened on that leaf partition. This >>>>>> reasoning can also apply to BR update triggers. But the reasons for >>>>>> having a BR trigger and AR triggers are quite different. Generally, a >>>>>> user needs to do some modifications to the row before getting the >>>>>> final NEW row into the database, and hence [s]he defines a BR trigger >>>>>> for that. And we can't just silently skip this step only because the >>>>>> final row went into some other partition; in fact the row-movement >>>>>> itself might depend on what the BR trigger did with the row. Whereas, >>>>>> AR triggers are typically written for doing some other operation once >>>>>> it is made sure the row is actually updated. In case of row-movement, >>>>>> it is not actually updated. >>>>> >>>>> How about running the BR update triggers for the old partition and the >>>>> AR update triggers for the new partition? It seems weird to run BR >>>>> update triggers but not AR update triggers. Another option would be >>>>> to run BR and AR delete triggers and then BR and AR insert triggers, >>>>> emphasizing the choice to treat this update as a delete + insert, but >>>>> (as Amit Kh. pointed out to me when we were in a room together this >>>>> week) that precludes using the BEFORE trigger to modify the row. >>>>> >>> >>> I also find the current behavior with respect to triggers quite odd. >>> The two points that appears odd are (a) Executing both before row >>> update and delete triggers on original partition sounds quite odd. >> Note that *before* trigger gets fired *before* the update happens. The >> actual update may not even happen, depending upon what the trigger >> does. And then in our case, the update does not happen; not just that, >> it is transformed into delete-insert. So then we should fire >> before-delete trigger. >> > > Sure, I am aware of that point, but it doesn't seem obvious that both > update and delete BR triggers get fired for original partition. As a > developer, it might be obvious to you that as you have used delete and > insert interface, it is okay that corresponding BR/AR triggers get > fired, however, it is not so obvious for others, rather it appears > quite odd. I agree that it seems a bit odd that we are firing both update and delete triggers on the same partition. But if you look at the perspective that the update=>delete+insert is a user-aware operation, it does not seem that odd. > If we try to compare it with the non-partitioned update, > there also it is internally a delete and insert operation, but we > don't fire triggers for those. For a non-partitioned table, the delete+insert is internal, whereas for partitioned table, it is completely visible to the user. > >>> (b) It seems inconsistent to consider behavior for row and statement >>> triggers differently >> >> I am not sure whether we should compare row and statement triggers. >> Statement triggers are anyway fired only per-statement, depending upon >> whether it is update or insert or delete. It has nothing to do with >> how the rows are modified. >> > > Okay. The broader point I was trying to convey was that the way this > patch defines the behavior of triggers doesn't sound good to me. It > appears to me that in this thread multiple people have raised points > around trigger behavior and you should try to consider those. I understand that there is no single solution which will provide completely intuitive trigger behaviour. Skipping BR delete trigger should be fine. But then for consistency, we should skip BR insert trigger as well, the theory being that the delete+insert are not fired by the user so we should not fire them. But I feel both should be fired to avoid any consequences unexpected to the user who has installed those triggers. The only specific concern of yours is that of firing *both* update as well as insert triggers on the same table, right ? My explanation for this was : we ha
Re: [HACKERS] UPDATE of partition key
On 12 May 2017 at 10:01, Amit Kapila wrote: > On Fri, May 12, 2017 at 9:27 AM, Amit Kapila wrote: >> On Thu, May 11, 2017 at 5:45 PM, Amit Khandekar >> wrote: >>> On 11 May 2017 at 17:24, Amit Kapila wrote: >>>> Few comments: >>>> 1. >>>> Operating directly on partition doesn't allow update to move row. >>>> Refer below example: >>>> create table t1(c1 int) partition by range(c1); >>>> create table t1_part_1 partition of t1 for values from (1) to (100); >>>> create table t1_part_2 partition of t1 for values from (100) to (200); >>>> insert into t1 values(generate_series(1,11)); >>>> insert into t1 values(generate_series(110,120)); >>>> >>>> postgres=# update t1_part_1 set c1=122 where c1=11; >>>> ERROR: new row for relation "t1_part_1" violates partition constraint >>>> DETAIL: Failing row contains (122). >>> >>> Yes, as Robert said, this is expected behaviour. We move the row only >>> within the partition subtree that has the update table as its root. In >>> this case, it's the leaf partition. >>> >> >> Okay, but what is the technical reason behind it? Is it because the >> current design doesn't support it or is it because of something very >> fundamental to partitions? No, we can do that if decide to update some table outside the partition subtree. The reason is more of semantics. I think the user who is running UPDATE for a partitioned table, should not be necessarily aware of the structure of the complete partition tree outside of the current subtree. It is always safe to return error instead of moving the data outside of the subtree silently. >> > > One plausible theory is that as Select's on partitions just returns > the rows of that partition, the update should also behave in same way. Yes , right. Or even inserts fail if we try to insert data that does not fit into the current subtree. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 11 May 2017 at 17:24, Amit Kapila wrote: > Few comments: > 1. > Operating directly on partition doesn't allow update to move row. > Refer below example: > create table t1(c1 int) partition by range(c1); > create table t1_part_1 partition of t1 for values from (1) to (100); > create table t1_part_2 partition of t1 for values from (100) to (200); > insert into t1 values(generate_series(1,11)); > insert into t1 values(generate_series(110,120)); > > postgres=# update t1_part_1 set c1=122 where c1=11; > ERROR: new row for relation "t1_part_1" violates partition constraint > DETAIL: Failing row contains (122). Yes, as Robert said, this is expected behaviour. We move the row only within the partition subtree that has the update table as its root. In this case, it's the leaf partition. > > 3. > + longer satisfy the partition constraint of the containing partition. In > that > + case, if there is some other partition in the partition tree for which > this > + row satisfies its partition constraint, then the row is moved to that > + partition. If there isn't such a partition, an error will occur. > > Doesn't this error case indicate that this needs to be integrated with > Default partition patch of Rahila or that patch needs to take care > this error case? > Basically, if there is no matching partition, then move it to default > partition. Will have a look on this. Thanks for pointing this out. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 11 May 2017 at 17:23, Amit Kapila wrote: > On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar > wrote: >> On 4 March 2017 at 12:49, Robert Haas wrote: >>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar >>> wrote: >>>> I think it does not make sense running after row triggers in case of >>>> row-movement. There is no update happened on that leaf partition. This >>>> reasoning can also apply to BR update triggers. But the reasons for >>>> having a BR trigger and AR triggers are quite different. Generally, a >>>> user needs to do some modifications to the row before getting the >>>> final NEW row into the database, and hence [s]he defines a BR trigger >>>> for that. And we can't just silently skip this step only because the >>>> final row went into some other partition; in fact the row-movement >>>> itself might depend on what the BR trigger did with the row. Whereas, >>>> AR triggers are typically written for doing some other operation once >>>> it is made sure the row is actually updated. In case of row-movement, >>>> it is not actually updated. >>> >>> How about running the BR update triggers for the old partition and the >>> AR update triggers for the new partition? It seems weird to run BR >>> update triggers but not AR update triggers. Another option would be >>> to run BR and AR delete triggers and then BR and AR insert triggers, >>> emphasizing the choice to treat this update as a delete + insert, but >>> (as Amit Kh. pointed out to me when we were in a room together this >>> week) that precludes using the BEFORE trigger to modify the row. >>> > > I also find the current behavior with respect to triggers quite odd. > The two points that appears odd are (a) Executing both before row > update and delete triggers on original partition sounds quite odd. Note that *before* trigger gets fired *before* the update happens. The actual update may not even happen, depending upon what the trigger does. And then in our case, the update does not happen; not just that, it is transformed into delete-insert. So then we should fire before-delete trigger. > (b) It seems inconsistent to consider behavior for row and statement > triggers differently I am not sure whether we should compare row and statement triggers. Statement triggers are anyway fired only per-statement, depending upon whether it is update or insert or delete. It has nothing to do with how the rows are modified. > >> >> I checked the trigger behaviour in case of UPSERT. Here, when there is >> conflict found, ExecOnConflictUpdate() is called, and then the >> function returns immediately, which means AR INSERT trigger will not >> fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR >> and AR UPDATE triggers will be fired. So in short, when an INSERT >> becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE >> and AR UPDATE also get fired. On the same lines, it makes sense in >> case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on >> the original table, and then the BR and AR DELETE/INSERT triggers on >> the respective tables. >> > > I am not sure if it is good idea to compare it with "Insert On > Conflict Do Update", but even if we want that way, I think Insert On > Conflict is consistent in statement level triggers which means it will > fire both Insert and Update statement level triggres (as per below > note in docs) whereas the documentation in the patch indicates that > this patch will only fire Update statement level triggers which is > odd > > Note in docs about Insert On Conflict > "Note that with an INSERT with an ON CONFLICT DO UPDATE clause, both > INSERT and UPDATE statement level trigger will be fired. I guess the reason this behaviour is kept for UPSERT, is because the statement itself suggests : insert/update. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
On 5 May 2017 at 07:50, David Rowley wrote: > On 3 May 2017 at 07:13, Robert Haas wrote: >> It is of course possible that the Parallel Seq Scan could run into >> contention problems if the number of workers is large, but in my >> experience there are bigger problems here. The non-parallel Seq Scan >> can also contend -- not of course over the shared mutex because there >> isn't one, but over access to the blocks themselves. Every one of >> those blocks has a content lock and a buffer header and so on, and >> having multiple processes accessing those things at the same time >> scales well, but not perfectly. The Hash node can also contend: if >> the hash join spills to disk, you've got multiple processes reading >> and writing to the temp directory at the same time and, of course, >> that can be worse than just one process doing it -- sometimes much >> worse. It can also be better, depending on how much I/O gets >> generated and how much I/O bandwidth you have. > > Yeah, I did get some time to look over the contention in Parallel Seq > Scan a while back and I discovered that on the machine that I was > testing on. the lock obtained in heap_parallelscan_nextpage() was > causing workers to have to wait for other workers to fetch their next > task to work on. > > I ended up writing the attached (which I'd not intended to post until > some time closer to when the doors open for PG11). At the moment it's > basically just a test patch to see how it affects things when we give > workers a bit more to do before they come back to look for more work. > In this case, I've just given them 10 pages to work on, instead of the > 1 that's allocated in 9.6 and v10. > > A quick test on a pretty large table on a large machine shows: > > Unpatched: > > postgres=# select count(*) from a; >count > > 187400 > (1 row) > > Time: 5211.485 ms (00:05.211) > > Patched: > > postgres=# select count(*) from a; >count > > 187400 > (1 row) > > Time: 2523.983 ms (00:02.524) The result is quite impressive. > > So it seems worth looking into. "a" was just a table with a single int > column. I'm unsure as yet if there are more gains to be had for tables > with wider tuples. I do suspect the patch will be a bigger win in > those cases, since there's less work to do for each page, e.g less > advance aggregate calls, so likely they'll be looking for their next > page a bit sooner. > > Now I'm not going to pretend that this patch is ready for the > prime-time. I've not yet worked out how to properly report sync-scan > locations without risking reporting later pages after reporting the > end of the scan. What I have at the moment could cause a report to be > missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch > size. I'm also not sure how batching like this affect read-aheads, but > at least the numbers above speak for something. Although none of the > pages in this case came from disk. > > I'd had thoughts that the 10 pages wouldn't be constant, but the > batching size would depend on the size of the relation to be scanned. > I'd rough ideas to just try to make about 1 million batches. Something > like batch_pages = Max(parallel_scan->phs_nblocks / 100, 1); so > that we only take more than 1 page if there's some decent amount to > process. We don't want to make the batches too big as we might end up > having to wait on slow workers at the end of a scan. I was wondering : if we keep increasing the batch size, that might lead to I/O contention. I mean, the higher the batch size, the higher is the chance to cause more random I/O, because all workers would be accessing disk blocks far away from each other in parallel. So there might be a trade off here. (it's another thing that there needs to be I/O contention testing done, in general, for many scenarios). I believe there are certain parallel scans (parallel bitmap heap scan ? ) where the logic to go to the next block consumes time, so more waits consequently. What if we supply for each worker with a sequence of blocks to be scanned, rather than a range of blocks. Each worker would have a list of next few blocks, say : w1 : 1, 5, 9, 13 w2 : 2, 6, 10, 14 w3 : 3, 7, 11, 15. w4 : . May be the leader worker would do the accounting and store the instructions for each of the workers at individual locations in shared memory, so there won't be any contention while accessing them. This may be simple/applicable for a sequential scan, but not for other scans, some of which this may not be even possible. But basically I was thinking of a way around to tackle shared memory contention as well as random I/O. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 2 May 2017 at 18:17, Robert Haas wrote: > On Tue, Apr 4, 2017 at 7:11 AM, Amit Khandekar wrote: >> Attached updated patch v7 has the above changes. > > This no longer applies. Please rebase. Thanks Robert for informing about this. My patch has a separate function for emitting error message when a partition constraint fails. And, the recent commit c0a8ae7be3 has changes to correct the way the tuple is formed for displaying in the error message. Hence there were some code-level conflicts. Attached is the rebased patch, which resolves the above conflicts. update-partition-key_v7_rebased.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 26 April 2017 at 00:28, Robert Haas wrote: > So what I'd prefer -- on > the totally unprincipled basis that it would let us improve > performance in the future -- if you operate on a partition directly, > you fire the partition's triggers, but if you operate on the parent, > only the parent's triggers fire. I would also opt for this behaviour. Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 7 April 2017 at 20:35, Andres Freund wrote: >> But for costs such as (4, 4, 4, 20 times), the logic would give >> us 20 workers because we want to finish the Append in 4 time units; >> and this what we want to avoid when we go with >> don't-allocate-too-many-workers approach. > > I guess, my problem is that I don't agree with that as a goal in an of > itself. If you actually want to run your query quickly, you *want* 20 > workers here. The issues of backend startup overhead (already via > parallel_setup_cost), concurrency and such cost should be modelled, not > burried in a formula the user can't change. If we want to make it less > and less likely to start more workers we should make that configurable, > not the default. > I think there's some precedent taken from the parallel seqscan case, > that's not actually applicable here. Parallel seqscans have a good > amount of shared state, both on the kernel and pg side, and that shared > state reduces gains of increasing the number of workers. But with > non-partial scans such shared state largely doesn't exist. After searching through earlier mails about parallel scan, I am not sure whether the shared state was considered to be a potential factor that might reduce parallel query gains, when deciding the calculation for number of workers for a parallel seq scan. I mean even today if we allocate 10 workers as against a calculated 4 workers count for a parallel seq scan, they might help. I think it's just that we don't know if they would *always* help or it would regress sometimes. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 6 April 2017 at 07:33, Andres Freund wrote: > On 2017-04-05 14:52:38 +0530, Amit Khandekar wrote: >> This is what the earlier versions of my patch had done : just add up >> per-subplan parallel_workers (1 for non-partial subplan and >> subpath->parallel_workers for partial subplans) and set this total as >> the Append parallel_workers. > > I don't think that's great, consider e.g. the case that you have one > very expensive query, and a bunch of cheaper ones. Most of those workers > wouldn't do much while waiting for the the expensive query. What I'm > basically thinking we should do is something like the following > pythonesque pseudocode: > > best_nonpartial_cost = -1 > best_nonpartial_nworkers = -1 > > for numworkers in 1...#max workers: >worker_work = [0 for x in range(0, numworkers)] > >nonpartial_cost += startup_cost * numworkers > ># distribute all nonpartial tasks over workers. Assign tasks to the ># worker with the least amount of work already performed. >for task in all_nonpartial_subqueries: >least_busy_worker = worker_work.smallest() >least_busy_worker += task.total_nonpartial_cost > ># the nonpartial cost here is the largest amount any single worker ># has to perform. >nonpartial_cost += worker_work.largest() > >total_partial_cost = 0 >for task in all_partial_subqueries: >total_partial_cost += task.total_nonpartial_cost > ># Compute resources needed by partial tasks. First compute how much ># cost we can distribute to workers that take shorter than the ># "busiest" worker doing non-partial tasks. >remaining_avail_work = 0 >for i in range(0, numworkers): >remaining_avail_work += worker_work.largest() - worker_work[i] > ># Equally divide up remaining work over all workers >if remaining_avail_work < total_partial_cost: > nonpartial_cost += (worker_work.largest - remaining_avail_work) / > numworkers > ># check if this is the best number of workers >if best_nonpartial_cost == -1 or best_nonpartial_cost > nonpartial_cost: > best_nonpartial_cost = worker_work.largest > best_nonpartial_nworkers = nworkers > > Does that make sense? Yeah, I gather what you are trying to achieve is : allocate number of workers such that the total cost does not exceed the cost of the first non-partial plan (i.e. the costliest one, because the plans are sorted by descending cost). So for non-partial costs such as (20, 10, 5, 2) allocate only 2 workers because the 2nd worker will execute (10, 5, 2) while 1st worker executes (20). But for costs such as (4, 4, 4, 20 times), the logic would give us 20 workers because we want to finish the Append in 4 time units; and this what we want to avoid when we go with don't-allocate-too-many-workers approach. > > >> BTW all of the above points apply only for non-partial plans. > > Indeed. But I think that's going to be a pretty common type of plan, Yes it is. > especially if we get partitionwise joins. About that I am not sure, because we already have support for parallel joins, so wouldn't the join subpaths corresponding to all of the partitions be partial paths ? I may be wrong about that. But if the subplans are foreign scans, then yes all would be non-partial plans. This may provoke off-topic discussion, but here instead of assigning so many workers to all these foreign plans and all those workers waiting for the results, a single asynchronous execution node (which is still in the making) would be desirable because it would do the job of all these workers. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers