Re: Speeding up INSERTs and UPDATEs to partitioned tables
On Sat, 17 Nov 2018 at 07:28, Alvaro Herrera wrote: > > The 0002 patch is included again, this time with a new proposed commit > > message. There was some discussion over on [1] where nobody seemed to > > have any concerns about delaying the locking until we route the first > > tuple to the partition. > > Please get a new commitfest entry for this patch. Added to Jan-fest in: https://commitfest.postgresql.org/21/1887/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On Thu, Nov 22, 2018 at 11:32:04AM +0900, Amit Langote wrote: > I noticed that there's a "be" missing in the comment above > ExecFindPartition. Fixed in the attached. Thanks Amit, I have committed this one. -- Michael signature.asc Description: PGP signature
Re: Speeding up INSERTs and UPDATEs to partitioned tables
Hi, On Thu, Nov 22, 2018 at 7:25 AM David Rowley wrote: > > On Thu, 22 Nov 2018 at 07:06, Alvaro Herrera wrote: > > On 2018-Nov-21, David Rowley wrote: > > > If I wasn't on leave late last week and early this week then the only > > > thing I'd have mentioned was the lack of empty comment line in the > > > header comment for PartitionDispatchData. It looks a bit messy > > > without. > > > > Absolutely. Pushed a few newlines -- I hope I understood you correctly. > > Thanks, you did. That looks better now. I noticed that there's a "be" missing in the comment above ExecFindPartition. Fixed in the attached. Thanks, Amit ExecFindPartition-typo.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On Thu, 22 Nov 2018 at 07:06, Alvaro Herrera wrote: > On 2018-Nov-21, David Rowley wrote: > > If I wasn't on leave late last week and early this week then the only > > thing I'd have mentioned was the lack of empty comment line in the > > header comment for PartitionDispatchData. It looks a bit messy > > without. > > Absolutely. Pushed a few newlines -- I hope I understood you correctly. Thanks, you did. That looks better now. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018-Nov-21, David Rowley wrote: > If I wasn't on leave late last week and early this week then the only > thing I'd have mentioned was the lack of empty comment line in the > header comment for PartitionDispatchData. It looks a bit messy > without. Absolutely. Pushed a few newlines -- I hope I understood you correctly. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On Sat, 17 Nov 2018 at 04:14, Alvaro Herrera wrote: > I'll now see about the commit message and push shortly. Many thanks for making the required adjustments and pushing this. If I wasn't on leave late last week and early this week then the only thing I'd have mentioned was the lack of empty comment line in the header comment for PartitionDispatchData. It looks a bit messy without. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018-Nov-13, David Rowley wrote: > The 0002 patch is included again, this time with a new proposed commit > message. There was some discussion over on [1] where nobody seemed to > have any concerns about delaying the locking until we route the first > tuple to the partition. Please get a new commitfest entry for this patch. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
I repeated David's original tests not terribly rigorously[*] and got these numbers: * Unpatched:72.396196 * 0001 :77.279404 * 0001+0002: 20409.415094 * 0002: 816.606613 * control : 22969.140596 (insertion into unpartitioned table) So while this patch by itself gives a pretty lame increase in tps, it removes bottlenecks that will appear once we change the locking scheme. [*] On my laptop, running each test only once for 60s. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018-Nov-16, Alvaro Herrera wrote: > One thing I don't quite like is the inconsistency in handling memory > context switches in the various function allocating stuff. It seems > rather haphazard. I'd rather have a memcxt member in > PartitionTupleRouting, which is set when the struct is created, and then > have all the other functions allocating stuff use that one. So while researching this I finally realized that there was a "lexical disconnect" between setting a ResultRelInfo's ri_PartitionInfo struct/pointer and adding it to the PartitionTupleRoute arrays. However, if you think about it, these things are one and the same, so we don't need to do them separately; just merge the new function I wrote into the existing ExecInitRoutingInfo(). Patch attached. (This version also rebases across Andres' recent conflicting TupleTableSlot changes.) I'll now see about the commit message and push shortly. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index e62e3d8fba..6588ebd6dc 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2316,6 +2316,7 @@ CopyFrom(CopyState cstate) bool *nulls; ResultRelInfo *resultRelInfo; ResultRelInfo *target_resultRelInfo; + ResultRelInfo *prevResultRelInfo = NULL; EState *estate = CreateExecutorState(); /* for ExecConstraints() */ ModifyTableState *mtstate; ExprContext *econtext; @@ -2331,7 +2332,6 @@ CopyFrom(CopyState cstate) CopyInsertMethod insertMethod; uint64 processed = 0; int nBufferedTuples = 0; - int prev_leaf_part_index = -1; bool has_before_insert_row_trig; bool has_instead_insert_row_trig; bool leafpart_use_multi_insert = false; @@ -2515,8 +2515,12 @@ CopyFrom(CopyState cstate) /* * If there are any triggers with transition tables on the named relation, * we need to be prepared to capture transition tuples. + * + * Because partition tuple routing would like to know about whether + * transition capture is active, we also set it in mtstate, which is + * passed to ExecFindPartition() below. */ - cstate->transition_capture = + cstate->transition_capture = mtstate->mt_transition_capture = MakeTransitionCaptureState(cstate->rel->trigdesc, RelationGetRelid(cstate->rel), CMD_INSERT); @@ -2526,19 +2530,8 @@ CopyFrom(CopyState cstate) * CopyFrom tuple routing. */ if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) - { proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel); - /* - * If we are capturing transition tuples, they may need to be - * converted from partition format back to partitioned table format - * (this is only ever necessary if a BEFORE trigger modifies the - * tuple). - */ - if (cstate->transition_capture != NULL) - ExecSetupChildParentMapForLeaf(proute); - } - /* * It's more efficient to prepare a bunch of tuples for insertion, and * insert them in one heap_multi_insert() call, than call heap_insert() @@ -2694,25 +2687,17 @@ CopyFrom(CopyState cstate) /* Determine the partition to heap_insert the tuple into */ if (proute) { - int leaf_part_index; TupleConversionMap *map; /* - * Away we go ... If we end up not finding a partition after all, - * ExecFindPartition() does not return and errors out instead. - * Otherwise, the returned value is to be used as an index into - * arrays mt_partitions[] and mt_partition_tupconv_maps[] that - * will get us the ResultRelInfo and TupleConversionMap for the - * partition, respectively. + * Attempt to find a partition suitable for this tuple. + * ExecFindPartition() will raise an error if none can be found or + * if the found partition is not suitable for INSERTs. */ - leaf_part_index = ExecFindPartition(target_resultRelInfo, -proute->partition_dispatch_info, -slot, -estate); - Assert(leaf_part_index >= 0 && - leaf_part_index < proute->num_partitions); + resultRelInfo = ExecFindPartition(mtstate, target_resultRelInfo, + proute, slot, estate); - if (prev_leaf_part_index != leaf_part_index) + if (prevResultRelInfo != resultRelInfo) { /* Check if we can multi-insert into this partition */ if (insertMethod == CIM_MULTI_CONDITIONAL) @@ -2725,12 +2710,9 @@ CopyFrom(CopyState cstate) if (nBufferedTuples > 0) { ExprContext *swapcontext; - ResultRelInfo *presultRelInfo; - - presultRelInfo = proute->partitions[prev_leaf_part_index]; CopyFromInsertBatch(cstate, estate, mycid, hi_options, - presultRelInfo, myslot, bistate, + prevResultRelInfo, myslot, bistate, nBufferedTuples, bufferedTuples, firstBufferedLineNo); nBufferedTuples = 0; @@ -2787,21 +2769,6 @@ CopyFrom(CopyState cstate) } } -
Re: Speeding up INSERTs and UPDATEs to partitioned tables
One thing I don't quite like is the inconsistency in handling memory context switches in the various function allocating stuff. It seems rather haphazard. I'd rather have a memcxt member in PartitionTupleRouting, which is set when the struct is created, and then have all the other functions allocating stuff use that one. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On Fri, Nov 16, 2018 at 11:40 AM Alvaro Herrera wrote: > On 2018-Nov-15, Amit Langote wrote: > > > Maybe name it PARTITION_INIT_ALLOCSIZE (dropping the ROUTING from it), or > > PROUTE_INIT_ALLOCSIZE, to make it clear that it's only allocation size? > > Here's a proposed delta on v17 0001. Most importantly, I noticed that > the hashed subplans stuff didn't actually work, because the hash API was > not being used correctly. So the search in the hash would never return > a hit, and we'd create RRIs for those partitions again. To fix this, I > added a new struct to hold hash entries. I'm a bit surprised that you found that the hash table didn't work, because I remember having checked by attaching gdb that it works when I was hacking on my own delta patch, but I may have been looking at too many things. > I think this merits that the performance tests be redone. (Unless I > misunderstand, this shouldn't change the performance of INSERT, only > that of UPDATE.) Actually, I don't remember seeing performance tests done with UPDATEs on this thread. Since we don't needlessly scan *all* subplan result rels anymore, maybe this removes a good deal of overhead for UPDATEs that update partition key. > On the subject of the ArraySpace routines, I decided to drop them and > instead do the re-allocations on the places where they were needed. > In the original code there were two places for the partitions array, but > both did the same thing so it made sense to create a separate routine to > do it instead (ExecRememberPartitionRel), and do the allocation there. > Just out of custom I moved the palloc to appear at the same place as the > repalloc, and after doing so it became obvious that we were > over-allocating memory for the PartitionDispatchData pointer -- > allocating the size for the whole struct instead of just the pointer. > > (I renamed the "allocsize" struct members to "max", as is customary.) These changes look good to me. > I added CHECK_FOR_INTERRUPTS to the ExecFindPartition loop. It > shouldn't be useful if the code is correct, but if there are bugs it's > better to be able to interrupt infinite loops :-) Good measure. :) > I reindented the comment atop PartitionTupleRouting. The other way was > just too unwieldy. > > Let me know what you think. Regression tests still pass for me. Overall, it looks good to me. Thanks, Amit
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018-Nov-15, Amit Langote wrote: > Maybe name it PARTITION_INIT_ALLOCSIZE (dropping the ROUTING from it), or > PROUTE_INIT_ALLOCSIZE, to make it clear that it's only allocation size? Here's a proposed delta on v17 0001. Most importantly, I noticed that the hashed subplans stuff didn't actually work, because the hash API was not being used correctly. So the search in the hash would never return a hit, and we'd create RRIs for those partitions again. To fix this, I added a new struct to hold hash entries. I think this merits that the performance tests be redone. (Unless I misunderstand, this shouldn't change the performance of INSERT, only that of UPDATE.) On the subject of the ArraySpace routines, I decided to drop them and instead do the re-allocations on the places where they were needed. In the original code there were two places for the partitions array, but both did the same thing so it made sense to create a separate routine to do it instead (ExecRememberPartitionRel), and do the allocation there. Just out of custom I moved the palloc to appear at the same place as the repalloc, and after doing so it became obvious that we were over-allocating memory for the PartitionDispatchData pointer -- allocating the size for the whole struct instead of just the pointer. (I renamed the "allocsize" struct members to "max", as is customary.) I added CHECK_FOR_INTERRUPTS to the ExecFindPartition loop. It shouldn't be useful if the code is correct, but if there are bugs it's better to be able to interrupt infinite loops :-) I reindented the comment atop PartitionTupleRouting. The other way was just too unwieldy. Let me know what you think. Regression tests still pass for me. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 32d2461528..22a814bcbe 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -1343,7 +1343,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo, resultRelInfo->ri_PartitionCheck = partition_check; resultRelInfo->ri_PartitionRoot = partition_root; - resultRelInfo->ri_PartitionInfo = NULL; /* May be set later */ + resultRelInfo->ri_PartitionInfo = NULL; /* may be set later */ } /* diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index b2d394676f..592daab1be 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -38,48 +38,46 @@ * route a tuple inserted into a partitioned table to one of its leaf * partitions. * - * partition_root The partitioned table that's the target of the - * command. + * partition_root + * The partitioned table that's the target of the command. * - * partition_dispatch_info Array of 'dispatch_allocsize' elements containing - * a pointer to a PartitionDispatch object for every - * partitioned table touched by tuple routing. The - * entry for the target partitioned table is *always* - * present in the 0th element of this array. See - * comment for PartitionDispatchData->indexes for - * details on how this array is indexed. + * partition_dispatch_info + * Array of 'max_dispatch' elements containing a pointer to a + * PartitionDispatch object for every partitioned table touched by tuple + * routing. The entry for the target partitioned table is *always* + * present in the 0th element of this array. See comment for + * PartitionDispatchData->indexes for details on how this array is + * indexed. * - * num_dispatchThe current number of items stored in the - * 'partition_dispatch_info' array. Also serves as - * the index of the next free array element for new - * PartitionDispatch objects that need to be stored. + * num_dispatch + * The current number of items stored in the 'partition_dispatch_info' + * array. Also serves as the index of the next free array element for + * new PartitionDispatch objects that need to be stored. * - * dispatch_allocsize The current allocated size of the - * 'partition_dispatch_info' array. + * max_dispatch + * The current allocated size of the 'partition_dispatch_info' array. * - * partitionsArray of 'partitions_allocsize' elements - * containing a pointer to a ResultRelInfo for every - * leaf partitions touched by tuple routing. Some of - * these are pointers to ResultRelInfos which are - * borrowed out of 'subplan_resultrel_hash'. The - * remainder have been built especially for tuple - * routing. See comment for - * PartitionDispatchData->indexes for details on how - * this array is indexed. + * partitions + * Array of 'max_partitions' elements containing a pointer to a + * ResultRelInfo for every leaf partitions touched by tuple routing. + * Some of these are pointers to ResultRelInfos whi
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018/11/15 8:58, Alvaro Herrera wrote: > On 2018-Nov-15, David Rowley wrote: > >> On 15 November 2018 at 07:10, Alvaro Herrera >> wrote: >>> What's with this comment? >>> >>> * Initially we must only set up 1 PartitionDispatch object; the >>> one for >>> * the partitioned table that's the target of the command. If we >>> must >>> * route a tuple via some sub-partitioned table, then its >>> * PartitionDispatch is only built the first time it's required. >>> >>> You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at >>> odds with the '1' mentioned in the comment. Which is wrong? >> >> I don't think either is wrong, but I guess something must be >> misleading, so could perhaps be improved. > > Ah, that makes sense. Yeah, it seems a bit misleading to me. No > worries. Maybe name it PARTITION_INIT_ALLOCSIZE (dropping the ROUTING from it), or PROUTE_INIT_ALLOCSIZE, to make it clear that it's only allocation size? Thanks, Amit
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018-Nov-15, David Rowley wrote: > On 15 November 2018 at 07:10, Alvaro Herrera wrote: > > What's with this comment? > > > > * Initially we must only set up 1 PartitionDispatch object; the > > one for > > * the partitioned table that's the target of the command. If we > > must > > * route a tuple via some sub-partitioned table, then its > > * PartitionDispatch is only built the first time it's required. > > > > You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at > > odds with the '1' mentioned in the comment. Which is wrong? > > I don't think either is wrong, but I guess something must be > misleading, so could perhaps be improved. Ah, that makes sense. Yeah, it seems a bit misleading to me. No worries. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
Thanks for picking this up. On 15 November 2018 at 07:10, Alvaro Herrera wrote: > What's with this comment? > > * Initially we must only set up 1 PartitionDispatch object; the one > for > * the partitioned table that's the target of the command. If we must > * route a tuple via some sub-partitioned table, then its > * PartitionDispatch is only built the first time it's required. > > You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at > odds with the '1' mentioned in the comment. Which is wrong? I don't think either is wrong, but I guess something must be misleading, so could perhaps be improved. We're simply allocating enough space for PARTITION_ROUTING_INITSIZE but we're only initialising 1 item. That leaves space for PARTITION_ROUTING_INITSIZE - 1 more items before we'd need to reallocate the array. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
What's with this comment? * Initially we must only set up 1 PartitionDispatch object; the one for * the partitioned table that's the target of the command. If we must * route a tuple via some sub-partitioned table, then its * PartitionDispatch is only built the first time it's required. You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at odds with the '1' mentioned in the comment. Which is wrong? (I have a few edits on the patch, so please don't send a full v18 -- a delta patch would be welcome, if you have further changes to propose.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
Thanks for updating the patch. On 2018/11/14 13:16, David Rowley wrote: > Thanks for looking at this again. > > On 14 November 2018 at 13:47, Amit Langote > wrote: >> +if (dispatchidx >= proute->dispatch_allocsize) >> +{ >> +/* Expand allocated space. */ >> +proute->dispatch_allocsize *= 2; >> +proute->partition_dispatch_info = (PartitionDispatchData **) >> +repalloc(proute->partition_dispatch_info, >> + sizeof(PartitionDispatchData *) * >> + proute->dispatch_allocsize); >> +} >> >> Sorry, I forgot to point this out before, but can this code in >> ExecInitPartitionDispatchInfo be accommodated in >> ExecCheckPartitionArraySpace() for consistency? > > I don't really want to put that code in ExecCheckPartitionArraySpace() > as the way the function is now, it makes quite a lot of sense for the > compiler to inline it. If we add redundant work in there, then it > makes less sense. There's never any need to check both arrays at once > as we're only adding the new item to one array at a time. > > Instead, I've written a new function named > ExecCheckDispatchArraySpace() and put the resize code inside that. Okay, seems fine. > I've fixed the typos you mentioned. The only other thing I changed was > to only allocate the PartitionDispatch->tupslot if a conversion is > required. The previous code allocated this regardless if it was going > to be used or not. This saves both the redundant allocation and also > very slightly reduces the cost of the if test in ExecFindPartition(). > There's now no need to check if the map ! NULL as if the slot is there Also makes sense. Although it seems that Alvaro has already started at looking at this, I'll mark the CF entry as Ready for Committer anyway, because I don't have any more comments. :) Thanks, Amit
Re: Speeding up INSERTs and UPDATEs to partitioned tables
Thanks for looking at this again. On 14 November 2018 at 13:47, Amit Langote wrote: > +if (dispatchidx >= proute->dispatch_allocsize) > +{ > +/* Expand allocated space. */ > +proute->dispatch_allocsize *= 2; > +proute->partition_dispatch_info = (PartitionDispatchData **) > +repalloc(proute->partition_dispatch_info, > + sizeof(PartitionDispatchData *) * > + proute->dispatch_allocsize); > +} > > Sorry, I forgot to point this out before, but can this code in > ExecInitPartitionDispatchInfo be accommodated in > ExecCheckPartitionArraySpace() for consistency? I don't really want to put that code in ExecCheckPartitionArraySpace() as the way the function is now, it makes quite a lot of sense for the compiler to inline it. If we add redundant work in there, then it makes less sense. There's never any need to check both arrays at once as we're only adding the new item to one array at a time. Instead, I've written a new function named ExecCheckDispatchArraySpace() and put the resize code inside that. I've fixed the typos you mentioned. The only other thing I changed was to only allocate the PartitionDispatch->tupslot if a conversion is required. The previous code allocated this regardless if it was going to be used or not. This saves both the redundant allocation and also very slightly reduces the cost of the if test in ExecFindPartition(). There's now no need to check if the map ! NULL as if the slot is there then the map must exist too. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v17-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data v17-0002-Delay-locking-of-partitions-during-INSERT-and-UP.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018/11/14 0:32, Jesper Pedersen wrote: > Hi, > > On 11/12/18 6:17 PM, David Rowley wrote: >> On 9 November 2018 at 19:18, Amit Langote >> wrote: >>> I have a comment regarding how you chose to make >>> PartitionTupleRouting private. >>> >>> Using the v14_to_v15 diff, I could quickly see that there are many diffs >>> changing PartitionTupleRouting to struct PartitionTupleRouting, but they >>> would be unnecessary if you had added the following in execPartition.h, as >>> my upthread had done. >>> >>> -/* See execPartition.c for the definition. */ >>> +/* See execPartition.c for the definitions. */ >>> typedef struct PartitionDispatchData *PartitionDispatch; >>> +typedef struct PartitionTupleRouting PartitionTupleRouting; >> >> Okay, done that way. v16 attached. Thank you. > Still passes check-world. I looked at v16 and noticed a few typos: + * partition_dispatch_info Array of 'dispatch_allocsize' elements containing + * a pointer to a PartitionDispatch objects for a PartitionDispatch objects -> a PartitionDispatch object + * partitions Array of 'partitions_allocsize' elements + * containing pointers to a ResultRelInfos of all + * leaf partitions touched by tuple routing. a ResultRelInfos -> ResultRelInfos + * PartitionDispatch and ResultRelInfo pointers the 'partitions' array. "in" the 'partitions' array. +/* Setup the PartitionRoutingInfo for it */ Setup -> Set up + * Ensure there's enough space in the 'partitions' array of 'proute' + * and store it in the next empty slot in proute's partitions array. Not a typo, but maybe just write proute->partitions instead of "partitions array of proute" and "proute's partition array". + * the next available slot in the 'proute' partition_dispatch_info[] + * array. Also, record the index into this array in the 'parent_pd' Similarly, here: proute->partition_dipatch_info array + * array. Also, record the index into this array in the 'parent_pd' + * indexes[] array in the partidx element so that we can properly Similarly, parent_pd->indexes array +if (dispatchidx >= proute->dispatch_allocsize) +{ +/* Expand allocated space. */ +proute->dispatch_allocsize *= 2; +proute->partition_dispatch_info = (PartitionDispatchData **) +repalloc(proute->partition_dispatch_info, + sizeof(PartitionDispatchData *) * + proute->dispatch_allocsize); +} Sorry, I forgot to point this out before, but can this code in ExecInitPartitionDispatchInfo be accommodated in ExecCheckPartitionArraySpace() for consistency? Thanks, Amit
Re: Speeding up INSERTs and UPDATEs to partitioned tables
Hi, On 11/12/18 6:17 PM, David Rowley wrote: On 9 November 2018 at 19:18, Amit Langote wrote: I have a comment regarding how you chose to make PartitionTupleRouting private. Using the v14_to_v15 diff, I could quickly see that there are many diffs changing PartitionTupleRouting to struct PartitionTupleRouting, but they would be unnecessary if you had added the following in execPartition.h, as my upthread had done. -/* See execPartition.c for the definition. */ +/* See execPartition.c for the definitions. */ typedef struct PartitionDispatchData *PartitionDispatch; +typedef struct PartitionTupleRouting PartitionTupleRouting; Okay, done that way. v16 attached. Still passes check-world. Best regards, Jesper
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 9 November 2018 at 19:18, Amit Langote wrote: > I have a comment regarding how you chose to make > PartitionTupleRouting private. > > Using the v14_to_v15 diff, I could quickly see that there are many diffs > changing PartitionTupleRouting to struct PartitionTupleRouting, but they > would be unnecessary if you had added the following in execPartition.h, as > my upthread had done. > > -/* See execPartition.c for the definition. */ > +/* See execPartition.c for the definitions. */ > typedef struct PartitionDispatchData *PartitionDispatch; > +typedef struct PartitionTupleRouting PartitionTupleRouting; Okay, done that way. v16 attached. The 0002 patch is included again, this time with a new proposed commit message. There was some discussion over on [1] where nobody seemed to have any concerns about delaying the locking until we route the first tuple to the partition. [1] https://www.postgresql.org/message-id/25C1C6B2E7BE044889E4FE8643A58BA963B5796B@G01JPEXMBKW03 -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v16-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data v16-0002-Delay-locking-of-partitions-during-INSERT-and-UP.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018/11/08 20:28, David Rowley wrote: > On 8 November 2018 at 20:15, Amit Langote > wrote: >> Actually, as I also proposed upthread, we should move root_tuple_slot from >> PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because >> it's part of the first step described above that has nothing to do with >> partition tuple routing proper. We can make PartitionTupleRouting private >> to execPartition.c if we do that. > > okay. Makes sense. I've changed things around to PartitionTupleRouting > is now private to execPartition.c Thank you. I have a comment regarding how you chose to make PartitionTupleRouting private. Using the v14_to_v15 diff, I could quickly see that there are many diffs changing PartitionTupleRouting to struct PartitionTupleRouting, but they would be unnecessary if you had added the following in execPartition.h, as my upthread had done. -/* See execPartition.c for the definition. */ +/* See execPartition.c for the definitions. */ typedef struct PartitionDispatchData *PartitionDispatch; +typedef struct PartitionTupleRouting PartitionTupleRouting; Thanks, Amit
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 9 November 2018 at 00:28, David Rowley wrote: > I've attached v15 and a delta from v14 to ease re-review. I just revived the 0002 patch, which is still just for testing at this stage. There was mention over on [1] about removing the find_all_inheritors() call. Also some benchmarks from v14 with 0001+0002. Setup: DROP TABLE hashp; CREATE TABLE hashp (a INT) PARTITION BY HASH (a); SELECT 'CREATE TABLE hashp'||x::Text || ' PARTITION OF hashp FOR VALUES WITH (modulus 1, remainder ' || x::text || ');' from generate_Series(0,) x; \gexec (0 partitions is a non-partitioned table) fsync=off Partitions Patched Unpatched 0 23672 23785 10 22794 18385 100 22392 8541 1000 22419 1159 1 22195 101 [1] https://www.postgresql.org/message-id/ca+tgmozgjsy-nrfnzurhzqjthddh5fzjkvbvhs0byn6_46p...@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v14-0002-Unsafe-locking-reduction-for-partitioned-INSERT-.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On Thu, Nov 8, 2018 at 6:28 AM David Rowley wrote: > I've attached v15 and a delta from v14 to ease re-review. > > I also ran pgindent on this version. That's not part of the delta but > is in the main patch. Did you notice http://postgr.es/m/25C1C6B2E7BE044889E4FE8643A58BA963B5796B@G01JPEXMBKW03 ? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 8 November 2018 at 20:15, Amit Langote wrote: > Actually, as I also proposed upthread, we should move root_tuple_slot from > PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because > it's part of the first step described above that has nothing to do with > partition tuple routing proper. We can make PartitionTupleRouting private > to execPartition.c if we do that. okay. Makes sense. I've changed things around to PartitionTupleRouting is now private to execPartition.c > I didn't find anything quite significant to complain about, except just > one line: > > + * Initially we must only setup 1 PartitionDispatch object; the one for > > setup -> set up Changed too. I've attached v15 and a delta from v14 to ease re-review. I also ran pgindent on this version. That's not part of the delta but is in the main patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v14_v15.diff Description: Binary data v15-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018/11/07 20:46, David Rowley wrote: > On 5 November 2018 at 20:17, Amit Langote > wrote: >> On 2018/11/04 19:07, David Rowley wrote: >>> Perhaps a better design would be to instead of having random special >>> partitioned-table-only fields in ResultRelInfo, just have an extra >>> struct there that contains the extra partition information which would >>> include the translation maps and then just return the ResultRelInfo >>> and allow callers to lookup any extra details they require. >> >> IIUC, you're saying that we could introduce a new struct that contains >> auxiliary information needed by partition pruning (maps, slot, etc. for >> tuple conversion) and add a new ResultRelInfo member of that struct type. >> That way, there is no need to return them separately or return an index to >> access them from their arrays. I guess we won't even need the arrays we >> have now. I think that might be a good idea and simplifies things >> significantly. > > I've attached a patch which does this. Thank you for updating the patch this way. > It adds a new struct named > PartitionRoutingInfo into ResultRelInfo and pulls 3 of the 4 arrays > out of PartitionTupleRouting. There are fields for each of what these > arrays used to store inside the PartitionRoutingInfo struct. > > While doing this I kept glancing back over at ModifyTableState and at > the mt_per_subplan_tupconv_maps array. I wondered if it would be > better to make the PartitionRoutingInfo a bit more generic, perhaps > call it TupleConversionInfo and have fields like ti_ToGeneric and > ti_FromGeneric, with the idea that "generic" would be the root > partition or the first subplan for inheritance updates. This would > allow us to get rid of a good chunk of code inside nodeModifyTable.c. > However, I ended up not doing this and left PartitionRoutingInfo to be > specifically for Root to Partition conversion. I think it's good that you left mt_per_subplan_tupconv_maps out of this. UPDATE tuple routing can be said to have two steps: first step, a tiny one, converts the tuple that needs to be routed from the source partition's rowtype to the root's rowtype so that tuple routing proper can begin, and second is the actual tuple routing carried out using PartitionTupleRouting. The first step is handled by nodeModifyTable.c and so any data structures related to it should be in ModifyTableState. Actually, as I also proposed upthread, we should move root_tuple_slot from PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because it's part of the first step described above that has nothing to do with partition tuple routing proper. We can make PartitionTupleRouting private to execPartition.c if we do that. > Also, on the topic about what to name the conversion maps from a few > days ago; After looking at this a bit more I decided that having them > named ParentToChild and ChildToParent is misleading. If the child is > the child of some sub-partitioned table then the parent that the map > is talking about is not the partition's parent, but the root > partitioned table. So really RootToPartition and PartitionToRoot seem > like much more accurate names for the maps. Makes sense. :) > I made a few other changes along the way; I thought that > ExecFindPartition() would be a good candidate to take on the > responsibility of validating the partition is valid for INSERTs when > it uses a partition out of the subplan_resultrel_hash. I thought it > was better to check this once when we're in the code path of grabbing > the ResultRelInfo out that hash table rather than in a code path that > must check if it's been done already each time we route a tuple into a > partition. It also allowed me to get rid of > ri_PartitionReadyForRouting. Ah, I too had tried to remove ri_PartitionReadyForRouting, but had to give up on that idea because I didn't think of moving steps that are needed to be performed before setting it to true to that block of code in ExecFindPartition. > I also moved the call to > ExecInitRoutingInfo() into ExecFindPartition() which allowed me to > make that function static, which could result in the generation of > slightly more optimal compiled code. > > Please find attached the v14 patch. > > Rather nicely git --stat reports a net negative additional code (with > the 48 lines of new tests included) > > 11 files changed, 632 insertions(+), 660 deletions(-) That's nice! I didn't find anything quite significant to complain about, except just one line: + * Initially we must only setup 1 PartitionDispatch object; the one for setup -> set up Thanks, Amit
Re: Speeding up INSERTs and UPDATEs to partitioned tables
Hi David, On 11/7/18 6:46 AM, David Rowley wrote: I've attached a patch which does this. It adds a new struct named PartitionRoutingInfo into ResultRelInfo and pulls 3 of the 4 arrays out of PartitionTupleRouting. There are fields for each of what these arrays used to store inside the PartitionRoutingInfo struct. While doing this I kept glancing back over at ModifyTableState and at the mt_per_subplan_tupconv_maps array. I wondered if it would be better to make the PartitionRoutingInfo a bit more generic, perhaps call it TupleConversionInfo and have fields like ti_ToGeneric and ti_FromGeneric, with the idea that "generic" would be the root partition or the first subplan for inheritance updates. This would allow us to get rid of a good chunk of code inside nodeModifyTable.c. However, I ended up not doing this and left PartitionRoutingInfo to be specifically for Root to Partition conversion. Yeah, it doesn't necessarily have to be part of this patch. Also, on the topic about what to name the conversion maps from a few days ago; After looking at this a bit more I decided that having them named ParentToChild and ChildToParent is misleading. If the child is the child of some sub-partitioned table then the parent that the map is talking about is not the partition's parent, but the root partitioned table. So really RootToPartition and PartitionToRoot seem like much more accurate names for the maps. Agreed. I made a few other changes along the way; I thought that ExecFindPartition() would be a good candidate to take on the responsibility of validating the partition is valid for INSERTs when it uses a partition out of the subplan_resultrel_hash. I thought it was better to check this once when we're in the code path of grabbing the ResultRelInfo out that hash table rather than in a code path that must check if it's been done already each time we route a tuple into a partition. It also allowed me to get rid of ri_PartitionReadyForRouting. I also moved the call to ExecInitRoutingInfo() into ExecFindPartition() which allowed me to make that function static, which could result in the generation of slightly more optimal compiled code. Please find attached the v14 patch. Passes check-world, and has detailed documentation about the changes :) Best regards, Jesper
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 5 November 2018 at 20:17, Amit Langote wrote: > On 2018/11/04 19:07, David Rowley wrote: >> Perhaps a better design would be to instead of having random special >> partitioned-table-only fields in ResultRelInfo, just have an extra >> struct there that contains the extra partition information which would >> include the translation maps and then just return the ResultRelInfo >> and allow callers to lookup any extra details they require. > > IIUC, you're saying that we could introduce a new struct that contains > auxiliary information needed by partition pruning (maps, slot, etc. for > tuple conversion) and add a new ResultRelInfo member of that struct type. > That way, there is no need to return them separately or return an index to > access them from their arrays. I guess we won't even need the arrays we > have now. I think that might be a good idea and simplifies things > significantly. I've attached a patch which does this. It adds a new struct named PartitionRoutingInfo into ResultRelInfo and pulls 3 of the 4 arrays out of PartitionTupleRouting. There are fields for each of what these arrays used to store inside the PartitionRoutingInfo struct. While doing this I kept glancing back over at ModifyTableState and at the mt_per_subplan_tupconv_maps array. I wondered if it would be better to make the PartitionRoutingInfo a bit more generic, perhaps call it TupleConversionInfo and have fields like ti_ToGeneric and ti_FromGeneric, with the idea that "generic" would be the root partition or the first subplan for inheritance updates. This would allow us to get rid of a good chunk of code inside nodeModifyTable.c. However, I ended up not doing this and left PartitionRoutingInfo to be specifically for Root to Partition conversion. Also, on the topic about what to name the conversion maps from a few days ago; After looking at this a bit more I decided that having them named ParentToChild and ChildToParent is misleading. If the child is the child of some sub-partitioned table then the parent that the map is talking about is not the partition's parent, but the root partitioned table. So really RootToPartition and PartitionToRoot seem like much more accurate names for the maps. I made a few other changes along the way; I thought that ExecFindPartition() would be a good candidate to take on the responsibility of validating the partition is valid for INSERTs when it uses a partition out of the subplan_resultrel_hash. I thought it was better to check this once when we're in the code path of grabbing the ResultRelInfo out that hash table rather than in a code path that must check if it's been done already each time we route a tuple into a partition. It also allowed me to get rid of ri_PartitionReadyForRouting. I also moved the call to ExecInitRoutingInfo() into ExecFindPartition() which allowed me to make that function static, which could result in the generation of slightly more optimal compiled code. Please find attached the v14 patch. Rather nicely git --stat reports a net negative additional code (with the 48 lines of new tests included) 11 files changed, 632 insertions(+), 660 deletions(-) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v14-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 11/4/18 5:07 AM, David Rowley wrote: I've attached v13 which hopefully addresses these. I ran a test against the INSERT case using a 64 hash partition. Non-partitioned table: ~73k TPS Master: ~25k TPS 0001: ~26k TPS 0001 + 0002: ~68k TPS The profile of 0001 shows that almost all of ExecSetupPartitionTupleRouting() is find_all_inheritors(), hence the last test with a rebase of the original v1-0002 patch. Best regards, Jesper
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018/11/04 19:07, David Rowley wrote: > On 1 November 2018 at 22:39, Amit Langote > wrote: > I've attached v13 which hopefully addresses these. Thank you for updating the patch. >> The macro naming discussion got me thinking today about the macro itself. >> It encapsulates access to the various PartitionTupleRouting arrays >> containing the maps, but maybe we've got the interface of tuple routing a >> bit (maybe a lot given this thread!) wrong to begin with. Instead of >> ExecFindPartition returning indexes into various PartitionTupleRouting >> arrays and its callers then using those indexes to fetch various objects >> from those arrays, why doesn't it return those objects itself? Although >> we made sure that the callers don't need to worry about the meaning of >> these indexes changing with this patch, it still seems a bit odd for them >> to have to go back to those arrays to get various objects. >> >> How about we change ExecFindPartition's interface so that it returns the >> ResultRelInfo, the two maps, and the partition slot? So, the arrays >> simply become a cache for ExecFindPartition et al and are no longer >> accessed outside execPartition.c. Although that makes the interface of >> ExecFindPartition longer, I think it reduces overall complexity. > > I don't really think stuffing values into a bunch of output parameters > is much of an improvement. I'd rather allow callers to fetch what they > need using the index we return. Most callers don't need to know about > the child to parent maps, so it seems nicer for those places not to > have to invent a dummy variable to pass along to ExecFindPartition() > so it can needlessly populate it for them. Well, if a caller finds a partition using ExecFindPartition, it's going to need to fetch those other objects anyway. Both of its callers that exist today, CopyFrom and ExecPrepareTupleRouting, fetch both maps and the slot in the same code block as ExecFindPartition. > Perhaps a better design would be to instead of having random special > partitioned-table-only fields in ResultRelInfo, just have an extra > struct there that contains the extra partition information which would > include the translation maps and then just return the ResultRelInfo > and allow callers to lookup any extra details they require. IIUC, you're saying that we could introduce a new struct that contains auxiliary information needed by partition pruning (maps, slot, etc. for tuple conversion) and add a new ResultRelInfo member of that struct type. That way, there is no need to return them separately or return an index to access them from their arrays. I guess we won't even need the arrays we have now. I think that might be a good idea and simplifies things significantly. It reminds me of how ResultRelInfo grew a ri_onConflict member of type OnConflictSetState [1]. We decided to go that way, as opposed to the earlier approach of having arrays of num_partitions length in ModifyTableState or PartitionTupleRouting that contained ON CONFLICT related objects for individual partitions. > I've not > looked into this in detail, but I think the committer work that's > required for the patch as it is today is already quite significant. > I'm not keen on warding any willing one off by making the commit job > any harder. I agree that it would be good to stabilise the API for > all this partitioning code sometime, but I don't believe it needs to > be done all in one commit. My intent here is to improve performance or > INSERT and UPDATE on partitioned tables. Perhaps we can shape some API > redesign later in the release cycle. What do you think? I do suspect that simplifying ExecFindPartition's interface as part of patch will make a committer's life easier, as the resulting interface is simpler, especially if we revise it like you suggest above. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=555ee77a9
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 1 November 2018 at 22:39, Amit Langote wrote: > On 2018/11/01 10:30, David Rowley wrote: >> It's great to know the patch is now so perfect that we've only the >> macro naming left to debate ;-) > > I looked over v12 again and noticed a couple minor issues. > > + * table then we store the index into parenting > + * PartitionTupleRouting 'partition_dispatch_info' array. An > > s/PartitionTupleRouting/PartitionTupleRouting's/g > > Also, I got a bit concerned about "parenting". Does it mean something > like "enclosing", because the PartitionDispatch is a member of > PartitionTupleRouting? I got concerned because using "parent" like this > may be confusing as this is the partitioning code we're talking about, > where "parent" is generally used to mean "parent" table. > > + * the partitioned table that's the target of the command. If we must > + * route tuple via some sub-partitioned table, then the PartitionDispatch > + * for those is only built the first time it's required. > > ... via some sub-partitioned table"s" > > Or perhaps rewrite a bit as: > > If we must route the tuple via some sub-partitioned table, then its > PartitionDispatch is built the first time it's required. I've attached v13 which hopefully addresses these. > The macro naming discussion got me thinking today about the macro itself. > It encapsulates access to the various PartitionTupleRouting arrays > containing the maps, but maybe we've got the interface of tuple routing a > bit (maybe a lot given this thread!) wrong to begin with. Instead of > ExecFindPartition returning indexes into various PartitionTupleRouting > arrays and its callers then using those indexes to fetch various objects > from those arrays, why doesn't it return those objects itself? Although > we made sure that the callers don't need to worry about the meaning of > these indexes changing with this patch, it still seems a bit odd for them > to have to go back to those arrays to get various objects. > > How about we change ExecFindPartition's interface so that it returns the > ResultRelInfo, the two maps, and the partition slot? So, the arrays > simply become a cache for ExecFindPartition et al and are no longer > accessed outside execPartition.c. Although that makes the interface of > ExecFindPartition longer, I think it reduces overall complexity. I don't really think stuffing values into a bunch of output parameters is much of an improvement. I'd rather allow callers to fetch what they need using the index we return. Most callers don't need to know about the child to parent maps, so it seems nicer for those places not to have to invent a dummy variable to pass along to ExecFindPartition() so it can needlessly populate it for them. Perhaps a better design would be to instead of having random special partitioned-table-only fields in ResultRelInfo, just have an extra struct there that contains the extra partition information which would include the translation maps and then just return the ResultRelInfo and allow callers to lookup any extra details they require. I've not looked into this in detail, but I think the committer work that's required for the patch as it is today is already quite significant. I'm not keen on warding any willing one off by making the commit job any harder. I agree that it would be good to stabilise the API for all this partitioning code sometime, but I don't believe it needs to be done all in one commit. My intent here is to improve performance or INSERT and UPDATE on partitioned tables. Perhaps we can shape some API redesign later in the release cycle. What do you think? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v13-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018/11/01 10:30, David Rowley wrote: > It's great to know the patch is now so perfect that we've only the > macro naming left to debate ;-) I looked over v12 again and noticed a couple minor issues. + * table then we store the index into parenting + * PartitionTupleRouting 'partition_dispatch_info' array. An s/PartitionTupleRouting/PartitionTupleRouting's/g Also, I got a bit concerned about "parenting". Does it mean something like "enclosing", because the PartitionDispatch is a member of PartitionTupleRouting? I got concerned because using "parent" like this may be confusing as this is the partitioning code we're talking about, where "parent" is generally used to mean "parent" table. + * the partitioned table that's the target of the command. If we must + * route tuple via some sub-partitioned table, then the PartitionDispatch + * for those is only built the first time it's required. ... via some sub-partitioned table"s" Or perhaps rewrite a bit as: If we must route the tuple via some sub-partitioned table, then its PartitionDispatch is built the first time it's required. The macro naming discussion got me thinking today about the macro itself. It encapsulates access to the various PartitionTupleRouting arrays containing the maps, but maybe we've got the interface of tuple routing a bit (maybe a lot given this thread!) wrong to begin with. Instead of ExecFindPartition returning indexes into various PartitionTupleRouting arrays and its callers then using those indexes to fetch various objects from those arrays, why doesn't it return those objects itself? Although we made sure that the callers don't need to worry about the meaning of these indexes changing with this patch, it still seems a bit odd for them to have to go back to those arrays to get various objects. How about we change ExecFindPartition's interface so that it returns the ResultRelInfo, the two maps, and the partition slot? So, the arrays simply become a cache for ExecFindPartition et al and are no longer accessed outside execPartition.c. Although that makes the interface of ExecFindPartition longer, I think it reduces overall complexity. I've implemented that in the attached patch 1-revise-ExecFindPartition-interface.patch. Also, since all members of PartitionTupleRouting are only accessed within execPartition.c save root_tuple_slot, we can move it to execPartition.c to make its internals private, after doing something about root_tuple_slot. Looking at the code related to root_tuple_slot, it seems the field really belongs in ModifyTableState, because it got nothing to do with routing. Attached 2-make-PartitionTupleRouting-private.patch does that. These patches 1 and 2 apply on top of v12-0001.. patch. Thanks, Amit diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 0b0696e61e..b45972682f 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2316,6 +2316,7 @@ CopyFrom(CopyState cstate) bool *nulls; ResultRelInfo *resultRelInfo; ResultRelInfo *target_resultRelInfo; + ResultRelInfo *prev_part_rel = NULL; EState *estate = CreateExecutorState(); /* for ExecConstraints() */ ModifyTableState *mtstate; ExprContext *econtext; @@ -2331,7 +2332,6 @@ CopyFrom(CopyState cstate) CopyInsertMethod insertMethod; uint64 processed = 0; int nBufferedTuples = 0; - int prev_leaf_part_index = -1; boolhas_before_insert_row_trig; boolhas_instead_insert_row_trig; boolleafpart_use_multi_insert = false; @@ -2685,19 +2685,24 @@ CopyFrom(CopyState cstate) /* Determine the partition to heap_insert the tuple into */ if (proute) { - int leaf_part_index; - TupleConversionMap *map; + TupleTableSlot *partition_slot = NULL; + TupleConversionMap *child_to_parent_map, + *parent_to_child_map; /* * Attempt to find a partition suitable for this tuple. * ExecFindPartition() will raise an error if none can be found. +* This replaces the original target ResultRelInfo with +* partition's. */ - leaf_part_index = ExecFindPartition(mtstate, target_resultRelInfo, - proute, slot, estate); - Assert(leaf_part_index >= 0 && - leaf_part_index < proute->num_partitions); + resultRelInfo = ExecFindPartition(mtstate, target_resultRelInfo, +
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 01/11/2018 14:30, David Rowley wrote: On 1 November 2018 at 13:35, Amit Langote wrote: On 2018/11/01 8:58, David Rowley wrote: [...] I agree. I don't think "TupRouting" really needs to be in the name. Probably "To" can also just become "2" and we can put back the Parent/Child before that. Agree that "TupRouting" can go, but "To" is not too long for using "2" instead of it. I think that while '2' may only be one character less than 'to', the character '2' stands out more. However, can't say I could claim this was of the utmost importance! Okay. Here's a version with "2" put back to "To"... [...]
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 1 November 2018 at 13:35, Amit Langote wrote: > On 2018/11/01 8:58, David Rowley wrote: >> On 1 November 2018 at 06:45, Robert Haas wrote: >>> I think a better way to shorten the name would be to truncate the >>> PartitionTupRouting() prefix in some way, e.g. dropping TupRouting. >> >> Thanks for chipping in on this. >> >> I agree. I don't think "TupRouting" really needs to be in the name. >> Probably "To" can also just become "2" and we can put back the >> Parent/Child before that. > > Agree that "TupRouting" can go, but "To" is not too long for using "2" > instead of it. Okay. Here's a version with "2" put back to "To"... It's great to know the patch is now so perfect that we've only the macro naming left to debate ;-) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v12-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018/11/01 8:58, David Rowley wrote: > On 1 November 2018 at 06:45, Robert Haas wrote: >> On Wed, Aug 22, 2018 at 8:30 AM David Rowley >> wrote: >>> On 22 August 2018 at 19:08, Amit Langote >>> wrote: +#define PartitionTupRoutingGetToParentMap(p, i) \ +#define PartitionTupRoutingGetToChildMap(p, i) \ If the "Get" could be replaced by "Child" and "Parent", respectively, they'd sound more meaningful, imho. >>> >>> I did that to save 3 chars. I think putting the additional >>> Child/Parent in the name is not really required. It's not as if we're >>> going to have a ParentToParent or a ChildToChild map, so I thought it >>> might be okay to assume that if it's "ToParent", that it's being >>> converted from the child and "ToChild" seems safe to assume it's being >>> converted from the parent. I can change it though if you feel very >>> strongly that what I've got is no good. >> >> I'm not sure exactly what is best here, but it seems to unlikely to me >> that somebody is going to read that macro name and think, oh, that >> means "get the to-parent map". They are more like be confused by the >> juxtaposition of "get" and "to". >> >> I think a better way to shorten the name would be to truncate the >> PartitionTupRouting() prefix in some way, e.g. dropping TupRouting. > > Thanks for chipping in on this. > > I agree. I don't think "TupRouting" really needs to be in the name. > Probably "To" can also just become "2" and we can put back the > Parent/Child before that. Agree that "TupRouting" can go, but "To" is not too long for using "2" instead of it. Thanks, Amit
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 1 November 2018 at 06:45, Robert Haas wrote: > On Wed, Aug 22, 2018 at 8:30 AM David Rowley > wrote: >> On 22 August 2018 at 19:08, Amit Langote >> wrote: >> > +#define PartitionTupRoutingGetToParentMap(p, i) \ >> > +#define PartitionTupRoutingGetToChildMap(p, i) \ >> > >> > If the "Get" could be replaced by "Child" and "Parent", respectively, >> > they'd sound more meaningful, imho. >> >> I did that to save 3 chars. I think putting the additional >> Child/Parent in the name is not really required. It's not as if we're >> going to have a ParentToParent or a ChildToChild map, so I thought it >> might be okay to assume that if it's "ToParent", that it's being >> converted from the child and "ToChild" seems safe to assume it's being >> converted from the parent. I can change it though if you feel very >> strongly that what I've got is no good. > > I'm not sure exactly what is best here, but it seems to unlikely to me > that somebody is going to read that macro name and think, oh, that > means "get the to-parent map". They are more like be confused by the > juxtaposition of "get" and "to". > > I think a better way to shorten the name would be to truncate the > PartitionTupRouting() prefix in some way, e.g. dropping TupRouting. Thanks for chipping in on this. I agree. I don't think "TupRouting" really needs to be in the name. Probably "To" can also just become "2" and we can put back the Parent/Child before that. I've attached v11, which does this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v11-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On Wed, Aug 22, 2018 at 8:30 AM David Rowley wrote: > On 22 August 2018 at 19:08, Amit Langote > wrote: > > +#define PartitionTupRoutingGetToParentMap(p, i) \ > > +#define PartitionTupRoutingGetToChildMap(p, i) \ > > > > If the "Get" could be replaced by "Child" and "Parent", respectively, > > they'd sound more meaningful, imho. > > I did that to save 3 chars. I think putting the additional > Child/Parent in the name is not really required. It's not as if we're > going to have a ParentToParent or a ChildToChild map, so I thought it > might be okay to assume that if it's "ToParent", that it's being > converted from the child and "ToChild" seems safe to assume it's being > converted from the parent. I can change it though if you feel very > strongly that what I've got is no good. I'm not sure exactly what is best here, but it seems to unlikely to me that somebody is going to read that macro name and think, oh, that means "get the to-parent map". They are more like be confused by the juxtaposition of "get" and "to". I think a better way to shorten the name would be to truncate the PartitionTupRouting() prefix in some way, e.g. dropping TupRouting. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Speeding up INSERTs and UPDATEs to partitioned tables
Thanks for both clarifications! I skimmed through the commits related to Inserts with partitioning since 10 and indeed - while not impossible it seems like quite some work to merge them into PG 10 codebase. We might consider preparing the patch in-house as otherwise PG 10 based partitioning is a major regression and we'd have to go back to inheritance based one - which seems the best option for now. Regards, Krzysztof On Tue, Oct 30, 2018 at 1:54 AM Amit Langote wrote: > > On 2018/10/30 8:41, Krzysztof Nienartowicz wrote: > > On Thu, Oct 25, 2018 at 5:58 PM Krzysztof Nienartowicz > > wrote: > >> On Tue, Oct 23, 2018 at 4:02 AM David Rowley > >> wrote: > >>> > >>> I more meant that it might be 0002 that fixes the issue for you. I > >>> just wanted to check if you'd tried 0001 and found that the problem > >>> was fixed with that alone. > >> > >> Will it apply on PG10? (In fact the code base is PG XL10 but > >> src/backend/executor/nodeModifyTable.c is pure PG) > > > > To complement the info: number of columns varies from 20 to 100 but > > some of the columns are composite types or arrays of composite types. > > > > The flamegraph after applying changes from patch 0002 can be seen > > here: https://gaiaowncloud.isdc.unige.ch/index.php/s/W3DLecAWAfkesiP > > shows most of the time is spent in the > > > > convert_tuples_by_name (PG10 version). > > As David mentioned, the patches on this thread are meant to be applied > against latest PG 12 HEAD. The insert tuple routing code has undergone > quite a bit of refactoring in PG 11, which itself should have gotten rid > of at least some of the hot-spots that are seen in the flame graph you shared. > > What happens in PG 10 (as seen in the flame graph) is that > ExecSetupPartitionTupleRouting initializes information for *all* > partitions, which happens even before the 1st tuple processed. So if > there are many partitions and with many columns, a lot of processing > happens in ExecSetupPartitionTupleRouting. PG 11 changes it such that the > partition info is only initialized after the 1st tuple is processed and > only of the partition that's targeted, but some overheads still remain in > that code. The patches on this thread are meant to address those overheads. > > Unfortunately, I don't think the community will agree to back-porting the > changes in PG 11 and the patches being discussed here to PG 10. > > Thanks, > Amit >
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018/10/30 8:41, Krzysztof Nienartowicz wrote: > On Thu, Oct 25, 2018 at 5:58 PM Krzysztof Nienartowicz > wrote: >> On Tue, Oct 23, 2018 at 4:02 AM David Rowley >> wrote: >>> >>> I more meant that it might be 0002 that fixes the issue for you. I >>> just wanted to check if you'd tried 0001 and found that the problem >>> was fixed with that alone. >> >> Will it apply on PG10? (In fact the code base is PG XL10 but >> src/backend/executor/nodeModifyTable.c is pure PG) > > To complement the info: number of columns varies from 20 to 100 but > some of the columns are composite types or arrays of composite types. > > The flamegraph after applying changes from patch 0002 can be seen > here: https://gaiaowncloud.isdc.unige.ch/index.php/s/W3DLecAWAfkesiP > shows most of the time is spent in the > > convert_tuples_by_name (PG10 version). As David mentioned, the patches on this thread are meant to be applied against latest PG 12 HEAD. The insert tuple routing code has undergone quite a bit of refactoring in PG 11, which itself should have gotten rid of at least some of the hot-spots that are seen in the flame graph you shared. What happens in PG 10 (as seen in the flame graph) is that ExecSetupPartitionTupleRouting initializes information for *all* partitions, which happens even before the 1st tuple processed. So if there are many partitions and with many columns, a lot of processing happens in ExecSetupPartitionTupleRouting. PG 11 changes it such that the partition info is only initialized after the 1st tuple is processed and only of the partition that's targeted, but some overheads still remain in that code. The patches on this thread are meant to address those overheads. Unfortunately, I don't think the community will agree to back-porting the changes in PG 11 and the patches being discussed here to PG 10. Thanks, Amit
Re: Speeding up INSERTs and UPDATEs to partitioned tables
To complement the info: number of columns varies from 20 to 100 but some of the columns are composite types or arrays of composite types. The flamegraph after applying changes from patch 0002 can be seen here: https://gaiaowncloud.isdc.unige.ch/index.php/s/W3DLecAWAfkesiP shows most of the time is spent in the convert_tuples_by_name (PG10 version). Thanks On Thu, Oct 25, 2018 at 5:58 PM Krzysztof Nienartowicz wrote: > > On Tue, Oct 23, 2018 at 4:02 AM David Rowley > wrote: > > > > On 23 October 2018 at 11:55, Krzysztof Nienartowicz > > wrote: > > > In the end we hacked the code to re-enable triggers on partitioned > > > tables and switch off native insert code on partitioned tables. Quite > > > hackish and would be nice to have it fixed in a more natural manner. > > > Yes, it looked like locking but not only - > > > ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors > > > looked like dominant and also convert_tuples_by_name but not sure if > > > the last one was not an artifact of perf sampling. > > > > The ExecOpenIndices was likely fixed in edd44738bc8 (PG11). > > find_all_inheritors does obtain the partition locks during the call, > > so the slowness there most likely down to the locking rather than the > > scanning of pg_inherits. > > > > 42f70cd9c3dbf improved the situation for convert_tuples_by_name (PG12). > > > > > Will check the patch 0001, thanks. > > > > I more meant that it might be 0002 that fixes the issue for you. I > > just wanted to check if you'd tried 0001 and found that the problem > > was fixed with that alone. > > Will it apply on PG10? (In fact the code base is PG XL10 but > src/backend/executor/nodeModifyTable.c is pure PG) > > > > > Do you mind sharing how many partitions you have and how many columns > > the partitioned table has? > > We have 2 level partitioning: 10 (possibly changing, up to say 20-30) > range partitions at first level and 20 range at the second level. We > have potentially hundreds processes inserting at the same time. > > > > > > > -- > > David Rowley http://www.2ndQuadrant.com/ > > PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On Tue, Oct 23, 2018 at 4:02 AM David Rowley wrote: > > On 23 October 2018 at 11:55, Krzysztof Nienartowicz > wrote: > > In the end we hacked the code to re-enable triggers on partitioned > > tables and switch off native insert code on partitioned tables. Quite > > hackish and would be nice to have it fixed in a more natural manner. > > Yes, it looked like locking but not only - > > ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors > > looked like dominant and also convert_tuples_by_name but not sure if > > the last one was not an artifact of perf sampling. > > The ExecOpenIndices was likely fixed in edd44738bc8 (PG11). > find_all_inheritors does obtain the partition locks during the call, > so the slowness there most likely down to the locking rather than the > scanning of pg_inherits. > > 42f70cd9c3dbf improved the situation for convert_tuples_by_name (PG12). > > > Will check the patch 0001, thanks. > > I more meant that it might be 0002 that fixes the issue for you. I > just wanted to check if you'd tried 0001 and found that the problem > was fixed with that alone. Will it apply on PG10? (In fact the code base is PG XL10 but src/backend/executor/nodeModifyTable.c is pure PG) > > Do you mind sharing how many partitions you have and how many columns > the partitioned table has? We have 2 level partitioning: 10 (possibly changing, up to say 20-30) range partitions at first level and 20 range at the second level. We have potentially hundreds processes inserting at the same time. > > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 23 October 2018 at 11:55, Krzysztof Nienartowicz wrote: > In the end we hacked the code to re-enable triggers on partitioned > tables and switch off native insert code on partitioned tables. Quite > hackish and would be nice to have it fixed in a more natural manner. > Yes, it looked like locking but not only - > ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors > looked like dominant and also convert_tuples_by_name but not sure if > the last one was not an artifact of perf sampling. The ExecOpenIndices was likely fixed in edd44738bc8 (PG11). find_all_inheritors does obtain the partition locks during the call, so the slowness there most likely down to the locking rather than the scanning of pg_inherits. 42f70cd9c3dbf improved the situation for convert_tuples_by_name (PG12). > Will check the patch 0001, thanks. I more meant that it might be 0002 that fixes the issue for you. I just wanted to check if you'd tried 0001 and found that the problem was fixed with that alone. Do you mind sharing how many partitions you have and how many columns the partitioned table has? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
In the end we hacked the code to re-enable triggers on partitioned tables and switch off native insert code on partitioned tables. Quite hackish and would be nice to have it fixed in a more natural manner. Yes, it looked like locking but not only - ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors looked like dominant and also convert_tuples_by_name but not sure if the last one was not an artifact of perf sampling. Will check the patch 0001, thanks. On Tue, Oct 23, 2018 at 12:36 AM David Rowley wrote: > > On 15 October 2018 at 23:04, Krzysztof Nienartowicz > wrote: > > We see quite prohibitive 5-6x slowdown with native partitioning on in > > comparison to trigger based in PG9.5. > > This is clearly visible with highly parallel inserts (Can share > > flamediagrams comparing the two). > > Does the 0001 patch here fix the problem? I imagined that it would be > the locking of all partitions that would have killed the performance. > > > This basically excludes native partitioning from being used by us. Do you > > think your changes could be backported to PG10? - we checked and this would > > need quite a number of changes but given the weight of this change maybe it > > could be considered? > > It's very unlikely to happen, especially so with the 0002 patch, which > I've so far just attached as a demonstration of where the performance > could end up. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 15 October 2018 at 23:04, Krzysztof Nienartowicz wrote: > We see quite prohibitive 5-6x slowdown with native partitioning on in > comparison to trigger based in PG9.5. > This is clearly visible with highly parallel inserts (Can share > flamediagrams comparing the two). Does the 0001 patch here fix the problem? I imagined that it would be the locking of all partitions that would have killed the performance. > This basically excludes native partitioning from being used by us. Do you > think your changes could be backported to PG10? - we checked and this would > need quite a number of changes but given the weight of this change maybe it > could be considered? It's very unlikely to happen, especially so with the 0002 patch, which I've so far just attached as a demonstration of where the performance could end up. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
Hi, On 2018/10/15 19:04, Krzysztof Nienartowicz wrote: > > We see quite prohibitive 5-6x slowdown with native partitioning on in > comparison to trigger based in PG9.5. > This is clearly visible with highly parallel inserts (Can share > flamediagrams comparing the two). > > This basically excludes native partitioning from being used by us. Do you > think your changes could be backported to PG10? - we checked and this would > need quite a number of changes but given the weight of this change maybe it > could be considered? It's unfortunate that PG 10's partitioning cannot be used for your use case, but I don't think such a major refactoring will be back-ported to 10 or 11. :-( Thanks, Amit
Re: Speeding up INSERTs and UPDATEs to partitioned tables
We see quite prohibitive 5-6x slowdown with native partitioning on in comparison to trigger based in PG9.5. This is clearly visible with highly parallel inserts (Can share flamediagrams comparing the two). This basically excludes native partitioning from being used by us. Do you think your changes could be backported to PG10? - we checked and this would need quite a number of changes but given the weight of this change maybe it could be considered? Thanks Krzysztof -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 9 October 2018 at 15:49, Amit Langote wrote: > On 2018/10/05 21:55, David Rowley wrote: >> I'm not so sure we need to zero the partition_tuple_slots[] array at >> all since we always set a value there is there's a corresponding map >> stored. I considered pulling that out, but in the end, I didn't as I >> saw some Asserts checking it's been properly set by checking the >> element != NULL in nodeModifyTable.c and copy.c. Perhaps I should >> have just gotten rid of those Asserts along with the palloc0 and >> subsequent memset after the expansion of the array. I'm undecided. > > Maybe it's a good thing that it's doing the same thing as with the > child_to_parent_maps array, which is to zero-init it when allocated. Perhaps, but the maps do need to be zeroed, the partition_tuple_slots array does not since we only access it when the parent to child map is set. In any case, since PARTITION_ROUTING_INITSIZE is just 8, it'll likely not save much since it's really just saving a memset(..., 0, 64), for single-row INSERTs on a 64-bit machine. So likely it won't save more than a bunch of nanoseconds. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
Hi David, On 2018/10/05 21:55, David Rowley wrote: > On 17 September 2018 at 21:15, David Rowley > wrote: >> v9 patch attached. Fixes conflict with 6b78231d. > > v10 patch attached. Fixes conflict with cc2905e9. Thanks for rebasing. > I'm not so sure we need to zero the partition_tuple_slots[] array at > all since we always set a value there is there's a corresponding map > stored. I considered pulling that out, but in the end, I didn't as I > saw some Asserts checking it's been properly set by checking the > element != NULL in nodeModifyTable.c and copy.c. Perhaps I should > have just gotten rid of those Asserts along with the palloc0 and > subsequent memset after the expansion of the array. I'm undecided. Maybe it's a good thing that it's doing the same thing as with the child_to_parent_maps array, which is to zero-init it when allocated. Thanks, Amit
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 17 September 2018 at 21:15, David Rowley wrote: > v9 patch attached. Fixes conflict with 6b78231d. v10 patch attached. Fixes conflict with cc2905e9. I'm not so sure we need to zero the partition_tuple_slots[] array at all since we always set a value there is there's a corresponding map stored. I considered pulling that out, but in the end, I didn't as I saw some Asserts checking it's been properly set by checking the element != NULL in nodeModifyTable.c and copy.c. Perhaps I should have just gotten rid of those Asserts along with the palloc0 and subsequent memset after the expansion of the array. I'm undecided. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v10-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 23 August 2018 at 00:30, David Rowley wrote: > I've attached a v8. The only change from your v7 is in the "go making" > comment. > v9 patch attached. Fixes conflict with 6b78231d. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v9-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018/08/22 21:30, David Rowley wrote: > On 22 August 2018 at 19:08, Amit Langote > wrote: >> +#define PartitionTupRoutingGetToParentMap(p, i) \ >> +#define PartitionTupRoutingGetToChildMap(p, i) \ >> >> If the "Get" could be replaced by "Child" and "Parent", respectively, >> they'd sound more meaningful, imho. > > I did that to save 3 chars. I think putting the additional > Child/Parent in the name is not really required. It's not as if we're > going to have a ParentToParent or a ChildToChild map, so I thought it > might be okay to assume that if it's "ToParent", that it's being > converted from the child and "ToChild" seems safe to assume it's being > converted from the parent. I can change it though if you feel very > strongly that what I've got is no good. No strong preference as such. Maybe, let's defer to committer. >> I've looked at v6 and spotted some minor typos. >> >> + * ResultRelInfo for, before we go making one, we check for a >> pre-made one >> >> s/making/make/g > > I disagree, but perhaps we can just reword it a bit. I've now got: > > + * Every time a tuple is routed to a partition that we've yet to set the > + * ResultRelInfo for, before we go to the trouble of making one, we check > + * for a pre-made one in the hash table. Sure. I guess "to the trouble of" was missing then. :) >> +/* If nobody else set the per-subplan array of maps, do so ouselves. */ >> >> I guess I'm the one to blame here for misspelling "ourselves". > > Thanks for noticing. > >> Since the above two are minor issues, fixed them myself in the attached >> updated version; didn't touch the macro though. > > I've attached a v8. The only change from your v7 is in the "go making" > comment. Thanks. >> Do you agree to setting this patch to "Ready for Committer" in the >> September CF? > > I read through the entire patch a couple of times yesterday and saw > nothing else, so yeah, I think now is a good time for someone with > more authority to have a look at it. Okay, doing it now. Thanks, Amit
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 22 August 2018 at 19:08, Amit Langote wrote: > +#define PartitionTupRoutingGetToParentMap(p, i) \ > +#define PartitionTupRoutingGetToChildMap(p, i) \ > > If the "Get" could be replaced by "Child" and "Parent", respectively, > they'd sound more meaningful, imho. I did that to save 3 chars. I think putting the additional Child/Parent in the name is not really required. It's not as if we're going to have a ParentToParent or a ChildToChild map, so I thought it might be okay to assume that if it's "ToParent", that it's being converted from the child and "ToChild" seems safe to assume it's being converted from the parent. I can change it though if you feel very strongly that what I've got is no good. >> I also fixed a bug where >> the child to parent map was not being initialised when on conflict >> transition capture was required. I added a test which was crashing the >> backend but fixed the code so it works correctly. > > Oops, I guess you mean my omission of checking if > mtstate->mt_oc_transition_capture is non-NULL in ExecInitRoutingInfo. Yeah. > I've looked at v6 and spotted some minor typos. > > + * ResultRelInfo for, before we go making one, we check for a > pre-made one > > s/making/make/g I disagree, but perhaps we can just reword it a bit. I've now got: + * Every time a tuple is routed to a partition that we've yet to set the + * ResultRelInfo for, before we go to the trouble of making one, we check + * for a pre-made one in the hash table. > +/* If nobody else set the per-subplan array of maps, do so ouselves. */ > > I guess I'm the one to blame here for misspelling "ourselves". Thanks for noticing. > Since the above two are minor issues, fixed them myself in the attached > updated version; didn't touch the macro though. I've attached a v8. The only change from your v7 is in the "go making" comment. > Do you agree to setting this patch to "Ready for Committer" in the > September CF? I read through the entire patch a couple of times yesterday and saw nothing else, so yeah, I think now is a good time for someone with more authority to have a look at it. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v8-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018/08/21 14:44, David Rowley wrote: > On 3 August 2018 at 17:58, Amit Langote wrote: >> On 2018/07/31 16:03, David Rowley wrote: >>> Maybe we can do that as a follow-on patch. >> >> We probably could, but I think it would be a good idea get rid of *all* >> redundant allocations due to tuple routing in one patch, if that's the >> mission of this thread and the patch anyway. > > I started looking at this patch today and I now agree that it should > be included in the main patch. Great, thanks. > I changed a few things with the patch. For example, the map access > macros you'd defined were not in CamelCase. In the updated patch: +#define PartitionTupRoutingGetToParentMap(p, i) \ +#define PartitionTupRoutingGetToChildMap(p, i) \ If the "Get" could be replaced by "Child" and "Parent", respectively, they'd sound more meaningful, imho. > I also fixed a bug where > the child to parent map was not being initialised when on conflict > transition capture was required. I added a test which was crashing the > backend but fixed the code so it works correctly. Oops, I guess you mean my omission of checking if mtstate->mt_oc_transition_capture is non-NULL in ExecInitRoutingInfo. Thanks for fixing it and adding the test case. > I also got rid of > the child_parent_map_not_required array since we now no longer need > it. The code now always initialises the maps in cases where they're > going to be required. Yes, thought I had removed the field in my patch, but looks like I had just removed the comment about it. > I've attached a v3 version of your patch and also v6 of the main patch > which includes the v3 patch. I've looked at v6 and spotted some minor typos. + * ResultRelInfo for, before we go making one, we check for a pre-made one s/making/make/g +/* If nobody else set the per-subplan array of maps, do so ouselves. */ I guess I'm the one to blame here for misspelling "ourselves". Since the above two are minor issues, fixed them myself in the attached updated version; didn't touch the macro though. Do you agree to setting this patch to "Ready for Committer" in the September CF? Thanks, Amit From 79c906997d80dc426530dea0b75363ef20286001 Mon Sep 17 00:00:00 2001 From: "dgrow...@gmail.com" Date: Thu, 26 Jul 2018 19:54:55 +1200 Subject: [PATCH v7] Speed up INSERT and UPDATE on partitioned tables This is more or less a complete redesign of PartitionTupleRouting. The aim here is to get rid of all the possibly large arrays that were being allocated during ExecSetupPartitionTupleRouting(). We now allocate small arrays to store the partition's ResultRelInfo and only enlarge these when we run out of space. The partitions array is now ordered by the order in which the partition's ResultRelInfos are initialized rather than in same order as partdesc. The slowest part of ExecSetupPartitionTupleRouting still remains. The find_all_inheritors call still remains by far the slowest part of the function. This patch just removes the other slow parts. Initialization of the parent to child and child to parent translation maps arrays are now only performed when we need to store the first translation map. If the column order between the parent and its child are the same, then no map ever needs to be stored, these (possibly large) arrays did nothing. The fact that we now always initialize the child to parent map whenever transition capture is required, we no longer need the child_parent_map_not_required array. Previously this was only required so we could determine if no map was required or if the map had not yet been initialized. In simple INSERTs hitting a single partition to a partitioned table with many partitions, the shutdown of the executor was also slow in comparison to the actual execution. This was down to the loop which cleans up each ResultRelInfo having to loop over an array which contained mostly NULLs which had to be skipped. Performance of this has now improved as the array we loop over now no longer has to skip possibly many NULL values. David Rowley and Amit Langote --- src/backend/commands/copy.c | 48 +- src/backend/executor/execPartition.c | 798 ++ src/backend/executor/nodeModifyTable.c| 109 +--- src/backend/utils/cache/partcache.c | 11 +- src/include/catalog/partition.h | 6 +- src/include/executor/execPartition.h | 171 -- src/test/regress/expected/insert_conflict.out | 22 + src/test/regress/sql/insert_conflict.sql | 26 + 8 files changed, 626 insertions(+), 565 deletions(-) diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 9bc67ce60f..0dfb9e2e95 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2510,8 +2510,12 @@ CopyFrom(CopyState cstate) /* * If there are any triggers with transition tables on the named relation, * we need to be prepared to capture transition tuples. +* +
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 3 August 2018 at 17:58, Amit Langote wrote: > On 2018/07/31 16:03, David Rowley wrote: >> Maybe we can do that as a follow-on patch. > > We probably could, but I think it would be a good idea get rid of *all* > redundant allocations due to tuple routing in one patch, if that's the > mission of this thread and the patch anyway. I started looking at this patch today and I now agree that it should be included in the main patch. I changed a few things with the patch. For example, the map access macros you'd defined were not in CamelCase. I also fixed a bug where the child to parent map was not being initialised when on conflict transition capture was required. I added a test which was crashing the backend but fixed the code so it works correctly. I also got rid of the child_parent_map_not_required array since we now no longer need it. The code now always initialises the maps in cases where they're going to be required. I've attached a v3 version of your patch and also v6 of the main patch which includes the v3 patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v3_Refactor-handling-of-child_parent_tupconv_maps.patch Description: Binary data v6-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
(looking at the v5 patch but replying to an older email) On 2018/07/31 16:03, David Rowley wrote: > I've attached a complete v4 patch. > >> By the way, when going over the updated code, I noticed that the code >> around child_parent_tupconv_maps could use some refactoring too. >> Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates >> child-to-parent map array needed for transition tuple capture even if not >> needed by any of the leaf partitions. I'm attaching here a patch that >> applies on top of your v3 to show what I'm thinking we could do. > > Maybe we can do that as a follow-on patch. We probably could, but I think it would be a good idea get rid of *all* redundant allocations due to tuple routing in one patch, if that's the mission of this thread and the patch anyway. > I think what we have so far > is already ended up quite complex to review. What do you think? Yeah, it's kind of complex, but at least it seems that we're clear on the point that what we're trying to do here is to try to get rid of redundant allocations. Parts of the patch that appear complex seems to be around the allocation of various maps. Especially the child-to-parent maps, which as things stand today, come from two arrays -- a per-update-subplan array that's needed by update tuple routing proper and per-leaf partition array (one in PartitionTupleRouting) that's needed by transition capture machinery. The original coding was such the update tuple routing handling code would try to avoid allocating the per-update-subplan array if it saw that per-leaf partition array was already set up in PartitionTupleRouting, because transition capture is active in the query. For update-tuple-routing code to be able to use maps from the per-leaf array, it would have to know which update-subplans mapped to which tuple-routing-initialized partitions. That was maintained in the subplan_partition_offset array that's now gone with this patch, because we no longer want to fix the tuple-routing-initialized-partition offsets in advance. So, it's better to dissociate per-subplan maps which are initialized during ExecInitModifyTable from per-leaf maps which are initialized lazily when tuple routing initializes a partition, which is what my portion of the patch did. As mentioned in my last email, I still think it would be a good idea to simplify the handling of child-to-parent maps in PartitionTupleRouting even further, while we're at improving the code in this area. I revised the patch such that it makes the handling of maps in PartitionTupleRouting even more uniform. With that patch, we no longer have two completely unrelated places in the code managing parent-to-child and child-to-parent maps, even though both arrays are in the same PartitionTupleRouting. Please find the updated patch attached with this email. Thanks, Amit From 1a814f5a40774a51bf702757ec91e02f418a5aba Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 3 Aug 2018 14:09:51 +0900 Subject: [PATCH v2 2/2] Refactor handling of child_parent_tupconv_maps They're currently created and handed out by TupConvMapForLeaf, which makes them look somewhat different from parent_to_child_tupconv_maps. In fact, both contain conversion maps possibly needed between a partition initialized by tuple routing and the root parent in one or the other direction, so it seems odd that parent-to-child ones are created in ExecInitRoutingInfo, whereas child-to-parent ones in TupConvMapForLeaf. The child-to-parent ones are only needed if transition capture is active, but we can already check that in ExecInitRoutingInfo via the incoming ModifyTableState (sure, we need to teach CopyFrom to add the necessary information into its dummy ModifyTableState, but that doesn't seem too bad). This way, we can manage both parent-to-child and child-to-parent maps in similar ways, and more importantly, use the same criterion of checking whether a partition's slot in the respective array is NULL or not to conclude if tuple conversion is necessary or not. --- src/backend/commands/copy.c| 37 +--- src/backend/executor/execPartition.c | 102 + src/backend/executor/nodeModifyTable.c | 11 ++-- src/include/executor/execPartition.h | 33 ++- 4 files changed, 64 insertions(+), 119 deletions(-) diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 752ba3d767..6f4069d321 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2510,8 +2510,12 @@ CopyFrom(CopyState cstate) /* * If there are any triggers with transition tables on the named relation, * we need to be prepared to capture transition tuples. +* +* Because partition tuple routing would like to know about whether +* transition capture is active, we also set it in mtstate, which is +* passed to ExecFindPartition below. */ - cstate->transition_capture = + cstate->t
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 31 July 2018 at 19:03, David Rowley wrote: > I've attached a complete v4 patch. I've attached v5 of the patch which is based on top of today's master (@ 579b985b22) A couple of recent patches conflict with v4. I've also made another tidy up pass, which was mostly just rewording comments in execPartition.h -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v5-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 30 July 2018 at 20:26, Amit Langote wrote: > I couldn't find much to complain about in the latest v3, except I noticed > a few instances of the word "setup" where I think what's really meant is > "set up". > > + * must be setup, but any sub-partitioned tables can be setup lazily as > > + * A ResultRelInfo has not been setup for this partition yet, > Great. I've fixed those and also fixed a few other comments. I found the comments on PartitionTupleRouting didn't really explain how the arrays were indexed. I've made an attempt to make that clear. I've attached a complete v4 patch. > By the way, when going over the updated code, I noticed that the code > around child_parent_tupconv_maps could use some refactoring too. > Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates > child-to-parent map array needed for transition tuple capture even if not > needed by any of the leaf partitions. I'm attaching here a patch that > applies on top of your v3 to show what I'm thinking we could do. Maybe we can do that as a follow-on patch. I think what we have so far is already ended up quite complex to review. What do you think? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v4-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018/07/28 10:54, David Rowley wrote: > On 27 July 2018 at 19:11, Amit Langote wrote: >> I've attached a delta patch to make the above changes. I'm leaving the >> hash table rename up to you though. > > Thanks for the delta patch. I took all of it, just rewrote a comment slightly. > > I also renamed the hash table to your suggestion and changed a few more > things. > > Attached a delta based on v2 and the full v3 patch. > > This includes another small change to make > PartitionDispatchData->indexes an array that's allocated in the same > memory as the PartitionDispatchData. This will save a palloc() call > and also should be a bit more cache friendly. > > This also required a rebase on master again due to 3e32109049. Thanks for the updated patch. I couldn't find much to complain about in the latest v3, except I noticed a few instances of the word "setup" where I think what's really meant is "set up". + * must be setup, but any sub-partitioned tables can be setup lazily as + * A ResultRelInfo has not been setup for this partition yet, By the way, when going over the updated code, I noticed that the code around child_parent_tupconv_maps could use some refactoring too. Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates child-to-parent map array needed for transition tuple capture even if not needed by any of the leaf partitions. I'm attaching here a patch that applies on top of your v3 to show what I'm thinking we could do. Thanks, Amit From 6ce1654aa929c7f8112c430914af7f464474ed31 Mon Sep 17 00:00:00 2001 From: amit Date: Mon, 30 Jul 2018 14:05:17 +0900 Subject: [PATCH 2/2] Some refactoring around child_parent_tupconv_maps Just like parent_child_tupconv_maps, we should allocate it only if needed. Also, if none of the partitions ended up needing a map, we should not have allocated the child_parent_tupconv_maps array, only the child_parent_map_not_required one. So, get rid of ExecSetupChildParentMapForLeaf(), which currently does initial, possibly useless, allocation of both of the above mentioned arrays. Instead, have TupConvMapForLeaf() allocate the needed array, just like ExecInitRoutingInfo does, when it needs to store a parent-to-child map. Finally, rename the function TupConvMapForLeaf to LeafToParentTupConvMapForTC for clarity; TC stands for "Transition Capture". --- src/backend/commands/copy.c| 19 +- src/backend/executor/execPartition.c | 102 - src/backend/executor/nodeModifyTable.c | 4 +- src/include/executor/execPartition.h | 12 ++-- 4 files changed, 59 insertions(+), 78 deletions(-) diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 6fc1e2b41c..6d0e9229e0 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2503,22 +2503,9 @@ CopyFrom(CopyState cstate) * CopyFrom tuple routing. */ if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) - { - PartitionTupleRouting *proute; - - proute = cstate->partition_tuple_routing = + cstate->partition_tuple_routing = ExecSetupPartitionTupleRouting(NULL, cstate->rel); - /* -* If we are capturing transition tuples, they may need to be -* converted from partition format back to partitioned table format -* (this is only ever necessary if a BEFORE trigger modifies the -* tuple). -*/ - if (cstate->transition_capture != NULL) - ExecSetupChildParentMapForLeaf(proute); - } - /* * It's more efficient to prepare a bunch of tuples for insertion, and * insert them in one heap_multi_insert() call, than call heap_insert() @@ -2666,8 +2653,8 @@ CopyFrom(CopyState cstate) */ cstate->transition_capture->tcs_original_insert_tuple = NULL; cstate->transition_capture->tcs_map = - TupConvMapForLeaf(proute, saved_resultRelInfo, - leaf_part_index); + LeafToParentTupConvMapForTC(proute, saved_resultRelInfo, + leaf_part_index); } else { diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index 2a18a30b3e..d183e8b758 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -400,7 +400,7 @@ ExecExpandRoutingArrays(PartitionTupleRouting *proute) sizeof(TupleConversi
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 27 July 2018 at 19:11, Amit Langote wrote: > I've attached a delta patch to make the above changes. I'm leaving the > hash table rename up to you though. Thanks for the delta patch. I took all of it, just rewrote a comment slightly. I also renamed the hash table to your suggestion and changed a few more things. Attached a delta based on v2 and the full v3 patch. This includes another small change to make PartitionDispatchData->indexes an array that's allocated in the same memory as the PartitionDispatchData. This will save a palloc() call and also should be a bit more cache friendly. This also required a rebase on master again due to 3e32109049. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v2-delta2.patch Description: Binary data v3-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018/07/27 1:19, David Rowley wrote: > On 18 July 2018 at 20:29, Amit Langote wrote: >> Let me know what you think of the code in the updated patch. > > Thanks for sending the updated patch. > > I looked over it tonight and made a number of changes: > > 1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was > useless since the int would have wrapped long before it reached > UINT_MAX. Oops, you're right. > There's no shortage of other code doubling the size of an > array by multiplying it by 2 unconditionally without considering > overflowing an int. Unsure why you considered this more risky. Just ill-informed paranoia on my part. Let's just drop it as you say, given also the Tom's comment that repalloc would fail anyway for requests over 1GB. > 2) Fixed a series of bugs regarding the size of the arrays in > PartitionTupleRouting. The map arrays and the partitions array could > differ in size despite your comment that claimed > child_parent_tupconv_maps was the same size as 'partitions' when > non-NULL. The map arrays being a different size than the partitions > array caused the following two cases to segfault. I've included two > cases as it was two seperate bugs that caused them. > > -- case 1 [ ] > -- case 2 Indeed, there were some holes in the logic that led to me to come up with that code. > 3) Got rid of ExecUseUpdateResultRelForRouting. I started to change > this to remove references to UPDATE in order to make it more friendly > towards other possible future node types that it would get used for > (aka MERGE). In the end, I found that performance could regress when > in cases like: > > drop table listp; > create table listp (a int) partition by list(a); > \o /dev/null > \timing off > select 'create table listp'||x::Text||' partition of listp for values > in('||x::Text||');' from generate_series(1,1000) x; > \gexec > \o > insert into listp select x from generate_series(1,999) x; > \timing on > update listp set a = a+1; > > It's true that UPDATEs with a large number of subplans performance is > quite terrible today in the planner, but this code made the > performance of planning+execution a bit worse. If we get around to > fixing the inheritance planner then I think > ExecUseUpdateResultRelForRouting() could easily appear in profiles. > > I ended up rewriting it to just get called once and build a hash table > by Oid storing a ResultRelInfo pointer. This also gets rid of the > slow nested loop in the cleanup operation inside > ExecCleanupTupleRouting(). OK, looks neat, although I'd name the hash table subplan_resultrel_hash (like join_rel_hash in PlannerInfo), instead of subplan_partition_table. > 4) Did some tuning work in ExecFindPartition() getting rid of a > redundant check after the loop completion. Also added some likely() > and unlikely() decorations around some conditions. All changes seem good. > 5) Updated some newly out-dated comments since your patch in execPartition.h. > > 6) Replaced the palloc0() in ExecSetupPartitionTupleRouting() with a > palloc() updating the few fields that were not initialised. This might > save a few TPS (at least once we get rid of the all partition locking) > in the single-row INSERT case, but I've not tested the performance of > this yet. > > 7) Also moved and edited some comments above > ExecSetupPartitionTupleRouting() that I felt explained a little too > much about some internal implementation details. Thanks, changes look good. > One thing that I thought of, but didn't do was just having > ExecFindPartition() return the ResultRelInfo. I think it would be much > nicer in both call sites to not have to check the ->partitions array > to get that. The copy.c call site would need a few modifications > around the detection code to see if the partition has changed, but it > all looks quite possible to change. I left this for now as I have > another patch which touches all that code that I feel is closer to > commit than this patch is. I had wondered about that too, but gave up on doing anything about it because the callers of ExecFindPartition want to access other fields of PartitionTupleRouting using the returned index. Maybe, we could make it return a ResultRelInfo * and also return the index itself using a separate output argument. Seems like a cosmetic improvement that can be made later. > I've attached a delta of the changes I made since your v2 delta and > also a complete updated patch. Thanks. Here are some other minor comments on the complete v2 patch. -tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index], +tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps ? + proute->parent_child_tupconv_maps[leaf_part_index] : +NULL, This piece of code that's present in both ExecPrepareTupleRouting and CopyFrom can be written as: if (proute->parent_child_tupconv_maps) ConvertPartitionTupleSlot(pr
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 27 July 2018 at 04:19, David Rowley wrote: > I've attached a delta of the changes I made since your v2 delta and > also a complete updated patch. I did a very quick performance test of this patch on an AWS m5d.large instance with fsync=off. The test setup is the same as is described in my initial email on this thread. The test compares the performance of INSERTs into a partitioned table with 10k partitions compared to a non-partitioned table. Patched with v2 patch on master@39d51fe87 -- partitioned $ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 1063764 latency average = 0.056 ms tps = 17729.375930 (including connections establishing) tps = 17729.855215 (excluding connections establishing) -- non-partitioned $ pgbench -n -T 60 -f partbench__insert.sql postgres transaction type: partbench__insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 1147273 latency average = 0.052 ms tps = 19121.194366 (including connections establishing) tps = 19121.695469 (excluding connections establishing) Here we're within 92% of the non-partitioned performance. Looking back at the first email in this thread where I tested the v1 patch, we were within 82% with: -- partitioned tps = 11001.602377 (excluding connections establishing) -- non-partitioned tps = 13354.656163 (excluding connections establishing) Again, same as with the v1 test, the v2 test was done with the locking of all partitions removed with: diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index d7b18f52ed..6223c62094 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -80,9 +80,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel) PartitionTupleRouting *proute; ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL; - /* Lock all the partitions. */ - (void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL); - /* * Here we attempt to expend as little effort as possible in setting up * the PartitionTupleRouting. Each partition's ResultRelInfo is built @@ -442,7 +439,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, * We locked all the partitions in ExecSetupPartitionTupleRouting * including the leaf partitions. */ - partrel = heap_open(partoid, NoLock); + partrel = heap_open(partoid, RowExclusiveLock); /* * Keep ResultRelInfo and other information for this partition in the Again, the reduce locking is not meant for commit for this patch. Changing the locking will require a discussion on its own thread. And just for fun, the unpatched performance on the partitioned table: ubuntu@ip-10-0-0-33:~$ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 5751 latency average = 10.434 ms tps = 95.836052 (including connections establishing) tps = 95.838490 (excluding connections establishing) (185x increase) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
David Rowley writes: > 1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was > useless since the int would have wrapped long before it reached > UINT_MAX. There's no shortage of other code doubling the size of an > array by multiplying it by 2 unconditionally without considering > overflowing an int. Unsure why you considered this more risky. As long as you're re-palloc'ing the array each time, and not increasing its size more than 2X, this is perfectly safe because of the 1GB size limit on palloc requests. You'll fail because of that in the iteration where the request is between 1GB and 2GB, just before integer overflow can occur. (Yes, this is intentional.) regards, tom lane
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 18 July 2018 at 20:29, Amit Langote wrote: > Let me know what you think of the code in the updated patch. Thanks for sending the updated patch. I looked over it tonight and made a number of changes: 1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was useless since the int would have wrapped long before it reached UINT_MAX. There's no shortage of other code doubling the size of an array by multiplying it by 2 unconditionally without considering overflowing an int. Unsure why you considered this more risky. 2) Fixed a series of bugs regarding the size of the arrays in PartitionTupleRouting. The map arrays and the partitions array could differ in size despite your comment that claimed child_parent_tupconv_maps was the same size as 'partitions' when non-NULL. The map arrays being a different size than the partitions array caused the following two cases to segfault. I've included two cases as it was two seperate bugs that caused them. -- case 1 drop table listp; create table listp (a int, b int) partition by list (a); create table listp1 partition of listp for values in (1); create table listp2 partition of listp for values in (2); create table listp3 partition of listp for values in (3); create table listp4 partition of listp for values in (4); create table listp5 partition of listp for values in (5); create table listp6 partition of listp for values in (6); create table listp7 partition of listp for values in (7); create table listp8 partition of listp for values in (8); create table listp9 (b int, a int); alter table listp attach partition listp9 for values in(9); insert into listp select generate_series(1,9); -- case 2 drop table listp; create table listp (a int, b int) partition by list (a); create table listp1 (b int, a int); alter table listp attach partition listp1 for values in(1); create table listp1 partition of listp for values in (1); create table listp2 partition of listp for values in (2); create table listp3 partition of listp for values in (3); create table listp4 partition of listp for values in (4); create table listp5 partition of listp for values in (5); create table listp6 partition of listp for values in (6); create table listp7 partition of listp for values in (7); create table listp8 partition of listp for values in (8); create table listp9 partition of listp for values in (9); insert into listp select generate_series(1,9); 3) Got rid of ExecUseUpdateResultRelForRouting. I started to change this to remove references to UPDATE in order to make it more friendly towards other possible future node types that it would get used for (aka MERGE). In the end, I found that performance could regress when in cases like: drop table listp; create table listp (a int) partition by list(a); \o /dev/null \timing off select 'create table listp'||x::Text||' partition of listp for values in('||x::Text||');' from generate_series(1,1000) x; \gexec \o insert into listp select x from generate_series(1,999) x; \timing on update listp set a = a+1; It's true that UPDATEs with a large number of subplans performance is quite terrible today in the planner, but this code made the performance of planning+execution a bit worse. If we get around to fixing the inheritance planner then I think ExecUseUpdateResultRelForRouting() could easily appear in profiles. I ended up rewriting it to just get called once and build a hash table by Oid storing a ResultRelInfo pointer. This also gets rid of the slow nested loop in the cleanup operation inside ExecCleanupTupleRouting(). 4) Did some tuning work in ExecFindPartition() getting rid of a redundant check after the loop completion. Also added some likely() and unlikely() decorations around some conditions. 5) Updated some newly out-dated comments since your patch in execPartition.h. 6) Replaced the palloc0() in ExecSetupPartitionTupleRouting() with a palloc() updating the few fields that were not initialised. This might save a few TPS (at least once we get rid of the all partition locking) in the single-row INSERT case, but I've not tested the performance of this yet. 7) Also moved and edited some comments above ExecSetupPartitionTupleRouting() that I felt explained a little too much about some internal implementation details. One thing that I thought of, but didn't do was just having ExecFindPartition() return the ResultRelInfo. I think it would be much nicer in both call sites to not have to check the ->partitions array to get that. The copy.c call site would need a few modifications around the detection code to see if the partition has changed, but it all looks quite possible to change. I left this for now as I have another patch which touches all that code that I feel is closer to commit than this patch is. I've attached a delta of the changes I made since your v2 delta and also a complete updated patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v2-0001-Sp
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 18 July 2018 at 21:44, Kato, Sho wrote: > part_num | latency_avg | tps_ex | update_latency | select_latency | > insert_latency > --+-++++ > 100 |2.09 | 478.379516 | 1.407 | 0.36 | > 0.159 > 200 | 5.871 | 170.322179 | 4.621 | 0.732 | > 0.285 > 400 | 39.029 | 25.622384 | 35.542 | 2.273 | > 0.758 > 800 | 142.624 | 7.011494 |135.447 | 5.04 | > 1.388 > 1600 | 559.872 | 1.786138 |534.301 | 20.318 | > 3.122 > 3200 |2161.834 | 0.462574 | 2077.737 | 72.804 | > 7.037 > 6400 | 8282.38 | 0.120739 | 7996.212 |259.406 | >14.514 Thanks for testing. It's fairly customary to include before/after, unpatched/patched results. I don't think your patched results really mean much by themselves. It's pretty well known that adding more partitions slows down the planner and the executor, to a lesser extent. This patch only aims to reduce some of the executor startup overheads for INSERT and UPDATE. Also, the 0001 patch is not really aiming to break any performance records. I posted results already and there is only a very small improvement. The main aim with the 0001 patch is to remove the bottlenecks so that the performance drop between partitioned and non-partitioned is primarily due to the partition locking. I'd like to fix that too, but it's more work and I see no reason that we shouldn't fix up the other slow parts first. I imagine this will increase the motivation to resolve the locking all partitions issue too. I'd also recommend that if you're testing this, that you do so with a recent master. The patch is not intended for pg11. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018-Jul-11, Alvaro Herrera wrote: > That commit is also in pg11, though -- just not in beta2. So we still don't > know how much of an improvement patch2 is by itself :-) Oops! I benchmarked with 11beta2 + 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch. Results are as follows. Performance seems to be improved. part_num | latency_avg | tps_ex | update_latency | select_latency | insert_latency --+-++++ 100 |2.09 | 478.379516 | 1.407 | 0.36 | 0.159 200 | 5.871 | 170.322179 | 4.621 | 0.732 | 0.285 400 | 39.029 | 25.622384 | 35.542 | 2.273 | 0.758 800 | 142.624 | 7.011494 |135.447 | 5.04 | 1.388 1600 | 559.872 | 1.786138 |534.301 | 20.318 | 3.122 3200 |2161.834 | 0.462574 | 2077.737 | 72.804 | 7.037 6400 | 8282.38 | 0.120739 | 7996.212 |259.406 | 14.514 Thanks Kato Sho -Original Message- From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com] Sent: Wednesday, July 11, 2018 10:30 PM To: David Rowley Cc: Kato, Sho/加藤 翔 ; PostgreSQL Hackers Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables On 2018-Jul-11, David Rowley wrote: > On 6 July 2018 at 21:25, Kato, Sho wrote: > > 2. 11beta2 + patch1 + patch2 > > > > patch1: Allow direct lookups of AppendRelInfo by child relid > > commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687 > > patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch > > > > part_num | tps_ex| latency_avg | update_latency | select_latency | > > insert_latency > > --+-+-+++ > > 100 | 1224.430344 | 0.817 | 0.551 | 0.085 | > >0.048 > > 200 | 689.567511 |1.45 | 1.12 | 0.119 | > > 0.05 > > 400 | 347.876616 | 2.875 | 2.419 | 0.185 | > >0.052 > > 800 | 140.489269 | 7.118 | 6.393 | 0.329 | > >0.059 > > 1600 | 29.681672 | 33.691 | 31.272 | 1.517 | > >0.147 > > 3200 |7.021957 | 142.412 | 136.4 | 4.033 | > >0.214 > > 6400 |1.462949 | 683.557 |669.187 | 7.677 | > >0.264 > Just a note to say that the "Allow direct lookups of AppendRelInfo by > child relid" patch is already in master. It's much more relevant to be > testing with master than pg11. This patch is not intended for pg11. That commit is also in pg11, though -- just not in beta2. So we still don't know how much of an improvement patch2 is by itself :-) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
Hi David, Thanks for taking a look. On 2018/07/15 17:34, David Rowley wrote: > I've looked over the code and the ExecUseUpdateResultRelForRouting() > function is broken. Your while loop only skips partitions for the > current partitioned table, it does not skip ModifyTable subnodes that > belong to other partitioned tables. > > You can use the following. The code does not find the t1_a2 subnode. > > create table t1 (a int, b int) partition by list(a); > create table t1_a1 partition of t1 for values in(1) partition by list(b); > create table t1_a2 partition of t1 for values in(2); > create table t1_a1_b1 partition of t1_a1 for values in(1); > create table t1_a1_b2 partition of t1_a1 for values in(2); > insert into t1 values(2,2); > > update t1 set a = a; Hmm, it indeed is broken. > I think there might not be enough information to make this work > correctly, as if you change the loop to skip subnodes, then it won't > work in cases where the partition[0] was pruned. > > I've another patch sitting here, partly done, that changes > pg_class.relispartition into pg_class.relpartitionparent. If we had > that then we could code your loop to work correctly.> Alternatively, > I guess we could just ignore the UPDATE's ResultRelInfos and just > build new ones. Unsure if there's actually a reason we need to reuse > the existing ones, is there? We try to use the existing ones because we thought back when the patch was written (not by me though) that redoing all the work that InitResultRelInfo does for each partition, for which we could have instead used an existing one, would cumulatively end up being more expensive than figuring out which ones we could reuse by a linear scan of partition and result rels arrays in parallel. I don't remember seeing a benchmark to demonstrate the benefit of doing this though. Maybe it was posted, but I don't remember having looked at it closely. > I think you'd need to know the owning partition and skip subnodes that > don't belong to pd->reldesc. Alternatively, a hashtable could be built > with all the oids belonging to pd->reldesc, then we could loop over > the update_rris finding subnodes that can be found in the hashtable. > Likely this will be much slower than the sort of merge lookup that the > previous code did. I think one option is to simply give up on the idea of matching *all* UPDATE result rels that belong to a given partitioned table (pd->reldesc) in one call of ExecUseUpdateResultRelForRouting. Instead, pass the index of the partition (in pd->partdesc->oids) to find the ResultRelInfo for, loop over all UPDATE result rels looking for one, and return immediately on finding one after having stored its pointer in proute->partitions. In the worst case, we'll end up scanning UPDATE result rels array for every partition that gets touched, but maybe such an UPDATE query is less common or even if such a query occurs, tuple routing might be the last of its bottlenecks. I have implemented that approach in the updated patch. That means I also needed to change things so that ExecUseUpdateResultRelsForRouting is now only called by ExecFindPartition, because with the new arrangement, it's useless to call it from ExecSetupPartitionTupleRouting. Moreover, an UPDATE may not use tuple routing at all, even if the fact that partition key is being updated results in calling ExecSetupPartitionTupleRouting. > Another thing that I don't like is the PARTITION_ROUTING_MAXSIZE code. > The code seems to assume that there can only be at the most 65536 > partitions, but I don't think there's any code which restricts us to > that. There is code in the planner that will bork when trying to > create a RangeTblEntry up that high, but as far as I know that won't > be noticed on the INSERT path. I don't think this code has any > business knowing what the special varnos are set to either. It would > be better to just remove the limit and suffer the small wasted array > space. I understand you've probably coded it like this due to the > similar code that was in my patch, but with mine I knew the total > number of partitions. Your patch does not. OK, I changed it to UINT_MAX. > Other thoughts on the patch: > > I wonder if it's worth having syscache keep a count on the number of > sub-partitioned tables a partition has. If there are none in the root > partition then the partition_dispatch_info can be initialized with > just 1 element to store the root details. Although, maybe it's not > worth it to reduce the array size by 7 elements. Hmm yes. Allocating space for 8 pointers when we really need 1 is not too bad, if the alternative is to modify partcache.c. > Also, I'm a bit confused why you change the comments in > execPartition.h for PartitionTupleRouting to be inline again. I > brought those out of line as I thought the complexity of the code > warranted that. You're inlining them again goes against what all the > other structs do in that file. It was out-of-line to begin with but it
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 13 July 2018 at 20:20, Amit Langote wrote: > Why don't we abandon the notion altogether that > ExecSetupPartitionTupleRouting *has to* process the whole partition tree? [...] > I implemented that idea in the attached patch, which applies on top of > your 0001 patch, but I'd say it's too big to be just called a delta. I > was able to get following performance numbers using the following pgbench > test: Thanks for looking at this. I like that your idea gets rid of the indexes cache in syscache. I was not very happy with that part. I've looked over the code and the ExecUseUpdateResultRelForRouting() function is broken. Your while loop only skips partitions for the current partitioned table, it does not skip ModifyTable subnodes that belong to other partitioned tables. You can use the following. The code does not find the t1_a2 subnode. create table t1 (a int, b int) partition by list(a); create table t1_a1 partition of t1 for values in(1) partition by list(b); create table t1_a2 partition of t1 for values in(2); create table t1_a1_b1 partition of t1_a1 for values in(1); create table t1_a1_b2 partition of t1_a1 for values in(2); insert into t1 values(2,2); update t1 set a = a; I think there might not be enough information to make this work correctly, as if you change the loop to skip subnodes, then it won't work in cases where the partition[0] was pruned. I've another patch sitting here, partly done, that changes pg_class.relispartition into pg_class.relpartitionparent. If we had that then we could code your loop to work correctly. Alternatively, I guess we could just ignore the UPDATE's ResultRelInfos and just build new ones. Unsure if there's actually a reason we need to reuse the existing ones, is there? I think you'd need to know the owning partition and skip subnodes that don't belong to pd->reldesc. Alternatively, a hashtable could be built with all the oids belonging to pd->reldesc, then we could loop over the update_rris finding subnodes that can be found in the hashtable. Likely this will be much slower than the sort of merge lookup that the previous code did. Another thing that I don't like is the PARTITION_ROUTING_MAXSIZE code. The code seems to assume that there can only be at the most 65536 partitions, but I don't think there's any code which restricts us to that. There is code in the planner that will bork when trying to create a RangeTblEntry up that high, but as far as I know that won't be noticed on the INSERT path. I don't think this code has any business knowing what the special varnos are set to either. It would be better to just remove the limit and suffer the small wasted array space. I understand you've probably coded it like this due to the similar code that was in my patch, but with mine I knew the total number of partitions. Your patch does not. Other thoughts on the patch: I wonder if it's worth having syscache keep a count on the number of sub-partitioned tables a partition has. If there are none in the root partition then the partition_dispatch_info can be initialized with just 1 element to store the root details. Although, maybe it's not worth it to reduce the array size by 7 elements. Also, I'm a bit confused why you change the comments in execPartition.h for PartitionTupleRouting to be inline again. I brought those out of line as I thought the complexity of the code warranted that. You're inlining them again goes against what all the other structs do in that file. Apart from that, I think the idea is promising. We'll just need to find a way to make ExecUseUpdateResultRelForRouting work correctly. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
Hi David. On 2018/06/22 15:28, David Rowley wrote: > Hi, > > As part of my efforts to make partitioning scale better for larger > numbers of partitions, I've been looking at primarily INSERT VALUES > performance. Here the overheads are almost completely in the > executor. Planning of this type of statement is very simple since > there is no FROM clause to process. Thanks for this effort. > My benchmarks have been around a RANGE partitioned table with 10k leaf > partitions and no sub-partitioned tables. The partition key is a > timestamp column. > > I've found that ExecSetupPartitionTupleRouting() is very slow indeed > and there are a number of things slow about it. The biggest culprit > for the slowness is the locking of each partition inside of > find_all_inheritors(). Yes. :-( > For now, this needs to remain as we must hold > locks on each partition while performing RelationBuildPartitionDesc(), > otherwise, one of the partitions may get dropped out from under us. We lock all partitions using find_all_inheritors to keep locking order consistent with other sites that may want to lock tables in the same partition tree but with a possibly conflicting lock mode. If we remove the find_all_inheritors call in ExecSetupPartitionPruneState (like your 0002 does), we may end up locking partitions in arbitrary order in a given transaction, because input tuples will be routed to various partitions in an order that's not predetermined. But, maybe it's not necessary to be that paranoid. If we've locked on the parent, any concurrent lockers would have to wait for the lock on the parent anyway, so it doesn't matter which order tuple routing locks the partitions. > The locking is not the only slow thing. I found the following to also be slow: > > 1. RelationGetPartitionDispatchInfo uses a List and lappend() must > perform a palloc() each time a partition is added to the list. > 2. A foreach loop is performed over leaf_parts to search for subplans > belonging to this partition. This seems pointless to do for INSERTs as > there's never any to find. > 3. ExecCleanupTupleRouting() loops through the entire partitions > array. If a single tuple was inserted then all but one of the elements > will be NULL. > 4. Tuple conversion map allocates an empty array thinking there might > be something to put into it. This is costly when the array is large > and pointless when there are no maps to store. > 5. During get_partition_dispatch_recurse(), get_rel_relkind() is > called to determine if the partition is a partitioned table or a leaf > partition. This results in a slow relcache hashtable lookup. > 6. get_partition_dispatch_recurse() also ends up just building the > indexes array with a sequence of numbers from 0 to nparts - 1 if there > are no sub-partitioned tables. Doing this is slow when there are many > partitions. > > Besides the locking, the only thing that remains slow now is the > palloc0() for the 'partitions' array. In my test, it takes 0.6% of > execution time. I don't see any pretty ways to fix that. > > I've written fixes for items 1-6 above. > > I did: > > 1. Use an array instead of a List. > 2. Don't do this loop. palloc0() the partitions array instead. Let > UPDATE add whatever subplans exist to the zeroed array. > 3. Track what we initialize in a gapless array and cleanup just those > ones. Make this array small and increase it only when we need more > space. > 4. Only allocate the map array when we need to store a map. > 5. Work that out in relcache beforehand. > 6. ditto The issues you list seem all legitimate to me and also your proposed fixes for each, except I think we could go a bit further. Why don't we abandon the notion altogether that ExecSetupPartitionTupleRouting *has to* process the whole partition tree? ISTM, there is no need to determine the exact number of leaf partitions and partitioned tables in the partition tree and allocate the arrays in PartitionTupleRouting based on that. I know that the indexes array in PartitionDispatchData contains mapping from local partition indexes (0 to partdesc->nparts - 1) to those that span *all* leaf partitions and *all* partitioned tables (0 to proute->num_partitions or proute->num_dispatch) in a partition tree, but we can change that. The idea I had was inspired by looking at partitions_init stuff in your patch. We could allocate proute->partition_dispatch_info and proute->partitions arrays to be of a predetermined size, which doesn't require us to calculate the exact number of leaf partitions and partitioned tables beforehand. So, RelationGetPartitionDispatchInfo need not recursively go over all of the partition tree. Instead we create just one PartitionDispatch object of the root parent table, whose indexes array is initialized with -1 meaning none of the partitions has not been encountered yet. In ExecFindPartition, once tuple routing chooses a partition, we create either a ResultRelInfo (if leaf) or a PartitionDispatch for it and sto
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018-Jul-11, David Rowley wrote: > On 6 July 2018 at 21:25, Kato, Sho wrote: > > 2. 11beta2 + patch1 + patch2 > > > > patch1: Allow direct lookups of AppendRelInfo by child relid > > commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687 > > patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch > > > > part_num | tps_ex| latency_avg | update_latency | select_latency | > > insert_latency > > --+-+-+++ > > 100 | 1224.430344 | 0.817 | 0.551 | 0.085 | > >0.048 > > 200 | 689.567511 |1.45 | 1.12 | 0.119 | > > 0.05 > > 400 | 347.876616 | 2.875 | 2.419 | 0.185 | > >0.052 > > 800 | 140.489269 | 7.118 | 6.393 | 0.329 | > >0.059 > > 1600 | 29.681672 | 33.691 | 31.272 | 1.517 | > >0.147 > > 3200 |7.021957 | 142.412 | 136.4 | 4.033 | > >0.214 > > 6400 |1.462949 | 683.557 |669.187 | 7.677 | > >0.264 > Just a note to say that the "Allow direct lookups of AppendRelInfo by > child relid" patch is already in master. It's much more relevant to be > testing with master than pg11. This patch is not intended for pg11. That commit is also in pg11, though -- just not in beta2. So we still don't know how much of an improvement patch2 is by itself :-) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 6 July 2018 at 21:25, Kato, Sho wrote: > 2. 11beta2 + patch1 + patch2 > > patch1: Allow direct lookups of AppendRelInfo by child relid > commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687 > patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch > > part_num | tps_ex| latency_avg | update_latency | select_latency | > insert_latency > --+-+-+++ > 100 | 1224.430344 | 0.817 | 0.551 | 0.085 | > 0.048 > 200 | 689.567511 |1.45 | 1.12 | 0.119 | > 0.05 > 400 | 347.876616 | 2.875 | 2.419 | 0.185 | > 0.052 > 800 | 140.489269 | 7.118 | 6.393 | 0.329 | > 0.059 > 1600 | 29.681672 | 33.691 | 31.272 | 1.517 | > 0.147 > 3200 |7.021957 | 142.412 | 136.4 | 4.033 | > 0.214 > 6400 |1.462949 | 683.557 |669.187 | 7.677 | > 0.264 Hi, Thanks a lot for benchmarking this. Just a note to say that the "Allow direct lookups of AppendRelInfo by child relid" patch is already in master. It's much more relevant to be testing with master than pg11. This patch is not intended for pg11. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: Speeding up INSERTs and UPDATEs to partitioned tables
Thanks David! I did benchmark with pgbench, and see a speedup for INSERT / UPDATE scenarios. I used range partition. Benchmark results are as follows. 1. 11beta2 result part_num | tps_ex | latency_avg | update_latency | select_latency | insert_latency --++-+++ 100 | 479.456278 | 2.086 | 1.382 | 0.365 | 0.168 200 | 169.155411 | 5.912 | 4.628 | 0.737 | 0.299 400 | 24.857495 | 40.23 | 36.606 | 2.252 | 0.881 800 | 6.718104 | 148.853 |141.471 | 5.253 | 1.433 1600 | 1.934908 | 516.825 |489.982 | 21.102 | 3.701 3200 | 0.456967 |2188.362 | 2101.247 | 72.784 | 8.833 6400 | 0.116643 |8573.224 |8286.79 |257.904 | 14.949 2. 11beta2 + patch1 + patch2 patch1: Allow direct lookups of AppendRelInfo by child relid commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687 patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch part_num | tps_ex| latency_avg | update_latency | select_latency | insert_latency --+-+-+++ 100 | 1224.430344 | 0.817 | 0.551 | 0.085 | 0.048 200 | 689.567511 |1.45 | 1.12 | 0.119 | 0.05 400 | 347.876616 | 2.875 | 2.419 | 0.185 | 0.052 800 | 140.489269 | 7.118 | 6.393 | 0.329 | 0.059 1600 | 29.681672 | 33.691 | 31.272 | 1.517 | 0.147 3200 |7.021957 | 142.412 | 136.4 | 4.033 | 0.214 6400 |1.462949 | 683.557 |669.187 | 7.677 | 0.264 benchmark script: \set aid random(1, 100 * 1) \set delta random(-5000, 5000) BEGIN; UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM test.accounts WHERE aid = :aid; INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP); END; partition key is aid. -Original Message- From: David Rowley [mailto:david.row...@2ndquadrant.com] Sent: Thursday, July 05, 2018 6:19 PM To: Kato, Sho/加藤 翔 Cc: PostgreSQL Hackers Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables On 5 July 2018 at 18:39, Kato, Sho wrote: > postgres=# create table a(i int) partition by range(i); CREATE TABLE > postgres=# create table a_1 partition of a for values from(1) to > (200); CREATE TABLE postgres=# create table a_2 partition of a for > values from(200) to (400); server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. Hi, Thanks for testing. I'm unable to reproduce this on beta2 or master as of f61988d16. Did you try make clean then building again? The 0001 patch does change PartitionDescData, so if you've not rebuilt all .c files which use that then that might explain your crash. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 6 July 2018 at 01:18, Jesper Pedersen wrote: > With 0002 INSERTs get close to the TPS of the non-partitioned case. However, > UPDATEs doesn't see the same speedup. But, as you said, a discussion for > another thread. Hi Jesper, Thanks for testing this out. It was only really the locking I didn't want to discuss here due to the risk of the discussion of removing the other overheads getting lost in discussions about locking. It's most likely that the UPDATE is slower due to the planner still being quite slow when dealing with partitioned tables. It still builds RangeTblEntry and RelOptInfo objects for each partition even if the partition is pruned. With INSERT with a VALUES clause, the planner does not build these objects, in fact, the planner bearly does any work at all, so this can be speeded up just by removing the executor overheads. (I do have other WIP patches to speed up the planner. After delaying building the RelOptInfo and RangeTblEntry, with my 10k partition setup, the planner SELECT became 1600 times faster. UPDATE did not finish in the unpatched version, so gains there are harder to measure. There's still much work to do on these patches, and there's still more performance to squeeze out too. Hopefully, I'll be discussing this on another thread soon.) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
Hi David, On 06/22/2018 02:28 AM, David Rowley wrote: I've attached 2 patches: 0001: implements items 1-6 0002: Is not intended for commit. It just a demo of where we could get the performance if we were smarter about locking partitions. I've just included this to show 0001's worth. I did some tests with a 64 hash partition setup, and see a speedup for INSERT / UPDATE scenarios. I don't want to discuss the locking on this thread. That discussion will detract from discussing what I'm proposing here... Which is not to change anything relating to locks. I'm still working on that and will post elsewhere. With 0002 INSERTs get close to the TPS of the non-partitioned case. However, UPDATEs doesn't see the same speedup. But, as you said, a discussion for another thread. Best regards, Jesper
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 5 July 2018 at 18:39, Kato, Sho wrote: > postgres=# create table a(i int) partition by range(i); > CREATE TABLE > postgres=# create table a_1 partition of a for values from(1) to (200); > CREATE TABLE > postgres=# create table a_2 partition of a for values from(200) to (400); > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. Hi, Thanks for testing. I'm unable to reproduce this on beta2 or master as of f61988d16. Did you try make clean then building again? The 0001 patch does change PartitionDescData, so if you've not rebuilt all .c files which use that then that might explain your crash. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: Speeding up INSERTs and UPDATEs to partitioned tables
(%rax),%rax 0x00596e46 <+60>:mov0x24(%rax),%eax 0x00596e49 <+63>:cltq 0x00596e4b <+65>:shl$0x2,%rax 0x00596e4f <+69>:add%rdx,%rax => 0x00596e52 <+72>:mov(%rax),%eax 0x00596e54 <+74>:jmp0x596e5b 0x00596e56 <+76>:mov$0x0,%eax 0x00596e5b <+81>:pop%rbp 0x00596e5c <+82>:retq End of assembler dump. (gdb) i r rax0x20057e77c 8595695484 rbx0x72 114 rcx0x7f50ce90e0e8 139985039712488 rdx0x259e98039446912 rsi0x7f50ce90e0a8 139985039712424 rdi0x259e92839446824 rbp0x7ffc05931890 0x7ffc05931890 rsp0x7ffc05931890 0x7ffc05931890 r8 0x7ffc059317bf 140720402012095 r9 0x0 0 r100x6b 107 r110x7f50cdbc3f10 139985025777424 r120x70 112 r130x0 0 r140x0 0 r150x0 0 rip0x596e52 0x596e52 eflags 0x10202 [ IF RF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb) list *0x596e52 0x596e52 is in get_default_oid_from_partdesc (partition.c:269). 264 Oid 265 get_default_oid_from_partdesc(PartitionDesc partdesc) 266 { 267 if (partdesc && partdesc->boundinfo && 268 partition_bound_has_default(partdesc->boundinfo)) 269 return partdesc->oids[partdesc->boundinfo->default_index]; 270 271 return InvalidOid; 272 } 273 regards, -Original Message- From: David Rowley [mailto:david.row...@2ndquadrant.com] Sent: Saturday, June 23, 2018 7:19 AM To: PostgreSQL Hackers Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables On 22 June 2018 at 18:28, David Rowley wrote: > I've written fixes for items 1-6 above. > > I did: > > 1. Use an array instead of a List. > 2. Don't do this loop. palloc0() the partitions array instead. Let > UPDATE add whatever subplans exist to the zeroed array. > 3. Track what we initialize in a gapless array and cleanup just those > ones. Make this array small and increase it only when we need more > space. > 4. Only allocate the map array when we need to store a map. > 5. Work that out in relcache beforehand. > 6. ditto I've added this to the July 'fest: https://commitfest.postgresql.org/18/1690/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 22 June 2018 at 18:28, David Rowley wrote: > I've written fixes for items 1-6 above. > > I did: > > 1. Use an array instead of a List. > 2. Don't do this loop. palloc0() the partitions array instead. Let > UPDATE add whatever subplans exist to the zeroed array. > 3. Track what we initialize in a gapless array and cleanup just those > ones. Make this array small and increase it only when we need more > space. > 4. Only allocate the map array when we need to store a map. > 5. Work that out in relcache beforehand. > 6. ditto I've added this to the July 'fest: https://commitfest.postgresql.org/18/1690/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Speeding up INSERTs and UPDATEs to partitioned tables
Hi, As part of my efforts to make partitioning scale better for larger numbers of partitions, I've been looking at primarily INSERT VALUES performance. Here the overheads are almost completely in the executor. Planning of this type of statement is very simple since there is no FROM clause to process. My benchmarks have been around a RANGE partitioned table with 10k leaf partitions and no sub-partitioned tables. The partition key is a timestamp column. I've found that ExecSetupPartitionTupleRouting() is very slow indeed and there are a number of things slow about it. The biggest culprit for the slowness is the locking of each partition inside of find_all_inheritors(). For now, this needs to remain as we must hold locks on each partition while performing RelationBuildPartitionDesc(), otherwise, one of the partitions may get dropped out from under us. There might be other valid reasons too, but please see my note at the bottom of this email. The locking is not the only slow thing. I found the following to also be slow: 1. RelationGetPartitionDispatchInfo uses a List and lappend() must perform a palloc() each time a partition is added to the list. 2. A foreach loop is performed over leaf_parts to search for subplans belonging to this partition. This seems pointless to do for INSERTs as there's never any to find. 3. ExecCleanupTupleRouting() loops through the entire partitions array. If a single tuple was inserted then all but one of the elements will be NULL. 4. Tuple conversion map allocates an empty array thinking there might be something to put into it. This is costly when the array is large and pointless when there are no maps to store. 5. During get_partition_dispatch_recurse(), get_rel_relkind() is called to determine if the partition is a partitioned table or a leaf partition. This results in a slow relcache hashtable lookup. 6. get_partition_dispatch_recurse() also ends up just building the indexes array with a sequence of numbers from 0 to nparts - 1 if there are no sub-partitioned tables. Doing this is slow when there are many partitions. Besides the locking, the only thing that remains slow now is the palloc0() for the 'partitions' array. In my test, it takes 0.6% of execution time. I don't see any pretty ways to fix that. I've written fixes for items 1-6 above. I did: 1. Use an array instead of a List. 2. Don't do this loop. palloc0() the partitions array instead. Let UPDATE add whatever subplans exist to the zeroed array. 3. Track what we initialize in a gapless array and cleanup just those ones. Make this array small and increase it only when we need more space. 4. Only allocate the map array when we need to store a map. 5. Work that out in relcache beforehand. 6. ditto The only questionable thing I see is what I did for 6. In partcache.c I'm basically building an array of nparts containing 0 to nparts -1. It seems a bit pointless, so perhaps there's a better way. I was also a bit too tight to memcpy() that out of relcache, and just pointed directly to it. That might be a no-go area. I've attached 2 patches: 0001: implements items 1-6 0002: Is not intended for commit. It just a demo of where we could get the performance if we were smarter about locking partitions. I've just included this to show 0001's worth. Performance AWS: m5d.large fsync=off Unpatched: $ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 2836 latency average = 21.162 ms tps = 47.254409 (including connections establishing) tps = 47.255756 (excluding connections establishing) (yes, it's bad) 0001: $ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 3235 latency average = 18.548 ms tps = 53.913121 (including connections establishing) tps = 53.914629 (excluding connections establishing) (a small improvement from 0001) 0001+0002: $ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 660079 latency average = 0.091 ms tps = 11001.303764 (including connections establishing) tps = 11001.602377 (excluding connections establishing) (something to aspire towards) 0002 (only): $ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 27682 latency average = 2.168 ms tps = 461.350885 (including connections establishing) tps = 461.363327 (excluding connections establishing) (shows that doing 0002 alone does not fix all our problem