Re: hashagg slowdown due to spill changes
On Sun, Jul 26, 2020 at 4:17 PM Jeff Davis wrote: > I saw less of an improvement than you or Andres (perhaps just more > noise). But given that both you and Andres are reporting a measurable > improvement, then I went ahead and committed it. Thank you. Thanks! -- Peter Geoghegan
Re: hashagg slowdown due to spill changes
On Sat, 2020-07-25 at 15:08 -0700, Peter Geoghegan wrote: > I find that Andres original "SELECT cat, count(*) FROM > fewgroups_many_rows GROUP BY 1;" test case is noticeably improved by > the patch. Without the patch, v13 takes ~11.46 seconds. With the > patch, it takes only ~10.64 seconds. I saw less of an improvement than you or Andres (perhaps just more noise). But given that both you and Andres are reporting a measurable improvement, then I went ahead and committed it. Thank you. Regards, Jeff Davis
Re: hashagg slowdown due to spill changes
On Sat, Jul 25, 2020 at 12:41 PM Peter Geoghegan wrote: > I have added a new open item for this separate > LookupTupleHashEntryHash()/lookup_hash_entry() pipeline-stall issue. Attached is a rebased version of Andres' now-bitrot 2020-06-12 patch ("aggspeed.diff"). I find that Andres original "SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1;" test case is noticeably improved by the patch. Without the patch, v13 takes ~11.46 seconds. With the patch, it takes only ~10.64 seconds. Didn't test it against v12 yet, but I have no reason to doubt Andres' explanation. I gather that if we can get this patch committed, we can close the relevant LookupTupleHashEntryHash() open item. Can you take this off my hands, Jeff? Thanks -- Peter Geoghegan 0001-Fix-LookupTupleHashEntryHash-pipeline-stall-issue.patch Description: Binary data
Re: hashagg slowdown due to spill changes
On Fri, Jul 24, 2020 at 4:51 PM Andres Freund wrote: > This is still not resolved. We're right now slower than 12. It's > effectively not possible to do performance comparisons right now. This > was nearly two months ago. I have added a new open item for this separate LookupTupleHashEntryHash()/lookup_hash_entry() pipeline-stall issue. (For the record I mistakenly believed that commit 23023022 resolved all of the concerns raised on this thread, which is why I closed out the open item associated with this thread. Evidently work remains to fix a remaining regression that affects simple in-memory hash aggregation, though.) -- Peter Geoghegan
Re: hashagg slowdown due to spill changes
Hi, On 2020-06-12 14:37:15 -0700, Andres Freund wrote: > On 2020-06-11 11:14:02 -0700, Jeff Davis wrote: > > On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote: > > > Did you run any performance tests? > > > > Yes, I reproduced your ~12% regression from V12, and this patch nearly > > eliminated it for me. > > I spent a fair bit of time looking at the difference. Jeff had let me > know on chat that he was still seeing some difference, but couldn't > quite figure out where that was. > > Trying it out myself, I observed that the patch helped, but not that > much. After a bit I found one major reason for why: > LookupTupleHashEntryHash() assigned the hash to pointer provided by the > caller's before doing the insertion. That ended up causing a pipeline > stall (I assume it's store forwarding, but not sure). Moving the > assignment to the caller variable to after the insertion got rid of > that. This is still not resolved. We're right now slower than 12. It's effectively not possible to do performance comparisons right now. This was nearly two months ago. Greetings, Andres Freund
Re: hashagg slowdown due to spill changes
On Wed, Jun 24, 2020 at 05:26:07PM -0700, Melanie Plageman wrote: On Tue, Jun 23, 2020 at 10:06 AM Andres Freund wrote: Hi, On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote: > On Mon, Jun 22, 2020 at 9:02 PM Andres Freund wrote: > > It's not this patch's fault, but none, really none, of this stuff should > > be in the executor. > > > > > Were you thinking it could be done in grouping_planner() and then the > bitmaps could be saved in the PlannedStmt? > Or would you have to wait until query_planner()? Or are you imagining > somewhere else entirely? I haven't thought about it in too much detail, but I would say create_agg_plan() et al. I guess there's some argument to be made to do it in setrefs.c, because we already do convert_combining_aggrefs() there (but I don't like that much). There's no reason to do it before we actually decided on one specific path, so doing it earlier than create_plan() seems unnecessary. And having it in agg specific code seems better than putting it into global routines. There's probably an argument for having a bit more shared code between create_agg_plan(), create_group_plan() and create_groupingsets_plan(). But even just adding a new extract_*_cols() call to each of those would probably be ok. So, my summary of this point in the context of the other discussion upthread is: Planner should extract the columns that hashagg will need later during planning. Planner should not have HashAgg/MixedAgg nodes request smaller targetlists from their children with CP_SMALL_TLIST to avoid unneeded projection overhead. Also, even this extraction should only be done when the number of groups is large enough to suspect a spill. IMO we should extract the columns irrespectedly of the estimates, otherwise we won't be able to handle underestimates efficiently. Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST in create_hashjoin_plan() for the inner side sub-tree and the outer side one if there are multiple batches. I wondered what was different about that vs hashagg (i.e. why it is okay to do that there). Yeah. That means that if we have to start batching during execution, we may need to spill much more datai. I'd say that's a hashjoin issue that we should fix too (in v14). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: hashagg slowdown due to spill changes
On Tue, Jun 23, 2020 at 10:06 AM Andres Freund wrote: > Hi, > > On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote: > > On Mon, Jun 22, 2020 at 9:02 PM Andres Freund > wrote: > > > It's not this patch's fault, but none, really none, of this stuff > should > > > be in the executor. > > > > > > > > Were you thinking it could be done in grouping_planner() and then the > > bitmaps could be saved in the PlannedStmt? > > Or would you have to wait until query_planner()? Or are you imagining > > somewhere else entirely? > > I haven't thought about it in too much detail, but I would say > create_agg_plan() et al. I guess there's some argument to be made to do > it in setrefs.c, because we already do convert_combining_aggrefs() there > (but I don't like that much). > > There's no reason to do it before we actually decided on one specific > path, so doing it earlier than create_plan() seems unnecessary. And > having it in agg specific code seems better than putting it into global > routines. > > There's probably an argument for having a bit more shared code between > create_agg_plan(), create_group_plan() and > create_groupingsets_plan(). But even just adding a new extract_*_cols() > call to each of those would probably be ok. > > So, my summary of this point in the context of the other discussion upthread is: Planner should extract the columns that hashagg will need later during planning. Planner should not have HashAgg/MixedAgg nodes request smaller targetlists from their children with CP_SMALL_TLIST to avoid unneeded projection overhead. Also, even this extraction should only be done when the number of groups is large enough to suspect a spill. So, I wrote a patch that extracts the columns the same way as in ExecInitAgg but in create_agg_plan() and it doesn't work because we haven't called set_plan_references(). Then, I wrote a patch that does this in set_upper_references(), and it seems to work. I've attached that one. It is basically Jeff's patch (based somewhat on my patch) which extracts the columns in ExecInitAgg but I moved the functions over to setrefs.c and gave them a different name. It's not very elegant. I shoved it in at the end of set_upper_references(), but I think there should be a nice way to do it while setting the references for each var instead of walking over the nodes again. Also, I think that the bitmapsets for the colnos should maybe be put somewhere less prominent (than in the main Agg plan node?), since they are only used in one small place. I tried putting both bitmaps in an array of two bitmaps in the Agg node (since there will always be two) to make it look a bit neater, but it was pretty confusing and error prone to remember which one was aggregated and which one was unaggregated. Note that I didn't do anything with costing like only extracting the columns if there are a lot of groups. Also, I didn't revert the CP_SMALL_TLIST change in create_agg_plan() or create_groupingsets_plan(). Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST in create_hashjoin_plan() for the inner side sub-tree and the outer side one if there are multiple batches. I wondered what was different about that vs hashagg (i.e. why it is okay to do that there). -- Melanie Plageman From 0b40ad205cf8fc8a83b8c65f5f18502a51c47370 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Wed, 24 Jun 2020 17:07:14 -0700 Subject: [PATCH v1] Move extracting columns for hashagg to planner Find the columns needed the executor will need to know about for hashagg and bifurcate them into aggregated and unaggregated column bitmapsets. Save them in the plan tree for use during execution. This is done after set_plan_refs() so all the references are correct. --- src/backend/executor/nodeAgg.c | 117 +-- src/backend/nodes/copyfuncs.c| 2 + src/backend/nodes/outfuncs.c | 2 + src/backend/nodes/readfuncs.c| 2 + src/backend/optimizer/plan/setrefs.c | 69 src/include/nodes/execnodes.h| 8 +- src/include/nodes/plannodes.h| 2 + 7 files changed, 140 insertions(+), 62 deletions(-) diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index a20554ae65..2f340e4031 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -359,6 +359,14 @@ typedef struct HashAggBatch int64 input_tuples; /* number of tuples in this batch */ } HashAggBatch; +/* used to find referenced colnos */ +typedef struct FindColsContext +{ + bool is_aggref; /* is under an aggref */ + Bitmapset *aggregated; /* column references under an aggref */ + Bitmapset *unaggregated; /* other column references */ +} FindColsContext; + static void select_current_set(AggState *aggstate, int setno, bool is_hash); static void initialize_phase(AggState *aggstate, int newphase); static TupleTableSlot *fetch_input_tuple(AggState *aggstate); @@ -391,8 +399,6 @@ static void finalize_aggregates(AggSta
Re: hashagg slowdown due to spill changes
Hi, On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote: > On Mon, Jun 22, 2020 at 9:02 PM Andres Freund wrote: > > It's not this patch's fault, but none, really none, of this stuff should > > be in the executor. > > > > > Were you thinking it could be done in grouping_planner() and then the > bitmaps could be saved in the PlannedStmt? > Or would you have to wait until query_planner()? Or are you imagining > somewhere else entirely? I haven't thought about it in too much detail, but I would say create_agg_plan() et al. I guess there's some argument to be made to do it in setrefs.c, because we already do convert_combining_aggrefs() there (but I don't like that much). There's no reason to do it before we actually decided on one specific path, so doing it earlier than create_plan() seems unnecessary. And having it in agg specific code seems better than putting it into global routines. There's probably an argument for having a bit more shared code between create_agg_plan(), create_group_plan() and create_groupingsets_plan(). But even just adding a new extract_*_cols() call to each of those would probably be ok. Greetings, Andres Freund
Re: hashagg slowdown due to spill changes
On Mon, Jun 22, 2020 at 9:02 PM Andres Freund wrote: > > > /* > > - * find_unaggregated_cols > > - * Construct a bitmapset of the column numbers of un-aggregated Vars > > - * appearing in our targetlist and qual (HAVING clause) > > + * Walk tlist and qual to find referenced colnos, dividing them into > > + * aggregated and unaggregated sets. > > */ > > -static Bitmapset * > > -find_unaggregated_cols(AggState *aggstate) > > +static void > > +find_cols(AggState *aggstate, Bitmapset **aggregated, Bitmapset > **unaggregated) > > { > > It's not this patch's fault, but none, really none, of this stuff should > be in the executor. > > Were you thinking it could be done in grouping_planner() and then the bitmaps could be saved in the PlannedStmt? Or would you have to wait until query_planner()? Or are you imagining somewhere else entirely? -- Melanie Plageman
Re: hashagg slowdown due to spill changes
Hi, I think it'd be good to get the changes that aren't related to projection merged. As far as I can tell there's performance regressions both because of the things I'd listed upthread, and due to the projection issue. That's not obvious because because we first won performance and then lost it again in several incremental steps. COPY (SELECT (random() * 4)::int cat, (random()*1)::int val FROM generate_series(1, 1)) TO '/tmp/data' WITH BINARY; BEGIN; DROP TABLE IF EXISTS fewgroups_many_rows; CREATE TABLE fewgroups_many_rows(cat int4 not null, val int4 not null); COPY fewgroups_many_rows FROM '/tmp/data' WITH (FORMAT BINARY, FREEZE); COMMIT; VACUUM FREEZE fewgroups_many_rows; Test prep: Test query: SET seed=0;SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1; (the seed seems to reduce noise due to hashtable iv being the same) best of six: 9e1c9f959422192bbe1b842a2a1ffaf76b08019612031.906 ms d52eaa094847d395f942827a6f413904e516994c12045.487 ms ac88807f9b227ddcd92b8be9a053094837c1b99a11950.006 ms 36d22dd95bc87ca68e742da91f47f8826f8758c911769.991 ms 5ac4e9a12c6543414891cd8972b2cd36a08e40cc11551.932 ms 1fdb7f9789c4550204cd62d1746a7deed1dc4c2911706.948 ms 4eaea3db150af56aa2e40efe91997fd25f3b6d7311999.908 ms 11de6c903da99a4b2220acfa776fc26c7f384ccc11999.054 ms b7fabe80df9a65010bfe5e5d0a979bacebfec38212165.463 ms 2742c45080077ed3b08b810bb96341499b86d53012137.505 ms 1f39bce021540fde00990af55b4432c55ef4b3c712501.764 ms 9b60c4b979bce060495e2b05ba01d1cc6bffdd2d12389.047 ms 4cad2534da6d17067d98cf04be2dfc1bda8f2cd013319.786 ms 1b2c29469a58cd9086bd86e20c708eb437564a8013330.616 ms There's certainly some noise in here, but I think the trends are valid. > /* > - * find_unaggregated_cols > - * Construct a bitmapset of the column numbers of un-aggregated Vars > - * appearing in our targetlist and qual (HAVING clause) > + * Walk tlist and qual to find referenced colnos, dividing them into > + * aggregated and unaggregated sets. > */ > -static Bitmapset * > -find_unaggregated_cols(AggState *aggstate) > +static void > +find_cols(AggState *aggstate, Bitmapset **aggregated, Bitmapset > **unaggregated) > { It's not this patch's fault, but none, really none, of this stuff should be in the executor. Greetings, Andres Freund
Re: hashagg slowdown due to spill changes
On Mon, Jun 15, 2020 at 07:38:45PM -0700, Jeff Davis wrote: On Mon, 2020-06-15 at 11:19 -0400, Robert Haas wrote: On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra wrote: > But just reverting 4cad2534d will make this much worse, I think, as > illustrated by the benchmarks I did in [1]. I share this concern, although I do not know what we should do about it. I attached an updated version of Melanie's patch, combined with the changes to copy only the necessary attributes to a new slot before spilling. There are a couple changes: * I didn't see a reason to descend into a GroupingFunc node, so I removed that. * I used a flag in the context rather than two separate callbacks to the expression walker. This patch gives the space benefits that we see on master, without the regression for small numbers of tuples. I saw a little bit of noise in my test results, but I'm pretty sure it's a win all around. It could use some review/cleanup though. Looks reasonable. I can't redo my tests at the moment, the machine is busy with something else. I'll give it a try over the weekend. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: hashagg slowdown due to spill changes
On Mon, 2020-06-15 at 11:19 -0400, Robert Haas wrote: > On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra > wrote: > > But just reverting 4cad2534d will make this much worse, I think, as > > illustrated by the benchmarks I did in [1]. > > I share this concern, although I do not know what we should do about > it. I attached an updated version of Melanie's patch, combined with the changes to copy only the necessary attributes to a new slot before spilling. There are a couple changes: * I didn't see a reason to descend into a GroupingFunc node, so I removed that. * I used a flag in the context rather than two separate callbacks to the expression walker. This patch gives the space benefits that we see on master, without the regression for small numbers of tuples. I saw a little bit of noise in my test results, but I'm pretty sure it's a win all around. It could use some review/cleanup though. Regards, Jeff Davis diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 331acee2814..5b5aa96c517 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -358,6 +358,14 @@ typedef struct HashAggBatch int64 input_tuples; /* number of tuples in this batch */ } HashAggBatch; +/* used to find referenced colnos */ +typedef struct FindColsContext +{ + bool is_aggref; /* is under an aggref */ + Bitmapset *aggregated; /* column references under an aggref */ + Bitmapset *unaggregated; /* other column references */ +} FindColsContext; + static void select_current_set(AggState *aggstate, int setno, bool is_hash); static void initialize_phase(AggState *aggstate, int newphase); static TupleTableSlot *fetch_input_tuple(AggState *aggstate); @@ -390,8 +398,9 @@ static void finalize_aggregates(AggState *aggstate, AggStatePerAgg peragg, AggStatePerGroup pergroup); static TupleTableSlot *project_aggregates(AggState *aggstate); -static Bitmapset *find_unaggregated_cols(AggState *aggstate); -static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos); +static void find_cols(AggState *aggstate, Bitmapset **aggregated, + Bitmapset **unaggregated); +static bool find_cols_walker(Node *node, FindColsContext *context); static void build_hash_tables(AggState *aggstate); static void build_hash_table(AggState *aggstate, int setno, long nbuckets); static void hashagg_recompile_expressions(AggState *aggstate, bool minslot, @@ -424,8 +433,8 @@ static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp); static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, int used_bits, uint64 input_tuples, double hashentrysize); -static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, -uint32 hash); +static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill, +TupleTableSlot *slot, uint32 hash); static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno); static void hashagg_tapeinfo_init(AggState *aggstate); @@ -1374,26 +1383,28 @@ project_aggregates(AggState *aggstate) } /* - * find_unaggregated_cols - * Construct a bitmapset of the column numbers of un-aggregated Vars - * appearing in our targetlist and qual (HAVING clause) + * Walk tlist and qual to find referenced colnos, dividing them into + * aggregated and unaggregated sets. */ -static Bitmapset * -find_unaggregated_cols(AggState *aggstate) +static void +find_cols(AggState *aggstate, Bitmapset **aggregated, Bitmapset **unaggregated) { - Agg *node = (Agg *) aggstate->ss.ps.plan; - Bitmapset *colnos; - - colnos = NULL; - (void) find_unaggregated_cols_walker((Node *) node->plan.targetlist, - &colnos); - (void) find_unaggregated_cols_walker((Node *) node->plan.qual, - &colnos); - return colnos; + Agg *agg = (Agg *) aggstate->ss.ps.plan; + FindColsContext context; + + context.is_aggref = false; + context.aggregated = NULL; + context.unaggregated = NULL; + + (void) find_cols_walker((Node *) agg->plan.targetlist, &context); + (void) find_cols_walker((Node *) agg->plan.qual, &context); + + *aggregated = context.aggregated; + *unaggregated = context.unaggregated; } static bool -find_unaggregated_cols_walker(Node *node, Bitmapset **colnos) +find_cols_walker(Node *node, FindColsContext *context) { if (node == NULL) return false; @@ -1404,16 +1415,24 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos) /* setrefs.c should have set the varno to OUTER_VAR */ Assert(var->varno == OUTER_VAR); Assert(var->varlevelsup == 0); - *colnos = bms_add_member(*colnos, var->varattno); + if (context->is_aggref) + context->aggregated = bms_add_member(context->aggregated, + var->varattno); + else + context->unaggregated = bms_add_member(context->unaggregated, + var->varattno); return false; } - if (IsA(node, Aggref) || IsA(node, GroupingFunc)) + if (IsA(node, Aggref)) {
Re: hashagg slowdown due to spill changes
Robert Haas writes: > On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra > wrote: >> But just reverting 4cad2534d will make this much worse, I think, as >> illustrated by the benchmarks I did in [1]. > I share this concern, although I do not know what we should do about it. Well, it's only June. Let's put it on the open issues list for v13 and continue to think about it. I concur that the hashagg spill patch has made this something that we should worry about for v13, so just reverting without a better answer isn't very appetizing. regards, tom lane
Re: hashagg slowdown due to spill changes
On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra wrote: > But just reverting 4cad2534d will make this much worse, I think, as > illustrated by the benchmarks I did in [1]. I share this concern, although I do not know what we should do about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: hashagg slowdown due to spill changes
On Sun, Jun 14, 2020 at 11:09:55PM -0700, Jeff Davis wrote: On Sun, 2020-06-14 at 11:14 -0700, Andres Freund wrote: I'm somewhat inclined to think that we should revert 4cad2534da6 and then look at how precisely to tackle this in 14. I'm fine with that. I don't see how we could just revert 4cad2534d and leave this for v14. The hashagg spilling is IMHO almost guaranteed to be a pain point for some users, as it will force some queries to serialize large amounts of data. Yes, some of this is a cost for hashagg enforcing work_mem at runtime, I'm fine with that. We'd get reports about that too, but we can justify that cost ... But just reverting 4cad2534d will make this much worse, I think, as illustrated by the benchmarks I did in [1]. And no, this is not really fixable by tweaking the cost parameters - even with the current code (i.e. 4cad2534d in place) I had to increase random_page_cost to 60 on the temp tablespace (on SATA RAID) to get good plans with parallelism enabled. I haven't tried, but I presume without 4cad2534d I'd have to push r_p_c even further ... [1] https://www.postgresql.org/message-id/20200519151202.u2p2gpiawoaznsv2%40development It'd probably make sense to request small tlists when the number of estimated groups is large, and not otherwise. That seems like a nice compromise that would be non-invasive, at least for create_agg_plan(). Maybe. It'd certainly better than nothing. It's not clear to me what would a good threshold be, though. And it's not going to handle cases of under-estimates. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: hashagg slowdown due to spill changes
On Sun, 2020-06-14 at 11:14 -0700, Andres Freund wrote: > I'm somewhat inclined to think that we should revert 4cad2534da6 and > then look at how precisely to tackle this in 14. I'm fine with that. > It'd probably make sense to request small tlists when the number of > estimated groups is large, and not otherwise. That seems like a nice compromise that would be non-invasive, at least for create_agg_plan(). Regards, Jeff Davis
Re: hashagg slowdown due to spill changes
Hi, On 2020-06-12 15:29:08 -0700, Jeff Davis wrote: > On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote: > > I don't see why it's ok to force an additional projection in the very > > common case of hashaggs over a few rows. So I think we need to > > rethink > > 4cad2534da6. > > One possibility is to project only spilled tuples, more similar to > Melanie's patch from a while ago: > > > https://www.postgresql.org/message-id/caakru_aefesv+ukqwqu+ioenoil2lju9diuh9br8mbyxuz0...@mail.gmail.com > > Which makes sense, but it's also more code. I'm somewhat inclined to think that we should revert 4cad2534da6 and then look at how precisely to tackle this in 14. It'd probably make sense to request small tlists when the number of estimated groups is large, and not otherwise. Greetings, Andres Freund
Re: hashagg slowdown due to spill changes
On Sat, Jun 13, 2020 at 11:48:09AM -0700, Jeff Davis wrote: On Fri, 2020-06-12 at 17:12 -0700, Andres Freund wrote: Do you think we should tackle this for 13? To me 4cad2534da seems like a somewhat independent improvement to spillable hashaggs. We've gone back and forth on this issue a few times, so let's try to get some agreement before we revert 4cad2534da. I added Robert because he also seemed to think it was a reasonable idea. I can't speak for Robert, but I haven't expected the extra projection would be this high. And I agree with Andres it's not very nice we have to do this even for aggregates with just a handful of groups that don't need to spill. In any case, I think we need to address this somehow for v13 - either we keep the 4cad2534da patch in, or we tweak the cost model to reflect the extra I/O costs, or we project only when spilling. I'm not in a position to whip up a patch soon, though :-( regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: hashagg slowdown due to spill changes
On Fri, 2020-06-12 at 17:12 -0700, Andres Freund wrote: > Do you think we should tackle this for 13? To me 4cad2534da seems > like a > somewhat independent improvement to spillable hashaggs. We've gone back and forth on this issue a few times, so let's try to get some agreement before we revert 4cad2534da. I added Robert because he also seemed to think it was a reasonable idea. Regards, Jeff Davis
Re: hashagg slowdown due to spill changes
Hi, On 2020-06-13 01:06:25 +0200, Tomas Vondra wrote: > I agree, we should revert 4cad2534da and only project tuples when we > actually need to spill them. There are cases where projecting helps for non-spilling aggregates too, but only for the representative tuple. It doesn't help in the case at hand, because there's just 5 hashtable entries but millions of rows. So we're unnecessarily projecting all-5 rows. But when there are many different groups, it'd be different, because then the size of the representative tuple can matter substantially. Do you think we should tackle this for 13? To me 4cad2534da seems like a somewhat independent improvement to spillable hashaggs. Greetings, Andres Freund
Re: hashagg slowdown due to spill changes
On Fri, Jun 12, 2020 at 03:29:08PM -0700, Jeff Davis wrote: On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote: I don't see why it's ok to force an additional projection in the very common case of hashaggs over a few rows. So I think we need to rethink 4cad2534da6. One possibility is to project only spilled tuples, more similar to Melanie's patch from a while ago: https://www.postgresql.org/message-id/caakru_aefesv+ukqwqu+ioenoil2lju9diuh9br8mbyxuz0...@mail.gmail.com Which makes sense, but it's also more code. I agree, we should revert 4cad2534da and only project tuples when we actually need to spill them. Did any of the WIP patches actually implement that, or do we need to write that patch from scratch? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: hashagg slowdown due to spill changes
On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote: > I don't see why it's ok to force an additional projection in the very > common case of hashaggs over a few rows. So I think we need to > rethink > 4cad2534da6. One possibility is to project only spilled tuples, more similar to Melanie's patch from a while ago: https://www.postgresql.org/message-id/caakru_aefesv+ukqwqu+ioenoil2lju9diuh9br8mbyxuz0...@mail.gmail.com Which makes sense, but it's also more code. Regards, Jeff Davis
Re: hashagg slowdown due to spill changes
Hi, On 2020-06-11 11:14:02 -0700, Jeff Davis wrote: > On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote: > > Did you run any performance tests? > > Yes, I reproduced your ~12% regression from V12, and this patch nearly > eliminated it for me. I spent a fair bit of time looking at the difference. Jeff had let me know on chat that he was still seeing some difference, but couldn't quite figure out where that was. Trying it out myself, I observed that the patch helped, but not that much. After a bit I found one major reason for why: LookupTupleHashEntryHash() assigned the hash to pointer provided by the caller's before doing the insertion. That ended up causing a pipeline stall (I assume it's store forwarding, but not sure). Moving the assignment to the caller variable to after the insertion got rid of that. It got within 3-4% after that change. I did a number of small microoptimizations that each helped, but didn't get quite get to the level of 12. Finally I figured out that that's due to an issue outside of nodeAgg.c itself: commit 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0 Author: Tomas Vondra Date: 2020-05-31 14:43:13 +0200 Use CP_SMALL_TLIST for hash aggregate Due to this change we end up with an additional projection in queries like this: postgres[212666][1]=# \d fewgroups_many_rows Table "public.fewgroups_many_rows" ┌┬─┬───┬──┬─┐ │ Column │ Type │ Collation │ Nullable │ Default │ ├┼─┼───┼──┼─┤ │ cat│ integer │ │ not null │ │ │ val│ integer │ │ not null │ │ └┴─┴───┴──┴─┘ postgres[212666][1]=# explain SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1; ┌───┐ │ QUERY PLAN │ ├───┤ │ HashAggregate (cost=1942478.48..1942478.53 rows=5 width=12) │ │ Group Key: cat │ │ -> Seq Scan on fewgroups_many_rows (cost=0.00..1442478.32 rows=10032 width=4) │ └───┘ (3 rows) as 'val' is "projected away".. After neutering the tlist change, Jeff's patch and my changes to it yield performance *above* v12. I don't see why it's ok to force an additional projection in the very common case of hashaggs over a few rows. So I think we need to rethink 4cad2534da6. Greetings, Andres Freund diff --git i/src/include/executor/executor.h w/src/include/executor/executor.h index c7deeac662f..415e117407c 100644 --- i/src/include/executor/executor.h +++ w/src/include/executor/executor.h @@ -139,7 +139,7 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent, MemoryContext tempcxt, bool use_variable_hash_iv); extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot, - bool *isnew); + bool *isnew, uint32 *hash); extern uint32 TupleHashTableHash(TupleHashTable hashtable, TupleTableSlot *slot); extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable, diff --git i/src/backend/executor/execGrouping.c w/src/backend/executor/execGrouping.c index 8be36ca7634..1e582832ea0 100644 --- i/src/backend/executor/execGrouping.c +++ w/src/backend/executor/execGrouping.c @@ -22,11 +22,12 @@ #include "utils/memutils.h" static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2); -static uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb, +static inline uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb, const MinimalTuple tuple); -static TupleHashEntry LookupTupleHashEntry_internal(TupleHashTable hashtable, - TupleTableSlot *slot, - bool *isnew, uint32 hash); +static inline TupleHashEntry LookupTupleHashEntry_internal( + TupleHashTable hashtable, + TupleTableSlot *slot, + bool *isnew, uint32 hash); /* * Define parameters for tuple hash table code generation. The interface is @@ -291,6 +292,9 @@ ResetTupleHashTable(TupleHashTable hashtable) * If isnew is NULL, we do not create new entries; we return NULL if no * match is found. * + * If hash is not NULL, we set it to the calculated hash value. This allows + * callers access to the hash value even if no entry is returned. + * * If isnew isn't NULL, then a new entry is created if no existing entry * matches. On return, *isnew is true if the entry is newly created, * false if it existed already. ->additional_data in the new entry has @@ -298,11 +302,11 @@ ResetTupleHashTable(TupleHashTable hashtable) */ TupleHashEntry LookupTupleH
Re: hashagg slowdown due to spill changes
On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote: > Did you run any performance tests? Yes, I reproduced your ~12% regression from V12, and this patch nearly eliminated it for me. Regards, Jeff Davis
Re: hashagg slowdown due to spill changes
Hi, On June 10, 2020 6:15:39 PM PDT, Jeff Davis wrote: >On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote: >> Hi, >> >> While preparing my pgcon talk I noticed that our hash-agg performance >> degraded noticeably. Looks to me like it's due to the spilling- >> hashagg >> changes. > >Attached a proposed fix. (Might require some minor cleanup). > >The only awkward part is that LookupTupleHashEntry() needs a new out >parameter to pass the hash value back to the caller. Ordinarily, the >caller can get that from the returned entry, but if isnew==NULL, then >the function might return NULL (and the caller wouldn't have an entry >from which to read the hash value). Great! Did you run any performance tests? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: hashagg slowdown due to spill changes
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote: > Hi, > > While preparing my pgcon talk I noticed that our hash-agg performance > degraded noticeably. Looks to me like it's due to the spilling- > hashagg > changes. Attached a proposed fix. (Might require some minor cleanup). The only awkward part is that LookupTupleHashEntry() needs a new out parameter to pass the hash value back to the caller. Ordinarily, the caller can get that from the returned entry, but if isnew==NULL, then the function might return NULL (and the caller wouldn't have an entry from which to read the hash value). Regards, Jeff Davis diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c index 8be36ca7634..27dbf264766 100644 --- a/src/backend/executor/execGrouping.c +++ b/src/backend/executor/execGrouping.c @@ -24,9 +24,10 @@ static int TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2); static uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb, const MinimalTuple tuple); -static TupleHashEntry LookupTupleHashEntry_internal(TupleHashTable hashtable, - TupleTableSlot *slot, - bool *isnew, uint32 hash); +static inline TupleHashEntry LookupTupleHashEntry_internal( + TupleHashTable hashtable, + TupleTableSlot *slot, + bool *isnew, uint32 hash); /* * Define parameters for tuple hash table code generation. The interface is @@ -291,6 +292,9 @@ ResetTupleHashTable(TupleHashTable hashtable) * If isnew is NULL, we do not create new entries; we return NULL if no * match is found. * + * If hash is not NULL, we set it to the calculated hash value. This allows + * callers access to the hash value even if no entry is returned. + * * If isnew isn't NULL, then a new entry is created if no existing entry * matches. On return, *isnew is true if the entry is newly created, * false if it existed already. ->additional_data in the new entry has @@ -298,11 +302,11 @@ ResetTupleHashTable(TupleHashTable hashtable) */ TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot, - bool *isnew) + bool *isnew, uint32 *hash) { TupleHashEntry entry; MemoryContext oldContext; - uint32 hash; + uint32 local_hash; /* Need to run the hash functions in short-lived context */ oldContext = MemoryContextSwitchTo(hashtable->tempcxt); @@ -312,8 +316,12 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot, hashtable->in_hash_funcs = hashtable->tab_hash_funcs; hashtable->cur_eq_func = hashtable->tab_eq_func; - hash = TupleHashTableHash_internal(hashtable->hashtab, NULL); - entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash); + local_hash = TupleHashTableHash_internal(hashtable->hashtab, NULL); + if (hash != NULL) + *hash = local_hash; + + entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, local_hash); + Assert(entry == NULL || entry->hash == local_hash); MemoryContextSwitchTo(oldContext); @@ -362,6 +370,7 @@ LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot, hashtable->cur_eq_func = hashtable->tab_eq_func; entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash); + Assert(entry == NULL || entry->hash == hash); MemoryContextSwitchTo(oldContext); @@ -480,7 +489,7 @@ TupleHashTableHash_internal(struct tuplehash_hash *tb, * NB: This function may or may not change the memory context. Caller is * expected to change it back. */ -static TupleHashEntry +static inline TupleHashEntry LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot, bool *isnew, uint32 hash) { diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 331acee2814..0447d473c82 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -403,8 +403,9 @@ static int hash_choose_num_partitions(uint64 input_groups, double hashentrysize, int used_bits, int *log2_npartittions); -static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash, - bool *in_hash_table); +static void initialize_hash_entry(AggState *aggstate, + TupleHashTable hashtable, + TupleHashEntry entry); static void lookup_hash_entries(AggState *aggstate); static TupleTableSlot *agg_retrieve_direct(AggState *aggstate); static void agg_fill_hash_table(AggState *aggstate); @@ -1979,75 +1980,39 @@ hash_choose_num_partitions(uint64 input_groups, double hashentrysize, } /* - * Find or create a hashtable entry for the tuple group containing the current - * tuple (already set in tmpcontext's outertuple slot), in the current grouping - * set (which the caller must have selected - note that initialize_aggregate - * depends on this). - * - * When called, CurrentMemoryContext should be the per-query context. The - * already-calculated hash value for the tuple must be specified. - * -
Re: hashagg slowdown due to spill changes
Hi, On 2020-06-08 13:41:29 -0700, Jeff Davis wrote: > On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote: > > Before there was basically one call from nodeAgg.c to execGrouping.c > > for > > each tuple and hash table. Now it's a lot more complicated: > > 1) nodeAgg.c: prepare_hash_slot() > > 2) execGrouping.c: TupleHashTableHash() > > 3) nodeAgg.c: lookup_hash_entry() > > 4) execGrouping.c: LookupTupleHashEntryHash() > > The reason that I did it that way was to be able to store the hash > along with the saved tuple (similar to what HashJoin does), which > avoids recalculation. That makes sense. But then you can just use a separate call into execGrouping for that purpose. > > Why isn't the flow more like this: > > 1) prepare_hash_slot() > > 2) if (aggstate->hash_spill_mode) goto 3; else goto 4 > > 3) entry = LookupTupleHashEntry(&hash); if (!entry) > > hashagg_spill_tuple(); > > 4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry) > > I'll work up a patch to refactor this. I'd still like to see if we can > preserve the calculate-hash-once behavior somehow. Cool! Greetings, Andres Freund
Re: hashagg slowdown due to spill changes
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote: > Why isn't the flow more like this: > 1) prepare_hash_slot() > 2) if (aggstate->hash_spill_mode) goto 3; else goto 4 > 3) entry = LookupTupleHashEntry(&hash); if (!entry) > hashagg_spill_tuple(); > 4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry) I see, you are suggesting that I change around the execGrouping.c signatures to return the hash, which will avoid the extra call. That makes more sense. Regards, Jeff Davis
Re: hashagg slowdown due to spill changes
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote: > Before there was basically one call from nodeAgg.c to execGrouping.c > for > each tuple and hash table. Now it's a lot more complicated: > 1) nodeAgg.c: prepare_hash_slot() > 2) execGrouping.c: TupleHashTableHash() > 3) nodeAgg.c: lookup_hash_entry() > 4) execGrouping.c: LookupTupleHashEntryHash() The reason that I did it that way was to be able to store the hash along with the saved tuple (similar to what HashJoin does), which avoids recalculation. That could be a nice savings for some cases, like when work_mem is small but the data still fits in system memory, which I expect to be fairly common. But based on your numbers, it might be a bad trade-off overall. > Why isn't the flow more like this: > 1) prepare_hash_slot() > 2) if (aggstate->hash_spill_mode) goto 3; else goto 4 > 3) entry = LookupTupleHashEntry(&hash); if (!entry) > hashagg_spill_tuple(); > 4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry) I'll work up a patch to refactor this. I'd still like to see if we can preserve the calculate-hash-once behavior somehow. Regards, Jeff Davis
Re: hashagg slowdown due to spill changes
On 2020-Jun-05, Andres Freund wrote: > While preparing my pgcon talk I noticed that our hash-agg performance > degraded noticeably. Looks to me like it's due to the spilling-hashagg > changes. Jeff, what are your thoughts on this? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
hashagg slowdown due to spill changes
Hi, While preparing my pgcon talk I noticed that our hash-agg performance degraded noticeably. Looks to me like it's due to the spilling-hashagg changes. Sample benchmark: config: -c huge_pages=on -c shared_buffers=32GB -c jit=0 -c max_parallel_workers_per_gather=0 (largely just to reduce variance) data prep: CREATE TABLE fewgroups_many_rows AS SELECT (random() * 4)::int cat, (random()*1)::int val FROM generate_series(1, 1); VACUUM (FREEZE, ANALYZE) fewgroups_many_rows; test prep: CREATE EXTENSION IF NOT EXISTS pg_prewarm;SELECT pg_prewarm('fewgroups_many_rows', 'buffer'); test: SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1; Using best-of-three timing: 12 12221.031 ms master 13855.129 ms While not the end of the world, that's a definitely noticable and reproducible slowdown (~12%). I don't think this is actually an inherent cost, but a question of how the code ended up being organized. Here's a perf diff of profiles for both versions: # Baseline Delta Abs Shared Object Symbol # . . # +6.70% postgres [.] LookupTupleHashEntryHash +6.37% postgres [.] prepare_hash_slot +4.74% postgres [.] TupleHashTableHash_internal.isra.0 20.36% -2.89% postgres [.] ExecInterpExpr 6.31% -2.73% postgres [.] lookup_hash_entries +2.36% postgres [.] lookup_hash_entry +2.14% postgres [.] ExecJustAssignScanVar 2.28% +1.97% postgres [.] ExecScan 2.54% +1.93% postgres [.] MemoryContextReset 3.84% -1.86% postgres [.] SeqNext 10.19% -1.50% postgres [.] tts_buffer_heap_getsomeattrs +1.42% postgres [.] hash_bytes_uint32 +1.39% postgres [.] TupleHashTableHash +1.10% postgres [.] tts_virtual_clear 3.36% -0.74% postgres [.] ExecAgg +0.45% postgres [.] CheckForSerializableConflictOutNeeded 0.25% +0.44% postgres [.] hashint4 5.80% -0.35% postgres [.] tts_minimal_getsomeattrs 1.91% -0.33% postgres [.] heap_getnextslot 4.86% -0.32% postgres [.] heapgettup_pagemode 1.46% -0.32% postgres [.] tts_minimal_clear While some of this is likely is just noise, it's pretty clear that we spend a substantial amount of additional time below lookup_hash_entries(). And looking at the code, I'm not too surprised: Before there was basically one call from nodeAgg.c to execGrouping.c for each tuple and hash table. Now it's a lot more complicated: 1) nodeAgg.c: prepare_hash_slot() 2) execGrouping.c: TupleHashTableHash() 3) nodeAgg.c: lookup_hash_entry() 4) execGrouping.c: LookupTupleHashEntryHash() For each of these data needs to be peeled out of one or more of AggState / AggStatePerHashData / TupleHashTable. There's no way the compiler can know that nothing inside those changes, therefore it has to reload the contents repeatedly. By my look at the profiles, that's where most of the time is going. There's also the issue that the signalling whether to insert / not to insert got unnecessarily complicated. There's several checks: 1) lookup_hash_entry() (p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;) 2) LookupTupleHashEntry_internal() (if (isnew)) 3) lookup_hash_entry() (if (entry == NULL) and if (isnew)) 4) lookup_hash_entries() if (!in_hash_table) Not performance related: I am a bit confused why the new per-hash stuff in lookup_hash_entries() isn't in lookup_hash_entry()? I assume that's because of agg_refill_hash_table()? Why isn't the flow more like this: 1) prepare_hash_slot() 2) if (aggstate->hash_spill_mode) goto 3; else goto 4 3) entry = LookupTupleHashEntry(&hash); if (!entry) hashagg_spill_tuple(); 4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry) That way there's again exactly one call to execGrouping.c, there's no need for nodeAgg to separately compute the hash, there's far fewer branches... Doing things this way might perhaps make agg_refill_hash_table() a tiny bit more complicated, but it'll also avoid the slowdown for the vast majority of cases where we're not spilling. Greetings, Andres Freund