Re: hashagg slowdown due to spill changes

2020-07-27 Thread Peter Geoghegan
On Sun, Jul 26, 2020 at 4:17 PM Jeff Davis  wrote:
> I saw less of an improvement than you or Andres (perhaps just more
> noise). But given that both you and Andres are reporting a measurable
> improvement, then I went ahead and committed it. Thank you.

Thanks!

-- 
Peter Geoghegan




Re: hashagg slowdown due to spill changes

2020-07-26 Thread Jeff Davis
On Sat, 2020-07-25 at 15:08 -0700, Peter Geoghegan wrote:
> I find that Andres original "SELECT cat, count(*) FROM
> fewgroups_many_rows GROUP BY 1;" test case is noticeably improved by
> the patch. Without the patch, v13 takes ~11.46 seconds. With the
> patch, it takes only ~10.64 seconds.

I saw less of an improvement than you or Andres (perhaps just more
noise). But given that both you and Andres are reporting a measurable
improvement, then I went ahead and committed it. Thank you.

Regards,
Jeff Davis






Re: hashagg slowdown due to spill changes

2020-07-25 Thread Peter Geoghegan
On Sat, Jul 25, 2020 at 12:41 PM Peter Geoghegan  wrote:
> I have added a new open item for this separate
> LookupTupleHashEntryHash()/lookup_hash_entry() pipeline-stall issue.

Attached is a rebased version of Andres' now-bitrot 2020-06-12 patch
("aggspeed.diff").

I find that Andres original "SELECT cat, count(*) FROM
fewgroups_many_rows GROUP BY 1;" test case is noticeably improved by
the patch. Without the patch, v13 takes ~11.46 seconds. With the
patch, it takes only ~10.64 seconds.

Didn't test it against v12 yet, but I have no reason to doubt Andres'
explanation. I gather that if we can get this patch committed, we can
close the relevant LookupTupleHashEntryHash() open item.

Can you take this off my hands, Jeff?

Thanks
-- 
Peter Geoghegan


0001-Fix-LookupTupleHashEntryHash-pipeline-stall-issue.patch
Description: Binary data


Re: hashagg slowdown due to spill changes

2020-07-25 Thread Peter Geoghegan
On Fri, Jul 24, 2020 at 4:51 PM Andres Freund  wrote:
> This is still not resolved. We're right now slower than 12. It's
> effectively not possible to do performance comparisons right now. This
> was nearly two months ago.

I have added a new open item for this separate
LookupTupleHashEntryHash()/lookup_hash_entry() pipeline-stall issue.

(For the record I mistakenly believed that commit 23023022 resolved
all of the concerns raised on this thread, which is why I closed out
the open item associated with this thread. Evidently work remains to
fix a remaining regression that affects simple in-memory hash
aggregation, though.)

-- 
Peter Geoghegan




Re: hashagg slowdown due to spill changes

2020-07-24 Thread Andres Freund
Hi,

On 2020-06-12 14:37:15 -0700, Andres Freund wrote:
> On 2020-06-11 11:14:02 -0700, Jeff Davis wrote:
> > On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote:
> > > Did you run any performance tests?
> >
> > Yes, I reproduced your ~12% regression from V12, and this patch nearly
> > eliminated it for me.
> 
> I spent a fair bit of time looking at the difference. Jeff had let me
> know on chat that he was still seeing some difference, but couldn't
> quite figure out where that was.
> 
> Trying it out myself, I observed that the patch helped, but not that
> much. After a bit I found one major reason for why:
> LookupTupleHashEntryHash() assigned the hash to pointer provided by the
> caller's before doing the insertion. That ended up causing a pipeline
> stall (I assume it's store forwarding, but not sure). Moving the
> assignment to the caller variable to after the insertion got rid of
> that.

This is still not resolved. We're right now slower than 12. It's
effectively not possible to do performance comparisons right now. This
was nearly two months ago.

Greetings,

Andres Freund




Re: hashagg slowdown due to spill changes

2020-06-25 Thread Tomas Vondra

On Wed, Jun 24, 2020 at 05:26:07PM -0700, Melanie Plageman wrote:

On Tue, Jun 23, 2020 at 10:06 AM Andres Freund  wrote:


Hi,

On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote:
> On Mon, Jun 22, 2020 at 9:02 PM Andres Freund 
wrote:
> > It's not this patch's fault, but none, really none, of this stuff
should
> > be in the executor.
> >
> >
> Were you thinking it could be done in grouping_planner() and then the
> bitmaps could be saved in the PlannedStmt?
> Or would you have to wait until query_planner()? Or are you imagining
> somewhere else entirely?

I haven't thought about it in too much detail, but I would say
create_agg_plan() et al. I guess there's some argument to be made to do
it in setrefs.c, because we already do convert_combining_aggrefs() there
(but I don't like that much).

There's no reason to do it before we actually decided on one specific
path, so doing it earlier than create_plan() seems unnecessary. And
having it in agg specific code seems better than putting it into global
routines.

There's probably an argument for having a bit more shared code between
create_agg_plan(), create_group_plan() and
create_groupingsets_plan(). But even just adding a new extract_*_cols()
call to each of those would probably be ok.



So, my summary of this point in the context of the other discussion
upthread is:

Planner should extract the columns that hashagg will need later during
planning. Planner should not have HashAgg/MixedAgg nodes request smaller
targetlists from their children with CP_SMALL_TLIST to avoid unneeded
projection overhead.
Also, even this extraction should only be done when the number of groups
is large enough to suspect a spill.



IMO we should extract the columns irrespectedly of the estimates,
otherwise we won't be able to handle underestimates efficiently.



Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST
in create_hashjoin_plan() for the inner side sub-tree and the outer side
one if there are multiple batches. I wondered what was different about
that vs hashagg (i.e. why it is okay to do that there).



Yeah. That means that if we have to start batching during execution, we
may need to spill much more datai. I'd say that's a hashjoin issue that
we should fix too (in v14).

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: hashagg slowdown due to spill changes

2020-06-24 Thread Melanie Plageman
On Tue, Jun 23, 2020 at 10:06 AM Andres Freund  wrote:

> Hi,
>
> On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote:
> > On Mon, Jun 22, 2020 at 9:02 PM Andres Freund 
> wrote:
> > > It's not this patch's fault, but none, really none, of this stuff
> should
> > > be in the executor.
> > >
> > >
> > Were you thinking it could be done in grouping_planner() and then the
> > bitmaps could be saved in the PlannedStmt?
> > Or would you have to wait until query_planner()? Or are you imagining
> > somewhere else entirely?
>
> I haven't thought about it in too much detail, but I would say
> create_agg_plan() et al. I guess there's some argument to be made to do
> it in setrefs.c, because we already do convert_combining_aggrefs() there
> (but I don't like that much).
>
> There's no reason to do it before we actually decided on one specific
> path, so doing it earlier than create_plan() seems unnecessary. And
> having it in agg specific code seems better than putting it into global
> routines.
>
> There's probably an argument for having a bit more shared code between
> create_agg_plan(), create_group_plan() and
> create_groupingsets_plan(). But even just adding a new extract_*_cols()
> call to each of those would probably be ok.
>
>
So, my summary of this point in the context of the other discussion
upthread is:

Planner should extract the columns that hashagg will need later during
planning. Planner should not have HashAgg/MixedAgg nodes request smaller
targetlists from their children with CP_SMALL_TLIST to avoid unneeded
projection overhead.
Also, even this extraction should only be done when the number of groups
is large enough to suspect a spill.

So, I wrote a patch that extracts the columns the same way as in
ExecInitAgg but in create_agg_plan() and it doesn't work because we
haven't called set_plan_references().

Then, I wrote a patch that does this in set_upper_references(), and it
seems to work. I've attached that one.
It is basically Jeff's patch (based somewhat on my patch) which extracts
the columns in ExecInitAgg but I moved the functions over to setrefs.c
and gave them a different name.

It's not very elegant.
I shoved it in at the end of set_upper_references(), but I think there
should be a nice way to do it while setting the references for each var
instead of walking over the nodes again.
Also, I think that the bitmapsets for the colnos should maybe be put
somewhere less prominent (than in the main Agg plan node?), since they
are only used in one small place.
I tried putting both bitmaps in an array of two bitmaps in the Agg node
(since there will always be two) to make it look a bit neater, but it
was pretty confusing and error prone to remember which one was
aggregated and which one was unaggregated.

Note that I didn't do anything with costing like only extracting the
columns if there are a lot of groups.

Also, I didn't revert the CP_SMALL_TLIST change in create_agg_plan() or
create_groupingsets_plan().

Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST
in create_hashjoin_plan() for the inner side sub-tree and the outer side
one if there are multiple batches. I wondered what was different about
that vs hashagg (i.e. why it is okay to do that there).

-- 
Melanie Plageman
From 0b40ad205cf8fc8a83b8c65f5f18502a51c47370 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 24 Jun 2020 17:07:14 -0700
Subject: [PATCH v1] Move extracting columns for hashagg to planner

Find the columns needed the executor will need to know about for hashagg
and bifurcate them into aggregated and unaggregated column bitmapsets.
Save them in the plan tree for use during execution.

This is done after set_plan_refs() so all the references are correct.
---
 src/backend/executor/nodeAgg.c   | 117 +--
 src/backend/nodes/copyfuncs.c|   2 +
 src/backend/nodes/outfuncs.c |   2 +
 src/backend/nodes/readfuncs.c|   2 +
 src/backend/optimizer/plan/setrefs.c |  69 
 src/include/nodes/execnodes.h|   8 +-
 src/include/nodes/plannodes.h|   2 +
 7 files changed, 140 insertions(+), 62 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a20554ae65..2f340e4031 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -359,6 +359,14 @@ typedef struct HashAggBatch
 	int64		input_tuples;	/* number of tuples in this batch */
 } HashAggBatch;
 
+/* used to find referenced colnos */
+typedef struct FindColsContext
+{
+	bool	   is_aggref;		/* is under an aggref */
+	Bitmapset *aggregated;		/* column references under an aggref */
+	Bitmapset *unaggregated;	/* other column references */
+} FindColsContext;
+
 static void select_current_set(AggState *aggstate, int setno, bool is_hash);
 static void initialize_phase(AggState *aggstate, int newphase);
 static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -391,8 +399,6 @@ static void 

Re: hashagg slowdown due to spill changes

2020-06-23 Thread Andres Freund
Hi,

On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote:
> On Mon, Jun 22, 2020 at 9:02 PM Andres Freund  wrote:
> > It's not this patch's fault, but none, really none, of this stuff should
> > be in the executor.
> >
> >
> Were you thinking it could be done in grouping_planner() and then the
> bitmaps could be saved in the PlannedStmt?
> Or would you have to wait until query_planner()? Or are you imagining
> somewhere else entirely?

I haven't thought about it in too much detail, but I would say
create_agg_plan() et al. I guess there's some argument to be made to do
it in setrefs.c, because we already do convert_combining_aggrefs() there
(but I don't like that much).

There's no reason to do it before we actually decided on one specific
path, so doing it earlier than create_plan() seems unnecessary. And
having it in agg specific code seems better than putting it into global
routines.

There's probably an argument for having a bit more shared code between
create_agg_plan(), create_group_plan() and
create_groupingsets_plan(). But even just adding a new extract_*_cols()
call to each of those would probably be ok.

Greetings,

Andres Freund




Re: hashagg slowdown due to spill changes

2020-06-23 Thread Melanie Plageman
On Mon, Jun 22, 2020 at 9:02 PM Andres Freund  wrote:

>
> >  /*
> > - * find_unaggregated_cols
> > - * Construct a bitmapset of the column numbers of un-aggregated Vars
> > - * appearing in our targetlist and qual (HAVING clause)
> > + * Walk tlist and qual to find referenced colnos, dividing them into
> > + * aggregated and unaggregated sets.
> >   */
> > -static Bitmapset *
> > -find_unaggregated_cols(AggState *aggstate)
> > +static void
> > +find_cols(AggState *aggstate, Bitmapset **aggregated, Bitmapset
> **unaggregated)
> >  {
>
> It's not this patch's fault, but none, really none, of this stuff should
> be in the executor.
>
>
Were you thinking it could be done in grouping_planner() and then the
bitmaps could be saved in the PlannedStmt?
Or would you have to wait until query_planner()? Or are you imagining
somewhere else entirely?

-- 
Melanie Plageman


Re: hashagg slowdown due to spill changes

2020-06-22 Thread Andres Freund
Hi,

I think it'd be good to get the changes that aren't related to
projection merged. As far as I can tell there's performance regressions
both because of the things I'd listed upthread, and due to the
projection issue.  That's not obvious because because we first won
performance and then lost it again in several incremental steps.

COPY (SELECT (random() * 4)::int cat, (random()*1)::int val FROM 
generate_series(1, 1)) TO '/tmp/data' WITH BINARY;
BEGIN;
DROP TABLE IF EXISTS fewgroups_many_rows;
CREATE TABLE fewgroups_many_rows(cat int4 not null, val int4 not null);
COPY fewgroups_many_rows FROM '/tmp/data' WITH (FORMAT BINARY, FREEZE);
COMMIT;
VACUUM FREEZE fewgroups_many_rows;

Test prep:

Test query:
SET seed=0;SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1;
(the seed seems to reduce noise due to hashtable iv being the same)

best of six:
9e1c9f959422192bbe1b842a2a1ffaf76b08019612031.906 ms
d52eaa094847d395f942827a6f413904e516994c12045.487 ms
ac88807f9b227ddcd92b8be9a053094837c1b99a11950.006 ms
36d22dd95bc87ca68e742da91f47f8826f8758c911769.991 ms
5ac4e9a12c6543414891cd8972b2cd36a08e40cc11551.932 ms
1fdb7f9789c4550204cd62d1746a7deed1dc4c2911706.948 ms
4eaea3db150af56aa2e40efe91997fd25f3b6d7311999.908 ms
11de6c903da99a4b2220acfa776fc26c7f384ccc11999.054 ms
b7fabe80df9a65010bfe5e5d0a979bacebfec38212165.463 ms
2742c45080077ed3b08b810bb96341499b86d53012137.505 ms
1f39bce021540fde00990af55b4432c55ef4b3c712501.764 ms
9b60c4b979bce060495e2b05ba01d1cc6bffdd2d12389.047 ms
4cad2534da6d17067d98cf04be2dfc1bda8f2cd013319.786 ms
1b2c29469a58cd9086bd86e20c708eb437564a8013330.616 ms

There's certainly some noise in here, but I think the trends are valid.


>  /*
> - * find_unaggregated_cols
> - * Construct a bitmapset of the column numbers of un-aggregated Vars
> - * appearing in our targetlist and qual (HAVING clause)
> + * Walk tlist and qual to find referenced colnos, dividing them into
> + * aggregated and unaggregated sets.
>   */
> -static Bitmapset *
> -find_unaggregated_cols(AggState *aggstate)
> +static void
> +find_cols(AggState *aggstate, Bitmapset **aggregated, Bitmapset 
> **unaggregated)
>  {

It's not this patch's fault, but none, really none, of this stuff should
be in the executor.

Greetings,

Andres Freund




Re: hashagg slowdown due to spill changes

2020-06-16 Thread Tomas Vondra

On Mon, Jun 15, 2020 at 07:38:45PM -0700, Jeff Davis wrote:

On Mon, 2020-06-15 at 11:19 -0400, Robert Haas wrote:

On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
 wrote:
> But just reverting 4cad2534d will make this much worse, I think, as
> illustrated by the benchmarks I did in [1].

I share this concern, although I do not know what we should do about
it.


I attached an updated version of Melanie's patch, combined with the
changes to copy only the necessary attributes to a new slot before
spilling. There are a couple changes:

* I didn't see a reason to descend into a GroupingFunc node, so I
removed that.

* I used a flag in the context rather than two separate callbacks to
the expression walker.

This patch gives the space benefits that we see on master, without the
regression for small numbers of tuples. I saw a little bit of noise in
my test results, but I'm pretty sure it's a win all around. It could
use some review/cleanup though.



Looks reasonable. I can't redo my tests at the moment, the machine is
busy with something else. I'll give it a try over the weekend.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: hashagg slowdown due to spill changes

2020-06-15 Thread Jeff Davis
On Mon, 2020-06-15 at 11:19 -0400, Robert Haas wrote:
> On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
>  wrote:
> > But just reverting 4cad2534d will make this much worse, I think, as
> > illustrated by the benchmarks I did in [1].
> 
> I share this concern, although I do not know what we should do about
> it.

I attached an updated version of Melanie's patch, combined with the
changes to copy only the necessary attributes to a new slot before
spilling. There are a couple changes:

* I didn't see a reason to descend into a GroupingFunc node, so I
removed that.

* I used a flag in the context rather than two separate callbacks to
the expression walker.

This patch gives the space benefits that we see on master, without the
regression for small numbers of tuples. I saw a little bit of noise in
my test results, but I'm pretty sure it's a win all around. It could
use some review/cleanup though.

Regards,
Jeff Davis

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 331acee2814..5b5aa96c517 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -358,6 +358,14 @@ typedef struct HashAggBatch
 	int64		input_tuples;	/* number of tuples in this batch */
 } HashAggBatch;
 
+/* used to find referenced colnos */
+typedef struct FindColsContext
+{
+	bool	   is_aggref;		/* is under an aggref */
+	Bitmapset *aggregated;		/* column references under an aggref */
+	Bitmapset *unaggregated;	/* other column references */
+} FindColsContext;
+
 static void select_current_set(AggState *aggstate, int setno, bool is_hash);
 static void initialize_phase(AggState *aggstate, int newphase);
 static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -390,8 +398,9 @@ static void finalize_aggregates(AggState *aggstate,
 AggStatePerAgg peragg,
 AggStatePerGroup pergroup);
 static TupleTableSlot *project_aggregates(AggState *aggstate);
-static Bitmapset *find_unaggregated_cols(AggState *aggstate);
-static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
+static void find_cols(AggState *aggstate, Bitmapset **aggregated,
+	  Bitmapset **unaggregated);
+static bool find_cols_walker(Node *node, FindColsContext *context);
 static void build_hash_tables(AggState *aggstate);
 static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
 static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
@@ -424,8 +433,8 @@ static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
 static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
 			   int used_bits, uint64 input_tuples,
 			   double hashentrysize);
-static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
-uint32 hash);
+static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
+TupleTableSlot *slot, uint32 hash);
 static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
  int setno);
 static void hashagg_tapeinfo_init(AggState *aggstate);
@@ -1374,26 +1383,28 @@ project_aggregates(AggState *aggstate)
 }
 
 /*
- * find_unaggregated_cols
- *	  Construct a bitmapset of the column numbers of un-aggregated Vars
- *	  appearing in our targetlist and qual (HAVING clause)
+ * Walk tlist and qual to find referenced colnos, dividing them into
+ * aggregated and unaggregated sets.
  */
-static Bitmapset *
-find_unaggregated_cols(AggState *aggstate)
+static void
+find_cols(AggState *aggstate, Bitmapset **aggregated, Bitmapset **unaggregated)
 {
-	Agg		   *node = (Agg *) aggstate->ss.ps.plan;
-	Bitmapset  *colnos;
-
-	colnos = NULL;
-	(void) find_unaggregated_cols_walker((Node *) node->plan.targetlist,
-		 );
-	(void) find_unaggregated_cols_walker((Node *) node->plan.qual,
-		 );
-	return colnos;
+	Agg *agg = (Agg *) aggstate->ss.ps.plan;
+	FindColsContext context;
+
+	context.is_aggref = false;
+	context.aggregated = NULL;
+	context.unaggregated = NULL;
+
+	(void) find_cols_walker((Node *) agg->plan.targetlist, );
+	(void) find_cols_walker((Node *) agg->plan.qual, );
+
+	*aggregated = context.aggregated;
+	*unaggregated = context.unaggregated;
 }
 
 static bool
-find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
+find_cols_walker(Node *node, FindColsContext *context)
 {
 	if (node == NULL)
 		return false;
@@ -1404,16 +1415,24 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
 		/* setrefs.c should have set the varno to OUTER_VAR */
 		Assert(var->varno == OUTER_VAR);
 		Assert(var->varlevelsup == 0);
-		*colnos = bms_add_member(*colnos, var->varattno);
+		if (context->is_aggref)
+			context->aggregated = bms_add_member(context->aggregated,
+ var->varattno);
+		else
+			context->unaggregated = bms_add_member(context->unaggregated,
+   var->varattno);
 		return false;
 	}
-	if (IsA(node, Aggref) || IsA(node, GroupingFunc))
+	if (IsA(node, Aggref))
 	{
-		/* do not descend into 

Re: hashagg slowdown due to spill changes

2020-06-15 Thread Tom Lane
Robert Haas  writes:
> On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
>  wrote:
>> But just reverting 4cad2534d will make this much worse, I think, as
>> illustrated by the benchmarks I did in [1].

> I share this concern, although I do not know what we should do about it.

Well, it's only June.  Let's put it on the open issues list for v13
and continue to think about it.  I concur that the hashagg spill patch
has made this something that we should worry about for v13, so just
reverting without a better answer isn't very appetizing.

regards, tom lane




Re: hashagg slowdown due to spill changes

2020-06-15 Thread Robert Haas
On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
 wrote:
> But just reverting 4cad2534d will make this much worse, I think, as
> illustrated by the benchmarks I did in [1].

I share this concern, although I do not know what we should do about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: hashagg slowdown due to spill changes

2020-06-15 Thread Tomas Vondra

On Sun, Jun 14, 2020 at 11:09:55PM -0700, Jeff Davis wrote:

On Sun, 2020-06-14 at 11:14 -0700, Andres Freund wrote:

I'm somewhat inclined to think that we should revert 4cad2534da6 and
then look at how precisely to tackle this in 14.


I'm fine with that.



I don't see how we could just revert 4cad2534d and leave this for v14.

The hashagg spilling is IMHO almost guaranteed to be a pain point for
some users, as it will force some queries to serialize large amounts of
data. Yes, some of this is a cost for hashagg enforcing work_mem at
runtime, I'm fine with that. We'd get reports about that too, but we can
justify that cost ...

But just reverting 4cad2534d will make this much worse, I think, as
illustrated by the benchmarks I did in [1]. And no, this is not really
fixable by tweaking the cost parameters - even with the current code
(i.e. 4cad2534d in place) I had to increase random_page_cost to 60 on
the temp tablespace (on SATA RAID) to get good plans with parallelism
enabled. I haven't tried, but I presume without 4cad2534d I'd have to
push r_p_c even further ...

[1] 
https://www.postgresql.org/message-id/20200519151202.u2p2gpiawoaznsv2%40development


It'd probably make sense to request small tlists when the number of
estimated groups is large, and not otherwise.


That seems like a nice compromise that would be non-invasive, at least
for create_agg_plan().



Maybe. It'd certainly better than nothing. It's not clear to me what
would a good threshold be, though. And it's not going to handle cases of
under-estimates.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: hashagg slowdown due to spill changes

2020-06-15 Thread Jeff Davis
On Sun, 2020-06-14 at 11:14 -0700, Andres Freund wrote:
> I'm somewhat inclined to think that we should revert 4cad2534da6 and
> then look at how precisely to tackle this in 14.

I'm fine with that.

> It'd probably make sense to request small tlists when the number of
> estimated groups is large, and not otherwise.

That seems like a nice compromise that would be non-invasive, at least
for create_agg_plan().

Regards,
Jeff Davis






Re: hashagg slowdown due to spill changes

2020-06-14 Thread Andres Freund
Hi,

On 2020-06-12 15:29:08 -0700, Jeff Davis wrote:
> On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote:
> > I don't see why it's ok to force an additional projection in the very
> > common case of hashaggs over a few rows. So I think we need to
> > rethink
> > 4cad2534da6.
> 
> One possibility is to project only spilled tuples, more similar to
> Melanie's patch from a while ago:
> 
> 
> https://www.postgresql.org/message-id/caakru_aefesv+ukqwqu+ioenoil2lju9diuh9br8mbyxuz0...@mail.gmail.com
> 
> Which makes sense, but it's also more code.

I'm somewhat inclined to think that we should revert 4cad2534da6 and
then look at how precisely to tackle this in 14.

It'd probably make sense to request small tlists when the number of
estimated groups is large, and not otherwise.

Greetings,

Andres Freund




Re: hashagg slowdown due to spill changes

2020-06-13 Thread Tomas Vondra

On Sat, Jun 13, 2020 at 11:48:09AM -0700, Jeff Davis wrote:

On Fri, 2020-06-12 at 17:12 -0700, Andres Freund wrote:

Do you think we should tackle this for 13? To me 4cad2534da seems
like a
somewhat independent improvement to spillable hashaggs.


We've gone back and forth on this issue a few times, so let's try to
get some agreement before we revert 4cad2534da. I added Robert because
he also seemed to think it was a reasonable idea.



I can't speak for Robert, but I haven't expected the extra projection
would be this high. And I agree with Andres it's not very nice we have
to do this even for aggregates with just a handful of groups that don't
need to spill.

In any case, I think we need to address this somehow for v13 - either we
keep the 4cad2534da patch in, or we tweak the cost model to reflect the
extra I/O costs, or we project only when spilling.

I'm not in a position to whip up a patch soon, though :-(


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: hashagg slowdown due to spill changes

2020-06-13 Thread Jeff Davis
On Fri, 2020-06-12 at 17:12 -0700, Andres Freund wrote:
> Do you think we should tackle this for 13? To me 4cad2534da seems
> like a
> somewhat independent improvement to spillable hashaggs.

We've gone back and forth on this issue a few times, so let's try to
get some agreement before we revert 4cad2534da. I added Robert because
he also seemed to think it was a reasonable idea.

Regards,
Jeff Davis






Re: hashagg slowdown due to spill changes

2020-06-12 Thread Andres Freund
Hi,

On 2020-06-13 01:06:25 +0200, Tomas Vondra wrote:
> I agree, we should revert 4cad2534da and only project tuples when we
> actually need to spill them.

There are cases where projecting helps for non-spilling aggregates too,
but only for the representative tuple. It doesn't help in the case at
hand, because there's just 5 hashtable entries but millions of rows. So
we're unnecessarily projecting all-5 rows. But when there are many
different groups, it'd be different, because then the size of the
representative tuple can matter substantially.

Do you think we should tackle this for 13? To me 4cad2534da seems like a
somewhat independent improvement to spillable hashaggs.

Greetings,

Andres Freund




Re: hashagg slowdown due to spill changes

2020-06-12 Thread Tomas Vondra

On Fri, Jun 12, 2020 at 03:29:08PM -0700, Jeff Davis wrote:

On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote:

I don't see why it's ok to force an additional projection in the very
common case of hashaggs over a few rows. So I think we need to
rethink
4cad2534da6.


One possibility is to project only spilled tuples, more similar to
Melanie's patch from a while ago:


https://www.postgresql.org/message-id/caakru_aefesv+ukqwqu+ioenoil2lju9diuh9br8mbyxuz0...@mail.gmail.com

Which makes sense, but it's also more code.



I agree, we should revert 4cad2534da and only project tuples when we
actually need to spill them. Did any of the WIP patches actually
implement that, or do we need to write that patch from scratch?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: hashagg slowdown due to spill changes

2020-06-12 Thread Jeff Davis
On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote:
> I don't see why it's ok to force an additional projection in the very
> common case of hashaggs over a few rows. So I think we need to
> rethink
> 4cad2534da6.

One possibility is to project only spilled tuples, more similar to
Melanie's patch from a while ago:


https://www.postgresql.org/message-id/caakru_aefesv+ukqwqu+ioenoil2lju9diuh9br8mbyxuz0...@mail.gmail.com

Which makes sense, but it's also more code.

Regards,
Jeff Davis






Re: hashagg slowdown due to spill changes

2020-06-12 Thread Andres Freund
Hi,

On 2020-06-11 11:14:02 -0700, Jeff Davis wrote:
> On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote:
> > Did you run any performance tests?
>
> Yes, I reproduced your ~12% regression from V12, and this patch nearly
> eliminated it for me.

I spent a fair bit of time looking at the difference. Jeff had let me
know on chat that he was still seeing some difference, but couldn't
quite figure out where that was.

Trying it out myself, I observed that the patch helped, but not that
much. After a bit I found one major reason for why:
LookupTupleHashEntryHash() assigned the hash to pointer provided by the
caller's before doing the insertion. That ended up causing a pipeline
stall (I assume it's store forwarding, but not sure). Moving the
assignment to the caller variable to after the insertion got rid of
that.

It got within 3-4% after that change. I did a number of small
microoptimizations that each helped, but didn't get quite get to the
level of 12.

Finally I figured out that that's due to an issue outside of nodeAgg.c
itself:

commit 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0
Author: Tomas Vondra 
Date:   2020-05-31 14:43:13 +0200

Use CP_SMALL_TLIST for hash aggregate

Due to this change we end up with an additional projection in queries
like this:

postgres[212666][1]=# \d fewgroups_many_rows
 Table "public.fewgroups_many_rows"
┌┬─┬───┬──┬─┐
│ Column │  Type   │ Collation │ Nullable │ Default │
├┼─┼───┼──┼─┤
│ cat│ integer │   │ not null │ │
│ val│ integer │   │ not null │ │
└┴─┴───┴──┴─┘

postgres[212666][1]=# explain SELECT cat, count(*) FROM fewgroups_many_rows 
GROUP BY 1;
┌───┐
│  QUERY PLAN   
│
├───┤
│ HashAggregate  (cost=1942478.48..1942478.53 rows=5 width=12)  
│
│   Group Key: cat  
│
│   ->  Seq Scan on fewgroups_many_rows  (cost=0.00..1442478.32 rows=10032 
width=4) │
└───┘
(3 rows)

as 'val' is "projected away"..


After neutering the tlist change, Jeff's patch and my changes to it
yield performance *above* v12.


I don't see why it's ok to force an additional projection in the very
common case of hashaggs over a few rows. So I think we need to rethink
4cad2534da6.

Greetings,

Andres Freund
diff --git i/src/include/executor/executor.h w/src/include/executor/executor.h
index c7deeac662f..415e117407c 100644
--- i/src/include/executor/executor.h
+++ w/src/include/executor/executor.h
@@ -139,7 +139,7 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
 			 MemoryContext tempcxt, bool use_variable_hash_iv);
 extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
 		   TupleTableSlot *slot,
-		   bool *isnew);
+		   bool *isnew, uint32 *hash);
 extern uint32 TupleHashTableHash(TupleHashTable hashtable,
  TupleTableSlot *slot);
 extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
diff --git i/src/backend/executor/execGrouping.c w/src/backend/executor/execGrouping.c
index 8be36ca7634..1e582832ea0 100644
--- i/src/backend/executor/execGrouping.c
+++ w/src/backend/executor/execGrouping.c
@@ -22,11 +22,12 @@
 #include "utils/memutils.h"
 
 static int	TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
-static uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
+static inline uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
 		  const MinimalTuple tuple);
-static TupleHashEntry LookupTupleHashEntry_internal(TupleHashTable hashtable,
-	TupleTableSlot *slot,
-	bool *isnew, uint32 hash);
+static inline TupleHashEntry LookupTupleHashEntry_internal(
+	TupleHashTable hashtable,
+	TupleTableSlot *slot,
+	bool *isnew, uint32 hash);
 
 /*
  * Define parameters for tuple hash table code generation. The interface is
@@ -291,6 +292,9 @@ ResetTupleHashTable(TupleHashTable hashtable)
  * If isnew is NULL, we do not create new entries; we return NULL if no
  * match is found.
  *
+ * If hash is not NULL, we set it to the calculated hash value. This allows
+ * callers access to the hash value even if no entry is returned.
+ *
  * If isnew isn't NULL, then a new entry is created if no existing entry
  * matches.  On return, *isnew is true if the entry is newly created,
  * false if it existed already.  ->additional_data in the new entry has
@@ -298,11 +302,11 @@ ResetTupleHashTable(TupleHashTable hashtable)
  */
 TupleHashEntry
 

Re: hashagg slowdown due to spill changes

2020-06-11 Thread Jeff Davis
On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote:
> Did you run any performance tests?

Yes, I reproduced your ~12% regression from V12, and this patch nearly
eliminated it for me.

Regards,
Jeff Davis






Re: hashagg slowdown due to spill changes

2020-06-11 Thread Andres Freund
Hi, 

On June 10, 2020 6:15:39 PM PDT, Jeff Davis  wrote:
>On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
>> Hi,
>> 
>> While preparing my pgcon talk I noticed that our hash-agg performance
>> degraded noticeably. Looks to me like it's due to the spilling-
>> hashagg
>> changes.
>
>Attached a proposed fix. (Might require some minor cleanup).
>
>The only awkward part is that LookupTupleHashEntry() needs a new out
>parameter to pass the hash value back to the caller. Ordinarily, the
>caller can get that from the returned entry, but if isnew==NULL, then
>the function might return NULL (and the caller wouldn't have an entry
>from which to read the hash value).

Great!

Did you run any performance tests?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: hashagg slowdown due to spill changes

2020-06-10 Thread Jeff Davis
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
> Hi,
> 
> While preparing my pgcon talk I noticed that our hash-agg performance
> degraded noticeably. Looks to me like it's due to the spilling-
> hashagg
> changes.

Attached a proposed fix. (Might require some minor cleanup).

The only awkward part is that LookupTupleHashEntry() needs a new out
parameter to pass the hash value back to the caller. Ordinarily, the
caller can get that from the returned entry, but if isnew==NULL, then
the function might return NULL (and the caller wouldn't have an entry
from which to read the hash value).

Regards,
Jeff Davis

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 8be36ca7634..27dbf264766 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -24,9 +24,10 @@
 static int	TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
 static uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
 		  const MinimalTuple tuple);
-static TupleHashEntry LookupTupleHashEntry_internal(TupleHashTable hashtable,
-	TupleTableSlot *slot,
-	bool *isnew, uint32 hash);
+static inline TupleHashEntry LookupTupleHashEntry_internal(
+	TupleHashTable hashtable,
+	TupleTableSlot *slot,
+	bool *isnew, uint32 hash);
 
 /*
  * Define parameters for tuple hash table code generation. The interface is
@@ -291,6 +292,9 @@ ResetTupleHashTable(TupleHashTable hashtable)
  * If isnew is NULL, we do not create new entries; we return NULL if no
  * match is found.
  *
+ * If hash is not NULL, we set it to the calculated hash value. This allows
+ * callers access to the hash value even if no entry is returned.
+ *
  * If isnew isn't NULL, then a new entry is created if no existing entry
  * matches.  On return, *isnew is true if the entry is newly created,
  * false if it existed already.  ->additional_data in the new entry has
@@ -298,11 +302,11 @@ ResetTupleHashTable(TupleHashTable hashtable)
  */
 TupleHashEntry
 LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
-	 bool *isnew)
+	 bool *isnew, uint32 *hash)
 {
 	TupleHashEntry entry;
 	MemoryContext oldContext;
-	uint32		hash;
+	uint32		local_hash;
 
 	/* Need to run the hash functions in short-lived context */
 	oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
@@ -312,8 +316,12 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
 	hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
 	hashtable->cur_eq_func = hashtable->tab_eq_func;
 
-	hash = TupleHashTableHash_internal(hashtable->hashtab, NULL);
-	entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
+	local_hash = TupleHashTableHash_internal(hashtable->hashtab, NULL);
+	if (hash != NULL)
+		*hash = local_hash;
+
+	entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, local_hash);
+	Assert(entry == NULL || entry->hash == local_hash);
 
 	MemoryContextSwitchTo(oldContext);
 
@@ -362,6 +370,7 @@ LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
 	hashtable->cur_eq_func = hashtable->tab_eq_func;
 
 	entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
+	Assert(entry == NULL || entry->hash == hash);
 
 	MemoryContextSwitchTo(oldContext);
 
@@ -480,7 +489,7 @@ TupleHashTableHash_internal(struct tuplehash_hash *tb,
  * NB: This function may or may not change the memory context. Caller is
  * expected to change it back.
  */
-static TupleHashEntry
+static inline TupleHashEntry
 LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
 			  bool *isnew, uint32 hash)
 {
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 331acee2814..0447d473c82 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -403,8 +403,9 @@ static int	hash_choose_num_partitions(uint64 input_groups,
 	   double hashentrysize,
 	   int used_bits,
 	   int *log2_npartittions);
-static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash,
-		  bool *in_hash_table);
+static void initialize_hash_entry(AggState *aggstate,
+  TupleHashTable hashtable,
+  TupleHashEntry entry);
 static void lookup_hash_entries(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
@@ -1979,75 +1980,39 @@ hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
 }
 
 /*
- * Find or create a hashtable entry for the tuple group containing the current
- * tuple (already set in tmpcontext's outertuple slot), in the current grouping
- * set (which the caller must have selected - note that initialize_aggregate
- * depends on this).
- *
- * When called, CurrentMemoryContext should be the per-query context. The
- * already-calculated hash value for the tuple must be specified.
- *

Re: hashagg slowdown due to spill changes

2020-06-08 Thread Andres Freund
Hi,

On 2020-06-08 13:41:29 -0700, Jeff Davis wrote:
> On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
> > Before there was basically one call from nodeAgg.c to execGrouping.c
> > for
> > each tuple and hash table. Now it's a lot more complicated:
> > 1) nodeAgg.c: prepare_hash_slot()
> > 2) execGrouping.c: TupleHashTableHash()
> > 3) nodeAgg.c: lookup_hash_entry()
> > 4) execGrouping.c: LookupTupleHashEntryHash()
> 
> The reason that I did it that way was to be able to store the hash
> along with the saved tuple (similar to what HashJoin does), which
> avoids recalculation.

That makes sense. But then you can just use a separate call into
execGrouping for that purpose.


> > Why isn't the flow more like this:
> > 1) prepare_hash_slot()
> > 2) if (aggstate->hash_spill_mode) goto 3; else goto 4
> > 3) entry = LookupTupleHashEntry(); if (!entry)
> > hashagg_spill_tuple();
> > 4) InsertTupleHashEntry(, ); if (isnew) initialize(entry)
> 
> I'll work up a patch to refactor this. I'd still like to see if we can
> preserve the calculate-hash-once behavior somehow.

Cool!

Greetings,

Andres Freund




Re: hashagg slowdown due to spill changes

2020-06-08 Thread Jeff Davis
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
> Why isn't the flow more like this:
> 1) prepare_hash_slot()
> 2) if (aggstate->hash_spill_mode) goto 3; else goto 4
> 3) entry = LookupTupleHashEntry(); if (!entry)
> hashagg_spill_tuple();
> 4) InsertTupleHashEntry(, ); if (isnew) initialize(entry)

I see, you are suggesting that I change around the execGrouping.c
signatures to return the hash, which will avoid the extra call. That
makes more sense.

Regards,
Jeff Davis








Re: hashagg slowdown due to spill changes

2020-06-08 Thread Jeff Davis
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
> Before there was basically one call from nodeAgg.c to execGrouping.c
> for
> each tuple and hash table. Now it's a lot more complicated:
> 1) nodeAgg.c: prepare_hash_slot()
> 2) execGrouping.c: TupleHashTableHash()
> 3) nodeAgg.c: lookup_hash_entry()
> 4) execGrouping.c: LookupTupleHashEntryHash()

The reason that I did it that way was to be able to store the hash
along with the saved tuple (similar to what HashJoin does), which
avoids recalculation.

That could be a nice savings for some cases, like when work_mem is
small but the data still fits in system memory, which I expect to be
fairly common. But based on your numbers, it might be a bad trade-off
overall.

> Why isn't the flow more like this:
> 1) prepare_hash_slot()
> 2) if (aggstate->hash_spill_mode) goto 3; else goto 4
> 3) entry = LookupTupleHashEntry(); if (!entry)
> hashagg_spill_tuple();
> 4) InsertTupleHashEntry(, ); if (isnew) initialize(entry)

I'll work up a patch to refactor this. I'd still like to see if we can
preserve the calculate-hash-once behavior somehow.

Regards,
Jeff Davis
 





Re: hashagg slowdown due to spill changes

2020-06-08 Thread Alvaro Herrera
On 2020-Jun-05, Andres Freund wrote:

> While preparing my pgcon talk I noticed that our hash-agg performance
> degraded noticeably. Looks to me like it's due to the spilling-hashagg
> changes.

Jeff, what are your thoughts on this?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




hashagg slowdown due to spill changes

2020-06-05 Thread Andres Freund
Hi,

While preparing my pgcon talk I noticed that our hash-agg performance
degraded noticeably. Looks to me like it's due to the spilling-hashagg
changes.

Sample benchmark:

config:
-c huge_pages=on -c shared_buffers=32GB -c jit=0 -c 
max_parallel_workers_per_gather=0
(largely just to reduce variance)

data prep:
CREATE TABLE fewgroups_many_rows AS SELECT (random() * 4)::int cat, 
(random()*1)::int val FROM generate_series(1, 1);
VACUUM (FREEZE, ANALYZE) fewgroups_many_rows;

test prep:
CREATE EXTENSION IF NOT EXISTS pg_prewarm;SELECT 
pg_prewarm('fewgroups_many_rows', 'buffer');

test:
SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1;

Using best-of-three timing:

12  12221.031 ms
master  13855.129 ms

While not the end of the world, that's a definitely noticable and
reproducible slowdown (~12%).

I don't think this is actually an inherent cost, but a question of how
the code ended up being organized. Here's a perf diff of profiles for
both versions:

# Baseline  Delta Abs  Shared Object Symbol
#   .    
.
#
   +6.70%  postgres  [.] LookupTupleHashEntryHash
   +6.37%  postgres  [.] prepare_hash_slot
   +4.74%  postgres  [.] TupleHashTableHash_internal.isra.0
20.36% -2.89%  postgres  [.] ExecInterpExpr
 6.31% -2.73%  postgres  [.] lookup_hash_entries
   +2.36%  postgres  [.] lookup_hash_entry
   +2.14%  postgres  [.] ExecJustAssignScanVar
 2.28% +1.97%  postgres  [.] ExecScan
 2.54% +1.93%  postgres  [.] MemoryContextReset
 3.84% -1.86%  postgres  [.] SeqNext
10.19% -1.50%  postgres  [.] tts_buffer_heap_getsomeattrs
   +1.42%  postgres  [.] hash_bytes_uint32
   +1.39%  postgres  [.] TupleHashTableHash
   +1.10%  postgres  [.] tts_virtual_clear
 3.36% -0.74%  postgres  [.] ExecAgg
   +0.45%  postgres  [.] 
CheckForSerializableConflictOutNeeded
 0.25% +0.44%  postgres  [.] hashint4
 5.80% -0.35%  postgres  [.] tts_minimal_getsomeattrs
 1.91% -0.33%  postgres  [.] heap_getnextslot
 4.86% -0.32%  postgres  [.] heapgettup_pagemode
 1.46% -0.32%  postgres  [.] tts_minimal_clear

While some of this is likely is just noise, it's pretty clear that we
spend a substantial amount of additional time below
lookup_hash_entries().

And looking at the code, I'm not too surprised:

Before there was basically one call from nodeAgg.c to execGrouping.c for
each tuple and hash table. Now it's a lot more complicated:
1) nodeAgg.c: prepare_hash_slot()
2) execGrouping.c: TupleHashTableHash()
3) nodeAgg.c: lookup_hash_entry()
4) execGrouping.c: LookupTupleHashEntryHash()

For each of these data needs to be peeled out of one or more of AggState
/ AggStatePerHashData / TupleHashTable. There's no way the compiler can
know that nothing inside those changes, therefore it has to reload the
contents repeatedly.  By my look at the profiles, that's where most of
the time is going.

There's also the issue that the signalling whether to insert / not to
insert got unnecessarily complicated. There's several checks:
1) lookup_hash_entry() (p_isnew = aggstate->hash_spill_mode ? NULL : )
2) LookupTupleHashEntry_internal() (if (isnew))
3) lookup_hash_entry() (if (entry == NULL) and if (isnew))
4) lookup_hash_entries() if (!in_hash_table)

Not performance related: I am a bit confused why the new per-hash stuff
in lookup_hash_entries() isn't in lookup_hash_entry()? I assume that's
because of agg_refill_hash_table()?


Why isn't the flow more like this:
1) prepare_hash_slot()
2) if (aggstate->hash_spill_mode) goto 3; else goto 4
3) entry = LookupTupleHashEntry(); if (!entry) hashagg_spill_tuple();
4) InsertTupleHashEntry(, ); if (isnew) initialize(entry)

That way there's again exactly one call to execGrouping.c, there's no
need for nodeAgg to separately compute the hash, there's far fewer
branches...

Doing things this way might perhaps make agg_refill_hash_table() a tiny
bit more complicated, but it'll also avoid the slowdown for the vast
majority of cases where we're not spilling.

Greetings,

Andres Freund