significant slowdown of HashAggregate between 9.6 and 10

2020-06-03 Thread Pavel Stehule
Hi

One czech Postgres user reported performance issue related to speed
HashAggregate in nested loop.

The speed of 9.6

HashAggregate  (cost=27586.10..27728.66 rows=14256 width=24)
(actual time=0.003..0.049 rows=39 loops=599203)

The speed of 10.7

HashAggregate  (cost=27336.78..27552.78 rows=21600 width=24)
(actual time=0.011..0.156 rows=38 loops=597137)

So it looks so HashAgg is about 3x slower - with brutal nested loop it is a
problem.

I wrote simple benchmark and really looks so our hash aggregate is slower
and slower.

create table foo(a int, b int, c int, d int, e int, f int);
insert into foo select random()*1000, random()*4, random()*4, random()* 2,
random()*100, random()*100 from generate_series(1,200);

analyze foo;

9.6.7
postgres=# explain (analyze, buffers) select i from
generate_series(1,50) g(i) where exists (select count(*) cx from foo
group by b, c, d having count(*) = i);
┌───┐
│  QUERY PLAN
│
╞═══╡
│ Function Scan on generate_series g  (cost=0.00..57739020.00 rows=500
width=4) (actual time=807.485..3364.515 rows=74 loops=1) │
│   Filter: (SubPlan 1)
│
│   Rows Removed by Filter: 499926
 │
│   Buffers: shared hit=12739, temp read=856 written=855
 │
│   SubPlan 1
│
│ ->  HashAggregate  (cost=57739.00..57739.75 rows=75 width=20) (actual
time=0.006..0.006 rows=0 loops=50)  │
│   Group Key: foo.b, foo.c, foo.d
 │
│   Filter: (count(*) = g.i)
 │
│   Rows Removed by Filter: 75
 │
│   Buffers: shared hit=12739
│
│   ->  Seq Scan on foo  (cost=0.00..32739.00 rows=200
width=12) (actual time=0.015..139.736 rows=200 loops=1)  │
│ Buffers: shared hit=12739
│
│ Planning time: 0.276 ms
│
│ Execution time: 3365.758 ms
│
└───┘
(14 rows)

10.9

postgres=# explain (analyze, buffers) select i from
generate_series(1,50) g(i) where exists (select count(*) cx from foo
group by b, c, d having count(*) = i);
┌───┐
│  QUERY PLAN
│
╞═══╡
│ Function Scan on generate_series g  (cost=0.00..57739020.00 rows=500
width=4) (actual time=825.468..4919.063 rows=74 loops=1) │
│   Filter: (SubPlan 1)
│
│   Rows Removed by Filter: 499926
 │
│   Buffers: shared hit=12739, temp read=856 written=855
 │
│   SubPlan 1
│
│ ->  HashAggregate  (cost=57739.00..57739.75 rows=75 width=20) (actual
time=0.009..0.009 rows=0 loops=50)  │
│   Group Key: foo.b, foo.c, foo.d
 │
│   Filter: (count(*) = g.i)
 │
│   Rows Removed by Filter: 75
 │
│   Buffers: shared hit=12739
│
│   ->  Seq Scan on foo  (cost=0.00..32739.00 rows=200
width=12) (actual time=0.025..157.887 rows=200 loops=1)  │
│ Buffers: shared hit=12739
│
│ Planning time: 0.829 ms
│
│ Execution time: 4920.800 ms
│
└───┘
(14 rows)

master


postgres=# explain (analyze, buffers) select i from
generate_series(1,50) g(i) where exists (select c

Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-03 Thread Pavel Stehule
st 3. 6. 2020 v 17:32 odesílatel Pavel Stehule 
napsal:

> Hi
>
> One czech Postgres user reported performance issue related to speed
> HashAggregate in nested loop.
>
> The speed of 9.6
>
> HashAggregate  (cost=27586.10..27728.66 rows=14256 width=24)
> (actual time=0.003..0.049 rows=39 loops=599203)
>
> The speed of 10.7
>
> HashAggregate  (cost=27336.78..27552.78 rows=21600 width=24)
> (actual time=0.011..0.156 rows=38 loops=597137)
>
> So it looks so HashAgg is about 3x slower - with brutal nested loop it is
> a problem.
>
> I wrote simple benchmark and really looks so our hash aggregate is slower
> and slower.
>
> create table foo(a int, b int, c int, d int, e int, f int);
> insert into foo select random()*1000, random()*4, random()*4, random()* 2,
> random()*100, random()*100 from generate_series(1,200);
>
> analyze foo;
>
> 9.6.7
> postgres=# explain (analyze, buffers) select i from
> generate_series(1,50) g(i) where exists (select count(*) cx from foo
> group by b, c, d having count(*) = i);
>
> ┌───┐
> │  QUERY PLAN
>   │
>
> ╞═══╡
> │ Function Scan on generate_series g  (cost=0.00..57739020.00 rows=500
> width=4) (actual time=807.485..3364.515 rows=74 loops=1) │
> │   Filter: (SubPlan 1)
>   │
> │   Rows Removed by Filter: 499926
>  │
> │   Buffers: shared hit=12739, temp read=856 written=855
>  │
> │   SubPlan 1
>   │
> │ ->  HashAggregate  (cost=57739.00..57739.75 rows=75 width=20)
> (actual time=0.006..0.006 rows=0 loops=50)  │
> │   Group Key: foo.b, foo.c, foo.d
>  │
> │   Filter: (count(*) = g.i)
>  │
> │   Rows Removed by Filter: 75
>  │
> │   Buffers: shared hit=12739
>   │
> │   ->  Seq Scan on foo  (cost=0.00..32739.00 rows=200
> width=12) (actual time=0.015..139.736 rows=200 loops=1)  │
> │ Buffers: shared hit=12739
>   │
> │ Planning time: 0.276 ms
>   │
> │ Execution time: 3365.758 ms
>   │
>
> └───┘
> (14 rows)
>
> 10.9
>
> postgres=# explain (analyze, buffers) select i from
> generate_series(1,50) g(i) where exists (select count(*) cx from foo
> group by b, c, d having count(*) = i);
>
> ┌───┐
> │  QUERY PLAN
>   │
>
> ╞═══╡
> │ Function Scan on generate_series g  (cost=0.00..57739020.00 rows=500
> width=4) (actual time=825.468..4919.063 rows=74 loops=1) │
> │   Filter: (SubPlan 1)
>   │
> │   Rows Removed by Filter: 499926
>  │
> │   Buffers: shared hit=12739, temp read=856 written=855
>  │
> │   SubPlan 1
>   │
> │ ->  HashAggregate  (cost=57739.00..57739.75 rows=75 width=20)
> (actual time=0.009..0.009 rows=0 loops=50)  │
> │   Group Key: foo.b, foo.c, foo.d
>  │
> │   Filter: (count(*) = g.i)
>  │
> │   Rows Removed by Filter: 75
>  │
> │   Buffers: shared hit=12739
>   │
> │   ->  Seq Scan on foo  (cost=0.00..32739.00 rows=200
> width=12) (actual time=0.025..157.887 rows=200 loops=1)  │
> │ Buffers: shared hit=12739
>   │
> │ Planning time: 0.829 ms
>   │
> │ Execution time: 4920.800 ms
>   

Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-03 Thread Tomas Vondra

Hi,

Not sure what's the root cause, but I can reproduce it. Timings for 9.6,
10 and master (all built from git with the same options) without explain
analyze look like this:

9.6
-
Time: 1971.314 ms
Time: 1995.875 ms
Time: 1997.408 ms
Time: 2069.913 ms
Time: 2004.196 ms

10
-
Time: 2815.434 ms (00:02.815)
Time: 2862.589 ms (00:02.863)
Time: 2841.126 ms (00:02.841)
Time: 2803.040 ms (00:02.803)
Time: 2805.527 ms (00:02.806)

master
-
Time: 3479.233 ms (00:03.479)
Time: 3537.901 ms (00:03.538)
Time: 3459.314 ms (00:03.459)
Time: 3542.810 ms (00:03.543)
Time: 3482.141 ms (00:03.482)

So there seems to be +40% between 9.6 and 10, and further +25% between
10 and master. However, plain hashagg, measured e.g. like this:

  select count(*) cx from foo group by b, c, d having count(*) = 1;

does not indicate any slowdown at all, so I think you're right it has
something to do with the looping.

Profiles from those versions look like this:

9.6
-
Samples
Overhead  Shared Objec  Symbol
  14.19%  postgres  [.] ExecMakeFunctionResultNoSets
  13.65%  postgres  [.] finalize_aggregates
  12.54%  postgres  [.] hash_seq_search
   6.70%  postgres  [.] finalize_aggregate.isra.0
   5.71%  postgres  [.] ExecEvalParamExec
   5.54%  postgres  [.] ExecEvalAggref
   5.00%  postgres  [.] ExecStoreMinimalTuple
   4.34%  postgres  [.] ExecAgg
   4.08%  postgres  [.] ExecQual
   2.67%  postgres  [.] slot_deform_tuple
   2.24%  postgres  [.] pgstat_init_function_usage
   2.22%  postgres  [.] check_stack_depth
   2.14%  postgres  [.] MemoryContextReset
   1.89%  postgres  [.] hash_search_with_hash_value
   1.72%  postgres  [.] project_aggregates
   1.68%  postgres  [.] pgstat_end_function_usage
   1.59%  postgres  [.] slot_getattr


10

Samples
Overhead  Shared Object   Symbol
  15.18%  postgres[.] slot_deform_tuple
  13.09%  postgres[.] agg_retrieve_hash_table
  12.02%  postgres[.] ExecInterpExpr
   7.47%  postgres[.] finalize_aggregates
   7.38%  postgres[.] tuplehash_iterate
   5.13%  postgres[.] prepare_projection_slot
   4.86%  postgres[.] finalize_aggregate.isra.0
   4.05%  postgres[.] bms_is_member
   3.97%  postgres[.] slot_getallattrs
   3.59%  postgres[.] ExecStoreMinimalTuple
   2.85%  postgres[.] project_aggregates
   1.95%  postgres[.] ExecClearTuple
   1.71%  libc-2.30.so[.] __memset_avx2_unaligned_erms
   1.69%  postgres[.] ExecEvalParamExec
   1.58%  postgres[.] MemoryContextReset
   1.17%  postgres[.] slot_getattr
   1.03%  postgres[.] slot_getsomeattrs


master
--
Samples
Overhead  Shared Object   Symbol
  17.07%  postgres[.] agg_retrieve_hash_table
  15.46%  postgres[.] tuplehash_iterate
  11.83%  postgres[.] tts_minimal_getsomeattrs
   9.39%  postgres[.] ExecInterpExpr
   6.94%  postgres[.] prepare_projection_slot
   4.85%  postgres[.] finalize_aggregates
   4.27%  postgres[.] bms_is_member
   3.80%  postgres[.] finalize_aggregate.isra.0
   3.80%  postgres[.] tts_minimal_store_tuple
   2.22%  postgres[.] project_aggregates
   2.07%  postgres[.] tts_virtual_clear
   2.07%  postgres[.] MemoryContextReset
   1.78%  postgres[.] tts_minimal_clear
   1.61%  postgres[.] ExecEvalParamExec
   1.46%  postgres[.] slot_getsomeattrs_int
   1.34%  libc-2.30.so[.] __memset_avx2_unaligned_erms

Not sure what to think about this. Seems slot_deform_tuple got way more
expensive between 9.6 and 10, for some reason. 



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-03 Thread Andres Freund
Hi,

On 2020-06-03 21:31:01 +0200, Tomas Vondra wrote:
> So there seems to be +40% between 9.6 and 10, and further +25% between
> 10 and master. However, plain hashagg, measured e.g. like this:

Ugh.

Since I am a likely culprit, I'm taking a look.


> Not sure what to think about this. Seems slot_deform_tuple got way more
> expensive between 9.6 and 10, for some reason.

Might indicate too many calls instead. Or it could just be the fact that
we have expressions deform all columns once, instead of deforming
columns one-by-one now.

Greetings,

Andres Freund




Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-04 Thread Andres Freund
Hi,

On 2020-06-03 13:26:43 -0700, Andres Freund wrote:
> On 2020-06-03 21:31:01 +0200, Tomas Vondra wrote:
> > So there seems to be +40% between 9.6 and 10, and further +25% between
> > 10 and master. However, plain hashagg, measured e.g. like this:

As far as I can tell the 10->master difference comes largely from the
difference of the number of buckets in the hashtable.

In 10 it is:
Breakpoint 1, tuplehash_create (ctx=0x5628251775c8, nelements=75, 
private_data=0x5628251952f0)
and in master it is:
Breakpoint 1, tuplehash_create (ctx=0x5628293a0a70, nelements=256, 
private_data=0x5628293a0b90)

As far as I can tell the timing difference simply is the cost of
iterating 500k times over a hashtable with fairly few entries. Which is,
unsurprisingly, more expensive if the hashtable is larger.

The reason the hashtable got bigger in 12 is

commit 1f39bce021540fde00990af55b4432c55ef4b3c7
Author: Jeff Davis 
Date:   2020-03-18 15:42:02 -0700

Disk-based Hash Aggregation.

which introduced

+/* minimum number of initial hash table buckets */
+#define HASHAGG_MIN_BUCKETS 256


I don't really see much explanation for that part in the commit, perhaps
Jeff can chime in?


I think optimizing for the gazillion hash table scans isn't particularly
important. Rarely is a query going to have 500k scans of unchanging
aggregated data. So I'm not too concerned about the 13 regression - but
I also see very little reason to just always use 256 buckets? It's
pretty darn common to end up with 1-2 groups, what's the point of this?


I'll look into 9.6->10 after buying groceries... But I'd wish there were
a relevant benchmark, I don't think it's worth optimizing for this case.

Greetings,

Andres Freund




Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-04 Thread Jeff Davis
On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote:
> +/* minimum number of initial hash table buckets */
> +#define HASHAGG_MIN_BUCKETS 256
> 
> 
> I don't really see much explanation for that part in the commit,
> perhaps
> Jeff can chime in?

I did this in response to a review comment (point #5):


https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development

Tomas suggested a min of 1024, and I thought I was being more
conservative choosing 256. Still too high, I guess?

I can lower it. What do you think is a reasonable minimum?

Regards,
Jeff Davis






Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-04 Thread Andres Freund
Hi,

On 2020-06-04 18:22:03 -0700, Jeff Davis wrote:
> On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote:
> > +/* minimum number of initial hash table buckets */
> > +#define HASHAGG_MIN_BUCKETS 256
> > 
> > 
> > I don't really see much explanation for that part in the commit,
> > perhaps
> > Jeff can chime in?
> 
> I did this in response to a review comment (point #5):
> 
> 
> https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development
> 
> Tomas suggested a min of 1024, and I thought I was being more
> conservative choosing 256. Still too high, I guess?

> I can lower it. What do you think is a reasonable minimum?

I don't really see why there needs to be a minimum bigger than 1?

Greetings,

Andres Freund




Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-05 Thread Tomas Vondra

On Thu, Jun 04, 2020 at 06:57:58PM -0700, Andres Freund wrote:

Hi,

On 2020-06-04 18:22:03 -0700, Jeff Davis wrote:

On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote:
> +/* minimum number of initial hash table buckets */
> +#define HASHAGG_MIN_BUCKETS 256
>
>
> I don't really see much explanation for that part in the commit,
> perhaps
> Jeff can chime in?

I did this in response to a review comment (point #5):


https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development

Tomas suggested a min of 1024, and I thought I was being more
conservative choosing 256. Still too high, I guess?



I can lower it. What do you think is a reasonable minimum?


I don't really see why there needs to be a minimum bigger than 1?



I think you're right. I think I was worried about having to resize the
hash table in case of an under-estimate, and it seemed fine to waste a
tiny bit more memory to prevent that. But this example shows we may need
to scan the hash table sequentially, which means it's not just about
memory consumption. So in hindsight we either don't need the limit at
all, or maybe it could be much lower (IIRC it reduces probability of
collision, but maybe dynahash does that anyway internally).

I wonder if hashjoin has the same issue, but probably not - I don't
think we'll ever scan that internal hash table sequentially.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-05 Thread Andres Freund
Hi,

On 2020-06-05 15:25:26 +0200, Tomas Vondra wrote:
> I think you're right. I think I was worried about having to resize the
> hash table in case of an under-estimate, and it seemed fine to waste a
> tiny bit more memory to prevent that.

It's pretty cheap to resize a hashtable with a handful of entries, so I'm not
worried about that. It's also how it has worked for a *long* time, so I think
unless we have some good reason to change that, I wouldn't.


> But this example shows we may need to scan the hash table
> sequentially, which means it's not just about memory consumption.

We *always* scan the hashtable sequentially, no? Otherwise there's no way to
get at the aggregated data.


> So in hindsight we either don't need the limit at all, or maybe it
> could be much lower (IIRC it reduces probability of collision, but
> maybe dynahash does that anyway internally).

This is simplehash using code. Which resizes on a load factor of 0.9.


> I wonder if hashjoin has the same issue, but probably not - I don't
> think we'll ever scan that internal hash table sequentially.

I think we do for some outer joins (c.f. ExecPrepHashTableForUnmatched()), but
it's probably not relevant performance-wise.

Greetings,

Andres Freund