> * mtup is hold in hjstate->hj_outerTupleBuffer, so we can using
> * shouldFree as false to call ExecForceStoreMinimalTuple().
> *
> * When slot is TTSOpsMinimalTuple we can avoid realloc memory for
> * new MinimalTuple(reuse StringInfo to call ExecHashJoinGetSavedTuple).
>
> But my point was that I don't think the palloc/repalloc should be very
> expensive, once the AllocSet warms up a bit.
Avoiding memory palloc/repalloc is just a side effect of avoiding reform tuple.
> * More importantly, in non-TTSOpsMinimalTuple scenarios, it can avoid
> * reform(materialize) tuple(see ExecForceStoreMinimalTuple).
>
> Yeah, but doesn't that conflate two things - materialization and freeing the
> memory? Only because materialization is expensive, is that a good reason to
> abandon the memory management too?
Currently, I haven't thought of a better way to avoid reform.
> >
> >> Can you provide more information about the benchmark you did? What
> >> hardware, what scale, PostgreSQL configuration, which of the 22
> >> queries are improved, etc.
> >>
> >> I ran TPC-H with 1GB and 10GB scales on two machines, and I see
> >> pretty much no difference compared to master. However, it occurred to
> >> me the patch only ever helps if we increase the number of batches
> >> during execution, in which case we need to move tuples to the right batch.
> >
> > Only parallel HashJoin speed up to ~2x(all data cached in memory),
> >
> > not full query, include non-parallel HashJoin.
> >
> > non-parallel HashJoin only when batchs large then one will speed up,
> >
> > because this patch only optimize for read batchs tuples to memory.
> >
>
> I'm sorry, but this does not answer *any* of the questions I asked.
>
> Please provide enough info to reproduce the benefit - benchmark scale, which
> query, which > parameters, etc. Show explain / explain analyze of the query
> without / with the patch, stuff > like that.
>
> I ran a number of TPC-H benchmarks with the patch and I never a benefit of
> this scale.
After further testing, it turns out that the parallel hashjoin did not improve
performance. I might have compared it with a debug version at the time. I
apologize for that.
Howerver, the non-parallel hashjoin indeed showed about a 10% performance
improvement.
Here is the testing information:
CPU: 13th Gen Intel(R) Core(TM) i7-13700
Memory: 32GB
SSD: UMIS REPEYJ512MKN1QWQ
Windows version: win11 23H2 22631.4037
WSL version: 2.2.4.0
Kernel version: 5.15.153.1-2
OS version: rocky linux 9.4
TPCH: SF=8
SQL:
set max_parallel_workers_per_gather = 0;
set enable_mergejoin = off;
explain (verbose,analyze)
select count(*)
from lineitem, orders
where lineitem.l_orderkey = orders.o_orderkey;
patch before:
Aggregate (cost=2422401.83..2422401.84 rows=1 width=8) (actual
time=10591.679..10591.681 rows=1 loops=1)
Output: count(*)
-> Hash Join (cost=508496.00..2302429.31 rows=47989008 width=0) (actual
time=1075.213..9503.727 rows=47989007 loops=1)
Inner Unique: true
Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
-> Index Only Scan using lineitem_pkey on public.lineitem
(cost=0.56..1246171.69 rows=47989008 width=4) (actual time=0.023..1974.365
rows=47989007 loops=1)
Output: lineitem.l_orderkey
Heap Fetches: 0
-> Hash (cost=311620.43..311620.43 rows=12000000 width=4) (actual
time=1074.155..1074.156 rows=12000000 loops=1)
Output: orders.o_orderkey
Buckets: 262144 Batches: 128 Memory Usage: 5335kB
-> Index Only Scan using orders_pkey on public.orders
(cost=0.43..311620.43 rows=12000000 width=4) (actual time=0.014..464.346
rows=12000000 loops=1)
Output: orders.o_orderkey
Heap Fetches: 0
Planning Time: 0.141 ms
Execution Time: 10591.730 ms
(16 rows)
Patch after:
Aggregate (cost=2422401.83..2422401.84 rows=1 width=8) (actual
time=9826.105..9826.106 rows=1 loops=1)
Output: count(*)
-> Hash Join (cost=508496.00..2302429.31 rows=47989008 width=0) (actual
time=1087.588..8726.441 rows=47989007 loops=1)
Inner Unique: true
Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
-> Index Only Scan using lineitem_pkey on public.lineitem
(cost=0.56..1246171.69 rows=47989008 width=4) (actual time=0.015..1989.389
rows=47989007 loops=1)
Output: lineitem.l_orderkey
Heap Fetches: 0
-> Hash (cost=311620.43..311620.43 rows=12000000 width=4) (actual
time=1086.357..1086.358 rows=12000000 loops=1)
Output: orders.o_orderkey
Buckets: 262144 Batches: 128 Memory Usage: 5335kB
-> Index Only Scan using orders_pkey on public.orders
(cost=0.43..311620.43 rows=12000000 width=4) (actual time=0.011..470.225
rows=12000000 loops=1)
Output: orders.o_orderkey
Heap Fetches: 0
Planning Time: 0.065 ms
Execution Time: 9826.135 ms