Hi,
On 2026-02-19 18:16:57 +0200, Ants Aasma wrote:
> > So it's a parallel aggregate? Partial + Finalize? I wonder if that might
> > be "correlating" the data in a way that makes it more likely to hit
> > SH_GROW_MAX_MOVE. But If that was the case, wouldn't we see this issue
> > more often?
>
> Interestingly the plan doesn't have partial and final on those hash agg nodes:
>
> -> HashAggregate (cost=142400.87..142800.87
> rows=40000 width=16) (actual time=7978.262..9591.682 rows=3698243
> loops=1)
> Group Key: "*SELECT* 2_4".vehicle_id,
> "*SELECT* 2_4".day
> Batches: 21 Memory Usage: 65593kB Disk
> Usage: 118256kB
> -> Gather (cost=133600.87..142000.87
> rows=80000 width=16) (actual time=1898.473..4772.296 rows=3698243
> loops=1)
> Workers Planned: 2
> Workers Launched: 2
> -> HashAggregate
> (cost=132600.87..133000.87 rows=40000 width=16) (actual
> time=1586.697..2040.368 rows=1232748 loops=3)
> Group Key: "*SELECT*
> 2_4".vehicle_id, "*SELECT* 2_4".day
> Batches: 1 Memory Usage: 5137kB
> Worker 0: Batches: 5 Memory
> Usage: 79921kB Disk Usage: 40024kB
> Worker 1: Batches: 5 Memory
> Usage: 81969kB Disk Usage: 36112kB
>
> There are timescale tables involved in the plan, so I think timescale
> might be behind that.
Hm, so timescale creates a plan that we would not?
> There is this comment above the simplehash growing logic:
> * To avoid negative consequences from overly imbalanced
> * hashtables, grow the hashtable if collisions would require
> * us to move a lot of entries. The most likely cause of such
> * imbalance is filling a (currently) small table, from a
> * currently big one, in hash-table order.
>
> The problem disappears if I have a breakpoint on tuplehash_grow, so
> apparently triggering the problem requires that the lower hashtable
> scans interleave in a particular manner to trigger the excess growth
> of the upper node.
>
> I'm wondering if some way to decorrelate the hashtables would help.
> For example a hashtable specific (pseudo)random salt.
We do try to add a hash-IV that's different for each worker:
/*
* If parallelism is in use, even if the leader backend is performing
the
* scan itself, we don't want to create the hashtable exactly the same
way
* in all workers. As hashtables are iterated over in keyspace-order,
* doing so in all processes in the same way is likely to lead to
* "unbalanced" hashtables when the table size initially is
* underestimated.
*/
if (use_variable_hash_iv)
hash_iv = murmurhash32(ParallelWorkerNumber);
I don't remember enough of how the parallel aggregate stuff works. Perhaps the
issue is that the leader is also building a hashtable and it's being inserted
into the post-gather hashtable, using the same IV?
In which case parallel_leader_participation=off should make a difference.
Greetings,
Andres Freund