Investigating a customer complaint I ran into an issue with the hash
aggregate code. The problem was that a query that usually completes in
less than a minute sometimes gets stuck indefinitely (hours+). I
tracked it down to a hash aggregate node returning one tuple from a
batch and spilling the rest.

The reason for the behavior is that aggstate->hash_metacxt was 100M,
which is larger than work_mem*hash_mem_multiplier of 64M. This makes
hash_agg_check_limits() always spill after the first tuple. I think
that ends up having a n² overhead, with n being almost 4M here.

I don't have a simple reproducer yet, because the live problem was on
a parallel query where looking at the backend wrong caused the problem
to disappear. After some retries I was able to catch an instance of
growing past work_mem with gdb. After that growth the simplehash was
{size = 4194304, members = 409839, ..}, i.e. the table was only 20%
full before growing. So the cause seems to be a run of hash collisions
bigger than SH_GROW_MAX_MOVE (150).

AFAICT there is nothing in simplehash that would stop it growing past
work_mem, and once it does the spilling logic in
agg_refill_hash_table() enters this degenerate state until the end of
the plan node.

I think the correct fix would be to have a way to insert into
simplehash with a limit on size, which means that the insert might
fail. I haven't yet looked at how complicated this would be to
implement.

I also haven't checked what is the cause for such a long run of
collisions. But I think it's related to it being a HashAggregate on
top of Gather on top of HashAggregate.

Regards,
Ants Aasma


Reply via email to