Re: Spilling hashed SetOps and aggregates to disk

Jeff Davis Tue, 19 Jun 2018 22:37:58 -0700

On Fri, 2018-06-15 at 12:30 -0700, David Gershuni wrote:
> For example, imagine a tuple that belongs to a group G in grouping
> set A had to be
> spilled because it also belonged to an evicted group from grouping
> set B. Then group
> G would remain in set A’s hash table at the end of the phase, but it
> wouldn’t have
> aggregated the values from the spilled tuple. Of course, work-arounds 
> are possible,
> but the current approach would not work as is.


I was imagining that (in the simplest approach) each hash table would
have its own set of spilled tuples. If grouping set A can be advanced
(because it's present in the hash table) but B cannot (because it's not
in the hash table and we are at the memory limit), then it goes into
the spill file for B but not A. That means that A is still finished
after the first pass.

But I am worried that I am missing something, because it appears that
for AGG_MIXED, we wait until the end of the last phase to emit the
contents of the hash tables. Wouldn't they be complete after the first
phase?

I realize that this simple approach could be inefficient if multiple
hash tables are spilling because we'd duplicate the spilled tuples. But
let me know if it won't work at all.

> But as you mentioned, this solution relies on all grouping keys being
> sortable, so we
> would need a fallback plan. For this, we could use your hash-based
> approach, but we
> would have to make modifications. One idea would be to split each
> grouping set into its
> own aggregate phase, so that only one hash table is in memory at a
> time. This would
> eliminate the performance benefits of grouping sets when keys aren’t
> sortable, but at
> least they would still work. 

I am having a hard time trying to satisfy all of the constraints that
have been raised so far:

* handle data types with hash ops but no btree ops
* handle many different group size distributions and input orders
efficiently
* avoid regressions or inefficiencies for grouping sets
* handle variable-size (particularly O(n)-space) transition states

without taking on a lot of complexity. I like to work toward the right
design, but I think simplicity also factors in to the right design.
NodeAgg.c is already one of the larger and more complicated files that
we have.

A lot of the complexity we already have is that grouping sets are
performed in one giant executor node rather than exposing the
individual phases as separate executor nodes. I suppose the reason
(correct me if I'm wrong) is that there are two outputs from a phase:
the aggregated groups, and the original input tuples in a new sort
order. That seems solvable -- the executor passes bitmaps around until
BitmapHeapScan turns them into tuples.

Regards,
        Jeff Davis

Re: Spilling hashed SetOps and aggregates to disk

Reply via email to