On Fri, 2018-06-15 at 12:30 -0700, David Gershuni wrote: > For example, imagine a tuple that belongs to a group G in grouping > set A had to be > spilled because it also belonged to an evicted group from grouping > set B. Then group > G would remain in set A’s hash table at the end of the phase, but it > wouldn’t have > aggregated the values from the spilled tuple. Of course, work-arounds > are possible, > but the current approach would not work as is.
I was imagining that (in the simplest approach) each hash table would have its own set of spilled tuples. If grouping set A can be advanced (because it's present in the hash table) but B cannot (because it's not in the hash table and we are at the memory limit), then it goes into the spill file for B but not A. That means that A is still finished after the first pass. But I am worried that I am missing something, because it appears that for AGG_MIXED, we wait until the end of the last phase to emit the contents of the hash tables. Wouldn't they be complete after the first phase? I realize that this simple approach could be inefficient if multiple hash tables are spilling because we'd duplicate the spilled tuples. But let me know if it won't work at all. > But as you mentioned, this solution relies on all grouping keys being > sortable, so we > would need a fallback plan. For this, we could use your hash-based > approach, but we > would have to make modifications. One idea would be to split each > grouping set into its > own aggregate phase, so that only one hash table is in memory at a > time. This would > eliminate the performance benefits of grouping sets when keys aren’t > sortable, but at > least they would still work. I am having a hard time trying to satisfy all of the constraints that have been raised so far: * handle data types with hash ops but no btree ops * handle many different group size distributions and input orders efficiently * avoid regressions or inefficiencies for grouping sets * handle variable-size (particularly O(n)-space) transition states without taking on a lot of complexity. I like to work toward the right design, but I think simplicity also factors in to the right design. NodeAgg.c is already one of the larger and more complicated files that we have. A lot of the complexity we already have is that grouping sets are performed in one giant executor node rather than exposing the individual phases as separate executor nodes. I suppose the reason (correct me if I'm wrong) is that there are two outputs from a phase: the aggregated groups, and the original input tuples in a new sort order. That seems solvable -- the executor passes bitmaps around until BitmapHeapScan turns them into tuples. Regards, Jeff Davis