Github user ilooner commented on the issue:
https://github.com/apache/drill/pull/1101
@Ben-Zvi I have responded to comments and implemented requested changes.
Please see latest commits for changes. I have made some additional changes
after noticing that some of the batch sizing calculations were incorrect, I
have also made various documentation and naming changes to improve the
readability of the code and to document what needs to be improved in the
future.:
- The size of varchar columns were not properly accounted for in the
outgoing batch in the case where varchars are aggregated. I have added logic to
the updateEst... method to account for this.
- I have added docs to the updateEst method explaining what the various
batch sizing estimates do.
- I have changed the variable names for the batch size estimates to more
accurately reflect what they do.
- Previously if a batch size estimate was smaller than the actual amount of
memory allocated, the estimate was updated to be the larger size. I think this
was actually the wrong behavior since it causes the HashAgg operator's memory
estimates to be extremely aggressive and in the worst case can cause the
operator to run out of memory prematurely. Ideally a good estimate will over
estimate 50% of the time and under estimate 50% of the time.
- I have changed the HashAgg defaults. Although tests passed with the
previous defaults, I felt they were too aggressive. I have changed the default
number of partitions to 16 and the minimum number of batches to 1.
Let me know if you have anymore comments. Thanks.
---