[GitHub] drill issue #1101: DRILL-6032: Made the batch sizing for HashAgg more accura...

ilooner Fri, 02 Feb 2018 17:29:13 -0800

Github user ilooner commented on the issue:

    https://github.com/apache/drill/pull/1101
  
    @Ben-Zvi I have responded to comments and implemented requested changes. 
Please see latest commits for changes. I have made some additional changes 
after noticing that some of the batch sizing calculations were incorrect, I 
have also made various documentation and naming changes to improve the 
readability of the code and to document what needs to be improved in the 
future.:
    
    - The size of varchar columns were not properly accounted for in the 
outgoing batch in the case where varchars are aggregated. I have added logic to 
the updateEst... method to account for this.
    - I have added docs to the updateEst method explaining what the various 
batch sizing estimates do.
    - I have changed the variable names for the batch size estimates to more 
accurately reflect what they do.
    - Previously if a batch size estimate was smaller than the actual amount of 
memory allocated, the estimate was updated to be the larger size. I think this 
was actually the wrong behavior since it causes the HashAgg operator's memory 
estimates to be extremely aggressive and in the worst case can cause the 
operator to run out of memory prematurely. Ideally a good estimate will over 
estimate 50% of the time and under estimate 50% of the time.
    - I have changed the HashAgg defaults. Although tests passed with the 
previous defaults, I felt they were too aggressive. I have changed the default 
number of partitions to 16 and the minimum number of batches to 1.
    
    Let me know if you have anymore comments. Thanks.

---

[GitHub] drill issue #1101: DRILL-6032: Made the batch sizing for HashAgg more accura...

Reply via email to