[ 
https://issues.apache.org/jira/browse/DRILL-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447473#comment-16447473
 ] 

ASF GitHub Bot commented on DRILL-6307:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1228#discussion_r183264629
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
    @@ -277,18 +286,29 @@ public boolean isRepeatedList() {
         /**
          * This is the average per entry width, used for vector allocation.
          */
    -    public int getEntryWidth() {
    +    private int getEntryWidthForAlloc() {
           int width = 0;
           if (isVariableWidth) {
    -        width = getNetSizePerEntry() - OFFSET_VECTOR_WIDTH;
    +        width = getAllocSizePerEntry() - OFFSET_VECTOR_WIDTH;
     
             // Subtract out the bits (is-set) vector width
    -        if (metadata.getDataMode() == DataMode.OPTIONAL) {
    +        if (isOptional) {
               width -= BIT_VECTOR_WIDTH;
             }
    +
    +        if (isRepeated && getValueCount() == 0) {
    +          return (safeDivide(width, STD_REPETITION_FACTOR));
    --- End diff --
    
    If the value count is zero, but the row count is non-zero, then a very low 
repetition rate is more realistic than 10. In earlier drafts, I found the 
repetition rate had to be a float since, in some data, the rate is something 
like 0.7 or 1.4. Rounding to an integer caused quite an error when multiplying 
by, say, 50K rows.
    
    Any reason we can't use the actual computed amount here? If we really have 
1 or 2 rows, then a guess of 10 is fine. But, if we have 60K rows, with an 
actual estimate of 0, then guessing 10 will allocate 600K values when we 
probably needed close to 0. (Unless I'm missing something.)


> Handle empty batches in record batch sizer correctly
> ----------------------------------------------------
>
>                 Key: DRILL-6307
>                 URL: https://issues.apache.org/jira/browse/DRILL-6307
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.13.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>            Priority: Major
>             Fix For: 1.14.0
>
>
> when we get empty batch, record batch sizer calculates row width as zero. In 
> that case, we do not do accounting and memory allocation correctly for 
> outgoing batches. 
> For example, in merge join, for outer left join, if right side batch is 
> empty, we still have to include the right side columns as null in outgoing 
> batch. 
> Say first batch is empty. Then, for outgoing, we allocate empty vectors with 
> zero capacity.  When we read the next batch with data, we will end up going 
> through realloc loop. If we use right side row width as 0 in outgoing row 
> width calculation, number of rows we will calculate will be higher and later 
> when we get a non empty batch, we might exceed the memory limits. 
> One possible workaround/solution : Allocate memory based on std size for 
> empty input batch. Use allocation width as width of the batch in number of 
> rows calculation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to