[jira] [Commented] (DRILL-6307) Handle empty batches in record batch sizer correctly

ASF GitHub Bot (JIRA) Wed, 25 Apr 2018 14:39:24 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453122#comment-16453122
 ]


ASF GitHub Bot commented on DRILL-6307:
---------------------------------------

Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1228#discussion_r184200281
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
    @@ -277,18 +286,29 @@ public boolean isRepeatedList() {
         /**
          * This is the average per entry width, used for vector allocation.
          */
    -    public int getEntryWidth() {
    +    private int getEntryWidthForAlloc() {
           int width = 0;
           if (isVariableWidth) {
    -        width = getNetSizePerEntry() - OFFSET_VECTOR_WIDTH;
    +        width = getAllocSizePerEntry() - OFFSET_VECTOR_WIDTH;
     
             // Subtract out the bits (is-set) vector width
    -        if (metadata.getDataMode() == DataMode.OPTIONAL) {
    +        if (isOptional) {
               width -= BIT_VECTOR_WIDTH;
             }
    +
    +        if (isRepeated && getValueCount() == 0) {
    +          return (safeDivide(width, STD_REPETITION_FACTOR));
    +        }
           }
     
    -      return (safeDivide(width, cardinality));
    +      return (safeDivide(width, getEntryCardinalityForAlloc()));
    +    }
    +
    +    /**
    +     * This is the average per entry cardinality, used for vector 
allocation.
    +     */
    +    private float getEntryCardinalityForAlloc() {
    +      return getCardinality() == 0 ? (isRepeated ? STD_REPETITION_FACTOR : 
1) :getCardinality();
    --- End diff --
    
    This is for joins. We allocate vectors based on first batch sizing 
information and if that first batch is empty, then, we are allocating vectors 
with zero capacity. When we read the next batch with data, we will end up going 
through realloc loop as we write values. For ex., for outer left join, if right 
side batch is empty, we still have to include the right side columns as null in 
outgoing batch. With the new lateral join operator, if the input has an empty 
array as the first record in the unnest column, then also we see the problem. 


> Handle empty batches in record batch sizer correctly
> ----------------------------------------------------
>
>                 Key: DRILL-6307
>                 URL: https://issues.apache.org/jira/browse/DRILL-6307
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.13.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>            Priority: Major
>             Fix For: 1.14.0
>
>
> when we get empty batch, record batch sizer calculates row width as zero. In 
> that case, we do not do accounting and memory allocation correctly for 
> outgoing batches. 
> For example, in merge join, for outer left join, if right side batch is 
> empty, we still have to include the right side columns as null in outgoing 
> batch. 
> Say first batch is empty. Then, for outgoing, we allocate empty vectors with 
> zero capacity.  When we read the next batch with data, we will end up going 
> through realloc loop. If we use right side row width as 0 in outgoing row 
> width calculation, number of rows we will calculate will be higher and later 
> when we get a non empty batch, we might exceed the memory limits. 
> One possible workaround/solution : Allocate memory based on std size for 
> empty input batch. Use allocation width as width of the batch in number of 
> rows calculation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6307) Handle empty batches in record batch sizer correctly

Reply via email to