[ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-6287:
-----------------------------

    Description: 
nextBatch() method that computes the batchSize is only aware of stripe 
boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled 
as PPD works at row group level (stripe contains multiple row groups). By 
default, row group stride is 10000. When PPD is enabled, some row groups may 
get eliminated. After row group elimination, disk ranges are computed based on 
the selected row groups. If batchSize computation is not aware of this, it will 
lead to BufferUnderFlowException (reading beyond disk range). Following 
scenario should illustrate it more clearly

{code}
|--------------------------------- STRIPE 1 
------------------------------------|
|-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
--|
                |--------- diskrange 1 ---------|               |- diskrange 2 
-|
                                                ^
                                             (marker)   
{code}

diskrange1 will have 20000 rows and diskrange 2 will have 10000 rows. Since 
nextBatch() was not aware of row groups and hence the diskranges, it tries to 
read 1024 values from the end of diskrange 1 where it should only read 20000 % 
1024 = 544 values. This will result in BufferUnderFlowException.

To fix this, a marker is placed at the end of each range and batchSize is 
computed accordingly. {code}batchSize = 
Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code}

Stack trace will look like:
{code}
Caused by: java.nio.BufferUnderflowException
        at java.nio.Buffer.nextGetIndex(Buffer.java:492)
        at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:135)
        at 
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:207)
        at 
org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readFloat(SerializationUtils.java:70)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$FloatTreeReader.nextVector(RecordReaderImpl.java:673)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.nextVector(RecordReaderImpl.java:1615)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.nextBatch(RecordReaderImpl.java:2883)
        at 
org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.next(VectorizedOrcInputFormat.java:94)
        ... 15 more
{code}

  was:
nextBatch() method that computes the batchSize is only aware of stripe 
boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled 
as PPD works at row group level (stripe contains multiple row groups). By 
default, row group stride is 10000. When PPD is enabled, some row groups may 
get eliminated. After row group elimination, disk ranges are computed based on 
the selected row groups. If batchSize computation is not aware of this, it will 
lead to BufferUnderFlowException (reading beyond disk range). Following 
scenario should illustrate it more clearly

{code}
|--------------------------------- STRIPE 1 
------------------------------------|
|-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
--|
                |--------- diskrange 1 ---------|               |- diskrange 2 
-|
                                                ^
                                             (marker)   
{code}

diskrange1 will have 20000 rows and diskrange 2 will have 10000 rows. Since 
nextBatch() was not aware of row groups and hence the diskranges, it tries to 
read 1024 values from the end of diskrange 1 where it should only read 20000 % 
1024 = 544 values. This will result in BufferUnderFlowException.

To fix this, a marker is placed at the end of each range and batchSize is 
computed accordingly. {code}batchSize = 
Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code}


> batchSize computation in Vectorized ORC reader can cause 
> BufferUnderFlowException when PPD is enabled
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-6287
>                 URL: https://issues.apache.org/jira/browse/HIVE-6287
>             Project: Hive
>          Issue Type: Bug
>          Components: Vectorization
>    Affects Versions: 0.13.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>              Labels: orcfile, vectorization
>             Fix For: 0.13.0
>
>         Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, 
> HIVE-6287.3.patch, HIVE-6287.4.patch, HIVE-6287.WIP.patch
>
>
> nextBatch() method that computes the batchSize is only aware of stripe 
> boundaries. This will not work when predicate pushdown (PPD) in ORC is 
> enabled as PPD works at row group level (stripe contains multiple row 
> groups). By default, row group stride is 10000. When PPD is enabled, some row 
> groups may get eliminated. After row group elimination, disk ranges are 
> computed based on the selected row groups. If batchSize computation is not 
> aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
> range). Following scenario should illustrate it more clearly
> {code}
> |--------------------------------- STRIPE 1 
> ------------------------------------|
> |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
> --|
>                 |--------- diskrange 1 ---------|               |- diskrange 
> 2 -|
>                                                 ^
>                                              (marker)   
> {code}
> diskrange1 will have 20000 rows and diskrange 2 will have 10000 rows. Since 
> nextBatch() was not aware of row groups and hence the diskranges, it tries to 
> read 1024 values from the end of diskrange 1 where it should only read 20000 
> % 1024 = 544 values. This will result in BufferUnderFlowException.
> To fix this, a marker is placed at the end of each range and batchSize is 
> computed accordingly. {code}batchSize = 
> Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
> rowInStripe));{code}
> Stack trace will look like:
> {code}
> Caused by: java.nio.BufferUnderflowException
>       at java.nio.Buffer.nextGetIndex(Buffer.java:492)
>       at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:135)
>       at 
> org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:207)
>       at 
> org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readFloat(SerializationUtils.java:70)
>       at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$FloatTreeReader.nextVector(RecordReaderImpl.java:673)
>       at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.nextVector(RecordReaderImpl.java:1615)
>       at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.nextBatch(RecordReaderImpl.java:2883)
>       at 
> org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.next(VectorizedOrcInputFormat.java:94)
>       ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to