[
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Prasanth J updated HIVE-6287:
-----------------------------
Attachment: HIVE-6287.1.patch
Added q file tests.
> batchSize computation in Vectorized ORC reader can cause
> BufferUnderFlowException when PPD is enabled
> -----------------------------------------------------------------------------------------------------
>
> Key: HIVE-6287
> URL: https://issues.apache.org/jira/browse/HIVE-6287
> Project: Hive
> Issue Type: Bug
> Components: Vectorization
> Affects Versions: 0.13.0
> Reporter: Prasanth J
> Assignee: Prasanth J
> Labels: orcfile, vectorization
> Attachments: HIVE-6287.1.patch, HIVE-6287.WIP.patch
>
>
> nextBatch() method that computes the batchSize is only aware of stripe
> boundaries. This will not work when PPD in ORC is enabled as PPD works at row
> group level (stripe contains multiple row groups). By default, row group
> stride is 10000. When PPD is enabled, some row groups may get eliminated.
> After row group elimination, disk ranges are computed based on the selected
> row groups. If batchSize computation is not aware of this, it will lead to
> BufferUnderFlowException (reading beyond disk range). Following scenario
> should illustrate it more clearly
> {code}
> |--------------------------------- STRIPE 1
> ------------------------------------|
> |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5
> --|
> |--------- diskrange 1 ---------| |- diskrange
> 2 -|
> ^
> (marker)
> {code}
> diskrange1 will have 20000 rows and diskrange 2 will have 10000 rows. Since
> nextBatch() was not aware of row groups and hence the diskranges, it tries to
> read 1024 values from the end of diskrange 1 where it should only read 20000
> % 1024 = 544 values. This will result in BufferUnderFlowException.
> To fix this, a marker is placed at the end of each range and batchSize is
> computed accordingly. {code}batchSize =
> Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition -
> rowInStripe));{code}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)