[
https://issues.apache.org/jira/browse/DRILL-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
James Turton resolved DRILL-8458.
---------------------------------
Resolution: Fixed
> Reading Parquet v2 data page with repetition levels larger than column data
> throws IllegalArgumentException
> -----------------------------------------------------------------------------------------------------------
>
> Key: DRILL-8458
> URL: https://issues.apache.org/jira/browse/DRILL-8458
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.21.1
> Reporter: Peter Franzen
> Assignee: James Turton
> Priority: Major
> Fix For: 1.22.0
>
>
> When the size of the repetition level bytes in a Parquet v2 data page is
> larger than the size of the column data bytes,
> {{org.apache.parquet.hadoop.ColumnChunkIncReadStore$ColumnChunkIncPageReader::readPage}}
> throws an {{{}IllegalArgumentException{}}}. This is caused by trying to set
> the limit of a ByteBuffer to a value large than its capacity.
>
> The offending code is at line 226 in {{{}ColumnChunkIncReadStore.java{}}}:
>
> {code:java}
> 217 int pageBufOffset = 0;
> 218 ByteBuffer bb = (ByteBuffer) pageBuf.position(pageBufOffset);
> 219 BytesInput repLevelBytes = BytesInput.from(
> 220 (ByteBuffer) bb.slice().limit(pageBufOffset + repLevelSize)
> 221 );
> 222 pageBufOffset += repLevelSize;
> 223
> 224 bb = (ByteBuffer) pageBuf.position(pageBufOffset);
> 225 final BytesInput defLevelBytes = BytesInput.from(
> 226 (ByteBuffer) bb.slice().limit(pageBufOffset + defLevelSize)
> 227 );
> 228 pageBufOffset += defLevelSize; {code}
>
> The buffer {{pageBuf}} contains the repetition level bytes followed by the
> definition level bytes followed by the column data bytes.
>
> The code at lines 217-221 reads the repetition level bytes, and then updates
> the position of the {{pageBuf}} buffer to the start of the definition level
> bytes (lines 222 and 224).
>
> The code at lines 225-227 reads the definition level bytes, and when creating
> a slice of the \{{pageBuf }}buffer containing the definition level bytes, the
> slice's limit is set as if the position was at the beginning of the
> repetition level bytes (line 226), i.e as if it not had been updated.
>
> This means that if the capacity of the pageBuf buffer (which is the size of
> the repetition level bytes + the size of the definition level bytes + the
> size of the column data bytes) is less than (repLevelSize + repLevelSize +
> defLevelSize), the call to limit() will throw.
>
> The fix is to change line 226 to
> {code:java}
> (ByteBuffer) bb.slice().limit(defLevelSize){code}
>
> For symmetry, line 220 could also be changed to
> {code:java}
> (ByteBuffer) bb.slice().limit(repLevelSize){code}
>
> although {{pageBufOffset}} is always 0 there and will not cause the limit to
> exceed the capacity.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)