[
https://issues.apache.org/jira/browse/DRILL-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
James Turton updated DRILL-8458:
--------------------------------
Fix Version/s: 1.22.0
Description:
When the size of the repetition level bytes in a Parquet v2 data page is larger
than the size of the column data bytes,
{{org.apache.parquet.hadoop.ColumnChunkIncReadStore$ColumnChunkIncPageReader::readPage}}
throws an {{{}IllegalArgumentException{}}}. This is caused by trying to set
the limit of a ByteBuffer to a value large than its capacity.
The offending code is at line 226 in {{{}ColumnChunkIncReadStore.java{}}}:
{code:java}
217 int pageBufOffset = 0;
218 ByteBuffer bb = (ByteBuffer) pageBuf.position(pageBufOffset);
219 BytesInput repLevelBytes = BytesInput.from(
220 (ByteBuffer) bb.slice().limit(pageBufOffset + repLevelSize)
221 );
222 pageBufOffset += repLevelSize;
223
224 bb = (ByteBuffer) pageBuf.position(pageBufOffset);
225 final BytesInput defLevelBytes = BytesInput.from(
226 (ByteBuffer) bb.slice().limit(pageBufOffset + defLevelSize)
227 );
228 pageBufOffset += defLevelSize; {code}
The buffer {{pageBuf}} contains the repetition level bytes followed by the
definition level bytes followed by the column data bytes.
The code at lines 217-221 reads the repetition level bytes, and then updates
the position of the {{pageBuf}} buffer to the start of the definition level
bytes (lines 222 and 224).
The code at lines 225-227 reads the definition level bytes, and when creating a
slice of the \{{pageBuf }}buffer containing the definition level bytes, the
slice's limit is set as if the position was at the beginning of the repetition
level bytes (line 226), i.e as if it not had been updated.
This means that if the capacity of the pageBuf buffer (which is the size of the
repetition level bytes + the size of the definition level bytes + the size of
the column data bytes) is less than (repLevelSize + repLevelSize +
defLevelSize), the call to limit() will throw.
The fix is to change line 226 to
{code:java}
(ByteBuffer) bb.slice().limit(defLevelSize){code}
For symmetry, line 220 could also be changed to
{code:java}
(ByteBuffer) bb.slice().limit(repLevelSize){code}
although {{pageBufOffset}} is always 0 there and will not cause the limit to
exceed the capacity.
was:
When the size of the repetition level bytes in a Parquet data page is larger
than the size of the column data bytes,
{{org.apache.parquet.hadoop.ColumnChunkIncReadStore$ColumnChunkIncPageReader::readPage}}
throws an {{{}IllegalArgumentException{}}}. This is caused by trying to set
the limit of a ByteBuffer to a value large than its capacity.
The offending code is at line 226 in {{{}ColumnChunkIncReadStore.java{}}}:
{code:java}
217 int pageBufOffset = 0;
218 ByteBuffer bb = (ByteBuffer) pageBuf.position(pageBufOffset);
219 BytesInput repLevelBytes = BytesInput.from(
220 (ByteBuffer) bb.slice().limit(pageBufOffset + repLevelSize)
221 );
222 pageBufOffset += repLevelSize;
223
224 bb = (ByteBuffer) pageBuf.position(pageBufOffset);
225 final BytesInput defLevelBytes = BytesInput.from(
226 (ByteBuffer) bb.slice().limit(pageBufOffset + defLevelSize)
227 );
228 pageBufOffset += defLevelSize; {code}
The buffer {{pageBuf}} contains the repetition level bytes followed by the
definition level bytes followed by the column data bytes.
The code at lines 217-221 reads the repetition level bytes, and then updates
the position of the {{pageBuf}} buffer to the start of the definition level
bytes (lines 222 and 224).
The code at lines 225-227 reads the definition level bytes, and when creating a
slice of the {{pageBuf }}buffer containing the definition level bytes, the
slice's limit is set as if the position was at the beginning of the repetition
level bytes (line 226), i.e as if it not had been updated.
This means that if the capacity of the pageBuf buffer (which is the size of the
repetition level bytes + the size of the definition level bytes + the size of
the column data bytes) is less than (repLevelSize + repLevelSize +
defLevelSize), the call to limit() will throw.
The fix is to change line 226 to
{code:java}
(ByteBuffer) bb.slice().limit(defLevelSize){code}
For symmetry, line 220 could also be changed to
{code:java}
(ByteBuffer) bb.slice().limit(repLevelSize){code}
although {{pageBufOffset}} is always 0 there and will not cause the limit to
exceed the capacity.
Summary: Reading Parquet v2 data page with repetition levels larger
than column data throws IllegalArgumentException (was: Reading Parquet data
page with repetition levels larger than column data throws
IllegalArgumentException)
> Reading Parquet v2 data page with repetition levels larger than column data
> throws IllegalArgumentException
> -----------------------------------------------------------------------------------------------------------
>
> Key: DRILL-8458
> URL: https://issues.apache.org/jira/browse/DRILL-8458
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.21.1
> Reporter: Peter Franzen
> Assignee: James Turton
> Priority: Major
> Fix For: 1.22.0
>
>
> When the size of the repetition level bytes in a Parquet v2 data page is
> larger than the size of the column data bytes,
> {{org.apache.parquet.hadoop.ColumnChunkIncReadStore$ColumnChunkIncPageReader::readPage}}
> throws an {{{}IllegalArgumentException{}}}. This is caused by trying to set
> the limit of a ByteBuffer to a value large than its capacity.
>
> The offending code is at line 226 in {{{}ColumnChunkIncReadStore.java{}}}:
>
> {code:java}
> 217 int pageBufOffset = 0;
> 218 ByteBuffer bb = (ByteBuffer) pageBuf.position(pageBufOffset);
> 219 BytesInput repLevelBytes = BytesInput.from(
> 220 (ByteBuffer) bb.slice().limit(pageBufOffset + repLevelSize)
> 221 );
> 222 pageBufOffset += repLevelSize;
> 223
> 224 bb = (ByteBuffer) pageBuf.position(pageBufOffset);
> 225 final BytesInput defLevelBytes = BytesInput.from(
> 226 (ByteBuffer) bb.slice().limit(pageBufOffset + defLevelSize)
> 227 );
> 228 pageBufOffset += defLevelSize; {code}
>
> The buffer {{pageBuf}} contains the repetition level bytes followed by the
> definition level bytes followed by the column data bytes.
>
> The code at lines 217-221 reads the repetition level bytes, and then updates
> the position of the {{pageBuf}} buffer to the start of the definition level
> bytes (lines 222 and 224).
>
> The code at lines 225-227 reads the definition level bytes, and when creating
> a slice of the \{{pageBuf }}buffer containing the definition level bytes, the
> slice's limit is set as if the position was at the beginning of the
> repetition level bytes (line 226), i.e as if it not had been updated.
>
> This means that if the capacity of the pageBuf buffer (which is the size of
> the repetition level bytes + the size of the definition level bytes + the
> size of the column data bytes) is less than (repLevelSize + repLevelSize +
> defLevelSize), the call to limit() will throw.
>
> The fix is to change line 226 to
> {code:java}
> (ByteBuffer) bb.slice().limit(defLevelSize){code}
>
> For symmetry, line 220 could also be changed to
> {code:java}
> (ByteBuffer) bb.slice().limit(repLevelSize){code}
>
> although {{pageBufOffset}} is always 0 there and will not cause the limit to
> exceed the capacity.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)