[
https://issues.apache.org/jira/browse/DRILL-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780610#comment-17780610
]
ASF GitHub Bot commented on DRILL-8458:
---------------------------------------
jnturton commented on PR #2838:
URL: https://github.com/apache/drill/pull/2838#issuecomment-1783778303
A test data file name like that would be great thank you. Yes, please may
you move the test itself inside TestParquetComplex and see if you can integrate
your test data file generation code into ParquetSimpleTestFileGenerator? It
already takes a Parquet version number parameter IIRC (up until now, only ever
set to v1). I think we may need to come back and organise our tests better for
Parquet v2, but I don't want to saddle this PR with that reorganisation.
> Reading Parquet v2 data page with repetition levels larger than column data
> throws IllegalArgumentException
> -----------------------------------------------------------------------------------------------------------
>
> Key: DRILL-8458
> URL: https://issues.apache.org/jira/browse/DRILL-8458
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.21.1
> Reporter: Peter Franzen
> Assignee: James Turton
> Priority: Major
> Fix For: 1.22.0
>
>
> When the size of the repetition level bytes in a Parquet v2 data page is
> larger than the size of the column data bytes,
> {{org.apache.parquet.hadoop.ColumnChunkIncReadStore$ColumnChunkIncPageReader::readPage}}
> throws an {{{}IllegalArgumentException{}}}. This is caused by trying to set
> the limit of a ByteBuffer to a value large than its capacity.
>
> The offending code is at line 226 in {{{}ColumnChunkIncReadStore.java{}}}:
>
> {code:java}
> 217 int pageBufOffset = 0;
> 218 ByteBuffer bb = (ByteBuffer) pageBuf.position(pageBufOffset);
> 219 BytesInput repLevelBytes = BytesInput.from(
> 220 (ByteBuffer) bb.slice().limit(pageBufOffset + repLevelSize)
> 221 );
> 222 pageBufOffset += repLevelSize;
> 223
> 224 bb = (ByteBuffer) pageBuf.position(pageBufOffset);
> 225 final BytesInput defLevelBytes = BytesInput.from(
> 226 (ByteBuffer) bb.slice().limit(pageBufOffset + defLevelSize)
> 227 );
> 228 pageBufOffset += defLevelSize; {code}
>
> The buffer {{pageBuf}} contains the repetition level bytes followed by the
> definition level bytes followed by the column data bytes.
>
> The code at lines 217-221 reads the repetition level bytes, and then updates
> the position of the {{pageBuf}} buffer to the start of the definition level
> bytes (lines 222 and 224).
>
> The code at lines 225-227 reads the definition level bytes, and when creating
> a slice of the \{{pageBuf }}buffer containing the definition level bytes, the
> slice's limit is set as if the position was at the beginning of the
> repetition level bytes (line 226), i.e as if it not had been updated.
>
> This means that if the capacity of the pageBuf buffer (which is the size of
> the repetition level bytes + the size of the definition level bytes + the
> size of the column data bytes) is less than (repLevelSize + repLevelSize +
> defLevelSize), the call to limit() will throw.
>
> The fix is to change line 226 to
> {code:java}
> (ByteBuffer) bb.slice().limit(defLevelSize){code}
>
> For symmetry, line 220 could also be changed to
> {code:java}
> (ByteBuffer) bb.slice().limit(repLevelSize){code}
>
> although {{pageBufOffset}} is always 0 there and will not cause the limit to
> exceed the capacity.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)