[
https://issues.apache.org/jira/browse/DRILL-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831924#comment-17831924
]
ASF GitHub Bot commented on DRILL-8486:
---------------------------------------
rymarm opened a new pull request, #2898:
URL: https://github.com/apache/drill/pull/2898
# [DRILL-8486](https://issues.apache.org/jira/browse/DRILL-8486): fix
handling of long variable length entries during bulk parquet reading
## Description
Drill, during a bulk reading of a parquet file, unproperly handles a long
value of parquet file entry. Drill reads the value, but after he finds that he
can’t handle the value in the current batch, he just moves on, without
persisting the read value. Since the value wasn’t pushed back to the reader
object, the total read and left-to-read records counts are now in unproper
state which causes data reading to fail in the future.
This issue hasn’t been faced before, because the conditions to get into this
state are rare.
**Solution**
Push back the value to the reader object to read it in the next iteration,
if the current batch can’t hold it.
## Documentation
\-
## Testing
Manual testing with a parquet file from the Jira ticket:
[DRILL-8486](https://issues.apache.org/jira/browse/DRILL-8486). It's hard to
reproduce this particular issue with random data.
> ParquetDecodingException: could not read bytes at offset
> ---------------------------------------------------------
>
> Key: DRILL-8486
> URL: https://issues.apache.org/jira/browse/DRILL-8486
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.21.1
> Reporter: Maksym Rymar
> Assignee: Maksym Rymar
> Priority: Major
> Attachments: test.parquet
>
>
> Drill fails to read a parquet file with the following exception:
>
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: could not read
> bytes at offset 591804
> at
> org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:42)
> at
> org.apache.drill.exec.store.parquet.columnreaders.VarLenColumnBulkInput$ValuesReaderWrapper.getNextEntry(VarLenColumnBulkInput.java:754)
> ... 43 common frames omitted
> Caused by: java.io.EOFException: null
> at
> org.apache.parquet.bytes.SingleBufferInputStream.read(SingleBufferInputStream.java:52)
> at
> org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83)
> at
> org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39)
> ... 44 common frames omitted {code}
>
>
> This issue only affects queries with {{store.parquet.flat.reader.bulk}} set
> to {{{}true{}}}(by default).
> Attaching the parquet file for the reproduce: [^test.parquet].
> Query: {{select log, app_name from dfs.tmp.`test.parquet`}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)