[
https://issues.apache.org/jira/browse/PARQUET-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776453#comment-17776453
]
Gang Wu commented on PARQUET-2367:
----------------------------------
Thanks for reporting this! I see the configs involve writing.
Have you tried to use a newer version of parquet-mr (e.g. 1.13.1) to reproduce
this?
> NegativeArraySizeException on read for parquet files written with large
> strings in some cases
> ---------------------------------------------------------------------------------------------
>
> Key: PARQUET-2367
> URL: https://issues.apache.org/jira/browse/PARQUET-2367
> Project: Parquet
> Issue Type: Bug
> Affects Versions: 1.12.2
> Reporter: Atul Felix Payapilly
> Priority: Major
>
> On Spark 3.3.1 which uses parquet 1.12.2, parquet files were successfully
> created using default parquet configs. Note: the write succeeded, so this is
> not the same as: https://issues.apache.org/jira/browse/PARQUET-1632
>
> The payload had large strings and this resulted in the following exception on
> read:
> {code:java}
> Caused by: java.lang.NegativeArraySizeException
> at
> org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:262)
> at org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:214)
> at org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:223)
> at
> org.apache.parquet.column.impl.ColumnReaderImpl.readPageV1(ColumnReaderImpl.java:592)
> at
> org.apache.parquet.column.impl.ColumnReaderImpl.access$300(ColumnReaderImpl.java:57)
> at
> org.apache.parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:536)
> at
> org.apache.parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:533)
> at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:95) {code}
> The issue could be addressed with the following configs:
> {code:java}
> parquet.page.size.row.check.min=1
> parquet.page.size.row.check.max=1000
> parquet.page.size.check.estimate=false
> spark.sql.parquet.columnarReaderBatchSize=2098 {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)