Chao Sun created SPARK-55968:
--------------------------------
Summary: Spark should not treat all RuntimeException as corrupted
file when `spark.sql.files.ignoreCorruptFiles` is enabled
Key: SPARK-55968
URL: https://issues.apache.org/jira/browse/SPARK-55968
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.1.1, 3.5.8, 4.0.2
Reporter: Chao Sun
When {{spark.sql.files.ignoreCorruptFiles}} is enabled, Spark currently will
catch any {{RuntimeException}} and ignore the data files. See
[FileScanRDD|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L271-L276]
and
[DataSourceUtils|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala#L202-L205]
In particular, sometimes vectorized Parquet reader could fail with the
following error
{code:java}
java.lang.RuntimeException: Cannot reserve additional contiguous bytes in the
vectorized reader (integer overflow). As a workaround, you can reduce the
vectorized reader batch size, or disable the vectorized reader, or disable
spark.sql.sources.bucketing.enabled if you read from bucket table. For Parquet
file format, refer to spark.sql.parquet.columnarReaderBatchSize (default 4096)
and spark.sql.parquet.enableVectorizedReader; for ORC file format, refer to
spark.sql.orc.columnarReaderBatchSize (default 4096) and
spark.sql.orc.enableVectorizedReader.
{code}
which would still be treated as data corruption. This would silently cause data
loss.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]