Chao Sun created SPARK-55968:
--------------------------------

             Summary: Spark should not treat all RuntimeException as corrupted 
file when `spark.sql.files.ignoreCorruptFiles` is enabled
                 Key: SPARK-55968
                 URL: https://issues.apache.org/jira/browse/SPARK-55968
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.1.1, 3.5.8, 4.0.2
            Reporter: Chao Sun


When {{spark.sql.files.ignoreCorruptFiles}} is enabled, Spark currently will 
catch any {{RuntimeException}} and ignore the data files. See 
[FileScanRDD|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L271-L276]
 and 
[DataSourceUtils|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala#L202-L205]

In particular, sometimes vectorized Parquet reader could fail with the 
following error


{code:java}
java.lang.RuntimeException: Cannot reserve additional contiguous bytes in the 
vectorized reader (integer overflow). As a workaround, you can reduce the 
vectorized reader batch size, or disable the vectorized reader, or disable 
spark.sql.sources.bucketing.enabled if you read from bucket table. For Parquet 
file format, refer to spark.sql.parquet.columnarReaderBatchSize (default 4096) 
and spark.sql.parquet.enableVectorizedReader; for ORC file format, refer to 
spark.sql.orc.columnarReaderBatchSize (default 4096) and 
spark.sql.orc.enableVectorizedReader.

{code}

which would still be treated as data corruption. This would silently cause data 
loss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to