[
https://issues.apache.org/jira/browse/SPARK-40591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kent Yao resolved SPARK-40591.
------------------------------
Resolution: Duplicate
> ignoreCorruptFiles results data loss
> ------------------------------------
>
> Key: SPARK-40591
> URL: https://issues.apache.org/jira/browse/SPARK-40591
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.0, 3.3.0, 3.2.2, 3.4.0, 4.1.0, 3.5.6, 4.0.1
> Reporter: Kent Yao 2
> Priority: Critical
> Labels: correctness
> Attachments: image-2022-09-28-09-20-21-693.png
>
>
> Let's take a look at the case below, the left and the right are visiting the
> same table and its partitions, and both of them are ignoreCorruptFiles=true.
> The right side shows that a task skips partial of data it reads because of
> encountering 'corrupt data', while the left read this file correctly. As
> ignoreCorruptFiles coarsely works with RuntimeException and IOException, it
> can not always represent data corruption.
> !image-2022-09-28-09-20-21-693.png!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]