[GitHub] [spark] yaooqinn opened a new pull request, #38024: [SPARK-40591][SQL] Fix data loss caused by ignoreCorruptFiles

GitBox Tue, 27 Sep 2022 18:55:30 -0700


yaooqinn opened a new pull request, #38024:
URL: https://github.com/apache/spark/pull/38024

<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR:
https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g.,
'[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a
faster review.
7. If you want to add a new configuration, please read the guideline first
for naming configurations in

'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
8. If you want to add or modify an error type or message, please read the
guideline first in
'core/src/main/resources/error/README.md'.
-->

### What changes were proposed in this pull request?

Let's take a look at the case below, the left and the right are visiting the
same table and its partitions, and both of them are ignoreCorruptFiles=true.
The right side shows that a task skips partial of the data it reads because of
encountering 'corrupt data', while the left read this file correctly. As
ignoreCorruptFiles coarsely works with RuntimeException and IOException, it can
not always represent data corruption.

![image](https://user-images.githubusercontent.com/8326978/192667546-30d20739-a322-4618-8fb7-b0fa24301bcc.png)

What's worse, such kinds of tasks are always marked as successful on the web
UI. The same query visiting the same snapshot of data might result in
inconsistency silently.

In this PR, we make the ignoreCorruptFiles work with taskAttemptNumber
together, that is, only the last attempt will ignore the maybe-corrupted file.
Users may want fewer retries in case of performance regressions, so
ignoreCorruptFilesAfterRetries is introduced which can be set to less than
`spark.task.maxFailures`.

### Why are the changes needed?

Fix data loss.

Also, the UI now contains failed tasks for both positive and negative data
corruption which helps us in bug hunting.

### Does this PR introduce _any_ user-facing change?

No, it's a bug fix (maybe a UI change like what I said above).

### How was this patch tested?

tested locally and existing tests for ignoreCorruptFiles

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn opened a new pull request, #38024: [SPARK-40591][SQL] Fix data loss caused by ignoreCorruptFiles

Reply via email to