[GitHub] [spark] JoshRosen commented on pull request #36775: [SPARK-39389]Filesystem closed should not be considered as corrupt files

GitBox Mon, 06 Jun 2022 19:53:21 -0700


JoshRosen commented on PR #36775:
URL: https://github.com/apache/spark/pull/36775#issuecomment-1148136217

Does checking for filesystem closed exceptions completely fix this issue or
are we vulnerable to race conditions? Skimming through the [Hadoop DFSClient
code](https://github.com/apache/hadoop/blame/2dfa928a201560bc739ff5e4fe29f8bb19188927/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSClient.java),
it looks like it only throws `Filesystem closed` IOExceptions from a
`checkOpen()` call, but what happens if the FS is closed while a thread has
proceeded past `checkOpen()` and is in the middle of some other operation? In
that case we might get a different IOException but would presumably want to
treat it the same way (because it is a side-effect of another task being
interrupted rather than a side-effect of a corrupt file).

It seems like the core problem is that we have a really overly-broad `catch`
block which catches corrupt files but also catches many other potential things.
For example, let's say that we get an IOException caused by a transient network
connectivity issue to external storage: this doesn't meet the intuitive
definition of "corrupt file" but would still get caught in the dragnet of the
current exception handler.

The current code seems biased towards identifying "false positive" instances
of corruption (which can lead to correctness issues). If instead we wanted to
bias towards false negatives (i.e. mis-identifying true corruption as a
transient crash, therefore failing a user's job) then maybe we could have the
code in `readFunc` wrap and re-throw exceptions in some sort of
`CorruptFileException` wrapper and then modify the `catch` here to only ignore
that new exception. This would require changes in a number of data sources,
though. The FileScanRDD code might simply lack the necessary information to be
able to identify true corruption cases, so pushing part of that decision one
layer lower might make sense.

I think that could be a breaking change for certain users, though, so we'd
need to treat it as a user-facing change and document it appropriately (and
might need add escape hatch flags in case users need to revert back to the old
(arguably buggy) behavior).

----

I'm not sure what's the right course of action here. I just wanted to flag
that there's a potentially broader issue here.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] JoshRosen commented on pull request #36775: [SPARK-39389]Filesystem closed should not be considered as corrupt files

Reply via email to