JoshRosen commented on PR #36775:
URL: https://github.com/apache/spark/pull/36775#issuecomment-1148136217

   Does checking for filesystem closed exceptions completely fix this issue or 
are we vulnerable to race conditions? Skimming through the [Hadoop DFSClient 
code](https://github.com/apache/hadoop/blame/2dfa928a201560bc739ff5e4fe29f8bb19188927/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSClient.java),
 it looks like it only throws `Filesystem closed` IOExceptions from a 
`checkOpen()` call, but what happens if the FS is closed while a thread has 
proceeded past `checkOpen()` and is in the middle of some other operation? In 
that case we might get a different IOException but would presumably want to 
treat it the same way (because it is a side-effect of another task being 
interrupted rather than a side-effect of a corrupt file).
   
   It seems like the core problem is that we have a really overly-broad `catch` 
block which catches corrupt files but also catches many other potential things. 
For example, let's say that we get an IOException caused by a transient network 
connectivity issue to external storage: this doesn't meet the intuitive 
definition of "corrupt file" but would still get caught in the dragnet of the 
current exception handler. 
   
   The current code seems biased towards identifying "false positive" instances 
of corruption (which can lead to correctness issues). If instead we wanted to 
bias towards false negatives (i.e. mis-identifying true corruption as a 
transient crash, therefore failing a user's job) then maybe we could have the 
code in `readFunc` wrap and re-throw exceptions in some sort of 
`CorruptFileException` wrapper and then modify the `catch` here to only ignore 
that new exception. This would require changes in a number of data sources, 
though. The FileScanRDD code might simply lack the necessary information to be 
able to identify true corruption cases, so pushing part of that decision one 
layer lower might make sense.
   
   I think that could be a breaking change for certain users, though, so we'd 
need to treat it as a user-facing change and document it appropriately (and 
might need add escape hatch flags in case users need to revert back to the old 
(arguably buggy) behavior).
   
   ----
   
   I'm not sure what's the right course of action here. I just wanted to flag 
that there's a potentially broader issue here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to