HeartSaVioR commented on issue #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink URL: https://github.com/apache/spark/pull/26590#issuecomment-556964599 @zsxwing Ah OK got it. That's a good point - reading files in FileStreamSink output directory without metadata information is unsafe anyway. Btw, actually I and @gaborgsomogyi considered about edge-cases which the query reads `sub-directory(-ies)` or `ancestor with recursive option` of FileStreamSink output directory, because the actual impact here is a kind of "side-effect" which "affects" other queries. It might be less problematic that the query will read the directory "incorrectly" and incorrect output will come up. The thing is, the query will also mess up the output directory as well since processed files will be cleaned up, which will also break other queries as well. So I feel we still have to make a decision with consideration of possible side-effect; 1) try our best to prevent all known cases with (high?) costs, 2) consider these edge-cases as bad input and we don't care at all (maybe we could document it instead.) What do you think?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org