[GitHub] [spark] HeartSaVioR commented on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

GitBox Sun, 30 May 2021 13:42:21 -0700


HeartSaVioR commented on pull request #32702:
URL: https://github.com/apache/spark/pull/32702#issuecomment-851058399



   This was already proposed before from a part of #31638, though I'm not sure 
you've indicated this.
   
   Quoting my comment 
https://github.com/apache/spark/pull/31638#issuecomment-787221374 :
   
   > The behavioral change is bound to file data source, right? I prefer adding 
the source option instead of adding config, because 1) Spark has a bunch of 
configurations already 2) I'd prefer having smaller range of impact, per source 
instead of session-wide.
   > 
   > And also, the option should be used carefully (at least we'd be better to 
indicate to end users), as the metadata in file stream sink output ensures the 
"correctly written" files are only read. What would happen if they ignore it?
   > 
   > files not listed in metadata could be corrupted, and these files are now 
being read. They may need to also turn on the flag "ignore corrupted" as well.
   > 
   > They are no longer able to consider the output directory as "exactly once" 
- that is "at least one", meaning the output rows can be written multiple 
times. Without deduplication or consideration of logic on read side, it may 
result to incorrect output on read side.
   > 
   > Technically this is an existing issue when batch query tries to read from 
multiple directories or glob path and it includes the output directory on file 
stream sink. (That said, they could use the glob path as a workaround without 
adding the new configuration, though I'd agree explicit config is more 
intuitive.) I'd love to see the proper notice on such case as well since we are 
here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

Reply via email to