[GitHub] [spark] HeartSaVioR edited a comment on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

GitBox Sun, 30 May 2021 14:19:32 -0700


HeartSaVioR edited a comment on pull request #32702:
URL: https://github.com/apache/spark/pull/32702#issuecomment-851062762



   And one more, I think let file stream sink to ignore metadata directory on 
reading existing metadata but write to the metadata directory is odd and 
error-prone. The metadata is no longer valid when Spark starts to write a new 
metadata on the same directory, and the option must be set to true for such 
directory to read properly despite Spark writes the metadata. There's no 
indication and end users have to memorize it.
   
   The ideal approach is writing metadata to the directory indicating whether 
the directory is set to at-least-once (or multi-writes) or exactly-once (or 
single-write) when the directory is written for the first time, and leverage 
the option all the time instead of changing its behavior depending on the 
query's config/option. This will bring consistency for the directory.
   
   Btw I've made more improvements on file stream source and file stream sink, 
but I had to agree that the efforts are quite duplicated with data lake 
solutions. (See discussions in #27694) Once you start to address the issues one 
by one, you've got to realize these are what data lake solutions have been 
fixed. That's why I stop dealing with file stream source and file stream sink, 
though I guess ETL to data lake solutions is still valid and then the 
long-running issue on file stream source should be fixed - 
https://github.com/apache/spark/pull/28422 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR edited a comment on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

Reply via email to