HeartSaVioR edited a comment on pull request #32702: URL: https://github.com/apache/spark/pull/32702#issuecomment-851062762
And one more, I think let file stream sink to ignore metadata directory on reading existing metadata but write to the metadata directory is odd and error-prone. The metadata is no longer valid when Spark starts to write a new metadata on the same directory, and the option must be set to true for such directory to read properly despite Spark writes the metadata. There's no indication and end users have to memorize it. The ideal approach is writing metadata to the directory indicating whether the directory is set to at-least-once (or multi-writes) or exactly-once (or single-write) when the directory is written for the first time, and leverage the option all the time instead of changing its behavior depending on the query's config/option. This will bring consistency for the directory. Btw I've made more improvements on file stream source and file stream sink, but I had to agree that the efforts are quite duplicated with data lake solutions. (See discussions in #27694) Once you start to address the issues one by one, you've got to realize these are what data lake solutions have been fixed. That's why I stop dealing with file stream source and file stream sink, though I guess ETL to data lake solutions is still valid and then the long-running issue on file stream source should be fixed - https://github.com/apache/spark/pull/28422 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org