HeartSaVioR commented on pull request #32702:
URL: https://github.com/apache/spark/pull/32702#issuecomment-851062762


   And one more, I think let file stream sink to ignore metadata directory on 
reading existing metadata but write to the metadata directory is odd and 
error-prone. The metadata is no longer valid when Spark starts to write a new 
metadata on the same directory, and the option must be set to true for such 
directory to read properly despite Spark writes the metadata. There's no 
indication and end users have to memorize it.
   
   The ideal approach is writing metadata to the directory indicating whether 
the directory is set to at-least-once (or multi-writes) or exactly-once (or 
single-write) when the directory is written for the first time, and leverage 
the option all the time instead of changing its behavior depending on the 
query's config/option. This will bring consistency for the directory.
   
   Btw I've made more improvements on file stream source and file stream sink, 
but I had to agree that the efforts are quite duplicated with data lake 
solutions. Once you start to address the issues one by one, you've got to 
realize these are what data lake solutions have been fixed. That's why I stop 
dealing with file stream source and file stream sink, though I guess ETL to 
data lake solutions is still valid and then the long-running issue on file 
stream source should be fixed - https://github.com/apache/spark/pull/28422 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to