HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-558433816 Maybe we can differentiate two major cases: 1) downstream query to read the output directory is also Spark (leverages metadata) In this case, technically we never be able to delete any entries in metadata if we want to ensure the downstream query provides same result during multiple runs (unless inputs are added in real time). We know that's only ideal - if the streaming query runs longer and writes gigantic number/size of files for a long time, we would want to get rid of some part to gain speed and save storage with fully understanding that we are throwing out some inputs which will affect the result of query. Assume we decided to get rid of some output files. How to do it safely? The only safe way to do it is, getting rid of them in metadata first, and delete actual files. (Downstream query relies on the metadata to get the list of files, so if we don't make sure deleting them in metadata first, the downstream query will try to read the file which no longer exist, and fails - depending on the option.) That means running streaming query should deal with the deletion, as we don't have any official offline tool to modify metadata, and you may find difficulties to "how" to let streaming query know which files to delete. That's why I just simply pick "retention" which is generally acceptable approach (Kafka also applies retention policy by default). 2) we never let Spark read the output directory - we let other frameworks to read the directory In this case we don't need to build metadata - though this means end users will need to deal with "at-least-once" guarantee. Given the file sink doesn't overwrite the file, it may leave corrupted records on partial output as well. If that's acceptable, we may be able to add an option to "disable" metadata, though there was some comments worried about doing it: https://github.com/apache/spark/pull/24128#issuecomment-474109068 So I guess there're not many options here and I guess I picked the viable one, but I'd be really appreciated for more ideas!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org