HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-558433816
 
 
   Maybe we can differentiate two major cases:
   
   1) downstream query to read the output directory is also Spark (leverages 
metadata)
   
   In this case, technically we never be able to delete any entries in metadata 
if we want to ensure the downstream query provides same result during multiple 
runs (unless inputs are added in real time). 
   
   We know that's only ideal - if the streaming query runs longer and writes 
gigantic number/size of files for a long time, we would want to get rid of some 
part to gain speed and save storage with fully understanding that we are 
throwing out some inputs which will affect the result of query.
   
   Assume we decided to get rid of some output files. How to do it safely? The 
only safe way to do it is, getting rid of them in metadata first, and delete 
actual files. (Downstream query relies on the metadata to get the list of 
files, so if we don't make sure deleting them in metadata first, the downstream 
query will try to read the file which no longer exist, and fails - depending on 
the option.) 
   
   That means running streaming query should deal with the deletion, as we 
don't have any official offline tool to modify metadata, and you may find 
difficulties to "how" to let streaming query know which files to delete. That's 
why I just simply pick "retention" which is generally acceptable approach 
(Kafka also applies retention policy by default).
   
   2) we never let Spark read the output directory - we let other frameworks to 
read the directory
   
   In this case we don't need to build metadata - though this means end users 
will need to deal with "at-least-once" guarantee. Given the file sink doesn't 
overwrite the file, it may leave corrupted records on partial output as well. 
If that's acceptable, we may be able to add an option to "disable" metadata, 
though there was some comments worried about doing it: 
https://github.com/apache/spark/pull/24128#issuecomment-474109068
   
   So I guess there're not many options here and I guess I picked the viable 
one, but I'd be really appreciated for more ideas!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to