[ https://issues.apache.org/jira/browse/SPARK-27188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jungtaek Lim reassigned SPARK-27188: ------------------------------------ Assignee: Jungtaek Lim > FileStreamSink: provide a new option to have retention on output files > ---------------------------------------------------------------------- > > Key: SPARK-27188 > URL: https://issues.apache.org/jira/browse/SPARK-27188 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 3.1.0 > Reporter: Jungtaek Lim > Assignee: Jungtaek Lim > Priority: Major > > From SPARK-24295 we indicated various end users are struggling with dealing > with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary > readers which leverage metadata log to determine which files are safely read > (to ensure 'exactly-once'), pruning metadata log is not trivial to implement. > While we may be able to deal with checking deleted output files in > FileStreamSink and get rid of them when compacting metadata, that operation > would take additional overhead for running query. (I'll try to address this > via another issue though.) > We can still get time-to-live (TTL) of output files from end users, and > filter out files in metadata so that metadata is not growing linearly. Also > filtered out files will be no longer seen in reader queries which leverage > File(Stream)Source. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org