[ https://issues.apache.org/jira/browse/SPARK-35565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jungtaek Lim resolved SPARK-35565. ---------------------------------- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32702 [https://github.com/apache/spark/pull/32702] > Add a config for ignoring metadata directory of file stream sink > ---------------------------------------------------------------- > > Key: SPARK-35565 > URL: https://issues.apache.org/jira/browse/SPARK-35565 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 3.2.0 > Reporter: L. C. Hsieh > Assignee: L. C. Hsieh > Priority: Major > Fix For: 3.2.0 > > > FileStreamSink produces a metadata directory which logs output files per > micro-batch. When we read from the output path, Spark will look at the > metadata and ignore other files not in the log. > Normally it works well. But for some use-cases, we may need to ignore the > metadata when reading the output path. For example, when we change the > streaming query and must to run it with new checkpoint directory, we cannot > use previous metadata. If we create a new metadata too, when we read the > output path later in Spark, Spark only reads the files listed in the new > metadata. The files written before we use new checkpoint and metadata are > ignored by Spark. > Although seems we can output to different output directory every time, but it > is bad idea as we will produce many directories unnecessarily. > Seems we need a config for ignoring the metadata of FileStreamSink when > reading the output path. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org