[ https://issues.apache.org/jira/browse/SPARK-30915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shixiong Zhu resolved SPARK-30915. ---------------------------------- Fix Version/s: 3.1.0 Assignee: Jungtaek Lim Resolution: Fixed > FileStreamSinkLog: Avoid reading the metadata log file when finding the > latest batch ID > --------------------------------------------------------------------------------------- > > Key: SPARK-30915 > URL: https://issues.apache.org/jira/browse/SPARK-30915 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 3.1.0 > Reporter: Jungtaek Lim > Assignee: Jungtaek Lim > Priority: Major > Fix For: 3.1.0 > > > FileStreamSink.addBatch checks the latest batch ID before writing outputs to > skip writing batch if the batch was committed before. > While it's valid to compare the current batch with the latest batch ID, > getLatest() method is designed to return both the batch ID as well as content > which denotes that the latest metadata log file is being read and > deserialized. This would introduces heavy latency when the latest batch is a > compacted batch. > We could just find the metadata log file for latest batch ID, and only do the > minimal check without reading content. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org