[ https://issues.apache.org/jira/browse/SPARK-40849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-40849: ------------------------------------ Assignee: Apache Spark > Async log purge > --------------- > > Key: SPARK-40849 > URL: https://issues.apache.org/jira/browse/SPARK-40849 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming > Affects Versions: 3.4.0 > Reporter: Boyang Jerry Peng > Assignee: Apache Spark > Priority: Major > > Purging old entries in both the offset log and commit log will be done > asynchronously. > > For every micro-batch, older entries in both offset log and commit log are > deleted. This is done so that the offset log and commit log do not > continually grow. Please reference logic here > > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539] > > > The time spent performing these log purges is grouped with the “walCommit” > execution time in the StreamingProgressListener metrics. Around two thirds > of the “walCommit” execution time is performing these purge operations thus > making these operations asynchronous will also reduce latency. Also, we do > not necessarily need to perform the purges every micro-batch. When these > purges are executed asynchronously, they do not need to block micro-batch > execution and we don’t need to start another purge until the current one is > finished. The purges can happen essentially in the background. We will just > have to synchronize the purges with the offset WAL commits and completion > commits so that we don’t have concurrent modifications of the offset log and > commit log. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org