[ 
https://issues.apache.org/jira/browse/SPARK-40849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-40849:
--------------------------------------
    Description: 
Purging old entries in both the offset log and commit log will be done 
asynchronously.

 

For every micro-batch, older entries in both offset log and commit log are 
deleted. This is done so that the offset log and commit log do not continually 
grow.  Please reference logic here

 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539]
 

 

The time spent performing these log purges is grouped with the “walCommit” 
execution time in the StreamingProgressListener metrics.  Around two thirds of 
the “walCommit” execution time is performing these purge operations thus making 
these operations asynchronous will also reduce latency.  Also, we do not 
necessarily need to perform the purges every micro-batch.  When these purges 
are executed asynchronously, they do not need to block micro-batch execution 
and we don’t need to start another purge until the current one is finished.  
The purges can happen essentially in the background.  We will just have to 
synchronize the purges with the offset WAL commits and completion commits so 
that we don’t have concurrent modifications of the offset log and commit log.

  was:Purging old entries in both the offset log and commit log will be done 
asynchronously


> Async log purge
> ---------------
>
>                 Key: SPARK-40849
>                 URL: https://issues.apache.org/jira/browse/SPARK-40849
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>    Affects Versions: 3.4.0
>            Reporter: Boyang Jerry Peng
>            Priority: Major
>
> Purging old entries in both the offset log and commit log will be done 
> asynchronously.
>  
> For every micro-batch, older entries in both offset log and commit log are 
> deleted. This is done so that the offset log and commit log do not 
> continually grow.  Please reference logic here
>  
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539]
>  
>  
> The time spent performing these log purges is grouped with the “walCommit” 
> execution time in the StreamingProgressListener metrics.  Around two thirds 
> of the “walCommit” execution time is performing these purge operations thus 
> making these operations asynchronous will also reduce latency.  Also, we do 
> not necessarily need to perform the purges every micro-batch.  When these 
> purges are executed asynchronously, they do not need to block micro-batch 
> execution and we don’t need to start another purge until the current one is 
> finished.  The purges can happen essentially in the background.  We will just 
> have to synchronize the purges with the offset WAL commits and completion 
> commits so that we don’t have concurrent modifications of the offset log and 
> commit log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to