Tathagata Das created SPARK-24156:
-------------------------------------

             Summary: Enable no-data micro batches for more eager streaming 
state clean up 
                 Key: SPARK-24156
                 URL: https://issues.apache.org/jira/browse/SPARK-24156
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 2.3.0
            Reporter: Tathagata Das
            Assignee: Tathagata Das


Currently, MicroBatchExecution in Structured Streaming runs batches only when 
there is new data to process. This is sensible in most cases as we dont want to 
unnecessarily use resources when there is nothing new to process. However, in 
some cases of stateful streaming queries, this delays state clean up as well as 
clean-up based output. For example, consider a streaming aggregation query with 
watermark-based state cleanup. The watermark is updated after every batch with 
new data completes. The updated value is used in the next batch to clean up 
state, and output finalized aggregates in append mode. However, if there is no 
data, then the next batch does not occur, and cleanup/output gets delayed 
unnecessarily. This is true for all stateful streaming operators - aggregation, 
deduplication, joins, mapGroupsWithState

This issue tracks the work to enable no-data batches in MicroBatchExecution. 
The major challenge is that all the tests of relevant stateful operations add 
dummy data to force another batch for testing the state cleanup. So a lot of 
the tests are going to be changed. So my plan is to enable no-data batches for 
different stateful operators one at a time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to