[jira] [Commented] (SPARK-24156) Enable no-data micro batches for more eager streaming state clean up

krishna (Jira) Fri, 28 Jan 2022 08:35:04 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483838#comment-17483838
 ]


krishna commented on SPARK-24156:
---------------------------------

Hi [~kcsrms] [~tdas] ,

  I am having the same issue. Is this issue resovled? is there a specific 
version I need to choose?

 
  I am struggling with a unique issue. I am not sure if my understanding is 
wrong or this is a bug with spark.
 
 #  I am reading a stream from events hub ( Extract)
 #  Pivoting and Aggregating the above dataframe ( Transformation). This is a 
WATERMARKED aggregation.
 #  writing the aggregation to Delta table in APPEND  mode with a Trigger . 

However, the most recently published message to event hub is not writing to 
delta even after falling out of the watermark time. 
 
 My understanding is the data should be inserted to the Delta table after 
Eventtime+Watermark.
 
 

> Enable no-data micro batches for more eager streaming state clean up 
> ---------------------------------------------------------------------
>
>                 Key: SPARK-24156
>                 URL: https://issues.apache.org/jira/browse/SPARK-24156
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.3.0
>            Reporter: Tathagata Das
>            Assignee: Tathagata Das
>            Priority: Major
>             Fix For: 2.4.0
>
>
> Currently, MicroBatchExecution in Structured Streaming runs batches only when 
> there is new data to process. This is sensible in most cases as we dont want 
> to unnecessarily use resources when there is nothing new to process. However, 
> in some cases of stateful streaming queries, this delays state clean up as 
> well as clean-up based output. For example, consider a streaming aggregation 
> query with watermark-based state cleanup. The watermark is updated after 
> every batch with new data completes. The updated value is used in the next 
> batch to clean up state, and output finalized aggregates in append mode. 
> However, if there is no data, then the next batch does not occur, and 
> cleanup/output gets delayed unnecessarily. This is true for all stateful 
> streaming operators - aggregation, deduplication, joins, mapGroupsWithState
> This issue tracks the work to enable no-data batches in MicroBatchExecution. 
> The major challenge is that all the tests of relevant stateful operations add 
> dummy data to force another batch for testing the state cleanup. So a lot of 
> the tests are going to be changed. So my plan is to enable no-data batches 
> for different stateful operators one at a time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24156) Enable no-data micro batches for more eager streaming state clean up

Reply via email to