[ 
https://issues.apache.org/jira/browse/SPARK-39805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567866#comment-17567866
 ] 

Apache Spark commented on SPARK-39805:
--------------------------------------

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/37213

> Deprecate Trigger.Once and Promote Trigger.AvailableNow
> -------------------------------------------------------
>
>                 Key: SPARK-39805
>                 URL: https://issues.apache.org/jira/browse/SPARK-39805
>             Project: Spark
>          Issue Type: Task
>          Components: Structured Streaming
>    Affects Versions: 3.4.0
>            Reporter: Jungtaek Lim
>            Priority: Major
>
> Quoting the discussion in spark dev@: 
> [link|https://lists.apache.org/thread/2xnxlxhw245cmspd8nd17cq5doj2c7hc]
> Rationalization:
> The expected behavior of Trigger.Once is like reading all available data 
> after the last trigger and processing them. This holds true when the last run 
> was gracefully terminated, but there are cases streaming queries to not be 
> terminated gracefully. There is a possibility the last run may write the 
> offset for the new batch before termination, then a new run of Trigger.Once 
> only processes the data which was built in the latest unfinished batch and 
> doesn't process new data.
> The behavior is not deterministic from the users' point of view, as end users 
> wouldn't know whether the last run wrote the offset or not, unless they look 
> into the query's checkpoint by themselves.
> While Trigger.AvailableNow came to solve the scalability issue on 
> Trigger.Once, it also ensures that it tries to process all available data at 
> the point of time it is triggered, which consistently works as expected 
> behavior of Trigger.Once.
> Another issue on Trigger.Once is that it does not trigger a no-data batch 
> immediately. When the watermark is calculated in batch N, it takes effect in 
> batch N + 1. If the query is scheduled to be run per day, you can see the 
> output from the new watermark in the query run the next day. Thanks to the 
> behavior of Trigger.AvailableNow, it handles no-data batch as well before 
> termination of the query.
> There was no strong feedback in the discussion thread, but accounting the 
> fact we have very small number of contributors (including committers/PMC 
> members) being active in SS area, we have to just go with lazy consensus.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to