Hi dev, I would like to hear voices about deprecating Trigger.Once, and promoting Trigger.AvailableNow as a replacement [1] in Structured Streaming. (It doesn't mean we remove Trigger.Once now or near future. It probably requires another discussion at some time.)
Rationalization: The expected behavior of Trigger.Once is like reading all available data after the last trigger and processing them. This holds true when the last run was gracefully terminated, but there are cases streaming queries to not be terminated gracefully. There is a possibility the last run may write the offset for the new batch before termination, then a new run of Trigger.Once only processes the data which was built in the latest unfinished batch and doesn't process new data. The behavior is not deterministic from the users' point of view, as end users wouldn't know whether the last run wrote the offset or not, unless they look into the query's checkpoint by themselves. While Trigger.AvailableNow came to solve the scalability issue on Trigger.Once, it also ensures that it tries to process all available data at the point of time it is triggered, which consistently works as expected behavior of Trigger.Once. Another issue on Trigger.Once is that it does not trigger a no-data batch immediately. When the watermark is calculated in batch N, it takes effect in batch N + 1. If the query is scheduled to be run per day, you can see the output from the new watermark in the query run the next day. Thanks to the behavior of Trigger.AvailableNow, it handles no-data batch as well before termination of the query. Please review and let us know if you have any feedback or concerns on the proposal. Thanks! Jungtaek Lim 1. https://issues.apache.org/jira/browse/SPARK-36533