Hi dev,

I would like to hear voices about deprecating Trigger.Once, and promoting
Trigger.AvailableNow as a replacement [1] in Structured Streaming.
(It doesn't mean we remove Trigger.Once now or near future. It probably
requires another discussion at some time.)

Rationalization:

The expected behavior of Trigger.Once is like reading all available data
after the last trigger and processing them. This holds true when the last
run was gracefully terminated, but there are cases streaming queries to not
be terminated gracefully. There is a possibility the last run may write the
offset for the new batch before termination, then a new run of Trigger.Once
only processes the data which was built in the latest unfinished batch and
doesn't process new data.

The behavior is not deterministic from the users' point of view, as end
users wouldn't know whether the last run wrote the offset or not, unless
they look into the query's checkpoint by themselves.

While Trigger.AvailableNow came to solve the scalability issue on
Trigger.Once, it also ensures that it tries to process all available data
at the point of time it is triggered, which consistently works as expected
behavior of Trigger.Once.

Another issue on Trigger.Once is that it does not trigger a no-data batch
immediately. When the watermark is calculated in batch N, it takes effect
in batch N + 1. If the query is scheduled to be run per day, you can see
the output from the new watermark in the query run the next day. Thanks to
the behavior of Trigger.AvailableNow, it handles no-data batch as well
before termination of the query.

Please review and let us know if you have any feedback or concerns on the
proposal.

Thanks!
Jungtaek Lim

1. https://issues.apache.org/jira/browse/SPARK-36533

Reply via email to