Dang I was hoping it was the second one. In our case the data is too large
to run the whole backfill for the aggregation in a single batch (the
shuffle is too big). We currently resort to manually batching (i.e. not
streaming) the backlog (anything older than the watermark) when we need to
reprocess, because we can't really know for sure our batches are processed
in the correct event time order when starting from scratch.

I'm not against deprecating Trigger.Once, just wanted to chime in that
someone was using it! I'm itching to upgrade and try out the new stuff.

Adam

On Fri, Jul 8, 2022 at 9:16 AM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> Thanks for the input, Adam! Replying inline.
>
> On Fri, Jul 8, 2022 at 8:48 PM Adam Binford <adam...@gmail.com> wrote:
>
>> We use Trigger.Once a lot, usually for backfilling data for new streams.
>> I feel like I could see a continuing use case for "ignore trigger limits
>> for this batch" (ignoring the whole issue with re-running the last failed
>> batch vs a new batch), but we haven't actually been able to upgrade yet and
>> try out Trigger.AvailableNow, so that could end up replacing all our use
>> cases.
>>
>> One question I did have is how it does (or is supposed to) handle
>> watermarking. Is the watermark determined for each batch independently like
>> a normal stream, or is it kept constant for all batches in a single
>> AvailableNow run? For example, we have a stateful job that we need to rerun
>> occasionally, and it takes ~6 batches to backfill all the data before
>> catching up to live data. With a Trigger.Once we know we won't accidentally
>> drop any data due to the watermark when backfilling, because it's a single
>> batch with no watermark yet. Would the same hold true if we backfill with
>> Trigger.AvailableNow instead?
>>
>
> The behavior is the former one. Each batch advances the watermark and it's
> immediately reflected on the next batch.
>
> The number of batches Trigger.AvailableNow will execute depends on the
> data source and the source option. For example, if you use Kafka data
> source and use Trigger.AvailableNow without specifying any source option on
> limiting the size, Trigger.AvailableNow will process all newly available
> data as a single microbatch. It may not be still a single microbatch - it
> would also handle the batch already logged in WAL first if any, as well as
> handle no-data batch after the run of all microbatches. But I guess these
> additional batches wouldn't hurt your case.
>
> If the data source doesn't allow processing all available data within a
> single microbatch (depending on the implementation of default read limit),
> you could probably either 1) set source options regarding to limit size as
> an unrealistic one to enforce a single batch or 2) set the delay of
> watermark as an unrealistic one. Both of the workarounds require you to use
> different source options/watermark configuration for backfill vs normal run
> - I agree it wouldn’t be a smooth one.
>
> This proposal does not aim to remove Trigger.Once in near future. As long
> as we deprecate Trigger.Once, we would get some reports for use cases
> Trigger.Once may work better (like your case) for the time period across
> several minor releases, and then we can really decide. (IMHO, handling
> backfill with Trigger.Once sounds to me as a workaround. Backfill may
> warrant its own design to deal with.)
>
>
>>
>> Adam
>>
>> On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim <kabhwan.opensou...@gmail.com>
>> wrote:
>>
>>> Bump to get a chance to expose the proposal to wider audiences.
>>>
>>> Given that there are not many active contributors/maintainers in area
>>> Structured Streaming, I'd consider the discussion as "lazy consensus" to
>>> avoid being stuck. I'll give a final reminder early next week, and move
>>> forward if there are no outstanding objections.
>>>
>>> On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Hi dev,
>>>>
>>>> I would like to hear voices about deprecating Trigger.Once, and
>>>> promoting Trigger.AvailableNow as a replacement [1] in Structured 
>>>> Streaming.
>>>> (It doesn't mean we remove Trigger.Once now or near future. It probably
>>>> requires another discussion at some time.)
>>>>
>>>> Rationalization:
>>>>
>>>> The expected behavior of Trigger.Once is like reading all available
>>>> data after the last trigger and processing them. This holds true when the
>>>> last run was gracefully terminated, but there are cases streaming queries
>>>> to not be terminated gracefully. There is a possibility the last run may
>>>> write the offset for the new batch before termination, then a new run of
>>>> Trigger.Once only processes the data which was built in the latest
>>>> unfinished batch and doesn't process new data.
>>>>
>>>> The behavior is not deterministic from the users' point of view, as end
>>>> users wouldn't know whether the last run wrote the offset or not, unless
>>>> they look into the query's checkpoint by themselves.
>>>>
>>>> While Trigger.AvailableNow came to solve the scalability issue on
>>>> Trigger.Once, it also ensures that it tries to process all available data
>>>> at the point of time it is triggered, which consistently works as expected
>>>> behavior of Trigger.Once.
>>>>
>>>> Another issue on Trigger.Once is that it does not trigger a no-data
>>>> batch immediately. When the watermark is calculated in batch N, it takes
>>>> effect in batch N + 1. If the query is scheduled to be run per day, you can
>>>> see the output from the new watermark in the query run the next day. Thanks
>>>> to the behavior of Trigger.AvailableNow, it handles no-data batch as well
>>>> before termination of the query.
>>>>
>>>> Please review and let us know if you have any feedback or concerns on
>>>> the proposal.
>>>>
>>>> Thanks!
>>>> Jungtaek Lim
>>>>
>>>> 1. https://issues.apache.org/jira/browse/SPARK-36533
>>>>
>>>
>>
>> --
>> Adam Binford
>>
>

-- 
Adam Binford

Reply via email to