Hello!

In a streaming app, you have two choices: wait forever and never have any
output OR use some method to decide that aggregation is "done".

In Beam, the way you decide that aggregation is "done" is the watermark.
When the watermark predicts no more data for an aggregation, then the
aggregation is done. For example GROUP BY <minute> is "done" when no more
data will arrive for that minute. At this point, your result is produced.
More data may arrive, and it is ignored. The watermark is determined by the
IO connector to be the best heuristic available. You can configure "allowed
lateness" for an aggregation to allow out of order data.

Kenn

On Thu, Apr 22, 2021 at 1:26 PM Tao Li <t...@zillow.com> wrote:

> Hi Beam community,
>
>
>
> I am wondering if there is a risk of losing late data from a Beam stream
> app due to watermarking?
>
>
>
> I just went through this design doc and noticed the “droppable” definition
> there:
> https://docs.google.com/document/d/12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y/edit#
>
>
>
> Can you please confirm if it’s possible for us to lose some data in a
> stream app in practice? If that’s possible, what would be the best practice
> to avoid data loss? Thanks!
>
>
>

Reply via email to