Re: aggregating over triggered results

Aaron Dixon Tue, 29 Oct 2019 19:01:55 -0700

Thank you, Luke and Robert. Sorry for hitting dev@, I criss-crossed and meant 
to hit user@, but as we're here could you clarify your two points, however--

1) I am under the impression that the 4,000 sliding windows approach (30 days 
every 10m) will re-evaluate my combine aggregation every 10m whereas with the 
two-window approach my Combine aggregation would evolve iteratively, only 
merging new results into the aggregation. 

If there's a cross-window optimization occurring that would allow iterative 
combining _across windows_, given the substantial order of magnitude difference 
in scale at play, is it safe to consider such 'internal optimization detail' 
part of the platform contract (Dataflow's, say)? Otherwise it would be hard to 
lean on this from a production system that will live into the future.

2) When you say "regardless of the how the problem is structured" there are 
4,000 stored 'sub-aggregations', even in the two-window approach--why is that 
so? Isn't the volume of panes produced by a trigger a function of what keys 
have actually received new values *in the window*?

Thanks for help in understanding these details. I want to make good use of Beam 
and hope to contribute back at some point (docs/writing etc), once I can come 
to terms with all of these pieces.

On 2019/10/29 20:39:18, Robert Bradshaw <rober...@google.com> wrote: 
> No matter how the problem is structured, computing 30 day aggregations
> for every 10 minute window requires storing at least 30day/10min =
> ~4000 sub-aggregations. In Beam, the elements themselves are not
> stored in every window, only the intermediate aggregates.
> 
> I second Luke's suggestion to try it out and see if this is indeed a
> prohibitive bottleneck.
> 
> On Tue, Oct 29, 2019 at 1:29 PM Luke Cwik <lc...@google.com> wrote:
> >
> > You should first try the obvious answer of using a sliding window of 30 
> > days every 10 minutes before you try the 60 days every 30 days.
> > Beam has some optimizations which will assign a value to multiple windows 
> > and only process that value once even if its in many windows. If that 
> > doesn't perform well, then come back to dev@ and look to optimize.
> >
> > On Tue, Oct 29, 2019 at 1:22 PM Aaron Dixon <atdi...@gmail.com> wrote:
> >>
> >> Hi I am new to Beam.
> >>
> >> I would like to accumulate data over 30 day period and perform a running 
> >> aggregation over this data, say every 10 minutes.
> >>
> >> I could use a sliding window of 30 days every 10 minutes (triggering at 
> >> end of window) but this seems grossly inefficient (both in terms of # of 
> >> windows at play and # of events duplicated across these windows).
> >>
> >> A more efficient strategy seems to be to use a sliding window of 60 days 
> >> every 30 days -- triggering every 10 minutes -- so that I'm guaranteed to 
> >> have 30 days worth of data aggregated/combined in at least one of the 2 
> >> at-play sliding windows.
> >>
> >> The last piece of this puzzle however would be to do a final global 
> >> aggregation over only the keys from the latest trigger of the earlier 
> >> sliding window.
> >>
> >> But Beam does not seem to offer a way to orchestrate this. Even though 
> >> this seems like it would be a pretty common or fundamental ask.
> >>
> >> One thought I had was to re-window in a way that would isolate keys 
> >> triggered at the same time, in the same window but I don't see any 
> >> contracts from Beam that would allow an approach like that.
> >>
> >> What am I missing?
> >>
> >>
>

Re: aggregating over triggered results

Reply via email to