Re: Towards a spec for robust streaming SQL, Part 1

Tyler Akidau Wed, 10 May 2017 22:14:55 -0700

On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fhue...@gmail.com> wrote:


> Hi Tyler,
>
> thank you very much for this excellent write-up and the super nice
> visualizations!
> You are discussing a lot of the things that we have been thinking about as
> well from a different perspective.
> IMO, yours and the Flink model are pretty much aligned although we use a
> different terminology (which is not yet completely established). So there
> should be room for unification ;-)
>

Good to hear, thanks for giving it a look. :-)


> Allow me a few words on the current state in Flink. In the upcoming 1.3.0
> release, we will have support for group window (TUMBLE, HOP, SESSION), OVER
> RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> aggregations. The group windows are triggered by watermark and the over
> window and non-windowed aggregations emit for each input record
> (AtCount(1)). The window aggregations do not support early or late firing
> (late records are dropped), so no updates here. However, the non-windowed
> aggregates produce updates (in acc and acc/retract mode). Based on this we
> will work on better control for late updates and early firing as well as
> joins in the next release.
>

Impressive. At this rate there's a good chance we'll just be doing catchup
and thanking you for building everything. ;-) Do you have ideas for what
you want your early/late updates control to look like? That's one of the
areas I'd like to see better defined for us. And how deep are you going
with joins?


> Reading the document, I did not find any major difference in our concepts.
> In fact, we are aiming to support the cases you are describing as well.
> I have a question though. Would you classify an OVER aggregation as a
> stream -> stream or stream -> table operation? It collects records to
> aggregate them, but emits one record for each input row. Depending on the
> window definition (i.e., with FOLLOWING CURRENT ROW), you can compute and
> emit the result record when the input record is received.
>

I would call it a composite stream → stream operation (since SQL, like the
standard Beam/Flink APIs, is a higher level set of constructs than raw
streams/tables operations) consisting of a stream → table windowed grouping
followed by a table → stream triggering on every element, basically as you
described in the previous paragraph.

-Tyler


>
> I'm looking forward to the second part.
>
> Cheers, Fabian
>
>
>
> 2017-05-09 0:34 GMT+02:00 Tyler Akidau <taki...@google.com.invalid>:
>
> > Any thoughts here Fabian? I'm planning to start sending out some more
> > emails towards the end of the week.
> >
> > -Tyler
> >
> >
> > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <taki...@google.com> wrote:
> >
> > > No worries, thanks for the heads up. Good luck wrapping all that stuff
> > up.
> > >
> > > -Tyler
> > >
> > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fhue...@gmail.com>
> > wrote:
> > >
> > >> Hi Tyler,
> > >>
> > >> thanks for pushing this effort and including the Flink list.
> > >> I haven't managed to read the doc yet, but just wanted to thank you
> for
> > >> the
> > >> write-up and let you know that I'm very interested in this discussion.
> > >>
> > >> We are very close to the feature freeze of Flink 1.3 and I'm quite
> busy
> > >> getting as many contributions merged before the release is forked off.
> > >> When that happened, I'll have more time to read and comment.
> > >>
> > >> Thanks,
> > >> Fabian
> > >>
> > >>
> > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <taki...@google.com.invalid>:
> > >>
> > >> > Good point, when you start talking about anything less than a full
> > join,
> > >> > triggers get involved to describe how one actually achieves the
> > desired
> > >> > semantics, and they may end up being tied to just one of the inputs
> > >> (e.g.,
> > >> > you may only care about the watermark for one side of the join). Am
> > >> > expecting us to address these sorts of details more precisely in doc
> > #2.
> > >> >
> > >> > -Tyler
> > >> >
> > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > <k...@google.com.invalid
> > >> >
> > >> > wrote:
> > >> >
> > >> > > There's something to be said about having different triggering
> > >> depending
> > >> > on
> > >> > > which side of a join data comes from, perhaps?
> > >> > >
> > >> > > (delightful doc, as usual)
> > >> > >
> > >> > > Kenn
> > >> > >
> > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > >> <taki...@google.com.invalid
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> > >> basically
> > >> > > > flatten + GBK. Flatten is a non-grouping operation that merges
> the
> > >> > input
> > >> > > > streams into a single output stream. GBK then groups the data
> > within
> > >> > that
> > >> > > > single union stream as you might otherwise expect, yielding a
> > single
> > >> > > table.
> > >> > > > So I think it doesn't really impact things much. Grouping,
> > >> aggregation,
> > >> > > > window merging etc all just act upon the merged stream and
> > generate
> > >> > what
> > >> > > is
> > >> > > > effectively a merged table.
> > >> > > >
> > >> > > > -Tyler
> > >> > > >
> > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > >> <lc...@google.com.invalid
> > >> > >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > The doc is a good read.
> > >> > > > >
> > >> > > > > I think you do a great job of explaining table -> stream,
> stream
> > >> ->
> > >> > > > stream,
> > >> > > > > and stream -> table when there is only one stream.
> > >> > > > > But when there are multiple streams reading/writing to a
> table,
> > >> how
> > >> > > does
> > >> > > > > that impact what occurs?
> > >> > > > > For example, with CoGBK you have multiple streams writing to a
> > >> table,
> > >> > > how
> > >> > > > > does that impact window merging?
> > >> > > > >
> > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > >> > > <taki...@google.com.invalid
> > >> > > > >
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > >> > > > > >
> > >> > > > > > Apologies for the big cross post, but I thought this might
> be
> > >> > > something
> > >> > > > > all
> > >> > > > > > three communities would find relevant.
> > >> > > > > >
> > >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> > Calcite,
> > >> > > thanks
> > >> > > > to
> > >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> > >> conclusion
> > >> > > > about
> > >> > > > > > how to elegantly support the full suite of streaming
> > >> functionality
> > >> > in
> > >> > > > the
> > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> > community
> > >> > have
> > >> > > > been
> > >> > > > > > pushing on this (e.g., adding windowing constructs, amongst
> > >> others,
> > >> > > > thank
> > >> > > > > > you! :-), but from my understanding we still don't have a
> full
> > >> spec
> > >> > > for
> > >> > > > > how
> > >> > > > > > to support robust streaming in SQL (including but not
> limited
> > >> to,
> > >> > > > e.g., a
> > >> > > > > > triggers analogue such as EMIT).
> > >> > > > > >
> > >> > > > > > I've been spending a lot of time thinking about this and
> have
> > >> some
> > >> > > > > opinions
> > >> > > > > > about how I think it should look that I've already written
> > down,
> > >> > so I
> > >> > > > > > volunteered to try to drive forward agreement on a general
> > >> > streaming
> > >> > > > SQL
> > >> > > > > > spec between our three communities (well, technically I
> > >> volunteered
> > >> > > to
> > >> > > > do
> > >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks
> might
> > >> want
> > >> > to
> > >> > > > > join
> > >> > > > > > in since you're going that direction already anyway and will
> > >> have
> > >> > > > useful
> > >> > > > > > insights :-).
> > >> > > > > >
> > >> > > > > > My plan was to do this by sharing two docs:
> > >> > > > > >
> > >> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> > >> context,
> > >> > > and
> > >> > > > > >    really only mentions SQL in passing. But it describes the
> > >> > > > relationship
> > >> > > > > >    between the Beam Model and the "streams & tables" way of
> > >> > thinking,
> > >> > > > > which
> > >> > > > > >    turns out to be useful in understanding what robust
> > >> streaming in
> > >> > > SQL
> > >> > > > > > might
> > >> > > > > >    look like. Many of you probably already know some or all
> of
> > >> > what's
> > >> > > > in
> > >> > > > > > here,
> > >> > > > > >    but I felt it was necessary to have it all written down
> in
> > >> order
> > >> > > to
> > >> > > > > > justify
> > >> > > > > >    some of the proposals I wanted to make in the second doc.
> > >> > > > > >
> > >> > > > > >    2. A streaming SQL spec for Calcite - The goal for this
> doc
> > >> is
> > >> > > that
> > >> > > > it
> > >> > > > > >    would become a general specification for what robust
> > >> streaming
> > >> > SQL
> > >> > > > in
> > >> > > > > >    Calcite should look like. It would start out as a basic
> > >> proposal
> > >> > > of
> > >> > > > > what
> > >> > > > > >    things *could* look like (combining both what things look
> > >> like
> > >> > now
> > >> > > > as
> > >> > > > > > well
> > >> > > > > >    as a set of proposed changes for the future), and we
> could
> > >> all
> > >> > > > iterate
> > >> > > > > > on
> > >> > > > > >    it together until we get to something we're happy with.
> > >> > > > > >
> > >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> > monster,
> > >> > so I
> > >> > > > > > figured I'd share it and let folks hack at it with comments
> if
> > >> they
> > >> > > > have
> > >> > > > > > any, while I try to get the second doc ready in the
> meantime.
> > As
> > >> > part
> > >> > > > of
> > >> > > > > > getting doc #2 ready, I'll be starting a separate thread to
> > try
> > >> to
> > >> > > > gather
> > >> > > > > > input on what things are already in flight for streaming SQL
> > >> across
> > >> > > the
> > >> > > > > > various communities, to make sure the proposal captures
> > >> everything
> > >> > > > that's
> > >> > > > > > going on as accurately as it can.
> > >> > > > > >
> > >> > > > > > If you have any questions or comments, I'm interested to
> hear
> > >> them.
> > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams &
> Tables":
> > >> > > > > >
> > >> > > > > >   http://s.apache.org/beam-streams-tables
> > >> > > > > >
> > >> > > > > > -Tyler
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Reply via email to