Re: About Stream SQL

Julian Hyde Tue, 23 Feb 2016 10:25:58 -0800

Sorry I misunderstood. As for ways to tell the system that it can make
progress, the more the merrier. There's not a "best" mechanism. It
depends on the business problem. A good engine should support several,
including fully-ordered columns, punctuation, and slack, and let users
chose on a per-stream (per-topic) or even per-query basis.


On Tue, Feb 23, 2016 at 9:03 AM, Milinda Pathirage
<[email protected]> wrote:
> Hi Julian,
>
> I agree with you. Calcite should stay away from physical properties of
> stream as much as possible. I was just trying to clarify the confusion
> regarding the punctuations and watermarks. My last question was not related
> to Calcite, rather to Flink and other implementations. Sorry for the
> confusion.
>
> Milinda
>
> On Tue, Feb 23, 2016 at 11:41 AM, Julian Hyde <[email protected]> wrote:
>
>> As the author of the streaming SQL specification, I don't care at all
>> how the system deduces that it is able to make progress. Just as the
>> authors of the SQL standard don't care whether a vendor chooses to
>> store records sorted and/or compressed.
>>
>> All the streaming SQL validator/optimizer needs to know is that, say,
>> orderTime is monotonic, orderId is quasi-monotonic, and paymentMethod
>> is non-monotonic, so it can allow streaming aggregations on orderTime
>> and orderId, and disallow them on paymentMethod.
>>
>> This allows streaming engines to add novel mechanisms without having
>> to change the definition of streaming SQL.
>>
>>
>> On Tue, Feb 23, 2016 at 7:42 AM, Milinda Pathirage
>> <[email protected]> wrote:
>> > Thank you Julian for the document.
>> >
>> > [1] is also a good read on punctuation. What I understood from reading
>> [1]
>> > and MillWheel paper is that a low-watermark (or row-time bound) is a
>> > property maintained by operators and operators derive low-watermark by
>> > processing punctuations.
>> >
>> > One other thing mentioned in MillWheel is the fact that Google's input
>> > streams contain punctuations to communicate stream progress. If
>> > punctuations are not there in the input stream we will have to generate
>> > them during ingest based on a slack or some similar technique. What do
>> you
>> > think about this?
>> >
>> > Thanks
>> > Milinda
>> >
>> >
>> > [1] http://www.vldb.org/pvldb/1/1453890.pdf
>> >
>> > On Mon, Feb 22, 2016 at 9:44 PM, Julian Hyde <[email protected]> wrote:
>> >
>> >> I’ve updated the Streaming reference guide as Fabian requested:
>> >> http://calcite.apache.org/docs/stream.html <
>> >> http://calcite.apache.org/docs/stream.html>
>> >>
>> >> Julian
>> >>
>> >> > On Feb 19, 2016, at 3:11 PM, Julian Hyde <[email protected]> wrote:
>> >> >
>> >> > I gave a talk about streaming SQL at a Samza meetup. A lot of it is
>> >> about the semantics of streaming SQL, and I cover some ground that I
>> don’t
>> >> cover in the streams page[1].
>> >> >
>> >> > The news item[2] gets you to both slides and video.
>> >> >
>> >> > In other news, I notice[3] that Spark 2.1 will contain “continuous
>> SQL”.
>> >> If the examples[4] are accurate, all queries are heavily based on
>> sliding
>> >> windows, and they use a syntax for those windows that is very different
>> to
>> >> standard SQL.  I think we can deal with their use cases, and in my
>> opinion
>> >> our proposed syntax is more elegant and closer to the standard. But we
>> >> should discuss. I don’t want to diverge from other efforts because of
>> >> hubris/ignorance.
>> >> >
>> >> > At the Samza meetup some folks mentioned the use case of a stream that
>> >> summarizes, emitting periodic totals even if there were no data in a
>> given
>> >> period. Can they re-state that use case here, so we can discuss?
>> >> >
>> >> > Julian
>> >> >
>> >> > [1] http://calcite.apache.org/docs/stream.html
>> >> >
>> >> > [2] http://calcite.apache.org/news/2016/02/17/streaming-sql-talk/
>> >> >
>> >> > [3]
>> >>
>> http://www.slideshare.net/databricks/the-future-of-realtime-in-spark-58433411
>> >> slide 29
>> >> >
>> >> > [4]
>> >>
>> https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf
>> >> >
>> >> >> On Feb 17, 2016, at 12:09 PM, Milinda Pathirage <
>> [email protected]>
>> >> wrote:
>> >> >>
>> >> >> Hi Fabian,
>> >> >>
>> >> >> We did some work on stream joins [1]. I tested stream-to-relation
>> joins
>> >> >> with Samza. But not stream-to-stream joins. But never updated the
>> >> streaming
>> >> >> documentation. I'll send a pull request with some documentation on
>> >> joins.
>> >> >>
>> >> >> Thanks
>> >> >> Milinda
>> >> >>
>> >> >> [1] https://issues.apache.org/jira/browse/CALCITE-968
>> >> >>
>> >> >> On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <[email protected]>
>> >> wrote:
>> >> >>
>> >> >>> Hi,
>> >> >>>
>> >> >>> I agree, the Streaming page is a very good starting point for this
>> >> >>> discussion. As suggested by Julian, I created CALCITE-1090 to update
>> >> the
>> >> >>> page such that it reflects the current state of the discussion
>> (adding
>> >> HOP
>> >> >>> and TUMBLE functions, punctuations). I can also help with that,
>> e.g.,
>> >> by
>> >> >>> contributing figures, examples, or text, reviewing, or any other
>> way.
>> >> >>>
>> >> >>> From my point of view, the semantics of the window types and the
>> other
>> >> >>> operators in the Streaming document are very good. What is missing
>> are
>> >> >>> joins (windowed stream-stream joins, stream-table joins,
>> stream-table
>> >> joins
>> >> >>> with table updates) as already noted in the todo section.
>> >> >>>
>> >> >>> Regarding the handling of late-arriving events, I am not sure if
>> this
>> >> is a
>> >> >>> purely QoS issue as the result of a query might depend on the chosen
>> >> >>> strategy. Also users might want to pick different strategies for
>> >> different
>> >> >>> operations, so late-arriver strategies are not necessarily
>> end-to-end
>> >> but
>> >> >>> can be operator specific. However, I think these details should be
>> >> >>> discussed in a separate thread.
>> >> >>>
>> >> >>> I'd like to add a few words about the StreamSQL roadmap of the Flink
>> >> >>> community.
>> >> >>> We are currently preparing our codebase and will start to work on
>> >> support
>> >> >>> for structured queries on streams in the next weeks. Flink will
>> >> support two
>> >> >>> query interface, a SQL interface and a LINQ-style Table API [1].
>> Both
>> >> will
>> >> >>> be optimized and translated by Calcite. As a first step, we want to
>> add
>> >> >>> simple stream transformations such as selection and projection to
>> both
>> >> >>> interfaces. Next, we will add windowing support (starting with
>> >> tumbling and
>> >> >>> hopping windows) to the Table API (as is said before, our plans here
>> >> are
>> >> >>> well aligned with Julian's suggestions). Once this is done, we would
>> >> extend
>> >> >>> the SQL interface to support windows which is hopefully as simple as
>> >> using
>> >> >>> a parser that accepts window syntax.
>> >> >>>
>> >> >>> So from our point of view, fixing the semantics of windows and
>> >> extending
>> >> >>> the optimizer accordingly is more urgent than agreeing on a syntax
>> >> >>> (although the Table API syntax could be inspired by Calcite's
>> StreamSQL
>> >> >>> syntax [2]). I can also help implementing the missing features in
>> >> Calcite.
>> >> >>>
>> >> >>> Having a reference implementation with tests would be awesome and
>> >> >>> definitely help.
>> >> >>>
>> >> >>> Best, Fabian
>> >> >>>
>> >> >>> [1]
>> >> >>>
>> >> >>>
>> >>
>> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
>> >> >>> [2]
>> >> >>>
>> >> >>>
>> >>
>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E
>> >> >>>
>> >> >>> 2016-02-14 21:23 GMT+01:00 Julian Hyde <[email protected]>:
>> >> >>>
>> >> >>>> Fabian,
>> >> >>>>
>> >> >>>> Apologies for the late reply.
>> >> >>>>
>> >> >>>> I would rather that the specification for streaming SQL was not too
>> >> >>>> prescriptive for how late events were handled. Approaches 1, 2 and
>> 3
>> >> are
>> >> >>>> all viable, and engines can differentiate themselves by the
>> strength
>> >> of
>> >> >>>> their support for this. But for the SQL to be considered valid, I
>> >> think
>> >> >>> the
>> >> >>>> validator just needs to know that it can make progress.
>> >> >>>>
>> >> >>>> There is a large area of functionality I’d call “quality of
>> service”
>> >> >>>> (QoS). This includes latency, reliability, at-least-once,
>> >> at-most-once or
>> >> >>>> in-order guarantees, as well as the late-row-handling this thread
>> is
>> >> >>>> concerned with. What the QoS metrics have in common is that they
>> are
>> >> >>>> end-to-end. To deliver a high QoS to the consumer, the producer
>> needs
>> >> to
>> >> >>>> conform to a high QoS. The QoS is beyond the control of the SQL
>> >> >>> statement.
>> >> >>>> (Although you can ask what a SQL statement is able to deliver,
>> given
>> >> the
>> >> >>>> upstream QoS guarantees.) QoS is best managed by the whole system,
>> >> and in
>> >> >>>> my opinion this is the biggest reason to have a DSMS.
>> >> >>>>
>> >> >>>> For this reason, I would be inclined to put QoS constraints on the
>> >> stream
>> >> >>>> definition, not on the query. For example, taking latency as the
>> QoS
>> >> >>> metric
>> >> >>>> of interest, you could flag the Orders stream as “at most 10 ms
>> >> latency
>> >> >>>> between the record’s timestamp and the wall-clock time of the
>> server
>> >> >>>> receiving the records, and any records arriving after that time are
>> >> >>> logged
>> >> >>>> and discarded”, and that QoS constraint applies to both producers
>> and
>> >> >>>> consumers.
>> >> >>>>
>> >> >>>> Given a query Q ‘select stream * from Orders’, it is valid to ask
>> >> “what
>> >> >>> is
>> >> >>>> the expected latency of Q?” or tell the planner “produce an
>> >> >>> implementation
>> >> >>>> of Q with a latency of no more than 15 ms, and if you cannot
>> achieve
>> >> that
>> >> >>>> latency, fail”. You could even register Q in the system and tell
>> the
>> >> >>> system
>> >> >>>> to tighten up the latency of any upstream streams and the standing
>> >> >>> queries
>> >> >>>> that populate them. But it’s not valid to say “execute Q with a
>> >> latency
>> >> >>> of
>> >> >>>> 15 ms”: the system may not be able to achieve it.
>> >> >>>>
>> >> >>>> In summary: I would allow latency and late-row-handling and other
>> QoS
>> >> >>>> annotations in the query but it’s not the most natural or powerful
>> >> place
>> >> >>> to
>> >> >>>> put them.
>> >> >>>>
>> >> >>>> Julian
>> >> >>>>
>> >> >>>>
>> >> >>>>> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <[email protected]>
>> wrote:
>> >> >>>>>
>> >> >>>>> Excellent! I missed the punctuations in the todo list.
>> >> >>>>>
>> >> >>>>> What kind of strategies do you have in mind to handle events that
>> >> >>> arrive
>> >> >>>>> too late? I see
>> >> >>>>> 1. dropping of late events
>> >> >>>>> 2. computing an updated window result for each late arriving
>> >> >>>>> element (implies that the window state is stored for a certain
>> period
>> >> >>>>> before it is discarded)
>> >> >>>>> 3. computing a delta to the previous window result for each late
>> >> >>> arriving
>> >> >>>>> element (requires window state as well, not applicable to all
>> >> >>> aggregation
>> >> >>>>> types)
>> >> >>>>>
>> >> >>>>> It would be nice if strategies to handle late-arrivers could be
>> >> defined
>> >> >>>> in
>> >> >>>>> the query.
>> >> >>>>>
>> >> >>>>> I think the plans of the Flink community are quite well aligned
>> with
>> >> >>> your
>> >> >>>>> ideas for SQL on Streams.
>> >> >>>>> Should we start by updating / extending the Stream document on the
>> >> >>>> Calcite
>> >> >>>>> website to include the new window definitions (TUMBLE, HOP) and a
>> >> >>>>> discussion of punctuations/watermarks/time bounds?
>> >> >>>>>
>> >> >>>>> Fabian
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> 2016-02-06 2:35 GMT+01:00 Julian Hyde <[email protected]>:
>> >> >>>>>
>> >> >>>>>> Let me rephrase: The *majority* of the literature, of which I
>> cited
>> >> >>>>>> just one example, calls them punctuation, and a couple of recent
>> >> >>>>>> papers out of Mountain View doesn't change that.
>> >> >>>>>>
>> >> >>>>>> There are some fine distinctions between punctuation, heartbeats,
>> >> >>>>>> watermarks and rowtime bounds, mostly in terms of how they are
>> >> >>>>>> generated and propagated, that matter little when planning the
>> >> query.
>> >> >>>>>>
>> >> >>>>>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <
>> [email protected]>
>> >> >>>> wrote:
>> >> >>>>>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <[email protected]>
>> >> >>> wrote:
>> >> >>>>>>>
>> >> >>>>>>>> Yes, watermarks, absolutely. The "to do" list has
>> "punctuation",
>> >> >>> which
>> >> >>>>>>>> is the same thing. (Actually, I prefer to call it "rowtime
>> bound"
>> >> >>>>>>>> because it is feels more like a dynamic constraint than a
>> piece of
>> >> >>>>>>>> data, but the literature[1] calls them punctuation.)
>> >> >>>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> Some of the literature calls them punctuation, other literature
>> [1]
>> >> >>>> calls
>> >> >>>>>>> them watermarks.
>> >> >>>>>>>
>> >> >>>>>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>> >> >>>>>>
>> >> >>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Milinda Pathirage
>> >> >>
>> >> >> PhD Student | Research Assistant
>> >> >> School of Informatics and Computing | Data to Insight Center
>> >> >> Indiana University
>> >> >>
>> >> >> twitter: milindalakmal
>> >> >> skype: milinda.pathirage
>> >> >> blog: http://milinda.pathirage.org
>> >> >
>> >>
>> >>
>> >
>> >
>> > --
>> > Milinda Pathirage
>> >
>> > PhD Student | Research Assistant
>> > School of Informatics and Computing | Data to Insight Center
>> > Indiana University
>> >
>> > twitter: milindalakmal
>> > skype: milinda.pathirage
>> > blog: http://milinda.pathirage.org
>>
>
>
>
> --
> Milinda Pathirage
>
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
>
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org

Re: About Stream SQL

Reply via email to