I gave a talk about streaming SQL at a Samza meetup. A lot of it is about the semantics of streaming SQL, and I cover some ground that I don’t cover in the streams page[1].
The news item[2] gets you to both slides and video. In other news, I notice[3] that Spark 2.1 will contain “continuous SQL”. If the examples[4] are accurate, all queries are heavily based on sliding windows, and they use a syntax for those windows that is very different to standard SQL. I think we can deal with their use cases, and in my opinion our proposed syntax is more elegant and closer to the standard. But we should discuss. I don’t want to diverge from other efforts because of hubris/ignorance. At the Samza meetup some folks mentioned the use case of a stream that summarizes, emitting periodic totals even if there were no data in a given period. Can they re-state that use case here, so we can discuss? Julian [1] http://calcite.apache.org/docs/stream.html [2] http://calcite.apache.org/news/2016/02/17/streaming-sql-talk/ [3] http://www.slideshare.net/databricks/the-future-of-realtime-in-spark-58433411 slide 29 [4] https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf > On Feb 17, 2016, at 12:09 PM, Milinda Pathirage <[email protected]> wrote: > > Hi Fabian, > > We did some work on stream joins [1]. I tested stream-to-relation joins > with Samza. But not stream-to-stream joins. But never updated the streaming > documentation. I'll send a pull request with some documentation on joins. > > Thanks > Milinda > > [1] https://issues.apache.org/jira/browse/CALCITE-968 > > On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <[email protected]> wrote: > >> Hi, >> >> I agree, the Streaming page is a very good starting point for this >> discussion. As suggested by Julian, I created CALCITE-1090 to update the >> page such that it reflects the current state of the discussion (adding HOP >> and TUMBLE functions, punctuations). I can also help with that, e.g., by >> contributing figures, examples, or text, reviewing, or any other way. >> >> From my point of view, the semantics of the window types and the other >> operators in the Streaming document are very good. What is missing are >> joins (windowed stream-stream joins, stream-table joins, stream-table joins >> with table updates) as already noted in the todo section. >> >> Regarding the handling of late-arriving events, I am not sure if this is a >> purely QoS issue as the result of a query might depend on the chosen >> strategy. Also users might want to pick different strategies for different >> operations, so late-arriver strategies are not necessarily end-to-end but >> can be operator specific. However, I think these details should be >> discussed in a separate thread. >> >> I'd like to add a few words about the StreamSQL roadmap of the Flink >> community. >> We are currently preparing our codebase and will start to work on support >> for structured queries on streams in the next weeks. Flink will support two >> query interface, a SQL interface and a LINQ-style Table API [1]. Both will >> be optimized and translated by Calcite. As a first step, we want to add >> simple stream transformations such as selection and projection to both >> interfaces. Next, we will add windowing support (starting with tumbling and >> hopping windows) to the Table API (as is said before, our plans here are >> well aligned with Julian's suggestions). Once this is done, we would extend >> the SQL interface to support windows which is hopefully as simple as using >> a parser that accepts window syntax. >> >> So from our point of view, fixing the semantics of windows and extending >> the optimizer accordingly is more urgent than agreeing on a syntax >> (although the Table API syntax could be inspired by Calcite's StreamSQL >> syntax [2]). I can also help implementing the missing features in Calcite. >> >> Having a reference implementation with tests would be awesome and >> definitely help. >> >> Best, Fabian >> >> [1] >> >> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html >> [2] >> >> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E >> >> 2016-02-14 21:23 GMT+01:00 Julian Hyde <[email protected]>: >> >>> Fabian, >>> >>> Apologies for the late reply. >>> >>> I would rather that the specification for streaming SQL was not too >>> prescriptive for how late events were handled. Approaches 1, 2 and 3 are >>> all viable, and engines can differentiate themselves by the strength of >>> their support for this. But for the SQL to be considered valid, I think >> the >>> validator just needs to know that it can make progress. >>> >>> There is a large area of functionality I’d call “quality of service” >>> (QoS). This includes latency, reliability, at-least-once, at-most-once or >>> in-order guarantees, as well as the late-row-handling this thread is >>> concerned with. What the QoS metrics have in common is that they are >>> end-to-end. To deliver a high QoS to the consumer, the producer needs to >>> conform to a high QoS. The QoS is beyond the control of the SQL >> statement. >>> (Although you can ask what a SQL statement is able to deliver, given the >>> upstream QoS guarantees.) QoS is best managed by the whole system, and in >>> my opinion this is the biggest reason to have a DSMS. >>> >>> For this reason, I would be inclined to put QoS constraints on the stream >>> definition, not on the query. For example, taking latency as the QoS >> metric >>> of interest, you could flag the Orders stream as “at most 10 ms latency >>> between the record’s timestamp and the wall-clock time of the server >>> receiving the records, and any records arriving after that time are >> logged >>> and discarded”, and that QoS constraint applies to both producers and >>> consumers. >>> >>> Given a query Q ‘select stream * from Orders’, it is valid to ask “what >> is >>> the expected latency of Q?” or tell the planner “produce an >> implementation >>> of Q with a latency of no more than 15 ms, and if you cannot achieve that >>> latency, fail”. You could even register Q in the system and tell the >> system >>> to tighten up the latency of any upstream streams and the standing >> queries >>> that populate them. But it’s not valid to say “execute Q with a latency >> of >>> 15 ms”: the system may not be able to achieve it. >>> >>> In summary: I would allow latency and late-row-handling and other QoS >>> annotations in the query but it’s not the most natural or powerful place >> to >>> put them. >>> >>> Julian >>> >>> >>>> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <[email protected]> wrote: >>>> >>>> Excellent! I missed the punctuations in the todo list. >>>> >>>> What kind of strategies do you have in mind to handle events that >> arrive >>>> too late? I see >>>> 1. dropping of late events >>>> 2. computing an updated window result for each late arriving >>>> element (implies that the window state is stored for a certain period >>>> before it is discarded) >>>> 3. computing a delta to the previous window result for each late >> arriving >>>> element (requires window state as well, not applicable to all >> aggregation >>>> types) >>>> >>>> It would be nice if strategies to handle late-arrivers could be defined >>> in >>>> the query. >>>> >>>> I think the plans of the Flink community are quite well aligned with >> your >>>> ideas for SQL on Streams. >>>> Should we start by updating / extending the Stream document on the >>> Calcite >>>> website to include the new window definitions (TUMBLE, HOP) and a >>>> discussion of punctuations/watermarks/time bounds? >>>> >>>> Fabian >>>> >>>> >>>> >>>> >>>> >>>> >>>> 2016-02-06 2:35 GMT+01:00 Julian Hyde <[email protected]>: >>>> >>>>> Let me rephrase: The *majority* of the literature, of which I cited >>>>> just one example, calls them punctuation, and a couple of recent >>>>> papers out of Mountain View doesn't change that. >>>>> >>>>> There are some fine distinctions between punctuation, heartbeats, >>>>> watermarks and rowtime bounds, mostly in terms of how they are >>>>> generated and propagated, that matter little when planning the query. >>>>> >>>>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <[email protected]> >>> wrote: >>>>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <[email protected]> >> wrote: >>>>>> >>>>>>> Yes, watermarks, absolutely. The "to do" list has "punctuation", >> which >>>>>>> is the same thing. (Actually, I prefer to call it "rowtime bound" >>>>>>> because it is feels more like a dynamic constraint than a piece of >>>>>>> data, but the literature[1] calls them punctuation.) >>>>>>> >>>>>> >>>>>> Some of the literature calls them punctuation, other literature [1] >>> calls >>>>>> them watermarks. >>>>>> >>>>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf >>>>> >>> >>> >> > > > > -- > Milinda Pathirage > > PhD Student | Research Assistant > School of Informatics and Computing | Data to Insight Center > Indiana University > > twitter: milindalakmal > skype: milinda.pathirage > blog: http://milinda.pathirage.org
