Sorry I misunderstood. As for ways to tell the system that it can make progress, the more the merrier. There's not a "best" mechanism. It depends on the business problem. A good engine should support several, including fully-ordered columns, punctuation, and slack, and let users chose on a per-stream (per-topic) or even per-query basis.
On Tue, Feb 23, 2016 at 9:03 AM, Milinda Pathirage <[email protected]> wrote: > Hi Julian, > > I agree with you. Calcite should stay away from physical properties of > stream as much as possible. I was just trying to clarify the confusion > regarding the punctuations and watermarks. My last question was not related > to Calcite, rather to Flink and other implementations. Sorry for the > confusion. > > Milinda > > On Tue, Feb 23, 2016 at 11:41 AM, Julian Hyde <[email protected]> wrote: > >> As the author of the streaming SQL specification, I don't care at all >> how the system deduces that it is able to make progress. Just as the >> authors of the SQL standard don't care whether a vendor chooses to >> store records sorted and/or compressed. >> >> All the streaming SQL validator/optimizer needs to know is that, say, >> orderTime is monotonic, orderId is quasi-monotonic, and paymentMethod >> is non-monotonic, so it can allow streaming aggregations on orderTime >> and orderId, and disallow them on paymentMethod. >> >> This allows streaming engines to add novel mechanisms without having >> to change the definition of streaming SQL. >> >> >> On Tue, Feb 23, 2016 at 7:42 AM, Milinda Pathirage >> <[email protected]> wrote: >> > Thank you Julian for the document. >> > >> > [1] is also a good read on punctuation. What I understood from reading >> [1] >> > and MillWheel paper is that a low-watermark (or row-time bound) is a >> > property maintained by operators and operators derive low-watermark by >> > processing punctuations. >> > >> > One other thing mentioned in MillWheel is the fact that Google's input >> > streams contain punctuations to communicate stream progress. If >> > punctuations are not there in the input stream we will have to generate >> > them during ingest based on a slack or some similar technique. What do >> you >> > think about this? >> > >> > Thanks >> > Milinda >> > >> > >> > [1] http://www.vldb.org/pvldb/1/1453890.pdf >> > >> > On Mon, Feb 22, 2016 at 9:44 PM, Julian Hyde <[email protected]> wrote: >> > >> >> I’ve updated the Streaming reference guide as Fabian requested: >> >> http://calcite.apache.org/docs/stream.html < >> >> http://calcite.apache.org/docs/stream.html> >> >> >> >> Julian >> >> >> >> > On Feb 19, 2016, at 3:11 PM, Julian Hyde <[email protected]> wrote: >> >> > >> >> > I gave a talk about streaming SQL at a Samza meetup. A lot of it is >> >> about the semantics of streaming SQL, and I cover some ground that I >> don’t >> >> cover in the streams page[1]. >> >> > >> >> > The news item[2] gets you to both slides and video. >> >> > >> >> > In other news, I notice[3] that Spark 2.1 will contain “continuous >> SQL”. >> >> If the examples[4] are accurate, all queries are heavily based on >> sliding >> >> windows, and they use a syntax for those windows that is very different >> to >> >> standard SQL. I think we can deal with their use cases, and in my >> opinion >> >> our proposed syntax is more elegant and closer to the standard. But we >> >> should discuss. I don’t want to diverge from other efforts because of >> >> hubris/ignorance. >> >> > >> >> > At the Samza meetup some folks mentioned the use case of a stream that >> >> summarizes, emitting periodic totals even if there were no data in a >> given >> >> period. Can they re-state that use case here, so we can discuss? >> >> > >> >> > Julian >> >> > >> >> > [1] http://calcite.apache.org/docs/stream.html >> >> > >> >> > [2] http://calcite.apache.org/news/2016/02/17/streaming-sql-talk/ >> >> > >> >> > [3] >> >> >> http://www.slideshare.net/databricks/the-future-of-realtime-in-spark-58433411 >> >> slide 29 >> >> > >> >> > [4] >> >> >> https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf >> >> > >> >> >> On Feb 17, 2016, at 12:09 PM, Milinda Pathirage < >> [email protected]> >> >> wrote: >> >> >> >> >> >> Hi Fabian, >> >> >> >> >> >> We did some work on stream joins [1]. I tested stream-to-relation >> joins >> >> >> with Samza. But not stream-to-stream joins. But never updated the >> >> streaming >> >> >> documentation. I'll send a pull request with some documentation on >> >> joins. >> >> >> >> >> >> Thanks >> >> >> Milinda >> >> >> >> >> >> [1] https://issues.apache.org/jira/browse/CALCITE-968 >> >> >> >> >> >> On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <[email protected]> >> >> wrote: >> >> >> >> >> >>> Hi, >> >> >>> >> >> >>> I agree, the Streaming page is a very good starting point for this >> >> >>> discussion. As suggested by Julian, I created CALCITE-1090 to update >> >> the >> >> >>> page such that it reflects the current state of the discussion >> (adding >> >> HOP >> >> >>> and TUMBLE functions, punctuations). I can also help with that, >> e.g., >> >> by >> >> >>> contributing figures, examples, or text, reviewing, or any other >> way. >> >> >>> >> >> >>> From my point of view, the semantics of the window types and the >> other >> >> >>> operators in the Streaming document are very good. What is missing >> are >> >> >>> joins (windowed stream-stream joins, stream-table joins, >> stream-table >> >> joins >> >> >>> with table updates) as already noted in the todo section. >> >> >>> >> >> >>> Regarding the handling of late-arriving events, I am not sure if >> this >> >> is a >> >> >>> purely QoS issue as the result of a query might depend on the chosen >> >> >>> strategy. Also users might want to pick different strategies for >> >> different >> >> >>> operations, so late-arriver strategies are not necessarily >> end-to-end >> >> but >> >> >>> can be operator specific. However, I think these details should be >> >> >>> discussed in a separate thread. >> >> >>> >> >> >>> I'd like to add a few words about the StreamSQL roadmap of the Flink >> >> >>> community. >> >> >>> We are currently preparing our codebase and will start to work on >> >> support >> >> >>> for structured queries on streams in the next weeks. Flink will >> >> support two >> >> >>> query interface, a SQL interface and a LINQ-style Table API [1]. >> Both >> >> will >> >> >>> be optimized and translated by Calcite. As a first step, we want to >> add >> >> >>> simple stream transformations such as selection and projection to >> both >> >> >>> interfaces. Next, we will add windowing support (starting with >> >> tumbling and >> >> >>> hopping windows) to the Table API (as is said before, our plans here >> >> are >> >> >>> well aligned with Julian's suggestions). Once this is done, we would >> >> extend >> >> >>> the SQL interface to support windows which is hopefully as simple as >> >> using >> >> >>> a parser that accepts window syntax. >> >> >>> >> >> >>> So from our point of view, fixing the semantics of windows and >> >> extending >> >> >>> the optimizer accordingly is more urgent than agreeing on a syntax >> >> >>> (although the Table API syntax could be inspired by Calcite's >> StreamSQL >> >> >>> syntax [2]). I can also help implementing the missing features in >> >> Calcite. >> >> >>> >> >> >>> Having a reference implementation with tests would be awesome and >> >> >>> definitely help. >> >> >>> >> >> >>> Best, Fabian >> >> >>> >> >> >>> [1] >> >> >>> >> >> >>> >> >> >> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html >> >> >>> [2] >> >> >>> >> >> >>> >> >> >> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E >> >> >>> >> >> >>> 2016-02-14 21:23 GMT+01:00 Julian Hyde <[email protected]>: >> >> >>> >> >> >>>> Fabian, >> >> >>>> >> >> >>>> Apologies for the late reply. >> >> >>>> >> >> >>>> I would rather that the specification for streaming SQL was not too >> >> >>>> prescriptive for how late events were handled. Approaches 1, 2 and >> 3 >> >> are >> >> >>>> all viable, and engines can differentiate themselves by the >> strength >> >> of >> >> >>>> their support for this. But for the SQL to be considered valid, I >> >> think >> >> >>> the >> >> >>>> validator just needs to know that it can make progress. >> >> >>>> >> >> >>>> There is a large area of functionality I’d call “quality of >> service” >> >> >>>> (QoS). This includes latency, reliability, at-least-once, >> >> at-most-once or >> >> >>>> in-order guarantees, as well as the late-row-handling this thread >> is >> >> >>>> concerned with. What the QoS metrics have in common is that they >> are >> >> >>>> end-to-end. To deliver a high QoS to the consumer, the producer >> needs >> >> to >> >> >>>> conform to a high QoS. The QoS is beyond the control of the SQL >> >> >>> statement. >> >> >>>> (Although you can ask what a SQL statement is able to deliver, >> given >> >> the >> >> >>>> upstream QoS guarantees.) QoS is best managed by the whole system, >> >> and in >> >> >>>> my opinion this is the biggest reason to have a DSMS. >> >> >>>> >> >> >>>> For this reason, I would be inclined to put QoS constraints on the >> >> stream >> >> >>>> definition, not on the query. For example, taking latency as the >> QoS >> >> >>> metric >> >> >>>> of interest, you could flag the Orders stream as “at most 10 ms >> >> latency >> >> >>>> between the record’s timestamp and the wall-clock time of the >> server >> >> >>>> receiving the records, and any records arriving after that time are >> >> >>> logged >> >> >>>> and discarded”, and that QoS constraint applies to both producers >> and >> >> >>>> consumers. >> >> >>>> >> >> >>>> Given a query Q ‘select stream * from Orders’, it is valid to ask >> >> “what >> >> >>> is >> >> >>>> the expected latency of Q?” or tell the planner “produce an >> >> >>> implementation >> >> >>>> of Q with a latency of no more than 15 ms, and if you cannot >> achieve >> >> that >> >> >>>> latency, fail”. You could even register Q in the system and tell >> the >> >> >>> system >> >> >>>> to tighten up the latency of any upstream streams and the standing >> >> >>> queries >> >> >>>> that populate them. But it’s not valid to say “execute Q with a >> >> latency >> >> >>> of >> >> >>>> 15 ms”: the system may not be able to achieve it. >> >> >>>> >> >> >>>> In summary: I would allow latency and late-row-handling and other >> QoS >> >> >>>> annotations in the query but it’s not the most natural or powerful >> >> place >> >> >>> to >> >> >>>> put them. >> >> >>>> >> >> >>>> Julian >> >> >>>> >> >> >>>> >> >> >>>>> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <[email protected]> >> wrote: >> >> >>>>> >> >> >>>>> Excellent! I missed the punctuations in the todo list. >> >> >>>>> >> >> >>>>> What kind of strategies do you have in mind to handle events that >> >> >>> arrive >> >> >>>>> too late? I see >> >> >>>>> 1. dropping of late events >> >> >>>>> 2. computing an updated window result for each late arriving >> >> >>>>> element (implies that the window state is stored for a certain >> period >> >> >>>>> before it is discarded) >> >> >>>>> 3. computing a delta to the previous window result for each late >> >> >>> arriving >> >> >>>>> element (requires window state as well, not applicable to all >> >> >>> aggregation >> >> >>>>> types) >> >> >>>>> >> >> >>>>> It would be nice if strategies to handle late-arrivers could be >> >> defined >> >> >>>> in >> >> >>>>> the query. >> >> >>>>> >> >> >>>>> I think the plans of the Flink community are quite well aligned >> with >> >> >>> your >> >> >>>>> ideas for SQL on Streams. >> >> >>>>> Should we start by updating / extending the Stream document on the >> >> >>>> Calcite >> >> >>>>> website to include the new window definitions (TUMBLE, HOP) and a >> >> >>>>> discussion of punctuations/watermarks/time bounds? >> >> >>>>> >> >> >>>>> Fabian >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> 2016-02-06 2:35 GMT+01:00 Julian Hyde <[email protected]>: >> >> >>>>> >> >> >>>>>> Let me rephrase: The *majority* of the literature, of which I >> cited >> >> >>>>>> just one example, calls them punctuation, and a couple of recent >> >> >>>>>> papers out of Mountain View doesn't change that. >> >> >>>>>> >> >> >>>>>> There are some fine distinctions between punctuation, heartbeats, >> >> >>>>>> watermarks and rowtime bounds, mostly in terms of how they are >> >> >>>>>> generated and propagated, that matter little when planning the >> >> query. >> >> >>>>>> >> >> >>>>>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning < >> [email protected]> >> >> >>>> wrote: >> >> >>>>>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <[email protected]> >> >> >>> wrote: >> >> >>>>>>> >> >> >>>>>>>> Yes, watermarks, absolutely. The "to do" list has >> "punctuation", >> >> >>> which >> >> >>>>>>>> is the same thing. (Actually, I prefer to call it "rowtime >> bound" >> >> >>>>>>>> because it is feels more like a dynamic constraint than a >> piece of >> >> >>>>>>>> data, but the literature[1] calls them punctuation.) >> >> >>>>>>>> >> >> >>>>>>> >> >> >>>>>>> Some of the literature calls them punctuation, other literature >> [1] >> >> >>>> calls >> >> >>>>>>> them watermarks. >> >> >>>>>>> >> >> >>>>>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf >> >> >>>>>> >> >> >>>> >> >> >>>> >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> Milinda Pathirage >> >> >> >> >> >> PhD Student | Research Assistant >> >> >> School of Informatics and Computing | Data to Insight Center >> >> >> Indiana University >> >> >> >> >> >> twitter: milindalakmal >> >> >> skype: milinda.pathirage >> >> >> blog: http://milinda.pathirage.org >> >> > >> >> >> >> >> > >> > >> > -- >> > Milinda Pathirage >> > >> > PhD Student | Research Assistant >> > School of Informatics and Computing | Data to Insight Center >> > Indiana University >> > >> > twitter: milindalakmal >> > skype: milinda.pathirage >> > blog: http://milinda.pathirage.org >> > > > > -- > Milinda Pathirage > > PhD Student | Research Assistant > School of Informatics and Computing | Data to Insight Center > Indiana University > > twitter: milindalakmal > skype: milinda.pathirage > blog: http://milinda.pathirage.org
