I remembered another reason I didn’t make the automaton deterministic. For MATCH_RECOGNIZE, any given row can match multiple symbols. Thus it’s not like regex matching, where the 3rd character of the string is either a or b. The third row can match symbols A and B simultaneously. To go deterministic, we’d not only have to create the powerset of states, we’d also have to label our transitions with the powerset of symbols. “When in state (S1, S3, S4), if we see a row that matches symbols A and B but not C, we transition to state (S2, S4, S5)”.
We can figure out which of the 2^N elements in the states powerset are reachable, and build only those, but we can’t do the same with the 2^M elements in the symbol powerset. We’d have to build the whole symbol powerset. That sounded painful, so I decided to stay in non-deterministic land. As for how I translated “repeat”. I used some online resources, but I don’t recall what they were. I’m not surprised that there are bugs, but it should be easy to find and fix them. Since the automaton is capable of matching regexes, I thought we could find some existing regex tests and use those to shake out bugs. Julian > On Dec 28, 2018, at 4:29 PM, Julian Hyde <jh...@apache.org> wrote: > > I hesitated to make the automaton deterministic because the DFA can have > exponentially more states than its corresponding NFA, and I was concerned > that this worst case might occur in real-world queries. I could imagine a > query with 20 symbols and 1 million states, or 30 symbols and 1 billion > states. Are you confident that the DFA will always have a reasonable size? > > Whether or not we go to DFA, removing epsilon transitions does seem > worthwhile. I think that would cause only a small increase in the number of > states. > > I’m going to stop working on this code, until you reach a stopping point. > There’s no point in us treading on each other’s feet. It’s a shame, because I > was looking forward to working on this code over the Xmas break. > > Julian > > >> On Dec 28, 2018, at 3:02 PM, Julian Feinauer <j.feina...@pragmaticminds.de> >> wrote: >> >> Hi Julian, >> >> as it got really confusing for me with the eps-NFA (and all the concurrent >> partial matches in the matcher) I added a class DFA that transforms the >> epsilon-NFA from an automaton to a DFA and reimplemented the Matcher based >> on it. >> I think it is running now but it fails on tests that contain a "repeat" >> statement (X{a,b}) in AutomatonTest. >> The constructed DFA from these statements is wrong. >> Do you have a reference of the transformation in your Thompson construction? >> I found nothing on quick googling. >> I would like to check the implementation in the AutomatonBuilder:202 ff as I >> did not find a Bug in the Matcher or the DFA code. >> Otherwise I would try to check it by expanding the repeat with symbols, ors >> and optionals. >> >> I also fixed the coding style (sorry for that, I totally missed that) so it >> should be better to review my code now (as of commit >> https://github.com/julianhyde/calcite/pull/15/commits/358ca1c5928b57cc96c8b39be8d017872d870dcf >> ). >> >> Best >> JulianF >> >> Am 28.12.18, 02:19 schrieb "Julian Hyde" <jh...@apache.org>: >> >> I think we should get one example working end-to-end before moving to the >> next. (By “example” I mean a SQL query on a standard data set that exercises >> one new feature, say FINAL.) Right now nothing works end-to-end because we >> don’t have the basic code generation working. >> >> I agree that assigning symbols to matched rows is a hard problem. I think >> the best approach is to first figure out whether there is a match (that’s >> what Automaton does) and then, in a second pass, assign a symbol to each >> row. The second pass might be significantly slower, but only occurs less >> often. Also, AFAICT, symbol assignment is only required for the CLASSIFIER() >> function. So I was going to defer that task. >> >> Yes, I am actively working on this code. I see that your branch has >> significant changes because you use a different coding style (e.g. different >> indentation). Please change your code back to the existing style. There is >> no reason to make the task even more difficult than it already is. >> >> Julian >> >> >>> On Dec 27, 2018, at 1:19 PM, Julian Feinauer <j.feina...@pragmaticminds.de> >>> wrote: >>> >>> Hi Julian, >>> >>> regarding "^" and "$" it seems like Zhiqiang already introduced the fields >>> strictStart and strictEnd in org.apache.calcite.rel.core.Match. But I agree >>> with you and already had the same idea. And I'll go over to you last commit >>> to start my branch off. >>> >>> I made some progress in my branch [1]. I get it to compile and I get the >>> test `JdbcTest.testMatch` to run and to fail (but no longer throw an >>> exception, at least). >>> I fixed several things at several places and I think the code generation is >>> now working (NOT working good) for Matcher and Emitter. >>> But there are “crucial” points where I’d like to have your advice (or >>> someone else familiar with these topics): >>> >>> First, I’m unsure how the FINAL function should be implemented (it’s no >>> regular operator and I did not find any reference on how to deal with this) >>> so I “shortcutet” it by a reference to the ABS function which is “noop” in >>> the test case, see RexImplTable:385. >>> >>> I also have no real idea about the implementation of PREV / LAST Operators. >>> I think there are some similarities to Window Aggregates and the PRECEDING >>> / FOLLOWING operators, like RexWindowBound. >>> >>> But currently I started working on a refactoring of the Matcher. Currently >>> it only returns the rows that matched but not the respective symbols the >>> rows where matched to. They are necessary for the emitter. I'm unsure >>> whether to keep it based on an NFA or it is easier with a DFA. >>> >>> Before I continue and dig more through the code base it would be good for >>> me to have some kind of feedback whether I’m going in the right direction >>> and the things I do are of any value or if I misunderstood or >>> misinterpreted some parts. >>> >>> JulianF >>> >>> PS.: Are you actively working on the branch? We should synchronize to avoid >>> duplicate work. >>> >>> [1] https://github.com/JulianFeinauer/calcite/tree/1935-match-recognize >>> >>> Am 27.12.18, 21:21 schrieb "Julian Hyde" <jh...@apache.org>: >>> >>> I think you can implement “^” by adding a special BEGIN state to the >>> automaton. Each automaton should be in this state on creation, and there is >>> no inbound transition (i.e. no way to get back into this state). >>> >>> And you can implement “$” by adding a special end-of-data symbol (you >>> might as well call it “$”) that is sent to each partition’s automaton when >>> the input ends. >>> >>> These seem to be elegant solutions because most of the work is in Pattern >>> and Automaton, and can be unit-tested in AutomatonTest. Just a little extra >>> plumbing needs to be added to the runtime in order to use it. >>> >>> As you have noticed my branch >>> https://github.com/julianhyde/calcite/tree/1935-match-recognize/ >>> <https://github.com/julianhyde/calcite/tree/1935-match-recognize/> is >>> broken as of the latest commit. Consider starting your branch from the >>> previous commit >>> https://github.com/julianhyde/calcite/commit/ea20e84c2d0cf636d2279d182be6df2ef65b67d7 >>> >>> <https://github.com/julianhyde/calcite/commit/ea20e84c2d0cf636d2279d182be6df2ef65b67d7>. >>> We can sync up when my branch is working. >>> >>> Julian >>> >>> >>>> On Dec 26, 2018, at 6:44 AM, Julian Feinauer >>>> <j.feina...@pragmaticminds.de> wrote: >>>> >>>> Hi Julian, >>>> >>>> I used [1] as reference. Anchors are explicitly stated as part of the >>>> syntax and explained as: >>>> >>>>> Anchors work in terms of positions rather than rows. They match a >>>>> position either at the start or end of a partition. >>>>> ^ matches the position before the first row in the partition. >>>>> $ matches the position after the last row in the partition. >>>>> As an example, PATTERN (^A+$) will match only if all rows in a partition >>>>> satisfy the condition for A. The resulting match spans the entire >>>>> partition. >>>> >>>> Regarding patterns, I think it should not be a big change, as the anchors >>>> are defined with respect to partition boundaries. So technically they do >>>> not have to see "beyond" boundaries but should simply "see" boundaries. >>>> So all we need should be an "outside partition" state which CAN be used as >>>> starting or ending state (basically symbols "^" and "$" should reference >>>> that). >>>> >>>> I'll see if I find a solution based on your code... I'll do the work in my >>>> branch [2] based on your branch [3]. >>>> >>>> Best >>>> JulianF >>>> >>>> [1] https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8956 >>>> [2] https://github.com/JulianFeinauer/calcite/tree/1935-match-recognize >>>> [3] https://github.com/julianhyde/calcite/tree/1935-match-recognize >>>> >>>> Am 26.12.18, 08:49 schrieb "Julian Hyde" <jh...@apache.org>: >>>> >>>> You are correct that my 1935-match-recognize branch doesn’t compile (as of >>>> 1a552a9). I committed and pushed in the middle of a change because I had >>>> done a non-trivial rebase. >>>> >>>> I haven’t missed a file; the two compilation errors were intended to >>>> remind me where to start work again. I am working on generating code to >>>> emit rows, and to populate measures and predicates from the input row. If >>>> you can make progress on that, that would be awesome. >>>> >>>> Are anchors (“^” and “$”) supported by Oracle? If so can you point me to >>>> the spec/examples. I am surprised that anything to do with patterns needs >>>> see beyond the boundaries of the current partition. I had assumed that >>>> each partition has its own state machine and it will be difficult to >>>> change that. >>>> >>>> Julian >>>> >>>> >>>>> On Dec 25, 2018, at 2:56 PM, Julian Feinauer >>>>> <j.feina...@pragmaticminds.de> wrote: >>>>> >>>>> Hey, >>>>> >>>>> it's once again me, JulianF. >>>>> I started work on the Automaton / Matcher and implemented OR and OPTIONAL >>>>> ("?") to get started with the code. >>>>> I would highly appreciate if you (Julian H) could check this code (I made >>>>> a PR to your branch). >>>>> Then, what else did you consider as necessary for the implementation? >>>>> I thought about anchors ("^", "$") but this would need a little bit of >>>>> extra changes in the PartitionStates, as far as I see it (to check when >>>>> we "enter" a partition and when we "leave". >>>>> >>>>> Best >>>>> JulianF >>>>> >>>>> Am 25.12.18, 20:38 schrieb "Julian Feinauer" >>>>> <j.feina...@pragmaticminds.de>: >>>>> >>>>> Hi Julian, >>>>> >>>>> as I already declared my interest in MATCH_RECOGNIZE and offered my help, >>>>> I plan to do some things in the next one or two weeks. >>>>> Thus, I wanted to start based on your branch (“1935-match-recognize”). >>>>> >>>>> I have some problems getting it to run. >>>>> Is it possible that there are some files missing in the commit or are >>>>> there some things to consider? >>>>> >>>>> Thanks! >>>>> Julian (F) >>>>> >>>>> On 2018/11/26 20:09:00, Julian Hyde >>>>> <j...@apache.org<mailto:j...@apache.org>> wrote: >>>>>> Over thanksgiving, I started working on MATCH_RECOGNIZE again. I wrote a >>>>>> standalone class called Automaton that allows you to build patterns >>>>>> (basically regular expressions, but sufficient for the PATTERN >>>>>> sub-clause of MATCH_RECOGNIZE), and execute them in a unit test.> >>>>>> >>>>>> Would someone like to help me develop this? We have support for “*” >>>>>> (zero or more repeats), “+” (1 or more repeats) and “{m,n}” (bounded >>>>>> repeats) but need “|” (or) and several others. It should be fairly >>>>>> straightforward test-driven development: add tests to AutomatonTest.java >>>>>> [1], then change Automaton, AutomatonBuilder, Pattern or Matcher until >>>>>> they pass.> >>>>>> >>>>>> We also need lots of SQL tests. Could someone write queries against >>>>>> Oracle’s “ticker” table and paste the queries and results into match.iq?> >>>>>> >>>>>> See CALCITE-1935 [2], and my branch [3].> >>>>>> >>>>>> I have cherry-picked commits from Zhiqiang He’s branch [4] into my >>>>>> branch, so this will be a joint effort when it is finished.> >>>>>> >>>>>> Julian> >>>>>> >>>>>> [1] >>>>>> https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java >>>>>> >>>>>> <https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java><https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java%3e>> >>>>>> >>>>>> [2] https://issues.apache.org/jira/browse/CALCITE-1935 >>>>>> <https://issues.apache.org/jira/browse/CALCITE-1935><https://issues.apache.org/jira/browse/CALCITE-1935%3e>> >>>>>> >>>>>> [3] https://github.com/julianhyde/calcite/tree/1935-match-recognize/ >>>>>> <https://github.com/julianhyde/calcite/tree/1935-match-recognize/><https://github.com/julianhyde/calcite/tree/1935-match-recognize/%3e>> >>>>>> >>>>>> [4] >>>>>> https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3 >>>>>> >>>>>> <https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3><https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3%3e>> >>>>>> >>>>>> >>>>>>> On Nov 21, 2018, at 8:45 AM, Julian Feinauer >>>>>>> <j....@pragmaticminds.de<mailto:j....@pragmaticminds.de>> wrote:> >>>>>>>> >>>>>>> Sorry, this is an old mail which got sent accidentally again by my mail >>>>>>> program.> >>>>>>> Please ignore this and excuse this.> >>>>>>>> >>>>>>> Julian> >>>>>>>> >>>>>>> Am 21.11.18, 16:34 schrieb "Julian Feinauer" >>>>>>> <j....@pragmaticminds.de<mailto:j....@pragmaticminds.de>>:> >>>>>>>> >>>>>>> Hi Julian,> >>>>>>>> >>>>>>> I decided to reply to this (old) email, because here some facts are >>>>>>> noted.> >>>>>>> Funnily, Apache Flink released their MATCH_RECOGNIZE Implementation >>>>>>> yesterday.> >>>>>>>> >>>>>>> So I recall that you and Zhigiang He did something on this.> >>>>>>> I would like to have such a feature in Calcite (as stated in the other >>>>>>> mail) and could try to go into this a bit with a colleague of mine and >>>>>>> give a bit of support on this topic (In fact, it sounds like fun to >>>>>>> us…).> >>>>>>> Perhaps theres also the chance to learn something from Flinks >>>>>>> implementation, as you already had some contacts with them, I think?> >>>>>>>> >>>>>>> Best> >>>>>>> Julian> >>>>>>>> >>>>>>> On 2018/07/23 17:53:57, Julian Hyde >>>>>>> <j....@apache.org<mailto:j....@apache.org>> wrote:> >>>>>>>> For quite a while we have had partial support for MATCH_RECOGNIZE. We >>>>>>>> support it in the parser and validator, but there is no runtime >>>>>>>> implementation. It’s a shame, because MATCH_RECOGNIZE is an incredibly >>>>>>>> powerful SQL feature for both traditional SQL (it’s in Oracle 12c) and >>>>>>>> for continuous query (aka complex event processing - CEP).>> >>>>>>>>> >>>>>>>> I figure it’s time to change that. My plan is to implement it >>>>>>>> incrementally, getting simple queries working to start with, then >>>>>>>> allow people to add more complex queries.>> >>>>>>>>> >>>>>>>> In a dev branch [1], I’ve added a method Enumerables.match[2]. The >>>>>>>> idea is that if you supply an Enumerable of input data, a finite state >>>>>>>> machine to figure out when a sequence of rows makes a match >>>>>>>> (represented by a transition function: (state, row) -> state), and a >>>>>>>> function to convert a matched set of rows to a set of output rows. The >>>>>>>> match method is fairly straightforward, and I almost have it >>>>>>>> finished.>> >>>>>>>>> >>>>>>>> The complexity is in generating the finite state machine, emitter >>>>>>>> function, and so forth.>> >>>>>>>>> >>>>>>>> Can someone help me with this task? If your idea of fun is >>>>>>>> implementing database algorithms, this is about as much fun as it >>>>>>>> gets. You learned about finite state machines in college - this is >>>>>>>> your chance to actually write one!>> >>>>>>>>> >>>>>>>> This might be a good joint project with the Flink community. I know >>>>>>>> Flink are thinking of implementing CEP, and the algorithm we write >>>>>>>> here could be shared with Flink (for use via Flink SQL or via the >>>>>>>> Flink API).>> >>>>>>>>> >>>>>>>> Julian>> >>>>>>>>> >>>>>>>> [1] https://github.com/julianhyde/calcite/commits/1935-match-recognize >>>>>>>> <https://github.com/julianhyde/calcite/commits/1935-match-recognize>><https://github.com/julianhyde/calcite/commits/1935-match-recognize%3e%3e>> >>>>>>>>> >>>>>>>> [2] >>>>>>>> https://github.com/julianhyde/calcite/commit/4dfaf1bbee718aa6694a8ce67d829c32d04c7e87#diff-8a97a64204db631471c563df7551f408R73 >>>>>>>> >>>>>>>> <https://github.com/julianhyde/calcite/commit/4dfaf1bbee718aa6694a8ce67d829c32d04c7e87#diff-8a97a64204db631471c563df7551f408R73>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> >>> >> >> >> >