I hesitated to make the automaton deterministic because the DFA can have exponentially more states than its corresponding NFA, and I was concerned that this worst case might occur in real-world queries. I could imagine a query with 20 symbols and 1 million states, or 30 symbols and 1 billion states. Are you confident that the DFA will always have a reasonable size?
Whether or not we go to DFA, removing epsilon transitions does seem worthwhile. I think that would cause only a small increase in the number of states. I’m going to stop working on this code, until you reach a stopping point. There’s no point in us treading on each other’s feet. It’s a shame, because I was looking forward to working on this code over the Xmas break. Julian > On Dec 28, 2018, at 3:02 PM, Julian Feinauer <j.feina...@pragmaticminds.de> > wrote: > > Hi Julian, > > as it got really confusing for me with the eps-NFA (and all the concurrent > partial matches in the matcher) I added a class DFA that transforms the > epsilon-NFA from an automaton to a DFA and reimplemented the Matcher based on > it. > I think it is running now but it fails on tests that contain a "repeat" > statement (X{a,b}) in AutomatonTest. > The constructed DFA from these statements is wrong. > Do you have a reference of the transformation in your Thompson construction? > I found nothing on quick googling. > I would like to check the implementation in the AutomatonBuilder:202 ff as I > did not find a Bug in the Matcher or the DFA code. > Otherwise I would try to check it by expanding the repeat with symbols, ors > and optionals. > > I also fixed the coding style (sorry for that, I totally missed that) so it > should be better to review my code now (as of commit > https://github.com/julianhyde/calcite/pull/15/commits/358ca1c5928b57cc96c8b39be8d017872d870dcf > ). > > Best > JulianF > > Am 28.12.18, 02:19 schrieb "Julian Hyde" <jh...@apache.org>: > > I think we should get one example working end-to-end before moving to the > next. (By “example” I mean a SQL query on a standard data set that exercises > one new feature, say FINAL.) Right now nothing works end-to-end because we > don’t have the basic code generation working. > > I agree that assigning symbols to matched rows is a hard problem. I think > the best approach is to first figure out whether there is a match (that’s > what Automaton does) and then, in a second pass, assign a symbol to each row. > The second pass might be significantly slower, but only occurs less often. > Also, AFAICT, symbol assignment is only required for the CLASSIFIER() > function. So I was going to defer that task. > > Yes, I am actively working on this code. I see that your branch has > significant changes because you use a different coding style (e.g. different > indentation). Please change your code back to the existing style. There is no > reason to make the task even more difficult than it already is. > > Julian > > >> On Dec 27, 2018, at 1:19 PM, Julian Feinauer <j.feina...@pragmaticminds.de> >> wrote: >> >> Hi Julian, >> >> regarding "^" and "$" it seems like Zhiqiang already introduced the fields >> strictStart and strictEnd in org.apache.calcite.rel.core.Match. But I agree >> with you and already had the same idea. And I'll go over to you last commit >> to start my branch off. >> >> I made some progress in my branch [1]. I get it to compile and I get the >> test `JdbcTest.testMatch` to run and to fail (but no longer throw an >> exception, at least). >> I fixed several things at several places and I think the code generation is >> now working (NOT working good) for Matcher and Emitter. >> But there are “crucial” points where I’d like to have your advice (or >> someone else familiar with these topics): >> >> First, I’m unsure how the FINAL function should be implemented (it’s no >> regular operator and I did not find any reference on how to deal with this) >> so I “shortcutet” it by a reference to the ABS function which is “noop” in >> the test case, see RexImplTable:385. >> >> I also have no real idea about the implementation of PREV / LAST Operators. >> I think there are some similarities to Window Aggregates and the PRECEDING / >> FOLLOWING operators, like RexWindowBound. >> >> But currently I started working on a refactoring of the Matcher. Currently >> it only returns the rows that matched but not the respective symbols the >> rows where matched to. They are necessary for the emitter. I'm unsure >> whether to keep it based on an NFA or it is easier with a DFA. >> >> Before I continue and dig more through the code base it would be good for me >> to have some kind of feedback whether I’m going in the right direction and >> the things I do are of any value or if I misunderstood or misinterpreted >> some parts. >> >> JulianF >> >> PS.: Are you actively working on the branch? We should synchronize to avoid >> duplicate work. >> >> [1] https://github.com/JulianFeinauer/calcite/tree/1935-match-recognize >> >> Am 27.12.18, 21:21 schrieb "Julian Hyde" <jh...@apache.org>: >> >> I think you can implement “^” by adding a special BEGIN state to the >> automaton. Each automaton should be in this state on creation, and there is >> no inbound transition (i.e. no way to get back into this state). >> >> And you can implement “$” by adding a special end-of-data symbol (you >> might as well call it “$”) that is sent to each partition’s automaton when >> the input ends. >> >> These seem to be elegant solutions because most of the work is in Pattern >> and Automaton, and can be unit-tested in AutomatonTest. Just a little extra >> plumbing needs to be added to the runtime in order to use it. >> >> As you have noticed my branch >> https://github.com/julianhyde/calcite/tree/1935-match-recognize/ >> <https://github.com/julianhyde/calcite/tree/1935-match-recognize/> is broken >> as of the latest commit. Consider starting your branch from the previous >> commit >> https://github.com/julianhyde/calcite/commit/ea20e84c2d0cf636d2279d182be6df2ef65b67d7 >> >> <https://github.com/julianhyde/calcite/commit/ea20e84c2d0cf636d2279d182be6df2ef65b67d7>. >> We can sync up when my branch is working. >> >> Julian >> >> >>> On Dec 26, 2018, at 6:44 AM, Julian Feinauer <j.feina...@pragmaticminds.de> >>> wrote: >>> >>> Hi Julian, >>> >>> I used [1] as reference. Anchors are explicitly stated as part of the >>> syntax and explained as: >>> >>>> Anchors work in terms of positions rather than rows. They match a position >>>> either at the start or end of a partition. >>>> ^ matches the position before the first row in the partition. >>>> $ matches the position after the last row in the partition. >>>> As an example, PATTERN (^A+$) will match only if all rows in a partition >>>> satisfy the condition for A. The resulting match spans the entire >>>> partition. >>> >>> Regarding patterns, I think it should not be a big change, as the anchors >>> are defined with respect to partition boundaries. So technically they do >>> not have to see "beyond" boundaries but should simply "see" boundaries. >>> So all we need should be an "outside partition" state which CAN be used as >>> starting or ending state (basically symbols "^" and "$" should reference >>> that). >>> >>> I'll see if I find a solution based on your code... I'll do the work in my >>> branch [2] based on your branch [3]. >>> >>> Best >>> JulianF >>> >>> [1] https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8956 >>> [2] https://github.com/JulianFeinauer/calcite/tree/1935-match-recognize >>> [3] https://github.com/julianhyde/calcite/tree/1935-match-recognize >>> >>> Am 26.12.18, 08:49 schrieb "Julian Hyde" <jh...@apache.org>: >>> >>> You are correct that my 1935-match-recognize branch doesn’t compile (as of >>> 1a552a9). I committed and pushed in the middle of a change because I had >>> done a non-trivial rebase. >>> >>> I haven’t missed a file; the two compilation errors were intended to >>> remind me where to start work again. I am working on generating code to >>> emit rows, and to populate measures and predicates from the input row. If >>> you can make progress on that, that would be awesome. >>> >>> Are anchors (“^” and “$”) supported by Oracle? If so can you point me to >>> the spec/examples. I am surprised that anything to do with patterns needs >>> see beyond the boundaries of the current partition. I had assumed that each >>> partition has its own state machine and it will be difficult to change that. >>> >>> Julian >>> >>> >>>> On Dec 25, 2018, at 2:56 PM, Julian Feinauer >>>> <j.feina...@pragmaticminds.de> wrote: >>>> >>>> Hey, >>>> >>>> it's once again me, JulianF. >>>> I started work on the Automaton / Matcher and implemented OR and OPTIONAL >>>> ("?") to get started with the code. >>>> I would highly appreciate if you (Julian H) could check this code (I made >>>> a PR to your branch). >>>> Then, what else did you consider as necessary for the implementation? >>>> I thought about anchors ("^", "$") but this would need a little bit of >>>> extra changes in the PartitionStates, as far as I see it (to check when we >>>> "enter" a partition and when we "leave". >>>> >>>> Best >>>> JulianF >>>> >>>> Am 25.12.18, 20:38 schrieb "Julian Feinauer" >>>> <j.feina...@pragmaticminds.de>: >>>> >>>> Hi Julian, >>>> >>>> as I already declared my interest in MATCH_RECOGNIZE and offered my help, >>>> I plan to do some things in the next one or two weeks. >>>> Thus, I wanted to start based on your branch (“1935-match-recognize”). >>>> >>>> I have some problems getting it to run. >>>> Is it possible that there are some files missing in the commit or are >>>> there some things to consider? >>>> >>>> Thanks! >>>> Julian (F) >>>> >>>> On 2018/11/26 20:09:00, Julian Hyde >>>> <j...@apache.org<mailto:j...@apache.org>> wrote: >>>>> Over thanksgiving, I started working on MATCH_RECOGNIZE again. I wrote a >>>>> standalone class called Automaton that allows you to build patterns >>>>> (basically regular expressions, but sufficient for the PATTERN sub-clause >>>>> of MATCH_RECOGNIZE), and execute them in a unit test.> >>>>> >>>>> Would someone like to help me develop this? We have support for “*” (zero >>>>> or more repeats), “+” (1 or more repeats) and “{m,n}” (bounded repeats) >>>>> but need “|” (or) and several others. It should be fairly straightforward >>>>> test-driven development: add tests to AutomatonTest.java [1], then change >>>>> Automaton, AutomatonBuilder, Pattern or Matcher until they pass.> >>>>> >>>>> We also need lots of SQL tests. Could someone write queries against >>>>> Oracle’s “ticker” table and paste the queries and results into match.iq?> >>>>> >>>>> See CALCITE-1935 [2], and my branch [3].> >>>>> >>>>> I have cherry-picked commits from Zhiqiang He’s branch [4] into my >>>>> branch, so this will be a joint effort when it is finished.> >>>>> >>>>> Julian> >>>>> >>>>> [1] >>>>> https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java >>>>> >>>>> <https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java><https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java%3e>> >>>>> >>>>> [2] https://issues.apache.org/jira/browse/CALCITE-1935 >>>>> <https://issues.apache.org/jira/browse/CALCITE-1935><https://issues.apache.org/jira/browse/CALCITE-1935%3e>> >>>>> >>>>> [3] https://github.com/julianhyde/calcite/tree/1935-match-recognize/ >>>>> <https://github.com/julianhyde/calcite/tree/1935-match-recognize/><https://github.com/julianhyde/calcite/tree/1935-match-recognize/%3e>> >>>>> >>>>> [4] >>>>> https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3 >>>>> >>>>> <https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3><https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3%3e>> >>>>> >>>>> >>>>>> On Nov 21, 2018, at 8:45 AM, Julian Feinauer >>>>>> <j....@pragmaticminds.de<mailto:j....@pragmaticminds.de>> wrote:> >>>>>>> >>>>>> Sorry, this is an old mail which got sent accidentally again by my mail >>>>>> program.> >>>>>> Please ignore this and excuse this.> >>>>>>> >>>>>> Julian> >>>>>>> >>>>>> Am 21.11.18, 16:34 schrieb "Julian Feinauer" >>>>>> <j....@pragmaticminds.de<mailto:j....@pragmaticminds.de>>:> >>>>>>> >>>>>> Hi Julian,> >>>>>>> >>>>>> I decided to reply to this (old) email, because here some facts are >>>>>> noted.> >>>>>> Funnily, Apache Flink released their MATCH_RECOGNIZE Implementation >>>>>> yesterday.> >>>>>>> >>>>>> So I recall that you and Zhigiang He did something on this.> >>>>>> I would like to have such a feature in Calcite (as stated in the other >>>>>> mail) and could try to go into this a bit with a colleague of mine and >>>>>> give a bit of support on this topic (In fact, it sounds like fun to >>>>>> us…).> >>>>>> Perhaps theres also the chance to learn something from Flinks >>>>>> implementation, as you already had some contacts with them, I think?> >>>>>>> >>>>>> Best> >>>>>> Julian> >>>>>>> >>>>>> On 2018/07/23 17:53:57, Julian Hyde >>>>>> <j....@apache.org<mailto:j....@apache.org>> wrote:> >>>>>>> For quite a while we have had partial support for MATCH_RECOGNIZE. We >>>>>>> support it in the parser and validator, but there is no runtime >>>>>>> implementation. It’s a shame, because MATCH_RECOGNIZE is an incredibly >>>>>>> powerful SQL feature for both traditional SQL (it’s in Oracle 12c) and >>>>>>> for continuous query (aka complex event processing - CEP).>> >>>>>>>> >>>>>>> I figure it’s time to change that. My plan is to implement it >>>>>>> incrementally, getting simple queries working to start with, then allow >>>>>>> people to add more complex queries.>> >>>>>>>> >>>>>>> In a dev branch [1], I’ve added a method Enumerables.match[2]. The idea >>>>>>> is that if you supply an Enumerable of input data, a finite state >>>>>>> machine to figure out when a sequence of rows makes a match >>>>>>> (represented by a transition function: (state, row) -> state), and a >>>>>>> function to convert a matched set of rows to a set of output rows. The >>>>>>> match method is fairly straightforward, and I almost have it finished.>> >>>>>>>> >>>>>>> The complexity is in generating the finite state machine, emitter >>>>>>> function, and so forth.>> >>>>>>>> >>>>>>> Can someone help me with this task? If your idea of fun is implementing >>>>>>> database algorithms, this is about as much fun as it gets. You learned >>>>>>> about finite state machines in college - this is your chance to >>>>>>> actually write one!>> >>>>>>>> >>>>>>> This might be a good joint project with the Flink community. I know >>>>>>> Flink are thinking of implementing CEP, and the algorithm we write here >>>>>>> could be shared with Flink (for use via Flink SQL or via the Flink >>>>>>> API).>> >>>>>>>> >>>>>>> Julian>> >>>>>>>> >>>>>>> [1] https://github.com/julianhyde/calcite/commits/1935-match-recognize >>>>>>> <https://github.com/julianhyde/calcite/commits/1935-match-recognize>><https://github.com/julianhyde/calcite/commits/1935-match-recognize%3e%3e>> >>>>>>>> >>>>>>> [2] >>>>>>> https://github.com/julianhyde/calcite/commit/4dfaf1bbee718aa6694a8ce67d829c32d04c7e87#diff-8a97a64204db631471c563df7551f408R73 >>>>>>> >>>>>>> <https://github.com/julianhyde/calcite/commit/4dfaf1bbee718aa6694a8ce67d829c32d04c7e87#diff-8a97a64204db631471c563df7551f408R73>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> >> >> >> > > >