Re: MATCH_RECOGNIZE

Julian Hyde Fri, 28 Dec 2018 16:59:37 -0800

I remembered another reason I didn’t make the automaton deterministic. For 
MATCH_RECOGNIZE, any given row can match multiple symbols. Thus it’s not like 
regex matching, where the 3rd character of the string is either a or b. The 
third row can match symbols A and B simultaneously. To go deterministic, we’d 
not only have to create the powerset of states, we’d also have to label our 
transitions with the powerset of symbols. “When in state (S1, S3, S4), if we 
see a row that matches symbols A and B but not C, we transition to state (S2, 
S4, S5)”.


We can figure out which of the 2^N elements in the states powerset are 
reachable, and build only those, but we can’t do the same with the 2^M  
elements in the symbol powerset. We’d have to build the whole symbol powerset. 
That sounded painful, so I decided to stay in non-deterministic land.

As for how I translated “repeat”. I used some online resources, but I don’t 
recall what they were. I’m not surprised that there are bugs, but it should be 
easy to find and fix them. Since the automaton is capable of matching regexes, 
I thought we could find some existing regex tests and use those to shake out 
bugs.

Julian


> On Dec 28, 2018, at 4:29 PM, Julian Hyde <jh...@apache.org> wrote:
> 
> I hesitated to make the automaton deterministic because the DFA can have 
> exponentially more states than its corresponding NFA, and I was concerned 
> that this worst case might occur in real-world queries. I could imagine a 
> query with 20 symbols and 1 million states, or 30 symbols and 1 billion 
> states. Are you confident that the DFA will always have a reasonable size?
> 
> Whether or not we go to DFA, removing epsilon transitions does seem 
> worthwhile. I think that would cause only a small increase in the number of 
> states.
> 
> I’m going to stop working on this code, until you reach a stopping point. 
> There’s no point in us treading on each other’s feet. It’s a shame, because I 
> was looking forward to working on this code over the Xmas break.
> 
> Julian
> 
> 
>> On Dec 28, 2018, at 3:02 PM, Julian Feinauer <j.feina...@pragmaticminds.de> 
>> wrote:
>> 
>> Hi Julian,
>> 
>> as it got really confusing for me with the eps-NFA (and all the concurrent 
>> partial matches in the matcher) I added a class DFA that transforms the 
>> epsilon-NFA from an automaton to a DFA and reimplemented the Matcher based 
>> on it.
>> I think it is running now but it fails on tests that contain a "repeat" 
>> statement (X{a,b}) in AutomatonTest.
>> The constructed DFA from these statements is wrong.
>> Do you have a reference of the transformation in your Thompson construction? 
>> I found nothing on quick googling.
>> I would like to check the implementation in the AutomatonBuilder:202 ff as I 
>> did not find a Bug in the Matcher or the DFA code.
>> Otherwise I would try to check it by expanding the repeat with symbols, ors 
>> and optionals.
>> 
>> I also fixed the coding style (sorry for that, I totally missed that) so it 
>> should be better to review my code now (as of commit 
>> https://github.com/julianhyde/calcite/pull/15/commits/358ca1c5928b57cc96c8b39be8d017872d870dcf
>>  ).
>> 
>> Best
>> JulianF
>> 
>> Am 28.12.18, 02:19 schrieb "Julian Hyde" <jh...@apache.org>:
>> 
>>   I think we should get one example working end-to-end before moving to the 
>> next. (By “example” I mean a SQL query on a standard data set that exercises 
>> one new feature, say FINAL.) Right now nothing works end-to-end because we 
>> don’t have the basic code generation working.
>> 
>>   I agree that assigning symbols to matched rows is a hard problem. I think 
>> the best approach is to first figure out whether there is a match (that’s 
>> what Automaton does) and then, in a second pass, assign a symbol to each 
>> row. The second pass might be significantly slower, but only occurs less 
>> often. Also, AFAICT, symbol assignment is only required for the CLASSIFIER() 
>> function. So I was going to defer that task.
>> 
>>   Yes, I am actively working on this code. I see that your branch has 
>> significant changes because you use a different coding style (e.g. different 
>> indentation). Please change your code back to the existing style. There is 
>> no reason to make the task even more difficult than it already is.
>> 
>>   Julian
>> 
>> 
>>> On Dec 27, 2018, at 1:19 PM, Julian Feinauer <j.feina...@pragmaticminds.de> 
>>> wrote:
>>> 
>>> Hi Julian,
>>> 
>>> regarding "^" and "$" it seems like Zhiqiang already introduced the fields 
>>> strictStart and strictEnd in org.apache.calcite.rel.core.Match. But I agree 
>>> with you and already had the same idea. And I'll go over to you last commit 
>>> to start my branch off.
>>> 
>>> I made some progress in my branch [1]. I get it to compile and I get the 
>>> test `JdbcTest.testMatch` to run and to fail (but no longer throw an 
>>> exception, at least).
>>> I fixed several things at several places and I think the code generation is 
>>> now working (NOT working good) for Matcher and Emitter.
>>> But there are “crucial” points where I’d like to have your advice (or 
>>> someone else familiar with these topics):
>>> 
>>> First, I’m unsure how the FINAL function should be implemented (it’s no 
>>> regular operator and I did not find any reference on how to deal with this) 
>>> so I “shortcutet” it by a reference to the ABS function which is “noop” in 
>>> the test case, see RexImplTable:385.
>>> 
>>> I also have no real idea about the implementation of PREV / LAST Operators. 
>>> I think there are some similarities to Window Aggregates and the PRECEDING 
>>> / FOLLOWING operators, like RexWindowBound.
>>> 
>>> But currently I started working on a refactoring of the Matcher. Currently 
>>> it only returns the rows that matched but not the respective symbols the 
>>> rows where matched to. They are necessary for the emitter. I'm unsure 
>>> whether to keep it based on an NFA or it is easier with a DFA. 
>>> 
>>> Before I continue and dig more through the code base it would be good for 
>>> me to have some kind of feedback whether I’m going in the right direction 
>>> and the things I do are of any value or if I misunderstood or 
>>> misinterpreted some parts.
>>> 
>>> JulianF
>>> 
>>> PS.: Are you actively working on the branch? We should synchronize to avoid 
>>> duplicate work.
>>> 
>>> [1] https://github.com/JulianFeinauer/calcite/tree/1935-match-recognize
>>> 
>>> Am 27.12.18, 21:21 schrieb "Julian Hyde" <jh...@apache.org>:
>>> 
>>>  I think you can implement “^” by adding a special BEGIN state to the 
>>> automaton. Each automaton should be in this state on creation, and there is 
>>> no inbound transition (i.e. no way to get back into this state).
>>> 
>>>  And you can implement “$” by adding a special end-of-data symbol (you 
>>> might as well call it “$”) that is sent to each partition’s automaton when 
>>> the input ends.
>>> 
>>>  These seem to be elegant solutions because most of the work is in Pattern 
>>> and Automaton, and can be unit-tested in AutomatonTest. Just a little extra 
>>> plumbing needs to be added to the runtime in order to use it.
>>> 
>>>  As you have noticed my branch 
>>> https://github.com/julianhyde/calcite/tree/1935-match-recognize/ 
>>> <https://github.com/julianhyde/calcite/tree/1935-match-recognize/> is 
>>> broken as of the latest commit. Consider starting your branch from the 
>>> previous commit 
>>> https://github.com/julianhyde/calcite/commit/ea20e84c2d0cf636d2279d182be6df2ef65b67d7
>>>  
>>> <https://github.com/julianhyde/calcite/commit/ea20e84c2d0cf636d2279d182be6df2ef65b67d7>.
>>>  We can sync up when my branch is working.
>>> 
>>>  Julian
>>> 
>>> 
>>>> On Dec 26, 2018, at 6:44 AM, Julian Feinauer 
>>>> <j.feina...@pragmaticminds.de> wrote:
>>>> 
>>>> Hi Julian,
>>>> 
>>>> I used [1] as reference. Anchors are explicitly stated as part of the 
>>>> syntax and explained as:
>>>> 
>>>>> Anchors work in terms of positions rather than rows. They match a 
>>>>> position either at the start or end of a partition.
>>>>> ^ matches the position before the first row in the partition.
>>>>> $ matches the position after the last row in the partition.
>>>>> As an example, PATTERN (^A+$) will match only if all rows in a partition 
>>>>> satisfy the condition for A. The resulting match spans the entire 
>>>>> partition.
>>>> 
>>>> Regarding patterns, I think it should not be a big change, as the anchors 
>>>> are defined with respect to partition boundaries. So technically they do 
>>>> not have to see "beyond" boundaries but should simply "see" boundaries.
>>>> So all we need should be an "outside partition" state which CAN be used as 
>>>> starting or ending state (basically symbols "^" and "$" should reference 
>>>> that).
>>>> 
>>>> I'll see if I find a solution based on your code... I'll do the work in my 
>>>> branch [2] based on your branch [3].
>>>> 
>>>> Best
>>>> JulianF
>>>> 
>>>> [1] https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8956
>>>> [2] https://github.com/JulianFeinauer/calcite/tree/1935-match-recognize
>>>> [3] https://github.com/julianhyde/calcite/tree/1935-match-recognize
>>>> 
>>>> Am 26.12.18, 08:49 schrieb "Julian Hyde" <jh...@apache.org>:
>>>> 
>>>> You are correct that my 1935-match-recognize branch doesn’t compile (as of 
>>>> 1a552a9). I committed and pushed in the middle of a change because I had 
>>>> done a non-trivial rebase.
>>>> 
>>>> I haven’t missed a file; the two compilation errors were intended to 
>>>> remind me where to start work again. I am working on generating code to 
>>>> emit rows, and to populate measures and predicates from the input row. If 
>>>> you can make progress on that, that would be awesome.
>>>> 
>>>> Are anchors (“^” and “$”) supported by Oracle? If so can you point me to 
>>>> the spec/examples. I am surprised that anything to do with patterns needs 
>>>> see beyond the boundaries of the current partition. I had assumed that 
>>>> each partition has its own state machine and it will be difficult to 
>>>> change that.
>>>> 
>>>> Julian
>>>> 
>>>> 
>>>>> On Dec 25, 2018, at 2:56 PM, Julian Feinauer 
>>>>> <j.feina...@pragmaticminds.de> wrote:
>>>>> 
>>>>> Hey,
>>>>> 
>>>>> it's once again me, JulianF.
>>>>> I started work on the Automaton / Matcher and implemented OR and OPTIONAL 
>>>>> ("?") to get started with the code.
>>>>> I would highly appreciate if you (Julian H) could check this code (I made 
>>>>> a PR to your branch).
>>>>> Then, what else did you consider as necessary for the implementation?
>>>>> I thought about anchors ("^", "$") but this would need a little bit of 
>>>>> extra changes in the PartitionStates, as far as I see it (to check when 
>>>>> we "enter" a partition and when we "leave".
>>>>> 
>>>>> Best
>>>>> JulianF
>>>>> 
>>>>> Am 25.12.18, 20:38 schrieb "Julian Feinauer" 
>>>>> <j.feina...@pragmaticminds.de>:
>>>>> 
>>>>> Hi Julian,
>>>>> 
>>>>> as I already declared my interest in MATCH_RECOGNIZE and offered my help, 
>>>>> I plan to do some things in the next one or two weeks.
>>>>> Thus, I wanted to start based on your branch (“1935-match-recognize”).
>>>>> 
>>>>> I have some problems getting it to run.
>>>>> Is it possible that there are some files missing in the commit or are 
>>>>> there some things to consider?
>>>>> 
>>>>> Thanks!
>>>>> Julian (F)
>>>>> 
>>>>> On 2018/11/26 20:09:00, Julian Hyde 
>>>>> <j...@apache.org<mailto:j...@apache.org>> wrote:
>>>>>> Over thanksgiving, I started working on MATCH_RECOGNIZE again. I wrote a 
>>>>>> standalone class called Automaton that allows you to build patterns 
>>>>>> (basically regular expressions, but sufficient for the PATTERN 
>>>>>> sub-clause of MATCH_RECOGNIZE), and execute them in a unit test.>
>>>>>> 
>>>>>> Would someone like to help me develop this? We have support for “*” 
>>>>>> (zero or more repeats), “+” (1 or more repeats) and “{m,n}” (bounded 
>>>>>> repeats) but need “|” (or) and several others. It should be fairly 
>>>>>> straightforward test-driven development: add tests to AutomatonTest.java 
>>>>>> [1], then change Automaton, AutomatonBuilder, Pattern or Matcher until 
>>>>>> they pass.>
>>>>>> 
>>>>>> We also need lots of SQL tests. Could someone write queries against 
>>>>>> Oracle’s “ticker” table and paste the queries and results into match.iq?>
>>>>>> 
>>>>>> See CALCITE-1935 [2], and my branch [3].>
>>>>>> 
>>>>>> I have cherry-picked commits from Zhiqiang He’s branch [4] into my 
>>>>>> branch, so this will be a joint effort when it is finished.>
>>>>>> 
>>>>>> Julian>
>>>>>> 
>>>>>> [1] 
>>>>>> https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java
>>>>>>  
>>>>>> <https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java><https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java%3e>>
>>>>>> 
>>>>>> [2] https://issues.apache.org/jira/browse/CALCITE-1935 
>>>>>> <https://issues.apache.org/jira/browse/CALCITE-1935><https://issues.apache.org/jira/browse/CALCITE-1935%3e>>
>>>>>> 
>>>>>> [3] https://github.com/julianhyde/calcite/tree/1935-match-recognize/ 
>>>>>> <https://github.com/julianhyde/calcite/tree/1935-match-recognize/><https://github.com/julianhyde/calcite/tree/1935-match-recognize/%3e>>
>>>>>> 
>>>>>> [4] 
>>>>>> https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3
>>>>>>  
>>>>>> <https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3><https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3%3e>>
>>>>>> 
>>>>>> 
>>>>>>> On Nov 21, 2018, at 8:45 AM, Julian Feinauer 
>>>>>>> <j....@pragmaticminds.de<mailto:j....@pragmaticminds.de>> wrote:>
>>>>>>>> 
>>>>>>> Sorry, this is an old mail which got sent accidentally again by my mail 
>>>>>>> program.>
>>>>>>> Please ignore this and excuse this.>
>>>>>>>> 
>>>>>>> Julian>
>>>>>>>> 
>>>>>>> Am 21.11.18, 16:34 schrieb "Julian Feinauer" 
>>>>>>> <j....@pragmaticminds.de<mailto:j....@pragmaticminds.de>>:>
>>>>>>>> 
>>>>>>> Hi Julian,>
>>>>>>>> 
>>>>>>> I decided to reply to this (old) email, because here some facts are 
>>>>>>> noted.>
>>>>>>> Funnily, Apache Flink released their MATCH_RECOGNIZE Implementation 
>>>>>>> yesterday.>
>>>>>>>> 
>>>>>>> So I recall that you and Zhigiang He did something on this.>
>>>>>>> I would like to have such a feature in Calcite (as stated in the other 
>>>>>>> mail) and could try to go into this a bit with a colleague of mine and 
>>>>>>> give a bit of support on this topic (In fact, it sounds like fun to 
>>>>>>> us…).>
>>>>>>> Perhaps theres also the chance to learn something from Flinks 
>>>>>>> implementation, as you already had some contacts with them, I think?>
>>>>>>>> 
>>>>>>> Best>
>>>>>>> Julian>
>>>>>>>> 
>>>>>>> On 2018/07/23 17:53:57, Julian Hyde 
>>>>>>> <j....@apache.org<mailto:j....@apache.org>> wrote:>
>>>>>>>> For quite a while we have had partial support for MATCH_RECOGNIZE. We 
>>>>>>>> support it in the parser and validator, but there is no runtime 
>>>>>>>> implementation. It’s a shame, because MATCH_RECOGNIZE is an incredibly 
>>>>>>>> powerful SQL feature for both traditional SQL (it’s in Oracle 12c) and 
>>>>>>>> for continuous query (aka complex event processing - CEP).>>
>>>>>>>>> 
>>>>>>>> I figure it’s time to change that. My plan is to implement it 
>>>>>>>> incrementally, getting simple queries working to start with, then 
>>>>>>>> allow people to add more complex queries.>>
>>>>>>>>> 
>>>>>>>> In a dev branch [1], I’ve added a method Enumerables.match[2]. The 
>>>>>>>> idea is that if you supply an Enumerable of input data, a finite state 
>>>>>>>> machine to figure out when a sequence of rows makes a match 
>>>>>>>> (represented by a transition function: (state, row) -> state), and a 
>>>>>>>> function to convert a matched set of rows to a set of output rows. The 
>>>>>>>> match method is fairly straightforward, and I almost have it 
>>>>>>>> finished.>>
>>>>>>>>> 
>>>>>>>> The complexity is in generating the finite state machine, emitter 
>>>>>>>> function, and so forth.>>
>>>>>>>>> 
>>>>>>>> Can someone help me with this task? If your idea of fun is 
>>>>>>>> implementing database algorithms, this is about as much fun as it 
>>>>>>>> gets. You learned about finite state machines in college - this is 
>>>>>>>> your chance to actually write one!>>
>>>>>>>>> 
>>>>>>>> This might be a good joint project with the Flink community. I know 
>>>>>>>> Flink are thinking of implementing CEP, and the algorithm we write 
>>>>>>>> here could be shared with Flink (for use via Flink SQL or via the 
>>>>>>>> Flink API).>>
>>>>>>>>> 
>>>>>>>> Julian>>
>>>>>>>>> 
>>>>>>>> [1] https://github.com/julianhyde/calcite/commits/1935-match-recognize 
>>>>>>>> <https://github.com/julianhyde/calcite/commits/1935-match-recognize>><https://github.com/julianhyde/calcite/commits/1935-match-recognize%3e%3e>>
>>>>>>>>> 
>>>>>>>> [2] 
>>>>>>>> https://github.com/julianhyde/calcite/commit/4dfaf1bbee718aa6694a8ce67d829c32d04c7e87#diff-8a97a64204db631471c563df7551f408R73
>>>>>>>>  
>>>>>>>> <https://github.com/julianhyde/calcite/commit/4dfaf1bbee718aa6694a8ce67d829c32d04c7e87#diff-8a97a64204db631471c563df7551f408R73>>>
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>

Re: MATCH_RECOGNIZE

Reply via email to