Re: MATCH_RECOGNIZE

Julian Hyde Fri, 28 Dec 2018 16:30:22 -0800

I hesitated to make the automaton deterministic because the DFA can have 
exponentially more states than its corresponding NFA, and I was concerned that 
this worst case might occur in real-world queries. I could imagine a query with 
20 symbols and 1 million states, or 30 symbols and 1 billion states. Are you 
confident that the DFA will always have a reasonable size?


Whether or not we go to DFA, removing epsilon transitions does seem worthwhile. 
I think that would cause only a small increase in the number of states.

I’m going to stop working on this code, until you reach a stopping point. 
There’s no point in us treading on each other’s feet. It’s a shame, because I 
was looking forward to working on this code over the Xmas break.

Julian


> On Dec 28, 2018, at 3:02 PM, Julian Feinauer <j.feina...@pragmaticminds.de> 
> wrote:
> 
> Hi Julian,
> 
> as it got really confusing for me with the eps-NFA (and all the concurrent 
> partial matches in the matcher) I added a class DFA that transforms the 
> epsilon-NFA from an automaton to a DFA and reimplemented the Matcher based on 
> it.
> I think it is running now but it fails on tests that contain a "repeat" 
> statement (X{a,b}) in AutomatonTest.
> The constructed DFA from these statements is wrong.
> Do you have a reference of the transformation in your Thompson construction? 
> I found nothing on quick googling.
> I would like to check the implementation in the AutomatonBuilder:202 ff as I 
> did not find a Bug in the Matcher or the DFA code.
> Otherwise I would try to check it by expanding the repeat with symbols, ors 
> and optionals.
> 
> I also fixed the coding style (sorry for that, I totally missed that) so it 
> should be better to review my code now (as of commit 
> https://github.com/julianhyde/calcite/pull/15/commits/358ca1c5928b57cc96c8b39be8d017872d870dcf
>  ).
> 
> Best
> JulianF
> 
> Am 28.12.18, 02:19 schrieb "Julian Hyde" <jh...@apache.org>:
> 
>    I think we should get one example working end-to-end before moving to the 
> next. (By “example” I mean a SQL query on a standard data set that exercises 
> one new feature, say FINAL.) Right now nothing works end-to-end because we 
> don’t have the basic code generation working.
> 
>    I agree that assigning symbols to matched rows is a hard problem. I think 
> the best approach is to first figure out whether there is a match (that’s 
> what Automaton does) and then, in a second pass, assign a symbol to each row. 
> The second pass might be significantly slower, but only occurs less often. 
> Also, AFAICT, symbol assignment is only required for the CLASSIFIER() 
> function. So I was going to defer that task.
> 
>    Yes, I am actively working on this code. I see that your branch has 
> significant changes because you use a different coding style (e.g. different 
> indentation). Please change your code back to the existing style. There is no 
> reason to make the task even more difficult than it already is.
> 
>    Julian
> 
> 
>> On Dec 27, 2018, at 1:19 PM, Julian Feinauer <j.feina...@pragmaticminds.de> 
>> wrote:
>> 
>> Hi Julian,
>> 
>> regarding "^" and "$" it seems like Zhiqiang already introduced the fields 
>> strictStart and strictEnd in org.apache.calcite.rel.core.Match. But I agree 
>> with you and already had the same idea. And I'll go over to you last commit 
>> to start my branch off.
>> 
>> I made some progress in my branch [1]. I get it to compile and I get the 
>> test `JdbcTest.testMatch` to run and to fail (but no longer throw an 
>> exception, at least).
>> I fixed several things at several places and I think the code generation is 
>> now working (NOT working good) for Matcher and Emitter.
>> But there are “crucial” points where I’d like to have your advice (or 
>> someone else familiar with these topics):
>> 
>> First, I’m unsure how the FINAL function should be implemented (it’s no 
>> regular operator and I did not find any reference on how to deal with this) 
>> so I “shortcutet” it by a reference to the ABS function which is “noop” in 
>> the test case, see RexImplTable:385.
>> 
>> I also have no real idea about the implementation of PREV / LAST Operators. 
>> I think there are some similarities to Window Aggregates and the PRECEDING / 
>> FOLLOWING operators, like RexWindowBound.
>> 
>> But currently I started working on a refactoring of the Matcher. Currently 
>> it only returns the rows that matched but not the respective symbols the 
>> rows where matched to. They are necessary for the emitter. I'm unsure 
>> whether to keep it based on an NFA or it is easier with a DFA. 
>> 
>> Before I continue and dig more through the code base it would be good for me 
>> to have some kind of feedback whether I’m going in the right direction and 
>> the things I do are of any value or if I misunderstood or misinterpreted 
>> some parts.
>> 
>> JulianF
>> 
>> PS.: Are you actively working on the branch? We should synchronize to avoid 
>> duplicate work.
>> 
>> [1] https://github.com/JulianFeinauer/calcite/tree/1935-match-recognize
>> 
>> Am 27.12.18, 21:21 schrieb "Julian Hyde" <jh...@apache.org>:
>> 
>>   I think you can implement “^” by adding a special BEGIN state to the 
>> automaton. Each automaton should be in this state on creation, and there is 
>> no inbound transition (i.e. no way to get back into this state).
>> 
>>   And you can implement “$” by adding a special end-of-data symbol (you 
>> might as well call it “$”) that is sent to each partition’s automaton when 
>> the input ends.
>> 
>>   These seem to be elegant solutions because most of the work is in Pattern 
>> and Automaton, and can be unit-tested in AutomatonTest. Just a little extra 
>> plumbing needs to be added to the runtime in order to use it.
>> 
>>   As you have noticed my branch 
>> https://github.com/julianhyde/calcite/tree/1935-match-recognize/ 
>> <https://github.com/julianhyde/calcite/tree/1935-match-recognize/> is broken 
>> as of the latest commit. Consider starting your branch from the previous 
>> commit 
>> https://github.com/julianhyde/calcite/commit/ea20e84c2d0cf636d2279d182be6df2ef65b67d7
>>  
>> <https://github.com/julianhyde/calcite/commit/ea20e84c2d0cf636d2279d182be6df2ef65b67d7>.
>>  We can sync up when my branch is working.
>> 
>>   Julian
>> 
>> 
>>> On Dec 26, 2018, at 6:44 AM, Julian Feinauer <j.feina...@pragmaticminds.de> 
>>> wrote:
>>> 
>>> Hi Julian,
>>> 
>>> I used [1] as reference. Anchors are explicitly stated as part of the 
>>> syntax and explained as:
>>> 
>>>> Anchors work in terms of positions rather than rows. They match a position 
>>>> either at the start or end of a partition.
>>>>  ^ matches the position before the first row in the partition.
>>>>  $ matches the position after the last row in the partition.
>>>> As an example, PATTERN (^A+$) will match only if all rows in a partition 
>>>> satisfy the condition for A. The resulting match spans the entire 
>>>> partition.
>>> 
>>> Regarding patterns, I think it should not be a big change, as the anchors 
>>> are defined with respect to partition boundaries. So technically they do 
>>> not have to see "beyond" boundaries but should simply "see" boundaries.
>>> So all we need should be an "outside partition" state which CAN be used as 
>>> starting or ending state (basically symbols "^" and "$" should reference 
>>> that).
>>> 
>>> I'll see if I find a solution based on your code... I'll do the work in my 
>>> branch [2] based on your branch [3].
>>> 
>>> Best
>>> JulianF
>>> 
>>> [1] https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8956
>>> [2] https://github.com/JulianFeinauer/calcite/tree/1935-match-recognize
>>> [3] https://github.com/julianhyde/calcite/tree/1935-match-recognize
>>> 
>>> Am 26.12.18, 08:49 schrieb "Julian Hyde" <jh...@apache.org>:
>>> 
>>>  You are correct that my 1935-match-recognize branch doesn’t compile (as of 
>>> 1a552a9). I committed and pushed in the middle of a change because I had 
>>> done a non-trivial rebase.
>>> 
>>>  I haven’t missed a file; the two compilation errors were intended to 
>>> remind me where to start work again. I am working on generating code to 
>>> emit rows, and to populate measures and predicates from the input row. If 
>>> you can make progress on that, that would be awesome.
>>> 
>>>  Are anchors (“^” and “$”) supported by Oracle? If so can you point me to 
>>> the spec/examples. I am surprised that anything to do with patterns needs 
>>> see beyond the boundaries of the current partition. I had assumed that each 
>>> partition has its own state machine and it will be difficult to change that.
>>> 
>>>  Julian
>>> 
>>> 
>>>> On Dec 25, 2018, at 2:56 PM, Julian Feinauer 
>>>> <j.feina...@pragmaticminds.de> wrote:
>>>> 
>>>> Hey,
>>>> 
>>>> it's once again me, JulianF.
>>>> I started work on the Automaton / Matcher and implemented OR and OPTIONAL 
>>>> ("?") to get started with the code.
>>>> I would highly appreciate if you (Julian H) could check this code (I made 
>>>> a PR to your branch).
>>>> Then, what else did you consider as necessary for the implementation?
>>>> I thought about anchors ("^", "$") but this would need a little bit of 
>>>> extra changes in the PartitionStates, as far as I see it (to check when we 
>>>> "enter" a partition and when we "leave".
>>>> 
>>>> Best
>>>> JulianF
>>>> 
>>>> Am 25.12.18, 20:38 schrieb "Julian Feinauer" 
>>>> <j.feina...@pragmaticminds.de>:
>>>> 
>>>> Hi Julian,
>>>> 
>>>> as I already declared my interest in MATCH_RECOGNIZE and offered my help, 
>>>> I plan to do some things in the next one or two weeks.
>>>> Thus, I wanted to start based on your branch (“1935-match-recognize”).
>>>> 
>>>> I have some problems getting it to run.
>>>> Is it possible that there are some files missing in the commit or are 
>>>> there some things to consider?
>>>> 
>>>> Thanks!
>>>> Julian (F)
>>>> 
>>>> On 2018/11/26 20:09:00, Julian Hyde 
>>>> <j...@apache.org<mailto:j...@apache.org>> wrote:
>>>>> Over thanksgiving, I started working on MATCH_RECOGNIZE again. I wrote a 
>>>>> standalone class called Automaton that allows you to build patterns 
>>>>> (basically regular expressions, but sufficient for the PATTERN sub-clause 
>>>>> of MATCH_RECOGNIZE), and execute them in a unit test.>
>>>>> 
>>>>> Would someone like to help me develop this? We have support for “*” (zero 
>>>>> or more repeats), “+” (1 or more repeats) and “{m,n}” (bounded repeats) 
>>>>> but need “|” (or) and several others. It should be fairly straightforward 
>>>>> test-driven development: add tests to AutomatonTest.java [1], then change 
>>>>> Automaton, AutomatonBuilder, Pattern or Matcher until they pass.>
>>>>> 
>>>>> We also need lots of SQL tests. Could someone write queries against 
>>>>> Oracle’s “ticker” table and paste the queries and results into match.iq?>
>>>>> 
>>>>> See CALCITE-1935 [2], and my branch [3].>
>>>>> 
>>>>> I have cherry-picked commits from Zhiqiang He’s branch [4] into my 
>>>>> branch, so this will be a joint effort when it is finished.>
>>>>> 
>>>>> Julian>
>>>>> 
>>>>> [1] 
>>>>> https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java
>>>>>  
>>>>> <https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java><https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java%3e>>
>>>>> 
>>>>> [2] https://issues.apache.org/jira/browse/CALCITE-1935 
>>>>> <https://issues.apache.org/jira/browse/CALCITE-1935><https://issues.apache.org/jira/browse/CALCITE-1935%3e>>
>>>>> 
>>>>> [3] https://github.com/julianhyde/calcite/tree/1935-match-recognize/ 
>>>>> <https://github.com/julianhyde/calcite/tree/1935-match-recognize/><https://github.com/julianhyde/calcite/tree/1935-match-recognize/%3e>>
>>>>> 
>>>>> [4] 
>>>>> https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3
>>>>>  
>>>>> <https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3><https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3%3e>>
>>>>> 
>>>>> 
>>>>>> On Nov 21, 2018, at 8:45 AM, Julian Feinauer 
>>>>>> <j....@pragmaticminds.de<mailto:j....@pragmaticminds.de>> wrote:>
>>>>>>> 
>>>>>> Sorry, this is an old mail which got sent accidentally again by my mail 
>>>>>> program.>
>>>>>> Please ignore this and excuse this.>
>>>>>>> 
>>>>>> Julian>
>>>>>>> 
>>>>>> Am 21.11.18, 16:34 schrieb "Julian Feinauer" 
>>>>>> <j....@pragmaticminds.de<mailto:j....@pragmaticminds.de>>:>
>>>>>>> 
>>>>>> Hi Julian,>
>>>>>>> 
>>>>>> I decided to reply to this (old) email, because here some facts are 
>>>>>> noted.>
>>>>>> Funnily, Apache Flink released their MATCH_RECOGNIZE Implementation 
>>>>>> yesterday.>
>>>>>>> 
>>>>>> So I recall that you and Zhigiang He did something on this.>
>>>>>> I would like to have such a feature in Calcite (as stated in the other 
>>>>>> mail) and could try to go into this a bit with a colleague of mine and 
>>>>>> give a bit of support on this topic (In fact, it sounds like fun to 
>>>>>> us…).>
>>>>>> Perhaps theres also the chance to learn something from Flinks 
>>>>>> implementation, as you already had some contacts with them, I think?>
>>>>>>> 
>>>>>> Best>
>>>>>> Julian>
>>>>>>> 
>>>>>> On 2018/07/23 17:53:57, Julian Hyde 
>>>>>> <j....@apache.org<mailto:j....@apache.org>> wrote:>
>>>>>>> For quite a while we have had partial support for MATCH_RECOGNIZE. We 
>>>>>>> support it in the parser and validator, but there is no runtime 
>>>>>>> implementation. It’s a shame, because MATCH_RECOGNIZE is an incredibly 
>>>>>>> powerful SQL feature for both traditional SQL (it’s in Oracle 12c) and 
>>>>>>> for continuous query (aka complex event processing - CEP).>>
>>>>>>>> 
>>>>>>> I figure it’s time to change that. My plan is to implement it 
>>>>>>> incrementally, getting simple queries working to start with, then allow 
>>>>>>> people to add more complex queries.>>
>>>>>>>> 
>>>>>>> In a dev branch [1], I’ve added a method Enumerables.match[2]. The idea 
>>>>>>> is that if you supply an Enumerable of input data, a finite state 
>>>>>>> machine to figure out when a sequence of rows makes a match 
>>>>>>> (represented by a transition function: (state, row) -> state), and a 
>>>>>>> function to convert a matched set of rows to a set of output rows. The 
>>>>>>> match method is fairly straightforward, and I almost have it finished.>>
>>>>>>>> 
>>>>>>> The complexity is in generating the finite state machine, emitter 
>>>>>>> function, and so forth.>>
>>>>>>>> 
>>>>>>> Can someone help me with this task? If your idea of fun is implementing 
>>>>>>> database algorithms, this is about as much fun as it gets. You learned 
>>>>>>> about finite state machines in college - this is your chance to 
>>>>>>> actually write one!>>
>>>>>>>> 
>>>>>>> This might be a good joint project with the Flink community. I know 
>>>>>>> Flink are thinking of implementing CEP, and the algorithm we write here 
>>>>>>> could be shared with Flink (for use via Flink SQL or via the Flink 
>>>>>>> API).>>
>>>>>>>> 
>>>>>>> Julian>>
>>>>>>>> 
>>>>>>> [1] https://github.com/julianhyde/calcite/commits/1935-match-recognize 
>>>>>>> <https://github.com/julianhyde/calcite/commits/1935-match-recognize>><https://github.com/julianhyde/calcite/commits/1935-match-recognize%3e%3e>>
>>>>>>>> 
>>>>>>> [2] 
>>>>>>> https://github.com/julianhyde/calcite/commit/4dfaf1bbee718aa6694a8ce67d829c32d04c7e87#diff-8a97a64204db631471c563df7551f408R73
>>>>>>>  
>>>>>>> <https://github.com/julianhyde/calcite/commit/4dfaf1bbee718aa6694a8ce67d829c32d04c7e87#diff-8a97a64204db631471c563df7551f408R73>>>
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
> 
>

Re: MATCH_RECOGNIZE

Reply via email to