precision : A.Z*.A" with Z = not A matched against "ABABAB" will match ABA twice, but will not match ABABA
2014-09-05 16:48 GMT+02:00 Furcy Pin <furcy....@flaminem.com>: > Hi all, > > I've just spent some time trying to understand how the 'regex' syntax > worked for matchpath. > As first I thought it worked like usual regex but it was very misleading > as it doesn't. (Perhaps Aster NPath does) > > The first thing it does (as I understood) is collecting the whole set of > rows matching the group declared with DISTRIBUTE BY > and sorted according to SORTED BY > It the iterate on that set. I like to think of it as a string (eg: "AABC") > > The UDF will then try to match each suffix of the string (eg: "AABC", > "ABC", "BC", "C") one by one, and return one row for each match. > > The matching iterates on each symbols of the pattern and for each of them > advances as much as it can in the string. > > > Here are some examples to help people understand how it works. > > String > Pattern = Matches > "AAB" > "A.B" = "AB" > "AAB" > "A+.B" = "AAB","AB" > "BB" > "B.A*.B" = "BB" > "BAAB" > "B.A*.B" = "BAAB" > > The next example is more tricky : let's consider X is a symbol that is > always true : > "ABABA" > "A.X*.A" = "ABABA", "ABA" > "ABABAB" > "A.X*.A" = nothing > > To understand what happens more deeply, let's number the letters > "ABABAB" > "A.X*.A" > "123456" > "7.8*.9" > > The algorithm with proceed as follow : > Trying 123456: > 1 (which is an A) is matched by symbol 7 (A) > 2345 (BABA) is matched by symbol 8* (X*) > 6 (B) is *not* matched by 9 (A) > duh. > Trying 23456: > 2 (B) is not matched by symbol 7 (A) > duh. > Trying 3456: > 3 (A) is matched by symbol 7 (A) > 45 (BA) is matched by symbol 8* (X*) > 6 (B) is *not* matched by 9 (A) > duh. > etc. > > So, if you want to match people with two events of type A with anything in > between (which would be matched by the *classic* regex "A.*A") > you shall not use the pattern "A.A" because it looks for two consecutive > events > you shall not use the pattern "A.X*.A" with X matching anything (too > greedy) > you may use the pattern "A.Z*.A" with Z = not A, but matched against the > string "ABABAB", it will only match ABA (non-greedy match) and not ABABA. > > I'm still looking for the pattern that does the same thing as the classic > regex "A.*A" > (for instance if you want to measure the duration between the first and > the last event of type A) > > I believe greedy matching requires an automaton to be done efficiently, > which is why you can't greedy match correctly with the current MatchPath > implementation. > > > Furcy > > > 2014-09-04 2:32 GMT+02:00 Lefty Leverenz <leftylever...@gmail.com>: > >> MatchPath.java still exists in Hive trunk and release 0.13.1 >> (ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/MatchPath.java). >> >> -- Lefty >> >> >> On Wed, Sep 3, 2014 at 12:39 PM, Muhammad Asif Abbasi < >> asif.abb...@gmail.com> wrote: >> >>> Hi Furcy, >>> >>> Many thanks for your email :) >>> >>> My latest info was that the rename took place due to objections by >>> Teradata, but didn't know if they had actually requested to take it off the >>> distribution entirely. >>> >>> Does anybody else have an idea on the licensing aspect of this? What >>> exactly has Teradata patented? Is it the technique to parse the rows in a >>> such a manner? Any tips/techniques would be highly appreciated. >>> >>> Regards, >>> Asif Abbasi >>> >>> >>> >>> >>> On Wed, Sep 3, 2014 at 5:30 PM, Furcy Pin <furcy....@flaminem.com> >>> wrote: >>> >>>> Hi Muhammad, >>>> >>>> From what I've googled a few months ago on the subject, MatchPath UDF >>>> has been removed from Cloudera and Hortonworks releases because TeraData >>>> claims it violates one of their patent (apparently renaming it did not >>>> suffice). >>>> >>>> I guess that if you really need it, it might be possible to add it >>>> yourself as an external UDF since the code is still available out there, >>>> but I have no idea >>>> whether TeraData would have the right to come after you (or not?) if >>>> you do. >>>> >>>> By the way, if anyone has news on the current situation with MatchPath >>>> and TerraData, that would be welcome. >>>> >>>> Furcy >>>> >>>> >>>> >>>> >>>> 2014-09-03 17:18 GMT+02:00 Muhammad Asif Abbasi <asif.abb...@gmail.com> >>>> : >>>> >>>> Hi, >>>>> >>>>> Many thanks for sending these links. Looking forward to more >>>>> documentation around this. >>>>> >>>>> BTW, why does " hive-exec-0.13.0.2.1.1.0-385.jar" not have any class >>>>> files for MatchPath UDF ? Have they been chucked out to a separate JAR >>>>> file? >>>>> I can see that " hive-exec-0.13.0.jar" has the appropriate class >>>>> files, and have tried to use them. They work well with the demo data set >>>>> but we certainly need more documentation around this. >>>>> >>>>> Regards, >>>>> Asif Abbasi >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Aug 26, 2014 at 6:42 AM, Lefty Leverenz < >>>>> leftylever...@gmail.com> wrote: >>>>> >>>>>> Thanks for pointing out that we still need documentation for this in >>>>>> the wiki. (I've added a doc comment to HIVE-5087 >>>>>> <https://issues.apache.org/jira/browse/HIVE-5087>.) In the >>>>>> meantime, googling "Hive npath" turned up these sources of information: >>>>>> >>>>>> - https://github.com/hbutani/SQLWindowing/wiki >>>>>> - >>>>>> http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive >>>>>> (slides >>>>>> 20-21) >>>>>> - >>>>>> >>>>>> >>>>>> http://www.justinjworkman.com/big-data/using-npath-with-apache-hive/ >>>>>> >>>>>> >>>>>> -- Lefty >>>>>> >>>>>> >>>>>> On Mon, Aug 25, 2014 at 8:27 AM, Muhammad Asif Abbasi < >>>>>> asif.abb...@gmail.com> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> I am trying to use MatchPath UDF (Previously called NPath). Does >>>>>>> anybody have a document around its syntax and usage? >>>>>>> >>>>>>> Regards, >>>>>>> Asif >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >