Re: MatchPath UDF usage info ?

Furcy Pin Fri, 05 Sep 2014 08:00:22 -0700

precision :
A.Z*.A" with Z = not A matched against "ABABAB" will match ABA twice, but
will not match ABABA




2014-09-05 16:48 GMT+02:00 Furcy Pin <furcy....@flaminem.com>:

> Hi all,
>
> I've just spent some time trying to understand how the 'regex' syntax
> worked for matchpath.
> As first I thought it worked like usual regex but it was very misleading
> as it doesn't. (Perhaps Aster NPath does)
>
> The first thing it does (as I understood) is collecting the whole set of
> rows matching the group declared with DISTRIBUTE BY
> and sorted according to SORTED BY
> It the iterate on that set. I like to think of it as a string (eg: "AABC")
>
> The UDF will then try to match each suffix of the string (eg: "AABC",
> "ABC", "BC", "C") one by one, and return one row for each match.
>
> The matching iterates on each symbols of the pattern and for each of them
> advances as much as it can in the string.
>
>
> Here are some examples to help people understand how it works.
>
> String  > Pattern = Matches
> "AAB"  > "A.B" = "AB"
> "AAB"  > "A+.B" = "AAB","AB"
> "BB"    > "B.A*.B" = "BB"
> "BAAB"  > "B.A*.B" = "BAAB"
>
> The next example is more tricky : let's consider X is a symbol that is
> always true :
> "ABABA"  > "A.X*.A" = "ABABA", "ABA"
> "ABABAB"  > "A.X*.A" = nothing
>
> To understand what happens more deeply, let's number the letters
> "ABABAB"  > "A.X*.A"
> "123456"  > "7.8*.9"
>
> The algorithm with proceed as follow :
> Trying 123456:
> 1 (which is an A) is matched by symbol 7 (A)
> 2345 (BABA) is matched by symbol 8* (X*)
> 6 (B) is *not* matched by 9 (A)
> duh.
> Trying 23456:
> 2 (B) is not matched by symbol 7 (A)
> duh.
> Trying 3456:
> 3 (A) is matched by symbol 7 (A)
> 45 (BA) is matched by symbol 8* (X*)
> 6 (B) is *not* matched by 9 (A)
> duh.
> etc.
>
> So, if you want to match people with two events of type A with anything in
> between (which would be matched by the *classic* regex "A.*A")
> you shall not use the pattern "A.A" because it looks for two consecutive
> events
> you shall not use the pattern "A.X*.A" with X matching anything (too
> greedy)
> you may use the pattern "A.Z*.A" with Z = not A, but matched against the
> string "ABABAB", it will only match ABA (non-greedy match) and not ABABA.
>
> I'm still looking for the pattern that does the same thing as the classic
> regex "A.*A"
> (for instance if you want to measure the duration between the first and
> the last event of type A)
>
> I believe greedy matching requires an automaton to be done efficiently,
> which is why you can't greedy match correctly with the current MatchPath
> implementation.
>
>
> Furcy
>
>
> 2014-09-04 2:32 GMT+02:00 Lefty Leverenz <leftylever...@gmail.com>:
>
>> MatchPath.java still exists in Hive trunk and release 0.13.1
>> (ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/MatchPath.java).
>>
>> -- Lefty
>>
>>
>> On Wed, Sep 3, 2014 at 12:39 PM, Muhammad Asif Abbasi <
>> asif.abb...@gmail.com> wrote:
>>
>>> Hi Furcy,
>>>
>>> Many thanks for your email :)
>>>
>>> My latest info was that the rename took place due to objections by
>>> Teradata, but didn't know if they had actually requested to take it off the
>>> distribution entirely.
>>>
>>> Does anybody else have an idea on the licensing aspect of this? What
>>> exactly has Teradata patented? Is it the technique to parse the rows in a
>>> such a manner? Any tips/techniques would be highly appreciated.
>>>
>>> Regards,
>>> Asif Abbasi
>>>
>>>
>>>
>>>
>>> On Wed, Sep 3, 2014 at 5:30 PM, Furcy Pin <furcy....@flaminem.com>
>>> wrote:
>>>
>>>> Hi Muhammad,
>>>>
>>>> From what I've googled a few months ago on the subject, MatchPath UDF
>>>> has been removed from Cloudera and Hortonworks releases because TeraData
>>>> claims it violates one of their patent (apparently renaming it did not
>>>> suffice).
>>>>
>>>> I guess that if you really need it, it might be possible to add it
>>>> yourself as an external UDF since the code is still available out there,
>>>> but I have no idea
>>>> whether TeraData would have the right to come after you (or not?) if
>>>> you do.
>>>>
>>>> By the way, if anyone has news on the current situation with MatchPath
>>>> and TerraData, that would be welcome.
>>>>
>>>> Furcy
>>>>
>>>>
>>>>
>>>>
>>>> 2014-09-03 17:18 GMT+02:00 Muhammad Asif Abbasi <asif.abb...@gmail.com>
>>>> :
>>>>
>>>> Hi,
>>>>>
>>>>> Many thanks for sending these links. Looking forward to more
>>>>> documentation around this.
>>>>>
>>>>> BTW, why does " hive-exec-0.13.0.2.1.1.0-385.jar" not have any class
>>>>> files for MatchPath UDF ? Have they been chucked out to a separate JAR
>>>>> file?
>>>>> I can see that " hive-exec-0.13.0.jar" has the appropriate class
>>>>> files, and have tried to use them. They work well with the demo data set
>>>>> but we certainly need more documentation around this.
>>>>>
>>>>> Regards,
>>>>> Asif Abbasi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 26, 2014 at 6:42 AM, Lefty Leverenz <
>>>>> leftylever...@gmail.com> wrote:
>>>>>
>>>>>> Thanks for pointing out that we still need documentation for this in
>>>>>> the wiki.  (I've added a doc comment to HIVE-5087
>>>>>> <https://issues.apache.org/jira/browse/HIVE-5087>.)  In the
>>>>>> meantime, googling "Hive npath" turned up these sources of information:
>>>>>>
>>>>>>    - https://github.com/hbutani/SQLWindowing/wiki
>>>>>>    -
>>>>>>    http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive 
>>>>>> (slides
>>>>>>    20-21)
>>>>>>    -
>>>>>>
>>>>>>
>>>>>>    http://www.justinjworkman.com/big-data/using-npath-with-apache-hive/
>>>>>>
>>>>>>
>>>>>> -- Lefty
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 25, 2014 at 8:27 AM, Muhammad Asif Abbasi <
>>>>>> asif.abb...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I am trying to use MatchPath UDF (Previously called NPath). Does
>>>>>>> anybody have a document around its syntax and usage?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Asif
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: MatchPath UDF usage info ?

Reply via email to