Re: MatchPath UDF usage info ?

2014-09-05 Thread Furcy Pin
precision :
A.Z*.A" with Z = not A matched against "ABABAB" will match ABA twice, but
will not match ABABA



2014-09-05 16:48 GMT+02:00 Furcy Pin :

> Hi all,
>
> I've just spent some time trying to understand how the 'regex' syntax
> worked for matchpath.
> As first I thought it worked like usual regex but it was very misleading
> as it doesn't. (Perhaps Aster NPath does)
>
> The first thing it does (as I understood) is collecting the whole set of
> rows matching the group declared with DISTRIBUTE BY
> and sorted according to SORTED BY
> It the iterate on that set. I like to think of it as a string (eg: "AABC")
>
> The UDF will then try to match each suffix of the string (eg: "AABC",
> "ABC", "BC", "C") one by one, and return one row for each match.
>
> The matching iterates on each symbols of the pattern and for each of them
> advances as much as it can in the string.
>
>
> Here are some examples to help people understand how it works.
>
> String  > Pattern = Matches
> "AAB"  > "A.B" = "AB"
> "AAB"  > "A+.B" = "AAB","AB"
> "BB"> "B.A*.B" = "BB"
> "BAAB"  > "B.A*.B" = "BAAB"
>
> The next example is more tricky : let's consider X is a symbol that is
> always true :
> "ABABA"  > "A.X*.A" = "ABABA", "ABA"
> "ABABAB"  > "A.X*.A" = nothing
>
> To understand what happens more deeply, let's number the letters
> "ABABAB"  > "A.X*.A"
> "123456"  > "7.8*.9"
>
> The algorithm with proceed as follow :
> Trying 123456:
> 1 (which is an A) is matched by symbol 7 (A)
> 2345 (BABA) is matched by symbol 8* (X*)
> 6 (B) is *not* matched by 9 (A)
> duh.
> Trying 23456:
> 2 (B) is not matched by symbol 7 (A)
> duh.
> Trying 3456:
> 3 (A) is matched by symbol 7 (A)
> 45 (BA) is matched by symbol 8* (X*)
> 6 (B) is *not* matched by 9 (A)
> duh.
> etc.
>
> So, if you want to match people with two events of type A with anything in
> between (which would be matched by the *classic* regex "A.*A")
> you shall not use the pattern "A.A" because it looks for two consecutive
> events
> you shall not use the pattern "A.X*.A" with X matching anything (too
> greedy)
> you may use the pattern "A.Z*.A" with Z = not A, but matched against the
> string "ABABAB", it will only match ABA (non-greedy match) and not ABABA.
>
> I'm still looking for the pattern that does the same thing as the classic
> regex "A.*A"
> (for instance if you want to measure the duration between the first and
> the last event of type A)
>
> I believe greedy matching requires an automaton to be done efficiently,
> which is why you can't greedy match correctly with the current MatchPath
> implementation.
>
>
> Furcy
>
>
> 2014-09-04 2:32 GMT+02:00 Lefty Leverenz :
>
>> MatchPath.java still exists in Hive trunk and release 0.13.1
>> (ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/MatchPath.java).
>>
>> -- Lefty
>>
>>
>> On Wed, Sep 3, 2014 at 12:39 PM, Muhammad Asif Abbasi <
>> asif.abb...@gmail.com> wrote:
>>
>>> Hi Furcy,
>>>
>>> Many thanks for your email :)
>>>
>>> My latest info was that the rename took place due to objections by
>>> Teradata, but didn't know if they had actually requested to take it off the
>>> distribution entirely.
>>>
>>> Does anybody else have an idea on the licensing aspect of this? What
>>> exactly has Teradata patented? Is it the technique to parse the rows in a
>>> such a manner? Any tips/techniques would be highly appreciated.
>>>
>>> Regards,
>>> Asif Abbasi
>>>
>>>
>>>
>>>
>>> On Wed, Sep 3, 2014 at 5:30 PM, Furcy Pin 
>>> wrote:
>>>
 Hi Muhammad,

 From what I've googled a few months ago on the subject, MatchPath UDF
 has been removed from Cloudera and Hortonworks releases because TeraData
 claims it violates one of their patent (apparently renaming it did not
 suffice).

 I guess that if you really need it, it might be possible to add it
 yourself as an external UDF since the code is still available out there,
 but I have no idea
 whether TeraData would have the right to come after you (or not?) if
 you do.

 By the way, if anyone has news on the current situation with MatchPath
 and TerraData, that would be welcome.

 Furcy




 2014-09-03 17:18 GMT+02:00 Muhammad Asif Abbasi 
 :

 Hi,
>
> Many thanks for sending these links. Looking forward to more
> documentation around this.
>
> BTW, why does " hive-exec-0.13.0.2.1.1.0-385.jar" not have any class
> files for MatchPath UDF ? Have they been chucked out to a separate JAR
> file?
> I can see that " hive-exec-0.13.0.jar" has the appropriate class
> files, and have tried to use them. They work well with the demo data set
> but we certainly need more documentation around this.
>
> Regards,
> Asif Abbasi
>
>
>
>
> On Tue, Aug 26, 2014 at 6:42 AM, Lefty Leverenz <
> leftylever...@gmail.com> wrote:
>
>> Thanks for pointing out that we still need documentation for this in
>> the wiki.  (I've added a doc co

Re: MatchPath UDF usage info ?

2014-09-05 Thread Furcy Pin
Hi all,

I've just spent some time trying to understand how the 'regex' syntax
worked for matchpath.
As first I thought it worked like usual regex but it was very misleading as
it doesn't. (Perhaps Aster NPath does)

The first thing it does (as I understood) is collecting the whole set of
rows matching the group declared with DISTRIBUTE BY
and sorted according to SORTED BY
It the iterate on that set. I like to think of it as a string (eg: "AABC")

The UDF will then try to match each suffix of the string (eg: "AABC",
"ABC", "BC", "C") one by one, and return one row for each match.

The matching iterates on each symbols of the pattern and for each of them
advances as much as it can in the string.


Here are some examples to help people understand how it works.

String  > Pattern = Matches
"AAB"  > "A.B" = "AB"
"AAB"  > "A+.B" = "AAB","AB"
"BB"> "B.A*.B" = "BB"
"BAAB"  > "B.A*.B" = "BAAB"

The next example is more tricky : let's consider X is a symbol that is
always true :
"ABABA"  > "A.X*.A" = "ABABA", "ABA"
"ABABAB"  > "A.X*.A" = nothing

To understand what happens more deeply, let's number the letters
"ABABAB"  > "A.X*.A"
"123456"  > "7.8*.9"

The algorithm with proceed as follow :
Trying 123456:
1 (which is an A) is matched by symbol 7 (A)
2345 (BABA) is matched by symbol 8* (X*)
6 (B) is *not* matched by 9 (A)
duh.
Trying 23456:
2 (B) is not matched by symbol 7 (A)
duh.
Trying 3456:
3 (A) is matched by symbol 7 (A)
45 (BA) is matched by symbol 8* (X*)
6 (B) is *not* matched by 9 (A)
duh.
etc.

So, if you want to match people with two events of type A with anything in
between (which would be matched by the *classic* regex "A.*A")
you shall not use the pattern "A.A" because it looks for two consecutive
events
you shall not use the pattern "A.X*.A" with X matching anything (too greedy)
you may use the pattern "A.Z*.A" with Z = not A, but matched against the
string "ABABAB", it will only match ABA (non-greedy match) and not ABABA.

I'm still looking for the pattern that does the same thing as the classic
regex "A.*A"
(for instance if you want to measure the duration between the first and the
last event of type A)

I believe greedy matching requires an automaton to be done efficiently,
which is why you can't greedy match correctly with the current MatchPath
implementation.


Furcy


2014-09-04 2:32 GMT+02:00 Lefty Leverenz :

> MatchPath.java still exists in Hive trunk and release 0.13.1
> (ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/MatchPath.java).
>
> -- Lefty
>
>
> On Wed, Sep 3, 2014 at 12:39 PM, Muhammad Asif Abbasi <
> asif.abb...@gmail.com> wrote:
>
>> Hi Furcy,
>>
>> Many thanks for your email :)
>>
>> My latest info was that the rename took place due to objections by
>> Teradata, but didn't know if they had actually requested to take it off the
>> distribution entirely.
>>
>> Does anybody else have an idea on the licensing aspect of this? What
>> exactly has Teradata patented? Is it the technique to parse the rows in a
>> such a manner? Any tips/techniques would be highly appreciated.
>>
>> Regards,
>> Asif Abbasi
>>
>>
>>
>>
>> On Wed, Sep 3, 2014 at 5:30 PM, Furcy Pin  wrote:
>>
>>> Hi Muhammad,
>>>
>>> From what I've googled a few months ago on the subject, MatchPath UDF
>>> has been removed from Cloudera and Hortonworks releases because TeraData
>>> claims it violates one of their patent (apparently renaming it did not
>>> suffice).
>>>
>>> I guess that if you really need it, it might be possible to add it
>>> yourself as an external UDF since the code is still available out there,
>>> but I have no idea
>>> whether TeraData would have the right to come after you (or not?) if you
>>> do.
>>>
>>> By the way, if anyone has news on the current situation with MatchPath
>>> and TerraData, that would be welcome.
>>>
>>> Furcy
>>>
>>>
>>>
>>>
>>> 2014-09-03 17:18 GMT+02:00 Muhammad Asif Abbasi :
>>>
>>> Hi,

 Many thanks for sending these links. Looking forward to more
 documentation around this.

 BTW, why does " hive-exec-0.13.0.2.1.1.0-385.jar" not have any class
 files for MatchPath UDF ? Have they been chucked out to a separate JAR
 file?
 I can see that " hive-exec-0.13.0.jar" has the appropriate class
 files, and have tried to use them. They work well with the demo data set
 but we certainly need more documentation around this.

 Regards,
 Asif Abbasi




 On Tue, Aug 26, 2014 at 6:42 AM, Lefty Leverenz <
 leftylever...@gmail.com> wrote:

> Thanks for pointing out that we still need documentation for this in
> the wiki.  (I've added a doc comment to HIVE-5087
> .)  In the meantime,
> googling "Hive npath" turned up these sources of information:
>
>- https://github.com/hbutani/SQLWindowing/wiki
>-
>http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive 
> (slides
>20-21)
>-
>
>
>

Re: MatchPath UDF usage info ?

2014-09-03 Thread Lefty Leverenz
MatchPath.java still exists in Hive trunk and release 0.13.1
(ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/MatchPath.java).

-- Lefty


On Wed, Sep 3, 2014 at 12:39 PM, Muhammad Asif Abbasi  wrote:

> Hi Furcy,
>
> Many thanks for your email :)
>
> My latest info was that the rename took place due to objections by
> Teradata, but didn't know if they had actually requested to take it off the
> distribution entirely.
>
> Does anybody else have an idea on the licensing aspect of this? What
> exactly has Teradata patented? Is it the technique to parse the rows in a
> such a manner? Any tips/techniques would be highly appreciated.
>
> Regards,
> Asif Abbasi
>
>
>
>
> On Wed, Sep 3, 2014 at 5:30 PM, Furcy Pin  wrote:
>
>> Hi Muhammad,
>>
>> From what I've googled a few months ago on the subject, MatchPath UDF has
>> been removed from Cloudera and Hortonworks releases because TeraData
>> claims it violates one of their patent (apparently renaming it did not
>> suffice).
>>
>> I guess that if you really need it, it might be possible to add it
>> yourself as an external UDF since the code is still available out there,
>> but I have no idea
>> whether TeraData would have the right to come after you (or not?) if you
>> do.
>>
>> By the way, if anyone has news on the current situation with MatchPath
>> and TerraData, that would be welcome.
>>
>> Furcy
>>
>>
>>
>>
>> 2014-09-03 17:18 GMT+02:00 Muhammad Asif Abbasi :
>>
>> Hi,
>>>
>>> Many thanks for sending these links. Looking forward to more
>>> documentation around this.
>>>
>>> BTW, why does " hive-exec-0.13.0.2.1.1.0-385.jar" not have any class
>>> files for MatchPath UDF ? Have they been chucked out to a separate JAR
>>> file?
>>> I can see that " hive-exec-0.13.0.jar" has the appropriate class files,
>>> and have tried to use them. They work well with the demo data set but we
>>> certainly need more documentation around this.
>>>
>>> Regards,
>>> Asif Abbasi
>>>
>>>
>>>
>>>
>>> On Tue, Aug 26, 2014 at 6:42 AM, Lefty Leverenz >> > wrote:
>>>
 Thanks for pointing out that we still need documentation for this in
 the wiki.  (I've added a doc comment to HIVE-5087
 .)  In the meantime,
 googling "Hive npath" turned up these sources of information:

- https://github.com/hbutani/SQLWindowing/wiki
-
http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive 
 (slides
20-21)
-

http://www.justinjworkman.com/big-data/using-npath-with-apache-hive/


 -- Lefty


 On Mon, Aug 25, 2014 at 8:27 AM, Muhammad Asif Abbasi <
 asif.abb...@gmail.com> wrote:

> Hi All,
>
> I am trying to use MatchPath UDF (Previously called NPath). Does
> anybody have a document around its syntax and usage?
>
> Regards,
> Asif
>


>>>
>>
>


Re: MatchPath UDF usage info ?

2014-09-03 Thread Muhammad Asif Abbasi
Hi Furcy,

Many thanks for your email :)

My latest info was that the rename took place due to objections by
Teradata, but didn't know if they had actually requested to take it off the
distribution entirely.

Does anybody else have an idea on the licensing aspect of this? What
exactly has Teradata patented? Is it the technique to parse the rows in a
such a manner? Any tips/techniques would be highly appreciated.

Regards,
Asif Abbasi




On Wed, Sep 3, 2014 at 5:30 PM, Furcy Pin  wrote:

> Hi Muhammad,
>
> From what I've googled a few months ago on the subject, MatchPath UDF has
> been removed from Cloudera and Hortonworks releases because TeraData
> claims it violates one of their patent (apparently renaming it did not
> suffice).
>
> I guess that if you really need it, it might be possible to add it
> yourself as an external UDF since the code is still available out there,
> but I have no idea
> whether TeraData would have the right to come after you (or not?) if you
> do.
>
> By the way, if anyone has news on the current situation with MatchPath and
> TerraData, that would be welcome.
>
> Furcy
>
>
>
>
> 2014-09-03 17:18 GMT+02:00 Muhammad Asif Abbasi :
>
> Hi,
>>
>> Many thanks for sending these links. Looking forward to more
>> documentation around this.
>>
>> BTW, why does " hive-exec-0.13.0.2.1.1.0-385.jar" not have any class
>> files for MatchPath UDF ? Have they been chucked out to a separate JAR
>> file?
>> I can see that " hive-exec-0.13.0.jar" has the appropriate class files,
>> and have tried to use them. They work well with the demo data set but we
>> certainly need more documentation around this.
>>
>> Regards,
>> Asif Abbasi
>>
>>
>>
>>
>> On Tue, Aug 26, 2014 at 6:42 AM, Lefty Leverenz 
>> wrote:
>>
>>> Thanks for pointing out that we still need documentation for this in the
>>> wiki.  (I've added a doc comment to HIVE-5087
>>> .)  In the meantime,
>>> googling "Hive npath" turned up these sources of information:
>>>
>>>- https://github.com/hbutani/SQLWindowing/wiki
>>>-
>>>http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive 
>>> (slides
>>>20-21)
>>>-
>>>
>>>http://www.justinjworkman.com/big-data/using-npath-with-apache-hive/
>>>
>>>
>>> -- Lefty
>>>
>>>
>>> On Mon, Aug 25, 2014 at 8:27 AM, Muhammad Asif Abbasi <
>>> asif.abb...@gmail.com> wrote:
>>>
 Hi All,

 I am trying to use MatchPath UDF (Previously called NPath). Does
 anybody have a document around its syntax and usage?

 Regards,
 Asif

>>>
>>>
>>
>


Re: MatchPath UDF usage info ?

2014-09-03 Thread Furcy Pin
Hi Muhammad,

>From what I've googled a few months ago on the subject, MatchPath UDF has
been removed from Cloudera and Hortonworks releases because TeraData
claims it violates one of their patent (apparently renaming it did not
suffice).

I guess that if you really need it, it might be possible to add it yourself
as an external UDF since the code is still available out there, but I have
no idea
whether TeraData would have the right to come after you (or not?) if you do.

By the way, if anyone has news on the current situation with MatchPath and
TerraData, that would be welcome.

Furcy




2014-09-03 17:18 GMT+02:00 Muhammad Asif Abbasi :

> Hi,
>
> Many thanks for sending these links. Looking forward to more documentation
> around this.
>
> BTW, why does " hive-exec-0.13.0.2.1.1.0-385.jar" not have any class
> files for MatchPath UDF ? Have they been chucked out to a separate JAR
> file?
> I can see that " hive-exec-0.13.0.jar" has the appropriate class files,
> and have tried to use them. They work well with the demo data set but we
> certainly need more documentation around this.
>
> Regards,
> Asif Abbasi
>
>
>
>
> On Tue, Aug 26, 2014 at 6:42 AM, Lefty Leverenz 
> wrote:
>
>> Thanks for pointing out that we still need documentation for this in the
>> wiki.  (I've added a doc comment to HIVE-5087
>> .)  In the meantime,
>> googling "Hive npath" turned up these sources of information:
>>
>>- https://github.com/hbutani/SQLWindowing/wiki
>>- http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive 
>> (slides
>>20-21)
>>-
>>
>>http://www.justinjworkman.com/big-data/using-npath-with-apache-hive/
>>
>>
>> -- Lefty
>>
>>
>> On Mon, Aug 25, 2014 at 8:27 AM, Muhammad Asif Abbasi <
>> asif.abb...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I am trying to use MatchPath UDF (Previously called NPath). Does anybody
>>> have a document around its syntax and usage?
>>>
>>> Regards,
>>> Asif
>>>
>>
>>
>


Re: MatchPath UDF usage info ?

2014-09-03 Thread Muhammad Asif Abbasi
Hi,

Many thanks for sending these links. Looking forward to more documentation
around this.

BTW, why does " hive-exec-0.13.0.2.1.1.0-385.jar" not have any class files
for MatchPath UDF ? Have they been chucked out to a separate JAR file?
I can see that " hive-exec-0.13.0.jar" has the appropriate class files, and
have tried to use them. They work well with the demo data set but we
certainly need more documentation around this.

Regards,
Asif Abbasi




On Tue, Aug 26, 2014 at 6:42 AM, Lefty Leverenz 
wrote:

> Thanks for pointing out that we still need documentation for this in the
> wiki.  (I've added a doc comment to HIVE-5087
> .)  In the meantime,
> googling "Hive npath" turned up these sources of information:
>
>- https://github.com/hbutani/SQLWindowing/wiki
>- http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive 
> (slides
>20-21)
>-
>
>http://www.justinjworkman.com/big-data/using-npath-with-apache-hive/
>
>
> -- Lefty
>
>
> On Mon, Aug 25, 2014 at 8:27 AM, Muhammad Asif Abbasi <
> asif.abb...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am trying to use MatchPath UDF (Previously called NPath). Does anybody
>> have a document around its syntax and usage?
>>
>> Regards,
>> Asif
>>
>
>


Re: MatchPath UDF usage info ?

2014-08-25 Thread Lefty Leverenz
Thanks for pointing out that we still need documentation for this in the
wiki.  (I've added a doc comment to HIVE-5087
.)  In the meantime,
googling "Hive npath" turned up these sources of information:

   - https://github.com/hbutani/SQLWindowing/wiki
   - http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive
(slides
   20-21)
   -

   http://www.justinjworkman.com/big-data/using-npath-with-apache-hive/


-- Lefty


On Mon, Aug 25, 2014 at 8:27 AM, Muhammad Asif Abbasi  wrote:

> Hi All,
>
> I am trying to use MatchPath UDF (Previously called NPath). Does anybody
> have a document around its syntax and usage?
>
> Regards,
> Asif
>


MatchPath UDF usage info ?

2014-08-25 Thread Muhammad Asif Abbasi
Hi All,

I am trying to use MatchPath UDF (Previously called NPath). Does anybody
have a document around its syntax and usage?

Regards,
Asif