[ 
https://issues.apache.org/jira/browse/HIVE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892990#action_12892990
 ] 

Mayank Lahiri commented on HIVE-1481:
-------------------------------------

Nice catch. :)

> ngrams() UDAF for estimating top-k n-gram frequencies
> -----------------------------------------------------
>
>                 Key: HIVE-1481
>                 URL: https://issues.apache.org/jira/browse/HIVE-1481
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Mayank Lahiri
>            Assignee: Mayank Lahiri
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1481.1.patch
>
>
> [ngrams|http://en.wikipedia.org/wiki/N-gram] are fixed-length subsequences of 
> a longer sequences. This patch will add a new ngrams() UDAF to heuristically 
> estimate the top-k most frequent n-grams in a set of sequences.
> _Example_: *top bigrams in natural language text*
> Say you have a column with movie or product reviews from users expressed as 
> natural language strings. You want to find the top 10 most frequent word 
> pairs. First, pipe the text through the sentences() UDAF in HIVE-1438, which 
> tokenizes natural language text into an array of sentences, where each 
> sentence is an array of words.
> SELECT sentences("I hated this movie. I hated watching it and this movie made 
> me unhappy.") FROM reviews;
> _gives_:
> [  ["I", "hated", "this", "movie"], ["I", "hated", "watching", "it", "and", 
> "this", "movie", "made", "me", "unhappy"] ]
> SELECT ngrams(sentences("I hated this movie. I hated watching it and this 
> movie made me unhappy."), 2, 5) FROM reviews;
> _gives the *5* most frequent *2-grams*_:
> [ { ngram: ["I", "hated"] , estfrequency: 2 },
>   { ngram: ["this", "movie"], estfrequency: 2},
>   { ngram: ["hated", "this"], estfrequency: 1},
>   { ngram: ["hated", "watching"], estfrequency: 1},
>   { ngram: ["made", "me"], estfrequency: 1} ]
> Can also be used for finding common sequences of URL accesses, for example, 
> or n-grams in any data that can be represented as sequences of strings. More 
> examples will be put up in a separate wiki page after this UDAF is fully 
> developed.
> The algorithm is a heuristic. For relatively small "k" values, in the range 
> of 10-1000, the heuristic appears to perform well, with frequency counts 
> coming within 5% of their true values, and always undercounting. Again, more 
> results will be posted on a separate wiki page.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to