ngrams() UDAF for estimating top-k n-gram frequencies
-----------------------------------------------------
Key: HIVE-1481
URL: https://issues.apache.org/jira/browse/HIVE-1481
Project: Hadoop Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.7.0
Reporter: Mayank Lahiri
Assignee: Mayank Lahiri
Fix For: 0.7.0
[ngrams|http://en.wikipedia.org/wiki/N-gram] are fixed-length subsequences of a
longer sequences. This patch will add a new ngrams() UDAF to heuristically
estimate the top-k most frequent n-grams in a set of sequences.
_Example_: *top bigrams in natural language text*
Say you have a column with movie or product reviews from users expressed as
natural language strings. You want to find the top 10 most frequent word pairs.
First, pipe the text through the sentences() UDAF in HIVE-1438, which tokenizes
natural language text into an array of sentences, where each sentence is an
array of words.
SELECT sentences("I hated this movie. I hated watching it and this movie made
me unhappy.") FROM reviews;
_gives_:
[ ["I", "hated", "this", "movie"], ["I", "hated", "watching", "it", "and",
"this", "movie", "made", "me", "unhappy"] ]
SELECT ngrams(sentences("I hated this movie. I hated watching it and this movie
made me unhappy."), 2, 5) FROM reviews;
_gives the *5* most frequent *2-grams*_:
[ { ngram: ["I", "hated"] , estfrequency: 2 },
{ ngram: ["this", "movie"], estfrequency: 2},
{ ngram: ["hated", "this"], estfrequency: 1},
{ ngram: ["hated", "watching"], estfrequency: 1},
{ ngram: ["made", "me"], estfrequency: 1} ]
Can also be used for finding common sequences of URL accesses, for example, or
n-grams in any data that can be represented as sequences of strings. More
examples will be put up in a separate wiki page after this UDAF is fully
developed.
The algorithm is a heuristic. For relatively small "k" values, in the range of
10-1000, the heuristic appears to perform well, with frequency counts coming
within 5% of their true values, and always undercounting. Again, more results
will be posted on a separate wiki page.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.