[ https://issues.apache.org/jira/browse/HIVE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892990#action_12892990 ]
Mayank Lahiri commented on HIVE-1481: ------------------------------------- Nice catch. :) > ngrams() UDAF for estimating top-k n-gram frequencies > ----------------------------------------------------- > > Key: HIVE-1481 > URL: https://issues.apache.org/jira/browse/HIVE-1481 > Project: Hadoop Hive > Issue Type: New Feature > Components: Query Processor > Affects Versions: 0.7.0 > Reporter: Mayank Lahiri > Assignee: Mayank Lahiri > Fix For: 0.7.0 > > Attachments: HIVE-1481.1.patch > > > [ngrams|http://en.wikipedia.org/wiki/N-gram] are fixed-length subsequences of > a longer sequences. This patch will add a new ngrams() UDAF to heuristically > estimate the top-k most frequent n-grams in a set of sequences. > _Example_: *top bigrams in natural language text* > Say you have a column with movie or product reviews from users expressed as > natural language strings. You want to find the top 10 most frequent word > pairs. First, pipe the text through the sentences() UDAF in HIVE-1438, which > tokenizes natural language text into an array of sentences, where each > sentence is an array of words. > SELECT sentences("I hated this movie. I hated watching it and this movie made > me unhappy.") FROM reviews; > _gives_: > [ ["I", "hated", "this", "movie"], ["I", "hated", "watching", "it", "and", > "this", "movie", "made", "me", "unhappy"] ] > SELECT ngrams(sentences("I hated this movie. I hated watching it and this > movie made me unhappy."), 2, 5) FROM reviews; > _gives the *5* most frequent *2-grams*_: > [ { ngram: ["I", "hated"] , estfrequency: 2 }, > { ngram: ["this", "movie"], estfrequency: 2}, > { ngram: ["hated", "this"], estfrequency: 1}, > { ngram: ["hated", "watching"], estfrequency: 1}, > { ngram: ["made", "me"], estfrequency: 1} ] > Can also be used for finding common sequences of URL accesses, for example, > or n-grams in any data that can be represented as sequences of strings. More > examples will be put up in a separate wiki page after this UDAF is fully > developed. > The algorithm is a heuristic. For relatively small "k" values, in the range > of 10-1000, the heuristic appears to perform well, with frequency counts > coming within 5% of their true values, and always undercounting. Again, more > results will be posted on a separate wiki page. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.