[ https://issues.apache.org/jira/browse/HIVEMALL-146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Takuya Kitazawa closed HIVEMALL-146. ------------------------------------ Resolution: Done > Implement yet another UDF to generate n-grams from a list of words > ------------------------------------------------------------------ > > Key: HIVEMALL-146 > URL: https://issues.apache.org/jira/browse/HIVEMALL-146 > Project: Hivemall > Issue Type: New Feature > Reporter: Takuya Kitazawa > Assignee: Takuya Kitazawa > > Hive has {{ngrams()}} function to obtain n-grams of a list of words: > https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining#StatisticsAndDataMining-ngrams()andcontext_ngrams():N-gramfrequencyestimation > While the existing function returns "estimated" top-k list of frequent > n-grams, NLP applications sometimes need to get "exact" list of n-grams which > include all of 1-, 2-, ..., n-grams. To give an example, for an input > \["machine", "learning"\], we might expect to get the following result: > \["machine", "learning", "machine learning"\]. > Hence, this ticket requests to implement yet another UDF something like > {{ngrams()}}. Implementation could be similar to {{getNgrams()}} in the > Stanford CoreNLP library: > https://github.com/stanfordnlp/CoreNLP/blob/d6318a0cb06dba635550477bc843952cc5a5f868/src/edu/stanford/nlp/util/StringUtils.java#L2132-L2142 -- This message was sent by Atlassian JIRA (v6.4.14#64029)