[jira] Created: (HIVE-1438) sentences() UDF for natural language tokenization

Mayank Lahiri (JIRA) Fri, 25 Jun 2010 12:55:13 -0700

sentences() UDF for natural language tokenization
-------------------------------------------------


                 Key: HIVE-1438
                 URL: https://issues.apache.org/jira/browse/HIVE-1438
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Query Processor
    Affects Versions: 0.7.0
            Reporter: Mayank Lahiri
            Assignee: Mayank Lahiri
             Fix For: 0.7.0


Create a generic UDF that tokenizes free-form natural language text into 
sentences and words for more advanced processing, while stripping unnecessary 
punctuation and being fully international-aware. Fortunately, most of this 
functionality is already built into Java in the form of the i8n BreakIterator 
class, so this UDF will just connect it to Hive. For example:

> SELECT sentences("Hello there! This is a UDF.") FROM somedata LIMIT 1;
[ ["Hello", "there"], ["This", "is", "a", "UDF"] ]

or

> SELECT sentences("Je m'apelle hive!!!", "fr") FROM somedata LIMIT 1;
[["Je","m'apelle","hive"]]

Notice how punctuation is maintained only where appropriate. Breaking at 
sentences (and thus the nested array return type) is important for tasks like 
counting the frequency of n-grams in text, which should not cross sentence 
boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HIVE-1438) sentences() UDF for natural language tokenization

Reply via email to