[jira] Commented: (HIVE-1438) sentences() UDF for natural language tokenization

John Sichi (JIRA) Mon, 12 Jul 2010 15:45:17 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887573#action_12887573
 ]


John Sichi commented on HIVE-1438:
----------------------------------

+1.  Will commit if tests pass.

(Some optimization for the case where the locale is constant would be nice, but 
we can leave that for a followup.)


> sentences() UDF for natural language tokenization
> -------------------------------------------------
>
>                 Key: HIVE-1438
>                 URL: https://issues.apache.org/jira/browse/HIVE-1438
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Mayank Lahiri
>            Assignee: Mayank Lahiri
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1438.1.patch, HIVE-1438.2.patch
>
>
> Create a generic UDF that tokenizes free-form natural language text into 
> sentences and words for more advanced processing, while stripping unnecessary 
> punctuation and being fully international-aware. Fortunately, most of this 
> functionality is already built into Java in the form of the i8n BreakIterator 
> class, so this UDF will just connect it to Hive. For example:
> > SELECT sentences("Hello there! This is a UDF.") FROM somedata LIMIT 1;
> [ ["Hello", "there"], ["This", "is", "a", "UDF"] ]
> or
> > SELECT sentences("Je m'apelle hive!!!", "fr") FROM somedata LIMIT 1;
> [["Je","m'apelle","hive"]]
> Notice how punctuation is maintained only where appropriate. Breaking at 
> sentences (and thus the nested array return type) is important for tasks like 
> counting the frequency of n-grams in text, which should not cross sentence 
> boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1438) sentences() UDF for natural language tokenization

Reply via email to