[jira] [Commented] (OPENNLP-1261) Language Detector fails to predict language on long input texts

Tim Allison (JIRA) Tue, 11 Jun 2019 15:54:37 -0700


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861584#comment-16861584
 ]


Tim Allison commented on OPENNLP-1261:
--------------------------------------

Looks like improvement all around.  Performance doesn't tank as much with 
crazily large chunks of text; it still does a bit...I wonder if this is caused 
by saturation...some features are boosted to and then flat line at 1.0 or the 
opposite with that many observations?

Overall, though, this appears to be faster, more accurate on very short texts 
and much more accurate on noisy text.

Key parts of reports: "Accuracy Across Languages -- Detector/Noise/Length"

+1

> Language Detector fails to predict language on long input texts
> ---------------------------------------------------------------
>
>                 Key: OPENNLP-1261
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1261
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Language Detector
>            Reporter: Joern Kottmann
>            Assignee: Joern Kottmann
>            Priority: Major
>         Attachments: langid_plus_minus_rollups.zip, opennlp_as_is_vs_1261.zip
>
>
> If the input text is very long, e.g. 100k chars, then the lang detect 
> component fails to detect the language correctly, even though the text is 
> only written in one language.
> This issue was tracked down to the context generator, where the count of the 
> ngrams are ignored.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OPENNLP-1261) Language Detector fails to predict language on long input texts

Reply via email to