[ https://issues.apache.org/jira/browse/OPENNLP-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861584#comment-16861584 ]
Tim Allison commented on OPENNLP-1261: -------------------------------------- Looks like improvement all around. Performance doesn't tank as much with crazily large chunks of text; it still does a bit...I wonder if this is caused by saturation...some features are boosted to and then flat line at 1.0 or the opposite with that many observations? Overall, though, this appears to be faster, more accurate on very short texts and much more accurate on noisy text. Key parts of reports: "Accuracy Across Languages -- Detector/Noise/Length" +1 > Language Detector fails to predict language on long input texts > --------------------------------------------------------------- > > Key: OPENNLP-1261 > URL: https://issues.apache.org/jira/browse/OPENNLP-1261 > Project: OpenNLP > Issue Type: Improvement > Components: Language Detector > Reporter: Joern Kottmann > Assignee: Joern Kottmann > Priority: Major > Attachments: langid_plus_minus_rollups.zip, opennlp_as_is_vs_1261.zip > > > If the input text is very long, e.g. 100k chars, then the lang detect > component fails to detect the language correctly, even though the text is > only written in one language. > This issue was tracked down to the context generator, where the count of the > ngrams are ignored. -- This message was sent by Atlassian JIRA (v7.6.3#76005)