[ https://issues.apache.org/jira/browse/OPENNLP-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862168#comment-16862168 ]
Tim Allison commented on OPENNLP-1261: -------------------------------------- I don't know if you want to include this in 1261, but I experimented with a few potential improvements to [~joern]'s 1261 branch. I've started a 1261-sandbox branch on my fork. The modifications are simply proof of concept. I'm not sure how/if you want to change your API to include these potential areas for improvement. a) baseline 1261 ||Length||MS||Accuracy|| |10|113|0.67| |20|109|0.86| |30|115|0.86| |40|88|0.94| |50|92|0.95| |100|175|0.97| |150|229|0.97| |200|351|0.97| |500|750|0.98| |1000|1558|1.00| |5000|7649|1.00| |10000|14839|1.00| |20000|27709|1.00| b) lowercase once, and lowercase codepoints (not chars) [849cd27|https://github.com/tballison/opennlp/commit/849cd2746d21c0de7e9889d0e94c03e59e11cd6a] ||Length||MS||Accuracy|| |10|109|0.67| |20|120|0.86| |30|104|0.86| |40|90|0.94| |50|85|0.95| |100|153|0.97| |150|243|0.97| |200|274|0.97| |500|612|0.98| |1000|1333|1.00| |5000|6186|1.00| |10000|12467|1.00| |20000|24448|1.00| c) b _and_ Map<String, Integer> ||Length||MS||Accuracy|| |10|135|0.67| |20|145|0.86| |30|109|0.86| |40|98|0.94| |50|123|0.95| |100|152|0.97| |150|213|0.97| |200|268|0.97| |500|631|0.98| |1000|1151|1.00| |5000|5080|1.00| |10000|9421|1.00| |20000|18591|1.00| d) c but with Map<String, MutableInt> [3695c12|https://github.com/tballison/opennlp/commit/3695c12d6e222376a37b6f2ac1c31a0b717d6b88] ||Length||MS||Accuracy|| |10|139|0.67| |20|115|0.86| |30|98|0.86| |40|95|0.94| |50|94|0.95| |100|144|0.97| |150|209|0.97| |200|266|0.97| |500|620|0.98| |1000|1161|1.00| |5000|4762|1.00| |10000|8608|1.00| |20000|16117|1.00| > Language Detector fails to predict language on long input texts > --------------------------------------------------------------- > > Key: OPENNLP-1261 > URL: https://issues.apache.org/jira/browse/OPENNLP-1261 > Project: OpenNLP > Issue Type: Improvement > Components: Language Detector > Reporter: Joern Kottmann > Assignee: Joern Kottmann > Priority: Major > Attachments: langid_plus_minus_rollups.zip, opennlp_as_is_vs_1261.zip > > > If the input text is very long, e.g. 100k chars, then the lang detect > component fails to detect the language correctly, even though the text is > only written in one language. > This issue was tracked down to the context generator, where the count of the > ngrams are ignored. -- This message was sent by Atlassian JIRA (v7.6.3#76005)