[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836719#comment-16836719 ]
Tim Allison commented on TIKA-2790: ----------------------------------- In addition to speed and error rate on different lengths of text, I'd also like to see how sensitive the confidence scores are to noised text. I took opennlp's subset of the leipzig corpus and randomly selected text of char lengths: 50, 100, 200, 500, 1000, 10000 and 100000. For each text, I also randomly added noise at a rate of 5%, 10%, 20%, 30%, 50% and 90% -- single character random selection from codepoints 0-1,000,000. I added a pseudo language "num" that is composed solely of (Arabic) numbers, spaces and commas. I slightly modified yalder to allow simpler loading of all models (core and extras) -- see my {{load_all_langs}} branch. As [~kkrugler] observed, yalder is much, much faster than opennlp: ||Detector||Length||Millis||Avg(ms)||Stdev|| |YalderDetector|50|1046|1.22|0.55| |YalderDetector|100|1057|1.24|0.48| |YalderDetector|200|1166|1.37|0.66| |YalderDetector|500|1070|1.25|0.52| |YalderDetector|1000|1123|1.31|0.53| |YalderDetector|10000|1184|1.39|0.52| |YalderDetector|100000|2495|2.92|3.12| |OptimaizeLangDetector|50|1039|1.22|0.52| |OptimaizeLangDetector|100|1054|1.23|0.5| |OptimaizeLangDetector|200|1085|1.27|0.51| |OptimaizeLangDetector|500|1142|1.34|0.54| |OptimaizeLangDetector|1000|1202|1.41|0.57| |OptimaizeLangDetector|10000|1983|2.32|0.82| |OptimaizeLangDetector|100000|10465|12.25|9.09| |OpenNLPLangDetector|50|1019|1.19|0.41| |OpenNLPLangDetector|100|1193|1.4|0.51| |OpenNLPLangDetector|200|1400|1.64|0.54| |OpenNLPLangDetector|500|1968|2.3|0.63| |OpenNLPLangDetector|1000|2992|3.5|1.14| |OpenNLPLangDetector|10000|15450|18.09|12.47| |OpenNLPLangDetector|100000|108240|126.74|52.4| > Consider switching lang-detection in tika-eval to open-nlp > ---------------------------------------------------------- > > Key: TIKA-2790 > URL: https://issues.apache.org/jira/browse/TIKA-2790 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005)