[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836719#comment-16836719
 ] 

Tim Allison commented on TIKA-2790:
-----------------------------------

In addition to speed and error rate on different lengths of text, I'd also like 
to see how sensitive the confidence scores are to noised text.

I took opennlp's subset of the leipzig corpus and randomly selected text of 
char lengths: 50, 100, 200, 500, 1000, 10000 and 100000.  

For each text, I also randomly added noise at a rate of 5%, 10%, 20%, 30%, 50% 
and 90% -- single character random selection from codepoints 0-1,000,000.

I added a pseudo language "num" that is composed solely of (Arabic) numbers, 
spaces and commas.

I slightly modified yalder to allow simpler loading of all models (core and 
extras) -- see my {{load_all_langs}} branch.

As [~kkrugler] observed, yalder is much, much faster than opennlp:

||Detector||Length||Millis||Avg(ms)||Stdev||
|YalderDetector|50|1046|1.22|0.55|
|YalderDetector|100|1057|1.24|0.48|
|YalderDetector|200|1166|1.37|0.66|
|YalderDetector|500|1070|1.25|0.52|
|YalderDetector|1000|1123|1.31|0.53|
|YalderDetector|10000|1184|1.39|0.52|
|YalderDetector|100000|2495|2.92|3.12|
|OptimaizeLangDetector|50|1039|1.22|0.52|
|OptimaizeLangDetector|100|1054|1.23|0.5|
|OptimaizeLangDetector|200|1085|1.27|0.51|
|OptimaizeLangDetector|500|1142|1.34|0.54|
|OptimaizeLangDetector|1000|1202|1.41|0.57|
|OptimaizeLangDetector|10000|1983|2.32|0.82|
|OptimaizeLangDetector|100000|10465|12.25|9.09|
|OpenNLPLangDetector|50|1019|1.19|0.41|
|OpenNLPLangDetector|100|1193|1.4|0.51|
|OpenNLPLangDetector|200|1400|1.64|0.54|
|OpenNLPLangDetector|500|1968|2.3|0.63|
|OpenNLPLangDetector|1000|2992|3.5|1.14|
|OpenNLPLangDetector|10000|15450|18.09|12.47|
|OpenNLPLangDetector|100000|108240|126.74|52.4|

> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
>                 Key: TIKA-2790
>                 URL: https://issues.apache.org/jira/browse/TIKA-2790
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to