[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858666#comment-16858666
 ] 

Tim Allison commented on TIKA-2790:
-----------------------------------

I did some more digging, and I ran some experiments where I turned off Yalder's 
early stopping feature.  I only ran this against 'FRA' documents in the Leipzig 
corpus for the sake of time.  The results are in the table below.

I also instrumented Yalder to report the number of ngrams/number of ngrammable 
characters it processed when it stopped early.  I found that it was reading on 
average (across all languages) the first 60 ngrammable characters.

||Tool||Length||Total Time (ms)||Tool||Length||Total Time (ms)||
|Yalder|50|95|YalderNoStop|50|95|
|Yalder|100|44|YalderNoStop|100|79|
|Yalder|200|34|YalderNoStop|200|132|
|Yalder|500|32|YalderNoStop|500|287|
|Yalder|1,000|54|YalderNoStop|1,000|528|
|Yalder|10,000|40|YalderNoStop|10,000|5,013|
|Yalder|100,000|60|YalderNoStop|100,000|47,893|
|Optimaize|50|28|Optimaize|50|29|
|Optimaize|100|22|Optimaize|100|23|
|Optimaize|200|25|Optimaize|200|25|
|Optimaize|500|36|Optimaize|500|27|
|Optimaize|1,000|43|Optimaize|1000|52|
|Optimaize|10,000|189|Optimaize|10,000|240|
|Optimaize|100,000|1,522|Optimaize|100,000|1,322|
|OpenNLP|50|86|OpenNLP|50|64|
|OpenNLP|100|31|OpenNLP|100|33|
|OpenNLP|200|93|OpenNLP|200|54|
|OpenNLP|500|108|OpenNLP|500|110|
|OpenNLP|1,000|192|OpenNLP|1,000|206|
|OpenNLP|10,000|1,567|OpenNLP|10,000|1,651|
|OpenNLP|100,000|14,696|OpenNLP|100,000|15,585|

> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
>                 Key: TIKA-2790
>                 URL: https://issues.apache.org/jira/browse/TIKA-2790
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: fra_mixed_100000_0.0_0.txt, hasEnough.png, 
> langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, 
> langid_20190514_plus_minus_1.zip, timeVsLength.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to