[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858666#comment-16858666 ]
Tim Allison commented on TIKA-2790: ----------------------------------- I did some more digging, and I ran some experiments where I turned off Yalder's early stopping feature. I only ran this against 'FRA' documents in the Leipzig corpus for the sake of time. The results are in the table below. I also instrumented Yalder to report the number of ngrams/number of ngrammable characters it processed when it stopped early. I found that it was reading on average (across all languages) the first 60 ngrammable characters. ||Tool||Length||Total Time (ms)||Tool||Length||Total Time (ms)|| |Yalder|50|95|YalderNoStop|50|95| |Yalder|100|44|YalderNoStop|100|79| |Yalder|200|34|YalderNoStop|200|132| |Yalder|500|32|YalderNoStop|500|287| |Yalder|1,000|54|YalderNoStop|1,000|528| |Yalder|10,000|40|YalderNoStop|10,000|5,013| |Yalder|100,000|60|YalderNoStop|100,000|47,893| |Optimaize|50|28|Optimaize|50|29| |Optimaize|100|22|Optimaize|100|23| |Optimaize|200|25|Optimaize|200|25| |Optimaize|500|36|Optimaize|500|27| |Optimaize|1,000|43|Optimaize|1000|52| |Optimaize|10,000|189|Optimaize|10,000|240| |Optimaize|100,000|1,522|Optimaize|100,000|1,322| |OpenNLP|50|86|OpenNLP|50|64| |OpenNLP|100|31|OpenNLP|100|33| |OpenNLP|200|93|OpenNLP|200|54| |OpenNLP|500|108|OpenNLP|500|110| |OpenNLP|1,000|192|OpenNLP|1,000|206| |OpenNLP|10,000|1,567|OpenNLP|10,000|1,651| |OpenNLP|100,000|14,696|OpenNLP|100,000|15,585| > Consider switching lang-detection in tika-eval to open-nlp > ---------------------------------------------------------- > > Key: TIKA-2790 > URL: https://issues.apache.org/jira/browse/TIKA-2790 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Major > Attachments: fra_mixed_100000_0.0_0.txt, hasEnough.png, > langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, > langid_20190514_plus_minus_1.zip, timeVsLength.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)