[jira] [Comment Edited] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

Tim Allison (JIRA) Tue, 16 Jul 2019 06:00:20 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886096#comment-16886096
 ]


Tim Allison edited comment on TIKA-2790 at 7/16/19 12:59 PM:
-------------------------------------------------------------

I'm attaching a final "out of the box" [comparison| 
https://issues.apache.org/jira/secure/attachment/12974822/rollups_20190716.zip] 
against the 103 langs in the standard distro of OpenNLP.  The scoring only 
considers the languages if a given detector claims that it can identify it.  
Nevertheless, of course, if one detector has models for 200 languages, and 
another is targeted to the 103 langs specifically, there will be 
some...differences...ymmv.  

The custom, probing OpenNLP-based lang detector for tika-eval uses a model with 
121 languages.

This shows that the tika-eval lang id's performance does not degrade like 
OpenNLP's 1.9.1's lang detector does on long pieces of text, and it is far 
faster than OpenNLP 1.9.1.  There are two areas for improvement in the custom 
tika-eval detector: short text and noisy text -- Optimaize is still much better 
on both -- although, to be fair, Optimaize has fewer language models.  Yalder, 
of course, is still the fastest, by far.


was (Author: [email protected]):
I'm attaching a final "out of the box" comparison against the 103 langs in the 
standard distro of OpenNLP.  The scoring only considers the languages if a 
given detector claims that it can identify it.  Nevertheless, of course, if one 
detector has models for 200 languages, and another is targeted to the 103 langs 
specifically, there will be some...differences...ymmv.  

The custom, probing OpenNLP-based lang detector for tika-eval uses a model with 
121 languages.

This shows that the tika-eval lang id's performance does not degrade like 
OpenNLP's 1.9.1's lang detector does on long pieces of text, and it is far 
faster than OpenNLP 1.9.1.  There are two areas for improvement in the custom 
tika-eval detector: short text and noisy text -- Optimaize is still much better 
on both -- although, to be fair, Optimaize has fewer language models.  Yalder, 
of course, is still the fastest, by far.

> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
>                 Key: TIKA-2790
>                 URL: https://issues.apache.org/jira/browse/TIKA-2790
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: fra_mixed_100000_0.0_0.txt, hasEnough.png, 
> langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, 
> langid_20190514_plus_minus_1.zip, rollups_20190716.zip, timeVsLength.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

Reply via email to