[
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886096#comment-16886096
]
Tim Allison edited comment on TIKA-2790 at 7/16/19 12:59 PM:
-------------------------------------------------------------
I'm attaching a final "out of the box" [comparison|
https://issues.apache.org/jira/secure/attachment/12974822/rollups_20190716.zip]
against the 103 langs in the standard distro of OpenNLP. The scoring only
considers the languages if a given detector claims that it can identify it.
Nevertheless, of course, if one detector has models for 200 languages, and
another is targeted to the 103 langs specifically, there will be
some...differences...ymmv.
The custom, probing OpenNLP-based lang detector for tika-eval uses a model with
121 languages.
This shows that the tika-eval lang id's performance does not degrade like
OpenNLP's 1.9.1's lang detector does on long pieces of text, and it is far
faster than OpenNLP 1.9.1. There are two areas for improvement in the custom
tika-eval detector: short text and noisy text -- Optimaize is still much better
on both -- although, to be fair, Optimaize has fewer language models. Yalder,
of course, is still the fastest, by far.
was (Author: [email protected]):
I'm attaching a final "out of the box" comparison against the 103 langs in the
standard distro of OpenNLP. The scoring only considers the languages if a
given detector claims that it can identify it. Nevertheless, of course, if one
detector has models for 200 languages, and another is targeted to the 103 langs
specifically, there will be some...differences...ymmv.
The custom, probing OpenNLP-based lang detector for tika-eval uses a model with
121 languages.
This shows that the tika-eval lang id's performance does not degrade like
OpenNLP's 1.9.1's lang detector does on long pieces of text, and it is far
faster than OpenNLP 1.9.1. There are two areas for improvement in the custom
tika-eval detector: short text and noisy text -- Optimaize is still much better
on both -- although, to be fair, Optimaize has fewer language models. Yalder,
of course, is still the fastest, by far.
> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Major
> Attachments: fra_mixed_100000_0.0_0.txt, hasEnough.png,
> langid_20190509.zip, langid_20190510.zip, langid_20190514.zip,
> langid_20190514_plus_minus_1.zip, rollups_20190716.zip, timeVsLength.png
>
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)