[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707343#comment-16707343
 ] 

Ken Krugler commented on TIKA-2790:
-----------------------------------

My concern with OpenNLP is that during a web crawl, even with the current 
lightweight detection algorithm, the detection can add a lot of processing 
time. OpenNLP is generally not known as being "lightweight" :) But we could 
give it a try, for sure.

Note that OpenNLP uses ISO 639-2 (three letter codes). Having a more robust 
representation of languages in the language detector API would be a good thing 
in general (e.g. 639-2 code plus an optional locale code, so you can 
differentiate Mandarin Chinese in Taiwan from Mandarin Chinese in China or 
Singapore).

> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
>                 Key: TIKA-2790
>                 URL: https://issues.apache.org/jira/browse/TIKA-2790
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to