[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707343#comment-16707343 ]
Ken Krugler commented on TIKA-2790: ----------------------------------- My concern with OpenNLP is that during a web crawl, even with the current lightweight detection algorithm, the detection can add a lot of processing time. OpenNLP is generally not known as being "lightweight" :) But we could give it a try, for sure. Note that OpenNLP uses ISO 639-2 (three letter codes). Having a more robust representation of languages in the language detector API would be a good thing in general (e.g. 639-2 code plus an optional locale code, so you can differentiate Mandarin Chinese in Taiwan from Mandarin Chinese in China or Singapore). > Consider switching lang-detection in tika-eval to open-nlp > ---------------------------------------------------------- > > Key: TIKA-2790 > URL: https://issues.apache.org/jira/browse/TIKA-2790 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005)