[ 
https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729595#comment-14729595
 ] 

Ken Krugler commented on TIKA-1723:
-----------------------------------

Biggest remaining issue before I commit is how to deal with language names (aka 
language tags). I've got a LanguageNames class (probably should be renamed to 
LanguageTags) that wraps some of Java's Locale object, to help with handling 
conversion between strings and formal locales, and doing fuzzy comparison. But 
some of what should be in that class requires functionality not provided by 
Locale (e.g. what's the suppress-script setting for a locale?), and other 
functionality requires some decision making. For example, if you request 'zh' 
as one of the language profiles, and the detector has zh-Latn-CN, then is that 
a match, and thus pinyin (e.g. "beijing") gets flagged as Chinese?

> Integrate language-detector into Tika
> -------------------------------------
>
>                 Key: TIKA-1723
>                 URL: https://issues.apache.org/jira/browse/TIKA-1723
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 1.11
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-1723-2.patch, TIKA-1723-3.patch, TIKA-1723.patch, 
> TIKA-1723v2.patch
>
>
> The language-detector project at 
> https://github.com/optimaize/language-detector is faster, has more languages 
> (70 vs 13) and better accuracy than the built-in language detector.
> This is a stab at integrating it, with some initial findings. There are a 
> number of issues this raises, especially if [~chrismattmann] moves forward 
> with turning language detection into a pluggable extension point.
> I'll add comments with results below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to