Re: Usage of Tika LanguageIdentifier in language-identifier plugin

Sebastian Nagel Tue, 24 Oct 2017 02:41:52 -0700

Hi Yossi,

why not port it to use
   
http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDetector.html


The upgrade to Tika 1.16 is already in progress (NUTCH-2439).

Sebastian

On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> Hi
> 
>  
> 
> The language-identifier plugin uses
> org.apache.tika.language.LanguageIdentifier for extracting the language from
> the document text. There are two issues with that:
> 
> 1.    LanguageIdentifier is deprecated in Tika.
> 2.    It does not support CJK language (and I suspect a lot of other
> languages -
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages
> _and_their_ISO_636_Codes), and it doesn't even fail gracefully with them -
> in my experience Chinese was recognized as Italian.
> 
>  
> 
> Since in Tika LanguageIdentifier was superseded by
> org.apache.tika.language.detect.LanguageDetector, it seems obvious to make
> that change in the plugin as well. However, because the design of
> LanguageDetector is terrible, it makes the implementation not reentrant,
> meaning the full language model would have to be reloaded on each call to
> the detector.
> 
>  
> 
> For my needs, I have modified the plugin to use
> com.optimaize.langdetect.LanguageDetector directly, which is what Tika's
> LanguageDetector uses internally (at least by default). My question is
> whether that is a change that should be made to the official plugin. 
> 
>  
> 
> Thanks,
> 
>                Yossi.
> 
>

Re: Usage of Tika LanguageIdentifier in language-identifier plugin

Reply via email to