Hi

 

The language-identifier plugin uses
org.apache.tika.language.LanguageIdentifier for extracting the language from
the document text. There are two issues with that:

1.      LanguageIdentifier is deprecated in Tika.
2.      It does not support CJK language (and I suspect a lot of other
languages -
https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages
_and_their_ISO_636_Codes), and it doesn't even fail gracefully with them -
in my experience Chinese was recognized as Italian.

 

Since in Tika LanguageIdentifier was superseded by
org.apache.tika.language.detect.LanguageDetector, it seems obvious to make
that change in the plugin as well. However, because the design of
LanguageDetector is terrible, it makes the implementation not reentrant,
meaning the full language model would have to be reloaded on each call to
the detector.

 

For my needs, I have modified the plugin to use
com.optimaize.langdetect.LanguageDetector directly, which is what Tika's
LanguageDetector uses internally (at least by default). My question is
whether that is a change that should be made to the official plugin. 

 

Thanks,

               Yossi.

Reply via email to