Hi
The language-identifier plugin uses org.apache.tika.language.LanguageIdentifier for extracting the language from the document text. There are two issues with that: 1. LanguageIdentifier is deprecated in Tika. 2. It does not support CJK language (and I suspect a lot of other languages - https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages _and_their_ISO_636_Codes), and it doesn't even fail gracefully with them - in my experience Chinese was recognized as Italian. Since in Tika LanguageIdentifier was superseded by org.apache.tika.language.detect.LanguageDetector, it seems obvious to make that change in the plugin as well. However, because the design of LanguageDetector is terrible, it makes the implementation not reentrant, meaning the full language model would have to be reloaded on each call to the detector. For my needs, I have modified the plugin to use com.optimaize.langdetect.LanguageDetector directly, which is what Tika's LanguageDetector uses internally (at least by default). My question is whether that is a change that should be made to the official plugin. Thanks, Yossi.