Hi Yossi, why not port it to use http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDetector.html
The upgrade to Tika 1.16 is already in progress (NUTCH-2439). Sebastian On 10/24/2017 11:26 AM, Yossi Tamari wrote: > Hi > > > > The language-identifier plugin uses > org.apache.tika.language.LanguageIdentifier for extracting the language from > the document text. There are two issues with that: > > 1. LanguageIdentifier is deprecated in Tika. > 2. It does not support CJK language (and I suspect a lot of other > languages - > https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages > _and_their_ISO_636_Codes), and it doesn't even fail gracefully with them - > in my experience Chinese was recognized as Italian. > > > > Since in Tika LanguageIdentifier was superseded by > org.apache.tika.language.detect.LanguageDetector, it seems obvious to make > that change in the plugin as well. However, because the design of > LanguageDetector is terrible, it makes the implementation not reentrant, > meaning the full language model would have to be reloaded on each call to > the detector. > > > > For my needs, I have modified the plugin to use > com.optimaize.langdetect.LanguageDetector directly, which is what Tika's > LanguageDetector uses internally (at least by default). My question is > whether that is a change that should be made to the official plugin. > > > > Thanks, > > Yossi. > >