Hi Yossi, sorry while fast-reading I've thought it's about the old LanguageIdentifier.
> it is not possible to initialize the detector in setConf and then reuse it Could explain why? The API/interface should allow to get an instance and call loadModels() or not? >>> For my needs, I have modified the plugin to use >>> com.optimaize.langdetect.LanguageDetector directly, which is what Of course, that's also possible. Or just add a plugin language-identifier-optimaize. Btw., I recently had a look on various open source language identifier implementations would prefer langid (a port from Python/C) because it's faster and has a better precision: https://github.com/carrotsearch/langid-java.git https://github.com/saffsd/langid.c.git https://github.com/saffsd/langid.py.git Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's C++). Thanks, Sebastian On 10/24/2017 11:46 AM, Yossi Tamari wrote: > Hi Sebastian, > > Please reread the second paragraph of my email 😊. > In short, it is not possible to initialize the detector in setConf and then > reuse it, and initializing it per call would be extremely slow. > > Yossi. > > >> -----Original Message----- >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] >> Sent: 24 October 2017 12:41 >> To: user@nutch.apache.org >> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin >> >> Hi Yossi, >> >> why not port it to use >> >> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe >> tector.html >> >> The upgrade to Tika 1.16 is already in progress (NUTCH-2439). >> >> Sebastian >> >> On 10/24/2017 11:26 AM, Yossi Tamari wrote: >>> Hi >>> >>> >>> >>> The language-identifier plugin uses >>> org.apache.tika.language.LanguageIdentifier for extracting the >>> language from the document text. There are two issues with that: >>> >>> 1. LanguageIdentifier is deprecated in Tika. >>> 2. It does not support CJK language (and I suspect a lot of other >>> languages - >>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan >>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully >>> with them - in my experience Chinese was recognized as Italian. >>> >>> >>> >>> Since in Tika LanguageIdentifier was superseded by >>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to >>> make that change in the plugin as well. However, because the design of >>> LanguageDetector is terrible, it makes the implementation not >>> reentrant, meaning the full language model would have to be reloaded >>> on each call to the detector. >>> >>> >>> >>> For my needs, I have modified the plugin to use >>> com.optimaize.langdetect.LanguageDetector directly, which is what >>> Tika's LanguageDetector uses internally (at least by default). My >>> question is whether that is a change that should be made to the official >>> plugin. >>> >>> >>> >>> Thanks, >>> >>> Yossi. >>> >>> > >