Hello, Not sure what the problem is but , buried deep in our parser we also use Optimaize, previously lang-detect. We load models once, inside a static block, and create a new Detector instance for every record we parse. This is very fast.
Regards, Markus -----Original message----- > From:Sebastian Nagel <wastl.na...@googlemail.com> > Sent: Tuesday 24th October 2017 14:11 > To: user@nutch.apache.org > Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin > > Hi Yossi, > > > does not separate the Detector object, which contains the model and should > > be reused, from the > > text writer object, which should be request specific. > > But shouldn't a call of reset() make it ready for re-use (the Detector object > including the writer)? > > But I agree that a reentrant function maybe easier to integrate. Nutch > plugins also need to be > thread-safe, esp. parsers and parse filters if running in a multi-threaded > parsing fetcher. > Without a reentrant function and without a 100% stateless detector, the only > way is to use a > ThreadLocal instance of the detector. At a first glance, the optimaize > detecter seems to be stateless. > > > I chose optimaize mainly because Tika did. Using langid instead should be > > very simple, but the > > fact that the project has not seen a single commit in the last 4 years, and > > the usage numbers are > > also quite low, gives me pause... > > Of course, maintenance or community around a project is an important factor. > CLD2 is also not really > maintained, plus the models are fixed, no code available to retrain them. > > > what I have done locally > > In any case, would be great if you would open an issue on Jira and a pull > request on github. > Which way to go may be discussed further. > > Thanks, > Sebastian > > > On 10/24/2017 01:05 PM, Yossi Tamari wrote: > > Why not LanguageDetector: The API does not separate the Detector object, > > which contains the model and should be reused, from the text writer object, > > which should be request specific. The same API Object instance contains > > references to both. In code terms, both loadModels() and addText() are > > non-static members of LanguageDetector. > > > > Developing another language-identifier-optimaize is basically what I have > > done locally, but it seems to me having both in the Nutch repository would > > just be confusing for users. 99% of the code would also be duplicated (the > > relevant code is about 5 lines). > > > > I chose optimaize mainly because Tika did. Using langid instead should be > > very simple, but the fact that the project has not seen a single commit in > > the last 4 years, and the usage numbers are also quite low, gives me > > pause... > > > > > >> -----Original Message----- > >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > >> Sent: 24 October 2017 13:18 > >> To: user@nutch.apache.org > >> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin > >> > >> Hi Yossi, > >> > >> sorry while fast-reading I've thought it's about the old > >> LanguageIdentifier. > >> > >>> it is not possible to initialize the detector in setConf and then reuse it > >> > >> Could explain why? The API/interface should allow to get an instance and > >> call > >> loadModels() or not? > >> > >>>>> For my needs, I have modified the plugin to use > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is what > >> > >> Of course, that's also possible. Or just add a plugin language-identifier- > >> optimaize. > >> > >> Btw., I recently had a look on various open source language identifier > >> implementations would prefer > >> langid (a port from Python/C) because it's faster and has a better > >> precision: > >> https://github.com/carrotsearch/langid-java.git > >> https://github.com/saffsd/langid.c.git > >> https://github.com/saffsd/langid.py.git > >> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but > >> it's > >> C++). > >> > >> Thanks, > >> Sebastian > >> > >> On 10/24/2017 11:46 AM, Yossi Tamari wrote: > >>> Hi Sebastian, > >>> > >>> Please reread the second paragraph of my email . > >>> In short, it is not possible to initialize the detector in setConf and > >>> then reuse it, > >> and initializing it per call would be extremely slow. > >>> > >>> Yossi. > >>> > >>> > >>>> -----Original Message----- > >>>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > >>>> Sent: 24 October 2017 12:41 > >>>> To: user@nutch.apache.org > >>>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier > >>>> plugin > >>>> > >>>> Hi Yossi, > >>>> > >>>> why not port it to use > >>>> > >>>> > >> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe > >>>> tector.html > >>>> > >>>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439). > >>>> > >>>> Sebastian > >>>> > >>>> On 10/24/2017 11:26 AM, Yossi Tamari wrote: > >>>>> Hi > >>>>> > >>>>> > >>>>> > >>>>> The language-identifier plugin uses > >>>>> org.apache.tika.language.LanguageIdentifier for extracting the > >>>>> language from the document text. There are two issues with that: > >>>>> > >>>>> 1. LanguageIdentifier is deprecated in Tika. > >>>>> 2. It does not support CJK language (and I suspect a lot of other > >>>>> languages - > >>>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan > >>>>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully > >>>>> with them - in my experience Chinese was recognized as Italian. > >>>>> > >>>>> > >>>>> > >>>>> Since in Tika LanguageIdentifier was superseded by > >>>>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to > >>>>> make that change in the plugin as well. However, because the design of > >>>>> LanguageDetector is terrible, it makes the implementation not > >>>>> reentrant, meaning the full language model would have to be reloaded > >>>>> on each call to the detector. > >>>>> > >>>>> > >>>>> > >>>>> For my needs, I have modified the plugin to use > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is what > >>>>> Tika's LanguageDetector uses internally (at least by default). My > >>>>> question is whether that is a change that should be made to the official > >> plugin. > >>>>> > >>>>> > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Yossi. > >>>>> > >>>>> > >>> > >>> > > > > > >