Hi Yossi, > does not separate the Detector object, which contains the model and should be > reused, from the > text writer object, which should be request specific.
But shouldn't a call of reset() make it ready for re-use (the Detector object including the writer)? But I agree that a reentrant function maybe easier to integrate. Nutch plugins also need to be thread-safe, esp. parsers and parse filters if running in a multi-threaded parsing fetcher. Without a reentrant function and without a 100% stateless detector, the only way is to use a ThreadLocal instance of the detector. At a first glance, the optimaize detecter seems to be stateless. > I chose optimaize mainly because Tika did. Using langid instead should be > very simple, but the > fact that the project has not seen a single commit in the last 4 years, and > the usage numbers are > also quite low, gives me pause... Of course, maintenance or community around a project is an important factor. CLD2 is also not really maintained, plus the models are fixed, no code available to retrain them. > what I have done locally In any case, would be great if you would open an issue on Jira and a pull request on github. Which way to go may be discussed further. Thanks, Sebastian On 10/24/2017 01:05 PM, Yossi Tamari wrote: > Why not LanguageDetector: The API does not separate the Detector object, > which contains the model and should be reused, from the text writer object, > which should be request specific. The same API Object instance contains > references to both. In code terms, both loadModels() and addText() are > non-static members of LanguageDetector. > > Developing another language-identifier-optimaize is basically what I have > done locally, but it seems to me having both in the Nutch repository would > just be confusing for users. 99% of the code would also be duplicated (the > relevant code is about 5 lines). > > I chose optimaize mainly because Tika did. Using langid instead should be > very simple, but the fact that the project has not seen a single commit in > the last 4 years, and the usage numbers are also quite low, gives me pause... > > >> -----Original Message----- >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] >> Sent: 24 October 2017 13:18 >> To: user@nutch.apache.org >> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin >> >> Hi Yossi, >> >> sorry while fast-reading I've thought it's about the old LanguageIdentifier. >> >>> it is not possible to initialize the detector in setConf and then reuse it >> >> Could explain why? The API/interface should allow to get an instance and call >> loadModels() or not? >> >>>>> For my needs, I have modified the plugin to use >>>>> com.optimaize.langdetect.LanguageDetector directly, which is what >> >> Of course, that's also possible. Or just add a plugin language-identifier- >> optimaize. >> >> Btw., I recently had a look on various open source language identifier >> implementations would prefer >> langid (a port from Python/C) because it's faster and has a better precision: >> https://github.com/carrotsearch/langid-java.git >> https://github.com/saffsd/langid.c.git >> https://github.com/saffsd/langid.py.git >> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but >> it's >> C++). >> >> Thanks, >> Sebastian >> >> On 10/24/2017 11:46 AM, Yossi Tamari wrote: >>> Hi Sebastian, >>> >>> Please reread the second paragraph of my email 😊. >>> In short, it is not possible to initialize the detector in setConf and then >>> reuse it, >> and initializing it per call would be extremely slow. >>> >>> Yossi. >>> >>> >>>> -----Original Message----- >>>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] >>>> Sent: 24 October 2017 12:41 >>>> To: user@nutch.apache.org >>>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin >>>> >>>> Hi Yossi, >>>> >>>> why not port it to use >>>> >>>> >> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe >>>> tector.html >>>> >>>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439). >>>> >>>> Sebastian >>>> >>>> On 10/24/2017 11:26 AM, Yossi Tamari wrote: >>>>> Hi >>>>> >>>>> >>>>> >>>>> The language-identifier plugin uses >>>>> org.apache.tika.language.LanguageIdentifier for extracting the >>>>> language from the document text. There are two issues with that: >>>>> >>>>> 1. LanguageIdentifier is deprecated in Tika. >>>>> 2. It does not support CJK language (and I suspect a lot of other >>>>> languages - >>>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan >>>>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully >>>>> with them - in my experience Chinese was recognized as Italian. >>>>> >>>>> >>>>> >>>>> Since in Tika LanguageIdentifier was superseded by >>>>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to >>>>> make that change in the plugin as well. However, because the design of >>>>> LanguageDetector is terrible, it makes the implementation not >>>>> reentrant, meaning the full language model would have to be reloaded >>>>> on each call to the detector. >>>>> >>>>> >>>>> >>>>> For my needs, I have modified the plugin to use >>>>> com.optimaize.langdetect.LanguageDetector directly, which is what >>>>> Tika's LanguageDetector uses internally (at least by default). My >>>>> question is whether that is a change that should be made to the official >> plugin. >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Yossi. >>>>> >>>>> >>> >>> > >