Hello, Sorry, i didn't say that as Nutch committer. Our parser at Openindex has Optimaize deep under the hood, and it is fast!
Regards, Markus -----Original message----- > From:Yossi Tamari <yossi.tam...@pipl.com> > Sent: Tuesday 24th October 2017 14:46 > To: user@nutch.apache.org > Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin > > Hi Markus, > > Can you please explain what do you mean by "our parser", because I'm pretty > sure the language-identifier plugin is not using Optimaize. > > Thanks, > Yossi. > > > -----Original Message----- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Sent: 24 October 2017 15:25 > > To: user@nutch.apache.org > > Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin > > > > Hello, > > > > Not sure what the problem is but , buried deep in our parser we also use > > Optimaize, previously lang-detect. We load models once, inside a static > > block, > > and create a new Detector instance for every record we parse. This is very > > fast. > > > > Regards, > > Markus > > > > -----Original message----- > > > From:Sebastian Nagel <wastl.na...@googlemail.com> > > > Sent: Tuesday 24th October 2017 14:11 > > > To: user@nutch.apache.org > > > Subject: Re: Usage of Tika LanguageIdentifier in language-identifier > > > plugin > > > > > > Hi Yossi, > > > > > > > does not separate the Detector object, which contains the model and > > > > should be reused, from the text writer object, which should be request > > specific. > > > > > > But shouldn't a call of reset() make it ready for re-use (the Detector > > > object > > including the writer)? > > > > > > But I agree that a reentrant function maybe easier to integrate. Nutch > > > plugins also need to be thread-safe, esp. parsers and parse filters if > > > running in > > a multi-threaded parsing fetcher. > > > Without a reentrant function and without a 100% stateless detector, > > > the only way is to use a ThreadLocal instance of the detector. At a first > > > glance, > > the optimaize detecter seems to be stateless. > > > > > > > I chose optimaize mainly because Tika did. Using langid instead > > > > should be very simple, but the fact that the project has not seen a > > > > single commit in the last 4 years, and the usage numbers are also quite > > > > low, > > gives me pause... > > > > > > Of course, maintenance or community around a project is an important > > > factor. CLD2 is also not really maintained, plus the models are fixed, no > > > code > > available to retrain them. > > > > > > > what I have done locally > > > > > > In any case, would be great if you would open an issue on Jira and a pull > > request on github. > > > Which way to go may be discussed further. > > > > > > Thanks, > > > Sebastian > > > > > > > > > On 10/24/2017 01:05 PM, Yossi Tamari wrote: > > > > Why not LanguageDetector: The API does not separate the Detector object, > > which contains the model and should be reused, from the text writer object, > > which should be request specific. The same API Object instance contains > > references to both. In code terms, both loadModels() and addText() are non- > > static members of LanguageDetector. > > > > > > > > Developing another language-identifier-optimaize is basically what I > > > > have > > done locally, but it seems to me having both in the Nutch repository would > > just > > be confusing for users. 99% of the code would also be duplicated (the > > relevant > > code is about 5 lines). > > > > > > > > I chose optimaize mainly because Tika did. Using langid instead should > > > > be > > very simple, but the fact that the project has not seen a single commit in > > the last > > 4 years, and the usage numbers are also quite low, gives me pause... > > > > > > > > > > > >> -----Original Message----- > > > >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > > > >> Sent: 24 October 2017 13:18 > > > >> To: user@nutch.apache.org > > > >> Subject: Re: Usage of Tika LanguageIdentifier in > > > >> language-identifier plugin > > > >> > > > >> Hi Yossi, > > > >> > > > >> sorry while fast-reading I've thought it's about the old > > > >> LanguageIdentifier. > > > >> > > > >>> it is not possible to initialize the detector in setConf and then > > > >>> reuse it > > > >> > > > >> Could explain why? The API/interface should allow to get an > > > >> instance and call > > > >> loadModels() or not? > > > >> > > > >>>>> For my needs, I have modified the plugin to use > > > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is > > > >>>>> what > > > >> > > > >> Of course, that's also possible. Or just add a plugin > > > >> language-identifier- optimaize. > > > >> > > > >> Btw., I recently had a look on various open source language > > > >> identifier implementations would prefer langid (a port from > > > >> Python/C) because it's faster and has a better precision: > > > >> https://github.com/carrotsearch/langid-java.git > > > >> https://github.com/saffsd/langid.c.git > > > >> https://github.com/saffsd/langid.py.git > > > >> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is > > > >> unbeaten (but it's > > > >> C++). > > > >> > > > >> Thanks, > > > >> Sebastian > > > >> > > > >> On 10/24/2017 11:46 AM, Yossi Tamari wrote: > > > >>> Hi Sebastian, > > > >>> > > > >>> Please reread the second paragraph of my email . > > > >>> In short, it is not possible to initialize the detector in setConf > > > >>> and then reuse it, > > > >> and initializing it per call would be extremely slow. > > > >>> > > > >>> Yossi. > > > >>> > > > >>> > > > >>>> -----Original Message----- > > > >>>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > > > >>>> Sent: 24 October 2017 12:41 > > > >>>> To: user@nutch.apache.org > > > >>>> Subject: Re: Usage of Tika LanguageIdentifier in > > > >>>> language-identifier plugin > > > >>>> > > > >>>> Hi Yossi, > > > >>>> > > > >>>> why not port it to use > > > >>>> > > > >>>> > > > >> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/Lan > > > >> guageDe > > > >>>> tector.html > > > >>>> > > > >>>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439). > > > >>>> > > > >>>> Sebastian > > > >>>> > > > >>>> On 10/24/2017 11:26 AM, Yossi Tamari wrote: > > > >>>>> Hi > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> The language-identifier plugin uses > > > >>>>> org.apache.tika.language.LanguageIdentifier for extracting the > > > >>>>> language from the document text. There are two issues with that: > > > >>>>> > > > >>>>> 1. LanguageIdentifier is deprecated in Tika. > > > >>>>> 2. It does not support CJK language (and I suspect a lot of other > > > >>>>> languages - > > > >>>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implement > > > >>>>> ed_Lan guages _and_their_ISO_636_Codes), and it doesn't even > > > >>>>> fail gracefully with them - in my experience Chinese was > > > >>>>> recognized as Italian. > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> Since in Tika LanguageIdentifier was superseded by > > > >>>>> org.apache.tika.language.detect.LanguageDetector, it seems > > > >>>>> obvious to make that change in the plugin as well. However, > > > >>>>> because the design of LanguageDetector is terrible, it makes the > > > >>>>> implementation not reentrant, meaning the full language model > > > >>>>> would have to be reloaded on each call to the detector. > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> For my needs, I have modified the plugin to use > > > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is > > > >>>>> what Tika's LanguageDetector uses internally (at least by > > > >>>>> default). My question is whether that is a change that should be > > > >>>>> made to the official > > > >> plugin. > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> Thanks, > > > >>>>> > > > >>>>> Yossi. > > > >>>>> > > > >>>>> > > > >>> > > > >>> > > > > > > > > > > > > > > > >