Hi Markus, Can you please explain what do you mean by "our parser", because I'm pretty sure the language-identifier plugin is not using Optimaize.
Thanks, Yossi. > -----Original Message----- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: 24 October 2017 15:25 > To: user@nutch.apache.org > Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin > > Hello, > > Not sure what the problem is but , buried deep in our parser we also use > Optimaize, previously lang-detect. We load models once, inside a static block, > and create a new Detector instance for every record we parse. This is very > fast. > > Regards, > Markus > > -----Original message----- > > From:Sebastian Nagel <wastl.na...@googlemail.com> > > Sent: Tuesday 24th October 2017 14:11 > > To: user@nutch.apache.org > > Subject: Re: Usage of Tika LanguageIdentifier in language-identifier > > plugin > > > > Hi Yossi, > > > > > does not separate the Detector object, which contains the model and > > > should be reused, from the text writer object, which should be request > specific. > > > > But shouldn't a call of reset() make it ready for re-use (the Detector > > object > including the writer)? > > > > But I agree that a reentrant function maybe easier to integrate. Nutch > > plugins also need to be thread-safe, esp. parsers and parse filters if > > running in > a multi-threaded parsing fetcher. > > Without a reentrant function and without a 100% stateless detector, > > the only way is to use a ThreadLocal instance of the detector. At a first > > glance, > the optimaize detecter seems to be stateless. > > > > > I chose optimaize mainly because Tika did. Using langid instead > > > should be very simple, but the fact that the project has not seen a > > > single commit in the last 4 years, and the usage numbers are also quite > > > low, > gives me pause... > > > > Of course, maintenance or community around a project is an important > > factor. CLD2 is also not really maintained, plus the models are fixed, no > > code > available to retrain them. > > > > > what I have done locally > > > > In any case, would be great if you would open an issue on Jira and a pull > request on github. > > Which way to go may be discussed further. > > > > Thanks, > > Sebastian > > > > > > On 10/24/2017 01:05 PM, Yossi Tamari wrote: > > > Why not LanguageDetector: The API does not separate the Detector object, > which contains the model and should be reused, from the text writer object, > which should be request specific. The same API Object instance contains > references to both. In code terms, both loadModels() and addText() are non- > static members of LanguageDetector. > > > > > > Developing another language-identifier-optimaize is basically what I have > done locally, but it seems to me having both in the Nutch repository would > just > be confusing for users. 99% of the code would also be duplicated (the relevant > code is about 5 lines). > > > > > > I chose optimaize mainly because Tika did. Using langid instead should be > very simple, but the fact that the project has not seen a single commit in > the last > 4 years, and the usage numbers are also quite low, gives me pause... > > > > > > > > >> -----Original Message----- > > >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > > >> Sent: 24 October 2017 13:18 > > >> To: user@nutch.apache.org > > >> Subject: Re: Usage of Tika LanguageIdentifier in > > >> language-identifier plugin > > >> > > >> Hi Yossi, > > >> > > >> sorry while fast-reading I've thought it's about the old > > >> LanguageIdentifier. > > >> > > >>> it is not possible to initialize the detector in setConf and then > > >>> reuse it > > >> > > >> Could explain why? The API/interface should allow to get an > > >> instance and call > > >> loadModels() or not? > > >> > > >>>>> For my needs, I have modified the plugin to use > > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is > > >>>>> what > > >> > > >> Of course, that's also possible. Or just add a plugin > > >> language-identifier- optimaize. > > >> > > >> Btw., I recently had a look on various open source language > > >> identifier implementations would prefer langid (a port from > > >> Python/C) because it's faster and has a better precision: > > >> https://github.com/carrotsearch/langid-java.git > > >> https://github.com/saffsd/langid.c.git > > >> https://github.com/saffsd/langid.py.git > > >> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is > > >> unbeaten (but it's > > >> C++). > > >> > > >> Thanks, > > >> Sebastian > > >> > > >> On 10/24/2017 11:46 AM, Yossi Tamari wrote: > > >>> Hi Sebastian, > > >>> > > >>> Please reread the second paragraph of my email . > > >>> In short, it is not possible to initialize the detector in setConf > > >>> and then reuse it, > > >> and initializing it per call would be extremely slow. > > >>> > > >>> Yossi. > > >>> > > >>> > > >>>> -----Original Message----- > > >>>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > > >>>> Sent: 24 October 2017 12:41 > > >>>> To: user@nutch.apache.org > > >>>> Subject: Re: Usage of Tika LanguageIdentifier in > > >>>> language-identifier plugin > > >>>> > > >>>> Hi Yossi, > > >>>> > > >>>> why not port it to use > > >>>> > > >>>> > > >> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/Lan > > >> guageDe > > >>>> tector.html > > >>>> > > >>>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439). > > >>>> > > >>>> Sebastian > > >>>> > > >>>> On 10/24/2017 11:26 AM, Yossi Tamari wrote: > > >>>>> Hi > > >>>>> > > >>>>> > > >>>>> > > >>>>> The language-identifier plugin uses > > >>>>> org.apache.tika.language.LanguageIdentifier for extracting the > > >>>>> language from the document text. There are two issues with that: > > >>>>> > > >>>>> 1. LanguageIdentifier is deprecated in Tika. > > >>>>> 2. It does not support CJK language (and I suspect a lot of other > > >>>>> languages - > > >>>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implement > > >>>>> ed_Lan guages _and_their_ISO_636_Codes), and it doesn't even > > >>>>> fail gracefully with them - in my experience Chinese was > > >>>>> recognized as Italian. > > >>>>> > > >>>>> > > >>>>> > > >>>>> Since in Tika LanguageIdentifier was superseded by > > >>>>> org.apache.tika.language.detect.LanguageDetector, it seems > > >>>>> obvious to make that change in the plugin as well. However, > > >>>>> because the design of LanguageDetector is terrible, it makes the > > >>>>> implementation not reentrant, meaning the full language model > > >>>>> would have to be reloaded on each call to the detector. > > >>>>> > > >>>>> > > >>>>> > > >>>>> For my needs, I have modified the plugin to use > > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is > > >>>>> what Tika's LanguageDetector uses internally (at least by > > >>>>> default). My question is whether that is a change that should be > > >>>>> made to the official > > >> plugin. > > >>>>> > > >>>>> > > >>>>> > > >>>>> Thanks, > > >>>>> > > >>>>> Yossi. > > >>>>> > > >>>>> > > >>> > > >>> > > > > > > > > > >