RE: Usage of Tika LanguageIdentifier in language-identifier plugin

Markus Jelsma Tue, 24 Oct 2017 05:51:14 -0700

Hello,

Sorry, i didn't say that as Nutch committer. Our parser at Openindex has 
Optimaize deep under the hood, and it is fast!


Regards,
Markus
 
-----Original message-----
> From:Yossi Tamari <yossi.tam...@pipl.com>
> Sent: Tuesday 24th October 2017 14:46
> To: user@nutch.apache.org
> Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hi Markus,
> 
> Can you please explain what do you mean by "our parser", because I'm pretty 
> sure the language-identifier plugin is not using Optimaize.
> 
> Thanks,
>       Yossi.
> 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: 24 October 2017 15:25
> > To: user@nutch.apache.org
> > Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin
> > 
> > Hello,
> > 
> > Not sure what the problem is but , buried  deep in our parser we also use
> > Optimaize, previously lang-detect. We load models once, inside a static 
> > block,
> > and create a new Detector instance for every record we parse. This is very 
> > fast.
> > 
> > Regards,
> > Markus
> > 
> > -----Original message-----
> > > From:Sebastian Nagel <wastl.na...@googlemail.com>
> > > Sent: Tuesday 24th October 2017 14:11
> > > To: user@nutch.apache.org
> > > Subject: Re: Usage of Tika LanguageIdentifier in language-identifier
> > > plugin
> > >
> > > Hi Yossi,
> > >
> > > > does not separate the Detector object, which contains the model and
> > > > should be reused, from the text writer object, which should be request
> > specific.
> > >
> > > But shouldn't a call of reset() make it ready for re-use (the Detector 
> > > object
> > including the writer)?
> > >
> > > But I agree that a reentrant function maybe easier to integrate. Nutch
> > > plugins also need to be thread-safe, esp. parsers and parse filters if 
> > > running in
> > a multi-threaded parsing fetcher.
> > > Without a reentrant function and without a 100% stateless detector,
> > > the only way is to use a ThreadLocal instance of the detector. At a first 
> > > glance,
> > the optimaize detecter seems to be stateless.
> > >
> > > > I chose optimaize mainly because Tika did. Using langid instead
> > > > should be very simple, but the fact that the project has not seen a
> > > > single commit in the last 4 years, and the usage numbers are also quite 
> > > > low,
> > gives me pause...
> > >
> > > Of course, maintenance or community around a project is an important
> > > factor. CLD2 is also not really maintained, plus the models are fixed, no 
> > > code
> > available to retrain them.
> > >
> > > > what I have done locally
> > >
> > > In any case, would be great if you would open an issue on Jira and a pull
> > request on github.
> > > Which way to go may be discussed further.
> > >
> > > Thanks,
> > > Sebastian
> > >
> > >
> > > On 10/24/2017 01:05 PM, Yossi Tamari wrote:
> > > > Why not LanguageDetector: The API does not separate the Detector object,
> > which contains the model and should be reused, from the text writer object,
> > which should be request specific. The same API Object instance contains
> > references to both. In code terms, both loadModels() and addText() are non-
> > static members of LanguageDetector.
> > > >
> > > > Developing another language-identifier-optimaize is basically what I 
> > > > have
> > done locally, but it seems to me having both in the Nutch repository would 
> > just
> > be confusing for users. 99% of the code would also be duplicated (the 
> > relevant
> > code is about 5 lines).
> > > >
> > > > I chose optimaize mainly because Tika did. Using langid instead should 
> > > > be
> > very simple, but the fact that the project has not seen a single commit in 
> > the last
> > 4 years, and the usage numbers are also quite low, gives me pause...
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> > > >> Sent: 24 October 2017 13:18
> > > >> To: user@nutch.apache.org
> > > >> Subject: Re: Usage of Tika LanguageIdentifier in
> > > >> language-identifier plugin
> > > >>
> > > >> Hi Yossi,
> > > >>
> > > >> sorry while fast-reading I've thought it's about the old 
> > > >> LanguageIdentifier.
> > > >>
> > > >>> it is not possible to initialize the detector in setConf and then
> > > >>> reuse it
> > > >>
> > > >> Could explain why? The API/interface should allow to get an
> > > >> instance and call
> > > >> loadModels() or not?
> > > >>
> > > >>>>> For my needs, I have modified the plugin to use
> > > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is
> > > >>>>> what
> > > >>
> > > >> Of course, that's also possible. Or just add a plugin
> > > >> language-identifier- optimaize.
> > > >>
> > > >> Btw., I recently had a look on various open source language
> > > >> identifier implementations would prefer langid (a port from
> > > >> Python/C) because it's faster and has a better precision:
> > > >>   https://github.com/carrotsearch/langid-java.git
> > > >>   https://github.com/saffsd/langid.c.git
> > > >>   https://github.com/saffsd/langid.py.git
> > > >> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is
> > > >> unbeaten (but it's
> > > >> C++).
> > > >>
> > > >> Thanks,
> > > >> Sebastian
> > > >>
> > > >> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> > > >>> Hi Sebastian,
> > > >>>
> > > >>> Please reread the second paragraph of my email .
> > > >>> In short, it is not possible to initialize the detector in setConf
> > > >>> and then reuse it,
> > > >> and initializing it per call would be extremely slow.
> > > >>>
> > > >>>       Yossi.
> > > >>>
> > > >>>
> > > >>>> -----Original Message-----
> > > >>>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> > > >>>> Sent: 24 October 2017 12:41
> > > >>>> To: user@nutch.apache.org
> > > >>>> Subject: Re: Usage of Tika LanguageIdentifier in
> > > >>>> language-identifier plugin
> > > >>>>
> > > >>>> Hi Yossi,
> > > >>>>
> > > >>>> why not port it to use
> > > >>>>
> > > >>>>
> > > >> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/Lan
> > > >> guageDe
> > > >>>> tector.html
> > > >>>>
> > > >>>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> > > >>>>
> > > >>>> Sebastian
> > > >>>>
> > > >>>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> > > >>>>> Hi
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> The language-identifier plugin uses
> > > >>>>> org.apache.tika.language.LanguageIdentifier for extracting the
> > > >>>>> language from the document text. There are two issues with that:
> > > >>>>>
> > > >>>>> 1.  LanguageIdentifier is deprecated in Tika.
> > > >>>>> 2.  It does not support CJK language (and I suspect a lot of other
> > > >>>>> languages -
> > > >>>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implement
> > > >>>>> ed_Lan guages _and_their_ISO_636_Codes), and it doesn't even
> > > >>>>> fail gracefully with them - in my experience Chinese was
> > > >>>>> recognized as Italian.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> Since in Tika LanguageIdentifier was superseded by
> > > >>>>> org.apache.tika.language.detect.LanguageDetector, it seems
> > > >>>>> obvious to make that change in the plugin as well. However,
> > > >>>>> because the design of LanguageDetector is terrible, it makes the
> > > >>>>> implementation not reentrant, meaning the full language model
> > > >>>>> would have to be reloaded on each call to the detector.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> For my needs, I have modified the plugin to use
> > > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is
> > > >>>>> what Tika's LanguageDetector uses internally (at least by
> > > >>>>> default). My question is whether that is a change that should be
> > > >>>>> made to the official
> > > >> plugin.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>>
> > > >>>>>                Yossi.
> > > >>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > > >
> > > >
> > >
> > >
> 
>

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

Reply via email to