RE: Usage of Tika LanguageIdentifier in language-identifier plugin

Yossi Tamari Tue, 24 Oct 2017 05:53:12 -0700

Hi Markus,

Can you please explain what do you mean by "our parser", because I'm pretty 
sure the language-identifier plugin is not using Optimaize.


Thanks,
        Yossi.

> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: 24 October 2017 15:25
> To: user@nutch.apache.org
> Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hello,
> 
> Not sure what the problem is but , buried  deep in our parser we also use
> Optimaize, previously lang-detect. We load models once, inside a static block,
> and create a new Detector instance for every record we parse. This is very 
> fast.
> 
> Regards,
> Markus
> 
> -----Original message-----
> > From:Sebastian Nagel <wastl.na...@googlemail.com>
> > Sent: Tuesday 24th October 2017 14:11
> > To: user@nutch.apache.org
> > Subject: Re: Usage of Tika LanguageIdentifier in language-identifier
> > plugin
> >
> > Hi Yossi,
> >
> > > does not separate the Detector object, which contains the model and
> > > should be reused, from the text writer object, which should be request
> specific.
> >
> > But shouldn't a call of reset() make it ready for re-use (the Detector 
> > object
> including the writer)?
> >
> > But I agree that a reentrant function maybe easier to integrate. Nutch
> > plugins also need to be thread-safe, esp. parsers and parse filters if 
> > running in
> a multi-threaded parsing fetcher.
> > Without a reentrant function and without a 100% stateless detector,
> > the only way is to use a ThreadLocal instance of the detector. At a first 
> > glance,
> the optimaize detecter seems to be stateless.
> >
> > > I chose optimaize mainly because Tika did. Using langid instead
> > > should be very simple, but the fact that the project has not seen a
> > > single commit in the last 4 years, and the usage numbers are also quite 
> > > low,
> gives me pause...
> >
> > Of course, maintenance or community around a project is an important
> > factor. CLD2 is also not really maintained, plus the models are fixed, no 
> > code
> available to retrain them.
> >
> > > what I have done locally
> >
> > In any case, would be great if you would open an issue on Jira and a pull
> request on github.
> > Which way to go may be discussed further.
> >
> > Thanks,
> > Sebastian
> >
> >
> > On 10/24/2017 01:05 PM, Yossi Tamari wrote:
> > > Why not LanguageDetector: The API does not separate the Detector object,
> which contains the model and should be reused, from the text writer object,
> which should be request specific. The same API Object instance contains
> references to both. In code terms, both loadModels() and addText() are non-
> static members of LanguageDetector.
> > >
> > > Developing another language-identifier-optimaize is basically what I have
> done locally, but it seems to me having both in the Nutch repository would 
> just
> be confusing for users. 99% of the code would also be duplicated (the relevant
> code is about 5 lines).
> > >
> > > I chose optimaize mainly because Tika did. Using langid instead should be
> very simple, but the fact that the project has not seen a single commit in 
> the last
> 4 years, and the usage numbers are also quite low, gives me pause...
> > >
> > >
> > >> -----Original Message-----
> > >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> > >> Sent: 24 October 2017 13:18
> > >> To: user@nutch.apache.org
> > >> Subject: Re: Usage of Tika LanguageIdentifier in
> > >> language-identifier plugin
> > >>
> > >> Hi Yossi,
> > >>
> > >> sorry while fast-reading I've thought it's about the old 
> > >> LanguageIdentifier.
> > >>
> > >>> it is not possible to initialize the detector in setConf and then
> > >>> reuse it
> > >>
> > >> Could explain why? The API/interface should allow to get an
> > >> instance and call
> > >> loadModels() or not?
> > >>
> > >>>>> For my needs, I have modified the plugin to use
> > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is
> > >>>>> what
> > >>
> > >> Of course, that's also possible. Or just add a plugin
> > >> language-identifier- optimaize.
> > >>
> > >> Btw., I recently had a look on various open source language
> > >> identifier implementations would prefer langid (a port from
> > >> Python/C) because it's faster and has a better precision:
> > >>   https://github.com/carrotsearch/langid-java.git
> > >>   https://github.com/saffsd/langid.c.git
> > >>   https://github.com/saffsd/langid.py.git
> > >> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is
> > >> unbeaten (but it's
> > >> C++).
> > >>
> > >> Thanks,
> > >> Sebastian
> > >>
> > >> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> > >>> Hi Sebastian,
> > >>>
> > >>> Please reread the second paragraph of my email .
> > >>> In short, it is not possible to initialize the detector in setConf
> > >>> and then reuse it,
> > >> and initializing it per call would be extremely slow.
> > >>>
> > >>>         Yossi.
> > >>>
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> > >>>> Sent: 24 October 2017 12:41
> > >>>> To: user@nutch.apache.org
> > >>>> Subject: Re: Usage of Tika LanguageIdentifier in
> > >>>> language-identifier plugin
> > >>>>
> > >>>> Hi Yossi,
> > >>>>
> > >>>> why not port it to use
> > >>>>
> > >>>>
> > >> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/Lan
> > >> guageDe
> > >>>> tector.html
> > >>>>
> > >>>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> > >>>>
> > >>>> Sebastian
> > >>>>
> > >>>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> > >>>>> Hi
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> The language-identifier plugin uses
> > >>>>> org.apache.tika.language.LanguageIdentifier for extracting the
> > >>>>> language from the document text. There are two issues with that:
> > >>>>>
> > >>>>> 1.    LanguageIdentifier is deprecated in Tika.
> > >>>>> 2.    It does not support CJK language (and I suspect a lot of other
> > >>>>> languages -
> > >>>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implement
> > >>>>> ed_Lan guages _and_their_ISO_636_Codes), and it doesn't even
> > >>>>> fail gracefully with them - in my experience Chinese was
> > >>>>> recognized as Italian.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Since in Tika LanguageIdentifier was superseded by
> > >>>>> org.apache.tika.language.detect.LanguageDetector, it seems
> > >>>>> obvious to make that change in the plugin as well. However,
> > >>>>> because the design of LanguageDetector is terrible, it makes the
> > >>>>> implementation not reentrant, meaning the full language model
> > >>>>> would have to be reloaded on each call to the detector.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> For my needs, I have modified the plugin to use
> > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is
> > >>>>> what Tika's LanguageDetector uses internally (at least by
> > >>>>> default). My question is whether that is a change that should be
> > >>>>> made to the official
> > >> plugin.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Thanks,
> > >>>>>
> > >>>>>                Yossi.
> > >>>>>
> > >>>>>
> > >>>
> > >>>
> > >
> > >
> >
> >

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

Reply via email to