RE: Usage of Tika LanguageIdentifier in language-identifier plugin

Markus Jelsma Tue, 24 Oct 2017 05:25:14 -0700

Hello,

Not sure what the problem is but , buried  deep in our parser we also use 
Optimaize, previously lang-detect. We load models once, inside a static block, 
and create a new Detector instance for every record we parse. This is very fast.


Regards,
Markus
 
-----Original message-----
> From:Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: Tuesday 24th October 2017 14:11
> To: user@nutch.apache.org
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hi Yossi,
> 
> > does not separate the Detector object, which contains the model and should 
> > be reused, from the
> > text writer object, which should be request specific.
> 
> But shouldn't a call of reset() make it ready for re-use (the Detector object 
> including the writer)?
> 
> But I agree that a reentrant function maybe easier to integrate. Nutch 
> plugins also need to be
> thread-safe, esp. parsers and parse filters if running in a multi-threaded 
> parsing fetcher.
> Without a reentrant function and without a 100% stateless detector, the only 
> way is to use a
> ThreadLocal instance of the detector. At a first glance, the optimaize 
> detecter seems to be stateless.
> 
> > I chose optimaize mainly because Tika did. Using langid instead should be 
> > very simple, but the
> > fact that the project has not seen a single commit in the last 4 years, and 
> > the usage numbers are
> > also quite low, gives me pause...
> 
> Of course, maintenance or community around a project is an important factor. 
> CLD2 is also not really
> maintained, plus the models are fixed, no code available to retrain them.
> 
> > what I have done locally
> 
> In any case, would be great if you would open an issue on Jira and a pull 
> request on github.
> Which way to go may be discussed further.
> 
> Thanks,
> Sebastian
> 
> 
> On 10/24/2017 01:05 PM, Yossi Tamari wrote:
> > Why not LanguageDetector: The API does not separate the Detector object, 
> > which contains the model and should be reused, from the text writer object, 
> > which should be request specific. The same API Object instance contains 
> > references to both. In code terms, both loadModels() and addText() are 
> > non-static members of LanguageDetector.
> > 
> > Developing another language-identifier-optimaize is basically what I have 
> > done locally, but it seems to me having both in the Nutch repository would 
> > just be confusing for users. 99% of the code would also be duplicated (the 
> > relevant code is about 5 lines).
> > 
> > I chose optimaize mainly because Tika did. Using langid instead should be 
> > very simple, but the fact that the project has not seen a single commit in 
> > the last 4 years, and the usage numbers are also quite low, gives me 
> > pause...
> > 
> > 
> >> -----Original Message-----
> >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> >> Sent: 24 October 2017 13:18
> >> To: user@nutch.apache.org
> >> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> >>
> >> Hi Yossi,
> >>
> >> sorry while fast-reading I've thought it's about the old 
> >> LanguageIdentifier.
> >>
> >>> it is not possible to initialize the detector in setConf and then reuse it
> >>
> >> Could explain why? The API/interface should allow to get an instance and 
> >> call
> >> loadModels() or not?
> >>
> >>>>> For my needs, I have modified the plugin to use
> >>>>> com.optimaize.langdetect.LanguageDetector directly, which is what
> >>
> >> Of course, that's also possible. Or just add a plugin language-identifier-
> >> optimaize.
> >>
> >> Btw., I recently had a look on various open source language identifier
> >> implementations would prefer
> >> langid (a port from Python/C) because it's faster and has a better 
> >> precision:
> >>   https://github.com/carrotsearch/langid-java.git
> >>   https://github.com/saffsd/langid.c.git
> >>   https://github.com/saffsd/langid.py.git
> >> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but 
> >> it's
> >> C++).
> >>
> >> Thanks,
> >> Sebastian
> >>
> >> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> >>> Hi Sebastian,
> >>>
> >>> Please reread the second paragraph of my email .
> >>> In short, it is not possible to initialize the detector in setConf and 
> >>> then reuse it,
> >> and initializing it per call would be extremely slow.
> >>>
> >>>   Yossi.
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> >>>> Sent: 24 October 2017 12:41
> >>>> To: user@nutch.apache.org
> >>>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier 
> >>>> plugin
> >>>>
> >>>> Hi Yossi,
> >>>>
> >>>> why not port it to use
> >>>>
> >>>>
> >> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
> >>>> tector.html
> >>>>
> >>>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> >>>>
> >>>> Sebastian
> >>>>
> >>>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> >>>>> Hi
> >>>>>
> >>>>>
> >>>>>
> >>>>> The language-identifier plugin uses
> >>>>> org.apache.tika.language.LanguageIdentifier for extracting the
> >>>>> language from the document text. There are two issues with that:
> >>>>>
> >>>>> 1.      LanguageIdentifier is deprecated in Tika.
> >>>>> 2.      It does not support CJK language (and I suspect a lot of other
> >>>>> languages -
> >>>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
> >>>>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
> >>>>> with them - in my experience Chinese was recognized as Italian.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Since in Tika LanguageIdentifier was superseded by
> >>>>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
> >>>>> make that change in the plugin as well. However, because the design of
> >>>>> LanguageDetector is terrible, it makes the implementation not
> >>>>> reentrant, meaning the full language model would have to be reloaded
> >>>>> on each call to the detector.
> >>>>>
> >>>>>
> >>>>>
> >>>>> For my needs, I have modified the plugin to use
> >>>>> com.optimaize.langdetect.LanguageDetector directly, which is what
> >>>>> Tika's LanguageDetector uses internally (at least by default). My
> >>>>> question is whether that is a change that should be made to the official
> >> plugin.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>>                Yossi.
> >>>>>
> >>>>>
> >>>
> >>>
> > 
> > 
> 
>

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

Reply via email to