Why not LanguageDetector: The API does not separate the Detector object, which 
contains the model and should be reused, from the text writer object, which 
should be request specific. The same API Object instance contains references to 
both. In code terms, both loadModels() and addText() are non-static members of 
LanguageDetector.

Developing another language-identifier-optimaize is basically what I have done 
locally, but it seems to me having both in the Nutch repository would just be 
confusing for users. 99% of the code would also be duplicated (the relevant 
code is about 5 lines).

I chose optimaize mainly because Tika did. Using langid instead should be very 
simple, but the fact that the project has not seen a single commit in the last 
4 years, and the usage numbers are also quite low, gives me pause...


> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: 24 October 2017 13:18
> To: user@nutch.apache.org
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hi Yossi,
> 
> sorry while fast-reading I've thought it's about the old LanguageIdentifier.
> 
> > it is not possible to initialize the detector in setConf and then reuse it
> 
> Could explain why? The API/interface should allow to get an instance and call
> loadModels() or not?
> 
> >>> For my needs, I have modified the plugin to use
> >>> com.optimaize.langdetect.LanguageDetector directly, which is what
> 
> Of course, that's also possible. Or just add a plugin language-identifier-
> optimaize.
> 
> Btw., I recently had a look on various open source language identifier
> implementations would prefer
> langid (a port from Python/C) because it's faster and has a better precision:
>   https://github.com/carrotsearch/langid-java.git
>   https://github.com/saffsd/langid.c.git
>   https://github.com/saffsd/langid.py.git
> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's
> C++).
> 
> Thanks,
> Sebastian
> 
> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> > Hi Sebastian,
> >
> > Please reread the second paragraph of my email 😊.
> > In short, it is not possible to initialize the detector in setConf and then 
> > reuse it,
> and initializing it per call would be extremely slow.
> >
> >     Yossi.
> >
> >
> >> -----Original Message-----
> >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> >> Sent: 24 October 2017 12:41
> >> To: user@nutch.apache.org
> >> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> >>
> >> Hi Yossi,
> >>
> >> why not port it to use
> >>
> >>
> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
> >> tector.html
> >>
> >> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> >>
> >> Sebastian
> >>
> >> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> >>> Hi
> >>>
> >>>
> >>>
> >>> The language-identifier plugin uses
> >>> org.apache.tika.language.LanguageIdentifier for extracting the
> >>> language from the document text. There are two issues with that:
> >>>
> >>> 1.        LanguageIdentifier is deprecated in Tika.
> >>> 2.        It does not support CJK language (and I suspect a lot of other
> >>> languages -
> >>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
> >>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
> >>> with them - in my experience Chinese was recognized as Italian.
> >>>
> >>>
> >>>
> >>> Since in Tika LanguageIdentifier was superseded by
> >>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
> >>> make that change in the plugin as well. However, because the design of
> >>> LanguageDetector is terrible, it makes the implementation not
> >>> reentrant, meaning the full language model would have to be reloaded
> >>> on each call to the detector.
> >>>
> >>>
> >>>
> >>> For my needs, I have modified the plugin to use
> >>> com.optimaize.langdetect.LanguageDetector directly, which is what
> >>> Tika's LanguageDetector uses internally (at least by default). My
> >>> question is whether that is a change that should be made to the official
> plugin.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>>                Yossi.
> >>>
> >>>
> >
> >


Reply via email to