Re: Usage of Tika LanguageIdentifier in language-identifier plugin

Sebastian Nagel Tue, 24 Oct 2017 03:18:23 -0700

Hi Yossi,

sorry while fast-reading I've thought it's about the old LanguageIdentifier.


> it is not possible to initialize the detector in setConf and then reuse it

Could explain why? The API/interface should allow to get an instance and call 
loadModels() or not?

>>> For my needs, I have modified the plugin to use
>>> com.optimaize.langdetect.LanguageDetector directly, which is what

Of course, that's also possible. Or just add a plugin 
language-identifier-optimaize.

Btw., I recently had a look on various open source language identifier 
implementations would prefer
langid (a port from Python/C) because it's faster and has a better precision:
  https://github.com/carrotsearch/langid-java.git
  https://github.com/saffsd/langid.c.git
  https://github.com/saffsd/langid.py.git
Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's 
C++).

Thanks,
Sebastian

On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> Hi Sebastian,
> 
> Please reread the second paragraph of my email 😊.
> In short, it is not possible to initialize the detector in setConf and then 
> reuse it, and initializing it per call would be extremely slow.
> 
>       Yossi.
> 
> 
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>> Sent: 24 October 2017 12:41
>> To: user@nutch.apache.org
>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>>
>> Hi Yossi,
>>
>> why not port it to use
>>
>> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
>> tector.html
>>
>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
>>
>> Sebastian
>>
>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
>>> Hi
>>>
>>>
>>>
>>> The language-identifier plugin uses
>>> org.apache.tika.language.LanguageIdentifier for extracting the
>>> language from the document text. There are two issues with that:
>>>
>>> 1.  LanguageIdentifier is deprecated in Tika.
>>> 2.  It does not support CJK language (and I suspect a lot of other
>>> languages -
>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
>>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
>>> with them - in my experience Chinese was recognized as Italian.
>>>
>>>
>>>
>>> Since in Tika LanguageIdentifier was superseded by
>>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
>>> make that change in the plugin as well. However, because the design of
>>> LanguageDetector is terrible, it makes the implementation not
>>> reentrant, meaning the full language model would have to be reloaded
>>> on each call to the detector.
>>>
>>>
>>>
>>> For my needs, I have modified the plugin to use
>>> com.optimaize.langdetect.LanguageDetector directly, which is what
>>> Tika's LanguageDetector uses internally (at least by default). My
>>> question is whether that is a change that should be made to the official 
>>> plugin.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>                Yossi.
>>>
>>>
> 
>

Re: Usage of Tika LanguageIdentifier in language-identifier plugin

Reply via email to