Re: Usage of Tika LanguageIdentifier in language-identifier plugin

Sebastian Nagel Tue, 24 Oct 2017 05:11:30 -0700

Hi Yossi,

> does not separate the Detector object, which contains the model and should be 
> reused, from the
> text writer object, which should be request specific.


But shouldn't a call of reset() make it ready for re-use (the Detector object 
including the writer)?

But I agree that a reentrant function maybe easier to integrate. Nutch plugins 
also need to be
thread-safe, esp. parsers and parse filters if running in a multi-threaded 
parsing fetcher.
Without a reentrant function and without a 100% stateless detector, the only 
way is to use a
ThreadLocal instance of the detector. At a first glance, the optimaize detecter 
seems to be stateless.

> I chose optimaize mainly because Tika did. Using langid instead should be 
> very simple, but the
> fact that the project has not seen a single commit in the last 4 years, and 
> the usage numbers are
> also quite low, gives me pause...

Of course, maintenance or community around a project is an important factor. 
CLD2 is also not really
maintained, plus the models are fixed, no code available to retrain them.

> what I have done locally

In any case, would be great if you would open an issue on Jira and a pull 
request on github.
Which way to go may be discussed further.

Thanks,
Sebastian


On 10/24/2017 01:05 PM, Yossi Tamari wrote:
> Why not LanguageDetector: The API does not separate the Detector object, 
> which contains the model and should be reused, from the text writer object, 
> which should be request specific. The same API Object instance contains 
> references to both. In code terms, both loadModels() and addText() are 
> non-static members of LanguageDetector.
> 
> Developing another language-identifier-optimaize is basically what I have 
> done locally, but it seems to me having both in the Nutch repository would 
> just be confusing for users. 99% of the code would also be duplicated (the 
> relevant code is about 5 lines).
> 
> I chose optimaize mainly because Tika did. Using langid instead should be 
> very simple, but the fact that the project has not seen a single commit in 
> the last 4 years, and the usage numbers are also quite low, gives me pause...
> 
> 
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>> Sent: 24 October 2017 13:18
>> To: user@nutch.apache.org
>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>>
>> Hi Yossi,
>>
>> sorry while fast-reading I've thought it's about the old LanguageIdentifier.
>>
>>> it is not possible to initialize the detector in setConf and then reuse it
>>
>> Could explain why? The API/interface should allow to get an instance and call
>> loadModels() or not?
>>
>>>>> For my needs, I have modified the plugin to use
>>>>> com.optimaize.langdetect.LanguageDetector directly, which is what
>>
>> Of course, that's also possible. Or just add a plugin language-identifier-
>> optimaize.
>>
>> Btw., I recently had a look on various open source language identifier
>> implementations would prefer
>> langid (a port from Python/C) because it's faster and has a better precision:
>>   https://github.com/carrotsearch/langid-java.git
>>   https://github.com/saffsd/langid.c.git
>>   https://github.com/saffsd/langid.py.git
>> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but 
>> it's
>> C++).
>>
>> Thanks,
>> Sebastian
>>
>> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
>>> Hi Sebastian,
>>>
>>> Please reread the second paragraph of my email 😊.
>>> In short, it is not possible to initialize the detector in setConf and then 
>>> reuse it,
>> and initializing it per call would be extremely slow.
>>>
>>>     Yossi.
>>>
>>>
>>>> -----Original Message-----
>>>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>>>> Sent: 24 October 2017 12:41
>>>> To: user@nutch.apache.org
>>>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>>>>
>>>> Hi Yossi,
>>>>
>>>> why not port it to use
>>>>
>>>>
>> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
>>>> tector.html
>>>>
>>>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
>>>>
>>>> Sebastian
>>>>
>>>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
>>>>> Hi
>>>>>
>>>>>
>>>>>
>>>>> The language-identifier plugin uses
>>>>> org.apache.tika.language.LanguageIdentifier for extracting the
>>>>> language from the document text. There are two issues with that:
>>>>>
>>>>> 1.        LanguageIdentifier is deprecated in Tika.
>>>>> 2.        It does not support CJK language (and I suspect a lot of other
>>>>> languages -
>>>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
>>>>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
>>>>> with them - in my experience Chinese was recognized as Italian.
>>>>>
>>>>>
>>>>>
>>>>> Since in Tika LanguageIdentifier was superseded by
>>>>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
>>>>> make that change in the plugin as well. However, because the design of
>>>>> LanguageDetector is terrible, it makes the implementation not
>>>>> reentrant, meaning the full language model would have to be reloaded
>>>>> on each call to the detector.
>>>>>
>>>>>
>>>>>
>>>>> For my needs, I have modified the plugin to use
>>>>> com.optimaize.langdetect.LanguageDetector directly, which is what
>>>>> Tika's LanguageDetector uses internally (at least by default). My
>>>>> question is whether that is a change that should be made to the official
>> plugin.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>                Yossi.
>>>>>
>>>>>
>>>
>>>
> 
>

Re: Usage of Tika LanguageIdentifier in language-identifier plugin

Reply via email to