October 2017 15:25
> To: user@nutch.apache.org
> Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin
>
> Hello,
>
> Not sure what the problem is but , buried deep in our parser we also use
> Optimaize, previously lang-detect. We load models once, inside a s
e.org
> Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin
>
> Hi Markus,
>
> Can you please explain what do you mean by "our parser", because I'm pretty
> sure the language-identifier plugin is not using Optimaize.
>
> Thanks,
>
-
> From:Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: Tuesday 24th October 2017 14:11
> To: user@nutch.apache.org
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>
> Hi Yossi,
>
> > does not separate the Detector object, wh
at the project has not seen a single commit in
> the last 4 years, and the usage numbers are also quite low, gives me pause...
>
>
>> -Original Message-
>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>> Sent: 24 October 2017 13:18
>> To: u
stl.na...@googlemail.com]
> Sent: 24 October 2017 13:18
> To: user@nutch.apache.org
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>
> Hi Yossi,
>
> sorry while fast-reading I've thought it's about the old LanguageIdentifier.
>
> &
> Yossi.
>
>
>> -Original Message-
>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>> Sent: 24 October 2017 12:41
>> To: user@nutch.apache.org
>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>>
stl.na...@googlemail.com]
> Sent: 24 October 2017 12:41
> To: user@nutch.apache.org
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>
> Hi Yossi,
>
> why not port it to use
>
> http://tika.apache.org/1.16/api/org/apache/tika/languag
Hi Yossi,
why not port it to use
http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDetector.html
The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
Sebastian
On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> Hi
>
>
>
> The language-identifier plugin uses
>
Hi
The language-identifier plugin uses
org.apache.tika.language.LanguageIdentifier for extracting the language from
the document text. There are two issues with that:
1. LanguageIdentifier is deprecated in Tika.
2. It does not support CJK language (and I suspect a lot of other
9 matches
Mail list logo