RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari
Hi Markus,

Can you please explain what do you mean by "our parser", because I'm pretty 
sure the language-identifier plugin is not using Optimaize.

Thanks,
Yossi.

> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: 24 October 2017 15:25
> To: user@nutch.apache.org
> Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hello,
> 
> Not sure what the problem is but , buried  deep in our parser we also use
> Optimaize, previously lang-detect. We load models once, inside a static block,
> and create a new Detector instance for every record we parse. This is very 
> fast.
> 
> Regards,
> Markus
> 
> -Original message-
> > From:Sebastian Nagel <wastl.na...@googlemail.com>
> > Sent: Tuesday 24th October 2017 14:11
> > To: user@nutch.apache.org
> > Subject: Re: Usage of Tika LanguageIdentifier in language-identifier
> > plugin
> >
> > Hi Yossi,
> >
> > > does not separate the Detector object, which contains the model and
> > > should be reused, from the text writer object, which should be request
> specific.
> >
> > But shouldn't a call of reset() make it ready for re-use (the Detector 
> > object
> including the writer)?
> >
> > But I agree that a reentrant function maybe easier to integrate. Nutch
> > plugins also need to be thread-safe, esp. parsers and parse filters if 
> > running in
> a multi-threaded parsing fetcher.
> > Without a reentrant function and without a 100% stateless detector,
> > the only way is to use a ThreadLocal instance of the detector. At a first 
> > glance,
> the optimaize detecter seems to be stateless.
> >
> > > I chose optimaize mainly because Tika did. Using langid instead
> > > should be very simple, but the fact that the project has not seen a
> > > single commit in the last 4 years, and the usage numbers are also quite 
> > > low,
> gives me pause...
> >
> > Of course, maintenance or community around a project is an important
> > factor. CLD2 is also not really maintained, plus the models are fixed, no 
> > code
> available to retrain them.
> >
> > > what I have done locally
> >
> > In any case, would be great if you would open an issue on Jira and a pull
> request on github.
> > Which way to go may be discussed further.
> >
> > Thanks,
> > Sebastian
> >
> >
> > On 10/24/2017 01:05 PM, Yossi Tamari wrote:
> > > Why not LanguageDetector: The API does not separate the Detector object,
> which contains the model and should be reused, from the text writer object,
> which should be request specific. The same API Object instance contains
> references to both. In code terms, both loadModels() and addText() are non-
> static members of LanguageDetector.
> > >
> > > Developing another language-identifier-optimaize is basically what I have
> done locally, but it seems to me having both in the Nutch repository would 
> just
> be confusing for users. 99% of the code would also be duplicated (the relevant
> code is about 5 lines).
> > >
> > > I chose optimaize mainly because Tika did. Using langid instead should be
> very simple, but the fact that the project has not seen a single commit in 
> the last
> 4 years, and the usage numbers are also quite low, gives me pause...
> > >
> > >
> > >> -Original Message-
> > >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> > >> Sent: 24 October 2017 13:18
> > >> To: user@nutch.apache.org
> > >> Subject: Re: Usage of Tika LanguageIdentifier in
> > >> language-identifier plugin
> > >>
> > >> Hi Yossi,
> > >>
> > >> sorry while fast-reading I've thought it's about the old 
> > >> LanguageIdentifier.
> > >>
> > >>> it is not possible to initialize the detector in setConf and then
> > >>> reuse it
> > >>
> > >> Could explain why? The API/interface should allow to get an
> > >> instance and call
> > >> loadModels() or not?
> > >>
> > >>>>> For my needs, I have modified the plugin to use
> > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is
> > >>>>> what
> > >>
> > >> Of course, that's also possible. Or just add a plugin
> > >> language-identifier- optimaize.
> > >>
> > >> Btw., I recently had a look on various open source language
> > >

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Markus Jelsma
Hello,

Sorry, i didn't say that as Nutch committer. Our parser at Openindex has 
Optimaize deep under the hood, and it is fast!

Regards,
Markus
 
-Original message-
> From:Yossi Tamari <yossi.tam...@pipl.com>
> Sent: Tuesday 24th October 2017 14:46
> To: user@nutch.apache.org
> Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hi Markus,
> 
> Can you please explain what do you mean by "our parser", because I'm pretty 
> sure the language-identifier plugin is not using Optimaize.
> 
> Thanks,
>   Yossi.
> 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: 24 October 2017 15:25
> > To: user@nutch.apache.org
> > Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin
> > 
> > Hello,
> > 
> > Not sure what the problem is but , buried  deep in our parser we also use
> > Optimaize, previously lang-detect. We load models once, inside a static 
> > block,
> > and create a new Detector instance for every record we parse. This is very 
> > fast.
> > 
> > Regards,
> > Markus
> > 
> > -Original message-----
> > > From:Sebastian Nagel <wastl.na...@googlemail.com>
> > > Sent: Tuesday 24th October 2017 14:11
> > > To: user@nutch.apache.org
> > > Subject: Re: Usage of Tika LanguageIdentifier in language-identifier
> > > plugin
> > >
> > > Hi Yossi,
> > >
> > > > does not separate the Detector object, which contains the model and
> > > > should be reused, from the text writer object, which should be request
> > specific.
> > >
> > > But shouldn't a call of reset() make it ready for re-use (the Detector 
> > > object
> > including the writer)?
> > >
> > > But I agree that a reentrant function maybe easier to integrate. Nutch
> > > plugins also need to be thread-safe, esp. parsers and parse filters if 
> > > running in
> > a multi-threaded parsing fetcher.
> > > Without a reentrant function and without a 100% stateless detector,
> > > the only way is to use a ThreadLocal instance of the detector. At a first 
> > > glance,
> > the optimaize detecter seems to be stateless.
> > >
> > > > I chose optimaize mainly because Tika did. Using langid instead
> > > > should be very simple, but the fact that the project has not seen a
> > > > single commit in the last 4 years, and the usage numbers are also quite 
> > > > low,
> > gives me pause...
> > >
> > > Of course, maintenance or community around a project is an important
> > > factor. CLD2 is also not really maintained, plus the models are fixed, no 
> > > code
> > available to retrain them.
> > >
> > > > what I have done locally
> > >
> > > In any case, would be great if you would open an issue on Jira and a pull
> > request on github.
> > > Which way to go may be discussed further.
> > >
> > > Thanks,
> > > Sebastian
> > >
> > >
> > > On 10/24/2017 01:05 PM, Yossi Tamari wrote:
> > > > Why not LanguageDetector: The API does not separate the Detector object,
> > which contains the model and should be reused, from the text writer object,
> > which should be request specific. The same API Object instance contains
> > references to both. In code terms, both loadModels() and addText() are non-
> > static members of LanguageDetector.
> > > >
> > > > Developing another language-identifier-optimaize is basically what I 
> > > > have
> > done locally, but it seems to me having both in the Nutch repository would 
> > just
> > be confusing for users. 99% of the code would also be duplicated (the 
> > relevant
> > code is about 5 lines).
> > > >
> > > > I chose optimaize mainly because Tika did. Using langid instead should 
> > > > be
> > very simple, but the fact that the project has not seen a single commit in 
> > the last
> > 4 years, and the usage numbers are also quite low, gives me pause...
> > > >
> > > >
> > > >> -Original Message-
> > > >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> > > >> Sent: 24 October 2017 13:18
> > > >> To: user@nutch.apache.org
> > > >> Subject: Re: Usage of Tika LanguageIdentifier in
> > > >> language-identifier plugin
> > > >>
> > > >&g

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Markus Jelsma
Hello,

Not sure what the problem is but , buried  deep in our parser we also use 
Optimaize, previously lang-detect. We load models once, inside a static block, 
and create a new Detector instance for every record we parse. This is very fast.

Regards,
Markus
 
-Original message-
> From:Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: Tuesday 24th October 2017 14:11
> To: user@nutch.apache.org
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hi Yossi,
> 
> > does not separate the Detector object, which contains the model and should 
> > be reused, from the
> > text writer object, which should be request specific.
> 
> But shouldn't a call of reset() make it ready for re-use (the Detector object 
> including the writer)?
> 
> But I agree that a reentrant function maybe easier to integrate. Nutch 
> plugins also need to be
> thread-safe, esp. parsers and parse filters if running in a multi-threaded 
> parsing fetcher.
> Without a reentrant function and without a 100% stateless detector, the only 
> way is to use a
> ThreadLocal instance of the detector. At a first glance, the optimaize 
> detecter seems to be stateless.
> 
> > I chose optimaize mainly because Tika did. Using langid instead should be 
> > very simple, but the
> > fact that the project has not seen a single commit in the last 4 years, and 
> > the usage numbers are
> > also quite low, gives me pause...
> 
> Of course, maintenance or community around a project is an important factor. 
> CLD2 is also not really
> maintained, plus the models are fixed, no code available to retrain them.
> 
> > what I have done locally
> 
> In any case, would be great if you would open an issue on Jira and a pull 
> request on github.
> Which way to go may be discussed further.
> 
> Thanks,
> Sebastian
> 
> 
> On 10/24/2017 01:05 PM, Yossi Tamari wrote:
> > Why not LanguageDetector: The API does not separate the Detector object, 
> > which contains the model and should be reused, from the text writer object, 
> > which should be request specific. The same API Object instance contains 
> > references to both. In code terms, both loadModels() and addText() are 
> > non-static members of LanguageDetector.
> > 
> > Developing another language-identifier-optimaize is basically what I have 
> > done locally, but it seems to me having both in the Nutch repository would 
> > just be confusing for users. 99% of the code would also be duplicated (the 
> > relevant code is about 5 lines).
> > 
> > I chose optimaize mainly because Tika did. Using langid instead should be 
> > very simple, but the fact that the project has not seen a single commit in 
> > the last 4 years, and the usage numbers are also quite low, gives me 
> > pause...
> > 
> > 
> >> -Original Message-
> >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> >> Sent: 24 October 2017 13:18
> >> To: user@nutch.apache.org
> >> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> >>
> >> Hi Yossi,
> >>
> >> sorry while fast-reading I've thought it's about the old 
> >> LanguageIdentifier.
> >>
> >>> it is not possible to initialize the detector in setConf and then reuse it
> >>
> >> Could explain why? The API/interface should allow to get an instance and 
> >> call
> >> loadModels() or not?
> >>
> >>>>> For my needs, I have modified the plugin to use
> >>>>> com.optimaize.langdetect.LanguageDetector directly, which is what
> >>
> >> Of course, that's also possible. Or just add a plugin language-identifier-
> >> optimaize.
> >>
> >> Btw., I recently had a look on various open source language identifier
> >> implementations would prefer
> >> langid (a port from Python/C) because it's faster and has a better 
> >> precision:
> >>   https://github.com/carrotsearch/langid-java.git
> >>   https://github.com/saffsd/langid.c.git
> >>   https://github.com/saffsd/langid.py.git
> >> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but 
> >> it's
> >> C++).
> >>
> >> Thanks,
> >> Sebastian
> >>
> >> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> >>> Hi Sebastian,
> >>>
> >>> Please reread the second paragraph of my email .
> >>> In short, it is not possible to initialize the detector in setConf and 
> >>> then reuse it,
&g

Re: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Sebastian Nagel
Hi Yossi,

> does not separate the Detector object, which contains the model and should be 
> reused, from the
> text writer object, which should be request specific.

But shouldn't a call of reset() make it ready for re-use (the Detector object 
including the writer)?

But I agree that a reentrant function maybe easier to integrate. Nutch plugins 
also need to be
thread-safe, esp. parsers and parse filters if running in a multi-threaded 
parsing fetcher.
Without a reentrant function and without a 100% stateless detector, the only 
way is to use a
ThreadLocal instance of the detector. At a first glance, the optimaize detecter 
seems to be stateless.

> I chose optimaize mainly because Tika did. Using langid instead should be 
> very simple, but the
> fact that the project has not seen a single commit in the last 4 years, and 
> the usage numbers are
> also quite low, gives me pause...

Of course, maintenance or community around a project is an important factor. 
CLD2 is also not really
maintained, plus the models are fixed, no code available to retrain them.

> what I have done locally

In any case, would be great if you would open an issue on Jira and a pull 
request on github.
Which way to go may be discussed further.

Thanks,
Sebastian


On 10/24/2017 01:05 PM, Yossi Tamari wrote:
> Why not LanguageDetector: The API does not separate the Detector object, 
> which contains the model and should be reused, from the text writer object, 
> which should be request specific. The same API Object instance contains 
> references to both. In code terms, both loadModels() and addText() are 
> non-static members of LanguageDetector.
> 
> Developing another language-identifier-optimaize is basically what I have 
> done locally, but it seems to me having both in the Nutch repository would 
> just be confusing for users. 99% of the code would also be duplicated (the 
> relevant code is about 5 lines).
> 
> I chose optimaize mainly because Tika did. Using langid instead should be 
> very simple, but the fact that the project has not seen a single commit in 
> the last 4 years, and the usage numbers are also quite low, gives me pause...
> 
> 
>> -Original Message-
>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>> Sent: 24 October 2017 13:18
>> To: user@nutch.apache.org
>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>>
>> Hi Yossi,
>>
>> sorry while fast-reading I've thought it's about the old LanguageIdentifier.
>>
>>> it is not possible to initialize the detector in setConf and then reuse it
>>
>> Could explain why? The API/interface should allow to get an instance and call
>> loadModels() or not?
>>
>>>>> For my needs, I have modified the plugin to use
>>>>> com.optimaize.langdetect.LanguageDetector directly, which is what
>>
>> Of course, that's also possible. Or just add a plugin language-identifier-
>> optimaize.
>>
>> Btw., I recently had a look on various open source language identifier
>> implementations would prefer
>> langid (a port from Python/C) because it's faster and has a better precision:
>>   https://github.com/carrotsearch/langid-java.git
>>   https://github.com/saffsd/langid.c.git
>>   https://github.com/saffsd/langid.py.git
>> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but 
>> it's
>> C++).
>>
>> Thanks,
>> Sebastian
>>
>> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
>>> Hi Sebastian,
>>>
>>> Please reread the second paragraph of my email .
>>> In short, it is not possible to initialize the detector in setConf and then 
>>> reuse it,
>> and initializing it per call would be extremely slow.
>>>
>>> Yossi.
>>>
>>>
>>>> -Original Message-
>>>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>>>> Sent: 24 October 2017 12:41
>>>> To: user@nutch.apache.org
>>>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>>>>
>>>> Hi Yossi,
>>>>
>>>> why not port it to use
>>>>
>>>>
>> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
>>>> tector.html
>>>>
>>>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
>>>>
>>>> Sebastian
>>>>
>>>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
>>>>> Hi
>>>>>
>>>>>
>>>>>
>>>>> The language-identifier plugin uses
>>>>> 

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari
Why not LanguageDetector: The API does not separate the Detector object, which 
contains the model and should be reused, from the text writer object, which 
should be request specific. The same API Object instance contains references to 
both. In code terms, both loadModels() and addText() are non-static members of 
LanguageDetector.

Developing another language-identifier-optimaize is basically what I have done 
locally, but it seems to me having both in the Nutch repository would just be 
confusing for users. 99% of the code would also be duplicated (the relevant 
code is about 5 lines).

I chose optimaize mainly because Tika did. Using langid instead should be very 
simple, but the fact that the project has not seen a single commit in the last 
4 years, and the usage numbers are also quite low, gives me pause...


> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: 24 October 2017 13:18
> To: user@nutch.apache.org
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hi Yossi,
> 
> sorry while fast-reading I've thought it's about the old LanguageIdentifier.
> 
> > it is not possible to initialize the detector in setConf and then reuse it
> 
> Could explain why? The API/interface should allow to get an instance and call
> loadModels() or not?
> 
> >>> For my needs, I have modified the plugin to use
> >>> com.optimaize.langdetect.LanguageDetector directly, which is what
> 
> Of course, that's also possible. Or just add a plugin language-identifier-
> optimaize.
> 
> Btw., I recently had a look on various open source language identifier
> implementations would prefer
> langid (a port from Python/C) because it's faster and has a better precision:
>   https://github.com/carrotsearch/langid-java.git
>   https://github.com/saffsd/langid.c.git
>   https://github.com/saffsd/langid.py.git
> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's
> C++).
> 
> Thanks,
> Sebastian
> 
> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> > Hi Sebastian,
> >
> > Please reread the second paragraph of my email .
> > In short, it is not possible to initialize the detector in setConf and then 
> > reuse it,
> and initializing it per call would be extremely slow.
> >
> > Yossi.
> >
> >
> >> -Original Message-
> >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> >> Sent: 24 October 2017 12:41
> >> To: user@nutch.apache.org
> >> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> >>
> >> Hi Yossi,
> >>
> >> why not port it to use
> >>
> >>
> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
> >> tector.html
> >>
> >> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> >>
> >> Sebastian
> >>
> >> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> >>> Hi
> >>>
> >>>
> >>>
> >>> The language-identifier plugin uses
> >>> org.apache.tika.language.LanguageIdentifier for extracting the
> >>> language from the document text. There are two issues with that:
> >>>
> >>> 1.LanguageIdentifier is deprecated in Tika.
> >>> 2.It does not support CJK language (and I suspect a lot of other
> >>> languages -
> >>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
> >>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
> >>> with them - in my experience Chinese was recognized as Italian.
> >>>
> >>>
> >>>
> >>> Since in Tika LanguageIdentifier was superseded by
> >>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
> >>> make that change in the plugin as well. However, because the design of
> >>> LanguageDetector is terrible, it makes the implementation not
> >>> reentrant, meaning the full language model would have to be reloaded
> >>> on each call to the detector.
> >>>
> >>>
> >>>
> >>> For my needs, I have modified the plugin to use
> >>> com.optimaize.langdetect.LanguageDetector directly, which is what
> >>> Tika's LanguageDetector uses internally (at least by default). My
> >>> question is whether that is a change that should be made to the official
> plugin.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>>Yossi.
> >>>
> >>>
> >
> >




Re: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Sebastian Nagel
Hi Yossi,

sorry while fast-reading I've thought it's about the old LanguageIdentifier.

> it is not possible to initialize the detector in setConf and then reuse it

Could explain why? The API/interface should allow to get an instance and call 
loadModels() or not?

>>> For my needs, I have modified the plugin to use
>>> com.optimaize.langdetect.LanguageDetector directly, which is what

Of course, that's also possible. Or just add a plugin 
language-identifier-optimaize.

Btw., I recently had a look on various open source language identifier 
implementations would prefer
langid (a port from Python/C) because it's faster and has a better precision:
  https://github.com/carrotsearch/langid-java.git
  https://github.com/saffsd/langid.c.git
  https://github.com/saffsd/langid.py.git
Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's 
C++).

Thanks,
Sebastian

On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> Hi Sebastian,
> 
> Please reread the second paragraph of my email .
> In short, it is not possible to initialize the detector in setConf and then 
> reuse it, and initializing it per call would be extremely slow.
> 
>   Yossi.
> 
> 
>> -Original Message-
>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>> Sent: 24 October 2017 12:41
>> To: user@nutch.apache.org
>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>>
>> Hi Yossi,
>>
>> why not port it to use
>>
>> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
>> tector.html
>>
>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
>>
>> Sebastian
>>
>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
>>> Hi
>>>
>>>
>>>
>>> The language-identifier plugin uses
>>> org.apache.tika.language.LanguageIdentifier for extracting the
>>> language from the document text. There are two issues with that:
>>>
>>> 1.  LanguageIdentifier is deprecated in Tika.
>>> 2.  It does not support CJK language (and I suspect a lot of other
>>> languages -
>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
>>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
>>> with them - in my experience Chinese was recognized as Italian.
>>>
>>>
>>>
>>> Since in Tika LanguageIdentifier was superseded by
>>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
>>> make that change in the plugin as well. However, because the design of
>>> LanguageDetector is terrible, it makes the implementation not
>>> reentrant, meaning the full language model would have to be reloaded
>>> on each call to the detector.
>>>
>>>
>>>
>>> For my needs, I have modified the plugin to use
>>> com.optimaize.langdetect.LanguageDetector directly, which is what
>>> Tika's LanguageDetector uses internally (at least by default). My
>>> question is whether that is a change that should be made to the official 
>>> plugin.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>Yossi.
>>>
>>>
> 
> 



RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari
Hi Sebastian,

Please reread the second paragraph of my email .
In short, it is not possible to initialize the detector in setConf and then 
reuse it, and initializing it per call would be extremely slow.

Yossi.


> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: 24 October 2017 12:41
> To: user@nutch.apache.org
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hi Yossi,
> 
> why not port it to use
> 
> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
> tector.html
> 
> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> 
> Sebastian
> 
> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> > Hi
> >
> >
> >
> > The language-identifier plugin uses
> > org.apache.tika.language.LanguageIdentifier for extracting the
> > language from the document text. There are two issues with that:
> >
> > 1.  LanguageIdentifier is deprecated in Tika.
> > 2.  It does not support CJK language (and I suspect a lot of other
> > languages -
> > https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
> > guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
> > with them - in my experience Chinese was recognized as Italian.
> >
> >
> >
> > Since in Tika LanguageIdentifier was superseded by
> > org.apache.tika.language.detect.LanguageDetector, it seems obvious to
> > make that change in the plugin as well. However, because the design of
> > LanguageDetector is terrible, it makes the implementation not
> > reentrant, meaning the full language model would have to be reloaded
> > on each call to the detector.
> >
> >
> >
> > For my needs, I have modified the plugin to use
> > com.optimaize.langdetect.LanguageDetector directly, which is what
> > Tika's LanguageDetector uses internally (at least by default). My
> > question is whether that is a change that should be made to the official 
> > plugin.
> >
> >
> >
> > Thanks,
> >
> >Yossi.
> >
> >




Re: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Sebastian Nagel
Hi Yossi,

why not port it to use
   
http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDetector.html

The upgrade to Tika 1.16 is already in progress (NUTCH-2439).

Sebastian

On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> Hi
> 
>  
> 
> The language-identifier plugin uses
> org.apache.tika.language.LanguageIdentifier for extracting the language from
> the document text. There are two issues with that:
> 
> 1.LanguageIdentifier is deprecated in Tika.
> 2.It does not support CJK language (and I suspect a lot of other
> languages -
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages
> _and_their_ISO_636_Codes), and it doesn't even fail gracefully with them -
> in my experience Chinese was recognized as Italian.
> 
>  
> 
> Since in Tika LanguageIdentifier was superseded by
> org.apache.tika.language.detect.LanguageDetector, it seems obvious to make
> that change in the plugin as well. However, because the design of
> LanguageDetector is terrible, it makes the implementation not reentrant,
> meaning the full language model would have to be reloaded on each call to
> the detector.
> 
>  
> 
> For my needs, I have modified the plugin to use
> com.optimaize.langdetect.LanguageDetector directly, which is what Tika's
> LanguageDetector uses internally (at least by default). My question is
> whether that is a change that should be made to the official plugin. 
> 
>  
> 
> Thanks,
> 
>Yossi.
> 
>