Hi - We see this too with Japanese where just a few kanji can spoil the detection. The only solution i see is creating a better model.
Markus -----Original message----- > From:[email protected] <[email protected]> > Sent: Friday 6th April 2018 12:51 > To: [email protected] > Subject: Re: Tika detects short Japanese sentences as Chinese > > Hi Ken, yes it's OptimaizeLangDetector. > Should I post it to optimaize mailing list? > > On 2018/04/05 18:42:25, Ken Krugler <[email protected]> wrote: > > Hi Artur, > > > > Is the detector that you get back from getDefaultLanguageDetector the > > OptimaizeLangDetector? > > > > — Ken > > > > > > > On Apr 3, 2018, at 2:55 AM, Artur Rashitov <[email protected]> wrote: > > > > > > Given the following code: > > > > > > val japanese = "私はガラスを食べられます。それは私を傷つけません。" > > > LanguageDetector.getDefaultLanguageDetector.loadModels().detectAll(japanese) > > > > > > it produces [zh-CN: MEDIUM (0.579961), zh-TW: MEDIUM (0.405015)] > > > And the same thing for many short Japanese sentences. > > > > > > Apache Tika 1.17 > > > > -------------------------------------------- > > http://about.me/kkrugler > > +1 530-210-6378 > > > >
