Re: using CJKTokenizerFactory for Japanese language

Koji Sekiguchi Thu, 11 Nov 2010 14:51:10 -0800

(10/11/12 1:49), Kumar Pandey wrote:

I am exploring support for Japanese language in solr.
Solr seems to provide CJKTokenizerFactory.
How useful is this module? Has anyone been using this in production for
Japanese language?


CJKTokenizer is used in a lot of places in Japan.

One shortfall it seems to have from what I have been able to read up on is
that it can generate lot of false matches. For example mathcing kyoto when
searching for tokyo etc.


Yep, it is a well-known problem.

I did not see many questions related to this module so I wonder if people
are actively using it.
If not are there any other solution in the market that are recommended by
solr users?


You may want to look at morphological analyzers. There are some of them in 
Japan.
Search MeCab, Sen, GoSen by Google. Or in Lucene, there is a patch for
a morphological-taste analyzer:

https://issues.apache.org/jira/browse/LUCENE-2522

Koji

--
http://www.rondhuit.com/en/

Re: using CJKTokenizerFactory for Japanese language

Reply via email to