Tokenizing Chinese & multi-language search

Andy Tue, 15 Mar 2011 18:08:10 -0700

Hi,

I remember reading in this list a while ago that Solr will only tokenize on 
whitespace even when using CJKAnalyzer. That would make Solr unusable on 
Chinese or any other languages that don't use whitespace as separator.


1) I remember reading about a workaround. Unfortunately I can't find the post 
that mentioned it. Could someone give me pointers on how to address this issue?

2) Let's say I have fixed this issue and have properly analyzed and indexed the 
Chinese documents. My documents are in multiple languages. I plan to use 
separate fields for documents in different languages: text_en, text_zh, 
text_ja, text_fr, etc. Each field will be associated with the appropriate 
analyzer. 
My problem now is how to deal with the query string. I don't know what language 
the query is in, so I won't be able to select the appropriate analyzer for the 
query string. If I just use the standard analyzer on the query string, any 
query that's in Chinese won't be tokenized correctly. So would the whole system 
still work in this case?

This must be a pretty common use case, handling multi-language search. What is 
the recommended way of dealing with this problem?

Thanks.
Andy

Tokenizing Chinese & multi-language search

Reply via email to