Re: Tokenizing Chinese & multi-language search

Otis Gospodnetic Tue, 15 Mar 2011 20:51:46 -0700

Hi Andy,

Is the "I don't know what language the query is in" something you could change 
by...
- asking the user
- deriving from HTTP request headers
- identifying the query language (if queries are long enough and "texty")
- ...


Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Andy <angelf...@yahoo.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, March 15, 2011 9:07:36 PM
> Subject: Tokenizing Chinese & multi-language search
> 
> Hi,
> 
> I remember reading in this list a while ago that Solr will only  tokenize on 
>whitespace even when using CJKAnalyzer. That would make Solr  unusable on 
>Chinese or any other languages that don't use whitespace as  separator.
> 
> 1) I remember reading about a workaround. Unfortunately I  can't find the 
> post 
>that mentioned it. Could someone give me pointers on how to  address this 
>issue?
> 
> 2) Let's say I have fixed this issue and have  properly analyzed and indexed 
>the Chinese documents. My documents are in  multiple languages. I plan to use 
>separate fields for documents in different  languages: text_en, text_zh, 
>text_ja, text_fr, etc. Each field will be  associated with the appropriate 
>analyzer. 
>
> My problem now is how to deal with  the query string. I don't know what 
>language the query is in, so I won't be able  to select the appropriate 
>analyzer 
>for the query string. If I just use the  standard analyzer on the query 
>string, 
>any query that's in Chinese won't be  tokenized correctly. So would the whole 
>system still work in this  case?
> 
> This must be a pretty common use case, handling multi-language  search. What 
> is 
>the recommended way of dealing with this  problem?
> 
> Thanks.
> Andy
> 
> 
>       
>

Re: Tokenizing Chinese & multi-language search

Reply via email to