Re: Tokenizing Chinese & multi-language search

Andy Tue, 15 Mar 2011 22:05:27 -0700

Hi Otis,

It doesn't look like the last 2 options would work for me. So I guess my best 
bet is to ask the user to specify the language when they type in the query.


Once I get that information from the user, how do I dynamically pick an 
analyzer for the query string?

Thanks

Andy

--- On Tue, 3/15/11, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote:

> From: Otis Gospodnetic <otis_gospodne...@yahoo.com>
> Subject: Re: Tokenizing Chinese & multi-language search
> To: solr-user@lucene.apache.org
> Date: Tuesday, March 15, 2011, 11:51 PM
> Hi Andy,
> 
> Is the "I don't know what language the query is in"
> something you could change 
> by...
> - asking the user
> - deriving from HTTP request headers
> - identifying the query language (if queries are long
> enough and "texty")
> - ...
> 
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> ----- Original Message ----
> > From: Andy <angelf...@yahoo.com>
> > To: solr-user@lucene.apache.org
> > Sent: Tue, March 15, 2011 9:07:36 PM
> > Subject: Tokenizing Chinese & multi-language
> search
> > 
> > Hi,
> > 
> > I remember reading in this list a while ago that Solr
> will only  tokenize on 
> >whitespace even when using CJKAnalyzer. That would make
> Solr  unusable on 
> >Chinese or any other languages that don't use
> whitespace as  separator.
> > 
> > 1) I remember reading about a workaround.
> Unfortunately I  can't find the post 
> >that mentioned it. Could someone give me pointers on
> how to  address this issue?
> > 
> > 2) Let's say I have fixed this issue and have 
> properly analyzed and indexed 
> >the Chinese documents. My documents are in 
> multiple languages. I plan to use 
> >separate fields for documents in different 
> languages: text_en, text_zh, 
> >text_ja, text_fr, etc. Each field will be 
> associated with the appropriate 
> >analyzer. 
> >
> > My problem now is how to deal with  the query
> string. I don't know what 
> >language the query is in, so I won't be able  to
> select the appropriate analyzer 
> >for the query string. If I just use the  standard
> analyzer on the query string, 
> >any query that's in Chinese won't be  tokenized
> correctly. So would the whole 
> >system still work in this  case?
> > 
> > This must be a pretty common use case, handling
> multi-language  search. What is 
> >the recommended way of dealing with this 
> problem?
> > 
> > Thanks.
> > Andy
> > 
> > 
> >       
> > 
>

Re: Tokenizing Chinese & multi-language search

Reply via email to