Hi Otis, It doesn't look like the last 2 options would work for me. So I guess my best bet is to ask the user to specify the language when they type in the query.
Once I get that information from the user, how do I dynamically pick an analyzer for the query string? Thanks Andy --- On Tue, 3/15/11, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote: > From: Otis Gospodnetic <otis_gospodne...@yahoo.com> > Subject: Re: Tokenizing Chinese & multi-language search > To: solr-user@lucene.apache.org > Date: Tuesday, March 15, 2011, 11:51 PM > Hi Andy, > > Is the "I don't know what language the query is in" > something you could change > by... > - asking the user > - deriving from HTTP request headers > - identifying the query language (if queries are long > enough and "texty") > - ... > > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > ----- Original Message ---- > > From: Andy <angelf...@yahoo.com> > > To: solr-user@lucene.apache.org > > Sent: Tue, March 15, 2011 9:07:36 PM > > Subject: Tokenizing Chinese & multi-language > search > > > > Hi, > > > > I remember reading in this list a while ago that Solr > will only tokenize on > >whitespace even when using CJKAnalyzer. That would make > Solr unusable on > >Chinese or any other languages that don't use > whitespace as separator. > > > > 1) I remember reading about a workaround. > Unfortunately I can't find the post > >that mentioned it. Could someone give me pointers on > how to address this issue? > > > > 2) Let's say I have fixed this issue and have > properly analyzed and indexed > >the Chinese documents. My documents are in > multiple languages. I plan to use > >separate fields for documents in different > languages: text_en, text_zh, > >text_ja, text_fr, etc. Each field will be > associated with the appropriate > >analyzer. > > > > My problem now is how to deal with the query > string. I don't know what > >language the query is in, so I won't be able to > select the appropriate analyzer > >for the query string. If I just use the standard > analyzer on the query string, > >any query that's in Chinese won't be tokenized > correctly. So would the whole > >system still work in this case? > > > > This must be a pretty common use case, handling > multi-language search. What is > >the recommended way of dealing with this > problem? > > > > Thanks. > > Andy > > > > > > > > >