Hi Andy, Is the "I don't know what language the query is in" something you could change by... - asking the user - deriving from HTTP request headers - identifying the query language (if queries are long enough and "texty") - ...
Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: Andy <angelf...@yahoo.com> > To: solr-user@lucene.apache.org > Sent: Tue, March 15, 2011 9:07:36 PM > Subject: Tokenizing Chinese & multi-language search > > Hi, > > I remember reading in this list a while ago that Solr will only tokenize on >whitespace even when using CJKAnalyzer. That would make Solr unusable on >Chinese or any other languages that don't use whitespace as separator. > > 1) I remember reading about a workaround. Unfortunately I can't find the > post >that mentioned it. Could someone give me pointers on how to address this >issue? > > 2) Let's say I have fixed this issue and have properly analyzed and indexed >the Chinese documents. My documents are in multiple languages. I plan to use >separate fields for documents in different languages: text_en, text_zh, >text_ja, text_fr, etc. Each field will be associated with the appropriate >analyzer. > > My problem now is how to deal with the query string. I don't know what >language the query is in, so I won't be able to select the appropriate >analyzer >for the query string. If I just use the standard analyzer on the query >string, >any query that's in Chinese won't be tokenized correctly. So would the whole >system still work in this case? > > This must be a pretty common use case, handling multi-language search. What > is >the recommended way of dealing with this problem? > > Thanks. > Andy > > > >