Hi Tim, I'm working on a similar project with some differences and may be we can share our knowledge in this area :
1) I have no problem with the Chinese characters. You can try this link : http://123.100.239.158:8983/solr/collection1/browse?q=%E4%B8%AD%E5%9B%BD Solr can find the record even the phrase 中国 (meaning China) is in the middle of the sentence. 2) My problem is more relating to other Asian languages ... Thai and Arabic are two examples. Read from https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters that solr.ICUTokenizerFactory can overcome the problem and I am exploring this approach at the moment. Simon. On Sat, Jun 21, 2014 at 7:37 AM, T. Kuro Kurosaka <k...@healthline.com> wrote: > On 06/20/2014 04:04 AM, Allison, Timothy B. wrote: > >> Let's say a predominantly English document contains a Chinese sentence. >> If the English field uses the WhitespaceTokenizer with a basic >> WordDelimiterFilter, the Chinese sentence could be tokenized as one big >> token (if it doesn't have any punctuation, of course) and will be >> effectively unsearchable...barring use of wildcards. >> > > In my experiment with Solr 4.6.1, both StandardTokenizer and ICUTokenizer > generates a token per han character. So they are searcheable though > precision suffers. But in your scenario, Chinese text is rare, so some > precision > loss may not be a real issue. > > Kuro > >