Re: ICUTokenizer or StandardTokenizer or ??? for "text_all" type field that might include non-whitespace langs

Simon Cheng Fri, 20 Jun 2014 17:08:24 -0700

Hi Tim,

I'm working on a similar project with some differences and may be we can
share our knowledge in this area :

1) I have no problem with the Chinese characters. You can try this link :

http://123.100.239.158:8983/solr/collection1/browse?q=%E4%B8%AD%E5%9B%BD

Solr can find the record even the phrase 中国 (meaning China) is in the
middle of the sentence.

2) My problem is more relating to other Asian languages ... Thai and Arabic
are two examples. Read from
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters that
solr.ICUTokenizerFactory  can overcome the problem and I am exploring this
approach at the moment.

Simon.

On Sat, Jun 21, 2014 at 7:37 AM, T. Kuro Kurosaka <k...@healthline.com>
wrote:

> On 06/20/2014 04:04 AM, Allison, Timothy B. wrote:
>
>> Let's say a predominantly English document contains a Chinese sentence.
>>  If the English field uses the WhitespaceTokenizer with a basic
>> WordDelimiterFilter, the Chinese sentence could be tokenized as one big
>> token (if it doesn't have any punctuation, of course) and will be
>> effectively unsearchable...barring use of wildcards.
>>
>
> In my experiment with Solr 4.6.1, both StandardTokenizer and ICUTokenizer
> generates a token per han character. So they are searcheable though
> precision suffers. But in your scenario, Chinese text is rare, so some
> precision
> loss may not be a real issue.
>
> Kuro
>
>

Re: ICUTokenizer or StandardTokenizer or ??? for "text_all" type field that might include non-whitespace langs

Reply via email to