RE: ICUTokenizer or StandardTokenizer or ??? for "text_all" type field that might include non-whitespace langs

Allison, Timothy B. Fri, 20 Jun 2014 04:05:30 -0700

Alex,
  Thank you for the quick response.  Apologies for my delay.
Y, we'll use edismax.  That won't solve the issue of multilingual documents...I 
don't think...unless we index every document as every language.
Let's say a predominantly English document contains a Chinese sentence.  If the 
English field uses the WhitespaceTokenizer with a basic WordDelimiterFilter, 
the Chinese sentence could be tokenized as one big token (if it doesn't have 
any punctuation, of course) and will be effectively unsearchable...barring use 
of wildcards.
So, what we're looking for is a basic, reliable-ish field configuration to 
handle all languages as a fallback.  So we were thinking, perhaps, ICUTokenizer 
with ICUFoldingFilter and perhaps a multilingual stopword list.
We do want the language specific handling for most cases, and the basic 
langid+field per language setup with edismax will get us that.  Any thoughts?


Thank you, again.

   Best,

       Tim

I don't think the text_all field would work too well for multilingual
setup. Any reason you cannot use edismax to search over a bunch of
fields instead?

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency



________________________________
From: Allison, Timothy B.
Sent: Wednesday, June 18, 2014 9:31 PM
To: solr-user@lucene.apache.org
Subject: ICUTokenizer or StandardTokenizer or ??? for "text_all" type field 
that might include non-whitespace langs

All,

In one index I’m working with, the setup is the typical langid mapping to 
language specific fields.  There is also a text_all field that everything is 
copied to.  The documents can contain a wide variety of languages including 
non-whitespace languages.  We’ll be using the ICUTokenFilter in the analysis 
chain, but what should we use for the tokenizer for the “text_all” field?  My 
inclination is to go with the ICUTokenizer.  Are there any reasons to prefer 
the StandardTokenizer or another tokenizer for this field?

Thank you.

       Best,

              Tim

RE: ICUTokenizer or StandardTokenizer or ??? for "text_all" type field that might include non-whitespace langs

Reply via email to