Re: Is StandardAnalyzer good enough for multi languages...

Steve Rowe Tue, 08 Jan 2013 22:26:15 -0800

Dude.  Go look.  It allows for per-script specialization, with (non-UAX#29) 
specializations by default for Thai, Lao, Myanmar and Hewbrew.  See 
DefaultICUTokenizerConfig.  It's filled with exactly the opposite of what you 
were describing.

ICUTokenizerFactory's customizability has been enhanced in to-be-released 
Lucene/Solr 4.1: <https://issues.apache.org/jira/browse/SOLR-4123> - you can 
provide per-script RuleBasedBreakIterator specification files at runtime. 

On Jan 9, 2013, at 12:12 AM, Trejkaz <[email protected]> wrote:

> On Wed, Jan 9, 2013 at 10:57 AM, Steve Rowe <[email protected]> wrote:
>> Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be 
>> of interest to you, along with the token filters in that same module. - Steve
> 
> ICUTokenizer sounds like it's implementing UAX #29, which is exactly
> the standard filled with all the issues I was describing. Unless it
> does more than that, I would recommend against using that also.
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Is StandardAnalyzer good enough for multi languages...

Reply via email to