Dude.  Go look.  It allows for per-script specialization, with (non-UAX#29) 
specializations by default for Thai, Lao, Myanmar and Hewbrew.  See 
DefaultICUTokenizerConfig.  It's filled with exactly the opposite of what you 
were describing. 

ICUTokenizerFactory's customizability has been enhanced in to-be-released 
Lucene/Solr 4.1: <https://issues.apache.org/jira/browse/SOLR-4123> - you can 
provide per-script RuleBasedBreakIterator specification files at runtime. 

On Jan 9, 2013, at 12:12 AM, Trejkaz <trej...@trypticon.org> wrote:

> On Wed, Jan 9, 2013 at 10:57 AM, Steve Rowe <sar...@gmail.com> wrote:
>> Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be 
>> of interest to you, along with the token filters in that same module. - Steve
> 
> ICUTokenizer sounds like it's implementing UAX #29, which is exactly
> the standard filled with all the issues I was describing. Unless it
> does more than that, I would recommend against using that also.
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to