Dude. Go look. It allows for per-script specialization, with (non-UAX#29) specializations by default for Thai, Lao, Myanmar and Hewbrew. See DefaultICUTokenizerConfig. It's filled with exactly the opposite of what you were describing.
ICUTokenizerFactory's customizability has been enhanced in to-be-released Lucene/Solr 4.1: <https://issues.apache.org/jira/browse/SOLR-4123> - you can provide per-script RuleBasedBreakIterator specification files at runtime. On Jan 9, 2013, at 12:12 AM, Trejkaz <trej...@trypticon.org> wrote: > On Wed, Jan 9, 2013 at 10:57 AM, Steve Rowe <sar...@gmail.com> wrote: >> Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be >> of interest to you, along with the token filters in that same module. - Steve > > ICUTokenizer sounds like it's implementing UAX #29, which is exactly > the standard filled with all the issues I was describing. Unless it > does more than that, I would recommend against using that also. > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org