On Mon, Nov 22, 2010 at 6:50 PM, Burton-West, Tom <[email protected]> wrote: > Hi all, > > I see in the javadoc for the ICUTokenizer that it has special handling for > Lao,Myanmar, Khmer word breaking but no details in the javadoc about what it > does with CJK, which for C and J appears to be breaking into unigrams. Is > this correct? >
The han ideographs are segmented into unigram (this is the uax#29 default behavior). I don't know off the top of my head what the rules are for japanese kana... --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
