Re: ICUTokenizer and CJK

Robert Muir Tue, 23 Nov 2010 03:08:17 -0800

On Mon, Nov 22, 2010 at 6:50 PM, Burton-West, Tom <[email protected]> wrote:
> Hi all,
>
> I see in the javadoc for the ICUTokenizer that it has special handling for 
> Lao,Myanmar, Khmer word breaking but no details in the javadoc about what it 
> does with CJK, which for C and J appears to be breaking into unigrams. Is 
> this correct?
>


The han ideographs are segmented into unigram (this is the uax#29
default behavior). I don't know off the top of my head what the rules
are for japanese kana...

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: ICUTokenizer and CJK

Reply via email to