(Jonathan, I apologize for emailing you twice, i meant to hit reply-all)

On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind <rochk...@jhu.edu> wrote:
>
> Wait, standardtokenizer already handles CJK and will put each CJK char into
> it's own token?  Really? I had no idea!  Is that documented anywhere, or you
> just have to look at the source to see it?
>

Yes, you are right, the documentation should have been more explicit:
in previous releases it doesn't say anything about how it tokenizes
CJK in the documentation. But it does do them this way, and tagged
them as "CJ" token type.

I think the documentation issue is "fixed" in branch_3x and trunk:

 * As of Lucene version 3.1, this class implements the Word Break rules from the
 * Unicode Text Segmentation algorithm, as specified in
 * <a href="http://unicode.org/reports/tr29/";>Unicode Standard Annex #29</a>.
(from 
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java)

So you can read the UAX#29 report and then you know how it tokenizes text
You can also just use this demo app to see how the new one works:
http://unicode.org/cldr/utility/breaks.jsp (choose "Word")

Reply via email to