(Jonathan, I apologize for emailing you twice, i meant to hit reply-all) On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind <rochk...@jhu.edu> wrote: > > Wait, standardtokenizer already handles CJK and will put each CJK char into > it's own token? Really? I had no idea! Is that documented anywhere, or you > just have to look at the source to see it? >
Yes, you are right, the documentation should have been more explicit: in previous releases it doesn't say anything about how it tokenizes CJK in the documentation. But it does do them this way, and tagged them as "CJ" token type. I think the documentation issue is "fixed" in branch_3x and trunk: * As of Lucene version 3.1, this class implements the Word Break rules from the * Unicode Text Segmentation algorithm, as specified in * <a href="http://unicode.org/reports/tr29/">Unicode Standard Annex #29</a>. (from http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java) So you can read the UAX#29 report and then you know how it tokenizes text You can also just use this demo app to see how the new one works: http://unicode.org/cldr/utility/breaks.jsp (choose "Word")