[ http://issues.apache.org/jira/browse/LUCENE-478?page=comments#action_12361804 ]
Steven Rowe commented on LUCENE-478: ------------------------------------ There are six classes of issues: 1. A character range in StandardTokenizer.jj that is missing in John's list, and should be left as-is in StandardTokenizer.jj (in the <CJ> section): 1.a. [ U+3100 - U+312F ] BoPoMoFo (a.k.a. ZhuYin): Phonetic transcription symbols used in Taiwan; not used on mainland China. 2. A character range in StandardTokenizer.jj that is also in John's list, but in the <LETTER> section rather than in the <CJ> section, and should be left as-is: 2.a. [ U+1100 - U+11FF ] Korean Jamo (phonetic symbols) 3. A character range in StandardTokenizer.jj that is not present in John's list, and that should be removed from the <KOREAN> section in StandardTokenizer.jj: 3.a. [ U+D7A4 - U+D7AF ] Non-character range at the end of the pre-composed Hangul (Korean) block 4. A character range in John's list that is missing in StandardTokenizer.jj, but which was not present in Unicode 3.0, and so strictly should not be included when running on Java 1.4; since this is a non-character range in Unicode 3.0, however, I think it should be included in StandardTokenizer.jj (in the <CJ> section) for future compatibility with Java 1.5 and Unicode 4.0: 4.a. [ U+31F0 - U+31FF ] Japanese Katakana phonetic extensions; these were introduced in Unicode version 3.2 (see http://www.unicode.org/reports/tr28/tr28-3.html#10_3_katakana ) 5. Character ranges in John's list that are missing in StandardTokenizer.jj, and that should be added to the newly re-labeled <CJ> section: 5.a. [ U+FF65 - U+FF9F ] Half-width Japanese Katakana (phonetic symbols) 5.b. [ U+3d2e - U+4DB5 ] (non-chars [ U+4DB6 - U+4DBF ] excluded) CJK Ideograph Extension A. This range was introduced in Unicode 3.0. 6. A character range in John's list that is missing in StandardTokenizer.jj, and that should be added to the <LETTER> section, since it, like the [ U+1100 - U+11FF ] range already included there, is a range of Korean Jamo (phonetic symbols): 6.a. [ U+FFA0 - U+FFDC ] Half-width Korean Jamo (phonetic symbols) > CJK char list > ------------- > > Key: LUCENE-478 > URL: http://issues.apache.org/jira/browse/LUCENE-478 > Project: Lucene - Java > Type: Bug > Components: Analysis > Versions: 1.4 > Reporter: John Wang > Priority: Minor > > Seems the character list in the CJK section of the StandardTokenizer.jj is > not quite complete. Following is a more complete list: > < CJK: // non-alphabets > [ > "\u1100"-"\u11ff", > "\u3040"-"\u30ff", > "\u3130"-"\u318f", > "\u31f0"-"\u31ff", > "\u3300"-"\u337f", > "\u3400"-"\u4dbf", > "\u4e00"-"\u9fff", > "\uac00"-"\ud7a3", > "\uf900"-"\ufaff", > "\uff65"-"\uffdc" > ] > > -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]