Hi Scott, I think this sounds reasonable, but why not also add LATIN_EXTENDED_B and LATIN_EXTENDED_ADDITIONAL? AFAICT, among other things, these cover some eastern European languages and Vietnamese, respectively.
Steve On 07/18/2008 at 5:03 PM, Scott Smith wrote: > org.apache.lucene.analysis.cjk.CJKTokenizer is in the > "contrib" portion of lucene, so I'm not sure if this is the > right place to mention this or not. I was doing some > detailed analysis of how this tokenizer worked and noticed > the following behavior (which I would classify as a bug). > > If you pass the word "construccion" to the tokenizer, it returns a > single token: "construccion". That seems correct. If you pass the word > "construcción" to this tokenizer, it will generate three tokens: > "construcci", "ó", and "n". This is happens because the accented "o" is > not treated as a Latin-1 character. Splitting the word seems like a bug > and violates the "does a decent job for most European languages" > statement. > > The fix seems straight forward. I replaced the following 2 > lines (in the CJKTokenizer class): > > if ((ub == Character.UnicodeBlock.BASIC_LATIN) > || (ub == > Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS)) > > With > > if ((ub == Character.UnicodeBlock.BASIC_LATIN) // chars 0x00-0x7f > || (ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT) // > char 0x80-0xff > || (ub == Character.UnicodeBlock.LATIN_EXTENDED_A) // > char 0x100-0x17f > || (ub == > Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS)) > > Am I missing something or does this seem like a reasonable > thing to want to do? --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]