org.apache.lucene.analysis.cjk.CJKTokenizer is in the "contrib" portion of 
lucene, so I'm not sure if this is the right place to mention this or not.  I 
was doing some detailed analysis of how this tokenizer worked and noticed the 
following behavior (which I would classify as a bug).

 

If you pass the word "construccion" to the tokenizer, it returns a single 
token: "construccion".  That seems correct.  If you pass the word 
"construcción" to this tokenizer, it will generate three tokens: "construcci", 
"ó", and "n".  This is happens because the accented "o" is not treated as a 
Latin-1 character.  Splitting the word seems like a bug and violates the "does 
a decent job for most European languages" statement.

 

The fix seems straight forward.  I replaced the following 2 lines (in the 
CJKTokenizer class):

 

            if ((ub == Character.UnicodeBlock.BASIC_LATIN)

                 || (ub == 
Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))

 

With

 

            if ((ub == Character.UnicodeBlock.BASIC_LATIN)               // 
chars 0x00-0x7f

                 || (ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT)    // 
char 0x80-0xff

                 || (ub == Character.UnicodeBlock.LATIN_EXTENDED_A)      // 
char 0x100-0x17f

                 || (ub == 
Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))

 

Am I missing something or does this seem like a reasonable thing to want to do?

Reply via email to