RE: Bug in CJKTokenizer

Steven A Rowe Fri, 18 Jul 2008 14:32:22 -0700

Hi Scott,

I think this sounds reasonable, but why not also add LATIN_EXTENDED_B and 
LATIN_EXTENDED_ADDITIONAL?  AFAICT, among other things, these cover some 
eastern European languages and Vietnamese, respectively.


Steve

On 07/18/2008 at 5:03 PM, Scott Smith wrote:
> org.apache.lucene.analysis.cjk.CJKTokenizer is in the
> "contrib" portion of lucene, so I'm not sure if this is the
> right place to mention this or not.  I was doing some
> detailed analysis of how this tokenizer worked and noticed
> the following behavior (which I would classify as a bug).
> 
> If you pass the word "construccion" to the tokenizer, it returns a
> single token: "construccion".  That seems correct. If you pass the word
> "construcción" to this tokenizer, it will generate three tokens:
> "construcci", "ó", and "n".  This is happens because the accented "o" is
> not treated as a Latin-1 character.  Splitting the word seems like a bug
> and violates the "does a decent job for most European languages"
> statement.
> 
> The fix seems straight forward.  I replaced the following 2
> lines (in the CJKTokenizer class):
> 
>             if ((ub == Character.UnicodeBlock.BASIC_LATIN)
>                  || (ub == 
> Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))
> 
> With
> 
>             if ((ub == Character.UnicodeBlock.BASIC_LATIN) // chars 0x00-0x7f
>                  || (ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT)    // 
> char 0x80-0xff
>                  || (ub == Character.UnicodeBlock.LATIN_EXTENDED_A)      // 
> char 0x100-0x17f
>                  || (ub == 
> Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))
> 
> Am I missing something or does this seem like a reasonable
> thing to want to do?

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Bug in CJKTokenizer

Reply via email to