RE: StandardTokenizer and Korean grouping with alphanum

Steven A Rowe Mon, 22 Sep 2008 07:51:57 -0700

Hi Daniel,

On 09/22/2008 at 12:49 AM, Daniel Noll wrote:
> I have a question about Korean tokenisation.  Currently there
> is a rule in StandardTokenizerImpl.jflex which looks like this:
> 
> ALPHANUM   = ({LETTER}|{DIGIT}|{KOREAN})+


LUCENE-1126 <https://issues.apache.org/jira/browse/LUCENE-1126> changed 
StandaradTokenizerImpl.jflex, for trunk and the looming 2.4 release.  ALPHANUM 
now looks like:

  THAI       = [\u0E00-\u0E59]

  // basic word: a sequence of digits & letters
  // (includes Thai to enable ThaiAnalyzer to function)
  ALPHANUM   = ({LETTER}|{THAI}|[:digit:])+

  // From the JFlex manual: "the expression that matches everything of
  // <a> not matched by <b> is !(!<a>|<b>)"
  LETTER     = !(![:letter:]|{CJ})

In JFlex grammars, [:letter:] is translated to the set of chars that return 
true for Character.isLetter(); this includes Chinese, Japanese and Korean 
characters.  Similarly, [:digit:] -> Character.isDigit().  Although the grammar 
looks different, the result is the same: Korean characters are still grouped 
together with digits, as you noted, rather than like Chinese and Japanese, for 
which each character is separately tokenized.

Korean has been treated differently from Chinese and Japanese since LUCENE-461 
<https://issues.apache.org/jira/browse/LUCENE-461>.  The grouping of Hangul 
with digits was introduced in this issue.

> I'm wondering if there was some good reason why it isn't:
> 
> ALPHANUM   = (({LETTER}|{DIGIT})+|{KOREAN}+)

Since LUCENE-1126 removed separate handling for Korean, it would have to be 
re-introduced in order to make a change like this. 

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: StandardTokenizer and Korean grouping with alphanum

Reply via email to