Hi Daniel,
On 09/22/2008 at 12:49 AM, Daniel Noll wrote:
> I have a question about Korean tokenisation. Currently there
> is a rule in StandardTokenizerImpl.jflex which looks like this:
>
> ALPHANUM = ({LETTER}|{DIGIT}|{KOREAN})+
LUCENE-1126 <https://issues.apache.org/jira/browse/LUCENE-1126> changed
StandaradTokenizerImpl.jflex, for trunk and the looming 2.4 release. ALPHANUM
now looks like:
THAI = [\u0E00-\u0E59]
// basic word: a sequence of digits & letters
// (includes Thai to enable ThaiAnalyzer to function)
ALPHANUM = ({LETTER}|{THAI}|[:digit:])+
// From the JFlex manual: "the expression that matches everything of
// <a> not matched by <b> is !(!<a>|<b>)"
LETTER = !(![:letter:]|{CJ})
In JFlex grammars, [:letter:] is translated to the set of chars that return
true for Character.isLetter(); this includes Chinese, Japanese and Korean
characters. Similarly, [:digit:] -> Character.isDigit(). Although the grammar
looks different, the result is the same: Korean characters are still grouped
together with digits, as you noted, rather than like Chinese and Japanese, for
which each character is separately tokenized.
Korean has been treated differently from Chinese and Japanese since LUCENE-461
<https://issues.apache.org/jira/browse/LUCENE-461>. The grouping of Hangul
with digits was introduced in this issue.
> I'm wondering if there was some good reason why it isn't:
>
> ALPHANUM = (({LETTER}|{DIGIT})+|{KOREAN}+)
Since LUCENE-1126 removed separate handling for Korean, it would have to be
re-introduced in order to make a change like this.
Steve
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]