[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Steven Rowe (JIRA) Wed, 03 Sep 2008 12:34:18 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628106#action_12628106
 ]

Steven Rowe commented on LUCENE-1126:
-------------------------------------

Yeah, I see this too.

The issue is that the entire Thai range {{\u0e00-\u0e5b}} is included in the 
unpatched grammar's {LETTER} definition, which contains the huge range 
{{\u0100-\u1fff}}, much of which is not actually letters.  The patched grammar 
instead substitutes the Unicode 3.0 {{Letter}} general category (via JFlex's 
[:letter:]), which excludes some characters in the Thai range: non-spacing 
marks, a currency symbol, numerals, etc.

ThaiAnalyzer uses ThaiWordFilter, which uses Java's BreakIterator to tokenize 
the contiguous text (i.e. without whitespace) provided by StandardTokenizer.

The failing test expects to see {{"\u0e17\u0e35\u0e48"}}, but instead gets 
{{"\u0e17"}}, because {{\u0e35}} is a non-spacing mark, which the patched 
StandardTokenizer doesn't pass to ThaiWordFilter.

Because of this problem, I guess I'm -1 on applying the patch I provided.

One solution would be to switch from using the {{Letter}} general category to 
the derived property {{Alphabetic}}, which includes both general categories 
{{Letter}} and {{Mark}}. (see Annex C of [the Unicode Regular Expressions 
Technical 
Standard|http://www.unicode.org/unicode/reports/tr18/#Compatibility_Properties] 
under "alpha" for discussion of this).  The current version of JFlex does not 
support Unicode property references in its syntax, though, so simplifying -- 
and correcting -- the grammar may have to wait for the next version of JFlex, 
which will support syntax like {{\p{Alphabetic}}}.

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a 
> question on Korean ranges" begun by Daniel Noll on java-user, and carried 
> over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class 
> [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as 
> java.lang.Character.isLetter()) includes CJK characters, there is a way to 
> handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex 
> documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is 
> !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason 
> (AFAICT) to treat them separately; unlike Chinese and Japanese characters, 
> which are individually tokenized, the Korean characters should participate in 
> the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 
> supports, and Unicode 5.0, the latest version, and there are lots of new and 
> modified letter and digit ranges.  This stuff gets tweaked all the time, and 
> I don't think Lucene should be in the business of trying to track it, or take 
> a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character 
> classes ties (most of) these decisions to the user's choice of JVM version, 
> and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Reply via email to