[ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628106#action_12628106 ]
Steven Rowe commented on LUCENE-1126: ------------------------------------- Yeah, I see this too. The issue is that the entire Thai range {{\u0e00-\u0e5b}} is included in the unpatched grammar's {LETTER} definition, which contains the huge range {{\u0100-\u1fff}}, much of which is not actually letters. The patched grammar instead substitutes the Unicode 3.0 {{Letter}} general category (via JFlex's [:letter:]), which excludes some characters in the Thai range: non-spacing marks, a currency symbol, numerals, etc. ThaiAnalyzer uses ThaiWordFilter, which uses Java's BreakIterator to tokenize the contiguous text (i.e. without whitespace) provided by StandardTokenizer. The failing test expects to see {{"\u0e17\u0e35\u0e48"}}, but instead gets {{"\u0e17"}}, because {{\u0e35}} is a non-spacing mark, which the patched StandardTokenizer doesn't pass to ThaiWordFilter. Because of this problem, I guess I'm -1 on applying the patch I provided. One solution would be to switch from using the {{Letter}} general category to the derived property {{Alphabetic}}, which includes both general categories {{Letter}} and {{Mark}}. (see Annex C of [the Unicode Regular Expressions Technical Standard|http://www.unicode.org/unicode/reports/tr18/#Compatibility_Properties] under "alpha" for discussion of this). The current version of JFlex does not support Unicode property references in its syntax, though, so simplifying -- and correcting -- the grammar may have to wait for the next version of JFlex, which will support syntax like {{\p{Alphabetic}}}. > Simplify StandardTokenizer JFlex grammar > ---------------------------------------- > > Key: LUCENE-1126 > URL: https://issues.apache.org/jira/browse/LUCENE-1126 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Affects Versions: 2.2 > Reporter: Steven Rowe > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1126.patch > > > Summary of thread entitled "Fullwidth alphanumeric characters, plus a > question on Korean ranges" begun by Daniel Noll on java-user, and carried > over to java-dev: > On 01/07/2008 at 5:06 PM, Daniel Noll wrote: > > I wish the tokeniser could just use Character.isLetter and > > Character.isDigit instead of having to know all the ranges itself, since > > the JRE already has all this information. Character.isLetter does > > return true for CJK characters though, so the ranges would still come in > > handy for determining what kind of letter they are. I don't support > > JFlex has a way to do this... > The DIGIT macro could be replaced by JFlex's predefined character class > [:digit:], which has the same semantics as java.lang.Character.isDigit(). > Although JFlex's predefined character class [:letter:] (same semantics as > java.lang.Character.isLetter()) includes CJK characters, there is a way to > handle this using JFlex's regex negation syntax {{!}}. From [the JFlex > documentation|http://jflex.de/manual.html]: > bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is > !(!{{a}}|{{b}}) > So to exclude CJ characters from the LETTER macro: > {code} > LETTER = ! ( ! [:letter:] | {CJ} ) > {code} > > Since [:letter:] includes all of the Korean ranges, there's no reason > (AFAICT) to treat them separately; unlike Chinese and Japanese characters, > which are individually tokenized, the Korean characters should participate in > the same token boundary rules as all of the other letters. > I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 > supports, and Unicode 5.0, the latest version, and there are lots of new and > modified letter and digit ranges. This stuff gets tweaked all the time, and > I don't think Lucene should be in the business of trying to track it, or take > a position on which Unicode version users' data should conform to. > Switching to using JFlex's [:letter:] and [:digit:] predefined character > classes ties (most of) these decisions to the user's choice of JVM version, > and this seems much more reasonable to me than the current status quo. > I will attach a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]