[ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516893 ]
Stanislaw Osinski commented on LUCENE-966: ------------------------------------------ When digging deeper into the issues of compatibility with the original StandardAnalyzer, I stumbled upon something strange. Take the following text: 78academyawards/rules/rule02.html,7194,7227,type which was tokenized by the original StandardAnalyzer as one <NUM>. If you look at the definition of the <NUM> token: // every other segment must have at least one digit <NUM: (<ALPHANUM> <P> <HAS_DIGIT> | <HAS_DIGIT> <P> <ALPHANUM> | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ ) you'll see that, as explained in the comment, every other segment must have at least one digit. But actually, according to my understanding, this rule should not match the above text as a whole (and with JFlex it doesn't , actually). Below is the text split by punctuation characters, and it looks like there is no way of splitting this text into alternating segments, every second of which must have a digit (A = ALPHANUM, H = HAS_DIGIT): 78academyawards / rules / rule02 . html , 7194 , 7227 , type H P A P H P A P H P A P H?* (starting from the beginning) H?* P A P H P A (starting from the end) * (would have to be H, but no digits in substring "type" or "html") I have no idea why JavaCC matched the whole text as a <NUM>, JFlex behaved "more correctly" here. Now I can see two solutions: * try to patch the JFlex grammar to emulate JavaCC quirks (though I may not be aware of most of them...) * relax the <NUM> rule a little bit (JFlex notation): // there must be at least one segment with a digit NUM = ({P} ({HAS_DIGIT} | {ALPHANUM}))* {HAS_DIGIT} ({P} ({HAS_DIGIT} | {ALPHANUM}))* With this definition, again, all StandardAnalyzer tests pass, plus all texts along the lines of: 2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type 78academyawards/rules/rule02.html,7194,7227,type 978-0-94045043-1,86408,86424,type 62.46,37004,37009,type (this one was parsed as <HOST> by the original analyzer) get parsed as a whole as one <NUM>, which is equivalent to what JavaCC-based version would do. I will attach a corresponding patch in a second. > A faster JFlex-based replacement for StandardAnalyzer > ----------------------------------------------------- > > Key: LUCENE-966 > URL: https://issues.apache.org/jira/browse/LUCENE-966 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Stanislaw Osinski > Fix For: 2.3 > > Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, > jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt > > > JFlex (http://www.jflex.de/) can be used to generate a faster (up to several > times) replacement for StandardAnalyzer. Will add a patch and a simple > benchmark code in a while. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]