[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Stanislaw Osinski (JIRA) Wed, 01 Aug 2007 00:54:19 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516893
 ]


Stanislaw Osinski commented on LUCENE-966:
------------------------------------------

When digging deeper into the issues of compatibility with the original 
StandardAnalyzer, I stumbled upon something strange. Take the following text:

78academyawards/rules/rule02.html,7194,7227,type

which was tokenized by the original StandardAnalyzer as one <NUM>. If you look 
at the definition of the <NUM> token:

// every other segment must have at least one digit
<NUM: (<ALPHANUM> <P> <HAS_DIGIT>
       | <HAS_DIGIT> <P> <ALPHANUM>
       | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
       | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
       | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
       | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
        )

you'll see that, as explained in the comment, every other segment must have at 
least one digit. But actually, according to my understanding, this rule should 
not match the above text as a whole (and with JFlex it doesn't , actually). 
Below is the text split by punctuation characters, and it looks like there is 
no way of splitting this text into alternating segments, every second of which 
must have a digit (A = ALPHANUM, H = HAS_DIGIT):

78academyawards   /   rules   /   rule02   .   html   ,   7194   ,   7227   ,   
type
                H                  P      A     P       H       P     A     P   
  H      P      A     P    H?*     (starting from the beginning)
                                                                              
H?*    P     A      P      H     P     A       (starting from the end)

* (would have to be H, but no digits in substring "type" or "html")

I have no idea why JavaCC matched the whole text as a <NUM>, JFlex behaved 
"more correctly" here. 

Now I can see two solutions:

* try to patch the JFlex grammar to emulate JavaCC quirks (though I may not be 
aware of most of them...)
* relax the <NUM> rule a little bit (JFlex notation):

// there must be at least one segment with a digit
NUM = ({P} ({HAS_DIGIT} | {ALPHANUM}))* {HAS_DIGIT} ({P} ({HAS_DIGIT} | 
{ALPHANUM}))*

With this definition, again, all StandardAnalyzer tests pass, plus all texts 
along the lines of:

2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type
78academyawards/rules/rule02.html,7194,7227,type
978-0-94045043-1,86408,86424,type
62.46,37004,37009,type    (this one was parsed as <HOST> by the original 
analyzer)

get parsed as a whole as one <NUM>, which is equivalent to what JavaCC-based 
version would do. I will attach a corresponding patch in a second.



> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, 
> jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several 
> times) replacement for StandardAnalyzer. Will add a patch and a simple 
> benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Reply via email to