[
https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516752
]
Michael McCandless commented on LUCENE-966:
-------------------------------------------
I tracked down at least some differences between the JavaCC vs JFlex
versions of StandardAnalyzer.
I think we should resolve these before committing.
I just printed all tokens for the first 20 Wikipedia docs and diff'd
the outputs.
Here are the categories of differences that I saw:
* Only the type differs on a filename-like token:
OLD: (2004.jpg,34461,34469,type=<HOST>)
NEW: (2004.jpg,34461,34469,type=<NUM>)
In this case the old StandardAnalyzer called "2004.jpg" a HOST and
the new one calls it a NUM. Seems like neither one is right!
* Only the type differs on a number token:
OLD: (62.46,37004,37009,type=<HOST>)
NEW: (62.46,37004,37009,type=<NUM>)
The new tokenizer looks right here. I guess the decimal point
confuses the JavaCC (old) one.
* Different number of tokens produced for number-like-token:
OLD: (978-0-94045043-1,86408,86424,type=<NUM>)
NEW: (978-0-94045043,86408,86422,type=<NUM>)
(1,86423,86424,type=<ALPHANUM>)
The new one split off the final "-1" as its own token, and called
it ALPHANUM not NUM. I think the old behavior is correct.
* Different number of tokens produced for filename:
OLD: (78academyawards/rules/rule02.html,7194,7227,type=<NUM>)
NEW: (78academyawards/rules/rule02,7194,7222,type=<NUM>)
(html,7223,7227,type=<ALPHANUM>)
I think the old one is better, though it should not be called a
NUM (maybe we need a new "FILENAME" token type?).
* Same as above, but split on final '_' instead of '.' ('-' also
shows this behavior):
OLD:
(2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type=<NUM>)
new: (2006-03-11t082958z_01_ban130523_rtridst_0,2076,2117,type=<NUM>)
(ozabs,2118,2123,type=<ALPHANUM>)
> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
> Key: LUCENE-966
> URL: https://issues.apache.org/jira/browse/LUCENE-966
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Stanislaw Osinski
> Fix For: 2.3
>
> Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt,
> jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several
> times) replacement for StandardAnalyzer. Will add a patch and a simple
> benchmark code in a while.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]