[ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516752 ]
Michael McCandless commented on LUCENE-966: ------------------------------------------- I tracked down at least some differences between the JavaCC vs JFlex versions of StandardAnalyzer. I think we should resolve these before committing. I just printed all tokens for the first 20 Wikipedia docs and diff'd the outputs. Here are the categories of differences that I saw: * Only the type differs on a filename-like token: OLD: (2004.jpg,34461,34469,type=<HOST>) NEW: (2004.jpg,34461,34469,type=<NUM>) In this case the old StandardAnalyzer called "2004.jpg" a HOST and the new one calls it a NUM. Seems like neither one is right! * Only the type differs on a number token: OLD: (62.46,37004,37009,type=<HOST>) NEW: (62.46,37004,37009,type=<NUM>) The new tokenizer looks right here. I guess the decimal point confuses the JavaCC (old) one. * Different number of tokens produced for number-like-token: OLD: (978-0-94045043-1,86408,86424,type=<NUM>) NEW: (978-0-94045043,86408,86422,type=<NUM>) (1,86423,86424,type=<ALPHANUM>) The new one split off the final "-1" as its own token, and called it ALPHANUM not NUM. I think the old behavior is correct. * Different number of tokens produced for filename: OLD: (78academyawards/rules/rule02.html,7194,7227,type=<NUM>) NEW: (78academyawards/rules/rule02,7194,7222,type=<NUM>) (html,7223,7227,type=<ALPHANUM>) I think the old one is better, though it should not be called a NUM (maybe we need a new "FILENAME" token type?). * Same as above, but split on final '_' instead of '.' ('-' also shows this behavior): OLD: (2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type=<NUM>) new: (2006-03-11t082958z_01_ban130523_rtridst_0,2076,2117,type=<NUM>) (ozabs,2118,2123,type=<ALPHANUM>) > A faster JFlex-based replacement for StandardAnalyzer > ----------------------------------------------------- > > Key: LUCENE-966 > URL: https://issues.apache.org/jira/browse/LUCENE-966 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Stanislaw Osinski > Fix For: 2.3 > > Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, > jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt > > > JFlex (http://www.jflex.de/) can be used to generate a faster (up to several > times) replacement for StandardAnalyzer. Will add a patch and a simple > benchmark code in a while. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]