[ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516775 ]
Michael McCandless commented on LUCENE-966: ------------------------------------------- I agree, let's try to perfectly match the tokens of the old StandardAnalyzer so we have a way-faster drop-in replacement. The speedups of JFlex are amazing: based on a quick test, with JFlex + patch from LUCENE-969, the new StandardAnalyzer is only 2.09X slower than WhitespaceAnalyzer even though it's doing so much more ... > Finally, when it comes to the initialization time of the new > tokenizer -- according to the JFlex documentation, some time is > required to unpack the transition tables. But the unpacking takes > place during the initialization of static fields, so once the class > is loaded the overhead should be negligible. Yeah I'm baffled why it's that much slower, but on 100 token docs I definitely see LUCENE-969 making things 84% faster but "only" 36% faster if I use the full Wikipedia doc (which are much larger than 100 tokens on average). If we tested even smaller docs I think the gains would be even more. When I ran under the profiler it was the StandardTokenizerImpl <init>(java.io.Reader) way on the top. Maybe it's the cost of new'ing the 16 KB buffer each time? In any event I think it's OK, so long as we get LUCENE-969 in, and document the importance of using reusableTokenStream() API for better performance. > A faster JFlex-based replacement for StandardAnalyzer > ----------------------------------------------------- > > Key: LUCENE-966 > URL: https://issues.apache.org/jira/browse/LUCENE-966 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Stanislaw Osinski > Fix For: 2.3 > > Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, > jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt > > > JFlex (http://www.jflex.de/) can be used to generate a faster (up to several > times) replacement for StandardAnalyzer. Will add a patch and a simple > benchmark code in a while. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]