[ 
https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516745
 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------

I took the patch from here (to use jflex for StandardAnalyzer) and
merged it with the patch from LUCENE-969 (re-use Token & TokenStream)
to measure the net performance gains.

I measure the time to just tokenize all of Wikipedia using
StandardAnalyzer using contrib/benchmark plus patch from LUCENE-967
(test details are described in LUCENE-969).

With the jflex patch it takes 646 sec (best of 2 runs); when I then
merge in the patch from LUCENE-969 it takes 455 sec.  Subtracting off
the time to just load all Wikipedia docs (= 112 sec) that gives net
additional speedup of 36% (534 sec -> 343 sec) when using LUCENE-969
in addition to jflex.

A couple other things I noticed:

  * The init cost of jflex (StandardTokenizerImpl) seems to be fairly
    high: when I repeat the above test with smallish docs (100 tokens
    each) instead, the gain is around 84%.  I think this just makes
    the new reusableTokenStream() in LUCENE-969 important to commit.

  * I'm seeing differing token counts with the jflex StandardAnalyzer
    vs the current one; I think there is some difference here.  I will
    track down which tokens differ and post back...


> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, 
> jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several 
> times) replacement for StandardAnalyzer. Will add a patch and a simple 
> benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to