[jira] Commented: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

Michael Busch (JIRA) Mon, 30 Jul 2007 13:13:21 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516510
 ]


Michael Busch commented on LUCENE-969:
--------------------------------------

Hi Mike,

this is just an idea to keep Token.java simpler, but I haven't really thought 
about all the consequences. So feel free to tell me that it's a bad idea ;)

Could you add a new class TermBuffer including the char[] array and your 
resize() logic that would implement CharSequence? Then you could get rid of the 
duplicate constructors and setters for String and char[], because String also 
implements CharSequence. And CharSequence has the method charAt(int index), so 
it should be almost as fast as directly accessing the char array in case the 
TermBuffer is used. You would need to change the existing constructors and 
setter to take a CharSequence object instead of a String, but this is not an 
API change as users can still pass in a String object. And then you would just 
need to add a new constructor with offset and length and a similiar setter. 
Thoughts?

> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -----------------------------------------------------------------
>
>                 Key: LUCENE-969
>                 URL: https://issues.apache.org/jira/browse/LUCENE-969
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-969.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
>     a new one for every term.  To do this, I added a new method "Token
>     next(Token result)" (Doron's suggestion) which means TokenStream
>     may use the "Token result" as the returned Token, but is not
>     required to (ie, can still return an entirely different Token if
>     that is more convenient).  I added default implementations for
>     both next() methods in TokenStream.java so that a TokenStream can
>     choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
>     termText".
>     Token now maintains a char[] termBuffer for holding the term's
>     text.  Tokenizers & filters should retrieve this buffer and
>     directly alter it to put the term text in or change the term
>     text.
>     I only deprecated the termText() method.  I still allow the ctors
>     that pass in String termText, as well as setTermText(String), but
>     added a NOTE about performance cost of using these methods.  I
>     think it's OK to keep these as convenience methods?
>     After the next release, when we can remove the deprecated API, we
>     should clean up Token.java to no longer maintain "either String or
>     char[]" (and the initTermBuffer() private method) and always use
>     the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
>     creating a new one for each doc.  To do this I added an optional
>     "reusableTokenStream(...)" to Analyzer which just defaults to
>     calling tokenStream(...), and then I implemented this for the core
>     analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=50000
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

Reply via email to