On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
>> To further improve "out of the box" performance I would really also
>> like to fix the core analyzers, when possible, to re-use a single
>> Token instance for each Token they produce. This would then mean no
>> objects are created as you step through Tokens in the TokenStream
>> ... so this would give the best performance.
>How much better I wonder? Small object allocation & reclaiming is
>supposed to be very good in current JVMs.
Sorry I cannot give you exact numbers now, but I know for sure that we decided
to take "real analysis" into separate phase that gets executed before entering
Lucene TokenStreram and Indexing due to String in Token and than do just the
simple whitespace tokenisation during indexing. And this was not just out for
fun, there was some real benefit in it.
The issue with performance here is in making transformations on tokens during
analysis (where this applies), you gave very nice example , stemming, that
itself generates new Strings, another nice example is NGram generation in
SpellChecker that generates rater huge numbers of small objects.
The simplest model, tokenize(without modifying)/index ironically also benefits
from char[] as than things go really fast in general so new String() on the
way gets noticed in profiler. While testing new indexing code from Mike, we
also changed our vanilla Tokenizer to use termBuffer and there was again
something like 10-15% boost.
It's been a while since that so I do not know exact numbers, but I learned this
many times the hard way, nothing beats char[] when it comes to text processing.
To stop bothering you people, IMHO, there is a hard work in Analyzer chain to
be done before Token gets ready for prime time in Lucene core, and this is the
place where having String overproduction hurts.
___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]