On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
OK, I ran some benchmarks here.
The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
17.2% speedup using Sun's JDK 6, on Linux. This is indexing all
Wikipedia content using LowerCaseTokenizer + StopFilter +
PorterStemFilter. I think it's worth pursuing!
Did you try it w/o token reuse (reuse tokens only when mutating, not
when creating new tokens from the tokenizer)?
It would be interesting to see what's attributable to Token reuse only
(after core filters have been optimized to use the char[] setters,
etc).
We've had issues in the past regarding errors with filters dealing
with token properties:
1) filters creating a new token from and old token, but forgetting
about setting positionIncrement
2) legacy filters losing "new" information such as payloads when
creating , because they didn't exist when the filter was written.
#1 is solved by token mutation because there are setters for the value
(before, the filter author was forced to create a new token, unless
they could access the package-private String).
#2 can now be solved by clone() (another relatively new addition)
So what new problems might crop up with token reuse?
- a filter reusing a token, but not zeroing out something new like
payloads because they didn't exist when the filter was authored (the
opposite problem from before)
Would a Token.clear() be needed for use by (primarily) tokenizers?
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]