[
https://issues.apache.org/jira/browse/LUCENE-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507233
]
Mark Miller commented on LUCENE-937:
------------------------------------
I am testing on Java 1.5
I tested with LinkedList using get, LinkedList using iterator, ArrayList()
(defaults to 10), ArrayList(16), ArrayList(30), ArrayList(60)
For a handful of tokens, all methods are about the same speed. (3-10 tokens)
LinkedList, ArrayList(16) and ArrayList are a smidgen faster than
ArrayList(30-60) though.
At 15-40 tokens all of the methods are still about the same speed, though
ArrayList(30) may be a hair faster (than even ArrayList(10))
At 50-300 tokens (weighted towards 50), you see a 47% increase in speed using
ArrayList(16 or 30) and ArrayList(60), the other methods are about the same
speed.
At 150-900 tokens (weighted towards 150), you see a 100% increase in speed
using ArrayList(30) -- ArrayList(60) is no better,
the other methods take twice as long (even ArrayList(10)).
I used Reuters data, shortening it and stiching it together for the various
sizes.
The first test was the average speed of about 21,000 runs on 21,000 different
docs.
The second test was the average speed of about 21,000 runs on 21,000 different
docs.
The third test was the average speed of about 15,000 runs on 15,000 different
docs.
The fourth test was the average speed of about 5,000 runs on 5,000 different
docs.
I don't completely understand why, but ArrayList(30) or ArrayList(16) blow
everything else out of the water. ArrayList(16) is a smidgen faster at very low
token counts, while ArrayList(30) is a smidgen faster above 50 or so tokens.
The differences are almost negligible though.
ArrayList(30) or ArrayList(16) are approx the same speed as LinkedList (get and
iterator are always approx the same speed)
at low numbers and get better and better VERY quickly as the number of tokens
goes up.
I'd recommend ArrayList(16) as the best choice for this class. It is no worse
than LinkedList on very small documents, and much better than LinkedList on
small to large documents.
A friend was recently telling me that ArrayList defaulted to 16, but it does
not -- it defaults to 10. He must have been confused and known that it *should*
default to 16. 16 is a much better default number than 10 due to the
exponential growth when doubling the array on resize. It may take a bit more
memory, but it gets a LOT more speed.
- Mark
> Make CachingTokenFilter faster
> ------------------------------
>
> Key: LUCENE-937
> URL: https://issues.apache.org/jira/browse/LUCENE-937
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Mark Miller
> Priority: Minor
> Attachments: CachingTokenFilter.patch
>
>
> The wrong data structure was used for the CachingTokenFilter. It should be an
> ArrayList rather than a LinkedList. There is a noticeable difference in speed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]