OK, I ran some benchmarks here.
The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
17.2% speedup using Sun's JDK 6, on Linux. This is indexing all
Wikipedia content using LowerCaseTokenizer + StopFilter +
PorterStemFilter. I think it's worth pursuing!
Here are the optimizations I tested:
* Change core analyzers to re-use a single Token instance and reuse
the char[] termBuffer (using a new method "boolean next(Token t)" so it's
backwards compatible).
* For the StopFilter I created a new helper class (CharArraySet) to
create a hash set that can key off of char[]'s without having to
new a String.
* Fix the analyzer to re-use the same tokenizer across documents &
fields (rather than new'ing one every time)
I ran tests with "java -server -Xmx1024M", running on an Intel Core II
Duo box with Debian Linux 2.6.18 kernel and a RAID 5 IO system.
I index all text (every single term) in Wikipedia, pulling from a
single line file (I'm using the patch from LUCENE-947 that adds
line-file creation & indexing to contrib/benchmark).
First I create a single large file that has one doc per line from
Wikipedia content, using this alg:
docs.dir=enwiki
doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
line.file.out=/lucene/wikifull.txt
doc.maker.forever=false
{WriteLineDoc()}: *
Resulting file is 8.4 GB and 3.2 million docs. Then I indexed it
using this alg:
analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
directory=FSDirectory
ram.flush.mb=64
max.field.length=2147483647
compound=false
max.buffered=70000
doc.add.log.step=5000
docs.file=/lucene/wikifull.txt
doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
doc.tokenized=true
doc.maker.forever=false
ResetSystemErase
CreateIndex
{ "All"
{AddDoc}: *
}
CloseIndex
RepSumByPref All
RepSumByPref AddDoc
Resulting index is 2.2 GB.
The LowercaseStopPorterAnalyzer just looks like this:
public class LowercaseStopPorterAnalyzer extends Analyzer {
Tokenizer tokenizer;
TokenStream stream;
public final TokenStream tokenStream(String fieldName, Reader reader) {
if (tokenizer == null) {
tokenizer = new LowerCaseTokenizer(reader);
stream = new PorterStemFilter(new StopFilter(tokenizer,
StopAnalyzer.ENGLISH_STOP_WORDS));
} else
tokenizer.reset(reader);
return stream;
}
};
I then record the elapsed time reported by the "All" task. I ran each
test twice and took the faster time:
JDK 5 Trunk 21 min 41 sec
JDK 5 New 18 min 54 sec
-> 12.8% faster
JDK 6 Trunk 21 min 43 sec
JDK 6 New 17 min 59 sec
-> 17.2% faster
It's rather odd that we see better gains in JDK 6 ... I had expected
the reverse (assuming GC performance is better in JDK 6 than JDK 5).
I also think it's quite cool that we can index all of Wikipedia in 18
minutes :) That works out to ~8 MB/sec.
I will open an issue...
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]