Re: Dmitry's Term Vector stuff, plus some

markharw00d Wed, 25 Feb 2004 13:59:12 -0800

I've just run some stats on the overhead of tokenizing text/highlighting.
It looks like its tokenizing that's the main problem and it is CPU bound.


I ran three tests, all  on the same index/machine : pentium 3 800mhz,  360mb index, 
lucene 1.3 final, JDK 1.4.1, Porter stemmer based analyser.

For each test I processed the same set of docs - on average about 1k in size.
Each test looped 100 times around the same set of 60 docs - processing a total of 
6,058,300 bytes of content;


The first test measured the time to simply get text using docs.get(fieldName) - this 
was very quick - here's the output:
----- Reading:0 ms avg per doc,10ms total

The second test measured the time to retrieve and tokenize the text using the 
analyzer, this was proportionally much larger than a straight read
but still reasonable for small docs:
----- Tokenizing:5 ms avg per doc,33439ms total

The third test measured the time to retrieve and highlight the text (using the same 
analyzer and selecting 3 "best" fragments), this was only marginally 
slower than the time taken to tokenize.
----- Highlighting:5 ms avg per doc,35020ms total

The conclusion?
* Reading text from the index is very quick (6 meg of document text in 10ms!)
* Tokenizing the text is much slower but this is only noticable if you're processing a 
LOT of text. 
(My docs were an average of 1k in size so only took 5ms each)
* The time taken for the highlighting code is almost all down to tokenizing

Bruce,
Could a short term ( and possibly compromised )solution to your performance problem be 
to offer only the first 3k of these large 200k docs to 
the highlighter in order to minimize the amount of tokenization required? Arguably the 
most relevant bit of a document is typically in the first 1k anyway?
Also, for the purposes of the search index are you doing all you can to strip out a 
lot of the duplicated text   (>> your comments etc) from the 
reply posts typically found in your forums?
My timings seem in line with your estimates - a 1k doc takes 5ms so a 200k doc is 
close to a second!


Cheers
Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Dmitry's Term Vector stuff, plus some

Reply via email to