More Like This Query updated plus benchmarks

markharw00d Sun, 29 Feb 2004 14:54:37 -0800

I have updated the MoreLikeThis query generator to address a few issues.
The code is available here: http://home.clara.net/markharwood/lucene/MoreLikeThis.java
I have added comments at the top of the class to describe the changes.


I was interested in the benefits of the new TermVector code so I benchmarked 
it's effect on average time to generate a "MoreLikeThis" Query object for varying 
sized example 
docs from indexes with and without TermVector support:

For avg example doc size of 250 bytes :
VectorIndex  21 msecs
NoVectorIndex   37 msecs

For avg example doc size of 1,000 bytes :
VectorIndex  25 msecs
NoVectorIndex   48 msecs

For avg example doc size of 16,000 bytes :
VectorIndex 235 ms 
NoneVectorIndex356 ms

For avg example doc size of 150,000 bytes :
VectorIndex 533 ms 
NoneVectorIndex1809 ms


TermVector support is beneficial and its effects are more noticeable in larger docs.
However, once you get into 200k sized docs you probably want to look at ways to 
improve 
performance. 

A tokenizing size limit is an obvious way to optimise performance for large docs 
without term vectors
This cuts down on tokenizing time but may reduce the quality of results.
I introduced a default "5000" term limit on tokenization and this cut the 1809ms in 
the above 
results down to 612 ms
I haven't been able to test for the quality of results produced by this query (my 150k 
docs were made 
by concatenating several smaller, docs of different subject matter together).
Looking at the query terms produced however it seems to compare reasonably with the 
vector-produced one:

* 5k tokenize limit query=: colchest our essex home us we you from flower uk site 
click your ship compani new servic page 01206 fashion gift here music florist busi 

* Full vector query=: colchest our essex you flower we us click home school from your 
suffolk florist site about here servic uk new deliveri gift page an 01206



I'm not currently sure what the approach would be to optimising performance for 
TermVector-backed queries
when using large example docs.


On a related subject: now that I understand the TermVector feature better (and found 
there is no 
position data) I can't see a way that it is of any benefit to optimising the 
highlighter code.
I'd previously thought term sequence was in there.


Cheers
Mark









---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

More Like This Query updated plus benchmarks

Reply via email to