I have updated the MoreLikeThis query generator to address a few issues. The code is available here: http://home.clara.net/markharwood/lucene/MoreLikeThis.java I have added comments at the top of the class to describe the changes.
I was interested in the benefits of the new TermVector code so I benchmarked it's effect on average time to generate a "MoreLikeThis" Query object for varying sized example docs from indexes with and without TermVector support: For avg example doc size of 250 bytes : VectorIndex 21 msecs NoVectorIndex 37 msecs For avg example doc size of 1,000 bytes : VectorIndex 25 msecs NoVectorIndex 48 msecs For avg example doc size of 16,000 bytes : VectorIndex 235 ms NoneVectorIndex356 ms For avg example doc size of 150,000 bytes : VectorIndex 533 ms NoneVectorIndex1809 ms TermVector support is beneficial and its effects are more noticeable in larger docs. However, once you get into 200k sized docs you probably want to look at ways to improve performance. A tokenizing size limit is an obvious way to optimise performance for large docs without term vectors This cuts down on tokenizing time but may reduce the quality of results. I introduced a default "5000" term limit on tokenization and this cut the 1809ms in the above results down to 612 ms I haven't been able to test for the quality of results produced by this query (my 150k docs were made by concatenating several smaller, docs of different subject matter together). Looking at the query terms produced however it seems to compare reasonably with the vector-produced one: * 5k tokenize limit query=: colchest our essex home us we you from flower uk site click your ship compani new servic page 01206 fashion gift here music florist busi * Full vector query=: colchest our essex you flower we us click home school from your suffolk florist site about here servic uk new deliveri gift page an 01206 I'm not currently sure what the approach would be to optimising performance for TermVector-backed queries when using large example docs. On a related subject: now that I understand the TermVector feature better (and found there is no position data) I can't see a way that it is of any benefit to optimising the highlighter code. I'd previously thought term sequence was in there. Cheers Mark --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]