As a test, I tried to compare a few documents on various topics (a few on linux, and another on the U.S. constitution) to a source document on linux using a query formed by MoreLikeThis.
1. Looking at the hits, they have the same score. I'd expect them to be different, based on their relevance to the source document. Any ideas? 2. I can't figure out how the search picks up the article on the US constitution as similar. Any hints? I can tweak the search a bit to limit this by increasing the min. term frequency via setMinTermFreq() but given that the interesting terms are so vastly different, I wouldn't have thought this necessary. This is my output. I can paste my source code in too if needed. Thanks ================================================================= read file c:/tmp/similarity2/linux.txt read file c:/tmp/similarity2/linuxkernel.txt read file c:/tmp/similarity2/linuxtoo.txt read file c:/tmp/similarity2/constitution.txt index size 4 Field option:title Field option:name MLT parms: maxQueryTerms : 25 minWordLen : 3 maxWordLen : 0 fieldNames : title boost : false minTermFreq : 5 minDocFreq : 1 query formed for source doc linux.txt is title:linux title:can title:want title:system title:operating title:our title:information title:you title:linus title:released title:found title:kernel title:use title:its title:about title:page title:more Interesting terms for c:/tmp/similarity2/linux.txt :linux can want system operating our information you linus released found kernel use its about page more Interesting terms for c:/tmp/similarity2/linuxtoo.txt :linux you free software can like your operating system journal other systems development commercial productivity aug-22-08 many web even which video network popular platforms source Interesting terms for c:/tmp/similarity2/linuxkernel.txt :linux kernel version torvalds support edit code system stable gpl retrieved linus sco released changes series drivers also only number operating were binary new has Interesting terms for c:/tmp/similarity2/constitution.txt :shall president states united votes senate vice office electors congress law person them have may from which number time all state other number of matches : 4 1. [c:/tmp/similarity2/linux.txt] score: 0.46413344 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< why the same score 2. [c:/tmp/similarity2/linuxkernel.txt] score: 0.46413344 3. [c:/tmp/similarity2/linuxtoo.txt] score: 0.46413344 4. [c:/tmp/similarity2/constitution.txt] score: 0.46413344 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< why ???