As a test, I tried to compare a few documents on various topics (a few on
linux, and another on the U.S. constitution) to a source document on linux
using a query formed by MoreLikeThis.

1. Looking at the hits, they have the same score. I'd expect them to be
different, based on their relevance to the source document. Any ideas?
2. I can't figure out how the search picks up the article on the US
constitution as similar. Any hints? I can tweak the search a bit to limit
this by increasing the min. term frequency via setMinTermFreq() but given
that the interesting terms are so vastly different, I wouldn't have thought
this necessary.

This is my output. I can paste my source code in too if needed.

Thanks

=================================================================



read file c:/tmp/similarity2/linux.txt

read file c:/tmp/similarity2/linuxkernel.txt

read file c:/tmp/similarity2/linuxtoo.txt

read file c:/tmp/similarity2/constitution.txt
index size 4
Field option:title
Field option:name
MLT parms:
    maxQueryTerms  : 25
    minWordLen     : 3
    maxWordLen     : 0
    fieldNames     : title
    boost          : false
    minTermFreq    : 5
    minDocFreq     : 1

query formed for source doc linux.txt is title:linux title:can title:want
title:system title:operating title:our title:information title:you
title:linus title:released title:found title:kernel title:use title:its
title:about title:page title:more
Interesting terms for c:/tmp/similarity2/linux.txt :linux can want system
operating our information you linus released found kernel use its about page
more
Interesting terms for c:/tmp/similarity2/linuxtoo.txt :linux you free
software can like your operating system journal other systems development
commercial productivity aug-22-08 many web even which video network popular
platforms source
Interesting terms for c:/tmp/similarity2/linuxkernel.txt :linux kernel
version torvalds support edit code system stable gpl retrieved linus sco
released changes series drivers also only number operating were binary new
has
Interesting terms for c:/tmp/similarity2/constitution.txt :shall president
states united votes senate vice office electors congress law person them
have may from which number time all state other


number of matches : 4
1.  [c:/tmp/similarity2/linux.txt]  score: 0.46413344
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< why the same score
2.  [c:/tmp/similarity2/linuxkernel.txt]  score: 0.46413344
3.  [c:/tmp/similarity2/linuxtoo.txt]  score: 0.46413344
4.  [c:/tmp/similarity2/constitution.txt]  score: 0.46413344
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< why ???

Reply via email to