Hi Andrew,

no idea, I'm afraid - but could you sent the output of interestingTerms=details? This at least would show what MoreLikeThis uses, in comparison to the TermVectorComponent you've already pasted.

Chantal

Andrew Clegg schrieb:
Any ideas on this? Is it worth sending a bug report?

Those links are live, by the way, in case anyone wants to verify that MLT is
returning suggestions with very low tf.idf.

Cheers,

Andrew.


Andrew Clegg wrote:
Hi,

If I run a MoreLikeThis query like the following:

http://www.cathdb.info/solr/mlt?q=id:3.40.50.720&rows=0&mlt.interestingTerms=list&mlt.match.include=false&mlt.fl=keywords&mlt.mintf=1&mlt.mindf=1

one of the hits in the results is "and" (I don't do any stopword removal
on this field).

However if I look inside that document with the TermVectorComponent:

http://www.cathdb.info/solr/select/?q=id:3.40.50.720&tv=true&tv.all=true&tv.fl=keywords

I see that "and" has a measly tf.idf of 7.46E-4. But there are other terms
with *much* higher tf.idf scores, e.g.:

<lst name="aquaspirillum">
<int name="tf">1</int>
<int name="df">10</int>
<double name="tf-idf">0.1</double>
</lst>

that *don't* appear in the MoreLikeThis list. (I tried adding
&mlt.maxwl=999 to the end of the MLT query but it makes no difference.)

What's going on? Surely something with tf.idf = 0.1 is a far better
candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4?
Or does MoreLikeThis do some other heuristic magic to select good
candidates, and sometimes get it wrong?

BTW the keywords field is indexed, stored, multi-valued and term-vectored.

Thanks,

Andrew.

--
:: http://biotext.org.uk/ ::



--
View this message in context: 
http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26335061.html
Sent from the Solr - User mailing list archive at Nabble.com.


Reply via email to