Greets, Lucene has a MoreLikeThisQuery in contrib:
http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/similar/MoreLikeThis.html It functions by selecting a handful of high-value (i.e. rare) terms out of a document and building up a composite ORQuery based on those. The thing that's always bothered me about its results is that it gets thrown off by things like proper names. Proper names are often very rare, and thus highly discriminatory terms. They often pass all the heuristics that MoreLikeThisQuery uses: low doc_freq() (meaning occurs in few documents), long token length (more than 5 characters), etc. The problem is that if you have e.g. two authors with the same (uncommon) last name, but these authors write on entirely different subjects, MoreLikeThisQuery will often conflate them. However, there is a potential remedy available if we use clustering. Say that the heuristics yield this collection of terms: economics capital interest investment addison One of these things is not like the others. :) Meaning, if you look at all those terms in a vector space, most of them will be clustered together, but one will be way far away. What I'd like to do is identify the cluster that best represents the document, and exclude any terms outside of that cluster when building the MoreLikeThisQuery. What kind of a data structure would we need to achieve that? Marvin Humphrey
