On 16.03.2010 06:17, Marvin Humphrey wrote: > Greets, > > Lucene has a MoreLikeThisQuery in contrib: > > > http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/similar/MoreLikeThis.html > > It functions by selecting a handful of high-value (i.e. rare) terms out of a > document and building up a composite ORQuery based on those.
Yes, it looks like it's simply an approximation to the cosine similarity of the term vectors. > The problem is that if you have e.g. two authors with the same (uncommon) last > name, but these authors write on entirely different subjects, > MoreLikeThisQuery will often conflate them. I think a remedy for this problem is a dimensionality reduction of the term-document matrix like LSA does. Maybe I'm going to experiment with that a little in the next weeks. What's the easiest way to get to the term-document matrix either during or after indexing? > However, there is a potential remedy available if we use clustering. Say that > the heuristics yield this collection of terms: > > economics capital interest investment addison > > One of these things is not like the others. :) Meaning, if you look at all > those terms in a vector space, most of them will be clustered together, but > one will be way far away. > > What I'd like to do is identify the cluster that best represents the document, > and exclude any terms outside of that cluster when building the > MoreLikeThisQuery. I'm not sure clustering really helps here. Suppose that each half of the search terms is from one of two clusters both of which are relevant to the query. Do you really want to exclude one of the clusters? Nick -- aevum gmbh rumfordstr. 4 80469 münchen germany tel: +49 89 3838 0653 http://aevum.de/
