On 16.03.2010 06:17, Marvin Humphrey wrote:
> Greets,
> 
> Lucene has a MoreLikeThisQuery in contrib:
> 
>   
> http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/similar/MoreLikeThis.html
> 
> It functions by selecting a handful of high-value (i.e. rare) terms out of a
> document and building up a composite ORQuery based on those. 

Yes, it looks like it's simply an approximation to the cosine similarity
of the term vectors.

> The problem is that if you have e.g. two authors with the same (uncommon) last
> name, but these authors write on entirely different subjects,
> MoreLikeThisQuery will often conflate them.

I think a remedy for this problem is a dimensionality reduction of the
term-document matrix like LSA does. Maybe I'm going to experiment with
that a little in the next weeks. What's the easiest way to get to the
term-document matrix either during or after indexing?

> However, there is a potential remedy available if we use clustering.  Say that
> the heuristics yield this collection of terms:
> 
>     economics capital interest investment addison 
>   
> One of these things is not like the others.  :)  Meaning, if you look at all
> those terms in a vector space, most of them will be clustered together, but
> one will be way far away.
> 
> What I'd like to do is identify the cluster that best represents the document,
> and exclude any terms outside of that cluster when building the
> MoreLikeThisQuery.   

I'm not sure clustering really helps here. Suppose that each half of the
search terms is from one of two clusters both of which are relevant to
the query. Do you really want to exclude one of the clusters?

Nick


-- 
aevum gmbh
rumfordstr. 4
80469 münchen
germany

tel: +49 89 3838 0653
http://aevum.de/

Reply via email to