Related Article question

sdeck Fri, 06 Jul 2007 16:35:41 -0700

Hello all,
  I have been trying out the MoreLikeThis and many other similarity types of
queries, but still run into problems with content not being matched up.


Let me give an example, as well as some question that, hopefully someone can
answer, to help me refine my work.

Example:
1) Document A may have a title: Oden and Durant Are being recruited, and
Document B would have a title
Trailblazers look at Oden and Durant.
 Both Document A and B talk about the recruitment of Oden and Durant, just
in fairly different ways.  One may emphasis Oden over Durant, or vice versa.
 The way the MoreLikeThis and similarity queries seem to work is that they
take terms and see if a lot of them match up in the documents. So, if Durant
is ins doc A 10 times and 10 times in doc B, then the similarity will be
higher.

Here is my problem though. I run these morelike this and other similarity
queries and it many of those types of articles do not get matched, because a
lot of the terms are not the same, but they are talking about the same
topic.  

Here is what I wonder
1) Should I somehow give more boost to a full name, or other names, or
titles to help matching? Or, does that hinder things?
2) How does shorter content versus longer content work? I make only get
around 5-6 sentences in one document, but a full page in another, but they
are still talking about the same thing
3) How would term vectors help, versus not storing term vectors?

To also help, the way the system is setup, I have one main index.  I will
run a search of the web and collect more documents. Before adding these to
the main index, I will run a morelikethis query against the main index of
each of the new documents to be inserted.  That way, I can keep a separate
place of what articles are related to each other for faster lookups.  I also
do a query of morelikethis against the new index, just to see what recently
searched articles are similar to each other. 
It would seem that document frequency and term numbers will not really work
in these sorts of scenarios.

Not sure if I am explaining my problem as well as I can, but I would love
some kind of reference to figuring out how to do related article searching
and see how I can refine my results. Right now, I would say about 60-70% get
correctly mapped into related articles, and about 10-20 percent get
incorrectly mapped as a related article (similar terms, but perhaps not
enough content, but the article is not about any of the others)

Any help would be appreciated.
Thanks
Scott
-- 
View this message in context: 
http://www.nabble.com/Related-Article-question-tf4038641.html#a11474031
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Related Article question

Reply via email to