Document comparison

2005-02-18 Thread Matt Chaput
Is there a simple, efficient way to compute similarity of documents 
indexed with Lucene?

My first, naive idea is to use the entire contents of one document as a 
query to the second document, and use the score as a similarity 
measurement. But I think I'm probably way off base with that.

Can any IR pros set me straight? Thanks very much.
Matt
--
Matt Chaput
Word Monkey
Side Effects Software Inc.
A goddamned ray of sunshine all the goddamned time
-- Sparkle Hayter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document comparison

2005-02-18 Thread Matt Chaput
My first, naive idea is to use the entire contents of one document as 
a query to the second document,
Sorry, I meant use the entire contents of one document as a query *on 
the rest of the corpus*.

--
Matt Chaput
Word Monkey
Side Effects Software Inc.
A goddamned ray of sunshine all the goddamned time
-- Sparkle Hayter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document comparison

2005-02-18 Thread Otis Gospodnetic
Matt,

Erik and I have some code for this in Lucene in Action, but David
Spencer did this since the book was published:

  http://www.lucenebook.com/blog/announcements/more_like_this.html

Otis

--- Matt Chaput [EMAIL PROTECTED] wrote:

 Is there a simple, efficient way to compute similarity of documents 
 indexed with Lucene?
 
 My first, naive idea is to use the entire contents of one document as
 a 
 query to the second document, and use the score as a similarity 
 measurement. But I think I'm probably way off base with that.
 
 Can any IR pros set me straight? Thanks very much.
 
 Matt
 
 
 --
 Matt Chaput
 Word Monkey
 Side Effects Software Inc.
 
 A goddamned ray of sunshine all the goddamned time
 -- Sparkle Hayter
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document comparison

2005-02-18 Thread David Spencer
Otis Gospodnetic wrote:
Matt,
Erik and I have some code for this in Lucene in Action, but David
Spencer did this since the book was published:
  http://www.lucenebook.com/blog/announcements/more_like_this.html

If you want an informal way of doing it you're right, just feed the 
words of the source doc to a query. The doc for the code it is at this 
easy to  remember URL:
http://searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/SimilarityQueries.html#formSimilarQuery(java.lang.String,%20org.apache.lucene.analysis.Analyzer,%20java.lang.String,%20java.util.Set)

Follow Otis's link above to my weblog for the code.
The MoreLikeThis stuff is similar but more sophisticated.
Also if you want the IR way I think you'd do a cosine measure. I know 
carrot2 has the code - this might be it:

http://www.searchmorph.com/pub/carrot2/jd/com/chilang/carrot/filter/cluster/rough/measure/CosineCoefficient.html
Otis
--- Matt Chaput [EMAIL PROTECTED] wrote:

Is there a simple, efficient way to compute similarity of documents 
indexed with Lucene?

My first, naive idea is to use the entire contents of one document as
a 
query to the second document, and use the score as a similarity 
measurement. But I think I'm probably way off base with that.

Can any IR pros set me straight? Thanks very much.
Matt
--
Matt Chaput
Word Monkey
Side Effects Software Inc.
A goddamned ray of sunshine all the goddamned time
-- Sparkle Hayter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document comparison

2005-02-18 Thread Matt Chaput
Matt,
Erik and I have some code for this in Lucene in Action, but David
Spencer did this since the book was published:
  http://www.lucenebook.com/blog/announcements/more_like_this.html
Otis
Awesome awesome awesome! Thanks very much.
--
Matt Chaput
Word Monkey
Side Effects Software Inc.
A goddamned ray of sunshine all the goddamned time
-- Sparkle Hayter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]