Re: Document comparison
Matt, Erik and I have some code for this in Lucene in Action, but David Spencer did this since the book was published: http://www.lucenebook.com/blog/announcements/more_like_this.html Otis Awesome awesome awesome! Thanks very much. -- Matt Chaput Word Monkey Side Effects Software Inc. "A goddamned ray of sunshine all the goddamned time" -- Sparkle Hayter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document comparison
Otis Gospodnetic wrote: Matt, Erik and I have some code for this in Lucene in Action, but David Spencer did this since the book was published: http://www.lucenebook.com/blog/announcements/more_like_this.html If you want an informal way of doing it you're right, just feed the words of the source doc to a query. The doc for the code it is at this easy to remember URL: http://searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/SimilarityQueries.html#formSimilarQuery(java.lang.String,%20org.apache.lucene.analysis.Analyzer,%20java.lang.String,%20java.util.Set) Follow Otis's link above to my weblog for the code. The "MoreLikeThis" stuff is similar but more sophisticated. Also if you want the IR way I think you'd do a "cosine measure". I know carrot2 has the code - this might be it: http://www.searchmorph.com/pub/carrot2/jd/com/chilang/carrot/filter/cluster/rough/measure/CosineCoefficient.html Otis --- Matt Chaput <[EMAIL PROTECTED]> wrote: Is there a simple, efficient way to compute similarity of documents indexed with Lucene? My first, naive idea is to use the entire contents of one document as a query to the second document, and use the score as a similarity measurement. But I think I'm probably way off base with that. Can any IR pros set me straight? Thanks very much. Matt -- Matt Chaput Word Monkey Side Effects Software Inc. "A goddamned ray of sunshine all the goddamned time" -- Sparkle Hayter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document comparison
Matt, Erik and I have some code for this in Lucene in Action, but David Spencer did this since the book was published: http://www.lucenebook.com/blog/announcements/more_like_this.html Otis --- Matt Chaput <[EMAIL PROTECTED]> wrote: > Is there a simple, efficient way to compute similarity of documents > indexed with Lucene? > > My first, naive idea is to use the entire contents of one document as > a > query to the second document, and use the score as a similarity > measurement. But I think I'm probably way off base with that. > > Can any IR pros set me straight? Thanks very much. > > Matt > > > -- > Matt Chaput > Word Monkey > Side Effects Software Inc. > > "A goddamned ray of sunshine all the goddamned time" > -- Sparkle Hayter > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document comparison
My first, naive idea is to use the entire contents of one document as a query to the second document, Sorry, I meant use the entire contents of one document as a query *on the rest of the corpus*. -- Matt Chaput Word Monkey Side Effects Software Inc. "A goddamned ray of sunshine all the goddamned time" -- Sparkle Hayter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Document comparison
Is there a simple, efficient way to compute similarity of documents indexed with Lucene? My first, naive idea is to use the entire contents of one document as a query to the second document, and use the score as a similarity measurement. But I think I'm probably way off base with that. Can any IR pros set me straight? Thanks very much. Matt -- Matt Chaput Word Monkey Side Effects Software Inc. "A goddamned ray of sunshine all the goddamned time" -- Sparkle Hayter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]