Re: Document Similarity Algorithm at Solr/Lucene

2013-08-07 Thread Lance Norskog
Block-quoting and plagiarism are two different questions. Block-quoting is simple: break the text apart into sentences or even paragraphs and make them separate documents. Make facets of the post-analysis text. Now just pull counts of facets and block quotes will be clear. Mahout has a

RE: Document Similarity Algorithm at Solr/Lucene

2013-08-05 Thread Alexey Kozhemiakin
Subject: Re: Document Similarity Algorithm at Solr/Lucene BTW, How Solr's MoreLikeThis Component works? Which algorithm does it use at underlying? 2013/7/24 Roman Chyla roman.ch...@gmail.com This paper contains an excellent algorithm for plagiarism detection, but beware the published version had

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-25 Thread Furkan KAMACI
BTW, How Solr's MoreLikeThis Component works? Which algorithm does it use at underlying? 2013/7/24 Roman Chyla roman.ch...@gmail.com This paper contains an excellent algorithm for plagiarism detection, but beware the published version had a mistake in the algorithm - look for corrections - I

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-24 Thread Otis Gospodnetic
Sent: Tuesday, July 23, 2013 6:16 AM To: solr-user@lucene.apache.org Subject: Re: Document Similarity Algorithm at Solr/Lucene Actually I need a specialized algorithm. I want to use that algorithm to detect duplicate blog posts. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com Hi, I

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-24 Thread Roman Chyla
This paper contains an excellent algorithm for plagiarism detection, but beware the published version had a mistake in the algorithm - look for corrections - I can't find them now, but I know they have been published (perhaps by one of the co-authors). You could do it with solr, to create an index

Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Furkan KAMACI
Hi; Sometimes a huge part of a document may exist in another document. As like in student plagiarism or quotation of a blog post at another blog post. Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to detect it?

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Tommaso Teofili
Hi, I you may leverage and / or improve MLT component [1]. HTH, Tommaso [1] : http://wiki.apache.org/solr/MoreLikeThis 2013/7/23 Furkan KAMACI furkankam...@gmail.com Hi; Sometimes a huge part of a document may exist in another document. As like in student plagiarism or quotation of a

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Furkan KAMACI
Actually I need a specialized algorithm. I want to use that algorithm to detect duplicate blog posts. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com Hi, I you may leverage and / or improve MLT component [1]. HTH, Tommaso [1] : http://wiki.apache.org/solr/MoreLikeThis 2013/7/23

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Jack Krupansky
that the top results will be more relevant. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Tuesday, July 23, 2013 6:16 AM To: solr-user@lucene.apache.org Subject: Re: Document Similarity Algorithm at Solr/Lucene Actually I need a specialized algorithm. I want to use

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Shawn Heisey
On 7/23/2013 3:33 AM, Furkan KAMACI wrote: Sometimes a huge part of a document may exist in another document. As like in student plagiarism or quotation of a blog post at another blog post. Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to detect it? Solr is designed

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Tommaso Teofili
if you need a specialized algorithm for detecting blogposts plagiarism / quotations (which are different tasks IMHO) I think you have 2 options: 1. implement a dedicated one based on your features / metrics / domain 2. try to fine tune an existing algorithm that is flexible enough If I were to do

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Furkan KAMACI
Thanks for your comments. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com if you need a specialized algorithm for detecting blogposts plagiarism / quotations (which are different tasks IMHO) I think you have 2 options: 1. implement a dedicated one based on your features / metrics / domain

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Shashi Kant
Here is a paper that I found useful: http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI furkankam...@gmail.com wrote: Thanks for your comments. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com if you need a specialized