Somethig that gives also false positives ok, because we can check by hand for the final decision on the doc.
I need more specific directions with some examples , because we have few time to implement this. On Mon, Jul 11, 2011 at 10:35 AM, Em <[email protected]> wrote: > Hi Luca, > > how about quoting another researcher's work? Are you also interested in > the amount of quotes in respect to the whole document? I think it is not > impossible to let an algorithm find out whether some subsequences in > both documents are correctly marked, but it might be hard. Depending on > your business-case you might find out that there will be a lot of > false-positives when judging someone's work as plagiarism. > > Another idea to find out similarity between the content of two documents > is implemented in Nutch. Fortunately I found a piece of documentation in > the solr-api-docs where you can read about it: > > http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html > > You could do something like that for content-blocks of a document > (several sentences or a fixed window of words). This way you are able to > find out similarities between documents where the author has rewritten a > part of another researcher's work. > This way you are able to find out phrases where the > longest-common-subsequence is small but a human would see the > similarities between both documents and the possiblity of a plagiarism. > > Regards, > Em > > Am 11.07.2011 09:15, schrieb Luca Natti: > > yes, i'm interested in plagiarism applied to research papers, university > > notes, thesis. > > Any theory and *best* snippets of code/examples is very appreciated! > > thanks in advance for your help, > > > > > > On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg <[email protected]> > wrote: > > > >> If 'puzzling' means direct plagiarism, then some sort of > >> longest-common-subsequence might be a better metric. > >> > >> If this isn't what the OP meant, then sorry! 'Puzzling' is a new term > for > >> me. > >> > >> On Friday, 8 July 2011, Sergey Bartunov <[email protected]> wrote: > >>> You may start from > http://en.wikipedia.org/wiki/Latent_semantic_analysis > >>> > >>> On 8 July 2011 12:47, Luca Natti <[email protected]> wrote: > >>>> Is there a way to compute similarity between docs? > >>>> And similarity by paragraphs? > >>>> > >>>> What We want to tell is if a research paper is original or made by > >>>> "puzzling" other works. > >>>> > >>>> thanks! > >>>> > >>> > >> > >> -- > >> > >> http://tinyurl.com/andrew-clegg-linkedin | > http://twitter.com/andrew_clegg > >> > > >
