Re: Plagiarism - document similarity

Luca Natti Mon, 11 Jul 2011 03:01:40 -0700

Somethig that gives also false positives ok,
because we can check by hand for the final decision on the doc.


I need more specific directions with some examples ,
because we have few time to implement this.


On Mon, Jul 11, 2011 at 10:35 AM, Em <[email protected]> wrote:

> Hi Luca,
>
> how about quoting another researcher's work? Are you also interested in
> the amount of quotes in respect to the whole document? I think it is not
> impossible to let an algorithm find out whether some subsequences in
> both documents are correctly marked, but it might be hard. Depending on
> your business-case you might find out that there will be a lot of
> false-positives when judging someone's work as plagiarism.
>
> Another idea to find out similarity between the content of two documents
> is implemented in Nutch. Fortunately I found a piece of documentation in
> the solr-api-docs where you can read about it:
>
> http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html
>
> You could do something like that for content-blocks of a document
> (several sentences or a fixed window of words). This way you are able to
> find out similarities between documents where the author has rewritten a
> part of another researcher's work.
> This way you are able to find out phrases where the
> longest-common-subsequence is small but a human would see the
> similarities between both documents and the possiblity of a plagiarism.
>
> Regards,
> Em
>
> Am 11.07.2011 09:15, schrieb Luca Natti:
> > yes, i'm interested in plagiarism applied to research papers, university
> > notes, thesis.
> > Any theory and *best* snippets of code/examples is very appreciated!
> > thanks in advance for your help,
> >
> >
> > On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg <[email protected]>
> wrote:
> >
> >> If 'puzzling' means direct plagiarism, then some sort of
> >> longest-common-subsequence might be a better metric.
> >>
> >> If this isn't what the OP meant, then sorry! 'Puzzling' is a new term
> for
> >> me.
> >>
> >> On Friday, 8 July 2011, Sergey Bartunov <[email protected]> wrote:
> >>> You may start from
> http://en.wikipedia.org/wiki/Latent_semantic_analysis
> >>>
> >>> On 8 July 2011 12:47, Luca Natti <[email protected]> wrote:
> >>>> Is there a  way to compute similarity between docs?
> >>>> And similarity by paragraphs?
> >>>>
> >>>> What We want to tell is if a research paper is original or made by
> >>>> "puzzling" other works.
> >>>>
> >>>> thanks!
> >>>>
> >>>
> >>
> >> --
> >>
> >> http://tinyurl.com/andrew-clegg-linkedin |
> http://twitter.com/andrew_clegg
> >>
> >
>

Re: Plagiarism - document similarity

Reply via email to