Hi Luca,

how about quoting another researcher's work? Are you also interested in
the amount of quotes in respect to the whole document? I think it is not
impossible to let an algorithm find out whether some subsequences in
both documents are correctly marked, but it might be hard. Depending on
your business-case you might find out that there will be a lot of
false-positives when judging someone's work as plagiarism.

Another idea to find out similarity between the content of two documents
is implemented in Nutch. Fortunately I found a piece of documentation in
the solr-api-docs where you can read about it:
http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html

You could do something like that for content-blocks of a document
(several sentences or a fixed window of words). This way you are able to
find out similarities between documents where the author has rewritten a
part of another researcher's work.
This way you are able to find out phrases where the
longest-common-subsequence is small but a human would see the
similarities between both documents and the possiblity of a plagiarism.

Regards,
Em

Am 11.07.2011 09:15, schrieb Luca Natti:
> yes, i'm interested in plagiarism applied to research papers, university
> notes, thesis.
> Any theory and *best* snippets of code/examples is very appreciated!
> thanks in advance for your help,
> 
> 
> On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg <[email protected]> wrote:
> 
>> If 'puzzling' means direct plagiarism, then some sort of
>> longest-common-subsequence might be a better metric.
>>
>> If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for
>> me.
>>
>> On Friday, 8 July 2011, Sergey Bartunov <[email protected]> wrote:
>>> You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis
>>>
>>> On 8 July 2011 12:47, Luca Natti <[email protected]> wrote:
>>>> Is there a  way to compute similarity between docs?
>>>> And similarity by paragraphs?
>>>>
>>>> What We want to tell is if a research paper is original or made by
>>>> "puzzling" other works.
>>>>
>>>> thanks!
>>>>
>>>
>>
>> --
>>
>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>
> 

Reply via email to