Re: [Caml-list] Ocaml clone detector
Andrej Bauer a écrit : > As far as student plagiarism goes, we found out that for Java > programs, you can pretty much detect frauds by erasing everything from > the programs except parentheses ( ) { } and then comparing the > remaining strings for editing distance. My explanation is that > students who copy code don't want to spend much time on it. In order > to chance the parenthesis they would have to understand the structure > of the program, which they don't. This reminds me of: http://www.cs.ucdavis.edu/~su/publications/icse07.pdf Cheers, -- Stéphane ___ Caml-list mailing list. Subscription management: http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list Archives: http://caml.inria.fr Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Ocaml clone detector
On Sep 3, 2009, at 8:06 AM, Nicolas barnier wrote: An amazing and simple technology to detect plagiarism is compression-based similarity distance. It is a side-effect of state-of-the-art compression algorithms that can be used to compute a distance for many kind of documents (it seems to work at least for program sources, books, music, DNA etc): take any two files A and B, compress A, compress B, and compress the concatenation of A and B, i.e. AB; take the size of these compressed files c(A), c(B) and c(AB); the similarity distance is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)). Indeed, if documents A and B share information, the compression of AB will be much shorter than c(A) + c(B). Also see Alex Aiken's "MOSS" (measure of software similarity). It's online, language-specific, works for a variety of languages. Don't know how its algorithm compares to the one here. I suspect it's different insofar the one you describe is language-independent. John Clements ___ Caml-list mailing list. Subscription management: http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list Archives: http://caml.inria.fr Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Ocaml clone detector
An amazing and simple technology to detect plagiarism is compression-based similarity distance. It is a side-effect of state-of-the-art compression algorithms that can be used to compute a distance for many kind of documents (it seems to work at least for program sources, books, music, DNA etc): take any two files A and B, compress A, compress B, and compress the concatenation of A and B, i.e. AB; take the size of these compressed files c(A), c(B) and c(AB); the similarity distance is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)). Indeed, if documents A and B share information, the compression of AB will be much shorter than c(A) + c(B). A good article (in French unfortunately) can be found at: http://interstices.info/jcms/c_21828/classer-musiques-langues-images-textes-et-genomes where a link points to "Baldr", a free Java application written by a French CS professor to compute this distance pairwise for a set of source codes and sort the result. What appears is that the distance between two documents will be much smaller in case of plagiarism than for any other two (even if good students will tend to produce close source codes for the same exam). I wrote a small Ocaml program (baldml) to perform the same task (but without GUI): ./baldml.opt -algo bz2 -regexp ".+ml$" -n 3 dir where you can choose the compression algorithm among bz2 or gzip, specify a Str-style regexp to filter the files (Ocaml files in the example, but I use it as well for C exam) and the number of sorted lines you want in the output among the n(n-1)/2 unique pairs of the n matching files recursively found in the directory "dir". I can provide the 100 lines of code it if anyone is interested. Hope this helps, -- Nicolas ___ Caml-list mailing list. Subscription management: http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list Archives: http://caml.inria.fr Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Ocaml clone detector
As far as student plagiarism goes, we found out that for Java programs, you can pretty much detect frauds by erasing everything from the programs except parentheses ( ) { } and then comparing the remaining strings for editing distance. My explanation is that students who copy code don't want to spend much time on it. In order to chance the parenthesis they would have to understand the structure of the program, which they don't. Andrej ___ Caml-list mailing list. Subscription management: http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list Archives: http://caml.inria.fr Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Ocaml clone detector
Jacques Carette wrote: > Erik de Castro Lopo wrote: > > Maybe it would help if you explained what a "clone detector" > > is. > > > A "clone" is software-engineering speak for "duplicated code". Exactly > what qualifies as 'duplicated code' and how to efficiently find such > (without too many false positives nor false negatives) is still fairly > active research. This is a huge issue in languages without decent > abstraction features, and less so otherwise (or so it seems). Thanks for the explanation. I can think of two situations where such a clone detector may be useful; for finding similar chunks of code so that they may be refactored for software QA management and for detecting plaguarism in student programming assignments. I suspect that these two kinds of clone detectors would actually be quite different. Cheers, Erik -- -- Erik de Castro Lopo http://www.mega-nerd.com/ ___ Caml-list mailing list. Subscription management: http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list Archives: http://caml.inria.fr Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Ocaml clone detector
Erik de Castro Lopo wrote: > Maybe it would help if you explained what a "clone detector" > is. > A "clone" is software-engineering speak for "duplicated code". Exactly what qualifies as 'duplicated code' and how to efficiently find such (without too many false positives nor false negatives) is still fairly active research. This is a huge issue in languages without decent abstraction features, and less so otherwise (or so it seems). Jacques ___ Caml-list mailing list. Subscription management: http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list Archives: http://caml.inria.fr Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Ocaml clone detector
Kihong Heo wrote: > I want to know there is a clone detector for Ocaml program. Maybe it would help if you explained what a "clone detector" is. Erik -- -- Erik de Castro Lopo http://www.mega-nerd.com/ ___ Caml-list mailing list. Subscription management: http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list Archives: http://caml.inria.fr Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs