On Tue, Dec 13, 2016 at 03:08:02PM -0700, Swift Griggs wrote: > Let's say one wants to make general statement that "This code is 30% > the same as that code!" Another example would be someone wants to > make the statement that "XX% of this code really came from project > X." In my case I'm only interested in "honest" code, not trying to > catch someone stealing/permuting existing code. Oh, and everything I > care about is in C. > > My questions are: > > * Are there tools that already do this?
I don't know if there is a tool that does what you want---please let us know what you find. You might have a look at 'spdiff', a program that infers "semantic patches" that can be applied with the Coccinelle program, spatch. I'm not sure how you would assign a "sameness" to the result. Shortness of the patch, maybe? You might look at my ARFE tools, which I have put in NetBSD's othersrc repo. ARFE uses a dynamic programming algorithm to align one text with another, seeking to match like characters or lexical items ("tokens") with like while minimizing the amount of unmatched text ("residue"). Imperfect matches and residue are added up to produce a "score" for the alignment. The algorithm, a variant of Hirschberg's algorithm, seeks to minimize the score. ARFE understands some common tokens like numbers, whitespace, and C-like identifiers. ARFE does not "understand" nested structures, yet. Also, it does not favor an alignment where every instance of a token x in the first text is replaced by the token y in a second text, over an alignment where x has assorted replacements. I cannot make up my mind whether it would be more difficult to make ARFE understand nested structures, or to favor alignments where one token always replaced another. Since you are concerned with comparing C programs, you would want to do both, and you would want to respect the scope rules. Speaking of nested structures, there are algorithms for aligning trees rather than strings. You could conceivably compare the abstract syntax trees produced by two C programs, and judge their sameness that way. > I know that this is essentially an AI problem and thus can get > complex in a hurry. I wouldn't call it an AI problem, myself. It's an optimization problem. Or maybe that is just the way I choose to think of it. :-) I suspect that it is easier to produce a tool that produces useful results on many (but not all) texts consisting of tokens and nested structures that are common on the web, than to produce a tool that in produces a perfect result on, say, every compilable C program. Dave -- David Young dyo...@pobox.com Urbana, IL (217) 721-9981