Re: Semetic/fuzzy-logic code comparison tool ?
On Tue, Dec 13, 2016 at 03:08:02PM -0700, Swift Griggs wrote: > Let's say one wants to make general statement that "This code is 30% > the same as that code!" Another example would be someone wants to > make the statement that "XX% of this code really came from project > X." In my case I'm only interested in "honest" code, not trying to > catch someone stealing/permuting existing code. Oh, and everything I > care about is in C. > > My questions are: > > * Are there tools that already do this? I don't know if there is a tool that does what you want---please let us know what you find. You might have a look at 'spdiff', a program that infers "semantic patches" that can be applied with the Coccinelle program, spatch. I'm not sure how you would assign a "sameness" to the result. Shortness of the patch, maybe? You might look at my ARFE tools, which I have put in NetBSD's othersrc repo. ARFE uses a dynamic programming algorithm to align one text with another, seeking to match like characters or lexical items ("tokens") with like while minimizing the amount of unmatched text ("residue"). Imperfect matches and residue are added up to produce a "score" for the alignment. The algorithm, a variant of Hirschberg's algorithm, seeks to minimize the score. ARFE understands some common tokens like numbers, whitespace, and C-like identifiers. ARFE does not "understand" nested structures, yet. Also, it does not favor an alignment where every instance of a token x in the first text is replaced by the token y in a second text, over an alignment where x has assorted replacements. I cannot make up my mind whether it would be more difficult to make ARFE understand nested structures, or to favor alignments where one token always replaced another. Since you are concerned with comparing C programs, you would want to do both, and you would want to respect the scope rules. Speaking of nested structures, there are algorithms for aligning trees rather than strings. You could conceivably compare the abstract syntax trees produced by two C programs, and judge their sameness that way. > I know that this is essentially an AI problem and thus can get > complex in a hurry. I wouldn't call it an AI problem, myself. It's an optimization problem. Or maybe that is just the way I choose to think of it. :-) I suspect that it is easier to produce a tool that produces useful results on many (but not all) texts consisting of tokens and nested structures that are common on the web, than to produce a tool that in produces a perfect result on, say, every compilable C program. Dave -- David Young dyo...@pobox.comUrbana, IL(217) 721-9981
Re: A single-board computer for NetBSD
On Sat 2016-12-10 01:34 , Martin Cermak wrote: > On Tue 2016-12-06 21:38 , Martin Husemann wrote: > > On Tue, Dec 06, 2016 at 08:38:17PM +0100, Martin Cermak wrote: > > > On Tue 2016-12-06 18:49 , co...@sdf.org wrote: > > > > I have the same. I did not replace the USB stick, and in the original > > > > msdos > > > > partition I placed a netbsd.elf32 file in the same name as the linux > > > > kernel > > > > was (it was 'vmlinux.64' or so). > > > > > > That works for me, the kernel boots; how about the root > > > partition? Did you reuse the existing ext2 partition? > > > Or did you do something else? > > > > I would replace the ext2 partition with a ffs one (u-boot should not care > > about anything but the msdos one) and install onto that. > > > > Got it! > > https://paste.fedoraproject.org/502816/32982214/ > > Thanks everyone, > > Martin Guys, is there some public tracker for the second cpu enablement effort? Today's kernel ERLITE.201612131250Z doen't seem to have it yet. Other than that, the black box seems to run like charm! Martin
Semetic/fuzzy-logic code comparison tool ?
Let's say one wants to make general statement that "This code is 30% the same as that code!" Another example would be someone wants to make the statement that "XX% of this code really came from project X." In my case I'm only interested in "honest" code, not trying to catch someone stealing/permuting existing code. Oh, and everything I care about is in C. My questions are: * Are there tools that already do this? * What do you do about whitespace, simple variable permutation, and formatting issues? Ie.. times when a tiny thing changes the "checksum" of your content but it's essentially still the same code. I know that this is essentially an AI problem and thus can get complex in a hurry. I was writing some scripts to take a swing at some kind of prototype (and I even made some early progress), but then I though "surely someone's already done this, genius." Anyone know of any place to start, here? I know it's awfully arbitrary and subjective. However, as long as the algorithm isn't partisan and generates reproducible and at least somewhat defensible results, I can live with the subjectivity. -Swift Now for those that might be somewhat interested this is what I started with on tissue paper (just notes). Feel free to critique if you have ideas or know of preexisting stuff I should look at. I'd rather not invent this wheel. * Substitute all whitespace for a single space, yeah, for sure. Forget about wrapping characters, too (CR, LF, etc..). * Possibly use something like soundex on variables? Hmm, how to detect when the same variable is used under a new name? Leading/trailing characters? * Count braces and nesting levels? Does this generate a unique enough pattern? Add it to an overall heuristic score ala Bayesian style? * How to solve the problem of old code with a new location? Also when it's slightly permuted? * What will I use for quanta/units to analyze. Going by lines is dumb since it implies whitespace (which is ignored). By function? By sets of braces or parens? By scope ? Multiple types of quanta? H * I'll start with multiple scripts. Each one builds it's own score based on a different technique. Then we aggregate the scores and see which ones are most useful/accurate for my use cases. Then see if any track together or diverge in different cases. * What about old K code that's simply been updated with a newer function declaration and C99 or C11 stuff? Should be able to regex to detect this ? * Probably better to write the tool in script, too much string handling to dork with it in C. * If one file is 100k and another 50k make sure that the tools never assert a difference of less than 50%? What if file B is just 2x a bunch of code still found in file A? Grrr... think... Those were just rough notes with my ideas.