Re: Semetic/fuzzy-logic code comparison tool ?

2016-12-13 Thread David Young
On Tue, Dec 13, 2016 at 03:08:02PM -0700, Swift Griggs wrote:
> Let's say one wants to make general statement that "This code is 30%
> the same as that code!" Another example would be someone wants to
> make the statement that "XX% of this code really came from project
> X." In my case I'm only interested in "honest" code, not trying to
> catch someone stealing/permuting existing code. Oh, and everything I
> care about is in C.
> 
> My questions are:
> 
> * Are there tools that already do this?

I don't know if there is a tool that does what you want---please let
us know what you find.

You might have a look at 'spdiff', a program that infers "semantic
patches" that can be applied with the Coccinelle program, spatch.  I'm
not sure how you would assign a "sameness" to the result.  Shortness of
the patch, maybe?

You might look at my ARFE tools, which I have put in NetBSD's othersrc
repo.  ARFE uses a dynamic programming algorithm to align one text with
another, seeking to match like characters or lexical items ("tokens")
with like while minimizing the amount of unmatched text ("residue").
Imperfect matches and residue are added up to produce a "score" for the
alignment.  The algorithm, a variant of Hirschberg's algorithm, seeks to
minimize the score.  ARFE understands some common tokens like numbers,
whitespace, and C-like identifiers.

ARFE does not "understand" nested structures, yet.  Also, it does not
favor an alignment where every instance of a token x in the first text
is replaced by the token y in a second text, over an alignment where x
has assorted replacements.  I cannot make up my mind whether it would be
more difficult to make ARFE understand nested structures, or to favor
alignments where one token always replaced another.  Since you are
concerned with comparing C programs, you would want to do both, and you
would want to respect the scope rules.

Speaking of nested structures, there are algorithms for aligning trees
rather than strings.  You could conceivably compare the abstract syntax
trees produced by two C programs, and judge their sameness that way.

> I know that this is essentially an AI problem and thus can get
> complex in a hurry.

I wouldn't call it an AI problem, myself.  It's an optimization problem.
Or maybe that is just the way I choose to think of it. :-)

I suspect that it is easier to produce a tool that produces useful
results on many (but not all) texts consisting of tokens and nested
structures that are common on the web, than to produce a tool that in
produces a perfect result on, say, every compilable C program.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: A single-board computer for NetBSD

2016-12-13 Thread Martin Cermak
On  Sat  2016-12-10  01:34 , Martin Cermak wrote:
> On  Tue  2016-12-06  21:38 , Martin Husemann wrote:
> > On Tue, Dec 06, 2016 at 08:38:17PM +0100, Martin Cermak wrote:
> > > On  Tue  2016-12-06  18:49 , co...@sdf.org wrote:
> > > > I have the same. I did not replace the USB stick, and in the original 
> > > > msdos
> > > > partition I placed a netbsd.elf32 file in the same name as the linux 
> > > > kernel
> > > > was (it was 'vmlinux.64' or so).
> > > 
> > > That works for me, the kernel boots; how about the root
> > > partition?  Did you reuse the existing ext2 partition?
> > > Or did you do something else?
> > 
> > I would replace the ext2 partition with a ffs one (u-boot should not care
> > about anything but the msdos one) and install onto that.
> > 
> 
> Got it!
> 
> https://paste.fedoraproject.org/502816/32982214/
> 
> Thanks everyone,
> 
> Martin

Guys, is there some public tracker for the second cpu enablement
effort?  Today's kernel ERLITE.201612131250Z doen't seem to have
it yet.  Other than that, the black box seems to run like charm!

Martin


Semetic/fuzzy-logic code comparison tool ?

2016-12-13 Thread Swift Griggs


Let's say one wants to make general statement that "This code is 30% the 
same as that code!" Another example would be someone wants to make the 
statement that "XX% of this code really came from project X." In my case 
I'm only interested in "honest" code, not trying to catch someone 
stealing/permuting existing code. Oh, and everything I care about is in C.


My questions are:

* Are there tools that already do this?

* What do you do about whitespace, simple variable permutation, and
  formatting issues? Ie.. times when a tiny thing changes the "checksum"
  of your content but it's essentially still the same code.

I know that this is essentially an AI problem and thus can get complex in 
a hurry. I was writing some scripts to take a swing at some kind of 
prototype (and I even made some early progress), but then I though "surely 
someone's already done this, genius."


Anyone know of any place to start, here? I know it's awfully arbitrary and 
subjective. However, as long as the algorithm isn't partisan and generates 
reproducible and at least somewhat defensible results, I can live with the 
subjectivity.


-Swift


Now for those that might be somewhat interested this is what I started 
with on tissue paper (just notes). Feel free to critique if you have ideas 
or know of preexisting stuff I should look at. I'd rather not invent this 
wheel.


* Substitute all whitespace for a single space, yeah, for sure. Forget
  about wrapping characters, too (CR, LF, etc..).

* Possibly use something like soundex on variables? Hmm, how to detect
  when the same variable is used under a new name? Leading/trailing
  characters?

* Count braces and nesting levels? Does this generate a unique enough
  pattern? Add it to an overall heuristic score ala Bayesian style?

* How to solve the problem of old code with a new location? Also when it's
  slightly permuted?

* What will I use for quanta/units to analyze. Going by lines is dumb
  since it implies whitespace (which is ignored). By function? By sets of
  braces or parens? By scope ? Multiple types of quanta? H

* I'll start with multiple scripts. Each one builds it's own score based
  on a different technique. Then we aggregate the scores and see which
  ones are most useful/accurate for my use cases. Then see if any track
  together or diverge in different cases.

* What about old K code that's simply been updated with a newer function
  declaration and C99 or C11 stuff? Should be able to regex to detect this ?

* Probably better to write the tool in script, too much string handling to
  dork with it in C.

* If one file is 100k and another 50k make sure that the tools never
  assert a difference of less than 50%? What if file B is just 2x a bunch
  of code still found in file A? Grrr... think...

Those were just rough notes with my ideas.