[FRIAM] hi friam - how do I calculate the fractal dimension of repetitive text?

Giles Bowkett Fri, 10 Dec 2010 18:28:23 -0800

Howdy all - it's been a long time, but I was just planning a return to NM to
see family for the holiday and reading Mandelbrot's book on finance, and the
pairing reminded me of FRIAM.


I have something I'm puzzling with, which you guys might find interesting. I
want to calculate the fractal dimension of repetitive text, specifically
code.

I'm building a system to do automated refactoring. It's essentially a
compiler. I see this as a business project, but the fractal dimensionality,
I don't see any actual business in that. It's just a mild personal
obsession. I read Mandelbrot's "Fractal Geometry of Nature" when I was a
teenager and it changed my life.

So anyway, consider a code base at a tech company. Typically, these code
bases are measured in lines of code. It's a one-dimensional measurement, so
in a sense, a "line" whose length is determined by the number of lines of
code. However, lines of code exhibit self-similarity, and the
self-similarity increases with the clumsiness of the syntax of the
programming language, and with the incompetence or indifference of the
programmers who wrote it. This comes from tons of personal experience, and
from the fact that the less competent a programmer is, or the less they care
about the company, or the more confusing a language's syntax is, the more
likely a programmer is to just copy and paste some code, instead of actually
figuring out what it does.

I think the "line" formed by measuring the number of these lines of code
would be more strictly be considered a curve, as in the Koch curve, which is
a highly self-similar curve with a fractal or Hausdorff dimension greater
than one; its fractal dimension comes from its self-similar filagrees.
Fractals 101.

http://en.wikipedia.org/wiki/Koch_snowflake
http://en.wikipedia.org/wiki/Hausdorff_dimension

You create the Koch curve by taking a line, inserting a deviation, and then
dividing the result into fourths, and repeating the process on each fourth.
You end up with this thing that looks like a snowflake or a fuzzy Star of
David.

Many programmers write their code by copying a line and inserting some minor
variation.

E.g., say I have this:

this.moduleOverlay.data.info.specifics.details.render(this.display, "hello
world", this);

That's pretty much a real-world example. Say I need the application this
comes from to display "howdy world" a little later on. Even though I think
it's kind of a horrible thing to do, for expediency's sake, I might copy
this code, paste it, and just modify the literal:

this.moduleOverlay.data.info.specifics.details.render(this.display, "howdy
world", this);

Now imagine this happening very frequently in a company with a very high
rate of copy-paste. The company creates its entire code base by copying
lines and inserting deviations. You very soon have a very self-similar
corpus of text. The recursivity in the Koch curve also has an analog,
because programmers will not just copy-paste line by line, but also
copy-paste giant blocks of code, instead of refactoring to objects. So
although the copying is not a recursive process in most cases, you do have
repetition at multiple scales.

The result is that the code base, considered as a whole, has some degree of
fractal self-similarity. If we take lines of code as our metric, we're going
to have some kind of fractal dimensionality in the result, something greater
than 1 but less than 2, just like the coastline of Britain. (again,
http://en.wikipedia.org/wiki/Hausdorff_dimension)

In reality, however, lines of code is a ridiculous metric, because it's
meaningless. If somebody offers you a consulting gig and tells you they have
a thousand lines of code, is that highly repetitive code, or entirely unique
code? You could technically be talking about the same line repeated a
thousand times. (In fact I worked at a company in Santa Fe which was just
about that dumb, and I know if Carl is still on here, he knows who I'm
talking about.) In order to do any kind of useful analysis on code, you need
to turn the code into a parse tree, and analyze the tree. For instance, my
automated refactoring code is still in its infancy, but can detect simple
kinds of repetition and similarity, and does so by converting code into
parse trees, and then comparing the trees. It's downright tautological to
say that if you're looking at trees which contain subtrees, and those
subtrees are equal or similar to other subtrees (both subtrees within the
same tree and subtrees found in other trees), and further that the same
patterns of repetition shaped both the macro and micro scales of the tree,
you are talking about fractals.

But I can't find anything on how to calculate the fractal dimension of a
tree! The best thing I've found is something on how to calculate the fractal
dimension of a network, which I'm guessing could be clumsily coerced into a
method for trees. There's plenty out there about how to use fractals to
**generate** an irregular tree, so I may be able to use something like that
and go backwards.

OK, and I found this:

http://www.trusoft-international.com/

But it costs about $250, and I want something Unix-y or RESTful so I can use
it programmatically.

-- 
Giles Bowkett
http://gilesbowkett.com

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org

[FRIAM] hi friam - how do I calculate the fractal dimension of repetitive text?

Reply via email to