Re: [FRIAM] hi friam - how do I calculate the fractal dimension of repetitive text?

ERIC P. CHARLES Wed, 15 Dec 2010 18:06:46 -0800

Hey all,
I'm definitely just a lurker relative to this thread, but could you answer a
few quick questions so I can lurk more effectively?


The goal is to find the fractal dimension of a given bunch of computer code
(perhaps within a part of a program, perhaps within a database of several
thousand programs). 

At this point it is clear to me (and it wasn't for some odd reason before this
last set of examples) that you could be interested in the fractal dimension for
at least two distinct reasons: 1) you are interested in something about
computer code itself, 2) you are interested in something about what a given
person or group of people do when coding. 

If you were interested in "the code itself", then you might, for example, get a
sample of code widely agreed to be very pretty, and a sample widely agreed to
be very ugly, and compare fractal dimensionality. 

If you were interested in "what coders do", then you might, for example, watch
someone writing a program that had to perform many tasks, and see how the
fractal dimensionality changes as they tweak the code towards a final project.

Are these the types of comparisons you have in mind? Is one type of question
more interesting? Is there some third type of question motivating the project?
 
Thanks,
Eric

P.S. A third reason to be interested that just occurred to me would be
comparison across different languages. Surely some languages encourage more
fractal code than others... right?



On Wed, Dec 15, 2010 08:11 PM, Giles Bowkett <gil...@gmail.com> wrote:
>>Hey Russell,
>>
>
>You might find the algorithm as described is computationally intensive>
>
>>Quite an understatement!
>
>
>
>>so you might want to exploit
>
>some structural properties of the C++ code (eg using method boundaries
>and line boundaries to help in framing at the appropriate scales).
>>
>
>>I think that's the way to go, although I'll probably look at other languages
first. For example, is there a power law relationship between the similarity of
functions within a file and the similarity of files? My refactoring code has
some stuff to measure repetition and similarity between functions, and will
probably have more soon (very probably if you measure "soon" in years).
Basically I'll probably look at code as trees of tokens rather than as literal
strings, because the computation would otherwise just be massive.
>>
>
>>One possible hypothesis to test is that evolved software will exhibit
>a power law scaling, whereas engineered (and finely refactored) code
>does not. 
>>
>
>>Yeah, this is really my base assumption. I have this weird zoo of "evolved"
code, kind of an embarrassment of riches in the sample data department. I've
been collecting samples for a couple years now. I'm almost certain the power
law scaling is there, it's really almost just a question of how you define the
scales. measuring repetition from line to line is very easy to do, but code
also exhibits patterns which I'm not sure how I'd quantify. Example:
>>
>
>>File A:
>>
>
>>name = <http://info.name>;
>>address = info.address;
>>console.log(name);
>>console.log(address);
>>
>
>>
>>File B:
>>
>
>>zip = info.zip;
>>phone = info.phone;
>>console.log(zip);
>>console.log(phone);
>>
>
>
>>Obviously there's a lot of similarity here, both between individual lines and
between files. it's pretty easy to think of ways to capture these similarities.
>>
>
>>
>>File C:
>>
>
>>penguin = info.penguin;
>>console.log(penguin);
>>walrus = info.walrus;
>>console.log(walrus);
>>
>
>>This represents the equivalent of a mild topological transformation (if I'm
using my terms right), plus there's also this metalinguistic thing where the
theme of Files A and B are similar to each other, but different from File C,
and also that the grouping of terms by theme is a kind of similarity if we
introduce a File D which doesn't observe it:
>>
>
>>>
>>File D:
>>
>
>>orca = info.orca;
>>console.log(orca);
>>apartment = info.apartment;
>>console.log(apartment);
>
>>
>
>>You could actually capture that using the Python Natural Language ToolKit,
which has some access to semantic clustering of some kind. And you can obtain
it through latent semantic analysis, although that's nontrivial to say the
least.
>>
>
>>Anyway, bit of a tangent there.
>>
>
>>On Mon, Dec 13, 2010 at 3:12 PM, Russell Standish <<#>> wrote:
>I had an idea how you might do this.
>
>
>You can start by counting the number of repeated characters in your
>
>program. Then count all pairs of characters that
>
>appear more than once. Then all quadruplets, octuplets and so on. This
>
>will give you some scale dependent distribution of code similarity. If
>
>this turns out to exhibit a power law, you can get the Hausdorf
>
>dimension from that.
>
>
>A variant you could try is considering some threshold value (eg 10%),
>
>and consider a repeat to be any string that is within that threshold
>
>(x string length) as measured by the Hamming distance.
>
>
>You might find the algorithm as described is computationally intensive
>
>(it looks to be O(n^2) at first glance), so you might want to exploit
>
>some structural properties of the C++ code (eg using method boundaries
>
>and line boundaries to help in framing at the appropriate scales).
>
>
>The above technique could also be applied to genomic data, and also
>
>the genomes of digitial organisms from such systems as Tierra or Avida.
>
>
>One possible hypothesis to test is that evolved software will exhibit
>
>a power law scaling, whereas engineered (and finely refactored) code
>
>does not. This would be in line with Stephan Halloy's observations of
>
>scaling laws in natural systems - see
><http://pandora.nla.gov.au/pan/10178/20040710-0000/www.complexity.org.au/ci/vol06/halloy/index.html>
>
>
>.
>
>
>Cheers
>>
>>
>>
>
>
>On Fri, Dec 10, 2010 at 06:27:13PM -0800, Giles Bowkett wrote:
>
>> Howdy all - it's been a long time, but I was just planning a return to NM to
>
>> see family for the holiday and reading Mandelbrot's book on finance, and the
>
>> pairing reminded me of FRIAM.
>
>>
>
>> I have something I'm puzzling with, which you guys might find interesting. I
>
>> want to calculate the fractal dimension of repetitive text, specifically
>
>> code.
>
>>
>
>> I'm building a system to do automated refactoring. It's essentially a
>
>> compiler. I see this as a business project, but the fractal dimensionality,
>
>> I don't see any actual business in that. It's just a mild personal
>
>> obsession. I read Mandelbrot's "Fractal Geometry of Nature" when I was a
>
>> teenager and it changed my life.
>
>>
>
>> So anyway, consider a code base at a tech company. Typically, these code
>
>> bases are measured in lines of code. It's a one-dimensional measurement, so
>
>> in a sense, a "line" whose length is determined by the number of lines of
>
>> code. However, lines of code exhibit self-similarity, and the
>
>> self-similarity increases with the clumsiness of the syntax of the
>
>> programming language, and with the incompetence or indifference of the
>
>> programmers who wrote it. This comes from tons of personal experience, and
>
>> from the fact that the less competent a programmer is, or the less they care
>
>> about the company, or the more confusing a language's syntax is, the more
>
>> likely a programmer is to just copy and paste some code, instead of actually
>
>> figuring out what it does.
>
>>
>
>> I think the "line" formed by measuring the number of these lines of code
>
>> would be more strictly be considered a curve, as in the Koch curve, which is
>
>> a highly self-similar curve with a fractal or Hausdorff dimension greater
>
>> than one; its fractal dimension comes from its self-similar filagrees.
>
>> Fractals 101.
>
>>
>
>> <http://en.wikipedia.org/wiki/Koch_snowflake>
>
>> <http://en.wikipedia.org/wiki/Hausdorff_dimension>
>
>>
>
>> You create the Koch curve by taking a line, inserting a deviation, and then
>
>> dividing the result into fourths, and repeating the process on each fourth.
>
>> You end up with this thing that looks like a snowflake or a fuzzy Star of
>
>> David.
>
>>
>
>> Many programmers write their code by copying a line and inserting some minor
>
>> variation.
>
>>
>
>> E.g., say I have this:
>
>>
>
>> this.moduleOverlay.data.info.specifics.details.render(this.display, "hello
>
>> world", this);
>
>>
>
>> That's pretty much a real-world example. Say I need the application this
>
>> comes from to display "howdy world" a little later on. Even though I think
>
>> it's kind of a horrible thing to do, for expediency's sake, I might copy
>
>> this code, paste it, and just modify the literal:
>
>>
>
>> this.moduleOverlay.data.info.specifics.details.render(this.display, "howdy
>
>> world", this);
>
>>
>
>> Now imagine this happening very frequently in a company with a very high
>
>> rate of copy-paste. The company creates its entire code base by copying
>
>> lines and inserting deviations. You very soon have a very self-similar
>
>> corpus of text. The recursivity in the Koch curve also has an analog,
>
>> because programmers will not just copy-paste line by line, but also
>
>> copy-paste giant blocks of code, instead of refactoring to objects. So
>
>> although the copying is not a recursive process in most cases, you do have
>
>> repetition at multiple scales.
>
>>
>
>> The result is that the code base, considered as a whole, has some degree of
>
>> fractal self-similarity. If we take lines of code as our metric, we're going
>
>> to have some kind of fractal dimensionality in the result, something greater
>
>> than 1 but less than 2, just like the coastline of Britain. (again,
>
>> <http://en.wikipedia.org/wiki/Hausdorff_dimension>)
>
>>
>
>> In reality, however, lines of code is a ridiculous metric, because it's
>
>> meaningless. If somebody offers you a consulting gig and tells you they have
>
>> a thousand lines of code, is that highly repetitive code, or entirely unique
>
>> code? You could technically be talking about the same line repeated a
>
>> thousand times. (In fact I worked at a company in Santa Fe which was just
>
>> about that dumb, and I know if Carl is still on here, he knows who I'm
>
>> talking about.) In order to do any kind of useful analysis on code, you need
>
>> to turn the code into a parse tree, and analyze the tree. For instance, my
>
>> automated refactoring code is still in its infancy, but can detect simple
>
>> kinds of repetition and similarity, and does so by converting code into
>
>> parse trees, and then comparing the trees. It's downright tautological to
>
>> say that if you're looking at trees which contain subtrees, and those
>
>> subtrees are equal or similar to other subtrees (both subtrees within the
>
>> same tree and subtrees found in other trees), and further that the same
>
>> patterns of repetition shaped both the macro and micro scales of the tree,
>
>> you are talking about fractals.
>
>>
>
>> But I can't find anything on how to calculate the fractal dimension of a
>
>> tree! The best thing I've found is something on how to calculate the fractal
>
>> dimension of a network, which I'm guessing could be clumsily coerced into a
>
>> method for trees. There's plenty out there about how to use fractals to
>
>> **generate** an irregular tree, so I may be able to use something like that
>
>> and go backwards.
>
>>
>
>> OK, and I found this:
>
>>
>
>> <http://www.trusoft-international.com/>
>
>>
>
>> But it costs about $250, and I want something Unix-y or RESTful so I can use
>
>> it programmatically.
>
>>
>
>> --
>
>> Giles Bowkett
>
>> <http://gilesbowkett.com>
>
>
>
>
>>> ============================================================
>
>> FRIAM Applied Complexity Group listserv
>
>> Meets Fridays 9a-11:30 at cafe at St. John's College
>
>> lectures, archives, unsubscribe, maps at <http://www.friam.org>
>
>
>
>
>--
>
>
>----------------------------------------------------------------------------
>
>Prof Russell Standish                  Phone 0425 253119 (mobile)
>
>Mathematics
>
>UNSW SYDNEY 2052                         <#>
>
>Australia                                <http://www.hpcoders.com.au>
>
>----------------------------------------------------------------------------
>>
>>
>>
>
>
>============================================================
>
>FRIAM Applied Complexity Group listserv
>
>Meets Fridays 9a-11:30 at cafe at St. John's College
>
>lectures, archives, unsubscribe, maps at <http://www.friam.org>
>
>
>
>
>
>
>
>
>-- 
>Giles Bowkett
><http://gilesbowkett.com>
>
>
============================================================
>FRIAM Applied Complexity Group listserv
>Meets Fridays 9a-11:30 at cafe at St. John's College
>lectures, archives, unsubscribe, maps at http://www.friam.org
>

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org

Re: [FRIAM] hi friam - how do I calculate the fractal dimension of repetitive text?

Reply via email to