Derek M Jones <[EMAIL PROTECTED]> wrote:
        Figure 2 is evidence that there is a strong correlation between
        developer performance and certain kinds of operator occurrences
        in source code and no correlation for other kinds of measurements
        of operators occurrences.
        
Er, no.

Figure 2 (left), with r = 0.64, is a WEAK correlation.
Remember, to grasp the importance of a correlation, you have to square it.
R-squared for (% correct answers) against (% occurrence in operator pairs
in specified sample of C code) is just 41%.

However, that's because the linear relationship measured by Pearson
correlation does *not* match what's actually happening in the data of
figure 2, which is quadratic relationship with maximum about 10%.
And that says "up to a certain point, the commoner the operator the
better the performance, beyond that point, the commoner the operator
the WORSE the performance."

Figure 2 (right) shows, well, basically all Figure 2 (right) really
shows is that ==, !=, +, and - are common and all the other operators
are rare.  But you can get a nice quadratic relationship out of that too.

But there's an easily overlooked point here:
    the horizontal axes of these figures are based on operator occurrences
    in a certain body of C code.  We have NO evidence that any of the
    subjects have been exposed to ANY of that body of code, and the paper
    gives NO evidence that that body of code is similar to the code that
    the subjects (or other C developers in general) have been exposed to.

Let's give an example of that.  I just counted the operators in a program
I happen to use a lot.  The program has 250 000 lines in .c files; I did
not count .h files.  Of those 250 000 lines, 110 000 lines are SLOC; they
contain a total of 133 000 operators (I counted assignment operators and
also counted ?: as an operator).  37 000 were the simple assignment
operator.  Counting only the operators listed in the paper's table 2,
the program has 91 000 operators, about 10% of the code volume used in
the paper.

In table 2,
    + was twice as common as -
    * was twice as common as /
    & was commoner than      *
    && was commoner than     ||
    == was four times as common as >

In the program I checked,
    - was four times as common as +
    * was ten times as common as  /
    * was four times as common as &
    || was a tiny bit commoner than &&
    > was three times as common as ==

The only conclusion I want to draw from this is that the horizontal
axes in Jones's Figure 2 CANNOT be taken as surrogates for the amount
of exposure any developer has had to C operators; the relative frequencies
in different programs can be drastically different.  That is, the
horizontal axes are about one body of code, while the vertical axes are
about developers exposed to a *different* body of code, quite likely
with different characteristics.

(Looking at table 2 one becomes a little suspicious:  high frequency
operators usually occur often in pairs.  But not '&'.  Were some of
the occurrences of '&' really occurrences of the unary address-of operator?)
 
----------------------------------------------------------------------
PPIG Discuss List (discuss@ppig.org)
Discuss admin: http://limitlessmail.net/mailman/listinfo/discuss
Announce admin: http://limitlessmail.net/mailman/listinfo/announce
PPIG Discuss archive: http://www.mail-archive.com/discuss%40ppig.org/

Reply via email to