Re: findin sloc changes between two tags

Paul Sander Tue, 19 Feb 2008 10:56:05 -0800


On Feb 19, 2008, at 12:35 AM, yeti wrote:

On Feb 19, 1:06 pm, Paul Sander <[EMAIL PROTECTED]> wrote:

On Feb 18, 2008, at 8:40 PM, yeti wrote:

On Feb 19, 4:38 am, Paul Sander <[EMAIL PROTECTED]> wrote:
For this particular metric, I usually run the two versionsthrough a
beautifier with standard settings, then diff the output of that.

On Feb 18, 2008, at 10:17 AM, Rick Genter wrote:

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]
On Behalf Of Ted Stern

But that regexp handles only C++ comments.  I don't know of a
way to
recognize /* ... [text containing newlines] ... */.  Possibly
another
diff utility has that options (xxdiff, tkdiff?).

You could write an awk or perl script to filter the multiline
comments
out, save the output to a file, then diff those files. I, however,
consider comments to be equally (or even more) important to non-
comments
in source code, and don't understand the use case.- Hide quoted
text -

- Show quoted text -

Hi guys,

Thanks for all those answers. I however thought that this would be a
fairly common problem and there might be a standard solution.Keeping
your suggestions in mind I did

cvs diff -wlcbBC20 -r rev1 -r rev2 my_file.c | perl -0777 -pe's{/
\*.*?\*/}{}gs' | diffstat >> FileToHoldInfo.txt

idea is to get enough context lines and then eliminate the comments
from the diff output and finally use diffstat to gather stats. Doyou
think this is the correct way ??


I think that this method will work only if the comments are
completely enclosed within the context displayed by the diff
program.  It will fail (i.e., produce incorrect output), for example,
if a short sentence is added to the end of a 50-line comment.  Or to
the beginning of one.  Or to the middle of a 100-line comment.  It
also fails if someone arbitrarily inserts or removes newlines in the
code itself.

This is where beautifiers such as the "indent" program come in.  It
normalizes the format of the source code based on the syntax of the
programming language and policies specified on its command line.  It

leaves comments in place, so additional filtering (like your Perlone-

liner above) might be necessary.

After the two versions have been reduced to standard formats, you can
apply the diff program with minimal arguments.  Its output can be
used to count the number of lines inserted, deleted, and changed.


Yes you are right I'm assuming that most comments would be 20 line
wide though one can as well use -C50 to make it work for 50 line wide
comments and so on. To remove blank lines regexp can be modified. But
now I have detected another problem :-(

Your algorithm also won't handle cases where users arbitrarilyreformat the code. In C, for example, the following styles are common:


1a.  Insert newlines between terms in complex boolean expressions.

1b. Make expressions as wide as possible and insert newlines only toavoid wraparound issues within terms.

2a. Surround all curly braces with newlines so that they alwaysappear alone on lines of code.2b. Place open curly braces at the ends of lines, and combine openand close braces with "else" keywords on a single line.

Beautifiers cut through this cosmetic stuff, immunizing the metricsfrom arbitrary reformats. On the other hand, they don't handlecertain cases where users insert or remove optional artifacts, likeinserting braces where they are allowed but not required.

If I check out two different versions of the file and apply unix diff
over them the results are very different from those obtained using cvs
diff on two revisions. cvs diff is showing 256 modifications (!) in
the code when there are no modifications at all. There are about 700
additions (+) but cvs diff is showing only 424 (+). I think cvs diff
is confusing some additions with modifications. However unix diff on
files gives correct results.
I wonder why is cvs diff showing incorrect results ? Is this a known
problem ? If so are there any workarounds for it.

There are a lot of differencing algorithms out there. Some of themminimize the number of edits between versions, others minimize thesize of the edits. Additionally, CVS has access to the individualdeltas between versions, and it may be combining them in ways youdon't expect (rather than constructing the two selected versions andrunning diff on them).

Re: findin sloc changes between two tags

Reply via email to