I'm interested in the comments on CPD/Simian performance - since I was responsible for the current algorithm in CPD. On my dodgy old 833MHz laptop, CPD does the entire JDK1.4 in about 1 minute 30, all but 18 seconds of which are actually spent reading the files (ok reading the files and building ANTLR parse trees) - it was I/O bound. I can eat into those 18 seconds but the rest is out of my hands[1].

For example, here is what happens when you just time the line and character counts, in native code:
brian [EMAIL PROTECTED] /c/j2sdk1.4.0/src
$ time find . -name '*.java' | xargs wc -l | grep xyzzy [2]


real    1m19.289s
user    0m1.670s
sys     0m3.791s

Thats 79 seconds just to *count* the chars, and it hardly dented my CPU. A CPU twice as fast won't make much difference to this.

Have you compared the current CPD with Simian? I did my rewrite when I saw the times quoted in the O'Reilly article, which is, I guess, when most people tried it. To quote Tom Copeland "A job that took the old CPD 2 hours now takes 8 seconds. It's pretty sweet." (he's referring to the time *after* the files are read here)

I am aware of a faster algorithm that can be used incrementally, but I havent implemented it yet.
<http://sourceforge.net/forum/forum.php?thread_id=885972&forum_id=188192>


-Baz

[1] CPD in its current form would be faster with suffix tree sorts instead of quicksorts, and if punctuation ( { ; } ) wasn't counted as a place where matching could begin. It would also be faster if I didn't use a parse tree, but just heuristics for ignoring comments + whitespace (I originally wrote my CPD that way, in perl). However none of these optimizations are considered worthwhile because the whole thing is I/O bound.
[2] grep xyzzy in this case is just stopping wc -l outputting to stdout, so that you're not seeing slowdown due to flushes on that stream.


Aslak Hellesøy wrote:

It seems Maven's JIRA still doesn't notify this list, so here I go...

Simian (http://www.redhillconsulting.com.au/products/simian/index.html )>
is a great little tool that detects duplicate source code. Very much
like PMD's CPD (http://pmd.sourceforge.net/cpd.html), but a _lot_ faster.

I have written a Maven report plugin for Simian that I'd like to contribute to Maven. Have a look at some sample reports:

http://www.picocontainer.org/simian-report.html
http://www.nanocontainer.org/simian-report.html

It would be really nice to have the Simian Report included in the standard reports, as it reveals refactoring candidates and is something>
every sound project should have!

It's all in JIRA: http://jira.codehaus.org/secure/ViewIssue.jspa?key=MAVEN-516


Cheers,
Aslak


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]




Privacy and Confidentiality Notice


------------------------------------------------

The information contained in this E-Mail message is intended only for the person or persons to whom it is addressed. Such information is confidential and privileged and no mistake in transmission is intended to waive or compromise such privilege. If you have received it in error, please destroy it and notify us on the telephone number printed above. If you do not receive complete and legible copies, please telephone us immediately. Any opinions expressed herein including attachments are those of the author only. i-documentsystems Ltd. does not accept responsibility for the accuracy or completeness of the information provided or for any changes to this Email, however made, after it was sent. (Please note that it is your responsibility to scan this message for viruses).


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to