For example, here is what happens when you just time the line and character counts, in native code:
brian [EMAIL PROTECTED] /c/j2sdk1.4.0/src
$ time find . -name '*.java' | xargs wc -l | grep xyzzy [2]
real 1m19.289s user 0m1.670s sys 0m3.791s
Thats 79 seconds just to *count* the chars, and it hardly dented my CPU. A CPU twice as fast won't make much difference to this.
Have you compared the current CPD with Simian? I did my rewrite when I saw the times quoted in the O'Reilly article, which is, I guess, when most people tried it. To quote Tom Copeland "A job that took the old CPD 2 hours now takes 8 seconds. It's pretty sweet." (he's referring to the time *after* the files are read here)
I am aware of a faster algorithm that can be used incrementally, but I havent implemented it yet.
<http://sourceforge.net/forum/forum.php?thread_id=885972&forum_id=188192>
-Baz
[1] CPD in its current form would be faster with suffix tree sorts instead of quicksorts, and if punctuation ( { ; } ) wasn't counted as a place where matching could begin. It would also be faster if I didn't use a parse tree, but just heuristics for ignoring comments + whitespace (I originally wrote my CPD that way, in perl). However none of these optimizations are considered worthwhile because the whole thing is I/O bound.
[2] grep xyzzy in this case is just stopping wc -l outputting to stdout, so that you're not seeing slowdown due to flushes on that stream.
Aslak Hellesøy wrote:
It seems Maven's JIRA still doesn't notify this list, so here I go...is a great little tool that detects duplicate source code. Very much
Simian (http://www.redhillconsulting.com.au/products/simian/index.html )>
like PMD's CPD (http://pmd.sourceforge.net/cpd.html), but a _lot_ faster.every sound project should have!
I have written a Maven report plugin for Simian that I'd like to contribute to Maven. Have a look at some sample reports:
http://www.picocontainer.org/simian-report.html http://www.nanocontainer.org/simian-report.html
It would be really nice to have the Simian Report included in the standard reports, as it reveals refactoring candidates and is something>
It's all in JIRA: http://jira.codehaus.org/secure/ViewIssue.jspa?key=MAVEN-516
Cheers, Aslak
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Privacy and Confidentiality Notice
------------------------------------------------
The information contained in this E-Mail message is intended only for the person or persons to whom it is addressed. Such information is confidential and privileged and no mistake in transmission is intended to waive or compromise such privilege. If you have received it in error, please destroy it and notify us on the telephone number printed above. If you do not receive complete and legible copies, please telephone us immediately. Any opinions expressed herein including attachments are those of the author only. i-documentsystems Ltd. does not accept responsibility for the accuracy or completeness of the information provided or for any changes to this Email, however made, after it was sent. (Please note that it is your responsibility to scan this message for viruses).
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]