On Fri, Mar 13, 2009 at 5:28 AM, Paul Sanders <p.sand...@dsl.pipex.com> wrote: > Bill said something in passing on this issue which I think is important. To > paraphrase: If you care about performance, don't use the Cocoa RegEx stuff > to parse large amounts of data.
I disagree :), and I have numbers to back it up: (RegexKitLite was used to do the regex processing in the examples below) [subjectString componentsSeparatedByCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] [subjectString componentsSeparatedByRegex:[NSString stringWithUTF8String:"(?:\\r\n|[\n\\v\\f\\r\302\205\\p{Zl}\\p{Zp}\\t ])"]] componentsSeparatedByCharactersInSet: Time used: 4114134.0, per: 1.493509760680, count: 2754675 componentsSeparatedByRegex: Time used: 1577230.0, per: 0.572564821621, count: 2754675 In this case, regexes beat the system method by: 4114134.0 / 1577230.0 = 2.60. --- [subjectString componentsSeparatedByString:@"\n"] [subjectString componentsSeparatedByRegex:[NSString stringWithUTF8String:"(?:\\r\n|[\n\\v\\f\\r\302\205\\p{Zl}\\p{Zp}])"]] componentsSeparatedByString: Time used: 548741.0, per: 2.181959521253, count: 251490 componentsSeparatedByRegex: Time used: 320646.0, per: 1.274985088870, count: 251490 In this case, regexes beat the system method by: 548741.0 / 320646.0 = 1.71. > I think this observation is true whether > you use GC or not. GC just makes it worse. I'd like to see a pure-C > benchmark of the original test, perhaps just from the command line using > egrep. I suspect the results would be startling. How about perl instead? (I don't think egrep is a fair test, it doesn't have to 'do anything' with the results, like create a new string from them). This is a rough perl equivalent of my original problem: $text = ""; $cnt = 0; while(<>) { $text .= $_; } for($loops = 0; $loops < 1; $loops++) { my @results; while($text =~ /\S+/g) { push(@results, $1); $cnt++; } } shell% time /usr/bin/perl pl_rkl.pl BIG.txt 2.159u 0.030s 0:02.22 98.1% 0+0k 0+0io 0pf+0w shell% time rkl_tests 1.874u 0.073s 0:01.97 98.4% 0+0k 0+0io 0pf+0w Now, the perl example could be improved (notably the part that sucks in the text), but I think it's fair to say that this isn't quite the result you'd intuitively expect. Naturally, these results aren't representitive of every use case, but I think it goes to show that processing strings in Cocoa with regexes can be competitive with other solutions out there. > Having said all of which, I think the original test is not unfair and I > agree with a lot of the points people have made in support of that view. > It's always painful to have to step outside the Cocoa frameworks, and (off > topic) it seems that GC can make it more so. I for one will not be using > it. _______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com