On Fri, Mar 13, 2009 at 5:28 AM, Paul Sanders <p.sand...@dsl.pipex.com> wrote:
> Bill said something in passing on this issue which I think is important.  To
> paraphrase: If you care about performance, don't use the Cocoa RegEx stuff
> to parse large amounts of data.

I disagree :), and I have numbers to back it up:

(RegexKitLite was used to do the regex processing in the examples below)

[subjectString componentsSeparatedByCharactersInSet:[NSCharacterSet
whitespaceAndNewlineCharacterSet]]
[subjectString componentsSeparatedByRegex:[NSString
stringWithUTF8String:"(?:\\r\n|[\n\\v\\f\\r\302\205\\p{Zl}\\p{Zp}\\t
])"]]

componentsSeparatedByCharactersInSet: Time used: 4114134.0, per:
1.493509760680, count: 2754675
componentsSeparatedByRegex: Time used: 1577230.0, per: 0.572564821621,
count: 2754675

In this case, regexes beat the system method by: 4114134.0 / 1577230.0 = 2.60.

---

[subjectString componentsSeparatedByString:@"\n"]
[subjectString componentsSeparatedByRegex:[NSString
stringWithUTF8String:"(?:\\r\n|[\n\\v\\f\\r\302\205\\p{Zl}\\p{Zp}])"]]

componentsSeparatedByString: Time used: 548741.0, per: 2.181959521253,
count: 251490
componentsSeparatedByRegex: Time used: 320646.0, per: 1.274985088870,
count: 251490

In this case, regexes beat the system method by: 548741.0 / 320646.0 = 1.71.

> I think this observation is true whether
> you use GC or not.  GC just makes it worse.  I'd like to see a pure-C
> benchmark of the original test, perhaps just from the command line using
> egrep.  I suspect the results would be startling.

How about perl instead? (I don't think egrep is a fair test, it
doesn't have to 'do anything' with the results, like create a new
string from them). This is a rough perl equivalent of my original
problem:

$text = ""; $cnt = 0;
while(<>) { $text .= $_; }
for($loops = 0; $loops < 1; $loops++) { my @results; while($text =~
/\S+/g) { push(@results, $1); $cnt++; } }

shell% time /usr/bin/perl pl_rkl.pl BIG.txt
2.159u 0.030s 0:02.22 98.1%     0+0k 0+0io 0pf+0w
shell% time rkl_tests
1.874u 0.073s 0:01.97 98.4%     0+0k 0+0io 0pf+0w

Now, the perl example could be improved (notably the part that sucks
in the text), but I think it's fair to say that this isn't quite the
result you'd intuitively expect.  Naturally, these results aren't
representitive of every use case, but I think it goes to show that
processing strings in Cocoa with regexes can be competitive with other
solutions out there.

> Having said all of which, I think the original test is not unfair and I
> agree with a lot of the points people have made in support of that view.
> It's always painful to have to step outside the Cocoa frameworks, and (off
> topic) it seems that GC can make it more so.  I for one will not be using
> it.
_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to