On Tuesday, 18 June 2013 at 18:53:34 UTC, Gary Willoughby wrote:
Below is an example snippet of code to test for performance of
regex matches. I need to parse a large log and extract data
from it and i've noticed a huge increase in time of the loop
when reading and using regex.
...
auto alert = regex(r"^Alert ([0-9]+)");
while ((line = file.readln()) !is null)
{
auto m = match(line, alert);
if (m)
{
alerts++;
}
counter++;
}
...
Using the above example i parse about 700K lines per second
(i'm reading from an SSD). If i comment out the regex match
function, i read at 4.5M lines per second. Considering i need
to use about 8 regex matches and extract data, this figure
further drops to about 100K lines per second.
Is there anything i can do to speed up regex matching in such a
scenario? Are there any tips you can share to speed things up?
Thanks.
I'm working with some string-heavy applications so I was curious
about this myself. I'm new to D, but I did some heavy data
analysis on chat files a while back.
Not knowing anything about your data or what other queries you
might want to do on it, matching the first part of the string
with std.algorithm.startsWith() and splitting the line on a
delimiter outperforms regex matching on my admittedly arbitrary
test code. I tested two extremes at 10,000,000 rounds:
Match everything: ~39 seconds for match, ~8 seconds for
startsWith/split.
Match fails at start of string: ~10 seconds for match, ~1 second
for startsWith/split.
Match fails at end of string: ~15 seconds for match, ~1 second
for startsWith/split.
Even if you need a regex to match the middle, it might be
worthwhile to filter the list with startsWith if you're matching
a fixed string at the start of the line. Again, it depends on the
frequency of hits and how the data is structured.
Split is probably not the best way to slice the match, but I
don't have time tonight to try other slicing methods.