On Tuesday, 18 June 2013 at 18:53:34 UTC, Gary Willoughby wrote:
Below is an example snippet of code to test for performance of regex matches. I need to parse a large log and extract data from it and i've noticed a huge increase in time of the loop when reading and using regex.

        ...
        auto alert = regex(r"^Alert ([0-9]+)");

        while ((line = file.readln()) !is null)
        {
                auto m = match(line, alert);

                if (m)
                {
                        alerts++;
                }

                counter++;
        }
        ...

Using the above example i parse about 700K lines per second (i'm reading from an SSD). If i comment out the regex match function, i read at 4.5M lines per second. Considering i need to use about 8 regex matches and extract data, this figure further drops to about 100K lines per second.

Is there anything i can do to speed up regex matching in such a scenario? Are there any tips you can share to speed things up?

Thanks.

I'm working with some string-heavy applications so I was curious about this myself. I'm new to D, but I did some heavy data analysis on chat files a while back.

Not knowing anything about your data or what other queries you might want to do on it, matching the first part of the string with std.algorithm.startsWith() and splitting the line on a delimiter outperforms regex matching on my admittedly arbitrary test code. I tested two extremes at 10,000,000 rounds:

Match everything: ~39 seconds for match, ~8 seconds for startsWith/split. Match fails at start of string: ~10 seconds for match, ~1 second for startsWith/split. Match fails at end of string: ~15 seconds for match, ~1 second for startsWith/split.

Even if you need a regex to match the middle, it might be worthwhile to filter the list with startsWith if you're matching a fixed string at the start of the line. Again, it depends on the frequency of hits and how the data is structured.

Split is probably not the best way to slice the match, but I don't have time tonight to try other slicing methods.


Reply via email to