18-Jun-2013 22:53, Gary Willoughby пишет:
Below is an example snippet of code to test for performance of regex
matches. I need to parse a large log and extract data from it and i've
noticed a huge increase in time of the loop when reading and using regex.

     ...
     auto alert = regex(r"^Alert ([0-9]+)");

     while ((line = file.readln()) !is null)
     {
         auto m = match(line, alert);

         if (m)
         {
             alerts++;
         }

         counter++;
     }
     ...


Depending on you data size reading more into memory in one go and then use regex with "g" option on it. The difference is that each time you call `match` regex engine has to allocate state for the matcher. This leads to basically doing malloc/free per line. Millions of allocations per second are surely out of reach.

I've been long wondering what I can do to speed up such one-off matches as most of tuning had gone into the machine itself, not start/stop paths. What can be done to fix this is adding TLS cache of memory in std.regex, so that it reuses the same memory on each call to match. Partially this was postponed till supposedly soon to come Allocators design but it never did.

Alternatives.

Simplest alternative would be to use some Appender!char object and fill it up with say 100 lines.

Second option is read up a big buffer, find a newline from the end of it and run global match on this chunk. Copy the (hopefully small) tail to the beginning of buffer and read into the rest.

The example below assumes you can't have line bigger then some buffer size.

(it's a sketch, untested)

alert = regex(..., "gm");
ubyte[] buffer = new ubyte[64*1024];
ubyte[] slice = buffer[];
auto hasData = true;
while(hasData){
        size_t got = file.rawRead(slice).length;
        // + tail from prev iteration
        auto usable = got + buffer.length - slice.length;
        if(got < slice.length)
                hasData = false;
        auto idx = indexOf(retro(buffer[0..got]), '\n');
        if(idx < 0){
                slice = buffer[];
                continue;
        }
        size_t lastLF = got-idx;
        char[] piece = cast(char[])buffer[0..lastLF];
        foreach(m; match(piece, alert))
        {
                //old code here
        }
        copy(buffer[lastLF..$], buffer[]); //copy leftover to front
        slice = buffer[$-lastLF..$]; //read into the rest of buffer
}

This is how I'd go about it ATM if speed is important.

Using the above example i parse about 700K lines per second (i'm reading
from an SSD). If i comment out the regex match function, i read at 4.5M
lines per second. Considering i need to use about 8 regex matches and
extract data, this figure further drops to about 100K lines per second.

Like test 8 separate patterns? This would multiply aforementioned malloc/free by 8. One idea is to try putting them all in a single regex with alternations '|'.


Is there anything i can do to speed up regex matching in such a
scenario? Are there any tips you can share to speed things up?


There are also compiler flags to fiddle with (dmd):
 -O -release -inline -noboundscheck

Thanks.


--
Dmitry Olshansky

Reply via email to