Re: Tips on making regex more performant?

2013-06-18 Thread 1100110

On 06/18/2013 01:53 PM, Gary Willoughby wrote:

Below is an example snippet of code to test for performance of regex
matches. I need to parse a large log and extract data from it and i've
noticed a huge increase in time of the loop when reading and using regex.

 ...
 auto alert = regex(r"^Alert ([0-9]+)");

 while ((line = file.readln()) !is null)
 {
 auto m = match(line, alert);

 if (m)
 {
 alerts++;
 }

 counter++;
 }
 ...

Using the above example i parse about 700K lines per second (i'm reading
from an SSD). If i comment out the regex match function, i read at 4.5M
lines per second. Considering i need to use about 8 regex matches and
extract data, this figure further drops to about 100K lines per second.

Is there anything i can do to speed up regex matching in such a
scenario? Are there any tips you can share to speed things up?

Thanks.


enum alert = ctRegex!r"^Alert ([0-9]+)";

And then use it the same way.


Re: Tips on making regex more performant?

2013-06-18 Thread Gary Willoughby

enum alert = ctRegex!r"^Alert ([0-9]+)";

And then use it the same way.


Thanks. Hmmm.. i get 500K (worse performance) using that. :/

Any more tips?


Re: Tips on making regex more performant?

2013-06-18 Thread bearophile

Gary Willoughby:


Any more tips?


Try using the LDC2 compiler, with the ctRegex, and using the 
correct compilation flags (like ldmd2 -O -release -inline 
-noboundscheck).


If that fails one solution is to write your own finite state 
machine where the state is encoded by the program counter, using 
gotos. There is a well known simple to use C library that 
generates such code automatically. But it's probably better to 
rethink your problem and re-frame it.


Bye,
bearophile


Re: Tips on making regex more performant?

2013-06-18 Thread Jacob Carlborg

On 2013-06-18 21:22, Gary Willoughby wrote:


Thanks. Hmmm.. i get 500K (worse performance) using that. :/


D has basically the fastest regular expression library/module. It's 
faster than V8.



Any more tips?


How about reading in larger chunks than single lines?

--
/Jacob Carlborg


Re: Tips on making regex more performant?

2013-06-18 Thread Dmitry Olshansky

18-Jun-2013 22:53, Gary Willoughby пишет:

Below is an example snippet of code to test for performance of regex
matches. I need to parse a large log and extract data from it and i've
noticed a huge increase in time of the loop when reading and using regex.

 ...
 auto alert = regex(r"^Alert ([0-9]+)");

 while ((line = file.readln()) !is null)
 {
 auto m = match(line, alert);

 if (m)
 {
 alerts++;
 }

 counter++;
 }
 ...



Depending on you data size reading more into memory in one go and then 
use regex  with "g" option on it. The difference is that each time you 
call `match` regex engine has to allocate state for the matcher. This 
leads to basically doing malloc/free per line. Millions of allocations 
per second are surely out of reach.


I've been long wondering what I can do to speed up such one-off matches 
as most of tuning had gone into the machine itself, not start/stop 
paths. What can be done to fix this is adding TLS cache of memory in 
std.regex, so that it reuses the same memory on each call to match.
Partially this was postponed till supposedly soon to come Allocators 
design but it never did.


Alternatives.

Simplest alternative would be to use some Appender!char object and fill 
it up with say 100 lines.


Second option is read up a big buffer, find a newline from the end of it 
and run global match on this chunk. Copy the (hopefully small) tail to 
the beginning of buffer and read into the rest.


The example below assumes you can't have line bigger then some buffer size.

(it's a sketch, untested)

alert = regex(..., "gm");
ubyte[] buffer = new ubyte[64*1024];
ubyte[] slice = buffer[];
auto hasData = true;
while(hasData){
size_t got = file.rawRead(slice).length;
// + tail from prev iteration
auto usable = got + buffer.length - slice.length;
if(got < slice.length)
hasData = false;
auto idx = indexOf(retro(buffer[0..got]), '\n');
if(idx < 0){
slice = buffer[];
continue;
}
size_t lastLF = got-idx;
char[] piece = cast(char[])buffer[0..lastLF];
foreach(m; match(piece, alert))
{
//old code here
}
copy(buffer[lastLF..$], buffer[]); //copy leftover to front
slice = buffer[$-lastLF..$]; //read into the rest of buffer
}

This is how I'd go about it ATM if speed is important.


Using the above example i parse about 700K lines per second (i'm reading
from an SSD). If i comment out the regex match function, i read at 4.5M
lines per second. Considering i need to use about 8 regex matches and
extract data, this figure further drops to about 100K lines per second.


Like test 8 separate patterns? This would multiply aforementioned 
malloc/free by 8. One idea is to try putting them all in a single regex 
with alternations '|'.




Is there anything i can do to speed up regex matching in such a
scenario? Are there any tips you can share to speed things up?



There are also compiler flags to fiddle with (dmd):
 -O -release -inline -noboundscheck


Thanks.



--
Dmitry Olshansky


Re: Tips on making regex more performant?

2013-06-18 Thread Dmitry Olshansky

18-Jun-2013 23:22, Gary Willoughby пишет:

enum alert = ctRegex!r"^Alert ([0-9]+)";

And then use it the same way.


Thanks. Hmmm.. i get 500K (worse performance) using that. :/


My bet would be allocations are taking the bulk of time. Any chance to 
compile it with -profile?




Any more tips?




--
Dmitry Olshansky


Re: Tips on making regex more performant?

2013-06-18 Thread Dmitry Olshansky

19-Jun-2013 00:34, Jacob Carlborg пишет:

On 2013-06-18 21:22, Gary Willoughby wrote:


Thanks. Hmmm.. i get 500K (worse performance) using that. :/


D has basically the fastest regular expression library/module. It's
faster than V8.



As much as I'm appeased to hear this it's isn't simply "faster then V8". 
V8 one is very fast in general case (JIT compiler aids in that) and I 
think would come out as winer more often then std.regex.


What is closer to truth is that typically std.regex is within about 5 
top well-known engines and it's beats the rest of competition on some 
patterns including in certain popular benchmarks such as regex-dna.


For instance it typically obliterates PCRE, infinitely so on Unicode 
patterns (I never had patience to let it finish) .


With that being said there are many things to improve in std.regex 
speed-wise, sadly I haven't been able to tell about it at DConf.



--
Dmitry Olshansky


Re: Tips on making regex more performant?

2013-06-18 Thread KJS

On Tuesday, 18 June 2013 at 18:53:34 UTC, Gary Willoughby wrote:
Below is an example snippet of code to test for performance of 
regex matches. I need to parse a large log and extract data 
from it and i've noticed a huge increase in time of the loop 
when reading and using regex.


...
auto alert = regex(r"^Alert ([0-9]+)");

while ((line = file.readln()) !is null)
{
auto m = match(line, alert);

if (m)
{
alerts++;
}

counter++;
}
...

Using the above example i parse about 700K lines per second 
(i'm reading from an SSD). If i comment out the regex match 
function, i read at 4.5M lines per second. Considering i need 
to use about 8 regex matches and extract data, this figure 
further drops to about 100K lines per second.


Is there anything i can do to speed up regex matching in such a 
scenario? Are there any tips you can share to speed things up?


Thanks.


I'm working with some string-heavy applications so I was curious 
about this myself. I'm new to D, but I did some heavy data 
analysis on chat files a while back.


Not knowing anything about your data or what other queries you 
might want to do on it, matching the first part of the string 
with std.algorithm.startsWith() and splitting the line on a 
delimiter outperforms regex matching on my admittedly arbitrary 
test code. I tested two extremes at 10,000,000 rounds:


Match everything: ~39 seconds for match, ~8 seconds for 
startsWith/split.
Match fails at start of string: ~10 seconds for match, ~1 second 
for startsWith/split.
Match fails at end of string: ~15 seconds for match, ~1 second 
for startsWith/split.


Even if you need a regex to match the middle, it might be 
worthwhile to filter the list with startsWith if you're matching 
a fixed string at the start of the line. Again, it depends on the 
frequency of hits and how the data is structured.


Split is probably not the best way to slice the match, but I 
don't have time tonight to try other slicing methods.





Re: Tips on making regex more performant?

2013-06-19 Thread Jacob Carlborg

On 2013-06-18 23:29, Dmitry Olshansky wrote:


As much as I'm appeased to hear this it's isn't simply "faster then V8".
V8 one is very fast in general case (JIT compiler aids in that) and I
think would come out as winer more often then std.regex.

What is closer to truth is that typically std.regex is within about 5
top well-known engines and it's beats the rest of competition on some
patterns including in certain popular benchmarks such as regex-dna.

For instance it typically obliterates PCRE, infinitely so on Unicode
patterns (I never had patience to let it finish) .

With that being said there are many things to improve in std.regex
speed-wise, sadly I haven't been able to tell about it at DConf.


I see, thanks for the correction.

--
/Jacob Carlborg


Re: Tips on making regex more performant?

2013-06-20 Thread Gary Willoughby
I'm working with some string-heavy applications so I was 
curious about this myself. I'm new to D, but I did some heavy 
data analysis on chat files a while back.


Not knowing anything about your data or what other queries you 
might want to do on it, matching the first part of the string 
with std.algorithm.startsWith() and splitting the line on a 
delimiter outperforms regex matching on my admittedly arbitrary 
test code.


This is the approach i used and it's very very fast. Using 
std.algorithm.startsWith() and simple delimiter based string 
slicing i've solved it. It's so fast there is hardly any slow 
down.


As a side note, i also use regex matching for the same data in 
debug blocks to assert the slicing method matches what i would of 
got from the regex, which is nice.