Regex performance
This might be worth looking into. Dmitry? http://jblewitt.com/blog/?p=462 Andrei
Re: Regex performance
On Saturday, 24 March 2012 at 23:06:54 UTC, Andrei Alexandrescu wrote: This might be worth looking into. Dmitry? http://jblewitt.com/blog/?p=462 Andrei A difference of that amount is likely expecting something like regex("Blah") to not have to create a new regex struct each time, something which I'm guessing Ruby does (as do other standard libraries like .NET).
Re: Regex performance
"Nick Sabalausky" wrote in message news:jklomm$2seb$1...@digitalmars.com... > > Yea, I agree that's what it sounds like. I tried to post a response, but > I'm just getting this result (and yes, this is with JS enabled): > > > Asirra validation failed! > ticket = start ASIRRAVALIDATION ir=cd ir data= > start RESULT ir=1cd ir 1 data=Failend Resource id #62cd ir 0 data= > start DEBUG ir=cd ir data=exceptions.Exception: invalid ticket formatend > Resource id #62cd ir 0 data= > end Resource id #62XML: > > Fail > exceptions.Exception: invalid ticket format > If it's working for anyone > else, maybe you could post it for me?: > Erm, the formatting got fucked up: Asirra validation failed! ticket = start ASIRRAVALIDATION ir= cd ir data= start RESULT ir=1 cd ir 1 data=Fail end Resource id #62 cd ir 0 data= start DEBUG ir= cd ir data=exceptions.Exception: invalid ticket format end Resource id #62 cd ir 0 data= end Resource id #62 XML: Fail exceptions.Exception: invalid ticket format
Re: Regex performance
"Kapps" wrote in message news:yudtvjsuhhimrhqai...@forum.dlang.org... > On Saturday, 24 March 2012 at 23:06:54 UTC, Andrei Alexandrescu wrote: >> This might be worth looking into. Dmitry? >> >> http://jblewitt.com/blog/?p=462 >> >> >> Andrei > > A difference of that amount is likely expecting something like > regex("Blah") to not have to create a new regex struct each time, > something which I'm guessing Ruby does (as do other standard libraries > like .NET). Yea, I agree that's what it sounds like. I tried to post a response, but I'm just getting this result (and yes, this is with JS enabled): Asirra validation failed! ticket = start ASIRRAVALIDATION ir=cd ir data= start RESULT ir=1cd ir 1 data=Failend Resource id #62cd ir 0 data= start DEBUG ir=cd ir data=exceptions.Exception: invalid ticket formatend Resource id #62cd ir 0 data= end Resource id #62XML: Fail exceptions.Exception: invalid ticket format If it's working for anyone else, maybe you could post it for me?: A few things on the D verison: - Make sure you're using a recent version of DMD. The regex engine was overhauled fairly recently (I forget exactly which version, but the latest, 2.058 definitely has it, along with some bugfixes.) - Make sure you're using "std.regex", not the deprecated "std.regexp". - It sounds like this may be your main problem: Make sure you're not re-creating the same regex multiple times: // Bad: foreach(str; strings) { auto result = match(str, regex("abc.*def")); } // Good: auto myRegex = regex("abc.*def"); foreach(str; strings) { auto result = match(str, myRegex); } Some regex engines cache the regex, but D's does't ATM. I think that'll likely get fixed though. - Even better yet, if your regex string is a literal (or otherwise known or computable at compile-time) as above, use the compile-time version instead: auto myRegex = ctRegex!"abc.*def"; // [...same 'foreach' loop as before...]
Re: Regex performance
On 25.03.2012 4:26, Nick Sabalausky wrote: "Nick Sabalausky" wrote in message news:jklomm$2seb$1...@digitalmars.com... Yea, I agree that's what it sounds like. I tried to post a response, but I'm just getting this result (and yes, this is with JS enabled): Asirra validation failed! ticket = start ASIRRAVALIDATION ir=cd ir data= start RESULT ir=1cd ir 1 data=Failend Resource id #62cd ir 0 data= start DEBUG ir=cd ir data=exceptions.Exception: invalid ticket formatend Resource id #62cd ir 0 data= end Resource id #62XML: Fail exceptions.Exception: invalid ticket format If it's working for anyone No luck here as well. I'm really curious of that guy's actual problem though. First bet is, of course, non-cached regex or regexp. -- Dmitry Olshansky
Re: Regex performance
Hello everyone, I'm the author of the blog post. First of all, thanks so much for the interest in my problem. I had no idea that the D community was so active (a fact that pleases me greatly). A quick update. I've written a small benchmark based on my real code and I'm now getting *significantly* better performance from my D code. I'm currently trying to figure out what I'm doing differently in my original program. At this point I am assuming that I have an error in my code which causes the D program to do much more work that its Ruby counterpart (although I am currently unable to find it). When I know more I will let you know. James Blewitt
Re: Regex performance
On Sunday, 25 March 2012 at 16:31:40 UTC, James Blewitt wrote: I'm currently trying to figure out what I'm doing differently in my original program. At this point I am assuming that I have an error in my code which causes the D program to do much more work that its Ruby counterpart (although I am currently unable to find it). When I know more I will let you know. James Blewitt That was the same type of thing I was seeing with very simple regex expressions. The regex was on the order of 30 times slower than hand code for finding words in strings. The ctRegex is on the order of 13x slower than hand code. The times below are from parallel processing on 100MB of text files, just finding the word boundaries. I uploaded that tests in https://github.com/jnorwood/wc_test I believe in all these cases the files are being cached by the os, since I was able to see the same measurements from a ramdisk done with imdisk. So in these cases the file reads are about 30ms of the result. The rest is cpu time, finding the words. This is with default 7 threads finished wcp_wcPointer! time: 98 ms finished wcp_wcCtRegex! time: 1300 ms finished wcp_wcRegex! time: 2946 ms finished wcp_wcRegex2! time: 2687 ms finished wcp_wcSlices! time: 157 ms finished wcp_wcStdAscii! time: 225 ms This is processing the same data with 1 thread finished wcp_wcPointer! time: 188 ms finished wcp_wcCtRegex! time: 2219 ms finished wcp_wcRegex! time: 5951 ms finished wcp_wcRegex2! time: 5502 ms finished wcp_wcSlices! time: 318 ms finished wcp_wcStdAscii! time: 446 ms And this is processing the same data with 13 threads finished wcp_wcPointer! time: 93 ms finished wcp_wcCtRegex! time: 1110 ms finished wcp_wcRegex! time: 2531 ms finished wcp_wcRegex2! time: 2321 ms finished wcp_wcSlices! time: 136 ms finished wcp_wcStdAscii! time: 200 ms The only change in the program that is uploaded is to add the suggested defaultPoolThreads(13); at the start of main to change the ThreadPool default thread count.
Re: Regex performance
On 26.03.2012 20:00, Jay Norwood wrote: On Sunday, 25 March 2012 at 16:31:40 UTC, James Blewitt wrote: I'm currently trying to figure out what I'm doing differently in my original program. At this point I am assuming that I have an error in my code which causes the D program to do much more work that its Ruby counterpart (although I am currently unable to find it). When I know more I will let you know. James Blewitt That was the same type of thing I was seeing with very simple regex expressions. The regex was on the order of 30 times slower than hand code for finding words in strings. This is a sad fact of life, the general tool can't beat highly specialized things. Ideally it can be on par though. Even in the best case ctRegex has to do a lot of things a simple == '\n' doesn't do, like storing boundaries of match. That's something to keep in mind. By the way, regex does fine job on (semi-)fixed strings of length >= 3-4, often easily beating plain find/indexOf. I haven't tested Boyer-Moore version of find, that should be faster then regex for sure. The ctRegex is on the order of 13x slower than hand code. The times below are from parallel processing on 100MB of text files, just finding the word boundaries. I uploaded that tests in https://github.com/jnorwood/wc_test I believe in all these cases the files are being cached by the os, since I was able to see the same measurements from a ramdisk done with imdisk. So in these cases the file reads are about 30ms of the result. The rest is cpu time, finding the words. This is with default 7 threads finished wcp_wcPointer! time: 98 ms finished wcp_wcCtRegex! time: 1300 ms finished wcp_wcRegex! time: 2946 ms finished wcp_wcRegex2! time: 2687 ms finished wcp_wcSlices! time: 157 ms finished wcp_wcStdAscii! time: 225 ms This is processing the same data with 1 thread finished wcp_wcPointer! time: 188 ms finished wcp_wcCtRegex! time: 2219 ms finished wcp_wcRegex! time: 5951 ms finished wcp_wcRegex2! time: 5502 ms finished wcp_wcSlices! time: 318 ms finished wcp_wcStdAscii! time: 446 ms And this is processing the same data with 13 threads finished wcp_wcPointer! time: 93 ms finished wcp_wcCtRegex! time: 1110 ms finished wcp_wcRegex! time: 2531 ms finished wcp_wcRegex2! time: 2321 ms finished wcp_wcSlices! time: 136 ms finished wcp_wcStdAscii! time: 200 ms The only change in the program that is uploaded is to add the suggested defaultPoolThreads(13); at the start of main to change the ThreadPool default thread count. -- Dmitry Olshansky
Re: Regex performance
Hello everybody, Thanks once again for the interest in my problem. I have posted the details and source code that recreates (at least for me) the poor performance. I didn't know how to post the code to the forum, so I posted it to my blog instead (see post update): http://jblewitt.com/blog/?p=462 Again, if I'm doing something stupid in my code (which is possible) then I apologise in advance. I'll take a look at the ctRegex as soon as I can. Regards, James
Re: Regex performance
On 27.03.2012 0:27, James Blewitt wrote: Hello everybody, Thanks once again for the interest in my problem. I have posted the details and source code that recreates (at least for me) the poor performance. I didn't know how to post the code to the forum, so I posted it to my blog instead (see post update): http://jblewitt.com/blog/?p=462 Again, if I'm doing something stupid in my code (which is possible) then I apologise in advance. No need to apologize, but you are using 2.054, which is unfashionable :) More importantly 2.054 contains old and rusty version of std.regex, the new version was included in 2.057+. BTW The current release is 2.058. I'll take a look at the ctRegex as soon as I can. Yup, just update compiler+phobos. Regards, James -- Dmitry Olshansky
Re: Regex performance
On 03/26/2012 02:41 PM, Dmitry Olshansky wrote: > On 27.03.2012 0:27, James Blewitt wrote: >> Hello everybody, >> >> Thanks once again for the interest in my problem. I have posted the >> details and source code that recreates (at least for me) the poor >> performance. >> I didn't know how to post the code to the forum, so I posted it to my >> blog instead (see post update): >> >> http://jblewitt.com/blog/?p=462 >> >> Again, if I'm doing something stupid in my code (which is possible) then >> I apologise in advance. > > No need to apologize, but you are using 2.054, which is unfashionable :) > More importantly 2.054 contains old and rusty version of std.regex, the > new version was included in 2.057+. > BTW The current release is 2.058. > >> >> I'll take a look at the ctRegex as soon as I can. >> > > Yup, just update compiler+phobos. > >> Regards, >> James > > My unofficial results comparing 2.056 to 2.058 on 64 bits: shakespeare.txt, 2.056 -> 1868 msecs shakespeare.txt, 2.058 -> 632 msecs data.csv, 2.056 -> 51953 msecs data.csv, 2.058 -> 1329 msecs That last line is pretty impressive. :) Ali
Re: Regex performance
On 27 March 2012 11:05, Ali Çehreli wrote: > My unofficial results comparing 2.056 to 2.058 on 64 bits: > > shakespeare.txt, 2.056 -> 1868 msecs > shakespeare.txt, 2.058 -> 632 msecs > > data.csv, 2.056 -> 51953 msecs > data.csv, 2.058 -> 1329 msecs > > That last line is pretty impressive. :) Dmitry did impressive work over those few version of Phobos/DMD. The performance is even more impressive when you consider that std.regex supports things like named matching and lookbehind that often slow down a regex (also kinda removes the "regular" from the name regular expression, technically) -- James Miller
Re: Regex performance
On Monday, 26 March 2012 at 22:05:34 UTC, Ali Çehreli wrote: My unofficial results comparing 2.056 to 2.058 on 64 bits: shakespeare.txt, 2.056 -> 1868 msecs shakespeare.txt, 2.058 -> 632 msecs data.csv, 2.056 -> 51953 msecs data.csv, 2.058 -> 1329 msecs That last line is pretty impressive. :) Ali Unofficial 2.056/2.058/Ruby 1.9.3 Windows 32bit data.csv: data.csv, 2.056 -> 76351 msecs data.csv, 2.058 -> 2573 msecs data.csv, 1.9.3 -> 9170 msecs Also I had to modify line 48 of the ruby file not knowing what I'm doing: if text.force_encoding("UTF-8") =~ /#{rule}/u Couldn't build it with ctRegex (Some Error, then ran out of memory).
Re: Regex performance
Great! Thanks for the support everyone. What a performance jump between v2.054 and v2.058! James
Re: Regex performance
On 3/27/12 1:57 AM, James Blewitt wrote: Great! Thanks for the support everyone. What a performance jump between v2.054 and v2.058! James Hi James -- you may want to link this discussion from your blog. Cheers, Andrei