https://issues.dlang.org/show_bug.cgi?id=18600
Issue ID: 18600 Summary: Regex performance enhancement for repeated matchFirst calls Product: D Version: D2 Hardware: x86 OS: Mac OS X Status: NEW Severity: enhancement Priority: P1 Component: phobos Assignee: nob...@puremagic.com Reporter: jrdemail2000-dl...@yahoo.com This is a follow-up to https://issues.dlang.org/show_bug.cgi?id=18114, which identified a regex performance regression introduced in the 2.078 release. A number of improvements in 2.079 cut the performance regression meaningfuly enough to close the issue, but a meaningful delta remained, from 10-25% in my tests, depending on the regex issued. This indicates headroom for improvement exists. Subsequently, Dmitry Olshansky identified another performance opportunity. The test cases make repeated calls to matchFirst. This is the usage in eBay's TSV Utilities where this was identified. In this usage, a regex match is made against every line of input in turn. Each call to matchFirst instantiates a new version of the regex engine, a relatively expensive operation. Caching and reusing the engine on subsequent calls would be faster. This optimization has been available for a while, however, an increase in instantiation cost may have contributed to the performance regression identified in 2.078. In Dmitry's tests, caching the regex engine showed approximately 3x improvement. This would be an improvement significantly beyond the regression introduced in 2.078. My performance tests from issue 18114 (see issue 18114 for test details): | Regex | 2.077.1 | 2.078.0-beta1 | 2.079.0 | |-----------------------+--------------+---------------+--------------| | 'abc' | 3819.314 ms | 6202.892 ms | 5077.937 ms | | '(aa)+(cc)+g' | 5457.678 ms | 8209.269 ms | 6672.057 ms | | '(aa|ax).+[gxb][cyw]' | 10121.181 ms | 12448.443 ms | 11254.978 ms | A material improvement over 2.078.0-beta1, but still 10-25% off 2.077.1, depending on the regex. Dimitry's tests caching the engine, using the second regex: # Before ▶ ./find_regex '(aa)+(cc)+g' strings.txt 106437 matches; 5538.671 milliseconds # After ▶ ./find_regex '(aa)+(cc)+g' strings.txt 106437 matches; 1880.550 milliseconds This would be approximately a 3x improvement. --