ps. Just to put some numbers to it, using `psysh` on $html100 which contains the (Parsoid format) HTML for the [[en:Barack Obama]] article on Wikipedia.
``` >>> strlen($html100) => 2592386 >>> timeit -n1000 preg_match_all( '/(b)/', $html100, $m, PREG_OFFSET_CAPTURE ); => 22062 Command took 0.008648 seconds on average (0.008236 median; 8.648343 total) to complete. >>> timeit -n1000 preg_match_all( '/b()/', $html100, $m, PREG_OFFSET_CAPTURE ); => 22062 Command took 0.008438 seconds on average (0.008127 median; 8.437881 total) to complete. >>> timeit -n1000 preg_match_all( '/b()()/', $html100, $m, PREG_OFFSET_CAPTURE ); => 22062 Command took 0.012069 seconds on average (0.011589 median; 12.069407 total) to complete. >>> timeit -n1000 preg_match_all( '/(?=(b))/', $html100, $m, PREG_OFFSET_CAPTURE ); => 22062 Command took 0.012134 seconds on average (0.011483 median; 12.134265 total) to complete. >>> timeit -n1000 preg_match_all( '/(?=()b())/', $html100, $m, PREG_OFFSET_CAPTURE ); => 22062 Command took 0.016513 seconds on average (0.016039 median; 16.513011 total) to complete. ``` So this isn't a good way to determine the cost of the string copy in the $matches array. (The string copy is really trivial in this particular case anyway.) --scott On Thu, Mar 21, 2019 at 5:16 PM C. Scott Ananian <canan...@wikimedia.org> wrote: > Quick status update. I tried to prototype this in pure PHP in the > wikimedia/remex-html library using (?= .. ) around each regexp and ()...() > around each captured expression (replacing the capture parens) to > effectively bypass the string copy and return a bunch of zero-length > strings. That didn't succeed in speeding up remex-html on my pet benchmark > because (1) the (?= ... ) appears to deoptimize the regexp match, and (2) > it turns out there's a substantial cost to each capture (presumably all > those two-element arrays which Nikita flagged before as a future issue) and > so doubling the total number of captures by using () () instead of (....) > slowed the match down. > > So bad news: my benchmarking shortcut didn't work. Potential good news: I > guess that underlines why this feature is necessary and can't just be > emulated. > > I'm going to try this benchmark again tomorrow but by rebuilding PHP from > source using Nikita's proposed patch so that I can actually get an > apples-to-apples comparison. > --scott > > On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov <nikita....@gmail.com> wrote: > >> On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian <canan...@wikimedia.org> >> wrote: >> >>> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov <nikita....@gmail.com> >>> wrote: >>> >>>> After thinking about this some more, while this may be a minor >>>> performance improvement, it still does more work than necessary. In >>>> particular the use of OFFSET_CAPTURE (which would be pretty much required >>>> here) needs one new two-element array for each subpattern. If the captured >>>> strings are short, this is where the main cost is going to be. >>>> >>> >>> The primary use of this feature is when the captured strings are *long*, >>> as that's when we most want to avoid copying a substring. >>> >>> >>>> I'm wondering if we shouldn't consider a new object oriented API for >>>> PCRE which can return a match object where subpattern positions and >>>> contents can be queried via method calls, so you only pay for the parts >>>> that you do access. >>>> >>> >>> Seems like this is letting the perfect be the enemy of the good. The >>> LENGTH_CAPTURE significantly reduces allocation for long match strings, and >>> it allocates the same two-element arrays that OFFSET_CAPTURE would -- it >>> just stores an integer where there would otherwise be an expensive >>> substring. Furthermore, since the array structure is left mostly alone, it >>> would be not-too-hard to support earlier-PHP versions, with something like: >>> >>> $hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE >>> : 0; >>> $r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture); >>> $matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]); >>> $matchOneOffset = $m[1][1]; >>> >>> If you introduce a whole new OO accessor object, it starts becoming very >>> hard to write backward-compatible code. >>> --scott >>> >> >> Fair enough. I've created https://github.com/php/php-src/pull/3971 to >> implement this feature. It would be good to have some confirmation that >> this is really a significant performance improvement before we land it >> though. >> >> Nikita >> > > > -- > (http://cscott.net) > -- (http://cscott.net)