Re: [PHP-DEV] Offset-only results from preg_match

C. Scott Ananian Thu, 21 Mar 2019 14:22:02 -0700

ps. Just to put some numbers to it, using `psysh` on $html100 which
contains the (Parsoid format) HTML for the [[en:Barack Obama]] article on
Wikipedia.


```
>>> strlen($html100)
=> 2592386
>>> timeit -n1000 preg_match_all( '/(b)/', $html100, $m,
PREG_OFFSET_CAPTURE );
=> 22062
Command took 0.008648 seconds on average (0.008236 median; 8.648343 total)
to complete.
>>> timeit -n1000 preg_match_all( '/b()/', $html100, $m,
PREG_OFFSET_CAPTURE );
=> 22062
Command took 0.008438 seconds on average (0.008127 median; 8.437881 total)
to complete.
>>> timeit -n1000 preg_match_all( '/b()()/', $html100, $m,
PREG_OFFSET_CAPTURE );
=> 22062
Command took 0.012069 seconds on average (0.011589 median; 12.069407 total)
to complete.
>>> timeit -n1000 preg_match_all( '/(?=(b))/', $html100, $m,
PREG_OFFSET_CAPTURE );
=> 22062
Command took 0.012134 seconds on average (0.011483 median; 12.134265 total)
to complete.
>>> timeit -n1000 preg_match_all( '/(?=()b())/', $html100, $m,
PREG_OFFSET_CAPTURE );
=> 22062
Command took 0.016513 seconds on average (0.016039 median; 16.513011 total)
to complete.
```

So this isn't a good way to determine the cost of the string copy in the
$matches array. (The string copy is really trivial in this particular case
anyway.)
  --scott

On Thu, Mar 21, 2019 at 5:16 PM C. Scott Ananian <canan...@wikimedia.org>
wrote:

> Quick status update.  I tried to prototype this in pure PHP in the
> wikimedia/remex-html library using (?= .. ) around each regexp and ()...()
> around each captured expression (replacing the capture parens) to
> effectively bypass the string copy and return a bunch of zero-length
> strings.  That didn't succeed in speeding up remex-html on my pet benchmark
> because (1) the (?= ... ) appears to deoptimize the regexp match, and (2)
> it turns out there's a substantial cost to each capture (presumably all
> those two-element arrays which Nikita flagged before as a future issue) and
> so doubling the total number of captures by using ()  () instead of (....)
> slowed the match down.
>
> So bad news: my benchmarking shortcut didn't work. Potential good news: I
> guess that underlines why this feature is necessary and can't just be
> emulated.
>
> I'm going to try this benchmark again tomorrow but by rebuilding PHP from
> source using Nikita's proposed patch so that I can actually get an
> apples-to-apples comparison.
>    --scott
>
> On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov <nikita....@gmail.com> wrote:
>
>> On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian <canan...@wikimedia.org>
>> wrote:
>>
>>> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov <nikita....@gmail.com>
>>> wrote:
>>>
>>>> After thinking about this some more, while this may be a minor
>>>> performance improvement, it still does more work than necessary. In
>>>> particular the use of OFFSET_CAPTURE (which would be pretty much required
>>>> here) needs one new two-element array for each subpattern. If the captured
>>>> strings are short, this is where the main cost is going to be.
>>>>
>>>
>>> The primary use of this feature is when the captured strings are *long*,
>>> as that's when we most want to avoid copying a substring.
>>>
>>>
>>>> I'm wondering if we shouldn't consider a new object oriented API for
>>>> PCRE which can return a match object where subpattern positions and
>>>> contents can be queried via method calls, so you only pay for the parts
>>>> that you do access.
>>>>
>>>
>>> Seems like this is letting the perfect be the enemy of the good.  The
>>> LENGTH_CAPTURE significantly reduces allocation for long match strings, and
>>> it allocates the same two-element arrays that OFFSET_CAPTURE would -- it
>>> just stores an integer where there would otherwise be an expensive
>>> substring.  Furthermore, since the array structure is left mostly alone, it
>>> would be not-too-hard to support earlier-PHP versions, with something like:
>>>
>>> $hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE
>>> : 0;
>>> $r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture);
>>> $matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]);
>>> $matchOneOffset = $m[1][1];
>>>
>>> If you introduce a whole new OO accessor object, it starts becoming very
>>> hard to write backward-compatible code.
>>>  --scott
>>>
>>
>> Fair enough. I've created https://github.com/php/php-src/pull/3971 to
>> implement this feature. It would be good to have some confirmation that
>> this is really a significant performance improvement before we land it
>> though.
>>
>> Nikita
>>
>
>
> --
> (http://cscott.net)
>


-- 
(http://cscott.net)

Re: [PHP-DEV] Offset-only results from preg_match

Reply via email to