Edit report at http://bugs.php.net/bug.php?id=51531&edit=1

 ID:               51531
 Comment by:       mrjminer at gmail dot com
 Reported by:      mrjminer at gmail dot com
 Summary:          Adding additional backreferencing indicators for use
                   with PREG_OFFSET_CAPTURE
 Status:           Open
 Type:             Feature/Change Request
 Package:          *Regular Expressions
 Operating System: All (AFAIK)
 PHP Version:      Irrelevant

 New Comment:

By the way, "backreferencing indicator" is not a technical term, as far
as I know.  I mean something along the lines of how '?:' indicates no
backreference should be captured.



Thanks for reading!


Previous Comments:
------------------------------------------------------------------------
[2010-04-11 03:43:16] mrjminer at gmail dot com

Description:
------------
This suggestion is related to PREG_MATCH_ALL when using
PREG_OFFSET_CAPTURE.



When specifying PREG_OFFSET_CAPTURE as a flag, each subpattern matched
results in the return of the subpatterned matched and the offset of the
subpattern matched in the $matches array.  Yet, there are instances
where I may only need one of these pieces of information for a
particular subpattern match, but want the other piece (or both pieces)
of information for a different particular subpattern match within the
expression.  In these instances, resources are being unnecessarily
wasted to store undesired information in the $matches array.



My suggestion is to add two additional indicators for backreference
capturing that can be used when the PREG_OFFSET_CAPTURE flags is
specified.  These indicators would tell the engine to set the results of
either the offset or the subpattern string in the $matches array to
null.  I believe this change would reduce the space required to hold the
information in $matches, while extending the typical functional use of
PREG_MATCH_ALL when used with PREG_OFFSET_CAPTURE (the same could also
be done for PREG_SPLIT and PREG_SPLIT_OFFSET_CAPTURE)

Test script:
---------------
Take, for instance, the following preg_match_all expressions to match
opening tags of BBCode:



1.

preg_match_all('/\\[(B|I|U|URL|COLOR|SIZE|LIST)(?:=([^]]*?))?](?=\\s*?[^\\s])/iu',$bbc,$openers,PREG_SET_ORDER|PREG_OFFSET_CAPTURE);

foreach($openers as $key => $val) {

        foreach($val as $key2 => $val2) {

                foreach($val2 as $key3 => $val3) {

                        echo '$openers['.$key.']['.$key2.']['.$key3.'] = 
'.$val3.'<br>';

                }

        }

}



2.

preg_match_all('/\\[(B|I|U|URL|COLOR|SIZE|LIST)(?:=([^]]*?))?](?=(\\s*?[^\\s]))/iu',$bbc,$openers,PREG_SET_ORDER|PREG_OFFSET_CAPTURE);

foreach($openers as $key => $val) {

        foreach($val as $key2 => $val2) {

                foreach($val2 as $key3 => $val3) {

                        echo '$openers['.$key.']['.$key2.']['.$key3.'] = 
'.$val3.'<br>';

                }

        }

}

Expected result:
----------------
In expression 1, the subpattern '(?=\\s*?[^\\s])' is used to check for
basic validity of an opening tag.  The beginning of the contents of the
opening tag would have to be found using the offset of the whole match
($matches[#][0][1]) plus the length of the whole match
($matches[#][0][0]):  $matches[#][0][1] + strlen($matches[#][0][0]) =
$contentstartposition.



In expression 2, the subpattern '(?=(\\s*?[^\\s]))' is used to check for
basic validity of an opening tag AND capture the position of where the
content starts in order to prevent performing a mathematical equation
and a strlen in order to find the starting position of the content: 
$matches[#][3][1] = $contentstartposition.



In terms of processing power involved, expression 2 is superior to
expression 1, as it is merely relaying information already gathered and
known by the engine instead of performing addition and a strlen(). 
However, in terms of the resources required to store the match
information, expression 1 is superior to expression 2 and still ensures
a valid tag is found (but will require additional processing to get a
piece of information returned by expression 2).



The commonalities among both of these expressions:

-Neither requires the offsets for subpattern [1] or [2], merely the
contents of it (for parsing / filtering).  The offsets are returned at
the expense of memory resources to store these unneeded offsets.  The
only other alternative to obtaining only the contents of the match
without using the memory is to spend significant processing resources to
parse for the same contents the subpattern match returns in $matches.

-Neither requires the contents of the last subpattern (captured or not)
-- the offset is the only desired portion.  In expression 1, the offset
must be attained by comprimising processing resources; in expression 2,
the offset is attained by comprimising memory resources.



If there were additional indicators to restrict the returned value in
$matches for each subpattern, the $matches array returned could require
substantially less resources to store, while retaining its current
functionality and adding functionality to situations where it would not
be feasible to comprimise an increased use of memory resources for a
decreased use of CPU resources.



Thanks for your time!



------------------------------------------------------------------------



-- 
Edit this bug report at http://bugs.php.net/bug.php?id=51531&edit=1

Reply via email to