ID:               40395
 Updated by:       [EMAIL PROTECTED]
 Reported By:      jfrim at idirect dot com
-Status:           Assigned
+Status:           Closed
 Bug Type:         Documentation problem
 Operating System: *
 PHP Version:      *
 Assigned To:      nlopess
 New Comment:

This bug has been fixed in the documentation's XML sources. Since the
online and downloadable versions of the documentation need some time
to get updated, we would like to ask you to be a bit patient.

Thank you for the report, and for helping us make our documentation
better.




Previous Comments:
------------------------------------------------------------------------

[2007-02-09 19:28:57] jfrim at idirect dot com

The code from my [2007-02-08 19:59:04] post shows only 0x00 and 0x22
being escaped.  Maybe single-quote (0x27) only gets escaped depending
on the PHP.INI settings?  I may check into this later.

If nothing is changed in the PHP code, the best work-around I could
come up with is this:

Always use the "e" modifier, and in the replacement string for
preg_replace(), surround the back-reference with a pair of
str_replace(), one to handle 0x00 and one for 0x22.

Example:
<?php
echo
preg_replace('/([\\x00-\\xFF])/e',"'0x'.sprintf('%02X',ord(str_replace('\\\\0',\"\\0\",str_replace('\\\\\"','\"','\\1'))))",$inputstring);
?>

This example takes a string, and turn each byte into "0x" followed by
the two digit hex code.  Note the first str_replace() turns the \0 into
a proper NULL, and the second nested str_replace() turns the \" into
just " .

It's a very dirty work-around, because preg_replace() is useless
without the "e" modifier (adds processing overhead), and str_replace()
has to be called twice (adds processing overhead again), and the number
of backslashes in the source code is tremendous and can get confusing!

And we still have a potential problem remaining.  If just which
characters are escaped and which ones aren't is dependant on the
PHP.INI settings (ie. regarding double-quote and single-quote), then
it's impossible for this dirty work-around to be portable, unless the
entire thing is encapsulated in an if() or switch() block.  That's
REALLY dirty!

The reason why stripslashes() can't be used on the back-reference is
because the backslash character, if matched in the pattern, is NOT
escaped when returned in the back-reference!  stripslashes() ends up
returning a null when only a single backslash is passed to it.


If we can't change the behaviour of preg_replace() without breaking
compatibility, then I suggest introducing a new function called
something like preg_replace_ex() or preg_replace_binsafe() or
something, which fixes the bug properly.

The ideal bug fix would be for the back-reference to never escape any
returned characters, since the input string fed to preg_replace() is
NOT in an escaped context, and you should not mix escaped data with
unescaped data.

------------------------------------------------------------------------

[2007-02-09 17:37:51] [EMAIL PROTECTED]

ok, so after talking with Andrei, we came up with the decision to
document it rather than changin the behaviour (e.g. because of bug
#5676).
BTW, probably you'll want to consider using preg_replace_callback().
note to self: need to review again the escaped chars (at least NULL,
single-quote and double-quote are)

------------------------------------------------------------------------

[2007-02-08 21:55:58] jfrim at idirect dot com

Another reason why it would be best to return NULL and DOUBLE-QUOTE
(0x00 and 0x22 respectively) in regular expression back-references
WITHOUT being escaped:


If this bug was fixed by escaping the backslash as well...

...The the context of the resulting output string would be a mix of
escaped and non-escaped data.  (Since the input string is non-escaped,
but back-references are escaped.)  This would make it impossible to
safely un-escape without risk of data corruption.  The only way to
handle this would be to use the "e" modifier in the regular expression
and embed stripslashes() into the replacement string.  That's extra
processing overhead, and basically makes the entire preg_replace()
function useless without the "e" modifier.  It also defeats any
possible purposes as to why the back-references are escaped in the
first place.  Boo to this solution!


Alternatively, if this bug was fixed by returning NULL and DOUBLE-QUOTE
without being escaped...

When using preg_replace, the resulting string will always be in a
non-encoded context.  If a slash-encoded string is ever desired, the
entire thing can be wrapped in addslashes() by the user, without ever
risking destroying the integrity of the data.

------------------------------------------------------------------------

[2007-02-08 19:59:04] jfrim at idirect dot com

The following code demonstrates 0x00 and 0x22 being escaped, without
0x5C being escaped.
It creates an 8-bit ASCII text output, with the character value (in
DECIMAL) enclosed within braces (except for escaped chars, in which
case it ends up as "92"), followed by the actual character, then a
CRLF, for all 256 characters.

Note how the backslash (0x5C, decimal 92) is NOT escaped, and contrary
to what [EMAIL PROTECTED] posted, the single-quote (0x27, decimal 39) is
NOT escaped either.  (The double-quote (0x22, decimal 34) is escaped
instead.)

<?php
header('Content-Type: text/plain; charset=US-ASCII');
header('Content-Disposition: inline; filename=PCRE.txt');
header('Pragma: no-cache');
header('Expires: 0');
header('Cache-Control: no-cache; must-revalidate');
$teststring='';
for ($i=0; $i<=255; $i++) {
        $teststring.=chr($i);
}
echo
preg_replace('/([\\x00-\\xFF])/e',"'{'.ord('\\1').'}\\1'.chr(13).chr(10)",$teststring);
?>

------------------------------------------------------------------------

[2007-02-08 19:47:10] jfrim at idirect dot com

I have verifed that along with 0x00 being escaped, 0x22 (the
double-quote character) is also escaped.  No other byte values are
affected.

Even if the documentation was changed to reflect this escaped behaviour
of 0x00 and 0x22, there would still be a bug with this behaviour since
0x5C (the backslash character) is NOT escaped!

This would create a discrepency problem if the input string to a
preg_replace() contained a literal backslash followed by a number zero,
or a backslash followed by a double-quote.  There would be no way to
tell from the resulting preg_replace'd data if those sequences are
escaped NULLs and escaped double-quotes, or if those were literal
sequences in the input string.

So the only way to fix this bug is to either...
...A: Escape the backslash as well, and change the documentation to
state that 0x00, 0x22, and 0x5C are escaped, or...
...B: Do not escape any characters.

I would say method B is preferred, since no stripslashes() would have
to be performed on the resulting output from a preg_replace(), and it's
far more intuitive to always know that a regular expression
back-reference will always contain the exact byte value that was
matched, without having to worry about special exceptions.

------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/40395

-- 
Edit this bug report at http://bugs.php.net/?id=40395&edit=1

Reply via email to