ID:               33334
 User updated by:  kloske at tpg dot com dot au
 Reported By:      kloske at tpg dot com dot au
 Status:           Bogus
 Bug Type:         PCRE related
 Operating System: Linux
 PHP Version:      4.3.10
 New Comment:

Okay, the PCRE people have gotten back to me, and PCRE has proven to
produce the correct expected behavior and my test case has not failed.

So now we're left with a test case which fails in PHP yet works on
PCRE.

For a more stark example, consider the following PHP code:

$r = "/^\"([^\\\"]|\\\\|\\\")*\"\$/";
$s = "\"some text\",\"test \\\"";
preg_match($r, $s, $m);
var_dump($m);

$m should be empty, since $s does not match $r, yet the following is
returned:

array(2) { [0]=> string(20) ""some text","test \"" [1]=> string(1) "\"
} 

Note that the last element of the array contains a single backslash,
indicating that the last choice that matched was a backslash, which is
NOT ONE OF THE THREE CHOICES.

So, the PCRE people explained that they were not familiar with PHP but
wondered if it is an escaping issue.

Does PHP require you to DOUBLE escape regex? ie, to match a sequence of
two backslashes in a row, do you need to write "\\\\\\\\"? I've tried
doing this and it seems to give the expected behavior, yet the manual
does not mention this fact, and worse the user comments seem to
indicate that you should not double escape (since no one is trying to
do two backslashes in a row anywhere).

I'd say this is a documentation ~defficiency~ more than anything, since
it should be made clear that you need to escape the string first, which
then will need to be escaped again for correct interpretation by PCRE
if you are trying to include a literal backslash, or in other
situations where PCRE needs to escape things.

To recap, this is what you apparently need to write in PHP to match a 
literal of two backslashes next to each other:

"\\\\\\\\"

Gotta love it!

Because:

The number of backslashes are halved when PHP encodes it as a string,
then 
it passes it literally to PCRE, which halves the number of backslashes

again, to the final figure of two backslashes!

Simple when you understand, not even hinted at in the PHP
documentation.


Previous Comments:
------------------------------------------------------------------------

[2005-06-15 11:22:32] kloske at tpg dot com dot au

As a more simple test case, this literal text string:

"test","string\"

matches the folling REGEX pattern:

^"([^\"]|\\|\")*"$

Reversing the sense of REGEX to being a pattern GENERATOR, there is no
way for that REGEX pattern to generate the string above.

I've reported this to the PCRE people and will keep you all posted as
to the reply.

------------------------------------------------------------------------

[2005-06-15 01:18:47] kloske at tpg dot com dot au

Thank you for that information - it is much appreciated. I will take
this up with the PCRE people, as I still believe this to be incorrect
behavior.

FYI, the documentation I was reading was the regex man pages on both
solaris and linux. My peers were people who've studied regular
expressions (as have I), and agreed that based on the definitions we've
all seen in our respective studies (though none of us have studied PCRE
specifically as an implementation) that the behavior we saw was a
violation of matching conditions, as specified in the test case's
regular expression.

ie: based on your greedy quote from the PCRE pages, I do not want it to
match a minimum number of times, I want it to match as much as possible.
Note the word possible; this regex did not allow it to match as much as
it did - IE, it became very greedy indeed, to the point of matching
text it wasn't allowed to!

------------------------------------------------------------------------

[2005-06-14 17:35:48] [EMAIL PROTECTED]

I have no idea what manuals you are reading or which peers you are
talking to, but in perl-style regular expressions the '?' character is
overloaded and has different meanings in different contexts.  Type "man
perlre" at your Unix prompt and you will see:

       By default, a quantified subpattern is "greedy", that is, it
will match
       as many times as possible (given a particular starting location)
while
       still allowing the rest of the pattern to match.  If you want it
to
       match the minimum number of times possible, follow the
quantifier with
       a "?".

If you still don't understand this, take it up with the developers of
the PCRE library over at http://pcre.org since that is the code PHP
uses.  Even if somebody here agreed that there is a bug, it would have
to be fixed by the PCRE folks.

------------------------------------------------------------------------

[2005-06-14 16:46:09] [EMAIL PROTECTED]

It really is bogus: PHP uses the PCRE library underneath the preg_*
functions. If there is any bug (IMO there is not bug), then it's in
PCRE, so report this to the authors of that.


------------------------------------------------------------------------

[2005-06-14 12:23:04] kloske at tpg dot com dot au

I do not believe this bug to be bogus or resolved.

------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/33334

-- 
Edit this bug report at http://bugs.php.net/?id=33334&edit=1

Reply via email to