ID: 33334 Updated by: [EMAIL PROTECTED] Reported By: kloske at tpg dot com dot au Status: Bogus Bug Type: PCRE related Operating System: Linux PHP Version: 4.3.10 New Comment:
It would be a hell of a lot easier to read your regexes if you would use single quotes. eg. $r = '/^"([^\\"]|\\\\|\\")*"$/'; $s = '"some text","test \\"'; preg_match($r, $s, $m); var_dump($m); for your above example. And this stuff is documented. Previous Comments: ------------------------------------------------------------------------ [2005-06-15 12:03:40] kloske at tpg dot com dot au Okay, the PCRE people have gotten back to me, and PCRE has proven to produce the correct expected behavior and my test case has not failed. So now we're left with a test case which fails in PHP yet works on PCRE. For a more stark example, consider the following PHP code: $r = "/^\"([^\\\"]|\\\\|\\\")*\"\$/"; $s = "\"some text\",\"test \\\""; preg_match($r, $s, $m); var_dump($m); $m should be empty, since $s does not match $r, yet the following is returned: array(2) { [0]=> string(20) ""some text","test \"" [1]=> string(1) "\" } Note that the last element of the array contains a single backslash, indicating that the last choice that matched was a backslash, which is NOT ONE OF THE THREE CHOICES. So, the PCRE people explained that they were not familiar with PHP but wondered if it is an escaping issue. Does PHP require you to DOUBLE escape regex? ie, to match a sequence of two backslashes in a row, do you need to write "\\\\\\\\"? I've tried doing this and it seems to give the expected behavior, yet the manual does not mention this fact, and worse the user comments seem to indicate that you should not double escape (since no one is trying to do two backslashes in a row anywhere). I'd say this is a documentation ~defficiency~ more than anything, since it should be made clear that you need to escape the string first, which then will need to be escaped again for correct interpretation by PCRE if you are trying to include a literal backslash, or in other situations where PCRE needs to escape things. To recap, this is what you apparently need to write in PHP to match a literal of two backslashes next to each other: "\\\\\\\\" Gotta love it! Because: The number of backslashes are halved when PHP encodes it as a string, then it passes it literally to PCRE, which halves the number of backslashes again, to the final figure of two backslashes! Simple when you understand, not even hinted at in the PHP documentation. ------------------------------------------------------------------------ [2005-06-15 11:22:32] kloske at tpg dot com dot au As a more simple test case, this literal text string: "test","string\" matches the folling REGEX pattern: ^"([^\"]|\\|\")*"$ Reversing the sense of REGEX to being a pattern GENERATOR, there is no way for that REGEX pattern to generate the string above. I've reported this to the PCRE people and will keep you all posted as to the reply. ------------------------------------------------------------------------ [2005-06-15 01:18:47] kloske at tpg dot com dot au Thank you for that information - it is much appreciated. I will take this up with the PCRE people, as I still believe this to be incorrect behavior. FYI, the documentation I was reading was the regex man pages on both solaris and linux. My peers were people who've studied regular expressions (as have I), and agreed that based on the definitions we've all seen in our respective studies (though none of us have studied PCRE specifically as an implementation) that the behavior we saw was a violation of matching conditions, as specified in the test case's regular expression. ie: based on your greedy quote from the PCRE pages, I do not want it to match a minimum number of times, I want it to match as much as possible. Note the word possible; this regex did not allow it to match as much as it did - IE, it became very greedy indeed, to the point of matching text it wasn't allowed to! ------------------------------------------------------------------------ [2005-06-14 17:35:48] [EMAIL PROTECTED] I have no idea what manuals you are reading or which peers you are talking to, but in perl-style regular expressions the '?' character is overloaded and has different meanings in different contexts. Type "man perlre" at your Unix prompt and you will see: By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". If you still don't understand this, take it up with the developers of the PCRE library over at http://pcre.org since that is the code PHP uses. Even if somebody here agreed that there is a bug, it would have to be fixed by the PCRE folks. ------------------------------------------------------------------------ [2005-06-14 16:46:09] [EMAIL PROTECTED] It really is bogus: PHP uses the PCRE library underneath the preg_* functions. If there is any bug (IMO there is not bug), then it's in PCRE, so report this to the authors of that. ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at http://bugs.php.net/33334 -- Edit this bug report at http://bugs.php.net/?id=33334&edit=1