ID:               40871
 Updated by:       [EMAIL PROTECTED]
 Reported By:      ismith at motorola dot com
-Status:           Assigned
+Status:           Bogus
 Bug Type:         PCRE related
 Operating System: Windows Server 2003 SP1
 PHP Version:      5.2.1
 Assigned To:      andrei
 New Comment:

Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

I would really like to keep UTF-8 validation and escapement of bad
sequences out of PCRE. Yes, it does return an error when it runs into a
bad UTF-8 sequence, but that is all it can do. It does not return the
location of the error. Yes, we could return the subject string if we see
PCRE_BAD_UTF8_ERROR, but I do not believe it makes sense to do so, since
there is still an error condition. It is very likely that you're passing
the same bad UTF-8 string to other functions as well, so one could make
an argument that this validation and escapement should be done
everywhere, which unfortunately is not going to happen and which is why
we have PHP 6 in the works.

If you are working with UTF-8 strings, I suggest you validate them with
 your own function before passing them around to PHP extensions.


Previous Comments:
------------------------------------------------------------------------

[2007-04-26 09:14:19] [EMAIL PROTECTED]

Nuno, Andrei wake up.
Is it worth/possible to do something about it or should I mark it as
"won't fix"?

------------------------------------------------------------------------

[2007-03-22 23:03:41] [EMAIL PROTECTED]

in PHP 6, PHP always passes well-formed utf-8 strings to pcre, because
the strings are previously processed by ICU. In PHP 4/5, well.. It's
hard to leave up to the user-land app to deal with these kind of complex
things, but should we really interfere with string? I dunno.. but my
point is that maintaing BC is more important at this time..

------------------------------------------------------------------------

[2007-03-22 00:29:24] [EMAIL PROTECTED]

Did you see this:

http://us3.php.net/manual/en/function.preg-last-error.php

The error is not getting lost. There's just not much we can do about it
aside from returning it to the user.

------------------------------------------------------------------------

[2007-03-21 22:47:02] [EMAIL PROTECTED]

Andrei, do you think there is something we can do about it?

------------------------------------------------------------------------

[2007-03-21 17:45:27] ismith at motorola dot com

Further info:

I emailed the PCRE maintainer, and he said that since PCRE doesn't do
the replacement part, PCRE itself isn't dumping the text.  Apparently
when PCRE sees bad UTF8, it returns an error code (I believe
PCRE_ERROR_BADUTF8).

I think the text is getting lost by php_pcre_replace_impl.  If
pcre_exec returns PCRE_ERROR_NOMATCH, it saves all the unmatched text in
the result; but if pcre_exec returns some other error code, it looks to
me like it's dumping the result (which matches what I'm seeing).

I don't see how PHP can do much else than what it's doing; without a
match count back from pcre_exec, it can't process the replacements in
any case.

My feeling is that PCRE should not return an error code in this case,
but work around the bad UTF-8 character, which would be more in keeping
with the Unicode standard.  I'll discuss this further with the PCRE
folks.  OTOH, maybe MediaWiki should do UTF-8 cleanup on the string
before giving it to PHP.

------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/40871

-- 
Edit this bug report at http://bugs.php.net/?id=40871&edit=1

Reply via email to