From: masakielastic at gmail dot com Operating system: PHP version: 5.5.1 Package: Strings related Bug Type: Feature/Change Request Bug description:improvement for counting ill-formed byte sequences
Description: ------------ Consider the number of substitute characters (U+FFFD) when the range of UTF-8 string of second byte is narrow (such as 0xA0 - 0xBF) // Code Points First Byte Second Byte Third Byte Fourth Byte // U+0800 - U+0FFF E0 A0 - BF 80 - BF // U+D000 - U+D7FF ED 80 - 9F 80 - BF // U+10000 - U+3FFFF F0 90 - BF 80 - BF 80 - BF // U+100000 - U+10FFFF F4 80 - 8F 80 - BF 80 - BF If you follow the recommended policy describled in "Table 3-8. Use of U+FFFD in UTF-8 Conversion" of The Unicode Standard, "\xE0\x80" should be converted to "\xEF\xBF\xBD"."\xEF\xBF\xBD". The actual result is "\xEF\xBF\xBD". The one of solution for that purpose is introducing a macro that checks second byte by first byte. https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/html.p atch https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/test.p hp Test script: --------------- // https://bugs.php.net/bug.php?id=65081 function str_scrub($str) { return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')); } $ufffd_x2 = "\xEF\xBF\xBD"."\xEF\xBF\xBD"; $ufffd_x3 = $ufffd_x2."\xEF\xBF\xBD"; var_dump( $ufffd_x2 === str_scrub("\xE0\x80"), $ufffd_x3 === str_scrub("\xE0\x80\x80") ); Expected result: ---------------- bool(true) bool(true) Actual result: -------------- bool(false) bool(false) -- Edit bug report at https://bugs.php.net/bug.php?id=65323&edit=1 -- Try a snapshot (PHP 5.4): https://bugs.php.net/fix.php?id=65323&r=trysnapshot54 Try a snapshot (PHP 5.3): https://bugs.php.net/fix.php?id=65323&r=trysnapshot53 Try a snapshot (trunk): https://bugs.php.net/fix.php?id=65323&r=trysnapshottrunk Fixed in SVN: https://bugs.php.net/fix.php?id=65323&r=fixed Fixed in release: https://bugs.php.net/fix.php?id=65323&r=alreadyfixed Need backtrace: https://bugs.php.net/fix.php?id=65323&r=needtrace Need Reproduce Script: https://bugs.php.net/fix.php?id=65323&r=needscript Try newer version: https://bugs.php.net/fix.php?id=65323&r=oldversion Not developer issue: https://bugs.php.net/fix.php?id=65323&r=support Expected behavior: https://bugs.php.net/fix.php?id=65323&r=notwrong Not enough info: https://bugs.php.net/fix.php?id=65323&r=notenoughinfo Submitted twice: https://bugs.php.net/fix.php?id=65323&r=submittedtwice register_globals: https://bugs.php.net/fix.php?id=65323&r=globals PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65323&r=php4 Daylight Savings: https://bugs.php.net/fix.php?id=65323&r=dst IIS Stability: https://bugs.php.net/fix.php?id=65323&r=isapi Install GNU Sed: https://bugs.php.net/fix.php?id=65323&r=gnused Floating point limitations: https://bugs.php.net/fix.php?id=65323&r=float No Zend Extensions: https://bugs.php.net/fix.php?id=65323&r=nozend MySQL Configuration Error: https://bugs.php.net/fix.php?id=65323&r=mysqlcfg