From:             masakielastic at gmail dot com
Operating system: 
PHP version:      5.5.1
Package:          Strings related
Bug Type:         Feature/Change Request
Bug description:improvement for counting ill-formed byte sequences  

Description:
------------
Consider the number of substitute characters (U+FFFD)
when the range of UTF-8 string of second byte is narrow (such as 0xA0 -
0xBF)

//      Code Points   First Byte Second Byte Third Byte Fourth Byte
//   U+0800 -   U+0FFF   E0         A0 - BF     80 - BF
//   U+D000 -   U+D7FF   ED         80 - 9F     80 - BF
//  U+10000 -  U+3FFFF   F0         90 - BF     80 - BF    80 - BF
// U+100000 - U+10FFFF   F4         80 - 8F     80 - BF    80 - BF

If you follow the recommended policy describled in "Table 3-8. Use of
U+FFFD in 
UTF-8 Conversion" of The Unicode Standard,
"\xE0\x80" should be converted to "\xEF\xBF\xBD"."\xEF\xBF\xBD".
The actual result is "\xEF\xBF\xBD".

The one of solution for that purpose is introducing a macro that checks
second 
byte by first byte.

https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/html.p
atch
https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/test.p
hp

Test script:
---------------
// https://bugs.php.net/bug.php?id=65081
function str_scrub($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE,
'UTF-8'));
}

$ufffd_x2 = "\xEF\xBF\xBD"."\xEF\xBF\xBD";
$ufffd_x3 = $ufffd_x2."\xEF\xBF\xBD";

var_dump(
    $ufffd_x2 === str_scrub("\xE0\x80"),
    $ufffd_x3 === str_scrub("\xE0\x80\x80")
);

Expected result:
----------------
bool(true)
bool(true)

Actual result:
--------------
bool(false)
bool(false)

-- 
Edit bug report at https://bugs.php.net/bug.php?id=65323&edit=1
-- 
Try a snapshot (PHP 5.4):   
https://bugs.php.net/fix.php?id=65323&r=trysnapshot54
Try a snapshot (PHP 5.3):   
https://bugs.php.net/fix.php?id=65323&r=trysnapshot53
Try a snapshot (trunk):     
https://bugs.php.net/fix.php?id=65323&r=trysnapshottrunk
Fixed in SVN:               https://bugs.php.net/fix.php?id=65323&r=fixed
Fixed in release:           https://bugs.php.net/fix.php?id=65323&r=alreadyfixed
Need backtrace:             https://bugs.php.net/fix.php?id=65323&r=needtrace
Need Reproduce Script:      https://bugs.php.net/fix.php?id=65323&r=needscript
Try newer version:          https://bugs.php.net/fix.php?id=65323&r=oldversion
Not developer issue:        https://bugs.php.net/fix.php?id=65323&r=support
Expected behavior:          https://bugs.php.net/fix.php?id=65323&r=notwrong
Not enough info:            
https://bugs.php.net/fix.php?id=65323&r=notenoughinfo
Submitted twice:            
https://bugs.php.net/fix.php?id=65323&r=submittedtwice
register_globals:           https://bugs.php.net/fix.php?id=65323&r=globals
PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65323&r=php4
Daylight Savings:           https://bugs.php.net/fix.php?id=65323&r=dst
IIS Stability:              https://bugs.php.net/fix.php?id=65323&r=isapi
Install GNU Sed:            https://bugs.php.net/fix.php?id=65323&r=gnused
Floating point limitations: https://bugs.php.net/fix.php?id=65323&r=float
No Zend Extensions:         https://bugs.php.net/fix.php?id=65323&r=nozend
MySQL Configuration Error:  https://bugs.php.net/fix.php?id=65323&r=mysqlcfg

Reply via email to