Req #65323 [Fbk]: improvement for counting ill-formed byte sequences

cataphract Thu, 01 Aug 2013 00:07:20 -0700

Edit report at https://bugs.php.net/bug.php?id=65323&edit=1


 ID:                 65323
 Updated by:         cataphr...@php.net
 Reported by:        masakielastic at gmail dot com
 Summary:            improvement for counting ill-formed byte sequences
 Status:             Feedback
 Type:               Feature/Change Request
 Package:            Strings related
 PHP Version:        5.5.1
 Assigned To:        cataphract
 Block user comment: N
 Private report:     N

 New Comment:

Unfortunately, it's also significantly slower... I have to look more closely.


Previous Comments:
------------------------------------------------------------------------
[2013-07-31 23:00:34] cataphr...@php.net

Can you test this branch?

https://github.com/cataphract/php-src/compare/bug65323

I basically rewrote the parser; it was getting too complicated.

------------------------------------------------------------------------
[2013-07-26 00:43:34] yohg...@php.net

Thank you for the report.
This seems good.

We are also discussing about mb_scrub() as mb_convert_encoding() alias. i.e. 
calling converter internally like mb_convert_encoding().

On master branch mbfl converter fix has been committed.
We appreciate if you could check the current implementation.

------------------------------------------------------------------------
[2013-07-24 11:20:38] masakielastic at gmail dot com

Table 3-8. Use of U+FFFD in UTF-8 Conversion" of The Unicode Standard
http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf

------------------------------------------------------------------------
[2013-07-24 10:59:34] masakielastic at gmail dot com

Description:
------------
Consider the number of substitute characters (U+FFFD)
when the range of UTF-8 string of second byte is narrow (such as 0xA0 - 0xBF)

//      Code Points   First Byte Second Byte Third Byte Fourth Byte
//   U+0800 -   U+0FFF   E0         A0 - BF     80 - BF
//   U+D000 -   U+D7FF   ED         80 - 9F     80 - BF
//  U+10000 -  U+3FFFF   F0         90 - BF     80 - BF    80 - BF
// U+100000 - U+10FFFF   F4         80 - 8F     80 - BF    80 - BF

If you follow the recommended policy describled in "Table 3-8. Use of U+FFFD in 
UTF-8 Conversion" of The Unicode Standard,
"\xE0\x80" should be converted to "\xEF\xBF\xBD"."\xEF\xBF\xBD".
The actual result is "\xEF\xBF\xBD".

The one of solution for that purpose is introducing a macro that checks second 
byte by first byte.

https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/html.p
atch
https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/test.p
hp

Test script:
---------------
// https://bugs.php.net/bug.php?id=65081
function str_scrub($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 
'UTF-8'));
}

$ufffd_x2 = "\xEF\xBF\xBD"."\xEF\xBF\xBD";
$ufffd_x3 = $ufffd_x2."\xEF\xBF\xBD";

var_dump(
    $ufffd_x2 === str_scrub("\xE0\x80"),
    $ufffd_x3 === str_scrub("\xE0\x80\x80")
);

Expected result:
----------------
bool(true)
bool(true)

Actual result:
--------------
bool(false)
bool(false)


------------------------------------------------------------------------



-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65323&edit=1

Req #65323 [Fbk]: improvement for counting ill-formed byte sequences

Reply via email to