Edit report at https://bugs.php.net/bug.php?id=65045&edit=1

 ID:                 65045
 Updated by:         hirok...@php.net
 Reported by:        masakielastic at gmail dot com
 Summary:            mb_convert_encoding breaks well-formed character
-Status:             Verified
+Status:             Feedback
 Type:               Bug
 Package:            mbstring related
 Operating System:   Mac OSX
 PHP Version:        5.5.0RC3
-Assigned To:        
+Assigned To:        hirokawa
 Block user comment: N
 Private report:     N

 New Comment:

This problem is caused by ill-formed utf-8 handling issue of libmbfl.
libmbfl is maintaining at https://github.com/moriyoshi/libmbfl.
Please try to use the newest version of libmbfl on github.


Previous Comments:
------------------------------------------------------------------------
[2013-06-22 14:02:28] a...@php.net

Related To: Bug #65081

------------------------------------------------------------------------
[2013-06-17 12:30:10] a...@php.net

I can reproduce that on windows too, the issue is probably not only osx. Here's 
slightly modified snippet:

<?php

$str1 = "\xF0\xA4\xAD" . "\xF0\xA4\xAD\xA2" . "\xF0\xA4\xAD\xA2";
$exp1 = "\xEF\xBF\xBD" . "\xF0\xA4\xAD\xA2" . "\xF0\xA4\xAD\xA2";

if (true !== mb_substitute_character(0xFFFD)) {
        die("can't set substitute char\n");
}

print_hex($str1);
$s = mb_convert_encoding($str1, 'UTF-8', mb_detect_encoding($str1));
print_hex($s);

function print_hex($s)
{
        for ($i = 0; $i < strlen($s); $i++) {
                echo "0x", dechex(ord($s[$i])), " ";
        }
echo "\n";
}

?>

And the output (added pipes as utf8 char separators manually)

0xf0 0xa4 0xad | 0xf0 0xa4 0xad 0xa2 | 0xf0 0xa4 0xad 0xa2

0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xf0 0xa4 
0xad 0xa2

As one can see, the first original invalid 3 byte sequence and the second valid 
4 byte sequence are replaced with "0xef 0xbf 0xbd", the last one remains. 
However looking at the codes only libmfl is in the game 
there http://lxr.php.net/xref/PHP_5_5/ext/mbstring/mbstring.c#3011 . Not sure 
yet to have overseen something, have to make a C 
snippet.

------------------------------------------------------------------------
[2013-06-16 23:17:01] masakielastic at gmail dot com

Description:
------------
When converting string from UTF-8 to UTF-8 by using mb_convert_encoding for 
replacing ill-formed byte sequence with the substitute character(U+FFFD), 
mb_convert_encoding replaces the character follwing ill-formed byte sequence 
with 
the substitute character. mb_convert_encoding also delete trailing ill-formed 
byte 
sequence and doesn't replace it with the substitute character.

The comprehensive test case for 2-4 byte 
characters is here: https://gist.github.com/masakielastic/5793665 .

Test script:
---------------
// U+24B62: "\xF0\xA4\xAD\xA2"
// ill-formed: "\xF0\xA4\xAD"
// U+FFFD: "\xEF\xBF\xBD"

$str = "\xF0\xA4\xAD".  "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2";
$expected = "\xEF\xBF\xBD"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2";

$str2 = "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD";
$expected2 = "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xEF\xBF\xBD";

mb_substitute_character(0xFFFD);
var_dump(
    $expected === htmlspecialchars_decode(htmlspecialchars($str, 
ENT_SUBSTITUTE, 'UTF-8')),
    $expected2 === htmlspecialchars_decode(htmlspecialchars($str2, 
ENT_SUBSTITUTE, 'UTF-8')), 
    $expected === mb_convert_encoding($str, 'UTF-8', 'UTF-8'),
    $expected2 === mb_convert_encoding($str2, 'UTF-8', 'UTF-8')
);

Expected result:
----------------
bool(true)
bool(true)
bool(true)
bool(true)

Actual result:
--------------
bool(true)
bool(true)
bool(false)
bool(false)


------------------------------------------------------------------------



-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65045&edit=1

Reply via email to