Edit report at https://bugs.php.net/bug.php?id=65045&edit=1
ID: 65045 Updated by: hirok...@php.net Reported by: masakielastic at gmail dot com Summary: mb_convert_encoding breaks well-formed character -Status: Feedback +Status: Closed Type: Bug Package: mbstring related Operating System: Mac OSX PHP Version: 5.5.0RC3 Assigned To: hirokawa Block user comment: N Private report: N New Comment: Automatic comment on behalf of hirokawa Revision: http://git.php.net/?p=php-src.git;a=commit;h=c6a7549efcca62346687b0fda5b408b963f5ab2d Log: fixed #65045: mb_convert_encoding breaks well-formed character. Previous Comments: ------------------------------------------------------------------------ [2013-06-30 02:49:42] hirok...@php.net This problem is caused by ill-formed utf-8 handling issue of libmbfl. libmbfl is maintaining at https://github.com/moriyoshi/libmbfl. Please try to use the newest version of libmbfl on github. ------------------------------------------------------------------------ [2013-06-22 14:02:28] a...@php.net Related To: Bug #65081 ------------------------------------------------------------------------ [2013-06-17 12:30:10] a...@php.net I can reproduce that on windows too, the issue is probably not only osx. Here's slightly modified snippet: <?php $str1 = "\xF0\xA4\xAD" . "\xF0\xA4\xAD\xA2" . "\xF0\xA4\xAD\xA2"; $exp1 = "\xEF\xBF\xBD" . "\xF0\xA4\xAD\xA2" . "\xF0\xA4\xAD\xA2"; if (true !== mb_substitute_character(0xFFFD)) { die("can't set substitute char\n"); } print_hex($str1); $s = mb_convert_encoding($str1, 'UTF-8', mb_detect_encoding($str1)); print_hex($s); function print_hex($s) { for ($i = 0; $i < strlen($s); $i++) { echo "0x", dechex(ord($s[$i])), " "; } echo "\n"; } ?> And the output (added pipes as utf8 char separators manually) 0xf0 0xa4 0xad | 0xf0 0xa4 0xad 0xa2 | 0xf0 0xa4 0xad 0xa2 0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xf0 0xa4 0xad 0xa2 As one can see, the first original invalid 3 byte sequence and the second valid 4 byte sequence are replaced with "0xef 0xbf 0xbd", the last one remains. However looking at the codes only libmfl is in the game there http://lxr.php.net/xref/PHP_5_5/ext/mbstring/mbstring.c#3011 . Not sure yet to have overseen something, have to make a C snippet. ------------------------------------------------------------------------ [2013-06-16 23:17:01] masakielastic at gmail dot com Description: ------------ When converting string from UTF-8 to UTF-8 by using mb_convert_encoding for replacing ill-formed byte sequence with the substitute character(U+FFFD), mb_convert_encoding replaces the character follwing ill-formed byte sequence with the substitute character. mb_convert_encoding also delete trailing ill-formed byte sequence and doesn't replace it with the substitute character. The comprehensive test case for 2-4 byte characters is here: https://gist.github.com/masakielastic/5793665 . Test script: --------------- // U+24B62: "\xF0\xA4\xAD\xA2" // ill-formed: "\xF0\xA4\xAD" // U+FFFD: "\xEF\xBF\xBD" $str = "\xF0\xA4\xAD". "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"; $expected = "\xEF\xBF\xBD"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"; $str2 = "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD"; $expected2 = "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xEF\xBF\xBD"; mb_substitute_character(0xFFFD); var_dump( $expected === htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')), $expected2 === htmlspecialchars_decode(htmlspecialchars($str2, ENT_SUBSTITUTE, 'UTF-8')), $expected === mb_convert_encoding($str, 'UTF-8', 'UTF-8'), $expected2 === mb_convert_encoding($str2, 'UTF-8', 'UTF-8') ); Expected result: ---------------- bool(true) bool(true) bool(true) bool(true) Actual result: -------------- bool(true) bool(true) bool(false) bool(false) ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=65045&edit=1