Bug #52810 [Opn->Bgs]: substr() and $string[n] corrupt multi-byte UTF-8 strings

cataphract Fri, 10 Sep 2010 04:55:15 -0700

Edit report at http://bugs.php.net/bug.php?id=52810&edit=1


 ID:                 52810
 Updated by:         cataphr...@php.net
 Reported by:        trane at gol dot com
 Summary:            substr() and $string[n] corrupt multi-byte UTF-8
                     strings
-Status:             Open
+Status:             Bogus
 Type:               Bug
 Package:            Strings related
 Operating System:   OS X 10.6.4
 PHP Version:        Irrelevant
 Block user comment: N

 New Comment:

This is not a bug.



substr and $str[n] or $str{n} treat the string as a byte array. If you
want to get the n-th Unicode code point, use mb_substr.


Previous Comments:
------------------------------------------------------------------------
[2010-09-10 12:46:44] trane at gol dot com

Description:
------------
(PHP 5.3.2 (cli) (built: Aug  7 2010 00:04:41) 

Copyright (c) 1997-2010 The PHP Group

Zend Engine v2.3.0, Copyright (c) 1998-2010 Zend Technologies)



When trying to extract a single character from a UTF-8-encoded Japanese
string, instead of the expected character, one gets the dreaded
black-diamond-question-mark-of-death.





Test script:
---------------
$s_string = "éå²¡ã¯è¸ãæãã§ãã";

echo $s_string[3], "<p />";

// expected output is è¸

// actual output is ï¿½

print_r($s_string[3]);

// expected output is è¸

// actual output is ï¿½

echo "<p />";

$sub = substr($s_string, 3, 1);

echo $sub, "<p />";

// expected output is è¸

// actual output is ï¿½

Expected result:
----------------
Expected output is è¸





Actual result:
--------------
Actual output is ï¿½




------------------------------------------------------------------------



-- 
Edit this bug report at http://bugs.php.net/bug.php?id=52810&edit=1

Bug #52810 [Opn->Bgs]: substr() and $string[n] corrupt multi-byte UTF-8 strings

Reply via email to