Edit report at https://bugs.php.net/bug.php?id=63079&edit=1
ID: 63079 Comment by: Matti dot jarvinen at nitroid dot fi Reported by: astatutov at gmail dot com Summary: String access by character is not multibyte-safe Status: Open Type: Bug Package: Strings related PHP Version: Irrelevant Block user comment: N Private report: N New Comment: Under "String access and modification by character" at http://php.net/manual/en/language.types.string.php there is no mention about [] syntax not being multibyte safe. At least make this a documentation issue. Previous Comments: ------------------------------------------------------------------------ [2012-09-13 19:27:35] astatutov at gmail dot com *mbstring* just determines which module will read this option. It doesn't say which module it will affect. Say, option mbstring.func_overload affects whole php, because it overrides native functions. Option mbstring.http_input changes default php behavior when reading HTTP-request and so on. So why can't mbstring.func_overload or, say, mbstring.op_overload override the string accessing operation? ------------------------------------------------------------------------ [2012-09-13 14:12:32] larue...@php.net as the option self said *mbstring*.internal_encoding, not php.internal_encoding... ------------------------------------------------------------------------ [2012-09-13 11:57:39] astatutov at gmail dot com > you should use mb_* to deal with multi-byte characters I know it. I mentioned it in the description. The option mbstring.func_overload do it for me. But bracket operator is still unusable: the documentation states it accesses the character while it doesn't. And I believe it's not the documentation problem. Any modern language I know which is able to work with utf-8 do it transparently for developer. The aim of mbstring is the same, isn't it? Setting mbstring.internal_encoding to utf-8 a developer will expect that INTERNAL string accessing operator will support it. This is what the term "predictable behavior" means. ------------------------------------------------------------------------ [2012-09-13 10:48:25] larue...@php.net yeah, it's not. you should use mb_* to deal with multi-byte characters ------------------------------------------------------------------------ [2012-09-13 09:58:55] astatutov at gmail dot com Description: ------------ I know, there is section named "Details of the String Type" in documentation. But still there is other section, that stats "Think of a string as an array of characters for this purpose". This is very convenient to think so. We use mbstring extension to work entirely on utf-8 and mbstring.func_overload option allows us almost forget about differences between regular and multibyte strings. We just write our application, thinking about its native logic, not PHP internal logic. This is high-level programming language, by the way. We're using strlen, substr, etc. as we're doing with regular strings. And BANG! String bracket operator returns bytes, not characters! I think it's unpredictable behavior, even if it's well-documented (but it's not). Considering that the use of utf-8 grows everywhere and maybe even PHP 6 will support it by default, why not implement multibyte support in bracket operations now in mbstring extension? Of course, it must be configurable to be back-compatible. I know, we can use substr as a replace of string accessing operation, but it's very slow and it's wrong in general. Also I now this is not a first bug on this subject. There was #51919 as example, which was closed and marked as not a bug. But I propose to look at this problem from the point of view of the language logic, not the implementation. Sorry, if I've missed something else. Test script: --------------- $str = "KÄ t"; echo $str[1]; Expected result: ---------------- Ä Actual result: -------------- � ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=63079&edit=1