ID: 37571 User updated by: jdolecek at NetBSD dot org Reported By: jdolecek at NetBSD dot org -Status: Bogus +Status: Open Bug Type: WDDX related Operating System: Any PHP Version: 5.1.4 New Comment:
You probably don't understand the problem. I'm not talking about problem encoding iso-8859-1 text, but problem encoding text in _UTF-8_. UTF-8 stream legally contains characters in 128-160 range. Hopefully we agree here. WDDX uses iscntrl() to determine if it should record the character to <char code="XX"/> form. So it takes each character of multicharacter UTF-8 sequence and if _the single character of the sequence_ is determined to be control character according to current locale, it turns the component of multibyte sequence into <char code="XX"/> construct. So, it turns perfectly valid UTF-8 stream into invalid text stream, where some UTF-8 sequences are valid and some not. The problem is that it uses iscntrl(), while it arguably should enforce valid UTF-8 input and use something along iswcntrl(). But this would change the interface and likely break existing code using WDDX which depend on using iso-8859-1 text as input to serializer. Using iscntrl() + isascii() definitely solves the problem in the least obtrusive way AFAICS. Previous Comments: ------------------------------------------------------------------------ [2006-05-24 06:46:22] [EMAIL PROTECTED] Latin 1 doesn't define those characters in the 128-160 range... so it's perfectly correct not to encode them to UTF-8. You simply need to make sure you have valid text in the first place. ------------------------------------------------------------------------ [2006-05-23 22:50:20] jdolecek at NetBSD dot org Description: ------------ WDDX cannot be used to encode certain UTF8-encoded iso-8859-1 text. Particularily those iso-8859-1 characters, which after conversion to UTF-8 generate sequence of characters with value in 128-160 range, which are recognized as control characters. Control characters are turned into <char code="XX"/> sequence by WDDX. wddx_deserialize() expects UTF-8 encoded string, and implicitly converts the text back to iso-8859-1 before deserializing the structure. This is done _before_ the <char code="XX"/> is replaced by the character. The < is thus recognized as part of the UTF-8 sequence, two-byte sequence is recoded to single-byte character and the result contains invalid XML (fragment 'char code="XX"/>'). Deserialization thus fails silently. I.e.: 1. iso-8859-1 is Z (ord(Z) > 128) 2. UTF-8 string is XY 3. WDDX serializes that as X<char code="ord(Y)"/> 4. deserializer converts UTF-8 input to iso-8859-1 before starting deserialization, result is Bchar code="ord(Y)"/> 5. deserializer detects invalid XML and aborts the decode, returns empty string Fix: Only recode ASCII control characters to <char code="XX" /> sequence: --- wddx.c.orig 2006-05-24 00:39:34.000000000 +0200 +++ wddx.c @@ -399,7 +399,8 @@ static void php_wddx_serialize_string(wd break; default: - if (iscntrl((int)*(unsigned char *)p)) { + if (iscntrl((int)*(unsigned char *)p) + && isascii((int)*(unsigned char *)p)) { FLUSH_BUF(); sprintf(control_buf, WDDX_CHAR, *p); php_wddx_add_chunk(packet, control_buf); Note - this patch also makes problem of Bug #37569 go away, but that patch is still useful to apply for code clarity. This bug is probably same problem as Bug #35241. Reproduce code: --------------- On UNIX with iso-8859-1 locale or Windows with Windows-1250 locale: var_dump( wddx_deserialize(wddx_serialize_value(utf8_encode(chr(200)))) ); Expected result: ---------------- string(1) "Č" Actual result: -------------- string(0) "" ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/?id=37571&edit=1