ID:               37571
 User updated by:  jdolecek at NetBSD dot org
 Reported By:      jdolecek at NetBSD dot org
-Status:           Bogus
+Status:           Open
 Bug Type:         WDDX related
 Operating System: Any
 PHP Version:      5.1.4
 New Comment:

You probably don't understand the problem. I'm not talking about
problem encoding iso-8859-1 text, but problem encoding text in
_UTF-8_.

UTF-8 stream legally contains characters in 128-160
range. Hopefully we agree here.

WDDX uses iscntrl() to determine if it should record the character to
<char code="XX"/> form. So it takes each character of multicharacter
UTF-8 sequence and if _the single character of the sequence_ is
determined to be control character according to current locale, it
turns the component of multibyte sequence into <char code="XX"/>
construct.

So, it turns perfectly valid UTF-8 stream into invalid text stream,
where some UTF-8 sequences are valid and some not.

The problem is that it uses iscntrl(), while it arguably should enforce
valid UTF-8 input and use something along iswcntrl(). But this would
change the interface and likely break existing code using WDDX which
depend on using iso-8859-1 text as input to serializer.

Using iscntrl() + isascii() definitely solves the problem in the least
obtrusive way AFAICS.


Previous Comments:
------------------------------------------------------------------------

[2006-05-24 06:46:22] [EMAIL PROTECTED]

Latin 1 doesn't define those characters in the 128-160 range... so it's
perfectly correct not to encode them to UTF-8. You simply need to make
sure you have valid text in the first place.

------------------------------------------------------------------------

[2006-05-23 22:50:20] jdolecek at NetBSD dot org

Description:
------------
WDDX cannot be used to encode certain UTF8-encoded iso-8859-1 text.
Particularily those iso-8859-1 characters, which after conversion to
UTF-8 generate sequence of characters with value in 128-160 range,
which are recognized as control characters. Control characters are
turned into <char code="XX"/> sequence by WDDX.

wddx_deserialize() expects UTF-8 encoded string, and implicitly
converts the text back to iso-8859-1 before deserializing the
structure. This is done _before_
the <char code="XX"/> is replaced by the character. The < is thus
recognized as part of the UTF-8 sequence, two-byte sequence is recoded
to single-byte character and the result contains invalid XML (fragment
'char code="XX"/>'). Deserialization thus fails silently.

I.e.:
1. iso-8859-1 is Z (ord(Z) > 128)
2. UTF-8 string is XY
3. WDDX serializes that as X<char code="ord(Y)"/>
4. deserializer converts UTF-8 input to iso-8859-1 before
   starting deserialization, result is Bchar code="ord(Y)"/>
5. deserializer detects invalid XML and aborts the decode,
   returns empty string

Fix:

Only recode ASCII control characters to <char code="XX" /> sequence:

--- wddx.c.orig 2006-05-24 00:39:34.000000000 +0200
+++ wddx.c
@@ -399,7 +399,8 @@ static void php_wddx_serialize_string(wd
                                        break;

                                default:
-                                       if (iscntrl((int)*(unsigned
char *)p)) {
+                                       if (iscntrl((int)*(unsigned
char *)p)
+                                           && isascii((int)*(unsigned
char *)p)) {
                                                FLUSH_BUF();
                                                sprintf(control_buf,
WDDX_CHAR, *p);
                                               
php_wddx_add_chunk(packet, control_buf);

Note - this patch also makes problem of Bug #37569 go away, but that
patch is still useful to apply for code clarity.

This bug is probably same problem as Bug #35241.


Reproduce code:
---------------
On UNIX with iso-8859-1 locale or Windows with Windows-1250 locale:

var_dump(
    wddx_deserialize(wddx_serialize_value(utf8_encode(chr(200))))
    );


Expected result:
----------------
string(1) "&#268;"

Actual result:
--------------
string(0) ""



------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=37571&edit=1

Reply via email to