ID:               35241
 Updated by:       [EMAIL PROTECTED]
 Reported By:      mikx at mikx dot de
-Status:           Open
+Status:           Bogus
 Bug Type:         WDDX related
 Operating System: Linux, Windows
 PHP Version:      5CVS-2005-11-16 (snap)
 New Comment:

To handle UTF data you need to use utf8_encode() function on the data
itself and add xml header identifying the data as being UTF8. 


Previous Comments:
------------------------------------------------------------------------

[2005-11-16 16:07:59] mikx at mikx dot de

Tried the snapshot for Windows you linked to (PHP Version
5.1.0RC5-dev). Result for the testcase is exactly the same as with
5.0.5.

------------------------------------------------------------------------

[2005-11-16 15:53:55] [EMAIL PROTECTED]

Please try using this CVS snapshot:

  http://snaps.php.net/php5-latest.tar.gz
 
For Windows:
 
  http://snaps.php.net/win32/php5-win32-latest.zip



------------------------------------------------------------------------

[2005-11-16 15:20:42] mikx at mikx dot de

Description:
------------
It seems the behavior of wddx_deserialize is inconsistent or at least
unpredictable based on the given documentation. Not only between PHP 4
and 5, also based on the given packet data. I am not sure if this is a
bug or expected behavior. I am aware of bug #34928 - so please don't
just treat this as bogus.

The following script behaves as described on PHP 5.0.5 on Windows and
5.0.4 on Linux (currently i have no 5.0.5 Linux testcase available) and
PHP 4.3.9 on Linux. At least the windows version is a complete default
installation.

Please clearify what wddx serialize and deserialize exactly do
(encoding), why the documentation encourages to add an additional
utf8_encode to non-ascii characters on serialize and how the entire
process can be influenced (e.g. which configs get used). setlocale()
and putenv("locale=xyz") have no effect.

Currently wddx_serialize adds no character set information and keeps
whatever you supply as a string inside the resulting wddx file. So if
you send an extended character in ISO-8859-1 or UTF-8 it will be the
same in the resulting wddx packet.

The deserializer seems to always convert the packet to ISO-8859-1
unless you explicitly set information in the XML file that it is
already ISO-8859-1 (even if there is UTF-8 content in it). 

If the documentation entry to always utf8_encode a string before
sending it to serialize is correct, it would mean you would have to
double encode an UTF-8 string. But that seems like a dirty workaround.


>From my perspective both wddx_serialize and wddx_deserialize should
add/respect the information to the XML file and get an additional
parameter to enforce an input or output encoding or overwrite the
default behavior.

Currently i try to deserialize wddx packets produced with PHP4 in PHP5.
They are stored in a database, firstly in MySQL4 (latin1 encoded) and
now migrated to MySQL5 (utf8 encoded). What is the proper way to handle
that? utf8_encode the packet (producing a double encoded packet) before
sending to wddx_deserialize (which implicitly adds a utf8_decode on
that data) seems like an evil hack in a undocumented area.

This seems like a common migration path to me, so please specifiy
clearly what to expect and what to do.






Reproduce code:
---------------
<?php

header("Content-type: text/html; charset=UTF-8"); 

echo "ISO-8859-1 specified, ISO-8859-1 data<br>";
echo "produces latin1 output [php5]<br>";
echo "produces ISO-8859-1 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\"
encoding=\"ISO-8859-1\"?><wddxPacket
version='1.0'><header/><data><string>abc-äöü</string></data></wddxPacket>")."<hr>";

echo "UTF-8 specified, ISO-8859-1 data<br>";
echo "non-ascii characters get stripped [php5]<br>";
echo "produces ISO-8859-1 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\"
encoding=\"UTF-8\"?><wddxPacket
version='1.0'><header/><data><string>abc-äöü</string></data></wddxPacket>")."<hr>";;

echo "Nothing specified, ISO-8859-1 data<br>";
echo "non-ascii characters get stripped [php5]<br>";
echo "produces ISO-8859-1 output [php4]<br>";
echo wddx_deserialize("<wddxPacket
version='1.0'><header/><data><string>abc-äöü</string></data></wddxPacket>")."<hr>";;

echo "ISO-8859-1 specified, UTF-8 data<br>";
echo "produces utf-8 output [php5]<br>"; 
echo "produces UTF-8 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\"
encoding=\"ISO-8859-1\"?><wddxPacket
version='1.0'><header/><data><string>".utf8_encode("abc-äöü")."</string></data></wddxPacket>")."<hr>";;

echo "UTF-8 specified, UTF-8 data<br>";
echo "produces latin1 output [php5]<br>"; 
echo "produces UTF-8 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\"
encoding=\"UTF-8\"?><wddxPacket
version='1.0'><header/><data><string>".utf8_encode("abc-äöü")."</string></data></wddxPacket>")."<hr>";;

echo "Nothing specified, UTF-8 data<br>";
echo "produces latin1 output [php5]<br>";
echo "produces UTF-8 output [php4]<br>";
echo wddx_deserialize("<wddxPacket
version='1.0'><header/><data><string>".utf8_encode("abc-äöü")."</string></data></wddxPacket>")."<hr>";;

?>



------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=35241&edit=1

Reply via email to