ID:               43896
 Comment by:       tallyce at gmail dot com
 Reported By:      arnaud dot lb at gmail dot com
 Status:           Open
 Bug Type:         *General Issues
 Operating System: Any
 PHP Version:      5.2.5
 New Comment:

See also bugs 43294 and 43549 which seem to be the same thing.

This is really starting to bite now. Please can this be fixed, or
suggest how we can reliably process incoming user data in UTF8 given
this behaviour change!


Previous Comments:
------------------------------------------------------------------------

[2008-01-24 12:29:58] arnaud dot lb at gmail dot com

I made a patch for this bug:

http://s3.amazonaws.com/arnaud.lb/php_htmlentities_utf.patch

The internal get_next_char() function returns a status of FAILURE 
when it encounters a invalid or incomplete sequence, which causes 
the htmlspecialchars and htmlentities functions to return an empty 
string.

This patch modify the behavior of these functions to skip invalid 
sequences, without discarding the whole string. This involves a very 
few changes and makes the behavior of theses functions more 
consistent with previous PHP versions.

It also adds a few tests to htmlentities-utf.phpt.

------------------------------------------------------------------------

[2008-01-20 02:12:01] arnaud dot lb at gmail dot com

Description:
------------
htmlspecialchars/htmlentities returns an empty string when the input 
contains an invalid unicode sequence.

I think these functions should just skip the invalid sequences or 
encode them byte by byte (e.g. 0xE9 => é), instead of 
discarding the whole string.

Sometimes you have to display arbitrary strings of unknow encoding. 
So you make them more safe using htmlspecialchars($string, 
ENT_COMPAT, "site_encoding, utf-8 in my case"), but if there is at 
least one invalid sequence in the string, it returns an empty 
string :/

Reproduce code:
---------------
$string = "Voil\xE0"; // "VoilĂ ", in ISO-8859-15

var_dump(htmlspecialchars($string, ENT_COMPAT, "utf-8"));


Expected result:
----------------
string(4) "Voil"

OR 

string(10) "Voilà"

Actual result:
--------------
string(0) ""


------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=43896&edit=1

Reply via email to